Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES
Selected Proceedings of the Symposium on Estimating Functions Ishwar V. Basawa, V.P. Godambe and Robert L. Taylor, Editors
Volume 32
Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES Volume 32
Selected Proceedings of the Symposium on Estimating Functions Ishwar V. Basawa, V.P. Godambe and Robert L. Taylor, Editors
Institute of Mathematical Statistics Hayward, California
Institute of Mathematical Statistics Lecture Notes-Monograph Series Editorial Board Andrew A. Barbour, Joseph Newton, and David Ruppert (Editor)
The production of the IMS Lecture Notes-Monograph Series is managed by the IMS Business Office: Miriam Gasko Donoho, IMS Treasurer, and James H. Sanders, IMS Business Manager.
Library of Congress Catalog Card Number: 97-077203 International Standard Book Number 0-940600-44-7 Copyright © 1997 Institute of Mathematical Statistics All rights reserved Printed in the United States of America
TABLE OF CONTENTS An Overview of the Symposium I.V. Basawa, V.P. Godambe and R.L. Taylor Estimating Functions: A Synthesis of Least Squares and Maximum Likelihood Methods V.P. Godambe
1
5
S E C T I O N 1: L I K E L I H O O D A N D R E L A T E D T O P I C S . . . 17 Partial Likelihood and Estimating Equations P. Greenwood and W. Wefelmeyer
19
Avoiding the Likelihood C.C. Heyde
35
Likelihood and Pseudo-likelihood Estimation Based on Response-Biased Observation J.F. Lawless
43
Likelihood From Estimating Functions P.A. Mykland
57
SECTION 2: GENERAL THEORY
63
Estimating Functions in Semiparametric Statistical Models . . . .
65
S. Amari and M. Kawanabe Estimating Functions, Partial Sufficiency and Q-Sufficiency in the Presence of Nuisance Parameters V. P. Bhapkar
83
Estimating Functions and Higher Order Significance D.A.S. Fraser, N. Reid and J. Wu
105
On Consistency of Generalized Estimating Equations B. Li
115
SECTION 3: QUASILIKELIHOOD
137
Extended Quasilikelihood and Estimating Equations J.A. Nelder and Y. Lee
139
Quasilikelihood Regression Models for Markov Chains W. Wefelmeyer
149
SECTION 4: APPLICATIONS TO LINEAR MODELS AND ECONOMETRICS
175
Optimal Instrumental Variable Estimation for Linear Models With Stochastic Regressors Using Estimating Functions A.C. Singh and R. P. Rao
177
On Estimating Function Approach in the Generalized Linear Mixed Model B.C. SutradharandV.P. Godambe
193
Using Godambe-Durbin Estimating Functions in Econometrics . 215 H.D. Vinod Estimating Functions and Over-Identified Models T. Wirjanto SECTION 5: APPLICATIONS TO TIME SERIES, BIOSTATISTICS AND STOCHASTIC PROCESSES
239
257
On the Prediction for Some Nonlinear Time Series Models Using Estimating Functions B. Abraham, A. Thavaneswaran and S. Veins
259
Estimating Function Methods of Inference for Queueing Parameters I.V. Basawa, R. Lund and U.N. Bhat
269
Optimal Estimating Equations for State Vectors in Non-Gaussian and Nonlinear State Space Time Series Models 285
/. Durbin
Estimating Functions in Failure Time Data Analysis /?.L. Prentice and L. Hsu
293
Estimating Functions for Discretely Observed Diffusions: A Review M. Sorensen
305
Fitting Diffusion Models in Finance
327
D.L. McLeish and A. W. Kolkiewicz SECTION 6: APPLICATIONS TO SPATIAL STATISTICS 351 Prediction Functions and Geostatistics 353 A.F. Desmond Efficiency of the Pseudo-Likelihood Estimate in a One Dimensional Lattice Gas 369 J.L. Jensen Estimating Functions for Semivariogram Estimation 5. Lele
381
SECTION 7: NONPARAMETRICS, ROBUST INFERENCE AND BOOTSTRAP 397 Estimating Covariance Matrices Using Estimating Functions in Nonparametric and Semiparametric Regression RJ. Carroll, SJ. Iturria and R.G. Gutierrez
399
Estimating Equations and the Bootstrap F. Hu and J.D. Kalbfleisch
405
Estimating Functions: Nonparametrics and Robustness P.K. Sen
417
SECTION 8: FURTHER TOPICS
437
Inference From Stable Distributions H. El Barmi and P. I. Nelson
439
Separate Optimum Estimating Function for the Ruled Exponential Family Γ. Yanagimoto and Y. Hiejima
457
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
AN OVERVIEW OF THE SYMPOSIUM ON ESTIMATING FUNCTIONS I. V. Basawa University of Georgia V. P. Godambe University of Waterloo R. L. Taylor University of Georgia The Symposium on Estimating Functions was held at the University of Georgia from March 21, 1996 to March 23, 1996. The Symposium was cosponsored by the Institute of Mathematical Statistics and the Statistical Society of Canada and represented continuing efforts by the two professional societies to focus special attention on some of the more prominent directions in probability and statistics. Partial funding by the University of Georgia's "State-of-the-Art" Conference Program and a National Security Agency Grant contributed to the success of the Symposium and is gratefully acknowledged. The Symposium attracted 119 registered participants from several countries including Australia, Canada, Denmark, England, Germany, Hong Kong, India, Japan, Kuwait, Sweden, and the United States of America. The program consisted of 13 sessions with 35 invited speakers and 14 contributed talks. The main theme of the Symposium can be summarized as "Statistics at a Juncture of a Synthesis." The 'likelihood function' has provided a basic methodology for the parametric inference for decades. On the other hand semi-parametric inference has been primarily based on the 'least-squares' methodology for a longer time. The unification and extension of the two methodolgies, achieved during recent years through estimating functions was the main theme of the opening talk of the Symposium by V. P. Godambe. The discussions and presentations which followed were energetic and covered a wide variety of topics: C. C. Heyde suggested avoiding the likelihood; J. A. Nelder presented extensions to quasilihood; N. Reid discussed higher order significance; J. Durbin discussed applications to nonlinear state space time series; J. D. Kalbfleish presented bootstrap using estimating functions. There appeared to be a consensus at the Symposium that Statistics was at
2
BASAWA, GODAMBE AND TAYLOR
a juncture of a synthesis of the two of its main methodologies, namely the likelihood and least squares, brought about by estimating functions. Many of the major results which were presented at the Symposium are summarized in these selected proceedings. Specifically, the papers are organized into eight sections. A general historical paper by V. P. Godambe precedes the following sections: 1. 'Likelihood' with papers by P.Greenwood & W. Welfelmeyer, C. C. Heyde, J. F. Lawless, and P. A. Mykland. Topics which are central to this section include likelihood, partial likelihood, pseudo-likelihood and other alternatives to the likelihood. 2. 'General Theory' with papers by A. Amari & M. Kawanabe, V. P. Bhapkar, D.A.S. Eraser, N. Reid & J. Wu, and B. Li. Papers in this section deal with problems on estimating functions in semiparametric models, nuisance parameters, higher order significance and consistency. 3. 'Quasilikelihood' with papers by J.A. Nelder & Y. Lee, and W. Wefelmeyer. Extended quasilikelihood and regression models for Markov chains are discussed in this section. 4. 'Applications to Linear Models and Econometrics' with papers by A. C. Singh & R. P. Rao, B.C. Sutradhar & V. P. Godambe, H. D. Vinod, and T. Wirjanto. Papers in this section address problems in instrumental variable estimation, generalized linear mixed models, Godambe-Durbin estimating functions in econometrics and over- identified models. 5. 'Applications to Time Series, Biostatistics and Stochastic Processes' with papers by B. Abraham, A. Thavaneswaran & S. Peiris, I. V. Basawa, R. B. Lund & U. N. Bhat, J. Durbin, R. L. Prentice & L. Hsu, M. Sorensen, and D. L. McLeish & A. W. Kolkiewicz. Applications in this section are in nonlinear state space models, prediction, failure time data analysis, queueing parameter estimation, diffusion processes and models in finance. 6. 'Applications to Spatial Statistics' with papers by A. F. Desmond, J. L. Jensen, and S. Lele. Prediction in geostatistics, pseudo-likelihood estimation for lattice models and semivariogram estimation are covered in papers in this section. 7. 'Nonparametrics, Robust Inference and Bootstrap' with papers by R. J. Carroll, S. J. Iturria & R. G. Gutierrez, F. Hu & J. D. Kalbfleisch, and P. K. Sen. These papers contain results for estimating covariance matrices, nonparametrics and robustness and bootstrap techniques. 8. 'Futher Topics' with papers by H. El Barmi & P. I. Nelson, and T. Yanagimoto & Hiejima. The two papers in this final section are concerned with inference from stable distributions and inference for the ruled exponential family. The editors of these selected proceedings of 29 papers are very grateful to numerous referees who very carefully and critically reviewed all papers which
SYMPOSIUM OVERVIEW
3
were submitted for publication in the proceedings. Also, the willingness of individual authors to limit the number of pages of their articles helped produce this volume in the IMS Lecture Notes Series. Special thanks go to Connie Durden for the preparation of this Volume. Ms. Durden worked tirelessly and patiently with the various authors in securing software files of their papers, providing uniformity of margins, spacing and similar editorial changes which greatly enhanced the general appearance of this volume.
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS: A SYNTHESIS OF LEAST SQUARES AND MAXIMUM LIKELIHOOD METHODS V.P. Godambe University of Waterloo ABSTRACT The development of the modern theory of estimating functions is traced from its inception. It is shown that this development has brought about a synthesis of the two historically important methodologies of estimation namely, the 'least squares' and the 'maximum likelihood'. Key Words: Estimating functions; likelihood; score function.
1
Introduction
In common with most of the historical investigations, it is difficult to trace the origin of the subject of this conference: 'Estimating Functions'. However, in the last two centuries clearly there are three important precursors of the modern theory of estimating functions (EF): In the year 1805, Legendre introduced the least squares (LS) method. At the turn of the last century Pearson proposed the method of moments and in 1925 Fisher put forward the maximum likelihood (ML) equations. Of these three, the method of moments faded out in time because of its lack of any sound theoretical justification. However the other two methods namely the LS and the ML even at present play an important role in the statistical methodology. These two methods would also concern us in the following. The LS method was justified by what today is called the Gauss-Markoff (GM) theorem: The estimates obtained from LS equations are 'optimal' in the sense that they have minimum variance in the class of linear unbiased estimates. This was a finite sample justification. At about the same time Laplace provided a different 'asymptotic justification' for the method. Fisher justified the ML estimation, for it produced estimates which are asymptotically unbiased with smallest variance. This left open the question, is there a finite sample justification for the ML estimation corresponding to the GM theorem justification for the LS estimation? The modern EF theory provided such a justification. According to the Όptimality criterion' of the EF theory, the score function (SF) is 'optimal'.
6
2
GODAMBE
SF Optimality
To state the just mentioned result formally we introduce briefly some notation. Let X = {x} be the sample (observations) space and a class of possible distributions (densities) on X be given by {/( |0),0 G Ω}, Ω being the parameter space, which we assume here to be the real line. If the function / is completely specified up to the (unknown) parameter 0, f{ \0) is called a parametric model. For this model the (score function) SF = d\ogf( \θ)/dθ. Any real function of x and 0 say g(x,0) is called an estimating function, (EF). It is said to be unbiased if its mean value for 0 E Ω, is zero; £g = 0. Further, for reasons which would be clear later, corresponding to every EF g we define a standardized version g/{£{τ^)}uu
Now in a class Q = {g} of
unbiased estimating functions, g* is said to be 'optimal' if the variance of the standardized EF g, is minimized for g — g*: ε
^)2/{£(%)}2
< ε(g)2/{εφ}2,
θen,9eg.
(2.1)
SF Theorem (Godambe, 1960). For a parametric model /( |0), granting some regularity conditions, in the class of all unbiased EFs, the optimal estimating function is given by the SF i.e. g* = dlogf{.\θ)/dθ. The optimality of the SF given by the above Theorem should be distinguished from the optimality of the LS estimates based on the GM theorem. The SF optimality (though with some additional assumptions implies asymptotic optimality of the ML estimate) is essentially optimality of the 'estimating function' while the LS optimality is optimality of the 'estimate'. The concept underlying optimality criterion of the EF theory became more vivid and compelling in relation to the problem of nuisance parameters.
3
Conditional SF Optimality
Now let the parameter 0 consist of two components 0\ and #2, θ = (#i,#2) and the parametric model be /( |0i,#2) where θ\ is real and 02 is a vector; θ £ Ω, θ\ G Ωi, 02 £ Ω2 and Ω = Ωi x Ω2. Further suppose we want to estimate only 0\ (the interesting parameter) ignoring 02 (the nuisance parameter). How to proceed? To this question the ML estimation provides no satisfactory answer. If 0\ and §2 are jointly ML estimates for θ\ and 02, as is well known, the estimate 0\ can be inconsistent (unacceptable) in case the dimensionality of the parameter 02 goes on increasing with the number of observations (cf. Neyman-Scott, 1948). The EF theory, for the present
ESTIMATING FUNCTIONS situation implies, restricting to that part of likelihood function which is governed by the interesting parameter θ\ only. Formally for the parametric model /( |0i,02), let Q\ be the class of all unbiased EFs g(x,θ\), that is functions of x and 0χ only: Gi = {9
9 = s ( M i ) , S(g) = 0, Θ e Ω}.
Further let t be a complete sufficient statistic for the parameter 02, f° r every fixed 0χ. Assuming the statistic t is independent of the parameter θ\ we have Conditional SF Theorem (Godambe, 1976). Granting some regularity conditions, in the class of EFs Q\, the 'optimal' EF g* is given by the conditional SF i.e. g* = dlog/( |t;0i)/30i. Note in the above theorem the definition of optimality is obtained from (2.1) just by replacing in it Q by Gi and consequently S(dg/dθ) by ε(dg/dθι). That is the criterion of optimality is unconditional. In the case of the Neyman-Scott example, unlike the ML estimate 0, the equation 'conditional SF = 0' provides a consistent estimate of θ\. Further the EF optimality criterion suggests a definition of 'conditional SF' in case the statistic t depends on the parameter θ\. If ί(0io) is the value oΐt at θ\ = 0χo then we define the conditional SF by g* where g* = {dlogf(.\t(θ10),θ1,θ2)/dθι}θl0=Θl.
(3.1)
This definition is motivated as follows. The EF g* in (3.1) £ Q\ though it depends on 02- It further is 'optimal' in Q\ though only locally at 02 (Lindsay, 1982). Unlike the previous situation, when the sufficient statistic t, was independent of 0χ, now no universally optimal g* (i.e. for all 02 £ Ω2) exists in Q\. Further though the EF g* in (3.1) depends on 02, it is orthogonal to the marginal SF of the sufficient statistic ί, hence the substitution of an estimate 02 derived from the latter, in the former would still leave the former nearly optimal for large samples, (Lindsay, 1982; Godambe, 1991; Small and McLeish, 1994; Liang and Zeger, 1995). The equation
would provide a (nearly optimal) consistent estimate of θ\. Note in the forgoing discussion, conditioning is used just as a 'technique' to obtain (unconditionally) 'optimum' EFs; but it is not used as a principle of inference. In fact, without invoking any conditioning at all, Godambe and Thompson (1974) established, in case of the normal distribution iV(0χ,02), the optimality of the EF (s 2 - 02), for the interest parameter 02, ignoring the nuisance parameter θ\. How this (unconditional) optimality leads to a very 'flexible conditioning' will be discussed later.
GODAMBE For a general perspective on the topic of conditioning and optimality we refer to Small and McLeish (1988), Lindsay and Waterman (1991) and Lindsay and Li (1995). Lloyd (1987) and Bhapkar (1991) have given results concerning optimality of 'marginal SF' under 'conditional completeness'. From the above discussion it is clear that the EF theory has corrected a major deficiency of the ML estimation in case of the nuisance parameters. Some earlier references in respect of the nuisance parameters are Bartlett (1936), Cox (1958), Barnard (1963), Kalbfleisch and Sprott (1970), BarndorffNielsen (1973) and others. Some of these authors tried to obtain conditions under which the marginal distribution of t does not contain any information about θ\ the parameter of interest. As we have seen the optimality criterion of the EF theory yields such a condition in terms of 'completeness of the statistic t\ Though not universally applicable (as none can be, I suppose) it by now has been commonly used for its mathematical manageability. It also carries with it greater conviction for it is derived from an optimality criterion which has proved to be fruitful very generally. In the following we would show that the EF theory, just as it corrected ML estimation, also corrects some major inadequacies of the LS estimation and the GM theorem.
4
Quasi-Score Function
We now replace the abstract (observation) sample in the discussion by n real variates X{ : i = 1, ...,n which are assumed to be independently distributed with means μi(θ) and variances Vi(θ) (μι and v% being some specified functions of θ) i = 1, ...,n. For simplicity let θ be a scalar parameter. Initially we consider the special case where μι are linear functions of θ and V{ are independent of θ. Here the LS equation is given by Σι{xi — μ%){-^)hi — 0 uu The solution of the equation, as said before, according to GM theorem, has smallest variance in the class of all linear unbiased estimates of 0; hence is 'optimal'. The estimating function Σ{χi ~ tJLi){-^E')lvi ιs a l s o 'optimal' uθ according to criterion (2.1), in the class of all EFs of the form n
9 = Σ(xi
~ μ%)(H
(4.1)
where aι can be any arbitrary functions of θ. (Actually here we minimize 8(g2) subject to holding ε(dg/dθ) = const.. This will explain the standardization of EF mentioned earlier.) Note this EF optimality implies more than the GM optimality, for the solutions corresponding to all the equations g = 0 include not only all linear unbiased estimates of θ but many more.
ESTIMATING FUNCTIONS Now let the means μ* and variances V{ be arbitrarily specified functions of θ. Here the LS equation is given by 3 + B = 0 where g
=
, n ,„. 1
..jdμi/dθ) υ%
and
(4.2)
Clearly in (4.2), S{g) = 0 and £(B) = ΣΐdlogVi/dθ. Note for large n, (/n) ~ 0 while ( £ / n ) could still be very large. Hence because of the bias term (B) the LS equation Ίj + B = 0 would generally lead to an inconsistent estimate. On the other hand according to the optimality criterion (2.1) of the EF theory, in the class of EFs given by (4.1) for different functions di(θ), i — 1,..., n, g given by (4.2) is Optimal'. Generally the equation ~g = 0 would lead to a consistent solution. (Here GM theorem cannot be of any avail for the solution of g = 0 would generally lead to a biased estimate.) For reasons to be explained soon, we would call the estimating function ~g a quasi-score function, (Quasi-SF). Interestingly the EF optimality of the quasi-SF ~g was first established in a wider setting of discrete stochastic processes with a martingale structure. Quasi-SF Theorem (Godambe 1985). If μι and V{ denote the means and variances of X{ conditional on the past observations #i_i,...,α;o i.e. μι = μi(0, #0, •••, Xi-i) and Vi — Vi(θ, x0,..., x;_i) for i = 1,..., n then _ λdμi/dθ 9 = 2^{x% - μ%) •—
(4-3)
Here g is the optimal EF in the class of the EFs given by (4.1) where aι are functions of a?<-i, •••, ^o in addition to θ. Among precursors to the EF g in (4.3) above are included the following: Durbin (1960) gave a GM theorem analogue for linear time series model. Klimko and Nelson (1978) obtained conditional LS equations. Kalbfleisch and Lawless (1983) suggested a special case of g for Markov models. For further generalizations of the EF optimality results, relating to g in (4.3) we refer to Godambe (1985), Godambe and Heyde (1987), Godambe and Thompson (1989). Returning back for simplicity to the case where the variates x^ i = 1,.., n are independently distributed, we summarize important properties of the EF g given by (4.2). The 'optimality' of g is for the semi-parametric model defined by means μ%{θ) and variances Vi(θ). As a special case when μ; is linear in θ and V{ is independent of θ the EF optimality of ~g implies the GM optimality of the LS estimates. In this special case, if the underlying distribution
10
GODAMBE
is normal, the LS estimates coincide with the ML estimates. Generally, for the exponential family distributions the SF coincides with the optimal EF g given by (4.2). Even outside the exponential family of distributions, ~g 2 2 satisfies a very general property of the SF; ί[SF) = -S(dSF/dθ) simi2 2 larly we have E(g) = —S(dg/d9) . Further if SF denotes a generic 'score function' for the class of distributions consistent with the semi-parametric 2 2 model mentioned above then S(g - SF) < E(g - SF) , (the expectation being taken w.r.t. the distribution that corresponds to the SF), for all the EFs g given by (4.1). These are the properties which justify the previously introduced term ' quasi-SF' for ~g. (Even before the EF optimality of g was discovered the term quasi-likelihood was commonly used in the literature on generalized linear models, McCullagh and Nelder 1983, 1989). As we have seen previously the EF theory corrected a major deficiency in ML estimation relating to nuisance parameters. The above discussion points to yet another accomplishment of the EF theory. It brought about, via quasiscore function g, a kind of synthesis of two historically distinct methods of estimation: LS for semi-parametric models and ML for parametric models. The same criterion of the EF optimality namely (2.1) is satisfied in case of the latter by the SF and in case of the former by the quasi-SF ~g in (4.2). Only the classes of competing EFs are different. They are taken appropriate to the model (see Godambe and Thompson, 1989 Appendix). Of course, the forgoing discussion also shows that the quasi-SF g does provide, not only a unification (of the two methods LS and ML) but much more. It provides a generalization to deal with problems outside the scope of both LS and ML methods. As a further contribution of the EF theory to statistics, below we briefly outline a very 'flexible conditioning' that the theory permits and the consequent incorporation of the Bayesian factor within its methodology.
5
A Generalization
It was mentioned earlier that within the framework of martingales and corresponding filtering, the EF theory suggested use of weighted conditional least square estimation, on grounds of its optimality property. But to deal with general spatial processes one needs more flexible conditioning than used before; this was provided by Godambe and Thompson (1989): Let as before X = {x} be an abstract sample space and T = {F} be a class of distributions on X. Further let θ be a real parameter, defined on T\ {Θ(F),F € T) = Ω. Now suppose hj is a real function on X x Ω and Xj a specified partition (or a σ-field generated by a partition) of X such that Xj) = o,
j = ι,...,k.
(5.i)
11
ESTIMATING FUNCTIONS
The functions /ij, j = l,...,fc are called the elementary EFs; they are not exhaustive. Their choice is determined by the problem at hand. Now suppose the elementary EFs, h\,...,hk are mutually orthogonal (Def. Godambe and Thompson 1989) and the class of underlying distributions T satisfy certain conditions. Then in the class of all EFs g of the form (5.2) 3=1
where qj are some real functions o n ί x ί ί which are measurable on Xj, j = 1,..., k the 'optimal' one is given by
3=1
where q*j = {ε(dhj/dθ\Xj)}/{ε(hj\Xj)}. Here the criterion of optimality as always is unconditional, given by (2.1) for a real parameter θ (or its appropriate version if θ is a vector); the expectation is taken with respect to Fef. Up to the above results the EF theory was 'restricted' to the classical setup where distributions on the sample space X for some fixed values of the parameters are considered. But the formalism of the EF optimality criterion is flexible enough and the just mentioned 'restriction' can be set aside if we know something about the prior distribution of 0; for instance its mean (0Q) and variance (VQ). Under such Bayesian setup the only changes that are required to be done are as follows: (i) In (5.1) now Xj is not necessarily a partition of just the sample space X, but it can be a partition of X x Ω, Ω as before being the parameter space, (ii) Some elementary EFs hj, j = l,...,fc can now be functions exclusively of the parameter θ. (in) All expectations in the optimality criterion (2.1) are now with respect to the joint distributions of (x,θ) (and not as before with respect to distributions of x given θ). Following is an illustration. Let the partitions of the sample space Xj and the elementary estimating functions hj, j = l,...,fc be the same as in (5.1). Further, as suggested before, let the mean value (0o) and the variance (v$) of the prior distribution of θ be known. Now, to the set of elementary estimating functions h\,...,hk we add one more, namely hk+ι = θ — ΘQ. In this case the optimal EF is given by g* + (θ — ΘQ)/VQ, where g* is the same as in (5.3). Similarly now the quasi-SF ~g in (4.2), which was obtained under the assumption '0 is fixed' will now have to be replaced by ~g — (θ — ΘQ)/VQ (Godambe, 1994). The 'optimality' of the EF given by the derivative of the logarithm of the posterior density was established in a 'parametric setup' by Ferreira (1982) and Ghosh (1993). Naik-Nimbalkar and Rajarshi (1995) have established some optimality results in 'semi-parametric Bayesian setup'.
12
6
GODAMBE
Other Topics
Now, following are a few remarks (possibly only tangential) about the likelihoods: empirical, partial, profile, quasi and the like. Basically when the likelihood function is precisely known, with no nuisance parameters, likelihood ratio test is 'optimal' in the conventional sense of the term. Also the SF satisfies the EF criterion of Όptimality'. Now the various likelihoods, empirical partial, quasi just mentioned, try to 'approximate' the underlying (true, precise) likelihood in situations of nuisance parameters and/or of semiparametric models. In similar situations EF theory tries to 'approximate' the (true) underlying SF. However, unlike the former, the latter 'approximation' can be assessed with a plausible finite sample criterion. Suppose g(x, θ) is a real function of the sample x and the parameter of interest θ such that the expectation £ (g) = 0 for all possible underlying distributions F i.e. for F G T. Let further SF be a score function corresponding to F in T. Then the finite sample criterion of assessing the approximation g for SF is given by £{g - SF)2, for all F £ T. This criterion as said before leads to the Όptimality' criterion (2.1) of the EF theory. As I have previously shown optimal or approximately optimum EFs are found in many practical problems and in fact by now they are in common use. Now while optimum EFs and approximations thereof can provide a handy instrument for constructing confidence intervals and related tests cf. Rao's test (Rao 1947, Basawa 1991), for some other problems some kind of 'approximate likelihood' would be more handy. I think, to be safer, construction of such approximate likelihoods should be tied to the optimum EFs, whenever possible. It is good to note already a strong trend in that direction (Qin and Lawless, 1994). An often asked question (cf. Liang and Zeger 1995) is how does the EF optimality relate to the properties of the corresponding estimate? How good is the estimate? Usually the answer is given in terms of the 'error' of the estimate. Now this 'error' is somewhat of an involved concept. Certainly, error is not just a square root of an arbitrary (unbiased or nearly so) estimate of variance. However for a parametric model the concept is clear. The error is derived from the conditional (or the natural estimate of) variance of the SF. Thus error is the inverse of the square root of observed Fisher information (Efron and Hinkley, 1978). This methodology is formalized and extended by the EF theory. Consider the confidence intervals, θ ± const (error), where the estimate θ is obtained from the unbiased estimating equation g(θ) = 0. Here a more direct way of obtaining the confidence intervals is by inverting the distribution of the standardized version (cf. Godambe 1991, eq. 40) of the EF g around θ. These intervals, compared to former ones, are easier to compute. Also if g is the optimal EF, the corresponding intervals are shortest compared to that of any other unbiased EF, (Godambe and Heyde 1987).
13
ESTIMATING FUNCTIONS
The standardizing factor of the EF g directly leads to the computation of the 'error' for the estimate θ (Godambe, 1995). For important previous review articles on the subject we refer to Heyde (1989) and Godambe and Kale (1991). The present review highlights some more recent developments and presents older results with different emphasis and interpretations. A further reference along this line is Desmond (1997).
References Barnard, G.A. (1963). Some logical aspects of the fiducial argument. J.R. Statist Soc. B, 25, 111-114. Barndorff-Nielsen, O.E. (1973). On M-ancillarity. Biometrika 60, 447-455. Bartlett, M.S. (1936). The information available in small samples. Proc. Camb. Phil. Soc, 34, 33-40. Basawa, I.V. (1991). Generalized score tests for composite hypotheses. Estimating functions, (ed. V.P. Godambe), Oxford Univ. Press, Oxford. 121-131. Bhapkar, V.P. (1991). Sufficiency, ancillarity and information in estimating functions. Estimating Functions. (Ed. V.P. Godambe), Oxford Univ. Press, Oxford. 240-254. Cox, D.R. (1958). Some problems connected with statistical inference. Ann. Math. Statist. 29, 357-372. Desmond, A.F. (1997). Optimal estimating functions, quasi-likelihood and statistical modelling (with discussion). J. Stat. Plan. Inf. 60, 77-121. Durbin, J. (1960). Estimation of parameters in time series regression models. J. Roy. Statist. Soc. B, 22, 139-153. Ferreira, P.E. (1982). Multiparametric estimating equations. Ann. Stat. Math. 34, 423-431. Fisher, R.A. (1925). Theory of statistical estimation. Proc. Cambridge Phil. Soc. 22, 700-706. Ghosh, M. (1990). On a Bayesian analog of the theory of estimating functions. C.G. Khatri Memorial Volume of Gujard Statistical Review, 17A, 47-52. Godambe, V.P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist. 31, 1208-1212. Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284. Godambe, V.P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika 72, 419-428.
14
GODAMBE
Godambe, V.P. (1991). Orthogonality of estimating functions and nuisance parameters. Biometrika 78, 143-151. Godambe, V.P. (1994). Linear Bayes and optimal estimation. Tech. Report STAT-94-11, University of Waterloo. Godambe, V.P. (1995). Discussion of the paper, 'Inference Based on estimating functions in the presence of nuisance parameters' by Liang, K.Y. and Zeger, S.L. Statistical Science 10, 173-174. Godambe, V.P. and Heyde, C.C. (1987). Quasi-likelihood and optimal estimation. Int. Stat. Rev. 55, 231-244. Godambe, V.P. and Kale, B.K. (1991). Estimating functions: an overview. Estimating Functions. (Ed. V.P. Godambe), Oxford University Press, Oxford. 1-20. Godambe, V.P. and Thompson, M.E. (1974). Estimating equations in presence of nuisance parameters. Ann. Stat. 2,568-571. Godambe, V.P. and Thompson, M.E. (1989). An extension of quasi-likelihood estimation (With Discussion). J. Stat. Plan. Inf. 22, 137-172. Heyde, C.C. (1989). Quasi-likelihood and optimality of estimating functions: some current unifying themes. Bull. Int. Stat. Inst. Book 1, 19-29. Kalbfleisch, J.D. and Sprott, D.A. (1970). Applications of likelihood methods to models involving large number of parameters. (With Discussion). J.R. Statist. B, 32, 175-208. Kalbfleisch, J.D., Lawless, J.F. and Vollmer, W.M. (1983). Estimation in Markov models from aggregate data. Biometrics 39, 907-919. Klimko, L.A. and Nelson, P.I. (1978). On conditional least squares estimation for stochastic processes. Ann. Statist. 6, 629-642. Legendre, A.M. (1805). Nouvelles methodes pour la determination des orbites des cometes. Paris: Courcier. Liang, K.Y. and Zeger, S.L. (1995). Inference based on estimating functions in the presence of nuisance parameters. (With Discussion). Statistical Science 10, 158-195. Lindsay, B. (1982). Conditional score functions: some optimality results. Biometrika 69, 503-512. Lindsay, B. and Waterman, R.P. (1991). Extending Godambe's method in nuisance parameter problems. Proceedings of a Symposium in honour of Prof. V.P. Godambe. University of Waterloo, 1-43. Lindsay, B.G. and Li, B. (1995). Discussion of the paper, 'Inference based on estimating functions in the presence of nuisance parameters' by Liang, K.Y. and Zeger, S.L. Statistical Science 10, 175-177. Lloyd, C.J. (1987). Optimality of marginal likelihood estimating equations. Comm. Stat Theory and Meth. 16, 1733-1741. McCullagh, P. and Nelder, J.A. (1983, 1989). Generalized linear models (1st and 2nd editions). Chapman and Hall, London. Naik-Nimbalkar, U.V. and Rajarshi, M.B. (1995). Filtering and smoothing via estimating functions. J. Amer. Statist. Asso. 90, 301-306.
ESTIMATING
FUNCTIONS
15
Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrika, 16, 1-32. Qin, J. and Lawless, J.F. (1994). Empirical likelihood and general estimating equations. Annals of Statistics 22, 300-325. Rao, C.R. (1947). Large sample tests for statistical hypotheses concerning several parameters with applications to problems of estimation. Proc. Camb. Phil. Soc. 44, 50-57. Small, C. and McLeish, D.L. (1988). The theory and applications of statistical inference functions. Lecture Notes in statistics No. 44, Springer Verlag. Heidelberg, New York, London. Small, C. and McLeish, D.L. (1994). Hilbert space methods in probability and statistical inference. John Wiley and Sons, Inc. New York.
19
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
PARTIAL LIKELIHOOD AND ESTIMATING EQUATIONS Priscilla E. Greenwood University of British Columbia and Wolfgang Wefelmeyer University of Siegen Abstract Consider a regression model for discrete-time stochastic processes, with a (partially specified) model for the conditional distribution of the response given the covariate and the past observations. Suppose we also have some knowledge about how the parameter of interest affects the conditional distribution of the covariate given the past. We assume that these two model assumptions give rise to two martingale estimating functions, and determine an optimal combination. We indicate for the case of jump processes how our result carries over to continuous time. The resulting estimators are efficient.
1 Introduction Suppose we know something about how the parameter of interest in a regression model appears both in the conditional distribution of the response given the covariate, and in the distribution of the covariate. How can we exploit this knowledge? Let us illustrate our approach in the case of independent and identically distributed observations (Xj, Y;), with Xι the covariate and Yi the response. In a regression model one usually specifies the conditional distribution of Y given X, either fully, by a parametric model, or partially. An example of a partial specification is a model for the conditional mean of Y given X, say E(Y|X) = ΰX. More generally, we specify a function ## (X, Y) such that E{gϋ{X,Y)\X) = 0. In the example, gΰ(X,Y) = Y -ϋX. We assume a similar partial specification of the distribution of the covariate X, say Έg*$(X) — 0. The two functions g# and g*$ give rise to two estimating equations
fχYi)=0,
£>*(*<) =0. t=l
supported by NSERC, Canada.
20
GREENWOOD AND WEFELMEYER
By the usual Taylor expansion argument, their solutions ϋ = ϋn are asymptotically normal with variances 7 " 1 and I~x, respectively, where
Since g${Xi, Y%) and g*$(Xi) are uncorrelated, the combined estimating equation i)) = 0 leads to an estimator with asymptotic variance
Applying the Schwarz inequality to the denominator, one sees that this variance is minimized for E
E
opt _ opt _ %
opt _ flti9 W
Έy
-
The minimal asymptotic variance is (7 + I*)~ι. The weights wopt and w*pt depend on ϋ and, in general, also on other features of the distribution of (X, Y). In the estimating function, they must be replaced by estimators ώ£ pt and ώ*n\ say by using empirical estimators for the distributions involved. This does not change the asymptotic variance (/ + I*)" 1 . Can we do better than using the combined estimating equation? Note that we can multiply 'g^(X^Y) by a function w(X) of X and still have conditional expectation zero, E(w(X)gϋ(X,Y)\X)=0. This leads to new estimating equations iM) + w*g*ϋ(Xi)) - 0
(1.1)
1
with asymptotic variance (Iw + A ) " , where (E(w(X)E(g'ΰ(X,Y)\X)))2 Applying again the Schwarz inequality, one sees that l^ is maximized by
The weight again depends on ϋ and, in general, also on other features of the distribution of (X, Y) and must be estimated, say by using appropriate
21
PARTIAL LIKELIHOOD
nonparametric estimators for the conditional expectations involved. With p # ° (X) denoting such an estimator, we arrive at the estimating equation = 0.
(1.2)
The asymptotic variance of the estimator corresponding to this equation is 1 (7 + 1 * ) - with 2 (E(g'ϋ(X,Y)\X)) 7 _ 2
E(gύ(X,Y) \X) ' By the Schwarz inequality (2.10) below, / is strictly larger than / unless both conditional expectations do not depend on X. For the example given above, g#(X, Y) = Y-ΰX, we have Έ(g'ΰ(X, Y)\X) = -X and E{gϋ{X,Y)2\X) = E((F - ϋX)2\X), the conditional variance of Y given X. The estimating equation (1.2) is not only optimal among estimating equations (1.1) but even efficient among all (regular) estimators as long as we do not impose additional restrictions on the distribution of (X, Y) which involve i?. Let us give a sketch of the argument, referring to Bickel, Klaassen, Ritov and Wellner (1993) for an account of the concepts involved. The model is described by all distributions p(dx, dy) = p*(dx)p(x, dy) of (X, Y) such that Jp(x,dy)gΰ(x,y)=E(gΰ{x,Y)\x) p*{dx)g*i){x) =Eg*#
/•
= 0 for all x,
(1.3)
= 0
(1.4)
if ΰ is true. Introduce a local model by perturbing p and p* as p(x, dy)(l + n~ι/2uh{x,y)) and p*(dx)(l + n~ιί2uh*(x)) such that the two conditions (1.3) and (1.4) hold (approximately) with ϋ replaced by ϋ + n~ιl2w.
I p{x, dy) (l +n~
ι/2
J
gϋ+n-i/2u(x, y) = 0 for all x,
uh(x,y))
p*(dx) (l + n-^uh^x))
g^ΰ+n-i,2u(x)
= 0.
Then h and /ι* must fulfill
j p{x,dy)(h{x,y)g {x,y)+g' {x,y)) ΰ
ϋ
j p*{dx){K(x)g*ϋ(x)+g'^{x))
= 0 for all x,
= 0.
The perturbed p is approximately p{dx,dy) ( l + n~1/2u (h(x,y) + M
(1.5)
(1.6)
22
GREENWOOD AND WEFELMEYER
This means that the tangent space in the sense of Bickel et al. (1993, p. 50, Definition 2) consists of the functions u(h + h*) with h and /&* fulfilling (1.5) and (1.6). We view the parameter ϋ as a function of p and determine its canonical gradient in the sense of Bickel et al. (1993, p. 58). This is a function ύ in the tangent space such that nι'2{ϋ
+ n-ι'2u
-ϋ) = u = uEύ(h + h*)
for all h, K fulfilling (1.5), (1.6).
According to Bickel et al. (1993, p. 63, Theorem 2B, and p. 65, Theorem 1A), an estimator ϋn is regular and efficient if and only if ι/2
n 0n
- 0) = n" 1 / 2 Σ *(**, γi) + OP(1).
(1.7)
In particular, a lower bound for the asymptotic variance of regular estimators of ΰisEi/2.
Since the tangent space is generated by the aίfine space of functions h + h* with h and h* fulfilling (1.5) and (1.6), we can write the canonical gradient as v = (Έ(s + s*)2)~ (s + 5*), where 5 + s* is the optimal score function, minimizing E(h+h*)2 over all h and /ι* fulfilling (1.5) and (1.6). In particular, the lower bound for the asymptotic variance of regular estimators can be written 1/E(s + s*)2. The function s + s* is characterized by E(s + θ*)(Λ + K) = E(s + s*)2
for all Λ, K fulfilling (1.5), (1.6).
Since h and h* are orthogonal, this is equivalent to ESh = E5 2
for all h fulfilling (1.5),
Es*/ι* = Es 2 for all K fulfilling (1.6).
One easily checks that the solution is
By the usual Taylor series argument, the solution ϋ = ΰn of the optimal estimating equation (1.2) is seen to fulfill
i=l
By the characterization (1.7), this means that this estimator is efficient. In Sections 2 and 3 we show how the calculation of the optimal estimating function carries over to ergodic discrete-time stochastic processes and jump processes, respectively. Efficiency also carries over, but we will not give the details. All results extend immediately to vectors ϋ and vector-valued functions g# and g*$. We do not give precise regularity conditions for our results.
23
PARTIAL LIKELIHOOD
2
Discrete-time stochastic processes
Suppose we observe a stochastic process (Xi, Yi) at times i = 1,..., n. The law of the process is determined by the conditional distributions pi(dx,dy) of (Xi,Y{) given the past observations. Here and in the following, we suppress the dependence of pi and similar objects on the past, (AΊ, Yi),..., (X -i, Yi-ι). As in the i.i.d. case considered in Section 1, we describe a regression model by (partial) specifications of (1) the conditional distribution of the response given the present value of the covariate and now also the past observations, and of (2) the conditional distribution of the covariate, now also given the past. We factor pi into marginal and conditional, Pi(dx, dy) = p*i(dx)pi(x, dy),
(2.1)
and specify two functions g^{x^y) and g*iΰ(x), possibly depending on the past, such that EX9iu = I Pi{xidy)giΰ(x,y)
= 0 for all x,
E*ig*iu = / P*i(dx)g*iΰ(x) = 0. They give rise to estimating equations
How can we combine them in an optimal way? Our result holds for geometrically ergodic processes and under appropriate smoothness and moment conditions which can be seen from the sketch of the proof. Result 1. From estimating equations of the form Σ (Wi(Xi)giϋ{XuY^
+ w.ig*n>(Xi)) = 0,
(2.2)
i=l
an estimator with minimal asymptotic variance is obtained using weights which are consistent estimators of Wi(Xi) w«
= Έ^/Έf^,
(2.3)
= KisUϋβ*i9Ϊiΰ
(2 4)
The estimator is asymptotically normal. Its asymptotic variance is the limit
24
GREENWOOD AND WEFELMEYER
Sketch of proof. To simplify the notation, we introduce w = (w,w*) and gφ = (g#,(7*tf), and write the estimating equation (2.2) as w
*i9*iϋ = °
Let # be a solution. Under appropriate differentiability conditions, a Taylor expansion gives 0
= Σ wi9a * Σ «w» + (*-*) Σ «w«
Then /A
Conditionally on the past, the martingale increments w^Xij'g^Xi^Yi) i) are orthogonal:
— /
ί-
-
I
I
J
and
Introduce an inner product (v,w) = \ J Έ/*iϋiWi + / ^ v * with corresponding norm ||w|| 2 = (w,w). Interpret products υw of vectors componentwise,
Consider first the numerator in (2.7). With (2.8), the predictable quadratic variation of Σwigi$ is (w2,υ) = Huw1/2!!2, where ϋi(x) = Ts Consider now the denominator in (2.7). The compensator of ΣtUip^ (tϋ,m), where rΠi(x) = Έ*g'iΰ, m*i = E^gliϋ.
ιs
Since Σwιg'iϋ — (w,m) is a martingale, ^ (ΣVt0<0 ~ {W >m)) ιs asymptotically negligible, and we may replace ^ Σwi9iΌ by ^(u;,m). If the process is ergodic, ^(w,m) is asymptotically constant. Hence the predictable quadratic variation of n 1 / 2 ^—ϋ) is approximately n||ii;?;1/2||2/(ii;, m) 2 . By the Schwarz inequality, /
/
l
(2.9)
25
PARTIAL LIKELIHOOD In other words, w,m)z
Urn?;"1/2!!2
Hence \\wvι/2\\2 / (w,m)2 is minimized by w = mv~1, and the minimum is -1
By an appropriate martingale central limit theorem, nι/2(ϋ — #) is asymptotically normal with variance equal to the limit of nUmv" 1 / 2 !!" 2 , and the assertion follows. D Efficiency of the estimator based on the optimal estimating equation can be proved by an approach similar to that outlined in Section 1 for the i.i.d. case. If we use predictors to estimate Wi(x) and w*i, i.e. estimators involving only the past observations (Xi, YΊ),..., (Xi_i, Yi_i), then the optimal estimating function is a martingale. We may allow weights wι, w*i to depend on ϋ. Then the derivative of ΣwiQiΰ in the expansion (2.6) has a second term Σwi9iΰ I* ι s asymptotically negligible since the g^ are martingale increments. Remark 1. Usually one takes predictable weights to combine two martingale estimating functions; e.g. Heyde (1987). For the estimating functions Σgi#{Xi,Yi) and Σί7*ztf(Xΐ) this would mean using weights Wi rather than i). Then the best weights would be
and w*i as above, and the minimal asymptotic variance would be the limit of an expression of the form (2.5) with E*» (jβ?%ΰ)2/Ίϊfig'ϊ^) replaced by 2 2 the simpler (Eig'ild) /Eig ΰ. The resulting variance is, in general, larger than our minimum variance (2.5) because Eg 2
^ l
Exg2
This inequality follows from the Schwarz inequality:
(E/)
2
=
(2,0,
26
GREENWOOD AND WEFELMEYER
Remark 2. The weight W{ depends only on p{ and giϋ, and w^ depends only onp*i and 5*^. This is due to orthogonality (2.8). Indeed, the weights (2.3) are optimal for estimating functions of the form Σ Wi(Xi)<7^(Xi, Yi)i a n d weights (2.4) are optimal for estimating functions of the form ^ w Remark 3. Suppose we have a parametric model piϋ(Xi, dy) for the conditional distribution of the response Y{ given the present covariate X{ and the past observations. Differentiating under the integral, we obtain 0 = (Exiΰgiΰ)
= Έ%g'iϋ
where
Hence, by the Schwarz inequality, [E^'g'^j /Έ^βΊjfβ is maximal for "g^ = 2iϋ, and the optimal weight (2.3) for giΰ = liΰ is ϊϋi(Xi) = —1. In particular, the estimating function YJ-i${Xi<>Yi) is optimal among estimating functions Y^Wi(X{)'gil&(Xi1Yi). The optimal estimating function is the partial score function, i.e., the derivative of the partial likelihood ratio of Cox (1975), <Φ(Xύ)
Π
Hence the optimal estimating function gives the maximum partial likelihood estimator. If the observations (XΪ, Y{) are independent, the partial likelihood ratio is the conditional likelihood ratio for Yi,...,Y n given the covariates A i, . . . , Xn. Similarly, if there is a parametric model p*i$ for the distribution of the covariate Xi given the past, the optimal g*i$ is g { χ ) =d ^ i x )
=:
£ { χ )
and the optimal weight (2.4) is —1. Moreover, if there is a fully specified parametric model Piϋ(dx, dy) = p*iϋ(dx)piϋ(x, dy), then the likelihood ratio can be written dp*i dP(χi')
Π
27
PARTIAL LIKELIHOOD
and the optimal estimating function is the score function Σί'iϋ(Xi, Y%) with 4?(*>y) := dT^j^{x,y)
=ίiϋ{x,y) + ί!Hΰ{x).
Hence the optimal estimator is the maximum likelihood estimator, and its asymptotic variance is the limit of
Remark 4. For discrete-time processes, it is common to model the conditional distribution of the response given the past and the present value of the covariate. In the continuous-time setting of Section 3 one usually models the conditional distribution of the response given only the past. This is just a convention: We may consider X%-i rather than X{ the 'present' covariate of the response.
3
Jump processes
Suppose we observe a jump process (X, Y) = (JΓ5,Ys)s>o on a finite time interval [0, ί]. The corresponding multivariate point process is given by the jump measure μ(ds,dx,dy) = The law of the process is determined by the compensator of the jump measure. Assume, for simplicity, that the compensator has the form Ks(dx, dy)ds, so that there are no time points at which the process has a positive probability of jumping. We can write Ks(dx,dy) = asps(dx,dy) withp s a probability measure, the jump size distribution at time 5 given the past, and as the jump intensity. For the theory of continuous-time processes and limit theorems we refer to Jacod and Shiryaev (1987). The multivariate point process corresponding to the response process Y is μγ(ds,dy)= ]Γ ε(sAγs)(ds,dy). s .AYs^O
A regression model is given by a (partial) specification of the compensator of μ y , say Ks(dy)ds. As noted in Remark 4, this is not exactly analogous to the discrete-time case. We specify a predictable function ~gs${y) such that Ksgsΰ = j Ks(dy)gsΰ(y)
=0
28
GREENWOOD AND WEFELMEYER
and obtain a martingale estimating equation
We want to assume a similar partial specification of the distribution of the covariate process X. It will be based on a factorization of Ks(dx,dy) analogous to the factorization (2.1) of the distribution pi(dx,dy). We must take into account the possibility that X jumps while Y does not. Following Arjas and Haara (1984) and Greenwood and Wefelmeyer (1996), we write Ks(dx,dy)
= K-Os(dx,dy)
+ K*s(dx)εo(dy),
where K-QS(dx,dy) does not charge the subspace described by y = 0. Then K*s governs those jumps of X that do not occur simultaneously with jumps of Y. As in (2.1), but with the roles of X and Y interchanged, we factor K-Os(dx,dy)
(3.1)
= Kβ{dy)K-.a(dx,y).
We note that K-*s(dx,y) is a probability measure, the conditional distribution of the jump size of the covariate given a jump of size y of the response, and given the past. Additional specifications of the model may now be given by predictable functions 5*s1?(x) and <7-*stf(#,y) such that K*s9*sϋ = / K*s(dx)g*sϋ(x) Ki+8g-*84 = I K-*s(dx,y)g-*8*fay)
= 0, = 0
for
all y.
They give rise to additional martingale estimating equations =
0,
= 0. How can we combine the three estimating functions in an optimal way? Again, our result holds for geometrically ergodic processes under appropriate smoothness and moment conditions which can be seen from the sketch of the proof. Result 2. From estimating equations of the form (3.2)
wsgso(AYs)
= 0,
29
PARTIAL LIKELIHOOD
an estimator with minimal asymptotic variance is obtained using weights which are consistent estimators of
= KdsϋIKsg2sϋ, ' l
ws
(3.3) (3.4) (3.5)
^ .
The estimator is asymptotically normal. Its asymptotic variance is the limit
of
t
(f ψfU ψf + / ψU£*+ JK^ψί^Δ V
A
#
J
J X
J
9
Λ9
. ,3.6) )
Sketch of proof. To simplify the notation, we introduce w = (w, w*,w-*) and g$ = (<7tf,ί7*τ?, <7-*#), and write the estimating equation (3.2) as -*sg-*sϋ = 0. Let ϋ be a solution. Under appropriate differentiability conditions, a Taylor expansion gives
O = Σ ws9s#« Σ Ws^ + (*-*) Σ wds*
(37)
Then
The martingale £tu* 5 g* s tf is orthogonal to the martingales ΣΰJ s ί? s l ? a n ( i 2 w-*sg-*sΰ because it lives on time points 5 with AYS = 0 while the two other martingales do not jump at these time points. Because K-os(dx,dy) does not charge y = 0, we may and will assume that g-.*s$(x,0) = 0. Then w-*sg-*sϋ
= ws JΈs(dy)gsΰ{y)w-*s(y) Hence ΣΰJsΊJsΰ product
a n d
Σw-*s9-*sΰ
(υ,w) = / vsws+ Jo
(3.9)
J K-*s(dx,y)g-*s#(x,y) = 0.
are also orthogonal. Introduce an inner
/ υ* 5 w* 5 + / Jo Jo J 2
Έs{dy)υ-*s{y)w-*s(y)ds
with corresponding norm ||w|| = (w,w). Consider first the numerator in (3.8). With (3.9) and orthogonality of Σw*s9*sϋ and ΣwsVsϋ, t h e P r e " w 2 2 2 dictable quadratic variation of Σ s9sΰ is {w ,v) = | | W / | | , where vs =
30
GREENWOOD AND WEFELMEYER
Consider now the denominator in (3.8). The compensator of Σws9sϋ (w,ra), where rns = ~Ksg'sϋ, m*5 = K^g1^,
is
m-*s{y) = #-* 5 5-* s tf
Since Σws9ta<β — (w,m) is a martingale, \ (Σws9fs>β — (w,m)) is asymptotically negligible, and we may replace \ Σ ws9's o by \(w->m) ^ the P r o c e s s is ergodic, \(w, m) is asymptotically constant. Hence the predictable quadratic variation of tιί2(ϋ — ϋ) is approximately t\\wvιl2\\2/(w,m)2. By the inequal1 ity (2.9), this is minimized by w = m υ " , and the minimum is
\Jo vs
JQ
V*S
JO
By an appropriate martingale central limit theorem, tιl2{ϋ — ϋ) is asymptotically normal with variance equal to the limit of ίUmi;" 1 / 2 !!" 2 , and the assertion follows. D As in Remark 2, the weight (3.3) is optimal for estimating equations wsgsΰ(AYs)=0, and the weights (3.4) and (3.5) have analogous optimality properties on their own. Remark 5. Suppose we have a parametric model Ksΰ{dy) for the compensator of the jump measure μγ of the response. Write
When the intensity UTS#(R) of the response depends on #, then Ks$ist& will not be zero in general. This differs from the discrete-time case. Differentiating under the integral, we obtain 0 = (KsΰgsΰY
= Ks$g[sύ + KsϋIsϋgsϋ
- Ks^gιsϋ + Ks#(ϊsϋ
-
Ks#lsΰ)gsϋ.
Using the Schwarz inequality, we see that (Ksΰg'sΰ)2
_
is maximal for gs# = l's# — Ks$£rs#, and then the optimal weight is ^tJs = —1.
31
PARTIAL LIKELIHOOD In particular,
is optimal among estimating functions of the form
and the asymptotic variance of the optimal estimator is the limit of
As in Remark 3, the optimal estimating function turns out to be the partial score function in the following sense. A partial likelihood ratio for jump processes was introduced by Arjas and Haara (1984) as - l)ds) . J
See also Andersen, Borgan, Gill and Keiding (1993) and, for general semimartingales, Jacod (1987) and (1990). The partial score function is the derivative of the partial likelihod ratio at τ — ϋ. Using Vs$# = 1 we see that the derivative equals (3.10). Remark 6. Suppose we have a parametric model K*s$(dx)ds for the compensator of the jump measure of those jumps of the covariate X that do not occur simultaneously with jumps of the response Y. Write V*sΰτ = -J77
j
Ksΰ = dτ=
As in Remark 5, the best g*s# is t*sΰ~κ*sϋKsϋ' (3.4) is w*s = — 1. In particular,
an
d then the optimal weight
(3.H) is optimal among estimating functions of the form
32
GREENWOOD AND
WEFELMEYER
Remark 7. Suppose we have a parametric model for the compensator K-*s$(dx,y) of the conditional jump size distribution at time s of X given a jump Y = y and the past. Write
Since K-*s$(dx,y) is a probability measure, we have Ky_^sι&ί'_^sΰ = 0. As in Remark 5, the best g~*sΰ is ^/_*stf, and then the optimal weight (3.5) is w-*s = —1. In particular, (3.12) is optimal among estimating functions of the form ΔYS).
Remark 8. Suppose we have a fully parametric model K8$, K*s$, K_*5ί?. According to Result 2 and Remarks 5 to 7, the best estimating equation is
To show that this gives the maximum likelihood estimator, we recall a representation of Greenwood and Wefelmeyer (1996) of the likelihood ratio, - l)ds)
(3.13)
J
exp (- f K*sΰ(V*sΰτ - l)ds)
For a heuristic derivation in terms of product integrals, see Andersen et al. (1993, p. 107). We have already noted in Remark 5 that the derivative of the first factor, the partial likelihood ratio, equals the partial score function
33
PARTIAL LIKELIHOOD
(3.10). The derivative of the second factor is obtained similarly. Finally, dτ=φV-*s#τ = ^'_*5tf by definition. The representation (3.13) of the likelihood can be used in the partially specified model of Result 2 to prove that the optimal estimating function obtained there is efficient as long as no additional restrictions involving ΰ are imposed on the model. The arguments are similar to those outlined in Section 1 for the i.i.d. case. In Greenwood and Wefelmeyer (1996) a representation analogous to (3.13) is given for general semimartingales and can be used to generalize the results obtained here to partially specified semimartingale regression models. References Andersen, P. K., Borgan, 0., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag, New York. Arjas, E. and Haara, P. (1984). A marked point process approach to censored failure data with complicated covariates. Scand. J. Statist. 11,193-209. Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner, J.A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269-276. Greenwood, P. E. and Wefelmeyer, W. (1996). Cox's factoring of regression model likelihoods for continuous time processes. To appear in Bernoulli. Heyde, C.C. (1987). On combining quasi-likelihood estimating functions. Stochastic Process. Appl. 25 281-287. Jacod, J. (1987). Partial likelihood process and asymptotic normality. Stochastic Process. Appl. 26, 47-71. Jacod, J. (1990). Sur le processus de vraisemblance partielle. Ann. Inst. Henri Poincare 26, 299-329. Jacod, J. and Shiryaev, A. N. (1987). Limit Theorems for Stochastic Processes. Grundlehren der mathematischen Wissenschaften 288, SpringerVerlag, Berlin.
35
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
AVOIDING THE LIKELIHOOD C.C.Heyde Columbia University and Australian National University ABSTRACT For the estimation of a finite dimensional parameter in a stochastic model it has become increasingly clear that it is usually possible to replace likelihood based techniques by quasi-likelihood alternatives in which only assumptions about means and covariances are made in order to obtain estimators. If it is available, the likelihood does provide a basis for benchmarking of alternative approaches but not more than that. The challenge is to see whether everything that can be done via likelihoods has a corresponding quasi-likelihood approach from which the likelihood based results can be recovered, if they are available. It is conjectured that this is the case. In this paper, various illustrations are sketched of avoiding the likelihood in contexts where alternative approaches have not been obvious. Key Words: Quasi-likelihood; E-M algorithm; constrained estimation; nuisance parameters; diffusions; REML estimation.
1
Introduction
This paper is concerned with promoting the thesis that: For parameter inference (1) it is advantageous to make minimalist assumptions on models (initially concerning only means and covariance structure), and (2) there is a sensible quasi-likelihood (QL) alternative/ generalization of any likelihood based methodology, at least to the first order of asymptotics. We have also come to rely on the full distribution theory as a basis for a wide range of statistical procedures. Indeed, questions of appropriateness of the model are often supressed in order to make use of easy analytical methods (as with the Black-Scholes model in Finance). However, many ostensibly likelihood based methods do not actually require full distributional assumptions. They can readily be extended to the estimating functions context when there is a conservative quasi-score. That is, an estimating function which is the gradient of a scalar objective function which plays the role of the
36
HEYDE
likelihood if and when it exists. However, it is argued that a scalar objective function for which the quasi-score is the gradient is inessential. These pronouncements may be regarded as controversial. However, they are motivated by a wish to promote serious consideration and debate and not out of intrinsic dogmatism. We shall give a smorgasbord of examples to elucidate the point of view described above. After discussion the general QL framework (Section 2), we shall describe some Projection-Solution (P-S) methods, namely in the contexts of constrained parameter estimation, nuisance parameters and the E-M algorithm (Section 3). Then we shall discuss bypassing the likelihood through examples for diffusion processes and REML estimation (Section 4). There are, of course, many other areas in which there is substantial progress towards the use of estimating functions without direct recourse to likelihood ideas. These include the areas of multiple roots (e.g. Heyde and Morton (1996b)), likelihood ratio tests (e.g Li (1993)) and Bayesian analysis (e.g. Godambe (1994)). A much broader perspective will soon be available in Heyde (1997).
2
General QL Principles
Suppose we have a sample {Xt, t G T} of vectors of dimension r, T possibly being discrete, continuous or lattice. The possible probability measures {Pθ} for {Xt} are the union of families of models and the θ = (0χ,..., ΘPY to be estimated is a vector of dimension p. The approach is via the set of p dimensional vector estimating functions Q = {Gτ(θ) = Gτ({Xut GT},0)} which are functions of the data and θ for which EGT = 0 for each PQ and the matrices EGT{Θ) =
dθ
{EdGτ,i{Θ)l j)
and EGτ{θ)Gτ{θ)f are nonsingular, the prime denoting transpose. The QL theory focuses on suitably chosen subsets of Q and involves choice of an estimating function GT to maximize, in the partial order of non-negative definite (nnd) matrices, the information criterion:
S{Gτ) = {EOTyiEGrG^iEOT) which is a natural generalization of Fisher information. We have the following definition. Definition. Suppose that G*τeUcG.
If
37
AVOIDING THE LIKELIHOOD
is nnd for all GT G Ή we say that Gΐp is a quasi-score estimating function (QSEF) within U. The choice of the family Ή is completely open and should be tailored to the particular application. The estimator θ^ obtained from Gγ{θ^) = 0 which is termed a quasilikelihood estimator has, under broad conditions, certain minimum size asymptotic confidence zone properties for 0, at least within Ή. Indeed, the basic properties are those of the maximum likelihood estimator, but restricted to the class Ή. The theory does not require a parametric setting, let alone the existence of a likelihood score function Uτ{θ). However, if UT G Ή, as can ordinarily be arranged in exponential family problems, then UT is the QSEF within 7ί and can easily be calculated without using likelihoods. It is not usually practicable to find a quasi-score estimating function directly from the definition. However, the criterion given in the following proposition is easy to use in practice. Proposition 1. Let Ή. G Q. Then Gj* G H is a quasi-score estimating function within H if ) EGχGj> = CT (2-1) for all GT G H, where CT is a fixed matrix. Conversely, if H is convex and Gγ is a quasi-score estimating function then (2.1) holds.
3
Projection Based Methods
Many problems for which the use of likelihood based ideas is standard, and for which there is not an obvious quasi-likelihood analogue in the absence of a conservative quasi-score, can be dealt with via projection based methods. We give three illustrations in this section.
3.1 Constrained Parameter Estimation Here we wish to estimate θ subject to the constraint F'θ = cί, F being a q x p matrix which does not depend on the data or θ. Suppose we have an unconstrained quasi-score Q(θ) G Ή. and, using a minus to denote generalized inverse, define the projection matrix P = 1
FiF'V^FyF'V'1
for V = EQQ . In the case where a likelihood L(θ) is available, the usual procedure is to use the method of Lagrange multipliers and maximize L(θ) + \'(Ffθ — d) where λ is determined by the constraint. Thus, we differentiate with respect
38
HEYDE
to θ and solve the equations
for λ and 0, U being the score function. The striking thing is that this procedure works in general for quasilikelihood, even in the non-conservative case. We solve the equations
for λ and 0, that is, ( I - P ) Q ( 0 ) = O , F ' 0 = d. Optimality is preserved. For details see Heyde and Morton (1993). 3.2 Nuisance Parameters Here we have θ' = (φ', φ1) where φ is the parameter of interest and ψ is a nuisance parameter. Then, supposing that we have a quasi-score Q(θ) E Ή, we use the partitioned forms
and write Φ
φ
\ VΨΦ ( VΦΦ \ Vφφ
The projection identifies the information about φ for φ given and the estimating equation (I-PΨ)Q = O is optimal for the estimation of φ in the presence of the nuisance parameter ψ. The sensitive dependence of Q on φ has been removed in the sense that
This is a first order approach. In the language of McLeish and Small (1988), (/ - Pψ)Q is locally E-ancillary for ψ and PψQ is locally E-sufficient for ψ.
39
AVOIDING THE LIKELIHOOD
3.3 E-M Algorithm Generalization The E-M method is used for parameter estimation where there is missing data. In the first (E) step, one takes the conditional expectation of the complete data likelihood with respect to the available data. In the second (M) step, one maximizes over possible distributions. However, it is possible to avoid the likelihood completely by introducing a project-solve (P-S) method. Suppose that the full data is denoted by z, the observed data by y and θ is the parameter of interest. We seek to adapt a quasi-score Q(θ; x) G Hx to obtain a quasi-score Q{θ;y) G Ήy. In fact E(Q* - Q)(Q* - QY = inf E(G - Q)(G - Q)' Grfc Hy
and Q* is the element of Ή,y with minimum dispersion distance from
Q G nx.
If the likelihood score U G Ή x , then Q = E(U\y) provided this belongs to Ήy as in the E-M case. However, Q* is given in general just as a least squares predictor and mostly
Q*(θ y)φEθ(Q(θ,x)\y). A detailed discussion of the method can be found in Heyde and Morton (1996a). There is an algorithm for solving Q*(θ; y) = 0 along the lines of the E-M algorithm. This usually gives a first order rate of convergence. However, the equation can often be solved directly with a second order rate of convergence, for example using Fisher's method of scoring.
4
Bypassing the Likelihood
In this section we give examples of the derivations of score functions without having first to find a likelihood to differentiate. In such cases there is the added advantage of being useful under distinctly broader distributional conditions than are imposed by the need to prescribe a likelihood.
4.1 Parameters in Diffusion Type Models Here we have a model described by the stochastic differential equation dXt = a(t, Xu θ)dt + 62 (t, Xt)dWt where α, b are known functions and Wt is standard Brownian motion. The usual approach to estimation of θ is to obtain an appropriate RadonNikodym derivative. This is tedious from first principles. Differentiation with respect to θ then gives the likelihood score.
40
HEYDE
Alternatively, one may consider the family of martingale estimating functions rT
n ={ kt(θ)(dXt - a(t, Xu θ)dt), ktpredictable}. Jo The quasi-score estimating function from this family can be written down almost immediately using Proposition 1 as jf (a(t,Xuθ))'(b(t,Xt))-ι(dXt
-a{t,Xuθ)dt),
and this is equivalent to the likelihood score. The explanation is straightforward. Note that the elements of H are (martingale) stochastic integrals with respect to Wt Also, a likelihood score is a martingale under modest regularity conditions. Furthermore, all square integrable martingales living on the same probability space as this process can be described as stochastic integrals with respect to the Brownian motion. Thus Ή, contains the likelihood score and the QSEF will pick it out. Now the quasi-likelihood method goes much further. One does not have to perturb a diffusion type model much to destroy the likelihood. For example, in the Cox-Ingersoll-Ross model used for interest rates in financial modelling, 1
dXt = Oί^β — Xfjdt + (JXI dWt, the Radon-Nikodym derivative will not exist if the volatility σ is rate dependent on α. However, the QSEF is unaffected. For more details see Heyde (1994a). 4.2 Restricted (or Residual) Maximum Likelihood Here the problem is of estimating dispersion in a linear model; the nxr vector y has the multivariate normal distribution MVN(Xβ, V(θ)) with mean Xβ, covariance V(θ) and θ is to be estimated. Take the rank of X as r, the dimension of β and let A be any matrix with n rows and rank n-r satisfying A'X = 0. Then A'y has the MVN(0, A'VA) distribution. The striking thing here is that the likelihood function does not depend on A. Indeed, for all A, A(A'VA)-A' = V~ιQ where Q — I-P, P being the projector onto the subspace R(X) (the range space of X) with respect to the inner product a'Vb. The likelihood function of θ based on A'y is, omitting a constant multiplier,
(f{li)2eχp{-\y'V-ιQy)
41
AVOIDING THE LIKELIHOOD ι
where the k are the non-zero eigenvalues of V~ Q. This is not a straightforward calculation, nor is the differentiation with respect to θ required to obtain the REML estimating equations ι
^
^- Q)y,i
= 1,2, ...p,
tr denoting trace. For the quasi-likelihood approach we no longer require multivariate normality but instead that y has mean vector Xβ, covariance matrix V(θ), and that each yι has kurtosis 3. The crucial step is taken by noting that we expect to use quadratic functions of the data to estimate covariances. For fixed A, let z = A!y and take θ as a scalar for clarity. Now introduce the family of estimating functions U = {G(S) = z'Sz - Ez'Sz, Ssymmetric}. Write W = A'VA. Then EG{S)G{S*) = 2tr(WSWS*) Ez'Sz = tr{WS) EG(S) = - t r ί ^ S ) , and we see via Proposition 1 that S* for the QSEF is given by
The REML estimating equations then follow since AW~A! = V~ιQ. For more details see Heyde(1994b).
42
HEYDE References
Godambe, V.P. (1994). Linear Bayes and optimal estimation. Technical Report STAT-94-11, University of Waterloo, Canada. Heyde, C.C. (1994a). A quasi-likelihood approach to estimating parameters in diffusion-type processes. Studies in Applied Probability. J. Applied Prob., 31A, 283-290. Heyde, C.C. (1994b). A quasi-likelihood approach to the REML estimating equations. Statistics and Probability Letters, 21, 381-384. Heyde, C.C. (1997). Quasi-Likelihood Theory and its Application. Springer, New York. Heyde, C.C. and Morton, R. (1993). On constrained quasi-likelihood estimation. Biometrika, 80, 755-761. Heyde, C.C. and Morton, R. (1996a). Quasi- likelihood and generalizing the E-M algorithm. J. R. Statist. Soc. Ser. B, 58, 317-327. Heyde, C.C. and Morton, R. (1996b). Multiple roots and dimension reduction issues for general estimating equations. Unpublished manuscript. Li, B. (1993). A deviance function for the quasi-likelihood method. Biometrika, 80, 741-753. McLeish, D.L. and Small, C.G. (1988). The Theory and Applications of Statistical Inference Functions, Lecture Notes in Statistics Vol.44, Springer, New York.
43 Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
LIKELIHOOD AND PSEUDO LIKELIHOOD ESTIMATION BASED ON RESPONSE-BIASED OBSERVATION J.F. Lawless University of Waterloo
ABSTRACT Response-biased observation refers to situations where the probability a unit is observed depends on the value of a response associated with that unit. We discuss the construction of estimating equations for parametric regression models through likelihood and pseudo likelihoods, for situations in which responses are stratified and sampling is stratum-specific. Properties of the resulting estimators are reviewed and an illustration involving field reliability data is presented.
1
Introduction
In many observational studies the probability that a specific individual or unit is observed or selected in a sample depends upon responses or covariates associated with that unit. That is, if units in some population have associated response variables y and covariates #, then the probability unit i is selected depends upon the values (yi,Xi) for that unit. When the probability of selection depends upon yι, we call the observation scheme response-selective, or response-biased. For simplicity of exposition I will focus mainly on situations where the probability of selection depends solely on yι. However, as described at the end of Section 2 and in Section 4, situations where the probability of selection depends on both yι and X{ may also be handled using the methods considered here. Examples of response-selective observation are abundant. In socio-economic studies based on samples drawn from administrative records, selection is often response-related (e.g. Hausman and Wise 1981). Similarly, in a study of factors affecting low birth weight of humans, one might select newborns over a period of time and measure covariates so as to over-sample babies with
44
LAWLESS
low birth weights. Extreme forms of response-selection are embodied in casecontrol or choice-based sampling used in epidemiology and economics (e.g. Breslow and Cain 1988; Hsieh, Manski and McFadden 1985); in this case the response variable is categorical and covariates are observed for samples of individuals selected from each response category. Examples of selection bias in more complicated settings are given by Hoem (1985) and Kalbfleisch and Lawless (1988a), who discuss the observation of life history events for human populations. When the probability of selection p(y) for a unit is a known function of y, methods that weight log likelihood or estimating function components ι according to p{y)~ are often used. Several other approaches can also be used in various contexts. This article discusses four methods of estimation for a rather broad class of situations. The approaches described are not new, but questions remain about their properties. Our purpose is to review the four methods and recent investigations into their properties, and to indicate connections with other areas. It is assumed that (y, x) values for individuals or units in some "population" from which units will be sampled are generated from a probability distribution with density or mass function f(y\x;θ)g(x)
yey,xeX
(1)
and that our objective is to estimate the p-dimensional parameter θ. We wish to avoid strong parametric assumptions about g(x) and the corresponding distribution function G(x), as is common in regression modelling. Sampling is response-selective in the following sense: the range of y is partitioned into strata 5i,...,5fc and if yι G Sj then unit i is sampled (selected) with probability pj. In other words, p(yi) = γjj=ιpjl(yi G Sj), where I (A) is the indicator function which equals 1 if event A is true and 0 otherwise. More specifically, we assume that a finite population of N units has values (yi,Xi), i = l,...,iV generated as independent realizations from (1). In survey sampling terminology, we have a stratified population and wish to estimate parameters in the superpopulation model (1). Samples may be selected in various ways. We consider two, termed basic stratified sampling (BSS) and basic variable probability sampling (BVPS). In BSS a simple random sample of specified size Πj is selected from units in the j ' t h stratum (i.e. with yι G Sj). In BVPS each unit in the population is considered for selection independent of every other unit. The j ' t h stratum size (number of units with yι G Sj) is denoted by Nj (j = 1,..., k). In the case of BSS the fixed sample size from stratum j is nj — NjPj, whereas in the case of BVPS the size of the sample from the j ' t h stratum is random. It is an important feature of our framework that the stratum sizes iVi,...,iV^ are known, or observable. This latter feature is not present in some applications, e.g. for
45
RESPONSE-BIASED ESTIMATION
many instances of case-control or choice-based sampling. The remainder of the paper is as follows. Section 2 makes the observational framework precise and then describes methods of estimation based both on likelihood and on pseudo likelihood functions. Section 3 discusses asymptotic properties and variance estimation, and Section 4 illustrates the methodology. Section 5 concludes with remarks on extensions and relationships with missing data methods.
2
Likelihood and Pseudo Likelihood Estimating Functions
Suppose that individual pairs (y^Zi), ί — l,...,iV are generated independently from (1), and that Nj units have yι E Sj (j = 1, ...,&). Units are selected by either BSS or BVPS as described in Section 1 and (yi,Xi) observed. Let Ri = /(unit i is selected), and Dj = {i : Ri = l,yj € Sj} denote the units selected from stratum j , where \Dj\ = nj. As is customary, Y and X are used to represent the random variables of which yι and X{ are realizations. For simplicity I consider only situations where the population size N and stratum sizes Nj are fixed at the time units are selected. It is also assumed that the iV^'s are known and further, that for units not selected all that is known is which stratum they are in. In some contexts such as the birth weight study, the values of N and the JVj's may be unknown until the end of the sampling period and the j/i's (but not the x^s) may be known for units not selected. Such features may also be dealt with via the methods discussed here; Lawless, Wild and Kalbfleisch (1997) consider a variety of response-selective sampling schemes. Under BSS or BVPS in the framework described, the data include the stratum sizes iVΊ,..., JV^, the pairs (j/i,x<), i € Dj (j = l,...,fc) for the selected units, and for BVPS, the sample sizes ni,...,rifc. In either case the probability density function for the observed data defines a likelihood function for the unknown parameters θ and G (the distribution function of X) which is proportional to LF(Θ,G)
= l[{Qj(θ,G)Ni-ni
Π f{Vi\*ϊJ)M!(*i)h
(2)
j=\
where Qj(θ,G) = Pr(Y e Sr,θ,G) =
ίPr(Y
e Sj\x\θ)dG(x)
(3)
46
LAWLESS
It is recognized in the notation that (2) depends on both θ and G. Prom our point of view G is a nuisance parameter but because of (3) it is necessary to estimate it in order to estimate θ by maximizing (2). One approach is to maximize the semiparametric likelihood (2) jointly with respect to θ and G. This is feasible when Y is categorical (Wild 1991, Scott and Wild 1997) or when G is discrete with relatively few points of support (Hsieh et al. 1985), and recent work suggests it is feasible quite generally. A second line of attack is to maximize LF(Θ,G), where G is a simple nonparametric estimate of G\ this is an extension of the parametric pseudo likelihood idea of Gong and Samaniego (1981) to a semiparametric setting. Noting that
G(x) = ΣPr{Xi
< x\Yi e Sj)Pr(Yi e 5, ),
3=1
we propose to use the estimate (4) where Gj (x) is the empirical cumulative distribution function (cdf) based on the Xi's for units i G Dj sampled from the j ' t h stratum. Inserting (4) into (3) and taking d\ogL(θ,G)/dθ, we obtain the pseudo score function
sp(θ) =
^y,
Σ
diogfiyilxi θ) dθ
' ι=i
(5)
ί=l
where pi =
and
= Pr(YeSj\x;θ).
(6)
We estimate θ by solving the equation SP(Θ) = 0. This idea has been used in other contexts by Pepe and Fleming (1991), and Hu and Lawless (1997). A third possibility is to weight score contributions for sampled units to give the weighted pseudo score (7) 3=1
RESPONSE-BIASED ESTIMATION
47
It is apparent that S\γ(θ) is unbiased (has mean 0) under BSS or BVPS, where expectation is with respect to (1) and the selection scheme. Weighting is common in survey sampling (eg. Holt, Smith and Winter 1980, Binder 1983, Binder and Patak 1994) and has been considered for maximum likelihood methods by Hsieh et al. (1985), Scott and Wild (1986), Kalbfleisch and Lawless (1988ab) and others. Robins et al. (1994, 1995) consider weighted pseudo score functions when the selection probabilities p(yi) may depend on unknown parameters. The final method considered is based on the observation that in the case of BVPS the distribution of the observed responses, conditional on the values of Rι and X{ (i = 1, ...,iV), yields the conditional, or selection-biased likelihood N
Lc(θ) =
= ΠΠ j=l
(8)
ieDj
U=i
The corresponding score function is
Sc(θ) =
aiog/(y z | ^ ;(
(9)
dθ ί=l
Straightforward calculation shows that Sc{θ) is unbiased under BSS as well as under BVPS. lΐ pi = = pk then Sw(θ) = 0 and Sc(θ) = 0 yield the same estimate, but otherwise the estimators obtained from the estimating equations Sp(θ) = 0, Sw(θ) = 0 and Sc{θ) = 0 appear to be distinct. General results concerning the relative efficiencies of the estimators θp,θw and θc obtained from these equations, and θp obtained by maximization of Lp(θ,G) with respect to θ and G, are not available, though results of Robins et al. (1994, 1995) show that the pseudo-likelihood methods are asymptotically inefficient. Another important point is that in the case of BVPS it is preferable to use pj = nj/Nj rather than prior selection probabilities. In Section 3 we discuss asymptotic properties of the estimates and in Sections 4 and 5 some limited simulation results. We conclude this section by noting that analogous estimating functions may be given for cases in which the probability of selection for a unit depends
48
LAWLESS
on both y and x. In particular, suppose that k strata are defined according to whether {yi.Xi) E Sj, where SΊ, ...,£& partitions y x X. The likelihood function based on the observed data is now given by (2), with Qj(θ,G) redefined as Qj(θ,G) =
=
Pr{(Y,X)eSjiθ,G}
ίPr{{Y,x) e Sj\x',θ}dG{x).
(10)
A pseudo score Sp(θ) corresponding to (5) is obtained by estimating (10) via (4), where G(x) is as before the empirical cdf based on the x^s for units i E Dj. A weighted pseudo score function is given by (7) once again, and a conditional score is given by (9) with Q}(XΪ,Θ) replaced by QUx;θ) = Pr{{Y,x)eSι\x ,θ}.
3
(11)
Asymptotic Properties, Variance Estimates and Confidence Limits
By taking limits as N -> oo and with pj = rij/Nj (j = 1, ...,&) fixed positive values, we may show that under mild conditions the estimators θ obtained by solving Sw{θ) = 0 or Sc(θ) = 0 are consistent and asymptotically normal. Special cases have been considered by Kalbfleisch and Lawless (1988b), Wild (1991) and Scott and Wild (1997). Asymptotics under BVPS may also be obtained. A rigorous development of asymptotics for the case of full maximum likelihood based on (2) or for the estimating equation Sp(θ) = 0 is more difficult. For the former Wild (1991) and Scott and Wild (1997) deal with the case of categorical responses, and for the latter Hu and Lawless (1997) deal with the special problem described in Section 4. We outline the asymptotic normal results for Sw(θ) and Sc(θ) given by (7) and (9), respectively; Lawless et al. (1997) give a fuller treatment. Both (7) and (9) may be written in the form N
;Xi;θ).
(12)
For Sw{θ), for example, k
U(Yϊ;Xi',θ) = Σpfm 3=1
6 Sj)dV>zfWXi\θ)ISΘ.
(13)
49
RESPONSE-BIASED ESTIMATION Assume p\im(Nj/N) = 7Γ, > 0 as N ->• 00 and define
A(θ) =plimAN(θ)
B{θ) = UmBN{θ).
Under mild regularity conditions on the model (1), we have yjNφ -Θ0)ΛN
(θ,A{θo)-ιB{θo)A{θo)-1) ,
where ΘQ represents the true value of θ. The asymptotic variance of θ may be estimated as Ά~ιBA~ι, where A and B are consistent estimates of A(ΘQ) and JB(0o) The matrix A = AN(Θ) may be used to estimate A{ΘQ). For Sc(θ) in the case of BVPS the estimating function is a likelihood score function, so A(ΘQ) = B(ΘQ). For the other cases we require a consistent estimator B, which is not hard to obtain. For Sw{θ), for example, extending the approach of Kalbfleisch and Lawless (1988b) and defining Vi{θ) = dlogf(yi\Xi;θ)/dθ YΆτ{Sw(θ)}
and w
= VΆτγ]xEmx{Sw(θ)} = E(^)
+
C(θ),
(θ) = Σi-.y.eSj Vi(θ)/Nj, we get +
Eγ]xVΆτmx{Sw(θ)} (14)
where for BSS we have
C(θ) = EYlx J2 n ^
P
i \ Σ foM- ^ 0 ) (θ)}[vi(θ)- vU) (*)]'. (15)
j^iPjW-Vines,
χ
Since N- E{-dSw/dθ) = A(θ), equations (14) - (15) indicate that B(θ) may be estimated by B = A + C, where
= h Σ ^r^i-^)
Σte(β)- ^ωW]NW - tJ^ίβ)]', (16)
where i Confidence intervals for parameters may be obtained by treating y/N(θ — θ) as approximately normal with a suitably estimated covariance matrix. An alternative would be to use some form of bootstrap. Investigation of specific problems by simulation is needed to gain insight into the adequacy of confidence interval procedures for different sample and population sizes. As noted in Section 5, relatively little is known about the efficiency of Sw (0), Sc(θ) and Sp(θ) in general situations; this too deserves investigation.
50
4
LAWLESS
An Example
Kalbfleisch and Lawless (1988b), Hu and Lawless (1996) and others have considered problems in epidemiology and reliability in which a response time yι > 0 and covariates z% for an individual or unit in some population are always observed if yι does not exceed an associated censoring time rι > 0. The number of units for which the response time is censored (i.e. yι> Ti) is known but the values of T{ and Z{ are not, so a fraction p2 of the censored units are sampled, and their T{ and Z{ values are obtained. The objective is to estimate the distribution f{yi\zi\θ), where it is assumed that Y{ and T{ are independent, given Z{. This problem involves response-selective sampling of the type described at the end of Section 2. In particular, let xι = {r^Zi) be an extended covariate vector representing the censoring time τι and covariates zι for unit i. The data for N units i — 1,..., JV are assumed to come from (1), where f{yi\xi',θ) = f{yi\zi',θ)- Consider two strata for (y,τ,z) defined by S\ = {(y,τ,z) - y < r} and S2 = {{y,τ,z) : y > r}. Units with (y^Xi) G Si are selected with probability pi = 1 and those with (yi, X{) G S2 a r e selected with probability p2 < 1. We extend the four estimation procedures of Section 2 slightly to deal with the stratification on both y and x, as described at the end of Section 2, and to reflect the fact that for units i G D<ι we know only that y%> τι, and not yi's exact value. We obtain the likelihood function corresponding to (2) as
LF(Θ,G)=
Π
Π
ieDi
i£D2
(17) where F(τ\x; θ) = /r°° f{y\x; θ)dy and
Q2(θ,G) = JF(τ\x;θ)dG(x). The pseudo score Sp(θ) corresponding to (5) is then
=
^
d\ogf{yi\xi ,θ)
L
QΘ dF(τi\xi ,θ) , 1 dθ VI ^ — n\ — Π2)
dθ
51
RESPONSE-BIASED ESTIMATION The weighted pseudo score corresponding to (7) is ^ Sw(θ) = ^
d]ogf(yi\xi ,θ) 1 ^ + -
Finally, the conditional (pseudo) score corresponding to (9) is
5
*
F(φi
&
dθ
^ ; g)/gfl 1 θ)+p2F(φi θ) }
where F ( φ ; 0) = 1 - F ( φ ; 0). Hu and Lawless (1996, 1997) illustrate the use of the four estimation methods on problems involving automobile warranty data, and compare the methods in a simulation study. In their context N is large (N = 4000 in the simulation) and the proportion of the population falling into stratum 1 is .25 or smaller. They found that with selection probabilities p2 in the range .05 - .20, the four estimation methods were all close to unbiased and gave estimators with roughly the same variance. In addition, normal approximations for 0 were adequate for the range of population and sample sizes considered. The estimators based on Sw{θ) and Sc{θ) are easier to deal with in terms of variance estimation, and SW(0) has the added convenience of being computable with standard censored lifetime data software that allows variable case weights.
5
Additional Remarks
This article reviews several approaches to estimation of parameters when sampling is response-selective with known selection probabilities. Discussion was restricted to two common selection procedures (BSS and BVPS), but extensions to other schemes are possible. For example, in some applications, such as the birth weight study mentioned in Section 1, quota sampling may be used so that the total size N of the population assumed to be generated by (1) is random. Information about the relative efficiencies of the different estimation procedures is at present quite limited. For the scenario described in Section 4 Hu and Lawless (1997) found all four methods to be comparable. However, in the case of binary responses Wild (1991) and others have found that estimators based on the weighted pseudo score Sw{θ) can be considerably less efficient than those based on Sc{θ) or on LF(Θ,G). Wild also found that Sc{θ) gave estimates very nearly as efficient as those obtained
52
LAWLESS
from the full likelihood Lp(θ,G). A limited simulation study by Robins et al. (1994) in a binary response problem, however, revealed situations where all of Sw(θ)iSc(θ) and Sp(θ) were rather inefficient relative to LF(Θ^G). Further investigation is desirable. It should be mentioned that another feature of weighted pseudo scores is their applicability in more complex probability sampling situations (e.g. Binder 1983, Binder and Patak 1994) and in situations where only moments of Y given X are modelled, rather than f(y\x\ θ) (OΉara Hines 1997). However, other more efficient pseudo likelihood methods can also be developed (Robins et al. 1994). There is a close connection between the methods discussed here and methods for dealing with missing data. Indeed, the present framework can be viewed as one in which covariate values are missing for units that are not selected. The approaches to estimation used here may also be applied with more general missing data problems. Robins et al. (1994, 1995) and Carroll et al. (1995, Chapter 9) provide wide-ranging discussions. Hu and Lawless (1997) also provide general discussion, and some simulation results. Robins et al. deal with very general problems in which the probability an observation is incomplete (has data missing) may depend upon unknown parameter values. They obtain asymptotically optimal estimators of θ within semiparametric models but their methods are generally difficult to implement. As remarked earlier, it is of considerable interest to compare the various approaches in more detail. Lawless et al. (1997) give some results. Finally, we note another method of estimation that is suggested by the use of the EM algorithm to maximize Lp{θ,G). By considering the "complete" data log likelihood based on knowledge of all x^s (i = 1,..., iV), N 2=1
+
(1 - Λi){logQ£(z i; 0) +
we obtain the following E-M algorithm, which leads to a stationary point of £F{Θ,G): let x\,...,x*m denote the distinct x^s observed, and denote gr = dG(x*). Then we have E-step: Given current estimates 0, G = (pi, ...,fifm)5 compute w
rj
=
M-step: Obtain the updated estimate of θ by solving
r=l
53
RESPONSE-BIASED ESTIMATION Obtain the updated estimate of G from
_
N ~ where dr = ΣΪLi I{Ri = l,Xi = x*). If instead of (18) we use the empirical estimates 9r
ώ
r, = Σ
in (19), we obtain an estimating equation SM{Θ) = 0, where
j=ι
n
J
A similar idea has been used in a different context by Reilly and Pepe (1995), and it would be of interest to see how it performs in the current framework. It is easily seen that for the example in Section 4, SM(Θ) is identical with Sw (θ) and it is also identical when y is categorical with strata corresponding to categories. More generally, however, it is different.
Acknowledgements This work was partially supported by grants from the Natural Sciences and Engineering Research Council of Canada.
References Binder, D.A. (1983). On the variance of asymptotically normal estimators from complex surveys. Int. Statist. Rev. 51, 279-292. Binder, D.A. and Patak, Z. (1994). Use of estimating functions for estimation from complex surveys. J. Amer. Statist. Assoc. 89, 1035-1043. Breslow, N.E. and Cain, K.C. (1988). Logistic regression for two-stage casecontrol data. Biometrika 75, 11-20. Carroll, R.J., Ruppert, D. and Stefanski, L.A. (1995). Measurement Error in Nonlinear Models. London: Chapman and Hall. Gong, G. and Samaniego, F.J. (1981). Pseudo maximum likelihood estimation: theory and applications. Ann. Statist. 9, 861-869.
54
LAWLESS
Hausman, J.A. and Wise, D.A. (1981). Stratification on endogenous variables and estimation: The Gary Income Maintenance Experiment. In Structural Analysis of Discrete Data with Econometric Applications, eds. C.F. Manski and D. McFadden, Cambridge, MA: MIT Press, pp. 364-391. Hoem, J.M. (1985). Weighting, misclassification and other issues in the analysis of survey samples of life histories. Chapter 5 in Longitudinal Analysis of Labor Market Data, eds. J.J. Heckman and B. Singer. Cambridge, UK: Cambridge University Press. Holt, D., Smith, T.M.F. and Winter, P.D. (1980). Regression analysis of data from complex surveys. J. Roy. Statist. Soc. A 143, 474-487. Hsieh, D.A., Manski, C.F. and McFadden, D. (1985). Estimation of response probabilities from augmented retrospective observations. J. Amer. Statist. Assoc. 80, 651-662. Hu, X.J. and Lawless, J.F. (1996). Estimation from truncated lifetime data with supplementary information on covariates and censoring times. Biometrika 83, 747-761. Hu, X.J. and Lawless, J.F. (1997). Pseudo likelihood estimation in a class of problems with response-related missing covariates. To appear in Canad. J. Statistics. Kalbfleisch, J.D. and Lawless, J.F. (1988a). Likelihood analysis of multistate models for disease incidence and mortality. Statist. Med. 7, 149-160. Kalbfleisch, J.D. and Lawless, J.F. (1988b). Estimation of reliability from field performance studies, (with discussion), Technometrics 30, 365388. Lawless, J.F., Wild, C.J. and Kalbfleisch, J.D. (1997). Likelihood and pseudo likelihood estimation for response-stratified data. U. of Waterloo Technical Report Stat-97-07. O'Hara Hines, R.J. (1997). Fitting generalized linear models to retrospectively sampled clusters with categorical responses. To appear in Canad. J. Statistics. Pepe, M.S. and Fleming, T.R. (1991). A nonparametric method for dealing with mismeasured covariate data. J. Amer. Statist. Assoc. 86, 108113.
RESPONSE-BIASED ESTIMATION
55
Reilly, M. and Pepe, M.S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82, 299-314. Robins, J.M., Rotnitzky, A., and Zhao, L.P.(1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist Assoc. 89, 846-866. Robins, J.M., Hsieh, F. and Newey, W. (1995). Semiparametric efficient estimation of a conditional density with missing or mismeasured covariates. J. Roy. Statist. Soc. B 57, 409-424. Scott, A.J. and Wild, C.J. (1986). Fitting logistic regression models under case-control or choice-based sampling. J. Roy. Statist. Soc. B 48, 170-182. Scott, A.J. and Wild, C.J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 8^ 57-71. Wild, C.J. (1991). Fitting prospective regression models to case-control data. Biometrika 78, 705-717.
57
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
LIKELIHOOD FROM ESTIMATING FUNCTIONS P. A. Mykland University of Chicago ABSTRACT Likelihoods based on estimating equations have typically been designed to retain some, but not all, of the desirable properties of true likelihood. We discuss how these differences in intention affect a number of commonly used procedures, and, in particular, how various likelihoods live up to the ideal: true likelihood. Key Words: Accurary, adaptive estimation, conditionality, dual likelihood, efficiency, empirical likelihood, projective likelihood, quasi-likelihood.
1
Introduction
Likelihood inference has the nice property that it solves several different problems at the same time. Such inference is first order efficient. It has nice accuracy properties, in particular when using the likelihood ratio statistic, its signed square root R, and the associated R* (Barndorff-Nielsen (1986)) statistic. The accuracy is both unconditional and conditional (McCullagh (1984), Jensen (1992, 1997)). One gets a likelihood surface. And likelihood incorporates notions of inferential correctness from both the frequentist and Bayesian viewpoints. This concatenation of desirable features would seem hard to replicate in a less than full parametric setting. Most approaches attempt to solve one of the above problems rather than all of them at the same time. Quasi- and projective likelihood (Godambe (1960), Wedderburn (1974), Godambe and Heyde (1987), McLeish and Small (1992)) and adaptive inference (going back to Beran (1974), Sacks (1975) and Stone (1975); see Bickel, Klaassen, Ritov and Wellner (1993) for a quite comprehensive account) are, essentially, solutions to the efficiency problem. Dual likelihood and empirical likelihood for estimating equations (Kolaczyk (1994), Qin and Lawless (1994), Mykland (1995)) are solutions to the unconditional accuracy problem. The original empirical likelihood (Owen (1988, 1990), see also DiCiccio and Romano (1989) and DiCiccio, Hall and Romano (1991)) appears to have been motivated by both considerations.
58
MYKLAND
To what exent do these methods cope with the problems which they were not designed to solve?
2
Unintended Properties
As far as the author is aware, this question has not been heavily studied. In the following, we summarize what appears to be known about the properties mentioned above, (i) Optimality This is the high ground of quasi-likelihood and adaptive inference. If an empirical or dual likelihood is based on a quasi-score, it will have the same asymptotic efficiency as the score itself (Kolaczyk (1994), Mykland (1995)). One can also base such likelihoods on other scores, but this would be unnatural if one knows what the second moment structure is like. We are not aware of any work concerning any possible connection between adaptive inference and empirical or dual likelihood, (ii) Unconditional Accuracy This is what empirical and dual likelihood are good at, though things start breaking down in the presence of nuisance parameters (Lazar (1996), Lazar and Mykland (1996), Mykland (1996)). The quasi-log likelihood does not typically satisfy Bartlett identities of order higher than 2, so the accuracy properties of the R statistic and its cousins do not hold (cf. Mykland (1996)). If one wishes these properties, one can instead use projective likelihood (McLeish and Small (1992)), which is based on the same inferential ideas as quasi-likelihood. By virtue of the projective likelihood being a true Radon-Nikodym derivative, accuracy will be as for likelihood. There is no free lunch, however, as we shall see next. Little is known about the accuracy of adaptive inference beyond first order, (iii) Likelihood Surface This exists for quasi- and empirical likelihood, and can presumably also be defined in the context of adaptive inference. For the other two approaches which we are discussing, the issue is more problematic. Projective likelihood needs a reference parameter value; the log is on the form IΘ (Θ), and, typically, lθo(θι) + lθι(θ2) is not the same as /0O(#2) One can locate the reference value at the MPLE, but this is not a completely satisfying solution. For dual likelihood, this question is not fully explored. If it is a dual criterion function to a nonparametric likelihood (empirical or point process, cf. Section 6 in Mykland (1995)), one can presumably use the surface from the nonparametric quantity. For a 'pure' dual likelihood (based on an estimating equation only, with no nonparametric counterpart, such as Aalen's (1980, 1989) linear regression), we do not know whether a likelihood surface exists. O
59
LIKELIHOOD INFERENCE
(iv) Conditional Properties We are not aware of any work in this direction, so here is a first stab at this. Let us suppose that the quasi-score and its derivative is correctly specified, in the sense that they coincide with the first two derivatives in the 'true' (unknown) log likelihood. In this case, the quasi and projective R statistics are, obviously, second order locally sufficient in the sense of McCullagh (1984). This, however, is not the case for empirical or dual R. In regular cases, the argument goes as follows: by the Hajek-LeCam convolution theorem (see, e.g., Hajek (1969)), A = [/,/](#) + 1{Θ) is first order ancillary. In view of McCullagh (1984), there is a J5, B = O p (l), so that A = A + B is second order ancillary. Hence, by McCullagh (1984), cov(A, ϊ(θ)) + cum{Aj(θ)J{θ))
= o(n), and so
cov(Λ, -[/, ϊ](θ)) + cum(A, Z(0), i(θ)) = -var(iί) + o{n). This expression is o(n) only if var(A) is o(n), which, under standard assumptions, translates into A = O p (l). Since A = [lj](θ) + ϊ(θ) + d(θ) + O p (l), this means in turn that Idiβ + δ)-
ld(θ) = ld(θ + μ)-
ld(θ) +
Op{μ%
where Id and / are the dual and true likelihood, respectively, and where μ = δ — δ2/2. Hence, by McCullagh (1984), the dual (and hence the empirical) R statistic is second order locally sufficient only if the dual likelihood coincides with the true one (and hence with the quasi-likelihood) to second order locally at θ. This leads, inter alia, to the conclusion that the dual R is unconditionally more accurate than the quasi-i?, but conditionally less so! What if the second order structure is not known? One may then have a choice between overdispersed quasi- and empirical/dual likelihood. In this case, things are less clear. The issue is pursued in Lazar (1996). There are a number of other questions here. To mention a few: What about adaptive estimation? And the above only tackles second order local sufficiency. What about the large deviation properties documented in Skovgaard (1990, 1996), Jensen (1992, 1997) and Barndorff-Nielsen and Wood (1995)?
3
Conclusion
It has hopefully been illustrated in the above that there are a substantial number of unresolved issues in this area. Even more fundamentally, there are also more questions which need to be asked. If likelihood is the gold
60
MYKLAND
standard, then what are the properties of likelihood anyway? New ones keep being discovered, as the rich recent literature on the subject can testify. And are there criterion functions yet to be discovered which come closer to the gold standard than the ones we have discussed?
Acknowledgements This research was supported in part by National Science Foundation grants DMS 93-05601 and DMS 96-26266, and Army Research Office grant DAAH04-95-1-0105. The manuscript was prepared using computer facilities supported in part by the National Science Foundation grants DMS 89-05292, and DMS 87-03942 awarded to the Department of Statistics at The University of Chicago, and by The University of Chicago Block Fund.
References Aalen, O. (1980). A model for nonparametric regression analysis of counting processes, Led. N. Statist. 2 1-25 (Springer, New York). Aalen, O. (1989). A linear regression model for the analysis of life times, Statist, in Medicine 8 907-925. Barndorίf-Nielsen, O.E. (1986). Inference on full or partial parameters based on the standardized signed log likelihood ratio, to appear in Biometrika 73 307-322. Barndorff-Nielsen, O. E., and Wood, A. T. A. (to appear). On large deviations and choice of ancillary for p* and the modified directed likelihood. Bernoulli. Beran, ίt. (1974). Asymptotically efficient adaptive rank estimates in location models, Ann. Statist. 2 63-74. Bickel, P.J., Klaassen, C.A.J., Ritov, Y., and Wellner, J.A. (1993). Efficient and Adaptive Estimation for Semiparametric Models (Johns Hopkins). DiCiccio, T.J., Hall, P., and Romano, J.P. (1991). Empirical likelihood is Bartlett-correctable, Ann. Statist. 19 1053-1061. DiCiccio, T.J., and Romano, J.P. (1989). On adjustments based on the signed root of the empirical likelihood ratio statistic, Biometrika 76 447456. Godambe, V.P. (1960). An optimum property of regular maximum likelihood estimation, Ann. Math. Statist. 31 1208-1212. Godambe, V.P., and Heyde, C.C. (1987). Quasi-likelihood and optimal estimation, Int. Statist Rev. 55 231-244. Hajek, J. (1969). A characterization of limiting distributions of regular estimates, Z. Wahrsch. verw. Gebiete 14 323-330. Jensen, J. L. (1992). The modified signed likelihood statistic and saddlepoint approximations. Biometrika 79 693-703.
LIKELIHOOD INFERENCE
61
Jensen, J. L. (1997). A simple derivation of r* for cuved exponential families. Scand. J. Statist. 24 33-46. Kolaczyk, E.D. (1994). Empirical likelihood for generalized linear models, Statistica Sinica 4 199-218. Lazar, N. (1996). Some Inferential Aspects of Empirical Likelihood. Ph.D. dissertation, University of Chicago. Lazar, N., and Mykland, P. A. (1996). Empirical likelihood in the presence of nuisance parameters, Technical report no. 400, Dept. of Statistics, University of Chicago. McCullagh, P. (1984). Local sufficiency. Biometrika 71 233-244. McLeish, D.L., and Small, C.G. (1992). A projected likelihood function for semiparametric models, Biometrika 79 93-102. Mykland, P.A. (1995). Dual likelihood. Ann. Statist. 23 396-421. Mykland, P.A. (1996). The accuracy of likelihood, Technical report no. 420, Department of Statistics, University of Chicago. Owen, A.B. (1988). Empirical likelihood ratio confidence intervals for a single functional, Biometrika 75 237-249. Owen, A.B. (1990). Empirical likelihood ratio confidence regions, Ann. Statist. 18 90-120. Qin, J., and Lawless, J. (1994). Empirical likelihood and general estimating equations, Ann. Statist. 22 300-325. Sacks, J. (1975). An asymptotically efficient sequence of estimators of a location parameter, Ann. Statist. 3 285-298. Skovgaard, Ib (1990). On the density of minimum contrast estimators. Ann. Statist. 18 779-789. Skovgaard, Ib M. (1996). An explicit large-deviation approximation to oneparameter tests. Bernoulli 2 145-165. Stone, C.J. (1975). Adaptive maximum likelihood estimators of a location parameter, Ann. Statist. 3 267-284. Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method, Biometrika 61 439-447.
65
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS IN SEMIPARAMETRIC STATISTICAL MODELS S. Amari The Institute of Physical and Chemical Research Saitama, Japan M. Kawanabe University of Tokyo ABSTRACT The geometrical structure of estimating functions is elucidated by information geometry in the framework of semiparametric statistical models. A condition which guarantees the existence of an estimating function is given. Moreover, the set of all the estimating functions is obtained explicitly when it is not null. The optimal estimating function is derived, and the maximum Godambe information is explicitly given. A geometrical condition is given which guarantees that the Godambe information is the maximal available information. Key Words: Semiparametric models; estimating functions; dual parallel transport; Godambe information; ra-curvature free; efficient score function.
1
Introduction
Godambe (1960, 1976) proposed the estimating function method as a generalization of the maximum likelihood method for parameter estimation. An estimating function gives a ^/n-consistent estimator by a simple and tractable procedure under certain regularity conditions. Moreover, the method is applicable even to semiparametric models. However, the class of estimators derived from estimating functions might not necessarily include the Fisher efficient estimator. Therefore, it is important to study efficiency of estimators derived from the estimating function method. It is another important problem to know how we can obtain the optimal estimating function. Recently researches on estimating functions have been developed and been applied to semiparametric models. It has been naturally understood that the
66
AMARI AND KAWANABE
optimal estimating function is given by projecting the score function to the linear space consisting of all the estimating functions (Small and McLeish (1989), Waterman and Lindsay (1996), Durairajan (1996), Chan and Ghosh (1996), Li (1996)). However, there still remain many important problems to be studied further. They are, for example, as follows : 1. To obtain a condition which guarantees the existence of estimating functions. 2. To obtain the linear spaces of all the estimating functions explicitly. 3. To obtain the amount of information, called the Godambe information, included in the optimal estimating function. 4. To obtain a condition which guarantees that the Godambe information is full, that is, equal to the maximal information in the general sense. The present paper studies these problems by using the Hubert bundle formalism of information geometry in the semiparametric context (Amari (1985), Amari and Kumon (1988)). This is a simplified version of the paper (Amari and Kawanabe (1996)), but this includes some new developments and a new example.
2
Estimating Functions in Semiparametric Models
Let p(x, 0, φ) be a probability density function of a random variable x with respect to a common dominating measure μ(dx), specified by two kinds of parameters θ = (0 1 , ,0 m ) and φ, where θ E Θ is a finite-dimensional vector, Θ is an open set of Rm and φ G Φ is a finite or an infinite dimensional parameter, typically living in a space of functions. The set of distributions S = {p(x, 0, ψ)} is called a statistical model with a nuisance parameter and is called in particular a semiparametric statistical model when Φ is infinitedimensional. Here θ is called the parameter of interest and φ is called the nuisance parameter. Let y(x, θ) = [yi{x, 0)], i = 1, , m, be a vector-valued smooth function of 0, not depending on φ, of the same dimension as 0. Such a function is called an estimating function (Godambe (1976, 1991)), when it satisfies the following conditions, E*>(M)]
= 0,
(2.1)
det\EθJdθy(x,θ)]\
φ 0,
(2.2)
Vθ,φ[\\y(x'θ)W2]
< °°>
Eθφ[\dθy(x,θ)\2]
< oo,
(2.3)
for all θ and φ, where Eβ denotes the expectation with respect to the distribution p(x,θ,φ), dgy is the gradient of y with respect to 0, i.e., the
67
SEMIPARAMETRIC MODELS
matrix whose elements are (dyi/dθi) in the component form, det | | denotes 2 the determinant of a matrix, and || y || is the squared norm of the vector 2 2 V > \\y\\ = ΣXί/ΐ) We further need that Jypdμ is differentiate with respect to θ and that integration and differentiation are interchangeable. The condition (2.1) is called the unbiasedness condition. When the above conditions (2.1) — (2.3) hold in a neighborhood N(φo) of φo, such y(x,θ) is called a local estimating function at ψQ. When an estimating function y(x, θ) exists, by replacing the expectation in (2.1) by the empirical sum, we have an estimator θ of θ by solving the estimating equation -,0)=O,
(2.4)
where a?i, ,a;n are n independently and identically distributed observations. This is called the estimating equation and such an estimator is called an M-estimator. It might be thought that the additive form (2.4) is too restrictive for obtaining good estimators. However, we shall prove that the Fisher efficient estimator is included in this class in the m-curvature free models. The asymptotic behavior of an M-estimator θ is known by the following theorem (Godambe (1976), McLeish and Small (1988) for example). Theorem 1 Under the ordinary regularity conditions, the estimator θ obtained from an estimating function y(x,θ) is consistent and is asymptotically normally distributed, with the asymptotic covariance matrix AV[Θ; y] = A-ιEθφ[yyΎ}{Aτ)-\
(2.5)
where the asymptotic covariance matrix is defined by AV[θ;y] = lir^nEgJiθ
- θ)(θ - Θ)Ύ],
(2.6)
A is the matrix defined by A =
Eθtφ[dθy(x,θ)]t
and the superfix T denotes the transposition of a vector or a matrix. Let T(θ) be a non-singular m x m matrix smoothly depending on θ. It should be noted that y*(x,θ) — T(θ)y(x,θ) gives an estimating function equivalent to y in the sense of yielding the same estimator. The present paper aims at obtaining the minimum value of AV[0;y]. Before that, we give two examples of semiparametric statistical models.
68
AMARI AND KAWANABE
Neyman-Scott problem and mixture models : Let {q(x,θ,ζ)} be a regular statistical model, where both the parameter of interest θ and the nuisance parameter £ are of finite dimensions. Let X{, i = 1,2, ,n, be n independent observations from q(xi >θ->£>%)-> where θ is common but ^{ takes a different value at each observation. Then, estimating θ from observations x = (xι, " >χn) is called the Neyman-Scott problem, where the underlying probability distribution
includes the nuisance parameters ξ 1 ? , ξ n as large as the number of observations. This problem can be treated by the following semiparametric model. Let us assume that the unknown ^ are independently generated subject to a common but unknown probability distribution having a density function φ(ζ) Then, the X{ are regarded as independent observations from the semiparametric model (2.7) where φ(£) is the nuisance parameter of function-degrees of freedom. This model is called the mixture model. This type of problems was studied by Neyman and Scott (1948) and has attracted many researchers (Andersen (1970), Lindsay (1982), Kumon and Amari (1984), Amari and Kumon (1988), Pfanzagl (1990) etc.). There are a lot of interesting and important examples in this class. A typical example is the following class of distributions of the form, q(x, θ, ξ) = exp{ξ s(x, θ) + r(s, θ) - ψ(θ, £)},
(2.8)
where s(x, θ) is a vector not depending on £ and is the inner product. Here, the distribution is of exponential type for £ when θ is fixed. 2. Blind separation of mixture signals : Let 5 α , α = 1,2, ,r, be r signal sources which produce r time serieses s α (t), t = 1,2, We assume that each sa{t) is an ergodic time series having the probability density qa{sa) at any ί. Moreover, si, , sr are assumed to be independent. Then, their joint probability is written as Ψ(S) = Π Qa(Sa)
(2.9)
α=l
at any t where s = (si, , sr). We assume that we cannot directly observe the r signals sa(t) but
69
SEMIPARAMETRIC MODELS we can observe their mixtures, r
Xi{t) = ΣMta8a(t),
» = 1, ••-,*•,
(2.10)
α=l
where M = (Mιa) is an r x r non-singular matrix consisting of fixed mixing coefficients Mιa. Then, the joint probability density function of xΎ = (#1, , xr) is given by
p(x) = where (2.12) If we know M or W, the original source signals s(t) are recovered from the observed x(t) by β(ί) = Ifa (t).
(2.13)
When we do not know M or W, we should estimate W from the observed ίc(t), ί = 1,2, , where the density functions ςi(si), , qr{sr) are usually unknown. Such a problem often occurs in medical or communication signal processing, and is called the blind separation of sources. See Amari et al. (1996). This gives a typical semiparametric statistical model, p(x, W, φ) = \W\φ(Wx),
(2.14)
where W is the parameter of interest and r s
ψ( ) = Π ^s^ a=l
is the nuisance functions.
3
Hubert Tangent Spaces and Score Functions
Given a probability density function p(x), let us consider a one-parameter statistical model p ( M ) = p ( * ) { l + tα(αO}, where t (0 < t < ε) is the parameter. The constraint Ep[a{x)} = 0
ΐ3-1)
70
AMARI AND KAWANABE
holds where Ep is the expectation with respect to p(#), because of fp(x,t)dμ(x) = ίp(x){l + ta(x)}dμ(x) = 1.
(3.2)
When t is small, p(x, t) is a small deviation in the direction of a(x) from p(x). The model (3.1) is a curve parameterized by t in the set of all the probability density functions. Let us consider the linear space of functions a(x) which satisfy
Ep[α(z)] = 0,
2
E p [{a(x)} ] < oo.
(3.3)
The set of all such a(x) is a Hubert space Hp with the inner product of a(x) and b(x) defined by
(3-4)
)Kx)]
The Hubert space Hp consists of all the deviations a(x) of probability distribution from p{x). The random variable (3.5)
a χ
( ) = T:
t=o
is the tangent vector of the curve (3.1) at p(x). This is the score function for the one-dimensional statistical model (3.1) parameterized by t. Given a semiparametric model S = {p(x, 0, <£>)}, we construct the Hubert space HQ denoting the set of all the deviations iτomp(x) = p(x, θ, φ). Since we have interest in estimating functions, we define it by 2
Hθψ = {a(x) \EθJa(x)} = 0, E ^ , [{a(x)} } < oo for all ψ'} restricting the space such that it consists of functions a(x) which are square f integrable at all p(x, 0, φ ) even when it is defined at (0, φ). The tangent directions along the parameter of interest are the score functions (x, θ,φ) = -^i logp{x, θ, φ).
(3.6)
Eβjμt] = 0
(3.7)
Ui
Obviously, and we further assume that U{ is square-integrable at any {θ,φ'). Then it belongs to HQ . We call the subspace spanned by these u^s the tangent subspace TQ along the parameter of interest. The vector score function is U = (lil,-
,ϋm).
We next define the tangent directions along the nuisance parameter. Let us consider a curve c(t) connecting functions ψ and φ' such that c(0) = ψ and
71
SEMIPARAMETRIC MODELS
c(ίo) = φr. We then have the one-dimensional statistical model p{x, 0, c(t)} parameterized by t. Its score function is given by
(3.8)
v(x,θ,φ,c) = — t=o
which we assume to belong to the HQ . This v is the tangent vector along c(t) of the nuisance parameter. There are infinitely many curves c(t) and the corresponding υ's, when φ is a function. Let τ£ be the smallest closed subspace including all such v's. We call it the nuisance tangent space. This is a closed subspace of HQ . Now, let us project the score function uι to the subspace orthogonal to T ^ , that is to (τ$
)
which is the orthogonal complement of T$ . The
result is the function uf = U{ — v that minimizes EQ [\u{ — v\2],
v e TQ .
E
The vector function u = (uf) is called the efficient score function and the uf are called the components of the efficient score function (see Begun et al. (1983), Amari and Kumon (1988), Small and McLeish (1989)). Let T$ be the subspace of HQ spanned by the components uf of the efficient score function. Let Tit be the orthogonal complement of TQ θ Tp . It is called the ancillary subspace and spans directions orthogonal to any changes in the parameter of interest and the nuisance parameter. We thus have the orthogonal decomposition of the Hubert space (see Amari (1987), Amari and Kumon (1988), see also Small and McLeish (1988)), (3 9)
H
θ«=τiφ®τL®τL
E
The matrix G
= (gfj) defined by using the efficient score function
(3.10)
g?j(θ,φ)=EθJufuf]
is called the efficient Fisher information matrix. Begun et al. (1983) proved that GE gives the Cramer-Rao bound of the asymptotic covariance of estimators 0, Ύ
lim nE \(θ - θ){θ - Θ) ] >
(GE)~1
(3.11)
for any asymptotically normally distributed unbiased estimators in a semiparametric model. There is, however, no guarantee that this bound is asymptotically attainable by choosing an estimating function even when φ is finitedimensional. (This bound is attainable when φ is finite dimensional by taking the joint m.l.e. (θ, ψ) of θ and ψ.) So we need to search for a new bound, called the Godambe information bound, attainable by an estimating function explicitly.
72
4
AMARI AND KAWANABE
Global Decomposition of Hubert Spaces
Let us temporarily fix a φo When φo is the true nuisance parameter,
gives a good estimator. However, uE{xi, θ, φo) is not an estimating function in general, because it does not satisfy the condition
An estimating function y{x,θ) should satisfy the unbiasedness condition (2.1) for all φ. Such a global structure is elucidated by introducing two parallel transports of the Hubert spaces along the nuisance space. Let a(x) be a random variable belonging to HQ . Let us fix 0, and consider the subset SQ = {p{x,θ,φ)\φ G Φ}. We define two parallel transports of a vector a(x) from HQ to HQ , (Amari (1987)). The following (e)
Uia(x)
= a(x)-Έθφt[a(x)],
(4.1)
are called the e-parallel transport and the m-parallel transport of a(x) from (#,<£>) to (0,<^/), respectively. The parallel transports are generalizations of the dual geometrical structures derived from the underlying e- and m-connections or e- and m-covariant derivatives (Amari (1985), see also Amari and Kumon (1988)), but we do not go into mathematical details of differential geometry. The following lemma shows an important property connecting the two parallel transports. The proof is immediate and hence is omitted. Lemma 1 The two parallel transports are dual in the sense that, for any two a(x),b(x) € HQ , the inner product is kept invariant when one is etransported and the other is m-transported to HQ ,,
Π^'Π^)
>
(4 3)
where the suffix {θ,φ) denotes that the inner product or the expectation is taken with respect to p(x, 0, φ).
SEMIPARAMETRIC
73
MODELS
It is remarked that an estimating function is e-invariant, (β)
(e)
because of (2.1) where J J operates componentwise. Now we rewrite the unbiased condition by using the parallel transport. Let us consider a curve ψ = φ{t), φo = <£(0), in the nuisance space. By differentiating (2.1) with respect to t along the curve φ = φ(t), we have
— Jp{x,θ,φ(t)}y{x,θ)dμ{x)
_
= J v{x, θ, φo}p{x, θ, φo}y{x, θ) dμ{x)
where
v= —\ogp{x,θ,φ{t)} t=o
is the nuisance tangent direction at φo along the curve φ(t). This holds for any ψo so that any estimating function y(x,θ) is orthogonal to v(θ,φ) at any point (θ,φ). However, from /(m)
(e)
ι(m)
v
( J » ) ^»)
v
/ θ,φ
'
(44)
/θ,φ 1
where v is a nuisance tangent direction at φ , the orthogonality condition at φ1 is transferred to that at φ by the m-parallel transports of v &t φ1. This shows that an estimating function y is orthogonal, not only to the nuisance tangent direction at any φ, but to the m-parallel transports from ψ1 to φ of the nuisance tangent directions at any φ'. To incorporate with this global structure, we define the enlarged nuisance tangent space Fg by
that is, FQ is the subspace of HQ spanned by the m-parallel transports of TQ , from φ1 to φ for all ψ1. It might occur that the m-parallel transport
74
AMARI AND KAWANABE
of a(x) e Tβ , does not belong to HQ because of
= OO.
E
θ,φ
In this case, just ignore it. We next project TQφ or Tξ
to the subspace orthogonal to F$ . The
resultant subspace is called the information subspace and is denoted by Fk . The subspace orthogonal to Fβ
and Ffo is called the shrinked ancillary
space FQ . We then have the following orthogonal decomposition of HQ
:
Obviously,
T' W
5
r- I?N
rpA
—v τ?A
Estimating Functions and Godambe Information
The decomposition (4.5) of the Hubert space HQ makes it possible to characterize the set of all the estimating functions. We first show an important lemma (see Amari and Kawanabe (1996) for the details of the proof). Lemma 2 A necessary and sufficient condition for a function w(x, θ) to be e-inυariant is that it belongs to Fk Θ FQ for some φ. Proof
When w(x,θ) is e-invariant, we have Eθ%φ[w(x,θ)]=0.
We have already shown that w belongs to FQ 0 FQ
for any φ in this case.
On the other hand, let w(x, θ) be a function belonging to FQ
θ FQ .
In order to show that it is e-invariant, we consider a path φ = φ(t) in the nuisance space and put f(t)=EθMt)[w(x,θ)]. Obviously, /(0) = 0 where φ(0) = <po> By differentiating this with respect to ί, we can prove that
for any t (see Amari and Kawanabe (1996)), showing that f(t) = 0 for any t. It is also proved that, when w belongs to FQ ®FQ for a <ρ, it automatically
belongs to Ffo θ Ffi at all φ'.
Π
75
SEMIPARAMETRIC MODELS
Lemma 3 Let y{x,θ) be an estimating function and let y((x^θ) be the projection of the i-th component ofy(x, θ) to Fk . Then y/(#, 0), i — 1, , ra, span Fk
at any φ.
Proof
By differentiating (2.1) with respect to 0, we have []
=0
at all φ, where (u,y) is a matrix whose elements are {u^yj). Prom this, we have where the components u{ of the information score u1 are the projections of the components of the score u to Fk . When y is an estimating function, this is non-degenerate from (2.2), proving the lemma. D Combining the above lemmas, we have the following fundamental theorem, which gives the set of all estimating functions. Theorem 2 Any estimating function y{x,θ) posed at any φ as a sum
= {yi(x>θ)} can be decom-
y(x, θ) = T{θ, φ)ur(x, θ, φ) + a(x, θ, φ),
(5.1)
where the component ai(x,θ,φ) of a belongs to Fn and T(θ,φ) is a nonsingular matrix. Conversely, any function y{x,θ) defined in the form of (5.1) at a fixed φo gives an estimating function provided the projections of the components yι{x,θ) to Fk , span Fk , at every ψ1. It is possible to choose a basis for the information scores such that T(0, ψ) becomes the identity at a φ. The theorem also shows a condition for the existence of an estimating function. Theorem 3 A local estimating function at φo exists when and only when Fk is non-degenerate, that is, m-dimensional. A necessary condition for the existence of a global estimating function is that Fk
is non-degenerate
at all φ. Proof The necessary condition follows immediately from (2.2) and (5.1). When Fk is non-degenerate, u\x^ 0, φo) is a local estimating function in a neighborhood of φoD Remark
When we treat local estimating functions, the definition of FQ
and hence the decomposition (4.5) should be defined locally.
76
AMARI AND KAWANABE
We now derive the optimal estimating function and the amount of information (Godambe information) derived thereby. Let us define G\θ, ψ) = EQJU^X, θ, φHu'ix, θ, φ)}τ].
(5.2)
Given y, we also define GA(θ,φ;a) = EθJaaT],
(5.3)
where any estimating function can be decomposed as y(x, θ) = u*(x, θ, φ) + a(x, θ, φ) where α G Ffi . It is immediate to show
Hence, we have the following result. Theorem 4 The asymptotic covarίance matrix derived from an estimating function is AV[£; y] = ( G 7 ) - 1 + (GI)~lGA{GIy1. (5.4) The estimating function u^x^θ^φo) where φo is fixed is the optimal estimating function at φo and the Godambe information is given by G1.
6 Curvature-freeness The information score u1 is different from the efficient score uE in general, and GE > Gι. (6.1) The quantity
GE - G1 = EθJ(uE - u1)^ - uψ]
(6.2)
is the loss of information caused by using estimating functions. However, in many cases, uE — u1 and GE = G1 for any θ and ψ. In this case, the estimating function method is fully efficient, if the optimal y is chosen. When does this happen? To answer this question, we consider the statistical submodel SQ = {p(x, θ, φ)} by fixing θ where φ G Φ is the only free parameter. The tangent vectors of SQ compose the nuisance tangent space Γ ^ . Let us consider the m-parallel transports of TQ , from (0, φ') to (0, φ) and see how it is
77
SEMIPARAMETRIC MODELS
different from TQ . A manifold in general is said to be flat or curvature-free when its tangent directions are the same at all the points. In the present case, we can compare two tangent spaces TQ and TQ , by the m-parallel transport of one to the other. We give formal definitions of m-flatness and m-information-curvature-freeness. Definition 1 A semiparametric statistical model S is said to be m-flat or m-convex, when the TQ , are invariant under the m-parallel transports, that is, (m)
63
Λ^ί
( >
for any φ, φ' and θ. When the m-parallel transports °fTn
, from (θ,φ') to
(θ, φ) does not include the TQ components for any ψ, ψ' and θ, that is, (m)
X ^ ί
)
(6.4)
the model S is said to be m-curυature free in the information directions, or shortly, m-information-curvature free. It is easy to see that, when S is m-flat, it is m-information-curvature free. When Sβ is not m-flat, SQ is curved in general, because its tangent directions change as ψ changes. Theorem 5 When S is rn-information-curvature free, GE = G1 for any θ and φ. Moreover, y(x,θ) — u^x^θ^φo) = uE(x,θ,φo) is the optimal estimating function at φo and is efficient at φo. It should be noted that most semiparametric models so far treated by many researchers are m-flat. The important role of the m-flatness in the estimation function method is noted by Amari and Kumon (1988), Amari (1987), and also by Bickel et al. (1993) under the name of convexity. The present result shows that the m-information-curvature freeness is essential, establishing a necessary and sufficient condition that the estimating function method is fully efficient. However, the optimal estimating function depends on the true φ so that there is still a serious problem of choosing a good estimate φo from observed data to derive a good estimating function. It is a merit of estimating functions that, even if we misspecify the true φ and choose a wrong <£>o, the estimator is still -y/n-consistent. A practical method of choosing a good estimating function is given by Amari and Kawanabe (1996).
78
7
AMARI AND KAWANABE
Examples
Example 1. Mixture model A mixture model is m-flat and is hence ra-information-curvature free. In particular, when a model is given by (2.7) and (2.8), the 0-score, the information score u 7 , and the nuisance score in direction α(ξ) can be calculated explicitly. The 0-score u is given by 1 r u = ——-—- / (ΘQS ξ + dnr — dβψ)φ(ξ) exp{£ s + r — ψ}dξ. p\x,u,φ) J
(7.1)
Noting that the conditional distribution p(ξ\s) of £ conditioned on s is written as
Pttk) =
™'~™
ψ
> ,
(7.2)
J φ(ξ)exp{ξ s-ψ}dξ the 0-score may be written as u = dθs EK|θ] + dθr - E[dθψ\s], (7.3) where E[ \s] is the conditional expectation. Similarly, the nuisance score in the direction of α(ξ) is given by (7.4) Therefore, v[a] depends on x only through s so that the nuisance subspace Tβ is generated by the random variable s(x,0). It is known that the projection of a random variable t to the space generated by Si is given by the conditional expectation E[φj] and the projection to the orthogonal complement is t — Έ[t\si\. Hence, the efficient score, which is the same as the information score in this case, is given by u1 = uE
=
u-
E[u|β]
s)l
(7.5)
where the vector notation should be understood appropriately. This gives the efficient estimating function. Example 2. Blind separation of mixture signals In order to assure the identifiability, we put further restrictions E[8a] = 0 2
E[(sa) } = 1
(7.6) (7.7)
SEMIPARAMETRIC
79
MODELS
for the source distributions (α = 1, are
, r). Then, the score functions (matrix)
U =
where
In the component form, the score functions at the distribution (W,φ) are uιa(x, W, φ) = lfa(sa)xi
-f MQI
(7-8)
where Mιa are the components of the mixing matrix M. We next search for the nuisance scores. Let us consider a small change in the form of ψ. We can write it as qa(s,τ)=qa(s){l + τaa(s)},
(7.9)
where r is the parameter to denote the changes of functions qa in the direction of α α (θ). The score function in the direction of the nuisance parameter φ in the direction of α = (aa) is given by the score function υ{a)
=
— ,
\ogp{x,W,φ{s,τ)} , r=o
(7.10)
The linear space spanned by the functions υ(a) is called the nuisance tangent space, T # ι V = span{v(α)}. (7.11) We then have the important observation that this model is m-informationcurvature free. So the information score is given by the efficient score. The information score is given by 2
F^φ = span{ζ(s α )s 6 , (s α ) - casa - 1},
(7.12)
where ca = E[(s α ) 3 ]. The above consideration gives a very effective learning algorithm to this problem of blind signal separation. See Amari and Cardoso (1997) for details.
80
AMARI AND KAWANABE
References [1] Amari, S. (1985). Differential-Geometrical Method in Statistics. Lecture Notes in Statistics Vol.28, Springer, New York. [2] Amari, S. (1987). Dual connections on the Hubert bundles of statistical models, in Dodson, C.T.J. (ed.) Geometrization of Statistical Theory. 123 - 152. ULDM, Lancaster. [3] Amari, S. and Cardoso, J. -P. (1997). Blind source separation - Semiparametric statistical approach. IEEE Trans, on Signal Processing, accepted. [4] Amari, S., Cichocki, A., and Yang, M. (1996). A new learning algorithm for blind signal separation, in NIPS'95 Vol.8, MIT Press, 757-763. [5] Amari, S. and Kawanabe, M. (1996). Information geometry of estimating functions in semiparametric statistical models. Bernoulli, 3, 29-54. [6] Amari, S. and Kumon, M. (1988). Estimation in the presence of infinitely many nuisance parameters — geometry of estimating functions. Ann.Statist. 16, 1044-1068. [7] Andersen, E. B. (1970). Asymptotic properties of conditional maximum likelihood estimators. J. R. Statist. Soc. B 32, 283 - 301. [8] Begun, J. M., Hall, W. J., Huang, W. M., and Wellner, J. A. (1983). Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11, 432 - 452. [9] Bickel, P. J., Klaassen, C. A. J., Ritov, Y , and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. [10] Chan, S. and Ghosh, M. (1996). Estimating functions: a linear space approach. Symposium on Estimating Functions, Univ. of Georgia. [11] Durairajan, T.M. (1996). Optimal estimating function for estimation and predictions in semiparametric models. Symposium on Estimating Functions, Univ. of Georgia. [12] Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist. 31, 1208 - 1212. [13] Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika 63, 277 - 284.
SEMIPARAMETRIC
MODELS
81
[14] Godambe, V. P. (ed.) (1991). Estimating Functions. Oxford University Press, New York. [15] Kumon, M. and Amari, S. (1984). Estimation of a structural parameter in the presence of a large number of nuisance parameters. Biometrika 71, 445 - 459. [16] Li, B. (1996). A minimax approach to consistency and efficiency for estimating equations. Symposium on Estimating Functions, Univ. of Georgia. [17] Lindsay, B. G. (1982). Conditional score functions : Some optimality results. Biometrika 69, 503 - 512. [18] Lindsay, B. G. (1985). Using empirical partially Bayes inference for increased efficiency. Ann. Statist. 13, 914 - 931. [19] McLeish, D. L. and Small, C. G. (1988). The theory and applications of statistical inference functions, Lecture Notes in Statistics 44. Springer, New York. [20] Nagaoka, H. and Amari, S. (1982). Differential geometry of smooth families of probability distributions. Technical Report 82 - 7, University of Tokyo. [21] Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica 32, 1 - 32. [22] Pfanzagl, J. (1990). Estimation in Semiparametric Models: Some Recent Developments. Lecture Notes in Statistics Vol.63, Springer, New York. [23] Small, C. G. and McLeish, D. L. (1988). Generalization of ancillarity, completeness and sufficiency in an inference function space. Ann. Statist 16, 534 - 551. [24] Small, C. G. and McLeish, D. L. (1989). Projection as a method for increasing sensitivity and eliminating nuisance parameters. Biometrika 73, 693 - 703. [25] Waterman, R. P. and Lindsay, B. G. (1996). Projection score methods for approximating conditional scores. Biometrika 83, 1 - 13.
83
Institute of Mathematical Statistics
LECTURE NOTES — MONOGRAPH SERIES
Estimating Functions, Partial Sufficiency and Q-Sufficiency in the Presence of Nuisance Parameters Vasant P. Bhapkar University of Kentucky Abstract When there exists a statistic which has its distribution free of nuisance parameters, the optimality of the marginal score function can be investigated in the context of generalized Fisher information for parameters of interest. In the case of a partially sufficient statistic, i.e. a statistic sufficient for parameters of interest, the marginal score function is the optimal estimating function. With the new concept of q-sufficiency for parameters of interest, the marginal score function is operationally equivalent to the optimal estimating function. AMS Subject Classification: 62A20, 62F99 Key Words: Generalized Fisher information matrix, completeness of a family, partial sufficiency.
1
INTRODUCTION
The optimality of the conditional score function as an estimating function for the parameter of interest, in the presence of unknown nuisance parameters, was established by Godambe (1976) in the situation where there exists a statistic which is ancillary for the parameter of interest and which is also complete for nuisance parameters. Such a statistic has been termed a complete p-ancillary statistic in an earlier paper (Bhapkar, 1989). Refer to Liang and Zeger (1995) for a review of estimating functions theory, some discussion of optimality of the conditional score function under the conditions assumed by Godambe, and references to further work (for example, Lindsay 1982) to find approximately optimal estimating functions in more general situations. The analogous question concerning optimality of the marginal score function in the complete case, has not yet been satisfactorily resolved. Lloyd (1987) considered this question; however, his assertion of optimality of the
84
BHAPKAR
marginal score function has been shown recently by Bhapkar (1995) to be invalid. In a special case where the two appropriate components of the sufficient statistic happen to be independent, the marginal score function does turn out to be optimal i.e., the property (6.2) holds. In such a special case, the appropriate component possesses the property of p-sufficiency, symbolized by the relation (6.1); this p-sufficiency then implies the optimality of the marginal score. However, in the general case where the two components of the sufficient statistic are not necessarily independent, the property of p-sufficiency does not necessarily hold (as shown by the counter-example by Bhapkar, 1995). This paper now establishes a certain weaker property, referred in this paper as q-sufficiency; this property is symbolized by the relation (7.7). For example, in the case of a random sample X = (X\,..., Xn) from a normal population with mean μ and variance σ2, T = X is a complete pancillary statistic for σ 2 and 5 = Σi(Xi ~^)2 i s p-sufficient for σ 2 . Here S satisfies the property (6.1). However, in examples (4.1) and (5.1) S satisfies only the weaker relation (7.6), but not (6.1). Thus, here the marginal score function of 5 satisfies only the weaker optimality relation (7.7), but not (6.2), which is satisfied in the first example. In both these examples, the family of distributions of Γ, given S = s and 0, is complete for the nuisance parameter
φ. The optimality of marginal score function, in its weaker or stronger forms, is shown to be related to certain generalized Fisher information functions. Section 2 introduces the basic terminology, Section 3 considers information in a statistic (or its marginal distribution) as well as the information in the conditional distribution given the statistic. Section 4 discusses the special case where the marginal distribution of a statistic is free of the nuisance parameters, while Section 5 deals with the optimality of estimating functions. The case where the statistic happens to be sufficient for parameter of interest (i.e. p-sufficient) is described in Section 6 while the last section establishes a somewhat weaker property (viz. q-sufficiency) that holds in general in the complete case, and in another specific situation.
2
INFORMATION IN ESTIMATING FUNCTION
Suppose that the random variable X has probability distribution Pω and the probability density function (pdf) p(x; ω) with respect to a α-finite measure μ over the measurable space (χ,A). The parameter wGίl and assume that ω = (0, φ), where θ is the parameter of interest, φ is the nuisance parameter and 0, φ are variation independent in the sense that Ω = Θ x Φ, where Ω is an open interval in the d-dimensional space Rd, and d\ is the dimensionality
85
PARTIAL SUFFICIENCY oΐθ.
Let l'ω{X) = [dlogp(x]ω)/dω], be the row-vector of ω-scores, and its components [lβ(X),l'φ(X)] are referred to as 0-scores and φ-scores of X, respectively. We assume the standard Cramer-Rao type regularity conditions: R:
i. J p(x;ω)dμ(x) = 1 can be differentiated twice under the integral sign with respect to elements of ω; ii. The Fisher information matrix I(ω) — Eω[lω(X)l'ω(X)] definite (pd).
is positive
If θ is real-valued, g — g(x\ θ) is said to be a regular unbiased estimating function (RUEF) for θ if it satisfies the regularity conditions: RG:
i. Eωg{X;θ)=0,Eωg2{X;θ)
< oo;
ii. f g(x;θ)p(x;ω)dμ(x) = 0 can be differentiated under the integral sign with respect to elements of ω. Operationally, g(x; θ) = 0 is to be solved to produce an estimate θ = θ(x). In the d\-dimensional case, we shall still use the term estimating function for a real-valued function g = g(x; θ) if it satisfies conditions RQ', however to solve for θ we need an estimating equation g{x',θ) = 0, all elements of which satisfy conditions RQ. Furthermore, in order to ensure that the equation leads to a solution θ = θ(x), we also need the conditions: (a) the covariance matrix σg(ω) = Eω[ggf] is pd. (b) G(ω) = Eω[dg{X;θ)/dθ]
is nonsingular.
In any case, we denote by G the space of real-valued functions satisfying the regularity conditions RQ The generalized Fisher information for 0, when ω = {θ^φ) is the full parameter in the distribution Pω of X is defined (Godambe, 1984) in the one-dimensional case as 2
IG{θ;ω) = minEω [lθ(X) - u(X ω)} .
(2.1)
uζU
Here U is the space of real-valued functions u = u(x\ ω) such that (i) Eωu(X\ω) (ii)
= 0, Eωu\X\ώ)
< oo
(2.2)
Eω[u{X ω)g(X; θ)] = 0, for all g E G.
On the other hand, the information concerning θ in the RUEF g = g(x\ θ) is defined as
86
BHAPKAR
gι is then said to be more efficient than g if Igi(θ\ω) > Ig(θ\ω) for all ω with strict inequality for at least one ω. We have, then, the generalized information inequality (Godambe, 1984) (2.4)
Ig(θ;ω)
for all g EG. The multi-dimensional analogs of (2.1), (2.3) and (2.4) are (see, Bhapkar, 1989): IG(θ;ω)
= mmEω[lθ(X)-u][lθ(X)-u}'
Ig(θ;ω) Ig(θ;ω)
= G'(ω)σg (ω)G(ω), < IG(Θ]ω).
1
,
(2.5)
(2.6) (2.7)
Here, of course, A < B for non-negative-definite (nnd) matrices means B — A is nnd, min denotes the minimal matrix M* = M(u*) in the class of nnd matrices {M(u)} such that M{u) > M* for all tx, and u denotes a vector with all elements u G ZY, which satisfy (2.2). It has been proved (Bhapkar and Srinivasan, 1994) that IQ exists; in fact IG(θ',ω) = Eω[g*g*'},
(2.8)
where g* is the projection of IΘ{X) onto the space spanned by G. This G-form indeed generalizes the usual Fisher information matrix in the sense that ; θ) = I(θ) = Eθ[lθ(X)l'θ(X)]. (2.9) In order to prove (2.8) and the results in this paper we find the Hubert space technique quite useful. Another motivation for this is the need to tackle the identifiability problem posed by the distinction between a function involving parameter 0; e.g. g = g{x\θ) and a function involving full parameter ω, e.g. u = n(x α ), for the given ω. To treat ω as a variable, along with x, Bhapkar and Srinivasan (1994) introduced an arbitrary probability distribution π over the measurable space (Ω,/3) for "random variable" ω, and consider the joint distribution Pω x π of (X,ω) over
(Xxίϊ).
Thus in the Appendix the space C of real-valued functions c = c(x; ω) is defined such that 1. Ec(X]ω) = /c(x',ω)p(x;ω)dμ(x)dπ(ω) 2. Ec2(X]ω)
=0
87
PARTIAL SUFFICIENCY
Here we write E for joint expectation, while we write Eω for expectation, given ω. The technical discussion in this context, along with proofs where necessary, is given in the Appendix, in order to provide validity to statements in sections 4, 5 and 7 of the text of the paper, which specialize the general results for arbitrary π to the one-point distribution π at ω. Thus Cω denotes the space of real-valued functions c = c(x; ω) such that 1. Eωc(X\ώ) ΞΞ f c(x\ω)p{x\ώ)dμ{x) = 0 2. Eωc2(X;ω)
Similarly, Q denotes the closure G of the subspace G of functions g = g(x\ 0), which depend on ω only through θ. Then Qω denotes the space with π a one-point distribution at ω. The orthogonal complement of Qω in Cω is then
3
INFORMATION IN A STATISTIC
Suppose now (5, T) is a minimal sufficient statistic for the family {Pω ) : ω G Ω} of probability distributions Pύ ' of X. Without loss of generality, we replace hereafter X by (S,T) in view of the anticipated result stated here as a lemma (Bhapkar, 1991): Lemma 3.1 Under regularity conditions R for the distribution of X, and conditions RQ, I{GX\θ;ω)
= I{GS'T)(θ;ω).
(3.1)
Now we want to investigate the information contained in S alone, in relation to (5,T); similarly we want to find out the maximum information, concerning 0, for estimating function based on S alone, in relation to such maximum information in the functions based on both S and T. Let then /(s; ω) denote the marginal pdf of 5 with respect to measure ι/, and h(t\ ω\s) the pdf of conditional distribution of Γ given 5, the value of 5, with respect to measure τ?s, when (5, T) has pdf p(s, t; ω) with respect to μ. In view of regularity conditions R for p, we assume the following conditions for / and h: β*:
1. f f(sm,ω)dv(s) = 1 can be differentiated, twice under the integral sign, with respect to elements of ω; 2. f h(t;ω\s)dηs(t) = 1 can be differentiated twice with respect to elements of ω under the integral sign for almost all (y)s\
88
BHAPKAR 3. The Fisher information matrix J ( 5 ) (α;) = Eω[lω(S)l'ω(S)] exists where l'ω(S) = [d\ogf{s\ω)/dω\ is the marginal ω-score of 5; 4. For almost all (v)s, the information matrix of T, given s, viz I™{ω) = Eω[lω(T\s)l'ω(T\s)] exists where lω(T\s) = [dlogh{T;ω\s)/dω] is the conditional ω-score of T.
Let Cω denote the Hubert space of real functions c = c(s,t\ω) such that Eωc = 0 and Eωc2 < oo (see Lemma A.2), and Qω = Gω the closed linear subspace, where Gω is the set of functions g = g(s, t; θ) in Cω, which satisfy regularity conditions as in RQ. Hereafter we drop the subscript ω from Cω<>GωMω >- for simplicity of notation. Thus, we have (3.2)
C = G®U
where U is the orthogonal complement of Q in C; the elements u = u(s^ t\ ω) are orthogonal to g G Q i.e. Eω[gu] = 0 for all g G GNow we denote by C(S) the subset of functions c = c(s ω) in C, which depend on (s,t) only through 5. Similarly G{S) = G(S) denotes the closure of G(S), the subset of G depending on (s,t) only through s. Thus, G{S) = GnC{S). Note that C(S) is itself a Hubert space, which is a subspace of C, and the inner-products (or norms) for both spaces coincide for cχ,C2 G ^(5), in view of the fact that
G(S) is a closed linear subspace of C(S) and we denote the orthogonal complement of G{S) by V(5); thus
C(S)=£(S)ΘV(S).
(3.3)
Observe that V(S) contains both functions u — u(s\ω) in ZY, not depending on ί, and functions v = u(s;ω) = Eω[u(s,T]ω\s] for the remaining u in ZY, since Eω[gu] — 0 for all g eG(S) for any uGW. We can then define the generalized Fisher information matrix for 5, with respect to parameter of interest θ when ω is the full parameter, as = mmEω[lθ(S) in analogy with (2.5).
- u][lβ(S) - «/]',
(3.4)
89
PARTIAL SUFFICIENCY
Similarly, for each value s of 5, we consider the conditional distribution of T, given s, and define the Hubert space Cs of real functions cs = cs(t;ω) satisfying Eω(cs\s) = 0, Eω(c2s\s) < oo, (3.5) in view of Proposition A.2. The closed linear subspace £ s = (5S, and its orthogonal complement ys in Cs, are defined like Cs above; thus we have (3.6)
Cs = Gs®ys,
for each value 5 of S. The generalized information in the conditional distribution Pω \ given s, is defined now as
- ys] [lθ(T\s)
- ys]'\s} ,
(3.7)
in analogy with (2.5). Finally we define the subspace C(T\S) in C of functions c(s,£;u;) = cs(£;ω), G(T\S) as the subspace in G of functions (g(s,t]θ) — #s(£;0), G{T\S) = G i r l S ) , and y{T\S) as the subspace in C(T\S) of functions y(s,t\ω) — ys(t;ω). Thus these functions satisfy (3.5) a.e. (Pi ; ) and, thus, Q (T |S) C C{T \S) and {y(T \S) C C{T \S)). We also note that {G{T \S)
= gnc(τ\s)). We now have the following lemma from Bhapkar and Srinivasan (1994), which is the special version of Lemma A.3. Lemma 3.2 C = C{S)®C{T\S)
.
(3.8)
PROOF It is easy to see that if c E C(T \S), then c _L C(S). We want to prove the converse that if c _L C(5), then c G C(T \S). Let then c ± C(5), and Eω ( φ , T ; ω ) \s) = c*(β;ω). Since c* 6 C(5), and c ± C(5), we have Eω(cc*) = 0. But Eω{cc*) = ^ [c*(5;ω)β(l,(c|Sf)] = J5ω[c*2(5;ω)]. Hence Eω{c*2) = 0, which implies c*(5;ω) = 0, a.e. ( P ^ ) . Thus, c G C(T 15) and the lemma is established. Note that the decomposition of the space C into C(S) and C(T \S) corresponds to the fact that, for any ceC, Eω{c\s) E C(S), while c - Eω(c\s) E C{T\S). In view of Lemma A.4 we have C{T\S) = Q{T\S) Θ y{T\S). The generalized information for θ in the conditional distribution of T, given 5, is defined as (0; ω) = ^ e m m | s ) £ ω [i,(T |5) - y] [Iβ(T \S) - y]' .
(3.9)
90
BHAPKAR
Lemma 3.3 In view of (3.7) and (3.9), {
lS)
(3.10)
I G (θ;ω)>Eω[IG(θ;ω\S)] PROOF. The proof is immediately obtained by noting
min Eω \[as - ys] [as with a(s, t\ ω) = lθ{T \S) and as = lθ(T \s). We are now in a position to prove the following theorem; Theorem 3.1 Assume the regularity conditions R* for the marginal distribution of S and the conditional distribution of T, given S. Then
> i£V;") + -rgΊ5)(0;ω);
(3.u)
ω) > I(g)(θ;ω) + Eω[IG(θ )ω\S)}.
(3.12)
(i) ifτ\θ-,ω) and
(ii) I^Hθ
PROOF. Let n(u) = lo{S,T) - ix, m(i/) = lθ{S) - v and r(w) = IQ{T \S) — iϋ, where 1/(5; ω) = Eω[u \s] and w = u — v for any u eU (i.e. every element of u belongs to U). Then v 6 V(5), since u _L Q{S) implies E^ftep] = Eω\gu] = 0 for all g e G(S). On the other hand, if g = 0 is the only g in ^(5), then 1/ 6 V{S) in view of (3.3). Also, for all g e G{T |5), Eω[gw] = £?ω[flf(tι - v)] = ^[ptt] —JSα b^] = 0? a n d then tϋ E [V(T |5) in view of Lemma A.4. It is also true that w e y(T\S) if g = 0 is the only g in G(T\S). Now n(ϊz) = m(i/) + r(w) and we have Eω[n{u)n'(uj\ = Eω[m(u)m'{u)\ +
Eω[r(w)r'(w)].
Since we have 1/ G V(*S) and iϋ E y{T \S) for every tx € W, it follows that mmEω[n(u)ri(u)]
> min. Eω[m(u)mf(u))
+
min
i.e. inequality (i) holds. The inequality (ii) follows immediately from Lemma (3.3). Thus the theorem is established. Remarks, (a) The relation (i) would be a strict equality if there exists u — u* for which Eω[u* \s] = v* minimizes Eω[m(υ)m'(υ)] for v € V(S) and simultaneously y* = u* — υ* minimizes Eω[r(y)rf(y)] for y E y(T \S). The relation (ii) would be a strict equality if, furthermore, y* = y*s(t;w) where y* minimizes Eω[r(ys)rf(ys) \s] for ys E ys a.e. (Pώ )•
PARTIAL SUFFICIENCY
91
(b) The inequality (3.12) was earlier given as Theorem 4.2 in Bhapkar (1991), subject to the condition that Ic{θ\ω) exists. Since existence of such a generalized information matrix has been established by Bhapkar and Srinivasan (1994), such an existence qualification is no longer needed. Furthermore, the present proof is mathematically more rigorous. (c) For the special case discussed in next section, when S has distribution depending on ω only through 0, VG' (0; ω) reduces to 1^ (θ) in the inequality (3.12). Such an inequality was earlier established as Theorem 3.1 in Bhapkar (1991) subject to existence qualification. The previous comments (a) and (b) apply to this case as well. (d) The fact that we need a possible inequality in (3.11), rather than a strict equality, is seen from the following example: Suppose X = (X\1X2)1 where X\,X2 are independent N(φ, θ) variables. Then IG{Θ; ω) = 1/202, but forS = Xi,T = X 2 , wehave4 5 ) (0;ω) = 0, and J™(0;u;) = I{P{θ;ω) = 0 as seen from Example 4.2 Bhapkar (1991).
4
SPECIAL CASE
Now we deal with the special case where the marginal distribution of S depends on ω only through θ. Then every element of the marginal score function of 5, viz IQ(S) is a RUEF in view of condition R*{ϊ). We can now characterize all the RUEF in G in view of Lemma A.6 in the Appendix. Theorem 4.1 Assume that the joint distribution o/(5, T), given ω, satisfies the regularity conditions R* in Section 3, the marginal distribution of S depends only on θ, and the estimating functions g = g(s, t; θ), for parameter of interest θ, satisfy the regularity assumptions RQ Then any RUEF in G can be expressed in the orthogonal decomposition: g(s, t; θ) = b'(s, t; θ)lθ(s) + «>(*; θ)
(4.1)
where Eω[b(s,T]θ) \s] = α(ω), and g0 e G0(S), the set of RUEF in g(S) uncorrelated with IΘ{S). Remark (a) Although we are assuming here that ω is fixed (i.e. π is a one-point distribution at the given α;), as in most of the text, the strict validity of the statements with ω = (θ, φ) as variables follows from the proofs given in Appendix (e.g. Lemma A.6 for Theorem 4.1) where (5,T;ω) are variables. (b) Observe from Lemma A.6 that Q = Q{S,T) Θ Go(S), where Q(S,T) is defined by (A.10), and QX(S) C G{S,T). In particular, if G(T\S) is empty, G is not necessarily equal to G(S). See Example 4.1 below in this regard.
92
BHAPKAR
Example 4.1. Let X = (Xχ,...,X n ) be independent pairs Xi — (Yi,Zi), where Y{,Z{ are independent exponentially distributed variables with means φ and φθ respectively. Letting Y = £ ^ Y{ and Z = Σ% Z% ? ( ^ ^0 is minimally sufficient for ω = {θ,φ). Equivalently, (S,T) is a minimal sufficient statistic with S = Z/Y and T = Z. The marginal distribution of S is
free of >, and the conditional distribution of T, given s, has 2n-l
t
-t/δ(s)
e
where 5(s) = sθ φ/(s + θ). Here G(T \S) is empty (in view of completeness of Γ, given s, for (/> with fixed θ. We have lθ(s) = n(s - θ)/θ{s + θ). But there exists g = g{s,t]θ) =
t(s-θ)/θ2s
5
= b(s, t; θ)lθ(s) where 6(5, t; 0) = t{s+θ)/nθs so that E^ftls = 20.
OPTIMALITY OF ESTIMATING FUNCTIONS
A RUEF g* = #*(s, t; θ) in G is said to be optimal (Godambe and Thompson, 1974) for estimating the one-dimensional parameter θ, in the presence of unknown 0, if Ig*(θ;ω) > Ig(θ\ω) for all g G G, where Ig is the information concerning 0, defined by (2.3). The matrix analog of (2.3) is given by (2.6) for the case of multi-dimensional θ. Now we wish to consider the marginal score function of 5, when the distribution of S is free of the nuisance parameter φ. The general form of j G G has been shown to be (4.1) in this case. For simplicity of proofs here we consider the case of one-dimensional θ. We now show that go{s;θ) is the non-informative part of g in (4.1) in the sense that the information Igo(θ;ω) is zero; also if g(s, t\ θ) = g*(s, t; θ) + go(s-, θ) in (4.1), then Ig(θ; ω) < Ig* (0; ω) with strict inequality unless go(s; θ) ΞΞO.
In view of Theorem 4.1, g0 is a RUEF belonging to G(5), and it is uncorrelated with IΘ(S). Differentiating /go(s; θ) f (s; θ)dιy(s) = 0 with respect to θ under the integral sign we have
Hence Igo(θ;ω) = 0 in view of (2.3).
93
PARTIAL SUFFICIENCY
More generally, for g = g* + g0, where Eω[g*g0] = 0, we have Eω[g2] = Eω[g*2] + Eω[g2]. Also Eω[dg/dθ] = Eω[dg*/dθ]
+
Eω[dg0/dθ] = Eω[dg*/dθ].
Hence >
Λ
(Θ
IAΘM
2
~
Thus we have proved
Eω[g* ]
* K^ΪTEM)=l9iθ;ω)
τ ( a
λ
Proposition 5.1 If g(s,t\θ) = g*(s,t',θ)+go{s 1θ), where go(s;θ) is uncorrelated with IΘ{S), and also with g*, then Ig(θ;ω) < Ig+(θ;ω), with strict inequality unless go{s]θ) = 0. Now we consider the case where G(T\S) is empty. First suppose that S and T are independent. We plan then to show that Ig{θ;ω) < Iιθ{s){θ\ω) for any g G G. If G(T\S) is empty, then we may assume g = 6(s, ί; Θ)IQ{S) where £?α;[6|5] = a(ω) φ 0, in view of (4.1) and Proposition 5.1. But b(s,t]θ) = b*(t;θ) for some b* in view of independence of S and T. Thus, g — b*(T',θ)lβ(S), so that Eω{g2) = Eω(b*2)Eθ(l2θ{S)) = Eω{b*2)l(s\θ), where I^{θ) is the usual Fisher information in S. Also, dg/dθ = {db*/dθ)lθ(s) + b*{dlθ/dθ) so that
£
=
Eω[b*(T;θ)}Eθ
d2 log f(S-,θ) θθ2
Hence, from (2.3),
Since jEg(6*) < ^ ( ^ * 2 ) , we have Proposition 5.2 7/5 α n d T are independent, the distribution of S depends only on θ and G(T\S) is empty, then Ig(θ;ω) < The matrix analogs of Propositions 5.1 and 5.2 can be developed for the general case of vector θ in terms of the information matrix Ig(θ;ω), given by (2.6). Thus the marginal score function of 5, viz. IΘ{S), is the optimal RUEF for θ (uniquely except for multiplication by functions of θ alone), when G(T\S) is empty, if S and T are independent.
94
BHAPKAR
If the family of conditional distributions {P^ \φ G Φ} of T, given s foτφeΦ is complete for every fixed 0, a.e. (Pω ), then G(T \S) is empty. Is then IQ(S) an optimal RUEF if the distribution of S depends only on 0? Proposition 5.2 requires the additional assumption of independence of S and T to prove optimality of lβ(s). Lloyd (1987) asserted optimality of IΘ(S), without requiring independence, in the case of one-dimensional 0; a similar assertion was made by Bhapkar and Srinivasan (1993) for the general case. Both these assertions are now seen to be invalid in view of the following counter-example (Bhapkar, 1995). Example 5.1. (continuation of Example 4.1) In Example 4.1, for given 0 and 5, T is complete for φ and hence G(T\S) is empty. However /^(5)(0; ω) = n 2 /0 2 (2n + 1), while Ig(θ;ω) = n/20 2 > I^s\θ) = Ilθ{s)(θ;ω), for g = g(s,t θ) =t{s-θ)/θ2s. Note that although, here, both le(s) and g produce the same estimate, viz. θ = S,IQ and g are distinct functions, i.e., g is not of the type k(θ)lo(s), and these distinct functions lead to different information functions.
6
PARTIAL SUFFICIENCY OF S FOR θ
If the marginal distribution of S depends on ω only through 0, S has been termed partially sufficient (p-sufficient) for 0 if I{GX\θ',ω) = lW(θ).
(6.1)
Thus, it has been shown by Bhapkar (Theorem 3.3, 1991) that 5 is psufficient for 0 if the following condition holds: (i) The conditional pdf of T, given s, depends on ω = (0, φ) only through a parametric function δ = δ(ω), which is differentiate and is such that ω is a one-to-one function of η = (0,5), for almost all (PQ )S. Earlier the term partial sufficiency had been used by some authors to describe the property of statistic Γ when δ(ω) in (i) is φ itself. The property of partial sufficiency of Γ was, then, also referred to as S-sufficiency by Basu (1977), among others. There is another situation where the property (6.1) holds; in this situation, condition (i) is replaced by the following condition: (ii) The family of conditional probability distributions {Pω : φ E Φ} of T, given s, for φ G Φ is complete for each 0 a.e. {PQ ) , and (5,T) are independent. The proof that the equality (6.1) holds, under the regularity conditions R*,RG and the additional condition (ii), when S has pdf depending only
PARTIAL SUFFICIENCY
95
on 0, follows essentially from Proposition 5.2, in view of inequality (2.4). A direct proof has been given by Bhapkar (1990) under a condition (ii'), which is equivalent to (ii). In condition (ii7) T has been termed a complete p-ancillary statistic for θ. The independence of (5, T) in the completeness condition (ii) is crucial, as shown by Example 5.1, and thus the assertions in Lloyd (1987) and Bhapkar and Srinivasan (1993) appear to be invalid. If T is complete for 0, given s and 0, then the second terms on the right hand sides of both (3.11) and (3.12) are seen to vanish in view of (3.7) and (3.9). However, it does not follow that Ic{θ\ω) = IQ\0',UJ), in view of comment (d) after Theorem (3.1). Thus the equality (6.1) is not necessarily true when S-distribution depends on ω only through 0, and Γ is complete for >, given s and 0. When the relation (6.1) holds, i.e. when S is p-sufficient for 0, then the marginal score function of 5, viz. Zfl(s), is the optimal RUEF for 0, in the sense that Ig(θ;ω)
7
Q-SUFFICIENCY OF S FOR θ
If T is complete for >, given s and 0, then G(T\S) is empty. Lemma A.7 shows that all RUEF's for 0 can be represented in the form g(sjt θ) = bf{s,t;θ)lθ(s) +go{s), where Eω[b(s,T;θ)\s] = a(ω). Operationally, a dλvector g = g(s, ί; θ) can be considered an estimating function for d\-dimensional θ if the equation g{s,t\θ) = 0 can be solved to produce an estimate θ = θ(s,t). Thus, if Γ is complete for φ, given s and 0. g(8, t; θ) = B(8, t; θ)lθ(s) + go(s),
(7.1)
where all the elements of gr0 belong to Qo{S) and Eω[B(SjT;θ\s] — A(ω). In order to find the optimal g, which maximizes the information matrix Ig(θ\ ω)
96
BHAPKAR
given by (2.6), one may assume without any loss of generality that g0 — 0, in view of Proposition 5.2. Now solving the estimating equation B(s,t',θ)lθ(s)
=0
(7.2)
for producing an estimator θ = θ(s,t) is operationally equivalent, in fact, to solving the regular unbiased estimating equation IΘ{S) = 0. For instance, in the case d\ = l,b(s,t\θ) is not an unbiased estimating function since Eω[b(s,T',θ)\s] = a{ω) implies Eωb(S,T\θ) = a{ω) and a{ω) Φ 0 in view of the assumption that G(T\S) is empty. One can then define a weaker concept (or property) of sufficiency for the parameters of interest, viz, q-sufficiency, if there exists statistic S with distribution depending on ω only through θ such that the inequality Ig{θ;ω)
(7.3)
for all RUEF g holds for some g* of the form (7.2), which are operationally equivalent to IΘ{S), in the sense that the estimator θ = θ(s,t) is produced by the Z#(s) part. Note that in (7.2) the part 6(s,t;0), in the case d\ = 1, is not a RUEF itself in the sense that Eωb Φ 0. In the d\-dimensional case no row of B($, t\ θ) in (7.2) can be a RUEF itself. Thus, if T is complete for φ, given s and 0, then S is at least q-sufficient. If S and T happen to be independent, then S enjoys the stronger property of p-sufficiency. If S happens to be q-sufficient, one could then define the q-information matrix IQ{θ;ω) = mmEω[lθ(S,T) - u][lθ(S,T) - u]f], (7.4) where now u = u(S,T;ω) is permitted to have elements in U* which is the space of functions in C orthogonal to 5*, which is obtained from Q by excluding g(s,t;θ) = b'(s,t\θ) lθ{s) such that Eω[b(s,T;θ)\s] = a(ω), which is not of the form a*(θ). Thus Q* is the sub-space of functions g* = a'{θ)lθ{s) +go{s-,θ), where g0 e Go{S); hence Q* = G{S). Since Q* C Q, we have W D U, so that IQ(θ]ω)
(7.6)
we then have Ig(θ;ω)
< / | # ( f ) (0;ω) = I ( S ) (θ) - IQ(θ;ω)
(7.7)
97
PARTIAL SUFFICIENCY
for all g with elements in G* Another situation where S happens to be q-sufficient, but not necessarily p-sufficient, is somewhat similar to the one covered by condition (i) in Section 6 with the difference that the parametric function δ = δ(ω) does depend on s as well. Specifically, for the statistic (5,T), which is sufficient for ω, we have S-distribution depending on ω only through 0, and (iii) the conditional pdf of T, given 5, depends o n ω = {θ,φ) only through δs = δ(ω; s), which is differentiate such that ω is a one-to-one function of (0, £ s ), given s, for almost all s. Under condition (iii), the conditional pdf h(t;ω\s) = h*(t;δs\s). Then for the case d\ — 1 we have =
dθ
ί
dφ
Jd ( ω ; S)'
(7 8)
'
for a suitable function d( ; s), arguing as in Theorem 2.2 of Bhapkar (1991). In view of regularity assumption R* (ii) every element of dlogh/dφ belongs to the space ys, given s. The same is true oΐdlogh/dθ, i.e. IQ(T\S), because of the relation (7.8). Since Gs®ys is the orthogonal decomposition of C5, for every gs{t\ 0) E Gs we have Eθ[lθ(T\s)gs(T',θ)\s} = 0. (7.9) We have, thus, Eω[dgs(T',θ)/dθ\s} = 01
(7.10)
again in view of R* (ii); thus Eω[dg(a,T',θ)/dθ\8]=0 for every g eG(T\S). It then follows that Eω[dg(S,T;θ)/dθ]=0
(7.11)
for every g eG{T\S). Therefore, under assumption (iii), Ig(θ;ω) = 0 for g e G{T\S). Hence g is optimal only if g has the form 3(5, t; 0) = 6(s, t; Θ)IQ(S), in view of Proposition 5.1, where Eωb{S,T;θ) = α(ω) φ 0. Thus, arguing as in the complete case, S happens to be q-sufficient for 0 in the sense that the optimal g*(s,t;θ) is a RUEF operationally equivalent to lθ{s), and the property (7.7) holds, when assumption (iii) is satisfied.
98
BHAPKAR References
Basu, D. (1977). On the elimination of nuisance parameters. Journal of the American Statistical Association, 72, 355-366. Bhapkar, V.P. (1989). Conditioning on ancillary statistics and loss of information in the presence of nuisance parameters. Journal of Statistical Planning and Inference, 21, 139-160. Bhapkar, V.P. (1990). Conditioning, marginalization and Fisher information functions. Proceedings of the R.C. Bose Symposium, R.R. Bahadur (editor), Wiley Eastern Ltd., 123-136. Bhapkar, V.P. (1991). Loss of information in the presence of nuisance parameters and partial sufficiency. Journal of Statistical Planning and Inference, 28, 185-203. Bhapkar, V.P. (1995). Completeness and optimality of marginal likelihood estimating equations. Communications in Statistics, Theory and Methods, 24, 945-952. Bhapkar, V.P. and Srinivasan, C. (1993). Estimating functions: Fisher information and optimality. Probability and Statistics, Editors: S.K. Basu and B.K. Sinha, Narosa Publishing House, New Delhi, India, 165-172. Bhapkar, V.P. and Srinivasan, C. (1994). On Fisher Information inequalities in the presence of nuisance parameters. Annals of the Institute of Statistical Mathematics, 46, 593-604. Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284. Godambe, V.P. (1984). On ancillarity and Fisher information in the presence of a nuisance parameter. Biometrika, 71, 626-629. Godambe, V.P. and Thompson, M.E. (1974). Estimating equations in the presence of a nuisance parameter. Annals of Statistics, 2, 568-571. Liang, K.Y. and Zeger, S.L. (1995). Inference based on estimating functions in the presence of nuisance parameters. Statistical Science, 10, 158-173. Lindsay, B. (1982). Conditional score functions: some optimality results. Biometrika, 69, 503-512. Lloyd, C.J. (1987). Optimality of marginal likelihood estimating equations. Communications in Statistics, Theory and Methods, 16, 1733-1741. Rudin, Walter (1974). Real and Complex Analysis, McGraw-Hill Book Company, New York.
Appendix Proposition A.I Let (X,A,ψ) be a σ-finite measure space and L2{X the set of all real-valued functions f for which j f2(x)dφ(x) < oo; then is a Hilbert space.
99
PARTIAL SUFFICIENCY
The inner product < /i,/2 > is then given by / fιf2dψ and the norm is defined by | | / | | 2 = J72ώ/> The proof is given in mathematical texts (see, e.g., p. 81, Rudin, 1974). Such a proof can be modified, as appropriate, to establish Lemma A.I Let L(X,ψ) be the set of all real-valued functions f for which f f(x)dψ(x) = 0 and f f2(x)dψ(x) < oo. Then L(X,φ) is a Hilbert space with the inner product and norm defined as before. Lemma A.2 Let C = C(X x Ω,μ x π) be the space of real-valued functions 2 c = c(x\ω) satisfying E(c) = /c(x;ω)p(x;ω)dμ(x)dπ(ω) = 0 and E(c ) = 2 f c pdμdπ < oo, where p(x;ω) is the conditional pdf of X, given ω £ Ω, with respect to a σ-finite measure μ, and π an arbitrary probability measure over Ω. Then C is a Hilbert space with the inner product < ci,C2 > defined as E(c\C2). Proof. The set C forms a linear space over the field R of real numbers. It remains to prove that C is complete, i.e. to show that every Cauchy sequence {cn} in C converges to an element c in C with respect to the norm ||c|| defined by ||c|| 2 = E(c2) = ίc2pdμdπ
= ί d2{χ-,ω)d<ψ{χ ω),
(A.I)
where d(x\ω) = c{x',ω)\p{x\ω)]ιl2 and ψ = μ x π. If {en} is Cauchy, there exists a subsequence {cnk},rtι < n 2 < . . . such that -cnk\\<2-\k
= \χ...
1
Let oo '
a
=
In view of the triangle inequality, and (A.2), we have k
Applying Fatou's lemma to {α^}, // hminf liminf
/
k
\\
{1
liminf liminf
r
I k -> oo a{ 1 pdφ
oo / a{pdφ < 1,
i.e. ||α|| < 1. Thus a{x\ω) < oo a.e. (Ψ), and the series oo
ii+i(α;ω) - c n i ( ^ ; ω ) }
(A.2)
100
BHAPKAR
converges absolutely, say to c{x\ω) a.e^). Thus for cnk(x;ω) = cnι(x\ω) + lim cnk (x ω) = c(x; ω), α.e.(^).
(A.3)
We now show that c£C, and \\cn — c\\ -> 0 as n -> oo. For any e > 0 there exists N such that ||c n — cm \\ < e for m,n greater than N. Applying Fatou's lemma, for m > N we have r
/
(c - Cm) pdφ
=
liminf
2
/ k -> oo (c n i b - c m ) pdt/; liminf /• 2
2
< fc -> oo y (c n , - c m ) p # < e .
(A.4)
It follows that E(c2) = / c 2 p # < oo. Also - Cm + Cm\ = \E{c1 / 2- Cm)\ = 1\ J{c / 2
ΓΓ ΓΓ
1
- Cm)pd<ψ\
[J
in view of Cauchy-Schwartz inequality. Since fpdψ = f pdμdπ = f dπ = 1, we have |£7(c)| < e, in view of (A.4). Since e is arbitrary, this proves that E(c) = 0. Thus c eC, and ||c m — c\\ -» 0 as m -> oo in view of (A.4). Thus the lemma is established. Assume the regularity conditions #*, in Section 2, for the joint distribution of (5,T), given ω. Consider now the probability distribution π = πi x π2 of CJ = {β, φ) over θ x Φ. C is the Hubert space of real-valued functions c = c(s,£;ω), which satisfy E(c) = E(c 2 )
/ c(s, ί; ω)p(5, ί; ω)dμ(s, t)dπ(ω) = 0 / c2pdμdπ < oo .
=
See Lemma A.2 for the proof of assertion that C is a Hubert space. The subspaces (and their closures in C, as needed) considered later on are defined below: g is differentiable with respect to elements of θ } g U C(S) G{S) G(S) V(S)
= = = = = =
G
g1- so that C = G®U {c:c = φ ; ω) and ceC} {g:g G(S) ±
101
PARTIAL SUFFICIENCY For a given value s of 5, define Cs
= {cs : cs = c5(£;ω) ,-Eα,(cs |s) =
/ csh(t; ω \s )dηs (ί) = 0, a.e. (π) E(<ξ \s) = ίc2shdηsdπ{ω)
< oo} .
(A.5)
The proof that Cs is itself a Hubert space, given 5, is very similar to that of Lemma A.2; first we have Proposition A.2 Every Cauchy sequence {cs,nk},nι < n^ < . . . in Cs converges to an element in Cs; hence Cs is a Hilbert space. Proof. Arguing as in the proof of Lemma A.2, we have lim cSiTlk (t; ω) = cs(t; ω), a.e. (ηs x π)
(A.6)
>oo
in view of (A.3). It remains to verify that Eω[cs(T;ω) \s] = 0, a.e.(π); t h e remainder of the proof goes along the lines of Lemma A.2 proof. Now
Eω[cs \s] = J(cs - c s , m + ca,m)dPW8)
= J(cs - cs,
J(
(cs,nk
-
by applying Fatou's lemma to {cs,njc} as k -> oo. Hence JSω[cθ|s] < 0, in view of (A.6), noting that (A.6) holds a.e.(π). A similar argument applied to — cs gives the inequality — J E ^ C ^ S ] < 0. Thus, E^jcsls] = 0, a.e.(π). The proposition is established by arguing as in the proof of Lemma A.2. In C5, we define the subspaces Qs and ys as given below. = {9s : 9s = 9s (*; θ), gs G Cs and gs is differentiate with respect to θ} Gs = Gs ys = GsLmCs,i.e.cs = gs®ys
Gs
Finally we define in C C(T\S) G(T\S)
= {c:c = φ , ί ; w ) , c 6 C and c(s,t;ω) = cs(t;ω),cs e Cs a.e.(z/ x π)} = {g: g = g(s,t;θ),geG and g{s,t;θ) =gs{t',θ),gs e Gs a.e.(i/ x π)}
y(s,P,ω) = ys(t;ω),ys Lemma A.3 C = C{S) Θ C(T\S).
6 ^ 5 a.e.(ι/ x π)
102
BHAPKAR
PROOF Let c* = c*(s;ω) G C(S) and c = c(s,t;ω) G C(T\S). Then
E{cc*) = Jc*(s;ω) [|' c(a,t;u>)dP™] dP^dπ(ω) = 0 and thus C(S) ± C{T\S). Now we show that C(S)1 is C(T\S). Suppose now c J_ C(S), and let c*{s;ω) = Eω(c\s). Since c* G C(5), E{cc*) = 0. But
Hence c*(s;ω) = 0, a.e.(z/ x π). Thus c G C(T|5), and the lemma is proved. Lemma A.4 C(T\S) = C?(T|5) ®y(T\S). PROOF. If y G y(T\S) and5 G ί?(Γ|5), then
E(gy) = J gydP^dPWdπiω) = J [ | 5 ί y s dPf Is)] dP^WM = 0 Thus £(T|S) j . y(T|5). Suppose now c G CCΠS) and c ± ^(Γ|5). Since c G C(T\S), c = c(s,t;ω) = cs(t;ω), where cs G Cs, a.e.(^ x π). Consider the orthogonal decomposition of cs in Cs, viz
cs(t;ω)=g*s(t;θ) + y*s(t;ω).
Consider now g*(s,t;θ) = gl(t;θ); then 3* G G(T\S) and, hence, c ± g*, i.e. £?(<#*) = 0. But
E(cg*) = J'cg*dP^dP^d-κ{ω)
= J [J'cs(t;ω)g*s(t;θ)dP
Hence g*(s,t;θ) = 0, a.e.(μ x TΓI). Then gζ(t]θ) = 0 a.e.(^ x TΓI), which implies cs = y* a.e.(v x π) so that c(s,t;ω) = y*(s,t;ω), where y*(s,t;ω) = y*(t;ω). Thus c G ^( Γ| 5 ) , and the lemma is proved. Lemma A.5 Suppose that the distribution of S, given ω, depends on ω only through θ, Gι(S) = {gι:gi=gi(s;θ) = α'(θ)lθ(s)eg(S) for some α(θ)},
(A.7)
and £o(#) is the sub-space in £(S) orthogonal to Gi(S). Then G*(S,T),
(A.8)
103
PARTIAL SUFFICIENCY where G*(S,T) =
{g*:g*=g*(s,t;θ)eg<md
k*(s;θ) = Jg*(s,t;θ)dP™dπ2(φ)eGι(S),
a.e. (1/x πi)}.(A.9)
PROOF. When the distribution of 5, given ω, depends only on 0, then each component of IQ(S) belongs to G(S) in view of regularity condition R* (i). Since Q(S) is complete, it is a Hubert space with decomposition We also note that Q is a complete subspace of C and, hence, Q is a Hubert space. G*(S,T) is seen to be orthogonal to Go(S). It remains to show that G*(S,T) is the orthogonal complement of Go(S) in G> Let then g € G and suppose g ± Go{S). Then for all go G Go(S)
0 = E[gog] = J =
E\gok*].
Thus k* JL ^0(5). Since k* <Ξ G(S), it follows that k* G £i(S). Thus the lemma is established. Lemma A.6 Under the assumptions of Lemma A.5, let g{S,T)
that JBw[6(β,T;0)|s]
= {g:g = g{s,t;θ) = b'{sΛθ)lθ{s)
= α(ω), and g e G} .
such
(A.10)
(i) If there is a 5* e G*(S,T) orthogonal to £(S,T), thenfc*(5,6>)= 0, a.e. ( ) , and (ii) for every j G ? ( 5 , T ) , we have the orthogonal decomposition g = a*'(θ)lθ(s) + [b'(8,t;θ) - a*'(θ)]lθ(s),
(A.ll)
where Eω[b{s ,T;θ\s)] = α(ω), and /α(ω)cίπ2(<£) = a*{θ). Furthermore, (iii) if π is a one-point distribution at ω, then ^ has the orthogonal decomposition G = Go(S)®G(S,T). PROOF. Note first that G(S,T) C 0*(S,T). Also observe that ^i(5) C G{S,T). Hence if there is a g* G ζ?*(5,T) orthogonal to 5(5,T), then we have for all ^1 € ( ) 0 = E\gιg*] =
Jg1(s,θ)k^s]θ)dP{θS)dπι(θ)
in view of (A.9). However fc* G £1(5) in view of lemma A.5 and (i) follows. To prove (ii), note that the two components of g in (A.ll) are orthogonal. We need to show that the subspace spanned by the second component is
104
BHAPKAR
the orthogonal complement of Gi{S) in G(S,T). Let then g G G(S,T) and suppose g ± Gι(S). Then for all gx G Gι(S) 0 = E\gιg]
Jgi(s;θ)af(ω)lθ(s)dP{ΘS)dπ(ω)
= = =
J
a[(θ)lθ(s)ϊθ(s)a(ω)dP{θS)dπ(ω) ίa[(θ)I^(θ)a{ω)dπ{ω),
in view of (A.7) and (A. 10). Since this is true for all αi(0), we have Ja(ω)dπ2{φ) = 0. Thus (ii) follows. Finally, if π is a one-point distribution at ω, then ^[ff*^] = 0 and, thus, g* G G(T\S), which is a subset of G{S,T). Since 3* is orthogonal to £(S,T), it follows that g* — 0. Thus we have assertion (iii). Lemma A.7. Under the assumptions of Lemma A.5 and A.6, suppose π is a one-point distribution at ω. If now G{T\S) is empty, every g G G has a representation where 50 e (zoOS) and £„[&($,T;0)|s] = α(ω) 7^ 0. Remark. Although (A. 12) gives the general representation of Q for the case where G{T\S) is empty and π is a one-point distribution at ω, the decomposition (A.ll) shows that g in (S,T) belongs to Gi(S) only if Eω[b{s,Γ; θ)\s] = o*(0) for some α*, i.e. 6(s,ί;θ) = α*(θ) in view of assumption that (?(T|S) is empty. Then g is outside Q(S) only if J5w[6(s,T;0)|3] = a*(ω) for some function α* which depends also on φ in a non-trivial manner. The orthogonal decomposition of Q(S,T) into Gi{S) and its complement in G(S,T) as given in (A.ll) is now possible only as g(s, t; 0) = a'(ω)lθ(s) + [b'(s, t; 0) - α'(ω)]I,(*) where α(α ), for the given ω, depends non-trivially on co-ordinates φ of given ω.
105
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS AND HIGHER ORDER SIGNIFICANCE D.A.S. Eraser, N. Reid, and Jianrong Wu University of Toronto ABSTRACT Estimating functions provide inference methods based on a model for specified functions of the response variable, such as the mean and variance. The inference methods can use the limiting distribution of the estimating function, or the derived limiting distribution of the estimating function roots, or the derived limiting distribution of the quasi-likelihood function. In fully specified parametric models more accurate inference can be obtained by using recently developed higher order approximations based on likelihood asymptotics. We consider the recent third order methods in the estimating function context using the quasi-likelihood function. We focus on inference for a scalar parameter of interest: profile quasi-likelihood for the scalar parameter is defined and we describe a method for using this to approximate significance probabilities. Key Words: Ancillary directions; asymptotic inference; estimating functions; profile likelihood; quasi-likelihood; tail area approximations.
1
Introduction
In a fully specified parametric model, quantities for testing a parameter or parameter component can be constructed from the score function, the maximum likelihood estimator, or the likelihood ratio statistic. These quantities are equivalent in first order asymptotic theory, although examples tend to indicate that the likelihood ratio statistic provides the most reliable assessment, and the score statistic the least reliable assessment of a parameter of interest. It is also possible by considering higher order asymptotics to derive a modification of the likelihood ratio statistic with much better inferential properties. In the estimating equations context, optimally weighted estimating functions serve as an equivalent to, or extension of, the score function, and the quasi-likelihood function serves as an equivalent to, or extension of, the likelihood function. Our goal in this paper is to explore the extension of
106
FRASER, REID AND WU
likelihood asymptotics to the estimating equation context. We assume that the parameter of interest is a scalar, as this is central to the methods of likelihood asymptotics. In Section 2 we summarize the main results for third order inference based on the recent likelihood based asymptotics, with emphasis on the approach developed in Eraser and Reid (1995). In Section 3 we discuss the application of these methods to the estimating equation context. Some examples are discussed in Section 4 and limitations of the current work and possible extensions are outlined in Section 5.
2
Likelihood and significance
We assume in this section that we have a log-likelihood function ί(θ) = t(θ\y), based on a continuous variable y with n coordinates, and that the parameter is partitioned as θ = (λ, φ), with φ a scalar parameter of interest and λ a vector of p — 1 nuisance parameters. We will denote the observed information function —i"(θ) by j(θ) and the nuisance parameter submatrix by jχχ{θ). The profile log-likelihood function is ίp{φ) = i(\φ,φ), where λ^ is the maximum likelihood estimate of λ with φ fixed. We will often write θφ for {\φ,φ)> and i(θφ) for the profile log-likelihood function. In fairly wide generality, an approximate p-value for testing the hypothesis HQ : φ = φo can be computed from either of the following formulas: Φ1(r,q) = Φ(r) + φ(r)(r-1-q-1)
, Φ2(r,g) = Φ{r - r " 1 log(r/ς)}
(1)
where φ and Φ are the standard normal density and distribution functions. These are approximations to the distribution function of r, typically with relative error O(n~ 3 / 2 ). The first formula, an extension of the Lugannani and Rice (1980) approximation, often gives better accuracy with exponential models, while the second formula, due to Barndorff-Nielsen (1986, 1991) avoids anomalous values outside [0,1]. In (1) the quantity r — r(φo) is usually the signed square root of the profile log-likelihood ratio statistic: r = r(φ) = sgn(φ - φ) • [2{£(0;y) - tφtfy)}]*
(2)
and q is a complementary first order quantity, the explicit form of which is determined by the problem. For example, in the case of a canonical exponential family model with no nuisance parameters, the first version of (1) is the Lugannani and Rice (1980) approximation, with q taken to be the standardized maximum likelihood departure for the canonical parameter (φ — φ){j{φ)}λ^2 In the general one parameter setting q can be taken to be
HIGHER ORDER SIGNIFICANCE
107
where it is assumed that there is a one-to-one transformation from y to (?/>, α), with a exactly or approximately ancillary, and ty(ψ',y) = dί(ψ]y)/dy and t"ψ]y(Ψ',y) = dly(φ;y)/dφ, with α held fixed. In the presence of nuisance parameters a general formula for q was established in Barndorff-Nielsen (1986, 1991) under the assumption that there is available an explicit exact or approximate ancillary statistic a such that the conditioned variable given a is a one to one function of θ. In the special case τ of a canonical exponential family /(y; θ) = exp{ψs4-λ £—c(λ, ψ)—d(y)}, the minimal sufficient statistic is a one-to-one function of θ and the expression for q can be simplified to
where ρ2{ψ,ψ) = |jλλ( Equation (1) with (2) also gives an approximation to the Bayes posterior marginal cumulative distribution function with the choice q= An alternative expression for q was derived in Eraser and Reid (1995) which provides a more easily implemented calculation for the case with effective variable of the same dimension as the parameter as covered by BarndorffNielsen (1986, 1991) and also handles the general case where the dimension of the effective variable is larger than the dimension of θ. The dimension reduction from y to θ is effected by constructing a new parametrization ?, which is obtained as the gradient of the log likelihood function at the observed data taken in p directions V = {v\... υp).
V
=
dyτ dy τ
dθ
(2,0,00)
(4)
The latter differentiation is for fixed values of appropriate pivotal quantities, as discussed in Eraser and Reid (1995). The quantity q can be viewed as a standardized maximum likelihood departure q = q(φ) = sgn(V> - φ) • \χ(θ) - χ(θφ)\{\j{θθ)(θ)\/\j{Xλ)(θφ)\}ϊ
(5)
where 3{βθ)Φ) ιs the observed full information matrix and j(χ\){θψ) is the nuisance information matrix, and both are recalibrated in terms of the new
108
FRASER, REID AND WU
parameterization φ(θ): see eq. (17) below. The scalar parameter χ(0) replacing φ(θ) is linear in φ(θ)
3
Quasi-likelihood and significance
The quasi-score estimating function (Wedderburn, 1974) is denoted here by UQ to suggest its nominal role as a derivative with respect to 0, uθ(θ;y)=μJ(θ)Σ-1(θ){y-μ(θ)}.
(7)
This is based on n observations y with mean μ(θ), variance matrix Σ(0), and location gradient μe(β) = (d/dθτ)μ(θ). The more general optimally weighted estimating function (McCullagh & Nelder, 1989) uθ(θ;y) = μJ(θ)Σ-\θ)d(y;θ)
(8)
is based on a vector d(y; 0) recording some version of departure of y from 0 with mean E{d(y;θ)\θ} = 0, variance matrix Σ(0), and location gradient μβ(θ) = E{(d/dθ) d(y;0);0}. A further extension to handle conditional means and variances given a conditioning quantity A(θ) is discussed in Hanfelt and Liang (1995). Under reasonable conditions a root θ of the quasi-score estimating equation uβ(θ;y) = 0 is asymptotically normal with mean θ and variance I^{θ) where = μJ(θ)Σ-1(θ)μθ(θ) (9) is the variance of the estimating function. A quasi-likelihood ratio for 02 versus 0χ is obtained (Wedderburn, 1974) as a line integral (10) this will in general be path dependent if the quasi-score does not form an integrable vector field. We examine quasi-likelihood as constructed from estimating equations (7) and (8). With 0 expressed as (λ,^), the equations have coordinates corresponding to λ and φ:
109
HIGHER ORDER SIGNIFICANCE
The roots θ and θψ of the estimating equations will typically be obtained by iterative solution of UQ = 0 and u\ = 0, for example as
λj
+1)
)
1
)
)
= λ!ί + / λ - λ (^ K(λ!ί ,V;?/)
(12)
Now consider the interest parameter ψ. As a definition of the corresponding profile log quasi-likelihood we take £PQ(Ψ) to be Hλ^ o ,^o)
= L .
nψ(λψ ? ^;y)#
(13)
where the effective integration curve C — {θψ} = {(λψ,^)} is along the path of constrained solutions θψ as suggested in Barndorff-Nielsen (1995). It would typically be calculated iteratively from the overall solution θ. The signed likelihood root is then given uniquely by r{φ) = s g n ^ - <ψ)[2{ίPQ{ψ) - ePQ(φ)}]ϊ
(14)
The nominal reparameterization φ(θ) is needed only along the profile curve C and is obtained by an integral paralleling (13); it uses a n n x p matrix of ancillary directions V = (υ\... vp) which will be discussed later in this section. Again integrating along C, we define VτΣ-1(θ)μθ{θ)dθ
in the estimating equation case (7), and Vτd^(y;θ)Σ-1(θ)μθ(θ)dθ
(15)
in the more general case (8) with the notation dy(y;θ) = (d/dyτ)d(y;θ) .
This involves an integration for each of p coordinates but most of the calculations are common and related to (13). The Jacobian of the parameter change from θ to φ. φθτ(θ) = VT(ξ(y]θ)Σ-1(θ)μθ(θ)
,
(16)
is needed only at θ and θψ and is a by product of the calculation for (15). The inverse ofφΘτ(θψ) gives the coefficients ψφτ(θψ) for the linear parameter χ{θ) defined by (6).
110
FRASER, REID AND WU
The nominal information matrices JΘΘΦ) and j\\{θψ) are calculated from gradients of the quasi-scores jθθ(θ)
= - M Θ )
(Θ)
The recalibrated information matrices for use in (7) are then obtained using (16): \W)0)\
=
\Jθθ(θ)\\φθτ(θ)\-\ 1
\j(χχ)Φψ)\ = \3χχ(θ^)\\ψlΦΦ)φχτ(θΦ)\-
(17)
We now have all the ingredients for using the third order formula (1) with (2), (5), and (6) except the n x p matrix of ancillary directions V = (υi,..., υp) at the observed data point. In most inference contexts the number of variables n will exceed the number of parameters p and some procedure is needed to effectively reduce the number of variables to p. Higher order asymptotics indicates that the appropriate procedure is to condition on an ancillary of dimension n — p, thus giving p free variables. The higher order approximation described in Section 2 needs only the value of the likelihood function at the observed data point, and the gradient of the likelihood in p directions tangent to the ancillary, which give the new parameter φ in (4). The only information needed concerning the ancillary is thus the array V of tangent directions. For third order inference the vectors V need to be tangent to just a second order ancillary (Fraser and Reid, 1995; Skovgaard, 1986) and for second order inference the vectors V need to be tangent to just a first order ancillary. In the estimating equations context where the model is specifed as Eyi = μ^, vaπ/^ = Σ(μi), and the components are independent, a first order ancillary can be derived using results from curved exponential family theory (Amari, 1985). The resulting directions in this case are given by V = μθτ(θ) = ^μ(θ)\§.
(18)
In the case of the more general estimating equation (8), a separate argument is needed to establish V.
4
Examples
As a first example we consider a mean and variance function corresponding to exponential regression: the coordinates yι have mean μι and variance μf
111
HIGHER ORDER SIGNIFICANCE
with μι = exp{α + β{x% — x)}. We assume interest centers on the regression parameter β. The corresponding estimating equations for a and β are ua
=
uβ
= Σ(xi - x)μ~l(yi - μi)
Σμ^iyi-μi) (19)
These integrate on a path-free basis to give the quasi-likelihood
£{a,β) = -Σexp{-α - β(xi - x)}y{ - na which in fact coincides with the actual likelihood for the exponential regression model. The corresponding profile quasi-likelihood for β thus coincides with the ordinary profile ίP{β)
= Σyiμ'1
+nά- ϋyφjβ
- nάβ
where μι = exp{ά + β(xi — x)} and fiiβ = exp{άβ + β{x{ — x)}. For this example the zth row of the matrix V of ancillary directions (18) for second order inference is
The ancillary directions for third order inference are also available (Eraser et a l , 1994): the ith row of V is
The implied statistical model has in fact location model structure with an exact ancillary so exact p-values are available for comparison. We consider a random sample of size 5 from the Fiegl and Zelen leukemia data given as Set U in Cox and Snell (1981). Table 1 shows the exact and quasi-likelihood based p-values for selected values of /?, the regression coefficient. The full sample size is n = 17, and for the full sample there is almost no difference among the first order, quasi-likelihood and third order methods. The first order method here refers to using the normal approximation for the profile log-likelihood root. Table 1. Exact and approximate p-υalues: exponential regression.
β -4.7 -4.3 -3.8 -3.4 -2.9 1.5 1.0 2.0
first order 0.9954 0.9899 0.9747 0.9505 0.8956 0.0177 0.0370 0.0096
using (18) 0.9952 0.9896 0.9747 0.9514 0.8994 0.0240 0.0466 0.0134
exact 0.9960 0.9911 0.9776 0.9564 0.8953 0.0249 0.0513 0.0109
112
FRASER, REID AND WU Our second example is a binomial regression model: we assume that
Eyi = μι, vary; = rriiμi(l — μ;), with rrii known and μι = a + βx{.
We
are using a non-canonical parametrization, as there is no exact conditional method for inference about β in this setting. For illustration we fit this model to data from Example 1.5 of Cox and Snell (1989). The data values are I
Xi
1 2 3 4 5
1.0 1.7 2.2 2.8 4.0
y% 4 4 2 1 1
TH
110 105 62 65 45
As y is discrete the arguments of Section 2 do not apply directly, but the ancillary directions given by (18) are easily computable. Table 2 compares the significance functions for the first order and higher order approximations Table 2. First order and second order p-values: binomial example.
β -0.0245 -0.0225 -0.02 -0.018 -0.0155 -0.0115 -0.0065 0.000 0.0055 0.009 0.0125 0.017 0.0195
first order 0.9953 0.9894 0.9735 0.9487 0.8948 0.7454 0.4956 0.2413 0.0956 0.0512 0.0259 0.0010 0.0056
using(lί 0.9939 0.9867 0.9678 0.9398 0.8826 0.7375 0.5061 0.2587 0.1074 0.0591 0.0307 0.0122 0.0071
In both these examples the quasi-likelihood obtained from integrating the score equation is identical to the log-likelihood. It is necessary to introduce some dependence in the score equations, along the lines of Liang and Zeger's generalized estimating equations, or to use an optimally weighted estimating function as at (8), to derive a quasi-likelihood that is not also a log-likelihood. Unfortunately, we have not as yet determined a way to compute the ancillary directions V for these more general problems. Further comment on this point is given in the next section.
HIGHER ORDER SIGNIFICANCE
5
113
Discussion
In the application of formula (1) the choice of q seems to be crucial, and seems to need an argument based on approximate ancillarity in the nominal model leading to the estimating equation. For example, Hanfelt and Liang (1995) give an example using a moment type estimating equation for the shape parameter of a gamma distribution, and compare the first order approximations based on the quasi-likelihood ratio statistic and the standardized root of the estimating equation. We tried combining the two statistics, essentially using the standardized root as the q in (1), but the approximation was worse than either of the first order approximations. The derivation of V above uses the fact that μι is a location parameter, and we do not at the moment have an argument that applies to non-location parameters, such as over-dispersion parameters or additional dependence parameters in Σ. It may be possible to incorporate over-dispersion parameters using Nelder and Pregibon's (1987) extended quasi-likelihood, which essentially gives a REML-type marginal likelihood for the scale parameters. Alternatively it might be possible to substitute a consistent estimate of the scale parameter into the profile log quasi-likelihood, and still have an improved approximation, but we have not investigated this. The recent third order asymptotic methods address various model features not handled by typical first order methods: (i) Third and fourth order moments of the distributions of the component variables are implicitly involved. (ii) The coordinates means μi{θ) can measure θ and thus ψ on differing measurement scales. (iii) Non linearity in the parameter of interest as a function of θ or μ(θ) is allowed for. While the first of these was the initial stimulus for recent asymptotics, (ii) and (iii) are perhaps of greater importance. In the estimating equation context, (ii) and (iii) are of particular interest while information for (i) is generally not available. Accordingly we have adapted the asymptotic methods to the estimating equation context primarily to handle the complications (ii) and (iii). However concerning (i) we note that the application of the asymptotic methods starts with the quasi likelihood and we feel it is appropriate to make the most accurate extraction of significance probabilities from that likelihood.
114
FRASER, REID AND WU Acknowledgements
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada. References Amari, S.-I. (1985) Differential Geometric Methods in Statistics. New York: Springer-Verlag. Barndorff-Nielsen, O.E. (1986). Inference on full or partial parameters based on the standardized signed log likelihood ratio. Biometrika 73, 307-322. Barndorff-Nielsen, O.E. (1991). Modified signed log likelihood ratio. Biometrika 78, 557—64. Barndorff-Nielsen, O.E. (1995). Pseudo profile and directed likelihoods from estimating equations. Ann. Inst. Statist. Math. 47, 461-464. Cox, D.R. and Snell, E.J. (1981). Applied Statistics. London: Chapman and Hall. Cox, D.R. and Snell, E.J. (1989). The Analysis of Binary Data. London: Chapman and Hall. Eraser, D.A.S. and Reid, N. (1995). Ancillaries and third order significance. Utilitas Mathematica 47, 33-53. Eraser, D.A.S., Monette, C , Ng, K.W., and Wong, A. (1994). Higher order approximations with generalized linear models. Multivariate Analysis and its Applications, eds. T.W. Anderson, K.T. Fang and I. Olkin. IMS Lecture Notes Monograph Series, 24, 253-262. Hanfelt, J.J. and K.-Y. Liang (1995). Approximate likelihood ratios for general estimating functions. Biometrika 82, 461-477. Lugannani, R. and Rice, S.O. (1980). Saddlepoint approximation for the distribution of the sums of independent random variables. Adv. Appl. Prob. 12, 475-490. McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. 2nd ed. London: Chapman and Hall. Nelder, J.A. and Pregibon, D. (1987). An extended quasi-likelihood function. Biometrika 74, 221-232. Skovgaard, I.M. (1986). Successive improvements of the order of ancillarity. Biometrika 74, 516-519. Wedderburn, R.W.M. (1974). Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 82, 439-447.
115
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ON THE CONSISTENCY OF GENERALIZED ESTIMATING EQUATIONS BY BING LI
Pennsylvania State University
ABSTRACT We study the consistency of generalized estimating equations. Our consistency result differs from the known results in two respects. First, it identifies a specific sequence of consistent solutions to be the minimax point of a deviance function; this is stronger than the known consistency results, which assert only the asymptotic existence of a consistent sequence. Second, the minimax procedure applies and gives consistent estimate even when the generalized estimating equation itself is not defined, as would be the case if the mean function is not differentiate, or if the support of the random observations depend on the parameters. We also provide two practical criteria based on which we can decide whether a solution is consistent by fairly simple computations.
Key words and phrases: Quasi likelihood estimation; generalized estimating equations; deviance; projected likelihood ratio; Doob-Wald approach to consistency.
116
LI
1. INTRODUCTION Many data sets that arise from scientific research consist of repeated, or clustered, measurements, which results in dependence among the observations. It is therefore necessary to incorporate the dependence into the estimation of the parameters and the assessment of errors. The generalized estimating equations and related techniques are very effective for such purposes. See Liang & Zeger (1986), and Zeger &; Liang (1986). In the present paper we will establish the consistency of generalized estimating equations, and provide verifiable conditions under which consistency holds. Applying Crowder's general theory (Crowder 1986; Small & McLeish 1994, page 96), it is not difficult to demonstrate that, with probability tending to one, a generalized estimating equation has a consistent solution. However, here we aim at a more specific statement of consistency; that is, a specific sequence of solutions, which can be identified in practice, is consistent. This latter statement is important because, in many applications, a generalized estimating equation may either have multiple solutions or have none at all; See McCullagh (1990), Hanfelt k Liang (1995), and Li (1996). In the classical theory of maximum likelihood estimation, the more specific statement of consistency is known as the Doob-Wald statement, demonstrated by Doob (1934), Wald (1949), and Wolfowitz (1949). It is established via a property of likelihood functions, which states that the maximizer of expected log likelihood is the true parameter value. This implies that, under certain regularity assumptions, the maximizer of the averaged log likelihood converges in probability to the true parameter value. For estimating equations, however, there is often no such likelihood functions to which the Doob-Wald argument can be directly applied. This is because, unlike a likelihood score function which is by definition the gradient of the log likelihood, an estimating equation need not be the gradient of any potential function, and hence we have no function to maximize. See McCullagh (1990), McCullagh & Nelder (1989), Firth & Harris (1991), and Li & McCullagh (1994). However, Li (1996) pointed out that (i) the likelihood property used in Doob-Wald argument can be restated as: the minimax of the expectation of the likelihood ratio is the true parameter value, (ii) for many estimating equations it is possible to construct a function which behaves similarly as the log likelihood ratio, so that the minimax of the expectation of this function is always the true parameter value. Based on these observations Li (1996)
CONSISTENCY OF GEE
117
demonstrated that every minimax point of this function is necessarily a consistent solution. This leaves no ambiguity when the estimating equation has multiple solutions or has no solution at all. Moreover, the consistency of the minimax holds even when the estimating equation itself is not defined, as would be the case for quasi likelihood equation if the mean function is not diίferentiable or if the support of the random observations depend on the parameter. In this note we will extend the minimax approach of Li (1996) to demonstrate the consistency of generalized estimating equations. In addition, we will provide two practical criteria by which one can judge, via fairly simple computation, whether a solution is consistent. In §3 we demonstrate the consistency in the special cases in which the estimating equations do integrate to potential functions. In §4, we describe the minimax approach to consistency, and introduce a deviance function that will facilitate this approach. §5 contains all the technical preparations towards the proof of consistency: assumptions, lemmas, and examples. The consistency in general cases will then be demonstrated in §6. In §7, we provide two practical criteria for consistency. Finally, in §8, a numerical example is carried out to illustrate the use and effectiveness of these criteria.
2. GENERALIZED ESTIMATING EQUATIONS AND EANCILLARITY The data sets with which we shall be concerned are of the form {Yit : t = l,...,7iΐ,z = l,...,ϋf}. The observations are independent for different z, but dependent for different t within the same i. For example, {Yit : t = 1,..., n^} may be the measurements from the same subject at different times. Associated with each Yit are a p-dimensional explanatory variable Xu and a p-dimensional regression parameter /?, which together determine the expectation of Yit. We will denote the whole vector {Yit} by Y, and the sample space of Y by y. Let μit(β) and φ~ιVit(β) be the mean and variance of the observation Yit. We shall assume that μit{β) = μ{Xjtβ) and Vit(β) = V{X%β) for some known functions μ( ) and V( ). The dispersion parameter φ is always taken to be positive. The inverse of μ is often called the link function, and Xftβ the linear predictor. Typically, we take μ""1( ) to be the natural link function from a linear exponential family and V(-) to depend on μ according to that family, but in principal they can be other functions as well. See McCullagh & Nelder (1989). The dependence within each i is modeled by correlation matrices R(a) — {Rw(a) : t,£' = 1, ...,τij}, where a is an s dimensional parameter and Rtt'{ )
118
LI
known functions. Across i, Rte(ά) is assumed to remain constant, in so far as both observations Yit and Yit» are present in the cluster i. In other words, R(a) is the same for all i except for its dimension. In practice, the matrices R(a) need not be correctly assumed, and is called the working correlation matrices. Whether or not R(a) is correctly assumed, we can formally calculate the covariance matrices of Yi = {Yit : t = 1,..., Πi} based on i?(α), as
These are called the working covariance matrices. If the working correlation assumption were correct, then for each fixed φ and α, the optimal linear combination of {Yn — μu(β)} which yields the highest information about /?, in the sense of Godambe (1960), is q{β, φ, a) = φ~ι Σ{μi{β)}T{Wi{β,
φ, α ) } " 1 ^ " Mi(/*)} = 0,
(1)
where μι(β) is the rij x 1 vector {μu{β) : t — l,...n»}, /x»(/?) is the Πi x p dimensional gradient matrix of μi(β) Equation (1), considered as the estimating equation for β given φ and α, combined with any y/Kconsistent estimate φ(β) of φ, and V^-consistent estimate ά{β, φ(β)} of α, is the generalized estimating equation for β. In other words, we estimate β by solving the equation
If R(a) is the true correlation matrix, and if a is known, equation (2) reduces to the quasi-likelihood equation as defined by Wedderburn (1974) and McCullagh (1983). In this case, it is known that any solution to (2), if it is consistent, is asymptotically normally distributed, and is efficient among the solutions to linear estimating equations. See also Jarrett (1984), McLeish (1984), Godambe & Heyde (1987). In the sequel, we will denote the nuisance parameter (φ,aτ)τ by τ τ 7, and the combined parameter (β,j ) by θ. The parameter space for β will be written as B. Notice that if the working assumption is incorrect, then the nuisance parameter 7 may have nothing to do with the underlying distribution P. Therefore we will use Eβ to represent the expectation under the distribution P for which β = β(P)- We denote the true value of the regression parameter as /?0, and will often abbreviate EβQ by E.
119
CONSISTENCY OF GEE
What is remarkable is the way in which the nuisance parameters 7 enter into the equation (2). Specifically, Eβ{q{β,>y)} = Q for all 7.
(3)
Property (3) is called the E-ancillarity of the estimating equation q relative to the parameter 7; See Small k McLeish (1988, 1989). This is a fundamental ingredient of a generalized estimating equation, out of which arise most of the desirable properties of the method. In particular, as implicit in Liang & Zeger (1986), if R(-) is correct, substituting \fK consistent estimates of φ and a into (1) does not impair the efficiency of the estimate of/?; even if i?( ) is incorrect, in which case ά may not estimate anything, substituting φ and ά into (1) does not impair the consistency and asymptotic normality of /3, as long as y/K(ά — a) is bounded in probability for some a. Property (3) also plays a vital role in our demonstration of consistency.
3. DOOB-WALD CONSISTENCY IN SPECIAL CASES If an estimating equation does integrate to a potential function, then, under certain conditions, the Doob-Wald argument can be extended to verify that the global maximum of the potential function is consistent. In this section we sketch the proof of the consistency of quasi likelihood estimation in these simple cases. The argument can be extended to generalized estimating equation (2), provided that, with respect to the regression parameter /?, it integrates to a potential function for each fixed 7. This extension is fairly straightforward and will be omitted. For further studies of the potential functions of estimating equations, see Li & McCullagh (1994). Let Y = {Yi : i — 1, ...,n} be independent observations with mean μi(β) and variance Vχ(β) where, similarly as in §2, μι{β) = μ{Xiβ) and Vi(β) = V(Xiβ) for some known functions μ and V. Let s(β, Y) be the quasi score function
s{β,Y) = ΣμiWVΓHβM
- μi(β))
(4)
i=l
PROPOSITION 1. Suppose that the quasi score integrates to some potential function /(/?, Y); in other words (dl/dβ)(β, Y) = s(β, Y), and that there is a measure v with respect to which the integral of 1^^ does not depend on β; that is
ί
(5)
120
LI
Then, under certain regularity conditions, the global maximum ofl(β, Y) is a consistent estimate of βo. PROOF. The argument is similar in spirit to that used in Gourieroux, Monfort, and Trognon (1984). For each β, define a probability measure dQβ(y) = eι^^du(y)/c. Differentiating equation (5) with respect /?, we find that μ(β) = f ydQβ(y). Since /(/?, y) is a linear function of y, it follows that El(β, Y) = f /(/?, y)dQβo{y) for all β. Hence
E{l{β,Y) - l(βo,Y)} = J{l(β,y) < log J
l(βoMdQβo(y)
eι^~ι^dQβo{y)
= log J{dQβ(y)/dQβ0(y)}dQβ0(y) = 0. That is, the expectation El(β,Y) is maximized at β0. The rest of the argument follows Wald (1949), with appropriate regularity conditions imposed to ensure that n~ιl(β, Y) converges in probability to n~ιEl{β, Y) uniformly in β. D
4. A DEVIANCE FUNCTION AND A MINIMAX APPROACH The argument of the last section depends on the existence of the potential function /(/?, Y), and is inapplicable if no such function exists, as is the case for many applications. Thus the questions arise: What is the crucial element in Doob-Wald argument? Must one have something like a likelihood function to maximize in order to apply this argument? At first sight, the existence of a likelihood function seems essential: if there is a function /(/?, y ) , whose expectation is uniquely maximized at /?o, and if f(β,Y) converges in probability to Ef(β,Y) uniformly in /?, then the maximizer of f(β,Y) will be a consistent estimator of the maximizer of Ef(β,Y). But a more careful look at the DoobWald argument reveals that the maximization of a likelihood is not indispensable: if we can uniquely identify the true parameter value βo by examining the function Ef(β,Y), may it be the maximum, the minimum, the turning point, or the minimax, then the β that can be identified in the same way by empirical version of Ef(β,Y), namely /(/?, y ) , should be a consistent estimator of βo. To illustrate this idea, let {Yί,..., Yn} be independent and identically distributed observations with a common density Pβ(yi). Since βo is the maximizer of Eβo logp^(Yi), it is also the minimax point of the function
121
CONSISTENCY OF GEE lo
γ
γ
Eβo &{Pβ'{ \)l Pβ( ι)}
I n
sup Eβo log{pβ>(Yι)/pβo(Y1)} 'eB
β
symbols, = inf sup Eβo \og{pβ>(Y1)/pβ(Y1)}. βεBβ'B ι
This suggests that the minimax of n~ Σ\og{pβι(Yi)/pβ(Yi)}, namely the β defined by the relation ^ _
_
, .„.,#)} = irfsup^iogW(^)/^(^)},
should be a consistent estimator of β0. This is indeed the case: it can be easily verified that β defined above is identical to the maximum likelihood estimate in this case. See Li (1996). The passage from the maximum of a likelihood to the minimax of a likelihood ratio is important because, unlike the likelihood, the likelihood ratio can be generalized to many estimating equations, so that the minimax argument applies very generally. Li (1993a) introduced such a generalization to quasi likelihood equations. This is then further extended in Li (1993b) to generalized estimating equations, which is now recorded below. DEFINITION 1. Let, in the notation of 12, θλ = (βuφuaj)τ and θ2 = (β2,φ2,a2)τ be two points in the parameter space θ. Let ι D : θ x θ x y H> R be the function D(θuθ2) =
The deviance function of the generalized estimating equation (2) is a mapping R : B x B x y *-ϊ Rι defined by
R(βu β2) = D{βx, φifa), ά(βu φiβ,))- fa φ(β2), ά(ft, φih))}The centering function of R(βι,β2) is the mapping J : B x B H-> j?1 defined by J(βuβ2)
=
ED{βι,Eφ(β1),Eά(β1,Eφ(β1)); β2,Eφ(β2),Eά(β2,Eφ(β2))}.
122
LI
Notice that the dependence of D and R on the random observations is suppressed from the notation. The next proposition provides some elementary properties of J, which can be verified along the lines of Li (1993a). P R O P O S I T I O N 2. The centering function J(βuβ2) lowing properties:
(i) Jtfufo) (ii) J(βo,β)
has the fol-
= "(&, βι) for all β1, β2 in B is the negative quadratic form
, Eφ(β), Eά(β, Eφ(β))}{μi(β)
-
(βo)}.
μi
If the matrices W's in (ii) are all positive definite, and if β is identifiable by the assumption of the mean; that is, different βs correspond to different sets of means {μit(β)}, then J(βo,β) < 0 for all β in B. This, combined with (i), implies that supJ(/%,/?)=inf Thus, intuitively, if R(β,β') converges to J(β,β'), satisfies the minimax relation
then any β that
sup #(/?,/?) = inf sup R(β,β') should be a consistent estimator of /?0 This will be proved rigorously in the next two sections.
5. ASSUMPTIONS AND LEMMAS We shall frequently use the condition of stochastic equicontinuity, which can be found in Pollard (1984, page 139). The following definition is a combination of the condition Cl and Lemma 2.1 of Crowder (1986); it is to assume that a sequence of random functions and the corresponding sequence of centered random functions are both stochastically equicontinuous. Let T be a compact set in a Euclidean space, let {/ n (ί;Xχ,...,X n ) : n = 1,2,...}, abbreviated as {/ n (ί,X)}, be a sequence of random functions defined on T, and let {fn{t) : n = 1,2,...} be a sequence of (deterministic) functions of ί, considered as the "centering" functions of {/n}.
123
CONSISTENCY OF GEE
DEFINITION 2. Let T be a compact set in a Euclidean space. The sequence of random functions, {fn(t,X)}, and the associated centering sequence, {/n(*)}, o,re said to obey condition Cl if the following are satisfied: (i) {fn(t,X)} is stochastically equicontinuous on T; that is, for each e > 0, η > 0 there is a positive number δ > 0 such that, limsupPί sup
\fn(t,X)-fn{s,X)\>e}<η
,
(ti) {fn{t)} is equicontinuous on T; (in) There is t0 G T such that {/n(ίo) n = 1, 2,...} is bounded. By the Arzela-Ascoli theorem (Conway 1990, page 175) and the compactness of T, (ii) and (iii) imply that {fn(t)} is totally bounded. Therefore, if {φ(β)} and {Eφ(β)J obey Cl, then the set Fo = {Eφ(β) : β e B,n = 1,2,...} is contained in a compact set, F say. And, without loss of generality we assume that F is such that inf{\φ — φ'\ : φ e F o , φ1 G Fc} > CQ for some e0 > 0. Suppose that, in addition to Cl, we have the convergence φ(β) - Eφ(β) 4 0.
(6)
By Lemma 3.2 of Crowder (1986), the pointwise convergence (6), together with assumptions (i) and (ii), implies the the uniform convergence snpβeB \φ(β) — Eφ(β)\ A 0. It follows that, with probability tending to one, the set {φ(β) : β G B} lies in F , because P{φ(β) e F for all β} > P{\φ(β) - Eφ{β)\ < e0 for all β} =
P{8xιp\φ(β)-Eφ{β)\<eo}-+l.
This proves the following result. LEMMA 1. Suppose {φ(-),Eφ( ) : n = 1,2,...} obeys Cl, and suppose (6) holds. Then, with probability tending to one, {φ{β) : β G B} is contained in a compact set F. With this construction, the condition Cl is easily passed from simple functions to composite functions, a passage of importance since we are to study the limit behaviour of the substituted estimates such as ά(β, φ(β)) as a random function of β. LEMMA 2. Suppose {<£(•), Eφ{ )} obeys Cl on B, and {ά( , •), Eά{; •)} obeys Cl on B x F. Then {ά( ,φ(-)),Eά(ΊEφ( ))} obeys Cl on B.
124
LI
PROOF. By Billingsley (1968, page 221), conditions (ii) and (iii), as applied to {φ( ),Eφ( )} and {ά( , ),Eά( , •)}, imply that the sets {Eφ(β) : β E B,n = 1,2,...} and {Eά(β,φ) : (β,φ) <E B x F,n = 1,2,...} are bounded. So for any β £ B, sup\Eά(β,Eφ{β))\ < sxxpsup\Eά(β,φ)\ < supsupsup|£ά:(/3,(/>)| < oo. «
n φeF
n βς-B φeF
Hence (iii) holds for {Eά( ,Eφ(-))}. To prove (ii), let e > 0, and let δo > 0 be such that limsupsup{|£ά(/?, φ) - Ea(β', φ')\ : \\β - β'\\ < δ0, \\φ - φ'\\ < δ0} < e. n—ϊoo
This is possible by stochastic equicontinuity of {ά( , •)} and compactness of B x F. Let δι > 0 be such that limsupn_^oosup{|£l(^(i5) — Eφ(β')\ : ||/3 - β'\\ < δx} < δ0. It follows that - Eά{β',Eφ{β'))\ : \\β - β'\\
limsnpsnip{\Eά(β,Eφ(β)) n—> o o
To prove (iii), let e > 0, η > 0. Since lim P{φ(β) G F, for all β G B} = 1,
71—> OO
we can, and do, assume φ(β) G F without altering our limit argument. Let δ0 > 0 be such that limsupP{sup \ά(β, φ) - a(β', φ')\ > e} < η/2, n—>oo
supremum being over the set {(/?, φ, βf, φ1) : \\β - β'\\ < δ0, \\φ - φ'\\ < δo} Let δι > 0 be such that - φ{β')\ > δ0} < η/2,
n—>oo
supremum being over the set {(β,β') : \\β - β'\\ < ίi} min{<5o, δι}. It follows that limsupP{ sup n->oo l^ΊK
ά(β,φ(β))-ά(β',φ{β'))
<»
\\β-β'\\<S
ά(β,φ(β))-ά(β',φ(β'))
Put δ =
125
CONSISTENCY OF GEE
sup
φ(β) - φ(β')
S-β'\\<δ
< limsupP {sup ά(β, φ) - ά(β', φ') > e} + % < η, n—y<x>
where the supremum on the last line is over the set {\\β — β'\\ < δo,\\φ —
Φ'\\ < δ0}.
•
L E M M A 3. Suppose (a) for each β G B and φ G F, φ(β) Eφ(β) A 0 and ά(β,φ) - Eά(β,φ) 4 0, and (b) {φ(β),Eφ(β)} obeys Cl on B, and {ά(β,φ),Eά(β,φ)} obeys Cl on B x F. Then the sequence of random functions {ά(;φ(-))-Eά(;Eφ(-)):n
= l,2,...}
(7)
converges weakly in C(B) to a random function degenerated at constant 0; where C(B) is the class of continuous functions defined on the compact set B. P R O O F . First, we show that for each β e B, ά(β, φ(β))-Eά(β, Eφ(β)) converges in probability to 0. In other words, the finite dimensional distributions of the sequence of random functions (7) converge to those of the random function degenerated at constant 0. Let e > 0, and let δ > 0 be such that limsupPJ
sup
\ά(β,φ) - ά{β,φ')\ > e/2J - 0.
(8)
From the discussion preceding Lemma 1, assumptions (a) and (b) imply that the sequence {ά( , ) : n = l,2,...} converges weakly to constant 0 in C(B x F). It follows that limsupP {\ά{β, Eφ(β)) - Eά{β, Eφ(β))\ > e/2} n—xx)
\ά(β,φ) - Eά{β,φ)\ > e/2} = 0. (9)
βeB,φ£F
Therefore,
hmsnpP{\ά(β,φ(β)) - Eά(β,Eφ(β))\ > e} = \imsupP{\ά(β,φ(β)) - Ea(β,Eφ(β))\ > e, \φ(β) - Eφ(β)\ < δ}
126
LI
[by assumption (a)] < limsupP{\ά(β,φ(β)) - Eά(β,Eφ(β))\ > e, \φ(β) - Eφ(β)\ < δ, n—>oo
\ά(β, Eφ(β)) - Eά(β,Eφ(β))\
< e/2} [by (9)]
-ά{β,Eφ{β))\>e/2, n-ϊoo
\φ{β) - Eφ(β)\ < δ] = 0,
(10)
where the right most limit equals 0 because it is no more than the left hand side of (8). By Lemma 2 we know that the sequence (7) obeys Cl. This, together with (10), implies that the sequence (7) is tight in C(B). This proves the asserted result. • At this stage it is helpful to see through an example just what assumption Cl means, and how it can be verified for a specific moment structure. We do so by look into the moment estimate of φ suggested by Liang & Zeger (1986). The assumptions about the estimates of a suggested may be investigated following a similar procedure. E X A M P L E 1. For simplicity, we assume that πi = n for all i. Let B be a compact set in RP. Liang & Zeger (1986) suggested estimating φ by averaging over the residues based on the marginal moment assumptions, as follows
Actually, Liang & Zeger used (K — p)n as the denominator; we can ignore the finite number p without affecting the limit argument. Let β be an arbitrary but fixed parameter value, and βf a point in an open ball Oβ centered at /?, whose radius is yet to be determined. Let e and η be two positive numbers. By direct calculation,
\μu(β') +μu(β)\} * \μ*(P) ~ μ*(β)\
We now verify that {φ(β)} is stochastically equicontinuous if the following conditions are satisfied
127
CONSISTENCY OF GEE (i) For each ί, the sequences {μui') ' i — 1,2,...} and {Vit( ) : i — 1,2,...} are equicontinuous and the second sequence is bounded away from 0. (ii) There are parameter values βa and βb for which {μit{βa) '• i = 1,2,...} and {Vit(βb) : i = 1,2,...} are bounded.
By (i) and (ii), for each ί, \μit(β)\ is bounded for i = 1,2,..., and β e B. Therefore, since n is finite, \μa(β)\ < Mx. Similarly, l/Vit{β) < M2. Hence (11) implies that 1.
sup \φ(β') - φ(β)\ < - Λ Σ {\Yit\ x sup \μit(β') - μit(β)\}
β'eo0
Kn
K n
β>eO
i t
Σ SUP iMit(^) - μu(β)\ Σ SUP
Σ { l y « - ^(β)\2 x ^up |vit(/?') - vit{β)\). Assumptions (i) and (ii) also imply that the sequence
+ /4(A>)} : K = 1, 2,...} is bounded, and so both {{Kn)~ι Σiit \Yit\ : i — 1,2,...} and {(Kn)~ιΣiyt \Yit - μ%t{β)\2 i = 1,2,...} are bounded in probability. In other words, there is a positive number M3 for which or
Σ\Yit-μit{β)\2>Mz}<e-
By assumption (i) we can find δ > 0 as the radius of Oβ such that
max { sup \μit(βf) - μit(β)l
sup \Vit(β') - Vit(β)\\
for all ί = l,2,...,n;i = 1,2,.... Then, l i m s u p ^ ^ P ί l ^ ' ) - φ{β)\ > η) < e and, hence, {>(•)} is stochastic equicontinuous. Evidently assumptions (i) and (ii) also imply that {Eφ(-)} is a equicontinuous and bounded sequence. Therefore {φ(β),Eφ(β)} satisfies C l . •
128
LI
From Lemma 3 and the argument preceding Lemma 1, it is easy to see that, with probability tending to one, the random set {ά(β, φ{β)) : s β e B} is contained in a compact set in R . We write the set as A, and write the compact set B x F x A in RP+1+S as θ . It now takes only a small step further to obtain the limit behaviour of the deviance function introduced in the last section. L E M M A 4. Suppose that, in addition to the assumptions of Lemma ι ι 3, the sequence of random functions {rΓ D{θι,θ2) — n~ ED(θι,θ2) : 2 2 n = 1,2,...} obeys Cl in θ , and that for each (θuθ2) G θ , 1
n- D{θuθ2)
1
- n- ED(θuθ2)
4 0.
Then the sequences {n~1i?( , ),n~ 1 J( , •) : n = 1,2,...} obey Cl in B2, and n~1{i?( , •) — J ( , •)} converges weakly in C(B2) to a constant function 0. The proof is similar to that of lemma 3, and will be omitted. 6. CONSISTENCY OF GENERALIZED ESTIMATING EQUATIONS We are now ready to prove that any minimax point of the function R{β\ >β2) is consistent. As we shall see in the next section, under mild conditions, any minimax points are solutions to generalized estimating equation (2). In other words, we can identify the consistent solutions of (2) by verifying that they are minimax points. Two numerically more convenient criteria will be presented in the next section. The function J(β, β0) plays the same role as the Kullback-Leibler information number in the Wald's proof of consistency of maximum likelihood estimate, and R(βuβ2) plays the role of log likelihood ratio. However, in our case, neither R(βι,β2) nor J(β\,β2) has the form f(β2) — f{β\)' This is why a minimax deviance procedure must be used in place of the maximum likelihood procedure used in the Wald's proof. THEOREM 1. Suppose (a) The sequence {φ{β), Eφ(β)} obeys Cl in B, {ά(β, >), Eά(β, φ)} obeys Cl inBxF, and{n-ιD(θuθ2),n-ιED{θuθ2)} obeys Cl in θ 2 ; (b) For each fleθ, φ(β) - Eφ(β) A 0, ά(β, φ) - Eά(β, φ) 4 0, and rΓ1D(θuθ2) - rΓιED(θuθ2) 4 0;
129
CONSISTENCY OF GEE (c) For each β in B, β φ βo, the sequence of quadratic forms 1 {n~ J(β,βo)} satisfies
>0. Then any parameter value β that satisfy the relation sup R(β, β) = inf sup R(β, β')
(12)
is a consistent estimate o Notice that the estimates 0, ά need not be consistent; in fact, they need not converge in probability at all. Furthermore, the theorem asserts that all minimax points (if there are more than one) are consistent. The proof of the theorem is along the lines of Theorem 1 of Li (1996); here we only describe the idea briefly and highlight the difference. PROOF. Let Oβ0 be an arbitrary but fixed open ball centered at the true parameter value βo. Let β φ βo and Oβ be an open ball whose closure does not contain βo. By assumptions (a), (b), and Corollary 1, the sequence n~ι{R(β, βo) — J(β, βo)} converges weakly in C(B) to the random function degenerate at constant 0. This, together with assumption (c), implies that with probability tending to 1, inf^o^ n~ιR(β\ β0) > δ for some positive δ that may be dependent on β. Now the class of such open balls {Oβ : β G B \ Oβ0} form an open cover of B \ Oβ0. By compactness of B \ Oβ0 there is a finite subcover {Oi : ί = 1, ...,&}. Now on each O^ there is a ^ > 0 such that, with probability tending to one, inϊβ'zOi n~1R(βf, βo) > δι. It follows that, with probability tending to 1, mΐβt^Oβ0r^~1R(β^ βo) > δ for some positive δ. Since n-ιR(βu β2) > udβ>ίo0o n~lR(P, βo), we see that lim P { inf sup rΓιR{β, β') > δ\ = 1.
(13)
Meanwhile, by a similar argument, one can show that for every positive δ>0 lim P {inf sup rΓιR{β, β1) < δ] = 1. n->oo
^βEBβ,eB
(14)
>
However, (13) and (14) imply that, with probability tending to one, any minimax point β of R(β, β1) is in Oβo. In other words, β converges in probability to βo Π
LI
130
7. TWO PRACTICAL CRITERIA FOR CONSISTENCY Theoretically, Theorem 1 solves the consistency problem. Practically, the search for the global minimax of R(β, β') may be numerically difficult, and so we need simpler criteria for consistency. In this section we will introduce two such criteria. To achieve these, we need to assume that the function R(β, β') do not behave too irregularly along the straight line β2 = β\, specifically, that, with probability tending to one, the following condition is satisfied inf sup R(β, β') = sup inf R(β, β'). B
β£
(15)
βεB
β'eB
Roughly, the condition requires that, as β moves pass the true parameter value βo, the mode of the function R(β, •) moves continuously. For further discussion of this point, see Li (1996). In practice, such continuity is almost always satisfied; It is a challenge to find a counter example. Nevertheless, condition (15) is not implied by the antisymmetry of the function of i?, nor assumption (iii) of Theorem 1. The latter two conditions only guarantee that Pin"1] 1
inf sup R{β,β') - sup inf R(β,β')\ β€B β'€B
P€BP€B
< e) -> 1 for each e > 0, }
as can be seen from the proof of Theorem 1. Condition (15) is not crucial from a theoretical point of view because, as we have seen, the minimax β defined in (12) is consistent whether or not (15) holds. And, without requiring condition (15), we can show that β is efficient using a method similar to that used in the proof of Theorem 2 of Li (1996). However, the condition does simplify the computation and discussion, because it guarantees that the minimax point of R is necessarily a solution to the generalized estimating equation. Let B be the set of all solutions to the generalized estimating equation (2). C O R O L L A R Y 1. Suppose (a) the condition (15) is satisfied, (b) the minimax of R is in the interior of B, (c) i?( , •) is a differentiate everywhere in B2. Then, under the assumptions of Theorem 1, any solution β of equation (2) that satisfies )=0
(16)
βeB
is consistent. PROOF. Let β be a (any) minimax of R. Under the assumptions (a), (b), and (c), it is easy to show that β is in B. The argument is similar to
131
CONSISTENCY OF GEE
the proof of Theorem 3(b) of Li (1996), and the detail is omitted. Now let β be a solution to (2) that satisfies (16). Since β is the minimax, and since β € B, we have
0 < sup R(β, β) < sup R(β, β) = 0; so sup R(β, β) = 0. βB
Since βeB, the above implies R(β,β) < 0. But, since β € J3, (16) implies R(β, β) < 0. Hence, by the antisymmetry of R, R(β, β) = 0. Now let βo be the true value of the parameter /?, let e > 0 and p > 0 be arbitrary but fixed, and let Oβ0 be the open ball of radius p centered at βo. By an argument similar to that used in the proof of Theorem 1, there is a positive number η for which P { sup n~ιR(βOj β) < -η\ -> 1, as n -> oo. ι
(17)
}
βίoβ
Now by Lemma 4, the sequence {n~1i2( , •) : n = 1,2,...} is stochastically uniformly equicontinuous, therefore there is a δ > 0 for which \imsvLpP{sup\n-ιR(βuβ2)-n-ιR(β[,β2)\
> η/2} < e,
where the supremum is taken over the set {(βuβ[,β2) β[\\ <δ}. By Theorem 1,
(18)
G B3 : \\βι —
C δ) = 1.
(19)
Combining (18) and (19), it follows that ),β)-R(βo,β)\>η/2}<e.
(20)
Hence, by (17) and (20), n-too
< lim sup P{/? i Oβo, rΓι sup R(βo,β) < -η, n-*oo 1
β$Oβ0
n- suv\R(β,β)-R(β0,β)\<η/2}
+e
< lim sup P {n-ιR{βQ, β) < -η, n" 1 \R(β, β) - R(β0, β)\ < η/2} + e n—>oo
< lim sup P {nΓιR{β, β) < -η/2} + e = e. 71—>OO
132
LI
Since e is arbitrary we conclude that P(β £ Oβ0) -> 0.
•
If we do not want to find all the solutions of (2), we can use the following immediate consequence of Corollary 1. C O R O L L A R Y 2. Under the conditions of Corollary 1, any solution β of (2) that satisfies ()
(21)
=0
βeB
is consistent. Thus, in order to determine whether a solution β belongs to a consistent sequence, it suffices to check whether #(/?, β) < 0 for all β. We now apply the minimax-deviance approach to a numerical example. 8. A NUMERICAL EXAMPLE EXAMPLE 2. Let {Yit : t = 1,2; i = 1,..., 30} be thirty bivariate observations. For each z, (lα, l y follows a bivariate normal distribution with expectation located at an unknown point of a cardioid and correlation matrix completely unspecified. That is ( μ\ \
( 2cos/?-cos2/3 \
V μf J = [ 2 8m/? - sin2/3 J
(x/
v
λ
^rθ(YιUYi2)
./ 1
=φ^
a
a\ χ
J,
where θ = (/?, 0, a) are unknown and β is the parameter of interest. Thirty observations are generated with β = τr/4, φ = 0.3 and a = 0.3, and are presented in Figure 1. There are four solutions to the generalized estimating equation. The solution β = 0 will be ignored since it has nothing to do with data. The data and the solutions are presented in Figure 1. The three non-trivial solutions are βQ = 0.28τr, βι = 0.91π, and β2 = 1.92π. In this simple example, the likelihood function is available, and it takes a global maximum at /30, a local maximum at β2, and a global minimum at β\. Plotted in Figure 2 is the curve l(β) — sup^/ 6 B R(β, /?'); so the minimum point of l(β) is the minimax of R(β,βr). Thus by Theorem 1, β0 is a consistent solution. Figure 2 also indicates that lφ0) — inf/^β l(β) = 0; in other words sup^ G β R(βo> β) = 0. So Corollary 2 also tells us that βo is consistent. ACKNOWLEDGEMENT I wish to thank a referee for his (or her) very useful comments. The research is supported by the National Science Foundation Grant DMS-9306738.
CONSISTENCY OF GEE
133
Figure 1: Multiple solutions for the cardioid model. bO, bl, b2 are βo, βu $2- The three +'s mark the positions that correspond to the three solutions.
LI
134
8 -
saddle point = consistent solution
200
300
beta
Figure 2: Saddle point of the deviance function. The curve is supβ,eBR(β,β') as a function of β. The minimum point corresponds to the saddle point of R.
CONSISTENCY OF GEE
135 REFERENCES
[I]
P. (1968). Convergence of Probability Measures. John Wiley & Sons, Inc.
[2]
CONWAY, J.B. (1990). A Course in Functional Analysis. 2nd edition. New York: Springer.
[3]
DOOB,
[4]
CROWDER, M. (1986). On consistency and inconsistency of estimating equations. Econometric Theory 3, 305-30.
[5]
FIRTH, D. & HARRIS, I.R. (1991). Quasi-likelihood for multiplicative random effects. Biometrika 78, 545-555.
[6]
GODAMBE,
[7]
GODAMBE,
[8]
C., MONFORT, A., & TROGNON, A. (1984). Pseudo maximum likelihood methods: Theory. Econometrica 52, 681700.
[9]
R.G. (1984). Bounds and expansions for Fisher information when moments are known. Biometrika 74, 233-245.
BILLINGSLEY,
J.S. (1934). Probability and Statistics. Math. Soc. 36, 759-775.
Trans.
Amer.
V.P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist. 31, 1208-1211. V.P. and HEYDE, C.C. (1987) Quasi-likelihood and optimal estimation. Int. Statist. Rev. 55, 231-244. GOURIEROUX,
JARRETT,
[10] Li, B. (1993a). A deviance function for the quasi likelihood method. Biometrika 80, 741-753. [II] Li, B. (1993b). Deviance functions for generalized estimating equations, unpublished manuscript. [12] Li, B. & MCCULLAGH, P. (1994). Potential functions and conservative estimating functions. Ann. Statist. 22, 340-356. [13] Li, B. (1996). A minimax approach to consistency and efficiency for estimating equations. To appear in Ann. Statist. 24. [14]
LIANG,
K.Y. & ZEGER, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22.
[15]
MCCULLAGH,
11, 59-67.
P. (1983). Quasi-likelihood functions. Ann.
Statist.
136
LI
[16]
MCCULLAGH, P. (1990). Quasi-likelihood and estimating functions. In Statistical Theory and Modelling: in honour of Sir David Cox. Ed by D.V.Hinkley, N.Reid, and E.J.Snell. London: Chapman & Hall.
[17]
MCCULLAGH,
[18]
D.L. (1984). Estimation for aggregate models: the aggregate Markov chain. Can. J. Statist. 12, 265-282.
[19]
D. (1984). Convergence of Stochastic Processes. New York: Springer.
[20]
C.G. k MCLEISH, D.L. (1988). Generalization of ancillarity, completeness and sufficiency in an inference function space, Ann. Statist 16, 534-551.
[21]
C.G. & MCLEISH, D.L. (1989). Projection as a method for increasing sensitivity and eliminating nuisance parameters, Biometrika 76, 693-703.
[22]
SMALL,
[23]
WALD,
[24]
WEDDERBURN,
[25]
WOLFOWITZ,
P. & NELDER, J.A. (1989). Generalized Linear Models. 2nd edition. London: Chapman & Hall.
MCLEISH,
POLLARD,
SMALL,
SMALL,
C.G. & MCLEISH, D.L. (1994). Hubert'Space Methods in Probability and Statistical Inference. New York: John Wiley. A. (1949). Note on the consistency of maximum likelihood estimate, Ann. Math. Statist. 20, 595-601. R.W.M. (1974). Quasi-likelihood, generalized linear models, and the Gauss-Newton method. Biometrika 61, 439447. J. (1949) On Wald's proof of the consistency of the maximum likelihood estimate. Ann. Math. Statist. 20, 601-602.
139
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
EXTENDED QUASI-LIKELIHOOD AND ESTIMATING EQUATIONS J. A. Nelder Imperial College, London Y. Lee Seoul National University, Korea
1
INTRODUCTION
This paper compares the estimation of mean and dispersion parameters using (i) extended quasi-likelihood (EQL) and (ii) optimum estimating equations (OEE). 1.1
Extended quasi-likelihood (EQL)
Quasi-likelihood (QL) was introduced by Wedderburn (1974) as a way of weakening the distributional assumptions of GLMs (McCullagh and Nelder, 1989) by specifying only the form of the linear parameters β and the variance function V(μ), which expressed the variance as a function of μ. The linear score function dl/dμ = (y-μ)/V(μ) of GLMs was preserved in the wider class of models, and estimates obtained from maximizing the QL have many properties analogous to ML estimators. The importance of QLs is that they allow us to extend the ideas of GLMs where no exponential family exists to supply an error structure of the original form. EQL was introduced by Nelder and Pregibon (1987) to allow comparison of different variance functions on the same data. It is the key to methods for the joint modelling of mean and dispersion based on QL ideas. A QL model defines a deviance component d{ for observation i with mean μi given by
140
NELDER AND LEE
The deviance D = Σdi. EQL is most simply stated in terms of its extended deviance ^
)
.
(2)
Here the dispersion parameter φ is allowed to vary over the observations. For distributions of the GLM family, the EQL can be derived as a saddlepoint approximation in which all the factorials are replaced by their Stirling approximations. It is thus exact for the Normal and inverse gamma distributions, which have no factorials, and for the gamma differs only in the normalizing factor. Note that the EQL corresponds to an exact distribution of the kind Jorgensen (1996) calls a regular proper dispersion model whenever the normalizing factor is exactly one. The quantity D+ can be used as a criterion to fit models when both the mean and dispersion are assumed functions of explanatory variables.
1.2
Optimum estimating equations (OEE)
If a function g(Y θ) of a random variable Y, having a distribution depending on 0, has zero mean for all 0, then g(Y θ) is an estimating function. If θ is a scalar then the optimum estimating equation (OEE) is based on
h = g.E(dg/dθ)/E(g2) and for a sample of n independent Ys the OEE is given by
More generally if θ is a vector, V is the conditional covariance matrix of 3, given a model matrix A and E(g\A) = 0, the optimum estimating function is given by where
If U is linear (quadratic) in y we have an optimum linear (quadratic) estimating function (Godambe, 1991). It may happen that we require more than one estimating function, and this occurs if we require to estimate both mean and dispersion parameters in a model. Godambe and Thompson (1989) derive such joint estimating equations, and use components (y — μ) and (y — μ) 2 , orthogonalizing the second with respect to the first. We discuss this case further in Section 2.2 below.
141
EXTENDED QUASI-LIKELIHOOD
2 2.1
JOINT ESTIMATION OF MEAN AND DISPERSION The EQL criterion
For φi given, the estimates of β obtained from (2) are just those given by the Wedderburn QL equations
Σ
y
μjdμ dβ ΦiVi dβ
while for given μ, D+ is the same as would be obtained from a QL model with response variable d and variance function φ2, i.e. that of the gamma distribution. That this may be a good approximation even when y has a nonnormal distribution may be seen from the result of Pierce and Schafer (1986) that the deviance function is nearly an optimal normalizing transform. Figure 1 shows for the Poisson distribution the expectation and variance of d, which should be 1 and 2 respectively, and also the correlation of (y — μ) and d, which should be zero. For μ > 3 the approximations are acceptable, but, not surprisingly, break down for small μ. We are thus led to form two interlinked GLMs, one for the mean and one for the dispersion. That for the mean has response y, prior weight φ~ι, a linear predictor η = ΈβjXj, and some suitable link and variance functions. For the dispersion we have response d, a linear predictor ξ — Σjk^k with explanatory variables ui,^2,..., a, variance function φ2 and a suitable link, usually chosen as log. An obvious algorithm for fitting the joint model is a 'see-saw' one, in which the model for the mean is fitted first with current estimates of φι, then the model for the dispersion is fitted given current estimates of μι and hence of d{. Three cycles, starting with 0 = 1, are often sufficient tofitthe joint model. Standard GLM techniques may be used to check both models for internal consistency. 2.2
The OEE criterion
Godambe and Thompson (1989) derive what they call an extended quasiscore function which gives estimating equations of the form ^
0,
(3)
for the mean parameters β where, hi = {Vi - μi)2 - φV(μi) - Ίu{φV{μi))ll2{yi
- μi)
(4)
142
NELDER AND LEE
and Σhi = 0, for the dispersion parameter φ. In these equations ju and 72; are the standardized third and fourth cumulants and 1/2
2
α* = (7ii - Ίu)/{Φ V(μif (Ί2i
2
+ 2 - 7ii) },
where
is the exponential skewness. For simplicity these equations are given assuming a constant dispersion. If φ is structured they can be extended in a standard way. The form in which α; is defined shows immediately that for distributions of the GLM family aι is identically zero, because 71 = 7J, so that (3) reduces to the QL equations for β. Thus for GLM models, the optimum linear, the optimum quadratic and the quasi-likelihood estimating equations for β are all the same. In (4) the last term is identically zero for normal errors, or for gamma errors with a log link, and is usually much smaller than the first; thus, approximately, (4) is equivalent to equating the Pearson X2 to its expectation, uncorrected for d.f. lost in fitting β.
3
COMPARISON OF EQL AND OEE CRITERIA
There are two basic (but connected) differences in the equations for the joint estimation of mean and dispersion produced by the use of EQL and OEE. The first concerns the response variable for the dispersion; the use of the function (y — μ)2 in OEE equations leads to the Pearson X 2 , whereas the EQL equations use the deviance component. The second difference, which follows from the first, is that the OEE equations, in general, require knowledge of 71 and 72, or equivalently the third and fourth cumulants, whereas the EQL equations do not. For normal models 71 = 72 = 0, and the EQL and OEE equations are identical. For GLM non-normal models the equations for the mean parameters are the same, but those for the dispersion differ, while for nonGLM models both sets of equations are different. What reasons are there for preferring one method to the other?
3.1
Pseudo-likelihood v quasi-likelihood
Nelder and Lee (1992) compared estimates from the EQL having D+ = Σdi/φi
+
Σlog(2πφiV(yi)),
EXTENDED QUASI-LIKELIHOOD
143
with pseudo-likelihood (PL) estimates obtained from minimizing Dp = ΣXf/φi + Σ\og(2πφiV(μι)) for three non-GLM models. Model (1) involved the NBα distribution, a form of negative binomial distribution obtained mixing the Poisson with a gamma where the shape parameter v varies with μ, instead of the scale parameter a. The response y has υar(y) = μ(l + α), i.e. looks like an overdispersed Poisson. We did a 5-factor simulation in a 2 5 " 1 fractional factorial, the factors being aspects of the configuration of the means we thought might be important. Model (2) was a mixture of two Poisson samples with a ratio of means of 4:1; the variance function was assumed to have the form >μα, and estimates of a were of interest. Model (3) was a Poisson-Inverse Gaussian mixture with the inverse Gaussian distribution parameterized so that the variance function had the form μ+aμ2 (the IG-2 distribution). The experimental factors were sample size and value of α. For detailed results see our paper. The main conclusion was that though the bias of the maximum EQL (MEQL) estimator was usually larger than that of the maximum PL (MPL) estimator, this was more than offset in moderate sample sizes by the larger variance of the latter; the result was that in terms of MSE, the MEQL estimator was never appreciably inferior to the MPL estimator, and was often much better. An interesting result was that in finite samples the MEQL estimate was frequently better than the ML estimate. 3.2
The value of knowing 71 and 72
The OEE for mean and dispersion require knowledge of 71 and 72 for nonGLM models, and knowledge of these can improve estimates. For example, Lee and Nelder (unpublished) consider a model with log link and NBα errors, and two groups having means μ and 2μ. Table 1 shows the asymptotic variance ratios for estimates of the group difference for various μ and φ. The ML estimates are equivalent to those derived by knowing all the cumulants, the estimates based on quadratic estimating functions (QEF) to knowing the third and fourth cumulants, and the MEQL estimates to assuming an exponential-family pattern for the cumulants. For large overdispersion (φ = 5) there is considerable loss of efficiency of MEQL relative to the ML estimates, with the QEF results showing that a considerable part of this loss can be recovered if 71 and 72 are known. Similar results were found with another example using the IG-2 distribution for errors. While these results are interesting, they assume that the distribution is known at least up to the fourth cumulant. In practice this is almost never so.
144
NELDER AND LEE
Attempts to estimate 71 and 72 from moderate amounts of data may lead to estimates with considerable errors; for example, for a sample of 10 from a Poisson distribution with μ = 5, 100 simulated samples gave 33% of estimates of 71 with negative values, i.e. of the wrong sign. We need to know, therefore, what loss of information arises with various estimators when we assume values p\ and p2 for the unknown 71 and 72. We shall be mainly concerned with the optimum quadratic estimating function, the ML estimator and the MEQL estimator. For simplicity we restrict the argument to independent Y{. The QL equations assume a GLM pattern of cumulants. The optimum QEF equations improve efficiency by using information from the third and fourth cumulants and the ML equations from the all the cumulants. So the ML estimator will be most informative if the true model is known. But it is often not consistent if a model is wrongly chosen. Consider the trace or determinant of normalized asymptotic variance rΓ l coυφ) as the risk. For a given V(μ) with μ fixed, the MEQL estimator is a mini-max estimator among QEF estimators, since its risk remains constant, i.e. does not depend upon the true values of 71 and 72; it attains the minimum risk under the GLM skewness. The MEQL estimator is also the ML estimator if a GLM family exists. If so it is again a minimax estimator among ML estimators for the class of distributions with a given V(μ). Therefore, the MEQL estimator would be most conservative (in the sense that the possible maximum risk is minimal) against a possible misspecification of either the likelihood or cumulants of the model. λ 2
Godambe and Thompson's joint QEFs with p\ — ρ2 = 0 lead to the Normal ML estimator for β. For Normal heteroscedastic linear models, assuming only the first moment is correctly specified, Carroll and Ruppert (1982) showed that the MEQL estimator is robust against a small variancefunction mis-specification compared with the Normal ML estimator. Under mild regularity conditions, the consistency of the MEQL estimator depends only upon the correct specification of the regression while that of the optimum QEF estimator requires also the correct specification of V(μ)\ see Crowder (1987). The QEF equations (2) become the QL equations when a{ = 0. When V(μ) is mis-specified, the QEF estimator is no longer consistent unless aι = 0; see (2). So the MEQL estimator is most robust among QEF estimators against a mis-specification of V(μ). We illustrate the nature of the minimax property of the MQL estimator by two of the examples from Section 3.1 (Lee and Nelder, unpublished). In the first the unknown true distribution is the NBα distribution. For 1/2 1 2 this 71 = 0i/μ and 72 = 02/μ, where θλ = (1 + 2α)/(l + a) ' and 2 0 2 = (1 + 6α + 6α )/(l + a). Suppose we assume values p\ = λi/μ 1 / 2 and p2 = \2Jμ with λi and λ2 being values of Θ2 and Θ2 for which a = 0.2, 1 and
EXTENDED QUASI-LIKELIHOOD
145
5, corresponding to small, moderate and large amounts of overdispersion. Figure 2 shows the lower bound of the variance ratio for the QEF estimator with respect to the MEQL estimator over the θ\ scale, when μ = 1, an unfavourable value where the MEQL estimator is known from simulation to have low efficiency. It is clear that having the assumed value of θ\ too high can lead to much greater loss of efficiency than the corresponding gain in efficiency when the correct value is chosen. For larger μ this effect would be even more marked. In the second example the unknown true distribution is the IG-2 distri1 2 2 bution. Here 71 = 3/v / and 72 = 15/^, where var(Y) = μ + μ /v. Let assumed values p\ and P2 be 71 and 72 values at v = 0.5, 1, and 2, so that the corresponding values of pi are 4.243, 3, and 2.121 respectively. Figure 3 shows similar curves to Figure 2 over 71, the true skewness scale. Here the losses from using the MEQL estimator when p\ and p2 are nearly right are much less than the gains when they are too large.
4
CONCLUSION
We can derive the EQL equations for mean and dispersion from the OEEs by replacing the hi of equation (4) by d{ — φi, where d is the deviance component and φ the dispersion parameter, which may be structured. If we make three approximations, namely that E(d) = >, υar(d) = 2φ2 and corr(y — μ d) = 0, the resulting estimating equations are given by
and these are just those obtained from the use of EQL. Note that the three approximations become exact as μ -> 00; however, their joint effects for small μ need further investigation. The properties of the MEQL estimator give it a special place among estimators in the joint estimation of mean and dispersion.
References
CARROLL, R. J. and RUPPERT, D. (1982). A comparison between maximum likelihood and generalized least squares in a heteroscedastic linear model. J. Am Statist Assoc, 77, 878-882.
146
NELDER AND LEE
COX, D. R. (1993). Some remarks on overdispersion. Biometrika, 70, 279274. CROWDER, M. (1986). On consistency and inconsistency of estimating equations. Econometrics Theory, 3, 305-330. CROWDER, M. (1987). On linear and quadratic estimating functions. Biometrika, 74, 591-597. FIRTH, D. (1987). On the efficiency of quasi-likelihood estimation. Biometrika, 74, 2333-245. FIRTH, D. (1988). Multiplicative errors: lognormal or gamma? J. R. Statist Soc. B, 50, 266-268. GODAMBE, V. P. and THOMPSON, M. E. (1989). An extension of quasilikelihood estimation. J. Statist. Plann. Inference, 22, 137-152. GODAMBE, V. P. (Ed.) (1991). Estimating Functions. Oxford: Clarendon Press. JORGENSEN, B. (1996). Proper dispersion models. Braz. J. Probab. Statist, (to appear). McCULLAGH, P. (1983). Quasi-likelihood functions. Ann. Statist, 11, 59-67. McCULLACH, P. and NELDER, J. A. (1989). Generalized Linear Models, 2nd edn. London: Chapman and Hall. NELDER, J. A. (1989). Discussion of the paper by Godambe and Thompson. J. Statist. Plann. Inference, 22, 158-160. NELDER, J. A. and LEE, Y. (1992). Likelihood, quasi-likelihood and pseudo-likelihood: some comparisons. J. R. Statist. Soc. B, 54, 273284. NELDER, J. A. and PREGIBON, D. (1987). An extended quasi-likelihood function. Biometrika, 74, 221-231. PIERCE, D. A. and SCHAFER, D. W. (186). Residuals in generalized linear models. J. Am. Statist Assoc, 81, 977-986. WEDDERBURN, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the Gauss-Newton method. Biometrika, 61, 439-447.
EXTENDED QUASI-LIKELIHOOD
147
Table 1: Asymptotic variance ratios of the ML, MQL and optimum QEF estimators of group difference for NBa distribution
ML/MQL QEF/MQL
1 .875 .900
φ=2 5 .963 .966
10 .980 .981
50 .996 .996
1 .608 .768
φ=5 5 .805 .865
10 .885 .911
50 .974 .976
2.5-
var(d)
2.0-!
1.5-
Eld)
1.0-;
cov(y-y; d)
0.0-1 0.0
5.0
75
10.0 12.5 15.0 17.5
20.0
Figure 1: Properties of the Poisson deviance as a function of μ.
148
NELDER AND LEE
Figure 2: Asymptotic variance ratio of QEF/MEQL estimators as a function of θ example with NBα distribution. Curves with increasing λ; correspond to increasing amounts of actual overdispersion. Abscissa is assumed amount of overdispersion.
Figure 3: Asymptotic variance ratio of QEF/MEQL estimators as a function of 7i example with IG-2 distribution. Curves with increasing p - 1 correspond to increasing amounts of actual overdispersion. Abscissa is assumed amount of overdispersion.
149
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH
SERIES
QUASI-LIKELIHOOD REGRESSION MODELS FOR MARKOV CHAINS Wolfgang Wefelmeyer University of Siegen Abstract We consider regression models in which covariates and responses jointly form a higher order Markov chain. A quasi-likelihood model specifies parametric models for the conditional means and variances of the responses given the past observations. A simple estimator for the parameter is the maximum quasi-likelihood estimator. We show that it does not use the information in the model for the conditional variances, and construct an efficient estimating function which involves estimators for the third and fourth centered conditional moments of the responses. In many applications one assumes that the innovations are not arbitrary martingale increments but independently and identically distributed. We determine how much additional information about the parameter such an assumption contains. To make the exposition more readable, we first treat the case in which only the conditional mean is specified. AMS 1991 subject classifications. Primary 62G20; secondary 62M05. Key words and Phrases. Efficient estimator, quasi-likelihood, Markov chain, weighted nonlinear least squares, autoregression.
1
Introduction
Suppose we observe covariates Xι and responses Yi which jointly form a homogeneous p-order Markov chain Z{ = (Xi,Yi). We write Q(Zi_i, . . . , Zi-p,dz) for its transition distribution, and for the conditional mean and variance of the response we write =
/ / Q(zp-ι,...,zo,dx,dy)y,
=
J J Q{zp-u
, *o,dx, dy){y - m{zp-U . . . , z0))2
150
WEFELMEYER
If we have a parametric model m = m^ for the conditional mean of the response, we can introduce a large class of martingale estimating functions ...,
U
Zi-P)),
(1.1)
i=p
with w#(zp-ι, ...,ZQ) an arbitrary weight function. The corresponding estimators for ϋ are defined as solutions of M$n — 0. We indicate in Section 3 that an efficient estimator is obtained with the choice wϋ(Zi-U
. . . , Zi-p) = ΰi-iίZi-i,..., Z i - p J - ^ ί Z i - i , . . . , Zi_p),
(1.2)
with ϋi-ι an estimator for υ based on the observations up to time i — 1. By efficiency we mean asymptotic optimality among all regular estimators in the sense of an appropriate version of Hajek's (1970) convolution theorem, not just optimality within some class of estimating functions. For the case of no covariates, a first-order chain and a one-dimensional parameter, a rigorous proof is given in Wefelmeyer (1996a). The estimating function is an adaptive version of the quasi-score function. The model is described by all transition distributions Q which fulfill m = πiϋ for some ΰ. It could be interpreted as a semiparametric model by writing Q(zp_i,..., ZQ, dx, dy) = M(zp-u
••-,*()> dx, dy - m#(zp-ι,...,
* 0 ))
with / / M ( z p - ι , . . . ,zo,dx,dy)y = 0, and considering M as nuisance parameter. In many applications one uses more specific models, Yi = mϋ(Zi-ι,...,
Zi-p) + si,
where the ε» are i.i.d. with mean zero and known or unknown distribution, rather than arbitrary martingale increments. We call such models regressionautoregression models. Again, m — πiβ. We show that the specific structure contains additional information about ΰ, except when the ει are normal. The efficient estimators that have been constructed for specific such models are, however, not based on estimating functions. If we have, in addition to m = ra^, a parametric model v = v# for the conditional variance of the response, with the same parameter ΰ, then the model is called a quasi-likelihood model. The best weight in (3.1) is l then w$ = Vfi rhu, giving the maximum quasi-likelihood estimator. It is as good as the estimator corresponding to the estimated weights (1.2). This implies that the maximum quasi-likelihood estimator does not use any of the information in the model assumption υ = υ#. We also note that if the
151
QUASI-LIKELIHOOD REGRESSION
model v = υ# is misspecified, then the weights (1.2) lead to a strictly better estimator. In a quasi-likelihood model one can introduce further martingale estimating functions 2
i - mϋ{Zi-U ..., Zi.p))
-
We show in Section 4 that an appropriate combination with (3.1), with weights involving estimators for the conditional centered third and fourth moments of the response, gives an efficient estimator. For the case of no covariates, a first-order chain and a one-dimensional parameter, a rigorous proof is given in Wefelmeyer (1996b). The estimating function is an adaptive version of the extended quasi-score function. Recent reviews of quasi-likelihood methods are McCullagh (1991) and Firth (1993). Again, in many applications one uses more specific models Yi = mt{Zi-u
, Zi-p) +
p
where the e% are i.i.d. with mean zero and variance one. We call such models heteroscedastic regression-autoregression models and show again that the specific structure contains additional information about ϋ. We do not give precise regularity conditions for our results. They can be obtained by fairly straightforward, if tedious, modifications of Wefelmeyer (1996a, 1996b).
2
Notation
We observe fc-dimensional covariates X{ and real-valued responses Yi. We suppose that Z{ — (X^, Yi) form a homogeneous and ergodic p-order Markov chain. For the p values of the chain preceding Zi we write Zi-\ = (Zi-ι,... ,Zi-p)'. The chain starts with an initial value Z p _i = (Zp-ι,... ,Zo)f. For the transition distribution of Zi given Z;_i = t we write Q(t,dz). Here and in the following, we will always write z = (x,y) for the variables corresponding to the random variables Zi = (Xi,Yi), and t = (r,s) corresponding to Z»_i = pfj_i, Y<_i). Boldface letters denote corresponding p-dimensional vectors, with components numbered backwards. The conditional distribution of the response Yi given the past depends only on the value Zi_χ = t and is given by the marginal of the transition distribution of Z^
152
WEFELMEYER
For the conditional mean and variance of the response given the past observations we write m (t) = j Qr{t,dy)y,
v(t) =
jQr(t,dy)(y-m(t))2.
Let π(dz) denote the stationary law of Z[. For the expectation of a function /(Zi) under π we write
πf = j π(dz)f(z). Similarly, for the expectation of a function /(Zj_i, Y{) under the stationary law π ® Qr we write
π®Qr/ = 11 π(Λ)QΓ(t,dy)/(t,y). 3
Modeling the conditional mean of the response
3.1. Estimating functions. Suppose we have a parametric model m = m$ for the conditional mean of the response, where ϋ is a ς-dimensional parameter. Recall that a large class of martingale estimating functions can be constructed as follows. Note that Yi-m^(Zi-i) are martingale increments with respect to the filtration generated by the Z{. Choose a g-dimensional vector w#(t) of weight functions. Then w$(Zi-i) is predictable, and the components of the vector w$(Zi-\)(Yi — m^{Zi-\)) are again martingale increments, so that the estimating functions Mϋn ^ΣwviZi^Yi-m^Zi^))
(3.1)
i=p
form a martingale. An estimator is obtained as solution ϋ = ϋn of the estimating equation M$n = 0. We do not give conditions for existence and uniqueness here. Call an estimator Tn for ϋ asymptotically linear with influence function /(t,y)if
nιl2{Tn - ϋ) = n" 1 / 2 JΓ f(Zi^Yi)
+ oP(l)
i=p
and /Qr(t^y)/(t,y) = 0 for all t. Then the components of the vector f(Zi-ι,Yi) are martingale increments. If the components of / are π ® Qrsquare integrable, a martingale central limit theorem holds, and Tn is asymptotically normal with covariance matrix π ® Qrf f- See, e.g., Billingsley (1968, p. 206).
153
QUASI-LIKELIHOOD REGRESSION
Let us recall how one shows that the solution ΰ = ϋn of the estimating equation M# n = 0 is asymptotically linear. We use a dot on top of a vector of functions to denote the matrix of partial derivatives with respect to ΰ. A Taylor expansion gives 0 = MKn
= Mΰn + Mΰn{ΰn
- 0) + •
with matrix of partial derivatives
i=p
i=p
Note that m$ is a row vector. Since the entries of the matrix — m#(Zi-ι)) are mean zero martingales, Yi — m$(Zi-ι)) is negligible if the entries of the matrix are π ® Qr-square integrable. Furthermore,
n .
The above arguments show that i9n has influence function /(t,y) = ( π ^ m τ ? ) - 1 ^ ( t ) ( y - m ? ? ( t ) )
(3.2)
and asymptotic covariance matrix 1
nvw$w# (πrh^w1^)'1.
(3.3)
For dependent observations, weak conditions for asymptotic linearity of estimators may be found in Hosoya (1989), Andrews and Pollard (1994) and Andrews (1994). Remark 1. We have restricted attention to weights w^(Zi-ι) which depend only on the p previous observations Zi_i,..., Z{-v of the p-order Markov chain. Let us show that there is no point in using weights w$ which depend on observations preceding Zi_ p . Note first that with such weights we would also get a covariance matrix of the form (3.3), with π now denoting the stationary law of more than p successive observations. Let WQ denote the weight function in which the additional arguments appearing in w$ have been integrated out. Then nw^m^ equals πm#m#. By Jensen's ineqality, φ — πvw$Wβ is positive semi-definite. Hence
is positive semi-definite.
•
154
WEFELMEYER
3.2. Known conditional variance. Suppose, for the moment, that the conditional variance v of the response is known. Then we can determine a weight function which is optimal in the sense that it minimizes the asymptotic covariance matrix (3.3). Recall (e.g., Horn and Johnson, 1985, p. 472) that if a block matrix I ι
, - I is symmetric and positive definite, so is ι
ι
1
C - B'A~ B and hence also B'~ CB~ - A' . Applying this result to the l 2 l 2 covariance matrix of {v~ / rn!d^v / w^y under π, we see that 4
ι
1
(7nu#rhtf)~ πvwφWφ { nm'^w'^)~ — (πv^m^m^)'
is positive semi-definite.
This means that the covariance matrix (3.3) is minimized for wΰ = v~ιrn!ϋ,
(3.4)
and that the minimal covariance matrix is ι
ι
(3.5)
.
This result is well known in the context of quasi-likelihood models; see Subsection 4.1. The influence function (3.2) of the estimator corresponding to the optimal weight w$ = v"ιm'^ is /(t, y) = (nv^rh'trht)-1 v{t)'ln^(tY{y
- m*(t)).
(3.6)
3.3. An adaptive estimating function. If υ is not known, we can construct an estimating function which is adaptive in the sense that for each v it is asymptotically as good as the best estimating function (3.1) for known υ, with weight w# = υ~ιrn!ϋ. It suffices to replace the conditional variance v by an estimator; compare Wefelmeyer (1996a). Specifically, let θj_i(t) be estimators for υ(t) based only on the observations Zo,..., Z%-\ up to time i — 1. For the construction of such estimators see, e.g., Collomb (1984) and Truong and Stone (1992). We obtain the adaptive estimating function
Σϋi-ΛZi-iΓ^iZi^yiYi^mviZi^)).
(3.7)
i=p
This estimating function is an adaptive version of the quasi-score function discussed in Subsection 4.1. Since the weight is predictable, the estimating function is again a martingale. If the Ό^—i are strongly consistent, a Taylor expansion as in Subsection 3.1 shows that the corresponding estimator is asymptotically normal, its asymptotic covariance matrix is again (3.5), and its influence function is again (3.6). For the case of a one-dimensional
155
QUASI-LIKELIHOOD REGRESSION
parameter, and when there are no covariates, a rigorous proof is given in Wefelmeyer (1996a). 3.4. Efficiency of the adaptive estimating function. Does the adaptive estimating function (3.7) lead to an efficient estimator? In other words, is this estimator optimal not only among estimators based on estimating functions of the form (3.1), but also in the larger class of regular estimators? To answer this question, we must indicate that the model given by m = m# is locally asymptotically normal in an appropriate sense, and determine a bound for the asymptotic covariance matrices of regular estimators of ϋ in the sense of a convolution theorem. The basic reference for this theory in the i.i.d. case is Bickel et al. (1993). For the case of a one-dimensional parameter, and when there are no covariates, a rigorous proof of the efficiency of the adaptive estimating function is in Wefelmeyer (1996a). To accomodate covariates, we recall that by Cox (1972) the likelihood factors into two terms. The first is the partial likelihood and depends only on the conditional law Qr{t,dy) of the responses. The second depends only on the conditional law of the covariates given the past observations and the present responses. Our model m = ra$ is a condition on Qr only. Hence the second factor of the likelihood varies independently of ϋ. This means that the bound for the asymptotic covariance matrices can be determined from the partial likelihood. Fix Qr{t,dy). The model is described by a parametric family of side conditions m = m$ . To introduce a local model, we perturb Qr(t,cfy) such that the perturbed transition distribution is still in the model. This means that a perturbed condition m = ra#, with ϋ replaced by ϋ + n~ιl2u, say, holds. Such perturbations are conveniently described as follows. Consider the affine space of g-dimensional vectors h(t,y) of functions with
J Qr(t,dy)h(t,y)
J Qr(t,dy)yh(t,y)
= 0,
=
(3.8)
ro*(t)'.
(3.9)
These vectors will play the role of score functions. Set Qnrhu{t,dy) = Q r (t,dy)(l + n- 1 / 2 Mt,y)'u). Then
jQ?hu{t,dy)y
= mΰ{t) + n-ιl2 j ' Qr{t,dy)yh{t,y)'u =
mΰ+n-U2u(t)
(3.10)
156
WEFELMEYER
This shows that Q™hu is indeed (approximately) in the model. The partial likelihood ratio is
By a Taylor expansion, the partial likelihood ratio is shown to be locally asymptotically normal. n
1
1
•*
2
logT ^ = n- / Σ h{Zi-UYi)'u - -u'n ® Qrhh'u + o P (l), with n " 1 / 2 ^JLp/iίZj-i, YJ) asymptotically normal with mean zero and covariance matrix π ® Qrhh!. By the convolution theorem, an estimator is regular and efficient if and only if it is asymptotically linear with influence function Σ - 1 s , where Σ = π ® Qrss' and s is the efficient score function, minimizing π ® Qrhhf over the affine space of vectors h fulfilling (3.8) and (3.9). It is characterized by π ® Qrsh' = Σ for all h. It is straightforward to check that the solution is a(t,y) =υ(tΓιmΰ(t)(y-mΰ(t)).
(3.11)
Hence Σ = πυ^πi^rh^, so that the efficient influence function is (3.6). In particular, the minimal asymptotic covariance matrix for regular estimators of ϋ is (3.5). The estimator based on the adaptive estimating function (3.7) also has influence function (3.6) and is therefore efficient. 3.5. Regression-autoregression models. have an autoregressive structure,
Suppose that the responses
where the Si are i.i.d. with known or unknown mean zero density g(y). Then the conditional distribution of the response Y{ given Z{-\ — t has the form Qr(t,dy)=g(y-mϋ(t))dy,
(3.12)
with conditional mean m#(t). We call it a regression-autoregression model. It is a submodel of the model given by m =ra#.Conditions for (geometric) ergodicity are given in Bhattacharya and Lee (1995). The question arises whether in this submodel there are even better estimators than the one based on the adaptive estimating function (3.7). We show that the minimal asymptotic covariance matrix of regular estimators of ΰ is, in general, strictly smaller than (3.5). The regressionautoregression model (3.12) is a semiparametric model, with nuisance parameter g. The local model can be obtained by perturbing ΰ and g. Consider
157
QUASI-LIKELIHOOD REGRESSION the linear space of functions k(y) with Efc(e)=0,
(3.13)
Eεk(ε)=0.
(3.14)
Then gnk(y) = g{y){l + n~ιl2k(y)) is again a mean zero probability density. Set Q?ku(t,dy) = gnk(y mΰ+n.1/2u{t))(dy). Write £' for the logarithmic derivative g'/g of g. By a Taylor expansion, Qΐhu(t,dy) = Qr(t,dy) (l + n-V2(k(y - m* (t)) - mϋ{t)ui'(y - m*(t))))
The perturbation is seen to be (approximately) of the form (3.10), with /i(t,y)'u replaced by k(y — ra#(t)) — rn$(t)uί!(y —ra#(t)). Hence the corresponding partial likelihood ratio is locally asymptotically normal with variance
I J π{dt)g{y - mϋ{t))dy{k{y - mϋ{t)) - mΰ{t)ut!{y - m#{t))f π(dt)g(y)dy (k(y) - rhΰ(t)uέf(y))2
.
(3.15)
For the parametric case, g known and hence k = 0, see Hwang and Basawa (1993, 1994). For the semiparametric case considered here, see Koul and Schick (1996). These references do not consider covariates. To simplify the calculations, we will now assume that ϋ and g are locally orthogonal in the sense that the mixed term in the variance (3.15) vanishes, or equivalently, EJfe(ε)^(ε)=0
for all A;,
or
ππιΰ = 0.
(3.16)
The first condition is fulfilled if the density of ε is assumed symmetric. Then £' is odd, and since both g and gnk are symmetric, k must be even. The second holds in many applications; see also Examples 1 and 2 below. If (3.16) holds, then we can estimate ϋ asymptotically as well not knowing g as knowing g. We say that the model is adaptive with respect to g. We refer to Drost et al. (1994) and Drost and Klaassen (1995) for a discussion of adaptivity for general semiparametric GARCH models. Under (3.16), the variance (3.15) reduces to 2
Efc(ε) + Eί'(ε)
2
uπm'ϋmΰu,
158
WEFELMEYER
and the efficient score function, as defined at the end of Subsection 3.4, is
Hence Σ = E£'(ε)2 πτh'ΰrh.β, the efficient influence function is
and the minimal asymptotic covariance matrix of regular estimators of ΰ is
Of course, this covariance matrix cannot be larger than the minimal asymptotic covariance matrix (3.5) for the larger model m = m$. To check this, note first that in the regression-autoregression model we have
«(t) = j 9{y ~ mt(t))dy (y - mΰ(t))2 = Eε 2 . Hence (3.5) is E ε 2 (πmj^m^)"1. To prove the desired inequality, it suffices to recall that E^'(ε) 2 is the Fisher information for location, and that its inverse is not larger than E ε 2 . Hence E ε 2 (πm#rhΰ)~~1 — ( E ^ ( ε ) 2 ) " 1 (ππiβrhΰ)"1
is positive semi-definite.
We note that the difference between the two matrices is proportional to the difference between the asymptotic variance E ε 2 of the empirical estimator for the mean of g and the asymptotic variance (E^'(ε) 2 )" 1 of the maximum likelihood estimator for the mean in the location model generated by g . The inequality is strict unless ί'{y) is proportional to y. In particular, for normal ε», the adaptive estimating function (3.7) gives an efficient estimator in the regression-autoregression model. To summarize: The regression-autoregression model is a quasi-likelihood model with the additional restriction that the conditional law of the response does not depend on the past except through the mean. The additional restriction can be exploited to construct an estimator with asymptotic covariance matrix reduced by the factor ( E ^ ε ^ E ε 2 ) " 1 as compared to the adaptive estimating function. The reduction can be considerable if the density g is far from normal. On the negative side, the construction requires 1 estimating the logarithmic derivative I of g, see Koul and Schick (1996) when there are no covariates, and the estimator is inconsistent if in reality the additional restriction does not hold. Example 1. Set m#(Zi-ι) = tf'Yi-i. Then the conditional mean of the response does not depend on the covariates, but the conditional variance
159
QUASI-LIKELIHOOD REGRESSION
may still depend on them. An efficient estimating function is (3.7); here it has the form i=p
It gives the weighted least squares estimator
i=p
j
i-p
The corresponding regression-autoregression model is the p-order autoregression model Yi =
ΰ'Yi-l+εi,
where the Z{ are i.i.d. with mean zero density g. Here v(Zi-ι) = E ε 2 does not depend on the observations, and the weighted least squares estimator reduces to the ordinary least squares estimator
i=p
It is not efficient in the autoregression model unless the Si are normal. Huang (1986) proves local asymptotic normality of the autoregression model. An efficient estimator is constructed by Kreiss (1987a) for symmetric g, and by Kreiss (1987b) for arbitrary mean zero g. o Example 2. Set mΰ{Z^ι) = α'X;_i + /3'YVi dimension q = k + p . Write Si-\ = {Xi-\,Yi-\). function is (3.7); here it has the form
Then ϋ = (α,/?)' is of An efficient estimating
i=p
It gives the weighted least squares estimator
i=P
The corresponding regression-autoregression model is the p-order autoregression model with A -dimensional linear regression trend Yi = a'Xi-x + β'Yi-χ + εi = ΰ'Si-ι + ε<,
160
WEFELMEYER
where the Si are i.i.d. with mean zero density g. As in Example 1, v(Zi-ι) = E ε 2 does not depend on the observations, and the weighted least squares estimator reduces to the ordinary least squares estimator
\i=p
)
i=p
Swensen (1985) proves local asymptotic normality for the case of nonrandom Xi. See Garel and Hallin (1995) for a recent more general version and references. D
4
Quasi-likelihood models
4.1. The quasi-score function. A quasi-likelihood model is given by parametric models m = m$ and v — v$ for the conditional mean and variance of the response, with ΰ a common g-dimensional parameter. Consider again the estimating functions (3.1),
i=p
Exactly as in the case of a known conditional variance υ, Subsection 3.2, the best weight is determined as w$ = v^rh'^. It gives the quasi-score function
ύ).
(4.1)
A version of this result for general discrete-time processes is in Godambe (1985). For continuous time see Thavaneswaran and Thompson (1986), Hutton and Nelson (1986) and Godambe and Heyde (1987). The corresponding estimator is the maximum quasi-likelihood estimator. Its asymptotic covariance matrix is (3.5) with υ = υ^,
the inverse of the quasi-Fisher information matrix. Its influence function is (3.6) with v = -Utf, /(t,y) = (πυ^m^mtf)" 1 ^(t)~ l rh 1 ? (t) / (y -
mΰ(t)).
The quasi-score function is asymptotically as good as the adaptive estimating function (3.7). This implies that it does not use any of the information in the model assumption υ = i>#.
161
QUASI-LIKELIHOOD REGRESSION
By the arguments of Subsection 3.1, the quasi-score function can be used even if the model is not true. In this sense it is robust against misspecification of the conditional variance of the response. If the true conditional variance is v, then by (3.3) for w$ = v^ιm'ΰ the maximum quasi-likelihood estimator has asymptotic covariance matrix
However, unless v = w#, this covariance matrix is strictly larger than the lt ι covariance matrix (πυ~ m!ϋm^)~ which is attained by the estimator based on the adaptive estimating equation. 4.2. Further estimating functions. Note that (Y{ — ra#(Zi_i))2 — Vϋ(Zi-ι) are martingale increments with respect to the filtration generated by the Z{. We obtain martingale estimating functions i))
(j
(4.2)
i=p
which we can combine with estimating functions (3.1) to get estimating functions of the form (4-3)
ι=p
It will be convenient to introduce the q x 2 matrix of weights w# = and the two-dimensional vector of martingale increments Mt,y) = (y - m*(t), (y - mϋ{t))2
-
and to rewrite the estimating function (4.3) as
We also introduce the 2 x q matrix of derivatives d# = (m^,^)'. For the conditional centered third and fourth moments of the response we write Mj(t) = JQΛt,dy){y
- ro*(t))', j = 3,4.
The conditional covariance matrix of the martingale increments i# is
C=ίV«
μ
*2).
(4.4)
162
WEFELMEYER
As in Subsection 3.1, the estimator corresponding to the estimating equation (4.2) is shown to be asymptotically linear, with influence function
and asymptotic covariance matrix ^w^)" 1 .
(4.5)
4.3. Known conditional centered third and fourth moments. Suppose, for the moment, that we know the conditional centered third and fourth moments μ% and μ± of the response. The weights w\$ and W2>d which minimize the asymptotic covariance matrix (4.5) are (4.6)
πΰ = dbC-\ and the minimal asymptotic covariance matrix is
i
1
1
.
(4.7)
The optimal weights are determined by Crowder (1986, 1987) for independent observations, and by Godambe (1987) and Godambe and Thompson (1989) for discrete-time stochastic processes. These authors restrict attention to the special case of conditionally orthogonal martingale increments, i.e. μ3 = 0. The general case, also for continuous time, is treated in Heyde (1987). A different derivation may be found in Kessler (1995). The influence function of the estimator corresponding to the optimal weight is
/(t,y) = (πd'βC^dor'doityCitrH^y).
(4.8)
4.4. An extended adaptive estimating function. If the conditional centered third and fourth moments μ% and μ\ of the response are not known, we can construct an estimating function which is adaptive in the sense that for each μ$ and μ± it is asymptotically as good as the best estimating function (4.3) for known μ 3 and μ^ with weight (4.6). Similarly as in Subsection 3.3, replace, in (4.6), the matrix C(t) by an estimator Ci_i(t), using estimators βjj-iit) for μj{t) based on the observations Zo,...,Zi-i. This gives the extended adaptive estimating function ).
(4.9)
The estimating function is an adaptive version of the extended quasi-score function discussed in Remark 4 below. It gives an estimator whose influence function is (4.8).
163
QUASI-LIKELIHOOD REGRESSION
The extended adaptive estimating function can be written more explicitly. Estimate the determinant of C(Z{-ι) by
and write the estimating function as
ι=p n
In the important special case of orthogonal martingale increments, μ% = 0, the extended adaptive estimating function can be replaced by the simpler version
i=p n
ι=p
4.5. Efficiency of the extended adaptive estimating function. To show that the extended adaptive estimating function (4.9) leads to an efficient estimator, we must determine the lower bound for the asymptotic covariance matrices of regular estimators of ϋ. We follow the arguments in Subsection 3.4, adding the model assumption v = υ#. For the case of a onedimensional parameter, and when there are no covariates, a rigorous proof hu is in Wefelmeyer (1996b). We perturb Q^ as in (3.10), with h fulfilling (3.8) and (3.9) and also 2
jQr(t,dy)(y-mΰ(t)) h(t,y)
=vΰ(t)'.
Then Q?hu fulfills (3.11) and also
J
2
- mΰ+n-1/2u(t))
= vΰ+n-ι/2u(t)
+o^
(4.10)
164
WEFELMEYER
The efficient score function s again minimizes πQrhh', now over the smaller affine space of functions h fulfilling (3.8), (3.9) and (4.10). The solution is /
(t)i*(t,y).
(4.11)
To see this, note that s fulfills (3.8), (3.9) and (4.10) since
I Qr (t, dy)s{t, y)t*(t, y)' = d*(t)', and that s fulfills πQrsh! = πQrss' since h fulfills (3.8), (3.9) and (4.10). Hence the efficient influence function is (4.8), and the minimal asymptotic covariance matrix for regular estimators of ΰ is (4.7). The estimator based on the extended adaptive estimating function (4.9) also has influence function (4.8) and is therefore efficient. Remark 2. We have shown that our adaptive estimating function (3.7) is as good as the best estimating function (3.1) for known υ, with weight (3.4). This does not mean that the estimator based on (3.7) remains efficient in the class of all regular estimators if υ is assumed known. This is only true if the vectors Λ(t,y) fulfill, besides (3.8) and (3.9), Qr{t,dy){y-mΰ{t))2h{t,y)=0. This condition is not fulfilled by the score function (3.11) unless μ$ = 0, i.e., unless the two estimating functions (3.1) and (4.2) are orthogonal in the sense that μ$ = 0. • Remark 3. In some applications the conditional mean m$ of the response does not depend on ϋ. Then rh# — 0, and πw#dtf is not invertible, so that the calculations in Subsection 4.2 are not valid. In this case, the estimating functions (3.1) are useless in the sense that they do not lead to estimators with finite asymptotic variance. In particular, the quasi-score function (4.1) is useless. One possible alternative is to restrict attention to estimating functions (4.2) and proceed as in Subsections 3.1 to 3.3, with the model m = m$ replaced by the model v = v#. As in Subsection 3.1, the estimator corresponding to the estimating function (4.2) is shown to be asymptotically linear with influence function /(t, y) = (πwϋύϋ)~ι
tity(t) ((y - mΰ(t))2
and asymptotic covariance matrix ύtf)"1 π(μ 4 - v
-
165
QUASI-LIKELIHOOD REGRESSION If μ4 is known, this covariance matrix is minimized for ι
wϋ = (μ4 - vl)~ v'ϋ, and the minimal asymptotic covariance matrix is
A good estimating function is
τ=p
It would be efficient if we had not specified m at all. In general, however, the assumption that rn$ does not depend on ϋ contains information about ϋ. Condition (3.9) on h now reads
j Qr(t,dy)yh(t,y)=O.
(4.13)
The score function of the above estimator is
θ(t,y) = (μ 4 (t) - vΰ(t)2)-ιvΰ(tY
((y - mύ(t))2
- v*
For this score function to be efficient, condition (4.13) must hold for h = s. This is not true unless μ% — 0. An analogous result with interchanged roles of ra and υ was noted in Remark 2. We note that although the estimating functions (3.1) are useless on their own, they can be used in combination with estimating functions (4.2): For rh$ = 0 the efficient score function (4.11) reduces to t, y) ι
= D(t)- vϋ(t)'
2
(-μs(t)(y - m*(t)) + υϋ(t) ((y - m^(t)) - vd(t)))
,
where D = υ#(μ4 — υ | ) — μ | is the determinant of C. The corresponding extended adaptive estimating function is (4.14)
For μ3 = 0 this gives again (4.12).
•
Remark 4. An extended quasi-likelihood model is given by parametric models m =ra#,v = υ^, μ3 = μ3# and μ4 = μ^. Similarly as in Subsection
166
WEFELMEYER
4.1, the best estimating function (4.3) is seen to have weights (4.6), now with A*3 = μzϋ and μ± = μ^. This gives the extended quasi-score function
i=p
with
μ
» A
μw - v$ J It is asymptotically as good as the estimator given by the extended adaptive estimating function (4.9). Hence it does not use the information in the specifications μs = μ^ϋ and μ± = μ±$. It is robust against misspecification of μs and μ±, but then the extended adaptive estimating function is strictly better. • 4.6. Heteroscedastic regression-autoregression models. Suppose that the responses have a heteroscedastic autoregressive structure, Yi = m^Zi-x) + V where the ε» are i.i.d. with known or unknown mean zero density 5, We may and will also assume that the Si have variance one. The conditional distribution of the responses given Z{-\ = t has the form Qr(t,dy) = υϋ{t)-ι'2g {vϋ{t)'ιl2{y
- mΰ(t))) dy,
with conditional mean m$ and conditional variance v$. We call it a heteroscedastic regression-autoregression model. It is a submodel of the quasilikelihood model given by m = m# and v = v^. We show that the lower bound for the asymptotic covariance matrices of regular estimators of ϋ is, in general, strictly smaller than the lower bound (4.7) in the quasi-likelihood model. We follow the arguments of Subsection 2 3.5, now with heteroscedasticity. Since E ε = 1, the functions k fulfill not only (3.13) and (3.14) but also Eε2k(ε) = 0 . 1 2
nk
With g (y) = g(y)(l + n~ / A;(y)) as before we set Qfu(t,dy)
= v^n-U2u{t)-1'2
By a Taylor expansion, Q?ku(t,dy)
gnk {vΰ+n^2u(tr1/2(y
- ^ + n - 1 / 2 J t ) ) ) dy.
167
QUASI-LIKELIHOOD REGRESSION
ι 2
= Qr(t,dy)(l + n- ' {k ( M t Γ
1/2
( y " rr,
- m*(t))) y - mϋ{t))t' (υϋ{t)-1/2(y - m*(t))) + 1))]
Hence the corresponding partial likelihood ratio is locally asymptotically normal with variance π(dt)9(y)dy(k(y)-vϋ(tr1/2rhϋ(t)ue'(y)
(4.15)
The model is adaptive with respect to g if ΰ and g are locally orthogonal in the sense that k(y) is orthogonal to ^(t)~ 1 / / 2 m^(t)£ / (y) + i^(t)~ 1 ύtf(t) (y^(2/) + 1). This condition is rarely fulfilled. For a discussion see Drost et al. (1994) and Drost and Klaassen (1995). To simplify the calculations, we will assume that g is known, and calculate the minimal asymptotic covariance matrix for regular estimators in that case. It equals the minimal asymptotic covariance matrix for an adaptive model and is a lower bound for the non-adaptive situation. If g is known, the variance (4.15) reduces to where
/ J
- {
E£'(ε)2 \Eεί'{ε)2
\Eεi'{ε)2 i(Eε2£'(ε)2-l)
and V$ is the matrix ( ) with v = υ$. Hence the efficient score function is
s(t,y) = dϋ(t)'Viιeϋ(t,y) with
- mϋ{t))t' (Mt)~ 1 / 2 (y - mϋ(t))) and the minimal asymptotic covariance matrix of regular estimators of ϋ in the heteroscedastic regression-autoregression model is 1
-
(4-16)
168
WEFELMEYER
This matrix cannot be larger than the minimal asymptotic covariance matrix (4.7) in the larger model m = ra# and υ = v$, the quasi-likelihood model. To check this, note first that in the heteroscedastic regressionautoregression model the μι are of the form Mi(t)
= v'1'2 =
jg(y-ll\y-mϋ{t)))dy{y-mϋ{t)γ
vj/2Ee>, j = 3,4.
Hence the matrix (4.4) can be written C(t) = Vϋ(t)FVϋ(t) with F-(
\
* "l^Eε3 Eε4-1 J ' and the minimal asymptotic covariance matrix (4.7) is To prove that this matrix is larger than the minimal asymptotic covariance matrix (4.16) in the heteroscedastic regression-autoregression model, it suffices to show that F — J~ι is positive semi-definite. This is a well-known result. We recall it briefly. Consider the location-scale model generated by the density g with mean zero and variance one, and the problem of estimating mean and variance based on i.i.d. observations ε i , . . . ,ε n . If the true distribution has mean zero and variance one, the Fisher information matrix is J , and an efficient estimator, say the maximum likelihood estimator, has asymptotic covariance matrix J~ι. If we do not know the density , then the model is completely nonparametric, and an efficient estimator is the empirical estimator for the mean and the variance. If the true distribution has mean zero and variance one, its asymptotic covariance matrix is F. It must be larger than J " 1 . The inequality is strict unless ί'(y) is proportional to y. In particular, if the ε* are normal, then the extended adaptive estimating function (4.9) gives an efficient estimator in the heteroscedastic regression-autoregression model. Example 3. Set m = 0 and υ#(Zi-ι) = σ 2 ( l + βiY^i + + βpY?-p). Then ϋ = (σ 2 ,/?i,... ,/?p)' is of dimension q = 1 + p. As noted in Remark 3, a good estimator is obtained from the estimating function (4.12). With Yli = (Y?-iτ--,Y?-py it reads
ι=p
169
QUASI-LIKELIHOOD REGRESSION
An efficient estimating function is the extended adaptive estimating function (4.14). It is obtained from (4.17) by adding to the martingale increment Y? - σ 2 ( l + β'Yi_x) the increment
The corresponding heteroscedastic autoregression model is the p-order ARCH model introduced in Engle (1982), Yi = σ(l + βΎUΫ'2ei, where the E{ are i.i.d. with a density g which has mean and variance one. In this model we have
and the estimating function σ-^Eε4-!)-1,
(4.17) is, up to an irrelevant
( ^yV ) { i=p
\
* 1
)
factor
(4-18)
/
For normal ε; this gives the maximum likelihood estimator. A review of ARCH models is Bollerslev et al. (1992). Efficient estimators in this model are constructed in Engle and Gonzalez-Rivera (1991), Linton (1993) and Drost et al. (1994) under increasingly weaker assumptions. Example 4. Set m = 0 and vϋ(Zi-ι)
= σ 2 (l + βx (y^x - a'Xi-r)2 +
+ βp(Yi-p - c*%_ p ) 2 ) .
Then ΰ = (σ 2 , α i , . . . , α^, /3i,..., βp)f is of dimension q = 1 + k + p. As in Example 3, the quasi-score function (4.1) is useless, and a good estimator is obtained from the estimating function (4.12). We write ity(t) = <τ2(l + β'(s - α'r) 2 ) with s 2 = ( 5 2 _ υ . . . , s20)' and β'r = (β'rp-U . . . , βfrQY, and obtain / l +/3'(s-a'r)2 ^(t)= -2σ2βr{s-afr)r \ σ2{s-a'r)2 Hence the estimating function (4.12) is Γi-0 - σ 4 (l + β'{Yi-ι - a'Xi-i)2)2 (
i+/mvi-«'Xi-i) 2
\ 2
^-a (l-
(4.19)
170
WEFELMEYER
The corresponding heteroscedastic regression-autoregression model is the p-order ARCH model with /c-dimensional linear regression trend introduced in Engle (1982),
where the ε* are i.i.d. with a density g which has mean and variance one. In this model we have 4
ί
2
2
4
μ 4 ( t ) = σ ( l + i 9 (8-£/r) ) Eε , and the estimating function (4.19) is, up to the irrelevant factor σ" 4 (Eε 4 I)"1, n i=p
Acknowledgment The author thanks Feike Drost for helpful discussions on ARCH models, and the referee for numerous suggestions which made the paper more readable.
References Andrews, D. W. K. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62, 43-72. Andrews, D. W. K. and Pollard, D. (1994). An introduction to functional central limit theorems for dependent stochastic processes. Internet. Statist Rev. 62, 119-132. Bhattacharya, R. and Lee, C. (1995). On geometric ergodicity of nonlinear autoregressive models. Statist. Probab. Lett. 22, 311-315. Bickel, P. J. , Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York.
QUASI-LIKELIHOOD REGRESSION
171
Bollerslev, T., Chou, R. Y. and Kroner, K. (1992). ARCH modeling in finance. A review of the theory and empirical evidence. J. Econometrics 52, 5-59. Collomb, G. (1984). Proprietes de convergence presque complete de predicteur a noyeau. Z. Wahrscheinlichkeitstheorie verw. Gebiete 66, 441-460. Cox, D. R. (1972). Regression models with life tables (with discussion). J. Roy. Statist Soc. Ser. B 34, 187-220. Crowder, M. (1986). On consistency and inconsistency of estimating equations. Econometric Theory 2, 305-330. Crowder, M. (1987). On linear and quadratic estimating functions. Biometrika 74, 591-597. Drost, F. C. and Klaassen, C. A. J. (1995). Efficient estimation in semiparametric GARCH models. To appear in: J. Econometrics. Drost, F. C , Klaassen, C .A. J. and Werker, B. J. M. (1994). Adaptive estimation in time series models. Preprint. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50, 987-1008. Engle, R. F. and Gonzalez-Rivera (1991). Semiparametric ARCH models. J. Business Economic Statist. 9, 345-359. Firth, D. (1993). Recent developments in quasi-likelihood methods. Bull. Inst. Internat. Statist. 55, 2, 341-358. Garel, B. and Hallin, M. (1995). Local asymptotic normality of multivariate ARMA processes with a linear trend. Ann. Inst. Statist. Math. 47, 551-579. Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika 72, 419-428. Godambe, V. P. (1987). The foundations of finite sample estimation in stochastic processes - II. In: Proceedings of the 1st World Congress of the Bernoulli Society (Yu. Prohorov and V.V. Sazonov, eds.) 2, 49-54. VNU Science Press, Utrecht. Godambe, V.P. and Heyde, C.C. (1987). Quasi-likelihood and optimal estimation. Internat. Statist. Rev. 55, 231-244. Godambe, V. P. and Thompson, M. E. (1989). An extension of quasilikelihood estimation. J. Statist. Plann. Inference 22, 137-152.
172
WEFELMEYER
Hajek, J. (1970). A characterization of limiting distributions of regular estimates. Z. Wahrsch. Verw. Gebiete 14, 323-330. Heyde, C. C. (1987). On combining quasi-likelihood estimating functions. Stochastic Process. Appl. 25, 281-287. Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis. Cambridge University Press. Hosoya, Y. (1989). The bracketing condition for limit theorems on stationary linear processes. Ann. Statist. 17, 401-418. Huang, W.-M. (1986). A characterization of limiting distributions of estimators in an autoregressive process. Ann. Inst. Statist. Math. 38, 137-144. Hutton, J.E. and Nelson, P.I. (1986). Quasi-likelihood estimation for semimartingales. Stochastic Process. Appl. 22, 245-257. Hwang, S. Y. and Basawa, I. V. (1993). Asymptotic optimal inference for a class of nonlinear time series models. Stochastic Process. Appl. 46, 91-113. Hwang, S. Y. and Basawa, I. V. (1994). Large sample inference based on multiple observations from nonlinear autoregressive processes. Stochastic Process. Appl. 49, 127-140. Kessler, M. (1995). Martingale estimating functions for a Markov chain. Preprint. Koul, H. L. and Schick, A. (1996). Efficient estimation in nonlinear time series. Preprint. Kreiss, J.-P. (1987a). On adaptive estimation in stationary ARMA processes. Ann. Statist. 15, 112-133. Kreiss, J.-P. (1987b). On adaptive estimation in autoregressive models when there are nuisance functions. Statist. Decisions 5, 59-76. Linton, O. (1993). Adaptive estimation in ARCH models. Econometric Theory 9, 539-569. McCullagh, P. (1991). Quasi-likelihood and estimating functions. In: Statistical Theory and Modelling (D. V. Hinkley, N. Reid and E. J. Snell, eds.), 265-286. Chapman and Hall, London. Swensen, A. R. (1985). The asymptotic distribution of the likelihood ratio for autoregressive time series with a regression trend. J. Multivariate Anal. 16, 54-70.
QUASI-LIKELIHOOD REGRESSION
173
Thavaneswaran, A. and Thompson, M.E. (1986). Optimal estimation for semimartingales. J. Appl. Probab. 23, 409-417. Truong, Y.K. and Stone, C.J. (1992). Nonparametric function estimation involving time series. Ann. Statist. 20, 77-97. Wefelmeyer, W. (1996a). Adaptive estimators for parameters of the autoregression function of a Markov chain. To appear in: J. Statist Plann. Inference. Wefelmeyer, W. (1996b). Quasi-likelihood models and optimal inference. Ann. Statist. 24, 405-422.
177
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
OPTIMAL INSTRUMENTAL VARIABLE ESTIMATION FOR LINEAR MODELS WITH STOCHASTIC REGRESSORS USING ESTIMATING FUNCTIONS A. C. Singh Statistics Canada and Carleton University R. P. Rao Sri Sathya Sai Institute of Higher Learning Prasanthi Nilayam, India ABSTRACT In the usual Gauss-Markov (GM) framework of structural linear models, the GM estimators of the regression parameters become inconsistent if at least one of the regressors is correlated with the model error. The reason for this is that the transformation matrix in the GM estimating equation, which transforms the data to the parameter space (this happens to coincide with the design matrix .X"), cannot be regarded as conditionally fixed. Using a generalization of the method of estimating function of Godambe and Thompson (1989) to structural models, it is shown that an asymptotically consistent and optimal (in a restricted sense) estimator can be obtained by replacing the transfromation matrix X by EC(X), the linear regression of X on a given set of conditioning variables; the optimality is restricted in that it depends on the conditioning set. The matrix EC(X) can be viewed as a working (because of restricted optimality) transformation matrix with the desirable property of being uncorrelated with the model error but correlated with X. Although finding an unrestricted optimal transformation matrix is not generally feasible in practice, it is shown using the estimating function framework that a lower bound to the asymptotic covariance can be found. This bound is then used to propose a measure of asymptotic efficiency of the estimator. It is observed that the concept of a working transformation matrix is equivalent to that obtained from the method of instrumental variables. Through examples from different areas of modelling such as simultaneous equations, latent variables, and measurement errors, it is illustrated that the structural model estimating function provides a unifying principle which recovers existing results as well as leads to new results.
178
SINGH AND RAO
KEY WORDS: Conditioning variables; consistency; matrix Cauchy-Schwarz inequality.
1
INTRODUCTION
We consider a linear semiparametric model y = Xθ + e where e ~ (0,Γ), Γ = σ\ I, X is a n x p matrix with rank p, and θ is a p-vector of fixed parameters. In general, Γ may be σ 2 Λ where A is assumed to be nonsingular and known except for its possible dependence on θ. If some of the p-regressors, z i , . . . , £ p , are stochastic, then such a model is termed 'structural', while if all the rr's are fixed, it is termed 'functional' as defined in Fuller (1987, p. 2). For the case of functional linear models, it is known that when the covariance Γ depends on 0, the Gauss-Markov (GM) approach, or more generally, the least squares (LS) approach may fail in that the resulting estimate may be inconsistent. However, the method of estimating function of Godambe (1960) and Godambe and Thompson (1989, henceforth GT) does give an optimal and consistent estimate; for a good review see Godambe and Kale (1991). For the case of structural linear models also, the GM or the LS approach may fail when at least one of the regressors is correlated with the model error. Some examples are the cases of latent variable, simultaneous equation, and measurement error models which are often used in econometrics. In these situations, the method of generalized instrumental variable estimation (GIVE) is commonly used, see e.g., Harvey (1981, Ch. 2, p. 80). In this paper, for structural models we first make a connection between the concept of instrumental variables and that of conditioning variables used in GT methodology, and then propose a generalization of the GT estimator termed as the structural model estimating function (SMEF) estimator. The SMEF estimator can be used when conditional expectations in GT are not specified but can be approximated by a linear regression function. Different specifications of the conditioning or instrumental variables give rise to different SMEF estimators. For a given set of conditioning variables, it is shown that SMEF and GIVE methodologies yield identical estimates and thus optimality of GIVE estimators can be justified from the optimality of estimating functions. The optimality of each GIVE estimator is restricted in that it depends on the conditioning variables. It is also shown that a GIVE method can be improved by including a variable that is identically one to the set of instrumental variables. This improved version of GIVE arises naturally within the SMEF framework. Apparently, the distinction between instrumental variables with and without the inclusion of constant (i.e., 1) has not been emphasized in the literature. Also, since the GIVE estimators are optimal in a restricted sense, it will be useful to have a measure of the asymptotic efficiency of the GIVE estimator. Such a measure
STRUCTURAL MODEL ESTIMATING FUNCTIONS
179
is proposed by finding a lower bound to the asymptotic covariance. The lower bound, however, is generally not attainable because the corresponding optimal instruments are not obtainable in practice. The organization of the paper is as follows. Section 2 provides motivation of the proposed method of SMEF for optimal instrumental variable estimation. For this purpose both GM and GT estimators are first reviewed from functional and structural model perspectives. The SMEF method is presented in Section 3. Several illustrative examples are given in Section 4. Finally, Section 5 contains concluding remarks.
2
MOTIVATION OF THE PROPOSED METHOD WITH REVIEW
It will be helpful to review methods for functional linear models, i.e., models with fixed regressors. We will first consider the GM theorem (which gives BLUE-best linear unbiased estimator) and later the GT theorem which generalizes GM and gives the optimal method of estimating function. In parallel, problems arising from structural models will be discussed for motivating the proposed method. 2.1
Gauss-Markov Theorem
For the GM set-up, it is assumed that a -variables are fixed and Γ does not depend on θ. According to the GM theorem, the optimal (BLUE) estimator ^BLUE * s obtained as a solution of the estimating equation X'T-1{y-XΘ)
= O,
(2.1)
1
(2.2)
and is given by 1
1
y .
The estimator # B L U E has the small sample optimality of BLUE. Also, under standard regularity conditions, we have a s n - > oo, 1
*BLUE -+* NP[Θ, (X'Γ-'X)- ].
(2.3)
Note that we have used somewhat loosely the above notation for the asymptotic distribution because the covariance matrix depends on n. Now observe that the estimating equation (2.1) has four components: (i) the n-vector of zero functions y — Xθ; a zero function is a function of y or θ (or both) such that it is zero in expectation, (ii) the n x n inverse covariance matrix Γ " 1 , it gives differential weights to zero functions depending on their precision,
180
SINGH AND RAO
(iii) the p x n transformation matrix X1 which transforms the zero function vector from the data space (of dim n) to the parameter space (of dim p), and (iv) the p-vector of zeros on the right hand side of the equation; the transformed zero functions from the left hand side are set equal to zero which is closest under the mean squared error norm. It may be of interest to note that the solution of the estimating equation (2.1) enjoys robustness (in the sense that the estimator remains unbiased and consistent) when Γ represents a working covariance matrix. This occurs in situations where the true covariance is difficult to specify or approximate. This robustness property holds more generally as was observed by Liang and Zeger (1986) in defining generalized estimating equations. If Σ denotes the working covariance, we have a sandwich-type variance for the suboptimal estimator, 0, as n —>> oo, and is given by θ -> d Np[θ, Q-ιX'Έ-ιTΈ-ιXQ-ιl
Q = ( X ' Σ " 1 X).
(2.4)
In the above, there is quite a bit of flexibility in choosing the working covariance matrix (it can be stochastic, for example), the main requirement being that the covariance term in the normal approximation (2.4) must be Op(n~ι). Similarly, we will define the concept of a working transformation matrix which will be useful in dealing with structural linear models in Section 3. If a (working) transformation matrix F1 other than the optimal one X1 is used, then the resulting suboptimal estimator Θ(F) is also robust with a different sandwich-type variance expression. We have as n —> oo Θ(F) -> d Np[θ, (F'T-ιX)-ι(F-ιT-ιF)(X'Γ-ιF)-1].
(2.5)
So far the covariates X were considered fixed, i.e. as a matrix of constants. However, if X is random as in the case of structural models, then provided that X is independent of 6, the GM theorem remains valid conditional on X. Alternatively, if X is only uncorrelated with but not necessarily independent of €, then the optimality of the GM-estimator becomes asymptotic, see Section 3.3. Now for structural models, X is often correlated with €, i.e., y - Xθ. For example, in the case of distributed lag model, we have for 1 < t < T, yt = x'tβ + αy t _i + eu
(2.6)
where given {xt,yt-ι}, £t ~ (0,σ?) and uncorrelated over t. A typical example from econometrics may define the variable yt as rate of consumption and Xt as disposable income at time t. Here GM fails because the set of variables {yt-ι ' 1 < t < T} which are part of covariates are not independent
STRUCTURAL MODEL ESTIMATING FUNCTIONS
181
of model errors {et : 1 < t < T}. Using the concept of estimating equation, Durbin (1960) showed that the least squares method for the model (2.6) does provide consistent estimates using the fact that for each time £, the elementary estimating equation is unbiased because βt is uncorrelated with the covariates xt and yt-i However, Durbin did not establish optimality of the least squares estimate, although he did define an optimality criterion of the estimating equation and showed that the score equation for parametric models does satisfy the optimality criterion. Interestingly enough, around the same time, Godambe (1960) established a stronger optimality property of the score function. Further developments on Godambe's optimal estimating functions led to a general result of GT (1989) from which optimality of the least squares estimate for the distributed lag model easily follows. 2.2
Godambe-Thompson Theorem
For functional linear models, the GT theorem is more general than GM in that variance is allowed to depend on the mean parameters. For structural linear models, the GT theorem generalizes the GM theorem by allowing hierarchical conditioning with respect to a set of variables related to the random covariates. Consider, for example, the case of the distributed lag model (2.6). The conditioning variables are defined in a hierarchical manner for the vector of zero functions {gt := yt — x[β — αyt-i, 1 < * < T} by the increasing sequence of conditioning sets At = {xt',yt'-ι, 1 < t1 < t] U A$ for 1 < t < T, where Ao denotes the initial conditioning set, i.e., all those x's which are independent of error. Here the corresponding σ-fields define the conditional expectation operator to be denoted as Ec(-). For t > 1, define Et( ) as E(-\At-ι). Then for the random variable involving ί, we define Ec as Et. Now, for a prespecified Ec( ), the zero functions gt are required to be conditional zero functions, i.e., Ec(gt) = 0; this holds for the above example. We need the conditional covariance of the T-vector g which for our example is easily seen to be the diagonal matrix σ\ I using the hierarchical conditioning argument, and also need the conditional transformation matrix —Ec(dg/dθf) for θ = (/3',α)', which is simply (si,... ,zτ) where zt = (xft,yt-i)' Now, the GT optimal estimating function has the same form as that of GM (see (2.1)) except that covariance and transformation matrices are replaced by the corresponding conditional ones. It is easily seen that for the model (2.6), this leads to the LS estimating equation. To define the GT-optimality criterion with respect to a given Ec( ), consider in general K subsets A\ C ... C AK of the conditioning variables corresponding to the K subsets of the conditional zero function vector g{y,θ) of dimension n. Now denoting by G'c, the conditional transformation matrix of gradients, —Ec(dg/dθ')', and Γ c , the conditional covariance of g which will be block diagonal with K blocks, the optimal estimating function of GT
182
SINGH AND RAO
for estimating θ is given by G'cT:ιg(y,θ)=0.
(2.7)
It is assumed that Gc has full rank and a unique solution θy^p exists where MEF denotes the method of estimating function. Clearly, the resulting estimator depends on Ec, i.e., the conditioning variables. Note that in the particular case of Ak = Ax, 1 < k < K, i.e., when the conditioning variables are common for all #'s, 1 < i < n (which implies that Ec = E2), there are two special cases which give rise to GM: firstly when all covariates are constants, i.e., A\ corresponds to the trivial conditioning variable for the sure event, and secondly when all covariates are conditioned, i.e., A\ = X. Now, consider the class of estimating functions F ' Γ " 1 ^ defined by transformation matrices F such that F'T~ιGc is nonsingular. This is a linear class of unbiased estimating functions except that coefficients of the linear combination are allowed to depend on the parameter θ and the conditioning variables. Then, the GT-theorem states that the optimal F is given by Gc in the sense that it "minimizes" the following expression V(F) := [E^T^G^E^F'Γ^F^E^G^F)}-1
(2.8)
with respect to the partial order of nonnegative definite matrices. The above criterion is referred to as the small sample optimality criterion of estimating functions. A simple proof of the GT theorem is as follows. First observe that it is enough to show that E^F'Γ^F] - E^F'T^G^E^G'^G^E^G'^F)
> 0, (2.9)
i.e., nonnegative definite. The above requirement follows easily from the matrix version of the Cauchy-Schwarz inequality, namely, Σ n — Σ ^ Σ ^ Σ ^ i > 0 for a partitioned covariance matrix Σ with diagonal elements Σ n , Σ22 and off-diagonals as Σ12 and Σ21. In our case, Σ corresponds to the covariance
of [(FT" 1 *)', (G'cTςιg)Ύ. The expression (2.8) is the asymptotic covariance of the estimator 0(F), see equation (2.12) below and compare it with the sandwich-type finite sample covariance of Θ(F) in (2.5) when a working transformation matrix is used for GM. For scalar 0, the optimality criterion V{F) reduces to Eι[Ec(g*)2]l[EιEc{dg*Idθ))2 where g* = F'T~ιg. Except for the conditioning variables, this is same as the original criterion of Godambe (1960). For the multiparameter optimality criterion considered here, see also the important contributions of Durbin (1960), Kale (1962), and Bhapkar (1972). An illuminating interpretation of the large sample optimality of the estimating function G'cT~l9 comes from the projection approach of McLeish (1984), see also McLeish and Small (1988). If a complete parametric model is
STRUCTURAL MODEL ESTIMATING FUNCTIONS
183
postulated, then the corresponding score vector φθ(y,θ) leads to asymptotically optimal maximum likelihood estimates. The optimal EF turns out to be closest to φθ in the conditionally linear class (in that the coefficients may depend on the conditioning variables used in Ec) generated from g under the covariance norm, because the orthogonal projection of φθ on g is Ec{φθg')T-χg = -Ec(dg'/M)T?g,
(2.10)
using the fact that 0 = d(Ecg)/dθ' = Ec(dg/dθ') + Ec(gφ'θ). The asymptotic optimality of the estimator #MEF follows from the approximate representation (see Godambe and Heyde, 1987), F T-ιg{y,θ) = [E^F'T^GMΘ - θ) + op(l),
(2.11)
which implies that θ(F)-θ^dNp[0,V(F)].
(2.12)
Note that a consistent estimate of V(F) can be obtained from the expression (2.8) by dropping the E\ operator and substituting consistent estimates for θ if necessary. The small sample optimality of estimating function follows from the corresponding property of the score function ψQ (see Godambe, 1960) which seems natural in view of the projection argument. The only difference is that the general class of functions used in defining optimality of the score function is reduced to a linear class, within which the criterion V(F) is minimized. Now, the small sample optimality of the estimator # M E F ι s unknown in general. However, if we assume that Gc = —Ec(dg/dθ') is equal to —(dg/dθ1), i.e., without the expectation operator, then the representation (2.11) becomes exact for linear models, which in turn, implies that 0]vtEF has optimality similar to but stronger than BLUE. The reason is that unlike BLUE, the linear class for MEF is larger because it allows for coefficients of the linear combination to depend on θ as well as on the conditioning variables; see Godambe (1994) and Singh (1995) for similar results in the context of estimation for linear models with random effects.
2.3
Relation Between Conditioning and Instrumental Variables
Consider the methodology of GIVE for estimating ^-parameters of the structural model, y — XΘ = e ~ (0, Γ). For simplicity, assume Γ does not depend on ί. It is further assumed that for q > p, the n x q matrix of instrumental variables, W (say), is well correlated with the n x p matrix of model covariates X but uncorrelated with the model error e. The n x p optimal instrument matrix W* obtained from W is given by the linear regression of the p~vector x on the q-vector w. Note that the number of instruments
184
SINGH AND RAO
is the same as the number of x-variables, i.e., p, but the number of instrumental variables is qr, which is at least as large as p. It is further assumed that x— and lu-variables are such that plim(X'T~1W^/n) is nonsingular. Now, the conditioning variables used in MEF can serve as tϋ-variables, and Gc = —Ec(dg/dθ') = EC(X) as w-specific optimal instruments. Here only one conditioning set A\ is involved so that Ec = E2. However, under a given semi-parametric modelling, the information may not be sufficient to compute the actual conditional expectation EC(X) for finding w-specific optimal instruments. Instead, an approximation given by the linear regression function (EC(X), say) may be used as the instrument matrix. With this approximation, MEF and the method of instrumental variables will be equivalent if the instrumental variables coincide with the conditioning variables. In the next section, we present a generalization of the GT theorem to address the question of optimality when EC(X) is used in estimation for structural model parameters; the resulting estimates are termed SMEF estimates. We also consider the problem of finding asymptotic efficiency of a w-specific SMEF estimator. To this end, a lower bound for the asymptotic covariance is derived. The corresponding matrix of optimal instruments (EC*(X), say) is unfortunately not obtainable in practice. Thus, EC*(X) can be interpreted as the conceptual optimal (in the unrestricted sense) transformation matrix while EC(X), corresponding to specified instrumental variables, as the working (because of restricted optimality) transformation matrix. In the following, it will be assumed for simplicity that there is only one conditioning set, i.e., Ak = A\, 1 < k < K in the GT framework.
3
PROPOSED METHOD OF STRUCTURAL MODEL ESTIMATING FUNCTION
We propose two SMEF estimators 0gMEF a n c ^ ^SMEF corresponding to two choices of conditioning or instrumental variables w: (1) the w-variables are prespecified but do not contain the constant 1; w in this case will be denoted by W(i), and (2) the w-variables are prespecified and contain the constant 1. In each case, we establish optimality of the SMEF estimator in a suitable class in a manner analogous to that of the MEF estimator of GT. Next the quesiton of the asymptotic efficiency of the SMEF estimator is considered in Section 3.3.
3.1
The Estimator #SMEF
This case gives rise to commonly used instruments. Here the linear regression function EC(X) passes through the origin, because the constant 1 is not one of the w-variables. It turns out that the optimality of the estimator us-
STRUCTURAL MODEL ESTIMATING FUNCTIONS
185
ing EC(X) as instruments derived from estimating functions is equivalent to the well known property of w-specific optimal instruments in GIVE methodology. To see this, let us first define the operator Ec on the random variable /(y, θ) with respect to a set of q variables w^(q > p) as Ec(f) = w^E^w^w'^^Eiiw^f)
:= u>'(1)/3,
(3.1)
where β is the q-vector of regression coefficients of / on W(i). Note that since the regression function EcU) is through the origin, the w-variables are not centered. The operator Coυc(f) is defined as Cov\(f — J5C(/)), and is given by
CΪocU) = E1(f)-E1[f(Ec(f))}, = E1(f2)-Eι(fw[1))[(E1(w{ι)w{ί))]-1E1(w{1)f).
(3.2)
With this definition of Ec and Covc, we have for the semiparametric structural linear model, g = y - Xθ ~ (0, Γ), Γ = σ 2 J, Ee(g) = 0, Γc = CΪvc(g) = Γ,
(3.3)
because by definition, g is uncorrelated with the conditioning variables, w. We now have the following generalization of the GT theorem. Let Gc = EC(X) be defined elementwise by (3.1). Thus Gc is W^B where W^ is the n x q matrix of observations on the q- vector W(i), and B is the g x g matrix of regression coefficients [Eι(w^w'^)]~ιEι(w^ι)Xr) where x is the p-vector of x-variables. Since B is unknown in general, a consistent estimate can be substituted to approximate Gc as W{1>
:=W(i)B = W(i)(W{1)W(i))-1Wj1)X.
(3.4)
Therefore W(i)* is the (orthogonal) projection of X on the column space of W(!). Note that for general Γ, i.e., for Γ = σ2Λ, W(i)* will be defined as Γ Γ 1 r 1 r 1 W (i)(^(i) " W (i))" W (i)Γ~ X to ensure invariance of W^ to transformations of y — Xθ, and to achieve optimal instruments (see below). Now in the linear class of estimating functions defined b y n x p transformation matrices F (which may depend on the parameters θ and are in the column space of W^) such that F'T~ιGc is nonsingular, the optimal F is given by Gc in the sense that V{F) := [Ei(F'Γ- 1 G c )]- 1 [E 1 (F'Γ- 1 F)][£;i(G c Γ- 1 F)]- 1
(3.5)
is minimized. The term minimization is defined as in the case of the GT criterion (2.8). the estimator ^gMEF * s obtained by solving (when Gc is replaced by its estimate W^) ι
)
= 0.
(3.6)
186
SINGH AND RAO
We have, for large samples, *SMEF "*-•<* Np[0, (W'^T-1
W(1>)-1].
(3.7)
In fact the estimated asymptotic covariance matrix takes the sandwich form ( W ( 1 ) , Γ - 1 X ) - 1 ( W ( 1 ) , Γ - 1 W ( 1 ) , ) ( X ' Γ - 1 W ( 1 ) , ) - 1 (compare with (2.5)), which can be simplified by noting that W/ 1 \ ί|c Γ" 1 X is the same as W'/1v|(Γ~1W(i)1|t using the properties of the projection form (3.4). The estimator #SMEF ι s asymptotically optimal in the sense that it minimizes the asymptotic covariance V(F). This optimality is restricted in that the class of transformation matrices F is allowed to depend only on W(χ). The (estimated) asymptotic covariance further simplifies to [ ( X / Γ ~ 1 W Γ ( 1 ) ) ( W / 1 Λ Γ ~ 1 W ( I ) ) ~ 1 ( W / 1 Λ Γ ~ 1 X ) ] ~ 1 which can be seen to coincide with the known result for optimal instruments under GIVE methodology (Harvey, 1981, p. 80).
3.2
The Estimator 0SMEF
In this case the linear regression EC(X) is allowed to have the intercept term, and thus is more general than EC(X) of Section 3.1. In fact, EC(X) now corresponds to the commonly used definition of linear regression in which the regressors (w-variables in the present case) are centred. For a random variable /(y, 0), Ec is defined with respect to conditioning variables w = (l,w' ( 1 ) )' as
Ec{ί)
= = Ex{f) + E^fw'KE^ww')]-1^ 1 - *!*
- Ex{w)] (3.8)
where Covι{f,ww), for example, is Eλ[{f -Eι(f)){{w{1) -Eι{ww))']. The operator Covc(f) is defined as before by Cov\(f - Ec(f)) except that Ec(f) is different in this case. e
c a n
Analogous to ^SMEF' ^ optimal SMEF estimator 0gMEF be defined from (3.4) with W* instead of W(i)* and its asymptotic covariance 1 1 will be (W^T' W * ) " . Note that it will have stronger optimality property because of a larger linear class of estimating functions due to introduction of the unit vector in W. Thus, this SMEF estimator will be superior to the usual GIVE estimator unless the constant 1 is already used as one of the instrumental variables. However, its optimality is still restricted because the class of transformation matrices is allowed to depend only on W.
STRUCTURAL MODEL ESTIMATING FUNCTIONS
3.3
187
Asymptotic Efficiency of the SMEF Estimator
Using the framework of estimating functions, we consider the question of defining an optimal choice of conditioning variables such that they are correlated with the covariates X but uncorrelated with y — Xθ, and that they give rise to minimum asymptotic covariance in the class of all SMEF estimators. To this end, let 0° denote the true unknown value of 0 and define -X"* as the residual of X after regressing on y — Xθ°, i.e., for j = 1 to p, the jth column XQJ* of X* is (the n-vector XCJ denotes the jth column of X) Xφ = xcj - Cσυι(xCj,v-
XΘ°)T-ι{y - Xθ).
(3.9)
Note that if XCJ is uncorrelated with y — Xθ°, the corresponding XCJ* will coincide with it. For given 0°, the regression coefficients in (3.9) can be estimated consistently as follows. Since Γ = σ | J , it may be reasonable to assume that Cov\(xcj,y — Xθ°)/σ^ is also of the form jj(θ°)I where Ίjiβ) = Cov\(xij,yi — x\θ)jσl for all % = 1 to n. Now 7j(0°) can be estimated consistently as a function of 0° by using Ί3ψϋ) = x'j{y-XΘQ)lnσl
(3.10)
In an analogous manner σ\ (if unknown) can be consistently estimated as a function of θ°. Note that for general Γ, jj{θ°) should be modified to x/jT~1(y — Xθ°)/n. Thus for a given 0°, X* can be computed. However, since 0° is unknown, it is not computable in practice. Now, treating X* as conditioning variables, it easily follows that the optimal transformation matrix EC*(X) is X*. We will now show that the asymptotic covariance corresponding to -X"* provides a lower bound for the asymptotic covariance of an SMEF estimator. To see this, note that for any given set of conditioning variables tu, Covι(w,y — x'θ) — 0 by definition, r and therefore, Eι(xw') = Eι(x*w ) where x* is the p-vector of £*-variables. This implies using the definition Ec(x) = B'w (see 3.4) that * - Ec(x))(x* - Ec{x))'}.
(3.11)
In terms of the corresponding consistent estimates, we have X'*X* = W'*W* + (X* - W.)'(X* - W*)
(3.12)
Incidentally, the above decomposition also follows easily by noting that W* is the (orthogonal) projection of X* on W under the Euclidean norm. Now for general Γ, (3.12) takes the form
X Γ - 1 * , = w'*τιw* + (x* - w*yτ-ι(x* - w.).
(3.13)
188
SINGH AND 1
1
1
1
1
RAO
is
It follows from (3.13) that ( W j Γ " ^ ) " - ( X ' J - * * ) " nonnegative definite for any choice of w-variables and therefore X* provides optimal instruments. Although X* is not computable in practice as it depends on θ , a consistent estimate of the corresponding minimum asymptotic covari1 1 ance can nevertheless be obtained by replacing θ° in (X+Γ' X+)- by a consistent estimate given by an SMEF estimator. Note that the matrix W* plays the role of a working transformation matrix. To see this obι ι serve that the covariance matrix {W^V W ^)~ is asymptotically equiva1 lent to (W'^XJ-^W'^WJiX'+Γ^W*)in view of the comments given below (3.7) and the fact that W^T~ιX is asymptotically equivalent to W'*T-ιX*. We thus propose the following measure of asymptotic efficiency of an SMEF estimator
asy.eίf(β S M E F ) = KX Γ ^ X , ) " 1 ! * K ^ Γ " 1 ^ ) - 1 ! " 1 .
(3.14)
Alternatively, in the above definition of efficiency, trace can be used instead of the determinant operator.
4 4.1
EXAMPLES OF SMEF ESTIMATORS Latent Variable Models
Consider a linear model y — Xθ + va + <5, δ ~ (0,Γ^), with Γ = σ | l , υ ~ (O,σ^J) and δ uncorrelated with v. Here v corresponds to a latent covariate and is therefore unobserved. A typical example may define y as production of a company, x as size and υ as management motivation. If we treat va + δ as the model error e with mean 0 and covariance Γ, where 2 Γ = σ\l, σ\ — a σ% + σ|, then X and e will be correlated in general due to the presence of the covariate υ in e. Such problems are often addressed in longitudinal surveys where data can be obtained for a second occasion. Assuming that the parameters 0, α, and the latent variables υ do not change over the two occasions, then from the model Vi - V2 = (Xi - χ2)θ + (Ji - * 2 ) ,
(4.1)
where the subscript 1, for example, denotes the first occasion, one can estimate θ by GM provided Xχ-X2 has full rank. (Note that if the model has an intercept term, then X\— X2 will have a column of zeros which should be dropped. In other words, the intercept can't be estimated from the model (4.1)). Alternatively, we can treat X\ - X2 as instrumental variables because they are uncorrelated with βi = j/χ — X\θ and €2 = 2/2 ~ -^20? while they are correlated with X = (X^-X^)'. Thus, from Section 3.1, we can
STRUCTURAL MODEL ESTIMATING FUNCTIONS find 0gMEp ^
a
189
solution of 1
(Xι - X 2 )*Γ~ (y - Xθ) = 0,
(4.2)
where y - Xθ = [(yι - X\θ)', (y 2 - X20)T> Γ has σ\I on the diagonal and 2 a σ%I on the offdiagonal, and [X\ - X 2 )* is similar to (3.3) except that W(i) is replaced by X i - X 2 Again it is assumed that X\ - X2 has full rank. If the model has an intercept term, then all the linear functions of θ are not estimable. Note that in practice it may be difficult to estimate Γ and therefore a working covariance may be used to obtain suboptimal estimates. Alternatively, using the suboptimal estimates as initial consistent estimates, we can estimate Γ as in Zellner's feasible GLS approach and then obtain optimal estimates in a second step. c a n The above estimator 0QMEF ^ ee a s ^ y improved by adding a column F of ones in X\ — X2 to obtain ^SMEF* Moreover, the second estimator has an additional advantage in that it allows for estimation of the intercept term if it is present in the model. Also the asymptotic efficiencies of SMEF estimators can be computed as given by (3.14).
4.2
Simultaneous Equation Models
Consider a system of two equations 2/(1)
=
V(2) ^
-^(ΐ)β(i) + 2/(2)α(i) +€ (i)
:=
-^"(1)0(1) + € (i)
(2)β(2) + ί/(l)α(2) +c(2)
: =
-^(2)^(2) + €(2)
X
(4-3)
where€(1) - (0,Γ ( 1 ) ),e ( 2 ) - (0,Γ(2)),Xj) = (X(i),y (2 )),0(i) = (β[ι and so on. A typical example may define y^ as consumption expenditure, y^2) as income and x's as other explanatory variables. Since the regressors and model error are obviously correlated, we can't use GM. A commonly used solution is two stage least squares in which W = (X(ι),X(2)) is used as a common set of instrumental variables for each equation. Thus for (4.3a), in stage /, 2/(2) is obtained by regressing the j/(2)-variable on w-variables, and in stage II, parameter estimates are obtained as
where W(i) + denotes the optimal instruments obtained by regressing X ^ on W, i.e., W(i)+ is (X(i),ί/(2)) Iti s easily seen that the SMEF estimator ^SMEF w ^ ^ ^ a s c o n ditioning variables will be identical to the above estimate. Moreover, it can be improved by adding the unit vector in W and its asymptotic efficiency can be calculated as discussed earlier.
190
SINGH AND RAO
We remark that all parameters of the two equations can be estimated simultaneously but it requires estimation of covariance between e^ and €(2). For this reason, the method of three stage least squares is commonly used. Using SMEF, alternative estimators can be developed as in the case of two stage least squares. 4.3
Models with Measurement Error
Consider the model y = Zθ + δ, δ ~ (0, Γ$), Γ^ = σf J, where Z is subject to measurement error. Thus the observed Z is X, where X = Z + [/, U is the measurement error which is assumed to be uncorrelated with the model error δ. For the jth column UCJ of I/, it is assumed that UQJ ~ (0,σ^J) and uncorrelated over j , 1 < j < p. A typical example may define y as corn yield and Z as nitrogen content in the soil as considered by Fuller (1987, p. 2). Now rewriting the model as y = Xθ + e where e = (δ — Uθ), we have a structural linear model with X correlated with e. The covariance Γ of e will depend on unknown parameters σf and θ and has the form Γ = σ\l where σ\ = σ\ + Έjσ^jθj. Often an instrumental variable is obtained by taking a second independent observation on Z, see e.g. Fuller (1987, p. 52). Denoting the first and second observations on Z as X\ and X2, θ can be estimated from X'2T-\y-Xιθ)
= 0.
(4.5)
Similarly, using X\ as the instrumental variable, another estimate can be obtained, and then a final estimate can be obtained by combining the two. Above estimates can be alternatively obtained as ^SMEF' a n c * c a n ^ e further -(2)
improved by using #SMEF
5
CONCLUDING REMARKS
For structural linear models, two alternative SMEF estimators using instrumental variables were considered based on a generalization of the GT theory of estimating functions. The first SMEF estimator uses the available instrumental variables but does not include constant as an instrumental variable. The second SMEF estimator includes constant as an instrumental variable and was shown to be more efficient than the first SMEF estimator. These estimators correspond to the commonly used estimators based on GIVE methodology. However, the advantage of using constant as an instrumental variable does not seem to have been emphasized in the literature on GIVE methodology. Using the theory of estimating functions, a measure of asymptotic efficiency was proposed by finding a lower bound to asymptotic covariance of all GIVE estimators. Such a measure is expected to be useful in
STRUCTURAL MODEL ESTIMATING FUNCTIONS
191
practice as optimality of each GIVE is restricted to its own class. Through several examples drawn mostly from econometrics, it was shown that the SMEF methodology provides a useful generalization of the GM methodology as well as an important unified statistical technique for dealing with linear models with stochastic regressors. ACKNOWLEDGEMENT We are grateful to Professors V. P. Godambe, M. E. Thompson, and Dr. H. J. Mantel for comments and helpful discussions. The first author's research was supported in part by a grant from Natural Sciences and Engineering Research Council of Canada. He is especially grateful to Sai Baba, Chancellor, Sri Sathya Sai Institute of Higher Learning, for providing the opportunity to visit the institute and conduct this research.
References Bhapkar, V. P. (1972). On a measure of efficiency of an estimating equation. Sankhya A, 34, 467-72. Durbin, J. (1960). Estimation of parameters in time series regression models. J. Roy. Stat Soc. Ser. B, 22, 139-53. Fuller, W. A. (1987). Measurement Error Models. John Wiley and Sons, New York. Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Stat 31, 1208-12. Godambe, V. P. (1994). Linear Bayes and optimal estimation. STAT-94-11, Department of Statistics and Actuarial Sciences, University of Waterloo, Waterloo, 16 pages. Godambe, V. P. and Heyde, C. C. (1987). Quasi-likelihood and optimal estimation. Int. Stat. Rev., 55, 231-44. Godambe, V. P. and Kale, B. K. (1991). Estimating functions: an overview. In Estimating Functions (ed. Godambe, V. P.), Oxford University Press, Oxford, 3-20. Godambe, V. P. and Thompson, M. E. (1989). An extension of quasilikelihood estimation (with discussion). J. Stat Plan. Inf., 22, 137-72. Harvey, A. C. (1981). The Econometric Analysis of Time Series. Phillip Allan Publishers Ltd., Oxford. Kale, B. K. (1962). An extension of Cramer-Rao inequality for statistical estimation functions. Skand. Aktur., 45, 60-89. Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22.
192
SINGH AND RAO
McLeish, D. L. (1984). Estimation for aggregate models: the aggregate Markov chain. Can. J. StaL, 12, 265-282. McLeish, D. L. and Small, C. G. (1988). The Theory and Applications of Statistical Inference Functions. Lecture Notes in Statistics 44, SpringerVerlag, New York. Singh, A. C. (1995). Predicting functions for generalization of BLUP to mixed nonlinear models. ASA Proc. Biom. Sec, 300-5.
193
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ON ESTIMATING FUNCTION APPROACH IN THE GENERALIZED LINEAR MIXED MODEL Brajendra C. Sutradhar Memorial University of Newfoundland V. P. Godambe University of Waterloo ABSTRACT Waclawiw and Liang (1993) develop an estimating function-based approach to component estimation in the generalized linear mixed model with univariate random effects and a vector of fixed effects. In their approach they utilize the standard optimal estimating functions to estimate the fixed effects and a so-called Stein-type form of estimating functions to estimate both the random effects and their variance. In this paper, we provide a semiparametric solution to the estimation problem dealt by Waclawiw and Liang. The solution is obtained under two set-up by utilizing the standard theory of optimal estimating functions (Godambe and Thompson, 1989). Under the first set-up, the solution is obtained in three steps. In the first step, the estimating functions for the regression parameters, and the random effects are developed by treating the random effects as fixed effects. In the second step, we obtain the prediction of the random effects by taking their true nature of randomness into account. These predicted random effects are then used in the estimating equations for the regression parameters, of Step 1, to obtain their improved estimates. In the third step, the estimating function for the variance of the random effects is developed based on the true nature of the random effects. Under the second set-up, the estimating functions for the regression parameters, random effects and their variance are developed by utilizing the true nature of the random effects directly. Results of a small simulation study based on the performance of the proposed estimating function-based approaches are reported. Key Words: Random effects; variance component of the random effects; semi-parametric solutions; standard estimating function approach; unconditional and conditional mixed methods; corrected conditional mixed method.
194
SUTRADHAR AND GODAMBE
1 INTRODUCTION The generalized linear model (McCullah and Nelder (1989)) neatly synthesizes likelihood-based approaches to regression analysis for a variety of outcome measures. The underlying distribution of the outcome variables is assumed to be of the exponential family form, and a link function transformation of the expectation is modelled as a linear function of observed covariates. Several recent extensions of this useful theory involve models with random terms in the linear expectation. Such generalized linear mixed models are useful for accommodating the overdispersion often observed among outcomes that nominally have binomial (Williams (1982)) or Poisson (Breslow (1984)) distributions; and for modelling the dependence among outcome variables inherent in longitudinal or repeated measures designs (cf. Laird and Ware (1982), Stiratelli, Laird and Ware (1984), Zeger, Liang and Albert (1988), Zeger and Karim (1991)). Consider a set of repeated observations consisting of a response yij as the j t h (j = 1,..., rii) repeated observation on individual i(i = 1,..., k) and a p x 1 vector x^ of covariates associated with that response. Let β denote a p x l vector of unknown fixed effect parameters associated with covariate Xij. Further, let ji be a univariate random effect such that for a given 7^, rii observations due to the ith individual are independent. Under the assumption that the conditional density of y;, the n{ x 1 vector of responses for individual i, given ηι is of the exponential form
f(yi\Ίi) = exp < Σ ^"Wi "" Σ Φfai a n d
2
with ηij = x[jβ+Ίύ Ί% *~ iV(0, σ ), recently Waclawiw and Liang (1993) have used a three-step iterative procedure to estimate all three unknowns /?, 2 7; (i = 1,..., fc), and σ . The three steps are: 1. Assuming an initial value for σ 2 , the fixed effects β are computed using the generalized estimating equation of the form
-μi{β,σ2))=0,
(1.2)
where μi(β,σ2) and Vi(β,σ2) are, respectively, marginal mean and variance-covariance matrix of the response vector yι for the ith individual. 2. Assuming that σ2 and β are fixed, the equation gi = 0 is solved for 7;, where
ON GENERALIZED MIXED MODEL
195
is a class of estimating functions for 7$,
= ) = 0,
(1.4)
3=1
where φ'(ηij) is the first derivative of φ( ) in (1.1) with respect to 77^, which is, in fact, the conditional mean of yij given 7^. Note that in (1.3), they have obtained a,ij and bi by following certain optimal criteria due to Godambe (1960) [see also Ferreira (1982)]. Let α£ and 6J be such solutions. An optimal function g\ is obtained by substituting α*? and b\ in # given in (1.3). Next a Stein-type estimation for 7J, 7* is achieved by solving g* = 0. 3. The variance component of the random effects σ2 is estimated by using the relationship
E{Ίf)~E{Ί*)-E{Ί*-Ίi)\
(1.5)
where (7* — 7;) is obtained by first expanding the optimal estimating function to the first order approximation and then solving the resulting optimal estimating equation. Note that in this step, one actually obtains a recursive form for the estimation of σ 2 , which must be updated with changes in β and 7^. Let σ* be the estimator of σ 2 . The above three steps of the iterative procedure describe a complete cycle or one full iteration. The cycles of iteration continue until convergence (if exists) is achieved. However, convergence is to be investigated and further work is needed to guarantee convergence. Neither in Waclawiw and Liang (1993) nor in the present paper, this is attempted. More recently, Sutradhar and Qu (1997) have shown for a Poisson mixed model that even if 7* in Step 2 is computed based on large n^, the estimator σ* of σ 2 obtained from the third step, does not converge to σ 2 . These authors proposed a likelihood approximation (valid for small σ2) to estimate 2 all three parameters β, ηι and σ of this special model. The estimation is carried out in two steps. In the first step, they utilize a small σ2 based approximate likelihood function to estimate the fixed effect parameters and σ 2 . In the second step, they estimate the random effects 7^ by its posterior mean E(ji\yi), which is, in fact, the minimum mean square error prediction of 7i. It was shown by Sutradhar and Qu (1997) through a simulation study that their likelihood estimation approach performs much better than Waclawiw and Liang's three steps estimation approach in estimating all three
196
SUTRADHAR AND GODAMBE
parameters /?, 7$ and σ 2 . The computation of the likelihood function is, however, not easy in general. This paper, unlike Waclawiw and Liang (1993), and Sutradhar and Qu (1997), provides a semi-parametric solution to the estimation problem dealt by these authors. That is, we do not make any distributional assumption for the random effects. The solution is obtained under two set-up by utilizing the standard theory of optimal estimating functions [cf. Godambe and Thompson (1989), Godambe and Kale (1991)]. Under the first set-up, the optimal estimating functions for the regression parameters and the random effects are developed by treating random effects ηι fixed. The random nature of 7ί is, however, taken into account when estimating equations are solved for the parameters. Next the estimating function for σ 2 is developed based on the true nature of the random effects. Under the second set-up, estimating functions for the regression parameters, random effects and their variance are developed by treating ηι as random effects as they should be. The performance of the estimators obtained under these two set-up are also compared through a simulation experiment, for Poisson mixed models.
2
OPTIMAL ESTIMATION WHEN RANDOM EFFECTS ARE TREATED INITIALLY AS FIXED EFFECTS
Assume that given 7; (i = 1,...,fc),the response yij has the mean φ'{ηij) and variance φ"{ηij) with ηij = xfjβ + 7;, where φ'{ηij) is the first derivative of >(•) with respect to 77^, as in (1.4), and φ"{ηij) is the second derivative, φ(-) being a known functional form. Further assume that 7^'s are independently 2 and identically distributed with zero mean and variance σ , but the specific form of the distribution of ηι is not known. Now by holding 7; (i = 1,..., k) fixed, we construct the optimal estimating functions for β and 7i, following Godambe and Thompson (1989). These functions are, respectively, given by k τii
91 = ΣΣwiijhuj,
(2.1)
i=l j=l
and Tli
3=1
where huj are the elementary functions defined as hij
= yij-E2(yφi)
(2.3)
ON GENERALIZED MIXED MODEL
197
and w\ij and W2%j are given by
i/dβ\β,Ίi)
-Φ"(
and E2(dhUj/dΊi\β,Ίi)
_ -ς
j2 = - 1 ,
(2-5)
respectively. Note that in the equations (2.3) - (2.5), £2 denotes the conditional expectation of yij for given 7^. Further note that £(51) = £(ί/2i) = 0 for alH = 1,..., A;, because E(hiij\^fi) = 0. That is, g\ and g2% are unbiased estimating functions. Now joint estimation of the parameters β and 7^ can be achieved by solving estimating equations g\ = 0 and g2% — 0 for the observed data (xij^yij) (ί = 1, ...,&; j = 1,...,ni). The solutions of g\ — 0 and 522 = 0, denoted by β and 7* respectively, may be obtained by the customary Newton-Raphson method. To begin, we assume that 7^ = 0 for i = 1,..., k. Given the value β(u) at the uth iteration, β(u 4-1) is obtained as β(u + 1) = β(u) - [(dgι/dβ)τ]Zι[gi]u,
(2.6)
where [ ]u denotes that the expression within the brackets is evaluated at β{u). Next we use this estimate β in g
where [ ]u denotes that the expression within the brackets is evaluated at ΊiW Notice that 7;'s are estimated so far by treating them as fixed effects. But, in the present mixed model, they are random by nature. We now propose an adhoc estimator of the random effect 7$, say 7^ obtained as ηi = E{ηi\ηi) — £(7i) + £(7ΐ7ΐ){£(7ΐ)}~1(7ϊ ~~ ^(7i))»
(2-8)
which is the posterior mean of 7* given the data through 7*, provided 7, and ji have jointly bivariate normal distribution. Now, as 2
ji - 7i) = Eηf - 2EηiΊi
2
+σ ,
(2.9)
we may estimate E{ηιηi) by
[έ
]
(2.10)
198
SUTRADHAR AND GODAMBE 2
where E(ηι — 7^) may be obtained easily by expanding 322(7*) about 7^ k
and noting that #2i(7i) = 0. Next by using E{ji)
= ^Ji/k
= 7 and
k 2
i ~ ϊ] /k it follows from (2.8) and (2.10) that
7?/* + ^ - % < - 7i)2l (7< - 7)/ Σ(7i " 7) 2 ,
J 2
2
where σ is a suitable estimate of σ . These values of ηi may, in turn, be used in (2.6) to improve the estimate of the regression parameters. Now to obtain the estimate of σ 2 , we solve the estimating equation g$ = 0, where k
(2.12) is the optimal estimating function for σ 2 , with \i2i = ΊΪ—σ2 as the elementary function, and ^ as the respective weight given by ^3^ = E(dhsi/dσ2)/E(hli) —1/&4, where k± = #(7/) — σ 2 . Note that to solve 33 = 0 for σ 2 , it is not necessary to know k±, i.e., E{ηf). The solution of 33 = 0 for σ 2 yields σ 2 = Σ-γf/k. We, thus, obtain
where ηi is the prediction of the true random effects, given by (2.11). Notice that σ 2 in (2.13), 4» i n ( 2 H) have to be computed iteratively. As mentioned above, we then go back to (2.6) to improve the estimate of β by using η\ — ηi instead of %. The cycles of iteration continues until convergence is achieved for β and σ 2 . Let /3, 7, and σ 2 be the final estimates.
3
OPTIMAL ESTIMATION WHEN RANDOM EFFECTS ARE TREATED TRULY AS RANDOM EFFECTS
In this approach, optimal estimating functions for the regression parameters are developed under the fact that 7J'S are independently distributed with zero mean and unknown variance σ 2 . For the time being, suppose that we can compute the unconditional mean and variance-covariance matrix of the
ON GENERALIZED
199
MIXED MODEL
response vector yι. That is (β,σ2)
μi
(3.1)
= E(yi) = ElE2(yi\ii)
and
(3.2) are computable, where E2 in (3.1) denotes the conditional expectation of yi for fixed 7$ as in (2.4) and E\ denotes the expectation over 7* when they are random. We now consider an elementary estimating function for β as 2
)
(3.3)
and construct an optimal estimating function as
9l = Σw*ιMi
(3-4)
where For given σ 2 , the estimating equation
V
= 0
(3.5)
is solved for β. Note that the estimating equation g\ = 0 in (3.5) is the same as the estimating equation for β considered by Waclawiw and Liang (1993). Further note that in the manner similar to that for the estimation of /?, we could 2 construct an optimal estimating function for σ , but this will require calculations of higher moments for the response vector, which may not be easy. Consequently, we choose to estimate σ 2 by using the predicted random effects, where the prediction of the random effects is made by exploiting their true randomness nature. More specifically, the joint computational steps for /?, ji and σ2 are as follows. First, for an initial σ2 value, β estimate at the (u + l ) t h iteration is obtained as β*(u + 1) = β*{u) - {{dg\{σ2)ldβ)τ]-uι[gl(σ2)}u
(3.6)
where s ί ( σ 2 ) is the same estimating function as g\ in (3.4) except that now 2 σ has a specified value, and [ ]u denotes that the expression within the brackets is evaluated at β*{u). Next we use this estimate β* in (3.8) below
200
SUTRADHAR AND GODAMBE
to obtain an optimal prediction 7* for 7^ under the assumption that 7i's are truly random. The estimating function for ji is constructed as follows. To begin, 7i's are treated as fixed as in the previous section. Next the prior information that 7i's are independently and identically distributed with 2 zero mean and variance σ , is used to take the randomness nature of ηι into account. The optimal estimating function for 7^ is then written as 92i = 92i +531?
[o.l )
where g2{ is as in (2.2) and g& is given by g^i = ^3^32, with h^ = ji — 2 E(ji) as the elementary function and w$i = E(dhsi/dji)/E(hli) = 1/σ , as its weight (Godambe, 1994; Naik-Nimbalkar and Rajarshi, 1995). It then 2 follows that for β = β* obtained from (3.6) and for known σ , the predicted value of 7$ at the (u + l)th iteration is obtained as
7ί(« + 1) = 7*(u) - [dg*2i{β*^*)ldΊi]-ι[g*2i{β\σ2)\u,
(3.8)
where [ ]u denotes that the expression within the brackets is evaluated at 7*(u), the uth. iteration value of 7^. Notice that it has been assumed in (3.8) that σ 2 is known. When σ 2 is unknown, it may be estimated, as in the previous section, by using the optimal estimating function 33 given in (2.12). The corresponding estimating equation g$ = 0 yields k
ΐ IK
(3.9)
where 7* is obtained from (3.8) for β = β* and for a given value of σ 2 . Now σ* is put back into (3.6) and (3.8) to obtain the improved estimates of β and 7i. The improved estimate of 7, is then used in (3.9) to obtain an improved estimate of σ 2 . The cycle of iteration continues until convergence is achieved for β and σ 2 . Let β, ηi and σ be the final estimates. Now turning back to the issue of computations for the mean and covariance matrix of y^, it is well known that in general, exact expressions for the marginal means and variances may not be easily computable. But for 7< ^ iV(0, σ 2 ), expressions for the marginal means and variances simplify or may be easily approximated for the standard link functions [see Zeger et al (1988)]. For example, for the log link as in the Poisson case, y)
=
EιE2{yij\Ίi) σ 2 ),
and
(3.10)
ON GENERALIZED MIXED MODEL
=
201
(3.11)
vax{E2(yij\ji)} + E!{vΆτ(yφi)}
= expOrξ/3 + \σ2){\ + exp(aξ/? + ^ 2 )(exp(σ 2 ) - 1)}. If it is assumed that for given 7J, Poisson responses yij and y^i are independent, for j φ f, j,f = 1,... ,rij, then unconditional covariance between y^ and yijt is given by yiji) = {exp{xjjβ + -σ2)}{exp(xjfβ
+ -σ2)}
x{ex P (σ 2 )-l}.
(3.12)
The mean vector μι(β, σ2) and variance-covariance matrix Vι{β, σ2) are then easily computed. Note that for the cases when 7* ~ ΛΓ(O, σ 2 ) and σ2 is assumed to be small, one may develop an approximate likelihood function for β and σ2 and compute the likelihood estimate for these parameters [see Sutradhar and Qu (1997)]. The random effects 7; (i = 1,..., k) may be estimated by using the posterior likelihood of ηι given the data. For the general case when 7i's are independently and identically distributed with zero mean and variance σ 2 , one may still obtain the approximate marginal means and variances, provided σ2 is small (cf. Sutradhar and Rao (1996)]. Rewrite the conditional density (1.1) of y^ given 7^ as £
j
j + c ( ^ )},
(3.13)
where 0^'s, with θ\j = xjjβ + 7Ϊ, are independent random variates with E(θ*j) = Oij = xJjβ, and v a r ( ^ ) = σ2. Now by expanding f(yij\θ^) in (3.13) about 0y and taking expectation over the distribution of θψ one first obtains the density function of j/ij, which may then be exploited to compute the marginal means and variances. After some algebra it follows that 2 2
y [(α') {m 3 + raira2} + α"ra 2 ,
(3.14)
with mi = sflα\ m2 = [g" - α"mι]l{α')2, m 3 = -{α')-*[Zα'α"m2 + Q!"mx g'"], where, for example, mi, α! and g1 are used for the functions mi(^j), α!{θij) and g'{θij) respectively, by suppressing their dependence on θij. By similar calculations, one obtains 2
2
σ ) - {E(yij)} ,
(3.15)
202
SUTRADHAR AND GODAMBE
where f
2
2
m
2
ϋ"j °" ) = ( 2 + rn\) + y [(α') {m4 + 2mira 3 + mfm 2 } 2
3raira2 + mf} - #"(m 2 + m )], with m4
= -(
Further, as in this general case we do not specify the joint distribution of ί/ii? ? 2/175 j Virii, one may use the 'working' covariance matrix Σi(β,σ2)=D?R(a)Dl
(3.16)
in place of the true covariance matrix Vi(β,σ2) without sacrificing the consistency of β through the generalized estimating equation approach [Liang and Zeger (1986)]. In (3.16), R(a) is referred to as a 'working' correlation matrix of yi} and D{ = diag[var(y a )..., var(y^)..., var(yίn.)].
4
SIMULATION STUDY
To examine the performance of the proposed approaches, we executed a small simulation study under the Poisson mixed model, with v log{JB(yy | 7 i )} = Σβtxijt
+ Ίu
(4.1)
£=1
for j = 1,..., Πi\ i — 1,..., k. The parameters to be controlled in the simulation study are as follows: (a)fc,the number of independent clusters or individuals; (b) rij, the number of observations under each cluster or individual; (c) β\,... ,/3p, the regression effects of the p covariates; (d) 7^ (i = 1,..., A;), the random effects; and (e) σ 2 , the variance of the random effects. We take the number of clusters as k = 100, and consider two sets of values of ni and p, namely, Ui — 4, p — 4; and πi = 10, p = 2, for all i = 1,..., k. For the first set of values of ni and p, we take βλ = 2.5,
β2 = -1.0,
βs = 1.0 and βA = 0.5;
203
ON GENERALIZED MIXED MODEL and = 1,
for j = 1,...,
n»+ 1
j = 1,...,
and 0:^4 =
For the second set of values of Ui and p, we take the first two values of β% Γ i.e., βι = 2.5 and /?2 = —10; and first two covariates X{j\ and a ^ F° k = 100, the 7i's were independently generated from a normal distribution 2 2 with mean 0 and variance σ . Five values of σ = 0.1, 0.3, 0.50, 0.75 and 1.00 were considered. The responses (yn,... ,yini) for each cluster i were generated as realizations of Poisson model (1.1) with mean and variance equal to exp I Σ βtx%jt + 7< >. The simulated data (yij), j = 1,...,
I Ifei
JJ
i =
1,... fe, and the covariates (xiju), u = 1, ,p; j = 15 5 ^ ; i = 1,... A; were used to compute the estimates of the fixed effect parameters /?, variance component σ 2 of the random effects, and the random effects 7$ (i = 1,..., fc), based on both approaches discussed in Sections 2 and 3. The simulation was repeated 2,000 times in order to obtain the mean value and standard errors of the parameter estimates. For simplicity, we refer to the estimation method discussed in Section 2 as corrected conditional mixed method (CCMM) and the estimation method discussed in Section 3 as unconditional mixed method (UMM). We now spell out the formulas for the estimation of the parameters of the Poisson mixed model by CCMM and UMM. In CCMM, to begin with, the random effects ji are treated as fixed effects and estimating equations for β and 7^ are developed by conditioning on these fixed effects. Folowing (2.6) and (2.7), these estimating equations are
1 >_
-1 V^
V"^
Λ
Λ
ί / J
( 01
/ 7 —
QVΉl
T
(4.2)
and -1
(4-3) where [ ]u in (4.2) and (4.3) denotes that the expression within the brackets are evaluated at ΐith iterated value β(u) and ji(u) respectively. Note that
204
SUTRADHAR AND GODAMBE
7» in (4.3) are computed by treating 7, (i = 1,..., k) as fixed, although in reality, under the present model, they are random. We now make a correction to take this random nature of ji into account and estimate them by (2.11) by noting that for the Poisson model (4.4)
E{Ίi-Ίi? = 3=1
This result in (4.4) is obtained by expanding
about 7i and noting that 321 (7ί) = 0. Now by using (4.4) in (2.11), and exploiting (4.2), (4.3), (2.11) and (2.13) iteratively, we obtain the solutions for /?, 7i and σ 2 . They are referred to as /?, 7$ and <τ2 respectively. In UMM, regression effects β are estimated by using (3.6). For the Poisson model, equation (3.6) reduces to
l)=β*(u)-gfβg*1,
(4.5)
where / j Uijxij
+ Σ?ii
Li=i
and
2
2
2
with α = l/{exp(σ ) - 1}, and λ = l/{exp(σ )(exp(σ ) - 1)}. Next, by (3.8), the iterative equations for X{ (i = l,...,fc) for the Poisson model, reduces to Ί -1
x
(4.6)
205
ON GENERALIZED MIXED MODEL
where [ ]u denotes that the expression within the brackets are evaluated at 2 uth iterative value j*(u). Further, by (3.9), the estimating equation for σ is given by k 4
•**/*>
( 7) 2
which is similar to the estimating equation (2.13) for σ in CCMM. The above three estimating equations (4.5), (4.6) and (4.7) are solved 2 iteratively as follows. For a given value of σ , we first solve (4.5) for β. This estimate of β is then used in (4.6) to obtain the estimate of ηι. In the third step, these values of ηn (i = 1,..., k) are used in (4.7) to obtain an estimate 2 2 of σ . We then put back this estimate of σ in (4.5) and (4.6) to improve the estimates of β and 7*. This cycle of iteration continues until convergence is achieved for β and σ 2 . The final estimates are denoted by β, ηi (i = 1,..., k) and σ . In the simulation study, we also include the results based on the conditional mixed method (CMM). In this method, unlike the CCMM, β and ji are estimated by treating the random effects 7$ as fixed effects, although in reality they are not so. The appropriate estimating equations are (2.6) and (2.7) for β and ηι respectively. Next, the variance of the random components is estimated by treating 7$ as random effects, but using the estimate of the fixed 7J for the unobservable random 7$. The appropriate equation for the variance component is * =Σ,7.7*>
(4.8)
instead of (2.13). In (4.8), 7; is obtained from (2.7) instead of (2.11).
4.1
Estimate of/?
Table 1 reports the simulated values of the percent relative bias (RB% ) of the regression estimators computed by: (1) the conditional mixed method (CMM), (2) the corrected conditional mixed method (CCMM), and (3) the unconditional mixed method (UMM). The percent relative bias (RB% ) of the estimator /3χ, for example, is given by 100 x ΛB(jSi), where
RBφι) = \Eφι)-β1\/σφι), with Eφi) and σ(β\) as the simulated mean and standard errors of the estimator β\. It is clear from the table that in estimating all the regression parameters βu β2, /% and /34 for p = 4, βx and β2 for p=2, the UMM leads to very large reduction in RB relative to the conditional methods CMM and CCMM. Between the two conditional methods, the corrected method
206
SUTRADHAR AND GODAMBE
(CCMM) yields slightly better estimates for the regression parameters, as compared to the uncorrected conditional method (CMM). For large values 2 of σ , both conditional and unconditional methods may have convergence problems. For example, for n2 = 4, p = 4, CMM and CCMM do not converge 2 when σ = .75 and 1.00. The convergence problems are shown by putting '*' in the tables against the parameter values for which convergence are not achieved. Similarly, for U{ — 10, p = 2, the UMM fails to yield the estimates 2 of βι and β2 when σ = 1.00.
ON GENERALIZED MIXED MODEL
207
Table 1. Comparison of percent relative bias (RB%) of the regression estimates for selected values of σ 2 ; k = 100; Ui = 4 (i = 1,... , fc), p = 4; Πi = 10 (i = l,...,fc), p = 2; true values of the regression parameters: βι = 2.5, β2 = -1.0, /33 = 1.0 and /34 = 0.5; 2,000 simulations.
Cluster Size (πi) 4
p 4
Parameter
10
2
σ2 0.10
Method CMM CCMM UMM CMM 0.30 CCMM UMM 0.50 CMM CCMM UMM CMM 0.75 CCMM UMM CMM 1.00 CCMM UMM CMM 0.10 CCMM UMM CMM 0.30 CCMM UMM CMM 0.50 CCMM UMM CCM 0.75 CCMM UMM CMM 1.00 CCMM UMM
βl
2432 2380 54.3 2368 2305 100 1848 1819 148.5 * * 200.0 * * 259.4 5463 4950 123.1 6513 6275 235.7 4360 605.8 320.0 5467 1615 364.7 2422 2300 *
ft
353.7 353.7 1.7 443.8 428.6 1.8 540.5 538.1 3.7 * * 1.9 * * 4.2 705.3 636.4 0 668.4 688.9 0 615.8 221.7 4.5 623.5 536.8 0 564.7 564.7 *
βz 433.3 422.2 0 650.0 650.0 5.0 837.5 837.5 0 * * 0 * * 5.9
285.3 282.4 2.7 321.9 318.8 8.3 375.0 362.1 3.0 * * 1.0 * * 4.4
208
4.2
SUTRADHAR AND GODAMBE
Prediction of the Random Effects
The results of Table 2 show the performance of the predictors of ηι (i = 1,..., k) by all three methods CMM, CCMM and UMM. Here we examine the empirical distribution form of the estimates of 7; (i = 1,..., k), given that 7i's are generated from the normal distribution with mean 0 and vari2 ance σ . This is done by studying the empirical mean, median, skewness and kurtosis of the predictors of 7^. Table 2, in its last column, also reports the simulated total mean square error (TMSE) of the random effect predictors based on the CMM, CCMM, and UMM. Let % be the CMM estimator of the random effect 7^ in the sth (5 = 1,..., 2000) simulation. Then k
(2000
ϊ 2
the TMSE of the CMM predictors is defined by ^
I Σ {% - 7;) /2000 I
where k = 100 is the number of independent clusters. Similarly, the TMSE of k
(2000
^
the CCMM and UMM predictors are defined by ^ < ^ (% - 7i)2/2000 > i=i 15=1 J k
(2000
ϊ 2
and y^ < ^ (7 i s — 7i) /2000 > respectively. It is interesting to note from the i=i L=i J table that the corrected conditional method (CCMM) performs extremely well in the prediction of the random effects as compared to the unconditional method UMM. Between the two conditional mixed methods, as expected, the corrected conditional mixed method performs much better as compared to the uncorrected conditional mixed method. This leads to the fact that correction for randomness of ji (i = 1,..., k) is quite important as the true 7i's are random. When the corrected approach is compared with the uncorrected approach for checking for the normality of the predictors of the random effects, both approaches appear to yield the normal predictors. But, the mean value of the predictors by CMM is quite away from the mean value zero for 7i (i = 1,..., fc), whereas CCMM yields zero mean value similar to that of the distribution of 7$ (i = 1,..., k). Next,
ON GENERALIZED MIXED MODEL
209
Table 2. Comparison of Mean, Median, Skewness, Kurtosis and total mean square errors (TMSE) of the random effect predictions for selected values of σ 2 ; k = 100; n< = 4 (i = 1,..., fc), p = 4; n< = 10 (t = 1,..., fc), p = 2; true values of the regression parameters: /?i = 2.5, fo = —1.0, /?3 = 1.0 and /?4 = 0.5; 2,000 simulations.
Cluster Size 4
p 4
10
2
σ2 0.10
Method Mean Median Skewness Kurtosis TMSE CMM 0.473 0.445 0.320 2.655 23.327 CCMM 0.000 -0.028 0.332 0.178 2.660 UMM 0.000 0.004 0.184 2.284 0.813 CMM 0.30 0.337 0.288 0.321 2.654 12.565 CCMM 0.000 0.324 2.654 0.211 -0.049 UMM 0.000 0.007 0.153 2.264 1.050 0.50 CMM 0.153 0.091 0.324 3.252 2.651 CCMM 0.000 -0.064 0.244 0.325 2.651 UMM 0.002 0.014 0.142 2.252 1.293 * * * * * CMM 0.75 * * * * * CCMM UMM 0.006 0.029 0.135 2.246 1.639 * * * * * CMM 1.00 * * * * CCMM 0.014 UMM 0.038 0.135 2.256 2.046 0.342 2.654 30.817 CMM 0.536 0.507 0.10 CCMM 0.000 0.014 -0.089 1.106 2.833 2.324 1.160 UMM -0.002 0.238 0.000 0.362 CMM 0.460 0.408 2.658 23.793 0.30 2.661 1.315 -0.056 0.429 CCMM 0.000 2.264 1.496 0.183 UUM 0.000 0.005 17.318 0.314 0.363 2.651 CMM 0.380 0.50 1.462 0.386 2.658 -0.069 CCMM 0.000 1.771 0.164 2.252 0.014 UMM 0.001 10.455 2.649 0.274 0.194 0.366 CMM 0.75 1.669 0.372 2.650 -0.081 CCMM 0.000 2.170 2.256 0.159 0.005 0.015 UMM 2.642 5.548 0.074 0.371 CCM 0.168 1.00 2.642 1.845 -0.094 0.373 CCMM 0.000 * * * * * UMM
210
SUTRADHAR
AND
GODAMBE
when the unconditional approach UMM is compared with the corrected conditional mixed method (CCMM) for normality, the UMM appears to produce much better mean, median and skewness values for the predictors of the random effects than those yielded by the CCMM approach. The CCMM, however, appears to yield much better kurtosis value (close to 3) as compared to the unconditional mixed method (UMM).
4.3
Estimate of σ2
Table 3 reports the simulated mean values and standard errors of the estimates of σ2. It is clear from the table that the unconditional mixed method and the corrected conditional mixed method compete each other in estimating the variance σ 2 of the random effects. The uncorrected conditional mixed method performs worse, as expected, as compared to its counterpart CCMM. This is because, unlike the CCMM, the CMM treats the random effects as fixed effects and use them to estimate the variance component of the random effects. Between the CCMM and UMM, σ 2 estimates of the UMM always have the smaller bias but larger standard errors than the estimates of the corrected conditional mixed method (CCMM). Table 3. Comparison of simulated mean values and standard error (SE) of the estimates of variance components of random effects for selected values of σ 2 ; k = 100; n< = 4 (t = 1,..., * ) , p = 4; m = 10 (i = 1,..., fc), p = 2; true values of the regression parameters: β\ = 2.5, /% = —1.0, βs = 1.0 and β\ = 0.5; 2,000 simulations.
Cluster Size 4
2
p 4
Method CMM CCMM
Mean SE Mean
SE UMM 10
2
CMM CCMM UMM
Mean SE Mean SE Mean SE Mean
SE
0.10 0.329 0.006 0.104 0.003 0.095 0.006 0.401 0.004 0.112 0.007 0.090 0.008
0.30 0.427 0.007 0.312 0.005 0.304 0.011 0.530 0.010 0.306 0.011 0.298 0.014
σ 0.50 0.545 0.007 0.520 0.007 0.511 0.015 0.667 0.013 0.512 0.018 0.506 0.020
0.75 * * * * 0.772 0.022 0.853 0.021 0.774 0.026 0.763 0.026
1.00 * * * * 1.032 0.026 1.059 0.027 1.029 0.030 * *
ON GENERALIZED MIXED MODEL
5
211
SUMMARY AND DISCUSSION
Our limited simulation study has shown for the Poisson mixed model that the unconditional mixed method (UMM) is superior to the corrected conditional mixed method (CCMM) in estimating the fixed effects parameters. The CCMM, on the other hand, performs better than the UMM, in predicting the random effects, as the total mean square errors (TMSE) yielded by the CCMM were always found to be smaller than those produced by the UMM. In estimating the variance of the random effects, both UMM and CCMM were found to be almost the same. The CMM always performs poorly in 2 estimating any parameters β or ηι (i = 1,..., k) or σ . This uncorrected CMM, therefore, should not be used in estimating the parameters of the mixed model. Note that it has been assumed in the simulation study that 7$ x~ iV(0, σ 2 ). But, in general, 7* (i = 1,..., k) may also follow non-normal distributions. In such cases, the performance of the adhoc estimates E^ifii) or £7(7117J, computed by pretending that ηι ~ iV(0, σ 2 ), may not be satisfactory. This shows the necessity for further investigation for the construction of a robust corrected predictor for 7, (i = 1,..., fc), irrespective of the distribution of 7^. This problem, however, does not arise in the unconditional mixed method (UMM). But again, it does not mean that the UMM is problem free. This is because, for large σ 2 , it may be extremely difficult to compute the unconditional marginal mean and variance of the response variable, which are required for the construction of the estimating equations for the regression parameters. Further note that when the elementary functions, such as huj in (2.1) and \i2i in (2.12) are conditionally correlated, Durairajan (1992) has given a closed form of optimal estimating function under certain conditions and utilizing this, Bai and Durairajan (1996) obtained optimal estimating function for means and variances of one-type and two-type branching processes. The result of Durairajan (1992) may be used in the generalized linear mixed model also. However, we have not considered this approach in this paper.
Acknowledgements The research was partially supported by grants from the Natural Sciences and Engineering Research Council of Canada. The authors would like to thank a referee for constructive comments.
212
SUTRADHAR AND GODAMBE
References Bai, K. and Durairajan, T. M. (1996). Estimating functions for branching processes. J. Statist Plann. Inference, 53, 21-23. Breslow, N. E. (1984). Extra-Poisson variation in log-linear models. Applied Statistics, 33, 38-44. Durairajan, T. M. (1992). Optimal estimating function for non-orthogonal model. J. Statist Plann. Inference, 33, 381-384. Ferreira, P. E. (1982). Estimating equations in the presence of prior knowledge. Biometrika, 69, 667-669. Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Annals of Mathematical Statistics, 31, 1208-1211. Godambe, V. P. (1994). Linear Bayas and optimal estimation. Tech. Report STAT-94-11, University of Waterloo. Godambe, V. P. and Kale, B. K. (1991). Estimating functions: An overview. Estimating Functions, (V. P. Godambe, ed.), Oxford University Press, New York, 47-63. Godambe, V. P. and Thompson, M. E. (1989). An extension of quasilikelihood estimation (with discussion). Journal of Statistical Planning and Inference, 22, 137-72. Laird, N. M. and Ware, J. H. (1982). Random effects models for longitudinal data. Biometrics, 38, 963-974. Liang, K.-Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). London: Chapman and Hall. Naik-Nimbalkar, U. V. and Rajarshi, M. D. (1995). Filtering and smoothing via estimating functions. Journal of the American Statistical Association, 90, 301-306. Stiratelli, R., Laird, N. M. and Ware, J. H. (1984). Random effects models for serial observations with binary responses. Biometrics, 40, 961-971. Sutradhar, B. C. and Qu, Z. (1997). On approximate likelihood inference in Poisson mixed model. Canadian Journal of Statistics, to appear.
ON GENERALIZED MIXED MODEL
213
Sutradhar, B. C. and Rao, R. P. (1996). On joint estimation of regression and overdispersion parameters in generalized linear models for longitudinal data. Journal of Multiυariate Analysis, Vol. 56, No. 1, 90-119. Waclawiw, M. A. and Liang, K.-Y. (1993). Prediction of random effects; a Gibbs sampling approach. Journal of the American Statistical Association, 88, 171-178. Williams, D. A. (1982). Extra-binomial variation in logistic linear models. Applied Statistics, 31, 144-148. Zeger, S. L. and Karim, M. R. (1991). Generalized linear models with random effects: A Gibbs sampling approach. Journal of the American Statistical Association, 86, 79-86. Zeger, S. L., Liang, K.-Y. and Albert, P. S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44, 1049-1060.
215
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
USING GODAMBE-DURBIN ESTIMATING FUNCTIONS IN ECONOMETRICS H. D. Vinod Fordham University ABSTRACT This paper explains why Godambe-Durbin "estimating functions" (EFs) from 1960 are worthy of attention in econometrics. Godambe and Kale (1991) show the failures of Gauss-Markov and least squares and prove the small-sample superiority of EFs. There are many areas of Econometrics including unit root estimation, generalized method of moments (GMM), panel data models, etc., which can use some simplification, a little greater emphasis on finite sample properties and greater flexibility. We show why statistical inference using the EFs in conjunction with the bootstrap can be superior. For example, compared to the GMM, our EF estimates of the 'risk aversion parameter' are economically more meaningful and have shorter bootstrap confidence intervals. Key Words: Generalized method of moments, bootstrap, confidence intervals, small sample Gauss consistent, optimal estimation
1
Introduction
The aim of this paper is to continue a dialogue between statisticians working with Godambe-Durbin estimating functions (EFs) and econometricians, apparently started by Crowder's (1986) lead article in an econometrics journal. Though Crowder proves consistency of estimates as roots of EFs, he neglects to mention (i) the "main lesson" of EF theory, and (ii) that Durbin's (1960) two-regression (TR) estimator for autoregressive distributed lag (ADL) models is an optimal EF (OptEF) Ordinary least squares (OLS) estimators are roots of "normal equations" and maximum likelihood (ML) estimators are roots of "score equations." The "main lesson" from Godambe's EF theory is to deemphasize the estimates (roots) and focus on the underlying equations called the EFs. One considers the bias and variance of EFs themselves. Minimizing the variance of (standardized) EFs is Godambe's (1960) G-criterion. It provides optimal EF, g* = 0, whose roots are the OptEF estimators. The large and recently growing EF literature, surveyed by Godambe and Kale (1991), Dunlop
216
VINOD
(1994) and Liang and Zeger (1995), gives theory and examples of successful applications in biostatistics, survey sampling and elsewhere. They show that EFs can offer distinct and improved estimates, both with and without normality. It is remarkable that whenever the OLS or ML estimators do not coincide with the OptEF, it is the OptEF that have superior properties in both large and small samples. Rather than another survey, this paper indicates avenues for further research and applications in econometrics by focusing on the regression problem. Section 2 discusses EFs for multiple regression. Section 3 reviews EFs for the GMM and ML, and includes our "results." Section 4 has further results on Instrumental variables and specific suggestions for achieving smallsample optimality in GMM. Section 5 refers to the EFs in further settings, mostly those used in biostatistics. Section 6 discusses statistical testing and inference with new bootstraps based on EF pivots. Section 7 contains some final remarks.
2
Estimating Functions And Multiple Regression
Consider the usual regression model with T observations and p regressors: e, E(e) = 0, E{ee') = σ2Ω
(1)
If the Ω matrix is known, one uses the generalized least squares (GLS) estimator, whose "normal equations" are viewed here as EFs and denoted by: ggis = X'ΩΓιXβ
- X'ϊl-ιy
= X'ίl^e
= 0.
(2)
The log likelihood function Lτ when e ~ JV(O, σ2l) (normal errors and Ω = /) is: 2
2
Lτ = (-T/2)log2π - (T/2)logσ - (l/2σ )(y - Xβ)'{y - Xβ).
(3)
Let ST denote the p x l score vector of partial derivatives of LT with respect to (wrt) β. dLτ/dβ
= ST = (Mσ2)X\y
- Xβ) = (l/σ2)X'e
= 0.
(4)
Under the above assumptions the score equation Sτ = 0 is OptEFg*=0, since it minimizes Godambe's G-criterion defined for our vector case as «)D^\ where Dg* = Edg*/dβj for i, j = l , - , p.
(5)
ESTIMATION IN ECONOMETRICS
217
The OLS normal equations from (4) are, X'e = 0, which are optimal when Ω = /. Let us rewrite them as a sum over t: gols(y, X,β) = Σf=1 X[{yt - Xtβ) = 0, where Xt={xn, , xtp) is a row vector and X[Xt (for t = l , matrices having p equations for each t.
(6) ,T) are pxp
Remark 1: The EFs (as vector random variables) are sums of T terms, but the corresponding (estimators) roots need not be sums. Hence the "central limit theorem" arguments apply directly to EFs, supporting Small and Mcleish's (1988) claim that EFs are "more normal." This justifies what we called the "main lesson" of Godambe's EF theory, asking us to focus on the EFs while deemphasizing the roots. Using the Cauchy-Schwartz inequality, Kendall and Stuart (1979, sec. 17.17) prove that the Cramer-Rao lower bound on the variance of an unbiased estimator y of μ = Xβ is attained, if and only if, β is chosen to satisfy ST = A(X, β, σ 2 ) (y — Xβ) where A is the arbitrary constant (matrix) of proportionality. We refer to this important small sample property as "attaining Cramer-Rao," often readily satisfied by the EFs. Durbin (1960) proved that his simple TR estimator is OptEF for the ADL model, as it "attains Cramer-Rao," even if the autoregressive (AR) parameter has a unit root, or yt is explosive with the root > 1. Hendry (1995, p.232) lists eleven applied econometric models, including partial adjustment, equilibrium correction, leading indicator, etc., which are special cases of ADL models, but ignores Durbin's TR estimator for them. Similarly, most econometricians discuss asymptotics of complicated estimators, which often exclude unit roots or explosive cases. Since EF theorists have recently developed a better understanding of conditioning which makes Durbin's TR estimator OptEF, it deserves a fresh look in econometrics. Gauss's intuitive notion of consistency from 1820's is explained by Sprott (1983). An estimator is said to be small-sample Gauss-consistent (SSGC) if it equals the true value, when all errors are zero. For example, solve the OptEF Xfe = 0 when €*=() for all t in (6), and note that the EFs are obviously SSGC. Small-sample Gauss-consistency (SSGCy) is merely a desirable property, using Gauss's tool for studying the properties of estimators. Note that SSGCy is different from unbiasedness, since it does not involve averaging. Just as the conventional asymptotics is useful, even though T may never be actually infinite, SSGCy can be useful even though all errors may never be actually zero. Kadane (1970) implemented SSGCy as small-sigma asymptotics, which is conveniently discussed later in Remark 7. Wedderburn (1974) defined integral of ST as quasi-likelihood, used in econometrics by White (1982), without referring to Wedderburn. Quasilikelihoods require only the mean and variance and admit the exponential
218
VINOD
family of distributions. If only the first two moments are specified (no normality), the likelihood is unknown and the ML estimator is undefined. By contrast, EFs remain available and "attain Cramer-Rao." Economists' common impression that "under normality, ML is unbeatable" is proved wrong in Godambe (1985) and further explained in Godambe and Kale (1991). Godambe and Heyde(1987) prove that EFs yield asymptotically shortest confidence intervals. Vinod (1996b) explores this with regression examples. Small and McLeish (1988) show that EFs minimize the asymptotic mean squared error (MSE). The EF literature shows that solving an OptEF starting at some initial consistent estimator followed by Fisher's method of scoring, is similar to the Newton-Raphson iterates with readily available algorithms and "more normal" properties. Let the parameter of interest be a smooth function u(β) with nonzero derivatives, (e.g., long-run elasticities in econometrics). The chain rule implies that the new score for v(β) is the original score ST times (dβ/dv). Assuming that ST can be written as a sum over t, the new score equation, which is the new EF, can also be a sum over t, and by Remark 1 it too is "more normal". If we ignore the "main lesson" above and focus on the roots, its sampling distribution can be quite nonnormal. For example, if v{β)—β2^ it is easy to verify that the sampling distribution of the root is χ 2 . Remark 2: The EF literature seems to ignore warnings of numerical mathematicians. For example, finding regression coefficients by solving the normal equations (EFs) directly is not advisable, especially for ill-conditioned problems, due to rounding and truncation errors, Sawitzski (1994). Vinod and Ullah (1981) formulate the regression problem in terms of the singular value decomposition (SVD) of X = HAλ/2Gf, which leads to βols = GA~l/2H'y, as the computationally most reliable technique. ι If the regressors X are correlated with the errors we have X'Qτ e φ 0. This leads to biased EFs from (2), Έggιs φ 0. The familiar method of instrumental variables (IV) assumes that we have data on a Txr matrix Z of r instruments which are (i) correlated with X, and (ii) uncorrelated with ι e. The condition (ii) can be written as: EZΏ~ e = 0, which obviously achieves an unbiased EF of the IV estimator: giυ = Z'Ω-^V " Xβ) = Z'9Γ\y
- μ).
(7)
We will note [after eq. (18)] that replacing X by Z in IV-type estimation leads to "overidentification problems." Instead, Godambe and Thompson (1989) prove that it is "optimal" to replace X in the score equation (4) by c its own expectation, (denoted later by X ), hierarchically "conditional" on exogenous and lagged variables. Singh and Rao (1997) generalize "hierarchical conditioning" in Godambe-Thompson theorem and provide results on
ESTIMATION IN ECONOMETRICS
219
asymptotic efficiency. In (A.8) of our appendix we use AR(3) and AR(4) to replace a regressor logCt by its predicted value.
3
Estimating Functions, GMM And ML Estimators
It is well-known in econometrics texts that OLS, GLS, ML and IV can be viewed as special cases of GMM (e.g., Hamilton, 1994, ch. 14). Since the summation in (6) can be replaced by an expectation, it is obvious that the moment conditions of GMM can be viewed as EFs in (6). Since the GMM is a direct generalization of Pearson's method of moments, it has (conditional) moment equations. The first moment leads to the following EF: E[9mom(y, X, β)] = E[X't(yt - Xtβ)} = 0,
(8)
The GMM solution of (8) is obtained flexibly by minimizing an associated quadratic form gfmomWgmorn, where W is any positive definite weight matrix. Remark 3: The asymptotically optimal GMM minimizes gfmomW gmom, for W = [Limτ->oo TE(gmomg'mom)]-\ Hamilton (1994, p. 413). Unfortunately, this W involves ad hoc choices and possibly impractical evaluation of the Limτ-^oo' By contrast, when available, the score is the OptEF without using asymptotic arguments. Since OptEF "attains Cramer-Rao," optimal GMM must be suboptimal in small samples, if it is different from the OptEF. Dhrymes (1994, p. 368) calls GMM a "minor" modification of standard IV methods. However, GMM is popular in the context of "rational expectations" framework in economics. It can adapt to dynamic macroeconomics, where the Euler equations can be directly used as moment equations. For example, Christiano and Eichenbaum (1992) write the "first order conditions" from the "real business cycle" theory as the moment restrictions of the GMM. An Appendix illustrates another application deriving (A.4) as the EF for Tauchen's (1986) intertemporal utility maximization in a consumption based capital asset pricing model (C-CAPM). See Ogaki (1993) for C-CAPM details. Result 1: Under certain assumptions of the regression model, EF methods yield the same β as the OLS, ML, GLS, IV and GMM estimators. However, OptEF methods may involve different choice of instruments or a different estimate of Ω yielding a different β. Then, OptEF alone has small-sample optimality and assumes no more than existence of first two moments. For example, if due to heteroscedasticity, Var(yt) is a function of /?, Godambe and Kale (1991) prove that the OptEF estimator is superior to GLS and ML.
220
VINOD
It is well known that ML estimators (found by solving the score equation (2) as the EF) enjoy an equivariance property wrt a one-to-one differential transformation, a property not shared by the unbiased minimum variance estimators. Now assume that Ω(>) is unknown, and collect the parameters as θ = (/?,>), where β are parameters of interest and φ are the nuisance parameters. The so-called Neyman and Scott (1948) problem is that the ML estimator obtained by ignoring nuisance parameters φ can be inefficient and inconsistent. Though this was published as a lead article in Econometrica, it is rarely, if ever, cited in econometric literature. The EF literature shows how to avoid the Neyman-Scott problem. Let q denote a complete sufficient statistic for φ for each fixed θ. Assume that q is independent of β and let f denote a generic density. We have (9)
f(y,β,Φ) = f(y\q,β)f(q;β,Φ),
where the first part is a conditional density. Godambe (1976) shows that the conditional score SCnd = dlogf(y\q,θ)/dθ
(10)
=0
is an OptEF, since it maximizes the G-criterion. Lindsey's (1982) conditional score subtracts expected value to achieve unbiasedness. For time series data, Godambe (1985) uses conditional expectations using information till time t1. Similarly Godambe and Thompson's (1989) hierarchical conditioning, and results in Godambe (1991) and Bhapkar (1991) can be used to avoid the Neyman-Scott problem. Our next task is to evaluate the G-criterion of (5) using g*=Scnd=0 of (10) and understand its deeper meaning. Clearly, the Dg* in (5) equals the Fisher information matrix Ip = —E[d2Lτ/{dθdθ')]. Let us denote this "second order partial" (Hessian) form of IF by /2op The E(g*g*') in (5) equals the "outer product of gradients" form, denoted here by Iopg. Thus we denote: Iopg = E{(dLτ/dθi)(dLτ/dθj)},
hop = -Ei</dθidθj},
(i,j = 1,
,/>)• (11)
Result 2: If g* proportional to a score vector similar to Sr, G-criterion = I^}v hpgl^p
(12)
For proof use (5) and (11). White's (1982) "information matrix equivalence theorem" states that: when the model is correctly specified, IF = hop — Iopg
(13)
221
ESTIMATION IN ECONOMETRICS
White further developed (13) into a specification test. Corollary of Result 2: Only when the model is correctly specified in the sense of (13) the G-criterion minimand (12) reduces to J ^ 1 , which is proportional to the variance. Rather than using the normal likelihood of (3), we recall μ=Xβ notation and verify this corollary in a simpler setting of estimating the mean μ of independent and identically distributed (iid) variates: yt~ΠD{μ,σ2).
(14)
Now, the simpler quasi-log-likelihood of an observation is , σ2) = -0.5log2π - O.blogσ2 - 0.5(y* - μ)2/σ2.
(15)
Defining θ — (μ, cr2), the Fisher information lF=hop is a 2x2 diagonal matrix having: (σ~ 2 , 2~ισ~A) along the diagonal. Since its inverse has variances along the diagonal, we have verified that Ip1 is proportional to the variance. Next, we have: /opg -
σ"2 4-^72
(16)
where 71 = E(yt — μ) 3 /σ 3 and 72 =E(yt — μ) 4 /σ 4 — 3, measure skewness and kurtosis. Clearly, only if 71 = 0 and 72 = 0, hop=Iopg are both diagonal matrices satisfying (13). Thus, by avoiding (13) the G-criterion offers (robustness) some protection against misspecification. This discussion of the corollary is intended to help the intuition. Remark 4: The practical success of EF methods reported in biostatistics, sampling and elsewhere may be due to the corollary. The G-criterion of (12) seeks efficiency (low variance), as it adjusts for misspecification arising from hpg Φ hop in finite samples. Minimizing the G-criterion (12) is not directly attempted in econometrics, though minimizing variance (efficient estimation) is ubiquitous. The G-criterion needs only finite mean and variance, and quasi (not actual) likelihood functions. By contrast, the ML is sensitive to misspecification of the likelihood function. In short, EFs provide a more general and flexible estimation theory, which remains applicable in finite samples and is insensitive to possible misspecification or nonexistence of the likelihood function.
4
Further Results On Instrumental Variables And GMM
Consider the class G of all estimating functions of the form g = Ae, where A is a p x T matrix, and Ee = E(y — μ)=0. In this section, we eliminate σ2 for
222
VINOD
brevity and redefine Ω = Eee'=E(y - μ)(y - μ)1. Also, let H(β) denote any pxp nonsingular nonstochastic matrix whose components depend on /?, and let the superscript + denote a generalized or pseudoinverse of the matrix. Result 3: In the class G, g* is an optimum estimating function (OptEF) if and only if g* = H(β)[d(y - μ)/dβ]'n+(y - μ). (17) The proof is analogous to Vijayan (1991), who has proved a special case when Ω is a diagonal matrix, which is of interest in survey sampling. Despite possibly nonzero off-diagonals, essentially the same arguments apply here. In particular, for the GLS we have μ = Xβ, X = d(y - μ)/dβ. Hence the choice A = XΏ+ gives the OptEF. This reduces to (2), the EF for the GLS, when H(β) = I and Ω is nonsingular. Thus (2) is the OptEF under certain assumptions. If the EF is biased, EA(y-μ) φ 0, we replace A by Az = ZΏ+ choosing the instruments Z which satisfy EAz(y - μ) = 0 = Eg*. This is when giυ of (7) is the OptEF. Result 4: If Z is a T x r matrix (of rank r) of instrumental variables, instead of (5) our new G-criterion minimand is (Z'Ω+X) + (ZΏ+Z)(XΏ+Z) + .
(18)
To derive (18), use (7) and write gιυ = Z'Ω + €. Now, verify that Egivg'iv = Z'Ω + Z, and that the Dg* from (5) becomes (Z'Ω + X). Since each instrumental variable is observable (inside the information set), our Az is observable in the linear case. Ogaki's (1993, p. 459) extension to the nonlinear case can be adapted here. The term "overidentification" in econometrics refers to simultaneous equation models when r>p, where r denotes the number of exogenous (predetermined) variables in the complete model, and p denotes the number of parameters in a single equation. The minimand (18) is designed for the overidentified GMM case, where r>p means that there are more (moment conditions) instrumental variables than regression parameters. In the EF theory, since one can combine many EFs into one, overidentifying restrictions can be superfluous. For example, we can combine g\ =0 and g2=0 into g\ +92=0. The "hierarchical conditioning" mentioned earlier obtains OptEF by replacing the X in (2) by a matrix of conditional expectations denoted here by Xc. The i-th column of Xc may be obtained by regressing i-th column of X on a column of ones (intercept) and any or all r columns of Z for.i=l, , p. Now we list some advantages of Xc over the Z matrix of instruments from (7): (i) The column dimension of Z as regressors does not matter, and one need not worry about overidentification, let alone have a complicated formal test for it. (ii) It is easy to guarantee that Xc columns
223
ESTIMATION IN ECONOMETRICS
are highly correlated with the X columns, (iii) A simpler G-criterion can be used to attain optimality. For an example, see our appendix. For the overidentified case, the GMM moment condition (8) becomes E(Zft(yt — μt)) = 0 for any t, where Zt denotes the t-th row of Z. Also, the GMM minimand becomes: (y-μyZWZf(y-μ),
(19)
where W is an r x r matrix of weights. Assuming that X is of full rank p and r>p, i.e., we have enough instruments, then GMM estimator βgmm is a solution of the following p EFs: 99mm = (X'ZWZ'X)β - X'ZWZ'y = 0.
(20)
Recent GMM theory notes that the optimal r x r weight matrix W is: W = STΩ^S'T, where Sτ= Z/Ω,~1(y — Xβ) is the r x l score vector similar to (7). Since score equations are OptEFs this brings GMM and EF theories closer. However, substituting this W in (20) seems to be unnecessarily complicated. Remark 5: Only when the W matrix which minimizes (19) also minimizes the G-criterion of (18) will the GMM solution coincide with the EF solution. Otherwise the GMM is suboptimal, as we have already noted in remark 3. Result 5: Denote the root of an EF by βef, and let the corresponding vector containing T residuals be e = y — μ(βef). Since the GLS normal equations (2) are EFs, their "equation residuals" are (degenerate) identically zero, or XΏ~ιe = 0 . Recall that we may avoid biased EFs by replacing X by X c , a conditional expectation based on Z instruments. Denote Aze = X^Ω^e = 0. To verify SSGCy we set e = 0 and note that this amounts to solving the normal equations. Thus, we always have SSGCy for these unbiased EFs. However, the true unknown "equation errors" g* — Aze are a non-degenerate vector of random variables with zero mean and variance V = Var(g*) = Eg*g*' = EAzee'A'z
= AzVAfz.
(21)
Since Aze = 0, it is incorrect to replace e in (21) by e. Instead, our estimate of variance is: C + C V = X 'Ω X . (22) Remark 6: Equation (22) needs a consistent estimate of Ω+ (or Ω" 1 , if it exists). If regression errors follow an AR(1) process with parameter p, it is well-known that Ω " 1 is proportional to a tridiagonal matrix: It has 1+p 2 along the main diagonal, except for ones in the two corners and a — p along the sub and super diagonal. More generally, we find the parameters of appropriate (by Schwartz criterion) autoregressive integrated moving average
224
VINOD
(ARIMA) process for residual e and indirectly determine autocovariances φj for j = l , , q 0 as σ -> 0, small-σ approximate bias is Bsσ = EAx(μ — μ), linking the bias of EFs with the bias of estimators μ. Writing the bias, variance and MSE as a power series in σ, one obtains small-σ approximations by ignoring high powers of σ. The variance of the biased EF is E(ggf) = (EAxee'A'x) = {EAxuu'Ax)/a2
(23)
If μ = Xβ, small-σ bias is Bsσ = EX'Ω+X(βef - β). If μ = Xβ and X is nonstochastic, E{gg')= (X'Ω+X), where we use Euu' = σ2ίl and cancel the σ 2 in the denominator of (23). More research is needed to explore smallsigma methods for EFs.
5
Estimating Functions In Further Settings
Cox's (1975) partial likelihood score, which is a sum of appropriately conditioned scores (10), is widely used in biostatistics and elsewhere. Godambe 2 2 2 (1985) proves that the optimal coefficient in g* is Et-i {d Scnd/dθ )/Et^ι {Scnd) = —l2op/Var(SCnd) Next, he proves that the EF theory implies optimality of the partial likelihood scores, which have been used by practitioners since the late 1970's. Wedderburn (1974) considers a general linear model (GLM) with independent responses. The GLM is specified by a known (monotonic differentiable) link function v{μt) = v(X[β), and the random part is from the exponential family. The Box-Cox transformation is one kind of link function well-known in econometrics. The quasi-score function for GLM is z? / Ω" 1 (y-μ) = o,
(24)
where μ = v~ι{Xβ), D = dμi/dβj is a pxp matrix, and Ω = diag(Ωt) is T x T diagonal matrix. The logistic model has μt=exp(X[β)l[l+exp(X'tβ)],
225
ESTIMATION IN ECONOMETRICS
dμ/dβj=xtjμt(l-μt)
Its familiar link function μt = log(yt/(l-yt)) provides
a useful simplification and linear estimating functions. Godambe and Heyde (1987) show that the quasi score function (which minimizes the G-criterion) is an optimal estimating function. Godambe and Thompson (1989) extend the quasi score function to include the quadratic wt = (yt — μt)2 and show that the variance matrix for (yt, wt) involves skewness and kurtosis parameters in (16). The GLS estimating functions (2) are a special case when D = X. For repeated k measurements (panel) of correlated data, the D matrix of (24) becomes block diagonal, and we have similar estimating functions with an additional summation over k blocks. These are called generalized estimating equations (GEE) defined by: Ytj—iDjΩ" (yj — μj) = 0,
(25)
where the matrices have subscript j to distinguish them from those of (24) in analogous notation. This will correctly account for the correlations φ incorporated in Ωj(φ) if they are known. Otherwise, one uses the feasible version by replacing them by consistent estimators denoted by Ωj. The asymptotic covariance matrix of GEE estimates of β is given by Varφgee) = σ2A~ιBA~ι, with A = Σ^Dfi^Dj B = Σ^DjΩ^ΩjΩ^Dj.
and (26)
Although repeated measurements are rare in econometrics, the above methodology can be useful. For example, Vinod (1989) uses a fuzzy range of rounding errors around each measurement to artificially create repeated measurements. In biostatistics, GEE has been applied to study various situations where y is continuous, binary or count. Hence econometric applications to similar data (e.g., panel) are worth considering.
6
Statistical Testing And Inference With New Bootstraps
Statistical testing is important in econometrics and the bootstrap has been proved to be useful for difficult inference problems, Hall and Horowitz (1996), Vinod (1993). Denote Fisher's pivotal for i-th regression coefficient as: ep = Φ% — βi)/SEi, where SEi denotes the standard error. This pivot is asymptotically standard normal, N(0,l), and since it is a function of (y,X,/3), it is an EF itself. Inverting Fisher-type pivotals (having βi — βi in the numerator) always yields symmetric confidence intervals. For the 95% case, inversion yields φi qF 1.96£E?i), a symmetric confidence interval for βi.
226
VINOD
Godambe and Heyde( 1987) prove that EFs yield asymptotically shortest confidence intervals. Lele (1991) suggests a symmetric EF bootstrap for dependent time-series data. Vinod (1995) suggests using the double bootstrap to deal with the problem of a lack-of-a-pivot, whereas Vinod (1996b) considers asymmetric intervals and uses the double bootstrap. Thavaneswaran (1991) notes that the sequence of EFs based on Godambe (1985) is a "martingale difference sequence" and provides Wald, Rao's score and nonparametric test statistics involving martingales and semimartingales. His applications include testing for structural change, common in econometrics. Heyde and Morton (1993) show that the restricted OLS estimator (RLS in econometrics texts) can be derived from projections in EF theory without using Lagrangians. For testing linear restrictions, one can use their EF estimators and corresponding Fisher information as covariance matrices for construction of test statistics. However, we propose a different approach where the pivotal is constructed from g and Egg' rather than the usual {0ef-β]x[Var(βef)}-V2. Given Eg = 0 (unbiased g) we seek the square root matrix of the p x p variance matrix V = Eggr of EF if V is known. Otherwise, we use a consistent, robust and nonparametric estimators V of V from (22). For related literature, see Ogaki (1993) and Andrews and Monahan (1992), among others. Now consider a square root decomposition: V = σ£,σ#, if available, or V = GLGR,
(27)
where GL denotes the p x p left-square-root matrix decomposition of the matrix, and GR a similar right-square-root matrix, assuming they exist. The computations may be based on either the Cholesky or symmetric versions. If Eg φ 0 due to nuisance parameters being estimated, Godambe (1991, p.144) suggests using g^g—Eg as the revised EF, where the expectation is conditional on minimal sufficient statistic for the nuisance parameter. For biased estimators, Vinod (1984) argues that it is useful to replace the variance in the denominator of the usual t ratio by the square root of an unbiased estimate of the mean squared error (^/UMSE). However, he shows that the resulting distribution is similar to a noncentral Chi-square, depending on unknown noncentrality; hence the revised t-ratio is only an approximate pivotal. Thus, if g is biased, we may replace the V in (27) by a matrix containing the UMSE. Further work is needed to explore the use of UMSE here. Result 6: An approximate pivotal for the EF estimator βej obtained by solving g = 0 can be ι U = σl g, (28) where we assume E(g) = 0 and use the square root decomposition in (27).
ESTIMATION IN ECONOMETRICS
227
For example, substituting V from (22) and Ώ from Remark 6 in (27) one can incorporate autocorrelations and heteroscedasticity. As a pivotal, we know that asymptotically, U is well behaved and Var(U) -> I. By contrast, a pivotal for βef needs variance of the sampling distribution of the root βef of g = 0, which can be more complicated. When nuisance parameters φ are present, the distribution of g may depend on φ. To solve this problem Morton (1981) considers a pivotal g, where the distribution depends only on the parameters of interest β. Parzen et al (1994) provide two mild regularity conditions on general EFs for asymptotically valid bootstrap confidence intervals. Now we outline typical steps of a bootstrap for time-series regressions, designed to overcome the time dependence (among regression errors). These should be modified to suit specific examples, and all computations should use numerically reliable methods (e.g., use SVD of Remark 2 when applicable). Steps for a Parametric Bootstrap using EFs: (i) Denote i-th columns of X and Xc by X{ and x\, respectively. Let χ9 = Xi for (deterministic or exogenous) variables in X known to be uncorrelated with errors, (y — μ). Regress each of the remaining (correlated) X{ on a column of ones (for intercept) and all relevant instruments Z. Use the predicted value from this regression to complete the Xc matrix. Construct p optimal EFs g* = X c/ Ω + (y — μ(/?)), where we permit some nonlinear functions μ(β)) instead of the usual Xβ. Unless Ω is known, use Ω = /, to obtain a preliminary p x l vector βpre by solving g* = 0. (ii) Use the T residuals et — yt — μ(βPre)t and Remark 6 to construct q autocovariances φj and hence a TxT Toeplitz matrix Ω to approximately represent time dependence. Make further adjustments to the diagonal terms to allow for heteroscedasticity, if desired. (iii) Solve the revised g* = XcfΩ+(y - μ(β)) = 0 to obtain /3e/, new residuals et=yt — μt{βef) > and revised Ω. In some examples it may be useful to repeat this step until βef values converge within a given tolerance. (iv) Obtain a (Cholesky) factorization of the pxp matrix V = XcfΩ+Xc = VLVR, from (22) defining σ^ and GR. (v) Generate J(=999, say) simulated p x l vectors e*j=σLV, of "equation errors," where υ is a p x l vector of N(0,l) unit normal deviates and j = l , J. (vi) Solve J different EFs: e* = Xc/Ω+(y - μ(β)) to yield J estimates of the p x l vector β denoted by β*^. If some function f(β) is of interest, (e.g., long-run elasticity), compute J estimates of that function denoted by
£(*/)• (vii) Order the p components of β separately in an increasing order and denote the order statistics by β*efuy For a 95% two-sided confidence interval
228
VINOD
use /?e/(25) a s the lower limit and βlf,97^ as the upper limit.
Similarly, from
the function estimates fjΦlf) = fj the confidence interval is [/(25), /(975)] Since the EFs are "more normal" than estimators of β (due to averaging), and since we are using EFs as pivots, we have reliable statistical inference according to the bootstrap and EF theories. A null hypothesis regarding functions f(/3), similar to long-run elasticities of Remark 1, can be tested by using appropriate confidence intervals from f(/3*j) in step (vii). The following nonparametric bootstrap can avoid estimation of Ω. A Nonparametric Bootstrap using EFs: Vinod (1996a) suggests using scaled recursive residuals, recently surveyed by Kianifard and Swallow (1996). For regressions with dependent data they are attractive, because they are iid with zero mean and unit variance. Denote by X τ a r x p matrix consisting of the first r rows of X, provided r > p + l . Let y τ denote the r x 1 vector of initial τ observations of y, and let bτ = (Xίj-Xr)"1 X'τyτ denote the corresponding OLS estimator. Denote xτ such that yτ — x'τbτ-ι is the "conditional forecast" of yτ for τ=p + 1, ,T. Instead of using y r , we define recursive residuals in an equivalent simpler form: wr = (yτ - x'τbτ)/στ,
where σ r = [1 - x'^X^X^Xr]1'2,
(29)
where the residual in the numerator, yτ — μ τ , is scaled by σ r to achieve Var(wτ) = l. Kianifard and Swallow (1996, p. 393) provide a computational strategy and write the (T — p) x 1 vector w = wτ=Cy, where C is a (T— p) xT complicated matrix. They show that CX = Q,Ew = 0, Eww' = σ2lτ-p, and w'w = e'e. Denote the residual sum of squares by RSST = (yr — xrτbτy(yτr 2 -x TbT). Since RSSτ = RSSτ-ι+w , this gives a numerically reliable method of estimating σ τ . From the OLS normal equations, the EFs are (T-p) separate sets for ct r = p + 1, ,T: g* = X (yT - μr)loτ, where μτ = x'τβτ. Each r leads to p equations g* = 0 yielding T - p sets of βefτ. However, we are mainly interested in the last solution βef=βefT using all available data. Bootstrap shuffling of the T - p iid wτ residuals avoids the problems with bootstrap for time series (dependent) data. Denote shuffled recursive residuals as wTj for j = l , , J(=999). Substitute wτj in (29) to get yrj = yτ + wTjdT. Next, we assemble yj = (yi, yp,yij, '-yTjrVTJ)', into a Tx 1 vector of y values for the j-th bootstrap resample. Now, solve Xcf{yj - μ(β)) = 0, to yield J estimates of p x l vectors denoted by /3*^. Finally follow the step (viii) mentioned earlier for statistical inference and confidence intervals. An Appendix implements this for the C-CAPM model using the GAUSS and RATS computer languages.
ESTIMATION IN ECONOMETRICS
7
229
Conclusion And Final Remarks
This paper discusses EFs, which are functions of parameters and data, g(y,X,β) = 0, in the regression context of econometrics. The main lesson of the EF theory is to focus attention on EFs while deemphasizing their roots (or estimators) β. The EFs are shown to be "more normal" than the roots, leading to better behaved pivotals for inference and easier conditioning arguments. Applied researchers sometimes wish to avoid the ML estimator, since the likelihood exists only under the (often unverifiable) normality assumption. If they are confident about the existence of the mean and variance, Wedderburn's (1974) quasi likelihood score can be the optimal EF. Using the EFs econometricians can avoid some fancy asymptotic theory and Όveridentifying restrictions'. Also, some recent unit root and other estimators of autoregressive distributed lag (ADL) models are shown to be unnecessarily complicated and suboptimal in small samples compared to Durbin's (1960, p.151) two regression (TR) estimator, further justified by the recent EF theory. We show that the optimum EF estimate β based on the quasi score "attains Cramer-Rao," and is small-sample Gauss-consistent (SSGC). Further advantages of EFs proved in the literature are: minimization of the asymptotic mean square error, ability to yield shortest confidence intervals, solve the Neyman-Scott problem and provide efficient Newton-Raphson iterates. Our references also give examples of practical uses of the EFs. We include an appendix where a familiar econometric example of dynamic rational expectation model (C-CAPM) shows that EF estimates are similar to GMM estimates, while avoiding some of the ad hoc features of the GMM. For inference on this example, we implement the nonparametric bootstrap using the iid recursive-residuals from (29). Our EFs yield shorter confidence intervals and our estimates of both risk aversion and time preference parameters are economically more meaningful than the GMM estimates. This application using commonly available data demonstrates that EFs are practical and potentially valuable tools for econometrics. This paper also includes several "results" and "remarks" which are potentially useful, and provides detailed steps for two new time series bootstraps for possibly dependent data. Since Euler equations or first-order conditions are readily regarded as EFs, there are many potential applications in economic dynamics. We can test complicated dynamic economic theories directly by using the EF methods, and recommend using the bootstrap when inference becomes analytically complicated. We have demonstrated that EFs can improve upon some esoteric econometrics and simplify applications.
230
VINOD
APPENDIX A GMM and EFs for C-CAPM Generalized Method of Moments and Estimating Functions for Estimation of Consumption-Based Capital Asset Pricing Model Consider a representative agent model leading to C-CAPM, see Tauchen (1986) and Ogaki (1993). Denote ct= consumption, pi t =price ex-dividend of i-th asset, and cf^=real dividend. Let an intertemporal budget constraint be: ct + Σfίtfitpit =^iLiQi,t-i(Pit + da), where <^=amount of the i-th asset the agent carries into the next period t+1. Now the intertemporal utility is assumed to be additively separable. Lifetime utility of an agent with infinite time horizon (due to bequests) is
EtΣfβku{ct+k),
(AΛ)
where Et is conditional expectation using all information till time t, and 0< β < 1 is subjective rate of time preference. If the constant subjective interest rate is i β , the parameter β is defined as l/(l+i8). The market interest may determine is and an "instrumental variable" may be based market interest rate data. In (A.I) the first and second derivatives (denoted by primes) of the utility function satisfy: υ! > 0 and u" < 0. A popular utility function is: u(c) = c 1 ~ 7 / ( l — 7), with 7 > 0, hats υ! = c~ 7 and v!1 =—7c" 7 " 1 where u" < 0. Since cu"/uf = —j, is a constant, this u(c) belongs to the constant relative risk aversion (CRRA) family of utility functions. The estimation of risk aversion parameter 7 is more important than that of the discount parameter β. We do not permit the agent to continuously roll over the debt. The already mentioned first-order conditions for maximizing the lifetime utility, subject to the budget constraint, are the Euler equations: pitv!(ct) = βEt[u'{ct+ι)(pitt+1 where (i=l, as
+ di >t+ i)],
(A.2)
,M). Using the CRRA utility function for u, we rewrite (A.2) Pi^Ί
= βEtpi^ιc^ι
(A3)
+ βEtditt+ic£!v
which can be written in terms of observable growth variables defined as ratios dij = dί,tM,t-i, and Ct = Q / Q - I and the price dividend ratio πa = Pi,t/d>i,t for the i-th asset as the estimating function (EF)
= 0,
(4.4)
where all terms are observable, except the parameters denoted by # = (/?, 7). Denote the gross return per dollar invested on the entire market from t to t+1, [(1 + πt + i)/πt]d ί + i, by Rt+i and rewrite (A.4) as: Σj=ιgmom,t(θ)
= Σj=ι[βRt+ιC^\
- 1] = 0,
(A.5)
231
ESTIMATION IN ECONOMETRICS
Typical GMM estimation involves following steps: (i) Choose r instrumental variables defined from lagged C and lagged asset returns to form an r x l vector zt for each t. (ii) Define gt{θ) = gmom^t- (ϋί) Compute time averf ages gτ(θ) = Σf=ιgt(θ)/T. (iv) Compute A = (l/T)Σj=1gtg t with diagonals at. (v) Use the starting Wj for j = l as a diagonal matrix having (1/αt). (vi) Compute the θj for j = l by minimizing a quadratic form gt{θ)'Wjgt(θ). 1 (vii) For j=2 let Wj = [(l/T)ΣjL1gt(θj-ι)gt(θj-iY]- . (win) Compute θj as minimizers of gt{θj-i)'Wjgt(θj-ι)(ix) Repeat steps (vii) and (viii) till convergence and obtain the GMM estimator θgmm. It is assumed that Wj ι converge to WQ = [Egtg't]~ evaluated at the true value θo, and that Egtg[ is nonsingular and symmetric, (x) The final step is to find the covariance matrix Var(θgmrn) = D^WQD^ where D% = EtDt, where Dt matrix contains partials oΐgtφ) with respect to θ. In practice, one substitutes time averages of Dt for Df and replaces Wo by the time average [(l/T)Σj=ιgtgrt]~ι, where the gt are evaluated at the θgmm, Tauchen (1986). Now we turn to our EF estimation. We rewrite (A.5) as E(βRt+iC£j!i)=l, which may be written as a nonlinear regression model with errors (not utilities) denoted by subscripted u: C^
= 1 + ut+u
Eut+ι = 0,
rxt+i = σet+i,
(A6)
where σ refers to small-sigma asymptotics (Remark 7). Now consider limits as σ -> 0. The SSGCy requires that the method of estimation should yield the correct estimates when the model equation errors ut = 0 for all t. In models containing forward-looking economic time series, it is reasonable to extend SSGCy to require that the estimation method performs correctly when "expectational errors" are zero. Log of the right hand side of the regression in (A.6) is log(l + ut+ι)=ut+ι - ^?+i/ 2 H = 0^+1 —σ2 *?+i/2. Since σ -> 0, we can omit terms with σ7 for j> 2. Next, we replace t+1 by t, and σet+ι by ut to write a new C-CAPM linearized regression: logRt = -logβ + ηlogCt + ut.
(A.7)
This new method suggested by the EF approach is potentially useful elsewhere in macroeconomics. Of course, the EFs are explicitly nonlinear functions, and small-σ asymptotics leading to (A.7) is not at all essential for selling the EFs. Since the correlation, corr(ut,logCt), between the error and the (stochastic) regressor of (A.7) may be significant, we use k lagged values as instruments. We construct Xc from predicted value logCt from an AR(k) regression: logCt = &o + Σkj=ιlogCt-j. The k is chosen by the information criteria (AlC-type). It is well-known from the econometric literature dealing with two stage least squares that corr(ut,logCt) -» 0. This supports similar "hierarchical conditioning" arguments from the EF
232
VINOD
TABLE l a GMM estimation by RATS package and Nonlinear Instrumental Variables. (Quarterly data from 1960:01 to 1987:04) GMM estimates of β and 7 with t-statistics in parentheses: Case 1) T(Usable Observations)=110, df( Degrees of Freedom)=108, itera t i o n s ^ , NLAG (Lags used for instruments)=l, Residual Sum of Squares (RSS)= 0.0092, Durbin-Watson statistic (DW)=1.6104. /3=0.99097 (54.86), 7=1.24629 (t-value=0.4823, SE=2.58). Upper limit of confidence interval for 7 (UpL) is much larger than 2 ( U p L » 2). The χ 2 ( l ) = 2.3841 < 3.8415, the tabulated 95 % value for df=l. Case 2) T=109, df=107, iterations=3, NLAG=2, RSS= 0.0092, DW=1.5633. 0=0.99058 (72.83), 7=1.18231 (0.6077, SE=1.94). Again UpL for 7 » 2. The χ 2 ( l ) = 4.0682.
literature mentioned after (7). Here, (7) becomes: 1
(A8)
where Z = [L, logCt], where L is a column of ones, y = logRt, X = [/,, logCt], and θ={-logβ,Ί)'. Table la reports estimates for the GMM (using RATS software, rather than Tauchen's). Tables lb to Id have our EF estimates of (A.7). Both methods use the same quarterly US data from 1960:1 to 1987:4, used by many others and which comes bundled with the RATS software. We find that the EF results are quite comparable in terms of the Durbin-Watson statistics and residual sum of squares as measures of fit. Both methods correctly estimate β < 1, (βef « 0.99). We use subscripts to compare 7 across tables. Using the subscript 0 for the no instrument case in Table lb we have 70=2.26782. Similarly, when AR(3) is the instrument to obtain logCt in Table lc, 7 α r 3 = 1.62983. When AR(4) is used in Table Id, we have %rA = 1.24253. The GMM estimate (Case 1, Table la) when NLAG (the number of lags used to form instruments) equals unity is % m m = 1.24628. This is close to %r4 = 1.24253. Recall from the comments before (7) that good instruments should be (i) highly correlated with the replaced regressors, and (ii) uncorrelated with the errors. In the GMM literature, lagged values of variables are directly used as instruments. The users of GMM software often throw-in many lagged values as instrumental variables, with little attempt to check both requirements of good instruments. Tauchen (1986) notes that Hansen's two-step GMM estimator with lagged endogenous variables as instruments may not "possess moments." He recommends reporting medians and interquartile ranges.
ESTIMATION IN ECONOMETRICS
233
TABLE lb EF estimation (No instrument used) T = l l l , df=109, RSS=0.009, DW=1.635, R2=0.206. άo=intercept=O.OO225 (0.7208), β=exp{-άo)=O.99775. Since the transformation exp(-άo) is needed, only Taylor series approximate t-values for β are possible, 7=2.26782 (5.32).
TABLE lc (Instrument created using AR(3) ) T=108, df=106, RSS=0.011, DW=1.624, i?2=0.006 άo=intercept=O.OO644 (0.4605), β = exp(-άo)=O.99358, 7=1.62983 (0.8241).
TABLE Id (Instrument created using AR(4) model ) T=107, df=105, RSS=0.011, DW=1.619, i?2=0.005 άo=intercept=O.OO921 (0.729), β = ezp(-άo)=O.99O84, 7=1.24253 (0.6966).
They are not needed here, since our C-CAPM does not suffer from the infinite moment problem. In any case, the GMM strategy for choosing instruments is somewhat ad hoc. Since the GMM routinely involves overidentified models (r>p), it needs a χ2 test for overidentifying restrictions. Since the observed χ2=2.3841 for Case 1 in Table la is smaller than the tabulated 95 % value of 3.8415 for df=l, we do not reject the overidentifying restrictions of the GMM. However, Tauchen (1986, p. 412) cautions that these χ2 values are slightly biased toward accepting model specification. The Case 2 of Table la does reject the GMM model restrictions. To avoid problems with nonstationarity of variables most researchers follow the current literature. They use consumption growth Ct = ct/ct-ι and dividend growth in the definition of Rt instead of Ct and dt levels. However, then we have to pay a price: the statistical fits 2 are often poor when level variables are absent. The R values are artificially small and do not imply that the model specification is flawed. Unfortunately, the low R2 values are rarely reported in the GMM papers. Perhaps, some alternatives to the usual i?2-type criteria are needed. Tauchen and others use the "reliability" of confidence intervals, as well as, Varφgmm) for evaluating the reliability of GMM estimation. Plots of simulated sampling distributions in Tauchen seem to be new. They are generally skewed to the right and suggest nonnormality. For EF estimation it is easy to use the standard t-ratios based on Var(βef)=σ2(XrΩ,~ιX), although the nonnormality (skewness) implies that the bootstrap may be
234
VINOD
better. However, when the residuals are shuffled for a bootstrap, they lose the time subscript and fail to retain original time dependence. Hence recent literature suggests that bootstrap of time series residuals is problematic. Following Vinod (1996a) we use the recursive residuals defined in (29) to develop a proper (iid) bootstrap. Table 2 reports the bootstrap 95 % confidence intervals for β and 7. Note that our upper limit (UpL) of the interval for β does not exceed unity. By contrast, for NLAG> 2, the GMM estimate β > 1 is economically meaningless, since it implies perverse discounting (negative interest rate). The bias corrected (BC) upper and lower limits move the confidence intervals slightly to the right, consistent with the positive skewness (the mean over 999 resamples exceeds the median). Our UpL for 7 is 1.9744, while both GMM UpL's (Table la) are very large. Since Mehra and Prescott (1985) concluded that "sensible" values of 7 should be less than 2, our intervals are more "sensible" than the GMM's. Table 2: Bootstrap results based on J=999 resamples Mean 0.99069 1.2204
Median 0.99066 1.2196
Low 0.98630 0.59527
Up 0.99550 1.8924
Low-BC 0.98674 0.65285
Up-BC Coefficient 0.99633 ~β 1.9744 7
where Low and Up are the lower and upper limits of percentile confidence intervals. BC refers to (median) bias correction. Since the median < mean for both, the estimated sampling distributions are skewed to the right (similar to Tauchen's simulations). Acknowledgements I am grateful to Parantap Basu, V. Godambe, Denis Kwiatkowski, Bruce McCullough and two referees for helpful comments. I am especially grateful to Avi Singh for generously taking the time to explain some EF theory, when I presented a version in March 1996 at the Symposium on Estimating Functions at Athens, Georgia. References Andrews D. W. K. and J. C. Monahan (1992). An improved heteroscedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60 (4), 953-966. Bhapkar, V. P. (1991). Sufficiency, ancillarity and information in estimating functions: an overview, Ch. 18 in V. P. Godambe (ed.) Estimating Functions. Clarendon Press, Oxford.
ESTIMATION IN ECONOMETRICS
235
Christiano, L. J. and M. Eichenbaum (1992). Current real-business-cycle theories and aggregate labor market fluctuations. Amer. Econ. Rev. 82,430-450. Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269-276. Crowder, M. (1986). On consistency and inconsistency of estimating equations. Econometric Theory 2, 305-330. Dhrymes, P.J. (1994). Topics in Advanced Econometrics. Vol. II, Springer Verlag, New York. Dunlop, D. D. (1994). Regression for longitudinal data: A bridge from least squares regression. Amer. Statistician 48, 299-303. Durbin, J. (1960). Estimation of parameters in time-series regression models. J. of the Roy. Statist. Soc. Ser. B, 22, 139-153. Godambe V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. of Math. Stat. 31, 1208-1212. Godambe, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika 63, 277-284. Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika 72, 419-428. Godambe, V. P. (1991). Orthogonality of estimating functions and nuisance parameters. Biometrika 78, 143-151. Godambe, V. P. and C.C. Heyde (1987). Quasilikelihood and optimal estimation. Intern. Statist. Rev. 55, 231-244. Godambe, V. P. and B. K. Kale (1991). Estimating functions: an overview, Ch. 1 in V. P. Godambe (ed.) Estimating Functions. Clarendon Press, Oxford. Godambe, V. P. and M. E. Thompson (1989). An extension of quasilikelihood estimation. J. Statist. Planning and Inf. 22, 137-152. Hall, P. and J. L. Horowitz (1996). Bootstrap critical values for tests based on generalized-method-of-moments estimators. Econometrica 64 (4), 891916. Hamilton, J. D. (1994). Time Series Analysis. Princeton, NJ.
Princeton Univ. Press,
Hansen, L. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 1029-1054. Hendry, D. F. (1995). Dynamic Econometrics Oxford Univ. Press, New York. Heyde, C. C. and R. Morton (1993). On constrained quasi-likelihood estimation. Biometrika 80, 755-761. Kadane, J. B. (1970). Testing overidentifying restrictions when disturbances are small. J. Amer. Statist. Assoc. 65, 182-185. Kendall, M. and A. Stuart (1979). The Advanced Theory of Statistics. MacMillan, Vol. 2, Fourth Ed. New York.
236
VINOD
Kianifard, F. and W. H. Swallow (1996). A review of the development and application of recursive residuals in linear models. J. Amer. Statist Assoc. 91, 391-400. Lele, S. (1991V Resampling using estimating functions. Ch. 22 in V. P. Godambe (ed.) Estimating Functions. Clarendon Press, Oxford. Lindsey, B. (1982). Conditional score functions: some optimality results. Biometrika 69, 503-512. Liang K. and S. L. Zeger (1995). Inference based on estimating functions in the presence of nuisance parameters. Statis. Sc. 10, 158-173. Mehra, R. and E. C. Prescott (1985). The equity premium: A puzzle. J. of Monetary Econ. 15, 145-162. Morton, R. (1981). Efficiency of estimating equations and the use of pivots. Biometrika 68, 227-233. Neyman, J. and E. L. Scott (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1-32. Ogaki, M. (1993). Generalized method of moments estimation, in G. S. Maddala, C. R. Rao, and H. D. Vinod (eds.) Handbook of Statistics: Econometrics. Vol. 11, North Holland Elsevier, New York. Chapter 17, 455-488. Parzen, M. I., L. J. Wei and Z. Ying (1994). A resampling method based on pivotal estimating functions. Biometrika 81, 341-350. Sawitzski, G. (1994). Testing numerical reliability of data analysis systems. Computa. Statist & Data Anal 18, 269-286. Singh, A. C. and R. P. Rao (1997). Optimal instrumental variable estimation for linear models with stochastic regressors using estimating functions, in Godambe V. P (ed) IMS Symposium of Estimating Functions New York. Small, C. and D.L. McLeish (1988). The Theory and Applications of Statistical Inference Functions. Lecture Notes in Statistics No. 44, Springer Verlag, New York. Sprott, D. A. (1983). Gauss Carl Priedrich, in Kotz and Johnson (eds.) Encyclopedia of Statistical Sciences. Vol 3. J. Wiley, New York. 305-308. Tauchen, G. (1987). Statistical properties of generalized method of moments. J. Bus. and Econ. Statist. 4 (4), 397-425. Thavaneswaran A. (1991). Tests based on an optimal estimate. Ch. 13 in V. P. Godambe (ed.) Estimating Functions. Clarendon Press, Oxford. Vijayan, K. (1991). Estimating functions in survey sampling: Estimation of super-population regression parameters. Ch. 17 in V. P. Godambe (ed.) Estimating Functions. Clarendon Press, Oxford. Vinod, H.D. and Ullah, A. (1981). Recent Advances in Regression Methods. Marcel Dekker, New York. Vinod, H. D. (1984). Distribution of a generalized t-ratio for biased estimators. Econ. Letters. 14, 43-52.
ESTIMATION
IN ECONOMETRICS
237
Vinod, H. D. (1989). Resampling fuzzy data and Latin Squares: Application to regression. 1989 Proc. Bus. & Econ. Sec. Amer. Statist. Assoc. Washington, D.C. 304-309. Vinod, H.D. (1993). Bootstrap methods: Applications in econometrics, in G. S. Maddala, C. R. Rao, and H. D. Vinod (eds.) Handbook of Statistics: Econometrics. Vol. 11, North Holland, Elsevier, New York. Chap. 23, 629-661. Vinod, H.D. (1995). Double bootstrap for shrinkage estimators. J. of Econometrics 68, 287-302. Vinod, H.D. (1996a). Comments on "Bootstrapping time series Models". Econometric Reviews 15 (2), 183-190. Vinod, H.D. (1996b). Foundations of Asymptotic Inference Using Modern Computers. Economics Dept., Fordham University, New York, discussion paper, Sept. 1996. Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear models and the Gaussian method. Biometrika 61, 439-447. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 48, 817-838.
239
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS AND OVER-IDENTIFIED MODELS Tony Wirjanto University of Waterloo ABSTRACT Economic theory (particularly with optimizing economic agents) usually imposes a set of moment restrictions on economic data. These restrictions are known as orthogonality conditions, which correspond to a set of unbiased estimating functions with the dimension of the estimating functions often larger than the dimension of the parameters of interest. This paper provides a selected review on the efficient methods of estimating such overidentified models, using the approach of estimating functions (see Godambe, 1960, 1976; Godambe and Heyde, 1987, and Godambe and Thompson, 1974, 1989), as an organizing principle. The discussion in this paper takes place in a random sampling framework and draws heavily from Qin and Lawless (1994), who use the estimating-functions approach to combine the estimating functions in an over-identified model optimally.
1
Introduction.
Let z be a p-dimensional vector with z G Z where Z is a compact subset of W. Let θ be a k-dimensional vector with θ G Q and define the vector-valued estimating function g as g : Z x Q -> W, such that it is unbiased E\g(z,θ)] = 0
(1)
for a unique element θo of Q. The function g(z, θ) is assumed to be twice continuously differentiable with respect to θ. Equation (1) with dim[g] > dim[0] often arises from economic theory with optimizing behavior on the part of economic agents. The parameter vector 0o is assumed to satisfy E[g(z,θo)] = 0, where g(z,θo) is a given vector-valued function of moment conditions implied by economic theory economic examples of this function in time-series context can be found in Hansen and Singleton (1982), Wirjanto (1995; 1996a, 1997), Amano and Wirjanto (1996a,b; 1997a,b,c) etc. Consequently this paper focuses on the
240
WIRJANTO
over-identified case, where dim[g] > dim[0]. Given an identically and independently distributed (i.i.d.) sequence { J ^ } ^ , this paper is interested in estimating the parameter vector θ.
2
Two-step Estimator
One popular approach to combining the estimating functions in econometrics is briefly mentioned in Qin and Lawless (1994, page 315). This approach considers the optimal (in the sense of minimum asymptotic covariance matrix) linear combination of the r estimating functions. This leads one to estimate θ as the solution to the estimating equations (2)
)= 0 t=ι
where M = E[dg(z,θo)/dθτ] has full rank. More generally, an estimate of θ solves the minimization program
= MIN, hr<^,0)j V |χ>(^0)j
(3)
over Q, for some positive semi-definite (rxr) symmetric weighting matrix V. Under standard regularity conditions, the minimand of Q(θ) is a consistent estimator of ΘQ. However it is not efficient in the over-identified case. An efficient estimator can be obtained in this case by minimizing Q(θ) for V = J - 1 , where J = E[g{z,θo)g{z,θo)τ] has full rank (See Hansen, 1982; McCullagh and Nelder, 1989; White, 1984). In practice the inverse of the weighting matrix, J, is unknown and needs to be estimated from the data. A two-step estimation strategy can be used to implement this procedure. In the first step, an initial consistent estimate 0° is obtained by minimizing Q(θ) for an arbitrary choice of J such as the r-dimensional identity matrix Ir. The optimal weighting matrix is then
estimated as J " 1 = [Σ?=i9(θ°) Σΐ=ι9(θ°)T/n}-\
where g(θ°) = g(zuθ°).
In the second step, an efficient estimator θ is obtained from the estimating equations
ΣMτJ-1g(zt,θ)=0
(4)
t=l
More generally, it is obtained by solving the second-stage minimization program
MΐNθQ(θ) = MINe ix>(z t ,0)| J - 1 IΣ9(*t,θ)]
(5)
241
OVER-IDENTIFIED MODELS
For an obvious reason, the resultant estimator is referred to as a two-step estimator (TSE). This two-step estimator is discussed in the time-series context in Hansen (1982), White (1984), etc. If the model is correctly specified, and there exists a unique value ΘQ such that the estimating function is unbiased, E[g(z,θo)] = 0, then ^(0 T S E -0o)Λi\Γ(O,Λ)
(6)
where Λ = (MΊ J~ιM)~ι. The normalized objective function, evaluated at the estimated parameters, converges in distribution to a chi-squared random variable with r — k degrees of freedom. One potential drawback of the two-step estimator discussed above is that in the over-identified case, i.e. when dim[g] > dim[0], the choice of the weighting matrix will affect efficiency considerations in an important way. More specifically in finite samples or in the case of misspecified model, the way in which the optimal weighting matrix is estimated will affect the efficiency of the estimator
3
Maximum Empirical Likelihood Estimator
An alternative to the two-step estimator is the one-step estimator based on solving a set of estimating equations for θ and 7, the r-dimensional normalized Lagrange multiplier. Let Φ = (θ,j)τ. Then the solution Φ M E L E ιs obtained by solving a set of estimating equations ,Φ) = 0
(7)
where l(zu )Φ) Γ ,/ 2 (^t, Ψ) T ] T , with the following estimating functions
This is shown by Qin and Lawless (1994) to be equivalent to solving the following constrained maximization program: n
MAXp.θΣn-^lnipt) - in^" 1 )] t=i
subject to the restrictions
(8)
242
WIRJANTO
which yields the maximum empirical likelihood estimator (MELE). Under regularity conditions, %[ELE i s asymptotically efficient for 0O, i e.
The MELE procedure requires that one solves a system of estimating equations in k+r unknown parameters in one step compared to the procedure for the TSE which requires two steps. However some of the estimating equations in this method may potentially be unstable since the matrix of expected derivatives will not have full rank at the limiting values of the parameters, i.e. at (0,7) = (0o,O), the (k + r) x (kxr) dimensional matrix of derivatives E[dl(z, 0o O)/dΨτ] will have rank r. In practice this may not pose a serious computational problem in specific applications since the empirical likelihood in terms of (0,7) gives (0,7) as a saddle point.
4
Alternative Characterization of the MELE
An alternative characterization of the MELE of Qin and Lawless (1994) is suggested below. It is based on solving a set of generalized estimating equations that takes account of the over-identifying restrictions on the distribution explicitly. In particular for a discrete parameterization of z with known support, this estimator (if it exists) is shown to belong to the MELE. In the theory of estimating functions (see e.g. Godambe and Heyde (1987) and Godambe and Thompson (1989)) an estimating function g*(z, 0) G Q, Q = Z x Q, is optimal in Q if the estimator 0 from g*(z, 0) = 0 has minimum asymptotic covariance matrix. To begin the analysis / partition the vector-valued estimating function g{z(θ) as
where g\ is a k dimensional vector and 52 is a (r - k) dimensional vector. Similarly / partition the M and J matrices conformably to the estimating function g as
I" u
J
\dg2(z,θo)/dθΊ
L
and 9i(z,θo)gi(z,θo)τ ^
9i(z,θo)g2(z,θo)τ 92(z,θo)g2(z,θo)T
243
OVER-IDENTIFIED MODELS
It is assumed that g\ can be used to estimate θ consistently and t2 represents the over-identifying restrictions on the model. The latter is equivalent to 1 assuming that the submatrix M^ exists. It should be pointed out that τ the partition of the g vector-valued function into (gj,g2) will affect the interpretation given to 7 only. However it will have no effect on the estimator of0. τ τ Let 7 be a (r —fc)-dimensionalvector. Let 77 = (η[,ηζ) , η\ = {θ,η) τ and 772 = (μι,μ2) . Define μι .= vec(Mi) for i = 1,2. Then an estimator 77 can be obtained as a solution to the following system of r(k + 1) generalized estimating equaitons
(10)
Σ,h(zt,η)=0
where h(zt,η) = [/ii(zt,η)τ, ...,/i4(zt,77)τ]τ, with the following estimating functions 1 4- ΊT92(Z,Θ) — 7 T M2M 1 1gi{z,ι 1
2-vec{dg2/dθτ)}
-1τM2MΓ19i(z,θγ
The resultant estimator rj defined above is referred to (for the sake of clarity) as a generalized estimating equations estimator (GEEE). Under regularity conditions, the GEEE ήι has the following limiting properties: (i) 771 4 7710, where ηλ0 = (0O,O)T; and (ii) y/n(ήι - 7710) 4 (0, Λi), where Ai =
Λ n = {MτJ-ιM)~ι
n Q
A
I
and Λ22 = [M2(M?)-ιJn(MΪ)-ιMξ
- 2J1T2(M1T)"1
M2 + J22]"1' The proof is given in Appendix A. The above limiting result suggests that solving the generalized estimating equations system in (10) yields an asymptotically efficient estimator, i.e. y/n(θ — ^TSE) = °p(l) However unlike the two-step estimator, this estimator is based on solving a set of estimating equations in one step. Therefore, it allows one to estimate the limiting covariance matrix of the estimator, even when the model is misspecified. Under misspecification where there is not a value for θ such that E\g(z,θ)] = 0, the limiting distribution of the estimator can still be obtained as a solution to the estimating equations ΣJLx h{zuή) = 0 as long as the observations are i.i.d. Let 77 be the unique
244
WIRJANTO
solution to the unbiased estimating functions E[h(η)] = 0. Under misspecification, the solution to 7 in general will not be 7 = 0, even though it exists. The limiting distribution of ή in the case of misspecification of the estimatτ ι ι ing function is given by y/n(ή — ή)-ϊ (0, Λ*), where Λ* = (M* J*~ M*)~ τ τ M* = E[dh{z,η)/dη l and J* = E[h(z,η)h(z,η) ]. The component of the limiting covariance matrix of 77, which corresponds to the variance y/n(θ — 0), is not equal to Λn if there does not exist a value θo at which all r estimating functions (
V9 = lpe V\3Θ V Jg(z,θ)F(dz,p) = θ |
(11)
Once the maximum-likelihood estimate (MLE) p is obtained, the MLE for θ is given by θ defined by /g(z,θ)F(dz,p) = 0. In an over-identified case, it is important to consider the restrictions implied by the difference between Vg and V. Hence consider a sequence of random variables with a known support {z\, z<ι,..., zm} and pj = Pr(z = Zj) for j = 1,2, ...,ra. Then the probability density function is
3=1
245
OVER-IDENTIFIED MODELS m
for p e V = {p e TZ \Pj > 0,Σf=iPj = l,BΘVfg{z,θ)F(dz,p) = θ} where φj(z) = I Ίϊ z = Zj, and = 0 if z φ Zj. Since there exists a value of θ V /g(z,θ)F(dz,p) = 0, p must be restricted to Vg defined in (11). Next / partition g into a k dimensional vector g\ and a,(r—k) dimensional vector g2, such that V\ = {p e V\3Θ such that f gi{z,θ)F(dz,p) = 0} is equal to V and Vp E Pi = P. Then the solution for 0 to the estimating equations /gι(z,θ)F(dz,p) = 0 is unique. Given these assumptions, define θ as a function of p implicitly as m
/
9ι(z,θ(p))F(dz,p) = Σ f t Ji(^j,fl(p)) = 0
(13)
i=i
There are two useful properties that can be obtained by differentiating (13) with respect to pτ m
9i(zτ,θ(p)) + ^2pj(dg1(zj,θ(p))/dθ)(dθ(p)/dPτ)
= 0.
(14)
3=1
These properties are given by m
dθ(p)/dPτ = -[ΣPj(d9i(zj,θ(p))/dθτ)}-1g1(zτ,θ(p))
(15)
and m
(g1(j,(p))/)}
j
i=i
3=1
ΣPMT,(P))
=0
J=I
(16) respectively. Consider the following problem
The assumptions suggest that the maximization can be performed on the set given by
Vg = lp e nm\pj > O,JTPj = 1, J g2(z,θ)F(dz,p) = 0 } i.e. the maximization program can be written as n
m
max > > ΦΛzΛlnypi) t=lj=l
m
s.t. ϋ 7 > U; > pΊ- = 1; and j=l
m
(18)
246
WIRJANTO
in the framework of empirical likelihood of Qin and Lawless (1994). In the above maximization program I have (r — k + 1) nonlinear restrictions, which contain an implicit function that can be solved using a numerical approximation method. However the solution to (19), if exists, will be identical to the solution to the generalized estimating equations in (10), which only requires a solution to a system of r(k + 1) exactly-identified equations. Let t\ be the Lagrange multiplier associated with the restriction Y^iPj, and #2 be the vector of Lagrange multipliers associated with the restrictions g2(zj,θ(p)) = 0 respectively. Then the first-order conditions for the maximization problem in (19) are [a]
Y\φj(zt)lpj\ - h - tξg2(zj, θ(p))
3) = O i = 1,2, ...,m; τ=l m
Multiplying the left-hand side of [a] above by pj and summing over j 1,2,..., m shows that the Lagrange multiplier ίi for the restriction Σφ=ι Pj 1 is n. Thus the solution for (p,ti,t2) is characterized by [a] h = n; n
Pj = t=l
τ=l
m τ=l
7 Given p and ί2, the MELE for θ = θ(p), which is 6» = 6>(p), can be obtained as a solution to the system of equations m
O
(20)
Below I show that the MELE for θ given by (20) is equivalent to the GEEE obtained from (10). To this end, let Mx = Σf=ιpjdgι(zj,θ{p))/dθτ,
247
OVER-IDENTIFIED MODELS and let M2 = Σ?=iPjd92{zj,θ(β))/dθτ. for η is characterized by
Let 7 = t 2 /*i
Then the solution
[c] ft- = n- 1 ΣΓ=i *i( Lastly, substituting the expression for pj, the solution for the estimator ή can be obtained from the system of generalized estimating equations m
0
(21)
Below I show that the MELE or θ given by (20) is equivalent to the GEEE obtained from (10). To this end, let Mx = Σf=1Pjdgι(zj,θ(p))/dθτ, d τ and let M 2 = ΣT=iPj 92(zj,θ(β))/dθ . Let 7 = t2/h- Then the solution for η is characterized by [b]ΣT=iPj92(zj,θ(β))=0; [c] pi = n- 1 Σt=ι Φj(zt)/[1 + iT92(zj,θ(p)) ητM2M^gι{Zj,θ(p))} Lastly, substituting the expression for pj, the solution for the estimator ή can be obtained from the system of generalized estimating equations (22)
)=0 t=ι
which is equation (10). The discussion so far has suggested a link between the GEEE and an estimator of the distribution function. To see this, consider z having a discrete distribution, so that the distribution function can be estimated by (23) which, after substitution of the expression for pj, yields ι ι *(z)==nnΣl{z < z)[l+τ Ίτ{2{zuθ) - M2Mf V M ) } ] " 1 F*(z) Σl{z t t < z)[l+Ί {g
t=ι
(24)
can be stated as follows. The probability qn = pr(z G B) is estimated by q= f I(zeB)F*(dz);
as n -> oo
(25)
The limiting properties of this estimator can be obtained by noticing that to the unbiased elementary estimating functions E[g(z,θ)], I can augment τ τ the estimating functions go(z,q) = lq — I(z E B). Let ψ = (q^θ ) , and let
248
WIRJANTO
Then I partition the estimating functions l(z,φ) as
τ τ
where ji(z,φ) = (go(z,q),gι{z,θ) ) and J2{z,φ) = g2{z,θ). The new estimating functions are therefore given by with ^ — hn(zuψ) h2(zuφ)
h
-\π-I(z€B]\: and
= hι{zt,η)', = h2(zt,η)\
= hs(zuη); =
hA(zuη).
It folows that the 77-component of the solution to the generalized estimating equations ) =0
(26)
t=l
where h(zt,φ) = [hιo{zt,φ)τ,hn(zt,φ)τ,...,h4(zt,φ)τ]τ, is identical to the estimator ή obtained from solving the generalized estimating equations in (10). This implies that the g-component of the solution to the above generalized estimating equations is given by (24), rewritten as q=
=
l
- .
Ί
n " 1 Y I{z G B)]
(27)
The limiting properties of this estimator for the distribution function are: [i] q 4 go; and [ii]Vfi(9)4(0,,), where ίlq = q{\ - q) - gl[J~ι - J~ιM(MτJ~ιM)-ιMτJ-χ]gB, and gB = E[g{z,θn)I{z e B)] = qoE[g{z,θo)\z G β]. The estimate q is fully efficient in the sense that its variance attains the semi-parametric efficiency bound. The proof is given in Appendix B.
5
The Minimum Discriminant Information Adjusted Estimator
The GEEE method in Section 4 solves a system of estimating equations of dimension rx(k+l) much larger than (k+r) x (k+r) as in the MELE procedure, even though the GEEE procedure has the advantage that E[dh(z;θo,O,μιo,
OVER-IDENTIFIED MODELS J
249
τ
l >2θ)/dη ] has full rank, whereas the MELE procedure has the disadvantage τ that E[dl(z\ 0Q, 0)/5Φ ] does not have full rank. In practice this dimensional issue may or may not be a serious handicap to applied workers, depending on specific applications. However given the potentially computational burden involved in the GEEE procedure it is worthwhile looking at alternative procedures. One such procedure is Haberman's (1984) procedure mentioned in Qin and Lawless (1994, example 3, page 314). Instead of maximizing the empirical likelihood as in (7), this estimator is obtained by minimizing the Kullback-Leibler divergence from the estimated distribution to the empirical distribution. Thus the resultant estimator is referred to as the minimum discriminant information adjusted estimator (MDIAE). The MDIAE of Φ can be obtained by solving a set of estimating equations for 0 and 7, the r-dimensional normalized tilting parameter in (28) ί=l
subject to the restrictions
Thus the solution Φ# is obtained by solving a set of estimating equations
£>(*,*) =0
(29)
ί=l
where ω(^,Φ) = [ωi(zt,Φ)τ,α;2(^t,Φ)'Γ]"Γ, with following estimating functions ωλ{zuΦ) = Ίτ{dg{z,θ)/dθτ)exp[7T<7(*,0)]; ω2(zuΦ) = g(z,θ)exp[7T ιs g(z, 0)]. Under regularity conditions, 0MDIAE asymptotically efficient for 0o, i.e. \/^(%DIAE " # T S E ) = M 1 ) There is a number of attractive features of the MDIAE. First, in the MDIAE procedure, the discrepancy between the estimated probabilities pt and the empirical frequency n~ι is weighted using the efficient estimate of these probabilities pf, in contrast in the MELE procedure, the discrepancy is weighted using an inefficient estimate of these probabilities n " 1 . Second, built on Huber (1980), the influence functions of the estimators defined by estimating equations for the MELE and MDIAE are given by E[dl(z,V)/dVτ]-ιl(z,V) and E[dω(z,Φ)/dVτ]-ιω(z,V) respectively. At the limiting values (0,7) = (0o?O) the influence functions for the MELE and MDIAE are identical. However the influence function for MELE, in contrast to the MDIAE, can be unbounded at 7 = e where e 3 0, even if the estimating functions themselves g(z,θ) are bounded. Third, the MDIAE procedure potentially is more attractive in terms of computation than the MELE and
250
WIRJANTO
GEEE procedures. This is because the estimated probabilities in the MDIAE procedure is given by pt = exp[yτg(z, θ)]/ Σΐ=i eMlTΦ> θ)] Replacing pt in (27) with this and re-arranging terms result in the constrained maximization program
t=i
subject to d{ln(Σexp[Ίτg(zt,θ)))
(30)
- ln(n)}/dΊ = 0
t=i
which is computationally easier to solve. It is clear that the estimating equations in (28) amounts to choosing (0, 7) with the first derivatives of the object function in (29) with respect to θ and with respect to 7 set equal to zero. Lastly it also is possible to obtain an estimator that has identical firstorder limiting properties as the GEEE, using the principle of minimizing the Kullback-Leibler divergence measure in Haberman (1984). In terms of computation however, it is potentially inferior to the MDIAE procedure applied to the formulation proposed by Qin and Lawless (1994). For this reason, it is not discussed in this paper.
6
Concluding Remarks
This paper has provided a selected survey on the efficient estimation of overidentified models common in Economic models with optimizing agents, using the theory of estimating functions as an organizing principle. It was argued that the two-step estimator is sensitive to the way in which the optimal weighting matrix is estimated from the data. The MELE of Qin and Lawless (1994) solves this problem by estimating a set of estimating equations in one step. However some of the estimating equation in the MELE procedure may potentially be unstable because the matrix of their expected derivatives does not have full rank at the limiting values. Although in practice this may not pose a computation burden in specific applications, an alternative characterization of the MELE was suggested in this paper. It is based on a set of generalized estimating equations, which does not share the shortcoming of the MELE mentioned above. This set of generalized estimating equations incorporates information provided by the over-identifying restrictions on the distribution function explicitly, resulting in a just-identified set of estimating equations with an exact solution. Unfortunately the computation of the result GEEE appears to be much less attractive than that of the MELE since the estimating equations in GEEE method has the dimension
OVER-IDENTIFIED MODELS
251
much larger than the dimension of the estimating equation in the MELE method. In practice this may or may not pose a computational problem in particular applications since it is possible to use this estimator to compute a saddle-point approximation to the finite-sample distribution. Next an alternative estimator to the MELE based on Haberman's (1984) minimization of Kullback-Leibler divergence measure is discussed. It has a number of appealing theoretical features in terms of finite-sample efficiency and robustness and a priori it has a more tractable computational requirement. It is worthwhile to make two final remarks: [i] the estimating functions approach is a natural framework to use in over-identified models. One of its major advantage over other known approaches, which has not been emphasized much in this literature, is that in this approach it is possible to treat over-identified models as just-identified models essentially be re-defining the estimating functions appropriately; [ii] another asymptotically equivalent estimator to the ones discussed in this paper can be obtained by combining the set of elementary estimating functions gχ(z,θ), which is unbiased, i.e. E[gι(z,θ)] = 0 with the set of auxiliary estimating functions £2(^,0) which represents the over-identifying restrictions. This set of estimating functions is biased with the biased term given by a r-dimensional auxiliary parameter vector λ, i.e. E\g2(z,θ)] = λ. However λ will be a zero vector identically, so that the set of elementary estimating functions will be unbiased, when the over-identifying restrictions are satisfied by the data. Under regularity conditions the resultant estimators will be consistent (i.e. λ will be consistent for a zero vector), asymptotically normal and asymptotically efficient. However the efficiency is obtained from a joint estimation of θ and λ, without using the prior information that λ = 0. This prior information can be used simply by projecting the estimator of θ orthogonally on to the tangent space in which λ = 0. This projection, which will be done approximately at the appropriate tangent space, corresponds to the one-step estimators discussed in this paper. Recently Wirjanto (1996b) has applied this approach to study a collection of generalized linear models in McCullagh and Nelder (1989) with contemporaneous correlations across the regression error terms. The resulting one-step estimator is shown to be not only efficient but also robust to nonconstant variances in each model's error terms as well as nonconstant correlations across the equation error terms.
APPENDIX A Since E[h(ηo)] = £7[/ι(0o,O,μio,μ2θ)T] = 0, there exists a consistent root of the generalized estimating equation
ί=l
252
WIRJANTO
Regularity conditions ensure that
where Λ(£> Γ Σ- 1 £>)- 1 , D = E[dh(z,ηo)/dητ], and Σ = E[h(z,ηo)h(z,ηo)τ]. Next let me partition the matrices D and Σ conformably to η = [ηi;η2]τ, where ηι = (0,7) T , and η2 = (μi,μ 2 ) τ , as
j_ D2χ
£>22
Σ =
Γ Σ Π Σ12
'
Σ21
Σ22
where I 0 u
^
"
λ
M 2 -J22 + π2(Mΐ)- MΪ
0
I' ^ ~ I 0 0
and [ /2211 0 [ W
1
i2222 J
\ Jn
J12
[ ^12
^22
Here, I2211 2tnd /2222 are identity matrices of dimensions (k(k — l))/2 and ((r — k)(r — k — l))/2 respectively. The limiting covariance matrix of ή\ is thus given by
Since I can write
[Mi -J12 + Jii(M 1 T )- 1 M 2 T 1 _ J Γ (MTΓ' [M 2 -J 2 2 + JUMjr'M?
\ ~ [ -Ir_k
the above expression for the matrix Λi simplifies to ι
[
0
J)-χ
0
[M
which yields the intended result. APPENDIX B The minimum bound for the parameter vector Φ is given by Ω(Φ) =
(AτB-χA)-1
where A = E[dj(z, Φ O )/9Φ T ] and B = E\j(z, Ψ 0 )j(*, Φo)Γ]
253
OVER-IDENTIFIED MODELS
The minimum bound for the parameter q can be obtained by noting that A=
1
o ] . o - i = \ q(ί-q)
0
where gB = E[g{z,θn)I(z
'
l
-9B 1
-9B
J
G B)] = qnE[g(z,θn)\z
E B\;
M
ι
\
1
B22 = J- 1 + J-ιgB[q{lq) Therefore I have
\
=
B21
B 22
ΓΩ11
Ω12
[ Ω 21 where
Ω12 = -[ςr(lςf) & % 22 Γ ι Ω = M J-!Af + MJ- gB[q{lq) Since the new estimating function does not affect the variance of the estimator 0, I have Ω22 = (MTJ'1M)'1 regardless of what the set B is. The element of Ω(Φ) corresponding to the parameter q is simply
Therefore, I obtain the result that var(q) = q(l - q)
If I have extra information in the form of unbiased estimating function with θ known, then ι υar(q) = q{\ - q) - gβJ~ gB It remains to show that the variance of the estimator q in (34) attains this minimum bound. Following the steps in section 2 for the new estimating function j(z, Φ) results in a one-step estimator. Using the result in Appendix A, it follows that its variance will be identical to the variance of the twostep estimator. Therefore, it attains the minimum bound and the resultant one-step estimator is q. The key to this result is that
1
= ί ° L2
=
254
WIRJANTO
so that L2LΊ1jι
= M2M{1g1
Therefore I have the results that
which can be rewritten as u (
q-I{ze B] 1 + Ί 92(z,θ) ψM2Mϊι9ι{z,qγ
\
τ
Similarly, it can be shown that hn(zt,η,q)
=
Λ
,
,
τ
J
{z,θ)-vec(d9ι/dθτ)
gι
l+Ίτg2{z,θ)-ΊτM2Mϊιgι(z,θ) g2(z,θ)-vec(dg2/dθτ)
Since Λn{zuη,q) = hι(zuη), h2(zuη,q) = h2(zuη), h{zuη,q) = h3(zuη), and h^zt^η^q) = h$(zt^), the intended results follow. Acknowledgements. Support from the Social Science and Humanities Council of Canada is acknowledged. The first version of this paper was prepared for the Institute of Mathematical Statistic (IMS) Special Topics Meeting Symposium on Estimating Functions, held at the University of Georgia, Athens Georgia, U.S.A. March'96. I am indebted to two anonymous referees and the editor, whose probing questions have substantially improved both the presentation and the content of the paper. The usual disclaimer applies.
REFERENCES
AMANO, R. A. and WIRJANTO, T. S. (1996a). Money stock targetting and money supply: a closer examination of the data. Journal of Applied Econometrics, 11 93-104.
OVER-IDENTIFIED MODELS
255
AMANO, R. A. and WIRJANTO, T. S. (1996b). Intertemporal substitution, imports and the permanent income model. Journal of International Economics, 40 439-457. AMANO, R. A. and WIRJANTO, T. S. (1997a). intratemporal substitution and government spending. The Review of Economics and Statistics (in press). AMANO, R. A. and WIRJANTO, T. S. (1997b). Adjustment costs and import demand behaviour: evidence from Canada and the United States. Journal of International Money and Finance (in press). AMANO, R. A. and WIRJANTO, T. S. (1997c). An empirical study of dynamic labour demand with integraded forcing processes. Journal of Macroeconomics (in press). GODAMBE, V. P. (1960). An optimum property of regular maximum lieklihood estimation. Ann. Math. Statist. 31 1208-1212. GODAMBE, V. P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63 277-284. GODAMBE, V. P. and HEYDE, C. C. (1987). Quasi-likelihood and optimal estimation. Int. tatist. Rev. 55 231-244. GODAMBE, V. P. and THOMPSON, M. E. (1974). Estimating equations in the presence of nuisance parameter. Ann. Statist. 2 568-571. GODAMBE, V. P. and THOMPSON, M. E. (1989). An extension of quasilikelihood (with discussion). J. Statist. Plan. Infer. 22 137-172. HABERMAN, S. J. (1984). Adjustment by minimum discriminant information. Ann. Statist. 12 971-988. HANSEN, L. P. (1982). Large sample properties of generalized method of moment estimators. Econometrica 50 1029-1054. HANSEN, L. P. and SINGLETON, K. J. (1982). Generalized instrumental variables estimation of non-linear rational expectations model. Econometrica 50 1269-1286. HUBER, P. J. (1980). Robust Statistics. Wiley, New York. McCULLAGH, P. and NELDER, J. (1989). Generalized Linear Models, Second ed. Chapman and Hall, London. QIN, J. and LAWLESS, J. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22 300-325. WHITE, H. (1984). Asymptotic Theory for Econometricians. Academic Press, Orlando. WIRJANTO, T. S. (1995). Aggregate consumption behaviour and liquidity constraints: the Canadian evidence. Canadian Journal of Economics, 28 1135-1152. WIRJANTO, T. S. (1996a). An empirical investigation into the permanent income hypothesis: further evidence from the Canadian data. Applied Economics 28 1451-1461.
256
WIRJANTO
WIRJANTO, T. S. (1996b). On the efficiency of the seemingly unrelated regressions, unpublished manuscript, University of Waterloo, Waterloo, Ontario, Canada. WIRJANTO, T. S. (1997). Aggregate consumption behaviour with time nonseparable preferences and liquidity constraints. Applied Financial Economics 7 107-104.
259
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ON THE PREDICTION FOR SOME NONLINEAR TIME SERIES MODELS USING ESTIMATING FUNCTIONS by Bovas Abraham University of Waterloo, Canada A. Thavaneswaran University of Manitoba, Canada and Shelton Peiris University of Sydney, Australia Abstract Godambe's (1960, 1985) theorems on optimal estimating equations are applied to some non-linear, non-Gaussian time series prediction problems. (Examples are considered from the usual class of time series models.) Recently many researchers in applied time series analysis attracted the information and valid analysis provided by the estimating equation approach. Therefore this article places an interest of estimating equation (EE) prediction theory and building a link between it and the well-known minimum mean square error (MMSE) prediction methodology. Superiority of this EE prediction method over the MMSE is investigated. In particular a random coefficient autoregressive model is discussed in some detail using these EE and MMSE theories. Keywords: Non-Gaussian models, non-linear time series; optimal estimation; optimal prediction, random coefficient autoregressive, minimum mean square.
260
1
ABRAHAM, THAVANESWARAN AND PEIRIS
Introduction
There are many examples of random vibrations in the real world. For example a ship rolling at a sea, car vibration on the road, brain-wave records in neurophysiology, and so on. Recently, there has been a growing interest in modelling these events as non-linear time series. See for instance Tj0stheim (1986), Abraham and Thavaneswaran (1991). In order to use non-linear time series models in practice, one must be able to fit models to data, estimate the respective parameters and obtain valid predictors. Computational procedures for determining parameters for various model classes, together with the theoretical properties of the resulting estimates are outlined in Tj0stheim (1986), Thavaneswaran and Abraham (1988). The theory of generalized estimation equation (GEE) was originally proposed by Godambe (1960) for identically distributed independent observations and recently extended to discrete time stochastic processes (see Godambe (1985)). The particular statistical relevance and lucidity of the GEE prediction method for statistical models under present study should be appreciated against the fundamental difficulties encounted in (a) likelihood prediction when the variance of the observation error depends on the parameter of interest (Godambe, 1985, §3.2) and (b) when the variance of the observation error becomes infinity and MMSE method does not apply. A special class of nonlinear models called Random Coefficient Autoregressive (RCA) models play an important role in the modern era of time series analysis. The class of RCA models are defined by allowing random additive perturbations of the autoregressive (AR) coefficients of ordinary AR models. That is we assume that the process {Zt} is given by p
i^et,
(1.1)
where φi, i = 2, ,p, are the parameters assumed to be known, {et} and {bi(t)} are zero mean square integrable independent processes and the variances are denoted by σ\ and σ\\ bi(t)(i = 1,2, ,p) are independent
of {et}
and
{Zt-i}.
{bi(t)} may be thought of as incorporating environmental stochasticity. For example, weather conditions might make {bi{t)} random variables having binomial distribution. For RCA models, the superiority of optimal estimate had been demonstrated in Thavaneswaran and Abraham (1988) and the superiority of the
PREDICTION FOR NONLINEAR TIME SERIES
261
interpolation had been given in Abraham and Thavaneswaran (1991). Godamble (1994) had briefly looked at the prediction problem in the Bayesian context assuming the future values as random. Naik-Nimbalkar and Rajarshi (1995) have used the estimating function method to study the smoothing and filtering problem in the Bayesian point of view. In this paper, we shall attempt to develop a more systematic approach and discuss a general framework for finite sample non-linear time series prediction. Our approach yields the most recent forecasting results as special cases and, in fact, we are able to improve the efficiency of the predicting equations. This approach of using estimating function ideas to study the prediction problem is very similar to the one used to study the smoothing problem as in Thavaneswaran and Peiris (1996). In section 2, we present a theorem on optimal forecasting for discrete time stochastic processes based on estimating functions with applications in non-linear, non-Gaussian time series models. Section 3 deals with quasilikelihood non-linear estimating functions and compares the efficiency for MMSE and optimal predictors.
2
A theorem on optimal prediction
Let {Zt : t G /} be a discrete-time stochastic process taking values in 11 and defined on a probability space (Ω, A, F). The index set / is the set of all positive integers. We assume that the observations (Zi, Z2, , Zn) 1 are available and that Zt(l) — E[Zt+i\Ff] is a function of JF /, the σfield generated by Zi, , Zt. Then the following theorem gives the form of the optimal one step ahead forecast of Zn+\ based on observed values Zi, , Zn. Let Z n +i - Z n (l) = α n +i, where {au αt, α n +i} is an iid sequence with probability density function /( ). Assume that /( ) is known and that E [—gj-y /(α)l < 00 and /f^ f(a) da is twice differentiate under the integral sign. Let G be the class of unbiased estimating functions g{an+1) such that E g{an+ι) = 0 . Consider the following theorem. Theorem 2.1 : In the class G, the optimal predictor of α n +i, which . . . Var g(an+ι) . . inimizes g is given by (i) ^ O ( / ) J
where ma{f)
is the mode of / ,
(ii) the optimal predictor of Zn+ι is given by Zn P '(1) = Zn(ΐ) + m α (/),
262
ABRAHAM, THAVANESWARAN AND PEIRIS
(iii) the efficiency of the optimal estimating function, g°, Eff(g°) for α n +i is
Proof: Parts (i) and (ii) of the theorem follow by observing that g° = -^log f(a) is an unbiased estimating function in the class G and using the CauchySchwarz inequality for unbiased estimating functions as in Godambe (1960). 2 It is easy to show that E[g° ] = E(—gj) and hence part (iii) follows.
Note: If
g = identify,
^—-— = σ\.
Thus the minimum mean
square error forecast is a special case of the theorem.
If g = — - » " " * , then
y^5^1)
Hence the maximum likelihood predictor is also a special case of the theorem. It is of interest to note that when the distribution of {at} is stable such as Cauchy, the MMSE predictor cannot be defined but the MLE could be defined and it has finite information. Example 2.1: Consider a stationary time series having moving average (linear filter) representation Zt = Φ(β)αt = at + Φiαt-i + *2«t-2 + •
(2.2)
where {α^}'s are independent mean zero with probability density function /. Existence in mean square requires that {at} have finite variance σ\ and E £ i * i < o o . Let (Zi, , Zn) be n observations from a series {Zt\. Then the optimal one-step ahead forecast is given by MMSE forecast, Z n (l), plus the mode of / i.e. Z £ p t (l) = Z n (l) + m β ( / ) = E\Zn+ι\F^\ +ma(f). For an AR(1) model of the form Zt = φZt-ι + at with \φ\ < 1, the optimal forecast of Zn+\ based on observed values Z\, ••• ,Zn is Zn (1) = 2" n (l)+m α (/), where m α (/) is the mode of the probability density of {at} If o t 's are i.i.d. MMSE forecast.
N(0,σ*)
then m β (/) = 0 and Zn P t "(l) = Zn(l) =
263
PREDICTION FOR NONLINEAR TIME SERIES
Now suppose that / corresponds to a double exponential distribution, with the density f(x) = |e~' χ l, —oo < x < oo. In this case, with =
dlog/(o)^ da
o n e h a g
Eff
= γ a r
,— a ^ - , I = i da J
L
(2
3)
On the other hand, r+oo
z oo
x2 e"lχl dz = /
σ^ = 1/2 /
x2e~xdx = 2.
That is Eff
(SMMSE(<*)) = 2. Thus, Eff.(g^j^jgg) is twice as large as Eff.(g°) and, therefore, the MMSE forecast Zn(l) of Z n +i in (2.1) entails about 50% loss of efficiency in this case. Example 2.2: Consider an ARCH (autoregressive conditionally heterocedastic) model of the form
where {at} is an iid sequence having pdf f(a) and variance σ\. It can be easily shown that the optimal predictor of Zn+\ based on observed values Zi, , Zn is given by
Similarly the two steps ahead forecast is
and the ^-steps ahead forecast is given by
Zn(i) = φeZn + Zl(ί-l)ma(f). Now we consider a more general situation in the next section.
3
Non-linear non-Gaussian models
Theorem 2.1 of this paper gives the optimal predictor when {at} is an i.i.d. sequence with known p.d.f. /(α) In the case when {at} is not an identically distributed independent sequence and when the first two
264
ABRAHAM, THAVANESWARAN AND PEIRIS
conditional moments are specified the following theorem gives the form of the optimal predictor. Let Z\, , Zn be n observations from a series having first two conditional moments
E[Zt\FU\ = f(θ, FU) and Let hu = Zt+ι and
E[Zt+ι\Ftz]
The following theorem reports the MMSE and otpimal predictors of l
Note: The choice of elementary estimating functions is subjective. Theorem 3.1: (a) The MMSE predictor of Z n +i is given by. and
(b) The optimal predictor of Zn+ι is given by
The proof of this theorem follows by taking the elementary estimating function h2n = Z*+ι - E[Z*+ι\F*\ (cf. Godambe (1985)). The following theorem reports the MMSE and optimal predictors of Zn+\ . Example 3.1: Consider the Random Coefficient Autoregressive (RCA) model given in (1.1). By considering a class of estimating functions of the form gn = Σ™=2 a>t-ι h , where ht = Zt - E\Zt\FUλ = Zt -ΣφiZt-i
,
(3.1)
2=1
optimal estimates for the model parameters were obtained in Thavaneswaran and Abraham (1988) and the superiority of the optimal estimate over the least squares had been discussed. Here if we restrict ourselves to a class of estimating functions of the above form then we will get the forecast of the future value of Zn+\ based on the observed values Z 1 ? Z 2 , , Z n as Z n ( l ) = E[Zn+i\Zn,Zn-U ''' ,]• That is whether we have an AR(p) model or RCA(p) model we will get the same linear predictor of Zn+\. However, for the RCA model under consideration we have
PREDICTION FOR NONLINEAR TIME SERIES and
265
p
2=1
Thus the conditional variance is a nonlinear function and hence the RCA model (1.1) may be viewed as a nonlinear time series model. Nicholls and Quinn (1980) studied linear as well as some nonlinear (proposed) predictors by fitting a nonlinear (RCA) model for the lynx data. By giving heuristic reasoning they proposed a nonlinear predictor Zn+ι = sgn(0iZn)[φ\Zn + σ%]2 and have shown empirically that the predictor Zn+ι is a better predictor (having smaller prediction errors when compared with the actual observations) than the linear predictor Zn+χ — φZn, for the lynx data. It is of interest to note that by defining ht = Zf - E[Zf\Ftz_ι\, the optimal predictor for Z n +i can be obtained as Z*(l) = sqrt [ J ^ Z ^ F ^ ] ] = Sgn(φιZn) [{φ\ + σ%)Z% + crf\ *•• ι-e estimating function method could be used to obtain a nonlinear predictor for a nonlinear model by considering a class of elementary martingale estimating functions generated by nonlinear functions of the observations. Using a similar argument we could also propose a nonlinear forecast for the ARCH process. Example 3.2: Doubly stochastic times series Random coefficient autoregressive sequences given in (1.1) are special cases of what Tj0stheim (1986) refers to as doubly stochastic time series models. In the nonlinear case these models are given by zt-θtf(t,Ftz_ι)
= et,
(3.2)
where {θt+bt} of (3.2) is now replaced by a more general stochastic sequence {θt} and zt-ι is replaced by a function of the past, f(t, Ff_λ). When {θt} is a moving average (MA) sequence of the form θt = θ + et + et-u
(3.3)
where {θt} , {et} are square integrable-independent random variables and {βt} consists of zero mean square integrable random variables independent of {et} . In this case E{zt\Fl_ι) depends on the posterior mean, mt = E(et\Ftz), and variance vt = E[(et-mt)2\Ftz] of et . Thus, for the evaluation of mt and vt we further assume that {e*} and {et} are Gaussian and that ZQ = 0. Then mt and v% satisfy the following Kalman-like recursive algorithms (see Shiryayev, 1984, p. 439) : _ σ2ef(t,FtU)[zt 1
- (θ + mt
σ2 + / 2 ( ί F / ) ( i
266
ABRAHAM, THAVANESWARAN AND PEIRIS
and
where UQ = σ\
and mt = 0 . Hence
and
E{hξ\Ff) =
E{[zt-E{zt\FU)?\FU}
= σe2 + / 2 (ί,F t 2 _ 1 )( σ e 2 + ^ _ 1 ) . Now the optimal predictor based on ht is given by
ACKNOWLEDGEMENT This paper was finalised while A. Thavaneswaran was at the University of Sydney. He thanks the School of Mathematics and Statistics for the support during his visit. The authors thank the referee and Professors I.V. Basawa and M. E. Thompson for very useful comments on an earlier version of this paper. References Abraham, B. and Thavaneswaran, A. (1991). A nonliner time series model and estimation of missing observations. Ann. Inst. Statist. Math. 43, 493-504. Godambe, V.P. (1960). An optimal property of regular maximum likelihood equations. Ann. Math. Statist, 31, 1208-1211. Godambe, V.P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika, 72, 419-428. Godambe, V.P. (1994). Linear Bayes and Optimal Estimation. Technical Report No. 11, University of Waterloo. Naik-Nimbalkar, U.V. and Rajarshi, M.B. (1995). Filtering and Smoothing via Estimating Functions. J. Amer. Statist. Assoc. 90, 301-306. Nicholls, D.F. and Quinn, B.G. (1982) Random Coefficient Autoregressive Models. An Introduction. Springer, New York.
PREDICTION FOR NONLINEAR TIME SERIES
267
Shiryayev, A.N. (1984). Probability. Springer, New York. Thavaneswaran, A. and Abraham, B. (1988) Estimation for nonlinear time series models using estimating equations. J. Time Ser. Anal. 9, 99-108. Thavaneswaran, A. and Peiris, S. (1996). Nonparametric estimation for some nonlinear models. Stats, and Prob. Letters 28, 227-233. Tj0stheim, D. (1986) Estimation in nonlinear time series models. Stochastic Processes and AppL, 21, 251-273.
269
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
Estimating Function Methods Of Inference For Queueing Parameters I. V. Basawa The University of Georgia
Robert Lund The University of Georgia
U. Narayan Bhat Southern Methodist University ABSTRACT This paper develops estimates of the interarrival and service time distribution parameters in a GI/G/1 queueing system from observations of the waiting times of the first N customers. Specifically, if Ik and Sk denote the interarrival and service times of the fcth customer arriving at the queue, then the waiting time sequence {Wk} evolves via the Markovian recursion Wk = max(Wib_i + Sk-\ - 4,0) for k > 2. We first exploit the Markov structure of {Wk} to derive an estimating function equation involving the waiting time data; in principle, this equation can be used to obtain estimates of the parameters governing the distributions of Si and I\. Next, all quantities involved in the estimating function equation are expressed in terms of the distributions of S\ and I\. The above estimating techniques are explored in depth for the M/Ek/l queue; here, explicit computations permit a simulation study of this queueing system. Finally, the consistency and asymptotic normality of the estimating function parameter estimates are established. Key Words: Queue; waiting time; estimating function; maximum likelihood.
270
1
BASAWA, LUND AND BHAT
Introduction
Consider a GI/G/1 queueing system and suppose that successive customers arrive at the queue at the times {T{,i > 1}. Let Si denote the service time of the ith customer, i > 1, and define I{ — T{ — Tί_i as the inter arrival time between the ith and (i — l)st customers (we take To = 0). The waiting time of the ith customer, denoted W{, is the total amount of time the ith customer spends waiting for his/her service to commence. The waiting time process {Wt,ί > 1} evolves via the well-known Lindley recursion Wt+ι=max{Wt + Xt+ι,0)
(1.1)
for t > 1 where Xt = St-ι — It (see Prabhu, 1980). Since Xt+ι is independent of Wί, (1.1) shows that {Wt} is a Markov chain on the state space [0, oo). We will henceforth assume that the traffic intensity of the queue is subcritical; that is, p = E[Sι]/E[Iι] < 1. When p < 1, it is known that Wt converges in total variation as t -> oo to a random variable W at a geometric rate (cf. Lund, 1996). We denote the measure associated with W by π (stationary distribution) and comment that the geometric convergence and the inference procedures described below are valid for any initial distribution of W\ satisfying E^r^ 1 ] < oo for some r > 1. Hence, for simplicity, we take W\ = 0; that is, we assume that the queue is initially empty unless otherwise stated. The objective of this paper is to develop estimates for the parameters of the distributions of I\ and S\ based on the waiting time observations Wt for 1 < t < N. Both maximum likelihood and estimating function approaches are considered and compared; hence, this paper extends the work of Basawa et αl. (1996). Most previous parameter inference procedures for queueing models require the observation of all customer interarrival and service times. We refer the reader to Basawa and Prabhu (1981, 1988), Bhat and Rao (1987), Basawa and Bhat (1992), and Thiruvaiyaru and Basawa (1992) for methods and related references on this topic. Unfortunately, the observation of all interarrival and service times is frequently impractical or costly; however, the customer waiting times can easily be measured by putting a "clock" on each customer. Hence, inference procedures based only on waiting time data are often desirable and cost efficient. In Section 2, we present the relevant theory needed to derive estimating function and maximum likelihood estimates of the interarrival and service time distribution parameters from the waiting time data. Section 3 explicitly computes the relevant quantities appearing in Section 2 equations for the case of an M/E^/l queue. Section 4 establishes the consistency and asymptotic normality of the estimating function estimates. Section 5 uses
271
INFERENCES FOR QUEUES
the results of Section 3 for a simulation study of the M/Ek/1 queue. Finally, Section 6 concludes with a summary and some comments. We refer to Godambe and Heyde (1987), Greenwood and Wefelmeyer (1991), and Hutton and Nelson (1986) for more general treatments of estimating functions for stochastic processes.
2
General Theory
In this section, we present the general theory needed to obtain estimating function and maximum likelihood estimates of the parameters governing the interarrival and service-time distributions. For notation, let θ denote the vector of interarrival and service time distribution parameters. Following Godambe (1985), we define an estimating function in the form SN(Θ)=
f^(WM
(2.1)
- Eθ[Ww\Wu.^Wt})ht(Wu.^Wt;θ),
t=i
where ht is a function of W\,..., Wj, and 0 for 1
= dEθ[W^mVar-θ\Wt+ι\Wt].
(2.2)
Hence, an estimating function estimate of θ based on W\,..., WN is a solution to the equation SN(Θ) = Σ(Wt+1 ί=i
- Eθ[Wt+1\Wt})dEθ[Wt^]Wt]Var^[Wt+1\Wt} d
υ
= 0.
v
(2.3) A simpler estimating function estimate can also be obtained when "variance weights" in (2.3) are neglected. Hence, we will also explore solutions to the equation S*N(β) = Σ(WM t=ι
- Eθ[Ww\Wt])d^^
dυ
= 0.
(2.4)
For (2.3) and (2.4) to be useful, one must be able to compute the expectations and variances appearing in these equations in terms of the interarrival and service time distributions. For this, let X be a random variable whose
272
BASAWA, LUND AND BHAT
distribution is the same as that of SΊ - I\. Define Fχ(x) = P[X < x], note that Fx depends on 0, and use (1.1) to get Eθ[WM\Wt]
(2.5)
= Wta(Wt) + β(Wt),
where a(Wt) = I
J{x>-Wt}
and β(Wt) = [
dFx{x)
J{x>-Wt}
xdFx(x).
(2.6)
Similar arguments show that Eθ[W?+1\Wt) = Wfa{Wt) + 2Wtβ(Wt) + Ί(Wt),
(2.7)
where Ί(Wt) = f
J{x>-Wt}
x2dFx(x).
(2.8)
Prom (2.7) and (2.5), we obtain Varθ[Wt+i\Wt] = W^a{Wt)[l - a(Wt)] +2Wtβ(Wt)[l - a(Wt)] + <γ(Wt) - β2(Wt).
(2.9)
Notice that in principal, (2.5) - (2.9) identify all quantities in (2.3) and (2.4) in terms of the interarrival and service time distributions; in practice, one would need explicit expressions for -B^[Wt+i|Wi], ^^[Wt+ilW^Jj and Vαr0[Wi+i|Wi] in terms of θ to implement (2.3) and (2.4). Now consider the method of maximum likelihood. For simplicity, we assume that Fx has the probability density function
fx(x) = ±(Fx(x)).
(2.10)
Define the indicator variable Zt =I(o,oo) W ) ;
(2.11)
the Markov property of {Wt} can be used to show that the likelihood function, denoted L(θ; W\,..., W^), satisfies N-l
log(L(fl; Wu..., WN)) = Σ (1 - Zt+i) log[l - a(Wt)} t=i N-l
+ Σ Zt+i log/x(Wt+i - Wt) (2.12) t=ι (see Basawa et al. (1996) for the details). Notice that the quantities in (2.12) are easily expressed in terms of the distribution of X.
273
INFERENCES FOR QUEUES
Note that SN(Θ) in (2.3) is optimal in the class of estimating functions specified by (2.1). However, if the choice of estimating functions is not restricted to the class in (2.1), Godambe (1960) has shown, under very general conditions, that the likelihood score function, viz. ^| , is a (globally) optimal estimating function. In our problem, the likelihood score does not rfl L satisfy (2.1), and hence SN(Θ) is "less optimal" than ^ , in the sense of information content. Consequently, there is some loss of efficiency in using SN(Θ) (or S*N(Θ)) instead of ^f^. On the otherhand, ^ ^ requires the knowledge of the density /χ(.)? where as SN(Θ) needs only the conditional mean and variance of {Wt}, and Sχ(θ) requires the conditional mean only. The simulaiton results in Section 5 show that the estimates obtained from SN(Θ) and Sχ(θ) are less biased than the maximum likelihood estimates. Moreover, the loss of efficiency due to using the estimating functions is negligible except when the traffic intensity is large.
3
Computations for the M/Ek/l Queue
In this section, all quantities appearing in the Section 2 estimating function and likelihood equations (2.3), (2.4), and (2.12) will be explicitly computed in terms of θ for the M/Ek/l queue; one obtains results for the classical M/M/l queue by taking k = 1. In the M/Ek/l queue, the customer interarrival times {Ij} are exponentially distributed random variables with parameter λ and the service times {Sj} have the Erlang (fc,μ) density. Hence, the probability density functions of I\ and 5χ, denoted by fi(x) and fs{%) respectively, are
//(x) = λe- Λx forz>0
and fs(x) = ^ ( f f i f 1 for rr > 0. (3.1)
Straightforward computations provide the cumulative distribution function of Si — Iι: and
x<0 1
(3 2)
k
FX{X) = l - -"° Σ t ^r+^\f \^w*>o
'
The probability density function fχ(x) is easily obtained by differentiating (3.2): fx(x)
= \μk(X + μ)-keXx,
x<0
and
fχ(x) = 1 - V ( λ + μ)-"e-'" ΣΪZl K ^ F , x > 0. From the first expression in (3.2), we obtain w
(3.4)
274
BASAWA, LUND AND BHAT
More tedious computations with the density function in (3.3) give β(Wt)
= λ - 1 ( - Γ ^ - ) * e - λ 1 v « ( l + XWt) + μ~ιk - λ" 1
(3.5)
and 2
λW
(Wt) = λ- (—!=— Ye- *[-yWi + 2XWt - 2]
Ί
A ~\~ jJL
2
+μ- k(k + 1) - 2\-2{kX/μ - 1).
(3.6)
Prom (2.5) and (2.9), we see that Eθ[Wt+i\Wt] and Vαr^ίWi+iIWi) are easily obtained in terms of a(Wt), β{Wt)^ and j(Wt). To complete the computation of all quantities in (2.3), (2.4), and (2.12), we must evaluate the partial derivatives of JS^[Wi+i|Wt] with respect to λ and μ. Using (2.5), (3.4), (3.5), and the notation Eβ[Wt+ι\Wt] = Eχ,μ[Wt+i\Wt], we find that —EχJWt+ι\Wt] αΛ
4
= λ ~ 2 - λ ~ 2 ( — ί — )ke~XWt[\Wt λ+μ
+ l + (λ + μ)~ι\k]',
(3.7)
Asymptotic Properties of the Estimates
We now follow Klimko and Nelson (1978), Hutton and Nelson (1986), and Hutton et al. (1991) and establish the consistency and asymptotic normality of the estimates obtained as solutions to (2.3) and (2.4). We will focus on the asymptotic properties of the estimating function estimates only and refer the reader to Basawa et al. (1996) for the asymptotic properties of the maximum likelihood estimates. Prom (2.3) and (2.4), it is straightforward to show that {SN{Θ)} and {5^(0)} are mean zero martingales with respect to {Wt}. Two results, Lemmas 4.1 and 4.2 below, that will be helpful later are now stated. The expectations in (4.1)-(4.4) are tacitly assumed to exist and are taken with respect to the stationary measure π. LEMMA 4.1. Consider the waiting time process {Wt} in (1.1) with p < 1. Let SN(Θ) and S*N{Θ) be defined as in (2.3) and (2.4); then the following convergence takes place in probability as N —>> oo. ι (<) N~ SN(θ) 4 0 and N^S^iθ) 4 0. a n d
dθ
^N-i^k^l
J
4 j*(0)
w h e r e
{varθ[WM\Wt}
and
(4.1)
275
INFERENCES FOR QUEUES
riβ)=E ψβ^wή j ί a M a ^ .
.
(4 2)
PROOF: Since {Wt} is an ergodic process when p < 1, the results in (i) and (ii) will follow from the ergodic theorem and algebraic computations with (2.3) and (2.4). We shall verify the results for 5jv(0). Similar arguments can be used for Sχ(θ). Define
Ut(θ) = we have SN(Θ) = ΣuΓi1 Ut(θ). Clearly, EUt(θ) = 0. The ergodic theorem then gives the result in (i). Also,
varθ[Wt+1\Wt]
It then follows readily that
Hence, (ii) follows from the ergodic theorem. • LEMMA 4.2. Under the notation and assumptions of Lemma 5.1, the following convergence takes place in distribution as N —> oo.
(i) N-ι/2SN(θ)
4 N{0,F(θ)) where
= E (it) N-1I2S*N{Θ) Ά N(0,F*(θ)) where
r W - E [varβlWt+m] {iM^j
{^g^)\
. (4.4,
PROOF: Notice that SN(Θ) and Sχ(θ) are sums of stationary ergodic martingale differences with finite second moments. An appeal to the martingale central limit theorem (cf. Billingsley (1961)) easily establishes (i) and (ii). Note that Eθ(Ut(θ)U[(θ)) = F(θ).
BASAWA, LUND AND BHAT
276 A similar computation holds for F*(θ). •
Note that F{θ) = J(0), however, F*{θ) φ J*{θ). We will now confine our attention to SΆr(0); analogous results for S%(θ) can be obtained by "starring" all quantities in the results below. Consider the following two conditions. (C.I) Suppose SN(Θ) is continuous in 0, and for all δ > 0, P{ sup
(θ-θo)'SN(θ)<-e)->l,
for any e > 0. (C.2) Suppose that if {ΘN} is any sequence of estimates such that then dSN(θ) dSN(θ) N -1 0 as N -» oo. 0=0* dθ dθ 0=0 Oj
(4.5)
(4.6)
Condition (C.2) imposes a type of continuous convergence on Sufficient conditions for (C.2) to hold can be phrased in terms of expectations of the second derivative of SN(Θ). The interested reader is referred to Klimko and Nelson (1978) for further details. See Hutton et al. (1991) for sufficient conditions for (C.I). We note that (C.I) and (C.2) can be verified for the M/M/l and M/Ek/l queues when p < 1 from these second derivative conditions and the equations in Sections 2 and 3. Our next two results establish the consistency and asymptotic normality of the estimating function estimates. THEOREM 4.1. Let {Wt} be the waiting time process in (1.1) with p < 1, and suppose that SN(Θ) in (2.3) satisfies (C.I). Then there exists a sequence of estimators ΘN such that PQ [SNΦN) = 0] -» 1 as N -> oo and ΘM -> #o as N —> oo.
PROOF: See Hutton et al. (1991), or Hutton and Nelson (1986). THEOREM 4.2. If p < 1, and (C.I) and (C.2) are satisfied, and if ΘN is any consistent solution of SN(Θ) = 0, then
PROOF: A Taylor expansion of SN(Θ) at ΘQ gives SN(Θ) = SN(ΘO)
=β
](θ - θ0)
where θ* lies between θ and ΘQ. Replacing θ in (4.7) by ΘN, we have
o-
(4.7)
277
INFERENCES FOR QUEUES where θ^ lies between ΘN and ΘQ. From (4.8), we obtain
VN(ΘN -ΘO) = - [iV
(4-9)
Lemma 4.2 shows that
From Lemma 4.1 (it) and (C.2), we have -N'
dθ
Θ=ΘN
Combining (4.9) - (4.11), we obtain the desired result. • Following similar arguments, we have THEOREM 4.3. If (C.I) and (C.2) are satisfied and if Θ*N is any consistent solution of Sχ(θ) — 0, then VN(Θ*N
- θ0) 4 ;v(o,
(rf(θ0)F^θQrιr(θ0)-1).
We comment that it would be a straightforward, albeit tedious, matter to derive explicit expressions for J(θ) and F(θ) for the M/E^/l queue. For example, in the M/M/l queue, one could use that the limiting measure π has an atom at {0} and is exponentially distributed elsewhere (cf. Prabhu (1980)); specifically, π({0}) = 1-λμ" 1 and π(dx) = \μ~ιe-^-χϊχ for x > 0. This could be combined with the equations in Sections 2 and 3 to obtain when Wt has distribution π. Similar computations would give and (4.1)-(4.4) could then be used to compute J(θ) and F(θ). These details are omitted.
5
A Simulation Study
In this section, we will compare properties of the estimating function estimates (with and without variance weights) and the method of maximum likelihood estimates via simulation. Waiting time data were simulated and the performance of the estimating methods was investigated. The results are summarized in Tables 1, 2, and 3 below. Table 1 considers the M/M/l queue. The parameter pairs λ = 1, μ = 2; λ = 2, μ = 3; and λ = 5, μ = 6 were studied; these parameter pairs yield the increasing traffic intensities p = 1/2,2/3, and 5/6 respectively. One hundred simulations were performed for each (λ, μ) pair and each of the sample sizes N = 100, N = 250, and N — 500. Table 1 shows the sample mean and root mean squared errors for each simulation. A separate subtable is included for each of the three methods of estimation.
278
BASAWA, LUND AND BHAT TABLE 1: The M/M/l queue.
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ,μ). Method: estimating function with variance weights.
Sample Size N λ = 1 μ =2
100 0.915 (0.414) 1.917 (0.606)
250 1.007 (0.313) 2.023 (0.426)
500 0.961 (0.209) 1.957 (0.266)
1000 1.030 (0.158) 2.044 (0.221)
λ =2 μ =3
1.780 (0.859) 2.863 (0.959)
1.914 (0.543) 2.901 (0.604)
1.969 (0.355) 3.024 (0.416)
2.016 (0.243) 3.047 (0.274)
λ =5 μ =6
5.034 (3.072) 5.974 (2.557)
5.050 (1.767) 6.095 (1.673)
4.902 (0.974) 5.949 (0.985)
4.992 (0.663) 5.975 (0.760)
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ, μ). Method: maximum likelihood.
Sample Size JV λ = 1 μ =2
100 1.439 (0.533) 2.311 (0.466)
250 1.423 (0.464) 2.309 (0.388)
500 1.401 (0.416) 2.285 (0.341)
1000 1.407 (0.417) 2.269 (0.289)
λ =2 μ =3
2.671 (0.816) 3.488 (0.751)
2.609 (0.669) 3.368 (0.497)
2.625 (0.648) 3.330 (0.414)
2.550 (0.563) 3.285 (0.331)
λ =5 μ =6
6.035 (1.302) 6.435 (1.077)
5.790 (0.922) 6.386 (0.674)
5.706 (0.765) 6.309 (0.497)
5.711 (0.739) 6.290 (0.393)
279
INFERENCES FOR QUEUES
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ,μ). Method: estimating function without variance weights. Sample Size N
λ =1 μ =2
100 0.978 (0.413) 1.947 (0.522)
250 1.032 (0.289) 2.079 (0.382)
500 0.999 (0.203) 2.017 (0.274)
1000 0.970 (0.145) 1.944 (0.206)
λ =2 μ =3
1.858 (0.729) 2.850 (0.805)
1.889 (0.474) 2.943 (0.539)
1.994 (0.363) 3.039 (0.409)
1.992 (0.292) 3.004 (0.333)
λ =5 μ=6
4.747 (2.286) 5.733 (2.255)
5.036 (1.811) 6.110 (1.681)
4.737 (0.926) 5.742 (0.938)
5.072 (0.775) 6.064 (0.776)
Table 1 shows that the two estimating functions methods yield approximately unbiased parameter estimates; in contrast, all maximum likelihood sample means are larger than the true parameter values. Despite this bias, the method of maximum likelihood has a smaller root mean squared error than both estimating function methods for the traffic intensity p = 5/6. This, of course, reflects the fact that the maximum likelihood estimates had a much smaller sample variance than their estimating function counterparts. We note that for p = 1/2, the root mean squared errors from the estimating function methods are comparable (sometimes even smaller) to the maximum likelihood estimate root mean squared errors. Inspection of Table 1 shows that the estimating function estimates without variance weights are, overall, about as efficient as the estimating function estimates with variance weights in terms of root mean squared error. Finally, we note that the root mean squared errors of all estimates increase with increasing λ and/or μ. In terms of computations, the maximum likelihood estimates were the easiest to obtain. The minimum of the negative log likelihood function was rapidly found in all simulations with a gradient search routinge. In contrast, difficulties were encountered with the root finding computations needed to compute the estimating function estimates. In a small proportion of the simulations with the smaller series lengths (particularly N = 100), none of the standard root finding numerical methods tried such as Newton or Broyden satisfactorily found the roots in all simulations. The root finding method that worked best in practice proceeded as follows. First, a gradient search routing was used to find "approximate roots" of the estimating function equations by numerically minimizing the sum of squares
280
BASAWA, LUND AND BHAT
in λ and μ, where SJy (λ,μ) is the ith component of SN{Θ) or Sχ(θ) for i = 1, 2 (i = 1 corresponds to λ, i = 2 corresponds to μ). These estimates were then refined with Newton's method for systems of non-linear equations. In virtually all simulations, a root (λ,μ) satisfying the tolerance
was found. Tables 2 and 3 show similar simulations for the M/E^/l queue for the cases k = 2 and k = 4 respectively. The parameter values of λ and μ were again selected to yield the increasing traffic intensities p = 1/2,2/3, and 5/6. The bias properties of the estimates in Tables 2 and 3 are similar to those in Table 1. We note that the maximum likelihood estimates, in most cases, have smaller root mean squared errors than their estimating function counterparts. In many cases, the estimating function approach without variance weights yielded a root mean squared error that was comparable, or only slightly larger, to the root mean squared error of the estimating function approach without variance weights. Hence, little seems to be gained by accounting for variances in the estimating function approach for the MjE^jλ queue. It should be noted, however, that the gain in efficiency due to accounting for variance weights should increase with larger sample sizes. TABLE 2: The M/E2/l queue. Sample mean and root mean squared error of 100 simulated parameter estimates of (λ, μ). Method: estimating function with variance weights. Sample Size N λ = 1 μ =4
100 1.005 (0.395) 4.045 (1.140)
250 0.971 (0.283) 3.990 (0.681)
500 0.965 (0.230) 3.947 (0.599)
1000 0.963 (0.150) 3.961 (0.407)
λ = 2 μ = 6
1.913 (0.912) 5.917 (1.974)
1.976 (0.541) 5.967 (1.163)
1.935 (0.347) 5.888 (0.768)
1.986 (0.230) 5.968 (0.496)
λ = 5 4.827 (2.059) 4.925 (1.279) 5.044 (0.990) 4.932 (0.598) μ = 12 11.975 (4.062) 11.947 (2.484) 12.167 (2.108) 11.922 (1.220)
INFERENCES
281
FOR QUEUES
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ, μ). Method: maximum likelihood. Sample Size N
λ = 1 μ =4
100 1.386 (0.457) 4.447 (0.731)
250 1.411 (0.450) 4.513 (0.661)
500 1.393 (0.410) 4.516 (0.602)
1000 1.337 (0.375) 4.504 (0.548)
λ =2 μ=6
2.675 (0.775) 6.808 (1.205)
2.633 (0.667) 6.640 (0.841)
2.596 (0.622) 6.628 (0.738)
2.579 (0.592) 6.552 (0.616)
λ =5 μ = 12
6.046 (1.281) 12.915 (1.677)
5.684 (0.981) 12.917 (1.398)
5.779 (0.833) 12.643 (0.927)
5.760 (0.791) 12.616 (0.783)
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ, μ). Method: estimating function without variance weights. Sample Size N
λ =1 μ =4
100 1.050 (0.489) 4.129 (1.366)
250 0.955 (0.299) 3.914 (0.824)
500 0.957 (0.202) 3.894 (0.560)
1000 0.984 (0.165) 3.964 (0.449)
λ=2 μ =6
2.023 (0.836) 6.119 (1.797)
1.998 (0.468) 6.036 (1.095)
1.994 (0.352) 5.989 (0.787)
1.981 (0.255) 5.935 (0.528)
λ=5 μ = 12
5.249 (2.314) 5.176 (1.324) 5.051 (0.888) 5.030 (0.590) 12.733 (4.046) 12.419 (2.690) 12.016 (1.743) 12.066 (1.193)
282
BASAWA, LUND AND BHAT TABLE 3: The M/E4/I queue.
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ,μ). Method: estimating function with variance weights. Sample Size N
λ =1 μ =8
100 1.099 (0.495) 8.650 (2.759)
250 0.958 (0.286) 7.889 (1.453)
500 0.969 (0.222) 7.869 (1.143)
1000 0.990 (0.166) 7.971 (0.828)
λ =2 μ = 12
1.811 (0.748) 11.636 (3.522)
1.910 (0.481) 11.701 (1.929)
1,931 (0.329) 11.775 (1.323)
1.946 (0.260) 11.831 (1.123)
λ =5 μ = 24
4.788 (2.316) 23.549 (7.694)
4.842 (1.262) 23.609 (5.074)
5.091 (0.948) 24.334 (3.475)
4.921 (0.592) 23.850 (2.240)
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ, μ). Method: maximum likelihood. Sample Size N
λ=1 μ =8
100 1.362 (0.438) 8.903 (1.255)
250 1.324 (0.356) 8.758 (0.996)
500 1.338 (0.355) 8.840 (0.950)
1000 1.326 (0.334) 8.795 (0.849)
λ =2 μ = 12
2.540 (0.645) 2.588 (0.633) 2.544 (0.570) 2.538 (0.549) 13.100 (1.673) 13.049 (1.360) 12.969 (1.143) 12.907 (0.974)
λ =5 μ = 24
5.926 (1.106) 5.800 (0.895) 5.782 (0.825) 5.785 (0.808) 25.412 (2.811) 25.261 (2.012) 25.290 (1.697) 25.145 (1.393)
283
INFERENCES FOR QUEUES
Sample mean and root mean squared error of 100 simulated parameter estimates of (λ,μ). Method: estimating function without variance weights. Sample Size N
λ=1 μ =8
100 1.085 (0.543) 8.241 (2.843)
λ = 2 2.026 (0.949) μ = 12 12.104 (4.014) λ=5 μ = 24
6
250 1.024 (0.323) 8.161 (1.766)
500 1.012 (0.350) 7.982 (1.664)
1000 1.004 (0.187) 8.036 (1.033)
1.959 (0.467) 1.972 (0.379) 11.837 (2.057) 11.944 (1.662)
1.961 (0.243) 11.785 (1.062)
5.065 (1.830) 5.006 (1.418) 4.935 (0.828) 4.841 (0.666) 23.967 (6.196) 24.103 (3.923) 23.975 (3.101) 23.466 (2.633)
Summary and Comments
This paper shows how parameter estimates for the interarrival and service time distributions in a GI/G/1 queue can be obtained from customer waiting time data. Both estimating function and maximum likelihood methods of estimation were considered. The simulation study in Section 4 shows that the maximum likelihood estimates can be significantly biased, while the estimating function estimates are approximately unbiased. Despite this bias, the maximum likelihood estimates had a smaller root mean squared error than their estimating function counterparts; a similar ordering of root mean squared errors did not hold for moderate traffic intensities. For the sample sizes considered in the simulation, accounting for "variances" in the estimating function produced little gain in estimation efficiency; however, it is expected that the efficiency of the variance weighted estimates would be superior with larger sample sizes. The consistency and asymptotic normality of the estimating function estimates were also established. Acknowledgements I. V. Basawa's work was partially supported by grants from the Office of Naval Research and the National Science Foundation. Robert Lund's research was supported by National Science Foundation Grant DMS-9703838. We thank the referee for a careful reading and some constructive suggestions.
284
BASAWA, LUND AND BEAT References
Basawa, I. V. and B. R. Bhat (1992). Sequential inference for single server queues. In Queueing and Related Models, 325-336, Edited by U. N. Bhat and I. V. Basawa, Oxford University Press, Oxford. Basawa, I. V., Bhat, U. N., and R. B. Lund (1996). Maximum likelihood estimation for single server queues from waiting time data, to appear in Queueing Systems. Basawa, I. V. and N. U. Prabhu (1981). Estimation in single server queues, Naval Research Logistics Quarterly, 28, 475-487. Basawa, I. V. and N. U. Prabhu (1988). Large sample inference from single server queues, Queueing Systems, 3, 289-306. Bhat, U. N. and S. S. Rao (1987). Statistical analysis of queueing systems, Queueing Systems, 1, 217-247. Billingsley, P. (1961). The Lindeberg-Levy theorem for martingales, Proceedings of the American Mathematical Society, 12, 788-792. Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Stat, 31, 1208-1212. Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic processes, Biometrika, 72, 419-428. Godambe, V. P. and C. C. Heyde (1987). Quasilikelihood and optimal estimation. Int. Statist. Rev., 55, 231-244. Greenwood, P. E. and W. Wefelmeyer (1991). On optimal estimating functions for partially specified counting process models. In Estimating Functions, Ed. V. P. Godambe, p 147-160, Oxford Univ. Press, Oxford. Hutton, J. E. and P. I. Nelson (1986). Quasilikelihood estimation for semimartingales. Stoch. Proc. and Applns. 22, 245-257. Hutton, J. E., O. T. Ogunyemi and P. I. Nelson (1991). Simplified and two-stage quasi-likelihood estimators. In Estimating Functions, Ed. V. P. Godambe, p 169-187, Oxford Univ. Press, Oxford. Klimko, L. and P. I. Nelson (1978). On conditional least squares estimation for stochastic processes, Annals of Mathematical Statistics, 6, 629-642. Lund, R. B. (1996). The geometric convergence rate of a Lindley random walk, to appear, Journal of Applied Probability. Prabhu, N. U. (1980). Stochastic Storage Processes, Springer-Verlag, New York. Thiruvaiyaru, D. and I. V. Basawa (1992). Empirical Bayes estimation for queueing systems and networks, Queueing Systems, 11, 179-202.
285
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
Optimal Estimating Equations for State Vectors in Non-Gaussian and Nonlinear State Space Time Series Models J. Durbin London School of Economics and Political Science ABSTRACT In state space times series models the development over time of the observed series is determined by an unobserved series of state vectors. The paper considers the estimation of these vectors by the mode of the posterior distribution of the state vectors given the data. It is shown that the estimates are the solution of an optimal unbiased estimating equation. Key words: Nonlinear time series; non-Gaussian time series; posterior mode estimates; estimating functions.
1
Introduction
State space models are a very general class of models which are increasingly used in applied time series analysis. In such models we have a series j/χ,..., yn of vector observations, a series αχ,...,α n of unobserved state vectors and a vector ψ of parameters which we assume to be known or to have been estimated efficiently. This paper is concerned with the problem of estimating αi,...,α n given the observations yi,..., yn. Most of the work that has been done on such models hitherto has been based essentially on the linear Gaussian case. See for example the book by Harvey (1989) for a comprehensive treatment of linear Gaussian state space models. However, for many practical applications the assumptions of linearity and normality seem inappropriate. For example, if the data consist of the number of car drivers killed per month in road accidents in a particular region, the Poisson distribution would seem to provide a more appropriate model for the data than the normal distribution. Similarly, if the observations appear to come from distributions with heavy tails, as is common with economic and many other types of data, the t-distribution
286
DURBIN
with a low number of degrees of frequency would seem a more appropriate model than the normal distribution. A further desirable relaxation is to allow departures from linearity in the model. For the linear Gaussian case the standard estimates of the at's are their conditional expectations given the y^'s. As is to be expected, these have an unbiased minimum variance property. For the non-Gaussian or non-linear cases, however, the problem of calculating these conditional expectations by analytical methods is intractable. An alternative considered by some authors is to use the mode instead of the mean of the conditional density of [α^,...,^]' given the y^'s since it is easier to handle. While this approach is intuitively attractive since the resulting estimates are the most probable values of the α*'s given the observations, it does not lead to estimation errors which are unbiased with minimum variance matrix. However, we shall show in this paper that the estimates have an analogous property, namely that they are the solution of unbiased estimating equations with minimum variance matrix. Our results are derived from the estimating equations approach to the estimation of fixed parameters of Godambe (1960) and Durbin (1960). They are also related to results of Ferreira (1982) on the application of estimating equation theory to the estimation of a single random parameter. The next section begins by considering the standard linear Gaussian state space model and uses this as a basis for discussing the classes of nonGaussian and nonlinear models considered in the paper. It introduces the idea of estimating the α^'s by their posterior mode and obtains an estimating equation for it in an appropriate form. Section 3 derives optimal estimating equations for models of the kind under consideration and shows that the estimating equation for the posterior mode belongs to this class.
2
State Space Models for Non-Gaussian and Nonlinear Time Series
The purpose of this section is to outline a broad class of models to which the results of the next section apply. Our starting point is the standard linear Gaussian state space model for an observed vector time series j/χ,..., yn, namely Vt = Oίt
=
Ztat Ttat-l + Rtηu
et ~ N(0, Ht) ηt ~ \Qt)
(2.1) (2.2)
for t = 1, ...,n where βt and ηt are independent error series and Zt.Ht,Tt.Rt and Qt are known matrices. The remaining series at is an unobserved series of state vectors which represent the development over time of the underlying
STATE SPACE MODELS
287
system. This is a very general model which includes as special cases many specific models used in time series analysis, such as ARIMA models. For the purpose of this paper we assume that the object of the analysis is to estimate αi, ...,α n for this and other models discussed later in this section. f f Denote the stacked vectors [y[,..., y^]' and [α^,..., a n] by y and a; also denote the joint density of a and y by p(α, y) and the conditional density of a given y by p{a\y). For model (2.1) and (2.2) it is standard to estimate a by E(a\y), which we denote by a. This estimation is carried out by the well known Kalman filter and smoother (KFS). We shall consider only departures from normality in the observational part (1) of the model, while retaining the linear Gaussian form (2) for the development of at. The first class of non-Gaussian observations we shall consider is the exponential family, for example Poisson or binomial observations, for which the density has the general form p{yt\at) = exp[θ[yt - bt{θt) + ct{yt)}
(2.3)
where θt = Zt&u t = l,...,n and where bt and ct are known functions. It turns out that the task of calculating a for model 2.3 by analytical techniques is intractable. Fahrmeir (1992) therefore suggested estimating a by the mode ά of p(a\y) and he gave an approximation to ά based on the extended Kalman filter. He called a the posterior mode estimate (PME). Durbin and Koopman (1993) showed how to compute ά accurately in a few iterations by applying the KFS to a linearised form of the estimating equation for a. A second important class of non-Gaussian models retains the same form as equation (2.1) but requires βt to have a non-Gaussian distribution, for example a t-distribution or a mixture of normals. Such distributions allow heavy-tailed observational densities to be handled. Again, a is easily calculated by a few iterations of the KFS as shown by Durbin and Koopman (1993). Finally, we consider nonlinear models where (2.1) is replaced by the equation yt = Zt{at) + £t where Zt is a nonlinear function of at and e* may be Gaussian or non-Gaussian, for example a time series made up of the product of trend and seasonal plus random error. For all these models we shall assume that a is the unique solution to the equation dlogp(a\y)/da = 0. But logp(a\y) = logp(α,y) - logp(y) where p(y) is the marginal density of y. It follows that a is the solution of the equation d\ogp(a,y)/da = 0. This is the form of the estimating equation for a that we shall use in this paper. The estimate a has the attractive intuitive property that it is the most probable value of a given the data. This might be sufficient grounds for using it for some workers. However, a has the objective optimality property that E(a — a) = 0 and if α* is any other estimate of a such that E(a* — a) = 0,
288
DURBIN
with MSE(a) = V, MSE(a*) = V*, then V* - V is non-negative definite. In the next section we shall seek an analogous optimality property for ά based on the theory of optimal unbiased estimating equations.
3
An Optimality Property of the Posterior Mode Estimate
We begin with some preliminaries. If a is the unique solution for a of the (mx 1) vector equation H(a, y) — 0 and if E[H(a, y)] = 0, where expectation is taken with respect to the joint density of a and y, we say that H(a,y) = 0 is an unbiased estimating equation. We want to establish a minimum variance property for such functions H(a, y) but obviously the equation can be multiplied through by an arbitrary nonsingular matrix and still give the same value ά as its solution. We therefore standardise if(α,y) in the usual way in estimating equation theory by multiplying it by [E{H (α, y)}]" 1 where H(a, y) = dH(a, y)/da! and we then seek a minimum variance property for
the function h(a,y) = [E{H(a,y)}]-ιH{a,y). Let #(α, y)p(a, y)dy = k(a)
(3.1)
where / indicates integration over the domain of y and where dy = ΠΓ=i Π?=i dytu P being the dimensionality of y*. Denote the ith element of a by α(φ i = 1, ...,mn where m is the dimensionality of α^, and note that k(a) is an (ra x 1) vector. We make the following assumption.
Assumption A For each i, and for all αy) fixed, j Φ i, limα(i)->±oo &(α) = 0. In view of the normality of the marginal distribution of α, this requirement would appear selfevidently satisfied for reasonable functions H{μ,y)\ however it seems sensible to make the requirement explicit. It is an appropriate reformulation for the present problem of Ferreira's (1960) condition B(g) = 0. Differentiating (3.1) under the integral sign, and assuming that this operation is valid, we obtain
a, y)dy + / H(a.y) ^
^
1
p(α, y)dy = ^
We now wish to integrate this with respect to α(i), ...,α( m n ). Writing
dk(a) _ \dk(a)
da
1
[
dk(a)
.
(3.2)
289
STATE SPACE MODELS it follows from Assumption A that /•oo dk(a)
Thus
roo dk(a) Qk{a}
roo
and hence
i J—oo J—oo
Γ
OQ
- -J—oo I "Q—" J—oo
c
O&ζi)
Γ dkia);
J—oo -oo
J—c ./-oo
^,Λ
OO/
v
'
. . . da(mn\ = 0. v
;
Integrating both sides of (3.2) with respect to α^), ...,α( m n ) we therefore have
Multiplying both sides by [^{^(α;,?/)}]"1 gives / + f? |/ι(α, y)
—
1 = 0.
(3.3)
Let Var[/ι(z,y)] = E[h(a,y)h(a,y)'} and let T = Result 1 If £7[iϊ(α,y)] = 0, H is differentiate with respect to α and Assumption A holds, Var[/i(α,y)] — T~ι is non-negative definite. We need the following further assumption. Assumption B. If k(a) = dlo|P(Q)p(Q,) where p(α) is the marginal density of a then Assumption A is satisfied. Result 2 If Assumption B holds, the minimum is attained when H(a, y) = p^y^ • Proof of Result 1 This follows immediately from (3.3) by the Cauchy-Schwarz inequality. Proof of Result 2 Let p(y\a) be the conditional density of y given α. Then dlogp(α,y) da
=
d log p(y\a) da
d log p(a) da
Substituting this for H(a,y) in (3.1) gives f dlogp{y\a) dlogp(a) p(a) J dα 'p(y\a)dy + p(a) ^ — = k(a).
290
DURBIN
Since dlogp(y\a)/da is the score function when a is regarded asfixed,the first term is zero so fc(α) = p(a)dlogp(a)/da. It follows from Assumption B that Assumption A is satisfied. Also, differentiation of the identity /p(a,y)dady = 1 under the integral sign with respect to α shows that E[dlogp(a,y)/da] = 0. Now if H{a,y) = dlogp{a,y)/da thenE[H{a,y)] = 2 E[d logp(a, y)/dada'] = —T as is shown by differentiating the identity f[dlogp(a,y)/da\dady = 0 under the integral sign with respect to α'. Thus ι h(a, y) = -T~ d log(α, y)/da so ι
1
1
Var[h(a,y)] = T- Var[dlogp(a,y)/da]T- = T" . This proves Result 2. When Result 2 holds we say that the estimating equation is an optimal estimating equation. These results can be regarded as an extension of the following result due independently to Godambe (1960) and G. A. Barnard, to whom it is attributed on p. 145 of Durbin (1960). If under suitable regularity conditions x is an observational vector with density f(x,θ) where θ is a fixed scalar parameter, and if G(x,θ) = 0 is an estimating equation for θ satisfying E[G{x,θ)\ = 0, then defining g(x,θ) = [#{^^ i }]- 1 G(a;,0) we have that E[g{x,θ)g{x,θ)'}-τθ-1 is non-negative definite, where TΘ = E[ o g /J X ) ^] 2 and the minimum is attained when the equation G(x,θ) = 0 is the maximum likelihood equation dlogf(x,θ)/dθ = 0. The straightforward extension to the case where θ is a vector was indicated on p.145 of Durbin (1960). The Godambe-Barnard result was extended to the Baysian context in which θ is a random scalar parameter by Ferreira (1982). Thus the present findings could be interpreted as an extension of Ferreira's results, although we have not regarded a as a parameter vector in this paper. Returning to the PME ά for nonlinear or non-Gaussian state space models, this is the solution of the estimating equation dlogp(α, y)/da = 0. Since k(a) = dlogp(a)/da and p(a) is a Gaussian density. Assumption A is satisfied. Thus Result 2 holds and ά is the solution of an optimal estimating equation. When the state space model has the linear Gaussian form (2.1) and (2.2) the estimating equation is linear in a and we have /ι(α, y) =a — a so the mode ά is equal to α. This result is in fact obvious since in this case p{oc\y) is normal and for a normal distribution the mode is equal to the mean.
STATE SPACE MODELS
291 References
Durbin, J. (1960). Estimation of parameters in time series regression models. J. R. Statist Soc. B, 22, 139-153. Durbin, J. and Koopman, S. J. (1993). Filtering, smoothing and estimation for time series models when the observations come from exponential family distributions. LSE discussion paper. Fahrmeir, L. (1992). Posterior mode estimation by extended Kalman filtering for multivariate dynamic generalised linear models. J. Amer. Statist. Ass., 87, 501-509. Ferreira, P. E. (1982). Estimating equations in the presence of prior knowledge. Biometrika, 69, 667-669. Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist, 31, 1208-1211. Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press.
293
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS IN FAILURE TIME DATA ANALYSIS Ross L. Prentice and Li Hsu Fred Hutchinson Cancer Research Center Seattle ABSTRACT The relative risk model of Cox (1972) has become the standard for the regression analysis of univariate failure time data. Cox's maximum partial likelihood estimator is shown to arise from a mean parameter estimating function for the cumulative baseline hazard variate, after allowing for right censorship and upon inserting the usual baseline hazard function estimator. Mean and covariance parameter estimating functions applied to these same cumulative baseline hazard variates lead to estimation procedures for the regression analysis of multivariate failure time data. For example these estimating procedures may be used to simultaneously estimate marginal hazard ratio parameters and pairwise cross ratio parameters. Such estimation allows assessment of regression effects on marginal hazard functions while providing summary measures of the strength of dependency among pairs of failure time variates. Some additional topics in the analysis of multivariate failure time data are briefly discussed.
1
Introduction
Let T > 0 be a failure time variate. The distribution of T can be represented by the hazard (differential) function A(dt) = E{N(dt)\T > t], where N is the failure time counting process corresponding to T, defined by N(t) = 1 if T
where Π denotes a product integral. Corresponding to the counting process N one can define a martingale, M, by M(t) = JV(t)-Λ(TΛt),
294
PRENTICE AND HSU
where 'Λ' denotes minimum. See Fleming and Harrington (1992) and Andersen et al (1993) for further detail and extensions. Suppose now that an absolutely continuous failure time T is accompanied f by a regression p-vector Z — (Zi,..., Zp) and that the Cox (1972) regression model, Λ(dt) = Ao(dt)exp(Z'β), holds where β is a p-vector of 'relative risk' parameters and Λo is an unspecified baseline hazard (differential) function. The regression parameter β can be estimated by the maximum partial likelihood estimator which solves K
ZkMk(Xk)
(1)
= 0,
based on data (Ifc,ίfc,Zfc),fc = 1, . . . , ! £ , where Xk = Tk Λ Ck and δk = I[Xk = Tk], and Ck is a right censoring time such that Tk and Ck are independent given Zk. Also the CΛ' in (1) indicates that Λo has been replaced by a standard baseline hazard function estimator (Breslow, 1974; Andersen and Gill, 1982)
Ao(Λ) = Σ
N{dt)l Σ
keR(t)
eft**,
(2)
keR(t)
where R(t) = {k\Xk > t} is the 'risk set' at time t. The estimating equation (1) has been justified via partial likelihood (Cox, 1975), marginal likelihood (Kalbfleisch and Prentice, 1973) and full likelihood (Johansen, 1978) arguments. The solution β has been shown to be semiparametric efficient (Begun et al, 1983). Expression (1) can also be derived from the standard mean parameter estimating function D'kΣ^ι(yk-μk)
=0
(3)
for a scalar or vector response yk having mean μk = μh(β) that depends on a parameter of interest β and variance Σk = Σk(β)> a n d where Dk = dμk/dβf. Note that (3) can be derived as the score (maximum likelihood) estimating equation for β under an exponential family model if yk is a scalar, and as the score equation under a 'partly exponential' family more generally (Zhao, Prentice and Self, 1992). For application to uncensored univariate failure time data under the Cox model we can set yk = ΛO(T^), k = 1,... ,K, in which case yk is exponentially distributed with a mean exp(—Z'kβ), variance exp(-2Z'kβ), and Dk — Zfkexp(-Z'kβ), so that (3) can be written
295
FAILURE TIME DATA ANALYSIS This equation can also be written κ
κ
κ
Zk Mk(Tk) = Σ
rτk
Zk /
Mk(dt) = 0,
upon noting that Nk(Tk) = 1 and Mk(Tk) = l-Ak(Tk). Under independent right censorship one can integrate the martingale Mk only from zero to Xk giving the estimating function Σ %h Mk(Xk). Insertion of the baseline hazard function estimator (2) then gives the estimating equation (1). One motivation for this development is to provide the basis for extension to the regression analysis of multivariate failure time data, by applying mean or mean and covariance parameter estimating equations to cumulative baseline hazard variates.
2
Estimating Functions for Marginal Hazard Ratio Parameters
While univariate failure time methods, including Kaplan-Meier curves, censored data rank tests, and Cox regression methods are well developed, multivariate failure time methods require much further development. Suppose that there are n absolutely continuous failure time variates (Ti,...,T n ) and that each T{ is accompanied by a regressionp-vector Z^i — 1,... ,n. Much of the work on the regression analysis of multivariate failure time data assumes the T{ to be independent conditional on covariates and on the value of a hypothetical frailty variate, that is usually assumed to affect the hazard function in a multiplicative manner (e.g., Andersen et al, 1993, Chapters 9 and 10; Bickel et al, 1993, Chapters 4, 6 and 7). The joint survivor function for (Tχ,...,T n ) is obtained by integrating out the frailty variable; and dependency among the failure times are characterized by the parameters of the frailty distribution. Cox (1972) models are typically specified for the hazard functions, conditional on frailty and covariates. Though frailty models undoubtedly have a place in multivariate failure time analysis they are limited in the flexibility with which dependencies can be modeled, since the parameters of a single frailty variate characterize the entire dependence structure for Ti,... ,T n . More complicated frailty constructions with certain frailty variates shared by some, but not all, failure times in a correlated set may be able to partially overcome this limitation. A second problem concerns the form of the marginal distributions. Specifically, if Cox model forms are assumed for intensities given the frailties, the marginal intensities will generally no longer be of the Cox model form. Hence, in using a frailty model approach for multivariate data one may find oneself fitting marginal hazard models that differ from those that would be
296
PRENTICE AND HSU
used if only the data on a specific margin were available. For these reasons a modeling approach that focuses on marginal survivor functions and pertinent pairwise dependency functions may be preferred. Consider failure time variates Tk = (T^i,... ,ϊfcn)' that are subject to right censorship by potential censoring variates Ck = (Cfci> iCkn)' such that Tk and Ck are independent given the corresponding matrix Zk = (Zfci,..., ZknY of regression vectors, k = 1,..., K. Suppose now that each Tki has a marginal hazard rate function of Cox model form Aki{dtki) = hoi{dtki)zMZki
β),
where hki conditions on Zk and Ck and on the evolving failure time inforth mation, Nki(u), u < tki, for the (k,i) individual. Define y^ = Aoi(T^), so that yki has mean exp(—Z'kiβ) and variance exp(—2 Zkiβ) for all (A;,i). Denote by pkij = pkij{β,aί) the correlation between y^ and ykj, so that the covariance Σkij = Σkij(βia) between yki and ykj is exp{-Z'kiβ)exp{-Z'kjβ)pkij. Hence the variance matrix for yk can be written Σ* = dmg{exp{-Z'klβ),
,exp(-4 n /?)}Ω* di
- ,exp{-Z'knβ))
where Ω^ = Ω^(^, a) is the correlation matrix {pkij) Also the partial derivative of μk{β) with respect to β' can be written Dk = Z'k diag{exp(-Z^^),
,exV(-Z'knβ)},
so that (3) reduces to K k=l
in the absence of censorship, where Mk(Tk) = {Mk\{Tk\),.. , Mkn(Tkn)yTo accommodate right censorship one can replace Mk{Tk) by Mk{Xk) where Xk = (Xki, ? Xkn)'- To accommodate unknown baseline hazard functions we can insert estimators (2) for each of i = 1,... ,n. It also seems natural to replace Ωfc, the correlation matrix for Mk(Tk) by the correlation matrix, say Δfc = Δfc(/3, α, Ck) for Mk(Xk), giving the estimating equation K
£
Zk Ak1Mk(Xk)
=0
(4)
k=l
for the marginal hazard ratio parameter /3, where the Λ on Δ& again denotes that the baseline hazard function estimators (2) have been inserted. Following Liang and Zeger (1986) we may consider the use of (4) with the correlation matrix replaced by a working correlation matrix, say Rk = Rk(β,a).
FAILURE TIME DATA ANALYSIS
297
Provided a K* consistent estimator ά = ά(β) is available one may then estimate β as solution to K
Σ
zk Rj HβMβ)} Mk(xk) = o.
(5)
Note, however, that some judgment is required in specifying the working correlation matrix, as ά may not be convergent to any fixed parameter as K -> oo if the working and true matrices are too disparate (Crowder, 1995). Wei, Lin and Weissfeld (1989) considered (5) in this multivariate failure time context with identity working correlation matrix Rk = Ik. Cai and Prentice (1995) considered an estimator that solves (5) with Rk a nonparametric estimate of the correlation matrix for Mk(Xk), for bivariate failure time data (n = 2) and regression matrix Zk that has finite support. Simulation studies conducted under the bivariate survival model of Clayton (1978) and Clayton and Cuzick (1985) indicated that the inclusion of this weight matrix did not appreciably improve the efficiency of β solving (5) unless the dependency between the failure times (Tk\,Tk2) was quite strong (e.g., Pku > 0.5), and furthermore that right censorship tends to reduce any such efficiency gain. These exercises then suggest that the simple estimating equation of Wei, Lin and Weissfeld, given by K
Σ
ZkMk(Xk) = 0
(6)
k=l
will be efficient enough for marginal hazard ratio estimation in most applications. Asymptotic distribution theory showing β solving (5) or (6) to be consistent and asymptotically normally distributed, and including a 'sandwich' variance estimator for β has been presented (Wei et al, 1989; Cai and Prentice, 1995). Time-varying covariates can be accommodated by generalizing (6) to
Σ k=i
/ J o
Zk(t)Mk(dt)=0
where integration takes place componentwise for the elements of t = (ti,..., tn)'.
3
Estimating Functions for Hazard Ratio and Pairwise Cross Ratio Parameters
In many multivariate failure time applications it will be important to not only estimate marginal hazard rates, but also to develop summary measures of the strength of dependence among pairs of failure times. For example,
298
PRENTICE AND HSU
dependency measures may be of primary interest in some contexts, for example in studies of disease occurrence among family members in genetic epidemiology. The 'cross-ratio' function (e.g., Oakes, 1982, 1986, 1989)
p .U.
tj,
provides a useful characterization of the relationship between failure time variates Ti and Tj. Note that CRij can also be expressed as CRijitutj Z) = Xi{ti\Tj = =
tj;Z)/Xi{ti\Tj>tj;Z)
\jitj\Ti = ^ Z)l\j{tj\Ti > U\ Z),
which has a very natural interpretation in epidemiologic and other contexts. The Clayton model (Clayton, 1978; Clayton and Cuzick, 1985) supposes that the cross ratio is a constant
for all (ίi,ίj), in which case θij(Z) > —0.5 provides a summary measure of the strength of dependence between T{ and T^ given Z, with positive and negative dependencies given by θ{j(Z) > 0 and θ{j(Z) < 0, respectively. Now consider joint estimating functions for the marginal hazard ratio parameter β and for an additional parameter α that characterizes the correlations among cumulative hazard variates yki = ΛOi(T)ti),i = 1,... ,n, k = l,...,ίί. Let σk{β,a) = (Σ*n>Σ*i2> > Σfcin> Σ*22> >-\ Σ*2n> •) denote the variance matrix for yk = {yku . . . , ykn)' in vector form. Under a quadratic exponential model for yk, k = 1,..., K the score equations for (/?, a) can be written (e.g., Prentice and Zhao, 1991) as f
l
Σ D kA^ fk = 0 k=l
where _(dμk/dβ' \
0
a n d where sk = {sku,...,
\ ) ^ skϊn,
_( k
sk22,
k
Σk , y , sk2n,
k
)
<x>v(yk,sk)\ var sk ) ' Ik •)> with skij
=
{yk-μk \sk-σk
=
{yki-μki){ykj-
μkj), is an empirical covariance vector. Note that E(sk) = σk. These mean and covariance estimating equations are attractive in that they arise as maximum likelihood equations under a rich quadratic exponential class for yk. A drawback, however, is that misspecification of the covariance model σk(β, a) can bias the estimator of the hazard ratio parameter β. This can be remedied
299
FAILURE TIME DATA ANALYSIS f
by replacing dσk/dβ and cov(yk,sk) by zero matrices giving the simplified estimating equations K 1 Σ (dμ'k/dβ) Σ^ (yk-μk)
K
= 0, £
(c^/dα)(var sjbΓ^ib-σjb) = 0. (7)
Under Cox-model marginal hazard functions and no censorship the first of these equations can be written (Prentice and Hsu, 1996), as before, as K
while the second equation similarly simplifies to K k=l
where Ek = dpk/da, Φk is the covariance matrix for {Akι(Tkι)Ak2(Tk2),
Akι(Tkl)Ak3(Tk3),...}
and Lk(Tk) =
{Lkl2{TkuTk2),Lkl3(TkuTk3),...}
with Lkij{Tki,Tkj)
= Mki(Tki)
Mkj(Tkj)
-
.
Pkij
Note that the expectation of both estimating functions is a zero vector even if Ωfc and Φk are misspecified. We can adapt these estimating functions to independent right censorship by again inserting baseline hazard function estimators (2) for i = 1,... ,n and by replacing Tk by Xk. One may also replace the correlation matrices Ωfc and Φ& by working covariance matrices, say Rk\ and Rk2, for Mk and Lk respectively, giving the estimating equations K fc=l
K E
Zk Rkll M Mkk(X (Xk)k) = 0, 0Σ Σ
kk R Rk2 Lk(Xk) = 0.
(8)
k=l
This notation conceals one important point. For Lk(Xk) to have mean zero under right censorship one must redefine Lkij(Xki,
Xkj) = Mki(Xki)
W h β Γ e
Akij(Xki,Xkj)=
Mkj(Xkj) Xk
I "l ί
Jo
Jo
Xkk3
- Akij(Xki,
Xkj)
(9)
300
PRENTICE AND HSU
and = EiMkiidU^Mkjidti^Tki
Akij{dtudtj)
> tuThj
>
tj,Zk}.
In fact the 'covariance rate function' Akij in conjunction with Λ^ and Akj completely determines the distribution of Tki and Tkj, given Zk (Prentice and Cai, 1992). Hence to apply (8) with right censoring one must specify the pairwise survivor functions Fij(U,tj] Zk). Such further assumption seems natural as the cumulative hazard variate correlations are not even identifiable if censorship restricts the support of (Xki,Xkj)One way to implement (8) is to impose constant cross ratio models CRkij{ti,tj',Zk)
= l + θkij(a)
for each pair of failure time variates (Tki,Tkj);i,j = 1,... ,n, even though the existence of an overall survivor function F(t\,... , ίn; Zk) having these pairwise marginals is yet to be demonstrated. These constant cross ratio assumptions give (10) where Ao{vι,v2
υιθ
υ
θ) = (0 + 1) e
θ
Vιθ
e * {e
v
θ
2
υθ
+ e * - 1)~ - {e '
υ
θ
1
+ e * - I)" .
The cumulative hazard correlation pkij is linked in a one-to-one fashion to the cross ratio parameter θkij via POO
Pkij
=
/
Jo
PO
/
Jo
thereby determining Ek in (8). Independence working matrices Rk\ = In Rh2 = In(n-i)/2c a n be expected to yield estimators of β and a of acceptable efficiency in most applications. Note that the cross ratio θkij can be modeled in various ways. For example, one could set θfaj = OL%J for all (z, j), could restrict some elements of aij to be common, or could allow θkij to depend on Zk. Lk(Xk) in (8) arises by inserting baseline estimators (2) into (9). In doing so, simulation studies suggest the use of Kaplan Meier-type estimators
- ku(d Xu)} in (10), where kti(dti) = exp{Z'uβ}koi{dti). See Prentice and Hsu (1996) for simulations and illustrations of the use of estimating equations K
K
Σ
Zk Mk(Xk) = 0, £
fc=l
k
=l
Ek Lk(Xk) = 0
(11)
FAILURE TIME DATA ANALYSIS
301
for hazard ratio parameter (/?) and pairwise cross ratio parameter (a) estimation. The estimator of β solving (11) is that of Wei, Lin and Weissfeld (1989), while the estimator of a has been shown in simulation studies to have comparable efficiency to the generalized maximum likelihood estimator of Nielson et al (1992) in the non-regression special case (Hsu and Prentice, 1996). Asymptotic distribution theory, including a consistent variance estimator is available for solutions to (11) and to the more general estimating equations (8) (Prentice and Hsu, 1996).
4
Discussion
Mean parameter estimating functions can be adapted to allow for right censorship, yielding the maximum partial likelihood estimator of the hazard ratio parameter in Cox's failure time regression model, and yielding the Wei, Lin and Weissfeld (1989) estimator of marginal hazard ratio parameters under an independence working model for a multivariate failure time response. Mean and covariance model estimating functions can extend the regression analysis of multivariate failure time data to the joint estimation of marginal hazard ratio parameters and pairwise cross ratio parameters. These estimating equations (11) involve straightforward computations and the estimated parameters have a ready interpretation. The principal limitation of the estimating equations (11) relates to the constant cross-ratio assumptions. Under departures from a constant cross ratio the estimates θkij(όι) presumably have an average cross ratio interpretation, with averaging over the density of Xkij, k = 1,..., K. As such the interpretation of θkij(&) will unfortunately depend on the censoring distribution, just as the interpretation to the solution to (6) will depend on the censorship under departure from the Cox model. The possibility of estimating average cross ratios, with averaging over the density of the underlying failure times, rather than the observed failure or censoring times, is currently being pursued by graduate student Juan Juan Fan in conjunction with the authors. Acknowledgement This work was supported by NIH grant CA-53996.
References Andersen, P.K. and Gill, R.D. (1982). Cox's regression model for counting processes: A large sample study. Ann. Statist. 10, 1100-1120.
302
PRENTICE AND HSU
Andersen, P.K., Borgan, O., Gill, R.D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag, New York. Begun, J.M., Hall, W.J., Huang, W.M. and Wellner, J.A. (1983). Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11, 432-452. Bickel, P.J., Klassen, C.A., Ritov, Y. and Wellner, J. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Univ. Press, Baltimore, Maryland. Breslow, N.E. (1974). Covariance analysis of censored survival data. Biometrics 30, 89-99. Cai, J. and Prentice, R.L. (1995). Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika 82, 151164. Clayton, D.G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 144-151. Clayton, D.G. and Cuzick, J. (1985). Multivariate generalizations of the proportional hazards model (with discussion). J.R. Statist. Soc. A 148, 82-117. Cox, D.R. (1972). Regression models and life tables (with discussion). J.R. Statist Soc. B 187-220. Fleming, T.R. and Harrington, D.P. (1991). Counting Processes and Survival Analysis. Wiley, New York. Hsu, L. and Prentice, R.L. On assessing the strength of dependency between failure time variates. In press, Biometrika, 1996. Johansen, S. (1978). The product limit estimator as maximum likelihood estimator. Scand. J. Statist. 5, 195-199. Kalbfleisch, J.D. and Prentice, R.L. (1973). Marginal likelihoods based on Cox's regression and life model. Biometrika 60, 267-278. Kalbfleisch, J. and Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. Wiley, New York. Liang, K.Y. and Zeger, S. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22. Nielsen, G.G., Gill, R.D., Andersen, P.K. and Sorensen, T.I.A. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scand. J. Statist. 19, 25-43. Oakes, D. (1982). A model for association in bivariate survival data. J.R. Statist. Soc. B 11, 111-122. Oakes, D. (1986). Semi-parametric inference in a model for association in bivariate survival data. Biometrika 73, 353-361. Oakes, D. (1989). Bivariate survival models induced by frailties. J. Amer. Statist. Assoc. 84, 487-493. Prentice, R.L. and Cai, J. (1992). Covariance and survivor function estimation using censored multivariate failure time data. Biometrika 79,
FAILURE TIME DATA ANALYSIS
303
495-512. Prentice, R.L. and Zhao, L.P. (1991). Estimating equations for parameters in means and covariances of multivariate discrete and continuous responses. Biometrics 47, 825-839. Prentice, R.L. and Hsu, L. Regression on hazard ratios and cross ratios in multivariate failure time analysis. In press, Biometrika, 1996. Wei, L.J., Lin, P.Y. and Weissfeld, L. (1989). Regression analysis of multivariate incomplete failure time data by modelling marginal distributions. J. Amer. Statist Assoc. 84, 1065-1073. Zhao, L.P., Prentice, R.L. and Self, S.G. (1992). Multivariate mean parameter estimation by using a partly exponential model. J.R. Statist. Soc. B 54, 805-811.
305
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS FOR DISCRETELY OBSERVED DIFFUSIONS: A REVIEW Michael S0rensen University of Aarhus Denmark Abstract Several estimating functions for discretely observed diffusion processes are reviewed. First we discuss simple explicit estimating functions based on Gaussian approximations to the transition density. The corresponding estimators often have considerable bias, a problem that can be avoided by using martingale estimating functions. These, on the other hand, are rarely explicit and therefore often require a considerable computational effort. We review results on how to choose an optimal martingale estimating function and on asymptotic properties of the estimators. Martingale estimating functions based on polynomials of the increments of the observed process or on eigenfunctions for the generator of the diffusion model are considered in more detail. The theory is illustrated by examples. In particular, the Cox-Ingersoll-Ross model is considered. Key Words: Approximate likelihood function; asymptotic normality; bias; consistency; Cox-Ingersoll-Ross model; eigenfunctions; inference for diffusion processes; martingale estimating functions; optimal inference; polynomial estimating functions; quasi likelihood.
1
Introduction
Diffusion processes often provide a useful alternative to the discrete time stochastic processes traditionally used in time series analysis as models for observations at discrete time points of a phenomenon that develops dynamically in time. In many fields of application it is natural to model the dynamics in continuous time, whereas dynamic modelling in discrete time contains an element of arbitrariness. This is particularly so when the time between observations is not equidistant. Statistical inference for diffusion processes based on discrete time observations can only rarely be based on the likelihood function as this is usually
306
SORENSEN
not explicitly available. The likelihood function is a product of transition densities, as follows easily from the fact that diffusions are Markov processes, but explicit expressions for the transition densities are only known in some special cases. One way around this problem is to find good approximations to the likelihood function by means of simulation methods for diffusions. This computer-intensive approach has been pursued in Pedersen (1995a, 1995b). Another solution is to base the inference on estimating functions. In this paper we review a number of recent contributions to this approach. The likelihood theory for continuously observed diffusions is well studied. In practice, however, diffusions are not observed continuously, but only at discrete time points or for instance through an electronic filter. There is therefore a need of methods which are applicable in statistical practice, and in recent years this has inspired quite a lot of work on estimation for discretely observed diffusions. The need has been particularly acute in finance where diffusion models must be fitted to time series of stock prices, interest rates or currency exchange rates in order to price derivative assets such as options. In Section 2 we discuss simple explicit estimating functions based on Gaussian approximations to the transition density. The corresponding estimators often have considerable bias, a problem which we discuss in some detail. When the distance between the observation times is sufficiently small, they are, however, useful in practice. Asymptotic results substantiating this claim are reviewed. The bias problems, to a large extend, can be avoided by using martingale estimating functions instead, which are treated in Section 3. Martingale estimating functions are, on the other hand, rarely explicit, and therefore often requires a considerable computational effort. We review results on how to choose an optimal martingale estimating function and on asymptotic properties of the estimators. Martingale estimating functions based on polynomials of the increments of the observed process or on eigenfunctions for the generator of the diffusion model are considered in more detail. A different kind of estimating functions, by which the bias problems discussed in Section 2 can also be avoided, and which have the advantage of being explicit, were recently proposed by Kessler (1996). Unfortunately, these can not be discussed in this relatively short review paper.
2
Simple explicit estimating functions
We consider one-dimensional diffusion processes defined as solutions of the following class of stochastic differential equations (2.1)
307
DIFFUSION PROCESSES
where W is a standard Wiener process. We assume that the drift b and the diffusion coefficient σ are known apart from the parameter θ which varies d in a subset Θ of ΊR . They are assumed to be smooth enough to ensure the existence of a unique weak solution for all θ in θ . The assumption that the drift and the diffusion coefficient do not depend on time is not essential for several of the estimating functions discussed in this paper which can be modified in a straightforward way to diffusions that are not timehomogeneous. Also the assumption that X is one-dimensional is in several cases not needed, but is made to simplify the exposition. The statistical problem considered in this paper is to draw inference about the parameter θ on the basis of observations of the diffusion X at discrete time points: XtQiXti,- ,Xtn, to = 0 < ίi < • < tn. The likelihood function for θ based on Xto, Xtί, • , Xtn is
Ln(θ) = f[p(Ai,Xti_1,Xti;θ),
(2.2)
2=1
where Δ; = tι — tj_i and where y \-ϊ p(Δ, re, y; θ) is the density of X& given XQ = x when θ is the true parameter value. The transition density p is only rarely explicitly known, and when Δ is not small, it can be far from Gaussian. We can, however, obtain a number of useful estimating functions by replacing p by approximations. When Δ is small, we can approximate p by a normal density function. Expressions for the conditional moments of XΔ given XQ can usually not be found, so in order to get an explicit estimating function, the mean value is approximated by x + b(x;θ)A and the variance by σ2(x;θ)A. By using this approximate Gaussian transition density, we obtain an approximate likelihood function, which equals the likelihood function for the Euler-Maruyama approximation (see Kloeden and Platen, 1992) to the solution of (2.1). The corresponding score function is
=Σ where v(x; θ) = σ2{x; 0), and where deb denotes the vector of partial derivatives with respect to θ. Vectors are column vectors. It is, of course, assumed that the partial derivatives in (2.3) exist. Throughout this paper, whenever a derivative appears its existence is implicitly assumed in order to avoid statements of obvious conditions. The cί-dimensional estimating function (2.3) is biased because we have used rather crude approximations for the mean
308
SORENSEN
value and the variance of the transition distribution. Therefore it can only be expected to yield reasonable estimators when the Δj's are small, and we can only expect these estimators to be consistent and asymptotically normal if the asymptotics is not only that the length of the observation interval, £n, goes to infinity, but also that the Δj's go to zero. First consider the estimating function obtained by deleting the quadratic terms from (2.3):
ffi;:ff (2.4) To simplify the exposition, we have here assumed that the observation times are equidistant, i.e. that Δ; = Δ for all i. This is the form (2.3) takes in cases where the diffusion coefficient is completely known, i.e. when it does not depend on 0, but (2.4) can obviously also be used when the diffusion coefficient depends on θ. Another way of obtaining this estimating function is by discretizing the score function based on continuous observation of the diffusion process X in the time interval [0, tn] (see Liptser and Shiryayev, 1977). The discretization is done by replacing Ito-integrals and Riemannintegrals by Ito-Riemann sums. The estimator θn obtained from (2.4), which can also be thought of as a weighted least squares estimator, was studied by
Dorogovicev (1976), Prakasa Rao (1983, 1988) and Florens-Zmirou (1989) in the case where the diffusion coefficient is constant and the parameter θ is one-dimensional. Under various regularity conditions these authors showed that θn is consistent provided Δ n —>> 0 and n Δ n —> oo, where it is assumed that the time between observations, Δ n , depends on the sample size n. Note that n Δ n = tn is the lenght of the observation interval. To prove asymptotic normality a stronger condition is needed. Prakasa Rao (1983, 1988) assumed that Δ n tends to zero sufficiently fast that nΔ^ —> 0, and referred to this condition as a rapidly increasing experimental design assumption. FlorensZmirou made the slightly weaker assumption that nΔ^ —> 0 in her result on asymptotic normality. We shall not state the results of these authors in details as a more general result will be given below. A different type of asymptotics, which has turned out to be relevant in several applications, was studied by Genon-Catalot (1990). She considered the situation where the length of the observation interval n Δ n is fixed and the diffusion coefficient is a constant σ2 tending to zero as the number of observations n tends to infinity. Under reasonable regularity conditions she showed that the estimator θn based on (2.4) is consistent provided σ ~ n~^ where β > 0.5 and asymptotically normal (and asymptotically efficient) under the additional condition β < 1. These various asymptotic results indicate that estimators based on (2.3) or (2.4) behave reasonably well in practice when the time between observa-
309
DIFFUSION PROCESSES
tions Δ is sufficiently small. This has been confirmed in simulation studies, see e.g. Kloeden et al. (1996). However, when Δ is not small, the estimators can be severely biased, as demonstrated in simulation studies by Pedersen (1995a) and Bibby and S0rensen (1995). In practice it can be difficult to determine whether in concrete models Δ is sufficiently small for the estimators to work well. Estimation based on (2.3) or (2.4) has been popular in the econometric literature under the name the generalized method of moments, a somewhat odd name, as the method is obviously not a method of moments, except approximately. The problem with the simple estimating functions (2.4) and (2.3) is that they can be strongly biased. An idea about the magnitude of the bias can be obtained from the expansions (Florens-Zmirou, 1989 and Kessler 1997) E0(XA\Xo
= x) = x
+ Δb{x;θ) + ±A2{b{x;θ)dxb(x;θ)
(2.5)
and
VΆτθ(XΛ\X0 = x) = Aυ(x;θ) + A2[1Ίb(x;θ)dxυ(x;θ) (2.6) 2 3 + υ(x; θ){dxb{x; θ) + \d xv(x; θ)}] + O(Δ ), where EQ and Var# denote expectation and variance, respectively, when θ is the true parameter value, and where d\ denotes the second partial derivative with respect to x. Suppose X is an ergodic diffusion with invariant probability measure μ# when θ is the true parameter value. If XQ ~ μe, we find, by (2.5), the following expression for the bias of the estimating function (2.4) ίA2nEμθ{dθb(θ)[b(θ)dxb(θ)/v(θ)+
Eθ(Hn(θ))=
l (2.7)
For a function (x,θ) «-> g{x;θ) we use the notation Eμθ(g(θ)) = /g(x; θ)dμo(x). When the initial distribution is different from μ#, (2.7) is, by the ergodic theorem (see e.g. Billingsley, 1961 or Florens-Zmirou, 1989), still a good estimate of the bias provided the number of observations is sufficiently large. Under weak standard regularity conditions (e.g. conditions similar to Condition 3.3 below) it follows that the asymptotic bias (n -» oo, Δ fixed) of the estimator θn derived from (2.4) is Eμβ{dθb(θ)[b(θ)dxb(θ)/v(θ)+ 2Eμβ{(dθb(θ))Vv(θ)}
A Δ
\d2xb{θ)}} , +
310
SORENSEN
in the case of a one-dimensional parameter. The expression analogous to (2.7) for the estimating function (2.3) is Eθ(Hn(θ))=
±AnEμβ{dθlogv(θ)
[ \b{θ)dxlogv(θ) + dxb(θ)
(2.9)
as is easily seen from (2.6). The fact that the bias of the estimating function (2.3) is of order nΔ when the diffusion coefficient depends on θ indicates that the corresponding estimator has a considerable bias even for small values of Δ. The reason is that in deriving (2.3) we used an approximation of the variance of the transition distribution that was too crude, see the discussion below. Example 2.1 Consider the Cox-Ingersoll-Ross model, which is widely used in mathematical finance to model interst rates (Cox, Ingersoll and Ross, 1985). The model is given by the stochastic differential equation dXt = (α + ΘXt)dt + σy/XtdWu
Xo = ^o > 0,
where 0 < 0 and σ > 0. The model has also been used in other applications, e.g. mathematical biology, for a long time. The state space is (0, oo). It is not difficult to derive an estimator for the parameter vector (α, θ, σ2) from (2.3). To simplify things we assume equidistant sampling times.
α
~θ
— —
(Xtn ~~ χo)(n Σ?=i Xu-ι)~
— ΣΓ=i Xti.
=
For parameter values where X is ergodic, an expression for the bias of these estimators when n is large can easily be found using the ergodic theorem and the fact that the invariant probability measure for the CoxIngersoll-Ross model is a gamma distribution. The bias can for some parameter values be dramatic even for rather small values of Δ, see Bibby and S0rensen (1995). D
The bias considerations above raise the question whether better estimators can be obtained by improving the approximations of the mean and variance in the Gaussian approximation to the transition distribution. Useful approximations were derived by Kessler (1997) under the following condition.
311
DIFFUSION PROCESSES
Condition 2.2 For every θ the functions b(x θ) and σ(χ-,θ) are K times continuously differentiable with respect to x and the derivatives are of polynomial growth in x uniformly in θ.
In order to formulate Kessler's expansions we need the generator of the diffusion process given by (2.1), i.e. the differential operator
Lθ = b(x;θ)—+
lυ(x;θ)4-ή.
z
dx
With the definition
(2.10)
-1~9
~~ %4f{χ),
(2.ii)
i=0
where f(x) = x, and where L\ denotes i-fold application of the differential operator £#, Kessler (1997) proved that + O(Δ* + 1 ),
o = x)= rk(A,x;θ)
(2.12)
provided k < K/2 + 1. Note that (2.5) is a particular case of (2.12). The dependence of the O-term on x and θ has been suppressed here. Kessler (1997) gave an upper bound for the term O(Δ Λ + 1 ) which is uniform in θ. For fixed x,y and θ the function (y — rfc(Δ,α;;0))2 is a polynomial of order 2k in Δ. Define gJχ θ(y), j = 0,1, , k by
(y-rk(A,x;θ))2
=
i=o
andΓfc(Δ,z;0) by k
k-j
j=Q
r=0
^
r
T
'
Kessler (1997) showed that
EΘ([XA
~ rk{Δ, x θ))2\Xv
= x)= Γ ^ ( Δ , x; θ) + O ( Δ * + 1 )
(2.14)
for k < K/2 + 1. Also in this case he gave an upper bound for the term We can now obtain an approximation to the likelihood function (2.2), which is considerably better than the approximation we used above, by replacing the transition density y \-> p(Δ,£,y;0) by a normal density with
312
SORENSEN
mean value r*fc(Δ, x; θ) and variance I\+i(Δ, x\ θ) with k < K/2. The corresponding estimating function (approximate score function) is
^
%
U
(2.15)
i
We have again allowed the time between observations to vary. Example 2.3 To avoid complicated expressions we consider as an example the Ornstein-Uhlenbeck process dXt = ΘXtdt + σdWu
^o = so,
where θ € IR and σ > 0. Long, but easy, calculations show that the estimators for θ and σ 2 based on Hn are
Θ2,n
= Δ"V2Qn -1-1)
-22
__
n Σ
provided . 1. o
To simplify matters we have assumed that the observation times are equidistant. There are, in fact, two solution for 0, but a moments reflection reveals that the other solution is not a good estimator. D
Suppose X is ergodic with invariant probability measure μg, all moments of which are finite. Then we find, using (2.12) and (2.14), that the bias of k+1 the estimating function Hn is of order O(nA ). This indicates that for k sufficiently large the estimator obtained from Hn ' is only slightly biased when Δ is not too large. This is indeed the case. In order to avoid technical problems Kessler (1997) modified the approximate Gaussian likehood function we used to derive the estimating function (2.15) by replacing the functions \og{Tk+ι{Δi,x',θ)/(Δiv(x;θ))} and Aiv(x;θ)/Γk+ι{Δi,x]θ) by Taylor expansions to order k. The estimating function derived from Kessler's approximate likelihood function of order k
313
DIFFUSION PROCESSES
differs only from Hn by terms of order O(Ak+ι). Therefore the estimator based on Hn behaves in the same way as the estimator based on Kessler's (1997) approximate likelihood function, for which he gave results under (essentially) the following conditions. Condition 2.4 1) For every θ there exists a constant CQ such that \b(x;θ)-b(y;θ)\ +
\σ(x;θ)-σ(y;θ)\
for all x and y in the state space. 2)'mfXiθυ(x,θ)
>0.
3) The functions b(x; θ) and σ(x\ θ) and all their partial x-deriυatives up to order K are three times differentiate with respect to θ for all x in the state space. All these derivatives with respect to θ are of polynomial growth in x uniformly in θ. 4) The process X is ergodic for every θ with invariant probability measure μ$. All polynomial moments of μe are finite. 5) For allp>0
and for all θ supt Eθ(\Xt\p) < oo.
Kessler further assumed that θ = (α, β) belongs to a compact subset Θ of IR2, that the drift depends only on a and that the diffusion coefficient depends only on β. Moreover, he imposed an obvious identifiability condition. The assumption that θ belongs to a compact set is only made to avoid technical problems concerning the existence of a maximum of the approximate likelihood function. Kessler (1997) proved the following result about the asymptotic properties of the estimator θk^n which maximizes his approximate likelihood function. The observation times are assumed to be equidistant with spacing Δ n , which depends on the sample size. Theorem 2.5 Assume that k < K/2 and that Condition 2.2 and Condition 2.4 hold. Then for all θ e θ (2.16) in PQ-probability
as n —>> oo, provided Δ n —> 0 and n Δ n —» oo.
//, in addition, n Δ
2A:+1
—• 0 and θ G int θ , then as n -» oo
(2.17)
314
SORENSEN
in distribution under Pg, where _ ( Eμβ[(dab(a))2/υ(β)}
0
\
The estimating functions considered in this section were all derived from an approximate (or pseudo) likelihood function. This has the advantage that if there are more than one solution to the estimating equation, we can choose the one that is the global maximum point for the pseudo likelihood function. The estimating functions considered in the next section do not generally have this property.
3
Martingale estimating functions
The problems caused by the bias of the estimating functions considered in Section 2 can most conveniently be avoided by using martingale estimating functions. We shall therefore in this section, for the same kind of data as those considered in Section 2, study estimating functions of the form
2=1
where the function g(Δ,£,y;0) satisfies p(Δ, x, y; θ)p{A, x, y; θ)dy = 0
(3.2)
for all x, Δ and θ. Here, as in the previous section, y ι-> p(Δ, x, y; θ) denotes the transition density, i.e. the density of X& given Xo = x. In most cases it is not easy to find g's that satisfy (3.2) since p is usually not known, but such #'s can always be found numerically, as we shall see later. Under (3.2) Gn(θ) is a martingale when θ is the true parameter value. In particular, Gn(θ) is an unbiased estimating function. If θ is d-dimensional, we usually take g to be d-dimensional too. With the bias problem out of the way, the question of how to choose the estimating function in an optimal way becomes more interesting. Godambe and Heyde (1987) gave criteria for choosing within a class of martingale estimating functions the one which is closest to the true (but for diffusion models usually not explicitly known) score function (fixed sample criterion) or the one which has the smallest asymptotic variance as the number of observations tends to infinity (asymptotic criterion). Suppose we have N real valued functions hj(Δ,α;,y;0), j = 1, — , JV, each of which satisfies (3.2) and which are all natural choices for defining a martingale estimating function. Then every function of the form
315
DIFFUSION PROCESSES
N
g(A,x,y;θ) = ^ • ( Δ ^ i=ι
^ Δ ^ y
fl),
(3.3)
where αj(Δ,£;0), j = l, ,iV, are arbitrary functions, can be used to define a martingale estimating function by (3.1). If θ is d-dimensional, we will usually try to find d-dimensional α's. Let Q denote the class of ddimensional martingale estimating functions of the form (3.1) with g given by (3.3). The following result by Kessler (1995) tells how to find the optimal estimating function in the sense of Godambe and Heyde (1987) within the class Q. We need the further assumption that for fixed Δ,a; and θ the functions /ij(Δ,z,y;0), j = 1, • ,iV are square integrable with respect to the transition distribution. Then the set of all real-valued functions of the form (3.3) is a (finite dimensional and hence closed) linear sub-space of L2(p(A,x,y,θ)dy). We denote this subspace by Theorem 3.1 Suppose the transition density p is differentiable with respect to θ and that for all fixed A,x and θ the functions de{ logp, i = 1, ,d, belong to L2(p(A,x,y\θ)dy). Denote by N
A.x.y the projection in L2(p(A,x,y;θ)dy)
(3.4)
θ)
of d$i logp onto Ή(Δ,rr;0), and define
^ ^ p ^
ί),
(3.5)
where ά^ is the d-dimensional vector ( α | l 5 , α ^ ) T (Γ denotes transposition). //g*(Δ,x,y;fl) is continuously differentiable with respect to θ for all fixed Δ, x and y, then G^(θ) is the optimal estimating function within Q with respect to the asymptotic criterion as well as to the fixed sample criterion of Godambe and Heyde (1987). The (Xβ 's are determined by the following linear equations / a*u(A,x;θ) C(A,x;θ)\
\
:
=B<(Δ,x;β),
(3.6)
\a*Ni(A,x;θ) J for i = 1,
, d, where C = {CM} and Bi = (bψ,
ckι(A,x;θ)=
and
, bjj)τ
are given by
I hk(A,x,y;θ)hι{A,x,y;θ)p{A,x,y;θ)dy
(3.7)
316
SORENSEN
bf{Δ,χ ,θ) = I hj(A,x,y θ)dθip(A,x,y θ)dy.
(3.8)
When the functions hj(A,x,y;θ), j = 1, , JV are linearly independent in L 2 (p(Δ,x,y;0)ώ/), the matrix C is obviously invertible. The condition that the estimating function is differentiable with respect to θ is really only a technical matter in the Godambe-Heyde theory, and the estimating function given by (3.5) is no doubt also the most efficient in the class Q under a weaker condition. From (3.6), (3.7) and (3.8) we see that it is not difficult to impose conditions on the functions /ij, j — 1, , TV which ensure that g* is continuously differentible with respect to θ. Note that under weak conditions ensuring that differentiation and integration can be interchanged (e.g. Condition 3.3 below), the 6j's given by (3.8) can also be expressed as (3.9) Results similar to Theorem 3.1 hold for general Markov processes and for more general classes of martingale estimating functions than those given by (3.3), see Kessler (1995). We next give a result about the asymptotic behaviour of the estimator obtained from a general martingale estimating function Gn(θ) of the form (3.1) with g given by (3.3), where the ay's are d-dimensional and the Λj's satisfy (3.2). We do this under the assumption that the diffusion is ergodic, which is ensured by the following condition. Here s(x; θ) denotes the density of the scale measure of X: = exp ( where x# is an arbitrary point in the interior of the state space of X. Condition 3.2 The following holds for all θ G Θ: s(x;θ)dx=
/
s{x\θ)dx — oo
J-oo
and roo
/
[s(x; θ)υ(x; θ^dx
= A{θ) < oo.
./-oo
If the state space of X is not the whole real line, the integration limits —oo and oo should be changed accordingly. Under Condition 3.2 the process X is ergodic with an invariant probability measure μ$ which has density
317
DIFFUSION PROCESSES
[A(θ)s(x;θ)υ(x;θ)]~~1 with respect to the Lebesgue measure. Define a probability measure Qfr on IR2 by Q$(x, y) = μθ{x) x p(Δ, x, y; 0).
(3.11)
For a function # : IR2 4 E w e use the notation Q$(g) = f gdQfi. The predictable quadratic variation of the martingale Gn(θ), when θ is the true parameter value, is
T
(G(θ))n = £ A(A, Xti_, Θ) C(A, Xu_, Θ)A(A, Xu_, θ),
(3.12)
2=1
where A(Δ,α;;0)ij = aij(A,x;θ). As above αij denotes the j ' t h coordinate of the d-dimensional vector OL{. We impose the following condition on the estimating functions. Prom now on ΘQ will denote the true value of θ. Condition 3.3 The following holds for all θ G Θ: 1) The function g is continuously differentiable with respect to θ for all Δ,# and y . The functions (rr,y) H * 9 ^ 5 j ( Δ , r r , 2 / ; ^ ) , i,j = 1, , d , where gj denotes the j'th coordinate of g, are locally dominated square integrable with respect to Q$Q, and the matrix D(ΘQ) given by N
D(θo)iidd = Qh(d = £ Q£ Q£o[aki(A;θo)dθjhk(A Qh(dΘjΘgi(Δ;θ gi(Δ;θ0)) 0))
θo)]
k=l
is invertible. 2) Each coordinate of the function (x,y) •->• ff(Δ,x,y;θ) is in
L2{Q$).
Theorem 3.4 Suppose ΘQ G int Θ and that the Conditions 3.2 and 3.3 hold. Then an estimator θn that solves the estimating equation Gn(θ) = 0
(3.13)
exists with a probability tending to one as n -> oo under PQ0 . Moreover, as n —> o o , θn -> θ0 in probability under P^o, and ^Ψn
- θo) ->
ι
1 τ
N{O,D(θo)- V(θo)(D(θo)- ) )
in distribution under Pβ0, where V(θ0)
= Eμβn (A(A; Θ)TC(A; Θ)A(A; θ)).
(3.14)
318
SORENSEN
Theorem 3.4 can be proved along the same lines as Theorem 3.3 in Bibby and S0rensen (1995), see also Kessler (1995) and Kessler and S0rensen (1995). Similar proofs of similar results can be found in several papers. Here Condition 3.1 (c) in Bibby and S0rensen (1995) has been omitted because Lemma 3.1 in Bibby and S0rensen (1995) remains valid without this condition as follows from Theorem 1.1 in Billingsley (1961a) and the central limit theorem for martingales in Billingsley (1961b). In fact, a multivariate version of the central limit theorem is needed here, but in the relatively simple ergodic case considered here this easily follows from the one-dimensional result by applying the Cramer-Wold device. Under Condition 3.3 the 6j's given by (3.8) can also be expressed by (3.9), so D(θo) = —V(ΘQ) for the optimal estimating function G*(0) since here the α's are given by (3.6). Hence the asymptotic covariance matrix of the estimator based on G*(0) is given by 3.1
Polynomial estimating functions
Let us first consider linear martingale estimating functions, i.e. estimating functions of the type Xu-ι',θ)l
(3.15)
where a is d-dimensional and where F(A,x;θ)
= EΘ(XA\XO = x).
(3.16)
In most cases the mean value of the transition distribution is not explicitly known so that it must be determined numerically. This is, however, relatively easy to do using suitable methods from Kloeden and Platen (1992). It is certainly much easier than to determine the entire transition density numerically. Estimating functions of the type (3.15) were studied in Bibby and S0rensen (1995). The optimal linear estimating function is (Bibby and S0rensen, 1995)
i l
(3.17) where Φ(Δ,x;0) = V
(3.18)
Calculation of a derivative of a function that has to be determined numerically is a considerably more demanding numerical problem than determination of the function itself. Pedersen (1994) proposed a numerical procedure
319
DIFFUSION PROCESSES
for determining dβF(A^x;θ) by simulation, which works in practice, but it is easier to use the following approximation to the optimal estimating function:
Khn(θ)
=Σdθb(Xtι_1;θ)v(Xti_1',θ)-1[Xti
-FiAuXu^
θ)],
(3.19)
i=l
which is obtained from K*n by inserting in the weight function dgF/Φ the first order approximations to F and Φ given by (2.5) and (2.6). The estimating function K\^n can also be obtained from the estimating function (2.4) by subtracting its compensator in order to turn it into a martingale and thus remove its bias, see Bibby and S0rensen (1995). It is very important that we have only made approximations in the weight function and not in the term Xt. - F ί Δ ^ X ^ fl), since such an approximation would destroy the martingale property, and hence the unbiasedness, and would thus reintroduce the problems encountered in Section 2. An approximation of the weights c^F/Φ only implies a certain loss of efficiency. Bibby and S0rensen (1995) showed that expansions in powers of Δ of the asymptotic variances of the estimators based on K{n and K^n agree up to and including terms of order O(Δ 2 ), so for small values of Δ there is not much loss of efficiency in using the approximation. Calculations and simulations for a number of examples indicate that the loss of efficiency is often rather small, see Bibby and S0rensen (1995). The linear estimating functions are useful when mainly the drift depends on the parameter θ. If only the diffusion coefficient depends on 0, while the drift is completely known, the linear estimating equations do not work. If the diffusion coefficient depends considerably on 0, it is an advantage to use second order polynomial estimating functions of the type
K2,n(θ) = Y^{a{^i,Xti_ι-θ)[Xti-F{^Xti_ι
θ))
(3.20)
The optimal estimating function, K^n, of this type is given by n
.*rτ.^ _
dθΦ(x;θ)η(x;θ)-dθF(x;θ)*(x;θ)
and „
^F(rr;fl)»7(a:;β) - dθΦ(x; θ)Φ(x; θ)
(32 2 )
320
SORENSEN
where the Δ's have been omitted, η(x;θ)=Eθ([XA-F(x;θ))z\X0
(3.23)
= x)
and Φ(x;
θ) = EΘ([XA - F(x; 0)]4|XO = x) - Φ(z; θf.
(3.24)
An approximation to the optimal quadratic estimating function is
*(«> = Σ { a ' v fχ t "~'.^ ί*« -f < Δ » *•<-•• *>]
<325 >
This estimating function is similar to (2.3), but it is unbiased and therefore generally gives a far better estimator. It is obtained from the optimal quadratic estimating function, K^^ by using Gaussian approximations to (3.23) and (3.24), i.e. η(x]θ)=0 and Φ(z;0)=2Φ(α;;0)2, and then using the first order approximations given by (2.5) and (2.6). Again it is important that we only make approximations in the weights α and /?, so that the unbiasedness is preserved. Quadratic estimating functions were treated in Bibby (1994) and Bibby and S0rensen (1996). Higher order polynomial estimating functions were investigated by Pedersen (1994) and Kessler (1995). Some times there can be good reasons to omit lower order terms in a polynomial estimating function, for an example of this see Bibby and S0rensen (1997). Example 3.5 Let us return to the Cox-Ingersoll-Ross model considered in Example 2.1. For this model the optimal estimating function given by (3.21) and (3.22) can be explicitly found (Bibby and S0rensen, 1996), but the corresponding estimating equation must be solved numerically. In the case of equidistant sampling times the approximately optimal estimating function (3.25) yields the following explicit estimators (Bibby and S0rensen, 1995, 1996):
Oίn
=
Όl = Tϊ
:? = 1 X^ (Xti - F(Δ, Xu_, an, ~θn)Ϋ «
yγj
•**•
1
I-LL /
A
-wr
~
7\
\
'
321
DIFFUSION PROCESSES
where F(Δ,x\a,θ) = [(a + θx)eΘA - a]/θ and φ*(Δ,x\a,θ) = |[(α + 2θx)e2ΘA - 2(α + θx)eΘA + a]θ~2. The estimators exist provided the expression for eθnA is positive. A simulation study in Bibby and S0rensen (1995) indicates that these estimators are quite good. D
3.2
Estimating equations based on eigenfunctions
The polynomial estimating functions are a generalization of the method of moments to Markov processes. They can also be thought of as approximations to the true score function, which are likely to be good when the time between observations is small enough that the transition density is not too far from being Gaussian. There is therefore no reason to believe that polynomial estimating functions are in general the best possible choise when the time between observations is large and the transition distribution is far from Gaussian. We shall therefore conclude this paper by discussing a type of martingale estimating functions that can be more closely tailored to the type of diffusion model under consideration. These estimating functions were proposed and studied by Kessler and S0rensen (1995). A twice differentiate function φ(x\ θ) is called an eigenfunction for the generator LQ (given by (2.10)) of the diffusion process (2.1) if θ),
(3.26)
where the real number λ(θ) is called the eigenvalue corresponding to φ{x\ θ). Under weak regularity conditions, see Kessler and Sorensen (1995), Eθ(φ(XA]θ)\X0
= x) = e-χWAφ(x;θ).
(3.27)
We can therefore define a martingale estimating function by (3.1) with N
,*,y; θ) = Σ <*i(Δ,x\ 0)to(y; θ) - e" A W A ^(a;; 0)],
(3.28)
where φι( ; 0), , ΦN{Ί θ) are eigenfunctions for LQ with eigenvalues λi(0), ,λ,v(0). The optimal estimating function of this type is given by (3.6) with
ckl(A,χ θ) = I φk(y;θ)φι(y;θ)p(A,x,y]θ)dy
and
(3.29)
322
SORENSEN
bf(A,x;θ) = -fdΘiφj(y,θ)p(A,x^θ)dy
+ dθι[e-χ^Aφj(x,θ)}.
(3.30)
Statistical inference based on this optimal estimating function is invariant under twice continuously differentiable transformations of data, see Kessler and S0rensen (1995). After such a transformation the data are, by Ito's formula, still observations from a certain diffusion process, and the eigenfunctions transform in exactly the way needed to keep the optimal estimating function invariant. Inference based on polynomial estimating functions is obviously not invariant under transformations of the data. Apart from this theoretical advantage, the optimal estimating functions discussed here have clear numerical advantages over the optimal polynomial estimating functions. As discussed earlier, determination of quantities like d$F in (3.17) is a difficult numerical problem. In (3.30) the derivative is under the integral sign, which makes determination of the optimal weights in estimating functions of the type (3.28) a much simpler numerical problem than the similar problem for polynomial estimating functions. Moreover, EQ(Φ(XΔ\Θ)\X§ = x) is explicitly known, so numerical inaccuracies cannot destroy the martingale property and the unbiasedness of these estimating functions. It might in some applications be reasonable to obtain a quick estimator by reducing the numerical accuracy when determining the weights, αy, j = 1, , N. For the estimating equations based on eigenfunctions this only implies a certain loss of efficiency, whereas the consistency of the estimators is preserved. It is also worth noting that for models where all eigenfunctions are polynomials or polynomials of the same function, the optimal weights given by (3.29) and (3.30) can be explicitly calculated, see Kessler and S0rensen (1995). The disadvantage of these estimating functions, on the other hand, is that it is not always possible to find eigenfunction for the generator of a given diffusion model. In such cases the polynomial estimating functions, in particular the quadratic, provide a very useful alternative. Example 3.6 For the Cox-Ingersoll-Ross model the eigenfunctions are the Laguerre polynomials, and we obtain the polynomial estimating functions discussed in the previous subsection, see Example 3.5. D Example 3.7 A more interesting example is the class of diffusions which solve dXt = -0tan(Xt)cJt + dW
u
*o =
For θ > \ the process X is an ergodic diffusion on the interval (-π/2,7r/2), which can be thought of as an Ornstein-Uhlenbeck process on a finite interval. The eigenfunctions are φi{x;θ) = Cf(sin(x)), i = 0,1, , with
323
DIFFUSION PROCESSES
eigenvalues i(θ + i/2), i = 0,1, , where Cf is the Gegenbauer polynomial of order i. The optimal estimating function based on any set of eigenfunctions can be found explicitly, see Kessler and S0rensen (1995). The optimal estimating function based on the first non-trivial eigenfunction, sin(x), is
When Δ is small the optimal estimating function can be approximated by Gn(θ) = 1=1
which yields the explicit estimator
~
θn = - Δ
log
/Σ? =1 sin(^_Jsin(^)\ V
Σi=ism 2 ( A ^ J
J
- 1/2,
provided the numerator is positive. Simulations indicate that this estimator is often almost as efficient as the optimal estimator based on G*, see Kessler and S0rensen (1995). • Acknowledgement: Thanks are due to a referee for a careful reading of the manuscript.
References Bibby, B.M. (1994): Optimal combinations of martingale estimating functions for discretely observed diffusion processes. Research Report No. 298, Department of Theoretical Statistics, University of Aarhus. Bibby, B.M. and S0rensen, M. (1995): Martingale estimating functions for discretely observed diffusion processes. Bernoulli 1, 17-39. Bibby, B.M. and S0rensen, M. (1996): On estimation for discretely observed diffusions: A review. To appear in Theory of Stochastic Processes 2 (18), 49-56. Bibby, B.M. and S0rensen, M. (1997): A hyperbolic diffusion model for stock prices. Finance and Stochastics 1, 25-41. Billingsley, P. (1961a): Statistical Inference for Markov Processes. The University of Chicago Press, Chicago.
324
SORENSEN
Billingsley, P. (1961b): The Lindeberg-Levy theorem for martingales. Proc. Amer. Math. Soc. 12, 788-792. Cox, J.C., Ingersoll, J.E. and Ross, S.A. (1985): A theory of the term structure of interest rates. Econometrica 53, 385-407. Dorogovcev, A. Ja. (1976): The consistency of an estimate of a parameter of a stochastic differential equation. Theor. Probability and Math. Statist. 10, 73-82. Florens-Zmirou, D. (1989): Approximate discrete-time schemes for statistics of diffusion processes. Statistics, 20, 547-557. Genon-Catalot, V. (1990): Maximum contrast estimation for diffusion processes from discrete observations. Statistics 21, 99-116. Godambe, V.P. and Heyde, C.C. (1987): Quasi likelihood and optimal estimation. Int. Statist. Rev. 55, 231-244. Heyde, C.C. (1988): Fixed sample and asymptotic optimality for classes of estimating functions. Contemporary Mathematics 80, 241-247. Kessler, M. (1995): Martingale estimating functions for a Markov chain. Preprint, Laboratoire de Probabilites, Universite Paris VI. Kessler, M. (1996): Simple and explicit estimating functions for a discretely observed diffusion process. Research Report No. 336, Department of Theoretical Statistics, University of Aarhus. Kessler, M. (1997): Estimation of an ergodic diffusion from discrete observations. Scand. J. Statist. 24, 211-229. Kessler, M. and S0rensen, M. (1995): Estimating equations based on eigenfunctions for a discretely observed diffusion process. Research Report No. 332, Department of Theoretical Statistics, University of Aarhus. Kloeden P.E. and Platen, E. (1996): Numerical Solution of Stochastic Differential Equations. Springer-Verlag, New York. Kloeden, P.E., Platen, E., Schurz, H. and S0rensen, M. (1996): On effects of discretization on estimators of drift parameters for diffusion processes. J.
DIFFUSION PROCESSES
325
Appl. Prob. 33, 1061-1076. Liptser, R.S. and Shiryayev, A.N. (1977): Statistics of Random Processes I, II. Springer-Verlag, New York. Pedersen, A.R. (1994): Quasi-likelihood inference for discretely observed diffusion processes. Research Report No. 295, Department of Theoretical Statistics, University of Aarhus. Pedersen, A.R. (1995a): A new approach to maximum likelihood estimation for stochastic differential equations based on discrete observations. Scand. J. Statist 22, 55-71. Pedersen, A.R. (1995b): Consistency and asymptotic normality of an approximate maximum likelihood estimator for discretely observed diffusion processes. Bernoulli 1, 257-279. Prakasa Rao, B.L.S. (1983): Asymptotic theory for non-linear least squares estimator for diffusion processes. Math. Operationsforsch. u. Statist., Ser. Statist. 14, 195-209. Prakasa Rao, B.L.S. (1988): Statistical inference from sampled data for stochastic processes. Contemporary Mathematics 80, 249-284.
327 Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
FITTING DIFFUSION MODELS IN FINANCE Don L. McLeish and Adam W. Kolkiewicz University of Waterloo ABSTRACT This paper is concerned with the problem of estimation for stochastic differential equations based on discrete observations when the likelihood formula is unknown. Often in the financial literature the first order discretetime approximation to the diffusion process is considered adequate for the purpose of simulation, estimation and fitting the model to historical data. We propose methods of estimation based on higher order Ito-Taylor expansions. Different methods of generating optimal estimating functions are considered and a method of quantifying the loss of information due to using lower order approximations is proposed. An important feature of these methods is that an assessment of the goodness of fit to data is possible. These ideas are illustrated using a model which generalizes most of the single factor diffusion models of the short-rate interest rate used in finance. Key Words: Diffusion models, estimating functions, finance.
1
Introduction
Many models common in finance take the form of one or more diffusion equations. Such equations are generally described by means of a stochastic differential equation of the form dXt= a(Xt)dt + σ(Xt)dWt,
0 < t < Γ,
(1.1)
where Wt is an ordinary Wiener process, and the drift coefficient a and the diffusion coefficient σ may depend on unknown parameters. Markov diffusion models have played a pre-eminent role in the theoretical literature on the term structure of interest rates (e.g. see Brennan and Schwartz (1979),
328
MCLEISH AND KOLKIEWICZ
Cox, Ingersoll and Ross (1985), and Longstaff, F.A. (1989)). A few of the most common models are listed in Table 1. Others, such as geometric Brownian motion, are special cases.
Table 1 Model α(x,0) σ{x,θ) Vasicek θι + θ2x θs Cox, Ingersoll, Ross θι + θ2x θsxι/2 Brennan, Schwartz θx + θ2x θzx Black, Karasinski I9\x + θ2x log(x) Θ3X Cox θλx θ2xθs Pearson, Sun θλ + θ2x {θ3 + Θ4X Constantinides-Ingersoll θxzl2 0 In this paper, we discuss the estimation of parameters θ{ in models such as those above for discretely sampled data. That is, on the basis of observations on Xt at discrete time points £χ < i2 < ..., we wish to construct reasonably efficient estimators of the parameters. Let us consider for example a process defined by the following diffusion equation; dXt = (α + βXt)dt + cXΊtdWt. This model generalizes all but one of the diffusions used above and is investigated by Chan, Karolyi, Longstarr and Sanders (1992). When β is negative, the process is mean reverting in the sense that it tends towards the value —α//?, its equilibrium mean. In this case, since the diffusion coefficient is 0 at 0 and the drift term positive in a neighbourhood around 0, the process, if initialized at a positive value Xo, remains positive with probability 1. This is a simple consequence of Theorem 2, page 149 of Gihman and Skorohod (1972). Provided that the function below is integrable, this process has equilibrium distribution given by the probability density function K
and this equilibrium distribution is well-defined, for example, for 7 > 1. Note that although the process is driven by a Brownian motion, the stationary distribution has tails far from Gaussian. Indeed, moments are only finite up to order less than 27 — 1.
329
DIFFUSION MODELS
Now consider a naive discrete time approximation to this process. The obvious discrete approximation is Xt+h-Xt=
a{Xt)h + σ(Xt)et,
where et is a sequence of independent normal random variables with mean 0 and variance h. This simple Euler approximation to the process has some undesirable features. For example, while the original process has an equilibrium distribution and remains positive, the discrete approximation may have neither property. Although the qualitative behaviour of the discrete and continuous processes differ, the maximum likelihood estimator of the drift parameters have similar forms. For example, for the continuous time process observed on the interval [0,T], the maximum likelihood estimator of the parameters a and β are given by the solutions of the two estimating equations
[
τ
Jo
2l
x; dxt
2Ί
- / x; (ά Jo
+ βxt)dt = o
- / xl~2^(ά + βXt)dt = 0 Jo
(1.2) (1.3)
and the maximum likelihood estimator for the discretely observed process is an analogous function of Xnh- Generally, under reasonable conditions, the continuous time estimator is consistent as the period of observation approaches infinity i.e. T —> oo. The discrete time estimators are consistent as T -» oo and h -» 0. The tails of the stationary distribution are large, certainly substantially larger than those of the normal distribution. A diffusion model, although driven by a process with Gaussian tails, can generate a process with tails similar to those, for example, of the stable distributions with index less than two. In order to determine whether diffusion models provide an adequate fit to financial data, we begin with data consisting of the yield of 30 year US bonds, the data obtained daily over the period April 13, 1987 to June 13, 1994. There are a total of 1818 recorded daily observations in this period. We begin by attempting to fit a general diffusion model of the above form (cf. Chan, Karolyi, Longstarr and Sanders (1992)) dXt = (α + βXt)dt + cXJdWt. The yields are plotted in Figure 1. We discuss the estimation of the parameters in the next section. At the moment we simply note that estimated values in this case are ά = 3.1827, β = -0.3962, c = 0.0075, 7 = 2.5813, indicating a tendency for the process to fluctuate around the mean —ot/β of approximately eight
MCLEISH AND KOLKIEWICZ
330
30 YEAR US BOND YIELD, April 87-June 94
Figure 1:
percent. The graph of the fitted equilibrium probability density function in Figure 2 further illustrates that the right tails are distinctly non-Gaussian. Does the fitted diffusion adequately explain the tails of the increments of the process? Consider the standardized "Euler residuals", defined by rz = (AXti-a(Xti)Ati)/σ(Xti), where Ati =. t i + 1 - U and ΔXU = X fc+1 - Xu. Provided that the discretization intervals are sufficiently small, this should be approximately a sequence of independent random normal variables. Figure 3 displays a Normal probability plot of these values against the normal quantiles. This plot shows clear evidence that the residuals are non-normal, indeed have tails more like those of a stable law. In fact if we fit a symmetric stable law to these residuals, we obtain index α =1.69. Almost the same value is obtained if we fit a symmetric stable law to the marginal distribution of the increments ΔXt..
DIFFUSION MODELS
331 STATIONARY DISTRIBUTION, BOND DATA
Figure 2:
Normal Probability Plot
0.003 0.001
Figure 3:
We seek an explanation of these tails in the Ito-Taylor expansion of the diffusion equation. According to the Ito-Taylor expansion (cf. Kloeden and Platen (1992), page 164), AXti can be expressed using a linear combination
332
MCLEISH AND KOLKIEWICZ
of the Hermite polynomials in the normalized increments of the Brownian motion. Indeed, with Z\ = ΔW^/yΔ^, we have ΔXU - E(ΔXU)
= αiZi + 02 (Z? - 1) + α 3 Zi(Z? - 3) + α 4 Z 2
(1.4)
where the coefficients α; are functions of Xt{ and Z 2 is a standard normal random variable independent of Z\. Notice that the distribution of the increment, centered at its expectation is not exactly normal, and indeed if the coefficients α 2 , and a^ are reasonably large compared with αi, they increase the weight in the tails of the distribution. Two questions arise immediately from this observation. • How do we use the representation (1.4) to estimate parameters in a diffusion? • Does this representation adequately explain the increased weight in the tails of the residuals? Despite a vast literature, many practical issues related to the problem of estimation and fitting a model from discretely sampled diffusion processes remain unanswered and only few quantitative results regarding consequences of discretization are available. For example, it is common practice to estimate the parameters using low order approximations to the diffusion process, for example the Euler scheme (e.g. Chan, Karolyi, Longstarr, and Sanders (1992)) or the Milstein scheme (e.g. Chesney, Elliott, Madan, and Yang (1993)). This is justified by asymptotic results which apply when the time intervals t{+ι —t{ converge to zero, but also by a lack of simple estimating procedures based on higher order approximations. It is not always clear whether the observed discretization is fine enough to justify the use of the lowest order approximations or whether higher order approximations will contribute something significant from the perspective of statistical modeling. In this study we propose several methods of estimation based on different order approximations and also methods which allow for assessing the goodness-of-fit. One advantage of the proposed methods is that it is possible to compare their relative performance.
2
Estimation of Parameters
Let us suppose that a process Xt satisfies (1.1), and on the basis of observations at discrete time points t\ < tϊ < ... , we wish to construct estimators of the parameters. There are several reasonable approaches to this problem. (1) When the parameter lies in the drift term, we may construct the continuous time maximum likelihood estimators as in (1.2) and (1.3) above and then approximate the integrals by sums.
333
DIFFUSION MODELS
(2) Again when the parameter lies in the drift term, we may construct the continuous time maximum likelihood estimating functions and then condition these estimating functions on the observed discrete data. (3) We may base the estimation on the discrete data alone, using the exact or approximate score function for discrete observations. This is equivalent to (1.2) in the case of drift parameters. Unfortunately, except for few simple examples, only the first approach is generally feasible. Both the second and third approaches require some simplification to the problem. How should we estimate the parameters if a given sequence of observations, centered at the expected values, has distribution given exactly by equation (1.4)? One possibility is to determine the score function for the distribution, and, using it as an estimating function, obtain the maximum likelihood estimators. Unfortunately, this is a rather difficult task. Alternatively, we may project this score function on some more suitable subspace. Such an approach guarantees the optimal estimating function in the chosen class. For example, the Hermite polynomials hi, given by h{(x) = x, (x2 — 1), (x3 — 3x) respectively for i = 1, 2, 3, provide a reasonable basis for expanding functions of near-normal random variables. These functions have mean zero and variance i\ and they are uncorrelated under a normal assumption. Projecting the score function onto these polynomials is equivalent to projecting onto a space spanned by the powers of AX^. Kessler and Sorensen (1995), for example, choose as basis functions for the space the eigenfunctions of the infinitesimal generator of the Markov process. 2.1
The Ito Taylor Expansion.
Higher order Ito-Taylor expansions may be used to approximate score functions by their projections onto a space spanned by polynomials. Let {ίi,... ,t n } be the points at which we observe the diffusion process {Xt} In the representation of -X*i+1
Xti+1 = XU + Γ + 1 a(Xs, θ)ds + Λ + 1 σ(Xs, θ)dWs Jti
(2.1)
Jti
Ito's lemma can be written in terms of two operators on twice diίferentiable functions / : 2
and
334
MCLEISH AND KOLKIEWICZ
Then for any twice differentiate function /, f(Xti+ι)
= f(Xti) + Γ
+ 1
hi
tl+1
L°f(Xs)ds
+ ί
hi
ι
(2.2)
L f(Xs)dWs.
By substituting in each of the integrands in (2.1) using the above identity and iterating this process we arrive at the Ito-Taylor expansions (e.g. Kloeden and Platen, (1992) page 164). It is easy to observe that terms with nonzero expectation will come only from the first integral and they will be of the form /Λ
\7
. i = l,.
,
(2-3)
where (L0)^) denotes the j-th iteration of the operator L° with (L°)^ being the identity operator. These terms provide successive approximations to the conditional expectations of ΔXti (= Xti+X — Xu)' /
f
3 ^
r = l,....
The first two approximations are m\^ = a(Xtiiθ)Δti,which the first term in the Euler expansion, and m 2 i i = a(Xu,θ)Au
+ l-[aa + \σ2a^){Xti)
(2.4) is equal to
• A2U,
which corresponds to the terms with nonzero expectation in the strong ItoTaylor expansion of order 1.5. In the latter approximation a' and a^ denote the first and second derivatives of a with respect to x, respectively. In general, the difference -mr,i (2.5) will have conditionally on X^ nonzero expectation of order Δ£ + 1 . Using these differences we shall find estimating equations for θ by finding moments of approximations to the distribution of (2.5). The distribution of (2.5) is determined by terms coming from the ItoTaylor expansions of both integrals in (2.1). By gathering these terms, we have the following approximations to the distribution of (2.5 ): (al) - terms of orders up to 0p(Δ ί 2 ): +
σ(Xu,θ) [ hi
dWs =
which together with the term m\ corresponds to the Euler expansion.
335
DIFFUSION MODELS
(a3) - terms of orders up to
σAWti
+ \ +
γ -ZAti)AWti Loσ(Xu)[ΔWuAu-AZu],
where AZti is a normally distributed random variable with mean, variance and correlation
= 0, E((AZti)2)
E(AZti)
= ±Δξ, and E(AWUAZU)
=
\(Ati)2,
respectively. These terms together with m2 correspond to the strong ItoTaylor approximation of order 1.5. By considering more terms we can obtain more accurate approximations to the transition distribution of the discretized process Xt Since it seems that there is no easy method of finding an explicit form of the density function for higher order approximations one may propose to approximate the score functions by their projections onto the space spanned by polynomial functions. It is interesting to notice that the multiple stochastic integrals that arise in Ito-Taylor approximations lead to the Hermite polynomials: i+l
i
Γ ... Γ dWSldWS2...
Jti
Jti
dWSn =
where hn is the Hermite polynomial of degree n. As an example of the proposed method, we shall consider the approximation (a3) and henceforth we shall assume that it provides adequate approximation to the distribution of the difference (2.5). We rewrite the approximation (a3) in terms of two standard normally distributed and uncorrelated random variables Z\ and Z2: + a^Z2, where in the last expression we used the following notation: A
α M = ahi(θ) = σ
1
3/2
^ + -±- [σa1 + aσf + -σV 2 >],
a2)i = a2,i{θ) =
-σσ'Ati
(2.6)
336
MCLEISH AND KOLKIEWICZ
Since Zi, Z? - 1, \Z\ — 3]Zi, and Z 2 are all uncorrelated, it is relatively simple to find higher moments of the random variable (2.6). Using the general theory of estimating functions we can employ these moments to generate optimal estimating equations. We shall denote the variance, skewness and kurtosis of the differences Δ-Xt , conditionally on Xti, by μ2,i, 7i,», and 3 + 72,., respectively. Prom (2.6), we can find the following explicit forms for these moments:
7i,t
=
( — ; ) * [6αf5ία2,i + 36α M α 2 ,iα 3 ) i + 8 α ^ + 10802,*^ J, /^2,i
Ίt2,i = 4 - [ 3 a ί , i +6 0 a 2,i +3 3 4 8 < i + 1296aMa£fi + 2 4 0 ? ^ + 60a? jO^ +252o?)ia§)1 + 57601,10^03,1 + 2232^^01^ + 3 0 ^ +
^ol f i ] - 3. We shall use these moments to project the score functions corresponding to the approximation (2.6) onto the space spanned by the function {1, x, x2}. Now, let us assume that mrj provides, for some value of r, a good approximation to the conditional expectation of ΔXti. Then it is easy to observe that the functions:
and / 2 ) i (ΔX t i ,0) = (AXti - mΓ,i(θ))2 - μ2,i - Ίl^{AXu
- mr,i(θ))
are orthogonal estimating functions and that the projection of the score function for estimating the parameter θj is of the form:
If we decide to estimate the parameters using the Euler scheme, then the score function for estimating the parameter θj can be obtained from the normal distribution of the difference Δxi - m\^ and it is of the form (2.8)
337
DIFFUSION MODELS
where μ2j — σ2(Xti,θ)Ati. By taking 71^ = 0 and 72,1 = 0, which correspond to a normal distribution, we can obtain from (2.7) the projection of the score function onto the space spanned by the functions {1, x, x2}. In this case, the resulting estimating equations are the likelihood equations given by (2.8). Thus, for small Δ^ the optimal estimating function (2.8) will be very close to the true score function. When the time interval Δ^ is not small enough then instead of (2.8 ) we should consider estimating functions given by (2.7), which are based on higher order approximations. Let us observe that the estimating equations (2.7) and (2.8) are of the same form but they have different approximations to the first two moments of ΔXt{ and different weights on the functions f\^ and /2,i. If a higher order expansion (like (2.6)) is a good approximation to the distribution of AX^, then it is possible to quantify the loss of efficiency due to using a lower order approximation, since in this case one may assume that the two estimating functions differ only by their weights in the representation whifu(AXti,θ)
+w2,if2,i (AX*,*).
(2.9)
Another situation when it is possible to generate a number of unbiased estimating equations which differ only by their weights on the functions J\^ and /2,i is the case when either of the first two moments of AX^ is known explicitly. For example, when the drift function α(#, θ) is a linear function of x, then the conditional expectation of AXti can be found explicitly and then it may be of interest to compare efficiency of different estimating functions
based on
fhi(AXti,θ).
Finally, when the time interval Ati is not sufficiently small to justify the use of estimating functions based on low order expansions and when explicit forms of the first two moments are unknown then Monte Carlo simulations provide one more method of generating unbiased estimating function of form (2.9) (Bibby and Sorensen, 1995). In the next section we consider a method of comparison of such estimating equations.
2.2
Estimating Functions of Higher Degree.
So far we have discussed only quadratic estimating functions but the method of projection can be applied to generate polynomial estimating functions. For example, it is possible using only knowledge of the first three moments of AXt{ to project the score function onto the space spanned by the three components of the vector estimating function
fi(θ)=
(AXu-mr>i(θ)) (AXti-mr,i(θ)γ-μ2,i(θ)
338
MCLEISH AND KOLKIEWICZ
provided we settle for an approximation to the covariance of these terms. These covariances are used only in determining the weights on the estimating functions and so while the resulting estimating function may be slightly sub-optimal, it will be unbiased. For the following, in order to simplify the notation slightly, we assume we are estimating a single scalar parameter θ which may be one component of the vector of parameters. Define h
The conditional covariance matrix of fa is obtained by using the normal approximation to the distribution to determine the moments of order > 5 , since the distribution as Δ^ -> 0 is normal. This yields ^)
=D
0 15
2 + j29i
(θ)
Ίhi
D, /
where D denotes a diagonal matrix with diagonal elements O*2,»(0)) > /*2,i(0), (M2,i(#)) In this case the projection of the score function onto the linear space spanned by the three components of fi(θ) is given by
If for small Δ*. we replace both J24(0) and 7i,i(0) in Σ/ by their asymptotic value 0, we obtain an approximation 1
ΣJ w D"
1
/ 5/2
0
-1/2
0 V-l/2
1/2 0
0 1/6
It follows that
+f
D~\
339
DIFFUSION MODELS 2
Now as Δ^. -> 0 , μ2,i(θ) ~ σ (θ)Ati and 71^ -» 0 . Assuming this convergence is sufficiently rapid, the above estimating function is asymptotically equivalent to (2.8). Observe that only the first term is used if the parameter lies only in the drift, and only the second term if it is a diffusion parameter. The normal approximation is not the only alternative to approximating higher order moments of the diffusion. For example, we could assume that (2.6) holds exactly and obtain moments of orders 5 and 6 using this approximation. An alternative approach uses Ito's lemma combined with the approximations (2.3) and (2.4) to the conditional expectation of AXti. If we first apply the Ito's lemma to the processes X$ , X? , and Xf , and then use (2.3) and (2.4), we can approximate, in principle to any degree, the first four conditional moments of ΔX^ For example, we have the following approximations to the conditional second moment
Cti+1 - Xf. \XU] « Σ ( £ ° ) ϋ
J
fi(X£' ^)^=T^' r = 1,...,
where o
^
1 2 ^2
and a(x,θ) = σ_(x,θ) = Since for any admissible function h we have L°h\x2 = L°h(x2), we arrive at
E[Xl+1 - X2 \Xti) « J2(L°)^X2 ^f,
r = 1,...,
Similar approximations can be obtained for higher moments of the process Xf. ) 3=1
r = 1,...,
^ J
'
Prom the differences of moments of Xt, we obtain the moments of AXt{. Although computations involved in these approximations are still complex, they are simpler than for the method which uses approximations to the distribution of AXt{. In addition, now calculations can be carried out using symbolic computer languages, like, for example, Maple. A drawback of this approach is that it does not allow for a simple assessment of the goodnessof-fit of a model.
340
3
MCLEISH AND KOLKIEWICZ
Relative efficiency for estimators based on different order approximations
In this section we use methods of the general theory of estimating equations to compare estimating functions generated from different approximations to the distribution of the increments ΔX^. Let Q be a class of zero mean, square integrable p-dimensional estimating functions gn(X\,... ,Xn',0) which are almost surely differentiate with respect to the components of θ and such that E(gn) = E(^-gnji) and E(gng^) are nonsingular. Suppose also that {gn^n} is & martingale whose quadratic characteristic is {< g > n , !Fn}. Let fi = /(Xi,..., Xi\ 0), 1 < i < n, be specified d-dimensional vectors that are martingale differences and suppose that we want to find an optimal estimating function from the class Λ4 C Q of martingale estimating functions of the form 2=1
with the W{ being matrices which are Tχ-\ measurable. In the theory of optimal estimating equations two criteria are used: the small sample optimality criterion (Oi?-optimality) and the asymptotic optimality criterion (0.4-optimality) (cf. Godambe and Heyde, 1987). Both optimality criteria are satisfied by the same estimating function
wffi,
(3.1)
1=1
with w* = In the one dimensional case (d = 1), the two criteria are equivalent to finding an estimating function which minimizes either
for Ojr-optimality, or *"•«?•'-tf'^-O
(3i3)
for 0,4-optimality. The reciprocal I9n of the quantity (3.3) is called the martingale information in gn. This information occurs as a scale variable in the asymptotic distribution of the estimator obtained as the solution to equation gn(θ) = 0.
341
DIFFUSION MODELS
Thus maximizing I9n leads to asymptotic confidence regions of minimum size. Using this interpretation, it seems reasonable to define the conditional relative information CRI(g\,g2) of an estimating function g\ with respect to a second estimating function p2 as the ratio I9l /I92. When g = g* we can find the martingale information using the formula
Ig =< 9* >n= Σ E{{wf fiΫ\Ti-λ),
(3.4)
i=l
however, for a general estimating function we have to use (3.3). The above concepts can be applied to the problem of estimation described in the previous section. Under the measure V derived from the 1.5 strong approximation (2.6) and under the assumption that mr^ is an adequate approximation to the conditional expectation of AXti, the optimal estimating function #*, which is in the space spanned by the functions f\j and /2,i, is given by (2.7). We shall denote the optimal coefficients by w\ { and wζj. Suppose also that we have another estimating function which is of the form 9n =
where and /ι2,i = {ΔXti - m r ,i) 2 - μ2,z ,
hιyi = ΔXU - m Γji ,
and we would like to compare the martingale information contained in both functions. Note that in the above representation of gn the moments m r? i and μ2,i are the same as in the optimal estimating function #*. These may be different from what we are actually using when building an estimating function based on lower order approximation but when comparing the martingale information we have to deal with unbiased estimating equations. For example, the weights in gn may come from the Euler scheme but mr^ and μ2,i may be based on a higher order scheme or obtained from simulation. For the optimal estimating function (2.7) the martingale information can be determined using (3.4)
i-72,z-7l,ij 2=1
n
dm
λ
r
2
I
342
MCLEISH AND KOLKIEWICZ
where we used notation from the previous section. For the second estimating function gn we have the following formula [Σ?=i ™M J-mr,z + W2,iml*2,i\2 I9n =
ι
ϊl]
Σ L i K , ^ 2 , z + ^2,^2,z(2 + 72,2-)
,o cx (3.5)
Using these formulae we can find the information contained in estimating functions based on different approximations to the transition distribution and in that way we can more easily assess the merits of using higher order expansions for a particular model-data combination. For illustration of the effects of discretization on different estimating equations let us consider again the model dXt = {a + βXt)dt + cXΊtdWu
(3.6)
with the following values of the parameters: a = 3.2,
β = -0.4,
c = 0.01,
and 7 = 2.5,
which are close to the values which are observed in practice (Section 1). The relative conditional efficiencies of the optimal quadratic estimating functions based on the Euler scheme with respect to the quadratic estimating functions based on the strong Ito-Taylor approximation of order 1.5 are plotted in Figure 4 . For simulation purposes we assumed that the process was observed at discrete equidistant points Δ, 2Δ,... ,nΔ, with n = 260. Then we compared the conditional information of the two estimating equations at different values of Δ, with Δ = 1 corresponding to daily observations. The reported values are means of the relative information calculated from five different trajectories of the process. The graphs show that the effect of discretization may be different for different parameters. While for the parameters in the drift term the efficiency of the two methods remains virtually unchanged for all values of Δ, for c and 7 the changes are quite visible suggesting a slightly higher efficiency of the estimating equation based on the higher order approximation. The largest drop in efficiency is about 5% and it occurs for the biweekly observations. Overall, for the given values of the parameters, the two methods of estimation show very similar performance. One may argue that in the above comparison the two estimating equations give rise to estimators with almost the same efficiency because in the estimating equation based on the Euler scheme we used the moments from the higher scheme, "borrowing" in this way efficiency from the more accurate approximation. We now consider a method of comparison which allows us to compare estimating equations without this adjustment.
343
DIFFUSION MODELS
s. -
\\ \\
\ \ \ \ \ \ \ \
- Alpha - Beta
-c
- Gamma
Figure 4:
Suppose that we want to compare the efficiency of the estimating equation (2.7) and (2.8), assuming that {ΔU} are not small enough to use the Euler scheme but the strong approximation of order 1.5 provides a good approximation to the conditional distribution of ΔXt . This would imply that the estimating equation (2.8) generated from the Euler scheme is not unbiased and therefore the previous methods of comparison cannot be applied directly. It is possible, however, to compare the information contained in each of these two estimating equations, at least up to order O(Δ 2 ), if we use the asymptotic results presented by Florens-Zmirou (1989). Suppose that a parameter θ is to be estimated from a discrete equidistant observations, XΛ> ? ^nΔ 5 of the process {Xt} with constant diffusion defined by and we are using estimating equations of the form 9n =
344
MCLEISH AND KOLKIEWICZ
Assume that the process Xt is ergodic and its invariant measure is given Θ by μe Furthermore, let Q t = πf xμβ, where πζ (dy, x) = Pg{Xt £ cίy|X0 = x) is the transition density of Xt. By θ0 we shall denote the true value of the parameter. 2 θ Florens-Zmirou (1989) shows that if h is such that Jh dQ A < oo and a satisfies some regularity conditions then [h(x,y
n
θ
θ)dQ £
2
in L (Pθo),
as n -> oo.
θ
Also, if the equation / h(x, y; θ)dQ £ = 0 has an unique solution, 0 Δ say, and σ and h satisfy some regularity conditions then the estimating equation gn = 0 gives estimates θn such that θn -» #Δ in probability under P^o, and (3.7) where
These results can be generalized to processes with non-constant diffusion term by the well known transformation
*(*)= Γ σ{y,θ)-ιdy. Jo In view of (3.7) and (3.8), it seems reasonable to define information contained in the estimating function gn, which now may be biased, as *>
E
Since we do not know θ0, it is not possible in practice to calculate I*n. We may use, however, the observed information \* which under some regularity conditions and suitably normalized will converge to I*n. The observed information J*n is analogous in form to the martingale information I9n which we used for unbiased estimating equations. The main difference is that the information /*n does not involve conditional expectations calculated under the specified model and therefore can be regarded as robust information (for general discussion about robust versions of information see Barndorff-Nielsen and Sorensen, (1994)). In addition, I*n
DIFFUSION MODELS
345
0.98
1.00
retains its meaning even if θ& does not converge to the true value of the parameter, provided Δ is small enough.
x
\
\
3-
\ \/ \ \ \
s_
0.92
d
—-
-
Alpha Beta C Gamma
g. o
6 Time interval
Figure 5:
We repeated the simulation study for the process (3.6) with the same values of the parameters and the same number of observations. This time each of the four parameters was estimated individually using two methods: the optimal quadratic estimating equation derived from the Euler scheme and the optimal quadratic estimating equation based on the order 1.5 strong Taylor approximation. Then, using the expression (3.9), the observed information was calculated for both methods. The procedure was repeated for different values of the time increments Δ, with Δ = 1 corresponding, as before, to the daily observations. Figure 5 shows the averages of the relative observed informations for the two methods, based on five different trajectories of the process. The relative observed information exhibits a higher variability than in the previous simulation, but, otherwise, the two methods of estimation show very similar performance. It seems that for the given values of the parameters of the process (3.6) and the time intervals Δ which range from daily up to biweekly observations, there is not much gain in efficiency when instead of using the optimal quadratic estimating equations based on the Euler scheme
346
MCLEISH AND KOLKIEWICZ
we use the quadratic estimating equations based on the order 1.5 strong Taylor approximation. Obviously, this may not be true for a different set of the parameters and/or different sampling intervals Δ.
4
Bond Data Example and Model Assessment
We now return to the bond data example and write the estimating functions Ί in more explicit form. Here a(x) — a + βx, σ(x) = cx . For simplicity, we retain only terms up to order (At.) in the distribution on the right side of (2.6). These may influence the efficiency of the estimator but not the consistency. In this case we are able to determine the mean exactly since m(t) = E(Xt) satisfies the linear differential equation m'(t) = (a +
βm(t))Ati.
It follows from the solution of this differential equation that m0o,i =
{(α
Also, as in the Milstein approximation,
μ2>i » σ2Δti{l + i(σ') 2 Δ t J = c2Xf Δti[l + i c V * ? " 2 * * ] . Solving for the root of the estimating functions (2.8) reduces to finding α, β minimizing - m^)2,
Wi =
,
(4.1)
where the weights are held constant while minimizing. For fixed weights, and constant At this has an explicit solution. Similarly, in principle at least, c, 7 can be found by minimizing the sum of squares - moo,i)2 - μ2,i]\
(4.2)
where again the weights are held constant while minimizing. One might substitute an initial consistent estimator of the parameters c, 7 in the weights, followed by the minimization. Unfortunately, there is little information in this data available for estimating both parameters c, 7 as the contour plot of the sum of squares function illustrated in Figure 6 indicates. There are local minima along a wide curved valley of nearly constant depth. Because of this near unidentifiability, convergence is extremely slow but the values 7 = 2.5910, ά = 3.188, β = -.3966 and c = .0052 appear to correspond to a local minimum.
347
DIFFUSION MODELS sum squares for c,g. Initial a, b, c, g= 3.017-0.3885 0.005364 2.58
2.3 0.02
0.025
Figure 6: There remains the question of whether the 1.5 strong order model (1.4) adequately explains the increased weight in the tail observed in Figure 3. Unfortunately, each increment centered at its expectation as in the left side of (1.4) is a function of two independent normal random variables Z\ and Z2, either of which might be considered residuals. Since there is not a unique Z\ for each observed increment, we are unable to directly define and analyze normal residuals in the model of (1.4). We propose one possible solution to this problem. Assume for simplicity that 7 > 1. Consider the transformed process Yt = Xl~Ί. Then by Ito's lemma, Yt satisfies a diffusion equation with constant diffusion term v
dYt = (i-D{[γYt= ά(Yt)dt
-l)dWt,
- βYt]dt + cdWt) say.
(4.3)
It follows from the representation (2.6) for the process Yt that = c(7 -
a2,i = α 3j< = 0,
^ά'(Yt)], c(7-l)(Δti)3/2 a
(4.4)
MCLEISH AND KOLKIEWICZ
348
Now because (2.6) is a linear combination of two independent normal rana a dom variables, if we divide by the standard deviation, \J \,i + \i-> the result is a standard normal variate that can be regarded as a standardized residual. Thus, in this case, the standardized residuals are of the form Δyti-α(Fti)Δ(i-
έW7 -
^ .
(4.5)
The plot of these residuals is in Figure 7. Residuals from transformed process
0
200
400
600
800 1000 1200 index of residual
1400
1600
1800
2000
Figure 7:
They seem to indicate a reasonable fit of the model, although there is possible evidence that the diffusion term is not homogeneous in time. We also generate a normal probability plot for these residuals n Figure 8. Note that these have the same basic character as do the Euler residuals, indicating wide tails more consistent with the stable laws, for example, than the Gaussian assumption. This leads us to speculate that the wide-tail phenomenon is not solved by increasing the order of the Ito-Taylor approximation, but requires developing models defined as stochastic integrals with respect to wider-tailed distributions driving the process than Brownian motion. The most obvious of these, while analytically complex, are the stable processes with index less than two. The transformation, which we use largely so that we are able to define approximately normally distributed residuals from the suggested model, pro-
DIFFUSION MODELS
349
vides alternative estimating functions as well. In general, it is usually possible to transform the original diffusion process so that the diffusion term is constant in Yt, say σ . In this case, the third order Ito-Taylor expansion is particularly simple, and as in this case, it results in a normal distribution since α2,i = a^^ = 0. Therefore, assuming the normal approximation to be accurate, the maximum likelihood estimators of the parameters may be obtained by weighted least squares. For example, for a parameter in the drift term ά only, we minimize - ά(Yti)Ati
-
(4.6)
with weights (4.7) Normal Probability Plot
0.003 0.001
Figure 8:
5
Conclusion
It is common practice in finance to use a particular diffusion model, sometimes with several factors, to model a given process and to price derivatives. In many cases, the choice of model is motivated by analytic convenience. Models such as the CIR model have easy solutions and pricing some derivatives is straightforward. We have shown that the Ito-Taylor expansion can be useful for two purposes. The first is calibrating the model or estimating
350
MCLEISH AND KOLKIEWICZ
the parameters. This may also be effected by first transforming the model to one which has constant variance term. The order of the Ito-Taylor expansion seems to have limited influence on the efficiency of the estimators when data is collected at discrete time intervals, provided these are not too far apart (e.g. daily data). The second application of the Ito-Taylor expansion, perhaps the more important one, is in assessing the goodness of fit of the data to the model. We speculate that many of the standard diffusion models will tend to fit observed data poorly in the tails of the distribution, and these tails may have considerable influence on the price of derivative products. Alternative models constructed as stochastic integrals with respect to wider tailed alternatives such as the stable laws are likely needed to achieve a satisfactory fit. References Barndorff-Nielsen, O.E. and Sorensen, M. (1994) A review of some aspects of asymptotic likelihood theory for stochastic processes. Internαt. Statist Rev. , 61(1), 133-165. Brennan, M.J. and Schwartz, E.S. (1979) A continuous time approach to the pricing of bounds, Journal of Banking and Finance, 3, 133-155. Bibby, B.M. and Sorensen, M. (1995) Martingale estimation functions for discretely observed diffusion processes. Bernoulli, 1(1/2), 17-39. Chan, K.C., Karolyi, G.A., Longstaff, F.A. and Sanders, A.B. (1992) A n empirical comparison of alternative models of the short-term interest rate. The Journal of Finance , Vol. XLVII, 1209-1227. Chesney, M., Elliott, R.J., Madan, D., and Yang H. (1993) Diffusion coefficient estimation and asset pricing when risk premia and sensitivities are time varying, Mathematical Finance, Vol. 3, No. 2, 85-99. Cox, J.C., Ingersoll J.E., and Ross, S. A. (1985) A theory of the term structure of interest rates, Econometrica, 53, 385-406. Florens-Zmirou, D. (1989) Approximate discrete-time schemes for statistics of diffusion processes. Statistics, 20(4), 547-557. Gihman, I.I. and Skorohod, A.V. (1972) Stochastic Differential Equations Springer-Verlag, Berlin. Godambe, V.P. and Heyde, C.C. (1987) Quasi-likelihood and optimal estimation. International Statistical Review, 55, 3, 231-244. Kessler, M. and Sorensen, M. (1995) Estimating equations based on eigenfunctions for a discretely observed diffusion process. Research Report No. 332, Department of Theoretical Statistics, University of Aarhus. Kloeden, P.E. and Platen, E. (1992) Numerical Solution of Stochastic Differential Equations. Springer- Verlag. Longstaff, F.A. (1989) A nonlinear general equilibrium model of the term structure of interest rates, Journal of Financial Economics, 23, 195-224.
353
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
PREDICTION FUNCTIONS AND GEOSTATISTICS A. F. Desmond University of Guelph ABSTRACT We consider analogues of estimating functions for situations in which the prediction of observables is of primary interest. We show that the mode of the predictive density has an optimality property for prediction analogous to a similar optimality property for the mode of the posterior density in the case of parametric estimation. Applications of predictive estimating functions in spatial statistics with particular reference to the geostatistical method known as kriging are developed. Key Words: Predictive estimating functions; kriging; predictive density; spatial statistics.
1
Introduction
Applications of estimating functions have mainly focused on estimation and inference for parameters either in fully parametric or semi-parametric models. While the focus on parameters as indices of probability distributions is at the core of modern statistics ever since Fisher (1922) (See Stigler (1976)), several authors have argued persuasively that prediction of potential observables may sometimes be important. For example, Pearson (1920) refers to this as the Fundamental Problem of Statistics. It is also central to the treatment of de Finetti (1974, 1975) although he prefers the term prevision. Geisser (1993) gives an extensive discussion of predictive inference focusing, however, mainly on the Bayesian approach. In this article we discuss predictive analogues of estimating functions, motivated by similar ideas for parametric estimation. Such analogues might be termed prediction functions, although this term has been used previously in at least two different senses. Mathiasen (1979) uses it to denote a function of the future observable and the current data which ranks values of the future observation in terms of relative plausibility in the light of the data. Such a function has also
354
DESMOND
been termed a predictive likelihood (cf. Bjornstad (1990) for a comprehensive review). The term prediction function or predictor has also been used for a function of the current data used to predict a future observation or a function of several future observations, e.g. in time series. Here, I shall use the term prediction function or predictive estimating function (to emphasize the analogy with estimating functions) to denote a function g(z,x) of data x and an observable z to be predicted. The equation g{z,x) = 0 will be referred to as a predictive estimating equation. In this article I consider predictive inference from the point of view of prediction functions. In section three I consider this from the Bayesian point of view and show that one can obtain optimality properties of modal predictive density estimators based on two optimality criteria analogous to those of Godambe (1960), Ferreira (1982), and Ghosh (1990). These optimality criteria are introduced in section two. In sections four and five I consider applications to spatial prediction and show that the kriging equations of Matheron (1962) can be obtained as special cases. Section six deals with prediction functions related to predictive likelihood, while section seven concludes with a brief discussion.
2
Optimality Criteria for Prediction
Classical approaches to prediction in time series (e.g. Box and Jenkins (1970)) or spatial statistics (Ripley (1981)) invariably focus on mean-squared error as a criterion e.g. in the 1-step ahead prediction problem one consid2 ers E[Yf - h(Yc)] either unconditionally or conditionally on Yc, where Y) denotes a future observation, and h(Yc) is a function of current data Yc. Suppose we are interested in a future value z from a parametric family fz(z]Q) depending on an unknown parameter θ_. Denote the current data by x = (sci,..., xn) a. random sample from fx{x; θ). An unbiased prediction function is a function g of the current data x and a future observable z such that: x)} = 0 (2.1) the expectation being over the joint distribution of z and x. We could also consider conditional unbiasedness: Ez/Mz,x)} = 0.
(2.2)
There is some debate (e.g. Butler (1990)) concerning the appropriateness of unconditional versus conditional assessments of predictive inferences but in my view both measures are important in practical applications. Roughly
PREDICTION AND GEOSTATISTICS
355
speaking, though, it is probably the case that unconditional assessments are more relevant in pre-data considerations, while conditional assessment may be more relevant at the analysis stage. An important difference here with more conventional treatments of prediction is that the point predictor Λ(x), say, obtained as a solution in z of the equation 9(z,x) = 0 (2.3) may or may not be unbiased. (A sufficient condition for the conventional unbiasedness requirement, E(z) = E(h(x)), is linearity of g in z and x). Another departure from classical prediction problems we propose to adopt here is to define optimality of our prediction function in terms of one of the following optimality criteria:
EFF{g)
= E^df)
(24)
or
EFF(g) = im^L {EΛd/)V
(2.5)
where EFF denotes efficiency. For simplicity we restrict attention to scalar z although multivariate analogues of (2.4) and (2.5) are easily obtained. Also, (2.4) and (2.5) are predictive analogues of criteria previously proposed by Ferreira (1982) and Ghosh (1990) respectively. Those authors, however, are concerned with Bayesian estimation of unknown and unobservable parameters, whereas z here represents a random variable which is potentially observable exactly (i.e. without measurement error). Naik-Nimbalkar and Rajarshi (1995) also develop this framework extensively in the context of state-space models but again the state-variables are parameters which are unobservable (although allowing measurement with error). In this paper, I shall restrict attention to the unconditional criteria, (2.1) and (2.4), although a similar development is possible in terms of the conditional criteria.
3
Optimal Prediction Functions
Suppose z is a scalar future observable and x_ = (x\,... ,x n ) is the current data. The ideal object for predictive purposes is a conditional probability density of the future observable z given the current data x, p{z\x) say. (We assume that all random variables are continuous). Lindley (1990) and Geisser (1993) point out that full specification of the marginal density of £, p{x) say, and hence, a fortiori, p(y\x), is difficult in general, but that
356
DESMOND
the introduction of a parametric model q{x\θ) and a prior distribution π(θ) enables us to write
p(x) = J q(x\θ)π(θ)dθ and, hence, the predictive density can be calculated. Also, the de Finetti representation theorem (de Finetti (1937)) shows that if the X{ are exchangeable, such a "lurking" parametric structure is implied. If one accepts the Bayesian argument, the predictive density can be obtained as p(yk) = f p(y\χ),θπ(θ\x)dθ
(3.1)
where π(θ\x) is the posterior density of θ given x, calculated from Bayes theorem. Fisher (1956) and Kalbfleisch (1971) obtained predictive distributions in situations where a fiducial distribution for the parameters is available. The calculation is similar to (3.1) with the posterior density replaced by the fiducial density, although the logic is quite different. Aitchison and Dunsmore (1975) make extensive use of the Bayesian predictive density. We argue here, informally, that if a predictive density is available, then the optimal prediction function is given by: * dlnp{z\x) 9 = . dz
(3.2)
We assume regularity conditions similar to those of Ferreira (1982), but involving conditions on existence of certain derivatives with respect to z rather than θ, and that certain interchanges of differentiation and integration operators are permissible. Denote by G the class of all prediction functions g : Z x X -» 9ϊ, such that EχEz^(g) — 0 and EχEz^(g2) < oo, where Z and X are the obvious sample spaces. Then G is a vector space with respect to the real numbers in the sense that if ci, c<ι are real numbers and 5i 5 52 € G, then c\g\ + C252 £ G. Also, if we define the inner product < 9i?52 > = EχEz\x(9ι92) > G is an inner product space which is complete in 2 the metric defined by the norm ||g|| = < g,g >; hence G is also a Hubert space. Consider an arbitrary element g of G. Denote by g the derivative of g with respect to z. Differentiation under the integral sign in the condition: = 0, and arguing in a similar fashion to Ferreira (1982) yields: E{g)
=
-
where 9
„ _ dlnp(z\x) " dz
357
PREDICTION AND GEOSTATISTICS
is the logarithmic derivative of the predictive density with respect to the future observable z. The Cauchy-Schwarz inequality now implies
Optimality of g* then follows from the condition for equality in the CauchySchwarz inequality. In practice, the predictive density p(z\x) will be unavailable and it is necessary to restrict attention to subclasses of unbiased prediction functions. For example, if G\ is the class of prediction functions linear in z and #, standard arguments suggest that the optimal prediction function in G\ is the projection of g* above into G\. Minimization of (2.4) above is then equivalent to finding g\ G G\ such that
| | 5 ί - 3 Ί I < l l 5 i - 5 Ί I , for all
Sl
GG,.
We give an example of this in spatial statistics in the next section and show that an optimal g* can be found depending only on second-moment assumptions about the data and the "future" observable. I make two remarks at this point. Firstly, by appropriate redefinition of inner products and corresponding norms, the above development can be carried out mutatis mutandis in terms of the conditional quantities (2.2) and (2.5). Secondly, optimality of the mode of the predictive density can also be derived based on maximizing an expected utility function. Aitchison and Dunsmore (1975, p. 46) show that for an all-or-nothing utility structure the optimum point predictor is the mode of the predictive density.
4
Prediction Function Approaches to Kriging
Godambe (1985) investigated finite sample parametric estimation for stochastic processes using estimating functions. The stochastic processes he considered were in discrete time and optimal quasi-score functions based on elementary martingale estimating functions were constructed. Thavaneswaran and Thompson (1986) generalize this to continuous time processes. In the case of spatial statistics such a martingale formulation would appear to be inapplicable. Nevertheless it is of interest to consider what estimating functions may offer. Also, unlike the aforementioned authors who deal with fixed unknown parameters in stochastic models, the type of applications I consider here involve prediction of unobserved (but potentially observable) random variables. We are particularly concerned here with an approach to spatial prediction commonly referred to as kriging, which has found extensive application in such areas as hydrology, soil science, and the mining industry (e.g. Journel and Huijbregts (1978)). Kriging is, in essence, an analogue for spatial
358
DESMOND
processes of the optimal linear prediction theories of Kolmogorov (1941) and Wiener (1949) for time-series and was developed mainly by Matheron and his school in the mining industry. Cressie (1990) gives an interesting historical account of the origins of kriging. As the area is relatively unfamiliar to statisticians we outline briefly some of the main ideas. We assume an underlying two or three-dimensional spatial stochastic process z(gc)^ x G 5ft2 or 5ft3 representing, for example, soil PH in 5ft2 or ore-grade in 5R3. It is desired to estimate or predict Z(XQ) at some unobserved location #o, based on observations of z(x), z(xλ),..., z(xn) at a set of n spatial locations xϊ}..., xn. The optimal predictor in terms of minimizing the mean-squared prediction error E[z(xQ] zfa),..., z(xn)) - z(x0)}2 (4.1) is, as in the Wiener-Kolmogorov theory, given by, the conditional expectation E(z(ΐo)\z(gil),...,z(2n))
(4.2)
but this entails knowledge of an (n + l)-dimensional distribution which may not be available. Kriging focuses on linear predictors of the form n
z{x.o', *(£i),
, z(xn)) = J ^ λiz(xi) + λ0
(4.3)
which satisfy an unbiasedness condition E[z{x0; z{xλ\ . . . , z(xn)) - z{xQ)] = 0,
(4.4)
and seeks a predictor of the form (4.3) which minimizes (4.1) subject to (4.4). The solution to the kriging problem, i.e. determination of the optimal weights λi, λ2 . . . λ n , depends on assumptions about the structure of z(x). Usually, it is assumed that z(x) = θ(x) + e(x) where e(x) has zero mean and is either a stationary process or an intrinsic random function (Matheron (1962)). An intrinsic random function is one for which generalized increments of some order are second-order stationary (e.g. Cressie (1991, p. 300)). There are three common assumptions about θ(x) : (ϊ)θ(x) is a known constant θo, (ii) θ(x) is an unknown constant #o a n d (ΐϋ) θ(x) is a "trend" function of the form Σ j = 1 fj{x)βj where fj(x) are known functions (e.g. low-order polynomials) and βj are unknown parameters. = v e Under assumptions of known covariance structure Σ ( ^ — x 2 ) C° ( (%i)i e(x2))i m the case where e(x) is stationary, or known semivariogram η{x_x — 2 n #2) = 7}E[(e(xi) — e(^ 2 )) ] i the case where e(x) is an intrinsic random
359
PREDICTION AND GEOSTATISTICS
function of order zero, optimal predictions in terms of (4.1) can be obtained in cases (i), (ii) and (iii) and lead, respectively, to what are referred to as simple, ordinary or universal kriging equations for the coefficients Xi in (4.3). Details are given in Ripley (1981) or Cressie (1991). Optimality here rests on the strong assumption of known covariance function or semivariogram. In practice, this needs to be estimated and an enormous body of work in geostatistics has concentrated on its estimation. Often various simple parametric forms for 70 () are assumed, θ estimated in various ad hoc ways, and JQ( ) is inserted into the kriging equations. Consider now a reformulation of the kriging problem from a predictive estimating function point of view. We seek a prediction function g(z(x0), z(xλ), ..., z(xn) which is: (i) unbiased, E{g(z(gU>),z(x1),...,z(xn))}=0
(4.5)
and (ii) minimizes Eg2
where g | is the derivative of g evaluated at Z(XQ) = z. Clearly (4.4) is a special case of (4.5) but (4.5) is more general in that the predictor z(x$) which is a solution to g = 0 need not be unbiased in the sense of (4.4). Prom section 3, the optimal predictor, minimizing (ii) in the class of unbiased predictive functions is = 9
dlnp(z(x0) = φ ( s χ ) , . . . , z{xn)) dz
K
'
}
provided the necessary conditional distribution is available. This is unlike the Wiener-Kolmogorov-Matheron theory which leads to the conditional expectation predictor. It has the same difficulties, which that theory encounters, in that knowledge of the conditional distribution is rarely available. However, if z(x) is a stationary Gaussian process, the modal predictor according to (4.7) coincides with the conditional expectation predictor. To obtain a prediction function not predicated on strong distributional assumptions we restrict the class of competing prediction functions. For example, one possibility is the class: G\
= {g : g = g(z(x0), z(xλ),..., z(xn)) =
g1(zteo)-h{z(xι),...,z(xj))
and E(9l) = 0}
(4.8)
where g\ and h are possibly non-linear functions. Clearly, choice of the identity function for g\ is sensible in many applications. For example, the
360
DESMOND
disjunctive kriging of Matheron (1976) would correspond to g\ the identity function and h{z{xλ),...,
z{xn)) = 2=1
where the functions {hi : i = 1, ...,n} are measurable square-integrable functions. Matheron (1976) shows that minimum mean-squared prediction error predictors can be obtained. The resulting disjunctive kriging equations require knowledge only of the bivariate distributions of (z(x_j), Z(XJ)), provided the process z(-) follows a so-called isofactorial model. This is in contrast to (4.7) which typically involves knowledge of an (n + l)-dimensional distribution. A further special case of (4.8) is the class G2
=
{9'9
= g(z(xo), zUι), »-> z(£n))
= z(x0) - λo - Σ λ ^ ^ ) '
w i t h
E
i9) = 0}
(4.9)
i-l
corresponding to prediction functions linear in the observations z(xj)y 1 < i < n and the unobserved z(xQ). We now show that the optimal prediction function in the class G2 leads to the simple kriging equations of Matheron (1962). The argument is a modification of theorem 2.1 of Thavaneswaran and Thompson (1988). We give the result for second-order stationary processes with known covariance function although it can be modified for the intrinsic random function situation with known semi-variogram. Theorem 4.1. Suppose E(z(x)) = θ(x) is a known function and denote by Σzz the n x n matrix with (ij)th element Cov(z(xi)^z(xj)). Let #o = τ E(z(x0)) and, θ_ = {θ(xι),..., θ(xn)) and d be the n-vector with ith element Coυ(z(aiQ),z(aii)). Let G2 be as in (4.9) above rewritten in the form T
G2 = {9 : 9 = (*0Eo) - θ0) - \ (zn
- θ)}
where λ τ = (λi,...,λ n ) and z_n = (^(^1),...,z(x n )) τ . prediction function minimizing (4.6) is given by τ
9* = (z(x0) - θ0) - d Σ^z(zn
- θ).
Then the optimal
(4.10)
Proof. The proof is an elementary modification of that of Thavaneswaran and Thompson (1988). They point out that a sufficient condition for g* to be optimal is that E(gg*) = KE(g) where g is the derivative with respect to Z(XQ) and K is an arbitrary constant. Since for the class (?2, E(g) is unity,
361
PREDICTION AND GEOSTATISTICS 2
it suffices to show that E(gg*) = E(g* ) for all j G G2. For g* given by (4.10), elementary manipulation yields E(g*g) = Var(z(x0)) - ffΣ^d,
(4.11)
for all 3 G G2. Since the right hand side does not involve λ, and hence not on the choice of #, the result is proven. In this case the optimal predictor Z(XQ) obtained by solving g* = 0 for z(x0) is given by τl (4.12) z(zn-θ). Matheron's derivation involves minimizing the mean-squared error with respect to variation in λo,λi,...,λ n , subject to E(z(x0)) = E(z(x0)). This leads via a Lagrange multiplier construction to the n +1 linear equations for λ o ,λi, . . . , λ n given by: ) + XTΣZiz
=d,i=, ...n
where Σ ^ z is the n-vector whose jth element is Coυ(z(xi), Z(XJ)). The optimal weights are identical to those in (4.12). If the underlying process is Gaussian, standard results for the multivariate Gaussian distribution show that (4.12) is the conditional mean of z(xQ) given z(xλ), ...,z(x n ), so that it is globally optimal (without restriction to G2) with respect to minimization of prediction mean-squared error. Since it is also the conditional mode it is also globally optimal with respect to minimization of (4.6). The treatment given here has reproduced Matheron's result from a predictive estimating function point of view. Also it depends on the unrealistic assumptions that both the mean of the process and its covariance are known. Ordinary kriging and universal kriging extend the theory to parametric linear models for the mean and I consider a more general version of this in the next section, from a predictive estimating function point of view. However, the real advantages of this point of view (and scope for generalization) are, in my opinion, in the possibilities of extension to non-linear prediction. Such an extension is broadly analogous to Godambe and Kale's (1991) extended Gauss-Markov theory for parametric estimation and will be treated in detail elsewhere.
5
Prediction with Unknown Mean
Suppose that the mean function θ(x) is a known function, say, θ(x; β) of a pdimensional vector of unknown parameters β. This includes the special cases of: (a) θ(x) an unknown but constant scalar, for which ordinary kriging has
362
DESMOND
been developed and (b) θ{χ-,β) = ΣPj=ιfj{x)βj corresponding to universal kriging. However, here, we allow the possibility that θ(x; β) is a non-linear function of β. We retain the assumption that the covariance matrix is known but possibly a function of /3, Σ,zz(β) We consider joint estimation of /?, and prediction of Z(X_Q) at an unobserved location x0. The optimal estimating function for β_ is the quasi-score DτΈ-z\{zn-θ_n{Άβ))
(5.1)
where D is the nxp matrix with (i,j) element r^ - , z_n is the data vector as before and 0n(z;/3), is the n-vector (θ(xι]β),... ,0(xn;/?)). The optimal prediction function for z(x0), given /?, is given by (4.10) appropriately modified with 0o = θ(xQ;β), θ_ = 0n(z;/?) and d, Σzz, the specified known functions of β. The optimal prediction function for z(x0) can then be obtained by solving (5.1) for the maximum quasi-likelihood estimate, J3QLI say, and inserting this into the modified version of (4.10). In general, solution of (5.1) will require a numerical solution. However, in the special case where θ{x]β) = Σ j = i fji^βj-, an explicit predictor can be found and shown to correspond to the universal kriging equations of Matheron. However, this derivation is more general in that it allows non-linear functions of β_ and additionally the covariance function may be a function of β. Finally, we remark that a partially Bayes approach similar to that of Godambe (1994) could be developed for spatial prediction. This involves the combination of estimating functions based on prior information about the mean function together with the quasi-score function based on the data. Although Godambe emphasizes parametric estimation he gives a brief illustration how this may be extended to forecasting of a future value of a branching process.
6
Optimal Prediction Functions Based on Likelihood
Whereas the Bayes approach to prediction is logically very appealing except for the sticking point of the prior, likelihood approaches are considerably murkier! Bjornstad (1990) gives a recent extensive survey outlining 14 different predictive likelihoods! This proliferation of definition suggests that predictive likelihood rests on somewhat shaky logical foundations. Bayarri, De Groot and Kadane (1987), in a provocative paper, question whether likelihood itself can be rigorously defined, in general, and much of their critique refers to situations in which prediction is of importance; see also, Berger and Wolpert (1988) for a discussion of this. In a sense, the entire parameter 0 is a nuisance parameter here so it is natural to consider elimination
PREDICTION AND GEOSTATISTICS
363
methods for nuisance parameters analogous to conventional methods for the parameter of interest. Basically, the predictive likelihoods considered by Bjornstad fall into three categories: (i) elimination by profiling, (ii) elimination by conditioning on "sufficient" statistics for the (nuisance) parameter, (iii) elimination by integration. Method (i) entails the usual difficulties entailed in inserting MLEs for the nuisance parameter. Method (iii) is similar to the Bayesian approach and provides a probability distribution for the parameter (e.g. the Bayesian predictive density in section three is an instance of this). However, where a fiducial distribution is available for the parameter, Kalbfleisch (1971) shows how to obtain a predictive distribution for the future observation. This is related to Fisher's (1956) original argument. Method (ii) is a method for ordering the plausibility of future values proposed independently by Lauritzen (1974) and Hinkley (1979) and more generally by Butler (1986). Another method, not discussed by Bjornstad, is that of marginal predictive likelihood discussed very briefly by Butler (1986). In this section we discuss this from the predictive estimation function point of view and conjecture that under certain model assumptions the marginal predictive score function has a certain optimality property. Let x_ = (xi, ...,α;n) be the current data and z a scalar future observable. Suppose there exists a transformation where (T, A) is sufficient for θ in the model fχ,z(x,z',θ), A = A(x, z) is ancillary for θ, i.e. possesses a distribution not involving 0, and T = T(x) is sufficient for θ in the model fχ[x;θ). Suppose also that the likelihood factorizes as
fA(a(x,z))fτlA(T(x)\A;θ).
(6.1)
If, in addition, T(x) is complete given A (conditionally complete) Basu's theorem (1959) implies that T and A are independent so that the final factor in (6.1) is the same as fτ(T(x)\θ). The last two components of the above factorization in essence separate the "data" (#, z) into two components, fτ\λ(') which is relevant to inference about θ based on the sufficient statistics T(#)> and /Λ( ) which provides information relevant to predicting z given x. Classical frequentist prediction intervals often involve inversion of a function such as A(x, z) which is sometimes referred to as a predictive pivot. If the above factorization applied, it seems reasonable to consider the class of prediction functions which involve the data through A alone. We conjecture that the optimal prediction function in this class would correspond to the marginal predictive score function
364
DESMOND
based on differentiation with respect to z. A proof might be constructed motivated by arguments of Lloyd (1987). He considers estimation of a parameter of interest θ in the presence of a nuisance parameter φ and shows that the marginal likelihood based on a maximal ancillary for φ provides the optimal estimating function for 0, if the remainder of the likelihood is conditionally complete. In our case, the nuisance parameter is the entire parameter θ and the quantity of interest is z. As a simple example of a situation to which the factorization applies consider #i, ...,xn and z as independently distributed N(θ, 1), where z is to be predicted. Then (6.1) applies with T(x) = x and A(x, z) = z — x. An alternate factorization to (6.1), which is sometimes available, is when a statistic T(x, z), sufficient for θ, is available in the joint model for x and z. Predictive likelihoods which eliminate θ by conditioning on T can then be constructed along the lines of Hinkley (1979) or more generally Butler (1986). Such a factorization is available, for example, in exponential families. There is a parallel here with similar conditioning arguments for parametric estimation, where nuisance parameters can be eliminated by conditioning with respect to statistics sufficient for the nuisance parameter. In the predictive case, we note that the conditioning statistic Γ is a function of the future observation z, which corresponds to the "parameter" of interest in the estimative case. It is an open question whether optimal prediction functions based on conditional predictive likelihood can be constructed.
7
Discussion
This paper has considered the problem of predicting observables from a different perspective than more classical formulations. In the classical approach the two common desiderata are: (i) unbiasedness of the predictor, and (ii) minimization of prediction mean-squared error. As regards (i), the concept of unbiasedness is different from the usual unbiasedness concept for an estimator of a fixed parameter, as pointed out by Robinson (1991). The "ideal" predictor, if available, is the mean of the predictive distribution of the unobserved variable given those observed. Restrictions to linear unbiased predictors lead to best linear unbiased predictors, commonly referred to as BLUPs. By contrast, in the formulation given here, (i) above is replaced by: (i') unbiasedness of the prediction function and (ii) is replaced by: (ii') minimization of criterion (2.4). Since (Γ) need not imply (i), biased predictors are possible in this formulation. Similarly (ii) and (ii') are equivalent only for a proper subclass of prediction functions so that the formulation via prediction functions is more general. The "ideal" predictor, if available, corresponds to the mode of the predictive distribution. If this is unavailable, projection into various sub-
PREDICTION AND GEOSTATISTICS
365
spaces of prediction functions produces locally optimal prediction functions which are closest to the ideal prediction function in the L2 norm defined by the covariance inner product defined on the space of prediction functions. The classical prediction theory has an analogous projection formulation but with a different inner product. I have shown here that the new formulation reproduces classical prediction results in the subset of cases where they coincide. Godambe and Kale (1991), in the case of parametric estimation, show that the estimating function approach reproduces classical optimality results such as the Gauss-Markov theorem in elementary cases for the linear model, but offers a good deal more generality for non-linear models and quasi-likelihood models where the original Gauss-Markov approach fails. Those authors successfully develop an extended Gauss-Markov theory for these cases which is logically equivalent to the original Gauss-Markov theory for the elementary case. The development of analogous extensions for prediction functions will be treated in a separate publication. Areas of potential application being considered include non-linear geostatistical problems such as disjunctive kriging and transgaussian kriging.
Acknowledgement This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada, Grant No. A85584.
References
Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge: Cambridge University Press. Bayarri, M. J., De Groot, M. H. and Kadane, J. B. (1987). What is the likelihood function? (with discussion). In Statistical Decision Theory and Related Topics IV, vol. 1, Gupta, S. S. and Berger, J., Eds., SpringerVerlag, New York. Basu, D. (1959). The family of ancillary statistics. Sankhya, 21, 247-256. Berger, J. O. and Wolpert, R. L. (1988). The Likelihood Principle. Hayward, California. Institute of Mathematical Statistics. Bjornstad, J. F. (1990). Predictive likelihood: a review. Statistical Science, 5, 242-265.
366
DESMOND
Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control. San Francisco: Holden-Day. Butler, R. W. (1986). Predictive likelihood inference with applications (with discussion). J. Roy. Statist. Soc, B, 48, 1-38. Butler, R. W. (1990). Comment on Bjornstad, J. F. Predictive likelihood: a review. Statistical Science, 5, 255-259. Cressie, N. (1990). The origins of kriging. Mathematical Geology, 22, 239252. Cressie, N. (1991). Statistics for Spatial Data. John Wiley and Sons, New York. De Finetti, B. (1937). La prevision: ses lois logiques, ses sources subjectives. Annales de ΓInstitute Henri Poincare, 7, 1-68. De Finetti, B. (1974, 1975). Theory of Probability, Volumes I and II. Wiley, New York. Ferreira, P. (1982). Estimating equations in the presence of prior knowledge. Biometrika, 69, 667-669. Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Phil. Trans. Roy. Soc, London, Ser. A, 222, 309-368. Fisher, R. A. (1956). Statistical Methods and Scientific Inference. Edinburgh: Oliver and Boyd. Geisser, S. (1993). Predictive Inference: An Introduction. Chapman and Hall, London. Ghosh, M. (1990). On a Bayesian analog of the theory of estimating functions. In C. G. Khatri Memorial Volume of the Gujarat Statistical Review, 17A, 47-52. Godambe, V. P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist, 31, 1208-1211. Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika, 72, 419-428. Godambe, V. P. (1994). Linear Bayes and optimal estimation. Technical Report Series, Dept. of Statistics, University of Waterloo, Stat-94-11. Godambe, V. P. and Kale, B. K. (1991). Estimating functions: an overview. In "Estimating Functions", V. P. Godambe, Ed., Clarendon Press, Oxford. Hinkley, D. V. (1979). Predictive likelihood. Ann. Statist, 7, 718-728. Corrigendum 8, 694. Journel, A. G. and Huijbregts, C. J. (1978). Mining Geostatistics. Academic Press, London. Kalbfleisch, J. D. (1971). Likelihood methods of prediction. In Foundations of Statistical Inference, V. P. Godambe and D. A. Sprott, Eds., Holt, Rinehart and Wilson, New York, 378-392. Kolmogorov, A. N. (1941). Interpolation and extrapolation of stationary random sequences. Izvestiia Akademii Nauk SSSR, Seriia Matematish-
PREDICTION
AND GEOSTATISTICS
367
eskiia, 5, 3-14. Lauritzen, S. L. (1974). Sufficiency, prediction and extreme models. Scand. J. Statist, 1, 128-134. Lindley, D. V. (1990). The 1988 Wald memorial lectures: The present position in Bayesian statistics (with discussion). Statistical Science, 5, 44-89. Mathiasen, P. E. (1979). Prediction functions. Scand. J. Statist, 6, 1-21. Matheron, G. (1962). Traite de Geostatistique Appliquee, Tome 1, Memoires du Bureau de Recherche Geologiques et Minieres, No. 14, Editions Technip, Paris. Matheron, G. (1976). A simple substitute for conditional expectation: The disjunctive kriging. In Advanced Geostatistics in the Mining Industry, M. Guarascio, M. David and C. Huijbregts, Eds., Reidel, Dordrecht, 221-236. Naik-Nimbalkar, U. V. and Rajarshi, M. B. (1995). Filtering and smoothing via estimating functions. J. American Stat. Assoc, 90, 301-306. Pearson, K. (1920). The fundamental problem of practical statistics. Biometriko 13, 1-16. Ripley, B. D. (1981). Spatial Statistics. Wiley, New York. Robinson, G. K. (1991). That BLUP is a good thing: The estimation of random effects (with discussion). Statistical Science, 6, 15-51. Stigler, S. M. (1976). Discussion of Savage, L. J., On rereading R. A. Fisher. Ann. Stat, 4, 441-500. Thavaneswaran, A. and Thompson, M. E. (1986). Optimal estimation for semi-martingales. J. Appl. Prob., 23, 409-417. Thavaneswaran, A. N. and Thompson, M. E. (1988). A criterion for filtering in semi-martingale models. Stoch. Processes and their Applications, 28, 259-265. Wiener, N. (1949). Extrapolation, Interpolation and Smoothing of Stationary Time Series. MIT Press, Cambridge, MA.
369
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
EFFICIENCY OF THE PSEUDO-LIKELIHOOD ESTIMATE IN A ONE-DIMENSIONAL LATTICE GAS J. L. Jensen University of Aarhus Denmark Abstract For a simple one-dimensional lattice gas we consider the efficiency properties of the maximum pseudo-likelihood estimate. We show that the pseudo-likelihood estimating function is not optimal within a natural class of estimating functions, although numerical investigations show that it is very close to being optimal. We also show that the pseudo-likelihood is far from being efficient when there is strong dependence in the model. Key words: Efficiency; estimating function; Gibbs model; pseudo-likelihood.
1
Introduction
In the field of stochastic processes it is often not possible to give the likelihood function in an explicit form. Instead one uses estimating functions, and it seems natural to look for an optimal estimating function within a class of such functions. A theory for this has been developed in Heyde (1988). For martingale estimating functions an application of these ideas can be found in Bibby and S0rensen (1996). In this paper we will try to use these ideas in the setting of Gibbs lattice models. Such models are defined through interactions between neighbouring points and typically there is a norming constant in the distribution that d cannot be calculated explicitly. For the lattice Z , d > 1, there is also the possibility of phase transitions and the maximum likelihood estimate need not be asymptotically normally distributed. Due to these problems other estimating procedures have been considered. Besag (1975) introduced the pseudo-likelihood function, which only uses local conditional distributions. It has been shown recently (Guyon and Kΐinsch 1992; Jensen and Kύnsch 1994) that the maximum pseudo-likelihood estimate admits a random norming so that the limiting distribution is normal. The efficiency of the maximum pseudo-likelihood estimate seems largely not to have been investigated. The
370
JENSEN
paper of Guyon and Kύnsch (1992) contains a comparison offiveestimators, including the pseudo-likelihood estimator, for a model slightly different from the model considered in this paper. What we propose to do here is to view the pseudo-likelihood estimating function as one in a class of estimating functions, and then to find the optimal choice within this class. The term pseudo-likelihood is here adopted from Besag (1975) although the contrast function used has no direct relation to a proper likelihood function. Also the reference to martingale estimating functions (Heyde, 1988) is indirect. The estimating function here is not a martingale, but it has the property that the individual terms have conditional mean zero, where the conditioning set consists of all but one variable. As in the martingale case this property ensures the consistency of the estimate. However, the conditional-mean-zero property is not something that is shared with the true likelihood. To be able to perform all the calculations we will only consider a onedimensional Gibbs model. We can then find the optimal estimating function within the class considered, and it turns out that the pseudo-likelihood is not optimal. However, for the model considered, the maximum pseudo-likelihood estimate is very close to being optimal. For the simple model considered it is also possible to compare the pseudo-likelihood with the true likelihood. When the interaction in the model is strong we find that the efficiency of the pseudo-likelihood estimate is very poor. Furthermore, an attempt to improve on this by extending the pseudo-likelihood idea turns out to give only a minor improvement. It is an open problem whether these conclusions carry over to higher dimensions.
2
The Gibbs Model
Let X{ E {—1,1}, Ϊ G Z , be a lattice gas, where X{ interacts with the four nearest neighbours (X;_2,Xi_i,Xi+\,Xi+2) The conditional specifications are given by P{Xi = Xi I (Xi-2,Xi-l,Xi+l,Xi+2) = {xi-2,Xi-l,Xi+l,Xi+2)) (2-1) = {2cosh[/3(a;i_2 + Xi-ι + Xi-ι + Xi+2)]}'1 exp{βxi(xi-2
+ x%-\
and for I > k I (Afc,
„
~
«.
~ \
Xk-2,Xk-l,Xl+l,Xl+2) + 1
i=k
371
PSEUDO-LIKELIHOOD ESTIMATION
With n = l — k+1 the evaluation of Z/_& involves a sum with 2n terms, and so for n large it is not feasible to evaluate £/_&• Instead of using the likelihood function for estimating β we use the pseudo-likelihood function exp(pln(β)). The latter is given as the product of the conditional densities in (2.1), see χ
Besag (1975). Let t h e observations be x _ i , x o ? ^ i 5 ••• , mXn+iiXn+2,
then
- log[2coβh(/fri)]},
ΣxiSi i=l
where Si = X{-2 + Xi-1 + Xi+l + #i
Prom this we find
Pl'n(β) = Σ ^ - t a n h ^ ) ] ^ , -Pl'ή(β) = Ecosh(/? S i Γ 2 S f.
(2.3) (2-4)
2=1
The question we want to investigate is whether the estimating equation (2.3) has some optimality properties? The form of (2.3) suggests a slightly more general class of estimating functions, namely n
Γn(# g) = 5 > i - taah(fi8i)]g{si', β).
(2.5)
This class of estimating functions has the important property that each term in the sum (2.5) has conditional mean zero, that is E{[Xi - t*ήh(βSi)]g(Snβ) I Xj : jφ i) = 0.
(2.6)
This property is essential when proving asymptotic normality of the estimate in related, but more complicated models, where one has the possibility of phase transitions (see Jensen and Kύnsch, 1994). What we want to consider is whether is the optimal choice of g in (2.5). Optimality is here defined in terms of having the smallest asymptotic variance of the estimate. For 2 < i < n - 1 we have from (2.6) that E{[Xt - tanh{βSi)]g(Si', β)Tn(β;g)}
= EYi{Yi-2 + Yχ-ι + Yi + Yi+ι
where Yi = [Xi - ta,nh(βSi)]g(Si\β). We therefore find that
H{g) = J ^ V a r (-Lr n (/?; f f )) = EYX{YX + 2Y2 + 2Y3}.
(2.7)
372
JENSEN
Minus the derivative of (2.5) with respect to β is
We therefore have from (2.6) that Sl9
lim E (-jn(β;9)) = E }?*L®. vy; 2 n-κ» Vn '/ cosh(0SΊ) Combining (2.7) and (2.8) we find J(g)
=
(2.8) v
;
where β is the solution to Tn(β;g) = 0. We want to minimize H(g)/J(g)2 with respect to g. For symmetry reasons in (2.5) we only consider functions with g(—s^; β) = —g{sϊ, β) for Si φ 0. We now argue that the smallest variance is obtained with g(0; β) = 0. Intuitively this is clear from (2.5) as those terms with Si = 0 give no information on β. More precisely, we can argue as follows. Write U{ = 1^1 (Si φ 0) and Vi = Yil{Si = 0). Both these terms have conditional mean zero and therefore lim
Vbi(^=
=
lim Var ( - ^
(2.9) The conditional specification in (2.2) implies that —X ~ X, that is, inverting the signs of all the X^s does not change the distribution. Since Ui(-X) = Ui{X) and Vi(-X) = -Vi(X) we find that the third term in (2.9) is zero. The second term in (2.9) is proportional to g(0;/3)2 and so we get the minimum for g(0; β) = 0. Since also J(g) does not depend on g(0; β) we get that H(g)/J(g)2 is minimized for g(0]β) = 0. Normalizing g by setting g(4; β) = 4 we end up considering the class of estimating functions (2.5) with g belonging to {g : {-4, -2,0,2,4} -> R | g(0) = 0, (-*) = -(*),5(4) = 4}.
(2.10)
We therefore only have one free parameter (2; /3) that we want to choose so that H(g)/J(g)2 is minimized.
373
PSEUDO-LIKELIHOOD ESTIMATION
3
A partial evaluation of the limiting variance
In this section we give a partial derivation of H(g) and J(g) from (2.7) and (2.8), respectively. We will perform the calculation by conditioning on (x~i, XQ, £4, £5). The conditional distribution has 8 states with probabilities given by (2.2), and the conditional means of the terms in (2.7) and (2.8) can be written explicitly. There are 16 possibilities for (X-I,XQ,X4,X$), but these can be paired two by two using that —X ~ X. We therefore basically have 8 different conditional distributions to consider. As an example consider the case (x-ι,xo,X4,xs) = (1,1,1,1). With ξ = g{2\ /?), r = tanh(2/3), and t = tanh(4/?) the conditional mean of Yι(Y\ + 2Y2 + 2Y3) is - t)2 t)2 + 6(1 - r)ψ
- 32£(1 - r)(l +1)} + 6e" 3/? (l + r)2ξ2} .
Calculating all the conditional means and using the notation P(ij; kϊ) = P(X-ι = i, Xo = j , X± = fc, X 5 = 0 we obtain
+ξ [-32e^(l - r)(l + t)] + ξ2 [6e^(l - r ) 2 + β e ^ l + r)2}}
+e 3 / J ((l + r ) 2 - 4(1 - r)(l + r)) + e ^ ί l - r ) 2 + 5e- 5/3 (l + r) 2 ]}
+ξ [8e7/3(l - ί)(l - r) - 16e^(l - r)(l + ί)] + ξ2 [e3β(3(l - r) -2(1 - r)(l + r)) + e-^((l - r ) 2 + (1 + r) 2 ) + be~^(l + r)2]} -r)2 + r)2 + 3(1 - r ) 2 - 2(1 - r)(l + r)) + 3e" 3 ^(l + r) 2 ]}
+ξ2 ^ ( 2 ( 1 - r ) 2 - 4(1 + r)(l - r)) + 6 e " ^ ( l + r) 2 ]} -
ί
)
2
5
'
2
2
374
JENSEN 3
5/3
2
3β
+ξ [8e "(l - r)(l - ί) + 16 e - (l + ί)(l + r)] + ξ [e (l - r)
+e-0((l - r)2 + 3(1 + rf - 6(1 - r)(l + r)) + e'5β(l + r)2}}
2
2
3/3
2
- r ) + (1 + r ) - 2(1 - r)(l + r)) + 3e~ (l + r) ]} 2
2
A similar calculation, with φ = 16/cosh(4/?) and φ = 2/cosh(2^) , gives the formula
^
]
μ
2P(-11; 11) pjβ
+
i P 3j3 _j_ ftp—/? _i_ p—5/9
]
}
(3.2)
+ e3^ + e-' +
- ,i-i;i-i)
+ Writing H(g) = α 0 + a\ξ + a,2ξ2 and J(^) = CQ + c\ξ we find that the asymptotic variance H(g)/J(g)2 is minimized for ξ
-=2α0c1-o1co 2αC αc
For /3 = 0 the X»'s are independent and P( ) = 1/16. Prom (3.1) and 2 (3.2) we find H{g) = 5 + jξ + gξ and J(g) = 1 + ±ξ. The optimal value of £ is then £ = 2, that is, the estimating function (2.3) is optimal in the class (2.10) when β = 0. Let us now evaluate (3.3) in the limit β —» 00. In the limit β -> 00 the distribution of X becomes concentrated in sequences with no change of sign along the sequence. One change of sign reduces the probability by a factor
375
PSEUDO-LIKELIHOOD ESTIMATION exp(—6/3). By such intuitive reasoning we find that as β -» oo 6
P(1M-1) = P(-11;11) ~ ie-W,
P ( l l -l-l) - 2e" ^,
These formulas can be proved directly from the results of the next section. Using these asymptotic relations we find from (3.1) and (3.2) that 8
α 0 - 64e" ^
12
aλ ~ -128e" ^
α2
and therefore from (3.3) We can therefore conclude that for β large the pseudo-likelihood estimating equation (2.3) is not optimal in the class (2.10). In the next section we evaluate precisely the difference between the optimal choice and the pseudolikelihood.
4
The equivalent Markov chain
For a one-dimensional lattice gas of finite range it is possible to express the distribution through a stationary Markov chain. We will use this to evaluate the probabilities P( ; ••) in (3.1) and (3.2). In (2.2) the interaction of X{ with previous values is through X{{x{-2 + Xi-\). If we pair the variables Y{ = (Yn,Yi2) = Pfo-i?-^) the interaction of Yi with previous values is through f{yi-ι,yi) = ya(yi-i,i + 2/2-1,2) + yi2(yi-ι,2 + yn) To get the Gibbs model in (2.2) we need a Markov chain for which the product of the transition probabilities equals exp ί Σ * f{y%-u Vi)) except for an initial term and a final term. This is exactly the construction used in large deviation theory for Markov chains, see e.g. Jensen (1991). We number the possible values of Y{ in the order (1,1), (1,—1),(—1,1) and (-1, -1). We let Q be the matrix with entries exp{/?/(?/i_i,yi)}, that is, /
cΛβ
1
P
-2β
p-2/3
1
1
O~2/?
rfλβ
x
1
e
e
2β e ~-2β
—2β e p-2/3
Λ
1 Λ.
1
1 x rΛβ
Let λ be the largest eigenvalue of Q and let er = (1, υ, v, 1) be a corresponding right eigenvector and e\ = (l,tι,tt, 1) a corresponding left eigenvector. We
JENSEN
376 ι
ι
then define the matrix P by P(y,z) = λ~ er{y)~ Q(y, / 1 P= λ
e
4/3
V
e
ve-W
1
z)er{z),
υ
\
-l e 2ί
1
V
V
e
/
and define the vector μ by (l,w,w,l) 2 + 2uv With these definitions we get
Σ/(w-i,y<) \ , (4.1) i=l
which gives the same model as in (2.2). The stationary distribution of Yΐ is given by μ. It is not difficult to show that
λ = I {e 4 / ? + 1 + 2e~2/3 + λ
2
+ 4e ^ + 9 + 4e~
(4.2)
_ e 4 ^ _ e-2/3
Let now
,i) +P(t,3)P(3,j) +P(t,4)P(4,i). Then we have for P( ; ••) in (3.1) and (3.2) P(ll ll) = P ( l l -ll) = μ(l){ 9 i 2 [P(2,l)+P(2,2)]+ 9 l 4 [P(4,l) P(-ll l l ) = and so forth. We are therefore now in a position to calculate (3.1) and (3.2) and the optimal value (3.3). In Table 1 we have given £ and the asymptotic variance with the optimal choice ξ and with ξ = 2 corresponding to the pseudo-likelihood estimate. It is clear from the table that although the pseudo-likelihood is not optimal in the class (2.10) it is very close to being so, and the difference has no practical importance. Because of the representation of our model as a Markov chain we can find the limiting variance of the maximum likelihood estimate /3o From (4.1) we can find the observed information based on Xχ 5 ...,X n , n = 2k. Dividing
PSEUDO-LIKELIHOOD ESTIMATION 0 0.0 0.2 0.4 0.6 0.8 1.0 1.5
377
ζ £=£ £= 2 mle double 2.00 0.50 0.50 0.50 0.51 2.04 0.33 0.33 0.29 0.33 0.44 2.03 0.46 0.29 0.46 1.86 1.67 1.67 0.69 1.53 1.64 8.41 2.39 7.86 8.36 43.32 1.46 43.78 8.76 41.96 1.19 2496.61 2518.04 205.55 2487.62
Table 1: The column ξ gives the optimal value (3.3) of ξ. The other columns give the asymptotic variance of different estimates. The column ξ = ζ is the estimate obtained from (2.5) with the optimal choice of g from (2.10), ζ = 2 is the pseudo-likelihood estimate obtained from (2.3), "mle" is the maximum likelihood estimate, and "double" is the extended pseudo-likelihood estimate obtained from (5.1).
this by n and taking the limit n-^oowe get \j^ logλ(/3), with λ given in (4.2), and therefore
In Table 1 the limiting variance of the maximum likelihood estimate has been included. As can be seen for β = 0 the pseudo-likelihood estimate is fully efficient, whereas for large values of β the efficiency is quite poor. Actually, for β ->• oo the ratio of the limiting variance of the pseudo-likelihood estimate to the variance of the maximum likelihood estimate is 9exp(2/?)/16. In statistical applications models of the form considered here are often used in situations with a weak interaction, that is, with small values of /?, see e.g. Besag (1974). In such cases we can expect the efficiency of the pseudo-likelihood to be acceptable.
5
An extended pseudo-likelihood
The pseudo-likelihood considered in Section 2 is based on the one-dimensional conditional distribution (2.1). It is therefore not surprising that the efficiency of the pseudo-likelihood estimate is poor when β is large. When β is large the dependency in the chain is very strong and this can not be seen efficiently from the one-dimensional conditional distributions. It seems of interest then to investigate what improvement we get by extending the pseudo-likelihood idea to consider the conditional distributions of two variables, say, given the remaining variables. Precisely, we base a new pseudo-likelihood exp(pZn) on
378
JENSEN
(2.2) with / = k + 1,
β
β
- log [2e cosh(β{s] + s*)) + 2e~ cosh(β(s} - 5?))]} ,(5.1) where Si = Xχ-2 + %i-l + %i+2
n
&d
S{ = Xi-\ + X%+2 + #i+3
As for pln(β) we can calculate the limiting variance of pln{β) and the limiting mean of pln{β) to get the limiting variance of the estimate β satisfying pln(β) = 0. The result can be seen in Table 1 in the column with the heading "double". As can be seen, for this one-dimensional lattice gas, the improvement is quite small and of no practical importance. It seems important to investigate if this conclusion is also true for higher dimensions. Intuitively, one feels that in two dimensions, say, much more about directional differences in the interactions can be learned from a set of nine points, say, than from a single point. To summarize, the pseudo-likelihood is based on local information and this is the basis for simple formulas and for simple asymptotic properties. However, using only local information is not a very efficient procedure when there is strong interaction. It therefore seems likely that if one can construct a more efficient class of estimating functions, one will be faced with complicated asymptotic properties. References Besag (1974): Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist Soc. B, 36, 192-236. Besag, J. (1975): Statistical analysis of non-lattice data. The Statistician, 24, 179-195. Bibby, B.M. and S0rensen, M. (1996): Martingale estimation function for discretely observed diffusion processes. Bernoulli, 1, 17-39. Guyon, X. and Kύnsch, H.R. (1992): Asymptotic comparison of estimators in the Ising model. In Stochastic Models, Statistical Methods and Algorithms in Image Analysis. Eds. P.Barone, A. Prigessi and M. Piccioni, Lecture Notes in Statist, 74, 177-198, Springer, New York. Heyde, C.C. (1988): Fixed sample and asymptotic optimality for classes of estimating functions. Contemporary Mathematics, 80, 241-244.
PSEUDO-LIKELIHOOD ESTIMATION
379
Jensen J.L. (1991): Saddlepoint expansions for sums of Markov dependent variables on a continuous state space. Probab. Th. Rel. Fields, 89, 181-199. Jensen, J.L. and Kunsch, H.R. (1994): On asymptotic normality of pseudo likelihood estimates for pairwise interaction processes. Ann. Inst. Statist. Math., 46 475-486.
381
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS FOR SEMIVARIOGRAM ESTIMATION Subhash Lele The Johns Hopkins University and Montana State University Abstract This paper proposes the use of estimating functions based on composite likelihood for the estimation of semivariogram parameters. These estimating functions eliminate the need for the subjective specification of the range and bin width parameter at the same time retain the model robustness of the classical procedures based on the method of moments. Improvement in the efficiency can be anywhere from 50 to 100%. Further extensions and uses are discussed. They attest to the power and flexibility of the composite likelihood approach. Key words: Composite likelihood, geostatistics, kriging, restricted maximum likelihood, variogram.
1
Introduction
Various scientific disciplines require the collection and prediction of data over space. For example, in mining where the goal is to predict ore concentration over the entire study area, samples are collected at various locations. To predict concentrations at locations where the samples are not collected, geostatistics uses a technique known as kriging. Kriging produces a map of ore concentrations for the entire site which can be used for planning and operating mining activities. This same technique has applications in environmental data collection where the goal is to predict environmental degradation or clean-up based on data collected at a discrete number of monitoring locations at a site. As in mining, a useful tool for site assessment and clean-up of a contaminant site is a contour map of contaminant concentrations over the area of interest. Environmental decision makers then could use this map to identify those areas which should be excavated to protect public health,
382
LELE
those which pose little or no risk, and those where the uncertainty is large enough to warrant additional sampling. The attraction of the kriging procedure in these applications is twofold. First, it offers a statistical justification for the way it takes point data (data from locations that have been sampled) and generates a smooth, interpolated map (i.e., a contour plot) of contaminant concentrations. Second, the kriging procedure generates explicit uncertainty measures (e.g., prediction intervals) for the interpolated and smoothed estimates-both for estimates of concentrations at particular locations, and for estimates of averages within a defined area. These uncertainty measures can be used for building precise margins of safety into a decision rule. We refer the reader to Cressie (1991) or Journel and Huijbregt (1978) for the details of kriging. But the basic idea behind kriging is simple to explain. We observe the phenomenon under study only at a finitely many sampled locations. We want to predict the values of the phenomenon at the unobserved locations. Obviously we need to model the relationship between the data that are observed and the data that need to be predicted. This model generally has two components: the trend component and the spatial association component. Once such a model is specified, it can be used to make the prediction for the observations at the unsampled locations such that the prediction error is minimized. In practice, although a parametric form of a model can be specified with reasonable confidence, one has to estimate the parameters based on the observed data. The purpose of this paper is to introduce the framework of estimating functions for the estimation of the spatial association, modeled by semivariogram, and use it to compare and improve existing methods of estimation. The main results of the paper are: 1. A new class of estimation procedures based on composite likelihood (Lindsay, 1988) is introduced. It is shown that in the case of ordinary kriging, this method leads to an estimating function similar to the popular method of weighted least squares (Cressie, 1991, page 96). 2. A composite likelihood for simple kriging leads to estimating functions, described by Godambe and Thomson (1989), that combine the information in conditional mean and conditional variance structure. This is shown to lead to a substantial improvement in the efficiency. 3. Composite likelihood estimating functions can be interpreted as minimizing prediction error. Kriging is used for prediction. It makes sense that the estimation of the semivariogram should also be based on minimization of prediction error.
383
SEMIVARIOGRAM ESTIMATION
4. The use of the composite likelihood, similar to maximum likelihood and restricted maximum likelihood methods, eliminates the need for subjective choices of range and bin width parameters needed for the classical methods; at the same time, it retains the model robustness properties of the classical methods. 5. Godambe's optimality criterion is used for comparing finite sample performance of different methods. It is shown that the use of composite likelihood can improve efficiency anywhere from 50 to 100% over the classical least squares method. 6. Extensions of these ideas to the case of universal kriging as well as robustness issues are indicated.
2
Classical semivariogram estimation
We assume that the reader is familiar with the basic ideas about kriging and semivariogram. Cressie (1991) or Journel and Huijbregt(1978) are excellent references. The following is a very brief introduction to the required concepts. We use the following notation. s denotes the location coordinates. U(s) denotes the value of the process at location s. Ίu(si, sj) denotes the semivariogram for the U process. It is given by: = var{U(s))
-var(U{si)-U(sj))
= σ
Simple kriging corresponds to E(U(s)) = 0. Ordinary kriging corresponds to E(U(s)) = μ. Isotropic semivariograms: Suppose that Ίu{si,Sj) =Ίu{d(si,Sj)) that is, the semivariogram depends only on the distance between the locations. Such semivariograms are called isotropic semivariograms. The covariance, when it exists, is given by Cu(d{su Sj)) = Ίuί+oo)
384
LELE
It is obvious that in practice one has to estimate the parameters involved in the semivariogram or covariogram models using the available data. There are two statistical questions that become relevant here. One is related to the methods of estimation and the other relates to the properties of the estimators obtained from various methods. This section will describe various methods of estimation. They all can be looked upon as particular cases of the method based on estimating functions. Let us begin the discussion with two simple but important cases of simple 2 and ordinary kriging. For the sake of simplicity, let us assume that σ = 1. The classical estimation procedure for the estimation of the variogram can be described in an algorithmic form as follows. 1. Calculate all pairwise distances between locations.
2. Calculate the squared deviations for the observations (empirical semivariogram) . v{si,Sj) = -(u{si) - U{SJ))2 3. Plot v(si,Sj) (on the y-axis) versus d(si,Sj) (on the x-axis). 4. If the parametric model to be fitted is
then the parameter θ is estimated by minimizing the following
One can also use some robust criterion such as absolute deviations instead of squared deviations; or one can use the weighted least squares criterion. Usually the values of v(si,Sj) (being chisquare random variables) are quite scattered. Some smoothing of these values through the use of local averages is suggested before attempting the fitting. The suggested values of the span in the local averaging are such that there are at least 30 points within a window. This is the bin width. One also has to throw away those pairs which are 'too far' apart, that is are outside the range parameter. This method of estimation is sensitive to the bin width and range parameters.
385
SEMIVARIOGRAM ESTIMATION
Maximum likelihood (ML) and restricted maximum likelihood (REML) approach: If one assumes a Gaussian distribution, one can also use the method of maximum likelihood to estimate the parameter in the covariogram. In this case, the value of 0 is obtained by maximizing the following. L(θ,μ) = ^ log \C(Θ)\ - i(J7 - μ)TC{θ)-\U
(1)
- μ)
where U is the vector of observations, μ is the vector of means and C(0) is the covariance matrix with Cij(θ) = 0d(s*'5i). It is advisable to eliminate the nuisance parameter μ before estimating 0. This is achieved by considering the likelihood of the contrasts. Consider the vector of contrasts uc = {U(si) — U(sι),i = 2,3, ..,n}. It is easy to see that this vector corresponds to multiplying the original data vector U by a matrix A such that its first column consists of -1 and the i-th column consists of zeros except in the i-th place. Thus: V = AU~N(O,AC(Θ)AT)
The likelihood corresponding to V is quite complicated. It is given by L{θ,uc) =
^ expί-^Λ'HflK} y } cS (2) n / 2 |Φ(^)| 1 /2 yX 2 c
(2) K }
where Φ«(0) = 2ηfu(sU8ι]θ) and * y ( 0 ) = 7u(*ή*i;0) +7u(«j,*i;0) — 7u(si, SJ; 0). Notice that this is a function of 0 only. Maximizing this with respect to 0 yields the REML estimator. Following is a summary of the merits and demerits of these classical methods of estimation of variogram parameters. 1. The method of moments estimator does not require specification of the Gaussian or any particular distribution. On the other hand, it requires that the bin width and range be specified. 2. The methods of ML or REML theoretically yield an optimal estimator but require a full specification of the probabilistic model. Moreover, they involve inversion of large matrices. This can be computationally prohibitive. Uniqueness of the maximum is also not always guaranteed. 3. Although the variogram is used for the purpose of prediction, this purpose does not seem to enter the classical estimation procedures. Notice also that classical MLE implicitly includes a term that derives from prediction error.
386
3
LELE
Estimating functions based on composite likelihood
The idea of composite likelihood, although discussed in various disguises such as Pseudolikelihood (Besag, 1975) or Partial Likelihood (Cox, 1975), was developed in its own right by Lindsay (1988). There are two motivations for constructing the composite likelihoods: first, they provide a substitute method of estimation when maximum likelihood is very difficult to calculate; secondly, they sometimes represent that portion of the model we are most comfortable with modelling and the resultant estimators can be consistent even when full maximum likelihood estimators are not, a form of consistency robustness. Let us concentrate on the first aspect for now. It is quite clear that the likelihood function in equation (1) is extremely difficult to deal with computationally. The same holds true for the Restricted Likelihood based on the contrasts. A natural question to ask would be: can we approximate the likelihood function by something that behaves almost like a likelihood but is easy to deal with, both computationally and mathematically? We will start with the simplest case of 'simple kriging' and then generalize it to the case of 'ordinary kriging'. Again we assume that σ 2 = 1. 3.1
Simple kriging:
In this case, we assume that E(U(si)) υar(U(si)) cσυ(U(si),U{sj))
= 0 = 1 = (1 - Ίu(si,Sj θ))
Then, under the Gaussian assumption, it is clear that U ~ N(O,C(Θ)) and the likelihood can be written as L(θ,u) = f(u(sι),u(s2),..-,u(sn)m,θ). Now, following Lindsay (1988), suppose we approximate this likelihood by the product of two dimensional marginal densities, namely:
CLO(Θ,U)
=Π Π / K 4 Φ i ) ; e ) z=l j>i
This is what is called a 'composite likelihood' because it is a composition of two dimensional marginal likelihoods. Consider the estimating function generated by this 'composite likelihood'.
387
SEMIVARIOGRAM ESTIMATION
This is a zero unbiased estimating function. Notice that the composite likelihood involves only two dimensional densities and hence is computationally substantially simpler than the total likelihood L(θ,u). These estimating functions have a very intuitive appeal. For the sake of illustration, consider the case corresponding to the isotropic exponential variogram model. The negative log-composite likelihood is, then, given by y y L — ίuts\ _ ffHs^sj) f Λ)2+ 1 l o g ( 1 _ θ2d(SiiSJh 3 r^ • 2(1 — 02d(β<»βi))v' v ' 2 This is minimized with respect to 0. Notice that the first term of this expression is just the weighted prediction error, where u(s{) is being predicted by θd(Si's^u(sj). The second term can be interpreted as a smoothing factor or the factor that makes the estimating function unbiased. This factor, surprisingly does not depend on the assumption of Gaussianity of the underlying process but only needs existence of the variance. (In fact, a similar justification of minimizing the prediction error can be given to the full likelihood as well.) Kriging is a tool for prediction. It makes sense that we estimate the parameters of the semivariogram based on their prediction performance (Marcotte, 1995). Let us look at the estimating function generated by the composite likelihood for simple kriging. It is simple to check that it has the following form.
- 02φi'Sj)) Notice that this is a linear combination of two estimating functions; the first one uses only the first moment and the second one uses the second moment of the process. This estimating function is zero unbiased when the conditional mean is linear and the marginal variances exist. This holds for probability structures more general than the Gaussian probability structure and thus retains the model robustness of the classical estimators. This, intuitively, is also a better use of the available information than using only the second moment information as done by the classical approaches. See Godambe and Thompson (1989) for a discussion of such estimating functions. 3.2
Ordinary kriging:
In this case, we assume that E(U(Si))
= μ
388
LELE var(U(Si))
= 1
cσυ{U(8i),U{sj)) = U-7«(«i,«,;*)) We can write down a composite likelihood quite simply. Consider the product of the marginal densities of the contrasts Vij = U(si) — U(SJ), namely,
Let us look at this particular case in detail. In the Gaussian case, notice that f(Vij;θ)
=
\
exp{-
l
(U(8i) -
2
U(Sj)) }
Hence, ignoring constant terms, negative log-composite likelihood upto a constant can be written as
The estimating function corresponding to this is given by:
iriM, 'a o) ί Win) - u(sj))2
A
Notice that this estimating function corresponds very closely to the weighted least squares method (Cressie 1991, page 96, equation 2.6.12). Continuing with the theme of composite likelihoods, observe that there are many other composite likelihoods that can be considered also. For example, we may consider two contrasts at a time to get
How this affects statistical efficiency is a question of interest. In practice, one will have to achieve a balance between statistical efficiency and computational efficiency. We will discuss comparison of statistical efficiency of estimating functions in the next section.
4
Efficiency comparisons
Having defined various estimating functions, the next natural question that arises is: which estimating function should be used when? In the following,
SEMIVARIOGRAM ESTIMATION
389
we will define the optimality criterion used by Godambe (1960), now known as Godambe's criterion. We then use it to choose an estimating function for semivariogram estimation. We will illustrate its use in a single parameter case. Godambe's optimality criterion We will not provide all the regularity conditions explicitly here. The details can be found in Godambe (1960). Let Θ denote the parameter space. Let G be the class of all zero unbiased estimating functions, that is, if g G G, then EQ(g(U, θ)) = 0 for all θ G Θ. Let us also assume that g € G are differentiate and all the relevant expectations exist. Then information content in g regarding the parameter θ is given by
Given two zero unbiased estimating functions g\ and #2? it is now easy to compare their performance. One can plot Infn(g\;θ) and Infn(g2m,θ) as a function of θ and choose that estimating function which is uniformly better. However, in most practical situations, there may be a certain part of the parameter space where g\ may be better than g<ι\ and on the other part 52 may be better than g\. In this situation, researcher will have to consider his prior opinion about the most likely parameter value for the data at hand and choose the relevant estimating function. One can also use other approaches such as minimax or non-informative priors or proper priors to calculate 'average information' to select an estimating function in these situations. Appropriateness of these approaches is a foundational issue and will not be discussed here. Comparison of composite likelihood and the classical method In the following we discuss the information comparison for the composite likelihood based estimating function and the classical method of estimation. The details of the model are as follows. We consider the ordinary kriging with exponential semivariogram. That is: E(U(s)) = μ Var(U{s)) = 1 Cσυ{U(3i),U(sj))=θd{ai'8j) We assume the underlying probabilistic model to be Gaussian. Tables 1, 2 show the information comparisons at various values of θ for 4 x 4 and 8 x 8 regular grids (increasing domain asymptotics) and table 3 shows the information comparison for the 8 x 8 grid nested inside a 4 x 4 grid (infill asymptotics). It is quite clear that the composite likelihood
390
LELE
estimating functions are substantially better, an increase in efficiency of about 50%, than the least squares approach. In table 4 we compare the composite likelihood for the simple kriging with the least squares approach. Recall that, in the case of simple kriging, composite likelihood uses both the conditional mean structure and the conditional variance to obtain estimating functions. Prom table 4 it is clear that the efficiency gains are substantial, to the order of 75%. Multiparameter extensions of this are straightforward and are not considered here. Recently Lele and Curriero (1997) have shown that the predictive performance of composite likelihood based estimation of variogram is comparable with the traditional approach.
5
Further extensions
In this section we will discuss various extensions of the use of composite likelihood. Geometric anisotropy and non-euclidean distances Curriero (1996) introduces the use of noneuclidean distances in variogram modelling. Lele and Curriero (1997) extend the use of composite likelihood approach to automatically estimate geometric anisotropy in the data. Universal kriging and Intrinsic Random Function kriging In practice, it is seldom the case that the mean is known. If the mean is constant, we saw how composite likelihood uses the marginal distribution of contrasts U(si) — U(SJ), which is independent of μ, to obtain the semivariogram parameters. This was shown to be related to the weighted least squares approach. Now suppose that the mean structure is given by E(U(s)) = X(s)β where X(s) is a vector of known covariates such as elevation, rock type etc. The vector parameter β is a nuisance parameter. It is well known (Cressie, 1991; page 153) that knowledge of β is inessential for kriging predictor as long as the semivariogram is known completely. Unfortunately that is seldom the case. The semivariogram is estimated based on the residual obtained by first estimating β using least squares approach. This estimate is also known to be biased (Cressie, 1991; page 167). One approach that overcomes this bias is REML. We consider contrasts of the observations such that each contrast has mean 0. Suppose U~N(Xβ,C(θ)) Construct a matrix A such that AAT = I- X{XτX)'ιXτ Then: V = AU~N(0,AC(θ)Aτ)
and ATA = /.
SEMIVARIOGRAM ESTIMATION
391
Thus the distribution of V is independent of β. The likelihood for V can be written as: L(θ, υ) = f(vι,v2,..., υn-p θ) where p is the number of covariates X. Utilizing ideas of composite likelihood, we can approximate the above likelihood in several ways. Since the marginal distribution of V{ depends on θ, the simplest possibility is:
CX2(M)= Uftoθ) Following the discussion of the corresponding composite likelihood for ordinary kriging, it can be seen that this essentially generalizes the weighted least squares approach for semivariogram estimation from ordinary kriging to universal kriging but without the need of estimation of the trend parameters and hence avoiding the bias considerations. Our initial studies show that CL2{θ,v) depends weakly on θ and hence is not very informative. The bivariate composite likelihood described below, however, seems to be fairly informative. As before, we could consider pairs of υ^s instead of singletons, to obtain 1
CL3(θ,v)=1[[l[f(v [[ i,vj',θ) 2=1 j>i
One would expect a substantial gain in efficiency parallel to the gains reported in table 4. Intrinsic random function kriging is very similar to universal kriging in spirit. See Cressie (1991, pages 299-309) for detailed description. The usual method of estimation of the generalized covariance function is based on REML. It is clear that composite likelihood should be applicable to the problem of estimation of the generalized covariance functions. Nonparametric semivariogram estimation Practitioners are reluctant to specify a particular model for semivariogram. Recently there have been several papers (Shapiro and Botha, 1991; Cherry, 1995; Lele, 1995) proposing methods for nonparametric semivariogram estimation. It is known (Schoenberg, 1938) that the class of all semivariograms corresponds to a mixture of some kernel semivariogram. The methodology of composite likelihood is easily applicable in such a situation. Moreover it is possible to fit such models using composite likelihood even in the case of universal kriging or intrinsic random function kriging. Robustness issues A legitimate concern of the practitioners is that these estimating procedures may not be robust against outliers. Lindsay (1994) discusses the issue
392
LELE
of robustness versus efficiency in terms of estimating functions. The weighted estimating functions considered by Lindsay (1994) have a form very similar to the estimating functions obtained from composite likelihoods. Consider equation 3. Let
Then a modified version of the above estimating function may be written as
LΣ If we take A(δ) = <J,we recover the original estimating function. If we take Ά(<ί) = ' + j + i ' ~ 1 ? w e r e c o v e r power weighted divergence family described by Cressie and Read(1984). Different values of λ lead to different robustness properties. These robust equations extend easily to the universal kriging case.
6
Summary
This paper proposes the method of composite likelihood for the estimation of semivariogram parameters. Several advantages of this method are outlined. 1. This method eliminates the need for the specification of the bin width and range parameters, making it automatic and objective, at the same time retains the model robustness of the classical approach. 2. Composite likelihood estimating functions have a very intuitive justification of minimizing the prediction error. The ultimate use of variograms is for prediction. It makes sense to estimate the variograms in such a manner that the prediction error is minimized. 3. The efficiency gains obtained by this method are shown to be substantial. 4. The flexibility of this method can lead to better estimation in the case of universal and intrinsic random function kriging. This flexibility also allows for estimation of mixtures of semivariograms in the presence of trend. Robustness to outliers also can be achieved fairly easily.
SEMIVARIOGRAM
ESTIMATION
393
7 Acknowledgments This work was partially supported by a grant from DOE (DE-FC07-94, Professor Daniel Goodman PI). I would like to thank Professor Goodman for his generous support, encouragement and many insightful comments. References Besag, J. E. (1975) Statistical analysis of non-lattice data. The Statistician, 24, 179-195. Cherry, S. (1995) Nonparametric estimation of variogram. Doctoral dissertation submitted to the Department of Mathematical Sciences, Montana State University, Bozeman, Montana. Cox, D. R. (1975) Partial likelihood. Biometrίka,62, 269-276. Cressie, N.A.C. (1991) Statistics for spatial data. John Wiley and Sons, NY. Cressie, N.A.C. and T.R.C. Read (1984) Multinomial goodness of fit tests. J. Roy. Statist. Soc. Ser. B 46, 440-464. Curriero, F. (1996) The use of non-euclidean distances in geostatistics. Doctoral dissertation, Department of Statistics, Kansas State University, U.S.A. Godambe, V. P. (1960) An optimum property of regular maximum likelihood estimation. Ann. Math.Stat., 31, 1208-1212. Godambe, V. P. (ed.) (1991) Press, NY.
Estimating Functions. Oxford University
Godambe V. P. and Kale, B. K. (1991) Estimating functions: An overview. In Estimating Functions, Godambe V. P. eds. Oxford University Press, NY. Godambe V. P. and M. Thompson (1989) An extension of quasi-likelihood estimation (with discussion). J. Stat. Plan, and Inf. 22, 137-172. Journel A. G. and C. J. Huijbregt (1978) Mining Geostatistics. Academic Press, London. Lele, S. (1995) Inner product matrices, kriging and nonparametric estimation of variogram. Math. Geology 27(5), 673-692. Lele, S. and F. Curriero(1997) A composite likelihood method for semivariogram estimation, (submitted manuscript)
394
LELE
Lindsay, B. G. (1988) Composite likelihood methods. Contemporary Mathematics, 80, 221-239. Lindsay, B. G. (1994) Efficiency versus robustness: The case for minimum Hellinger distance and related methods. Ann. Statist. 22, 1081-1114. Marcotte, Denis (1995) Generalized Cross-Validation for Covariance Model Selection. Math. Geology 27(5), 659-672. Schoenberg, I.J. (1938) Metric spaces and completely monotone functions. Ann. Mathematics 39(4), 811-841 Shapiro A. and J.D. Botha (1991) Variogram fitting with a general class of conditionally nonnegative definite functions. Computational Statistics and Data Analysis 11, 87-96.
SEMIVARIOGRAM ESTIMATION
395
Table 1: Efficiency comparison between the classical least squares method and the composite likelihood based method for a 4 x 4 grid. The second two columns are Monte -Carlo estimates of the information in the estimating functions and the third column is the ratio of the informations or the efficiency gain. Parameter
Least squares
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
4.44 4.55 5.75 6.65 7.79 12.70 24.03 53.74 223.265
Composite likelihood Efficiency gain 4.49 5.39 7.73 9.20 11.83 18.99 37.41 81.73 342.31
1.01 1.18 1.34 1.38 1.52 1.50 1.56 1.52 1.53
Table 2: Efficiency comparison between the classical least squares method and the composite likelihood based method for an 8 x 8 grid (increasing domain). The second two columns are Monte-Carlo estimates of the information in the estimating functions and the third column is the ratio of the informations or the efficiency gain. Parameter
Least squares
Composite likelihood
Efficiency gain
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
14.63 12.75 11.42 10.94 14.12 15.10 24.43 50.76 212.052
15.81 16.11 16.85 18.37 25.95 30.14 48.80 101.32 386.67
1.08 1.26 1.48 1.68 1.84 2.00 2.00 1.99 1.82
396
LELE
Table 3: Efficiency comparison between the classical least squares method and the composite likelihood based method for an 8 x 8 grid at a distance 0.5 (infill asymptotics). The second two columns are MonteCarlo estimates of the information in the estimating functions and the third column is the ratio of the informations or the efficiency gain. Parameter
Least squares
Composite likelihood
Efficiency gain
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
29.83 15.94 11.36 10.42 12.35 15.59 28.14 42.91 253.018
43.79 27.80 21.33 20.94 25.23 28.48 54.22 83.69 412.54
1.47 1.74 1.88 2.00 2.04 1.83 1.93 1.95 1.63
Table 4: Efficiency comparison for a 4 x 4 grid in the case of simple kriging. Here composite likelihood corresponds to a combination of linear and quadratic estimating functions. The second two columns are MonteCarlo estimates of the information in the estimating functions and the third column is the ratio of the informations or the efficiency gain. Parameter
Least squares
Composite likelihood
Efficiency gain
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
4.40 4.34 5.46 6.12 8.49 11.71 21.61 53.364 220.301
23.84 15.64 16.29 12.60 15.01 22.61 34.73 79.22 322.51
5.42 3.60 2.98 2.06 1.77 1.93 1.61 1.48 1.46
399
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING COVARIANCE MATRICES USING ESTIMATING FUNCTIONS IN NONPARAMETRIC AND SEMIPARAMETRIC REGRESSION R. J. Carroll and S. J. Iturria Texas A&M University R. G. Gutierrez Southern Methodist University ABSTRACT We use ideas from estimating function theory to derive new, simply computed consistent covariance matrix estimates in nonparametric regression and in a class of semiparametric problems. Unlike other estimates in the literature, ours do not require auxiliary or additional nonparametric regressions. Key Words: Estimating equations; kernel regression; nonparametric regression; plug-in semiparametrics; smoothing.
1
Introduction
Estimating functions form a powerful methodology for parametric analyses. Their use in nonparametric and semiparametric problems is less developed. Here we use estimating equations to derive standard error estimates in these contexts. The first problem is ordinary nonparametric local polynomial regression. It has not been generally appreciated that these estimates are in fact solutions to estimating equations, a point which was first noticed by Carroll, Ruppert & Welsh (1996). We show how their looking at this problem via estimating equations leads to a new sandwich-type covariance matrix estimate. The second problem is semiparametric regression, of a type we call "plugin" (defined later in the paper). In semiparametric problems, estimation of a parameter is often of most interest. One way to obtain a covariance matrix for the estimated parameter involves a two-step process: (a) derive an asymptotic expression, usually involving a suite of densities and additional nonparametric regressions; and (b) estimate each term in turn. We show how
400
CARROLL, ITURRIA AND GUTIERREZ
Gutierrez & Carroll (1996) use estimating equations in a one-step process, leading to consistent covariance matrix estimates under minimal assumptions, and without the need for additional nonparametric regressions.
2
Ordinary Nonparametric Regression
Ordinary nonparametric regression is ideally suited to development of estimating functions. For example, consider local polynomial regression of order p, in regression Y on X. Based on a sample of size n, local polynomial estimates of Θ(XQ) = E(Y\X = XQ) are formed by minimizing
ί=l
j=0
where w(x,xo) is a weight function, e.g., kernel weights or loess weights. The estimated function is Θ(XQ) = /?o Defining Gp(x) = (I,a:,ίc 2 ,...,α; p ) t and differentiating, local polynomial regression solves
0 = f > ( X i , soHVi - Σ / W - χo)J}GP(Xi - *o). t=l
(2.1)
j=0
Carroll, et al. (1996) noted that (2.1) is an estimating equation, and they use this fact to develop a general theory of nonparametric regression which includes both much of the current literature as well as many new ideas. The estimating function is not unbiased in the usual sense, because the true mean Θ(XQ) has been replaced by its local polynomial approximation Σϊj=oβj(X ~~ χoV However, asymptotically, as the weights become more concentrated at zo, the estimating function becomes unbiased. Routine application of Godambe's estimating function theory suggests that Θ(XQ) — Θ(XQ) is asymptotically normally distributed with mean zero and variance (1,0,..., O μ - 1 (xo)Bn(xo)A-ι(xo)(l,
0,..., 0)*,
(2.2)
where An(x0)
= E{Σ™(XuXo)Gp(Xi
- z o )G£(Xi - x0)}
(2.3)
t=l
Bn(x0) =
E[f^w\Xi,xo)Gp(Xi-xo)Gl(Xi-xo){Yi-Σβj(Xi-xo)Ψl
<=1
j=0
(2.4)
ESTIMATING COVARIANCES IN SEMIPARAMETRICS
401
As in the usual sandwich methodology, consistent estimation of AU(XQ) and Bn(xo) is accomplished by replacing the expectations in (2.3)-(2.4) by sums over the data. This routine use of well-known parametric theory in nonparametric regression problems appears to be new, and Carroll, et al. (1996) develop this idea into contexts not previously considered in the literature. Ordinarily, researchers either (i) assume homogeneity of variance and replace 2 {Yi — Σ*j=oβj(Xi — xoV} in (2.4) by the constant global variance; or (ii) work out all the details of the asymptotics and then estimate all of the terms. This use of parametric estimating equation theory provides a powerful way of forming estimated variances without having to go through the second alternative. Here we sketch the argument of Carroll, et al. (1996) in this special case, showing that at least for kernels the estimating equation-based standard errors are asymptotically correct. The only caveat concerns bias. Since (2.1) is not an unbiased estimating function, we cannot claim that θ(xo) is consistent for Θ(XQ) without accounting for bias. In fact, estimating this bias even in this simple context has been and remains a problem of considerable interest in the kernel literature (Ruppert, 1997). It is not clear whether, or how, one can use estimating equation methodology to estimate this bias. Here is a sketch of the argument of Carroll, et al. (1996) showing the consistency of (2.2) for local linear regression (p = 1). Let σ2(xo) = Vax(Y\X = xo), which is assumed to be smooth. For kernel weights with bandwidth h, w(Xi,xo) = Kh(X — xo) = h~ιK{(X — xo)/h}, and it is well known (Ruppert and Wand, 1994) that the asymptotic variance of local linear regression is {nhfx(x0)}~1k2σ2(x0),
(2.5)
where fχ( ) is the density of X, k\ = fx2K(x)dx, k2 = f K2(x)dx, and k3 = fx2K2{x)dx. It is easily seen that (2.2) is unchanged if we replace (X - xo) by (X xo)/h and adjust the definition of β\ accordingly, in which case it can be shown that fχ{x0) (h/n)Bn(xo)
ί
Q
Z+ f i x W M
)
k
(
2
Plugging these asymptotic expressions into (2.2), we obtain (2.5) as desired. For polynomials of order p ψ 1, similar arguments apply.
402
3
CARROLL, ITURRIA AND GUTIERREZ
Plug-in Semiparametrics
Estimating equation methodology can also be used in what we call semiparametric plug-in problems to derive easily computed consistent covariance matrix estimates for parameters. These problems are derived as follows. Suppose that an estimating equation for a parameter a depends on vector-valued data Y along with a scalar-valued function Θ(X), where X is a subcomponent of Y. In this case we can write the estimating function for a as Φ{Y,α,0(X)}. By definition, a plug-in problem works as follows: θ( ) can be estimated without reference to a by a local estimating equation based on (Δ,X) and an estimating function χ( )> where Δ is another component of Y, by solving
0 =J 2 ( » o )
p
ι=l
( i
Q
) { i j=0
with θ(xo) = βo; note the similarity with (2.1). We now "plug-in" the estimated function 0( ), and solve the following equation to form an estimate α for the parameter α:
In what follows, we will ignore issues of bias, which are considered in detail by Gutierrez and Carroll (1996) and by Carroll, Fan, Gijbels and Wand (1997). Gutierrez and Carroll (1996) derive the asymptotic distribution of a in this and more general situations. The asymptotic covariance matrix depends as expected on the density of the X's as well as various further nonparametric regressions. They show that the following is a consistent estimate of the asymptotic covariance matrix of δ (the argument appears after the definitions). Remember that a may be vector-valued but that θ( ) is scalar. Define
An(a,θ) =
-Σj-t 2=1
-(l,0,...,0)Σw(Xi,x)Gp{Xi-x)Gtp(Xi-x)
Bn(x,θ) =
X2(A,υ)
= ^
Cn(x,θ) =
J2w(Xi,x)Gp(Xi-x)X{Ai,θ(x)}; 2=1
ESTIMATING COVARIANCES IN SEMIPARAMETRICS
403
X{A,θ(Xi)} n
Dn(a,θ) =
Y/An(Ai,Xi,Ϋi,a,θ)Atn(Ai,Xi,Ϋi,a,θ).
A consistent covariance matrix estimate is ' a,θ).
(3.1)
To justify (3.1), we provide the following sketch based on the arguments of Gutierrez & Carroll (1996). First note that by ordinary estimating equation
theory, θ(x) - θ(x) « B-χ{x,θ)Cn{x,θ).
Then with Φ* =
and Φ β = a - a
1=1
Interchanging indices of summation, a-a w ^ " ^ α , ^) £?=i Λ n (Δj, Jf<, ί^, α, justifying (3.1). While informal, all of these calculations are easily justified in kernel re1 2 gression with bandwidth h. Generally though, in order that n / (δ — a) = 2 4 O p (l), it is required that nh -> oo and n/ι -> 0. Certain problems weaken n/i4 ^ 0 to nh6 -> 0, see Gutierrez and Carroll (1996). Implementation of (3.1) is easy, because all the terms involved are building blocks in the estimation process. We have found in other contexts (Simpson, et al., 1997) that inference is improved if it is based on percentiles of the t-distribution with n - 2(p + q + 1) degrees of freedom, and if (3.1) is multiplied by n/{n — 2 (p + q + 1)}, where q is the dimension of a and p is the size of the local polynomial. Acknowledgment This research was supported by a grant from the National Cancer Institute (CA-57030).
404
CARROLL, ITURRIA AND GUTIERREZ References
Carroll, R. J., Fan, J., Gijbels, I. and Wand, M. P. (1997). Generalized partially linear single-index models. J. Am. Statist. Assoc, to appear. Carroll, R. J., Ruppert, D. & Welsh, A. (1996). Nonparametric calibration: nonparametric regression based on estimating equations. J. Am. Statist Assoc, to appear. Gutierrez, R. G. and Carroll, R. J. (1996). Plug-in semiparametric estimating equations. Preprint. Ruppert, D. (1997). Empirical Bias Bandwidth Selection. J. Am. Statist Assoc., to appear. Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression. Ann. Statist, 22, 1346-1370. Simpson, D. G., Carroll, R. J., Guth, D. & Zhou, H. (1997). Interval censoring and marginal analysis in ordinal regression. J. Agric, Biol. Enυ. Statist, 1, 354-376.
405
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING EQUATIONS AND THE BOOTSTRAP Feifang Hu National University of Singapore
John D. Kalbfleisch University of Waterloo
ABSTRACT We consider interval estimation of a parameter θ when the estimation of θ is defined by a linear estimating equation based on independent observations. The proposed method involves bootstrap resampling of the estimating function that defines the equation with θ replaced by its estimated value. By this process, the distribution of the estimating function itself can be approximated, a confidence distribution for θ is induced and confidence regions can be simply defined. The procedure is termed the EF (Estimating Function) Bootstrap and, under fairly general conditions, can be shown to yield confidence intervals whose coefficients are accurate to first order. A simple studentized version is also defined and, in many instances, gives a second order approximation. In a number of examples, the method is shown to compare very well with classical bootstrap procedures. The intervals produced are more accurate, the method is more stable, and it has considerable computational advantage when compared to the classical approach. A number of comments and suggestions for future research are also given. Key Words: Bootstrap, estimating functions, common means problem
1
Introduction
Over the past fifteen to twenty years, both estimating equations and the bootstrap have been very influential ideas in theoretical and applied statistics. In this article, we summarize some recent work that combines these two ideas to use bootstrap resampling as the basis of inference for estimating equations. There seems to have been relatively little work in this area. The articles by Lele (1991a,b) and Hu and Zidek (1995) are notable exceptions. A more complete discussion of the present work can be found in Hu and Kalbfleisch (1997a). Estimating equations, the topic for this volume, provides a simple framework for the estimation of parameters. Godambe and Kale (1991) provide a nice recent review of the area. Although the theory leads to important results on optimality and substantial areas of application, methods of inference are primarily based on simple asymptotic approximations with little
406
HU AND KALBFLEISCH
available for inference in small samples nor for example, extensions to higher order asymptotics. Since Efron's (1979) fundamental paper, the bootstrap has been the subject of much discussion and development. Like estimating equations, the idea is simple and straightforward yet it has powerful implications for applications and forms the basis of much theoretical work as well. Recent reviews can be found in the books by Efron and Tibshirani (1993) and Hall (1992). The most studied problem is that of constructing reliable and accurate confidence intervals for a parameter θ of interest. The general approach involves generating the bootstrap distribution for an estimator θ and utilizing that distribution for interval estimation. DiCiccio and Romano (1988) give an excellent summary. In this article, we use bootstrap procedures to construct confidence intervals for a parameter θ when the estimation of θ is based upon a linear estimating equation. The methods are simple to implement and, in a wide class of problems, lead to very accurate confidence intervals with coverage probabilities that are accurate to second order. In section 2, we define a simple bootstrap method for the linear estimating equation and also give a studentized version. Our proposal involves resampling the components of the estimating function with the aim of estimating its distribution rather than the distribution of the estimator θ. This provides the basis to estimate a confidence distribution for θ and to develop approximate confidence regions. The ideas being used are similar to those of Hu and Zidek (1995) who develop related bootstrap methods for estimating equations in the linear model. The paper by Parzen, Wei and Ying (1994) is also closely related. They consider estimating functions whose distributions are very complex and use simulation methods to generate a confidence distribution for θ. Their approach, however, is not based on bootstrap resampling. In Section 3, a few examples are given to compare the method with more classical procedures. Section 4 gives a series of comments and outlines some areas for further investigation.
2
Estimating Equations and Confidence Regions
Suppose that yi, 2/2, , ?M are independent random variables, and our aim is to estimate an unknown parameter θ taking values in some space Ω. Further, we assume that
407
EES AND THE BOOTSTRAP
for all i and θ G Ω. The parameter θ is estimated by the root θ of the unbiased linear estimating equation gi(yitθ) = 0
(2.1)
In what follows, we suppose that Sy(θ) is a one to one function of θ, and we focus on the construction of confidence intervals for θ. Note that the estimating function (2.1) is taken as given and is assumed to provide an appropriate method for estimation. If the distribution of Sy(θ) is the same for all θ G Ω and C is a set that satisfies P{Sy(θ) G C} > 1 - α, then {θ G Ω : Sy(θ) G C} is a confidence interval with coefficient at least 1 — α. Parzen, Wei and Ying (1994) consider this setup and develop methods for simulating the distribution of Sy(θ) in certain regression examples. More usually, however, the exact distribution of Sy(θ) depends on θ and its distribution is complex or not even specified by the assumed model. In such cases, it is customary to rely on asymptotic results. Asymptotic Normal Approximations: If θ is a scalar parameter, for example, and certain regularity conditions hold, asymptotic inferences could be based on a central limit theorem applied to Sy(θ) directly. Alternatively, a Taylor series approximation yields an asymptotic distribution for θ. Thus, in the former case we have ^
Σ
9i(yuθ) ^
N(0,Vθ)
(2.2)
and in the latter case
2> N(0,Vθ/We)
(2.3)
where Vθ = Im^ I Σ var(gi(yuθ)) and Wθ = Jjjπ^ E(± Σ ίθ9i(yi,θ))2 Using (2.2) or (2.3) with a consistent estimator of VΘ or V$/WΘ, approximate confidence intervals for θ can easily be obtained. Extensions to vector parameters are, of course, straightforward. The result (2.3) is the most commonly used method, but it has several drawbacks: i) The approximation is first order only and convergence can be slow. ii) The method is not functionally invariant; estimation of λ = h(θ), where h is 1 : 1, can yield different results. iii) Sy (θ) must be smooth for the Taylor approixmations; this means it cannot be used in some nonparametric problems of interest.
408
HU AND KALBFLEISCH
iv) There is no adaptation to small samples. Use of (2.2) instead of (2.3) circumvents drawbacks ii) and iii). The EF Bootstrap discussed below is based on approximations to the distribution of Sy(θ) and also addresses ii), iii) and iv). In addition, a studentized version leads to higher order approximations. The EF (Estimating Function) Bootstrap The aim is to approximate the distribution of Sy(θ) = Σ9i{yi^) this purpose, let z\ = #i(yi, 0), i = 1,..., n and proceed as follows: EF1: EF2:
For
Draw a bootstrap sample z*,..., z* from zi,... zn. Let S* = z\ + ... + z* .
The empirical distribution of 5* is used, after many repetitions of EF1 and EF2, to approximate the distribution of Sy(θ). This procedure, termed the EF Bootstrap, typically gives a first order approximation to the distribution Of Sy(θ).
A higher order approximation can typically be obtained by the following Studentized EF Bootstrap procedure. Instead of approximating the distribution of Sy(θ), we approximate that of
SpHθ)=v-1/2Sy(θ) z z
where v = Σ i i
(2.4)
Thus, we use
SjD* = vϊ1'2
s;
(2.5)
where v* = Σ(z* ~ z*)(z* - Z*Y and z* = Σzi/nDetails on the order of approximations are discussed in Hu and Kalbfleisch (1997a) and some further comments can be found in Section 4 of this article. It should be clear that, once the distribution of 5^(0) is estimated, approximate confidence intervals or regions can be readily obtained. For example, if 0 is a scalar and Sy(θ) is monotone in 0, the quantiles of the empirical distribution of S* (or Sy ) can be used to determine confidence intervals for 0. More generally, let 0* (or 0^)*) be the unique value of 0 that satisfies Sy{θ) = S* (or sJ 1} (0) = S^r). Then from the above EF bootstrap procedure, we obtain 0^,02,... (or 0^ , 02 > •) which generates a joint bootstrap confidence distribution for 0. An approximate 1 — a confidence interval is given by a set A which is chosen so that P* (0* G A) = 1 — α or p* (#(!)* G A) = 1 — α. Here P* refers to the bootstrap probability. In the scalar case, A is most naturally determined as the interval defined by the α/2 and 1 — a/2 quantiles in the bootstrap confidence distribution. This is the method used in Section 3 below. The Classical Bootstrap:
409
EES AND THE BOOTSTRAP
An alternative approach, in the spirit of the classical bootstrap, proceeds as follows: θ a s a
Cl:
select 0*(y*, 0),..., 9n(Vn^ )
bootstrap sample from 31 (yi, 0),
C2:
Find 0*, the solution to Σ9*(y*,θ) = 0
The sequence θ*λ, 0* 2 ,... obtained by repeating this process can similarly be used to generate confidence regions for 0. It should be noted that the classical procedure summarized in Cl and C2 is computationally demanding, much more so than the EF bootstrap. In essence, the classical procedure specifies a new estimating equation with each bootstrap sample. With the EF procedure, however, only the right side of the equation changes while the estimating function Sy(θ) remains unchanged. In the context of a regression model, for example, the classical method amounts to a new design matrix with each bootstrap sample. The EF procedure, however, maintains the same design matrix throughout.
3
Some Examples
The interested reader is referred to Hu and Kalbfleisch (1997a) for a more extensive and complete set of examples and discussion. In this article, we give only a brief treatment of two related examples. Example 1: Suppose that yi,..., yn are independent random variables and that E{yι) = μ, υar(yi) = σ\,i = 1,... ,n where the σ[s are known. The minimum variance unbiased estimating equation is n
" ^
=0
(3.1)
which yields the estimator μ = (Σyi/of)/ Σ l/ σ ? We make comparison of five approaches: The normal approximations to Sy(μ) or μ The classical bootstrap The studentized classical bootstrap The EF bootstrap The studentized EF bootstrap. It is worth noting that, if all the σ^'s are equal, this reduces to the standard benchmark problem of estimating a population mean based on an iid sample. The estimating equation (3.1) then defines the sample mean. Hu and Kalbfleisch (1997a) note that the EF Bootstrap and the classical bootstrap yield, in this case, identical results, and also have the same studentized
410
HU AND
KALBFLEISCH
versions. In this case, and more generally, the Efron or classical bootstrap can be viewed as a special case of the EF Bootstrap. In the simulations reported here, however, we consider the case n = 40, μ = 0 and θ{ = 2.2i, i = l , . . . , 4 0 with normal and uniform errors. There were 1000 simulations of 1000 bootstrap samples. Table 1 gives the coverage probabilities for each of the four methods for nominal coverage probabilities of .80, .90 and .95. All five procedures do fairly well though in the case of normal errors the (exact) normal method and the studentized procedures have more accurate coverage probabilities. With uniform errors, the studentized bootstrap appears to do somewhat better than any of the others. Table 1. Coverage Percentages and Average Confidence Intervals for Competing Methods
Normal Errors 80% Normal approx. Classical bootstrap Studentized classical EF bootstrap Studentized EF
80 77 80 77 80
90%
[-0.92, 0.95] [-0.90, 0.92] [-0.92, 0.95] [-0.88, 0.91] [-0.92, 0.95]
89 87 90 87 90
[-1.18, [-1.16, [-1.18, [-1.14, [-1.18,
95% 1.21] 1.19] 1.22] 1.17] 1.21]
95 93 95 93 95
[-1.42, [-1.39, [-1.42, [-1.36, [-1.41,
1.44] 1.42] 1.44] 1.40] 1.45]
96 94 95 94 95
[-1.41, [-1.38, [-1.41, [-1.36, [-1.40,
1.46] 1.44] 1.46] 1.43] 1.46]
Uniform Errors Normal approx. Classical bootstrap Studentized Classical EF bootstrap Studentized EF
81 78 81 77 80
[-0.91, [-0.89, [-0.92, [-0.88, [-0.91,
0.96] 0.95] 0.96] 0.94] 0.96]
92 88 90 89 91
[-1.16, [-1.15, [-1.14, [-1.14, [-1.14,
1.23] 1.23] 1.23] 1.19] 1.23]
Example 2: A related example is the common means problem of Neyman and Scott (1948) which has received much attention in the literature (see, for example, Kalbfleisch and Sprott (1970), Bartlett (1936), Cox and Reid (1987) and Barndorff-Nielsen (1983)). Specifically, we have k independent strata and, in the ith stratum, y^ ~ N(μ,σϊ),j = 1,... ,n», independently where i = 1,..., k. The variables σ\ are unknown and interest centres on the estimation of μ. Neyman and Scott show that the maximum likelihood estimate of μ can be asymptotically inefficient if the stratum sizes are at least 3. They propose the estimating equation
411
EES AND THE BOOTSTRAP
A ni(ni-2)(yi-μ) _n
ϊϊδo
where Tί(μ) = Σ (Vij — μ ) 2 a n d yi = Σ yijlni>
"
( }
This same equation has been
proposed by many other authors as well. More generally, we could consider the case where the y^'s are independent as above, but relax the normality assumption. Thus, we assume only independence and E(yij) = μ, var(yij) = σ?, i = l,...,fc, j = l,...,7ii with σj, ...,σjk unknown. In this more general framework, we could still utilize (3.2) for estimation of μ. Hu and Kalbfleisch (1997a) consider various sampling schemes and aspects of this problem. Here, we look only at one approach. Specifically, we let yι = (yiu . . . ,yin.) and gi{yi,μ) = Πi(ni - 2){yi - μ)/Ti(μ). Thus, (3.2) can be rewritten as k
iiv»μ) = 0,
(3.3)
exactly of the form (2.1). The EF or classical bootstrap can now be applied in a straightforward manner to (3.3). We compare them and some normal approximations in the following simulation. In total, six methods are compared: • Normal 1: An asymptotic normal approximation in which the asymptotic variance of μ is approximated by
2
where σf = i ΣiVij ~ β)
Normal 2: An asymptotic normal approximation based on (2.3) to the distribution of y/n(μ — μ) with variance estimate k
_n σ(2)\22 =
• Classical Bootstrap: This is obtained by resampling g*{y*,μ) gi(yi,μ); let μ*c denote the bootstrap estimator.
from
• The Studentized Classical Bootstrap: use the variance estimator σ(2) along with the estimator μ*c. • The EF Bootstrap
2
412
HU AND
KALBFLEISCH
• The Studentized EF Bootstrap For the simulation, we used k = 40, μ = 0, Πi = 5 and 2σ* = 1 + (i l)/10, i = 1,... ,40. The errors were taken to be normal in one case and double exponential in the other (with p.d.f. exp(—\x — μ|/2)/4). Table 2 compares the methods for intervals of nominal 90% coverage. Table 2. Coverage Percentage and Average Confidence Intervals for Competing Methods
Normal 1 Normal 2 Classical* Studentized Classical** EF Studentized EF
Normal 90% 69.8 [-0.10, 0.10] 84.4 [-0.15, 0.15] 87.3 [-0.16, 0.18] 87.5 [-0.20, 0.19] 89.3 [-0.17, 0.16] 89.5 [-0.17, 0.16]
Double Exponential 90% 69.5 [-0.06, 0.06] 80.9 [-0.09, 0.08] 85.8 [-0.10, 0.10] 85.2 [-0.11, 0.12] 87.8 [-0.09, 0.09] 88.5 [-0.10, 0.09]
*10% failures as noted in comment iii) below. **15% failures A few comments follow: i) The first normal approximation works very badly indeed and clearly should not be used. The difficulty here is that there is no consistent estimate of σ\ with small U{ fixed and k —> oo. ii) The second normal approximation is less accurate than any of the bootstrap methods. It does work considerably better than the first approximation since it involves, at least, a consistent estimate of the variance. iii) The classical bootstrap appears to work reasonably well. From the simulations, however, about 10% of the bootstrap samples do not converge, using Newton's method, to a finite estimate of μ from a starting value of 0. The coverage rates and average intervals are based on the subset of bootstrap samples that give estimates for μ. iv) The classical bootstrap involves, on each bootstrap iteration, the solving of the estimating equation
Σg*(y*,μ)=0.
EES AND THE BOOTSTRAP
413
The computations involved in its implementation greatly exceed those for EF or the studentized EF Bootstrap. v) The studentized classical bootstrap does not offer any improvement in this case from the unstudentized version. This could be because the 2 variance estimator σ(2) is not very accurate or stable. vi) There is a clear indication that both the EF and the studentized EF bootstrap give more accurate results. In this example, they are also more stable and computationally much simpler than the classical approach. Many other examples are in Hu and Kalbfleisch (1997a) including a discussion of the linear model. Although Hu and Zidek (1995) approach the problem from a somewhat different view, the bootstrap procedure they recommend is numerically equivalent to the EF Bootstrap. As they note, the method they propose has excellent robustness properties against heterocedasticity and compares favourably, even in the homocedastic case with classical proposals.
4
Discussion
In this article, we have attempted to give only a brief introduction to the EF Bootstrap. More detail, further examples and theoretical aspects can be found in Hu and Kalbfleisch (1997a). We conclude this paper with a number of comments and note some areas where further work is needed. A. This paper, Hu and Kalbfliesch (1997a) and Hu and Zidek (1995) provide considerable empirical evidence that the EF Bootstrap, and especially the studentized version, has very good properties over a wide class of problems. The method gives accurate results, is numerically more stable than most competitors, and appears to have good robustness properties. B. The EF Bootstrap leads to a simple straightforward studentized version and avoids the many questions that arise about appropriate studentization of bootstrap samples of complex estimators. By concentrating on the linear estimating equation, the studentized version follows automatically and is easily implemented. It also gives very accurate results in many applications. C. There is substantial computational advantage, in many problems, to the EF Bootstrap versus the classical approach. In the EF Bootstrap,
414
HU AND KALBFLEISCH the observed estimating function Sy(θ) is unaltered and so repeated numerical solution of a new equation with each bootstrap sample is avoided. This also has appeal from the viewpoint of conditionality of the inferential approach; in the context of linear regression, for example, the EF Bootstrap corresponds to maintaining a common design matrix in all bootstrap replications.
D. Hu and Kalbfleisch (1997a) use Edgeworth expansions to investigate the accuracy of the approximations. Under fairly general conditions, they show that the EF Bootstrap provides a first order approximation to the true distribution of Sy(θ) and the studentized EF Bootstrap provides a 2nd order approximation to the distribution of Sy *{θ). Thus, confidence intervals based on the Studentized EF Bootstrap are accurate to order n~1. This theoretical asymptotic accuracy is also reflected in the simulations in finite samples. E. The role of conditionality in Bootstrap procedures is one that in general requires further thought and investigation. In the context of the EF Bootstrap, one can ask whether there are aspects of the data upon which one should condition in obtaining the distribution of Sy(θ) that is suitable for inference. Similar questions arise with the bootstrap. In the common means problem, for example, it seems natural to condition on the Πi's in making inferences about the common mean μ. This would suggest defining bootstrap replications in which the n^'s are held fixed. Hu and Kalbfleisch (1997a) explore various possibilities here. Classical bootstrap methods appear to have very poor properties when the n^s are at all small and the natural conditional approach is used; the EF Bootstrap has better properties for moderate n^, but it too breaks down if the ΠiS are very small (ni — 1 or 2). The estimating function itself may be a poor choice if the nj's are this small. In general, however, we need to balance considerations of conditionality against the need for a broad set of outcomes in the bootstrap sample so as to obtain good approximations. F. We have assumed in the above that Sy(θ) is 1 : 1. In some cases, it may not be 1 : 1 and there may in fact be multiple roots. If a consistent root can be identified, the EF Bootstrap could be applied. More generally, however, the difficulty with multiple roots is basic to methods based on the estimating function itself and not a particular difficulty with the EF Bootstrap. G. In this article, we have assumed that the y^'s are independent. In many applications, however, it is important to relax this assumption.
EES AND THE BOOTSTRAP
415
Various correlation structures could be considered. Hu and Kalbfleisch (1997b) consider extensions to autoregressive processes. Acknowledgements J.D. Kalbfleisch's work was partially supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC). Some of this work was done while F. Hu held a postdoctoral appointment at the University of Waterloo and was supported from a grant from NSERC. References Barndorff-Nielsen, O.E. (1983). On a formula for the distribution of a maximum likelihood estimator. Biometrika 70, 343-65. Bartlett, M.S. (1936). The information available in small samples. Proc. Camb. Phil Soc. 34, 33-40. Cox, D.R. and Reid, N.M. (1987). Parameter orthogonality and approximate conditional inference (with discussion). J. Royal Statist. Soc. 5 49,1-39. Di Ciccio, T.J. and Romano, J.P. (1988). A review of bootstrap confidence intervals. J. Royal Statist. Soc. B 50, 338-354. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist. 7, 1-26. Efron, B. &: Tibshirani, R.J. (1993). Bootstrap methods for standard errors, confidenc intervals, and other methods of statistical accuracy. Statistical Science 1, 54-77. Godambe, V.P. and Kale, B.K. (1991). Estimating functions: an overview. Estimating Functions, V.P. Godambe (ed.), Oxford University Press, Oxford, 3-20. Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer-Verlag. Hu, F. & Zidek, J.V. (1995). A bootstrap based on the estimating equations of the linear model. Biometrika 82, 263-275. Hu, F. and Kalbfleisch, J.D. (1997a). The Estimating Function Bootstrap, submitted. Hu, F. and Kalbfleisch, J.D. (1997b). A new bootstrap method for autoregression, in preparation. Kalbfleisch, J.D. and Sprott, D.A. (1970). Application of likelihood methods to models involving large numbers of nuisance parameters (with discussion). J. Royal Statist. Soc. B 32, 175-208. Lele, S.R. (1991a). Jackknifing linear estimating equations: asymptotic theory and applications in stochastic processes. J. Royal Statist. Soc. B 53, 253-67.
416
HU AND KALBFLEISCH
Lele, S.R. (1991b). Resampling using estimating equations. Estimating Functions, V.P. Godambe (ed.), Oxford University Press, Oxford, 295304. Parzen, M.I., Wei, L.J. & Ying, Z. (1994). A resampling method based on pivotal estimating functions. Biometrika 81, 341-50.
417
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
ESTIMATING FUNCTIONS: NONPARAMETRICS AND ROBUSTNESS Pranab K. Sen University of North Carolina at Chapel Hill ABSTRACT In nonparametric and robust inference, estimating functions are based on suitable implicitly or explicitly defined statistical functionals. The interplay of robustness and asymptotic efficiency properties of such typically nonlinear estimators is appraised here by reference to some standard as well as nonstandard problems that arise in statistical applications. Key words : Adaptive estimation; alignment principle; asymptotic optimality; conditional functionals; estimable parameters; GEE; GLM; Hadamard differentiability; influence functions; L-, M-, R- and WLS estimators; statistical functional: trimmed mean; U-process; U-statistics.
1
Introduction
In parametrics, estimable parameters generally appear as algebraic constants associated with the underlying distribution function(s) of assumed functional form(s). In nonparametrics, estimable parameters are defined as functionals of the underlying distribution(s) that may not have known functional form(s). This formulation shifts emphasis to validity for a broad class of distributions wherein efficiency and robustness properties dominate the scenario. The U-statistics are the precursors of such nonparametric estimators; they are of the kernel estimator type, and enjoy good efficiency (and unbiasedness) properties but may not be generally very robust. Moreover not all parameters in a nonparametric setup are estimable or regular functionals in the Hoeffding (1948) sense; the median or a percentile of a distribution belonging to a broad class is a classical example of such a nonregular functional. Significant developments in nonparametric and robust estimation theory covering both kernel type and estimating equation (EE) type estimating functions (EF) have taken place in the recent past. Three important classes of estimators are the following (i) L-estimators based on linear functions of order statistics, (ii) M-estimators allied to the maximum likelihood estimators (MLE),
418
SEN
and (iii) R-estimators based on suitable rank statistics. There has also been a drive to unify these estimators in terms of differentiable statistical functional, though that fails to cover all such estimators. A good deal of discussion of such estimators with due attention to their robustness and asymptotic properties has appeared in various contemporary research publications; we may refer to Jureckova and Sen (1996) for an up to date treatise of this subject matter. Whereas L-estimators are explicitly defined statistical functional, M- and R-estimators are defined implicitly (as is generally the case with the parametric MLE). Conditional statistical functional particularly arising in mixed-effects and multiυariate models have added new dimensions to the scope of study of (asymptotic) properties of such nonparametric estimators. In this quest, even the very linearity of the model has been challenged (on the grounds of validity and robustness), and hence, nonparametric regression functions have emerged in a better footing than before. Yet, the study of their robustness properties needs further scrutiny with adequate emphasis on their finite sample behavior. In the current study, due emphasis will be placed on such estimating function(al)s. The basic motivation for estimating functions in some simple nonparametric models is presented in Section 2. We examine the picture in a more general (linear model) setup in Section 3. Some nonstandard models arising in functional estimation problems are introduced in Section 4. The concluding section deals with some general remarks.
2
Nonparametric Estimating Functions
We may remark that the very formulation of the Fisher-consistency criterion of statistical estimators brings the relevance of functionals of emprical measures in estimation theory. In a simple setup, given n independent and identically distributed random variables (i.i.d.r.v.) XL, . . . , Xn from a distribution function (d.f.) F , we may introduce a statistical parameter θ = Θ(F) as a functional of F, and a natural (i.e., plug-in) estimator of this parameter is the corresponding sample counterpart Tn = T(Xι,...,Xn)
(2.1)
= θ(Fn),
where Fn is the sample (empirical) d.f. The empirical d.f. Fn is known to possess some optimality properties in a traditional setup, and if Θ(F) is a linear functional, these properties are shared by Tn as well. The ingenuity of Hoeffding (1948) lies in covering a more general class of functionals where Θ(F) can be expressed as Θ(F) =
EF{g(Xu...,Xm)}
= J---jg(xι,...,xτn)dF(xι)-
-dF{xτn),
(2.2)
419
ROBUST NONPARAMETRICS
where g(.) is a kernel of (finite) degree m(> 1), and without loss of generality, we assume that g(.) is a symmetric function of its m arguments. Hoeffding introduced the U-statistic as an (unbiased) estimator of Θ(F) :
il?
. . . , X ί m ) , n>m.
(2.3)
A closely related estimator is the von Mises (1947) functional
Vn = θ(Fn) = J'••• J'g(x1,...,xm)dFn(x1)--
^ϋ
dFn(xm) (2-4)
For m = 1, Un and V^ are the same (and average of i.i.d.r.v.'s). But for m > 2, they are generally not the same; whereas Un is an unbiased estimator, Vn is generally not so. Nevertheless, under quite general regularity conditions, \Vn — Un\ = Op(n~ι), so that asymptotically they share the same optimality properties, studied in detail by various researchers; Sen (1981) contains a systematic account of this work. We may note that the above formulation is pivoted to a suitable kernel #(.), and the optimality properties, interpreted in a nonparametric fashion, rest on the adoption of the classical Rao-Blackwell theorem along with sufficiency and completeness of sample order statistics. Nevertheless, from a robustness perspective the choice of the kernel is very important. Generally, if g(.) is unbounded, neither of these two estimators may be very robust. To illustrate this feature, let us consider the simple situation where Θ(F) is the variance of the d.f. F. In this case, the kernel g(.) is given by g(xι,x2) = -(xι-x2)2,
m = 2,
(2.5)
so that if F does not have a compact support, the estimators are vulnerable to error contamination, gross errors or outliers. A similar situation arises with the mean functional (where m = 1 and g(x) = rr), although there it may be possible to introduce the location parameter by imposing symmetry of F around its median and bypassing some of these technical difficulties. The degree of nonrobustness is likely to be more with dispersion than location measures. Robust estimation of location (regression) and scale parameters has its genesis in this feature, and we will discuss this briefly later on. There has been some attempts to introduce more robust dispersion functionals, such as the mean absolute deviation and interquartile range, although they may not belong to the class of estimable parameters in the sense of Hoeffding
420
SEN
(1948) and may also lack some generality prevailing in the case of location parameters; we refer to Jureckova and Sen (1996) for some discussion. We may also consider a variant of U- or F-statistics which merits consideration on the ground of robustness. Instead of taking an average over all possible subsample kernels g{Xiλ,..., Xim), 1 < i\ < - - < im < n, we arrange them in an ascending order, and denote the median of these Q) pseudovalues by θn; this can be proposed as an estimator of the median of the distribution of the kernel g(X\,... ,-Xm) Thus, whenever for this kernel distribution, the median and mean are the same, a case that holds when this distribution is symmetric, or they differ by a known constant, a robust estimator can be obtained in this manner. In general, we may define a {/-process by letting .,Xim)
(2.6)
Virtually what has been studied for [/-statistics remains true for such U-processes, and they have better robustness perspectives. Thus (Hadamard differentiate) statistical functionals of such [/-processes may be advocated on the ground of robustness as well as other optimality properties. This line of attack initiated the development of the [/-processes; in a more abstract setting, Nolan and Pollard (1987), and others, have studied related asymptotics. We would not go into further depths on this topic here. There are other types of [/-processes, more akin to sequential setups, and we refer to Sen (1981) for some broad coverage of them. For both location and dispersion measures, linear functions of order statistics, known as L-statistics, have been used extensively in the literature; they are generally efficient, adaptable in parametric as well as nonparametric setups, and generally possess good robustness properties. The estimating equation for L-estimators of location and scale parameters in (parametric) location-scale family of distributions has its genesis in the theory of (BLUE) best linear unbiased estimators which incorporates the weighted least squares
(WLS) methodology on the set of order statistics. Led by this genesis, in a nonparametric setup, a parameter Θ(F) is expressed as a functional g(x)J(F(x))dF(x),
Θ(F) =
(2.7)
J- oo
where g(.) is real valued, and J = {J(u),u G (0,1)} is a weight-function defined on the unit interval (0,1). It is easy to see that the corresponding sample counterpart θ(Fn), given by )Jn(i/n),
(2.8)
421
ROBUST NONPARAMETRICS
is an L-estimator; here the Xn:i,i = 1,... , n stand for the order statistics, and J n (.) is a suitable version converging to J(.) (as n increases) almost everwhere. The flexibility of this approach stems from the choice of g(.) and J(.), for which Θ(F) remains invariant, in such a way that retaining robustness to a greater extent, not much is compromized on (asymptotic) efficiency of an estimator of θ(Fn). For example, in the location model, the trimmed mean, a member of this class of L-estimators, for a small amount of trimming (at both ends) combines robustness with good efficiency properties. The theory of (asymptoptically) ABLUE is geared to this direction; it covers both the cases of smooth weight functions and a combination of a selected number of order statistics. For this location model, whenever F is assumed to be symmetric about its median (0), one may take g(x) = x a.e., and for nonnegative Jn(i/n) (adding upto 1), we have a convex combination of the order statistics as an estimator of θ. Within this class, one may like to choose the weight-function in such a way that robustness can be fruitfully combined with high efficiency properties. Generally smoother weight functions are used in this context. For the location-scale problems, the sample quantiles and interquartile range are particular cases of such L-estimators. It is often possible to express an L-statistic as a U-statistic, and in an asymptotic setup, a first order approximation for L-estimators in terms of U-statistics works out well [viz., Sen (1981, ch.7)]. A notable example in this context is the rank-weighted mean Tn^ which is the average of all the subsample medians of size 2k + 1 from the given sample of size n. As in Sen (1964), we may write this equivalently as
Γ
('')(%')
(2..)
so that for k = 0, we have the sample mean, and for k = [(n — l)/2], we have the sample median. Incidentally, this example, for a k > 1, provides an illustration for the robustness of U-statistics even when the kernel is possibly unbounded. In order to introduce the salient features of the estimating functions for R- and M-estimators it may be more convenient to start with the conventional MLE when the assumed (location) model is not necessarily the true one. Assume that X\,..., Xn are drawn from a population with an absolutely continuous density function f(x — θ) where /(.) is symmetric about the origin. Assume further that the true density function is given by g(x - θ) where g(.) is also absolutely continuous and symmetric about the origin. Then the estimating function for obtaining the MLE θn based on the assumed model is given by
£{-/'(*< - θ)/f(Xi - θ)} = 0.
(2.10)
422
SEN
Note that under the assumed conditions (on / and g), \-f'(x
- θ)/f(x - θ)}g(x - θ)dx = 0,
(2.11)
so that the MLE based on the assumed model remains pertinent to the entire class of g satisfying the above symmetry condition. Let us denote by 2
A (f,g) =
2
J{-f(χ)/f(χ)} g(χ)dχ,
Ί(f,9) = J{-f'(χ)/f(χ)}g'(χ)dx, (2.12)
1(9) = j{-g'{χ)lg{χ)γg{χ)dx.
Then following standard asymptotics for the MLE, it can be shown that nι/2(θn - θ) ^v
Λf(O, A2(f,g)/Ί2(f,g)
).
(2.13)
Let us denote the MLE based on the true model by θn. Then we have the following result: λf(O, [Iig)}-1)-
l
(2-14)
Next note that by the Cauchy-Schwarz inequality, -y2(f,g)
(2.15)
where the equality sign holds only when {—f'(x)/f(x)} = {—gf(x)/g(x)} almost everywhere (a.e.). Thus, if both / and g belong to the same locationscale family of densities for which the log-derivative scale-equivariant, as is the case when g is Laplace or normal, then θn and θn are isomorphic, and there is no loss of efficiency due to incorrect model assumption; this is the usual parametric orthogonality condition referred to in the literature. This orthogonality condition is not universal for the location-scale family of densities; the Cauchy density is a classical example toward this point. On the other hand, if / and g do not satisfy this condition, the asymptotic relative efficiency (ARE) of θn with respect to θn is given by 2
2
e(f,9) = Ί (f,9)/{Π9)A (f,g)} (< 1).
(2.16)
This ARE can be quite low depending on the divergence of / and g. For example if g is Cauchy while / is taken to be normal, A2(f,g) = oo, and hence, e(/,g) = 0. In the above development, we have tacitly assumed that (2.10) holds. Sans the assumed symmetry of / and g this may not be generally true, and therefore in such a case, the MLE θn may have serious bias, and this in turn may make it inconsistent too. In any case, it is clear
423
ROBUST NONPARAMETRICS
that with an incorrect model, the derived MLE can not attain the CramerRao information bound for its asymptotic mean square error, and hence, loses its (asymptotic) optimality properties. The above picture turns out to be far more complex in a general parametric model where θ may not be the location parameter or the density may not be symmetric, and as such it reveals the grave nonrobustness aspects of the classical MLE to plausible model departures. The nonparametric situation is more complex in the sense that / and g may not simply differ in nuisance parameters, and their separability is generally defined in terms of more general metrics. As such, the methodology developed for parametric EF in the presence of nuisance parameters may not be of much use in this more general setup. In the classical robust inference setup, Huber (1964) introduced various measures of departures from the assumed model, such as the Leυi-distance, Kolmogoroυ-distance and Prokhorov-distance, and has exhibited the possible lack of robustness of the classical MLE. Following his ground-breraking work, we may therefore conceive of a suitable score function ψ(t),t G ΊZ, and for the location model, consider the estimating equation: n
- 0 ) = O;
(2.17)
to have good robustness properties of the derived (M-)estimator, generally the influence function ψ is taken to be bounded a.e. In order that the above EE provides a consistent solution, we need that φ(x)g(x)dx = 0,
(2.18)
and a sufficient condition for this is the skew-symmetry of ψ and symmetry of g (both about 0). Within this broad class, specific choice of ψ can be made to achieve local efficiency, and we may refer to Hampel et al. (1986) and Jureckova and Sen (1996) for details. The EF's for R-estimators have a greater appeal from global robustness perspectives. It stems from the basic fact that under suitable hypotheses of invariance, a rank statistic is genuinely distribution-free, so that whenever it has some monotinicity properties with respect to an alignment in the direction of alternative hypotheses, we have a robust EF. For example in the location model, assuming that the underlying d.f. G is symmetric, and incorporating suitable scores α n (l) < < an(ή), we may consider a signed rank statistic Λ+J = Sn say,
(2.19)
where R^i is the rank of \X{\ among \X\\,..., \Xn\, for i = 1,... ,n. Note that under the null hypothesis that θ is null, Sn has a known, symmetric
424
SEN
distribution. Moreover, if we replace the X{ by X{ — α, for some real α, and denote the corresponding (aligned) signed rank statistic by Sn(a), then it is easy to verify that Sn(a) is nonincreasing in a ETZ.
(2.20)
Therefore, the EF in this case is Sn(a), and the corresponding EE is Sn(a) " = " 0,
(2.21)
where, in view of the usual step-function nature of S n (α), " = " is defined precisely as follows. We let 0n,i
= sup{α : Sn(a) > 0}, 0n,2 = inf{α : Sn{a) < 0}; (2.22)
θn = \[θnΛ+θn^
Although in the case of the sign statistic θn turns out to be the sample median, and for the Wilcoxon signed rank statistic, it is the median of the mid-ranges, in general, an algebraic expression may not be available, and an iterative solution has to be prescribed. Such R-estimators are distributionfree in the sense that they remain valid for the entire class of symmteric distributions, and unlike the case of the MLE, here the absolute continuity of the density function or a finite Fisher information need not be a part of the regularity assumptions. Let us have a close look into the ranks i?^, which assume the integer values 1,... , n. Thus, if an observation is moved to the extreme right (or left), it continues to have the rank n (or 1), no matter howfar it is shifted. In that way, the rank scores have good robustness properties for error contaminations and outliers. R-estimators are translation-invariant, robust, consistent and median-unbiased under very general regularity assumptions. Their asymptotic properties have been extensively studied in the literature; see for example, Jureckova and Sen (1996). Under appropriate regularity conditions, we have v2\
(2.23)
where 2
2
ι
ι
v = {j φ {u)du)l{j φ{u){-g'{G- {u))lg{G- {u))}du}\
(2.24)
and φ{.) is the score generating function for the an(k). As such, we may conclude that the nonparametric and robustness aspects of R-estimators prevail as long as the score functyion φ(.) is square integrable inside (0,1); unlike
ROBUST NONPARAMETRICS
425
the case of M-estimators, here we need not confine ourselves to bounded score functions. Within this broad class of score functions, one may choose specific members such that the ARE at a given G is a maximum, so that robustness can also be combined with local optimality. In passing we may remark that for both M- and R-estimators, the d.f. G is largely treated as a nuisance (functional) parameter, so that the situation differs drastically from the parametric situation where one has generally a finite (and typically small) number of nuisance parameters. Further, looking at the last equation, we gather that v2 > [I(g)]-\ V0€ £2(0,1),
(2.25)
where the equality sign holds only when φ(u) = {-g'{G~ι(u))/g(G~ι(u))}, u G (0,1). In this way, the situation is quite comparable to the case of M-estimators (sans the boundedness condition). As a matter of fact, both M- and R-estimators have certain asymptotic equivalence properties (with congruent score functions), and such general equivalence results for L-, Mand R-estimators have been studied in detail in Jureckova and Sen (1996). It follows from their general discussion that all of them are expressible in terms of statistical functional; L-estimators being defined explicitly, while M- and R-estimators implicitly. For such statistical functionals, suitable modes of differentiability have been incorporated to provide convenient means for the study of general asymptotic properties of such estimators. Among these, the Hadamard differentiability property has been exploited mostly. It turns out [viz., Sen, 1996a] that U-statistics for unbounded kernels may not be Hadamard differentiable, although they possess nice (reverse) martingale properties which provide access to various probabilistic tools that can be used to study related asymptotics. Likewise for L-, M- and R-estimators in a functional mold, one needs bounded score (weight) functions to verify the desired Hadamard differentiability property. For L- and M-estimators, such a boundedness condition does not pose any serious threat (as robustness considerations often prompt bounded influence functions), but bounded scores for R-estimators exclude some important statistics (such as the classical normal scores and log-rank scores statistics), and hence, this differentiability approach for R-estimators is not totally appropriate. Fortunately, there are alterenative methodologies, studied in detail in Jureckova and Sen (1996, ch.6), which provide a better resolution, and hence, we need not be overconcerned about Hadamard differentiability of EF's for R-estimators. For all the types of estimators considered above there is a basic query: Treating the underlying density g to be a nuisance functional belonging to a general class Q, is it possible to formulate some EF which yields asymptotically optimal estimators in the sense of attainment of the information bound for its asymptotic mean square error in a semiparametric setup? Un-
426
SEN
der fairly general regulaity conditions, we have an affirmative answer. We illustrate this point with adaptive R-estimators of location ( Huskova and Sen, 1986), and a similar situation holds for other functionals as well (Sen, 1996a). Let us define the Fisher-information score generating function by Φg(t) = {-9\G-ι(t))lg{G-1 (*))}, t 6 (0,1),
(2.26)
and assume that I(g) = JQ φ2g(t)dt < oo. Then we may consider a Fourier series expansion φg(t) = ΣΊkPk(t), t 6(0,1) (2.27) Λ;>0
where the {Pfc(.)? k >0} form a complete orthonornal system [on (0,1)] and the Fourier coefficients 7^ are defined as 7* = / φ9{t)Pk{t)dt, Jo
for k = 0,1,....
(2.28)
For adaptive R-estimators, Huskova and Sen (1985,1986) advocated the use of the Legendre polynomial system on (0,1), where for each k(= 0,1,2,...), Pk(t) = (2k + Iγl2{-\)k{k\)-\dkldtk){t{\
- t)}k, t e (0,1),
(2.29)
so that P0(t) = 1, Pι(t) = Λ/3(2< - 1), P2(<) = \/5{l - 6t(l - t)} and so on. Thus P\(.) is a variant of the Wilcoxon scores, and when g is a logistic density, it is easy to see that φ(.) = Pi(.)? while for a symmetric 5, it follows from the above that j2k = 0, Vfc > 0. In general the Fourier series is an infinite one, though convergent. The basic idea is to truncate this infinite series at a suitable stopping number, say, Kn, and for such a truncated version Σk<χn ΊkPkif), to estimate the Fourier coefficients ηk by the classical Jureckova-linearity method. If we denote these estimates by 7fc,n> k ~ ^n> then our adaptive version of the Fisher score function is
ΦgA*)= Σ 7*,nΉ(*)>*e(0,l).
(2.30)
k
One may then define a (signed-) rank statistic as in before with the scores άn(k), k = l , . . . , n generated by the adaptive score generating function m a n φg,n{ )i d , based on the same alignment principle, obtain adaptive Restimators of location. These estimators are robust and asymptotically efficient for the class of densities with finite Fisher information. Huskova and Sen (1986) considered a suitable sequential version to formulate suitable stopping numbers {Kn} for which there are suitable rate of convergence; a simpler algorathim without appealing to such a sequential scheme has been worked out in Sen (1996a).
427
ROBUST NONPARAMETRICS
3
Estimating Functions for Linear Models
In linear models though observations may not be i.d., the error components are assumed to be i.i.d.r.v.'s. Conventionally, we define a vector of observable random variables Y n = (YΊ,..., Yn)' by letting Yn = Xnβ + en] en = (ei,...,e n )',
(3.1)
where X n is a given (nonstochastic) n x p matrix, β = (/?i,... ,βp)' is a vector of unknown regression parameters, and the errors eι are assumed to be i.i.d.r.v.'s. For normally distributed errors the ML EE's are linear in Y n , and they agree with the LSE. Specifically, we have βn = {X^X n }" 1 x;Y n .
(3.2)
In a general parametric model with an assumed error density /, the ML EE is given by = 0.
(3-3)
where X^ = (x' l 5 ... ,x^) The situation is quite comparable to the i.i.d. case presented in (2.9) through (2.15), and although the ARE properties are isomorphic, possible lack of robustness would be more accentuated here because of the nonidentity of the x^. Moreover, for / not belonging to an exponential family, the resulting MLE in (3.3) may not be linear estimators, and hence, a trial and error solution may be necessary. The usual M-estimators for such linear models are based on EF's that resemble (3.3) along the same line as in Section 2: One would use a score function ψ(.) : Έ, —> Έ, satisfying the same regularity conditions as in Section 2, and consider the EE: (3.4) Jureckova and Sen (1996,ch.5) contains an extensive treatment of first and second order asymptotic distributional representations for M-estimators in linear models; their treatise exploits mostly the uniform asymptotic linearity results of M-statistics (in the regression parameters). Another approach to this type of representation is based on the Hadamard differentiability of extended statistical functionals, and this has been considered in detail by Ren and Sen (1991,1994,1995). As in location models, such M-estimators have good local robustness and optimality properties, but they are not genuinely nonparametric.
428
SEN
R-estimators of regression parameters are based on suitable linear rank statistics, and they are globally robust and asymptotically optimal for suitp able subfamilies of underlying densities. For a given b E K , we define the residuals by Yi(b) = Yi-*ib, i = l , . . . , n , (3.5) and denote the aligned ranks by l,...,n.
(3.6)
.7=1
Then a vector of aligned rank statistics is defined by ), b e W,
(3.7)
where the scores α n (l) < ,
(3.8)
and proposed to minimize Dn(b) with respect to b to obtain an estimator of β. Since the ranks are translation-invariant, it can be shown easily that Dn(b) is nonnegative, piecewise linear (and hence, continuous) and convex function of b. Thus, jDn(b) is almost everywhere differentiate with respect to b and (d/db)Dn(b) = — Ln(b) at any point of differentiability. Therefore 'equating L n (b) to 0' in a suitable norm (such as the C\ norm) yields a convenient R-estimator of β. In this context too, the uniform asymptotic linearity
429
ROBUST NONPARAMETRICS
of L n (b) in b in a shrinking neighborhood of the true β provides access to the variety of asymptotic results pertaining to robustness and asymptotic representations for R-estimators. R-estimators in linear models are closely related to regression rank scores (RRS) estimators, developed mostly due to Gutenbrunner and Jureckova (1992). To indroduce such estimators, we make use of some related EF's, due to Koenker and Bassett (1978), termed the regression quantiles (RQ), which possess a basic regression equivariance property and are variants of L-estimators in linear models. For a given a : 0 < a < 1, define pa{x)
= \X\{{1 - a)I{x
< 0) + al{x
> 0)}, i G K ,
(3.9)
and define the α-RQ estimator J3n(a) by
βn(a)
= Mgmm{ΣPa(Yi
- xjb)) : b e Up}.
(3.10)
Following Koenker and Bassett (1978), it can also be shown that an a—RQ can be characterized as optimal solution (/3(α)) of the linear programming problem n
n
a2_\rf
+ (1 — ex) 2Zrϊ~
Xijβj
+ rf -r{
= l,...,n
4>0,
= m
^n
= y , i = 1,..., n;
t=l
βjeπ,j
r " > 0 , i = l,...,n;
(3.11)
where rf(r~) is the positive (negative) part of the residual Y{ — /3'xi, i = 1,... ,n. Thus, for a given a : 0 < a < 1/2, usually small, if we define a diagonal matrix Cn = diag(c n χ,..., cnn) by letting cni = I{xJ/3(α) < Yi < x'iβ{l - α)}, i = 1,..., n,
(3.12)
then an a-trimmed (T)LSE of β can be defined as
T n (α)
= (X^C n X n )- 1 x;C n Y n .
(3.13)
Here the EF in (3.9)-(3.10) is primarily used to obtain the matrix C n , so that robustness and efficiency considerations are to be related to this basic choice. We refer to Jureckova and Sen (1996, ch.4) for some detail discussion of these aspects of RQ and related TLSE. For introducing the RRS estimators, for a given a E (0,1), we define the vector of regression ranks
430
SEN
(RR) &n(α) = (αni(α),...,α n n (α))' as the optimal solution of the linear programming problem: n
n
Σ Yiάni{θi) = max i=l
i=l
fini(α) 6 [0,1],
n
] ζ Xijάni{a) = (1 - α) ^ x^, j = 1,... ,p; i=l
i = l,...,n; α € ( 0 , l ) .
(3.14)
Let φ(u),u e (0,1) be a nondecreasing, square integrable score generating function, and let φn{u)
= 0(α*)J(O < ti < α*) + φ(u)I(a* < u < 1 - a*) + φ(l - a*)I(l - a* < u < 1),
(3.15)
where α G (0,1/2) and is usually chosen small. Then the RRS, generated by the score function φ, are taken as Γ1 φn{θL)dάni(a), i = 1,..., n. Jo We can then consider a regression rank measure of dispersion: bni = ~
D*n(b) = Σ(Yi - b'x;)[MY - Xb) - φl
(3.16)
(3.17)
where the RRS (bni(Y — Xb), i = 1,..., n) are computed from the aligned observations Y — Xb. Then the derived RRS estimator is defined as
βn = Arg. min{££(b) : b e W}.
(3.18)
A similar estimating function works out for the estimation of a subvector of β. The interesting fact is that under fairly general regularity conditions and based on a common score function >(.), the classical R-estimator and the RRS estimator are asymptotically equivalent upto the order (n" 1 / 2 ); we refer to Section 6.8 of Jureckova and Sen (1996) for details. Robustness aspects of both these type of rank estimators can therefore be studied in a unified manner, without requiring a bounded score generating function. We conclude this section with a note that as in the case with the semiparametric location model, here in the semiparametric linear models whenever the density / admits a finite Fisher information /(/), we may construct adaptive EF's based on adaptive rank scores statistics, and these yield asymptotiocally optimator estimators of the parameters involved in the parametric part of the model [viz., Huskova and Sen, 1985; Sen, 1996a]. Thus, adaptive EF's in semiparametric linear models, though computationally more complex, yield robust and asymptotically efficient estimators.
431
ROBUST NONPARAMETRICS
4
Estimating Functionals.
In traditional and generalized linear models, as well as in other specific forms of nonlinear models, essentially the parameter space is finite dimensional,so that EF's are vector valued. Robustness and nonparametric perspectives often prompt us not to advocate such finite dimensional models; the conditional mean or quantile regression functions in a multuivariate nonparametric setup are simple and classical examples of such functionals. While the basic motivations for EF's remain essentially the same in such a functional case, technical manipulations are usually more extensive. For this reason, robustness and efficiency considerations are to be assessed in a somewhat different manner. In a multivariate setup, let (Y^, X;), i — 1,..., n, be i.i.d.r.v.'s with a d.f. F defined on 7£ p + 1 , and let G(y\κ) be the conditional d.f. of y , given X = x. Then typically a regression functional of Y on x is a location functional of the conditional d.f. G(.|x), x E Vf \ Therefore as long as this conditional d.f. can be estimated consistently and efficiently in a nonparametric fashion, suitable sample counterparts of such location functionals (based on, for example, appropriate L-, M- and R-statistics) can be constructed in a robust manner. Sans multinormality of F, such regression functionals may not be generally linear (in x), and moreover a specific nonlinear form in turn calls for a specific form of F, though in general such functionals may otherwise exhibit good smoothness properties. Therefore in a nonparametric or semiparametric setup, it seems quite plausable to incorporate appropriate smoothness conditions on such conditional functionals and estimate them in a robust, consistent and efficient manner. Of course, one has to pay a little penalty for choosing an infinite dimensional parameter space when actually a finite dimensional one prevails, but, in the opposite case, a finite dimensional model based statistical analysis may be totally inadequate when a functional parameter space prevails. Among various possibilities, we may mention specifically the two popular approaches to this problem. They are (i) the nearest neighbor (NN) method, and (ii) kernel smooting methods. In a NN-method, corresponding to a set pivot x 0 , usually lying in a convex set C £ 7£p, we define the pseudoυariables Zi = d(Xi,x0),
i = l,...,n,
(4.1)
where d(.) is a suitable metric on Rp (which may as well be taken as the Euclidean norm), and denote the corresponding order statistics by Zn:\ < • < Zn:n] note that they specifically depend on the base sample as well as the chosen pivot. Also we define a nondecreasing sequence {kn} of positive integers such that kn -> oo but n~ιkn -> 0 as n ->• oo. Further, we set the antiranks Si by letting Zn:i correspond to X ^ , for i = 1,..., n. Then a
432
SEN empirical d.f. at the pivot x 0 is defined by
Gn,kn(yM = K'ΣΠYst
(4.2)
Estimating functionals are then based on the entire set Gnjfcn( |x o ), x 0 G C. Naturally robustness considerations dominate the choice of such functionals (viz., Sen 1996b). In a kernel method, we choose a known density φ(x) possessing some smoothness properties such as unimodality and symmetry around 0, compact support and differentiability upto a certain order, and define a smooth conditional empirical d.f. by integrating 0(x—xo) with respect to the empirical d.f. Fn. This conditional measure is then incorporated in the formulation of suitable robust functionals. These two methods compare favorably with respect to their asymptotic bias and asymptotic mean squares and robustness pictures. There is, however, a common concern that stems from the fact that such conditional functionals span an infinite dimensional parameter space, and hence, in setting suitable confidence sets, a prescription in terms of a finite number of pivots may not suffice. Weak convergence approaches [viz., Sen 1993, 1995a, 1996b] provide viable alternatives, yet retaining robustness to a certain extent.
5
General Remarks
In the context of (generalized) linear models (GLM),possibly involving nuisance parameters, EF's have received a good deal of attention, and these developments constitute a major advancement in the research literature. There is, however, a point worth mentioning: The very motivation of retaining the flavor of exponential family of densities by skillful choice of (canonical) link functions yields (generalized) GEE's that share nonrobustness properties with MLE's and other parametric estimators. In most biomedical applications the response variate is nonnegative and has typically a positively skewed distribution, so often the Box-Cox type transformation is used to induce more symmetry (if not normality); however, this may also distort the inherent linearity or other parametric structure of the underlying doseresponse relations. Hence GLM [ viz., McCullagh and Nelder, 1989] may not be universally advocated in such studies. Considerations of model robustness naturally call for nonparametrics or semiparametrics , and as such L-, Mand R- EF's along with their siblings come into the picture. In this context, the dimension of nuisance parameters is often large if not infinity, and the estmation parameter space may also be large. In a quasi-parametric setup, under the coverup of semiparametrics, Godambe (1985) initiated a line of attack for such EF's that are potentially applicable to various stochastic
ROBUST NONPARAMETRICS
433
processes where independence of the observations may not hold. A good deal of extensions of his seminal work has taken place during the past ten years. If we have a good feeling of the conditional distributions of the observations given the past ones, then Godambe's scores can be obtained in an appropriate manner, and his suggested avenue leads to a finite sample optimality property, interpreted in terms of the smallness of the mean square error of the estimators. However, sans the knowledge of these underlying distributions, we may not be able to decide a proper choice of the Godambe scores, and this may vitiate his small sample optimality properies when the assumed scores do not correspond to the likelihood based ones, so such semiparametric procedures are likely to be quite nonrobust to possible departures of the assumed conditional distributions from the true ones. Therefore, as in Huber (1964), we should consider some scores which remain robust in such situations. In a genuine semiparametric model, we usually allow a finite dimensional estimable parameter space retaining the infinite dimensionality of the nuisance parameter space; for example, the d.f.'s are unknown and arbitrary, but the linearity of regression prevails. As such, compared to pure nonparametrics, such semiparametrics may yield more efficient estimators when the postulated model is correct, but is naturally more nonrobust to plausible departures from such assumed models. One other advantage of semiparametrics is that the finite dimensionality of the estimable parameter space usually permits the adoption of adaptive procedures which are asymptotically optimal with respect to the postulated model. We refer again to the semiparametric linear models for which adaptive R- or M-estimators which are asymptotically efficient [viz., Huskova and Sen (1985,1986)]. In a relatively more general setup of statistical functionals, under a similar semiparametric modeling, such adaptive EF's have also been discussed in Sen (1996a). Prom this perspective it is clear that modeling part is a vital task in formulating the estimation space, and EF's are to be considered in the light of dimensionality and structure of this setup. In this context, robustness and (asymptotic) efficiency considerations are of utmost importance. In most biomedical, clinical and environmental studies, generally this modeling is far more complex, and conventional parametric GLM's may not be that appropriate even following suitable transformations. Therefore, there is a need to focus on the appropriateness of suitable semiparametric and nonparametric models, and model flexibility often favors the latter choice. We refer to Sen (1996c) for some discussion of EF's in GLM's in biostatistical applications.
434
6
SEN
Acknowledgements
The author is grateful to Professor V. P. Godambe and the referees for their useful comments on the manuscript. References
Cox, D. R. (1972). Regression models and life tables (with discussion). J. Roy. Statist Soc. Ser.B 74, 187-220. Godambe, V. P. (1985). The foundation of finite sample estimation in stochastic processes. Biometrika 72, 419-428. Gutenbrunner, C, and Jureckova, J. (1992). Regression rank scores and regression quantiles. Ann. Statist. 20, 305-330. Hampel, F. R., Ronchetti, E., Rousseeuw, P. J. and Stahel, W. A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Statist. 19, 293-325. Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35, 73-101. Huskova, M. and Sen, P. K. (1985). On sequentially adaptive asymptotically efficient rank statistics. Sequen. Anal. 4, 125-151. Huskova, M. and Sen, P. K. (1986). Sequentially adaptive signed rank statistics. Sequen. Anal. 5, 237-251. Jaeckel, L. A. (1972). Estimating regression coefficients by minimizing the dispersion of the residuals. Ann. Math. Statist. 4$, 1449-1458. Jureckova, J. and Sen, P. K. (1996). Robust Statistical Procedures: Asymptotics and Interrelations. Wiley, New York. Koenker, R., and Bassett, G. (1978). Regression quantiles. Econometrica 46, 33-50. McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman and Hall, London. Nolan, D., and Pollard, D. (1987). {/-processes: rates of convergence. Ann. Statist. 15, 780-789. Ren, J.-J., and Sen, P. K. (1991). Hadamard differentiability of extended statistical functionals. J. Multiυar. Anal. 39, 30-43. Ren, J.-J., and Sen, P. K. (1994). Asymptotic normality of regression Mestimators: Hadamard differentiability approaches. In Asmptotic Statistics (eds. Mandl, P. and Huskova, M.), Physica-Verlag, Vienna, pp. 131147. Ren, J.-J., and Sen, P. K. (1995). Hadamard differentiability on D[0, If. J. Multiυar. Anal. 45, 14-28.
ROBUST NONPARAMETRICS
435
Sen, P. K. (1964). On some properties of rank weighted means. J. Ind. Soc. Agri. Statist 16, 51-61. Sen, P. K. (1981). Sequential Nonparametrics : Inυariance Principles and Statistical Inference. Wiley, New York. Sen, P. K. (1993). Perspectives in multivariate nonparametrics: Conditional functionals and ANOCOVA models. Sankhyά, Ser.A 55, 516-532. Sen, P. K. (1994). Regression quantiles in nonparametric regression. J. Nonparamet. Statist. 3, 237-253. Sen, P. K. (1995). Robust and nonparametric methods in linear models with mixed effects. Tetra Mount Math. Public. 7, 331-342. Sen, P. K. (1996a). Statistical functionals, Hadamard differentiability and martingales. In A Festschrift for J. Medhi (eds. Borthakur, A. C, and Chaudhury, H.), New Age Press, Delhi, pp.29-47. Sen, P. K. (1996b). Regression rank scores estimation in ANOCOVA. Ann. Statist 24, 1586-1601. Sen, P. K. (1996c). An appraisal of generalized linear models in biostatistical applications. J. Appl. Statist. Sc. 5, 61-78. Shorack, G. R., and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. Ann. Math. Statist. 18, 309-348.
439 Institute of Mathematical Statistics LECTURE NOTES —• MONOGRAPH SERIES
Inference From Stable Distributions Hammou El Barmi and Paul I. Nelson Kansas State University ABSTRACT We consider linear regression models of the form Y = Xβ + e where the components of the error term have symmetric stable (SaS) distributions centered at zero with index of stability a in the interval (0,2). The tails of these distributions get progressively heavier as a decreases and their densities have known closed form expressions in only two special cases: a = 2 corresponds to the normal distribution and a = 1 to the Cauchy distribution. The SaS family of distributions has moments of order less than α. Therefore, for a < 1, the components of Xβ are viewed as location parameters. The usual theory of optimal estimating functions does not apply since variances of the components of Y are not finite. We study the behavior of estimators of β based on 3 types of estimating equations: (1) least squares, (2) maximum likelihood and (3) optimal norm. The score function from these stable models can also be used to consistently estimate β for a general class of variance mixture error models. Key Words: Stable distribution, regression, estimating function, consistency, constrained minimization, variance mixture.
1
Introduction
Statistical analyses of regression type models of the form Y = Xβ + e
(1.1)
typically assume that the error terms have independent normal distributions with common variance and that the components of the full rank design matrix X are constants. Here, we generalize and allow the components of e to have independent symmetric stable distributions with infinite variance. A stable distribution symmetric about μ has a log-characteristic function of the form a
= -\σt\ +iμt,
(1.2)
440
BARMI AND NELSON
where σ > 0 is a scale parameter and a is called the index of stability, 0 < a < 2. Affine transformations of independent copies of a stable random variable also have stable distributions. This closure property has proved useful in applications to economics and astronomy. The tails of these distributions get progressively heavier as a decreases. Except for a = 2, stable distributions only have moments of order less than a. The normal distribution with variance 2σ2 corresponds to a = 2 and the Cauchy distribution to a = 1. For a < 1, the components of Xβ should be viewed as location parameters. We use V ~ SaS(σ) (called symmetric alpha stable) to denote that V has the distribution given in (1.2) with μ = 0. The recent text by Samorodnitsky and Taqqu(1994) is an excellent source of information on stable distributions and processes. We will need one special type of skewed stable distribution with index of stability δ < 1, having log-characteristic function of the form: φ(t) = -σ*|i|*(l - isgn{t)Tan{πδ/2)),
(1.3)
where sgn(t) denotes the sign of t. Random variables having such distributions are supported on the positive axis and called stable subordinators of index δ with scale parameter σ. Prom (1.2) and (1.3) it follows that if V ~ SaS(σ),a < 2, then, in distribution, V = VAZ,
(1.4)
where Z ~ ]V(0,2σ2), A is a stable subordinator of index a/2 having scale parameter cos(πα/4) and A is independent of Z. See Samorodnitsky and Taqqu(1994). Thus, every SaS distribution is a variance mixture of normals and in particular we take the components {e^} of the error term in (1.1) to be independent SaS(l) random variables. The representation of a stable subordinator A given on page 29 of Samorodnitsky and Taqqu(1994) leads k to E{l/A ) < oo, for k > -a/2. A method for estimating the pxl vector of parameters β when the error terms {e^} do not have finite variances based on transforming the observations Y into bounded complex random variables exp(ityj),j = 1,2, ...,n, is given in Chambers and Heathcote(1981) and Paulson and Delehanty(1985). Also see Merkouris(1991) and McLeish and Small(1991). Here, we investigate the performance of estimators obtained from three estimating equations: (1) maximum likelihood, (2) minimum a norm, (3) least squares. Our analyses assume that the index of stability a is known. The value of σ is not needed to estimate the regression parameters /3, but would be to construct confidence intervals. In practice both a and σ could be iteratively estimated from residuals. Our simulation study, presented in Section 5, illustrates how this can easily be done for a. The representation given in (1.4) leads to the observation that the form of the score function for SaS errors can also
441
STABLE DISTRIBUTIONS
be used to consistently estimate β for a general class of error distributions modeled as variance mixtures. We use boldface to denote column vectors, v Γ to indicate the transpose of v and |v| its length. Let x; = (xn, xi2,..., Xip)τ denote the ith column of the transpose of the design matrix X, i = 1,2,..., n. Let P denote the underlying probability and Pβ the probability generated by shifting the random vector e by an amount Xβ. Unsubscripted probabilities and expectations are with respect to P. Note that P = PQ. We follow the common practice of omitting the qualifier "a.e." between almost every where equal random variables when context allows.
2
Maximum Likelihood
The lack of closed form expressions for their probability densities makes the use of maximum likelihood with stable distributions very difficult. Zolotarev (1966) provides an integral representation of symmetric stable densities which was used by Brorsen and Yang(1990) to find the maximum likelihood estimates (mle's) of the parameters a,σ and μ. DuMouchel (1971) uses a multinomial approximation to the likelihood equation to estimate these parameters. Feuerverger and McDunnough(1981) employ a fast Fourier transform of the empirical characteristic function to obtain an approximate likelihood. Here, we allow the location parameter μ to depend on covariates and use the score function directly to develop asymptotic properties of the mle of the vector of regression parameters β. Let the components of the error term e given in (1.1) be independent 5αS(l),0 < a < 2. From (1.4) we have that: d = Ziy/Ai,
(2.1)
i = 1,2, ...,n, where the stable subordinators {Ai} of index a/2 and the mean zero normal random variables {Zi} with variance 2 are all jointly independent. The probability density / of Yί in (1.1) and the likelihood Ln(β) are given by:
f(Vi\β) =
Jo
i
- i$β)/y/2ά)g(a)da,
(2.2)
where g is the density of a stable subordinator as described in (1.3) with δ = α/2, σ = cos(πa/4) and φ is the standard normal density. From Proposition 1.31 of Samorodnitsky and Taqqu(1994), f(y\β) is a Cauchy density for a = 1. Hence, f(y\β), which is a mixture of log-concave functions of β, is not necessarily log- concave as a function of β. However, we give conditions in
442
BARMI AND NELSON
Theorem 2.1 which ensure that as n -> oo, Ln(β) a.e. has a local maximum in every neighborhood of the true regression parameter. Since ϋ^l/A 1 / 2 ) is finite and φ is bounded, differentiation with respect to the components of β can be passed through the integral in (2.2). The score function ln(β) then has the form of a weighted least squares estimating function: n
ln(β) = Σχi(yi
- xjβ)w{yi - xf /3)/2,
(2.3)
where the weights are a.e. PQ given in terms of conditional expectations of the form:
w(Vi - *Jβ) = =
(2.4)
Eβ(l/Ai\yi-i$β).
In a simulation study summarized in Section 5 we were able to effectively find roots of ln(β) by using Monte Carlo Integration to approximate the likelihood function Ln(β) and a grid search to find its local maximum. This process avoids the more difficult task of computing and finding the root of the score function. The score function given (2.3) can also be used to form an estimating function for a general class of variance mixture models. Suppose that {Zϊ\ in (2.1) are i.i.d. according to a baseline location-scale family with continuous pdf φ* having mean zero and finite variance, say variance = 2 to be in conformity with the stable case. Further assume that {A{} are i.i.d. according to any distribution with pdf g* supported on the positive axis such that (2.4) exists. Let Gn(β) represent ln{β) as given in (2.3) with φ replaced by φ* and g replaced by g* :
Gn(β) = j^to - xJβ)Eμi/Ai\yi - xf β)/2,
(2.5)
1 = 11
where nE%(.\.)n
denotes conditional expectation under φ* and g*. The es-
timating function Gn(β) may be motivated as follows. {Aj}, Qn{β) = ]Pxϊ(yi — xτβ)/2Ai
Conditional on
is a Godambe optimal estimating func-
2=1
tion and hence optimality holds unconditionally, if second moments are finite. However, the variables {Ai} are not observable so that Qn{β) cannot be used to estimate β. The estimating function Gn(β) then results by conditioning on the data, Gn{β) = E(Qn(β)\yui = 1,2, ...,n). In Theorems 2.1 and 2.2 we give conditions under which roots of Gn(β) = 0 yield consistent estimators of β. We now delete the superscript * from the expectation operator.
443
STABLE DISTRIBUTIONS
To simplify the derivation of asymptotic properties of roots of Gn(β) = 0, assume without loss of generality that the true β = 0. Using a result given in Silvey and Aitchison(1957), the asymptotic existence and strong consistency as n -> oo of roots of (2.5) hold under conditions which guarantee that for all sufficiently small δ > 0, a.e. P, limsup{sup/3TGn(/3), \β\ = δ} < 0.
(2.6)
n—>oo
Obtaining the uniform upper bound on βτGn(β) required in (2.6) can be difficult. Key tools of our approach are strong laws of large numbers. First, Neveu(1975) shows that if {Ui} are jointly independent zero mean random n
variables with finite variances {σ?}, then Sn = Σ ° f —> oo implies that a.e.:
-> 0,
(2.7)
for h a nondecreasing positive function with:
i;
Λ(t)) 2 dt
(2.8)
Chung(1974, page 130) states that if the random variables {U{} are i.i.d with E(Ui) = 0, then for any sequence of uniformly bounded constants {cn}, .a.e. n
-> 0.
(2.9)
Theorem 2.1 on consistency will show, for example, that by taking h(t) = ίς, 0.5 < q < 1, asymptotically a strongly consistent root βn of Gn(β) exists a.e. if for a positive definite matrix Σ and a constant c, as n —> oo, XτX/n -> Σ
(2.10)
and p
n
Theorem 2.1 Let en(min) and en(max) denote the minimum and maximum eigenvalues of XTX and let h\(t) and h,2{t) be functions of the type given in (2.8). Then, asymptotically a strongly consistent root of Gn(β) exists a.e. P, if as n —> oo, en{min) -» oo,
(2.11)
444
BARMI AND NELSON p = limsup(/ii(en(max))/en(min))
< oo,
n-»oo
oo.
i/ in addition (2.10) holds, letting βn denote the consistent root, we have that as n -> oo, \βn — /3| = O
p
(
t
Proof. Let|/3| = ί > 0. Prom (2.5) we have that:
2βτGn(β) = =
έ
έ
Cn — Dn.
Under Po^Cn is a sum of independent, zero mean random variables with σ\ =Variance(Cn) = 2E(E2(l/Aι\e1))βτXτXβ = τβτXτXβ > rδ2en{min) 2 —> oo by hypothesis, where r = 2E(E (1/Aι\e\)). Prom the second line of (2.11), we then have that a.e., for large n, \Cn\/βτXτXβ
< \Cn\/δ2en(min) < 2p\Cn\/δ2hι(en(max)) < 2p\Cn\/δ2hι(σ2Jrδ2).
Since Cn is linear in /3, from (2.7) as n —> oo, we have that uniformly in β, \β\ = δ,Cn/βτXτXβ -> 0 a.e. The second quantity on the last line of n τ
(2.12) can be written as Dn = β ^2Uiβ+E(l/Aι)βτXτXβ,
where the pxp
i=l
matrix Ut = (Ui(j,k)) with Ui(j,k) = XijXik{E{\/Afci) - E(l/Ai)],j,k = n
1,2, . . . , p . For each j , k pair, Wn(j,k) = y^t/"i(j\A;) is a weighted sum of il jointly independent, mean zero random variables with Variance(Wn(j, k)) = n
n
n
n
i=l
i=l
i=l
i=l
Variance (^(1/Aileχ)). If σ%{j, k) = Va.na,nce(Wn(j, k)/y/ηι) is bounded in n, {Wn(j, k)/y/ηι} converges a.e. to a finite limit as n -¥ oo. Hence, from (2.7) and (2.11), {Wn(j,k)}/en(min) = {h2(σ2n(j,k))Wn(j,k)}/[h2(σ2n(j,k))en(min)} -> 0 a.e. as n —> ex), j , A; = 1,2,... ,p. It then follows that
-»• 0
445
STABLE DISTRIBUTIONS
a.e., uniformly in /3, \β\ = δ. Finally, uniformly in /3, βτGn(β) is a.e. asymptotically equivalent to βτXτXβ(-E{l/Aι)+o{l)) < δ2en(min)((-E(l/A1)+ o(l)) < 0, which using (2.6) completes the proof of consistency. The bound on the rate of convergence is obtained by taking δ = 1/\/en(min). Chung's (1974) strong law as given in (2.9) allows a weakening of the conditions in (2.11) at the expense of placing a uniform bound on the elements of the design matrix and requiring that en(min)/n converge to a positive constant. Theorem 2.2 Let E(l/Aι) < oo, en(min)/n -> c, a positive constant, and s\ip{\xij(n)\,ij < p,n = 1,2, } < M < oo, where Xij(n) is the element in row i, column j of the design matrix X based on n observations. Then, a.e there is asymptotically a consistent root of Gn(β). Proof: The proof parallels the one given for Theorem 2.1 and is omitted. Now, assume that E(l/A2) < oo. Let Gn(β) denote the matrix of partial derivatives of Gn(β) with respect to β and suppose that E(Gn) = —E(GnGT), which is the case when Gn(β) is the score function ln(β). The "information" matrix J(n,o;) of n is then given by: J(n,a)
= XτXE{e21E2(l/Aι\eϊ))/4 =
(2.13)
τ
X Xv{a).
If the second order term in the expansion of Gn(β) is suitably well behaved (a matter we have not been able to resolve), it follows from the LinderbergFeller central limit theorem that if a consistent root of Gn(β) exists and max{|α;i|,i < n}/en(min) —>• 0, we then have for large n, approximately, T
l
β~N(β,(X X)- /v(a))
(2.14)
Finally, if (2.10) holds, ln{β)/y/n will converge weakly to a multivariate normal distribution with mean vector 0 and covariance matrix υ(a)Σ(β). Therefore, a test for Ho : β = β0 can be based on Tn = 1%(βo)Σ~1 {βo)ln(βo)/nv(a)i which has asymptotically a chi-square distribution with p degrees of freedom under HQ.
3
Minimum a Norm Estimators
Here, we seek estimators of linear compounds 7 = Xτβ of the form 7(0) = CTY, for a vector of constants c, which are unbiased for 7, so that: cτX = λ τ ,
(3.1)
446
BARMI AND NELSON
and "close" to 7. Instead of using the variance of 7, which is not finite, we define 7 to be close to 7 if the scale parameter of 7 — 7 is small. Blattberg and Sargent(1971) introduced this concept for the one predictor case (p = 1). To further develop and extend these minimum α norm estimators, we first briefly describe what is called the covariation between jointly α stable, symmetric random variables. See Samorodnitsky and Taqqu(1994) for a full treatment of this concept. For α G (0,2], the random variables U = {ί/i, i = 1,2,..., n} are said to be jointly symmetric α stable, denoted SαS, if their log joint characteristic function is given by:
φ(t) = - [ |t Γ srΓ(ds),
(3.2)
Js where Γ is a finite measure, called the spectral measure, on the surface of the unit n-sphere S centered at the origin in TZn. For U having the distribution specified by (3.2), 1 < a < 2, define the covariation of Ui on Uj by [Ui,Uj]a = fsSisf^^Γids), where x<^> = sgn(x)\x\q. Covariation is not in general symmetric; [aUi,bUj]a = ab\Ui, Uj] and for a — 2, [U{,Uj]2 = Covariance(C/i, Uj)/2. Covariation leads to the α norm \\U\\a of a scalar SaS{σ) random variable U defined by \\U\\a = [U,U]Ha. Note that \\cU\\a = \c\\\U\\a and that \\U\\a = 0 if and only if 17 = 0 a.e. If the independent random variables U% ~ SaS(σ),i = 1,2, ...,n, then, for n
a vector of constants c, we have that ||c'U|| α = (]Γ]|ci<7i|α)1/α. Thus, for c 2=1
satisfying (3.1), ||-γ(c) - 7 | | S = il
Now, for 1 < a < 2, define an estimator 7(b*) to be best linear unbiased α-norm (BLUαN) estimator of 7 = λτβ if b* satisfies (3.1) and ||-γ(b*) 7l|α < ||τ(c) - 7 | | Q for all vectors c that satisfy (3.1). For a = 2, BLU2N and BLUE (best linear unbiased estimation) are identical concepts. However, for n
a < 2, it is not even necessarily so that 7(6*) = ^λi/3i, where βi are BLUαN for {$}. Unlike the mle, the BLUαN is defined only for α in (1,2]. A method for computing the BLUαN will be given below. In the scalar case, where β is a single unknown parameter, the BLUαN may be viewed as the solution of an optimal estimating equation in the following sense. Consider the family of estimating functions of the form G(k,/3) = Σ(y* " χiβ)kί-
Cal1
G
( k *>#) optimal if within this class it
i=l
minimizes \\G(k,βm/\E(d(G(k,β)/dβ)\<* = Σ*=i W V l Σ W
T h i s is
447
STABLE DISTRIBUTIONS
Q
equivalent to minimizing ]P|fci| subject to Y^kiXi = 1, which leads to the 2=1
i=l
BLUαN. The concept of James orthogonality can be used to characterize BLUαN estimators. For two jointly symmetric stable random variables U and V, 1 < a < 2, V is said to be James orthogonal to U if for all real T, ||τϊ7+V|| α > ||VΊ| α . Samorodnitsky and Taqqu(1994) prove that V is James orthogonal to U if and only if [I/, V]a = 0. Now, let 7(b) and j(c) be two linear unbiased estimators of 7 as described above. Taking U = 7c)—7(b) and V = τ(b)—7, we have that ||7(c)—y||α = ll^+^llα > ||τ(b)--γ|U if [U,V]a = 0. Since the Γ τ components of e are iid SaS(l), letting k = c — b, [£/, V]a = [k €, b e ] α = 01
φf -^ = 0 holds if:
Thus, 7(b*) is BLUαN if b = b* satisfies (3.1) and (3.3) holds for all c that n
satisfy (3.1), or equivalently if for all k in the null space of X τ , z=l
= 0. Finding the BLUαN of 7 = Xτβ requires obtaining that vector b* which n
minimizes ^ | ί > i | α subject to the constraint given in (3.1). In addition to 2=1
using (3.3), this can be accomplished by using a Fenchel duality type theorem (Rockafeller(1970)) to characterize b*. For b satisfying (3.1), let z = b — zo, where zo = X(XτX)~ιλ = (201^02? izθn)T a n d define the convex function / α (z) by: (3.4)
Λ*
It then follows that b* = z* + zo, where z* minimizes / α (z) subject to z G iV(X τ ), the null space of Xτ. To find z*, consider the convex conjugate of fa given by (Rockafellar(1970)):
/α(y) = s u p E ^ i - /Q(z),z E N(XT)].
(3.5)
2=1
n
Since for all y, z, /o(y)+/<*(*) > ΣviZi,
w e h a v e t h a t inf
[/<*( z )> z e ΛΓ(X)] +
i=l T
inf[/^(y),y € N+(X )\ > mΐ\^yizuz ί=l
€ iV(Xτ),y 6 iV+(Xτ)] = 0, where
448
BARMI AND NELSON
iV+(X τ ) is the orthogonal complement of N(XT). Therefore, if we can find y* e N+{XT) such that / α (z*) + /*(y*) < 0, then z* and y* must respectively solve the problems of finding the infima of [/Q(z),z E N(XT)], called the primal problem, and of [/£(y),y E N+(XT)], called the dual problem. Specifically we would have / α (z*) = inf[/Q(z),z E N{XT)] = -/*(y*) = inf[/*(y),y E N+{XT)}. The values z*,y* and b* can be found with the aid of the following lemma. Lemma 3.1 The convex conjugate in (3.4) can be expressed as:
/α(y) = -Σzoiyi + [(a - lJ/αElwrrt"- 1 ). 2=1
(3.6)
2 = 1
Moreover, y*,z* and b* must satisfy: y* = « where a
+ zoi)
= fr?^-^,! = 1,2,... ,n,
(3.7)
is as defined above.
The proof of Lemma 3.1 follows from Luenberger(1969, p 196) and is omitted. Finally, since N+(XT) = {Xd,d G 7£p}, from the definition of z 0 , we have that y* = Xd, where d achieves the minimization: inf[-λ τ d + [(α - l)/a]Σ\xΐd\a/{a-l\d
E W>).
(3.8)
A direct computation yields that d is the implicit solution to the system of equations,
The solutions to (3.9) can explicitly be found in 2 special cases. For a = τ ι τ ι 2,d = (X X)- λ and using (3.7) we obtain b* = X{X X)" X, the usual n
least squares estimator.
For p — 1,7 = β,d — (^2\xi\a^a~l^)l~a
and
ΐ=l
2=1
The BLUαN of 7 is optimal within the class of linear unbiased estimators in the sense of having maximal probability of being close to 7. Specifically, let 7(b*) be BLUαN, j(c) be a linear unbiased estimator of 7 and € be a ,ί>α5(l) scalar random variable. For any linear unbiased estimator,(7(c) — n
7 ) / E | c Γ ] 1 / α i s distributed as e. Hence, for δ > Q,P{\y{c) - 7| < δ) =
STABLE DISTRIBUTIONS
449
i=l
2=1
2=1
7l < δ). Conditions for the weak consistency of the BLUαN = 7(b*), can be given in terms of the rate at which en{min), the minimum eigenvalue of T X X, diverges to infinity. τ
Theorem 3.1 The BLUaNη(h*) ofη = λ β 2 α a as n -> oo ifV " )/\en{min)) -+ 0.
converges to j in probability
Proof: n
Since 7(b*)-7 - SaS([Σ\UΪ\a]1/a),
n
it suffices to show that J^l 6 ?Γ ~* °
i=l
i=l
The least squares estimator of 7 is given by 7(d) for d τ = λτ(XτX)~ιXτ, which satisfies (3.1). Hence, since 7(b*) is BUαN and the usual Lp norms are non-decreasing in p,
l2 -+ 0,
by hypothesis, which completes the proof.
4
Least Squares
The usual least squares estimator (LS) of β is given by βLS = (XTX)~1XTY, which has the advantages of simplicity and not requiring that a be known or estimated. Our simulation study, presented below, indicates that least squares estimator performs reasonably well compared to the BLUαN for a > 1. For a > l,βιs ιs weakly consistent for β under the conditions of Theorem 3.1. Note that βLS — β has a SaS distribution for all sample sizes n. Least squares also plays a role in a special case of the joint symmetric a stable distributions determined by (3.3). Suppose instead of (2.1), eι = \fAZi, i = 1,2,..., n, where now a single stable subordinator A , distributed as specified in (1.3) with δ = α/2, is multiplied by all the components of €. The random variable A is independent of {Zi}. Such joint distributions are called subGaussian. By conditioning on A, it is easily seen that the least squares estimator βLS is the mle in this setting. Further, if we let i = |Y - XβLS\/(n-p), then -γτ(βLS - /3)/λ/Iλ τ (X τ X)" 1 λ has a tdistribution with n — p degrees of freedom ,which can be used to construct confidence intervals for \τβ.
450
5
BARMI AND NELSON
Simulation
We carried out a simulation study to investigate and compare the performance of the mle, BLUαN and least squares estimators in the one predictor case (p = l),yi = X{β + e;, where {e^} are iid as defined by (2.1). Samorodnitsky and Taqqu(1994) provide a formula for transforming uniform random variables into symmetric stable random variables and a series expansion in terms of gamma variates which may be used to generate stable subordinators. We used simulation based on the law of large numbers to approximate the integral with respect to the distribution of A given in (2.1) in order to form the likelihood Ln(β). The likelihood turned out to be quadratic in shape in a neighborhood of β and a grid search was effective in finding the mle. In this case where p = 1; there are explicit expressions for the BLUαN and the familiar least squares estimator, denoted βisStarting at the least squares estimator, which doesn't require knowledge of α, we formed residuals £{ = yi — XiβhS-, i = 1,2,..., n in order to estimate α. McCulloch(1981) gives expected values {ra(α)} of the observed ratio r of specified spacings of iid symmetric α stable random variables expressed as functions of α. Inverting the tabled values (ra(α)}, for α > 0.5, based on an observed r obtained from the residuals {e^}, provided a quick moderately effective way to estimate α. Initially we iterated this scheme, but abandoned this refinement because it was time consuming and did not yield significantly better results. More sophisticated methods for estimating α from iid observations are available. See Arad(1980) for example. Additional study of the problem of estimating α in the context of regression models needs to be carried out. We generated the predictor variables {xi} from a uniform distribution on the interval (0,5). We simulated data with n = 10, 20 and 50 and several values of α in order to cover a wide range of possible parameter settings. Since the BLUαN is not defined for α < 1, whenever the estimated α was less than or equal to 1 we used α = 1.05 in the formula for computing the BLUαN. We first assess the performance of the 3 estimators in terms of their estimated mean. Table 5.1 contains (rounded to 2 places) sample means and sample mean absolute deviations, MAD = Σ I A - 01/1000? of the three /3's across the 1000 iterations. For α < 1, the responses do not have means and the least squares and BLUαN estimators are highly unstable, sometimes swinging from plus to minus with magnitudes of several thousand. Entries where the MAD is larger than the mean are consequently omitted from Table 5.1. The last row, corresponding to α = 2 where all 3 estimators are identical, is included to provide a basis of comparison for the other α's.
STABLE DISTRIBUTIONS
451
Keep in mind that all estimators (except when α = 2) used an estimated value of α and may not retain properties such as unbiasedness. Table 5.1 indicates that all 3 estimators have little bias, get better as the sample size n increases for fixed α and are relatively unaffected by the value of α, a > 1 for the BLUαN and LS, all α for the mle. The mle is a clear winner for a < 1. In order to get a more detailed picture of the behavior of the 3 estimators, Table 5.2 presents percentages (rounded to 2 places) of times out of 1000 iterations that the estimators were within the specified distances of β. Two standard errors of the entries are no larger than 0.032. Table 5.2 reveals that the mle performs substantially better than both the least squares and BLUαN estimators for the smaller values of α, especially for the interval β ± δ\. Note that the mle is best for the small values of α. We find the relatively good performance of the least squares estimator compared to the BLUαN for a > 1 in Table 5.2 surprising and comforting in view of its wide use. However, as noted above, the least squares estimator performs poorly for a < 1. To check limiting normality, Table 5.3 compares sample percentiles of simulated mle's to corresponding percentiles based on the asymptotic normality of the mle as given in (2.14). The symbol "S" denotes a sample percentile obtained from simulation and the symbol "A" an asymptotic percentile. Since we generated the carriers {xi} from a uniform distribution, we n
globally approximated ^2xi
by nE(x2). For n = 10 and a = 0.5, and 1.0,
the "S" column of Table 5.3 indicates that the distribution is skewed right. Otherwise, the sample percentiles are approximately symmetric around the sample median and reasonably close to the asymptotic percentiles. As expected, the normal approximation improves as n increases. More work needs to be done on assessing the role of α on the rate of convergence to normality.
6
Conclusions
Regression models whose error terms have mixture distributions with infinite variance have the potential for important applications. Our study of the case where the errors have symmetric stable distributions indicates that estimation via maximum likelihood is better than least squares and BLUαN, unless it is known that α is close to 2. Future studies should investigate the performance of roots of (2.5) as estimators of β for other families of variance mixture models.
452
BARMI AND NELSON TABLE 5.1 Means and Mean Absolute Deviations of the 3 Estimators MAD's in Parentheses True Value β = 5
α 0.5 0.7 1
1.5
1.7
2
Estimator MLE MLE MLE BLUαN LS MLE BLUαN LS MLE BLUαN LS
ALL
10 5.05 (0.21) 5.06 (0.20) 5.02 (0.19) 5.33 (1.62) 4.99 (1.30) 5.02 (0.16) 5.01 (0.32) 5.00 (0.24) 5.01 (0.16) 4.99 (0.18) 4.99 (0.17) 5.00 (0.13)
n 20 5.05 (0.10) 5.04 (0.11) 5.01 (0.11) 6.48 (2.19) 5.62 (2.06) 5.00 (0.11) 5.01 (0.17) 4.98 (0.23) 5.00 (0.11) 4.98 (0.15) 4.99 (0.13) 5.00 (0.09)
30 5.00 (0.04) 5.00 (0.05) 5.00 (0.06) 5.22 (1.38) 5.10 (1.42) 5.00 (0.07) 5.01 (0.16) 5.01 (0.12) 5.00 (0.06) 5.00 (0.10) 5.00 (0.09) 5.00 (0.06)
453
STABLE DISTRIBUTIONS TABLE 5.2 Percentages of Times Estimators were within δ of β δι = 0.025,
δ2 = 0.100,
<53 = 0.175,
δA = 0.250,
n=10 α
Estimator
0.5
MLE
BLUαN LS 0.7
MLE
BLUαN 1.0
LS MLE
BLUαN LS 1.1 1.3 1.5 1.7
MLE
BLUαN LS MLE BLUαN LS MLE BLUαN
LS MLE
BLUαN LS 2.0
ALL
n=20
δi 62 <5s
24 5 0 17 6 1 14 8 5 17 7 6 16 8 8 17 10 10 13 11 13 15
59 17 4 48 23 8 46 28 21 48 29 23 48 36 29 46 38 36 42 44 44 50
74 26 7 68 35 16 68 45 32 70 45 37 70 52 47 66 58 56 64 64 67 73
δ5 = 0.325
81 31 9 80 43 23 82 57 41 82 57 49 84 66 63 82 74 69 81 79 80 88
85 35 12 87 49 29 88 64 49 88 66 58 90 75 71 90 83 79 89 88 89 95
δι 39 4 0 27 5 2 22 6 6 20 7 7 21 9 9 23 12 13 21 13 13 23
δ2 78 13 3 70 20 8 61 26 20 61 32 26 57 37 35 59 44 45 59 53 53 65
δz 88 21 4 86 32 13 84 43 34 84 51 40 81 57 53 82 66 65 82 78 76 89
δ* 93 26 6 93 39 19 93 53 43 93 64 52 93 71 66 94 82 78 93 89 90 98
δ5 94 31 7 96 46 24 96 61 52 97 72 60 97 79 76 98 88 87 98 84 95 100
454
BARMI AND
NELSON
TABLE 5.2 (Continued) Percentages of Times Estimators were within δ of β δι = 0.025,
δ2 = 0.100,
Estimator MLE BLUαN LS 0.7 MLE BLUαN LS 1.0 MLE BLUαN LS 1.1 MLE BLUαN LS 1.3 MLE BLUαN LS 1.5 MLE BLUαN LS 1. 7 MLE BLUαN LS 2. 0 ALL
α 0.5
i 3 = 0.175,
δi δ2 51 94 1 7 2 0 39 92 3 14 2 7 35 86 8 26 6 21 33 84 8 32 8 26 30 83 13 44 12 43 35 84 18 58 19 57 29 80 21 69 22 69 35 88
δ4 = 0.250, n=50 δz 54 99 100 10 13 2 3 99 100 22 28 14 10 98 100 40 51 34 44 97 100 47 59 40 55 98 100 66 77 65 77 98 100 80 89 79 89 97 100 90 95 90 95 99 100
δ5 = 0.325
is 100 15 4 100 33 18 100 60 52 100 66 63 100 84 84 100 94 94 100 97 98 100
455
STABLE DISTRIBUTIONS TABLE 5.3 Asymptotic "A" and Sample "S" Percentiles of the MLE
n 20
10 α 0.5
1.0
1.5
2.0
Percentile 10 20 50 80 90 10 20 50 80 90 10 20 50 80 90 10 20 50 80 90
S
4.85 4.96 5.01 5.09 5.39 4.78 4.91 5.00 5.09 5.26 4.78 4.91 5.01 5.11 5.26 4.80 4.93 5.00 5.09 5.21
A
4.92 4.97 5.00 5.03 5.08 4.80 4.92 5.00 5.08 5.20 4.83 4.93 5.00 5.07 5.17 4.80 4.92 5.00 5.08 5.20
S 4.91 4.98 5.01 5.05 5.18 4.85 4.94 5.00 5.08 5.18 4.85 4.94 5.00 5.08 5.18 4.85 4.93 5.00 5.05 5.14
50
A
4.94 4.98 5.00 5.02 5.06 4.86 4.94 5.00 5.06 5.14 4.88 4.95 5.00 5.05 5.12 4.86 4.94 5.00 5.06 5.14
S
4.94 4.98 5.00 5.03 5.06 4.90 4.96 5.00 5.04 5.09 4.90 4.96 5.00 5.04 5.10 4.92 4.96 5.00 5.04 5.09
A
4.96 4.98 5.00 5.02 5.04 4.91 4.96 5.00 5.04 5.09 4.92 4.97 5.00 5.03 5.08 4.91 4.96 5.00 5.04 5.09
REFERENCES Aitchison, J. and Silvey, S.D.(1958). "Maximum likelihood estimation of parameters subject to constraints." Ann.Math.Stat.,29, 813- 828. Arad, R.W.(1980). "Parameter estimation for symmetric stable distributions." International Economic Review, 209-220. Blattberg, R. and Sargent, T. (1971). "Regression with non-Gaussian stable disturbences: some sampling results," Econometrica, 39, 501-510. Brorsen, W.B. and Yang, S.R.(1990) "Maximum likelihood estimators of symmetric stable distribution parameters." Commun. Statist.-Simula.,14 1464.
456
BARMI AND NELSON
Chambers, R.L. and Heathcote, C.R.(1981) "On the estimation of the slope and the identification of outliers in linear regression." Biometrika, 2133. Chung, K.L.(1974)." A Course in Probability Theory, Second Edition," Academic Press, N.Y. DuMouchel, W.H.(1971) "Stable Distributions in Statistical Inference." unpublished Ph.D. dissertation, Yale University. Feuerverger, A. and McDunnough, P.(1981). "On efficient inference in symmetric stable laws and processes." In Statistics and Related Topics. Edited by Csorgo, M.,Dawson, M.A., Rao, J.N.K. and Saleh, A.K.Md.E., North Holland, Amsterdam. McCulloch, J. H. (1986). "Simple consistent estimators of stable distribution parameters." Commun. Statist.-Simul., 1109-1136 McLeish,D.L. and Small, C.G.(1991). "A projected likelihood function for semi-parametric models." In Proceedings of a Symposium in Honour of Professor V.P. Godambe, University of Waterloo, Waterloo Ontario, Canada. Luenberger, D.G.(1969) "Optimization by Vector Space Methods," Wiley, N.Y. Merkouris, T.(1991) "A transform method of optimal estimation in stochastic processes: basic aspects." In Proceedings of a Symposium in Honour of V.P. Godambe, University of Waterloo, Waterloo, Ontario, Canada. Neveu, J.(1975). " Discrete Parameter Martingales", North Holland, Amsterdam. Paulson, A.S. and Delehanty, T.A.(1985). "Modified weighted square error estimation procedures with special emphasis on the stable laws." Commun. Statist.-Simula., 927-972. Rockafellar, R.T.(1970) "Convex Analysis," Princeton University Press, Princeton, N.J. Samorodnitsky, G. and Taqqu, M.S.(1994). "Stable Non-Gaussian Random Processes, Stochastic Models with Infinite Variance." Chapman and Hall, New York. Zolotarev, V.M.(1966). "On representation of the stable laws by integrals." Selected translations in Mathematical Statistics and Probability. Providence, R.L, American Mathematical Society, 6, 84-88.
457
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
Separate Optimum Estimating Function for the Ruled Exponential Family T. Yanagimoto Institute of Statistical Mathematics Tokyo Y. Hiejima Graduate University for Advanced Studies Tokyo ABSTRACT A general family of distributions with mutually orthogonal parameters is introduced. The mean parameter is estimated by using the score function, and the dispersion parameter by using the projected estimating function of the score function. Both the estimating functions attain the minimum of the sensitivity criterion due to Godambe (1960, 1976). Key Words: Negative binomial distribution, orthogonality, separate estimating function, unbiasedness
1
Introduction
For many distributions in common practice, the parameter has two types of components. One of them represents the mean, and the other the dispersion. The mean is usually estimated by using the score function, which is essentially free from the remaining component. This fact is a theoretical background of the familiar generalized linear model (GLM). The aims of the present paper are to define a family of distributions having the above properties in a general way, and also to discuss the separate estimation of each component of the parameter.
2
Ruled exponential family
Consider first a density function of a random variable x on Rn. For simplicity we will not distinguish a random variable and a sample of size 1, unless any confusion is anticipated. Let t(= t(x)) be a statistic on Rs with s < n, and
458
YANAGIMOTO AND HIEJIMA
m a point in Rs. Define a general family of probability density functions having the common support as J=m = {q(x) I E{t I q{x)) = m and E{eβt \ q(x)) exists for β £ B}
(2.1)
where the open set B C Rs includes 0 and it may depend on the density function q{x). Consider a subfamily of Tm as ^ ( Δ ) = {q(x;δ)\δ G Δ} where Δ C Rr with p = r + s < n. Let B(δ) be the parameter space of β given 5, and define the parameter space of θ = (/3, δ) as Θ = {(/3,δ)\δ G Δ, β G B(δ)}. Then the ruled exponential family is defined as follows. Definition. Let t be an s-dimensional statistic and the family ^* m (Δ) be a subfamily of Tm in (2.1). Define V{θ) with θ = {β,δ) as V(θ) = {p(x; /?, δ) I p(x; /3, 5) = exp(/3t)g(x; ί)/κ(/3, ί), θeθ}
(2.2)
where κ(/3, (5) is the normality constant. We will call this the ruled exponential family. The function, κ(β,δo) for a fixed #o, is the moment generating function of q{x\ δo). For convenience we suppressed the point m in the notation of the family V{θ). This is because the point concerns the family only marginally; in many familiar examples the family does not depend on m at all. The name of the family comes from the ruled surface in geometry, where a line in the ruled surface corresponds with an exponential family V(θ(δo)) with Let μ = E{t \ p{x\ β, δ)}. Then we can employ another parametrization θ = (μ,£). Since this parametrization is more convenient, it will be used throughout. Another regularity condition is that the components of the parameter, μ and 5, are variable independent, that is, θ = M ® Δ where M and Δ are the parameter spaces of μ and δ. When this condition is satisfied, the family (2.2) does not depend on the choice of m. Example 2.1. Consider the exponential family of distributions having the density function p(x; θ) — exp{β(θ)t — b(θ) + α(x)}. Consider also the common partition f = (t[,t'2), β(θ) = (β^θ), /%(0)), and μ(θ) = (μι{θ),μ2(θ)) with μ(θ) = E{tf I p{x\θ)). Let θλ = μx(θ) and θ2 = β2(θ). Then (0i,02) is orthogonal as in Huzurbazar (1956). An exponential family is obviously a ruled exponential family by setting t = ίi, μ = μχ(θ) and δ = /?2(#) Consider a subfamily {p(x\ (μ,<Jt)) | δ^ G τ(Δ)} for a smooth function r k τ( ) : R -> R with k < r. Then the family is the ruled exponential family, while it is not always an exponential family but a curved exponential family.
RULED EXPONENTIAL FAMILY
3
459
Some properties
The simple structure of the ruled exponential family yields properties useful for constructing estimators. Write the log-likelihood function l(χ-,θ){= l(x; μ, δ)) and its partial derivatives as lμ{x\ μ, δ) and ls(x\ μ, δ). Recall that V(θ(δo)) is the exponential family with the sufficient statistic t free from Jo Thus the following proposition is derived from the theory of the exponential family. Proposition 1. i) The conditional distribution of x given t is free of μ. ii) The statistic t is complete for μ in V(θ(δo)) iii) lμ(x; μ, δ) = V~ι{μ,δ)(t - μ) where V(μ,δ) = Vax(ί). Theorem 3.6 (Amari 1985) states that the e-geodesic between p(x; μ, δ*) and p(x\ μ*, δ*) and the rn-geodesic between p(x; μ, δ) and p(x; μ, δ*) intersect orthogonally at p(x; μ, δ*). The subfamily V{θ(δo)) for every δo is e-ίlat, and V(θ(μo)) for every μo is a subspace of the ra-flat space Tm. Thus we obtain Proposition 2. The components μ and δ are orthogonal, that is, , ί), (μ*, O ) = L>((μ, ί), (μ, <Γ)) + Z?((μ, <Γ), (μ*, ί*))
(3-1)
for any (μ, δ) and (μ*, δ*) where D(β, 6>*) = £;{logp(a;; (9)/p(rr; β*) | p(x; 0)}. Two existing families in the literature are closely related to the ruled exponential family. They are the generalized power series distribution on the nonegative integers in Patil (1964) and the discrete exponential dispersion model in Jorgensen (1987). The former covers the ruled exponential family, but any study on the structure of the family is not done. It is shown that the latter is covered by the ruled exponential family.
4
Separate estimating function
We begin with discussing a 'separate estimating function' under a general condition before pursuing that in the exponential model. The regularity conditions on an estimating function g(x; 0), x 6 i? n , θ E θ C Rp in Godambe (1976) will be assumed. Consider the common partition g(x; θ) = {g\{x\ 0), 52(^5 θ)) and θ = (#1? Θ2) where the dimensions are 5 and r (s + r = p), respectively. We call an estimating function separate, if g\(x\ θ) and 52(^5 θ) depend only on θ\ and 62, respectively. A practical way to make an estimating function separate is that g\(x\ θ\) = g{x\ 0χ, ^2(^1)) where ^2(^1) is the solution of 32(#; θ\>> Θ2) = 0, and g2{x', #2) is defined similarly. This treatment is
460
YANAGIMOTO AND HIEJIMA
employed in yielding the profile likelihood estimating function. The derived separate estimating function, however, is usually biased, as emphasized in Yanagimoto and Yamamoto (1993). Another conventional way is to discuss g\(x\ 0i, #2θ)and 32(^5 #io> #2) for a fixed 0o = (#io,02θ) The estimating function g\(x; 0χ, #20) is unbiased at 0 G Θ(02θ) = {θ I 02 = 020, 0 Ξ θ}, but is not unbiased globally. A projection method of gι{x; 0i, 02o) on the space of (globally) unbiased estimating functions is developed in Amari and Kawanabe (in press). 1 Now the score function for μ is written as lμ{x; μ, δ) = V"" ^, δ)(t — μ), and is essentially separate from δ. In fact it is equivalent with t — μ. Lindsey (1995) called (μ, δ) estimation orthogonal, when the MLE of μ is free from δ. On the other hand the score function for δ is not separate from μ. Let μo be an arbitrary value. Then l$(x; μo, δ) is unbiased only at 0 G θ(μo). Proposition 1 (i) and (ii) yield the optimality of lc$(x; <$, |t). Proposition 3. i) The estimating function lμ{x\ μ, SQ) is optimum for every Jo, attaining the Cramer-Rao bound. ii) For every μo the projection of l$(x] μo, δ) on the space of unbiased estimating functions is lcs{x; δ | i), which is free from μo It is shown that both the estimating functions in Proposition 3 attain the minimum of the sensitivity criterion by Godambe (1960, 1976). Note that the estimating function Zc$(x; δ \ t) does not depend on μ at all. Thus the two components are estimated in a separate way. Note also that the two estimating functions are orthogonal (Godambe 1991).
5
Examples
The following example introduces a family of possible usefulness in practice. 1
q
ι
Example 5.1. The beta density function is written as Πxf" (l - x%) ~ / Be(p,q) with the support (0,1). The family of beta density functions is an exponential family with the sufficient statistics, logx and log(l — x). The sample mean is known to perform favorably, as an estimator of the population mean p/(p + q). Let α(0 < a < 1) be a constant, and set p = aδ and q = (1 — a)δ. Consider the density function aδ
p(x; β, δ) =
Π
x
ι
- (\
ι a δ ι
χ
-.Xi)( - ) - eP
where M( ; •; •) is the confluent hypergeometric function. The family of these density functions is the curved exponential family and also the ruled
461
RULED EXPONENTIAL FAMILY
exponential family. Thus the sample mean is an efficient estimator of the mean. It is possible to extend a ruled exponential family to that having an infinite dimensional component δ, and also to that having infinite dimensional statistic t. Example 5.2. Consider the n-dimensional point process x(t)' = {x\(t), ..., xn(t)) > 0 < t < T, having the intensity function
where X(t) is a positive intensity function and z(t)' = (z\(t), ..., zn(t)) be a covarite such that all the components are not identical. Write X(t) = X(t)Έexpδzi(t). Then the density function is expressed as
Π[{Π λ i(*)}exp-/ T λ<( 5 ) da] i
teii
J o
= [l[λ(t)exp~ tei
where λ(t) = Σλi(ί), I{ = {t \ x^t) = 1, 0 < t < T} and / = U/;. Set xτ(t) = Σxi(ί), that is, the superimpose. Then the intensity function of xτ(t) is Σλ»(ί) = λ(ί), which is free from δ. Let β(t) = Iog{λ(ί)/λ0(t)}. By regarding the inner product < β(t),xτ{t) > as Σi e //3(t), we can show that X(t) and δ are orthogonal in the sense of (3.1).
References Amari, S-I. (1985). Differential-Geometrical Methods in Statistics. Springer, Berlin. Amari, S-I. and Kawanabe, M. (1997). Information geometry of estimating functions in semiparametric statistical models, Bernoulli, 3, 29-54. Godambe, V.P. (1960). An optimum property of regular maximum likelihood estimation. Ann. Math. Statist., 31, 1208-1211. Godambe, V.P. (1976). Conditional likelihood and unconditional optimum estimating equations. Biometrika, 63, 277-284. Godambe, V.P. (1991). Orthogonality of estimating functions and nuisance parameters. Biometrika, 78, 143-151. Huzurbazar, V.S. (1956). Sufficient statistics and orthogonal parameters. Sankhya, 17, 217-220.
462
YANAGIMOTO AND HIEJIMA
Jorgensen, B. (1987). Exponential dispersion models (with discussion). J. Roy. Statist Soc, Ser. B, 49, 127-162. Lindsey, J.K. (1995). Parametric Statistical Inference, Clarendon Press, Oxford. Patil, G.P. (1964). Estimation of the generalized power series distribution with two parameters and its application to binomial distribution. In Contribution to Statistics (ed. by C.R. Rao), 335-344. Pregibon Press, Oxford. Yanagimoto, T. and Yamamoto, E. (1991). The role of unbiasedness in estimating equations. In: Estimating Function (ed. by V.P. Godambe). Oxford University Press, New York, 89-101.