CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest in applied mathematics under the direction of the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and published by SIAM. GARRETT BIRKHOFF, The Numerical Solution of Elliptic Equations D. V. LINDLEY, Bayesian Statistics, A Review R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis R. R. BAHADUR, Some Limit Theorems in Statistics PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems ROGER PENROSE, Techniques of Differential Topology in Relativity HERMAN CHERNOFF, Sequential Analysis and Optimal Design J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function SOL I. RUBINOW, Mathematical Problems in the Biological Sciences P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock Waves I. J. SCHOENBERG, Cardinal Spline Interpolation IVAN SINGER, The Theory of Best Approximation and Functional Analysis WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics GERARD SALTON, Theory of Indexing CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics RICHARD ASKEY, Orthogonal Polynomials and Special Functions L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations and Stability of Nonautonomous Ordinary Differential Equations D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications PETER J. HUBER, Robust Statistical Procedures HERBERT SOLOMON, Geometric Probability FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties ZOHAR MANNA, Lectures on the Logic of Computer Programming ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and Semi-Group Problems SHMUEL WINOGRAD, Arithmetic Complexity of Computations (continued on inside back cover)
BAYESIAN STATISTICS, A REVIEW D. V. UNDLEY University College London
SOCIETY for INDUSTRIAL and APPLIED MATHEMATICS P H I L A D E L P H I A , PENNSYLVANIA 19103
Copyright 1972 by the Society for Industrial and Applied Mathematics. All rights reserved. Second Printing 1978 Third Printing 1980 Fourth Printing 1984 Fifth Printing 1989 Sixth Printing 1995
Printed for the Society for Industrial and Applied Mathematics by Capital City Press, Montpelier, Vermont
/s a registered trademark.
Contents 1. Introduction
1
2. Coherence
3
3. Sampling-theory statistics
10
4. Basic ideas in Bayesian statistics
17
5. Sequential experimentation
32
6. Finite population, sampling theory
35
7. Robustness
42
8. Multiparameter problems
49
9. Tolerance regions and predictive distributions
56
10. Multinomial data
59
11. Asymptotic results
61
12. 1. 2. 3. 4. 5. 6. 7.
64 66 68 69 71 72 74
Empirical Bayes and multiple decision problems Nonparametric statistics Multivariate statistics Invariance theories Comparison of Bayesian and orthodox procedures Information Probability assessments
Bibliography
75
iii
This page intentionally left blank
Preface I was invited by the Statistics Department at the Oregon State University to give ten lectures on Bayesian Statistics in July 1970. This monograph is a slightly expanded version of the content of those lectures. An adherent of the school of subjective probability might be forgiven for presenting a subjective view of the subject. Although I have tried to give a reasonably complete account of the present position in the study of statistical procedures that use the notion of a probability distribution over parameter space, both the emphasis and choice of topics reflect my own interests. I am most grateful to H. D. Brunk, Lyle D. Calvin and Don A. Pierce who suggested the idea and to the National Science Foundation for finance. The encouragement to put my knowledge into some reasonably tidy shape was most welcome. London October 1970
D. V. LINDLEY
This page intentionally left blank
Bayesian Statistics A Review D. V. Lindley 1. Introduction. The mathematical model that has been found convenient for most statistical problems contains a sample space X of elements x endowed with an appropriate cr-field of sets over which is given a family of probability measures. These measures are indexed by a quantity, 9, called a parameter, belonging to the parameter space 0. The values x are referred to variously as the sample, observations or data. Notice that 6 is merely an index for the various probabilities and that as a result this model includes nonparametric statistics, the special techniques in that field being necessitated by the complexity of the space 0 which, for example, may include all distributions on the real line (see § 12.2, below). For almost all problems it is sufficient to suppose that these probability measures are dominated by a cr-finite measure, so that they may be described through their density functions, p(x\9), with respect to this measure, yu(x), in the sense that
where A is any member of the a-field and P(A\d) is the probability of A according to the measure indexed by 6. In this review we shall always suppose this to be so and in (1.1) shall write simply dx for dn(x). In practice the dominating measure will usually be either Lebesgue or counting measure; an exception arises towards the end of § 4. Furthermore, we shall not usually distinguish between a random variable and the values it takes. When it is necessary to do so we shall use the tilde notation and write x for the variable. Distinction can then be made between P(x < 9) and P(x < $). This, admittedly rather sloppy, notation avoids expressions like Px(x\9) which are complicated for both statistician and printer alike and, in my experience, enables the meanings of statements to be more easily appreciated. It is worth remarking that this model is not used in a few branches of statistics. For example, some aspects of significance testing need only a single parameter value (or, more generally, a subset of 0) called the null value, 90, and the densities p(x\90), no reference being made to alternative values of 9. The early historical examples of significance tests based on tail-area considerations require only this, but, so far as I am aware, no attempt to formalize the intuitively sensible procedure has been successful. Stone's (1969a) comments are intriguing. Again the likelihood 1
2
D. V. LINDLEY
principle, that will be discussed below, uses only the observed data x and makes no reference to the other elements of the sample space. Despite these qualifications the model described in the first two paragraphs is used in all formal analyses of statistical problems. Most descriptions go a stage further and introduce a decision space D of elements d and a nonnegative loss function L(d, (?) on D x 0. For example, in estimation problems squared error los is often assumed, whilst in hypothesis testing a zero-one loss function is used. The Bayesian argument extends the basic model in a different direction and supposes that 0 supports a cr-field and a probability measure over it. This supposition I shall take to be the defining property of what constitutes a Bayesian argument, and a Bayesian solution is one that uses such a distribution. Again it will be convenient to describe the measure through its density function p(9) with respect to some dominating
• • • dO. Having introduced the single probability measure over the
parameter space, a family of such measures, one for each data point x, can be generated by Bayes' theorem which in density form reads where
all the resulting distributions, p(0\x), being dominated by the same measure. It is customary to refer to p(9) as the prior density. This is unfortunate because the property of being prior refers not to a distribution but to a relationship between a distribution and x (or X). We shall therefore speak of p(9) as describing the distribution (of 9) prior to x. Similarly p(9\x) is the density posterior to x. If two pieces of data, Xi and x 2 , arise in sequence, then the distribution posterior to x t is prior to x 2 — today's posterior is tomorrow's prior. For some problems the Bayesian argument goes further and, as above, uses a decision space, and a bounded utility function U(d, 9) over D x 0 which describes (in a sense to be discussed in detail below) the value of decision d when the parametric value is 9. A decision problem is then solved by maximizing the (posterior) expected utility
The Bayesian solution to an inference problem is simply obtained by stating the (posterior) distribution given by (1.2), though if 9 = (0!,0 2 ) and only Ql is of interest, it will be enough to quote the (marginal) distribution of 6 V . (All this is supposing a given distribution prior to the data.) It should be noticed that these two recipes for solving decision and inferential problems are completely general within the framework described. No other procedure known to me has this degree of generality. The Bayesian argument provides a procedure which is applicable to
BAYESIAN STATISTICS
3
all problems within the (X, 0, D) framework and does not require the introduction of ad hoc procedures for special situations. The Bayesian argument is not, of course, new. In essentials it goes back to the original 1763 paper of Bayes (19581) and has always commanded respect amongst statisticians and others. Even Karl Pearson, who seems wrongly to have acquired the image of being non-Bayesian, often argued within the Bayesian framework. Fisher, of course, did not, and the apparent success of his methods led to Bayesian methods being discredited. However, of recent years 2 a new element has entered the discussion and that is the appreciation that, in a sense, the Bayesian approach is inevitable: that sensible inferences and decisions necessitate the introduction of distributions over 0 and utility functions. The next section of this review discusses this inevitability. I have found, in discussion with colleagues, that the argument just referred to is often dismissed as being at a too basic, philosophical level to be of importance to practising statisticians. In the third section I therefore try to show that almost all statistical procedures are, at the operational level, unsound when judged against the general argument, and consequently that they should be abandoned as general methods. I know of no field where the foundations are of such practical importance as in statistics. It is necessary to have a name to describe methods that are not Bayesian. The adjectives classical and orthodox have been suggested. Pratt (1965) proposes the term standard and Box and Tiao (1968a) suggest, for a reason which will appear later, sampling-theory. 2. Coherence. The idea that leads to the adoption of Bayesian ideas is essentially the following. Suppose we had some inferential procedure that enabled us to say something about the value of 9 that generated an observed x, and a decision-making procedure that selected a d; what properties would we require these to have? Reflection suggests that there are a few basic principles that neither of the procedures should possibly violate, in the sense that any violation, once exposed, would be held to be ridiculous and the procedures changed. Consequently these principles can be taken as axioms, and deductions made from them in the usual mathematical way. Notice that these axioms are essentially simple and difficult to dispute. Naturally there is some choice over an axiom system, but all those suggested to date have the consequences that there exists a distribution over 0, there exists a utility function, and that the decision should be selected by maximizing expected utility, (1.4). Notice that this type of argument runs counter to that common in much statistical work where a procedure (such as confidence intervals) is first proposed and then its properties investigated: it asks, in reverse order, what properties are required and then finds what procedures have these properties. 1 In cases where a reference has appeared in several editions the latest English language one has usually been quoted. 2 In preparing this review I have not attempted to carry out any historical research and therefore my few remarks about the early work are not necessarily reliable. For example, Ramsey, in the important work described below, quotes a remark of Donkin and A. P. Dempster has mentioned Donkin's work to me. It may be that the idea, that apparently originated with Ramsey, had earlier manifestations.
4
D. V. LINDLEY
An ordered pair, (d,0), written c, will be called a consequence, and generally we shall refer to a space C of consequences c, some of which are identifiable with a pair (d, 9). Let E be any event of 0—that is, E is any set in the a-field over ©— and let c t , c2 be any two consequences. Consider a situation in which a decisionmaker (which may conveniently be thought of as "you") will receive cl if OeE and c2 otherwise. This is called variously a lottery, gamble or option and is written (CI.E,c 2 £). Generally we have lotteries (clEl,c2E2, ••• , cnEn), where the events £, are exclusive and exhaustive and the ct are any consequences: c( being received if Oe £,-. The arguments to be presented all hinge on the comparison of gambles. The importance of these comparisons in the statistical situation of § 1 is that the selection of a decision d is equivalent to the choice of a gamble which will give consequence c = (d, 0) if 6 obtains. (Here there is a generalization to gambles with a possibly infinite range of consequences.) The first discussion of this dates from 1926 and is due to Ramsey (1964). He only provided an outline of the proofs, remarking that a full exposition would "be like carrying a result to seven decimal places when it is only valid to two." Goo suggests he did this because he underestimated the magnitude of the opposition. Another explanation is that Ramsey was simply acting in the spirit of British applied mathematics of the '20's which revered ideas above rigor. Today the pendulum has swung in the opposite direction and rigor and generality are respected (and original ideas sometimes denigrated?). In a way this was a pity because we now have practical situations in which some or all of X, D and 0 are complicated and it would be desirable to know just what limitations on them are necessary for the Bayesian approach to be valid. Savage's discussion (1954), which is perhaps that best-known to statisticians, is particularly valuable in that respect because the details are spelt out and careful consideration given to exactly what properties a probability has. (We have supposed it cr-additive with p(X\9) = 1.) Savage's (1961) later article contains some clear thinking about the foundations. We now summarize Ramsey's argument. He begins by supposing the existence of an ethically neutral proposition n, in the sense that two consequences differing only that in one n is true, and in the other it is not true, are equally desirable. Furthermore, he supposes the existence of such a 7i in which you have degree of belief -j, in the sense that (c^n, c2n) = (^Jr, c27r) for any cl,c2. It is also necessary to suppose that all such u are equivalent, that is (c 1 7i 1 ,c 2 7T 1 ) = (cln2,c2n2). An example of such a proposition is "this coin will fall heads uppermost when tossed." This would have the required properties if the coin was judged fair. Ethically neutral propositions of degree of belief \ are often used in practice: for example, a coin is tossed to decide wh shall begin the play of a game. Essentially it is introduced by Ramsey as a standard by which other degrees of belief can be judged. Whenever measurement is con templated some standard of reference seems essential. His second assumption is, to my mind, the most important one in that it contains the key idea underlying all the different axiom systems. He supposes that lotteries (or options, to use his term) may be compared in the sense that, for any two options, / j and / 2 , /i may be preferred to 12,12 to / l s or l t and / 2 are judged equiv-
BAYESIAN STATISTICS
5
alent. In mathematical terms there is a total ordering of lotteries, / t X / 2 and read "/j is preferred to / 2 ." Furthermore, it is supposed that this ordering is transitive; that is, / t >- /2 and / 2 > /3 implies / j > / 3 . If decision-making is to be operational, then such an ordering is essential; transitivity is a modest requirement whose violation has peculiar and obviously undesirable properties. Ramsey introduces a third set of assumptions which are mathematical in form and include, for example, axioms of continuity, Archimedes axiom and axioms of uniqueness. These are needed to extend degrees of belief to the whole interval [0,1] in a unique way. From these assumptions Ramsey deduces the existence of a utility (or value) function for consequences, denoted u(c). Thus if n is an ethically neutral proposition of belief £, and c = (c^n, c2n), then u(c) — ^u(c t ) + ^u(c2) and generally by the mathematical axioms. With such a value function it is easy to define a degree of belief in E, given F, where E and F are two events of 0, by comparing the options (£±¥,€2?) and (c3EF,c4EF, c2F). If these are equivalent, then the required belief is defined to be (X^) — u(c4)]/[u(c3) — u(cj] and hence the usual laws of probability obtained. With both probabilities and utilities derived the Bayesian argument as outlined in the introduction is complete. If inference statements are interpreted as statements of degrees of belief, the discussion demonstrates that the beliefs must satisfy the probability axioms. Utility is a function of consequences whose expectation with respect to such probabilities determines which decision to take. Ramsey's work seems to have attracted little attention until Savage (1954) drew attention to it and used a related type of argument (which considered only utility) due to von Neumann (see the appendix to von Neumann and Morgenstern (1947)). One of the simplest and most satisfactory derivations of the Bayesian results from an axiom system is that of Pratt et al. (1964). This is completely rigorous but does have the disadvantage that D and 0 are both supposed finite. It improves on a related type of argument due to Anscombe and Aumann (1963). These authors introduce the idea of a canonical experiment, which is in effect an assumption that there exist random variables u and v which are independent and uniform over [0,1]. This is similar to Ramsey's assumption about a proposition of belief^, plus some mathematical axioms. Again this is essentially a standard to which other statements of uncertainty can be compared. They also, like Ramsey, assume lotteries are comparable and that this comparability is transitive. A consequence is a special lottery in which the result is certain, so that C may be ordered and, being finite, has a worst, c^, and a best, c*, consequence. Any other consequence can be scaled by a uniquj value u such that c = (c*u,cj(\ — u)). This value u, written now u(c), is the utility of c. (There the notation has been abused slightly by replacing an event E of probability u by u itself.) A further assumption is made that lotteries are substitutable. This means that if c = (clE,c2E), then (cF,c3F) = (clEF,c2EF,c3F), that is, the gamble equivalent to c may be substituted for c in another gamble. (The notation is designed to demonstrate the similarity to Ramsey's device.) From these it is a short step to the derivation of probabilities and the principle of maximization of expected utility. In their careful discussion
6
D. V. LINDLEY
they point out that an additional assumption, albeit an obvious and acceptable one, is needed to see how the comparison of lotteries, given additional data, may be effected in terms of an original comparison. Let A be any set in the a-field over X and let / j and 12 be any two lotteries. Then / t >- / 2 , given A, if l± > 12, provided both lotteries are to be called off if A occurs. In other words you can effect the comparison of / t and 12 either before or after A provided that, in the former case, no change takes place if A does not occur. This leads immediately to Bayes' result. A recent exposition at text-book level has been provided by De Groot (1970) based on the work of Villegas (1964). He develops an axiom system first for probabilities and then for utilities. He points out that a set of axioms which might appear to be enough are not in fact adequate to derive probabilities and he adds the assumption of the existence of a random variable uniformly distributed in [0,1] to complete the argument. The point is discussed in detail by Kraft et al. (1959) who demonstrated that a conjecture of de Finetti was wrong. Fishburn (1969b) argues that it is sometimes difficult to defend the transitivity assumption and explores the possibility of proceeding without it; the result is that only qualitative "probabilities" are obtained. De Finetti (1964), unaware of Ramsey's work, produced in the mid-1930's an argument which is different in spirit from those so far discussed. Let E t , E2, • • • , En be n exclusive and exhaustive events held with beliefs p ^ , p 2 , • • • , Pn- These are not yet probabilities but merely numerical measures of belief derived from the consideration that the gambles [(x;/p,- — x,-)Ej, — x,-£,-] are all equivalent, at least for reasonable values of x,-. Here xt is a stake which returns a prize, x,/p,-, if £, obtains but is otherwise lost—the expectation (xf/p,- — xi}pi — xt(\ — pt) being zero. Suppose now that a gambler puts stakes x,- on £,, i — 1,2, • • • , n. If £; occurs he will win the amount x/p, — ^ = 1 xh = g,, say. Considered as linear equations in the stakes they will have a solution unless the determinant is zero. Consequently unless this happens we could choose gj > 0 for all i and determine stakes that would be certain to win (not just expected to win) whatever event occurred. Hence the determinant must be zero, and this easily gives ^"h= 1 ph = 1. This justifies the addition rule for beliefs. The notion of called-off bets enables the multiplication rule to be derived. We shall see later (§ 8) that de Finetti's notion of a successful gambling system can be used with advantage to criticize some orthodox statistical procedures. His arguments do, however, suffer from the disadvantage that by introducing stakes there is some confusion with utility ideas. A nice treatment has recently been provided by Freedman and Purves (1969) who establish that bookies must be Bayesians. De Finetti's other important argument concerned with exchangeability will be discussed later (§ 6). A decision-maker whose actions agree with these axioms has been variously described as rational, consistent or coherent. We shall use the last term because it effectively captures the idea that the basic principle behind the axioms is that our judgements should fit together, or cohere. The axioms do not refer to single decisions or inference but to the way in which separate ones cohere, for example, in the transitivity requirement. The concept of coherence has been discussed
BAYES1AN STATISTICS
7
recently within the framework of modern statistics in a particularly illuminating article by Cornfield (1969). A justification for the Bayesian approach of an unusual type that might appeal to an orthodox statistician has been provided by Shubert (1969). A statistical tradition closely related to, but much weaker than, coherence derives from the work of Wald (1950). Unlike the expositions of Ramsey, de Finetti and others this is expressed in terms of the (X, D, 0) model discussed in the introduction. A loss function L(d, 9) is used, together with the notion of a decision function 6 which maps X onto D, 6(x) being the decision taken if x is observed. The risk function for 6 is defined as
6 is said to be inadmissible if there exists another decision function 6* with R(d*, 9) ^ R(6,9) for all 9, with strict inequality for some 9; otherwise it is admissible. Wald's major result can be summarized by saying that he proved that 6 is only admissible if it is a Bayes solution for some prior density p(9)—though the notion of a probability measure has to be extended somewhat to make this statement rigorous: specifically, improper priors, that is those for which
p(9) d6 diverges,
have to be included. Improper priors will be discussed below (§8). Wald's argument is considerably weaker than the others that have been discussed mainly because it assumes the existence of a loss function. (On the other hand the mathematical treatment is commendably complete, though it has been criticized by Stein.) A utility function is derived in such a way that its expectation is the inevitable and sole criterion by which a decision should be judged. It is by no means obvious that such a function should exist, and the precise meaning of a loss function is obscure.. In applications it typically seems true that
and we shall regard it as such. It is often useful to note that in the Bayesian solution of a decision problem it is always permissible to subtract from U(d, 9) any convenient function of 9 without affecting the result. This is clear from considering
J U(d, 9)p(9\x) d9, which will only have a quantity not
involving d subtracted
from it. The principle applies to (2.2). The results described in this section can be summarized by saying that any reasonable consideration of the way in which decisions or inferences cohere leads to the existence of p(0), U(d, 9) and the principle of maximization of expected utility. This has been rigorously demonstrated when D and 0 are finite. Savage's work deals with more complicated spaces but there still remain some points, apparently of detail, but possibly of practical importance that remain unclear. Presumably the utility function is bounded since otherwise paradoxes of the St. Petersburg type arise. It is not quite so clear whether the probability measure
8
D. V. LINDLEY
should be cr-additive, as we have required, or whether it is enough to be finitely additive. A recent general approach is that of Fishburn (1969a). He provides a set of axioms, which includes the idea of a canonical experiment (under the name of extraneous measurement probabilities), that place no real restrictions on C and 0 and establish the existence of a utility function, a finitely-additive probability measure and the principle of maximization of expected utility. A mild restriction on the probability suffices to make the utility bounded. This conclusion of finite-additivity agrees with de Finetti but the situation is unclear to me. We shall see below (§ 12.4) that requirements of invariance, that it seems sensible to impose on some statistical problems, would imply the use of improper probability distributions, but that these can cause difficulties. What does seem clear is that the use of a bounded utility function and a proper cr-additive density cannot lead to difficulties. Some coherent decisions and inferences may be possible outside these limits. It should be particularly noticed, since this affects the use of the ideas, that the arguments establish the existence of a distribution over 0. One often reads statements in the literature to the effect that "a prior distribution does not exist in this problem." Within the framework of coherence this is demonstrably not true. However much a person may rebel at the thought of it, the fact remains that if his statements are not to be found incoherent, then they will be interpretable in terms of such a prior (to misuse the adjective). The probability that the axiom system imposes is to be interpreted as a subjective probability possessed by the decision-maker, "you," whose judgements cohere. P(A\B) is the degree of belief you have in A. given B. It should not be confused with the so-called objective probability derived from long-term frequency considerations. For example, suppose we have a coin that is judged to be fair, then the subjective probability for heads will be ^; but on repeated tosses the frequency may be demonstrably not near 0.5. The relationship between the two ideas is explained by de Finetti's notion of exchangeability to be discussed later (§ 6). The view of probability that emerges from these axiomatic considerations is entirely subjective and the attitude will be adopted in this review that all probabilities are to be so interpreted. Objections to this attitude are numerous but none that I am aware of have gone to the axioms and criticized those. Indeed, it is hard to see how such criticism could be sustained since the requirements imposed by coherence are so modest. An excellent discussion on Bayesian ideas, Savage et al. (1962), includes contributions from speakers with widely differing viewpoints, though, to me, the eight years that have elapsed since then make much of it seem dated. An excellent, up-todate critique by one of the contributors is Bartlett (1967). The objections are usually at a nonmathematical level. A common one is that expressed by Le Cam in Barnard et al. (1968) who argues that the results are personalistic and therefore unsuitable for science which is objective. To reply to this, notice that the theory deals with a single decision-maker whom we have called "you" but equally it could be a firm or even a government. If science were really objective, then presumably the results could be described as those held by the scientific community, but surely the scientific community should be just as coherent as a single
BAYESIAN STATISTICS
9
individual scientist. If so, the scientific community would act as if it had a prior and a utility. In fact science is not objective as any practising scientist must realize, simply because scientists do not and could not perform as a single decisionmaker. The theory does not deal with two or more decision-makers, and does not say how people's ideas should be handled when disagreement exists. It is unreasonable to criticize a theory for not doing what it did not set out to do. My view is that a major gap in our knowledge is the lack of an adequate theory of conflict. Game theory, which only applies to the two-person zero-sum game, and then only to the equilibrium strategy, is not enough. A game should be played to maximize one's expected utility and the expectation based on one's assessment of the opponent's strategy: thus one should not minimax against an inexperienced player. Another, though weaker, reply to Le Cam's criticism is that the orthodox methods are also personalistic. Thus in Lehmann's (1959) book there is a discussion of the choice of a risk function on intuitive grounds. This will be considered in § 7. Dempster in a series of papers, a convenient reference is (1968), and Smith (1961, 1965) have made constructive criticisms, concerned particularly with the "firmness" with which a probability statement may be held, and have suggested that a single probability statement over 0 be replaced by upper and lower probabilities; only if these were equal would an ordinary probability obtain. Smith's theory is not developed at a formal level but Dempster's is and Aitchison in the discussion to the paper just referred to, presented the following criticism. Let X = {x 1 ,x 2 ,x 3 }, 0 = {9l,92}, let the probabilities p(x,-|0,-) be as in the table
and suppose p*^) — 0, p*^) = 1, where the asterisks denote Dempster's lower and upper probabilities by their positions. Then calculations show that p^O^x^ — 15 P*($il x i) = ?> Yet intuition suggests that x t gives no information about whether 91 or 92 is true. The coherence arguments provide a complete description of the decision problem in terms of (X, D, 0, p(x|0), p(0), U(d, 9)), the laws of probability and the principle of maximization of expected utility, and the formal framework is there for the resolution of any decision situation. The view will be taken in this review that the inference problem is similarly described in terms of (X, 0, p(0),p(x|0)) and solved by calculating p(0|x) or some margin thereof. Objections to this last resolution have been made on the grounds that inference is not to be confused with decision-making and that our coherence ideas deal with this latter problem. This is not strictly true since the coherence argument can be applied directly to the events of 0 (see De Groot or de Finetti). My view is that the purpose of an inference is to enable decision problems to be solved using the data upon which the inference is based, though at the time at which the inference is made no decisions may be envisaged. If this is correct, then the posterior must be quoted since it alone is needed for any decision situation. To quote Ramsey (1964): "A lump
10
D. V. LINDLEY
of arsenic is called poisonous not because it actually has killed or will kill anyone, but because it would kill anyone if he ate it." A different distinction between inference and decision-making has been presented by Blyth (1970) without reference to the ideas described in this section. Takeuchi (1970) gives a Bayesian reply. 3. Sampling-theory statistics. In this section the implications of the coherence argument for present day (orthodox) statistics is discussed. The bulk of the material consists of a series of counterexamples designed to demonstrate the incoherence of most statistical procedures. One immediate deduction from the coherence ideas is the likelihood principle. This says that if x l 5 x2 are two data sets with the same likelihood function apart from a multiplicative constant (that is, p(xl\6) — kp(x2\9) for all 0e0, where k does not depend on 9), then inferences and decisions should be identical for xl and x2. This principle can be defended with its own axiom system: see, for example, Birnbaum (1962) and Barnard et al. (1962). Further discussion of the principle of conditionality used by Birnbaum in his derivation has been given by Durbin (1970), and replied to by Savage (1970) and Birnbaum (1970) (see also Hartigan (1967) and §12.4 below). The principle follows from the Bayesian argument since equality of the likelihoods implies p(9\xl) = p(9\x2) for all 9. It is surprising that many statistical methods violate the principle; indeed, all methods that necessitate reference to some property of X other than the observed x do so. For example, the requirement that an estimate t(x) be unbiased, that is,
for all 9, violates the principle, since t(x) will typically depend on X through the integration involved in (3.1). A simple, oft-quoted example is interesting. Consider a sequence of binomial trials, that is,3 x = (x l s x 2 , • • • > X J> where, given 9, a real number, the x,-, all zero or one, are independent with p(xi = 1|0) = 9. Then if X consists of all such sequences of length n the only reasonable unbiased estimate of 9 is r/n, where r = £"= { x £ . On the other hand, if X consists of all such sequences with fixed r (inverse binomial sampling), the equivalent estimate of 9 is (r - l)/(n - 1). Yet, in both cases, the likelihood function is 9r(l - 9)"~r. Since many statistical procedures utilize the structure of X the specification of X constitutes a problem for the orthodox statistician. (It is because of this reference to the sample space that Box has introduced the adjective "samplingtheory.") Consider the following practical example due to Edwards (1970). In a mathematical model of the mutations that have produced the present distribution of blood groups in the human population of the world at the present time it is required to estimate, inter alia, 9, the mutation rate. The data x are the numbers with blood of each group. Analysis seems possible at an intuitive level, but what 3 In describing A" or x it will often be convenient to use bold fount, X or x, and reserve italic face for elements of the description.
BAYESIAN STATISTICS
11
is XI Realization of other possible worlds seems rather strained. My own view is that the orthodox statistician's choice of X has an arbitrariness about it comparable with the arbitrariness in p(&) of which the Bayesian is often accused. Our first set of examples will therefore deal with the choice of X. A statistic t(x) (that is, some function of x) is called ancillary if its probability distribution, derived from that of x, does not depend on 9. Sometimes the additional requirement is added that t, when combined with the maximum likelihood estimate should be sufficient4 (see below). The suggestion is often made to make inferences conditional on the observed value of an ancillary statistic. That is, if x0 is the observed data, restrict X to all x such that t(x) = t(x0). A standard example is bivariate regression where x = (x, y) and 9 being the set of regression parameters ; then r(x) = x is ancillary and it is common practice to regard the independent (or regressor) variable x as fixed. The general practice is obvious from the Bayesian viewpoint since and the two likelihoods are proportional. Our first example concerns a case where it seems natural to condition on an ancillary and yet the resulting procedures do not have the usual optimum samplingtheory properties. Here (Cox (1958), Basu (1964)) x = ( x l 5 x 2 ) ; xl = 0 or 1 with equal probabilities; if x t =0, then5 x 2 ~ N(6, GO); if X j = 1, then x 2 ~ JV(0, erf), with al » GO. (9 is measured either by a precise apparatus (cr0) or an imprecise one (0-j), the choice of apparatus being decided by the flip of a coin.) Clearly xl is ancillary and yet it can be shown that tests based on restricting X to the observed value of x1 are not the best possible in the Neyman-Pearson sense (Cornfield (1969)). Even the standard error of x 2 , the natural estimate of 9, is unclear since the computation of a standard error involves X (Buehler, (1959)). A variant of this example is to let X j be an integer and to consider x 2 ~ N(9, a2/n) with n = ^xj. Generalization to include various distributions for n have been discussed by Cohen (1958). Durbin (1969) shows that either the tests with n held fixed, or the unrestricted tests, can be uniformly most powerful depending on the situation, at least asymptotically. The most complete study of ancillarity has been made by Basu (1964), and his beautiful counterexamples are worth repeating. Let x be uniformly distributed in tne (real) interval [9, 1+9). Then it is easy to see that the fractional part of x is ancillary; in fact, it is uniformly distributed in [0, 1). If one was to condition on it, then x, given the fractional part, has a one-point distribution with rather 4 Durbin (1969) has given an example where a natural ancillary is not part of the sufficient statistic. Here x = (x,, x 2 ), and x, = 0 or 1 with equal probabilities. If x, = 0, x 2 is the result of n binomial trials (see above), if xv = 1, x2 is the result of r inverse binomial trials, n and r both having known (that is, not involving 9) distributions. Then xl is ancillary but not part of x 2 , the sufficient statisti 5 The relation "~" is to be read: "is distributed as." N((t, a2) refers to the normal distribution of mean /i and variance a2.
12
D. V. LINDLEY
limited distributional properties! A second example demonstrates the difficulty that ancillary statistics are typically not unique and consequently it is not clear which one to condition on.6 The following table lists in the first row the six
elements of X; the second provides the relevant densities for each 9, —I — 9 = I ; the third and fourth give the values of two ancillary statistics. (The reader will be able easily to construct for himself four other ancillaries.) To illustrate the difficulty suppose the data x = 5 is observed. If t^ is used as the ancillary statistic, then the maximum likelihood estimate (here 9 = 1) has a distribution on (- 1,1) with probabilities [(2 — 0)/4, (2 + 0)/4]. If t2 is used the corresponding distribution is quite different, namely [(3 - 0)/5, (2 + 0)/5]. The choice of ancillary, and generally the choice of sample space, presents a major difficulty in orthodox statistics. This difficulty is, from a Bayesian viewpoint, inevitable, since the use of X violates the likelihood principle and is therefore incoherent. An attempt to avoid the difficulty has been made by Fraser (1968 and earlier papers referred to therein) who argues that the model we have used is inadequate and omits certain important requirements. When these are inserted the ancillary is unique and inferences can proceed.7 Eraser's work will be discussed below (§8). Closely related to the likelihood principle is the method of maximum likelihood. Except in a detail to be mentioned in connection with the asymptotic theory, this does not violate the principle, but nevertheless can give rise to difficulties. The following example due to Kiefer and Wolfowitz (1956) is elegant and occurs in practice. Let x = (xj, x2, • • • , xn) be a random sample of size n from the density where 9 = (//, a2), (x is either JV(yu, 1) or N(/LL, a2), each possibility being equally likely, but, unlike Cox's example, we do not know which.) Let /j, = xt; then the likelihood tends to infinity as a -> 0, and this for all i = 1,2, • • • , n. Hence there is no maximum 8 in a strict sense, or n in a loose sense. Again suppose x = (xu, x2i; i = 1,2, • • • , n) with xti ~ N(/^» a2), 9 = (/^, f i 2 , • • • , /z n ,cr 2 ), all Xti being independent, given 9. (Pairs of measurements of equal precision are made on each ^.) The maximum likelihood (m.l.) estimate of a 2 is £(*!,- — x2i)2/4n and converges in probability to ^cr2 as n -> oo, which is hardly satisfactory. Barnard (1969) argues that (x u + x2i}, i — 1,2, • • • , n, are "irrele6
The concept of a maximal ancillary, analogous to a minimal sufficient, statistic does not seem to be realizable. 7 He does not use this language, but his restriction to orbits is mathematically equivalent to the choice of an ancillary. 8 A distribution for a (convergent as a -> 0) would resolve the difficulty and typically there would be a unique mode for the distribution of 8 posterior to x.
BAYESIAN STATISTICS
13
vant" so that using only dt = xli; — x2i and writing down the likelihood for this, the new m.l. estimate is I.df/2n which does tend to a2. This type of argument is typical of the ad hoc procedures that orthodox statisticians have to resort to in default of the Bayesian argument. A systematic study of this particular form, and a serious attempt to remove the improvization element has been made in their studies of marginal likelihoods by Kalbfleisch and Sprott (1970). A criticism of the argument that inferences should be based on the likelihood alone (and not in conjunction with the prior) will be postponed until §6 when sampling from a finite population is discussed. The phenomenon displayed in the last example is typical of what happens when incidental parameters (like the ji's) appear. A more extreme case arises when fitting a straight line with both variables subject to error. (The model is described in § 7 below.) There it was thought for a long time that the m.l. estimate of the slope was equal to the ratio of the m.l. estimates of the two standard deviations, an absurd situation. In fact Solari (1969) has shown th&t this supposed maximum is only a saddle point. The likelihood function has essential singularities and the likelihood can be made to approach any value between plus and minus infinity in any neighborhood of such points. Continuing with counterexamples, we turn to the topic of significance tests, a branch of statistics which has a more completely developed formal theory than most others (see, for example, Lehmann (1959)). We begin with the test of a simple null hypothesis against a simple alternative where the orthodox theory is most complete. That theory uses a and ft, the errors of the two kinds. Formally, with arbitrary X, 0 = (ftx, 82), D = ( d 1 , d2) and L(dt, Oj) = 0, if i = ;; 1, if i ^ j. Then for a decision function (test) <5, a = P(d — d2\0l), ft = P(d = d]\92)- Again the use of a and ft violates the likelihood principle. The incoherence of proposed procedures is most easily studied through a diagram with the values of a and ft as axes.
The usual method fixes a and selects a test that minimizes ft, but does not say how a is to be chosen. One suggestion (Lehmann (1958)) is to specify a series of indifference curves in the (a, /J)-diagram and to use these to choose an optimum test. Coherence easily establishes that these "curves" must be a series of parallel
14
D. V. LINDLEY
straight lines. Let P, Q (see Fig. 1) be two points on the same indifference curve, so that the tests corresponding to P and Q are judged equivalent. Then the associated gambles have the same expected utility, which will be the same as that of any test corresponding to a point on the line segment joining P and Q, since such a point may be reached by a gamble on P and Q. (For example, the toss of a fair coin will give the midpoint.) Hence the indifference curves must be straight lines. To establish that the lines must be parallel, a similar substitutability argument is used with P and the origin to obtain R, and Q and O to obtain S. The same mixing probability in the two cases gives RS parallel to PQ. Hence the only coherent way to select a test is to select the slope of this set of lines. For a zero-one loss function this is easily shown to be equivalent to the choice of P(#i)/P(02)' the odds prior to x. An alternative method is to select a curve passing through the origin and to select amongst available tests that one on the curve nearest to the origin. If the curve is a straight line at 45° to the oc-axis this is equivalent to the minimax method, which puts a = /?. (In general the minimax method advocates selecting that 5 which minimizes max e R(S, 9)—equation (2.1).) This procedure is easily shown to be incoherent (see Fig. 2) for, using the notation for the comparison of lotteries, if the available "good" tests have (a, /?)-values on BAE (as is easily shown to be possible), A > B. Similarly using CAN, A > C. But by the substitutability argument, if B X C, B > D, whence A >- D, which is absurd since D has both a and /? smaller than A. (If C > B, C > D and the same result follows.) Despite this simple demonstration of the unsoundness of the minimax method,9 it continues to be used; I personally find this hard to understand. A slightly modified version of this argument shows that a person who sticks to a fixed value of a (say 5 %) is also incoherent. The approach used in this discussion was developed by L. J. Savage and the author in 1955.
Q Another counterargument to minimax lies in its violation of the principle of the irrelevant alternative. This principle says that if/! is the better of (/ 1 ( / 2 ), it should not happen that 12 is the best of ( l l t 12, /3)— / 3 is irrelevant to the choice between /, and 12—but this can happen with minimax procedures (Savage (1954)), Of course, it cannot occur when expected utility is maximized.
BAYESIAN STATISTICS
15
The extension of these ideas from two to a general finite number of hypotheses, and the consequent increase in dimensionality of the figures, can be used to show that it can be incoherent to use similar regions: that is, to choose tests from amongst those which have P(5 = d2\6i) = a for all Q{ for which 1(^,0,.) = 0, a being a preassigned constant. Analogous difficulties arise in finite discrimination and classification problems (Lindley (1953)). In general there is a real danger of incoherence whenever a decision is selected amongst a limited class, rather than by maximizing the expected utility. Constrained optimization can be unsatisfactory. (In minimax the constraint is a — /?; unbiased estimation, to be discussed below, provides another example.) An alternative model for significance tests within the Bayesian framework has been proposed by Jeffreys 10 (1967). If accepted it shows that the conventional significance test can do exactly the opposite of what it is intended to do. We consider only the case where 0 is the real line and we wish to test the null hypothesis that 9 — 90 against the alternative that 9 ^ 0 0 . Jeffreys' suggestion is to suppose the probability of the null value prior to the data is TT, say, greater than zero (typically he takes n = ?) and that for 0 =£ 00 there is a density p(0), so that p(0) d9 = 1 — 7i, where 00 is excluded from 0 in the integration. The probability of 00 posterior to x can then be found and one minus this is to be interpreted rather like the orthodox significance level. (Another possibility is to use the odds in favour of 00 posterior to x.) Curiously enough it can happen that even in the simple situation where x is a random sample from N(6,1) that a result which is conventionally significant at, say, 5%, can have posterior probability near to 1, so that a hypothesis can be "rejected" when it is highly likely to be true. The calculation is easy, for if x is the sample mean, Bayes' theorem easily gives, with
the range of integration being as before. Now suppose x = A a n~ 1 / 2 , where Aa is chosen to make the result "significant at level a," then the integral is 0(n~ 1/2) as n -> oo, provided p(0) is bounded as 0 -> 00 = 0, and hence p(00|x) -* 1 as n -» oo. Hence for large enough n the posterior odds on 00 can be as large as one likes for a fixed level of significance. The point is discussed by Jeffreys, and also Lindley (1957) and Bartlett (1957). (See 4.29) below.) Other objections of incoherence against significance tests will be only briefly mentioned. A common method of producing tests is the likelihood ratio principle, which uses
where 00 corresponds to the null hypothesis. Stein, quoted by Lehmann (1950), 10
This book (the first edition appeared in 1939) is not read as widely as it should be. It contains a wealth of information on Bayesian ideas, both theoretical and practical.
16
D. V. LINDLEY
has given an example where such a test is worse than one provided by ignoring the data altogether, and yet where a satisfactory test does exist. Stein (1951) and Chernoff (1951) have provided examples where "good"—in Stein's case, uniformly most powerful—tests reject at level a but not at level a', where a' > a. In view of the close connection between tests and confidence intervals—the interval being roughly those null values which the data do not reject—these last two examples are embarrassing to an advocate of such intervals, the interval of smaller content not being included in the larger one. But the main attack on confidence intervals (or sets) lies elsewhere. Let A be a confidence statement, say that 6 e I an interval of the real line, /, or /, being the random quantity; then we have p(A\9) — a for all 9 (or, more generally, p(A\9) ^ a for all 9). This is a quasi degree-of-belief statement about 9 and unless effectively based on a distribution for 9 can be incoherent in a way now to be described. An important criticism of confidence intervals is due to Fisher (1956). A formal expression of his point appears to run as follows. A subset C of X is relevant11 (or recognizable) if p(A\C, 9) ^ a + £ for all 9 and for some £ > 0. The importance of a relevant subset is that whenever x e C we know that the true confidence coefficient is strictly greater than the a-value quoted, which seems absurd. The simplest example arises when A sometimes includes the whole real line, taking C to be the set of x-values for which this happens, then p(A\C, 9) = 1. Thus let x = ( x j , x2) with x,- ~ N(0i, 1) and X j , x2 independent. A confidence interval for OJ02 is provided by noting that (0 2 x t - 0iX 2 )(0i + #2)" 1/2 ~ N(0,1) and depends only on 0!/02. If (xl + x 2 ) < A«, where 4 is the upper, two-sided, a-point for the standard normal density, the resulting "interval" includes all values of 0!/02. Fisher's original idea in introducing the concept seems to have been to criticize Welch's (1947) solution to Behren's problem by demonstrating that a recognizable subset exists in that situation. Buehler (1959) has discussed the ideas in detail and Buehler and Feddersen (1963) have demonstrated the remarkable fact that relevant subsets exist in the common Student-f situation (since this is also a fiducial interval, Fisher's remarks have come full circle). Specifically they show that if x = (x t ,x 2 ) with x,-, independent, N(/n,a2), so that a 50% interval for /i is x min fS /i ^ x m a x » and if C is the set |Xj — x 2 | ^ 4|x|/3 (so that the two readings are rather discrepant), then p(A\C,6) ^ 0.5181. Consequently even the most frequently used confidence statement is unsound and it seems a reasonable conjecture that recognizable subsets exist for almost all situations. Hartigan has pointed out to me that relevant subsets always exist for one-sided confidence intervals on the real line. For let the confidence statement be p(9 > t(x)\9) = a, then it is easy to demonstrate that the set t(x) < 0 is relevant. Peculiar phenomena that can arise with confidence intervals have been expounded by Pratt (1963). We have already pointed out that in point estimation, unbiased estimates could be incoherent because of their dependence on the sample space. A simple example is provided in Ferguson's text book (1967). Here x is a Poisson variable of mean 9 11
A (frequency) theory of probability using the notion of relevant subsets has been developed by Kyburg(1969).
BAYESIAN STATISTICS
17
and an unbiased estimate of e~26 is required. (We observe a Poisson process for, say, an hour, and require to estimate the chance of no events in a subsequent two hour period.) To be unbiased we must have
or
on multiplying both sides by e6 and using the series for e~°. By the uniform convergence of the series, it follows that the only unbiased estimate is t(x) = ( — }x. The idea of estimating a probability as — 1 is particularly ludicrous. An indication at a more general level of the conflict between unbiased and Bayes' procedures has been given by Bickel and Blackwell (1967). The theory of unbiased estimation that forms so popular a part of most courses in mathematical statistics is therefore of doubtful value, especially when it is remembered that the final estimate that is produced as the best one is only best because the class of estimates has been so constrained that it has only a single member. An interesting practical problem in which the use of unbiased estimates and the related concepts of mean square error, or variance, give rise to difficulties, is that of calibration where a large class of reasonable estimates has infinite mean square error. The reader is referred to Krutchkoff (1969), Williams (1969), and, for a Bayesian reply, to Hoadley (1970). We have tried, in this section, to show that the principle of coherence has practical implications of considerable importance and that many orthodox statistical ideas are unsatisfactory when judged by this criterion. We know that difficulties of this sort cannot arise if Bayesian methods are used. Furthermore, Bayesian methods provide a general formulation and solution of most statistical problems. The system provides a general method of describing and analyzing any such situation without the appeal to ad hoc procedures or ingenious tricks. In this sense it is more objective than sampling-theory methods. We now examine some of the basic ideas in Bayesian statistics. 4. Basic ideas in Bayesian statistics. Despite the substantial criticisms of the last section, many important sampling-theory ideas do have a Bayesian interpretation. The most widely used methods are those based on least squares theory and the related technique of the analysis of variance. We begin this section by describing how these ideas can be expressed through posterior distributions. We do not attempt full generality but only aim to illustrate the basic ideas (for details the reader is referred to Jeffreys (1967) and Lindley (1965)). The numerous papers of Good are valuable; convenient references are (1950, 1965), and (1969) provides a bold attempt to apply the ideas. Let x = (xl,x2, • • • , xn) with x,- independent and normally distributed with constant variance, 0, say, which is unknown. Let £(x) = A6 with 0 = (0 l5 92, • • • , 0S)
18
D. V. LINDLEY
and A known. (6, in the earlier notation, is now (0, >).) Suppose A r A is nonsingular. For a distribution over parameter space, suppose the 6t and log 0 to be all uniformly and independently distributed. Then it is easy to show that
where S2 is the residual sum of squares, namely,
and S2(9) is a positive-definite quadratic form in 9r+1,6r +2, •• • , Os whose exact form need not concern us. This density is constant on ellipsoids S2(6) = const, with a maximum at the least squares estimates. The set consisting of the interior of any one of these ellipsoids has the property that the probability for any point inside the set is greater than that for any point exterior to it. Such sets have been called sets of highest posterior density, Box (1965), Bayesian confidence sets, Lindley (1965) and credible sets, Edwards et al. (1963); we shall use the last term. 12 The probability (posterior to the data) of 0 r + 1 ,0 r + 2 , • • • , 9S lying in the credible ellipsoids can easily be found from (4.1) in terms of the F-distribution. In fact, (S2(6)/(s - r)}/{S2/(n - s)} is F(s - r,n - s). The set, Aa, with total probabilit a is a credible set of credibility a. It is easy to see that it has exactly the same form as the confidence set for 9r+ j , Or +2, • • • , Qs based on the sampling distributions of S2(6) and S2, with confidence coefficient a. In fact we have both p(A^\\) = a, where the random elements in Aa are the s — r parameter values, and p(Aa\Q) = a, where the random element is x. The normal distribution has the remarkable property that equivalent statements can be made with either X or 0 as the relevant space supporting the probability distributions. A Bayesian interpretation of the common F-test is then available by rephrasing the sampling-theory notion that a null value is significant if the confidence interval does not include it, confidence being replaced by credible. Thus the hypothesis 9r+1 = 6r+2 = ••• = Os = 0 is tested by referring {S2(0)/(s — r)}/ {S2/(n — s)} to the F-table on s — r and n — s degrees of freedom in the usual way. Essentially in rejecting the null value we are saying that it has not got high posterior probability (density) in comparison with other values. Although these ideas enable orthodox practice to be interpreted in probability terms, it does not follow that the practice is to be adopted. Inferences should be expressed in the form of a posterior distribution. Practical circumstances may suggest some summary of the distribution because of the difficulties in describing a density, particularly in more than one dimension, but whether intervals are the most convenient forms of summary is unclear. Posterior means, modes or variances may be preferable. Another difficulty associated with the Bayesian description is that it uses improper prior distributions. We shall see later (§ 8) that there is 12 Even in one dimension such intervals are not always too easy to compute since typically two "tails" with equal bounding ordinates will have to be found. Tiao and Lochner (1967) discuss this for F. An example of the use of these interval estimates in assessing the reliability of systems is provided by Springer and Thompson (1966, 1968), a problem also considered by Bhattacharya (1967).
BAYESIAN STATISTICS
19
reason to suspect these, yet a reanalysis using a proper prior will not give orthodox results. The above discussion of least squares ideas can be extended to other orthodox practices. For example, maximum likelihood methods are often sensible for a Bayesian, at least asymptotically, though the posterior mode is perhaps a more reasonable substitute. The usual /2-tests for goodness-of-fit and for the analysis of contingency tables may also be justified asymptotically, though again, as we shall see below, other methods are more advantageous. We now turn from sampling-theory concepts to an honest Bayesian analysis of a decision problem (and hence of an associated inference problem). There are two ways to proceed. 1. Normal form. Let d be a decision function mapping X into D and describing the decision 6(x) to be adopted when x is observed. The performance of 3 (prior to the data being available) may be assessed for any value of 9 by calculating the expected utility conditional on 9; that is, by
(Compare the definition of a risk-function, equation (2.1).) Denote this by Ud(9). The Bayesian argument says that 6 should be selected by maximizing the expected value of U8(9), the expectation being with respect to the distribution of 9 prior to x, that is, by
Essentially this is the Bayesian solution to a decision problem when it is expressed in the sampling-theory form in which the distribution over X is paramount. A simpler analysis is possible. 2. Extensive form. This is the form already given in (1.4) and consists in evaluating
the posterior expected utility. At least if utility is bounded and p(9) proper the two forms are equivalent. For (4.4) is
20
D. V. LINDLEY
where Fubini's theorem has been used twice to interchange double and repeated integrals, and the passage from the second to third lines has been effected by Bayes' theorem, (1.2). The main difference between the normal and extensive forms is that in the former the decision-maker considers the situation before the data is available, whereas in the latter only the decision for that x observed is contemplated. The basic idea of "called-off" bets is relevant. The extensive form is simpler. The terminology is due to Raiffa and Schlaifer (1961), as are most of the ideas which follow in this section. An elementary exposition of some of them is given by Raiffa (1968). In the extensive form no expectation over X is required and the likelihood principle obtains. In the design of experiments, however, X can be selected and expectations are required. A triplet e = (X, 0, p(x\9)) is called an experiment. Consider a collection, E, of experiments e having a common 0, together with a decision space D. Prior to having selected e and observed x, we ask which is the best e to choose from E. The decision is now in two parts, the selection of e and the choice of d, and a general utility function will be of the form 13 U(d, 8, e, x), allowing for the fact that some experiments will cost more than others. For any e the expected performance of the best decision function is given by one of the equivalent forms in (4.6) and the best e maximizes these. Hence the formal Bayesian solution to the experimental design problem14 is provided by
This is perhaps most easily appreciated by using a decision tree (see Fig. 3). The sequence of events in time order is that e is selected, on performance it yields data x, when d is chosen and finally 9 yields the utility U(d,6,e,x). A decision tree is analyzed in reverse time order. We first average over 9, the appropriate distribution being p(B\x,e) since, at that time, e and x are available. Then d is selected to maximize the resulting average (or expectation). Next we average over x, the relevant density being p(x\e), and finally e selected to maximize the resulting expectation. Notice that the operations of expectation and maximization alternate in the sequence. In the decision tree the points where expectation is relevant 13
There is no difficulty in including x in the utility function. In the extensive form the quantity to
be maximized is then 14
U(d, 6, x)p(6\x) dd. In most applications U does not depend on x.
An alternative approach to experimental design, more in the spirit of inference than decision theory, uses the concept of information (see § 12.6).
BAYESIAN STATISTICS
21
have been indicated by circles (and are called random nodes); the others are shown as rectangles, termed decision nodes, and maximization is required. These simple ideas are extremely general and enable the Bayesian ideas to be extended to sequential experimentation to be described later. The analysis at the last two nodes, max d
dO, is called terminal analysis; the rest, maxc
dx, is called preposterior
analysis. Preposterior analysis involves the sample space; terminal analysis does not and uses the likelihood principle. Despite this very general formal solution to the problem of experimental design, few explicit results15 are available in the field that this title ordinarily covers. However one important consequence is immediately apparent and we pause to discuss this. Randomization. Let £ 0 , a subset of £, be the set of experiments 16 satisfying (4.7). If it contains a single member, then this is the best experiment to perform. If it contains more than one member then all e e £0 are equivalent from a Bayesian viewpoint and any may be selected. Consequently it is never necessary to randomize in experimental design, though randomization over £0 would not do any harm (nor any good). This goes counter to a popular sampling-theory canon. On reflection the Bayesian conclusion seems correct to me. Certainly I find it hard to see how the fact that a result was obtained by randomization rather than by deliberate choice can have any effect on the subsequent analysis; in particular, the randomization theory of tests seems unconvincing. How can the fact that a different result might have been obtained, but was not, influence you once the data is on view? The point has been well argued by Jeffreys (1967). There might, nevertheless, be some sense in randomizing but then using an orthodox or Bayesian argument. However it is clear that randomization can only be a last resort. If some factor is present which is thought likely to influence the data, then this should be allowed for in the design, for example, by using blocking devices. Randomization, therefore, even to an orthodox statistician, is only used to guard against the unforeseen. The Bayesian could therefore select a haphazard sample: that is, one which, as far as he can see, will provide a good inference and not be disturbed by other effects. At best randomization can only be a convenient device to simplify the subsequent calculations. Stone (1969b) disagrees. We shall return to this topic when discussing sampling from a finite population in § 6. Sufficiency. One topic on which all statisticians seem to be in complete agreement is that of sufficiency. The Bayesian definition is that t(x) is sufficient if p(6\t) = p(9\x) for every x and every-distribution, p(9), prior to x: that is, if the posterior given x is the same as given the statistic. This is easily seen to be equivalent to the orthodox definition. The extension to minimal sufficient, in terms of sub-o--fields over X, proceeds exactly as in the sampling theory. Like most writers we shall use sufficiency, when strictly minimal sufficiency is meant. In the important 15
Draper and Hunter (1966, 1967a, 1967b) have discussed the design problem from a Bayesian viewpoint but not using the formal loss structure here described. 16 We suppose £0 is not empty.
22
D. V. LINDLEY
case of random sampling it is necessary to include the sample size as part of the (minimal) sufficient statistic. Notice that if 9 = (0i,0 2 ), marginal sufficiency for 6l is, in general, undefined. (For example, is s2 marginally sufficient for a2 in sampling from a normal distribution? The answer would appear to be, "no".) If p(x|6) = p(t]\Bl)p(t2\G2] and the prior similarly factors, then ^(x) is marginally sufficient, but this is a very special case. The point arises in discussing robustness (§7). Exponential family. The case where x = (x t , x 2 , • • • , xn) and p(x|0) = Y["= i P(xil$)> so th a * x is a random sample of size n from the distribution of density17 p(x,-|0), is of common occurrence. A special case, where the Bayesian (and orthodox) arguments are rather simpler, arises when the distribution is a member of the exponential family, that is,
Here (/>,-( 0) are k real functions of the parameter, ^(xj) and H(x t ) are k + 1 statistics and G(0) is a normalizing factor defined in terms of the >'s, fs and H to make the density have integral (over X) equal to unity. It is immediately apparent that for x, £"=1 fj(x./X i = 1,2, • • • , k, and n, are sufficient for 0. Consequently, whatever be the size of sample the dimensionality of a sufficient statistic is constant, at k+ 1. The importance of this remark in a Bayesian analysis is that the posterior distribution of 0 given x will, under these circumstances, depend only on k + 1 values however large the sample is. In fact, if p(0) is the distribution prior to x, the posterior will be proportional to
with a,- = Yjj= i ?i(xj)' '' — 1, 2, • • • , /c, and /? = n. As x ranges over X this generates a family of densities all of the form (4.9) depending on hyperparameters a t , a 2 , • • • , a fc ,/?. Consequently not only is the density of x finitely parameterized, so is that of 0. This would not be true without the existence of sufficient statistics of fixed dimensionality. In this connection an important concept is due to Barnard (see Wetherill (1961)). A family $ of distributions over 0 is closed under sampling from the distribution with density p(x,-|0) if whenever p(6) e 5, p(#M e 5 for every x (and n). This means that provided the prior belongs to 5 any data will result in a posterior distribution in 5- If P(*il#) is a member of the exponential family, then 5 will depend on a finite number of hyperparameters. In connection with (4.8) the family with densities proportional to
17
The same symbol p has been used for the density of x and for any component x,.
BAYESIAN STATISTICS
23
is called the natural conjugate family (to p(xi\9)). Here al,a2, ••• ,ak and b are hyperparameters, with possible restrictions on their values in order that the integral of (4.10) over 0 converges. If p(9) has this form then, by (4.9), the posterior is of the same form with hyperparameters at + a f , i = 1,2, • • • , k, and ft + b replacing at, b in (4.10). The natural conjugate family is closed under sampling. It occupies an important role in current Bayesian research for no other reason than mathematical convenience. Two examples follow. Example 1. If xt ~ N(fi, a2} the likelihood for xl, x 2 , • • • , xn is
where, as usual, x = £ xjn, vs2 = £ (x; — x)2 and v = n — 1. If the prior is proportional to
the correspondence between the two functions is: n -» n', x -> m, s2 -» f 2 , v -» v' except in the power of a. Clearly as (n', m, t2, v') vary this gives a family closed under sampling. A convenient interpretation is that v't2/d2 is j2 on v' degrees of freedom and, conditional on a, p. ~ N(m,o2/n'}. Here tildes have been used to indicate the random quantities (and thereby prevent confusion with sampling theory ideas). The power of a has been arranged to make this interpretation possible. These ideas extend to the multivariate case and a comprehensive account of the distributional theory has been provided by Ando and Kaufmann (1965). Example 2. If xt is 0 or 1, with p(xt = l\9) = 9, the likelihood (see above) is 9r(l — 9)"~r, with r = ^Xj, the usual combinatorial being unnecessary. The natural conjugate family is the Beta distribution with density proportional to 9°~l(\ — 9)b~l, with a,b > 0. The extension to the case where xt takes k(> 2) distinct values leads to the Dirichlet family discussed by Dickey (1968b). Hald (1968a) has studied the dichotomy as n -> oo with h = r/n fixed for a general prior p(9). To quote a typical result, he shows that
to order n~l. Noninformative stopping. Continuing with the case of x, a random sample from p(xt\9), we have seen that Bayesian (terminal) analysis uses only the likelihood function and that the usual orthodox restriction to fixed n (in order to define X) is not needed. However, care is needed to ensure that the sampling rule does not itself contain information about 9. The following analysis is due to Raiffa and Schlaifer (1961). Define q(n\xl,x2, • • • , x n - i , 9, \{/) to be the chance, given x l 5 x 2 , • • • , *„_!, 9 and a nuisance parameter if/, of observing another sample, so that q defines the rule for stopping sampling. If x = ( x l 9 x 2 , • • • , x n ), then
D. V. LINDLEY
24
In an obvious notation this expression may be written where Q is the product of all the g-factors and p is as usual. The sampling rule is said to be noninformative if the Q-factor in (4.13) can be ignored: that is, if the posterior for 9 given x is unaffected by its exclusion. Sufficient conditions are that Q does not depend on 9 and 9 and ij/ are independent prior to x. Two examples follow. Example 1. Suppose xt ~ N(9,1) and the stopping rule is to continue sampling until \x\ > 2n~ 1 / 2 . (Sample until the null hypothesis that 9 = 0 is conventionally rejected at the 5% level.) This has been discussed by Armitage (1963). Here, perhaps surprisingly, the sampling rule is noninformative and the likelihood is as usual, though, at least when n (now n) is large almost all the information is contained in it. Example 2. The following practical application is due to Roberts (1967). The situation is the capture-recapture analysis that is presumably familiar enough to omit a detailed description. The marriage between the natural notation in this context and that of this review is as follows: 9 -» N, the size of the population, of which R are tagged, x -> r, the number found to be tagged in a second sample of n, \l/ -> p, the chance of catching a fish (say) in that sample. We make all the usual assumptions; for example, that all fish have the same chance of capture irrespective of whether or not they have been tagged in the first sample. Roberts points out that the sampling rule may reasonably be informative. As usual, we have the likelihood
where s = n — r, S = N — R. But reasonably it might also be true that
corresponding to the Q-factor in (4.13). If so the full likelihood is proportional to
18
Notice that in writing down this formula it has been assumed that p(xJ0) = p(xt\6, i//); that is, given 9, x, is independent of \l/. In Bayesian statistics all quantities are random variables and care is needed in making the probability specification. Usually the most convenient method is through a sequence of conditional probability statements: here p(6, /) (prior to x), the chance of the first sample, the distribution of x, (given 8, i/>) and so on in the natural order.
BAYESIAN STATISTICS
25
with S and p as the two parameters. Roberts supposes S to be uniform over the nonnegative integers and p to have the conjugate Beta density pr'~l(\ — p ) R ' ~ r ' ~ l , the distributions being independent and prior to the data. Integration with respect to p gives
with mean R + (R + R' - 2)(s + l)/(r + r' - 2) - 1, compared with the m.l. estimate R + Rs/r. Notice that r' and R may be related to the experience gained in capturing the first sample of R for tagging. The value of experiments. In expression (4.7) we saw how to solve the experimental design problem within the Bayesian framework. This expression is now studied further in order to assess the value of an experiment e. We suppose U(d, 9, e, x) = U(d, 9) + U(x, e) so that the terminal utility and experimental costs are additive. The expected utility of e before it is performed is
Consider the second of the two terms in the braces. It equals the expected utility of the best decision from e, given that x is observed. Hence the expectation of the utility from e will be the average of this over X. Whereas if e is not performed the best that can be obtained is maxd
U(d, 9)p(9) d9. The difference of these
two expressions, namely,
is called the expected value of e, denoted v(e). (Raiffa and Schlaifer call it the expected value of sample information, EVSI.) The expression is clearly nonnegative, since on reversing the orders of integration over X and maximization over d in the first term, an operation which can only decrease the value, the first and second terms become equal, by Bayes' theorem, and the difference is zero ; hence v(e) ^ 0. Hence any experiment is expected to be of value. Of course, when realized the value of x may result in a loss of utility. Writing U(x, e) = — c(x, e), the cost of e and x (in units of utility) the experiment is only worth performing if
on comparing with the first term in (4.16). A special case is where e isa perfect experiment ; that is, an experiment which is certain to inform you of the correct value of 9. Here p(9\x, e) becomes a Dirac (5-function and the integration over 0 in (4.16) gives just U(d, 9'), where 9' is the "revealed" value of 9, so that one obtains maxd U(d, 9'). But 9' has density p(9')
26
D. V. LINDLEY
prior to the perfect experiment e*. Hence,
in terms of the loss function (2.2). This expression, v(e*), is called the expected value of perfect information, EVPI. Reversal of the orders of integration over 0 and maximization over d in the first term of (4.17) clearly shows that v(e*) ^ v(e) (which is intuitively obvious). Hence the EVPI is a (useful) upper bound to the value of any experiment. It should be remembered that the exact connection between utility and experimental cost has to be considered carefully and involves considerations of the utility of money (see end of § 7). A detailed discussion has been given by LaValle (1968a, b, c) who discusses, inter alia, the buying and selling prices of a lottery. We next provide some examples designed to illustrate the above ideas. Example 1. This is a no-data decision problem with 0 the real line and D — ( d 1 , d 2 ) , the loss functions being linear in 9. Specifically, we suppose that
and otherwise zero, bl and b2 being nonnegative. The value 90 is therefore the "break-even" value; for 9 > 0 0 , dv is optimum, for 9 < 0 0 , d2 is the better. This is the most general linear-loss form, though without loss of generality we could put av = 0, bl = 1. The optimum decision is to select d^ if it has smaller expected loss, that is, if
(If data were available, p(9) would be replaced here by p(9\x).) Write p(0) = /0(0) and define
Integration by parts enables (4.20) to be written where E(9) is the expected value of 0. Evaluation of f^(Q) (recognizable as the distribution function) and /2(0) are necessary for solution of the decision problem. Had the loss functions been polynomials of degree m, then the /j(0) would be required up to degree m + 1, in general. (If bl = b2, /2(0) is not needed.) Notice that the normal distribution is particularly simple since /0(0) and /i(0) can be expressed in terms of 0(0 and O(f), the density and distribution functions of the
BAYESIAN STATISTICS
27
standardized normal curve, and the integral of
60 against 6 < 90. Example 2. This is again a no-data problem with 0 the real line, but now with D also the real line; again the losses are linear with
For a given d the expected loss is
using the notation introduced in Example 1. Differentiation with respect to d gives an equation involving the density and distribution functions for 9. If al = a2, the optimum d is the &i/(bi + b2) fractile. Again, were data available, p(9) would be replaced by p(9\x); and polynomial losses would involve /j(0) for larger values of i. This model may be suitable for a simple inventory situation with no carry-over from one period to the next, 6 being the demand and d the amount stocked (or manufactured). More general inventory problems will require the sequential ideas discussed below. It is also a special case of a point-estimation problem, to use sampling-theory language, although there quadratic loss is usual. Notice that our choice of linear losses necessitates the use (when al = a2) of fractiles, not moments. Example 3. This is a simple example involving the choice of experiment. E = {eQ,ei, • • •}, where en consists in taking a random sample of size n from the exponential density (see (4.8)) on the real line, with 9 also real, the cost of en being c0 + c^n for n > 0, and zero for n = 0. D is also the real line, the loss being quadratic, (d — 9)2; so that in orthodox language the problem is to determine the optimum size of a sample from (4.23) in order to point estimate 9. In (4.23) we have
The conjugate prior (4.10) is
28
D. V. LINDLEY
where
for suitable x0 and n 0 . It easily follows that where N = n0 + n and t = £"=0 x,- (note the lower limit of summation). The expected loss using the optimum decision, namely the mean, ^ N r =
9p(9\x)d9, of the posterior distribution, is the variance
The terminal analysis in (4.7) is now complete. It is next necessary to find p(x\e), here p(t\e), since t and n are sufficient, and integrate. But
where A is the set of all x t , x 2 , • • • , xn with ]T"=0 x,- = t. This is
from (4.24), where L(t, x0) is /Cno(x0) times the integral over A. The integration of (4.25) with respect to t can now be performed, and on inserting the cost term, we have, following (4.7), to find
the term c0 being omitted when n = 0. Notice how the evaluation of L(t, x0) is effectively the evaluation of the sampling distribution of the sufficient statistic /(x). Consequently the results of the sampling-theory school are of value in preposterior analysis, though, as we have seen, in terminal analysis they are not needed because the likelihood principle obtains. The special case of the estimation of the mean of a normal distribution with known variance illustrates the ideas. Here x; ~ N(6, 1) and in the form of (4.23). Equally,
and with t — x0 + nx and N = n 0 + n. The posterior mean is t/N with posterior
BAYESIAN STATISTICS
29
variance (4.25) N~l irrespective of the value of t. It is not therefore necessary to evaluate the distribution of t (this case is very special!) and (4.26) gives
with c0 omitted if n = 0. The result is that if nQ < c^ 1 / 2 and HQ l < co + 2c} /2 — n 0 Cj, then no experimentation should be performed; but otherwise perform the experiment whose size is the integer nearest to cj~ 1 / 2 — n0 (which may be zero). The reader can easily verify that v(en), the EVSI, is HQ l — (n + n0)~l, a value here which exceptionally is always attained. Example 4. This example is exactly as the previous one except that D = (^, d 2 ), the losses being as in Example 1. In other words we are in a hypothesis-testing type of situation (either 9 < 90 or 9 > 90) rather than point estimation. Surprisingly enough this is more difficult, despite the fact that D may reasonably be said to be simpler. The basic reason for the difficulty is that when it comes to the minimization (or maximization with utilities) over D (see (4.7)) the result in the case of D being the real line (estimation) can be expressed as an analytic function; when D contains only two (or generally a finite number of elements) it is only piecewise analytic. We shall not go into details; for these the reader is referred to Raiffa and Schlaifer (1961). The case av = a2,bv = b2, xt ~ N(9,1) has an interesting history. It was first solved by Grundy et al. (1956), at a time when decision ideas were scarcely considered in experimental design, using fiducial probability in lieu of conjugate priors. (In this case the difference is only linguistic.) It has recently been extended to several stages of sampling by Scott (1968); see also Scott (1969). The corresponding problem when the data are lognormal has been examined by Kaufman (1968). The special case of binomial sampling, where x, is 0 or 1 and p(x{ = l\9) — 9 has been extensively studied by Hald (1967, 1968b). The cost function is en (so c0 = 0, cl = c) and the loss functions have av — a2, b1 = b2 = b, say, so we may write19 L(d, 9) = b\9 — 90\. The obvious application is to industrial sampling inspection for attributes 20 where 90 is the break-even value. Hald shows that as c -»0 (so that the optimum sample size becomes large) the latter is given by with optimum expected loss (200(1 — 00)p(00)bc}1/2 + O(l). These ideas have also been extended to several stages by Hald and Keiding (1969). For two stages of sizes nl and n 2 the asymptotic results are where N = c~l. 19
The marriage between Hald's notation and ours is that our b is his N and our c is his 6. He talks of N -> oo; only the ratio b/c is relevant, so this is equivalent to c -> 0. 20 It is disappointing that Bayesian decision theory has had so little impact on the whole field of quality control which is still dependent upon sampling-theory ideas, though there are exceptions; for example, the comprehensive paper by Wetherill and Campling (1966) and Campling (1968).
30
D. V. LINDLEY
An interesting problem that arises in medical statistics has been discussed by Anscombe (1963), Colton (1963) and Canner (1970). Here N patients have a disease and two treatments Tj and T2 are available. A clinical trial is performed in which n patients are given 7i and n, T2. On the basis of the results of the trial the remaining N — In patients are treated with what appears to be the better treatment. The result of a trial is either success or failure and beta-priors are appropriate. The problems are how to select n and then 7^ or T2 for the remaining patients. The loss (or utility) function naturally needs careful consideration. Canner solves the problem by the usual inverse method corresponding to (4.7). He shows, for example, that the optimum value of n is about {(N + 2)/(12c + 2)}1/2, where c is the cost of each patient in the trial. Guthrie and Johns (1959) made an early Bayesian study of sampling from a batch of size N with a single sample of n and discuss the optimum sample size and decision procedure for large N. We conclude this material on basic ideas by discussing a Bayesian method of hypothesis testing different from those indicated at the beginning of the section and in Example 1. Let H be a subset of© and suppose that we wish to see, in the light of the data, whether it is reasonable to suppose 6 e H. It is customary to speak of this as testing the null hypothesis, H, that 9 e H, where H has been used to denote both the hypothesis and the subset of 0. The alternative hypothesis, H, is that 6 $ H, that is, 6 E H. One way of testing is to calculate P(H\x), the probability that 6 € H, given the data; or, more conveniently P(H\x)/P(H\x), the posterior odds in favor of H. Now
with a similar expression for H. The posterior odds are therefore given by
and do not involve p(x). A still more convenient expression is the ratio of posterior to prior odds which is easily seen to be given by
This expression has the advantage that it does not depend on p(H). However, it does involve the distributions of 6 conditional on H and on H prior to x. Its use first seems to have been suggested by Jeffreys (1967). A common special case is that of a sharp hypothesis. This arises when 6 = (£, r\), say, and H specifies the value of ^ = £ 0 , say, without specifying r\. H is simply £ 7^ £ 0 . Then r\ is a nuisance parameter. An obvious example is where we wish to test whether the mean ^ of a normal distribution is £0 without specifying the variance rj. It has been shown by Dickey and Lientz (1970) in an elegant paper that develops a general treatment, that in this case, under a reasonable additional assumption, (4.27) takes on a simple form.
BAYESIAN STATISTICS
31
Let us write p(9\H) = p(£, q\H) = /(£, q), say, where /(£, rj) is defined21 as the elementary derivative of the distribution of 6 obtained by taking a sphere of radius p about (
\f(^0,rj)dt], the usual conditional form. In words, the
conditional distribution of 77, given £, considered as a function of £, is smooth around £ = £ 0 , so that the only discontinuity in the joint distribution occurs in f, having a "concentration" of (prior) probability at the null value. The additional assumption is that
Returning then to (4.27), the denominator is simply p(x\H) and the numerator may be rewritten, using (4.28), giving the ratio to be
But
by Bayes' theorem
and
Consequently the ratio of posterior to prior odds is simply According to Dickey and Lientz this result is due to L. J. Savage. The simplicity of (4.29) is due to its containing only the marginal densities of £ (at £0) before and after the data. If conjugate densities can be used, then these are very simple to calculate, far simpler than the form (4.27). A simple example is where x — (rl,r2), 0 = (di,Q2) an^ >",• is the number of successes in nt binomial trials with probability of success 0;, i = 1, 2, the two sets of trials being independent, and we wish to test d1 — 82- If ^i ar|d #2 have, 2 ' A density is not unique since it may be changed on any set of dominating measure zero. Its definition here becomes critical since H is such a set.
32
D. V. LINDLEY
under H, independent prior beta distributions with parameters at, bh the posterior distributions will also be beta. The above result can be applied with 22 £ = 0j — 6 rj = |(0 j + 0 2 ). The calculations of p(£0\H) and p(£ 0 |x,H) with £0 = 0 follow easily from properties of the beta distributions. If the problem of testing H against H is regarded as a decision one with two decisions dH and dn we may, without loss of generality (see the remark after (2.2)) suppose
The expected utilities are then
and similarly,
and the posterior odds are directly relevant to the solution of the decision problem. If R is the ratio of the two integrals, then H is accepted if RLO exceeds unity, where 0 is the prior odds. The asymptotic theory will be discussed in § 1 1 below. 5. Sequential experimentation. In the last section the choice of a single experiment was discussed ; we now consider the selection of a sequence of experiments. This is a field in which the Bayesian approach promises to be more successful than standard theory, partly because it does not involve the complicated sample space in the same way, and partly because probabilities prior to xn seem more acceptable when data xl, x2, • • • , xn_l are already available. Consider a finite sequence of possible experimental choices; let E = E^ x E2 x • • • x En with E( = (X{, 0, p(x{\9)) so that 0 is fixed throughout. Let the cost function be additive: that is, c(x, e) = £"=1 c,-(x,-, et). The (terminal) decision space is D and the loss function is L(d, 9), supposed added to the experimental cost. we shall write \t — ( X j , x 2 , • • • , x f ). The idea is that e± is selected from £j, x t observed, then e2 chosen from E2, and so on, up to xn, when finally d is chosen from D. Typically each E{ will include a null experiment, that is, one in which no further data is collected, so that d is immediately taken. We saw that, even in the case of a single experimental choice, the analysis proceeds in a reverse time order (see (4.7) and the related decision tree). Consequently suppose that en_l = (e l 5 e2, •- ,en-i) has been performed, with result \n-i so that it is only necessary to consider the choice of en, the value of xn and the terminal decision. Then (4.7) 22
There are other possibilities: for example, £ = \og(Ql/Q2), r\ = log#,0 2 > but the results are invariant.
BAYESIAN STATISTICS
33
may be applied with the result
Write this Ln_l(xn_1,en-l); it is the expected loss of the best choice of en and d, given results x n _ 1 from e n _ j . The same principle can now be applied to the choice of „_!:
and so on recurrently back to the initial choice of et . Equation (5.1) is the basic dynamic programming result, which follows from the Bayesian approach. An example of the use of these ideas in a bioassay situation is given by Freeman (1970). His problem is to estimate the relationship, supposed parameterized in terms of 9, between Z, the percentage of animals affected by a drug when applied at a dosage of strength e (the notation is chosen to fit with the present theory). For the ith animal the dosage et may be selected, the response being x,- = 1 or 0 according to whether or not the animal is affected. A famous example is that of the two-armed bandit. Here, in formal language, each £, contains two elements, each of which gives a binomial trial of unknown chance 9{ in one case and 92 in the other. The losses are expressed naturally in terms of failures on the trials. The topic has an extensive literature (see, for example, De Groot (1970)). Although the recurrence relation derived from (5.1) provides a completely general method of solving the problems of sequential experimentation, in practice the analysis is involved and even the computation of numerical solutions is typically prohibitive. Some simplification is possible if the data come independently from a member of the exponential family because then the dimensionality of the space in which the computation takes place is fixed, irrespective of n. Even then the task may be formidable; thus, the two-armed bandit requires extensive calculations in four-dimensional space. (The beta-family is closed under binomial sampling—see above—and depends on two hyperparameters—there denoted a and b: there are two such families, one for each binomial sequence (arm).) Progress has been made in the special case where at each stage the experimenter has the choice either of performing an experiment (the same for every stage) or of taking a terminal decision from D. (This is the case originally considered by Wald.) Suppose then that, given 9, the x, are independent and identically distributed with a density of the exponential family (4.8), so that the distributions of 9 (given x,-) are closed under sampling and therefore described in terms of a hyperparameter a of fixed dimensionality. (In the notation of (4.9), a = ( a l 5 a 2 , ••• , a fc ,/?).) Given x; let the distribution of 9 have hyperparameter value a; = a(x,-) so that p(0|x,) = p(0|aj). Let L(a) be the expected loss of the best sequential scheme when the distribution of 9 is described by a. Here L(a,) is the quantity denoted above by L^x^e^, the simplification of notation being made possible by the special form of E{ and the existence of distributions closed under sampling, af summarizing all the relevant information after the /th observation.
34
D. V. LINDLEY
Let c be the cost of any observation x, and suppose the costs to be additive. Suppose you are at stage i ; there are two possibilities : (a) Do no further experimentation and select the optimum d; the expected loss will be
(b) Take a further observation, x i + 1 , and do the best with that. Since will change a,- to cti+1, say, the expected loss will be
If (5.1) is applied we have the recurrence relation
for the unknown function L(a) of k + 1 variables, L0(a) being known from (5.2). In the case 0 = (Ol,62), D = ( d l , d 2 ) , so that in standard statistical language one simple hypothesis is being tested sequentially against another, this equation has been solved and yields the original Wald test; but any more general case seems to have so far defied analytic solution. There is a good discussion by Whittle 23 (1964). Cornfield (1966) has discussed three-decision problems, D = (dl,d2,d3), where standard practice often restricts © unrealistically to three values, and suggests that this procedure may be unsatisfactory. Bickel and Yahav (1969) have discussed sequential estimation with quadratic loss and have obtained interesting asymptotic results, including the normal and binomial cases. We proceed to illustrate the use of (5.3) in the special case of binomial sampling where xt = 0 or 1, p(x, — l\9) = 6, so that the distributions over 0 are of the beta-family, proportional to 0°~ 1 (1 — 6)b~l, and the hyperparameter a = (a, b) of dimensionality two. L(a) is now L(a,b). It is easy to verify that the probabilit p(x i+1 |a i ) required in (5.3) is given by p(x i+1 = 1^,^) = ^/(a,- + bt), and then a(x i+1 ) = (at + I , hi). The probability for x,- +1 = 0 is of course &,-/(«; + bt) then <x(x i+ j) = (at, b{ + 1). Hence (5.3) becomes
Lindley and Barnett (1965) have solved this equation numerically for the case when D = ( d l , d 2 ) and the losses are linear (as in the work of Hald discussed above). The relevant space (of a, b) being two-dimensional, it is possible to produce graphs of boundaries in the (a, b)-plane, these boundaries having the following meaning. The distribution prior to x i + 1 has coordinate representation (a^b^, 23
Whittle and Lane (1967) have discussed sequential estimation and have shown that there are special cases, related to the attainment of the lower bound in the information inequality, when the best sequential procedure is nonsequential, that is, a sample of fixed size is optimum.
BAYESIAN STATISTICS
35
i = 0, 1, • • • . If this lies within the boundaries, xi+ ^ is observed and the representation changes to (at, b{ + 1) or (a, + 1, bt) according as xi+ x = 0 or 1, so pursuing a path in the plane. As soon as the boundary is reached sampling stops and a terminal decision is reached. The boundary is divided into two parts corresponding to selection of d^ and d2If c -> 0 the sampling costs become small relative to the fixed terminal losses and the sample sizes become large. The binomial sampling may then be replaced by a normal approximation and one is effectively testing whether the drift of a Wiener process (of known dispersion) is less than or above a critical value. With an appropriate transformation in the (a, fr)-plane due to Chernoff (1961), namely,
where d = a — 90(a + b) and 60 is the critical value, the boundaries tend to limits which have been calculated and are appropriate for normal sampling. A most systematic and careful study of problems of this type, with particular reference to asymptotic properties, has been made by Chernoff and colleagues (see Chernoff and Ray (1965) and references therein). Further remarks on the asymptotic shape of the testing regions have been made by Schwarz (1969) and Pratt (1966). A similar problem with the Poisson process has been examined by Lechner (1962). A group of problems which are related to these are concerned with optional stopping. For example, let xiti = 1,2, ••• , be independent and identically distributed according to a known distribution, p(xi), and JCQ = 0. The cost of each observation x, (i > 0) is c and you get a reward on stopping at the nth stage, according to the problem, either of xn (this is termed, without recall) or of max o^ign x i (with recall). The problem is to determine the best place to stop. In fact the answer is, in both cases, to continue until you obtain an observation above x where x satisfies the equation
If the distribution of x is unknown, but parameterized p(x\6), the problems are more difficult and closer to those considered above. A good survey is provided in the latter part of De Groot (1970). Smith and Vemuganti (1968) consider a regression problem concerning tool-wear in which the regression has to be estimated and, on the basis of this, the process stopped before the wear becomes excessive. 6. Finite population, sampling theory. This topic is a hazardous one to review at the present time because its foundations are currently being reexamined under the stimulus of Godambe and others. However, the Bayesian approach appears to shed some light on the fundamental difficulties and its application to sampling is a good test of its generality, so that it seems worthwhile to attempt a review even though it must necessarily be inconclusive,
36
D. V. LINDLEY
There exists a population24 of N individuals supposed indexed by the integers 1, 2, • • • , N: these integers will be termed the labels attached to members of the population. They may be real or they may be a fiction invented by the statistician in order to distinguish the members. In accordance with our notation, let 0 = (Bl, 92, • • • , 0N) be a description of the population, Ot being a description of member labelled i. The statistician's task is to make inferences about some aspect of 0, for example, £f= t 0t if the 0's are real. A sample s of size n is an ordered set of n labels, s = (^, i2, • • • , in), ik being the label of the kth member of the sample. The labels need not be distinct—as when the sampling is with replacement. According to the Bayesian canon all uncertain quantities are specified by probabilities, so that here there exists p(0), the distribution of 0 (the population description) prior to sampling. Then the sampling rule provides a conditional density p(s|0). For example, if the rule is to sample randomly without replacement for all s. We saw that within the Bayesian framework randomization is unnecessary; if this is avoided, then p(s|0) is degenerate, being 1 for the selected s and otherwise zero. Typically p(s|0) does not depend on 0 but sometimes it does as when sampling fibres, the chance of a fibre being included in s depending on its unknown length. Notice that the model supposes that a single sample is taken; it can be extended to sequential sampling. The sample having been taken, measurements x = ( x 1 , xz, • • • , xn) are made on the individuals in the sample. Here Xj is the measurement on the jth sample member. The case usually considered is that where Xj = 9t., that is, the measurement is exactly the description of the individual. It is more general, and it fits more clearly with our previous notation, to distinguish the x's and 0's. (I owe the idea to C. R. Rao.) The requisite probability specification here is written p(x|s, 0); in the usual case just referred to, this is degenerate. In sequential sampling we shall have in succession: p(s 1 |0),/?(x 1 |s 1 ,0),p(s 2 |x 1 ,s 1 ,6) and so on. The probability structure of the model is then completely specified by the product and, given x, s, this is proportional to the (posterior) distribution of 0 from which the required inferences can be made.25 However, it is by no means usual for the scientist to observe (x,s). For example, he may not know the labels at all, or he may not observe the sampling in sequence but only as a whole (as when he puts his hand into an urn and withdraws a single handful of n balls). Consequently what is required is not p(0|x, s) but p(0|t(x, s)), where t is an (observed) function of x and s. This is easily found (apart from the constant of proportionality) by 24
In applications this is a real population. The sampling-theory concept of a hypothetical population (finite or infinite) does not arise in Bayesian ideas except indirectly in connection with exchangeability, to be discussed below. 25 Notice (in reemphasis of the point already made) how the specification has been made through a sequence of conditional probability statements following the natural time order, 9, s, x.
BAYESIAN STATISTICS
37
integrating (with respect to the implied dominating measure—often counting measure) (6.1) over all x, s giving the observed value oft. An important special case is that in which each x-value assumes one of a fixed number T of possible values y^,y2, • •• , VT (thus, with T = 2, each value is either defective or not) and the only observations are n{, the number of x's equal to yt, i — 1,2, • • • , T. In this case it will typically happen that the distributions conditional on 6 will depend only on $,-, the number of 0's in the population equal to yh i — 1,2, • • • , T. In that case 0 may be replaced by <j) = ( < / > l 5 0 2 , • • • , (/>T) and the integration just referred to yields
where n = (n t , n2, • • • , nT). Much of the current discussion seems to revolve around the distinction between (6.1) and (6.2). Godambe (1969, 1970 and references therein) favors (6.1) whereas Hartley and Rao (1968, 1969) use (6.2). The distinction is important because, as Godambe was the first to show, within (6.1) (though without p(0), since the argument is sampling-theoretic) there do not exist unbiased estimates of minimum variance, whereas there do within (6.2). It is significant that Godambe's (1955) counterexample refers to sampling with replacement, where the labels are informative so that the full specification (6.1) is necessary. It is typically true, of course, that the labels s are not observed. We have already met an example in discussing Roberts' analysis of the capture-recapture situation in §4. Had the population really been numbered 1,2, • • • , N the capture of fish number 186 would be much more informative than the mere capture of a fish.26 When ignorant of the labels the integration referred to becomes necessary. This explains the combinatorial
in (4.15). Ordinarily the likelihood in the
binomial situation is (in the notation of (4.15)) p"(l — p)N~" (see the discussion of the likelihood principle in § 3). The coin and dies problem discussed by Plackett (1966)27 provides another example. In continuing the general discussion, let us specialize the model (6.1) as is usually done by supposing p(s|6) does not depend on 6 and that Xj = 9tj, so that the observation is without error. The likelihood is then either unity (if 9t. = Xj for all j = 1,2, • • • , n) or is zero. (p(s|9) disappears from (6.1) and p(x|s, 6) is degenerate.) It has therefore been argued that the likelihood function is completely uninformative. This is not so. In fact, what happens is that, in conjunction with a prior distribution over 0, the effect of this peculiar likelihood is to make the posterior probability zero (whenever the likelihood is zero) and to leave the other probabilities for 9 unaltered, except that they have to be scaled up to make 26
Readers familiar with the tram-car problem (see, for example, Jeffreys (1967)), where a traveller arrives in a strange town knowing that the tram-cars are labeled from 1 up to an unknown N and sees tram 186, will know that sensible inferences can be made from a sample of size one. 27 This paper on current trends in statistical inference is well worth reading and provides a more neutral account than the present review.
38
D. V. LINDLEY
the total probability one. With enough data all the values of 9 acquire zero chance except for the correct one. The following example is due to Cornfield (1965). A scientist has m + n units available for an experiment. On each unit he can either measure the control value a,- or the treatment value /?,- = a,- + &, i = 1, 2, • • • , m + n. He selects m (at random!) for the control and uses the remaining n for the treatments obtaining results xtl,xi2, • • • , xim (on the controls) and y,-,, >>;.,, • • • , y/ n (on the treatments)—x t is the value for unit ij. The likelihood is
where d(a, b) = 1 if a — b, and is otherwise zero. Clearly this alone provides no information about 6. (Cornfield suggests trying m = n = 2, x f l = x,-2 = 1, yh = yh — 105.) However, on specifying a reasonable prior for the a's and 0 the difficulty disappears. Suppose (and a reason in support of this will be given in a moment) that the a; are normally and independently distributed, independent of 9, whose distribution is unspecified as p(6). The posterior distribution of 9, given the data, is proportional to
where (/>(•) is the normal density. Cornfield shows that this leads to the usual inferences based on Student's t, modified by p(9). It is this type of argument that convinces me that methods of inference that use only the likelihood function cannot be generally successful. Roughly speaking, the more parameters there are, the less satisfactory are the likelihood discussions. The most complete study of sampling from a Bayesian viewpoint has been carried out by Ericson (1969a), but before introducing his work it is necessary to discuss an important idea of de Finetti's. The infinite sequence E1,E2, ••• of events is said to be exchangeable if, for any n distinct events, the probability that a specified r of them occur, and the remaining n — r do not occur, depends only on r and n; write it w(?\ Then de Finetti28 (1964) proved that necessarily
for some distribution function P(9) on [0,1]. He also extended the result to more general random quantities than those taking only two values (a thorough investigation has been performed by Hewitt and Savage (1955)). In the form stated, the theorem says that exchangeable events constitute a mixed binomial sequence: that is, they are binomial for some 9, with 9 having a probability distribution described by P(9). An alternative paraphrase is to say that exchangeable events are like a binomial likelihood, p(r, n\9) = 9r(l — 9)"~r, with a (prior) distribution 28 The idea dates from the mid 1930's. In general I have confined references to de Finetti to his papers in English. Much of his brilliant work is currently available only in Italian, in particular his important book (1970).
BAYESIAN STATISTICS
39
that is given by P(9). In its extended form it roughly says that an exchangeable sequence of random variables acts like a random sample from some distribution, with a (prior) distribution over the sampled distribution. It is clear therefore that, in a Bayesian development, exchangeability can be a substitute from the randomness concept that is basic to the orthodox school in its interpretation of the only sort of probability it admits as a limiting frequency. The sampling-theory judgement of a random sample can be replaced by that of exchangeability. It is surely undeniable that the latter concept is easier to attach to a sequence than is the former, for all it says is that the order does not matter, whereas the randomness involves a condition of independence which is a very subtle notion. For example, suppose we take a coin and repeatedly toss it, are the resulting events independent? The answer is "yes" if the coin is known to be fair (or generally if its "0" is known) but otherwise "no", as can easily be seen by remarking that people's judgement of the outcome of future tosses is affected by the earlier ones. Technically, the results are independent, given 6, but not otherwise. We now turn to Ericson's ideas. To perform the necessary analysis a distribution p(9) for the population has to be specified. It seems reasonable in many situations to suppose that the individual 9t are exchangeable, and that this property would persist for any N (de Finetti's result does not apply to finite sequences). The assumption of exchangeability means that one's opinions about, say, 07 are the same as those about any other 0,-, and similarly for any pairs, triplets and larger groups of 0's. Although a modest assumption it has consequences that parallel closely the orthodox ideas derived from simple random sampling.29 To illustrate this parallelism consider the case where the individual 0's are known but not the labeling. Then if exchangeability holds, all the possible labelings are equally likely. Suppose then that s = (1,2, • • • , n) for definiteness; then all possible samples having nonzero probability have the same chance
as in random-
sampling where the probability statement refers to this random element. Consequently many orthodox results can be used in the Bayesian approach. For example, from the usual formula for the sampling variance of the sample mean x, we have in the Bayesian framework,
where now the expectation is over the assumed exchangeable prior distribution given the 0's (but not the labeling) and hence given fj. and a2, the mean and variance respectively of the population. Thus exchangeability is closely related to the orthodox concept of randomization (Ericson discusses this (§2.4)). Generally, with exchangeability we have
29
Remember we are here dealing with the case where p(s|0) does not depend on 8 and x; = 6tj.
40
D. V. LINDLEY
where \l/ is a hyperparameter, by de Finetti's theorem. In other words the finite population is equivalent to a random sample from a hyperpopulation. This assumption has been made by other writers but the justification in terms of exchangeability seems much stronger than any heretofore proposed. The same idea will occur in discussing other multiparameter problems below. Ericson first considers the case where this hyperpopulation is normal, so that p(Qi\\l/'] is N(n, a2} with \l/ = (/*,
where ?„_ t is Student's t on n — 1 degrees of freedom and x and s2 are the usual sample mean and variance respectively. This is in close agreement with the usual theory. The normality assumption is a severe, and often unreal, one. Ericson therefore also considers the case, mentioned above, where the 0's (and therefore the x's) take only a finite number T of possible values. Exchangeability now means that the >'s are multinomial (with T = 2 we have the binomial result quoted in (6.5)) and the relevant conjugate distribution is Dirichlet (this will be discussed in § 10 when considering Bayesian estimation of multinomial parameters). A typical result is that the posterior variance of n is approximately (N — n)/N • s2/n, in close agreement with orthodox theory. Similar conclusions under an interesting alternative model have been obtained earlier by Hill (1969). It is perhaps surprising that the results for the normal and multinomial assumptions agree so closely. The multinomial case, supported by a Dirichlet prior has also been described independently by Hoadley (1969). He discusses the distributional problems carefully, and his paper is unusually interesting because it includes a practical application of the results to a telephone problem. A novel use of an old idea is to fit a Pearson curve to the posterior distribution of the required quantity (which is a rather complicated function of the 0's) on the basis of its moments. Ericson also considers the use of auxiliary measurements and in a later (1969c) paper extends the ideas to the stratified case. Here the assumption of exchangeability is unrealistic for the whole population but can still reasonably be applied within strata. The analysis of finite populations is of interest because it demonstrates the importance of the opinions prior to the sampling. These opinions can be rather vague and the likelihood on its own can be uninformative, but in conjunction they provide considerable knowledge of the population. If the prior distribution is different, for example, if it is nonexchangeable, then an equally informative but quite different analysis may result. I gave an example in the discussion to Ericson's (1969a) paper which may have repercussions in the problem of cluster analysis. Of course, even orthodox statisticians will use their prior knowledge to suggest the type of analysis appropriate to a data set; the only real difference is that the Bayesian approach performs this process in a more formal manner.
BAYES1AN STATISTICS
41
Although not concerned necessarily with finite population theory, a topic that is related at least practically to the topic of this chapter is that of optimum allocation to different strata in a population. This has also been discussed by Ericson (1965, 1967b) when the within-strata distributions are normal. Intuitively it is not easy to see how to balance the three factors of within-stratum variance, cost of sampling in a stratum and precision in the estimate of the quantity of interest. The computations are tricky, but the first paper gives some illuminating examples, some of which are surprising, and the second provides an algorithm which reduces it to a univariate problem. De Groot and Starr (1969) have considered two-stage stratified problems and Zacks (1970b) deals with the finite population aspect. This material fits into the general design formulation considered in §4 (e.g., equation (4.7)). Draper and Guttman (1968a, 1968b) describe two approaches, Bayesian and preposterior, which seem to give different answers but their logic is unclear to me. Ericson (1967a) has also examined the problem of nonresponse. In this course of his work Ericson uses a result which is stated more generally in another paper (1969b). Let x l 5 x 2 , • • • , xn have E(Xj\9) — n(9), say, for all i, and var(x,-|#) < GO. Let the prior distribution for 6 have E//(0) = m, say, and 0 < var{^(0)} < oo. Then if where a and /? do not depend on x, it follows that
The essential condition is (6.6), that the posterior mean of n(G) is linearly dependent on the data only through x, when the linear form thereby assumed is a weighted average of the sample mean and the prior mean with weights inversely proportional to appropriate variances or expectations of variances. Notice that the data does not have to be a random sample from a distribution. The result has been known for some time in special cases (for example, with normal and gamma distributions). The idea of shifting an estimate from one value towards another prior value has been discussed by Thompson (1968) and Arnold (1969). Although the result is intuitively appealing, it has been interestingly criticized by Beale (in the discussion to Lindley (1968)) on the grounds that if x and m (the sample and prior means) are very different, the estimate (6.7) is a long way from x, whereas commonsense would suggest that m was "wrong" in some sense and that the estimate should be near, if not at, x. In the case of discrepant means the form of the density p(9) in the tails becomes important because that is where the observations are, surprisingly, to be found. Some approximate analysis in the case n — 1, x is N(9,1)—where Ericson's result applies if p(9) is also normal—suggests that if p(9) is of the form of a r-distribution, then the deviation from x (now x) tends to zero as x and m become more discrepant, instead of increasing without limit in the normal prior case. The slight difference between the tails of the normal and ^-distributions has an appreciable effect on the posterior distribution. The effect of changes in the prior distribution are part of the robustness of Bayes'
42
D. V. LINDLEY
procedures to changes in the various parts of the formal model, and we therefore turn to a study of this topic. 7. Robustness. Bayesian concepts demand a degree of formalism in the analysis of data that is not always required by other methods; consider, for example, the ideas presented by Tukey (1962). There are three aspects of this: first, the explicit statement of the data structure through the density p(x\0); second, the quantitative expression of prior knowledge by means of p(0); and thirdly, if a decision problem is involved, the characterization of the consequences through a utility function. This strict formalism has the great advantage of providing a method of solution of all problems presented within its framework—unlike the ideas of Tukey and his colleagues which demand a degree of improvization and arbitrariness not often made explicit—but a penalty is paid by the need to describe the model closely, and consequently by a need to see how sensitive the answers are to the assumptions made in the model. It is this last question of sensitivity that we discuss in the present section. An insensitive procedure was called robust by Box, who first studied the problem within the orthodox framework. We now review the studies that have been made of the robustness of Bayes' procedures to changes in p(Q], p(x\9) and U(d, 9), respectively. The main work on the robustness with respect to the prior distribution is that of Edwards et al. (1963).30 They consider the case where repeated, independent measurements are made under identical conditions (in orthodox language, where x = (xltx2, • • • , xn) is a random sample of size n from some distribution) and show that it is enough, at least for all but small values of n, to suppose p(G) "smooth" over the range of 0 where the likelihood function is appreciable, and not too large elsewhere. (Their principle of stable estimation gives a precise statement.) Consequently in many cases it is possible to derive a posterior distribution by considering only the likelihood function and certain modest properties of the prior. They only discuss the case where 0 is the real line. Additional difficulties can arise when 0 is of higher dimension (examples will occur later) but clearly the same general principles obtain. In the case of sampling from members of the exponential family (4.8) it is particularly easy to study the effect of changes in the prior within the conjugate family (4.10) since the posterior is an easily calculable member of the same family. We illustrate with the case of binomial sampling which has received much attention. The likelihood for r successes in n trials is 9r(l — d)"~r and the conjugate family is Beta with density proportional to 6a~l(l — &)b~l. If the prior has hyperparameters (a, b), the posterior value of these are (a + r, b + n - r). It is perhaps easiest to re(hyper)parameterize in terms of n0 = a + b and 60 = a/(a + b): the posterior values are then n + n0 and
30
This long paper is one of the clearest presentations of the ideas behind practical implications of the Bayesian approach.
D. V. LINDLEY
44
procedure can be criterion nonrobust because it is not allowing for the fact that t, for example, is an inappropriate statistic to use for nonnormal distributions. Box and Tiao discuss two cases using numerical data: that is, their conclusions hold only for the numeric x considered by them. In both cases they generalize the normal model using the symmetric distributions described by
the constant not involving x. (Here 9 is (^, a] and — 1 < a fS 1.) They are therefore investigating the effect of kurtosis on the normal assumption. For a = 0, we have the normal density; for a = 1, the double exponential; and as a -» — 1 the density approaches the uniform one. The prior selected for a is proportional to (1 — oc 2 ) a ~ 1 for some a ^ 1 and the invariant prior d\i do jo is chosen for the normal parameters, all being independent. In this case the analysis for a = 0 gives essentially t for the comparison of means and F for that of variances (see § 3). Their first example concerns inferences about the mean, and even although the data chosen has a few possible nonnormal features, the final inferences p(n\x) are essentially the same as when normality is assumed. This is because the distribution p(a\x) concentrates around a = 0, thereby supporting the normal model. Their second example concerns the variances and is more interesting because the F-test is known to be very criterion nonrobust (a fact first pointed out by Box). The data consists of two independent samples, originally supposed to be from normal distributions, N(fii, erf), i = 1,2, but in the extended form to be described by (7.3) with a common value of a. Box and Tiao calculate the densities of cr2/cr2 given the data for various values of a, that is, p(cr|/o - i|x 1 , x 2 ,a), and find them to change considerably with a. However on integrating over a according to (7.2) they obtain a density p(
a0\a -0.6 -0.4 -0.2 0.0 0.2
Changes in significance level induced by departures of the parental a from the a 0 defining the criterion -0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.06 0.04 0.03 0.03 0.02
0.11 0.07 0.05 0.04 0.04
—
— 0.14 0.10 0.08 0.06
—
—
—
0.09 0.07 0.06 0.05
—
—
—
0.12 0.10 0.08
0.17 0.12 0.10
0.22 0.14 0.12
BAYESIAN STATISTICS
45
deviations and not variances.) Each row of the table refers to the value of a used in selecting the criterion. The entries in the table are the usual significance levels for testing the hypothesis a\ — v\. The row for «0 = 0 shows how nonrobust F is, varying from 0.03 to 0.14. The diagonal entries vary much less (from 0.06 to 0.09, including one not shown) showing how much more robust the Bayesia analysis33 is. Consequently, for these data, the usual level is misleading in the sampling-theory sense but not in the Bayesian one. It must, however, be remembered that these results only hold for the specific set of observations chosen. It is disappointing that more work has not been done with other data sets to see how general this important result is. Tiao and Lund (1970) have discussed inference robustness for the estimation of /j. in (7.3). Interval estimation has been scrutinized byLindley(1961b). A discussion along similar lines but in a different context has been given by Tiao and Tan (1966). They consider the one-way analysis of variance, random effects model where, in a familiar notation, for i = 1, 2, • • • , k- j = 1, 2, • • • , n with etj ~ N(0, a2) and a{ ~ N(Q, a2a\ all these variables being independent. This had been discussed by them in an earlier paper (1965) and in the one now being described they consider the effect autocorrelation within groups could have on the inferences drawn. Specifically they suppose with now e{j ~ N(0,
with and zero otherwise, so that there is correlation within rows but not otherwise. 33
These levels are approximately the posterior probabilities, given the data and a, that a\/a\ > 1.
46
D. V. LINDLEY
He considers in detail the special case vtj = 1 (i = j ) , p ( \ i — j\ = 1) and zero otherwise. The effect of changes in the prior density of p on the posterior density of contrasts is discussed, with conclusions similar to those of Tiao and Tan. The most extreme example of a nonrobust procedure known to me occurs when fitting a straight line to data when both the variables are subject to error. Specifically, and in its simplest form, suppose x{ ~ N( 2 ) for i = 1,2, ••• , n, all these probabilities being independent. That is, the "true" values of x and y lie on a line through the origin of slope 6, which is the parameter of interest. In some circumstances it may be reasonable to suppose the £'s exchangeable and the distribution from which they have effectively been sampled to be normal, N(0, T), say. If so the model is unidentifiable in the sense that two different sets of parameter values will give the same probabilistic description of the data. This is easily seen because the pairs (x^yj will be random samples from a bivariate normal distribution of zero means (for simplicity) and dispersion matrix
and four parameters (>i,> 2 , T, 9) describe the three distinct elements.34 Lindley and El-Sayyad (1968) have shown that as n -> oo the posterior distribution of 9 tends to a limiting form having nonzero variance, whereas the usual situation is for it to converge onto the true value. However, if the distributions (either of errors or of £'s) are nonnormal, then the problem is identifiable and the usual convergence takes place. This is true even for slight departures from normality so that the "normal" procedure is extremely nonrobust: a limiting variance changing abruptly from a nonzero to a zero value. The problem of fitting a straight line with both variables subject to error is a fascinating one; another curious feature of it was mentioned in § 3 when discussing maximum likelihood estimation. An aspect of robustness which was one of the earliest to be discussed concerns the effect a few aberrant observations, usually called outliers, might have on an inference. A Bayesian approach to this problem has been provided by Box and Tiao (1968b). They discuss the general linear model but for simplicity in exposition we describe only the case of real observations xl,x2, • • • ,xn depending on a single real parameter 9. Suppose that each observation has either density p(x{\9, ^) or p(xi\9, £ 2 ), where ^ indexes the standard distribution and £2 the alternative distribution of outliers. Let R be a subset of the first n integers and let aR denote the event that x ; , for i e R, comes from the standard distribution, and otherwise from 34 In passing it might be noted that unidentifiability causes no real difficulties in the Bayesian approach. If the likelihood does not involve a particular parameter, 6l, say, when written in the natural form, then the conditional distribution of 6l, given the remaining parameters, will be the same before and after the data. This will not typically be true of the marginal distribution of 9{ because of the change in assessment of the other parameters caused by the data, though if 6^ is independent of them, it will be. For example, unidentifiable (or unestimable) parameters in linear least squares theory are like 6l and do not appear in the likelihood. Notice, however, that with certain types of prior distribution having strong association between 8t and the other parameters, data not involving 0j can provide a lot of information about it. Effectively this is what happens in the case under discussion.
BAYESIAN STATISTICS
47
the alternative. Then the evaluation of p(9\x, aR) proceeds straightforwardly. Consequently in order to find p(9\\) we need to evaluate p(aR\x), so that we can use the result
But
and since p(aR) is part of the density prior to the data we only require p(x\aR). In an obvious notation we can write x = (X K , xs), where S = R, and hence in view of the description of aR as specifying that XR(XS) is distributed according to the standard (alternative) distribution. The densities occurring on the right-hand side of (7.7) are, in order, marginal (for X K ) and predictive (for xs, given XR) and may be evaluated in the usual way.35 The computation can therefore proceed, though it is not quick because, as (7.6) shows, p(6\x) is a weighted sum of 2" terms. Box and Tiao discuss in detail the case where xt ~ N(9, a2) for ^ and x; ~ N(9, fco-2) for £2 with k known. The prior density of 9 and a2 is proportional to a and, independently, each observation has a known chance, a, of being an outlier. The calculations (for a specific set of numeric data) are carried out for various values of k and a and suggest that although fairly insensitive to the choice of k the effect of a is more pronounced. This problem may also be described as one of sampling from the "mixed" distribution with density
so that the analysis provides a solution to the problem of "mixtures". An interesting approach to this problem has been provided by de Finetti (1961). He considers the case of a location parameter where p(x{\9) = f ( x t — 9) with 9 having a uniform prior distribution. The form assumed for / is
for a weight function, w(y) ^ 0,
w(y) dy = 1 over normal distributions. (This is
essentially the case just considered when the weight function is concentrated on two values only.) Various complications are also considered but few specific results obtained. It should be noted in reading de Finetti's paper that he does not refer to 9 in quite the way we have done, preferring instead to speak of the probability of x, a future observation, given x1, x2, • • • , xn. The case just described is termed by him the case of independence; he also discusses what we would term the 35
Predictive distributions will be discussed in §9 below.
48
D. V. LINDLEY
case of nuisance parameters under the name of exchangeable. As I understand him he regards the statistician's parameters as artifices produced by his theorem (compare the introduction of 9 merely as a variable of integration in (6.5)) and that probability statements should refer to observables. Using the term discordant observations, Hartigan (1968) has developed quite a different approach. Let p(9) be a density for 9 and let x be an observation. We may then calculate p(9\x) and Hartigan proposes measuring how discordant x is by comparing p(9) and p(9\x) using a distance function between two distributions earlier proposed by Jeffreys (1967). The discordance of x given p(9) is defined by
For a p(9\x) based on a large number of observations the dependence on a distribution prior to them is negligible in most cases so that D(x) can, to a good approximation, be expressed in terms of the likelihood function. An application to the detection of discordant judges is included in the paper. Another way of looking at the robustness problem is to see what transformation of the data will turn the model into a standard one, for example, normal. The parameter a (in (7.2)) corresponds to the transformation employed, for example, power transforms, xa. A study of this has been made by Box and Cox (1964) from both sampling-theory and Bayesian viewpoints. The latter is puzzling because it uses a density over 0 which depends on the data. This difficulty deserves more attention but appears to arise from the use of improper distributions (see §8). The problem has been reconsidered by Draper and Hunter (1969). The remaining feature of robustness to be discussed is that of sensitivity to the utility, or loss, function. There appears to be little on this topic perhaps because work has been mainly confined to the inference, rather than the decision, aspects. Britney and Winkler (1968) have discussed point estimation under various loss functions. Evans (1964) has considered the case of the variance of a normal distribution. Zellner and Geisel (1968) have investigated the regression problem with y. = fai + ut, i — 1, 2, • • • , n, ut ~ AT(0, a2}, where one is required to select xn+ t in order to make z — yn+ j as near as possible to a prescribed value36. They conclude that the optimum choice of xn +i is very sensitive to the form of the loss function employed. El-Sayyad (1967) has discussed the estimation of the parameter of an exponential distribution using a variety of loss functions. We have already pointed out (see (2.2)) that the interpretation of a loss function is not always clear. The argument of §2 establishes the existence and precise meaning of a utility function, and strictly the Bayesian analysis should be in terms of this. Furthermore the utility, and therefore loss, function should be bounded, which it is not in the papers just mentioned. A complete and satisfactory description of the form utility should take when the consequences are entirely monetary has been given by Pratt (1964). His concern is mainly with the phenomenon known as risk aversion which persuades a decision-maker not to accept a monetarily fair 36
This will be discussed in § 9 under the title of a regulation problem.
BAYESIAN STATISTICS
49
gamble. He discusses how this could be measured and the implications this has for the form of w(z), the utility of a monetary amount z. Strong arguments are adduced for using the measure —u"(z)/u'(z), primes denoting differentiation. If this is to decrease with z we have a condition on the third derivative which is not satisfied by quadratic utilities in any region. A utility function which satisfies these requirements and at the same time is likely to be analytically tractable is for a, b > 0 and 0 ^ w ^ 1. No work appears to have been done on the consequences of using this type of loss function in statistical problems. Little work appears to have been done on robustness in the experimental design problem described in §4. A useful contribution is that of Antleman (1965) who discusses the choice of size of a single sample and provides two inequalities which respectively relate the expected loss using one size with that of the optimum size, and the division of loss between terminal loss and sampling costs. 8. Multiparameter problems. In the collection of counterexamples included in §3 no mention was made of Stein's (1956,1962a) interesting work on the multivariate normal mean because it fits more naturally into the material now to be described. Stein showed that if x = ( x t , x 2 , • • • , xm) and xf ~ N(6i, 1), these distributions being independent, and if we require to estimate 0 = (0 l 5 62, • • • , Om) with squared-error loss Z(ef — 9t)2 for estimate et of 6{, then, provided m > 2, et = xt is inadmissible (see after (2.1)). Hence in dimensions greater than 2 the sample mean is not a satisfactory estimate of the population mean for normal distributions, where by satisfactory we mean according to the canons of the sampling-theory school of statistics37. The question considered in this section is, how far does Stein's result influence the Bayesian analysis and, in particular, what is a reasonable Bayes estimate of Qt. It can be argued that the result is of little interest in Bayesian statistics because it uses, in the criterion of admissibility, ideas which are alien to that approach in their use of the sample space, thereby violating the likelihood principle. Hill (1969) has produced a theory of least squares within the Bayesian framework which suggests using the sample mean. Another discussion is that of Box and Draper (1965). However the sample space is relevant to preposterior analysis so that there, at least, the result seems to me to be of significance. Also an alternative Bayesian analysis suggests estimates of the same general character as those proposed by Stein to avoid the difficulties just mentioned. The particular model (X, 0, p(x\6)) described in the last paragraph fits within the general linear model defined at the beginning of § 4, and in the discussion there we saw that a Bayesian argument can be given to support the usual orthodox procedure: in particular to justify the use of xt as an estimate of 0,-. However, that justification supposes that the 6t are, prior to x, uniformly distributed. This distribution is improper in the sense that it does not have a finite integral over ©. ' Joshi (1969) has established similar results for the ordinary confidence sets in this context.
50
D. V. LINDLEY
It turns out that such distributions are suspect and that therefore the Bayesian argument that uses them is of doubtful value. We therefore pause to discuss these suspicious features. The ideas derive from the work of Buehler (1959), from personal discussions with him at a conference in 1970 and from Cornfield (1969). An operational way of assessing a probability is through the study of relevant gambles (see § 2). Suppose then we have a density p(0|x) of 0, given data x. A gamble will be accepted or rejected according as its expected utility is positive or negative, and will be judged fair if this quantity is zero. If, with p(0|x), a gamble is proposed which yields a consequence of utility u(6, x) if 9 obtains, it will be fair if
Equivalently this may be rewritten by Bayes' theorem
Now consider a fixed value of 9, say 0 0 . The expected yield of the gamble, given 90,is
If p(0) is proper, that is, if
p(0) d9 = 1, then (8.2) cannot be nonnegative for all 90
and positive for a set of 0's of positive (prior) probability, for if so we could integrate with respect to 9, reverse the orders of integration and contradict (8.1). In particular (8.2) cannot be positive for all 00. But if p(9) is improper this can happen. Consider the following example. Suppose 9 is an integer (positive or negative). Let (the dominating measure being counting measure). Hence x is equally likely to be one more or one less than 9. Suppose p(0) uniform over the integers. Then (8.1) requires so that if we write u(n — 1, ri) = an we must have u(n + 1, n) = —an. Expression (8.2) is then, with 90 — n, equal to Consequently if an is an increasing, bounded 38 function, this is nonnegative. Hence if improper distributions are introduced there is a danger that one could make statements that are fair for any data value, but prior to the observation, have positive expectation. (Equally negative expectation could be arranged, so that one 38
To avoid difficulties associated with unbounded utilities.
BAYESIAN STATISTICS
51
would expect to lose for every parameter value.) This seems entirely unacceptable and violates the basic axiom concerning "called-off" bets. Another example, this time in connection with the random-effects, one-way analysis of variance model, has been given by Stone and Springer (1965) who show that indiscriminate use of improper priors in that situation can lead to posterior distributions which are unacceptable. The essence of the difficulty lies in the reversal of orders of integration 39 . In passing it might be noticed that these remarks impinge on the work of Fraser (1968). Although not within the Bayesian framework many of his interesting results are equivalent to those obtained by a Bayesian argument using improper (invariant) priors. In particular his model x = 6 + e, where e is an error with a known distribution, may be rephrased as a location-parameter problem and his solution is equivalent to a Bayesian one with uniform prior on 9. (The example just given is a special case.) If his probability statements about 6 have a betting interpretation, then the above criticism might seem to apply. I have discussed Eraser's work elsewhere (Lindley (1969b)). We return now to the usual linear model described at the beginning of § 3. We have seen that the estimates can be inadmissible, and that the Bayesian justification rests on the use of improper priors. We now consider a different formulation of the least squares problem which might avoid the difficulties (Lindley (1969a)). Let x = (x!, x2, • • • , xn) and 6 = (9 j , 92, • • • , BSl) with A j being a (known) design matrix of order n x 5 t . Let x have a multivariate normal distribution with known dispersion matrix C t . (In § 3 C t was taken to be 01 with > unknown; suffixes have been added for a reason which will appear in a moment.) The usual least squares estimates are where is their dispersion matrix (containing posterior moments in the Bayesian interpretation, sampling moments in the orthodox explanation). Now it often happens in applications of these results that the fy's themselves have a structure which would influence the choice of prior. For example, in a Latin square design they divide themselves into three groups corresponding to rows, columns and treatments. A simpler example will be pursued in detail below. We consider the case where this structure is expressed by a similar linear model. Specifically suppose with (known) matrix A 2 ,9 2 being a vector of s2 hyperparameters. Let Q{ have a 39 This also disposes of an old problem. If m is the median of a continuous distribution we have p(x > m) = \. On the evidence of one value of x can we say p(m < x) = jl The intuitive answer is "yes" but the above analysis shows that it must be "no".
52
D. V. LINDLEY
multivariate normal distribution with known dispersion matrix C 2 . (The role of the suffixes should now be clear: 1 refers to the data distribution, 2 to the parameter distribution.) This process may be repeated. In all applications so far met it can be concluded at a third stage which expresses the structure of the hyperparameters by supposing where A2 and \i are both known, ji containing s3 elements. The hyperparameters are again supposed normal with known dispersion matrix C 3 . Notice that provided the elements of C2 and C3 are finite this distribution of Ql prior to x is proper. Since the structure, expressed through (8.3), (8.6) and (8.7) is linear, the dispersion matrices known and the distributions normal, it is a simple matter to calculate the posterior density of the parameters given the data. It is normal with mean and dispersion matrix D x , with where These results take a simpler form in the case where the diagonal elements of C3 -» oo. Strictly this leads to impropriety, but the dimensionality of C 3 , namely s 3 , is typically much smaller than ^ (in many applications s3 = 1) so that it is reasonable to expect the difficulties not to be so serious as in the form of § 4; alternatively one can take C3 to be large in (8.8). Then with dispersion matrix where Consequently in this analysis the estimates (8.10) replace the usual least squares estimates given by (8.4). The former may be written in terms of the latter through the relation so that D0F essentially provides a correction term to the least squares value. A special case of a correction of this form has been suggested by Hoerl and Kennard (1970a, 1970b) in connection with what they call "ridge regression", though their primary justification is not Bayesian. As a first example of these ideas, consider the simple situation described at the beginning of this section where xt ~ N(0£, 1). It is first necessary to describe the distribution of the 0's, equation (8.6). In many applications it seems reasonable to
BAYESIAN STATISTICS
53
suppose that they are exchangeable. For example, suppose the data are all measurements on the yields of wheat, i referring to the variety; then one's prior views may well satisfy the requirements of exchangeability described in § 6; though if one of the varieties was a control, this might not be true. On the other hand if some of the data referred to wheat and the rest to tomatoes, exchangeability would probably be unreasonable 40 . If exchangeability is assumed for any sl we may use de Finetti's general theorem and regard the parameters as coming from a distribution. Supposing this to be normal, the above results can be applied with £(#,-) = £ and var(0,-) = a2, involving a single hyperparameter £. Using the approximate forms of (8.10) and (8.11) we easily obtain the estimate of 9{ to be where x. = ]Tx,-/Si, the overall mean of the x's. The form of this estimate is a weighted average of the mean for 0,- and the overall mean, the weights being equal to the precisions of the x,- and 0,- respectively. This is of the same form, though with different weights, as the estimate proposed by Stein (1962a). The idea of such a general shift towards the overall mean is not new. It occurs in a celebrated formula in educational statistics due to Kelley, and in factor analysis; I believe it has also been suggested in genetical applications. An obvious defect in the general analysis is the assumption throughout of known dispersion matrices. In the example this shows itself in the occurrence of a2 in (8.12). It is possible to generalize the results to allow for unknown dispersions but we content ourselves here with indicating the results in special cases41. Consider the one-way analysis of variance situation with xtj ~ N(0{, cr^-), i — 1, 2, • • • , m; j — 1, 2, • • • , «,., a2^ denoting the within variance. Combining an assumption of exchangeability on the means, together with one of normality on the consequential distribution, we obtain 9t ~ N(n, cr2,), say, where o\ is a between variance. This approach therefore blurs the distinction between Model I and II analyses; the between distribution which is regarded as part of the likelihood in Model II appears in mathematically the same way as part of the parameter distribution in Model I 42 . Write w,- = n,-crjj/(n£OB + °w); then the estimate (posterior mean) of 0,- is with in a simple generalization of (8.12). 40
This point is also relevant in empirical Bayes procedures to be discussed in § 12.1 below. The approach here is closely related to those methods but the justification for the form of the distribution for the parameters is entirely different. 41 A detailed account of these ideas is currently being written by Lindley and A. F. M. Smith, based partly on work done by the former author in connection with some educational problems suggested by Novick. 42 Notice, however, that in II the emphasis is usually on the estimation of variance components, whereas we are concerned with the means.
54
D. V. LINDLEY
This estimate still depends on knowing the variances. If these are to be estimated as well it is necessary to state their distribution prior to the data. Here difficulties arise since the usual improper prior for a variance, namely density proportional to a ~ 2, yields an improper posterior distribution for cr2, . The point has been examined by Hill (1965) and Novick (1969). The former provides a complete discussion, but with the main emphasis on the estimation of the variances, and we shall return to it at the end of this section. Box and Tiao (1968a) deal with the case nt = n, all z, and use a prior density 43 proportional to a^2(na^ + Ow)" 1 - This has the objection that it depends on n and its generalization to unequal n{ is unclear. It seems better to use the (proper) prior in which a^ and o^ are independent with vt/.t/a2 ~ y2 on v( degrees of freedom (t = B, W). The distribution of 6 posterior to x is no longer normal44 and we content ourselves here with considering only its modal value. The modes of joint distributions have an interesting property first pointed out to me by A. F. M. Smith. Let p(9, >) be the joint density of two parameters 9 and >, and suppose, as usual, the modal values obtained by solving the equations dp(9, (f))/d9 = dp(9, 0)/d<£ = 0. The first of these may be written p(>) • dp(9\(j))/d9 = 0 or equivalently, if p(0) ^ 0, dp(9\<j))/d9 = 0. Consequently the modal value for 9 can be found by substituting into the modal value for fixed >, the modal estimate for (/>. Applying this result here the estimate of 9t is given by (8.13) with erf,, a2^ replaced by their modal estimates. These are found to be respectively
where S2 = ^. (xtj — xt.)2 is the usual within sum of squares and N = Zn { . These values contain the estimates 9f of 9t so that the modal values have to be found by successive iteration. Notice the occurrence of terms v,^ from the prior knowledge and also that the within variance is not estimated from S2 alone since S2 involves deviations from the sample mean x,.. which is no longer the estimate of the true mean 9t, hence the deviation of xt. from Of has to be included. The estimate of the between variance has an advantage over the usual sampling-theory estimate (based on subtracting an estimate of a2^ from an estimate of na\ + a2^ when the n{ = n for all i) in that it cannot be negative. These ideas can be generalized to more complicated situations. They are relevant to an interesting example given by Efron(1970). A further extension is possible. It is usual in sampling-theory statistics to assume some variances to be equal : here, for example, to assume the same within variance for each group (corresponding to i in the above notation). The reason for this is that there is no generally accepted orthodox solution to the Fisher-Behren's problem, that is, to the problem of estimating or testing the difference between two 43 This density has been used by these authors in several papers; though undoubtedly convenient analytically the disadvantages here discussed also apply there. 44 The form is that of a product of f-distributions. These have been discussed by Tiao and Zellner (1964a, b) in connection with the Bayesian estimation of multivariate regression. Further illuminating discussion is provided by Dickey (1967a, b).
BAYESIAN STATISTICS
55
means of normal distributions when the corresponding variances are unequal. (A Bayesian solution—but with improper priors—agrees with Fisher's; see Lindley (1965) and Patil (1964).) But although difficulties arise in the Bayesian approach when the variances are unequal, they can be overcome and reasonable estimates obtained. In summary, the argument runs as follows. Suppose now that xtj ~ N(9i,(j)i), all these distributions being independent, >, being the within variance for the ith group. The assumption, derived partly from exchangeability, that 9{ ~ N(n, CT can be retained, and similar considerations of exchangeability applied to the variances >; suggest that they too might be supposed to be independent and identically distributed. Mathematical convenience suggests supposing the distribution implied here to be such that ver2/0,- is j2 on v degrees of freedom for appropriate adjustable hyperparameters v and cr2, this independently of the means. The formulas become somewhat more complicated, though the modal values for 0, remain (8.13) weighted averages of x, and an overall mean, the weights w,- being n(s\l(n^\ + 0*), where 4>f is an estimate of >,. The form of this latter estimate is again a weighted average, this time of the within sum of squares for the ith group and the harmonic mean of the estimates. In other words, just as with the means, there is a shift of the within variances to a common value, the difference being that the center towards which they are shifted is a harmonic, and not arithmetic, average. A feature of analyses of this type is the assumption of some form of exchangeability, or more generally, of some structure in the prior information. This structure is of a comparatively weak form—it does not, for example, specify any ranges of reasonable values for the parameters—nevertheless, in conjunction with the data, it has a substantial effect upon the final (posterior) analysis. I conjecture that this may turn out to be an important feature of Bayesian analyses, particularly when the dimensionality of the parameter space is large. It seems to me that our understanding of probability distributions in dimensions much above three is seriously inadequate and that interesting advances are possible using Bayesian inference in this field. We return briefly to this theme when discussing the estimation of the dispersion matrix of a multivariate normal distribution in § 12.3. We have explained that the above model is essentially the same as that used in Model II analysis of variance within the orthodox framework, the main difference being that the latter considers the variances rather than the means, as we have done. The variance problem has been discussed by Hill (1965). The orthodox argument usually deals with unbiased estimation and in this case can lead to negative estimates for er2, (since cr^ and a^ + nffl are estimated unbiasedly from the sum of squares, and er2, obtained by subtraction). Using prior distributions of inverse y2 form for the two components he discusses the posterior distributions with great care and shows how the Bayesian approach avoids the difficulty. Actually when a negative, orthodox estimate is obtained there is a strong suggestion that the model is at fault, so in a second paper Hill (1967) introduces correlated errors and reexamines the situation. The analysis is not too robust as Tiao and Tan (1965, 1966) have discovered. Related material will be found in Tiao (1966), Tiao
56
D. V. LINDLEY
and Box (1967) and Tiao and Draper (1968). These last five papers use the ignorance priors of Jeffreys (see § 12.4) mentioned above. 9. Tolerance regions and predictive distributions. We have seen (§3) that confidence intervals (or regions) play no part in Bayesian analysis because they violate the likelihood principle and even the simplest of them lead to incoherence. Nevertheless they remain an ingenious attempt to make a valid inference statement, and can be thought of as obtained by changing from the Bayesian p(&eR\x) to p ( 9 e R \ 9 ) , the only alteration being in the random variation used, notationally expressed by the position of the tilde, and the conditioning event. Tolerance intervals represent an extension of the confidence concept from a statement about the parameter to one about a future observation, or a set of future observations. With data x (usually a random sample, X j , x 2 , • • • , xn) it is desired to make a statement about additional data y (often xn+l). If 8 were known, a region R could be found such that p(ye R\9) = c, say, a given number usually near one45. But 9 being unknown the region R has to depend on what is known, x, and a tolerance region R(x) is one for which p(y e K(x)|x, 9) ^ c for all $ £ 0. But not all x will typically have this property. Let G(R( • ) , c) be the set of all x e X that do. Then the requirement is imposed that In words, a proportion q of times the data x will produce a region R(x) which will contain a proportion c at least of future observations. These tortuous statements are difficult to comprehend and are replaced in the Bayesian analysis by a single statement that is much simpler. Green (1969) disagrees. It is easy to see the form of this statement, for y is the unknown, x the data, and consequently we require p(y|x), the density ofy given x. The integral over R provides a region of probability content
p(y\x) dy = c, say. It is the orthodox schools inability to calculate this
distribution that necessitates the more involved approach. From a Bayesian viewpoint p(y|x), referring to observables, is more fundamental than statements referring to the unobserved 9 (compare the discussion of de Finetti's work on outliers in § 7). The density may be calculated as follows in the usual case where, given 0, xt are independent and identically distributed, x = ( x t , x 2 , • • • , xn) and y = xn+l.
45
The sampling-theory school with its interpretation of probability in terms of long-run proportions often refers not to one future observation but to the proportion c of outcomes of future replicates. The change needed is only linguistic.
BAYESIAN STATISTICS
57
in terms of the densities for the individual values and p(9). p(y|x) is often termed the predictive density of y, given x. Guttman and Tiao (1964) call it the density of a future observation. In the case of sampling from the simple exponential density (4.23) it is easy to verify that the predictive density is H(y)KN(t)/KN(t + y), where KN(t) is given by (4.24) and the rest of the notation is defined in Example 3 of § 4. Guttman (1967) uses the predictive density in goodness-of-fit problems, comparing the predictions with the values actually obtained, in a mixture of Bayesian and orthodox arguments. Just as there is some considerable latitude in the selection of a confidence region, there are some problems in the choice of a tolerance region. For example, if R = X we can have c = 1 and so in some way we have (as with confidence regions) to consider the "size" of R and make it small in some sense. Again the Bayesian procedure is obvious, for the choice of R is essentially a decision problem and hence a utility function is needed. Since y is the unknown which influences the situation we require a function of R and y. Let V(R, y) denote the utility of region R when the value of the future observation is y. Then in accordance with Bayesian principles of coherence, R is chosen to maximize expected utility, that is,
giving R as a function of x. The integral in (9.3) may be rewritten
Here
is the more usual form of utility function, in terms of the decision, R, and the parameter 6. In applications V(R, y) is probably the more convenient to use. The general problem of Bayesian tolerance regions has been discussed in detail by Aitchison (1964, 1966) and Aitchison and Sculthorpe (1965), and the above discussion is developed from those papers. They describe such regions for many standard distributions and discuss an application with
where K(R) is a given function describing the cost of R and AI , / 2 , ^2 > ^i > express quantitatively the success of choosing R correctly to include y. This work has been
58
D. V. LINDLEY
extended by Dunsmore (1966,1968, 1969) who, for example, uses in the case where y is one-dimensional,
where R = [ r l , r 2 ] and £, rj are positive constants. A further extension considered by Dunsmore is to the problem of calibration. Here, on each of a number of objects, two observations are made, one is precise and the other relatively imprecise. On the basis of this data it is required to calibrate the instrument used to make the latter observations, that is, for an imprecise observation on another object to say what the precise one would be. Mathematically let x and y denote the precise and imprecise sets respectively, where (x,-, y{) is the pair of observations on the ith object, i = 1,2, • • • , « . (The roles of x and y should not be confused with those in the previous paragraphs.) Let p(j,-|x,, 6) be the likelihood governing the relationship between the measurements—for example, a simple assumption would be yt ~ N(a + /foc,-,a 2 ) with 6 — (a, (3, a2). Let y be the measurement on an additional object and x the corresponding, unknown, precise value. We require to estimate x, given y, x and y. That is, in Bayesian terms, we need The denominator is simply the integral of the numerator over x so it will suffice to consider the latter. This may be evaluated as follows:
Consider the three probabilities in this integral. The first factorizes into the familiar form involving (n + 1) contributions to the likelihood. In most situations the precise observations, on their own, give no information about the relationship between the two sets, so that p(9\x, x) is simply p(9). The form taken by the last term depends upon the situation. If the x, have been selected they contain no information about x and p(x\x) = p(x), the (prior) distribution of the major unknown in the problem. At the other extreme if the xt are a random sample from some distribution and x is a further such sample, p(x|x) is the predictive density discussed earlier in this section, and may be evaluated in terms of the density p[Xi\(f)) dependent on some parameter 0. These last two cases (which are not the only possibilities) need to be carefully distinguished since the analyses that result are substantially different. A recent discussion has been given by Hoadley (1970) who shows that sampling-theory ideas lead to some conceptual difficulties and provides explicit Bayes answers with numerical illustrations. The special case of calibration where x, (and x) can only assume a set of discrete values, usually finite, is usually described as one of classification, the discrete
BAYESIAN STATISTICS
59
sets being referred to as classes. The case where the y{ have a multivariate normal distribution within each of the classes has been most completely discussed by Geisser (1964a) and by Geisser and Desu (1968). Two other related problems have been described by Dunsmore. In regulation the model x, y, x, y remains unaltered but the problem is to select x so that y is as near as possible to a prescribed value. In optimization the selection of x is to be made to make y as large as possible. Both these are decision problems and require a utility function to be introduced before an answer can be obtained. They are also closely related to, if not identical with, some problems in stochastic control theory 46 . 10. Multinomial data. In this section we discuss the situation where each of N observations can fall into one of k classes. Given 6 — (6l,92, • • • , 9k) the observations are independent and the chance of falling into the jth class is 6j, the same for each observation, 9j ^ 0, £*= l 6j = 1. The data can then be described by the sufficient statistic n = (n1,n2, • • • , rck), «,- being the number falling in the jth class, N = X}=i nr For fixed N the distribution of n is multinomial, the likelihood being \\ 0"*. The conjugate distribution (§ 4) is Dirichlet, with density proportional to Y[ @T f°r suitable hyperparameters, a ; , which have to exceed — 1 for the density to be proper. The posterior density is of the same form with a, replaced by ai + nt. The Dirichlet distribution is not too easy to handle—even in the simplest binomial case, k — 2, the Beta distribution is awkward to tabulate—and some approximation seems desirable. In the binomial case the distribution is simply related to the F-distribution which is adequately approximated, for all but small degrees of freedom, by a normal density on applying Fisher's z-transformation. This leads us here to consider loge(9l/62) = \oge{9l/(\ — 9^)}, the log-odds, the result being that this is approximately normal with mean log^Kaj + n± + ^)/(a2 + n2 + j)} and variance (a^ + n1 + I)" 1 + (a2 + n2 + 1) '• It is common to take al = a2 = — 1, although this leads to an improper prior and the warnings of § 8 apply. But a j = a2 — — ^has some merit (see § 12.4) and then a convenient form for the mean is Iog e tti/n 2 , the sample log-odds, and for the variance, n^1 + n2l, omitting the extraneous ^'s. The result may be extended to general k in the following way (see Lindley(1964)). Define a contrast in the log 0( (all logarithms being to the base e) as ]T ct log 9t with c; any constants satisfying £ ct = 0. The log-odds in the case k = 2 is a special case. Then a convenient approximation to the posterior distribution is obtained by taking any k — 1 linearly independent contrasts ^cp, log 9hp= 1, 2, • • • , k — 1, and remarking that they have an approximate multivariate normal distribution with means £icp* log nt and covariances (variances when p = q) ^iCpicqin^1. 46
It was my original intention to include as part of this review an account of the work done in stochastic control theory, since this subject is almost entirely Bayesian. However the developments there are so extensive that I did not feel myself able adequately to summarize them, especially when their language is typically that of an engineer and not a statistician. The statistician should, however, be aware of what is going on in this field, and a survey would be useful. In the meantime two valuable texts are those by Aoki (1967) and Sawagari et al. (1967), neither of which are too linguistically difficult for the statistician to follow.
60
D. V. LINDLEY
Refinements to this approximation have been given by Bloch and Watson (1967). The proof depends on Fisher's ingenious description of the multinomial distribution as the conditional distribution of rct , « 2 , • • • , nk given £«,- = N, where the nt are independent Poisson variables with means i/> ; ; then 6t = lAi/X/A;- This enables the situation to be described in terms of independent variables with consequent simplification. The multivariate normal density is constant on ellipsoids centered at the mean and, as in the first part of § 4, this fact may be used to construct credible intervals and significance tests for the parameters. The latter are distinct from the more familiar y2 tests but in large samples the two forms are equivalent. 47 These ideas easily extend to the comparison of several multinomial distributions and to the analysis of contingency tables. Let 9U be the probability of an observation falling into the («, /)-cell of a contingency table, i = 1, 2, • • • , r ;j = 1, 2, • • • , s, and consider n.. such observations falling independently with n^ in cell (i,j). Then we may write where 0;j = $;/$,. and dots replace suffixes over which summation has taken place. In words, the total probability can be factorized into the probability given one of the margins of the table, times the probability for that margin. Typically the two sets of parameters on the right-hand side of (10.1) are independent prior to the data and hence inferences about {0^-} can treat the marginal values n( as fixed —they are ancillary. The contingency table analysis has then been reduced to the comparison of r multinomials, one for each i. These ideas have also been discussed by Gart (1966) and usefully compared with orthodox approaches. Healy (1969a) indicates some of the difficulties for the latter. We illustrate with the common case of the 2 x 2 table (r — s = 2). If we wish to measure the association between the two characteristics defining the margins of the table, one convenient 48 way is through the parameter
which is easily seen to be a contrast in the log 6^. Consequently, the posterior distribution is approximately normal with mean equal to the sample equivalent of p, namely, r = Iognlln22/nl2n2\ an<^ variance £,-,_/ n^ 1 . This not only provides a significance test for association but also a credible interval for the strength of the association. The extension to larger tables and to tables of higher dimensions (r x s x r, say) is relatively straightforward though the analysis is essentially equivalent to a nonorthogonal (since the variance contributions n~jl are not constant) analysis of variance. The simplest way to extend to higher dimensions is through a sequence of conditional probabilities. The defining characteristics are put into some order, C l 5 C 2 , • • • , C m , say, in the case of an m-way table, and 47 48
The Bayesian approximation, like ^ 2 , needs the n{ not to be too small.
Edwards (1963) has shown that subject to certain mild conditions it is the only way. General measures of association have been discussed by Altham (1970).
BAYESIAN STATISTICS
61
then the dependence of C2 on C\, of C3 on C2 and C\, and so on up to Cm on all the rest, studied in order. Returning to the case of the r x s and, in particular, the 2 x 2 table, it should be noticed that although either margin separately can be regarded as ancillary there is no reason for thinking that together they are ancillary. Indeed in certain degenerate cases (e.g., «,. = u,n2. = Q,n.{ = u,n.2 = 0) the margins describe the table completely. There is therefore no direct Bayesian justification for Fisher's remarkable "exact" test for a 2 x 2 table. However, Altham (1969) has shown that, with a Dirichlet prior proportional to 9^1922, p(p < 0|data) is just Fisher's "exact" probability that the classical statistician would compute when testing p = 0 (against p < 0). This prior seems to suggest negative association. The precise relation between the two is discussed in the paper. An alternative test of the hypothesis of no association has been given by Jeffreys (1967). The Dirichlet prior has been justifiably criticized by Good (1965, 1967). It is, for example, clearly inappropriate when the multinomial classes have been derived by grouping a continuous variable because we feel that the 0,-'s are, in some sense, smooth. One way out of the difficulty is to treat the hyperparameters in the Dirichlet form as themselves subject to a distribution. Good discusses the case of a symmetric Dirichlet form with a, constant, so that the density is proportional to (YlOi)" for some a, and then adds a distribution for a. A significance test for a hypothesis H is found using the ratio of posterior to prior odds as given in (4.27). An application of the Dirichlet prior to the trichotomy, prefer A, prefer B, no preference, is given by Draper et al. (1969). A study of the effect of misclassification on multinomial data has been made by Press (1968). Discrimination problems have been discussed by Stromberg (1969). A related problem is the estimation of the transition probabilities in a finite-state Markov chain: when the data consist of proportions in the various classes (but not of transitions) the estimation has been investigated by Lee et al. (1968). 11. Asymptotic results. Consider the case where x = (jq, x2, • • • , xn), the X; being independent and identically distributed according to a density p(x,|$). In this section we see what can be said as n, usually referred to as the sample size, tends to infinity. A heuristic account of what typically happens is easy, but there are exceptional cases and rigorous proofs seem rather involved. We have seen that in solving a decision problem the Bayesian needs to evaluate (equation (4.5))
or, equivalently, using Bayes' theorem and omitting an irrelevant p(x),
For the purpose of asymptotic analysis, where only p(x|0) changes with «, we may
62
D. V. LINDLEY
write U(d, 0)p(9) = w(d, 9) a weight function for d and 9. Write xn = (xl,x2, xn) and
the logarithm of the likelihood for 0, given x n . Then (11.1) becomes
The heuristic argument then requires the log-likelihood to be expanded in a power series in 9 about the m.l. value 9n with the result that, in the one-dimensional case for illustration, where evaluated at the m.l. value. Similarly expanding w(d, 9) the dominate term in the maximization is
and consequently one chooses that d which maximizes w(d, 9n) ; in other words one acts as if the m.l. value were the true value of 9. Further terms in the expansion can be obtained at the expense of tedious algebra (Lindley (1961a)). The argument just presented also tells us something about the behavior of the posterior distribution and so is useful in inference as opposed to decision problems. The expansion of the log-likelihood given in (11.4), combined with the fact that p(9) does not depend on n, shows that the posterior distribution of 9 is approximately normal with mean 9n and variance {nc^J}"1. The method of maximum likelihood is therefore shown to be a reasonably coherent technique, at least in large samples. One slight difference from the method as it is usually employed is the use of the unexpected second derivate (equation (11.5)) rather than the expected (over xj value, Fisher's information function. This is inevitable since the use of the latter would violate the likelihood principle. But the difference involved here is slight and the point is more of theoretical than practical interest. The argument just presented does little more than indicate what might happen. As we have already shown (particularly in §§ 3 and 7), there are important exceptional cases that arise out of practical problems and are not just the product of a pure mathematician's fertile imagination. It therefore becomes important to have a precise statement of the conditions under which the above results are valid and a description of what happens when they do not obtain. These appear to be surprisingly difficult to find and there is room for improvement. A recent paper by Johnson (1970) extending an earlier one, Johnson (1967), provides asymptotic expansions for posterior distributions rigorously established under numerous
BAYESIAN STATISTICS
63
conditions which are not too difficult to verify. Let Fn(Q be the posterior distribution function for (9 — 9n){nc(9n)}l/2 and <£(
for sufficiently large n. Here >'_,-(£, xn) is a polynomial in £ having coefficients which are calculable. For example, where $(£) is the normal density, and a3n(0) arises from additional terms in the expansion (11.4). Walker (1969) and Dawid (1970) also provide proofs of the asymptotic normality of the posterior distributions under reasonable conditions but give no further terms in the expansion. In a series of papers LeCam (1958 is a convenient reference) has discussed the asymptotic properties of m.l. and Bayes' estimates but his proofs are, to quote Walker, "decidedly difficult to follow," and the conditions hard to verify. A recent paper by LeCam (1970) reconsiders the assumptions in the m.l. context. Chao (1970) has shown that, under certain conditions, the Bayes', 0*, and m.l. estimates, 6n, are asymptotically equivalent in the sense that n1/2(9* — $„) -» 0 almost certainly for all 0e0. Schwartz (1965) had earlier established that a Bayes' estimate is consistent if there exists a consistent estimate. It must, however, be remembered that exceptional cases can arise. A curious counterexample has been given by Freedman (1963) but the pathology here is only on a set of measure zero with respect to p(6). Berk (1970) has described conditions under which a sequence of posterior distributions converges weakly to a degenerate distribution. In an earlier paper (1966) he has considered this convergence when the model is incorrect. This is an interesting aspect of robustness (§ 7) and he shows that the distribution tends to be confined to a special set in the parameter space. Blackwell and Dubins (1962) have shown how different "priors" tend to the same limit, described as "merging of opinions." The material so far discussed has been concerned with estimation-type problems in which p(9) is smooth. A different type of asymptotic theory arises when dealing with "sharp" hypotheses as described at the end of § 4. We use the notation of the earlier discussion and the decision formulation given there. The relevant weight functions are
If, for illustration, we take the case where both £ and r\ are real numbers, expansion (11.6) can be used in connection with H and a bivariate generalization of it applied
64
D. V. LINDLEY
to H to show that H should be accepted if
where An is the determinant of a matrix whose typical element is, in generalization
evaluated at the m.l. values c n , /)„, and q* is the m.l. estimate of 77 under H, that is, with c = £o- This procedure is therefore equivalent to the usual likelihood-ratio test. There is, however, an interesting distinction between this test as usually performed and the Bayesian test, (11.8), which appears when we study the way the test is effected by n, the sample size. Denote the likelihood-ratio, the left-hand side of (1 1.8) by A(x B ). Then (1 1.8) may be written
where a few obvious abbreviations in notation have been adopted. In the samplingtheory approach the left-hand side is treated as y2 and H accepted if it is less than some constant obtained from tables of that distribution in terms of a significance level. Now it is easy to verify that the first term on the right-hand side of (11.9) tends, as n -> oo, to a constant, so that the main difference between the two approaches lies in the log/7 term in (1 1.9)49 which does not appear in the orthodox argument. Another way of appreciating the difference is to remark that the two approaches can be brought into asymptotic agreement if the orthodox statistician changes his significance level with n. Detailed calculations (Lindley (1961a)) show that the level should be asymptotically proportional to (n logn)~ 1 / 2 for equivalence ; in particular it should decrease with n. The phenomenon was earlier noticed by Jeffreys (1967). Some decrease may be observed in practice, otherwise large samples will usually give significant results. Notice that this decrease is necessitated by the need to be coherent in the judgments for different values of n. The next, and last, section of this review contains a variety of material. Some of it deals with subjects which are of importance but concerning which the Bayesian argument has little to say. The remainder is devoted to special topics which do not conveniently fit into the other sections. 12.1 Empirical Bayes and multiple decision problems. Let us make it clear at the outset that, despite their title, hardly any empirical Bayes procedures that have been suggested are Bayesian. The problems concern a sequence (Xj,0j) of sample and parameter spaces with the x, independent given the parameters, and typically with the densities p(xi\9i) having the same functional form for all i. 49
There is another difference in that the constant in the sampling-theory approach comes from a significance level, whereas in (11.9) it comes from weight functions, but the distinction can be treated as linguistic.
BAYESIAN STATISTICS
65
The parameters 9i are themselves supposed to be a random sample from some (unknown) distribution and this fact accounts for the reference to Bayes in the title. On the basis of observation of a finite sequence (xl,x2, • • • , xn) it is desired to make inference statements about 9n (or generally about all the 0;). Ordinarily the data, apart from xn, provide no information about 0 n , but the assumption that the parameters form a random sample (and this is the novelty in the problem) means that this is no longer so. The multiple decision problem is similar in that there is a sequence (X^D^&i) of decision problems with their associated losses. Under certain conditions, puzzling to an orthodox statistician, the ith decision may be improved by using data x}{j ^ 0 : the yield of wheat is better estimated by using additional data on tomatoes. A good account of the material, together with a bibliography, has been provided by Maritz (1970). The Bayesian formulation of the problem is clear, though its execution is none too simple. If the distribution from which the 0's come is unknown, then this is described by probability distribution over the class of such distributions. Let this class be described by a parameter £ e E, say, with density p(0|£) for the generation of the 0's and a (prior) density p(£) for the new parameter. The probability structure for all the quantities is then simply
and consequently,
The integral that occurs in this expression is the link that serves to connect the various values of 0(- and causes Xj to be used to make inferences about 9t. If all probability distributions of the 0's are considered, H is a complicated space and p(£) is hard to describe—a point to be discussed below —but if we are prepared to restrict the class of distributions to a normal or some other exponential (say) class, the description is possible and the analysis can proceed. The extension to the multiple decision problem is straightforward. If there is a subjective assessment that the 0,-'s are exchangeable (hence providing the apparent random sample), then we may proceed as indicated by (12.2); if not, then the problems should be dealt with separately. This explains why multiple decision procedures are appropriate when comparing several varieties of wheat but are not when wheat and tomatoes are both being considered. The varieties of wheat might reasonably be exchangeable but this is hardly true for wheat and tomatoes. The reader will notice that the discussion of multiparameter problems in § 8 uses the ideas just mentioned and the solutions described therein may be thought of as solutions to multiple decision problems. A Bayesian treatment of some of these problems has been provided by Rolph (1968). He considers the case where the densities p(x,-|0,-) are polynomials, so that the distribution p(£) can be expressed in terms of the moments of the distribution - The following simpler example may assist the understanding. Let p(xi\9i) be
66
D. V. LINDLEY
N(0j, 1), i = 1, 2, • • • , n, with 0,. = ± 1, and write p(0f - + 1|£) - I In words, we wish to see whether the means of normal distributions are plus or minus one, each mean having the same chances at the two values. A decision formulation can be included by introducing decisions di,d2, • • • ,dn with l(dt, 0,-) = 0 if dt = 9j and 1 otherwise, the losses (/ = 1, 2, • • • , n) being additive. The distribution conjugate to that of the 0's is Beta so that it is convenient to take p(C) oc £,a(\ — £)ft for selected a and b. Then easy calculation from (12.2) shows that
where r is the number of Ot equal to + 1, providing the connecting link between the n otherwise separate problems. From this distribution the expected loss for any set of decisions can be calculated and hence the optimum decision found. The following numerical example illustrates what happens. Suppose n — 2, x2 = — 2 and we want to select d^. Then the ""natural" d1, considering that problem on its own, is to select d\ — + 1 if and only if xl > 0. Within the formulation above and with a = b = 0 the optimum decision is to select dl — + 1 if and only if x t > 0.35. The point is that the relatively large negative value of x2 suggests £ is small and consequently more evidence than usual is needed to convince one that 6l = + 1, and the decision is only taken when xl exceeds 0.35 rather than the natural value, zero. 12.2. Nonparametric statistics. This is a subject about which the Bayesian method is embarrassingly silent. We have used the word parameter to index the class of probability distributions for the data. However its use is usually restricted to the case where the parameter space is Euclidean and the parameter (index) is a finite set of real numbers. All the examples we have discussed in the review are of this type. Nonparametric statistics usually, though there are variations, refer to cases where the parameter space is more complicated than Euclidean. For example, if X is the real line, then 6 may have to index all continuous distributions on that line and 0 is effectively the space of all such distributions. It is then possible to make inferences about, say, the mean of x, without restricting rather sharply the distributions being considered. However the Bayesian method requires a probability distribution to be specified over 0, and if this space is as complicated as even that of all continuous distributions on the real line no convenient way of doing this is known apart from some useful work by Rolph (1968) using the moments. Until this technical problem is overcome progress is difficult if not impossible. It is perhaps worth stopping to remark that the problem is a technical one ; the Bayesian method embraces nonparametric problems but cannot solve them because the requisite tool is missing. It is an interesting commentary on the present state of pure mathematics that although it can handle many aspects of most complicated spaces it cannot deal with these descriptive problems. On the real line we have several ways of describing a distribution: for example, the distribution function serves to generate probabilities for all Borel sets. But no such result appears to be available in the space of all continuous distributions. The general is often easier than the
BAYESIAN STATISTICS
67
particular. One possibility is to describe the distributions over 0 indirectly; for example, by saying that, given xl,x2, • • • ,*„, the distribution of xn+ j is such that xn + l is equally likely to fall into any of the (n + 1) intervals determined by these values. Hill (1968) has shown that this particular device is unfortunately not possible. Although particular problems cannot be solved the Bayesian viewpoint can say a little about current nonparametric methods. Most of these are based on the use of a statistic whose sampling distribution does not depend on, or is insensitive to, the exact form of the permissible underlying distributions on the null hypothesis. Typically a tail area of the sampling distribution is selected, using informal considerations of alternative hypotheses, and a significance test of the null value obtained. The obvious remark is that this is in direct violation of the likelihood principle and is therefore suspect. It is hard to see how widening the class of distributions from, say, an appropriate exponential family to all continuous distributions could necessitate entirely new considerations, namely the other values of x besides the one that actually arose. It is equally hard to see how these methods could, in some sense, be approximations to Bayesian ones though Good (1967) claims to see some connection between tail areas and posterior odds. I am often asked what I do in a practical situation that seems to call for nonparametric methods. My answer is that, if at all possible, I try to fit it into a parametric (in the usual sense) framework, for example, by using normal scores. This has an added advantage of enabling one to use interval and other estimates and not just make significance tests which are the mainstay of the nonparametricians. Another way is to group the data. It is worth pointing out that if X is finite, then the class of all probability distributions over X is described by points in the simplex ( P i > P ? ' "• iPn'-Pi = 0> £P; = 1) and (prior) distributions can be described and developed. Thus the classic x2-test has its Bayesian counterpart. The conjugate family is Dirichlet and is not always appropriate, but the difficulties of the general case are substantially reduced. An interesting alternative discussion of "goodnessof-fit" is given by Guttman (1967) who compares the observed values with the predicted values from the predictive density (§9). Some work has been done on a truly Bayesian approach to nonparametric problems apart from Hill's work, already mentioned. In a pair of curiously neglected papers Whittle (1957, 1958) has discussed the estimation of a density function and the closely related problem of estimating a spectral density. In both these situations the intuitive idea is that we want to "smooth" the empirical form. Whittle proposes describing this smoothing by a correlation between nearby ordinates of the density function and obtains estimates in terms of such correlations. Whilst it cannot be expected that second order moments will always provide a realistic description of prior knowledge, the idea goes a long way towards an adequate expression and produces interesting answers. The idea has been pursued further by Dickey (1968a, 1969) and Kimeldorf and Wahba (1970) using the fashionable notion of splines. Hartigan (1969) has attempted to derive linear Bayesian methods that parallel the orthodox linear methods and, like them, do not involve specific distributional assumptions. Altham (1969) has developed
68
D. V. LINDLEY
a method for investigating whether one distribution is stochastically larger than another when X is finite but ordered. If one distribution is described by {0,-} (in the simplex discussed above) and the other by {>J, she calculates
using the Dirichlet prior (which therefore ignores any possible smoothing effect). 12.3 Multivariate statistics. The Bayesian work on multivariate statistics has been confined to the (multivariate) normal distribution where some interestin problems arise. We begin by establishing some notation. Consider a random sample of size n from a p-variate normal distribution of mean ft and dispersion matrix E. Then the likelihood is proportional to
where x, S are respectively the sample mean and the sample dispersion matrix and are consequently sufficient statistics. The conjugate distributions to (12.3) involve normal and Wishart components and the resulting theory has been carefully discussed by Ando and Kaufman (1965) and Evans (1965). If the dispersion matrix is known, the estimation of \i has effectively been considered in the multiparameter material of § 8, and it was seen that there is some reason for thinking that x is not a good estimate. The estimation of n when all its components, ^,-, are known to be equal has been discussed by Geisser (1965) (see also Zacks (1970a)). Incidentally, it is often easier to work in terms of the precision matrix E~ *, rather than E, its properties being somewhat simpler to describe. With ji known only the trace term remains in the exponential in (12.3) and the conjugate family is Wishart, when expressed through the precision matrix. Special cases that have been considered are prior densities proportional 50 to I'L]"12 dE" 1 , or lEI^ 2 dp dE~ 1 if ji is unknown, for various values of v. These distributions are improper. Geisser and Cornfield (1963) have discussed the resulting posterior distributions. Considering first the mean ji, they show that with v = p + 1 the result is Retelling's T2 and the consequent credible intervals are the same as confidence intervals based on that distribution considered as a sampling distribution. The case v = 2 leads to the Fisher-Cornish fiducial distribution, the margins of which are Student's t. Furthermore, there is no value of v that will lead to Hotelling's distribution for all means and Student's for any single mean. Within that class of priors therefore a person who uses both t and T2 is incoherent. One suspects that the result is always true, that is, no prior can give both t and T2. They also go on to discuss in the bivariate case the posterior distribution of the variances, covariances and correlation coefficient and, for example, fail to get agreement with Fisher's famous result for the last-named. Brillinger (1963) has extended this 50 These are densities for the precision matrix, not the dispersion matrix, and to make this clear the differential element has been included.
BAYESIAN STATISTICS
69
result and shown that the joint fiducial distribution of 0-j, o2 and p (in an obvious notation) is not a Bayes posterior for any prior. The difficulties are reduced in the "uniform" case where all the correlations are equal (see Geisser (1964b)). Geisser (1965) has continued the discussion of the case v = p + 1. Let s n denote the sum of squares for the first of the p variables. He shows that s n /<7 n is y2 on n — p degrees of freedom, which is different from the sampling-theory, parallel result that sll/ffll is y2 on n — 1 degrees of freedom. Furthermore, let s ii-23-p denote the residual sum of squares for the first variable allowing for the remaining p — 1. Then sli.23...p/all.23...p is y2 on n - 1, compared with the sampling-theory result that s,1-23 -p/^i 1-23- P is Xn-P- A more general form of this comparison has been given by Dempster (1963). Suppose that it is required of our inference that it be linearly invariant, that is, if the inference about £ depends on S through a function //(S), then the inference about C£Cr is similarly given by H(CSCr) for any nonsingular C. Then he shows that this implies that b r L - 1 x b/b 7 S~ 1 b is stochastically larger than a ; Sa/a y £a, which contradicts the sampling theory result in which the inequality is the other way round, both quantities being /2 but on n — p and n — 1 degrees of freedom respectively. The case of 101 observations on each of 100 variables shows that the difference is far from trivial. In Geisser's case the invariant inference differs by a factor of 100 from the usual one. Dempster concludes his paper by saying that "the question of what constitutes a reasonable posterior inference about L appears to me to be wide open". Essentially the same point is discussed by Healy (1969b) who notes that it is possible to have a bivariate situation in which both Xj and x 2 give significant departures from zero, say, when judged by Student's t but the pair (x l 5 x 2 ) ar not significantly different from (0,0) when Hotelling's T2 is used. Fraser and Haq (1969) argue in favour of |Z|(P+ 1)/2 in the theory of structural inference and another justification is given by Villegas (1969) using invariance. Ignoring these difficulties and using |Z|~ 3/2 , Tiao and Fienberg (1969) consider the estimation of latent roots and vectors in the bivariate situation. Seven years later Dempster's judgment still seems correct. My own guess is that, as with the means, some form of exchangeability assumption is needed and invariance sacrificed. 12.4 Invariance theories. The discussion of the multivariate normal distribution suggests that the search for invariant procedures should be conducted with care because, both with the mean and the dispersion matrix, the results are not entirely satisfactory. The work of Brown (1966) within the sampling-theory framework shows that invariant estimates are typically inadmissible in dimensions greater than two. The point is connected with the use of improper prior distribution discussed at the beginning of § 8, because the invariant prior distributions that produce the invariant procedures are typically improper. For example, in dealing with scale and location parameters, ^i and a, say, the only prior density invarian under changes of origin and scale is proportional to a~l. Generally, one of the Haar invariant measures will provide the appropriate prior distribution. A detailed discussion has been given by Stone and von Randow (1968) and the earlier work of Brillinger (1963) is relevant.
D. V. LINDLEY
70
Even if improper prior distributions are not beyond reproach they have often been defended (by myself amongst others) on the grounds that they are reasonable approximations to a real, proper prior which would be much harder to handle. Stone (1963, 1965) has explored what is meant here by "a reasonable approximation" and the results are both interesting and surprising. Suppose there is a sequence [pn(9)} of prior densities converging to a limiting density p(6), called the quasi-prior, the likelihood p(x\9) being fixed.51 This generates a sequence of posterior densities pn(6\x) oc p(x\9}pn(9) and for fixed x we might ask whether pn(9\x) -+ p(9\x) oc p(x\9)p(9). If so it seems reasonable to use p(9) instead of pn(9) since, for large enough n, the effect on the posterior distribution will be small. But this is for a given x, and only after the data has been obtained can the validity of the approximation be considered. Stone therefore argues that one should take p\imn^x pn(9\x) and ask whether this is p(9\x), the sequence being according to the marginal measure pn(x) =
p(x\9)pn(9) d9. If it is, then this will mean that,
given any s > 0, there will be a prior (or a value of n) such that the prior probability of obtaining data for which the quasi-prior gives a reasonable approximation to the resulting posterior (for that n) exceeds 1 — e. For the usual posterior /-distribution for AT(/i, a2} derived from the density dp: da/a as quasi-prior the following (proper) densities seem a reasonable sequence to consider: the density is proportional to djj.da/0 over the rectangle p.ln < p. < n2n, 0 < ain < a < a2n. In other words, the pn(9) have been obtained by truncation of p(9). Stone shows that for the probability limit to behave as described it is necessary that p2Jp\n -*• °°> P2n -*• °° and liminf [Iogp ln /logp 2n ] ^ 0> where pin = (p.2n — Hin)/ain. In particular, it is not enough that p.ln -> + oo, p.2n -> — oo, ffln -> 0 and a2n -> + oo. In the multivariate case discussed in § 12.3 the quasi-prior with v = 2 and p > 1 never provides a suitable limit for a wide class of proper priors, so that the Fisher-Cornish argument is unsatisfactory (Stone (1964)). The relationship between invariance and the likelihood principle has been interestingly discussed by Hartigan (1967) who argues that the principles apply to different problems, which carries us back to the idea briefly mentioned in § 1 that there may be other modes of inference valid beside the Bayesian one in certain cases—for example, when alternatives are not naturally available. A rather different form of invariance has been proposed by Jeffreys (1967). To understand this it is first necessary to distinguish between subjective and objective probabilities. The subjective view holds that all probabilities are expressions of personal beliefs. The objective view holds that there exist "reasonable degrees of belief" that any person ought to have on the basis of a given set of data. It is outside the scope of this review to discuss the philosophical bases of these two viewpoints. The only aspect that need concern us here is that, accepting the objective attitude, it will make sense to talk of a reasonable degree of belief given no data; in other words, to produce a probability distribution corresponding to the notion of ignorance. This is what Jeffreys tries to do (see also § 12.6). Suppose we have a 51
The limiting operation is therefore quite different to that discussed in § 11.
BAYESIAN STATISTICS
71
likelihood p(x\9) and such an objective prior p(6). Then the likelihood could equally be written in terms of >, a 1 — 1 transform of 9, which in turn would yield its objective prior (>). Jeffreys' invariance requirement is simply that q(4>)d(f) = p(9) d9; that is, that the procedure for producing the prior should not depend on the parameterization employed. For a single real parameter 9 an appropriate prior has density /(0) 1/2 , where Fisher's information function. The idea has been extended in the case of the exponential family by Huzurbazar (1955). Unfortunately these ideas, apparently attractive in the case of a single real parameter, do not appear to work too well in the case of several parameters. For example, in the common linear hypothesis situation, the posterior r-distribution for a single parameter which results from the use of Jeffrey's prior has the same number of degrees of freedom irrespective of the number of fitted parameters. This seems unacceptable when compared with the usual practice of losing a single degree of freedom for each additional constant. A more serious objection to my mind lies in the fact that the ignorance prior depends on the likelihood function. Let 9 denote some real quantity, such as the mass of a particular object. Experimenter I proposes measuring 9 using an instrument which gives measurements x having distribution N(9,1). Experimenter II advocates a radioactive procedure giving data y having a Poisson distribution of mean 9. With prior given by (12.4), Experimenter I will take 9 to be uniform whereas II will use a density proportional to 9~1/2. Why should one's knowledge, or ignorance, of a quantity depend on the experiment being used to determine it? Incidentally, if this view is accepted it shows a danger in developing "readymade" Bayesian analyses in which 9 is just a parameter. The subjective view would hold that 9 corresponds to a real quantity and one's opinion about it depends on this reality. 12.5. Comparison of Bayesian and orthodox procedures. Some aspects of this have been discussed in §§ 3 and 4. Here we describe briefly some of the studies that have been made, mostly with a view to providing a further justification for orthodox procedures by using a Bayesian argument. Lindley (1958) considered the problem of estimating a real-valued parameter in the case where there is a sufficient statistic that is also real, so that both X and 0 are R[. In this case the fiducial and confidence iatervals agree. He showed that unless 9 was a location parameter for x, or separate transformations of 9 and of x could be made to make this so, the fiducial distribution was not a posterior distribution for any prior. Hence, as we have already seen, the confidence (and fiducial) arguments are incoherent. This is most easily displayed by remarking that if the total data is obtained in two stages, the final result will depend on the order of the stages, save in the transformable location case. Welch and Peers (1963) and Welch (1965) extend the discussion and consider asymptotic results as n, the sample size, tends to infinity. To order n~1/2 they show that it is always possible to reach agreement by using the Jeffreys' prior from (12.4).
72
D. V. LINDLEY
Equivalently this amounts to saying that if we transform from 9 to =
I(t)1'2 dt, then 0 will effectively be a location parameter. This transformation
is interesting because it is often used to stabilize a sampling variance (compare the use of ^/x if x is Poisson). The proper view here is that it is designed to produce a parameter > about which the information (in Fisher's sense) is approximately constant. Peers (1965) showed that the argument does not extend to more than one dimension. Peers (1968) has considered two-sided intervals and found for what priors the Bayesian credence intervals have the correct coverage property, to order n" 1 , demanded by orthodox theory. He uses the relatively invariant priors discussed by Hartigan (1964), the latter author having pointed out (1966) that the original work did not cover the two-sided case. We have already remarked that whereas the Bayes procedure (in a general problem) uses (see (4.5))
the orthodox approach considers (see (4.3))
A comparison of these two shows that the expected performance of one method is necessarily related to that of the other. For example, if tis an unbiased estimate of 9, E(t\9) = 9, a concept useful in the sampling-theory approach, we necessarily have E{t — E(9\x)} = 0, so that on the average, t is the posterior mean. This type of relationship is explored in detail in a most illuminating paper by Pratt (1965). Bartholomew (1965) has explored the dependence of standard methods on the sampling rule and his discussion is interesting for a Bayesian. A further paper of his (1967) considers the dependence of tests of hypotheses on the sampling rule; a particular aspect of the dependence of that type of inference on the sample space. Durbin (1969) disagrees with some of Bartholomew's conclusions. Kerridge (1963) has shown that if 0 contains a finite number k of values, which have the same probability prior to x, then the frequency with which at the end of sampling the posterior probability of the true value is not less than a cannot exceed (k — l)a/(l — a). A special case was rediscovered by Cornfield (1966). Thatcher (1964) compares Bayesian and confidence limits for the classic "rule of succession" problem. Stein (1962b) has produced an example where confidence and fiducial intervals are very different, but the example has been disputed by Barnard (1962) and Pinkham (1966). 12.6. Information. We have had occasion to mention Fisher's concept of information (see (12.4)). This is not, because of its use of the sample-space, acceptable in a Bayesian framework, at least for inference purposes. Furthermore it is not the only possible explicatum for information. Good (1960,1966) has con-
BAYESIAN STATISTICS
73
sidered the problem and derived useful results. A recent paper is by Sarndal (1970). Good not only discusses information but related ideas such as weight of evidence and corroboration. If I(H :E\G) is the amount of information concerning H provided by £, given G, he proves on the basis of several reasonable axioms that In our notation, replacing H by 9 and E by x, and omitting the general conditions G, we have
This refers to a single value of 9. Taking expectations over 9 we have
the difference between the Shannon measures of information before and after the data x. A further expectation of (12.7) over x, with marginal density p(x), leads, after a little manipulation to the expected information provided by an experiment (X, 0), namely,
Both these expressions are useful in the design of experiments. Thus (12.8) provides a number that could be used to decide between a number of experiments, given p(0), by choosing that one with the largest value. This idea is due to Good, and Lindley (1956). Similarly (12.7) could be used in the design of sequential experiments, data collection proceeding until it had reached a sufficiently high value. An application to the exponential distribution is given by El-Sayyad (1969) and to discrimination between models by Box and Hill (1967), extended by Hill and Hunter (1969). The advantage of these methods over the experimental design procedures discussed in § 4 is that they do not require for their execution a decision space and associated loss function. Many scientific experiments seem to be conducted without any decision problem in mind, rather in the spirit of inference, and these ideas may be relevant there. They cannot, however, provide as complete an account as the full decision-theoretic approach, because adequate information is not denned and costs do not enter. But for a comparison of equally expensive experiments the ideas may have some use. An alternative measure of information based on the expected (over X) posterior variance is suggested by Draper and Guttman (1969) and is used by them to assess the value of prior information. An ingenious principle for the use of information in developing "ignorance" prior distributions has been put forward by Jaynes (1968) and expounded by Tribus (1969). According to the latter, this consists in choosing that distribution
74
D. V. LINDLEY
which maximizes the entropy subject to constraints supplied by the given information. Tribus's stimulating, but mathematically infuriating, book makes out a good case. Another method of deriving an "ignorance" prior has been developed in some detail by Novick (1969). If the likelihood function belongs to the exponential family (4.8) we have seen that a convenient class of distributions over parameter space is provided by the conjugate family (4.9). These distributions will typically only be proper for restricted ranges of the hyperparameters. Novick uses this family, with hyperparameters chosen outside this range such that the effect of a minimum necessary sample is to take the parameter distribution into the proper range. Such an improper distribution is used as the ignorance prior. It is a matter of judgment what constitutes a minimum necessary sample. In the binomial case, for example, observations of both types are deemed necessary before a proper probability statement is available. The method yields some interesting results, even in the multiparameter case. 12.7. Probability assessments. The argument of § 3 establishes the existence of probabilities for a coherent individual. It does not say how they are to be obtained for a real, coherent man. This is matter more for a psychologist than a mathematician, though the latter has a role to play, and most of the references appear in psychological journals. I have not, therefore, attempted to survey the literature, important though it is in the practical use of Bayesian ideas. Reference should be made to the work of Winkler (1967) and to the key paper of de Finetti (1965).
Bibliography The bibliography lists all items referred to in the text; consequently it does not consist entirely of Bayesian material. Neither is it a complete listing of Bayesian papers, though it is reasonably exhaustive on recent material in the usual statistical journals. In compiling it I have been much helped by a bibliography prepared by Robert B. Miller (Technical Report 214 from the Department of Statistics, University of Wisconsin, Madison). A useful bibliography is also to be found in DeGroot's (1970) recent text. To reduce the length the following abbreviations have been used for the commoner statistical journals: JRSSA (B): Journal of the Royal Statistical Society Series A (B) Bk: Biometrika Be: Biometrics JASA: Journal of the American Statistical Association AMS: Annals of Mathematical Statistics Tc: Technometrics EC : Econometrica Other journals have had their titles abbreviated according to the scheme described in Mathematical Reviews. J. AITCHISON (1964) Bayesian tolerance regions, JRSSB, 26, 161-175. (1966) Expected-cover and linear-utility tolerance intervals, JRSSB, 28, 57-62. J. AITCHISON AND DIANE SCULTHORPE (1965) Some problems of statistical prediction, Bk, 52, 469^83. PATRICIA M. E. ALTHAM (1969) Exact Bayesian analysis of a 2 x 2 contingency table, and Fisher's "exact" significance test, JRSSB, 31, 261-269. (1970) The measurement of association of rows and columns for an r x s contingency table, JRSSB, 32, 63-73. ALBERT ANDO AND G. M. KAUFMAN (1965) Bayesian analysis of the independent multinormal process— neither mean nor precision known, JASA, 60, 347-358. F. J. ANSCOMBE (1963) Sequential medical trials, JASA, 58, 365-383. F. J. ANSCOMBE AND R. J. AUMANN (1963) A definition of subjective probability, AMS, 34, 199-205. GORDON R. ANTLEMAN (1965) Insensitivity to non-optimal design in Bayesian decision theory, JASA, 60,584-601. MASANAO AOKI (1967) Optimization of Stochastic Systems, Academic Press, New York, xv + 354. P. ARMITAGE (1963) Sequential medical trials: some comments on F. J. Anscombe's paper, JASA, 58, 384-387. J. C. ARNOLD (1969) A modified technique for improving an estimate of the mean. Be, 25, 588-591. G. A. BARNARD (1962) Comments on Stein's "A remark on the likelihood principle", JRSSA, 125, 569-573. (1969) Summary remarks, New Developments in Survey Sampling, N. L. Johnson and H. Smith, Jr., eds., Wiley-Interscience, New York, 696-711. G. A. BARNARD, G. M.JENKINS AND C. B. WiNSTEN(1962) Likelihood inference and time series, JRSSA, 125, 321-372. G. A. BARNARD, J. C. KIEFER, L. M. LE CAM AND L. J. SAVAGE (1968) Statistical inference, The Future of Statistics, Donald G. Watts, ed., Academic Press, New York, 139-160. 75
76
D. V. LINDLEY
D. J. BARTHOLOMEW (1965) A comparison of some Bayesian and frequent ist inferences, Bk, 52, 19-35. (1967) Hypothesis testing when the sample size is treated as a random variable, JRSSB, 29, 53-82. M. S. BARTLETT (1957) A comment on D. V. Lindley's statistical paradox, Bk, 44, 533-534. (1967) Inference and stochastic processes, JRSSA, 130, 457-478. D. BASU (1954) Recovery of ancillary information, Sankhya Ser. A, 26, 3-16. THOMAS BAYES (1958) An essay towards solving a problem in the doctrine of chances, Bk, 45, 293-315. (Reprinted from Philos. Trans. Roy. Soc. London, 53 (1763), 370-418, with biographical note by G. A. Barnard.) ROBERT H. BERK (1966) Limiting behavior of posterior distributions when the model is incorrect, AMS, 37,51-58. (1970) Consistency a posteriori, AMS, 41, 894-906. SAMIR KUMAR BHATTACHARYA (1967) Bayesian approach to life-testing and reliability estimation, JASA, 62, 48-62. PETER J. BICKEL AND DAVID BLACKWELL (1967) A note on Bayes estimates, AMS, 38, 19071911. PETER J. BICK.EL AND JOSEPH A. YAHAV (1969) On an A .P.O. rule in sequential estimation with quadratic loss, AMS, 40, 417-426. ALLAN BIRNBAUM (1962) On the foundations of statistical inference, JASA, 57, 269-306. (1970) On Durbins modified principle of conditionality, JASA, 65, 402-403. DAVID BLACKWELL AND LESTER DUBINS (1962) Merging of opinions with increasing information, AMS, 33,882-886. DANIEL A. BLOCH AND GEOFFREY S. WATSON (1967) A Bayesian study of the multinomial distribution, AMS, 38, 1423-1435. COLIN R. BLYTH (1970) On the inference and decision models of statistics, AMS, 41, 1034-1049. G. E. P. Box AND D. R. Cox (1964) An analysis of transformations, JRSSB, 26, 211-252. GEORGE E. P. Box AND NORMAN R. DRAPER (1965) The Bayesian estimation of common parameters from several responses, Bk, 52, 355-365. G. E. P. Box AND W. J. HILL (1967) Discrimination among mechanistic models, Tc, 9, 57-71. G. E. P. Box AND G. C. TIAO (1962) A further look at robustness via Bayes's theorem, Bk, 49, 419^32. (1964a) A Bayesian approach to the importance of assumptions applied to the comparison of variances, Bk, 51, 153-167. (1964b) A note on criterion robustness and inference robustness, Bk, 51, 169-173. (1965) Multiparameter problems from a Bayesian point of view, AMS, 36, 1468-1482. (1968a) Bayesian estimation of means for the random effect model, JASA, 63, 174-181. (1968b) A Bayesian approach to some outlier problems, Bk, 55, 119-129. DAVID R. BRILLINGER (1963) Necessary and sufficient conditions for a statistical problem to be invariant under a Lie group, AMS, 34, 492-500. ROBERT R. BRITNEY AND ROBERT L. WINKLER (1968) Bayesian point estimation under various loss functions, Amer. Statist. Assoc. Proc. Business and Econom. Statist. Sect. (Pittsburg, Pa.), 356-364. Amer. Statist. Assoc., Washington, D.C., 1968. LAURENCE DAVID BROWN (1966) On the admissibility of invariant estimators of one or more location parameters, AMS, 37, 1087-1136. ROBERT J. BUEHLER (1959) Some validity criteria for statistical inferences, AMS, 30, 845-863. R. J. BUEHLER AND A. P. FEDDERSEN (1963) Note on a conditional property of Student's t, AMS, 34, 1098-1100. G. E. G. CAMPLING (1968) Serial sampling acceptance schemes for large batches of items where the mean quality has a normal prior distribution, Bk, 55, 393-399. PAUL L. CANNER (1970) Selecting one of two treatments when the responses are dichotomous, JASA, 65, 293-306. M. T. CHAO (1970) The asymptotic behavior of Bayes' estimators, AMS, 41, 601-608. HERMAN CHERNOFF (1951) A property of some type A regions, AMS, 22, 472-474. (1961) Sequential tests for the mean of a normal distribution, Proc. Fourth Berkeley Sympos., 1, 79-91, University of California Press, Berkeley.
BAYES1AN STATISTICS
77
HERMAN CHERNOFF AND S. N. RAY (1965) A Hayes sequential sampling inspection plan, AMS, 36, 1387-1407. LEONARD COHEN (1958) On mixed single sample experiments, AMS, 29, 947-971. THEODORE COLTON (1963) A model for selecting one of two medical treatments, JASA, 58, 388400. JEROME CORNFIELD (1965) A note on the likelihood function generated by randomization over a finite set, Bull. Inst. Internal. Statist., 41, 1, 79-81. (1966) A Bayesian test of some classical hypotheses—with applications to sequential medical trials, JASA, 61, 577-594. (1969) The Bayesian outlook and its application. Be, 25, 617-657. D. R. Cox (1958) Some problems connected with statistical inference. AMS, 29, 357-372. A. P. DAWID (1970) On the limiting normality of posterior distributions, Proc. Cambridge Philos. Soc., 67,625-633. BRUNO DE FINETTI (1961) The Bavesian approach to the rejection of outliers, Proc. Fourth Berkeley Sympos., 1, 199-210, University of California Press, Berkeley. (1964) Foresight: its logical laws, its subjective sources, Studies in Subjective Probability, Henry E. Kyburg, Jr. and Howard E. Smokier, eds., Wiley, New York, 93-158. (Translation of La prevision: ses lois logiques, ses sources subjectives, Ann. Inst. H. Poincare, 7 (1937), 1-68.) (1965) Methods for discriminating levels of partial knowledge concerning a test item, British J. Math. Statist. Psychology, 18, 87-123. (1970) Teoria delle probabilita, sintesi introduttiva con appendice critica, Giulio Einaudi, Torino, I, xiv + 347; II, viii + 422. MORRIS H. DEGROOT (1970) Optimal Statistical Decisions, McGraw-Hill, New York, xvi -f 489. M. H. DEGROOT AND N. STARR (1969) Optimal two-stage stratified sampling, AMS, 40, 575-582. A. P. DEMPSTER (1963) On a paradox concerning inference about a covariance matrix. AMS, 34, 1414— 1418. (1968) A generalization of Bayesian inference, JRSSB, 30, 205-247. JAMES M. DICKEY (1967a) Expansions of t densities and related complete integrals, AMS, 38, 503-510. (1967b) Matricvariate generalizations of the multivariate t distribution and the inverted multivariate t distribution, AMS, 38, 511-518. (1968a) Smoothed estimates for multinomial cell probabilities, AMS, 39, 561-566. (1968b) Three multidimensional-integral identities with Bayesian applications, AMS. 39, 16151627. (1969) Smoothing by cheating, AMS, 40, 1477-1482. JAMESM. DICKEY AND B. P. LiENTZ(1970) The weighted likelihood ratio, sharp hypotheses about chances. the order of a Markov chain, AMS, 41, 214-226. NORMAN R. DRAPER AND IRWIN GUTTMAN (1968a) Some Bayesian stratified two-phase sampling results, Bk, 55, 131-139. (1968b) Bayesian stratified two-phase sampling results: k characteristics, Bk, 55, 587-589. (1969) The value of prior information, New Developments in Survey Sampling, N. L. Johnson and H. Smith, Jr., eds., Wiley-Interscience, New York, 305-325. NORMAN R. DRAPER AND WILLIAM G. HUNTER (1966) Design of experiments for parameter estimation in multiresponse situations, Bk, 53, 525-533. (1967a) The use of prior distributions in the design of experiments for parameter estimation in non-linear situations, Bk, 54, 147-153. (1967b) The use of prior distributions in the design of experiments for parameter estimation in non-linear situations: multiresponse case, Bk, 54, 662-665. (1969) Transformations: some examples revisited, Tc, 11, 23-40. NORMAN R. DRAPER, WILLIAM G. HUNTER AND DAVID E. TIERNEY (1969) Which product is better? Tc, 11, 309-320. I. R. DUNSMORE (1966) A Bayesian approach to classification, JRSSB, 28, 568-577. (1968) A Bayesian approach to calibration, JRSSB, 30, 396-405. (1969) Regulation and optimization, JRSSB, 31, 160-170.
78
D. V. LINDLEY
J. DURBIN (1969) Inferential aspects of the randomness of sample size in survey sampling, New Developments in Survey Sampling, N. L. Johnson and H. Smith, Jr., eds., Wiley-Interscience, New York, 629-651. (1970) On Birnbaum's theorem on the relation between sufficiency, conditionality and likelihood, JASA, 65, 395-398. A. W. F. EDWARDS (1963) The measure of association in a 2 x 2 table, JRSSA, 126, 109-114. (1970) Estimation of the branch points in a branching diffusion process, JRSSB, 32, 155-174. WARD EDWARDS, HAROLD LINDMAN AND LEONARD J. SAVAGE (1963) Bayesian statistical inference for psychological research, Psychological Rev., 70, 193-242. BRADLEY EFRON (1970) Comments on Blyth's paper, AMS, 41, 1049-1054. G. M. EL-SAYYAD(1967) Estimation ojthe parameter of'an exponential distribution, JRSSB, 29, 525-532. (1969) Information and sampling from the exponential distribution, Tc, 11, 41^45. W. A. ERICSON (1965) Optimum stratified sampling using prior information, JASA, 60, 750-771. (1967a) Optimal sample design with nonresponse, JASA, 62, 63-78. (1967b) On the economic choice of experiment sizes for decision regarding certain linear combinations, JRSSB, 29, 503-512. (1969a) Subjective Bayesian models in sampling finite populations, JRSSB, 31, 195-233. (1969b) A note on the posterior mean of a population mean, JRSSB, 31, 332-334. (1969c) Subjective Bayesian models in sampling finite populations: stratification, New Developments in Survey Sampling, N. L. Johnson and H. Smith, Jr., eds., Wiley-Interscience, New York, 326-357. I. G. EVANS (1964) Bayesian estimation of the variance of a normal distribution, JRSSB, 26, 63-68. (1965) Bayesian estimation of parameters of a multivariate normal distribution, JRSSB, 27, 279-283. T. S. FERGUSON (1967) Mathematical Statistics, Academic Press, New York, vii + 396. PETER C. FISHBURN (1969a) A general theory of subjective probabilities and expected utilities, AMS, 40, 1419-1429. (1969b) Weak qualitative probability infinite sets, AMS, 40, 2118-2126. RONALD A. FISHER (1956) Statistical Methods and Scientific Inference, Oliver and Boyd, Edinburgh, viii + 175. D. A. S. FRASER (1968) The Structure of Inference, John Wiley, New York, x + 344. D. A. S. FRASER AND M. SAFIUL HAQ (1969) Structural probability and prediction for the multivariate model, JRSSB, 31, 317-331. DAVID A. FREEDMAN (1963) On the asymptotic behavior of Bayes' estimates in the discrete case, AMS, 34,1386-1403. DAVID A. FREEDMAN AND ROGER A. PURVES (1969) Bayes' method for bookies, AMS, 40, 1177-1186. P. R. FREEMAN (1970) Optimal Bayesian sequential estimation of the median effective dose, Bk, 57, 79-89. JOHN J. GART (1966) Alternative analyses of contingency tables, JRSSB, 28, 164-179. SEYMOUR GEISSER (1964a) Posterior odds for multivariate normal classifications, JRSSB, 26, 69-76. (1964b) Estimation in the uniform covariance case, JRSSB, 26, 477^483. (1965) A Bayes approach for combining correlated estimates, JASA, 60, 602-607. (1965) Bayesian estimation in multivariate analysis, AMS, 36, 150-159. SEYMOUR GEISSER AND JEROME CORNFIELD (1963) Posterior distributions for multivariate normal parameters, JRSSB, 25, 368-376. SEYMOUR GEISSER AND M. M. DESU (1968) Predictive zero-mean uniform discrimination, Bk, 55, 519524. V. P. GODAMBE (1955) A unified theory of sampling from finite populations, JRSSB, 17, 269-278. (1969) Some aspects of the theoretical development in survey-sampling, New Developments in Survey Sampling, N. L. Johnson and H. Smith, Jr., eds., Wiley-Interscience, New York, 27-58. (1970) Foundations of survey-sampling, Amer. Statist., 24, 33-38. I. J. GOOD (1950) Probability and the weighing of evidence, Charles Griffin, London, viii + 119. (1960) Weight of evidence, corroboration, explanatory power, information and the utility of experiments, JRSSB, 22, 319-331. Corrigendum: Ibid., 30 (1968), 203.
BAYESIAN STATISTICS
79
(1965) The estimation of probabilities, Research monograph no. 30, M.I.T. Press, Cambridge, Mass., xii + 109. (1966) A derivation of the probabilistic explication of information, JRSSB, 28, 578-581. (1967) A Bayesian significance test for multinomial distributions, JRSSB, 29, 399-431. (1969) A subjective evaluation of Bodes' law and an "objective" test for approximate numerical rationality, JASA, 64, 23-66. J. R. GREEN (1969) Inference concerning probabilities and quantiles, JRSSB, 31, 310-316. P. M. GRUNDY, M. J. R. HEALY AND D. H. REES (1956) Economic choice of the amount of experimentation, JRSSB, 18, 32-55. D. GUTHRIE, JR AND M. V. JOHNS, JR. (1959) Bay'es acceptance sampling procedures for large lots, AMS, 30, 896-925. IRWIN GUTTMAN (1967) The use of the concept of a future observation in goodness-of-fit problems, JRSSB, 29, 83-100. IRWIN GUTTMAN AND GEORGE C. TIAO (1964) A Bayesian approach to some best population problems, AMS, 35, 825-835. A. HALD (1967) Asymptotic properties of Bayesian single sampling plans, JRSSB, 29, 162-173. (1968a) The mixed binomial distribution and the posterior distribution of p for a continuous prior distribution, JRSSB, 30, 359-367. (1968b) Bayesian single sampling attribute plans for continuous prior distributions, Tc, 10, 667-683. A. HALD AND N. KEIDING (1969) Asymptotic properties of Bayesian decision rules for two terminal decisions and multiple sampling, I, JRSSB, 31, 455^71. J. HARTIGAN (1964) Invariant prior distributions, AMS, 35, 836-845. J. A. HARTIGAN (1966) Note on the confidence-prior of Welch and Peers, JRSSB, 28, 55-56. (1967) The likelihood and invariance principles, JRSSB, 29, 533-539. (1968) Note on discordant observations, JRSSB, 30, 545-550. (1969) Linear Bayesian methods, JRSSB, 31, 446^54. H. O. HARTLEY AND J. N. K. RAO (1968) A new estimation theory for sample surveys, Bk, 55, 547-557. (1969) A new estimation theory for sample surveys, II, New Developments in Survey Sampling, N. L. Johnson and H. Smith, Jr., eds., Wiley-Interscience, New York, 147-169. M. J. R. HEALY (1969a) Exact tests of significance in contingency tables, Tc, 11, 393-395. (1969b) Rao's paradox concerning multivariate tests of significance, Be, 25, 411^413. EDWIN HEWITT AND LEONARD J. SAVAGE (1955) Symmetric measures on cartesian products, Trans. Amer. Math. Soc., 80, 470-501. BRUCE M. HILL (1965) Inference about variance components in the one-way model, JASA, 60, 806825. (1967) Correlated errors in the random model, JASA, 62, 1387-1400. (1968) Posterior distribution ofpercentiles: Bayes" theorem for sampling from a population, JASA, 63,677-691. (1969) Foundations for the theory of least squares, JRSSB, 31, 89-97. WILLIAM J. HILL AND WILLIAM G. HUNTER (1969) A note on designs for model discrimination: variance unknown case, Tc, 11, 396-400. BRUCE HOADLEY (1969) The compound multinomial distribution and Bayesian analysis of categorical data from finite populations, JASA, 64, 216-229. (1970) A Bayesian look at inverse linear regression, JASA, 65, 356-369. ARTHUR E. HOERL AND ROBERT W. KENNARD (1970a) Ridge regression: biased estimation for nonorthogonal problems, Tc, 12, 55—67. (1970b) Ridge regression: applications to nonorthogonalproblems, Tc, 12, 69-82. V. S. HUZURBAZAR (1955) Exact forms of some invariants for distributions admitting sufficient statistics, Bk, 42, 533-537. EDWIN T. JAYNES (1968) Prior probabilities, IEEE Trans. Systems Science and Cybernetics, SSC-4, 227-241. HAROLD JEFFREYS (1967) Theory of Probability, 3rd edition (corrected), Clarendon Press, Oxford, xi + 459.
80
D. V. LINDLEY
RICHARD A. JOHNSON (1967) An asymptotic expansion for posterior distributions, AMS, 38, 1899-1906. (1970) Asymptotic expansions associated with posterior distributions, AMS, 41, 851-864. V. M. JOSHI (1969) Admissibility of the usual confidence sets for the mean of a univariate or bivariate normal population, AMS, 40, 1042-1067. J. G. KALBFLEISCH AND D. A. SPROTT (1970) Application of likelihood methods to models involving large numbers of parameters, JRSSB, 32, 175-194. GORDON M. KAUFMAN (1968) Optimal sample size in two-action problems when the sample observations are lognormal and the precision h is known, JASA, 63, 653-659. D. KERRIDGE (1963) Bounds far the frequency of misleading Bayes inferences, AMS, 34, 1109-1110. J. KIEFER AND J. WoLFOWiTZ (1956) Consistency of the maximum likelihood estimator in the presence of infinitely many nuisance parameters, AMS, 27, 887-906. GEORGE S. KIMELDORF AND GRACE WAHBA (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, AMS, 41, 495-502. CHARLES H. KRAFT, JOHN W. PRATT AND A. SEIDF:NBERG (1959) Intuitive probability on finite sets, AMS, 30, 408-419. RICHARD G. KRUTCHKOFF (1969) Classical and inverse regression methods of calibration in extrapolation, Tc, 11, 605-608. HENRY E. KYBURG, JR. (1969) Probability Theory, Prentice-Hall, Englewood Cliffs, N.J., x + 294. W. A. LARSEN (1969) The analysis of variance for the two-way classification fixed-effects model with observations within a row serially correlated, Bk, 56, 509-515. IRVING H. LAVALLE (1968a) On cash equivalents and information evaluation in decisions under uncertainty. Part I: Basic theory, JASA, 63, 252-276. (1968b) On cash equivalents and information evaluation in decisions under uncertainty. Part II: Incremental information decisions, JASA, 63, 277-284. (1968c) On cash equivalents and information evaluation in decisions under uncertainty. Part III: Exchanging par tit ion-J for partition-K information, JASA, 63, 285-290. L. LECAM (1958) Les proprietes asymptotiques des solutions de Bayes, Publ. Inst. Statist. Univ. Paris, 7, 17-35. (1970) On the assumptions used to prove asymptotic normality of maximum likelihood estimates, AMS, 41, 802-828. J. A. LECHNER (1962) Optimum decision procedures for a Poisson process parameter, AMS, 33, 13841402. T. C. LEE, G. G. JUDGE AND A. ZEU.NER (1968) Maximum likelihood and Bayesian estimation of transition probabilities, JASA, 63, 1162-1179. E. L. LEHMANN (1950) Some principles of the theory of testing hypotheses, AMS, 21, 1-26. (1958) Significance level and power, AMS, 29, 1167-1176. (1959) Testing Statistical Hypotheses, John Wiley, New York, xiii + 369. D. V. LINDLEY (1953) Statistical inference, JRSSB, 15, 30-76. (1956) On a measure of the information provided by an experiment, AMS, 27, 986-1005. (1957) A statistical paradox, Bk, 44, 187-192. (1958) Fiducial distributions and Bayes' theorem, JRSSB, 20, 102-107. (196la) The use of prior probability distributions in statistical inference and decision, Proc. Fourth Berkeley Sympos., 1, 453-^68, University of California Press, Berkeley. (1961b) The robustness of interval estimates, Bull. Inst. Internal. Statist., 38, 4, 209-220. (1964) The Bayesian analysis of contingency tables, AMS, 35, 1622-1643. (1965) Introduction to Probability and Statistics from a Bayesian Viewpoint. Pt. 1. Probability. Pt. 2. Inference, University Press, Cambridge, xi + 259, xiii + 292. (1968) The choice of variables in multiple regression, JRSSB, 30, 31-66. (1969a) Bayesian least squares. Bull. Inst. Internal. Stalist., 43, 2, 152-153. (1969b) Review of Fraser (1968), Bk, 56, 453-456. D. V. LINDLEY AND B. N. BARNETT (1965) Sequential sampling: two decision problems with linear losses for binomial and normal random variables, Bk, 52, 507-532. D. V. LINDLEY AND G. M. EL-SAYYAD (1968) The Bayesian estimation of a linear functional relationship, JRSSB, 30, 190-202.
BAYESIAN STATISTICS
81
J. S. MARITZ (1970) Empirical Bayes Methods, Methuen, London, viii + 159. MELVIN R. NOVICK (1969) Multiparameter Bayesian indifference procedures, JRSSB, 31, 29-64. V. H. PATIL (1964) The Behrens-Fisher problem and its Bayesian solution, J. Indian Statist. Assoc., 2, 21-31. H. W. PEERS (1965) On confidence points and Bayesian probability points in the case of several parameters, JRSSB, 27, 9-16. (1968) Confidence properties of Bayesian interval estimates, JRSSB, 30, 535-544. R. S. PINKHAM (1966) On a fiducial example ofC. Stein, JRSSB, 28, 53-54. R. L. PLACKETT (1966) Current trends in statistical inference, JRSSA, 129, 249-267. JOHN W. PRATT (1963) Shorter confidence intervals for the mean of a normal distribution with known variance, AMS, 34, 574-586. (1964) Risk aversion in the small and in the large, EC, 32, 122-136. (1965) Bayesian interpretation of standard inference statements, JRSSB, 27, 169-203. (1966) The outer needle of some Bayes sequential continuation regions, Bk, 53, 455^467. JOHN W. PRATT, HOWARD RAIFFA AND ROBERT SCHLAIFER (1964) The foundations of decision under uncertainty: an elementary exposition, JASA, 59, 353-375. S. JAMES PRESS (1968) Estimating from misclassified data, JASA, 63, 123-133. HOWARD RAIFFA (1968) Decision Analysis. Introductory Lectures on Choices under Uncertainty, Addison-Wesley, Reading, Mass., xviii + 309. HOWARD RAIFFA AND ROBERT SCHLAIFER (1961) Applied Statistical Decision Theory, Division of Research, Harvard Business School, Boston, xxviii + 356. FRANK PLUMPTON RAMSEY (1964) Truth and probability, Studies in Subjective Probability, Henry E. Kyburg Jr. and Howard E. Smokier, eds., John Wiley, New York, 61-92. (Reprinted from The Foundations of Mathematics and Other Essays, Kegan, Paul, Trench, Trubner & Co. Ltd., London (1931), 156-198.) HARRY V. ROBERTS (1967) Informative stopping rules and inferences about population size, JASA, 62, 763-775. JOHN E. ROLPH (1968) Bayesian estimation of mixing distributions, AMS, 39, 1289-1302. CARL-ERIK SARNDAL (1970) A class ofexplicatafor "information" and "weight of evidence", Rev. Inst. Internal. Statist., 38, 223-235. LEONARD J. SAVAGE (1954) The Foundations of Statistics, John Wiley, New York, xv + 294. (1961) The foundations of statistics reconsidered, Proc. Fourth Berkeley Sympos., 1, 575-586, University of California Press, Berkeley. L. J. SAVAGE et al. (1962) The Foundations of Statistical Inference, Methuen, London, 112. LEONARD J. SAVAGE (1970) Comments on a weakened principle ofconditionality, JASA, 65, 399-401. Y. SAWAGARI, Y. SUNAHARA AND T. NAKAMIZO (1967) Statistical Decision Theory in Adaptive Control Systems, Academic Press, New York, xiii + 216. LORRAINE SCHWARTZ (1965) On Bayes procedures, Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4, 10-26. GIDEON SCHWARZ (1969) A second-order approximation to optimal sampling regions, AMS, 40, 313-315. A. SCOTT (1968) A multi-stage test for a normal mean, JRSSB, 30, 461-^68. (1969) A note on an allocation problem, JRSSB, 31, 119-122. BRUNO O. SHUBERT (1969) Bayesian model of decision-making as a result of learning from experience, AMS, 40, 2127-2142. BARNARD E. SMITH AND RAMAKRISHNA R. VEMUGANTI (1968) A learning model for processes with tool wear,1c, 10, 379-387. CEDRIC A. B. SMITH (1961) Consistency in statistical inference and decision, JRSSB, 23, 1-37. (1965) Personal probability and statistical analysis, JRSSA, 128, 469-499. MARY E. SOLARI (1969) The "maximum likelihood solution" of the problem of estimating a linear functional relationship, JRSSB. 31, 372-375. M. D. SPRINGER AND W. E. THOMPSON (1966) Bayesian confidence limits for the product of N binomial parameters, Bk, 53, 611-613. (1968) Bayesian confidence limits for reliability of redundant systems when tests are terminated at failure, Tc, 10, 29-36.
82
D. V. LINDLEY
C. M. STEIN (1951) A property oj some tests of composite hypotheses, AMS, 22, 475^76. (1956) Inadmissibility oj the usual estimator for the mean oj a multivariate normal distribution, Proc. Third Berkeley Sympos., 1, 197-206. University of California Press, Berkeley. (1962a) Confidence sets for the mean oj a multivariate normal distribution, JRSSB, 24, 265-296. (1962b) A remark on the likelihood principle, JRSSA, 125, 565-568. M. STONE (1963) The posterior t distribution, AMS, 34, 568-573. (1964) Comments on a posterior distribution of Geisser and Cornfield, JRSSB, 26, 274-276. (1965) Right Haar measure for convergence in probability to quasi posterior distributions, AMS, 36, 440-453. (1969a) The role oj significance testing: some data with a message, Bk, 56, 485^493. (1969b) The role of experimental randomization in Bayesian statistics: finite sampling and two Bayesians, Bk. 56, 681-683. M. STONE AND B. G. F. SPRINGER (1965) A paradox involving quasi prior distributions, Bk, 52, 623-627. M. STONE AND R. VON RANDOW (1968) Statistically inspired conditions on the group structure of invariant experiments and their relationships with other conditions on locally compact topological groups, Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 10, 70-80. JOHN L. STROMBERG (1969) Distinguishing among multinomial populations: a decision theoretic approach, JRSSB, 31, 376-387. KEI TAKEUCHI (1970) Comments on Blyttfs paper, AMS, 41, 1054-1058. A. R. THATCHER (1964) Relationships between Bayesian and confidence limits for predictions. JRSSB, 26, 176-210. JAMES R. THOMPSON (1968) Some shrinkage techniques for estimating the mean, JASA, 63, 113-122. GEORGE C. TIAO (1966) Bayesian comparison of means of a mixed model with application to regression analysis, Bk, 53, 11-25. G. C. TIAO AND G. E. P. Box (1967) Bayesian analysis of a three-component hierarchial design model, Bk, 54, 109-125. G. C. TIAO AND N. R. DRAPER (1968) Bayesian analysis of linear models with two random components with special reference to the balanced incomplete block design, Bk, 55, 101-117. G. C. TIAO AND S. FIENBERG (1969) Bayesian estimation of latent roots and vectors with special reference to the bivariate normal distribution, Bk, 56, 97-108. GEORGE C. TIAO AND ROBERT H. LOCHNER (1967) A note on tables for the comparison of the spread of two normal populations, Bk, 54, 683-684. G. C. TIAO AND DAVID R. LUND (1970) The use of OLUMV estimators in inference robustness studies of the location parameter of a class of symmetric distributions, JASA, 65, 370-386. GEORGE C. TIAO AND W. Y. TAN (1965) Bayesian analysis of random-effect models in the analysis of variance. I. Posterior distribution of variance-components, Bk, 52, 37-53. (1966) Bayesian analysis of random-effect models in the analysis of variance. II. Effect of autocorrelated errors, Bk, 53, 477-495. GEORGE C. TIAO AND ARNOLD ZELLNER (1964a) Bayes's theorem and the use of prior knowledge in regression analysis, Bk, 51, 219-230. (1964b) On the Bayesian estimation of multivariate regression, JRSSB, 26, 277-285. MYRON TRIBUS (1969) Rational Descriptions, Decisions and Designs, Pergamon Press, New York, xix + 478. JOHN W. TUKEY (1962) The future of data analysis, AMS, 33, 1-67. C. VILLEGAS (1964) On qualitative probability ff-algebras, AMS, 35, 1787-1796. (1969) On the a priori distribution of the covariance matrix, AMS, 40, 1098-1099. JOHN VON NEUMANN AND OSK.AR MORGENSTERN (1947) Theory of Games and Economic Behavior, Princeton University Press, Princeton, N.J., xviii + 641. ABRAHAM WALD (1950) Statistical Decision Functions, John Wiley, New York, ix + 179. A. M. WALKER (1969) On the asymptotic behaviour of posterior distributions, JRSSB, 31, 80-88. B. L. WELCH (1947) The generalization of Student's" problem when several different population variances are involved, Bk, 34, 28-35. (1965) On comparisons between confidence point procedures in the case of a single parameter, JRSSB, 27, 1-8.
BAYESIAN STATISTICS
83
B. L. WELCH AND H. W. PEERS (1963) On formulae for confidence points based on integrals of weighted likelihoods, JRSSB, 25, 318-329. G. B. WETHERILL (1961) Bayesian sequential analysis, Bk, 48, 281-292. G. B. WETHERILL AND G. E. G. CAMPLING (1966) The decision theory approach to sampling inspection, JRSSB, 28, 381-416. P. WHITTLE (1957) Curve andperiodogram smoothing, JRSSB, 19, 38-^7. (1958) On the smoothing of probability density functions, JRSSB, 20, 334-343. (1964) Some general results in sequential analysis, Bk, 51, 123-141. P. WHITTLE AND R. O. D. LANE (1967) A class of situations in which a sequential estimation procedure is non-sequential, Bk, 54, 229-234. E. J. WILLIAMS (1969) Regression methods in calibration problems, Bull. Inst. Internal. Statist., 43, 1, 17-28. ROBERT L. WINKLER (1967) The assessment of prior distributions in Bayesian analysis, JASA, 62, 776-800. S. ZACHS (1970a) Bayes and fiducial equivariant estimators of the common mean of two normal distributions, AMS, 41, 59-69. (1970b) Bayesian design of single and double stratified sampling for estimating proportion in finite population, Tc, 12, 119-130. ARNOLD ZELLNER AND MARTIN S. GEISEL (1968) Sensitivity of control to uncertainty and form of the criterion function, The Future of Statistics, Donald G. Watts, ed., Academic Press, New York, 269-289. ARNOLD ZELLNER AND GEORGE C. TIAO (1964) Bayesian analysis of the regression model with autocorrelated errors, JASA, 59, 763-778. As this bibliography goes to press the entertaining and instructive annotated bibliography by L. J. SAVAGE (1970) has appeared (Amer. Statistician, 24, 4, 23-27).
(continued from inside front cover) J. F. C. KINGMAN, Mathematics of Genetic Diversity MORTON E. GURTIN, Topics in Finite Elasticity THOMAS G. KURTZ, Approximation of Population Processes JERROLD E. MARSDEN, Lectures on Geometric Methods in Mathematical Physics BRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling Plans M. WOODROOFE, Nonlinear Renewal Theory in Sequential Analysis D. H. SATTINGER, Branching in the Presence of Symmetry R. TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis MIKLOS CSORGO, Quantile Processes with Statistical Applications 3. D. BUCKMASTER AND G. S. S. LuDFORD, Lectures on Mathematical Combustion R. E. TARJAN, Data Structures and Network Algorithms PAUL WALTMAN, Competition Models in Population Biology S. R. S. VARADHAN, Large Deviations and Applications KIYOSI ITO, Foundations of Stochastic Differential Equations in Infinite Dimensional Spaces ALAN C. NEWELL, Solitons in Mathematics and Physics PRANAB KUMAR SEN, Theory and Applications of Sequential Nonparametrics LASZLO LovAsz, An Algorithmic Theory of Numbers, Graphs and Convexity E. W. CHENEY, Multivariate Approximation Theory: Selected Topics JOEL SPENCER, Ten Lectures on the Probabilistic Method PAUL C. FIFE, Dynamics of Internal Layers and Diffusive Interfaces CHARLES K. CHUI, Multivariate Splines HERBERT S. WILF, Combinatorial Algorithms: An Update HENRY C. TUCKWELL, Stochastic Processes in the Neurosciences FRANK H. CLARKE, Methods of Dynamic and Nonsmooth Optimization ROBERT B. GARDNER, The Method of Equivalence and Its Applications GRACE WAHBA, Spline Models for Observational Data RICHARD S. VARGA, Scientific Computation on Mathematical Problems and Conjectures INGRID DAUBECHIES, Ten Lectures on Wavelets STEPHEN F. McCoRMlCK, Multilevel Projection Methods for Partial Differential Equations HARALD NIEDERREITER, Random Number Generation and Quasi-Monte Carlo Methods JOEL SPENCER, Ten Lectures on the Probabilistic Method, Second Edition MICCHELLI, CHARLES A., Mathematical Aspects of Geometric Modeling TEMAM, ROGER, Navier-Stokes Equations and Nonlinear Functional Analysis
For information about SIAM books, journals, conferences, memberships, or activities, contact:
Society for Industrial and Applied Mathematics 3600 University City Science Center Philadelphia, PA 19104-2688 Telephone: 215-382-9800 / Fax: 215-386-7999 / E-mail: [email protected] Science and Industry Advance with Mathematics