An Introduction to Bayesian Analysis

Springer Texts in Statistics Advisors: George Casella Stephen Fienberg Ingram Olkin springer Texts in Statistics Al...

Author: Jayanta K. Ghosh | Mohan Delampady | Tapas Samanta

452 downloads 3959 Views 15MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Springer Texts in Statistics Advisors: George Casella

Stephen Fienberg

Ingram Olkin

springer Texts in Statistics Alfred: Elements of Statistics for the Life and Social Sciences Berger: An Introduction to Probability and Stochastic Processes Bilodeau and Brenner: Theory of Multivariate Statistics Blom: Probability and Statistics: Theory and Applications Brockwell and Davis: Introduction to Times Series and Forecasting, Second Edition Carmona: Statistical Analysis of Financial Data in S-Plus Chow and Teicher: Probability Theory: Independence, Interchangeability, Martingales, Third Edition Christensen: Advanced Linear Modeling: Multivariate, Time Series, and Spatial Data—Nonparametric Regression and Response Surface Maximization, Second Edition Christensen: Log-Linear Models and Logistic Regression, Second Edition Christensen: Plane Answers to Complex Questions: The Theory of Linear Models, Third Edition Creighton: A First Course in Probability Models and Statistical Inference Davis: Statistical Methods for the Analysis of Repeated Measurements Dean and Voss: Design and Analysis of Experiments du Toit, Steyn, and Stumpf: Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes Edwards: Introduction to Graphical Modelling, Second Edition Finkelstein and Levin: Statistics for Lawyers Flury: A First Course in Multivariate Statistics Ghosh, Delampady and Samanta: An Introduction to Bayesian Analysis: Theory and Methods Gut: Probability: A Graduate Course Heiberger and Holland: Statistical Analysis and Data Display: An Intermediate Course with Examples in S-PLUS, R, and SAS Job son: Applied Multivariate Data Analysis, Volume I: Regression and Experimental Design Jobson: Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods Kalbfleisch: Probability and Statistical Inference, Volume I: Probability, Second Edition Kalbfleisch: Probability and Statistical Inference, Volume II: Statistical Inference, Second Edition Karr: Probability Keyfitz: Applied Mathematical Demography, Second Edition Kiefer: Introduction to Statistical Inference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Modeling, Analysis, Design, and Control of Stochastic Systems Lange: Applied Probability Lange: Optimization Lehmann: Elements of Large-Sample Theory (continued after index)

Jayanta K. Ghosh Mohan Delampady Tapas Samanta

An Introduction to Bayesian Analysis Theory and Methods

With 13 Illustrations

Sprin ger

Jayanta K. Ghosh Department of Statistics Purdue University 150 N. University Street West Lafayette, IN 47907-2067 USA [email protected] and Indian Statistical Institute 203 B.T. Road Kolkata 700108, India [email protected]

Editorial Board George Casella Department of Statistics University of Florida Gainesville, FL 32611-8545 USA

Mohan Delampady Indian Statistical Institute, 8th Mile, Mysore Road, R.V. College Post, Bangalore 560059, India mohan@isibang. ac. in

Tapas Samanta Indian Statistical Institute 203 B.T. Road Kolkata 700108, India [email protected]

Stephen Fienberg Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213-3890 USA

Ingram Olkin Department of Statistics Stanford University Stanford, CA 94305 USA

Library of Congress Control Number: 2006922766 ISBN-10: 0-387-40084-2

e-ISBN: 0-387-35433-6

ISBN-13: 978-0387-40084-6 Printed on acid-free paper. ©2006 Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excepts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. (MVY) 9 8 7 6 5 4 3 2 1 sprmger.com

To Ira, Shobha, and Shampa

Preface

Though there are many recent additions to graduate-level introductory books on Bayesian analysis, none has quite our blend of theory, methods, and applications. We believe a beginning graduate student taking a Bayesian course or just trying to find out what it means to be a Bayesian ought to have some familiarity with all three aspects. More specialization can come later. Each of us has taught a course like this at Indian Statistical Institute or Purdue. In fact, at least partly, the book grew out of those courses. We would also like to refer to the review (Ghosh and Samanta (2002b)) that first made us think of writing a book. The book contains somewhat more material than can be covered in a single semester. We have done this intentionally, so that an instructor has some choice as to what to cover as well as which of the three aspects to emphasize. Such a choice is essential for the instructor. The topics include several results or methods that have not appeared in a graduate text before. In fact, the book can be used also as a second course in Bayesian analysis if the instructor supplies more details. Chapter 1 provides a quick review of classical statistical inference. Some knowledge of this is assumed when we compare different paradigms. Following this, an introduction to Bayesian inference is given in Chapter 2 emphasizing the need for the Bayesian approach to statistics. Objective priors and objective Bayesian analysis are also introduced here. We use the terms objective and nonsubjective interchangeably. After briefly reviewing an axiomatic development of utility and prior, a detailed discussion on Bayesian robustness is provided in Chapter 3. Chapter 4 is mainly on convergence of posterior quantities and large sample approximations. In Chapter 5, we discuss Bayesian inference for problems with low-dimensional parameters, specifically objective priors and objective Bayesian analysis for such problems. This covers a whole range of possibilities including uniform priors, Jeffreys' prior, other invariant objective priors, and reference priors. After this, in Chapter 6 we discuss some aspects of testing and model selection, treating these two problems as equivalent. This mostly involves Bayes factors and bounds on these computed over large classes of priors. Comparison with classical P-value is

VIII

Preface

also made whenever appropriate. Bayesian P-value and nonsubjective Bayes factors such as the intrinsic and fractional Bayes factors are also introduced. Chapter 7 is on Bayesian computations. Analytic approximation and the E-M algorithm are covered here, but most of the emphasis is on Markov chain based Monte Carlo methods including the M-H algorithm and Gibbs sampler, which are currently the most popular techniques. Follwing this, in Chapter 8 we cover the Bayesian approach to some standard problems in statistics. The next chapter covers more complex problems, namely, hierarchical Bayesian (HB) point and interval estimation in high-dimensional problems and parametric empirical Bayes (FEB) methods. Superiority of HB and FEB methods to classical methods and advantages of HB methods over FEB methods are discussed in detail. Akaike information criterion (AIC), Bayes information criterion (BIC), and other generalized Bayesian model selection criteria, highdimensional testing problems, microarrays, and multiple comparisons are also covered here. The last chapter consists of three major methodological applications along with the required methodology. We have marked those sections that are either very technical or are very specialized. These may be omitted at first reading, and also they need not be part of a standard one-semester course. Several problems have been provided at the end of each chapter. More problems and other material will be placed at h t t p : / / w w w . i s i c a l . a c . i n / ' ^ tapas/book Many people have helped - our mentors, both friends and critics, from whom we have learnt, our family and students at ISI and Furdue, and the anonymous referees of the book. Special mention must be made of Arijit Chakrabarti for Sections 9.7 and 9.8, Sudipto Banerjee for Section 10.1, Fartha F. Majumder for Appendix D, and Kajal Dihidar and Avranil Sarkar for help in several computations. We alone are responsible for our philosophical views, however tentatively held, as well as presentation. Thanks to John Kimmel, whose encouragement and support, as well as advice, were invaluable.

Indian Statistical Institute and Furdue University Indian Statistical Institute Indian Statistical Institute February 2006

Jayanta K. Ghosh Mohan Delampady Tapas Samanta

Contents

Statistical Preliminaries 1.1 Common Models 1.1.1 Exponential Families 1.1.2 Location-Scale Families 1.1.3 Regular Family 1.2 Likelihood Function 1.3 Sufficient Statistics and Ancillary Statistics 1.4 Three Basic Problems of Inference in Classical Statistics 1.4.1 Point Estimates 1.4.2 Testing Hypotheses 1.4.3 Interval Estimation 1.5 Inference as a Statistical Decision Problem 1.6 The Changing Face of Classical Inference 1.7 Exercises

1 1 4 5 6 7 9 11 11 16 20 21 23 24

Bayesian Inference and Decision Theory 2.1 Subjective and Frequentist Probability 2.2 Bayesian Inference 2.3 Advantages of Being a Bayesian 2.4 Paradoxes in Classical Statistics 2.5 Elements of Bayesian Decision Theory 2.6 Improper Priors 2.7 Common Problems of Bayesian Inference 2.7.1 Point Estimates 2.7.2 Testing 2.7.3 Credible Intervals 2.7.4 Testing of a Sharp Null Hypothesis Through Credible Intervals 2.8 Prediction of a Future Observation 2.9 Examples of Cox and Welch Revisited 2.10 Elimination of Nuisance Parameters

29 29 30 35 37 38 40 41 41 42 48 49 50 51 51

X

Contents 2.11 A High-dimensional Example 2.12 Exchangeability 2.13 Normative and Descriptive Aspects of Bayesian Analysis, Elicitation of Probability 2.14 Objective Priors and Objective Bayesian Analysis 2.15 Other Paradigms 2.16 Remarks 2.17 Exercises

53 54

3

Utility, Prior, and Bayesian Robustness 3.1 Utility, Prior, and Rational Preference 3.2 Utility and Loss 3.3 Rationality Axioms Leading to the Bayesian Approach 3.4 Coherence 3.5 Bayesian Analysis with Subjective Prior 3.6 Robustness and Sensitivity 3.7 Classes of Priors 3.7.1 Conjugate Class 3.7.2 Neighborhood Class 3.7.3 Density Ratio Class 3.8 Posterior Robustness: Measures and Techniques 3.8.1 Global Measures of Sensitivity 3.8.2 Belief Functions 3.8.3 Interactive Robust Bayesian Analysis 3.8.4 Other Global Measures 3.8.5 Local Measures of Sensitivity 3.9 Inherently Robust Procedures 3.10 Loss Robustness 3.11 Model Robustness 3.12 Exercises

65 65 67 68 70 71 72 74 74 75 75 76 76 81 83 84 84 91 92 93 94

4

Large Sample Methods 4.1 Limit of Posterior Distribution 4.1.1 Consistency of Posterior Distribution 4.1.2 Asymptotic Normality of Posterior Distribution 4.2 Asymptotic Expansion of Posterior Distribution 4.2.1 Determination of Sample Size in Testing 4.3 Laplace Approximation 4.3.1 Laplace's Method 4.3.2 Tierney-Kadane-Kass Refinements 4.4 Exercises

99 100 100 101 107 109 113 113 115 119

55 55 57 57 58

Contents

XI

Choice of Priors for Low-dimensional Parameters 121 5.1 DiflFerent Methods of Construction of Objective Priors 122 5.1.1 Uniform Distribution and Its Criticisms 123 5.1.2 Jeffreys Prior as a Uniform Distribution 125 5.1.3 Jeffreys Prior as a Minimizer of Information 126 5.1.4 Jeffreys Prior as a Probability Matching Prior 129 5.1.5 Conjugate Priors and Mixtures 132 5.1.6 Invariant Objective Priors for Location-Scale Families . . 135 5.1.7 Left and Right Invariant Priors 136 5.1.8 Properties of the Right Invariant Prior for Location-Scale Families 138 5.1.9 General Group Families 139 5.1.10 Reference Priors 140 5.1.11 Reference Priors Without Entropy Maximization 145 5.1.12 Objective Priors with Partial Information 146 5.2 Discussion of Objective Priors 147 5.3 Exchangeability 149 5.4 Elicitation of Hyperparameters for Prior 149 5.5 A New Objective Bayes Methodology Using Correlation 155 5.6 Exercises 156 Hypothesis Testing and Model Selection 6.1 Preliminaries 6.1.1 BIG Revisited 6.2 P-value and Posterior Probability of HQ as Measures of Evidence Against the Null 6.3 Bounds on Bayes Factors and Posterior Probabilities 6.3.1 Introduction 6.3.2 Choice of Classes of Priors 6.3.3 Multiparameter Problems 6.3.4 Invariant Tests 6.3.5 Interval Null Hypotheses and One-sided Tests 6.4 Role of the Choice of an Asymptotic Framework 6.4.1 Comparison of Decisions via P-values and Bayes Factors in Bahadur's Asymptotics 6.4.2 Pitman Alternative and Rescaled Priors 6.5 Bayesian P-value 6.6 Robust Bayesian Outlier Detection 6.7 Nonsubjective Bayes Factors 6.7.1 The Intrinsic Bayes Factor 6.7.2 The Fractional Bayes Factor 6.7.3 Intrinsic Priors 6.8 Exercises

159 159 161 163 164 164 165 168 172 176 176 178 179 179 185 188 190 191 194 199

XII

Contents

7

Bayesian Computations 7.1 Analytic Approximation 7.2 The E-M Algorithm 7.3 Monte Carlo Sampling 7.4 Markov Chain Monte Carlo Methods 7.4.1 Introduction 7.4.2 Markov Chains in MCMC 7.4.3 Metropolis-Hastings Algorithm 7.4.4 Gibbs Sampling 7.4.5 Rao-Blackwellization 7.4.6 Examples 7.4.7 Convergence Issues 7.5 Exercises

205 207 208 211 215 215 216 218 220 223 225 231 233

8

Some Common Problems in Inference 8.1 Comparing Two Normal Means 8.2 Linear Regression 8.3 Logit Model, Probit Model, and Logistic Regression 8.3.1 The Logit Model 8.3.2 The Probit Model 8.4 Exercises

239 239 241 245 246 251 252

9

High-dimensional Problems 9.1 Exchangeability, Hierarchical Priors, Approximation to Posterior for Large p, and MCMC 9.1.1 MCMC and E-M Algorithm 9.2 Parametric Empirical Bayes 9.2.1 PEB and HB Interval Estimates 9.3 Linear Models for High-dimensional Parameters 9.4 Stein's Frequentist Approach to a High-dimensional Problem.. 9.5 Comparison of High-dimensional and Low-dimensional Problems 9.6 High-dimensional Multiple Testing (PEB) 9.6.1 Nonparametric Empirical Bayes Multiple Testing 9.6.2 False Discovery Rate (FDR) 9.7 Testing of a High-dimensional Null as a Model Selection Problem 9.8 High-dimensional Estimation and Prediction Based on Model Selection or Model Averaging 9.9 Discussion 9.10 Exercises

255 256 259 260 262 263 264 268 269 271 272 273 276 284 285

Contents

XIII

10 Some Applications 10.1 Disease Mapping 10.2 Bayesian Nonparametric Regression Using Wavelets 10.2.1 A Brief Overview of Wavelets 10.2.2 Hierarchical Prior Structure and Posterior Computations 10.3 Estimation of Regression Function Using Dirichlet Multinomial Allocation 10.4 Exercises

289 289 292 293

A

Common Statistical Densities A.l Continuous Models A.2 Discrete Models

303 303 306

B

Birnbaum's Theorem on Likelihood Principle

307

C

Coherence

311

D

Microarray

313

E

Bayes Sufficiency

315

296 299 302

References

317

Author Index

339

Subject Index

345

Statistical Preliminaries

We review briefly some of the background that is common to both classical statistics and Bayesian analysis. More details are available in Casella and Berger (1990), Lehmann and Casella (1998), and Bickel and Doksum (2001). The reader interested in Bayesian analysis can go directly to Chapter 2 after reading Section 1.1.

1.1 C o m m o n Models A statistician, who has been given some data for analysis, begins by providing a probabilistic model of the way his data have been generated. Usually the data can be treated as generated by random sampling or some other random mechanism. Once a model is chosen, the data are treated as a random vector X = ( X i , X 2 , . . . jXn)' The probability distribution of X is specified by f{x\6) which stands for a joint density (or a probability mass function), and 0 is an unknown constant or a vector of unknown constants called a parameter. The parameter 0 may be the unknown mean and variance of a population from which X is a random sample, e.g., the mean life of an electric bulb or the probability of doing something, vide Examples 1.1, 1.2, and 1.3 below. Often the data X are collected to learn about 0, i.e., the modeling precedes collection of data. The set of possible values of ^, called the parameter space, is denoted by 0 , which is usually a p-dimensional Euclidean space IZ^ or some subset of it, p being a positive integer. Our usual notation for data vector and parameter vector are X and 6, respectively, but we may use X and 0 if there is no fear of confusion. Example 1.1. (normal distribution). X i , X 2 , . . . , X^ are heights of n adults (all males or all females) selected at random from some population. A common model is that they are independently, normally distributed with mean /i and variance cr^, where —00 0, i.e., with 6 = (/x,(j^),

2

1 Statistical Preliminaries

/w(.)^n/(..i«)=n{-^e.p(-
If one samples both genders the model would be much more complicated - X^'s would be i.i.d. but the distribution of each Xi would be a mixture of two normals A/"(/i/?, ajp) and N{fiM^ <^M) where F and M refer to females and males. Example 1.2. (exponential distribution). Suppose a factory is producing some electric bulbs or electronic components, say, switches. If the data are a random sample of lifetimes of one kind of items being produced, we may model them as i.i.d. with common exponential density /(x,|0) = i e - ^ ^ / ^ X, > 0 , e > 0 . Example 1.3. (Bernoulli, binomial distribution). Suppose we have n students in a class with _ J. l ^ ifi XX ith student has passed a test; * ~ 10^ otherwise. We model X^'s as i.i.d. with the Bernoulli distribution:

/<-W={IM;:;:;: which may be written more compactly as ^^^(1 — 6Y~^^. The parameter 6 is the probability of passing. The joint probability function of X i , X2, • • •, Xy^ is

fim=n fi^i\^)=n {^''(1 - ^)'""'} > ^ e (0,1). i=l

i=l

UY = Y^'^Xi, the number of students who pass, then P{Y = y) = {y)Oy{l 6)'^~y, which is a binomial distribution, denoted B{n^9). Example 1.4- (binomial distribution with unknown n's and unknown/?). Suppose Yi,Y2,... ^Yk are the number of reported burglaries in a place in k years. One may model Y^'s as independent B{ni,p), where rii is the number of actual burglaries (some reported, some not) in zth year and p is the probability that a burglary is reported. Here 0 is ( n i , . . . , nfc,p). Example 1.5. (Poisson distribution). Let X i , X 2 , . . . , X^ be the number of accidents on a given street in n years. X^'s are modeled as i.i.d 'P(A), i.e., Poisson with mean A, P{Xi = Xi) = f{xi\X) = exp{-X)-^, Xi.

Xi = 0,1,2,...

,

A > 0.

1.1 Common Models

3

Example 1.6. (relation between binomial and Poisson). It is known B{n,p) is well approximated by V{X) if n is large, p is small but np = X is nearly constant, or, more precisely, n ^ oo,p ^ 0 in such a way that np —)• A. This is used in modeling distribution of defective items among some particular products, e.g., bulbs or switches or clothes. Suppose a lot size n is large. Then the number of defective items, say X, is assumed to have a Poisson distribution. Closely related to the binomial are three other distributions, namely, geometric, negative binomial, which includes the geometric distribution, and the multinomial. All three, specially the last, are important. Example 1.7. (geometric). Consider an experiment or trial with two possible outcomes - success with probability p and failure with probability 1 — p. For example, one may be trying to hit a bull's eye with a dart. Let X be the number of failures in a sequence of independent trials until the first success is observed. Then

p{x = x} = {i-prp, x = o,i,... This is a discrete analogue of the exponential distribution. Example 1.8. (Negative binomial). In the same setup as above, let k be given and X be the number of failures until k successes are observed. Then

This is the negative binomial distribution. The geometric distribution is a special case. Example 1.9. (multinomial). Suppose an urn has N balls of k colors, the number of balls of j t h color is Nj = Npj where 0 < pj < 1, zZiVj — 1- We take a random sample of n balls, one by one and with replacement of the drawn ball before the next draw. Let Xi = j if the ith ball drawn is of j t h color and let Uj = frequency of balls of the j t h color in the sample. Then the joint probability function of X\, X2, • . . , X^ is

/(^ip)^np?-, and the joint probability function of r i i , . . . , nj^ is n!

n^r-

The latter is called a multinomial distribution. We would also refer to the joint distribution of X's as multinomial.

4

1 Statistical Preliminaries

Instead of considering specific models, we introduce now three families of models that unify many theoretical discussions. In the following X is a fcdimensional random vector unless it is stated otherwise, and / has the same connotation as before. 1.1.1 E x p o n e n t i a l Families Consider a family of probability models specified by f{x\0), 0 e ©. The family is said to be an exponential family if f{x\0) has the representation

f{x\e) = exp I c{e) + J2ti(^)M^) [ K^),

(i-i)

where c(.), Aj{.) depend only on 6 and tj{.) depends only on x. Note that the support of f{x\0), namely, the set of x where f{x\0) > 0, is the same as the set where h{x) > 0 and hence does not depend on 6. To avoid trivialities, we assume that the support does not reduce to a single point. Problem 1 invites you to verify that Examples 1.1 through 1.3 and Example 1.5 are exponential families. It is easy to verify that if X^, i = 1 , . . . , n, are i.i.d. with density f{x\0)^ then their joint density is also exponential: n

(

llf{xi\0) i=l

P

\

= exp I nc{e) ^Y.^jAj{0) [ 3=1

n

m M^i), J 1=1

withT^=YlLihi^i)There are two convenient reparameterizations. Using new parameters we may assume Aj {0) = 6j. Then f{x\0)

= exp I c{e) -^^tj{x)ej

\ h{x).

(1.2)

The general theory of exponential families, see, e.g.. Brown (1986), ensures one can interchange differentiation and integration. Differentiation once under the integral sign leads to \ logfiX\e)j=^

"-Mw,

dc

+ EQtj{X),j

= l,...,p.

In a similar way, /dHogf\

fdlogfdlogf\

(1.3)

1.1 Common Models In the second parameterization, we set r]j = Eg{tj{X)),

5

i.e.,

dc Vj = - ^ , J = l , - . - , P -

(1-5)

In Problem 3, you are asked to verify for p = 1 that rji is a one-one function of ^. A similar argument shows r/ = (771,..., 77^) is a one-one function of 0. The parameters 6 are convenient mathematically, while the usual statistical parameters are closer to 77. You may wish to calculate r/'s and verify this for Examples 1.1 through 1.3 and Example 1.5. 1.1.2 Location-Scale Families Definition 1.10. Let X be a real- valued random variable, with density f{x\fi,a)

= -g (7

where g is also a density function, —oo0. and a are called location and scale parameters.

The parameters fi

With X as above, Z = {X — fi)/cr has density g. The normal N{fi^a'^) is a location-scale family with Z being the standard normal, A^(0,1). Example 1.2 is a scale family with fi = 0, a = 0. We can make it a location-scale family if we set ^^^1^ ^N n e x p ( - ^ ) ;[x\fi,a)-<^ ^

forx>/i; otherwise.

but then it ceases to be an exponential family for its range depends on /x. The other examples, namely, Bernoulli, binomial, and Poisson are not locationscale families. Example 1.11. Let X have uniform distribution over (^1,^2) so that ^ 10 ^

i{0i<x<92;

This is also a location-scale family, with a reparameterization, which is not an exponential family. Example 1.12. The Cauchy distribution specified by the density /(x|/i,0-) = - ^ . r^, TT (J^ + (X — /i)^

-OC < X < 00

is a location-scale family that is not exponential. It has several interesting properties. As \x\ -^ oc, it tends to zero but at a much slower rate than the

1 Statistical Preliminaries

o

n o

(M

o

o

- 6 - 4 - 2

0

2

4

6

Fig. 1.1. Densities of Cauchy(0, 1) and normal(0, 2.19). normal. One can verify that £^(|X|'') = oo for r = 1,2,... under any /x, cr. So Cauchy has no finite moment. However, Figure 1.1 shows remarkable similarity between the normal and Cauchy, except near the tails. The Cauchy density is much flatter at the tails than the normal, which means x's that deviate quite a bit from /x will appear in data from time to time. Such deviations from /x would be unusual under a normal model and so may be treated as outliers by a data analyst. It provides an important counter-example to the law of large numbers or central limit theorem when one has infinite moments. It also plays an important role in robustness studies (see, e.g.. Section 3.9). Finally, many of the attractive statistical properties of the normal arise from the fact that it is both an exponential and a location-scale family, thereby inheriting interesting properties of both. 1.1.3 Regular Family We end this section with a third very general family, defined by what are called mathematical regularity conditions. Definition 1.13. A family of densities f{x\0) is said to satisfy Cramer-Rao type regularity conditions if the support of f{x\6), i.e., the set of x for which f{x\0) > 0, does not depend onO, f is k times continuously differentiable with respect to 0 (with k usually equal to two or three) and one can differentiate under the integral sign as indicated below for real-valued 9:

1.2 Likelihood Function Eo ( ^ l o g / ( X | ^ ) ) = I " OO

/

7

1 ^ l o g / ( x | ^ ) | f{x\e) dx J

1

^ ^ / ( ^ l ^ ) dx=^J

/•CO

^ f{x\e) dx = 0, (1.6)

anc? similarly, Ee ( ^ l o g / ( X | ^ ) ) = - | ^

(^Aiog/(x|^)^

/(xl^)rfa;.

(1.7)

The condition that the support of f{-\9) is free of 6 is required for the last two relations to hold. The results of Chapter 4 require regularity conditions of this kind. The exponential families satisfy these regularity conditions. Location-scale families may or may not satisfy, usually the critical assumption is that relating to the support of / . Thus the Cauchy location-scale family satisfies these conditions but not the uniform or the exponential density f{x\ii,a)

= -exp f

— j , x> fi.

1.2 Likelihood Function A concept of fundamental importance is the likelihood function. Informally, for fixed x^ the joint density or probability mass function (p.m.f.) f{x\6)^ regarded as a function of ^, is called the likelihood function. When we think of / as the likelihood function we often suppress x and write / as L{0). The likelihood function is not unique in that for any c{x) > 0 that may depend on X but not on 0, c{x)f{x\0) is also a likelihood function. What is unique are the likelihood ratios L{02)/L{6x)^ which indicate how plausible is 6^^ relative to ^ 1 , in the light of the given data x. In particular, if the ratio is large, we have a lot of confidence in 62 relative to 0\ and the reverse situation holds if the ratio is small. Of course the threshold for what is large or small isn't easy to determine. It is important to note that the likelihood is a point function. It can provide information on relative plausibility of two points 0\ and ^2, but not of two ^-sets, say, two non-degenerate intervals. If the sample size n is large, usually the likelihood function has a sharp peak as shown in the following figure. Let the value of 0 where the maximum is attained be denoted as the maximum likelihood estimate (MLE) 6] we define it formally later. In situations like this, one feels 6 is very plausible as an estimate of 0 relative to any other points outside a small interval around 0. One would then expect ^ to be a good estimate of the unknown ^, at least in the sense of being close to it in some way (e.g., of being consistent, i.e, converging to 6 in probability). We discuss these things more carefully below.

1 Statistical Preliminaries o

I dj

in

o

I OJ

00

o

I in

-0.5

0

0.5

i

1.5

Fig. 1.2. L{6) for the double exponential model when data is normal mixture. Classical statistics also asserts that under regularity conditions and for large n, the maximum likelihood estimate minimizes the variance approximately within certain classes of estimates. Problem 10 provides a counter-example due to Basu (1988) when regularity conditions do not hold. Definition 1.14. The maximum likelihood estimate (MLE) 6 is a value of 0 where the likelihood function L{0) = f{x\0) attains its supremum, i.e.,

sxvpf{x\e) = f{x\e). e

Usually, the MLE can be found by solving the likelihood equation d dOj

\ogf{x\e)=0,j

= l,...,p.

(1.8)

In Problem 4(b), you are asked to show the likelihood function is logconcave, i.e., its logarithm is a concave function. In this case, if (1.8) has a solution, it is unique and provides a global maximum. There are well-known theorems, see, e.g., Rao (1973), which show the existence of a solution of (1.8) which converges in probability to the unknown true 0 if the dimension is fixed and Cramer-Rao type regularity conditions hold. If (1.8) has multiple roots, one has to be careful. A simple solution is to first find a ^n-consistent estimate Tn, i.e., an estimate Tn such that y/n{Tn — 0) is bounded in probability. Then choose a solution that is nearest to T^.

1.3 Sufficient Statistics and Ancillary Statistics

9

1.3 Sufficient Statistics and Ancillary Statistics Given the importance of likelihood function, it is interesting and useful to know what is the smallest set of statistics Ti(cc),..., Tm{x) in terms of which one can write down the likelihood function. As expected this makes it necessary to introduce sufficient statistics. Definition 1.15. Let X be distributed with density f{x\6). Then T = T(X) — ( T i ( X ) , . . . , Tm{X)) is sufficient for 0 if the conditional distribution of X given T is free of 0. A basic fact for verifying whether T is sufficient is the following factorization theorem: T is sufficient for 6 iff f{x\e) = g{Ti{x),... ,Tm{x),e)h{x). Using this, you are invited to prove (Problem 20) that the likelihood function can be written in terms of T iff T is sufficient. Thus the problem of finding the smallest T in terms of which one can write down the likelihood function reduces to the problem of finding what are called minimal sufficient statistics. Definition 1.16. A sufficient statistic TQ is minimal sufficient (or smallest among sufficient statistics) if To is a function of every sufficient statistic. Clearly, a one-one function of a minimal sufficient statistic is also minimal sufficient. In spite of the somewhat abstract definition, minimal sufficient statistics are usually easy to find by inspection. Most examples in this book would be covered by the following fact (Problem 19). Fact. Suppose X^,i = 1,2, . . . , n are i.i.d. from exponential family. Then (Tj = Y^i^i ^ji-^i)^J — ^^' •' ^P) together form a minimal sufficient statistics and hence is the smallest set of statistics in terms of which we may write down the likelihood function. Using this, you can prove (X^^-^i, J ] ^ - ^ f ) is minimal sufficient for // and cr^ if X i , X 2 , . . . , X n are i.i.d. iV(/x, cr^). This in turn implies (X, s^ = ^ ^ ^{Xi — X)^) is also minimal sufficient for (/i, cr^), being a one-one function of ( X ^ Xi^ ^ ^ ^i)-1^ the same way, X is minimal sufficient for both i.i.d. B{l,p) and V{X)- In Problem 10, one has to show X(i) = min(Xi, X 2 , . . . , X^) and X(^) = max(Xi, X 2 , . . . , Xn) are together minimal sufficient for U{0, 20). A bad case is that of i.i.d. Cauchy(/i, cr^). It is known (see, e.g., Lehmann and Casella (1998)) that the minimal sufficient statistic is the set of all order statistics (X^i), X(2)? • • • 5 ^{n)) where X(i) and X(^) have been defined earlier and X(^) is the rth X when the X^'s are arranged in ascending order (assuming all XiS are distinct). This is a bad case because the order statistics together are always sufficient when X^'s are i.i.d., and so if this is the minimal sufficient statistic, it means the density is so complicated that the likelihood cannot be expressed in terms of a smaller set of statistics. The advantage of sufficiency is that we can replace the original data set x by the minimal

10

1 Statistical Preliminaries

sufficient statistic. Such reduction works well for i.i.d. random variables with an exponential family of distributions or special examples like U{61,62). It doesn't work well in other cases including location-scale families. There are various results in classical statistics that show a sufficient statistic contains all the information about 6 in the data X. At the other end is a statistic whose distribution does not depend on 0 and so contains no information about 6. Such a statistic is called ancillary. Ancillary statistics are easy to exhibit if X i , . . . , X n are i.i.d. with a location-scale family of densities. In fact, for any four integers a, 6, c, and d, the ratio

is ancillary because the right-hand side is expressed in terms of order statistics of Zi's where Zi = {Xi — fi)/(T, i = 1 , . . . , n are i.i.d. with a distribution free of /i and a. There is an interesting technical theorem, due to Basu, which establishes independence of a sufficient statistic and an ancillary statistic. The result is useful in many calculations. Before we state Basu's theorem, we need to introduce the notion of completeness. Definition 1.17. A statistic T or its distribution is said to be complete if for any real valued function il^{T), Eeil)(T{X)) = 0 V 6> implies ^{T{X))

=0

(with probability one under all 6). Suppose T is discrete. The condition then simply means the family of p.m.f.'s f^{t\6) of T is rich enough that there is no non-zero i/;(t) that is orthogonal to f^{t\6) for all 6 in the sense Y.t ^{t)f^{t\6) = 0 ^ r all 6. Theorem 1.18. (Basu). Suppose T is a complete sufficient statistic and U is any ancillary statistic. Then T and U are independent for all 6. Proof. Because T is sufficient, the conditional probability of U being in some set B given T is free of 6 and may be written as Pe{U G B\T) = 0(T). Since U is ancillary, Ee{(t){T)) = PQ{U £ B) = c, where c is a constant. Let V^(r) = (f){T) - c. Then Eei){T) - 0 for all (9, implying ^{T) = 0 (with probability one), i.e., PeiU € B\T) = Pe{U e B). D It can be shown that a complete sufficient statistic is minimal sufficient. In general, the converse isn't true. For exponential families, the minimal sufficient statistic (Ti,...,Tp) = ( ^ ^ ] t i ( X i ) , . . . , J]]^ fp(Xj)) is complete. For Xi,X2,... ,Xn i.i.d. 17(61,62)^ (X(i),X(^)) is a complete sufficient statistic. Here are a couple of applications of Basu's theorem.

1.4 Three Basic Problems of Inference in Classical Statistics

11

Example 1.19. Suppose X i , X 2 , . . . ^Xn are i.i.d. N{ii^a'^). Then X and s^ = ^ ^ ^{Xi — X)^ are independent. To prove this, treat cr^ as fixed to start with and /x as the parameter. Then X is complete sufficient and s^ is ancillary. Hence X and 5^ are independent by Basu's theorem. Example 1.20. Suppose X i , X 2 , . . . ,X^ are i.i.d f/(^i,^2)- Then for any 1 < r < n, Y = {X(^r) — -^(i))/(-^(n) ~ ^(1)) is independent of (X(i),X(^)). This follows because Y is ancillary. A somewhat different notion of sufficiency appears in Bayesian analysis. Its usefulness and relation to (classical) sufficiency is discussed in Appendix E.

1.4 Three Basic Problems of Inference in Classical Statistics For simplicity, we take p = 1, so ^ is a real-valued parameter. Informally, inference is an attempt to learn about 6. There are three natural things one may wish to do. One may wish to estimate ^ by a single number. A classical estimate used in large samples is the MLE 0. Secondly, one may wish to choose an interval that covers 9 with high probability. Thirdly, one may test hypotheses about 9, e.g., test what is called a null hypothesis HQ : 9 ^ 0 against a two-sided alternative Hi : 9 y^ 0. More generally, one can test HQ \ 9 — 9Q against Hi : 9 ^ 9Q where ^o is a value of some importance. For example, 9 is the effect of some new drug on one of the two blood pressures, or 9Q is the effect of an alternative drug in the market and one is trying to test whether the new drug has different effects. If one wants to test whether the new drug is better then instead of Hi : 9 ^ 9o, one may like to consider one-sided alternatives Hi : 9 < 9Q or Hi : 9 > 9Q. 1.4.1 Point Estimates In principle, any statistic T{X) is an estimate though the context usually suggests some special reasonable candidates like sample mean X or sample median for a population mean like /i of A^(/i, cr^). To choose a satisfactory or optimal estimate one looks at the properties of its distribution. The two most important quantities associated with a distribution are its mean and variance or mean and the standard deviation, usually called the standard error of the estimate. One would usually report a good estimate and estimate of the standard error. So one judges an estimate T by its mean E(T\9) and variance Var(T|^). If we are trying to estimate ^, we calculate the bias E{T\9) — 9. One prefers small absolute values of bias, one possibility is to consider only unbiased estimates of 9 and so one requires E{T\9) = 9. Problem 17 requires you to show both X and the sample median are unbiased estimates for JJL in

12

1 Statistical Preliminaries

N{/i, (j^). If the object is to estimate some real-valued function T{0) of 9^ one would require E{T\e) = r{e). For unbiased estimates of r, Var(T|0) = E{{T — r{0))'^\6} measures how dispersed T is around T{6). The smaller the variance the better, so one may search for an unbiased estimate that minimizes the variance. Because 9 is not known, one would have to try to minimize variance for all 9. This is a very strong condition but there is a good theory that applies to several classical examples. In general, however it would be unrealistic to expect that such an optimal estimate exists. We will see the same difficulty in other problems of classical inference. We now summarize the basic theory in a somewhat informal manner. Theorem 1.21. Cramer-Rao Inequality (information inequality). Let T he an unbiased estimate ofT{9). Suppose we can interchange differentiation and integration to get

-E{T\e) = j ^•.. j j{x)f'{x\e)dx, and

J

r

/"oo

1

/»oo

/»oo

/•oo

Then,

where the ' in r and f indicates a derivative with respect to 9 and In{9) is Fisher information in x, namely,

in{e) = E\(j-^\ogf{x\e)^e' Proof. Let jp{X,9) = ^ l o g / ( X | ^ ) . The second relation above implies E{i;lx,9)\9) = 0 and then, Var(V^(X,6>)|6>) = In{9). The first relation implies Cov(r,V^(X,6>) \9) = T\9). It then follows that

If X i , . . . , Xn are i.i.d. f{x\9), then In{9) = nl{9) where I{9) is the Fisher information in a single observation,

1.4 Three Basic Problems of Inference in Classical Statistics

m

d

= EU-log

13

f{Xi\0)

To get a feeling for ln{0)^ consider an extreme case where f{x\9) is free of 0. Clearly, in this case there can be no information about 6 in X. On the other hand, if ln{0) is large, then on an average a small change in 0 leads to a big change in log/(a?|^), i.e., / depends strongly on 6 and one expects there is a lot that can be learned about 0 and hence r{6). A large value of ln{0) diminishes the lower bound making it plausible that one may be able to get an unbiased estimate with small variance. Finally, if the lower bound is attained at all 6 by T, then clearly T is a uniformly minimum variance unbiased (UMVUE) estimate. We would call them best unbiased estimates. A more powerful method of getting best unbiased estimates is via the Rao-Blackwell theorem. Theorem 1.22. (Rao-Blackwell). IfT is an unbiased estimate ofr{0) and S is a sufficient statistic, the T' = E{T\S) is also unbiased for r{6) and Var{r\0) < Var(T\0) V6>. Corollary 1.23. IfT is complete and sufficient, thenT' as constructed above is the best unbiased estimate for T{6). Proof. By the property of conditional expectations, E {T'\0) = E {E{T\S)

\0} = E (T|l9).

(You may want to verify this at least for the discrete case.) Also, Var(r|^) = E [{{T - T') + {T' - T{e))Y \ O]

= E{{T- rf

\e]^E{{T' - T{e)f \ e],

because

Cov{T - T',r - r{0) 16>} = E{(r - r){r - r{e) 10} = E[E {{r - r{o)){T - r) I s} \o] - E [{r - T{e))E{T - T'\S) I 0] -0. The decomposition of Var(T|^) above shows that it is greater than or equal to Var(r|(9). D The theorem implies that in our search for the best unbiased estimate, we may confine attention to unbiased estimates of T{6) based on S, However, under completeness, T' is the only such estimate.

14

1 Statistical Preliminaries

Example 1.24- Consider a random sample from N{/j,,a'^)^ G^ assumed known. Note that by either of the two previous results, X is the best unbiased estimate for /x. The best unbiased estimate for /i^ is X^ — cr^/n by the Rao-Blackwell theorem. You can show it does not attain the Cramer-Rao lower bound. If a T attains the Cramer-Rao lower bound, it has to be a linear function (with

T{x) =

a{e)^b{e)^iogf{x\ei

i.e., must be of the form

T{x) = c{e) + d{e)x. But T, being a statistic, this means

T{x) =

c^dx,

where c, d are constants. A similar argument holds for any exponential family. Conversely, suppose a parametric model f{x\0) allows a statistic T to attain the Cramer-Rao lower bound. Then, T{x) = a{e) +

b{6)^Aogf{x\e),

de

which implies

T{x) - a{e) _ d \ogf{x\e). h{e) de Integrating both sides with respect to 0,

T{x) f{bie))-Ue- Ia{e)(b{e))-^de + d{x) = \ogf{x\e), where d{x) is the constant of integration. If we write A{6) = J{b{6))~^d6, c{6) = J a{9)b{6)~^ dO and d{x) = log/i(x), we get an exponential family. The Cramer-Rao inequality remains important because it provides information about variance of T. Also, even if a best unbiased estimate can't be found, one may be able to find an unbiased estimate with variance close to the lower bound. A fascinating recent application is Liu and Brown (1992). An unpleasant feature of the inequality as formulated above is that it involves conditions on T rather than only conditions on /(ic|^). A considerably more technical version without this drawback may be found in Pitman (1979). The theory for getting best unbiased estimates breaks down when there is no complete sufficient statistic. Except for the examples we have already seen, complete sufficient statistics rarely exist. Even when a complete sufficient statistic exists, one has to find an unbiased estimate based on the complete sufficient statistic S. This can be hard. Two heuristic methods work sometimes. One is the method of indicator functions, illustrated in Problem 5.

1.4 Three Basic Problems of Inference in Classical Statistics

15

The other is to start with a plausible estimate and then make a suitable adjustment to make it unbiased. Thus to get an unbiased estimate for fi^ for N{ii^(j'^)^ one would start with X^. We know for sure X^ can't be unbiased since E'(X^|)U, cr^) = /i^ + cr^/n. So if cr^ is known, we can use X^ — cr^/n. If (T^ is unknown, we can use X'^ — 5^/n, where s^ — X](^« ~ ^Yl{^ — 1) is an unbiased estimate of cr^. Note that X^ — s^/n is a function of the complete, sufficient statistic (X, 5^) but may be negative even though iJ? is a positive quantity. For all these reasons, unbiasedness isn't important in classical statistics as it used to be. Exceptions are in unbiased estimation of risk (see Berger and Robert (1990), Lu and Berger (1989a, b)) with various applications and occasionally in variance estimation, specially in high-dimensional problems. See Chapter 9 for an application. We note finally that for relatively small values of p and relatively large values of n, it is easy to find estimates that are approximately unbiased and approximately attain the Cramer-Rao lower bound in a somewhat weak sense. An informal introduction to such results appears below. Under regularity conditions, it can be shown that

This implies 6 is approximately normal with mean 0 and variance {nl{9))~^^ which is the Cramer-Rao lower bound when we are estimating 0. Thus 6 is approximately normal with expectation equal to 6 and variance equal to the Cramer-Rao lower bound for T{6) — 0. For a general differentiable T{6), we show T{0) has similar properties. Observe that T{0) = T{0) -\- (0 — 6)T'{0)-\smaller terms^ which exhibits T{0) as an approximately linear function of 6. Hence T{9) is also approximately normal with mean = T(6) + (approximate) mean of (O — 6\ r'{6) = T ( ^ ) , and variance = {T'{6))

x approximate variance of (^ — ^J = {^\^))'^—rTm""

The last expression is the Cramer-Rao lower bound for T{9). The method of approximating T{6) by a linear function based on Taylor expansion is called the delta method. For N{fi,a'^) and fixed x, let r(6>) = T{0,X) = P{X < a:|/i,cr}. An approximately best unbiased estimate is P{X < x\jl,a} = ^ ( ^ ^ ) where s' = A M Yli-^i ~ ^ ) ^ ^^^ ^ ( 0 is ^^^ standard normal distribution function. The exact best unbiased estimate can be obtained by the method of indicator functions. Let

I(X ) = l^ ^ ^^

\0

if Xi < x;

otherwise .

16

1 Statistical Preliminaries

Then / is an unbiased estimate of T(^), SO the best unbiased estimate is E{I\X,s^) = P{Xi < x\X,s'}. The explicit form is given in Problem 5. 1.4.2 Testing Hypotheses We consider only the case of real-valued ^o, the null hypothesis HQ : 0 = 6o and the two-sided alternative Hi : 6 ^ 6o or, one-sided null and one-sided alternatives, e.g., HQ : 6 < 6o and Hi : 6 > OQ. In this formulation, the null hypothesis represents status quo as in the drug example. It could also mean an accepted scientific hypothesis, e.g., on the value of the gravitational constant or velocity of light in some medium. This suggests that one should not reject the null hypothesis unless there is compelling evidence in the data in favor of Hi. This fact will be used below. A test is a rule that tells us for each possible data set (under our model f{x\6)) whether to accept or reject HQ- Let W be the set of x's for which a given test rejects HQ and W^ be the set where the test accepts HQ. The region W, called a critical region or rejection region, completely specifies the test. Sometimes one works with the indicator of W rather than W itself. The collection of all subsets W in IZ^ or their indicators correspond to all possible tests. How does one evaluate them in principle or choose one in some optimal manner? The error committed by rejecting HQ when HQ is true is called the error of first kind. Avoiding this is considered to be more important than the so called second kind of error committed when HQ is accepted even though Hi is true. For any given W, Probability of error of first kind = Pe^ {X eW)

=

EQ^ {I{X))

,

where I{x) is the indicator of VF, ^ ^

\ 0 i f « (eW.

Probability of error of second kind = PeiX e W) = 1 - Ee{I{X)), for 6 as in Hi. One also defines the power of detecting iifi as 1 — P0{X € W^) = Ee{I{X)) iove asinHi. It turns out that in general if one tries to reduce one error probability the other error probability goes up, so one cannot reduce both simultaneously. Because probability of error of first kind is more important, one first makes it small, Ee, (liX)) < a, (1.9) where a, conventionally .05, .01, etc., is taken to be a small number. Among all tests satisfying this, one then tries to minimize the probability of committing error of second kind or equivalently, to maximize the power uniformly for all 0 as in Hi. You can see the similarity of (1.9) with restriction to unbiased estimates and the optimization problem subject to (1.9) as the problem of

1.4 Three Basic Problems of Inference in Classical Statistics

17

minimizing variance among unbiased estimates. The best test is called uniformly most powerful (UMP) after Neyman and Pearson who developed this theory. It turns out that for exponential families and some special cases like U{0^9) or U(0 — \->0 -\- | ) , one can find UMP tests for one-sided alternatives. The basic tool is the following result about a simple alternative Hi : 9 = Oi. Lemma 1.25. (Neyman-Pearson). Consider HQ : 0 = OQ versus Hi : 0 = Oi. FixO
r M-i^ with no restriction on Then

IQ

^ff{x\e^)>kf{x\eo);

if f{x\Oi) =• kf{x\6o)),

such that

EOQ{IQ{X))

— a.

for all indicators Ii satisfying

Ee,{h{X))
= a and

Then among all such I,

" '

EQ^ {I{X))

^(^)p(^) dx = c (same for all I). (1.10) is maximum at

T (^\ = l^ 'f ^(^l^i) > ^i/(^l^o) + k2g{x); ^o[X) I Q ^^ ^^^1^^^ ^ kif{x\eo) 4- k2g{x), where ki and k2 are two constants such that IQ satisfies the two constraints given in (1.10). C. If IQ exists in K or B and Ii is an indicator having the same maximizing property as IQ under the same constraints, then Io{x) and Ii{x) are same if f{x\Oi) - kf{x\0o) ^ 0, in case of A and f{x\6i) - kif{x\0o) - k2g{x) 7^ 0, in case ofB. Proof. A. By definition of /Q and the fact that 0 < Ii{x) have that / {(/o(x) - Ii{x)) {f{x\Oi) - kf{x\Oom which implies

< 1 for all / i , we

dx > 0,

(1.11)

18

1 Statistical Preliminaries

/ {Io{x)-h{x)}f{x\ei)dx>k Jx

[ Io{x)f{x\eo)dx-k Jx > ka — ka = 0.

[ Jx

h{x)f{x\Oo)dx

B . The proof is similar to that of A except that one starts with

L (/o - /) {f{x\e,) - kif{x\eo) - k29{x)} dx > 0. IX

C. Suppose /o is as in A and / i maximizes j^If{x\Oi)dx.

i.e.,

/ iof{x\ei)dx= [ hf{x\ei)dx Jx

Jx

subjected to / /o/(^|^o) dx = a, and / /i/(x|0o) dx = a.

Jx

Jx

Then, ' {/o - h}{f{x\9i) - kf{x\eo)}dx = 0. /fx, But the integrand {Io{x) — Ii{x)}{f{x\6i) — kf{x\6ox)} is non-negative for all X. Hence Io{x) = h{x) if fix\0,) - kf{x\eo) + 0. This completes the proof. D Remark 1,26. Part A is called the sufficiency part of the lemma. Part B is a generalization of A. Part C is a kind of necessary condition for /i to be MP provided /Q as specified in A or B exists. If XiS are i.i.d. A/'(/i, a^), then {x : f{x\6i) = kf{x\6Q)} has probability zero. This is usually the case for continuous random variables. Then the MP test, if it exists, is unique. It fails for some continuous random variables like XiS that are i.i.d. [7(0,6) and for discrete random variables. In such cases the MP test need not be unique. Using A of the lemma we show that for A^(/i, cr^), cr^ known, the UMP test oi HQ \ ji = fjLQ for a one-sided alternative, say. Hi : fi > fi^ \s given by _ fl

if X > fio +

°"|0

Za^;

ifx
where z^ is such that P{Z > Za} = c^ with Z ~ N{0,1). Fix /ii > /iQ. Note that /(cc|/ii)//(x|/io) is an increasing function of x. Hence for any /c in A, there is a constant c such that /(x|/ii) > kf{x\fio)

if and only if

x > c.

1.4 Three Basic Problems of Inference in Classical Statistics

19

So the M P test is given by t h e indicator

Jl °

ifxX

> C]

\' 0" '^ i f xX < c,

where c is such t h a t Efio{Io) = a. It is easy t o verify c = /J^Q -\- ZaCr/\/n does have this property. Because this test does not depend on t h e value of / i i , it is M P for all /ii > /io and hence it is U M P for Hi : /i > /ig. In the same way, one can find t h e U M P test of i f i : fi < /JLQ and verify t h a t t h e test now rejects ^ o if x < /io — Zaa/y/n. How a b o u t HQ : // < /io versus Hi : // > /io? Here we consider all tests with t h e property P^^fj2{Ho is rejected) < a for all /i < /ioUsing Problem 6 (or 7), it is easy to verify t h a t t h e U M P test of HQ : /i = /io versus ifi : /i > //o is also U M P when t h e null is changed to HQ : fi < /IQ. One consequence of these calculations and t h e uniqueness of M P tests (Part C) is t h a t there is no U M P test against two-sided alternatives. Each of the two U M P tests does well for its Hi b u t very badly at other ^ i , e.g., t h e U M P test Jo for HQ : JJ, = /J^Q versus i f i : /i > /io obtained above has Efii{Io) ^ 0 as /ii —> —oc. To avoid such poor behavior at some ^'s, one may require t h a t t h e power cannot be smaller t h a n a. T h e n EQQ^) < a, and Ee{I) > a, 0 y^ 00 imply EOQ{I) — a and EQ{I) ^^^ ^ global and hence a local minimum at 0 = OQ. Tests of this kind were first considered by N e y m a n and Pearson who called t h e m unbiased. T h e r e is a similarity with unbiased estimates t h a t was later pointed out by L e h m a n n (1986) (see C h a p t e r 1 there). Because every unbiased / satisfies conditions of P a r t B with g = f^[x\Oo), one can show t h a t t h e M P test for any Oi ^ OQ satisfies conditions for IQ. W i t h a little more effort, it can be shown t h a t t h e M P test is in fact if X > C2 or X if ci <x < C2

= {i

/o =

suitable ci and C2' T h e given constraints can be satisfied if ci =-•

/io

- Za/2—F^

and

C2 = Mo +

G

This is the U M P unbiased test. We have so far discussed how t o control a , t h e probability of error of first kind a n d then, subject t o this and other constraints, minimize /3(^), t h e probability of error of second kind. But how do we bring /3(^) to a level t h a t is desired? This is usually done by choosing an appropriate sample size n, see Problem 8. T h e general theory for exponential families is similar with T — '^^t{xi) or T/n taking on t h e role of x. However, t h e distribution of T may be discrete, as in the case of binomial or Poisson. T h e n it may not be possible to find t h e

20

1 Statistical Preliminaries

constants c or ci, C2. Try, for example, the case of 5 ( 5 , p), HQ : p = | versus Hi : p > | , Q; = .05. In practice one chooses an a' < a and as close to a as possible for which the constants can be found and the lemma applied with a' instead of a. A different option of some theoretical interest only is to extend the class of tests to what are called randomized tests. A randomized test is given by a function 0 < (t){x) < 1, which we interpret as the probability of rejecting HQ given X. By setting 0 equal to an indicator we get back the non-randomized tests. With this extension, one can find a UMP test for binomial or Poisson of the form 1 ifr>c;

{

0 if T < c ; 7 if T = c, where 0 < 7 < 1 is chosen along with c so that ^'^^(^o) = a. Such use of randomization has some other theoretical advantages. Randomization is sometimes needed to get a minimax test (i.e., a test that minimizes maximum probability or error of either kind), vide. Problem 14. Most important of all, randomization leads to the convexity of the collection of all tests in the sense that if 4>i{x) and 02(a;) are two randomized or non-randomized tests, the convex combination A0i + (1 — A)(/)2, 0 < A < 1, is again a function 0(x) lying between 0 and 1 and so it is a randomized test. This leads to convexity of risk set (Problem 15). Except for exponential families and a few special examples, UMP tests don't exist. However, just as in the case of estimation theory, there are approximately optimum tests based directly on maximum likelihood estimates of 0 or the likelihood ratio statistic

^_

f{x\eo) SUP0G6)i/(^l<^)'

where Oi is the set specified by Hi. 1.4.3 Interval Estimation A commonly used so called confidence interval for /i in A^(/i, cr^) with a^ known is X ± ^0,120 j-Jn. This means Pii,(j {^ ^ confidence interval } = Pu,a \^

- Za/2-7= < /^ < ^ + Zoc/2-i= \

In this statement, as in all other areas of classical statistics, /i is a constant, the probability statement is about X. So (l — a) is the proportion of times the interval covers /i over repetitions of the experiment and data sets. If one has a data set with X = 3, and asks for the probability that // lies in 3 ± 2:^^/2^/v^^

1.5 Inference as a Statistical Decision Problem

21

the answer isn't 1 — a but trivially zero or one depending on the value of //. Though the idea of such intervals is quite old, it was Neyman who formalized them. For X ~ f{x\0), 0 elZ, one calls {6_{X),0{X)) a confidence interval with confidence coefficient 1 - a, if PQ {e{X) < 6> < ^ ( X ) } = 1 - a. The simplest way to generate them is to find what Fisher called a pivotal quantity, namely, a real valued function T(X,6) of both X and 0 such that the distribution of T ( X , 6) does not depend on 0. Suppose then we choose two numbers ti and ^2 such that PQ {ti < T ( X , ^) < ^2} = 1 — «. If for each X , T{X^6) is monotone in ^, say, an increasing function of ^, then we can find 0_{X) and_^(X) such that T{X,e{X)) = ^2 and r ( X , g ( X ) ) = ti. Clearly {e_<0 <0)i^ti
{\''

otherwise.

(We have taken /Q = 0 at the two boundaries, which have zero probability anyway.) We now define a confidence set, say, A{X) C IZ by, A{X) = { /io such that HQ : fi = fiQ is accepted by its UMPU test} . Then

P^Q{A(X)

covers /^o} = P^oi/^o is accepted by its UMPU test } = 1 — a.

Also A{X) is nothing but the interval X ± 2:^^/2cr/^^. We have just gotten the same interval by a different route. This approach helps in showing many common intervals have the property of being shortest, i.e., having smallest expected length of all confidence intervals obtainable from a family of unbiased tests. This follows from an application of a simple but somewhat technical result (vide Ghosh-Pratt identity in Encyclopedia of Statistics).

1.5 Inference as a Statistical Decision Problem The three apparently very different inference problems discussed in Section 1.4 can be unified by formulating them as statistical decision problems. This approach is due to Wald, who not only unified classical inference but proved basic theorems applying to all inference. A couple of his results are mentioned below and in Section 2.3. One gains a certain conceptual clarity as well as a

22

1 Statistical Preliminaries

certain broader outlook. However certain special features of each problem, either relating to historical context or relating to such consideration as intuitive appeal or reasonableness, are lost. A statistical decision problem has a model f{x\6) and a space A of actions or decisions "a". A decision rule or decision function is a function 5{x) from the sample space of the data to the action space A^ i.e., S{x) is an action, for each X. To implement this rule, one simply takes the action 6{x) if data are X.

In estimation A = 1Z and S{x) = T{x) G 7^ is nothing but an estimate of 0. In testing, the action or decision consists of two elements {"accept HQ\ "accept ii^i" }. We may denote these elements as a^ and ai. A decision function has a one-one correspondence with indicator function as follows '^"^^"lo

iff5(aj)-ao.

In interval estimation, action space would be the collection of all intervals [a, 6]. Each confidence interval is a decision function. One of the advantages of the new approach is that it liberated classical statistics from some historical legacies like unbiasedness and in this way broadened it. We will discuss this particular point again in the chapter on hierarchical Bayes analysis. One more concept is needed to evaluate the performance of a decision function. Let the loss L(^, a) be a measure of how good the action a is when 6 is the value of the parameter: the smaller the loss better the action a relative to (9. In estimation, a commonly used L(^, a) is the squared error loss function. In testing HQ : 6 = OQ versus Hi : 6 ^ 6o, a, commonly used loss is the 0-1 loss, namely, .^ . _ j 0 ii 0 = 6o and a = ao or 6 ^ 6o and a = ai; ^ ' "^ \ l otherwise. In interval estimation there is no commonly used loss function. One choice would be a suitable penalty for length and failure to cover ^ by a chosen interval [a, 6], e.g., L{e, [a, b]) = ciLi{e, [a, 6]) + 02(6 - a). where

To evaluate a decision function 5{x)^ one calculates the average loss Ee{L{e,5{X))^^ This is a function of 6.

R{e,5).

1.6 The Changing Face of Classical Inference

23

How would one define an optimal decision function? In estimation, if one confines attention to unbiased decision functions, then R{0^ 6) = Ydiv{5{X)\d). Sometimes one can get a single SQ minimizing R{9^5) for all 6 among all unbiased S. Without the restriction to unbiasedness, this is no longer the case. Similar questions arise in testing and other problems also. Clearly, new principles are called for. We can introduce a weight function 7r(6) and minimize the weighted risk R{7r,S)=

f R{0,S)7r{9)dO.

A So minimizing this is called a Bayes rule. This is a problem that we discuss in Chapter 2. There 7r{0) is interpreted as a quantification of prior belief and is called a prior distribution of 0. We say So is a Bayes rule in the limit (or Bayes in the wide sense^ (Wald (1950))), if for a sequence of priors TT^, lim [RiTTi, So) - mfR{7Ti, S)] = 0. A somewhat conservative optimization principle is to minimize sup R{0,S). e A decision rule SQ is said to be minimax if sup i?(^, Jo) = inf sup i?(^, (5). e ^ e A sufficient condition for a rule So to be minimax is that So minimizes R(TI^ S) for some IT and has constant risk R{0^ So) = c. Then sup R{6^ (5o) = c = i^(7r, So) e < R{7r,S) < snpR{e,S). 0

This argument is due to Wald (1950). In Problem 16, you are asked to prove that if a rule So is Bayes in the limit and has constant risk, then it is minimax.

1.6 T h e Changing Face of Classical Inference Because the exact theories of optimal estimates are difficult to apply, attention has shifted to approximate algorithmic methods, like the EM algorithm, simulation, and asymptotics. Along with this, there has been much interest in robust methods that do well under a broad spectrum of models. As an example, we discuss the method of Bootstrap due to Efron (see Efron (1982)). We illustrate the method of Bootstrap by showing how to calculate, say, the variance of r{6) for a given r. The original sample is (xi, X2, • . . , x^). We

24

1 Statistical Preliminaries

sample from this data set n times at random and replacing each chosen item before the next draw. This produces a pseudo data set that we denote by {xl,X2,... ,a:*). We calculate ^* from this pseudo data and then T{0*). We repeat this N times (where N is much larger than n) to generate N pseudo data sets and N pseudo values of r(^), which we denote as Tj**, T2 , • •., T^ . The estimate for E{T{6)\0) is f* = j^Yli ^i ^^^ ^^ estimate for Var(T(^)|^) is j^ J2{Ti — f*)'^. There is considerable numerical and theoretical evidence that show the Bootstrap estimates are superior to earlier methods like the delta method discussed in Section 1.4. Finally, classical statistics has come up with many new methods for dealing with high-dimensional problems. A couple of them will be discussed in Chapter 9.

1.7 Exercises 1. Verify that A^(/i, cr^), exponential with f{x\6) = | e ~ ^ / ^ , Bernouni(p), binomial B(n,p), and Poisson 'P(A), each constitutes an exponential family. 2. Verify (1.4). 3. Assuming p = 1 in (1.4), show that ^ > 0. 4. (a) Generate data by drawing a sample of size n = 30 from A/'(/i, 1) with /JL = 2. For your data, plot the likelihood function and comment on its shape and how informative it is about /i. (b) For an exponential family, show that the likelihood function is log concave, i.e., the matrix with (i, j ) t h element ^^ '^^^ is negative definite. (Hint. The proof is similar to that for Problem 3. By direct calculation

Now use the fact that a variance-covariance matrix is positive definite, unless the distribution is degenerate). (c) Let X i , . . . , Xn be i.i.d. with density f{x\6)^ p = 1, in an exponential family. Show that MLE of 7/ is (1/n) YJl=i K^i) and hence the MLE 6^6 as n ^ oc. 5. Let X i , X 2 , . . . , Xn be i.i.d A^(/i, cr^), with /i, cr^ unknown. Let T{fx, cr^) = P{Xi<0|/i,(72}. (a) Calculate r{fi, a^), where /i, and a^ are the MLE of fi and cr^. (b) Show that the best unbiased estimate of r(//,(j^) is

W{X) = E {l{Xi < 0}|X, S^) =

F{-X/S)

where S"^ is the sample variance and F is the distribution function of

(Xi -

X)/S.

(c) For /x = 0, (j^ = 1, n = 36 find the mean squared errors

1.7 Exercises

25

E { ( r ( A , a 2 ) - r ( 0 , 1 ) ) 2 | 0 , 1 } and E{{W{X)-r{0, l ) ) 2 | 0 , 1 } approximately by simulations. (d) E s t i m a t e t h e mean, variance and t h e mean squared error of r ( / i , (J'^) by (i) delta m e t h o d , (ii) B o o t s t r a p , and compare with (c). 6. Let X i , X 2 , . . . ^Xn be i.i.d. with density {l/a)f{{x — /J^)/cr). Show t h a t for fixed cr, P^^a- {Y^i ^i > ^} ^^ ^^ increasing function of /i. 7. X i , X 2 , . . . , Xn are said t o have a family of densities f{x\0) with monotone likelihood ratio (MLR) in T{x) if there exists a sufficient statistic T{x) such t h a t f{x\92)/f{x\6i) is a non-decreasing function of T(cc) if O2 > Oi. (a) Verity t h a t exponential families have this property. (b) If fix\e) has M L R in T, show t h a t P0{T{X) > c} is non-decreasing in (9. 8. Let X i , X 2 , . . . , X n be i.i.d. N{/j.,l). Let 0 < a,/? < 1, Z\ > 0. (a) For Ho : // = /io versus Hi : /x > //Q, show t h e smallest sample size n for which t h e U M P test has probability of error of first kind equal t o a and probability of error of second kind < (3 for jj. > fio-\- A is (approximately) the integer p a r t of ({za + zp)/A) +1. Evaluate n numerically when A = .5, a = .01, P = .05. (b) For HQ : /X = /io versus Hi : ji ^ jio-, show t h e smallest n such t h a t U M P U test has probability of error of first kind equal to a and probability of error of second kind < /? for |/i —//o| > ^ is (approximately) t h e solution of ^ (^a/2 - V^^) + ^ {z^/2 + V^^) = 1 + /^. Evaluate n numerically when A — 0.5, a = .01 and (3 = .05. 9. Let X i , X 2 , . . . , Xn be i.i.d t/(0, 0), 0 > 0. Find t h e smallest n such t h a t t h e U M P test of HQ : 9 = 6Q against Hi : 0 > OQ has probability of error of first kind equal to a and probability of error of second kind 0i, with6>i > OQ. 10. (Basu (1988, p . l ) ) Let X i , X 2 , . . . , X n be i.i.d t/((9,2(9), 0 > 0. (a) W h a t is t h e likelihood function of 97 (b) W h a t is t h e minimal sufficient statistic in this problem? (c) Find 0, t h e M L E of 0. (d) Let X(i) = m i n ( X i , . . . , X n ) and T = (4i9 + X ( i ) ) / 5 . Show t h a t E ((T - 6>)2) /E ({6 - 0)A

is always less t h a n 1, and further,

E({T-e)^) —) ^ E (^{0 - ^ ) 2 j

12 y — as n ^ oc. 25

11. Suppose X i , X 2 , . . . ^Xn are i.i.d A^(/i, 1). A statistician has to test HQ : /i = 0; he selects his alternative depending on d a t a . If X < 0, he tests against Hi : /i < 0. If X > 0, his alternative is Hi : /x > 0. (a) If t h e statistician has t a k e n a = .05, w h a t is his real a ? (b) Calculate his power at /i = ± 1 when his nominal a — .05, n = 25. Will this power b e smaller t h a n t h e power of t h e U M P U test with OL — .05?

26

1 Statistical Preliminaries

12. Consider n patients who have received a new drug that has reduced their blood pressure by amounts Xi, X 2 , . . . , Xn- It may be assumed that X i , X 2 , . . . , Xn are i.i.d. A/'(//, cr^) where cr^ is assumed known for simplicity. On the other hand, for a standard drug in the market it is known that the average reduction in blood pressure is /XQ- The company producing the new drug claims // == /XQ, i-^-? it does what the old drug does (and probably costing much less). Discuss what should he HQ and Hi here. (This is a problem of bio-equivalence.) 13. (P-values) The error probabilities of a test do not provide a measure of the strength of evidence against HQ in a particular data set. The P-values defined below try to capture that. Suppose Ho : 0 = 60 and your test is to reject HQ for large values of a test statistic W{X), say, you reject HQ ii W > Wa- Then, when X = x is observed, the P-value is defined as

Pix) =

l-F^{W{x)),

where FQ^ = distribution function of W under ^o(a) Show that if FQ^ is continuous then P{X) has uniform distribution on (0,1). (b) Suppose you are not given the value of W but you know P. How will you decide whether to accept or reject HQ ? (c) Let X i , X 2 , . . . , Xn be i.i.d N{^^ 1). You are testing HQ : fx = fiQ versus Hi : ji ^ jiQ, Define P-value for the UMPU test. Calculate E^Q{P) and

E^,iP\P
l-E0,{Io).

Then show that /Q is minimax, i.e., /Q minimizes the maximum error probability, max(^^,(/o), 1 - EeAlo)) < max(^^,(/), 1 -

Ee,{I)-

(b) Let X i , X 2 , . . . ,Xn be i.i.d. N{fi, 1). Using (a) find the minimax test oi HQ : jj. = —1 versus Hi : fi = -hi. 15. (a) Let X have density f{x\6) and 0 = {^o^^i}- The null hypothesis is HQ : 6 = OQ^ the alternative is if 1 : 0 = Oi. Suppose the error probabilities of each randomized test 0 is denoted by (a^, pfj^) and S= the collection of all points (aff^^Pffy). S is called the risk set. Show that S is convex. (b) Let X be 5(2,p), p = ^ (corresponding with HQ) or | (corresponding with Hi). Plot the risk set 5 as a subset of the unit square. (Hint. Identify the lower boundary of 5 as a polygon with vertices corresponding with non-randomized most powerful tests. The upper boundary connects vertices corresponding with least powerful tests that are similar to Jo in the N-P lemma but with reverse inequalities.)

1.7 Exercises

27

16. (a) Suppose SQ is a decision rule that has constant risk and is Bayes in the hmit (as defined in Section 1.5). Show that SQ is minimax. (b) Consider i.i.d. observations X i , . . . , Xn from A^(/i, 1). Using a normal prior distribution for /i, show that X is a minimax estimate for /x under squared error loss. 17. Let Xi, X 2 , . . . , Xn be i.i.d. N{fi, cr^). Consider estimating //. (a) Show that both X and the sample median M are unbiased estimators of fi. (b) Further, show that both of them are consistent and asymptotically normal. (c) Discuss why you would prefer one over the other. 18. Let Xi, X 2 , . . . , Xn be i.i.d. A^(/i, a'^), Fi, ^2, • • •, ^m be i.i.d. N{rj, r^) and let these two samples be independent also. Find the set of minimal sufficient statistics when (a) —CX3 < /i, 77 < 00, cr^ > 0 and r^ > 0. (b) // = 77, —00 0 and r^ > 0. (c) —00 < /i, 77 < oc, cr^ = T^, and cr^ > 0. (d) /i = 77, cr^ = r^, —00 0. 19. Suppose X^,i = 1,2, . . . , n are i.i.d. from the exponential family with density (1.2) having full rank, i.e., the parameter space contains a pdimensional open rectangle. Then show that {Tj = X]f=i ^j(X^), j = 1 , . . . ,p) together form a minimal sufficient statistic. 20. Refer to the 'factorization theorem' in Section 1.3. Show that a statistic U is sufficient if and only if for every pair ^1, ^2, the ratio f{x\02)/f{x\0i) is a function of U{x).

Bayesian Inference and Decision Theory

This chapter is an introduction to basic concepts and implementation of Bayesian analysis. We begin with subjective probability as distinct from classical or objective probability of an uncertain event based on the long run relative frequency of its occurrence. Subjective probability, along with utility or loss function, leads to Bayesian inference and decision theory, e.g., estimation, testing, prediction, etc. Elicitation of subjective probability is relatively easy when the observations are exchangeable. We discuss exchangeability, its role in Bayesian analysis, and its importance for science as a whole. In most cases in practice, quantification of subjective belief or judgment is not easily available. It is then common to choose from among conventional priors on the basis of some relatively simple subjective judgments about the problem and the conventional probability model for the data. Such priors are called objective or noninformative. These priors have been criticized for various reasons. For example, they depend on the form of the likelihood function and usually are improper, i.e., the total probability of the parameter space is infinity. Here in Chapter 2, we discuss how they are applied; some answers to the criticisms are given in Chapter 5. In Section 2.3 of this chapter, there is a brief discussion of the many advantages of being a Bayesian.

2.1 Subjective and Frequentist Probability Probability has various connotations. Historically, it has been connected with both personal evaluation of uncertainty, as in gambling or other decision making under uncertainty, and predictions about proportion of occurrence of some uncertain event. Thus when a person says the probability is half that this particular coin will turn up a head, then it will usually mean that in many tosses about half the time it will be a head (a version of the law of large numbers). But it can also mean that if someone puts this bet on head - if head he wins

30

2 Bayesian Inference and Decision Theory

a dollar, if not he loses a dollar - the gamble is fair. The first interpretation is frequentist, the second subjective. Similarly one can have both interpretations in mind when a weather forecast says there is a probability of 60% of rain, but the subjective interpretation matters more. It helps you decide if you will take an umbrella. Finally, one can think up situations, e.g., election of a particular candidate or success of a particular student in a particular test, where only the subjective interpretation is valid. Some scientists and philosophers, notably Jeffreys and Carnap, have argued that there may be a third kind of probability that applies to scientific hypotheses. It may be called objective or conventional or non-subjective in the sense that it represents a shared belief or shared convention rather than an expression of one person's subjective uncertainty. Fortunately, the probability calculus remains the same, no matter which kind of probability one uses. A Bayesian takes the view that all unknown quantities, namely the unknown parameter and the data before observation, have a probability distribution. For the data, the distribution, given ^, comes from a model that arises from past experience in handling similar data as well as subjective judgment. The distribution of 6 arises as a quantification of the Bayesian's knowledge and belief. If her knowledge and belief are weak, she may fall back on a common objective distribution in such situations. Excellent expositions of subjective and objective Bayes approaches are Savage (1954, 1972), Jeffreys (1961), DeGroot (1970), Box and Tiao (1973), and Berger (1985a). Important relatively recent additions to the literature are Bernardo and Smith (1994), O'Hagan (1994), Gelman et al. (1995), Carlin and Louis (1996), Leonard and Hsu (1999), Robert (2001), and Congdon (2001).

2.2 Bayesian Inference Informally, to make inference about 6 is to learn about the unknown 6 from data X , i.e., based on the data, explore which values of 0 are probable, what might be plausible numbers as estimates of different components of 6 and the extent of uncertainty associated with such estimates. In addition to having a model f{x\0) and a likelihood function, the Bayesian needs a distribution for 0. The distribution is called a prior distribution or simply a prior because it quantifies her uncertainty about 0 prior to seeing data. The prior may represent a blending of her subjective belief and knowledge, in which case it would be a subjective prior. Alternatively, it could be a conventional prior supposed to represent small or no information. Such a prior is called an objective prior. We discuss construction of objective priors in Chapter 5 (and in Section 6.7.3 to some extent). An example of elicitation of subjective prior is given in Section 5.4. Given all the above ingredients, the Bayesian calculates the conditional probability density of 0 given X = x by Bayes formula

2.2 Bayesian Inference

.(^la.) = "W^("l^) ^ ' ^ Ie-iO')fix\e')de'

31

(21) ^'-'^

where 7r(^) is the prior density function and f{x\0) is the density of X , interpreted as the conditional density of X given 0. The numerator is the joint density of 0 and X and the denominator is the marginal density of X. The symbol 6 now represents both a random variable and its value. When the parameter 0 is discrete, the integral in the denominator of (2.1) is replaced by a sum. The conditional density 7r{0\x) of ^ given X = x is called the posterior density, a quantification of our uncertainty about 6 in the light of data. The transition from 7r{6) to 7r(^|cc) is what we have learnt from the data. A Bayesian can simply report her posterior distribution, or she could report summary descriptive measures associated with her posterior distribution. For example, for a real valued parameter 0, she could report the posterior mean oo

e7r{9\x)de /

-oo

and the posterior variance Var {e\x) = E{{e -

f

E{e\x))^\x}

{d -

E{d\x)fTT{9\x)de

J —(

or the posterior standard deviation. Finally, she could use the posterior distribution to answer more structured problems like estimation and testing. In the case of estimation of 0, one would report the above summary measures. In the case of testing one would report the posterior odds of the relevant hypotheses. Example 2.1. We illustrate these ideas with an example of inference about /i for normally distributed data {N{fi,a'^)) with mean fi and variance cr^. The data consist of i.i.d. observations Xi,X2, • • • ^Xn from this distribution. To keep the example simple we assume n = 10 and a^ is known. A mathematically convenient and reasonably flexible prior distribution for /i is a normal distribution with suitable prior mean and variance, which we denote by ry and T^. To fix ideas we take rj = 100. The prior variance r^ is a measure of the strength of our belief in the prior mean ry = 100 in the sense that the larger the value of r^, the less sure we are about our prior guess about r]. Jeffreys (1961) has suggested we can calibrate r"^ by comparing with a^. For example, setting r^ = a'^/m would amount to saying information about ry is about as strong as the information in m observations in data. Some support for this interpretation is provided in Chapter 5. By way of illustration, we take m = 1. With a little algebra (vide Problem 2), the posterior distribution can be shown to be normal with posterior mean

E{f^\x) = (^ry + 4 ^ ) / ( : i + 4 ) = (^ + io^)/ii rj-A

fjA

rj-A

fjA

(2-2)

32

2 Bayesian Inference and Decision Theory

and posterior variance

C^r')/C^+T')=a'/ll

(2.3)

i.e., in the Hght of the data, /i shifts from prior guess 77 towards a weighted average of the prior guess about /i and X , while the variabihty reduces from (j^ to (j^/11. If the prior information is small, implying large r^ or there are lots of data, i.e., n is large, the posterior mean is close to the MLE X. We will see later that we can quantify how much we have learnt from the data by comparing 7r(/i) and 7r(//|-X'). The posterior depends on both the prior and the data. As data increase the influence of data tends to wash away the prior. Our second example goes back in principle to Bayes, Laplace, and Karl Pearson {The Grammar of Science^ 1892). Example 2.2. Consider an urn with Np red and N{1 — p) black balls, p is unknown but A/^ is a known large number. Balls are drawn at random one by one and with replacement, selection is stopped after n draws. For i = 1,2,... ,n, let ^ r 1 if' the 1 ith ball drawn is red; * " 1^0^ otherwise. Then X^'s are i.i.d B{l,p)^ i.e., Bernoulli with probability of success p. Let p have a prior distribution 7r(p). We will consider a family of priors for p that simplifies the calculation of posterior and then consider some commonly used priors from this family. Let ^(^) = ^ 7 ^ ^ ^ " " ' ( 1 - ^ ) ^ " ' '

00,^>0.

(2.4)

This is called a Beta distribution. (Note that for convenience we take p to assume all values between 0 and 1, rather than only 0,1/N, 2/N, etc.) The prior mean and variance are a/{a + /?) and a/3/{{a -h/?)^(a-f/?-hl)}, respectively. By Bayes formula, the posterior density can be written as 7r{p\X = x) = C(aj)p^+'^-^(l - p ) ^ + ( ^ - ^ ) - i

(2.5)

where r = ^27=1 ^* ~ number of red balls, and {C{x))~^ is the denominator in the Bayes formula. A comparison with (2.4) shows the posterior is also a Beta density with a -}-r in place of a and /3 -\- (n — r) for p and

C{x) = r ( a + /? + n)/{r{a + r)r{p -\-n - r)}. The posterior mean and variance are E{p\x) = {a-\- r)/{a -\-p + n), / I X (a-\-r)(3-{-n — r) Var (px) = ^^ ^ -. ^^' ^ (a + /?-hn)2(a + /3 + n - h l )

. . 2.6 ^ ^

2.2 Bayesian Inference

33

As indicated earlier, a Bayesian analyst may just report the posterior (2.5), and the posterior mean and variance, which provide an idea of the center and dispersion of the posterior distribution. It will not escape one's attention that if n is large then the posterior mean is approximately equal to the MLE, p = r/n and the posterior variance is quite small, so the posterior is concentrated around p for large n. We can interpret this as an illustration of a fact mentioned before when we have lots of data, the data tend to wash away the influence of the prior. The posterior mean can be rewritten as a weighted average of the prior mean and MLE. {a-\- p) {a-\- P^n)

a {a~\- f3)

n r (a + /? + n) n '

Once again, the importance of both the prior and the data comes out, the relative importance of the prior and the data being measured by {a + f3) and n. Suppose we want to predict the probability of getting a red ball in a new (n -f l)-st draw given the above data. This has been called a fundamental problem of science. It would be natural to use E{p\x), the same estimate as above. We list below a number of commonly used priors and the corresponding value of £;(p|Xi,X2,---,Xn). The uniform prior corresponds with a = /^ = 1, with posterior mean equal to (Y^i ^i + l ) / ( ^ + 2). This was a favorite of Laplace and Bayes but not so popular anymore. If a = P = | , we have the Jeffreys prior with posterior mean ( S r "^^ + ^ ) / ( ^ + !)• This prior is very popular in the case of one-dimensional 0 as here. It is also a reference prior due to Bernardo (1979). Reference priors are very popular. If we take a Beta density with a = 0, /? = 0, it integrates to infinity. Such a prior is called improper. If we still use the Bayes formula to produce a posterior density, the posterior is proper unless r — 0 or n. The posterior mean is exactly equal to the MLE. Objective priors are usually improper. To be usable they must have proper posteriors. It is argued in Chapter 5 that improper priors are best understood through the posteriors they produce. One might examine whether the posterior seems reasonable. Suppose we think of the problem as a representation of production of defective and non-defective items in a factory producing switches, we would take red to mean defective and black to mean a good switch. In this context, there would be some prior information available from the engineers. They may be able to pinpoint the likely value of p, which may be set equal to the prior mean a/{a-\- P). If one has some knowledge of prior variability also, one would have two equations from which to determine a and p. In this particular context, the Jeffreys prior with a lot of mass at the two end points might be adequate if the process maintains a high level of quality (small p) except when it is out of control and has high values of p. The peak of the prior near p = 1

34

2 Bayesian Inference and Decision Theory Table 2.1. An Epidemiological Study Food Eaten No Crabmeat Crabmeat Potato Salad No Potato Salad Potato Salad No Potato Salad 22 4 111 0 120 24 Not 111 23 80 31

could reflect frequent occurrence of lack of control or a pessimistic prior belief to cope with disasters. It is worth noting that the uniform, Jeffreys prior, and reference priors are examples of objective priors and that all of them produce a posterior mean that is very close to the MLE even for small n. Also all of them make better sense than the MLE in the extreme case when p = 0. In most contexts the estimate ;p = 0 is absurd, the objective Bayes estimates move it a little towards p = ^, which corresponds with total ignorance in some sense. Such a movement is called a shrinkage. Agresti and Caffo (2000) and Brown et al. (2003) have shown that such estimates lead to confidence intervals with closer agreement between nominal and true coverage probability than the usual confidence intervals based on normal approximation to p or inversion of tests. In other words, the Bayesian approach seems to lead to a more reasonable point estimate as well as a more reliable confidence interval than the common classical answers based on MLE. Example 2.3. This example illustrates the advantages of a Bayesian interpretation of probability of making a wrong inference for given data as opposed to classical error probabilities over repetitions. In this epidemiological study repetitions don't make sense. The data in Table 2.1 on food poisoning at an outing are taken from Bishop et al. (1975) who provide the original source of the study. Altogether 320 people attended the outing, 304 responded to questionnaires. There was other food also but only two items, potato salad and crabmeat, attracted suspicion. We focus on the main suspect, namely, potato salad. A partial Bayesian analysis of this example will be presented later in Chapter 4. Example 2.4- Let X i , X 2 , . . . ,X^ be i.i.d A^(/x, cr^) and assume for simplicity CF'^ is known. As in Chapter 1, /i may be the expected reduction of blood pressure due to a new drug. You want to test HQ : fi < fiQ versus Hi : yu > /XQ, where //Q corresponds with a standard drug already in the market. Let 7r(//) be the prior. First calculate the posterior density 7r{fi\X). Then calculate n{^l\X)d^l = and

P{Ho\X},

2.3 Advantages of Being a Bayesian TT{^\X)dii

= 1 - P{Ho\X}

=

35

P{Hi\X}.

^'Mo He

One may simply report these numbers or choose one of t h e two hypotheses if one of t h e two probabihties is substantially bigger. We provide some calculations when t h e prior for /i is N{r),r'^). We recall from E x a m p l e 2.1 t h a t t h e posterior for /i is normal with mean and variance given by equations (2.2) and (2.3). If follows t h a t 7r(/i < Mol-^) = ^{z)

and 7r(// > / i o | ^ ) = 1 - ^ ( ^ )

where ^ is t h e s t a n d a r d normal distribution function and

A conventional choice is to make r^ ^ oo above, which would give t h e same result as assuming an improper uniform prior 7r(/i) = c,

—oo < /i < oo.

Any of these would lead to

a Suppose we wish t o reject if t h e posterior odds against HQ are 19:1 or more i.e., if posterior probability of HQ is < .05. T h e n we reject HQ if /io-^<(-l-64)-^

or X > /io + 1 - 6 4 - ^ , which is exactly t h e same as t h e classical test for this problem with a = .05. However if we had wished to test t h e sharp null hypothesis H^ : JJL = IIQ against Hi : ii ^ JJLQ oi Hi : /j. > fiQ, we have t o choose t h e prior in a different way since t h e prior we chose would assign zero probability to HQ. Moreover, the answers tend to be very different from classical answers as we shall see in Chapter 6.

2.3 Advantages of Being a Bayesian T h e Bayesian approach provides a fairly explicit solution to common problems of statistical inference (Chapters 2 and 8), new problems of high-dimensional d a t a analysis t h a t are coming up because of emergence of high-dimensional d a t a sets (Chapters 9 and 10), as well as complex decision problems of real

36

2 Bayesian Inference and Decision Theory

life (Chapter 10). It can handle presence of prior knowledge or partial prior knowledge, specially constraints like a < 0
2.4 Paradoxes in Classical Statistics

37

approaches as far as the decision that is made, only the objective Bayesian approach has a posterior and hence a data dependent method of evaluating the performance of the decision. Finally, basic Bayesian ideas and measures are easy to interpret and hence easy to communicate. One may well ask why in spite of all these advantages, an explosive growth and spread of the Bayesian approach has occurred only recently, in the past fifteen or so years. A major factor has been the arrival of MCMC (Markov chain Monte Carlo) in a big way and consequent advances in computation of posteriors for high-dimensional O and many real-life applications. A classic paper that ushered in these changes is Gelfand and Smith (1990).

2.4 Paradoxes in Classical Statistics The evaluation of performance of an inference procedure in classical statistics is based on expected quantities like bias or variance of an estimate, error probabilities for a test, and confidence coefficients of a confidence interval. Such measures are obtained by integrating or summing over the sample space of all possible data. Hence they do not answer how good the inference is for a particular data set. The following two examples show how irrelevant the classical answers can be once the data are in hand. Example 2.5. (Cox (1958)) To estimate /i in N{ii^a'^)^ toss a fair coin. Have a sample of size n = 2 if it is a head and take n = 1000 if it is a tail. An unbiased estimate of// is Xn = X^ILi ^ ^ / ^ with variance = | { ^ + ^ } "^ x * Suppose it was a tail. Would you believe (j^/4 is a measure of accuracy of the estimate? Example 2.6. (Welch (1939)) Let Xi,X2 be i.i.d. I7(l9-^,6>4-|). Let X±C be a 95% confidence interval, C > 0 being suitably chosen. Suppose Xi == 2 and X2 = 1. Then we know for sure 0 = (Xi-hX2)/2 and hence 0 e {X-C, X-\-C). Should we still claim we have only 95% confidence that the confidence interval covers 01 One of us (Ghosh) learned of this example from a seminar of D. Basu at the University of Illinois, Urbana-Champaign, in 1965. Basu pointed out how paradoxical is the confidence coefficient in this example. This perspective doesn't seem to be stressed in Welch (1939). The example has been discussed many times, see Lehmann (1986, Chapter 10, Problems 27 and 28), Pratt (1961), Kiefer (1977), Berger and Wolpert (1988), and Chatterjee and Chattopadhyay (1994). Fisher was aware of this phenomenon and suggested we could make inference conditional on a suitable ancillary statistic. In Cox's example (Example 2.5), it would be appropriate to condition on the sample size and quote the conditional variance given n — 1000 as a proper measure of accuracy. Note

38

2 Bayesian Inference and Decision Theory

that n is ancillary, its distribution is free of ^, so conditioning on it doesn't change the likelihood. In Welch's example (Example 2.6), we could give the conditional probability of covering ^, conditional on Xi — X2 = 1. Note that Xi — X2 is also an ancillary statistic like n in Cox's example, it contains no information about 6 - so fixing it would not change the likelihood - but its value, like the value of n, gives us some idea about how much information there is in the data. You are asked to carry out Fisher's suggestion in Problem 4. Suppose you are a classical statistician and faced with this example you are ready to make conditional inference as recommended by Fisher. Unfortunately, there is a catch. Classical statistics also recommends that inference be based on minimal sufficient statistics. These two principles, namely the conditionality principle (CP) and sufficiency principle (SP) together have a far reaching implication. Birnbaum (1962) proved that they imply one must then follow the likelihood principle (LP), which requires that inference be based on the likelihood alone, ignoring the sample space. A precise statement and proof are given in Appendix B. Bayesian analysis satisfies the likelihood principle since the posterior depends on the data only through the likelihood. Most classical inference procedures violate the likelihood principle. Closely related to the violation of LP is the stopping rule paradox in classical inference. There is a hilarious example due to Pratt (Berger, 1985a, pp. 30-31).

2.5 Elements of Bayesian Decision T h e o r y We can approach problems of inference in a mathematically more formal way through statistical decision theory. This would make the problems somewhat abstract and divorced from the real-life connotations but, on the other hand, provides a unified conceptual framework for handling very diverse problems. A classical statistical decision problem, vide Section 1.5, has the following ingredients. It has as data the observed value of X , the density f{x\0) where the parameter 6 lies in some subset 0 (known as the parameter space) of the p-dimensional Euclidean space IZ^. It also has a space A of actions or decisions a and a loss function L{0,a) which is the loss incurred when the parameter is 0 and the action taken is a. The loss function is assumed to be bounded below so that integrals that appear later are well-defined. Typically, L{6^ a) will be > 0 for all 6 and a. We treat actions and decisions as essentially the same in this framework though in non-statistical decision problems there will be some conceptual difference between a decision and the action it leads to. Finally it has a collection of decision functions or rules S{x) that take values in A. Suppose 6{x) — a for given x. Then the statistician who follows this particular rule 5{x) will choose action a given this particular data and incur the loss L{6,a).

2.5 Elements of Bayesian Decision Theory

39

Both estimation and testing are special cases. Suppose the object is to estimate r{0), a real-valued function of 0. Then A = 71, L(6, a) = {a — r{0)Y and a decision function 5{x) is an estimate of T{0). If it is a problem of testing Ho : 6 = 00 versus Hi : 0 ^ OQ, say, then A = {ao^^i} where aj means the decision to accept Hj, L{0, aj) — 0 if ^ satisfies Hj and L{6, aj) = 1 otherwise. If I{x) is the indicator of a rejection region for HQ, then the corresponding S{x) is equal to aj if I{x) = j , j = 0,1. We recall also how one evaluates the performance of S{x) in classical statistics through the average loss or risk function Rie,S) =

EeiL{0,S{X)).

If S is an estimate of r{0) in an estimation problem, then R{6, 6) = Ee{T{6) — 6{X))'^ is the MSE (mean squared error). If 5 is the indicator function of an i7o-rejection region, then R{0, 5) is the probability of error of first kind if 0 ^ 00 and probability of error of second kind if ^ G ©i. For a Bayesian, ^ is a random variable with prior distribution 7r{6) before seeing the data, for example, at the planning stage of an experiment. The relevant risk at this stage is the so-called preposterior risk

/.

R{e,s)7r{e)de = R{7T,s).

e It depends on 5 and the prior. On the other hand, after the data are in hand, the relevant distribution of 6 is given by the posterior density 7T{6\X) and the relevant risk is the posterior risk E{L{e,a)\x)

=^(cc,a).

The posterior risk associated with 5 is two Bayesian decision problems.

\IJ{X,5{X)).

So, in principle, there are

A. Given X = x, choose an optimal a, i.e., choose an a to minimize IIJ{X, a). B. At the planning stage, choose an optimal 5{X), denoted as 5^^ and called the Bayes decision rule or simply the Bayes rule, to minimize R{TT,5). We have the following pleasant fact, which shows in a sense both problems give the same answer for a given X. T h e o r e m 2.7. (a) For any 5, R{7r,5) = b) Suppose a{x) minimizes ip{x,a), il;{x,a{x))

E{i;{X,S{X))). i.e. = inf'0(cc,a).

Then the decision function a{x) minimizes R{7T, 6).

40

2 Bayesian Inference and Decision Theory

Proof, (a) Because E{L{e,S{X))\X) = ^{X,S{X)) by definition of ^, the result follows by taking expectations on both sides. (b) Let a{x), as defined in the theorem, be denoted by 60. Then, by part (a), and definition of a(a;), R{7r,So) =

E{^{X,a{X))

<E{IIJ{X,S{X))

so that R{7T, So) = infs ^(TT, (5), as claimed.

for any (J,

D

This fact will be used below in the sections on estimation, interval estimation and testing.

2.6 Improper Priors For point and interval estimates and to some extent in testing, objective priors are often improper. We have considered an improper prior for /x in A^(/i, cr^) earlier in Examples 2.1 and 2.4 but somewhat indirectly. Also, one of the Beta priors in Example 2.2 was improper. We discuss a few basic facts about improper priors. We follow Berger (1985a). An improper prior density 7r{6) is non-negative for all 6 but

i

7T{e)d{e) = 00.

e

Such an improper prior can be used in the Bayes formula for calculating the posterior, provided the denominator is finite for all x (or all but a set of x with zero probability for all 6), i.e..

i

7T{e)f{x\0)d0 < 00.

0

Then the posterior density 7r{6\X = x) is a proper probability density function and can be used at least in inference problems or the posterior decision problem where we define and minimize 7p{x,a). However, for improper priors usually -R(7r, S) is not used. The most common improper priors are 7ri(/i) = C,

—00 < // < 00,

7r2(cr) = —, c

0 < (J < (X),

for location and scale parameters. Both the improper priors may be interpreted as a sort of limit of the proper priors:

2.7 Common Problems of Bayesian Inference

41

. . _ / 1/(2L) if -L < n
, .x/^ if 0 < 1/L
= \ n i^^-{t

^2,L{^)

otherwise,

where A — 1/(2 log L), in the sense that the posteriors for TTI and 7r2 may be obtained by making L ^ OQ\wiXi^j,{Q\X.^. Also, as pointed out by Heath and Sudderth (1978), the posteriors for TT^ are same as the posteriors for suitably chosen proper but finitely additive priors.

2.7 C o m m o n P r o b l e m s of Bayesian Inference There are three common problems, as in classical statistics, namely, point estimation, interval estimation, and testing. We have already seen examples of point estimates and tests of one-sided hypotheses, so we begin with these two problems and then turn to interval estimates (credible intervals) and testing of a sharp null hypothesis. Testing a sharp null hypothesis will be illustrated with a popular Bayes test for the normal mean due to Jeffreys. We also discuss prediction and a few other topics related to testing and interval estimation. Because the differences between Bayesian inference and Bayesian decision theory is mainly one of nuances, we do not make any sharp distinctions between the two approaches. So our treatment of these three problems as well as other problems later includes elements of both - loss functions from decision theory as well as evidential descriptive measures from inference. A full Bayesian study of a problem consists of two stages, the planning or preposterior stage followed by posterior Bayesian analysis of data collected. At the planning stage one would have problems of choosing optimum design and optimum sample size. Then the integrated Bayes risk R{IT) = inf^ ^(TT, S) plays a central role. In this book we concentrate on the posterior Bayes analysis of data. 2.7.1 Point Estimates For a real valued ^, standard Bayes estimates are the posterior mean or the posterior median. The posterior mean is the Bayes estimate corresponding with squared error loss and the posterior median is the Bayes estimate for absolute deviation loss. Along with the posterior mean one reports the posterior variance or its square root, the posterior standard deviation of ^. If one chooses to work with the posterior median, it would be convenient to report a couple of other posterior quantiles to give an idea of the posterior variability of Q. One could report at least the first and third posterior quartiles. If the posterior is unimodal then the posterior mode is another choice. It is similar to the MLE of classical statistics. Indeed if the prior is uniform, both

42

2 Bayesian Inference and Decision Theory

are identical. Along with the posterior mode one can report a suitable highest posterior density (HPD) credible interval as a measure of posterior variability. If the parameter is a vector, common choices for reporting are the posterior mean vector and the posterior dispersion matrix. Again if the posterior is unimodal, one can report the posterior mode with a suitable HPD credible set. Problem 14 illustrates this with a multivariate normal model with known dispersion matrix and a multivariate normal or uniform prior for the normal mean vector. 2.7.2 Testing We want to test Ho'.e e0o

versus Hi : 0 e 0i.

(2.7)

If 00 and 01 are of the same dimension as for one-sided null and alternative hypotheses, it is convenient and easy to choose a prior density that assigns positive prior probability to 0o and ©i. One then calculates the posterior probabilities P{0i\x} as well as the posterior odds ratio (or simply posterior odds)^ namely, P{0Q\X}/P{0I\X}

that most people prefer. One would then find a threshold like 1/9 or 1/19, etc. to decide what constitutes evidence against HQ. The Bayes rule for 0-1 loss is to choose the hypothesis with higher posterior probability. There is a conceptual problem with this approach. If the prior is improper, then the prior probabilities may be undefined — they are, strictly speaking, undefined in the example with one-sided null and alternatives. Even if the prior is proper, the prior probabilities assigned to 0^, i.e., P{0i) may not be carefully chosen and so may not be satisfactory. Surely, if our attitude to HQ is still as in classical Statistics, namely, that it should not be rejected unless there is compelling evidence to the contrary, then it would be unreasonable to assign less prior probability to 0Q than 0i. In fact an objective or impartial choice would be to assign equal probabilities. These things can be done better if we use the following alternative way of specifying the prior. Let TTo and 1 — TTQ be the prior probabilities of 0o and 0i. Let gi{0) be the prior p.d.f. of 6 under 0^, so that

/ gi{e)de = l. The prior in the previous approach is nothing but AO) = nogo{0)I{e e 0o} + (1 - 7ro)gi{0)I{e e 0 i } .

(2.8)

We do not require any longer that 0o and 0i are of the same dimension. So in principle, sharp null hypotheses are also covered. We can now proceed as

2.7 Common Problems of Bayesian Inference

43

before and report posterior probabilities or posterior odds. To compute these posterior quantities, note t h a t the marginal density of X under t h e prior TT can be expressed as m^{x)

= f

f(x\e)7r(0)de

J0

= 7ro f

fix\e)go{0)dO^{l-7To)

f

J&Q

f{x\e)gi{e)

dO

(2.9)

J01

and hence t h e posterior density of 0 given t h e d a t a X = x as

''^*^'^^" m^ix) -\{i-7ro)fix\e)g,{e)/mAx)

aeee^.^^-'"')

It follows t h e n t h a t P^Ho\x)

= P^Oolx)

= -

^

f

f{x\e)go{e)de

T^^Lf{x\9)go{e)de no U^ f[x\e)go{6)

dO + {1 - TTQ) /^^ f{x\d)g,{9)

dO

and

p-(i/i|x) = p-(©i|x) = % = / ^ j fix\e)g,ie)d0 m^{x) J0i

{l-7To)f^Jix\e)g^{e)d0 no / e „ fix\e)goie)

dO + {1 - no) J^^ f{x\e)gi{0)

dO'

One may also report t h e Bayes factor^ which does not depend on TTQ. T h e Bayes factor of HQ relative to Hi is defined as

_j^jix\0)go{e)de ^ ^ - - Ujix\0)g^i0)d0Clearly, BFio = I/BFQI.

^'-''^

T h e posterior odds ratio of HQ relative to Hi is

1

TTO -TTo

BFf 017

which reduces to BFQI if TTQ = | . Thus, BFQI is an i m p o r t a n t evidential measure t h a t is free of TTQ. T h e smaller t h e value of BFQI, t h e stronger t h e evidence against HQ. Let us consider an example to illustrate some of these measures. It will be extended to include t h e well-known Jeffreys' analysis later. Example 2.8. Consider a blood test conducted for determining t h e sugar level of a person with diabetes two hours after he h a d his breakfast. It is of interest to see if his medication has controlled his blood sugar levels. Assume t h a t the test result X is N{9^ 100), where 9 is t h e t r u e level. In t h e appropriate

44

2 Bayesian Inference and Decision Theory

population (diabetic but under this treatment), 6 is distributed according to a iV(100,900). Then, marginally X is Ar(100,1000), and the posterior distribution of 9 given X = x is normal with mean = ^ x + i ^ l O O = 0.9x + 10 and variance = ^^j^ = 90. Suppose we want to test iJo * ^ < 130 versus i^i : ^ > 130. If the blood test shows a sugar level of 130, what can be concluded? Note that, given this test result, the true mean blood sugar level {6) may be assumed to be N{127,90). Consequently, we obtain, /130 — 127\ P{0 < 130|X = 130) = ^ f -=— 1 = ^(.316) = 0.624, and hence P{e > 130|X = 130) = 0.376. Therefore, Posterior odds ratio = 0.624/0.376 = 1.66. Because TTQ = P'^iO < 130) = ^ ( ^ ^ V ^ ^ ) "" ^ ( ^ ) ' ^^^ P^^^^ ^^^^ ^^^^^ i^ ^(1)/(1 - 0(1)) = .8413/.1587 = 5.3, and thus the Bayes factor turns out to be 1.66/5.3 = .313. It can also be noted here that in one-sided testing situations when a continuous prior TT can be specified readily for the entire parameter space, there is no need to express it in the form of 7r(^) = 7Togo{6)I{6 G 0o}-\-{l — 'Ko)gi{O)I{6 G 01}. However, the problem of testing a point null hypothesis turns out to be quite different as shown below. Testing a Point Null Hypothesis The problem is to test Ho:e = eo versus Hi : 6 j^ OQ.

(2.12)

Consider the following examples, which indicate when we need to consider point nulls and when we need not. Example 2.9. In a statistical quality control situation, 0 is the size of a unit and acceptable units are with 6 E (6Q — 6, OQ -\- 6). Then one would like to test Ho5 : \0 - eo\ < 6. In this problem the length of the interval, 2(5, can be explicitly specified. On the other hand, this is not the case in the following. Example 2.10. (i) Suppose we want to test the hypothesis. Ho : Vitamin C has no effect on the common cold. Clearly this is not meant to be thought of as an exact point null; surely vitamin C has some effect, though perhaps a very minuscule effect. Thus, in reality, this is still the case of an interval null hypothesis, with a very small

2.7 Common Problems of Bayesian Inference

45

unspecified interval. However, it would be b e t t e r represented as a point null hypothesis. (ii) On t h e other hand, a hypothesis such as HQ : Astrology cannot predict t h e future can perhaps be represented as an exact point null. Since these issues are i m p o r t a n t , we summarize t h e main points below. If the interval in an interval null hypothesis, along with TTQ, go, and gi can be specified, it is best to treat t h e problem as an interval null hypothesis problem and proceed accordingly. However, when t h e interval around ^o is small b u t unspecified, and go is difficult to specify, it is best to approximate t h e interval null by a point null. Conceptually testing a point null is not a different problem, but there are complications. First of all, it is not possible to use a continuous prior density because any such prior will necessarily assign prior probability zero to t h e null hypothesis. Consequently, t h e posterior probability of t h e null hypothesis will also be zero. Intuitively, this is clear: if t h e null hypothesis is a priori impossible, it will remain so a posteriori also. Therefore, a prior probability of TTQ > 0 needs to be assigned to t h e point Oo and t h e remaining probability of TTI = 1 — TTQ will be spread over {0 ^ ^o} using a density g\. Simply take go to be a point mass at Oo in (2.8). If t h e point null hypothesis approximates an interval null hypothesis, Ho : 0 G (^o — ^5^0 + ^)5 t h e n TTQ is the probability assigned to t h e interval (^0—^5 ^0+^) by a continuous prior. T h e complication now is t h a t the prior TT is of t h e form 7r(^) = 7ro/{^ = ^0} + (1 - i^o)gx{e)I{e

^ Qo)

(2.13)

and hence has b o t h discrete and continuous p a r t s . However, (2.9) and (2.10) yield, m{x) = ^0/(^1^0) + (1 - 7ro)mi(rr), where mi(x)= /

f{x\e)g^{Q)de.

Therefore, from (2.10), 7r(6'o|a:) =

/(x|^o)7ro m{x)

_

7ro/(x|(9o) 7ro/(a:|6>o) + (1 - 7ro)mi(x)

t

TTo

f{x\eo)

J

(2.14)

46

2 Bayesian Inference and Decision Theory

It follows then that the posterior odds ratio is given by 7r((9o|x) ^ TTQ f{x\eo) 1 - 7r(^ok) (1 - TTo) mi{x) ' and hence the Bayes factor of HQ relative to Hi (which is the ratio of the above posterior odds ratio to the prior odds ratio of 7ro/(l — TTQ)) is B = B{x) = BFo,{x) = ^ ^ ^ . mi[x) Thus, (2.15) can be expressed as 7r(0ok) = { l + ^ ^ S F o V ( x ) |

(2.16)

.

(2.17)

Example 2.11. Suppose X ~ B(n, ^) and we want to test HQ : 0 = OQ versus Hi : 9 ^ 6o, a, problem similar to checking whether a given coin is biased based on n independent tosses (where ^o will be taken to be 0.5). Under the alternative hypothesis, suppose 6 is distributed as Beta(Q;,/?). Then mi{x) is given by ._ in\r{a + (3) r{a + x)r(/? + n - x) "^^ ^^^ ~ \xj r{a)r{t3) r{a + t3 + n) so that 5Fo,(x) = f ^ V s d - OoT-y ^ ^ ^ ^ ^ ( " + ^^ ^ ( " + "^^^^ + " - ^) x/ "' "' ' V\2;/r(a)^(/?) r ( a + /3 + n) ^°(^"^°^ /U(«)n/3) r ( a + /J + n) r{a)r{(3) ria + p + n) •^5(1 - ^o)"-"r{a + /?) r{a + x)r{(3 + n-x Hence, we obtain, 7r(^o|a:) = | l + ^ ^ B F o l ' ( x ) | r{a-^(3) = / 1 + ^ ~ ^ 0 ^(c^)^W

^

TTo

r{a-\-x)r{P-\-n-x) r(a+/3+n)

-1

^g(l-^o)^-^

Further discussion on hypothesis testing will be deferred to Chapter 6 where basic aspects of model selection will also be considered. Jeffreys Test for N o r m a l M e a n w i t h U n k n o w n cr^ Suppose the data consist of i.i.d. observations Xi, X 2 , . . . , Xn from a normal A/'(/i, (j^) distribution. We want to test HQ : /i = fiQ versus Hi : /x 7^ //Q, where

2.7 Common Problems of Bayesian Inference

47

//o is some specified number. W i t h o u t loss of generality, we assume /J^Q = 0. Note t h a t t h e parameter cr^ is common in t h e two models corresponding t o HQ and Hi a n d /i occurs only in Hi. Also, in this example fj. a n d cr^ are orthogonal parameters in t h e sense t h a t t h e Fisher information m a t r i x is orthogonal. In such situations, Jeffreys (1961) suggests using (improper) objective prior only for t h e common p a r a m e t e r a n d using default proper prior for t h e other parameter t h a t occurs only in one model. Let us consider t h e following priors in our example. We take t h e prior go{cr) = 1/a for a under HQ. Under Hi, we take t h e same prior for a a n d a d d a conditional prior for fi given a, namely gi{fi\cr) -

G

-g2{-). a

where 5^2(•) is a p.d.f. A n initial n a t u r a l choice for ^2 is A^(0, (?\ T h u s t h e prior conditional variance of \i is calibrated with respect t o G^ as recommended by Jeffreys. Usually, one takes c = 1. Jeffreys points out t h a t one would expect t h e Bayes factor BF^^i should tend t o zero if :r —> oc a n d s^ — ^ ^ ^ X](^* — x)^ is bounded. He gives an argument t h a t implies t h a t unless gi h a s no finite moments, this will not happen. In particular, with ^2 = normal, it can be verified (Problem 12) directly t h a t BF^^i doesn't t e n d t o zero as above. Jeffreys suggested we should take g2 t o be Cauchy. So t h e priors recommended by Jeffreys are 5'o(cr) = -

under H^

G

and ^i(/i,cr) = - ^ i ( / i | c r ) =

-——z—^

under Hi.

One m a y now find t h e Bayes factor BFQI using (2.11). Let t h e joint density of X i , . . . , X ^ under N{IJ.,G'^) model b e denoted by / ( x i , . . . ,Xn|/i,cr^). T h e n BFQI

is given by

r^jj. I fi^l^ • • • . ^ n | 0 , G'^)go{G)dG nroi — J f{xi ,...,Xn\fi, G'^)gi (/i, G) dii dG' where go{G) a n d ^i(/i, G) are as given above. T h e integral in t h e numerator of BF^i can b e obtained in closed form. However, no closed form is available for the denominator. To calculate this one can proceed as follows. T h e Cauchy density ^i(/i|cr) can b e written as a G a m m a scale mixture of normals

27r

Vv27r

where r is t h e mixing G a m m a variable. T h e n t o calculate t h e denominator of BFQI, one can integrate over /i a n d G in closed form. Finally, one has a one-dimensional integral over r left.

48

2 Bayesian Inference and Decision Theory

Example 2.12. Einstein's theory of gravitation predicts that Hght is deflected by gravitation and specifies the amount of defiection. Einstein predicted that Hght of stars would deflect under gravitational pull of the sun on the nearby stars, but the effect would be visible only during a total solar eclipse when the deflection can be measured through apparent change in a star's position. A famous experiment by a team led by British astrophysicist Eddington, immediately after the First World War (see Gardner, 1997), led to acceptance of Einstein's theory. Though many other better designed experiments have confirmed Einstein's theory since then, Eddington's expedition remains historically important. There are four observations, two collected in 1919 in Eddington's expedition, and two more collected by other groups in 1922 and 1929. The observations are xi = 1.98, X2 = 1.61,2:3 = 1.18,^4 = 2.24 (all in seconds as measures of angular deflection). Suppose they are normally distributed around their predicted value /i. Then X i , - - - , X 4 are independent and identically distributed as A^(//, (T^). Einstein's prediction is /x = 1.75. We will test if0 • /i = 1.75 versus Hi : /JL ^ 1.75, where cr^ is unknown. If we use the conventional priors of Jeffreys to calculate the Bayes factor BFQI in this example, it turns out to be 2.98 (Problem 7). Thus the calculations with the given data lend some support to Einstein's prediction. However, the evidence in the data isn't very strong. This particular experiment has not been repeated because of unavoidable experimental errors. There are now better confirmations of Einstein's theory, vide Gardner (1997). 2.7.3 Credible Intervals Bayesian interval estimates for 6 are similar to confidence intervals of classical inference. They are called credible intervals or sets. Definition 2.13. For 0 < o; < 1, a 100(1 — a)% credible set for 6 is a subset C C 0 such that P{C\X = x} = l-a. Usually C is taken to be an interval. Let ^ be a continuous random variable, (9(i),6>(2) be 100ai% and 100(1 - 0^2)% quantiles with ai + a2 = a. Let C = [(9^1), 6>(2)]. Then P{C\X = x) = 1-a. Usually equal tailed intervals are chosen so ai = a2 = a/2. If 6 is discrete, usually it would be difficult to find an interval with exact posterior probability 1 — a. There the condition is relaxed to P{C\X =

x)>l-a

with the inequality being as close to an equality as possible. In general, one may use a conservative inequality like this in the continuous case also if exact posterior probability 1 — a is difficult to attain. Whereas the (frequentist) confidence statements do not apply to whether a given interval for a given x covers the "true" ^, this is not the case with

2.7 Common Problems of Bayesian Inference

49

credible intervals. The credibility 1 — a of a. credible set does answer a layman's question on whether the given set covers the "true" 0 with probability 1 — a. This is because in the Bayesian approach, "true" ^ is a random variable with a data dependent probability distribution, namely, the posterior distribution. For arbitrary priors, these probabilities will usually not have any frequency interpretation over repetitions like confidence statements. But for common objective priors, such statements are usually approximately true because of the normal approximation to the posterior distribution (see Chapter 4). Moreover, the approximations are surprisingly accurate for the Jeffreys prior. You are invited to verify this in Problem 8. Some explanation of this comes from the discussion of probability matching priors (Chapter 5). The equal tailed credible interval need not have the smallest size, namely, length or area or volume whichever is appropriate. For that one needs an HPD (Highest Posterior Density) interval. Definition 2.14. Suppose the posterior density for 6 is unimodal. Then the HPD interval for 0 is the interval C = {0:

7T{0\X = X)>

k},

where k is chosen such that P(C\X = x) = l - a . Example 2.15. Consider a normal prior for mean of a normal population with known variance a^. The posterior is normal with mean and variance given by equations (2.2) and (2.3). The HPD interval is the same as the equal tailed interval centered at the posterior mean, C — posterior mean ± Zc^/2 posterior s.d. Credible intervals are very easy to calculate unlike confidence intervals, the construction of which requires pivotal quantities or inversion of a family of tests (Chapter 1, Section 1.4.3). For a vector 0^ one may consider a HPD credible set, specially if the posterior is unimodal. Alternatively, one may have credible intervals for each component. One may also report the probability of simultaneous coverage of all components. 2.7.4 Testing of a Sharp Null Hypothesis Through Credible Intervals Some Bayesians are in favor of testing, say, HQ : 0 = OQ versus Hi : 0 ^ OQ by accepting HQ if ^o belongs to a chosen credible set. This is similar to the relation between confidence intervals and classical testing, except that there the tests are inverted to get confidence intervals. This must be thought of as

50

2 Bayesian Inference and Decision Theory

a very informal way of testing. If one really believes that the sharp null is a well-formulated theory and deserves to be tested, one would surely want to attach a posterior probability to it. That is not possible in this approach. Because the inference based on credible intervals often has good frequency properties, a test based on them also is similar to a classical test. This is in sharp contrast with inference based on Bayes factors or posterior odds (Section 2.7.2 and Chapter 6).

2.8 Prediction of a Future Observation We have already done this informally earlier. Suppose the data a i e «*^i, * * * ^ Jbfi^ where X i , . . . , Xn are i.i.d. with density f{x\6)^ e.g., A/"(/i, a^) with a^ known. We want to predict the unobserved X^+i or set up a predictive credible interval for Xn-\-iPrediction by a single number t(xi, • • • ,0;^) based on x i , " - , X n with squared error loss amounts to considering prediction loss E {(X^+i - tf\x}

= E [{{Xn+i - E{Xn+i\x))

-{t-

= E {(X,+i - E{Xn^i\x)f\x}

E{Xn^i\x))f + (t -

Ix]

E{Xn+i\x))^

which is minimum at t =

E{Xn+l\x).

To calculate the predictor we need to calculate the predictive distribution 7T{xn-^i\x) = / 7r{xn-i-i\x,6)7r{0\x)d0 J&

= I f{xn^i\e)7r{e\x)de. Je oo

/

xf{x\e)dx.

It can be shown that

-00

E{Xn^i\x)

= E{fi{0)\x) = [ fi{e)7r{e\x) dO Je oo

/

fi7r{/i\x) dfj. = posterior -00

mean of /i. Similarly in Example 2.2, the predictive probability that the next ball is red is

E{Xn+i\x) = Eip\x)=

l^l a -{- fj -\- n

where r = ^^ Xi. A predictive credible interval for X^+i is (c, d) where c and d are 100Q;I% and 100(1 — a2)% quantiles of the predictive distribution of Xn-\-i given x. Usually, one takes ai = a2 = a/2 as for credible intervals.

2.10 Elimination of Nuisance Parameters

51

2.9 Examples of Cox and Welch Revisited In both these problems (see Examples 2.5 and 2.6), the parameter is a location parameter. A common objective prior is TT^O) = constant , —oo < ^ < oo. You can verify (Problem 11) that the objective Bayesian answers, namely, posterior variance in Cox's example and posterior probability in Welch's example, agree with the corresponding conditional frequentist answers recommended by Fisher. This would typically be the case for location and scale parameters.

2.10 Elimination of Nuisance P a r a m e t e r s In problems of testing and estimation, the main object of interest may be not the full vector 6 but one of its components. Which component is important will depend on the context. To fix ideas let 6 = (^i, ^2) and 9i be the parameter of importance. The unimportant parameters 62 are called nuisance parameters. Classical statistics has three ways of eliminating nuisance parameters 02 and thus simplifying the problem of inference about 9i. We explain through three examples. Example 2.16. Suppose Xi and X2 are independent Poisson with mean Ai, A2. You want to test i^o • Ai = A2. We can reparameterize (Ai, A2) as Oi = ^ ^^^ , ^2 — Ai H- A2. Then Oi is the parameter of interest. Under HQ, 61 — | , only O2 is the unknown parameter. T = Xi + X2 is sufficient for Ai 4- A2 and the conditional distribution of Xi given T is binomial(n = T^p = 1/2), which can be used to construct a conditional test. Example 2.17. In the second example we use an invariance argument. Consider a sample from N{fi,a'^). We want to test HQ : fi = 0 against say Hi \ ji > 0, which can be reformulated as iJo • 1^/^ — 0 and Hi : 11/a > 0. Again reparameterize as (^1 = /i/cr, ^2 = cr). Note that (X, 5^ = ^ ^ J2(^i ~~ ^ ) ^ ) is a sufficient statistic and X/S is invariant under the transformation Xi -> cXi, So X/S

2 == 1,2, •••,n.

= ZjSzi where Zi = Xija depends only Qi. The usual t-test is based

on XjS. Example 2.18. In the third method one constructs what is called a profile likelihood for Oi by maximizing the joint likelihood with respect to 62 and then using it as a sort of likelihood for Oi. Thus the profile likelihood is

L^{ei) = sup f{x\0i,e2) = f{x\Oi, 62(01)) where ^2(^1) is the MLE of 62 if Oi is given.

52

2 Bayesian Inference and Decision Theory

In a full Bayesian approach, a nuisance parameter causes no problem. One simply integrates it out in the joint posterior. Suppose, however, that one does not want to do a full Bayesian analysis but rather construct a Bayesian analogue of profile likelihood, on the basis of which some exploratory Bayesian analysis for 6i will be done. Once again, this is easy. One uses

L{ei) = I fix\eue2)AO2\0i)de2. We give an example to indicate that integration makes better sense than treating the unknown 62 as known and equal to the conditional MLE ^2(^1)Example 2.19. (due to Neyman and Scott). Let Xii^Xi2^i = 1,2, . . . , n , be 2n independent normal random variables, with X^i, X^2 being i.i.d. iV()U^, cr^). Here cr^ = 61 is the parameter of interest and (/xi,... ^jin) = 02 is the nuisance parameter. One may think of a weighing machine with no bias but some variability; jii is the weight of zth object, Xn^ Xi2 are two measurements of the weight of the ith object. The profile likelihood is Lp(a2)cx

sup

a - 2 " e x p ( - - i 2 y ] { ( ^ ^ l - / ^ ^ ) ^ + (^^2-M^)^} I

Ml, -^Mn

V ^^

oc a - 2 - exp { - ^

J

,^1

J2 {(^^1 - ^ ^ ) ' + (^^2 - X,f]\

,

where Xi = {Xn -\-Xi2)/2. If one maximizes it to get an estimate of ^1, it will be the usual MLE of a'^, namely.

It is easy to show (Problem 13) that the estimate is inconsistent; it converges in probability to a'^/2. If one corrects it for its bias by dividing by n, instead of 2n, it becomes consistent. To rectify problems with profile likelihood, Cox and Reid (1987) have considered an asymptotic conditional likelihood, which behaves better than profile likelihood. The simple-minded Bayesian likelihood is

Ha') = J(7-2" expl-^J2 oc CT-" exp U ^

{(^^1 - ^^)' + (^'2 - Mi)'} ) ^(Mk') dti

f^ {(Xa - X , ) ' + (Xi2 - X i ) ' } j ,

where /x = (/ii,... ,7/^^) has an improper uniform prior. Maximizing it one gets a consistent estimate of a^.

2.11 A High-dimensional Example

53

Berger et al. (1999) discuss many such examples with subtle problems of lack of identifiability of the parameters in the model. For example, if one has two binomials, B(ni,pi), z = 1,2 where rii is large, pi is small, and riipi = T^2P2 = A, then both will be well approximated by a Poisson with mean A. So data would provide a lot of information on A = riiPi but not so on (rii^pi). If both parameters of binomial, namely, n and p are unknown, then they may have identifiability problems in this sense.

2.11 A High-dimensional E x a m p l e Examples discussed so far have one thing in common - the dimension of the parameter space is small. We refer to them as low-dimensional. Many of the new problems we have to solve today have a high-dimensional parameter space. We refer to them as high-dimensional. One such example appears below. Example 2.20. New biological screening experiments, namely, microarrays, test simultaneously thousands of genes to identify their functions in a particular context (say, in producing a particular protein or a particular kind of tumor). On the basis of the data some genes, usually in hundreds, are considered "expressed" if they are thought to have this function. They are taken up for further study by more traditional and time-consuming techniques. Without going into the fascinating biochemistry behind these experiments, we provide a statistical formulation. The data consist of (X^, S'^), i = l , 2 , . . . , p where X^, Si are the sample mean and s.d. based on raw data X^i, X^2, • • •, Xir of size r on the ith gene. For fixed i, X ^ i , . . . , X^r are i.i.d. A^(/ii, cr|). Further, /i^ = 0 if the ith gene is not expressed and /i^ 7^ 0 if the gene is indeed expressed. Of course, we could carry out a separate t-test for each i but this ignores some additional information that we can get by considering all the genes together in a Bayesian way. Moreover, a simple-minded testing for each gene separately would increase enormously the number of false discoveries. For example, if one tests for each i with a = 0.05, then even if no genes are really expressed there would be Na false rejections of the null hypothesis of "no expression". We put a prior on /i^'s and erf's as follows. We assume that (^LA^, (j|), i = 1, 2 , . . . ,p are i.i.d. given certain hyper-parameters. The prior distribution for /i^, given af is mixture of two normals pN{0,caf) + (1 — p)N{6^caf) and af are i.i.d. inverse Gamma. The prior distribution has five (hyper) parameters, namely, p, c, 0 and the shape and scale parameters. If the proportion of genes expected to be functional can be guessed, we would set p to be equal to this proportion. We would have to put a (second stage) prior on the remaining four parameters making this an example of hierarchical priors. A somewhat simpler approach (empirical Bayes) is to estimate the (hyper) parameters from data. We will see in Chapter 9 that there is a lot of information about them in the data. In either case, data about all the genes affect inference about each gene through

54

2 Bayesian Inference and Decision Theory

these (hyper) parameters that are common to all the genes. Our prior is based on a judgment of exchangeability of (/x^, a^), i = 1,2,... ,p and de Finetti's Theorem in Section 2.12. The inference for each gene is quite simple in the FEB (parametric empirical Bayes) approach. It is more complicated in the hierarchical Bayes setup but doable. Both are discussed in Chapter 9.

2.12 Exchangeability One may often be able to judge that a set of parameters ( ^ i , . . . , Op) or a set of observables like ( X i , X 2 , . . . ,Xn) are exchangeable, i.e., their joint distribution function is left unaltered if the arguments are permuted. Thus if P{Xi

< Xi,---,Xn < Xn} = P{Xi

< Xi^,'",Xn

< Xi^}

for all n! permutations x ^ j , . . . , Xi^ of x i , . . . , x^, one says X i , . . . , Xn are exchangeable. A simple way of generating exchangeable random variables is to choose an indexing random parameter rj and have the random variables conditionally i.i.d. given rj. In many cases the converse is also true, as shown by de Finetti (1974, 1975), and Hewitt and Savage (1955). We only discuss de Finetti's theorem. We say Xi, i = l , 2 , . . . , n , n + l , . . . , is a sequence of exchangeable random variables if Vn > 1, X i , X 2 , . . . , Xn are exchangeable. Theorem 2.21. (de Finetti). Suppose Xi ^s constitute an exchangeable sequence and each Xi takes only values 0 or 1. Then, for some n,

P{X,=xu"'.Xn

= Xn}=^ I 77Sr=i^^(l-r7)"-Er=i-^d7r(r7),

\/n, Vxi,...,Xn equal to 0 or 1, i.e., given TJ, X I , . . . , X „ are conditionally i.i.d. Bernoulli with parameter rj and r] has distribution n. A Bayesian may interpret this as follows. The subjective judgment of exchangeability leads to both a Bernoulli likelihood and the existence of a prior TT. If one has also a prediction rule as in Problem 18, n can be specified. Thus at least in this interpretation the prior and the likelihood have the same logical status, both arise from a subjective judgment about observables. Hewitt and Savage (1955) show that even if the random variables take values in 7^^, or more generally in a nice measurable space, then a similar representation as conditionally i.i.d. random variables holds. See Schervish (1995) for a statement and proof. In many practical cases, vide Example 2.20, one may perceive certain parameters ^ 1 , . . . , ^p as exchangeable. Even if the parameters do not form an infinite sequence, it is convenient to represent them as conditionally i.i.d. given a hyperparameter. Often as in Example 2.20, the form of 7r{0\r]) is also dictated by operational convenience. We show in Chapter 5 we can check if this form is validated by the data.

2.14 Objective Priors and Objective Bayesian Analysis

55

2.13 Normative and Descriptive Aspects of Bayesian Analysis, Elicit at ion of Probability Do most people faced with uncertainty make a decision as if they were Bayesian, each with her subjective prior and utility? The answer is generally No. The Bayesian approach is not claimed to be a description of how people tend to make a decision. On the other hand Bayesians believe, on the basis of various sets of rationality axioms and their consequences (as discussed in Chapter 3), people should act as if they have a prior and utility. The Bayesian approach is normative rather than descriptive. There have been empirical as well as philosophical studies of these issues. We refer the interested reader to Raiffa and Schlaiffer (1961) and French and Rios Insua (2000). We explore tentatively a couple of issues related to this. It is an odd fact in our intellectual history that the concept of probability, which is so fundamental both in daily life and science, was developed only during the European Renaissance. It is tempting to speculate that our current inability to behave rationally under uncertainty is related to the late arrival of probability on the intellectual scene. Most Bayesians hope the situation will improve with the passage of time and attempts to educate ourselves to act rationally. Related to these facts is the inability of most people to express their uncertainty in terms of a well calibrated probability. Probability is still most easily calculated in gambling or similar problems where outcomes are equally likely, in problems like life or medical insurance, where empirical calculations based on repetitions is possible or under exchangeability. Most examples of successful elicit at ion of subjective probability involve exchangeability in some form. However, there have been some progress in elicitation. Some of these examples are discussed in Chapter 5 . These examples and attempts notwithstanding, full elicitation of subjective probabihty is still quite rare. Most priors used in practice are at least partly nonsubjective. They are obtained through some objective, i.e., nonsubjective algorithms. In some sense they are uniform distributions that take into account what is known, namely some prior moments or quartiles and the geometry in the parameter space. We discuss objective priors and Bayesian analysis based on them in the next section.

2.14 Objective Priors and Objective Bayesian Analysis We refer to the Bayesian analysis based on objective priors as objective Bayesian analysis. One would expect that as elicitation improves, subjective Bayesian analysis would be used increasingly in place of objective Bayesian analysis. All Bayesians agree that wherever prior information is available, one should try to use a prior reflecting that as far as possible. In fact, one of

56

2 Bayesian Inference and Decision Theory

the attractions of the Bayesian paradigm is that use of prior expert information is a possibility. Incorporation of prior expert opinion would strengthen considerably purely data based analysis in real-life decision problems as well as problems of statistical inference with small sample size or high or infinite dimensional parameter space. In this approach use of objective Bayesian analysis has no conflict with the subjectivist approach. It may also have a legitimate place in subjective Bayesian analysis as a reference point or origin with which to compare the role and importance of prior information in a particular Bayesian decision. In a similar spirit, it may also be used to report to general readers or to a group of Bayesians with different priors. We discuss in Chapter 3 algorithms for generating common objective priors such as the Jeffreys or reference or probability matching priors. We also discuss there common criticisms, such as the fact that these priors are improper and depend on the experiment, as well as our answers to such criticisms. In examples with low-dimensional 0 , objective Bayesian analysis has some similarities with frequentist answers, as in Examples 2.2 and 2.4, in that the estimate obtained or hypothesis accepted tends to be very close to what a frequent ist would have done. However, the objective Bayesian has a posterior distribution and a data based evaluation of the error or risk associated with inference or decision, namely, the posterior variance or posterior error or posterior risk. In high-dimensional problems, e.g.. Example 2.20, it is common to use hierarchical priors with objective prior of the above type used at the highest level of the hierarchy. One then typically uses MCMC without always checking whether the posteriors are proper - in fact checking mathematically may be very difficult. Truncation of the prior, with careful variation of the stability of the posterior provides good numerical insight. However, this is not the only place where an objective prior is used in the hierarchy. In fact, in Example 2.20, the prior for (fXi^af) arises from a subjective assumption of exchangeability but the particular form taken is for convenience. This is a non-subjective choice but, as indicated in Chapter 9, some data based validation is possible. The objective Bayesian analysis in high-dimensional problems is also close in spirit to frequent ist answers to such problems. Indeed it is a pleasant fact that, as in low-dimensional problems but for different reasons, the frequentist answers are almost identical to the Bayesian answers. The frequentist answers are based on the parametric empirical Bayes (PEB) approach, in which the parameters in the last stage of hierarchical priors are estimated from data rather than given an objective prior. As in the low-dimensional case, the objective Bayesian analysis has some advantages over frequentist analysis. The PEB approach used by frequentists tends to underestimate the posterior risk. Though it is implicit in the above discussion, it is worth pointing out that Bayesian analysis can be based on an improper prior only if the posterior is proper. Somewhat surprisingly, the posterior is usually proper when one uses

2.16 Remarks

57

the Jeffreys prior or a reference prior, but counter-examples exist; see Ghosh (1997) and Section 7.4.7. Because the objective priors are improper, the usual type of preposterior analysis cannot be made at the planning stage. In particular one cannot compare different designs for an experiment and make an optimal choice. For the same reason choosing optimal sample size is a problem. It is suggested in Chapter 6 that a partial solution is to take a few observations, say the minimum number of observations needed to make the posterior proper, and use the proper posterior as a proper prior. The additional data can be used to update it. For an application of these ideas, see Ghosh et al. (2002). Unfortunately, when all the data have been collected at the stage of formulating the prior, one would need to modify the above simple procedure.

2.15 Other Paradigms In earlier sections, we have discussed several aspects of the Bayesian paradigm and its logical advantages. In this context we have also discussed in some detail various problems with the classical frequentist approach. Some of these problems of classical statistics can be resolved, or at least mitigated by appropriate conditioning. Even though Birnbaum's theorem shows extensive conditioning and restriction to minimal sufficiency would lead to fundamental changes in the classical paradigm and it may be quite awkward to find a suitable conditioning, the idea of conditioning makes it possible to reconcile a lot of objective Bayesian analysis and classical statistics if suitable conditioning is made. At least this makes communication relatively easy between the paradigms. There have also been attempts to create a new paradigm of inference based on sufficiency, conditioning and likelihood. An excellent treatment is available is Sprott (2000). Some of our reservations are listed in Ghosh (2002). One should also mention belief functions and upper and lower probabilities of Dempster and Schafer (see Dempster (1967), Shafer (1976) and Shafer (1987)). Wasserman and Kadane (1990) have shown that under certain axioms, their approach may be identified with a robust Bayesian point of view. Problems of foundations of probability and inference remain an active area. An entirely different popular approach is data analysis. Data analysis makes few assumptions, it is very innovative and yet easy to communicate. However, it is rather ad hoc and cannot quite be called a paradigm. If machine or statistical learning emerges as a new alternative paradigm for learning from data, then data analysis would find in it the paradigm it currently lacks.

2.16 Remarks Even though there are several different paradigms, we believe the Bayesian approach is not only the most logical but also very flexible and easy to com-

58

2 Bayesian Inference and Decision Theory

municate. Many innovations in computation have led to wide appHcabihty as well as wide acceptance from not only statisticians but other scientists. Within the Bayesian paradigm it is relatively easy to use information as well as solve real-life decision problems. Also, we can now construct our priors, with a fair amount of confidence as to what they represent, to what extent they use subjective prior information and to what extent they are part of an algorithm to produce a posterior. The fact that there are no paradoxes or counter-examples suggests the logical foundations are secure in spite of a rapid, vigorous growth, specially in the past two decades. The advantage of a strong logical foundation is that it makes the subject a discipline rather than a collection of methods, like data analysis. It also allows new problems to be approached systematically and therefore with relative ease. Though based on subjective ideas, the paradigm accepts likehhood, and frequentist validation in the real world as well as consequent calibration of probabilities, utilities, likelihood based methods. In other words, it seems to combine many of the conceptual and methodological strengths of both classical statistics and data analysis, but is free from the logical weaknesses of both. Ultimately, each reader has to make up her own mind but hopefully, even a reader, not completely convinced of the superiority of Bayesian analysis, will learn much that would be useful to her in the paradigm of her choice. This book is offered in a spirit of reconciliation and exploration of paradigms, even though from a Bayesian point of view. In many ways current mutual interaction between the three paradigms is reminiscent of the periods of similar rapid growth in the eighteenth, nineteenth, and early twentieth centuries. We have in mind specially the history of least squares, which began as a data analytic tool, then got itself a probabilistic model in the hands of Gauss, Laplace, and others. The associated inferential probabilities were simultaneously subjective and frequentist. The interested reader may also want to browse through von Mises (1957).

2.17 Exercises 1. (a) (French (1986)) Three prisoners. A, B , and C, are each held in solitary confinement. A knows that two of them will be hanged and one will be set free but he does not know who will go free. Therefore, he reasons that he has I chance of survival. He asks the guard who will go free, but has no success there. Being an intelligent person, he comes up with the following question for the guard: If two of us must die, then I know that either B or C must die and possibly both. Therefore, if you tell me the name of one who is to die, I learn nothing about my own fate; further, because we are kept apart, I cannot reveal it to them. So tell me the name of one of them who is to

2.17 Exercises

59

die. T h e guard likes this logic and tells A t h a t C will be hanged. A now argues t h a t either he or B will go free, and so now he has | chance of survival. Is this reasoning correct? b) T h e r e are three chambers, one of which has a prize. T h e master of ceremonies will give t h e prize to you if you guess t h e right chamber correctly. You first make a r a n d o m guess. T h e n he shows you one chamber which is empty. You have an option to stick t o your original guess or switch t o t h e remaining other chamber. (The chamber you guessed first has not been opened). W h a t should you do? 2. Suppose X | / i ~ A^(/i, cr^), CF^ known a n d fi ^ N{rj, r ^ ) , 77 and r^ known. (a) Show t h a t t h e joint density g(x^ /i) of X and /j, can be written as ^ ( x , / i ) = 7r(/i)/(x|/i)

(b) From (a) show t h a t t h e marginal density m{x) of X is 1

m{x)

/

V 2 ^ ( T 2 + ^2)

: exp

^^^" V

(x-r;)2 2(r2

+

a2)

and t h e posterior density 7r(/i|x) of / i | X = x is ' 7r(/i|a:)

r^ + g^2" 2r2cr2

rV2 M

(1+^)

T^ + (J^ T'

(c) W h a t are t h e posterior m e a n and posterior s.d. of /i given X = x7 (d) Instead of a single observation X as above, consider a r a n d o m sample X i , . . . , X ^ . W h a t is t h e minimal sufficient statistic and w h a t is t h e likelihood function for /i now? Work out (b) and (c) in this case. 3. Let X i , . . . , Xn be i.i.d. A/'(/i, cr^), cr^ known. Consider testing HQ : // < /io versus Hi \ ii> IIQ. (a) C o m p u t e t h e P-value. Compare it with t h e posterior probability of Ho when /i is assumed t o have t h e uniform prior. (b) Do t h e same for a sharp HQ. 4. Refer t o Welch's problem, Example 2.6. Follow Fisher's suggestion a n d calculate P { C I covers 9\Xi — X2} a n d verify it agrees with t h e objective Bayes solution with improper uniform prior for 6.

60

2 Bayesian Inference and Decision Theory

5. (Berger's version of Welch's problem, see Berger (1985b)). Suppose Xi and X2 are i.i.d. having the discrete distribution:

-{

y 6> - 1/2 with probability 1/2; e 4-1/2 with probability 1/2,

where 6 is an unknown real number. (a) Show that the set C given by r { ( X i + X 2 ) / 2 } if Xi 7^X2; \ {Xi-1} ifXi=X2,

6. 7. 8.

9.

10.

11.

is a 75% confidence set for 6. (b) Calculate P{C covers e\Xi - X2}. Can the Welch paradox occur if X i , X2 are i.i.d. N{6^ 1)? (Newton versus Einstein). In Example 2.12 calculate the Bayes factor, BFQI for the given data using Jeffreys prior. Let X i , . . . ,Xn be i.i.d. Bernoulli(p) (i.e., B{\,p)). (a) Assume p has Jeffreys prior. Construct the 100(1 — a)% HPD credible interval for p. (b) Suppose n == 10 and a = 0.05. Calculate the frequentist coverage probability of the interval in (a) using simulation. Consider the same model as in Problem 8. Derive the minimax estimate of p under the square error loss. Plot and compare the mean square error of this estimate with that of X for n = 10, 50, 100, and 400. (The minimax estimate seems to do better at least upto n = 100.) Let X i , . . . ,Xn be i.i.d. A/'(/i,cr^), a^ known. Suppose fx has the N{'r]^r'^) prior distribution with known T] and r^. (a) Construct the 100(1 — a)% HPD credible interval for /x. (b) Construct a 100(1 - a)% predictive interval for X^+i. (c) Consider the uniform prior for this problem by letting r^ —> 00. Work out (a) and (b) in this case. (a) Refer to Example 2.6. Let C(Xi,X2) denote the 100(1 - a)% confidence interval for 6. Assume that 6 has Jeffreys prior. Then show that P{C(Xi,X2) covers e\Xi - X2} = P{e G C(Xi,X2)|Xi,X2}. (b) Recall Example 2.5. Assume that /x has Jeffreys prior. Then show that Var(/i|x) = Var(X|n).

12. Let X i , . . . , X n be i.i.d. A^(/i,cr^), cr^ unknown. Consider Jeffreys test (Section 2.7.2) for testing HQ : /i = /J^Q versus Hi : fi ^ //Q- Consider both the normal and Cauchy priors for /x|cr^ under Hi. Suppose X —)- (X) and s^ is bounded. Compute BFQI under both the priors and show that BFoi converges to zero for Cauchy prior but does not converge to zero for normal prior.

2.17 Exercises

61

13. (a) Refer to Example 2.19. Show that the usual MLE of cr^, namely,

is inconsistent. Correct it to get a consistent estimate. (b) Suppose Xi,...,Xk are i.i.d. B(n,p). Both n and p are unknown, but only n is of interest, so ;? is a nuisance parameter (see Berger et al. (1999)). Derive the following likelihoods for n: (i) profile likelihood, (ii) conditional likelihood, i.e., that obtained from the conditional distribution of Xi 's given their sum (and n), (iii) integrated likelihood with respect to the uniform prior, and (iv) integrated likelihood with respect to Jeffreys prior. (c) Suppose the observations are (17, 19, 21, 28, 30). Plot and compare the different likelihoods in b) above, and comment. 14. Suppose X\/ji ^ Np{fjL, Z'), U known and /x ~ Np{rj, JT), rj and F known. (a) Show that the above probability structure is equivalent to X = /x + e, e ~ Np{0,U)^ fx ~ Np{rj,r)^ e and /x are independent and U, 77, F are known. (b) From (a) show that the joint distribution of X and /x is

(c) From (b) and using multivariate normal theory, show that

fi\x = x^ Np (r(r + F)-^x + r ( r + F)-\,

F - F{U + F)-^F) .

(d) What are the posterior mean and posterior dispersion matrix of fi? Construct a 100(1 - a)% HPD credible set for /x. (e) Work out (d) with a uniform prior. 15. Let X i , . . . , X m and Y i , . . . , 1^ be independent random samples, respectively, from N{fii,a'^) and N(fX2,cr'^)^ where a^ is known. Construct a 100(1 — a)% credible interval for (/ii — /i2) assuming a uniform prior on (MI,M2).

16. Let X i , . . . , Xm and Y i , . . . , F^ be independent random samples, respectively, from N{/jL^ai) and A^(/i, cri), where both af and cr^ are known. Construct a 100(1 — a)% credible interval for the common mean /i assuming a uniform prior. Show that the frequentist 100(1 — a ) % confidence interval leads to the same answer. 17. (Behrens-Fisher problem) Let X i , . . . , Xm and Y i , . . . , Y^ be independent random samples, respectively, from A^(/ii,cr^) and N(112^ 0-2)^ where all the four parameters are unknown, but inference on //i — /12 is of interest. To derive a confidence interval for /ii — //2 and also test HQ : /ii = /i2, the Behrens-Fisher solution is to use the statistic

62

2 Bayesian Inference and Decision Theory X

-Y

y/sl/m-\where sj = Etii^i (a) Show that

sl/n'

" ^ ) V ( ^ " 1) and si = E L i ( r , - y ) V ( n - 1). sl/m + sl/n

xl r\^

approximately, where z/ can be estimated by

{sl/m^sl/n)'^ sj/{m'^{m - 1)) + 4 / ( n 2 ( n - 1)) * (Hint: If we want to approximate the weighted sum, Y2iz=i ^i^i of independent x^., by a x^/^5 then a method of moment estimate for u is available, see Satterwaite (1946) and Welch (1949).) (b) Using (a), justify that T is approximately distributed hke a Student's t with i> degrees of freedom under HQ. (c) Show numerically that the 100(1 — a ) % confidence interval for //i — /i2 derived using T is conservative, i.e., its confidence coefficient will always be > 1 — a. (See Robinson (1976). A Bayesian solution to the Behrens-Fisher problem is discussed in Chapter 8.) 18. Suppose X i , X 2 , . . . ,Xn are i.i.d. Bernoulli(p) and the prediction loss is squared error. Further, suppose that for all n > 1, the Bayes prediction rule is given by E{Xn-^l\Xi,...,Xn)

=

)r , a + p+ n for some a > 0 and /? > 0. Show that this is possible iff the prior on p is Beta(a,/3). 19. Suppose (A/'i,... ,Nk) have the multinomial distribution with density / ( n i , . . . , n f c | p ) = —j—p

)T\Pj

Let p have the Dirichlet prior with density

(a) Find the posterior distribution of p. (b) Find the posterior mean vector and the dispersion matrix of p. (c) Construct a 100(1 — a)% HPD credible interval for pi and also for Pi +P2.

2.17 Exercises

63

20. Let p denote t h e probability of success with a particular drug for some disease. Consider two different experiments t o estimate p. In t h e first experiment, n randomly chosen patients are t r e a t e d with this drug and let X denote t h e number of successes. In t h e other experiment patients are treated with this drug, one after t h e other until r successes are observed. In this experiment, let Y denote t h e t o t a l number of patients t r e a t e d with this drug. (a) Construct 100(1 — a)% H P D credible intervals for p under t / ( 0 , 1 ) and Jeffreys prior when X = x is observed. (b) Construct 100(1 - a)% H P D credible intervals for p under C/(0,1) and Jeffreys prior when Y = y \s observed. (c) Suppose n = 16, X = 6, r = 6, and y = 16. Now compare (a) and (b) and comment with reference to LP. 21. Let X i , . . . , X ^ be i.i.d. A^(/x,cr^), cr^ known. Suppose we want to test Ho : fi = fiQ versus Hi : ii ^ JJLQ. Let TTQ = P{Ho) = 1/2 and under Hi^ let ji ^ N(fio,r'^). Show t h a t , unlike in t h e case of a one-sided alternative, P-value and t h e posterior probability of HQ can be drastically different here. 22. Let X i , . . . , Xn be i.i.d. A^(/i, cr^), where b o t h /i a n d a'^ are unknown. Take the prior 7r(/i, cr^) oc cr~^. Consider testing HQ : [1 < fiQ versus Hi : fi > /IQ. C o m p u t e the P-value. Compare it with t h e posterior probability of HQ. C o m p u t e t h e Bayes factor BFQI.

Utility, Prior, and Bayesian Robustness

We begin this chapter with a discussion of rationahty axioms for preference and how one may deduce the existence of a utihty and prior. Later we explore how robust or sensitive is Bayesian analysis to the choice of prior, utility, and model. In the process, we introduce and examine various quantitative evaluations of robustness.

3.1 Utility, Prior, and Rational Preference We have introduced in Chapter 2 problems of estimation and testing as Bayesian decision problems. We recall the components of a general Bayesian decision problem. Let X be the sample space, 0 the parameter space, f{x\0) the density of X and 7r(^) prior probability density. Moreover, there is a space A of actions "a" and a loss function L{0, a). The decision maker (DM) chooses "a" to minimize the posterior risk ^{a\x) = / L{e, a)iT{e\x) dO,

(3.1)

where T:{9\X) is the posterior density of 0 given x. Note that given the loss function and the prior, there is a natural preference ordering ai :< a2 (i.e., a2 is at least as good as ai) iff \l^(a2\x) < '0(ai|x). There is a long tradition of foundational study dating back to Ramsey (1926), in which one starts with such a preference relation on .4 x ^ satisfying certain rational axioms (i.e., axioms modeling rational behavior) like transitivity. It can then be shown that such a relation can only be induced as above via a loss function and a prior, i.e., 3L and TT such that ai
iff I L{0,a2)7r{0)dO < j L{0,ai)7T{0) dO.

(3.2)

In other words, from an objectively verifiable rational preference relation, one can recover the subjective loss function and prior. If there is no sample data.

66

3 Utility, Prior, and Bayesian Robustness

then TV would qualify as a subjective prior for the DM. If we have data x, a likelihood function f{x\6) is given and we are examining a preference relation given x, then also one can deduce the existence of L and TT such that ai:
iff

f L{e,a2)7r{e\x)de

< f L{e,ai)7r{e\x)de

(3.3)

under appropriate axioms. In Section 3.2, we explore the elicitation or construction of a loss function given certain rational preference relations. In the next couple of sections, we discuss a result that shows we must have a (subjective) prior if our preference among actions satisfies certain axioms about rational behavior. Together, they justify (3.2) and throw some light on (3.3). In the remaining sections we examine different aspects of sensitivity of Bayesian analysis with respect to the prior. Suppose one thinks of the prior as only an approximate quantification of prior belief. In principle, one would have a whole family of such priors, all approximately quantifying one's prior belief. How much would the Bayesian analysis change as the prior varies over this class? This is a basic question in the study of Bayesian robustness. It turns out that there are some preference relations weaker than those of Section 3.3 that lead to a situation like what was mentioned above, i.e., one can show the existence of a class of priors such that ai^a2

iff

I L{e,a2)n{0)de<

I L{e,ai)7r{O)d0

(3.4)

for all TT in the given class. This preference relation is only a partial ordering, i.e., not all pairs ai,a2 can be ordered. The Bayes rule a{x) minimizing il^{a\x) also minimizes the integrated risk of decision rules S{x), r{7r,6)=

[ Je

R{e,5)7r{e)d0,

where R{6,5) is the risk of 8 under 6^ namely, J^ L{6,8{x)) f {x\0) dx. Given a pair of decision rules, we can define a preference relation ai:
iff r(7r,a2(.)) < ^(TT,ai(.))-

(3.5)

One can examine a converse of (3.5) in the same way as we did with (3.2) through (3.4). One can start with a preference relation that orders decision rules (rather than actions) and look for rationality axioms which would guarantee existence of L and TT. For (3.2), (3.3) and (3.5) a good reference is Schervish (1995) or Ferguson (1967). Classic references are Savage (1954) and DeGroot (1970); other references are given later. For (3.4) a good reference is Kadane et al. (1999). A similar but different approach to subjective probability is via coherence, due to de Finetti (1972). We take this up in Section 3.4.

3.2 Utility and Loss

67

3.2 Utility and Loss It is t e m p t i n g to think of t h e loss function L{0^a) and a utility function u{0, a) = —L{0, a) as conceptually a mirror image of each other. French and Rios Insua (2000) point out t h a t there can be i m p o r t a n t differences t h a t depend on t h e context. In most statistical problems t h e D M (decision maker) is really trying to learn from d a t a rather t h a n implement a decision in t h e real world t h a t has m o n e t a r y consequences. For convenience we refer t o these as decision problems of T y p e 1 and T y p e 2. In T y p e 1 problems, i.e., problems without m o n e t a r y consequences (see Examples 2.1-2.3) for each 0 there is usually a correct decision a{0) t h a t depends on 0, and L{0^a) is a measure of how far "a" is away from a{9) or a penalty for deviation from a{6). In a problem of estimating (9, t h e correct decision a{0) is 6 itself. C o m m o n losses are {a — 0)'^, |a — ^|, etc. In the problem of estimating T ( ^ ) , a{0) equals T(0) and common losses are {T{0) — a)^, \r{0) — a|, etc. In testing a null hypothesis HQ against Hi, t h e 0-1 loss assigns no penalty for a correct decision and a unit penalty for an incorrect decision. In T y p e 2 problems, there is a similarity with gambles where one must evaluate the consequence of a risky decision. Historically, in such contexts one talks of utility rather t h a n loss, even t h o u g h either could be used. We consider below an axiomatic approach t o existence of a utility for T y p e 2 problems b u t we use t h e notations for a statistical decision problem by way of illustration. We follow Ferguson (1967) here as well as in t h e next section. Let V denote t h e space of all consequences like [6, a). It is customary to regard t h e m as non-numerical pay-offs. Let 7^* be t h e space of all probability distributions on V t h a t p u t mass on a finite number of points. T h e set P * represents risky decisions with uncertainty quantified by a known element of P * . Suppose t h e D M has a preference relation on P * , namely a t o t a l order, i.e., given any pair p i , P 2 ^ ^ * , either pi < p2 (p2 is preferred) or p2 :< Pi {pi is preferred) or both. Suppose also t h e preference relation is transitive, i.e., if p i :< p2 and p2 :< Psj t h e n pi < p^. We refer t h e reader to French and Rios Insua (2000) for a discussion of how compelling are these conditions. It is clear t h a t one can embed V as subset of P * by t r e a t i n g each element of V as a degenerate element of P * . T h u s t h e preference relation is also well-defined on V. Suppose t h e relation satisfies axioms H i and H 2 . H i If p i , p2 and q eV^ and 0 < A < 1, t h e n pi :< p2 if and only if Api + (1 — X)q ^ Xp2 + (1 - A)g. H2 If pi^ P2^ Ps are in P * and pi -< P2 -< Ps, t h e n there exist numbers 0 < A < 1 , 0 < / i < l , such t h a t Ap3 + (1 - A);?i -< P2 ^ I^Ps + (1 -

f^)Pi'

Ferguson (1967) shows t h a t if H i and H 2 hold t h e n there exists a utility u{.) on V* such t h a t pi :< p2 if and only if u{pi) < u{p2), where for p* =

68

3 Utility, Prior, and Bayesian Robustness

S l i i ^iPi^ with A^ > 0 and YlTLi ^i = 1? u{p*) is defined to be the average m

u{pn = Y.XMpi),

(3.6)

i=l

The main idea of the proof, which may also be used for eliciting the utility, is to start with a pair pi -< P2^ i.e., pi :< p\ but pi i^ p\. (Here ~ denotes the equivalence relation that the DM is indifferent between the two elements.) Consider all pi :< p* :< P2' Then by the assumptions H i and H2, one can find 0 < A* < 1 such that the DM would be indifferent between p* and (1 — A*)pi + A*P2- One can write A* = u{p*) and verify that pi ^ p'^ ^ pi ^ P2 iff A3 = U{PI) < U{PI) = A4 as well as the relation (3.6) above. For P2 :< p*, by a similar argument one can find a 0 < /x* < 1 such that p^ ~ (1 - ti*)Pi+f^*P* from which one gets where A* — l//x*. Set A* = u{p*) as before. In a similar way, one can find a A* for p* :< PI and set u{p*) = A*. In principle. A* can be elicited for each p*. Incidentally, utility is not unique. It is unique up to a change in origin and scale. Our version is chosen so that U{PI) = 0, u{p2) = 1. French and Rios Insua (2000) point out that most axiomatic approaches to the existence of a utility first exhibit a utility on 'P* and then restrict it to P , whereas intuitively, one would want to define u{.) first on V and then extend it to 'P*. They discuss how this can be done.

3.3 Rationality Axioms Leading t o t h e Bayesian Approach^ Consider a decision problem with all the ingredients discussed in Section 3.1 except the prior. If the sample space and the action space are finite, then the number of decision functions (i.e., functions from A' to A) is finite. In this case, the decision maker (DM) may be able to order any pair of given decision functions according to her rational preference of one to the other taking into account consequences of actions and all inherent uncertainties. Consider a randomized decision rule defined by S = piSi -\-p2S2 H

\-Pk^k,

where 61^62, - -. ^Sk constitute a listing of all the non-randomized decision functions and {pi,P2, • • • ,Pk) is a probability vector, i.e., Pi > 0 and X^^^^ Pi = ^ Section 3.3 may be omitted at first reading.

3.3 Rationality Axioms Leading to the Bayesian Approach

69

1. The representation means that for each x, the probabiUty distribution 5{x) in the action space is the same as choosing the action Si{x) with probabihty pi. Suppose the DM can order any pair of randomized decision functions also in a rational way and it reduces to her earlier ordering if the randomized decision functions being compared are in fact non-randomized with one pi equal to 1 and other pi^s equal to zero. Under certain axioms that we explore below, there exists a prior 7r{0) such that 6i ^ (5|, i.e., the DM prefers S^ to ^2 if and only if rin,5l)

= ^7r{e)Peia\Sl)L{e,a)

< '^7ri9)Po{a\S*2)Li9,a)

0,a

= ri7T,5;),

e,a

where P0(a\S*) is the probability of choosing the action "a" when 0 is the value of the parameter and 5* is used, i.e., using the representation S* = ^iPi^i^

X

i

and li is the indicator function T f^\^ j ^ if ^i(^) = a; ^iW-|0 iUi{x)j^a. We need to work a little to move from here to the starting point of Ferguson (1967). As far as the preference is concerned, it is only the risk function of S that matters. Also S appears in the risk function only through P0(a\d) which, for each J, is a probability distribution on the action space. Somewhat trivially, for each OQ E 0^ one can also think of it as a probability distribution q on the space V of all (^, a), ^ G 0 , a G v4 such that q{0,a)~^

0

a

6^00.

As in Section 3.2, let the set of probability distributions putting probability on a finite number of points in PheV*. The DM can think of the choice of a (5 as a somewhat abstract gamble with pay-off (P6/(ai|(5), P0{a2\S)^ • • •) if ^ is true. This pay-off sits on (^, ai), (^, a 2 ) . . . . Let Q be the set of all gambles of this form [pi,. •. ,Pm] where pi is a probability distribution on P that is the pay-off corresponding to the ith point ^^ in 0 = {^i, ^2, • • •, ^m}- Further, let Q* be the set of all probability distributions putting mass on a finite number of points in Q. The DM can embed her S in Q and suppose she can extend her preference relation to Q and Q*. If axioms H i and H2 of Section 3.2 hold, then there exists a utility function Ug on ^* that induces the preference relation :
dig [ p i , . . . , p U -

70

3 Utility, Prior, and Bayesian Robustness

A2 If p -< y , then [p,... ,p]
,p'].

To proceed further, we need one more assumption, A3 of Ferguson (1967). If p i , . . . , p/e are elements of V and A i , . . . , A^ are non-negative numbers adding up to 1, let ( A i p i , . . . , XkPk) denote the element of P* that chooses pay-off p^ with probability A^, 1 < i < /c. Then A3 is given by A 3 ( A i [ p i i , . . . , P l m ] , • • • , Afc[pfci, . . . ,Pkm]) ~ p [ ( A l P l l , . . . , \kPkl),

• • • , ( A l P l m , • • • , AfcPfcm)] ,

where ~p denotes equivalence under the preference relation on ^*. Then, under these three assumptions, it is shown by Ferguson that ::^^ is induced by a prior 7r(^) and the loss function L(6^ a) as indicated in Section 3.1. The need to extend the preference relation on the space of decision functions to all pairs of elements of ^* is somewhat artificial. It is of course true that in many practical decision problems the space ^* would occur naturally. For example, even in a statistical problem, if the loss or utility arising from the combination (^, a) doesn't depend on ^, then the extension to 5* would be relatively natural. An illuminating and penetrating discussion of various sets of axioms leading to existence of utility and prior appears in Chapter 2 of French and Rios Insua (2000). They also provide references to a huge literature and a brief survey.

3.4 Coherence There is an alternative way of justifying a Bayesian approach to decision making on the basis of the notion of coherence as modified by Freedman and Purves (1969) and Heath and Sudderth (1978). Coherence was originally introduced by de Finetti to show any quantification of uncertainty that does not satisfy the axioms of a (finitely additive) probability distribution would lead to sure loss in suitably chosen gambles. This is treated in Appendix C. To return to coherence in the context of decision making, suppose A stands for a set in the space of 6 and x values, and A^ = {6 : {0,x) G A}. Given X, the DM's uncertainty about A is given by q{x^Ax). An MC (master of ceremonies) chooses a betting system (^,6), where A is as above and 6 is a bounded real valued function of x. The DM accepts the gamble with pay-off ^iO,x) = b{x)[lA{0,x)

-

q{x,A,)].

She gets b{x)q{x,Ax) or pays b{x)[l — q{x,Ax)] depending on whether 9 lies in Ax or not. The expected pay-off is

E{e) = Ji>{e, x)p{dx\e).

3.5 Bayesian Analysis with Subjective Prior

71

If she accepts k such gambles defined as above by (A(i), 6(i)),..., {A(^k^,b(^k))i then her expected pay-off is the sum of the k expected pay-offs. She will face sure loss if Ki=l J The idea is that if q reflects her uncertainty about 0, then this combination of bets is fair and so acceptable to her. However any rational choice of q should avoid sure loss as defined above. Such a choice is said to be coherent if no finite combination of acceptable bets can lead to sure loss. The basic result of Freedman and Purves (1969) and Heath and Sudderth (1978) is that in order to be coherent, the DM must act like a Bayesian with a (finitely additive) prior and q must be the resulting posterior. A similar result is proved by Berti et al. (1991).

3.5 Bayesian Analysis with Subjective Prior We have already discussed basics of subjective prior Bayesian inference in Chapter 2. In the following, we shall concentrate on some issues related to robustness of Bayesian inference. The notations used will be mostly as given in Chapter 2, but some of those will be recalled and a few additional notations will be introduced here as needed. Let A* be the sample space and O be the parameter space. As before, suppose X has (model) density f{x\0) and 0 has (prior) probability density 7r{0). Then the joint density of (X, ^), for x G Af and 9 e 0/is h{x,0) = f{x\9)7T{0). The marginal density of X corresponding with this joint density is m^{x) = m{x\7T) = /

Je

f{x\0d7r{6).

Note that this can be expressed as /g) f{'^\9)T^{0) dO if X is continuous, mT,{x) = I :E e /(x|(9)7r(i9) if X is discrete. Often we shall use ifn{x) for mT^{x)^ especially if the prior TT which is being used is clear from the context. Recall that the posterior density of 0 given x is given by 7r[6\x) =

-— =

J-T—'

m^{x) mTr{x) The posterior mean and posterior variance with respect to prior TT will be denoted by E'^{0\x) and V'^{0\x), respectively. Similarly, the posterior probability of a set A C 0 given x will be denoted by P^{A\x).

72

3 Utility, Prior, and Bayesian Robustness

3.6 Robustness and Sensitivity Intuitively, robustness means lack of sensitivity of the decision or inference to assumptions in the analysis that may involve a certain degree of uncertainty. In an inference problem, the assumptions usually involve choice of the model and prior, whereas in a decision problem there is the additional assumption involving the choice of the loss or utility function. An analysis to measure the sensitivity is called sensitivity analysis. Clearly, robustness with respect to all three of these components is desirable. That is to say that reasonable variations from the choice used in the analysis for the model, prior, and loss function do not lead to unreasonable variations in the conclusions arrived at. We shall not, however, discuss robustness with respect to model and loss function here in any great detail. Instead, we would like to mention that there is substantial literature on this and references can be found in sources such as Berger (1984, 1985a, 1990, 1994), Berger et al. (1996), Kadane (1984), Leamer (1978), Rios Insua and Ruggeri (2000), and Wasserman (1992). Because justification from the viewpoint of rational behavior is usually desired for inferential procedures, we would like to cite the work of Nobel laureate Kahneman on Bayesian robustness here. In his joint paper with Tversky (see Tversky et al. (1981) and Kahneman et al. (1982)), it was shown in psychological studies that seemingly inconsequential changes in the formulation of choice problems caused significance shifts of preference. These 'inconsistencies' were traced to all the components of decision making. This probably means that robustness of inference cannot be taken for granted but needs to be earned. The following example illustrates why sensitivity to the choice of prior can be an important consideration. Example 3.1. Suppose we observe X, which follows Poisson(^) distribution. Further, it is felt a priori that 6 has a continuous distribution with median 2 and upper quartile 4. i.e. P'^iO <2)= 0.5 = P''{e> 2) and P'^ie > 4) = 0.25. If these are the only prior inputs available, the following three are candidates for such a prior: (i) TTi : 0 ^ exponential (a) with a = log(2)/2; (ii) 7r2 : log(^) - 7V(log(2), (log(2)/^.25)'); and (iii) 7TS : log(^) - Cauchy(log(2),log(2)). Then (i) under TTI, 6\X ^ Gamma(a -h l , x + 1), so that the posterior mean is ( a - M ) / ( x + l); (ii) under 7r2, if we let 7 = log(^), and r = log(2)/2:.25 = log(2)/0.675, we obtain £;^2((9|x) = £;^2(exp(7)|x) ^ / r ^ e x p ( - e ^ ) e x p ( 7 ( x + l))exp(-(7-log(2))V(2r^))c/7^ / ^ exp(-e^) exp(7x) e x p ( - ( 7 - log(2))V(2r2)) d-f and (iii) under TTS, again if let 7 = log(^), we get

3.6 Robustness and Sensitivity

73

Table 3.1. Posterior Means under TTI, 7r2, and TTS X

0

1

2

3

4

5

10

15

20

50

TT

TTi .749 1.485 2.228 2.971 3.713 4.456 8.169 11.882 15.595 37.874 7r2 .950 1.480 2.106 2.806 3.559 4.353 8.660 13.241 17.945 47.017 TTa .761 1.562 2.094 2.633 3.250 3.980 8.867 14.067 19.178 49.402

E''^{e\x)

=£;^^(exp(7)|x) -1

/ ! L e x p ( - e ^ ) e x p ( 7 ( x + 1)) fl +

C-^^f

6^7

To see if t h e choice of prior m a t t e r s , simply examine t h e posterior means under t h e three different priors in Table 3.1. For small or m o d e r a t e x {x < 10), there is robustness: t h e choice of prior does not seem to m a t t e r too much. For large values of x, t h e choice does m a t t e r . T h e inference t h a t a conjugate prior obtains t h e n is quite different from what a heavier tailed prior would obtain. It is now clear t h a t there are situations where it does m a t t e r w h a t prior one chooses from a class of priors, each of which is considered reasonable given t h e available prior information. T h e above example indicates t h a t there is no escape from investigating prior robustness formally. How does one t h e n reconcile this with t h e single prior Bayesian argument? It is certainly t r u e t h a t if one has a utility/loss function and a prior distribution there are compelling reasons for a Bayesian analysis using these. However, this assumes t h e existence of these two entities, and so it is of interest to know if one can justify t h e Bayesian viewpoint for statistics without this assumption. Various axiomatic systems for statistics can be developed (see Fishburn (1981)) involving a preference ordering for statistical procedures together with a set of axioms t h a t any 'coherent' preference ordering must satisfy. Justification for t h e Bayesian approach t h e n follows from t h e fact t h a t any rational or coherent preference ordering corresponds t o a Bayesian preference ordering (see Berger (1985a)). This means t h a t there must be a loss function and a prior distribution such t h a t this axiom system is compatible with t h e Bayesian approach corresponding to these. However, even t h e n there are no compelling reasons t o be a die-hard single prior Bayesian. T h e reason is t h a t it is impractical t o arrive at a t o t a l preference ordering. If we stop short of this and we are only able t o come u p with a partial preference ordering (see Seidenfeld et al. (1995) and K a d a n e et al. (1999)), t h e result will be a Bayesian analysis (again) using a class of prior distributions (and a class of utilities). This is t h e philosophical justification for a "robust Bayesian" as noted in Berger's book (Berger (1985a)). One could.

74

3 Utility, Prior, and Bayesian Robustness

of course, argue that a second stage of prior on the class F of possible priors is the natural solution to arrive at a single prior, but it is not clear how to arrive at this second stage prior.

3.7 Classes of Priors There is a vast literature on how to choose a class, F of priors to model prior uncertainty appropriately. The goals (see Berger (1994)) are clearly (i) to ensure that as many 'reasonable' priors as possible are included, (ii) to try to eliminate 'unreasonable' priors, (iii) to ensure that F does not require prior information which is difficult to elicit, and (iv) to be able to compute measures of robustness without much difficulty. As can be seen, (i) is needed to ensure robustness and (ii) to ensure that one does not erroneously conclude lack of robustness. The above mentioned are competing goals and hence can only be given weights which are appropriate in the given context. The following example from Berger (1994) is illuminating. Example 3.2. Suppose ^ is a real-valued parameter, prior beliefs about which indicate that it should have a continuous prior distribution, symmetric about 0 and having the third quartile, Qs, between 1 and 2. Consider, then Ti = {iV(0,r2),2.19 < r2 < 8.76} and -^2 = { symmetric priors with 1 < Qs < 2 }. Even though Fi can be appropriate in some cases, it will mostly be considered "rather small" because it contains only sharp-tailed distributions. On the other hand, F2 will typically be "too large," containing priors, shapes of some of which will be considered unreasonable. Starting with F2 and imposing reasonable constraints such as unimodality on the priors can lead to sensible classes such as i~3 = { unimodal symmetric priors with 1 < ^ 3 < 2 } D AIt will be seen that computing measures of robustness is not very difficult for any of these three classes. 3.7.1 Conjugate Class The class consisting of conjugate priors (discussed in some detail in Chapter 5) is one of the easiest classes of priors to work with. If X ~ N{0^ a'^) with known cr^, the conjugate priors for 0 are the normal priors N{fi,T'^). So one could consider Fc = {7V(/i,r2),/ii < /i < /i2,r2 < r^ < r^} for some specified values of //i, /i2, T^ , and rf. The advantage with the conjugate class is that posterior quantities can be calculated in closed form

3.7 Classes of Priors

75

(for n a t u r a l conjugate priors). In t h e above case, if ^ '^ N{II^T'^)^ then e\X = x ^ N{fi^{x),6^), where /x*(x) = (r^/ir^ + a^))x + ( a V ( r 2 + a^))fi and S'^ = T'^G^ jij^ + cr^). Minimizing and maximizing posterior quantities t h e n becomes an easy task (see Leamer (1978), Leamer (1982), and Polasek (1985)). T h e crucial drawback of t h e conjugate class is t h a t it is usually "too small" to provide robustness. Further, tails of these prior densities are similar t o those of t h e likelihood function, and hence prior moments greatly influence posterior inferences. Thus, even when t h e d a t a is in conflict with t h e specified prior information the conjugate priors used can have very pronounced effect (which may be undesirable if d a t a is to be t r u s t e d more). Details on this can be found in Berger (1984, 1985a, 1994). It must be added here t h a t mixtures of conjugate priors, on t h e other hand, can provide robust inferences. In particular, the Student's t prior, which is a scale mixture of normals, having flat tails can be a good choice in some cases. We discuss some of these details later (see Section 3.9). 3.7.2 N e i g h b o r h o o d Class If TTo is a single elicited prior, t h e n uncertainty in this elicitation can be modeled using t h e class ^N — {T ' T which are in t h e neighborhood of TTQ} . A n a t u r a l and well studied class is t h e e-contamination class, Te = {TT : TT = (1 - e)7ro + eg, g G Q } , e reflecting t h e uncertainty in TTQ and Q specifying the contaminations. Some choices for Q are, all distributions g, all unimodal distributions with mode ^o^ and all unimodal symmetric distributions with mode ^o • T h e e-contamination class with appropriate choice of Q can provide good robustness as we will see later. 3.7.3 Density Ratio Class Assuming t h e existence of densities for all t h e priors in t h e class, t h e density ratio class is defined as Tj^j^ = {TT : L{0) < an{0) < U{0) for some a > 0}

for specified non-negative functions L and U (see DeRobertis and Hartigan (1981)). If we take L = 1 and U = c^ t h e n we get FDR

= U

: c-i < ^

< c for all 0,e'\

.

76

3 Utility, Prior, and Bayesian Robustness

Some other classes have also been studied. For example, the sub-sigma field class is obtained by defining the prior on a sub-sigma field of sets. See Berger (1990) for details and references. Because many distributions are determined by their moments, once the distributional form is specified, sometimes bounds are specified on their moments to arrive at a class of priors (see Berger (1990)).

3.8 Posterior Robustness: Measures and Techniques Measures of sensitivity are needed to examine the robustness of inference procedures (or decisions) when a class F of priors are under consideration. In recent years two types of these measures have been studied. Global measures of sensitivity such as the range of posterior quantities and local measures such as the derivatives (in a sense to be made clear later) of these quantities. Attempts have also been made to derive robust priors and robust procedures using these measures. 3.8.1 Global Measures of Sensitivity Example 3.3. Suppose Xi, X 2 , . . . , Xn are i.i.d. N{6, cr^), with cr^ known and let r be all Ar(0,r'^), r'^ > 0, priors for 6. Then the variation in the posterior mean is simply (inf^2>o ^(^|^)5Sup^2>o-^(^1^))- Because, for fixed r^, E{6\x) = {T'^I{T^ + cr^))^, this range can easily be seen to be (0,x) or (x, 0) according as x > 0 or x < 0. If x is small in magnitude, this range will be small. Thus the robustness of the procedure of using posterior mean as the Bayes estimate of d will depend crucially on the magnitude of the observed value of X. As can be seen from the above example, a natural global measure of sensitivity of the Bayesian quantity to the choice of prior is the range of this quantity as the prior varies in the class of priors of interest. Further, as explained in Berger (1990), typically there are three categories of Bayesian quantities of interest. (i) Linear functionals of the prior: p(7r) = J^h{6)7r{d6), where /i is a given function. If h is taken to be the likelihood function /, we get an important linear functional, the marginal density of data, i.e., m(7r) = J^ l{6) 7T{d6). (ii) Ratio of linear functionals of the prior: p(7r) = ^^4^ J^ h{0)l{0)7r{dO) for some given function h. If we take h(0) = 9, p{7r) is the posterior mean. For h{9) = Ic{^)^ the indicator function of the set C, we get the posterior probability of C. (iii) Ratio of nonlinear functionals: p{7r) = ^^4^/g>/i(^, (/)(7r))/(^) 7r(c?^) for some given h. For h{9, (f^M) = (^ ~ A^(^))^5 where //(TT) is the posterior mean, we get p{7r) — the posterior variance. Note that extreme values of linear functionals of the prior as it varies in a class r are easy to compute if the extreme points of F can be identified.

3.8 Posterior Robustness: Measures and Techniques

77

Example 3.4- Suppose X ~ N{0^ cr^), with cr^ known and the class F of interest is ^su = { all symmetric unimodal distributions with mode ^o} • Then (j) denoting the standard normal density, m(7r) = / ^ ^0(^^)'7^(^) ^^• Note that any unimodal symmetric (about ^o) density TT is a mixture of uniform densities symmetric about ^o- Thus the extreme points of Fsu are U{6{) — r^6o -^ r) distributions. Therefore, inf m(7r) - inf — /

r>o 2r [

- ( / ) ( - — - ) dO

^(:^^^±^l^)-^(^i:i^l^)^

sup m(7r) = sup-— / —0( T^ersu r>o 2r 2^ i^o-r ^ ^

(3.8)

) du

= sup — <^(::^!±^l^)-^(^°-^-^)^ (3.9) r>o 2r [ a a In empirical Bayes problems (to be discussed later), for example, maximization of the above kind is needed to select a prior. This is called Type II maximum likelihood (see Good (1965)). To study ratio-linear functionals the following results from Sivaganesan and Berger (1989) are useful. Lemma 3.5. Suppose CT is a set of probability measures on the real line given by CT = {i^t '• i ^ T}, T C IZ^, and let C be the convex hull of CTFurther suppose hi and /12 are real-valued functions defined on IZ such that J \hi{x)\ dF{x) < 00 for all F G C, and K -]- h2{x) > 0 for all x for some constant K. Then, for any k, fi:

k-hJhi{x)dF{x) ^ k^Jhi{x)iyt{dx) K^J h2{x) dF{x) 'erK-^J h2{x)ut{dx)'

.^^ k^]hi{x)dF{x) Fee K-^ Jh2{x)dF{x)

^ .^^ k^jhi{x)pt{dx) ter K ^ J h2{x)ut(dx)'

^^''""^ ^ ' ^

Proof Because J hi{x)dF{x) = J hi{x) fj.Tyt{dx)/ji{dt), for some probability measure /x on T, using Fubini's theorem, /c+

hi{x)dF{x)

= l{k-^hi{x))

/

Vt{dx)fi{dt)

= f ( I {k + hi{x))vt{dx)\ ii{dt)

78

3 Utility, Prior, and Bayesian Robustness

Therefore, k-\- fhi(x)dF(x) f(k-\-hi(x))iyt(dx) sup —— r , \ \ , ', \ < sup • — FecK^J /i2(x) dF{x) - ter f{K + h2{x))ut{dx)' However, because C D CT, k-\- J hi{x) dF{x) J{k + hi{x))ut{dx) ficK^i h2{x) dF{x) - ' e ? !{K + h2{x))vt{dx)' Hence the proof for the supremum, and the proof for the infimum is along the same lines. D T h e o r e m 3.6. Consider the class Fsu of all symmetric unimodal prior distributions with mode 6Q . Then it follows that

hfe:Tr9{e)me)de sup E^9{e)\x) = sup^-:;^-;;;;;:j ,; , h:i;:::9m{x\o)de— . inf E''{g{e)\x) = inf ""-"'°~V Proof. Note that E''{g{e)\x)

= •^ ^f^\[\T}^!T^,

(3.12) (3.13)

where /(x|(9) is the density

of the data x. Now Lemma 3.5 can be apphed by recalUng that any unimodal symmetric distribution is a mixture of symmetric uniform distributions. D Example 3.1. Suppose X\Q ~ N{0,a'^) and robustness of the posterior mean with respect to Fgu is of interest. Then, range of posterior mean over this class can be easily computed using Theorem 3.6. We thus obtain, sup £ " ( % ) = sup ^" •^^."t " ^ 2r Jdo — r cr^^

" ' a

= X -\- sup —-^—r inf £ ; " ( % ) = inf ^^^^l-J

'

s-^

,

-""^ - '

Example 3.8. Suppose X\9 ^ N{6^(J'^) and it is of interest to test HQ : 0 < OQ versus Hi : 0 > OQ. Again, suppose that Fsu is the class of priors to be considered and robustness of this class is to be examined. Because

3.8 Posterior Robustness: Measures and Techniques

79

p^'iHoix) = p^{e < eo\x)

we can apply Theorem 3.6 here as well. We get, sup P^iHo\x)=snp

^-•>0o-r-^^

- '

2r JOo—r

a

a^^

'

= sup and similarly, inf

P'^iHolx)

= inf

. ^

It can be seen t h a t t h e above bounds are, respectively, 0.5 and a , where a=:^{^^)^ the P-value. We shall now consider t h e density-ratio class t h a t was mentioned earlier in (3.7) and is given by FDR = {TT : L{0) < a7T{0) < U{0) for some a > 0 } , for specified non-negative functions L and U. For TT G F^R and any real-valued TT-integrable function h on t h e p a r a m e t e r space 0 , let 7r(/i) = J^h{0)7r{dO). Further, let h = h'^ — h~ he the usual decomposition of h into its positive and negative parts, i.e., h~^{u) = m a x { / i ( x ) , 0 } and h~{u) = max{—/i(x),0}. T h e n we have t h e following theorem (see DeRobertis and Hartigan (1981)). T h e o r e m 3 . 9 . For U-integrable with respect to all ii ^ FDR, inf TrerDR 7r(/i2)

functions

hi and h2, with /i2 positive

is the unique solution

A of

U{hi - Xh2)- -f L{hi - A/i2)+ = 0, 7T{hi) sup _^^ ^ TverDR 7r(/i2j

is the unique solution

a.s.

(3.14)

A of

U{hi - A/i2)+ + L{hi - \h2)~

= 0.

(3.15)

Proof. Let AQ = inf^er^H ^ ( ^ ^ ^i = i^f^eri^H ^(^2) and C2 = s u p ^ ^ ^ ^ ^ 7r(/i2)T h e n 0 < ci < C2 < 00, and |Ao| < oc. Because U{hi — A/i2)~ + ^ ( ^ 1 — A/12)^ = infTTGrDH '^(^1 ~ ^^2) for any A, note t h a t AQ > A if and only if U{hi — A/i2)~ + L{hi — A/12)'^ > 0. However, AQ > A if and only if U{hi — A/i2)~ + L{hi — A/i2)^ > 0. A similar argument for t h e s u p r e m u m . D

80

3 Utility, Prior, and Bayesian Robustness

Example 3.10. Suppose X ~ N{6^ cr'^)^ with cr^ known. Consider the class FDH with L being the Lebesgue measure and U = kL, k > 1. Because the posterior mean is

jef{x\0)d7rie) Jfix\e)dn{e)

^nief{x\e)) n{f{x\e))'

in the notation of Theorem 3.9, we have that iniT^^roR E'^{0\x) is the unique solution A of /»cx)

A

(3.16)

/

(e - x)f{x\6) de+ -oo

{e- x)f{x\d) de = o, JX

and similarly, sup^^j^^^ E'^{9\x) is the unique solution A of /"OO

A

(3.17) {6 - X)f{x\e) de + k {6- \)f{x\6) dO = 0. OO J X Noting that f{x\e) = ^{^) = ^ 0 ( ^ ) , and letting Ai be the minimum and A2 the maximum, the above equations may be rewritten as /

(fc-1)

\l (7

a

— X

a

(fc-1)

= k { ^ ) .

Now let k{^^) = 7. Then A2 = a; + a^. Put AQ = x - cr^, or ^ ^ Then we see from the second equation above that

(3.18) (3.19) = -^.

(fc-1) (^o^)4>(^^:i^) + 0 ( ^ i ^ ) a

(fc-i)= 0, implying that once A2 is obtained, say A2 = x -h a^, the solution for Ai is simply x — a^. Table 3.2 tabulates 7 = "y{k) for various values of k. What one Table 3.2. Values of 7(/c) for Some Values of k k |l 1.25

1.5

2

3 1 4

5 1 10

7(/c)|0 0.089 0.162 0.276 0.436 0.549 0.636 0.901

3.8 Posterior Robustness: Measures and Techniques

81

can easily see from this table is that, if, for example, the prior density ratio between two parameter points is sure to be between 0.5 and 2, the posterior mean is sure to be within 0.276 standard deviation of x, and if instead the ratio is certain to be between 0.1 and 10, the range is certain to be no more than 1 s.d. either side. 3.8.2 Belief Functions An entirely different approach to global Bayesian robustness is available, and this is through belief functions and plausibility functions. This originated with the introduction of upper and lower probabilities by Dempster (1967, 1968) but further evolved in various directions as can be seen from Shafer (1976, 1979), Wasserman (1990), Wasserman and Kadane (1990), and Walley (1991). The terminology of infinitely alternating Choquet capacity is also used in the literature. Imprecise probability is a generic term used in this context, which includes fuzzy logic as well as upper and lower previsions. Recall that robust Bayesian inference uses a class of plausible prior probability measures. It turns out that associated with a belief function is a convex set of probability measures, of which the belief function is a lower bound, and the plausibility function an upper bound. Thus a belief function and a plausibility function can naturally be used to construct a class of prior probability distributions. Some specific details are given below skipping technical details and some generality. Suppose the parameter space O is a Euclidean space and D is a convex, compact subset of a Euclidean space. Let // be a probability measure on D and T be a map taking points in D to nonempty closed subsets of 0. Then for each A C 0, define A^ = {deD A^ = {deD:

: T{d) C A} , and T{d) DAy^cl)}.

Define Bel and PI on 0 by Bel{A) = fi{A^) and Pl{A) = /i(A*).

(3.20)

Then Bel is called a belief function and PZ, a plausibility function with source {D,ii,T). Note that 0 < Bel{A) < Pl{A) < 1, Bel{A) = 1 - P/(A^) for any A, and Bel{0) = Pl{0) = 1, Pe/(0) = P/(0) = 0. The above definition may be given the following meaning. If evidence comes from a random draw from D, then Bel{A) may be interpreted to be the probability that this evidence implies A is true, whereas Pl{A) can be thought of as the probability that this evidence is consistent with A being true. It can be checked that Bel is a probability measure iff Bel{A) = PK^) ^^^ ^^^ ^^ ^^ equivalently, T{d) is almost surely a singleton set.

82

3 Utility, Prior, and Bayesian Robustness

Example 3.11. Suppose it is known that the true value of 0 lies in a fixed set 00 C e. Set T{d) = 00 for all deD. Then Bel{A) = 1 if 6>o C A; Bel{A) = 0 otherwise. Example 3.12. Suppose P is a probability measure on 0. Then P is also a belief function with source (0,P,T), where T{e) = {6}. A probability measure P is said to be compatible with Bel and PI if for each A, Bel{A) < P{A) < Pl{A). Let C be the set of all probability measures compatible with Bel and PL Then C ^ (/) and for each A, Bel{A) = inf P{A) and Pl{A) = supP(A). This indicates that we can use Bel and PI to construct prior envelopes. In particular, if Bel and PI arise from any available partial prior information, then the set of compatible probability measures, C, is exactly the class of prior distributions that a robust Bayesian analysis requires (compare with (3.4)). Let h : 0 -^ 1Z he any bounded, measurable function. Define its upper and lower expectations by ^*(/i) = sup Ep{h) Pec

and E^h)

= inf Ep{h), P^^

(3.21)

where Ep{h) = J^ /i((9) P{de). If we let h*{d)=

sup h{e) amd h^d) = inf /i((9), eeT{d) oeT{d)

then it can be shown that E*{h)=

f h*{u)fi{du) JD

and E^h)

= f K{u)^i{du). JD

(3.22)

Details on these may be found in Wasserman (1990). Based on these ideas, some new techniques for Bayesian robustness measures can be derived when the prior envelopes arise from belief functions. Suppose Bel is a belief function on 0 with source (£), //, T) and C is the class of all prior probability measures compatible with Bel. Let L{0) = f{x\6) be the likelihood function of 6 given the data x, and let LA{0) = L{6)IA{0), where IA is the indicator function of A C 0. Then we have the following result and its application from Wasserman (1990). T h e o r e m 3.13. If L{0) is bounded and A
supn{A\x)

= ^ * ( ^ ^ ) ^ ^ ^ ( ^ ^ ^ ) - E,{L:,)IE^{{LACU

^^'^^^

3.8 Posterior Robustness: Measures and Techniques Example

83

3.14- Consider the class of e-contamination priors: C = {TT : TT = (1 - e)7ro + eg, g E Q } ,

where Q is t h e class of all probability measures on 0. This neighborhood class C corresponds to t h e belief function with source ( D , /i, T ) , where D = Ou{do}, /i = ( 1 — €)7TQ -\- 6(5, a n d

T{d)

({d} \ 0

iideO; ifd = do.

Here ^ is a point mass on do and TTQ is a probability measure on D giving zero probability t o do and is identical to TTQ on D — {do}. T h e n from Theorem 3.13 above,

supn

\x)

inf . M b ) =

{l-e)J^L{e)7ro{dO)^esupo^^L{0) (i_,)J^^(^)^^(^^) + ,,^p^^^^(^)'

(3.25)

{l-e)J^mMdO)

(3.26)

It may be noted t h a t this is a different proof for t h e same result of Berger and Berliner (1986). 3.8.3 Interactive R o b u s t Bayesian Analysis Following Berger (1994), a n interactive scheme for robust Bayesian analysis can b e suggested according t o t h e diagram Figure 3.1. T h e point t o note is t h a t , if lack of robustness is evident, t h e n t h e class F of priors obtained from initial prior inputs has t o be shrunk using further prior elicitation. Details on such a n approach for shrinking a large quantile class of priors is described in L i s e o e t al. (1996).

Initial Prior Inputs

Further Prior Inputs

Inference

Sensitivity Analysis

Not Robust

Fig. 3 . 1 . Interactive robust Bayesian scheme.

Robust

84

3 Utility, Prior, and Bayesian Robustness

3.8.4 Other Global Measures As seen earlier, interpretation of the size of the range of posterior quantities needs to be done within the given context. However, some efforts have been made to derive certain generic measures also. Ruggeri and Sivaganesan (2000) suggest a scaled version of the range for this purpose. Suppose TTQ is a baseline prior, and let the sensitivity of the posterior mean of a target quantity h(9) to deviations from TTQ be of interest. Let i^ be a class of plausible priors TT on ^. Assume the following notation of /9^(x) = E''{h{0)\x), p^{x) = E^o(/i((9)|x), and V'^{x) denoting the posterior variance of h{6) under prior TT. Then the relative sensitivity^ denoted by R^^^ is defined as

The motivation for considering RT^ is that the posterior variance V^ is a measure of accuracy in estimation of h{6)^ and hence if the squared distance of p^{x) from p^{x) relative to this is not too large, robustness can be expected. The following example which is essentially from Ruggeri and Sivaganesan (2000) illustrates this idea. Example 3.15. Let X have the N{6,1) distribution, and under TTQ, let 0 be N{0, 2). Consider the class T of all A^(0, r^) priors with 1 < r^ < 10. Consider sensitivity of posterior inferences about h{6) = 0 when x > 0 is observed. Because the posterior distribution (under the prior A^(0, r^)) of 6 given x is normal with mean T'^X/{T'^ H- 1) and variance r^ I[T^ + 1)^ note that p^{x) - p\x) = ( - ^ -l]x and R^{x) - ^^' ^^'"^^ ' ^' • • r2 -h 1 3y '^ ^"^ ^^"'^'^^ 9T^ir^ -h 1)" It can then be easily checked that the range of p'^{x) — p^{x) is 8x/33 and sup RT^{X) = 6.4x^/99. Thus, robustness can be expected when the observation X lies in the range 0 < x < 4, but certainly not when x = 10. 3.8.5 Local Measures of Sensitivity As can be noted from the previous section, unless the class F of possible priors is a 'nice' parametric class, or a class whose set of extreme points is easy to work with, computational complexity of global measures of robustness is high. Furthermore, this 'global' approach can become quite unfeasible for very complicated models. If, for example, X ~ P^, and 0 is p-dimensional, p > 1, then the range of posterior mean of 6i may well depend on prior inputs on Oj for j ^ i also. If such is the case, global measures of robustness will involve computing ranges of posterior quantities of general functions g{0) over classes of joint prior distributions of 0. The alternative, which has attracted a lot of attention in recent years, is that of trying to study the effects of small perturbations to the prior. This is

3.8 Posterior Robustness: Measures and Techniques

85

called local sensitivity. In this approach also, one may either study the sensitivity of the entire posterior distribution or that of some specified posterior quantity. Let us first consider the former as in Gustafson and Wasserman (1995). A different set of notations as given below are needed in this section. Let TT be a prior probability measure and let TT^ denote its corresponding posterior probability measure given the data x, i.e., n^{d6) = f{x\0)7r{d6)/mT^{x) where mT^{x) = f^ f{x\6)n{d9) is the marginal density of the data. Let V be the set of all probability measures on the probability space {0,B). A distance function d : V ^ V is needed to quantify changes in prior and posterior measures. Let u^ be a perturbation of n in the direction of a measure u. Then the local sensitivity of V in the direction of u can be defined (see Gustafson and Wasserman (1995)) by s{7T,jy;x) = lim ^^lllJ^. e4,o a(7r, i/g)

(3.28)

Two different types of perturbations z/g have been considered. The linear perturbation is defined as z/^ = (1 —e)7r+ez/, and the geometric perturbation as dz/g oc {^ydTv. (See Gelfand and Dey (1991) for details.) The local sensitivity 5(7r, z/; x) is simply the rate at which the perturbed posterior u^ tends to the 'initial' posterior TT^ relative to the change in the prior. As a measure of overall sensitivity of a class F of priors one may take 5(7r, F; x) = sup s(7r, z/; x). There are many possible choices for (i, the distance measure. (i) c/Ty(7r, v) = sup^^^ \^{^) ~ ^{^)\-> the total variation distance. In this case 5(7r, z/; x) for linear perturbations turns out to be the norm of the Frechet derivative. To see this one needs to start with the Gateaux differential of the posterior. To define the Gateaux differential, let 5 = TT — z/, ||5|| = dTv{'^i'^) and define T \V -^Vhy T{T:) = TT^. The Gateaux differential of T is then T;(5) = l i m ^ ^ ^ ^ ^ ^ ^ ^ = ^ d ^ ^ ( . ^ z . ^ ) , because zy^^ = (l-A)7r^ + Az/^,

(3.29)

where A = A(e) = emiy{x)/{{l — e)mT^{x) -\- emjy{x)}. Also, simply note that dTv{^^^'^e) — X{e)dTv{^^^^^)- Further, if the likelihood function f{x\0) is bounded (in 6), then TV ((5) is a linear map on signed measures such that T{7r -hS) = Tin) 4- t^{S) + o{\\S\\), as \\S\\-^ 0, uniformly over all signed measures 5 with mass 0 (see Diaconis and Freedman (1986)). Note then that

86

3 Utility, Prior, and Bayesian Robustness s{7r, F; x) = s u p

\T.\

(ii) d(f){7r, u) = J (I) ( ^)J ] du{6), where 0 is a smooth convex function with bounded first and second derivatives near 1 and such t h a t 0(1) = 0. This is the 0-divergence measure of distance. (See Csiszar (1978), Goel (1983) a n d Goel (1986).) Several well-known divergence measures are special cases of (j)divergence measure for diff'erent convex functions. Listed in Table 3.3 are some such (/) functions and t h e corresponding divergence measures obtained thereof. (See Rao (1982) for applications of many of these measures in statistics.) Consider first the e-contamination class of priors (or linear perturbations), and note t h a t s(7r, z/; x) = lim

' \ .

Because d^{P, Q) = f (t> ( ^ ) dQ^ b o t h d(j){7T^ Ue) and (/ 0, so t h a t on applying the L Hospital rule, we obtain 5(7r, u- x) = h m - ^

.li^4^^(l!^.

(3.30)

T h e following theorem t h e n follows from Theorem 3.1 of Dey and Birmiwal (1994).

Table 3.3. 0 Functions and the Corresponding Divergence Measures 0(x)

I Divergence Measure

KuUback-Leibler Directed divergence J-divergence Hellinger distance or Kolmogorov's measure of distance 1 - x " , 0 < a < 1 Generalized Bhattacharya measure Chi-squared divergence or Kagan's measure of distance {x-ir X log(x) - log(a:) {x - 1) log(x)

A(A + 1 ) ' ^ 7^ ^ '

-L

Power-weighted divergence

3.8 Posterior Robustness: Measures and Techniques

87

T h e o r e m 3.16. Suppose that J ^^W- dO < oo. Then

where Vq{h{0)) ^ J h'^(6) dq{e) - {h{0) dq{0)Y. Proof. In view of (3.30) above, it is enough to estabhsh that ^^2C^4^^^^f)|e=o = <^ (l)Kx ( ^ ^ j

.

(3.32)

Recall from (3.29) that ff = /i(e)7r^ + (1 - /i(e))i'=', where h{e) = (1 €)mT^{x)/m„^{x) = (1 — e)m7r(a:)/{(l — e)jn^(x) + emi/(a;)}. Now let 7 = 7^(6*, x) = i//(0)/7r^(6l), and note that _ h{e)Tv''{e) +

(l-h{e))v''{e)

Therefore,

_ fmjy^{x){-m^{x)) ~^

- (1 -e)m^{x){-m^{x) m2^(x)

+ m^^(x)) ^ ^ zy^(6>) ^^ ^ ^ ^

_ (1 - e)m^(a:^) - m^(x)m^^(x) - (1 - e)m^{x)mj,{x) m2 (x)

_ ^^(^). ^

and hence A

I

—

^^''=^~

^7r(^)^i^(3^) /-. _

^^4(^) mu{x) m^(x)^

T^^

^ ~^^ z^^(6>). 7r^(^)^'

and similarly, c?2 I Now because

m^(x)(m^^(x)-m^(x))

^^^(6').

71^(0)^'

88

3 Utility, Prior, and Bayesian Robustness

and because 0 is a smooth function with bounded first and second derivatives near 1, applying the dominated convergence theorem (DCT), one obtains,

idj(»",,.,') = J4'\-l,}^-li''(e)dD, and

= 0. Further, noting that ^d^{n^,i/f) DCT once again, one obtains.

= -^ f (j)'{'yf)j^^Tr^{6)d9, and applying

d^(7^^I/f) = l{^'{j)^-y de2

d,in^.^)U.o =

+ cl>"{j)i^^^f}n^e)de,

and

^''ii)i^^?Jii-g§^r^^e)de,

because

= 0. Further noting that

= —^fu{e)f{x\6de mj,{x) m^{x)' we get

"^^-(a^) _ m^(a:)i/^(^)\2^^ ^^ m^(a;) m^(x) 7r^(6')/ ^ ^ ' mi,(x) m^(a;)/(x|6')j^(6') m^(x) y, „ x^^ m ^ ^ f^m^(x) Tn^(x) J[x\tf)i^[tf) m„(x) J \m^{x) m^{x) m,{x) f{x\9)Tt{e)) ^'

= frM!^)\^e)de 7r{6) m7r(x)/ J \7r{0)

m^ix)/

MO) which concludes the proof. D

(3.34) (3.35)

3.8 Posterior Robustness: Measures and Techniques

89

We consider next geometric p e r t u r b a t i o n s . T h e following theorem t h e n follows from T h e o r e m 3.2 of Dey and Birmiwal (1994). T h e o r e m 3 . 1 7 . Suppose that (i) S{\og^)\{e)de<<^,and (ii) / ( l o g ^ ) ^ ( ^ ) ' 7 r ( ( 9 ) de< 0.

Then

s{iT,i.-x)=

(3.36)

Proof. As before in Theorem 3.16, it is enough t o establish t h a t ^ ^ 2 r f 0 ( ^ ^ < ) | . = o = (1)K.« ( ^ l o g ^ ^ j ' proving the desired result.

(3-37)

D

Applications of these results are similar t o those of a related simpler approach as shown below. T h e other approach to local sensitivity analysis is simply to look at variation of t h e curvature of ^-divergence as discussed in Dey and Birmiwal (1994) and Delampady and Dey (1994). This t u r n s out to be much easier also as shown below. Consider t h e class F of e-contamination priors, Te = {TT : TT = (1 - e)7ro + eg, g G Q } • T h e n t h e curvature C{q) defined by C{q) = j ^ f (l)(^^\r\)7ro{0\x)

dO^ under

general regularity conditions has t h e form C{q) = (p (1)^4-0(.[x) ( w i y ) ^^ SQQ'^ previously. Similarly, if we consider the class r ^ .= {TT : TT = c(e)7^^-^g^ q e Q] , then we have t h a t C{q) = 0 (l)14-Q(.|a,)(log ^ (L) • Variation of these quantities over m a n y parametric and nonparametric classes can be easily computed. T h e following example is from Dey and Birmiwal (1994). Example 3.18. Consider X | ^ ^ Np{6^1) and t h e class of Fg where under TTQ, 6 ^ N{fio, ^o) and Q = {q : 6\q r^ A^P(MO, ^ ^ O ) , h < k < /C2}, with ki < 1 < k2' T h e n t h e posterior distribution of 0 given x under TTQ is Np {Eo(I + r o ) - ^ x + ( / + T o ) - V o , ^o{I

+ ^o)~') ,

and hence

= (^y^)

{2trace(7 + S^^f

+ 4 ( x - M O ) ' ^ O ( / + i : o ) " ^ ( x - Mo)} •

90

3 Utility, Prior, and Bayesian Robustness

It can then be shown that C{q) attains its minimum at A: = 1 and maximum at ki or A;2. The extent of robustness will of course depend on the data x, smaller values of C{q) indicating robustness. Let Qus be the class of unimodal spherically symmetric densities q such that max^ q{0) 0. Consider r = {TT : TT = (1 - e)7ro -^eq.qe

Qus} •

(See Sivaganesan (1989) for details on this class.) Then, under certain reasonable conditions (see Delampady and Dey (1994)), sup C{q) irer

= 0"(1) sup

Koi.l.){A

qeQus = ^

sup

^o[f^) \f

^ ^ d e - ^ - i l

/(x|^)d^]n,(3.38)

where S(r) is a sphere of radius r centered at 0 and V{r) denotes its volume. The following example illustrates the use of this result. Example 3.19. Let X\e ~ N{e, 1), and under TTQ, 0 ~ N{0,r'^), r^ > 1. Then mTTo is the density of A^(0,r^ -h 1). Upper bounds for C{q) (denoted by C*) calculated using (3.38) are listed in Table 3.4 for selected values of r and x. The extremely large values of C* corresponding with r = 1.1 and x = 3,4 indicate that these data are not compatible with TTQ. However, the same data are compatible with TTQ if T has a larger value, say 2.0. Some kind of calibration, however, is needed to precisely establish what magnitudes of curvature can be considered extreme. Table 3.4. Bounds on Curvature for Different Values of r and x T

\x\C*

1.1 2 909.3 3 2.08225 xlO® 4 1.06395 xlO^^ 1.5 2 1.0918 3 13.7237 4 454.3244 2.0 3 1.1186 4 7.0946

3.9 Inherently Robust Procedures

91

Before we conclude this discussion, it should be mentioned t h a t there is a large amount of literature on g a m m a minimax estimation, which is a frequentist approach to Bayesian robustness. T h e idea here is t o look for t h e minimax estimator, b u t t h e class of priors considered (for minimaxity) being t h e one identified for Bayesian robustness consideration. Let us take a very brief look. Recall t h a t for any decision rule J, its frequentist risk function is given by R{0,S) = EL{0,S{X)), where L is the loss function and t h e expectation is with respect t o the distribution of X | ^ . If TT is any prior distribution on 0^ t h e Bayes risk of S with respect to TT is r{7T,S) = E^R{6^5). T h e decision rule (^TT, which minimizes t h e Bayes risk r(7r, (5), is t h e Bayes rule with respect to TT. Under t h e minimax principle, the optimal decision rule 5^ (minimax rule) is t h a t which minimizes t h e m a x i m u m of t h e frequentist risk R{0^8). Equivalently, 5^ minimizes t h e m a x i m u m of t h e Bayes risk r(7r,^) over t h e class of all priors TT. Under t h e g a m m a minimax principle, if TT is constrained to lie in a class F^ t h e optimal rule 5^ (gamma-minimax rule) minimizes supper ^(^,^)Even though there are m a n y attractive results in this topic, we will not be discussing t h e m . Extensive discussion can be found in Berger (1984, 1985a), and further material in Ickstadt (1992) and Vidakovic (2000).

3.9 Inherently Robust Procedures It is n a t u r a l t o look for priors and t h e resulting Bayesian procedures t h a t are inherently robust. Adopting this approach will eliminate t h e need for checking robustness at t h e end by building robustness into t h e analysis at the beginning itself. Further, practitioners can d e m a n d "default" Bayesian procedures with built-in robustness t h a t do not require specific sensitivity analyses requiring sophisticated tools. Accumulated evidence indicates t h a t priors with flatter tails t h a n those of t h e likelihood t e n d to be more robust t h a n easier choices such as conjugate priors. Literature here includes Dawid (1973), Box and Tiao (1973), Berger (1984, 1985a), O'Hagan (1988, 1990), Angers and Berger (1991), Fan and Berger (1992), and Geweke (1999). T h e following example from Berger (1994) illustrates some of these ideas. Example 3.20. Let X i , . . . , X^ be a r a n d o m sample from a measurement error model, so t h a t X^ = ^ + e^,i = l , . . . , n where e^ are the measurement errors. e^'s can then be reasonably assumed to be i.i.d. having a symmetric unimodal distribution with m e a n 0 and unknown variance cr^. T h e location p a r a m e t e r 9 is of inferential interest with t h e prior information t h a t it is symmetric a b o u t 0 and has quartiles of ± 1 , whereas cr^ is a nuisance p a r a m e t e r with little prior information. T h e simple "standard" analysis would assume t h a t Xi\0^a'^ are i.i.d. N{0^a'^) and 7r(^, cr^) oc -\7ri{0) where under TTI, the prior distribution of

92

3 Utility, Prior, and Bayesian Robustness

6 is Ar(0,2.19). (This may be contrasted with Jeffreys analysis discussed in Section 2.7.2.) This conjugate prior analysis suffers from nonrobustness as mentioned previously. Instead, assume that X^|^,cr^ are i.i.d. ^4(^,(7^), and likewise assume that under TTI, the prior distribution of 6 is Cauchy(0, 1). This analysis would achieve certain robustness lacking in the previous approach. Any outliers in the data will be adequately handled by the Student's t model, and further, if the prior and the data are in conflict, the prior information will be mostly ignored. There are certain computational issues to be addressed here. The "standard" analysis is very easy whereas the robust approach is computationally intensive. However, the MCMC techniques that will be discussed later in the context of hierarchical Bayesian analysis can handle these problems. O'Hagan (1990) and Angers (2000) discuss some of these issues formally using concepts that they call credence and p-credence that compare the tail behavior of the posterior distribution with that of heavy tailed distributions such as Student's t and exponential power density. Further discussion of robust priors and robust procedures will be deferred to Chapters 4 and 5 where we shall consider default and reference priors that are improper priors.

3.10 Loss Robustness Given the same decision problem, it is possible that different decision makers have different assessments for the consequences of their actions and hence may have different loss functions. In such a situation, it may be necessary to evaluate the sensitivity of Bayesian procedures to the choice of loss. Example 3.21. Suppose X is Poisson(^) and 6 has the prior distribution of exponential with mean 1. Suppose x = 0 is observed. Then the posterior distribution of 6 is exponential with mean 1/2. Therefore, the Bayes estimator of 9 under squared error loss is 1/2 which is the posterior mean, whereas the Bayes estimator under absolute error loss is 0.3465, the posterior median. These are clearly different, and this difference may have some significant impact depending on the use to which the estimator is being put. It is possible to provide a Bayesian approach to the study of loss robustness exactly as we have done for the prior distribution. In particular, if a class of loss functions is available, range of posterior expected losses can be computed and examined as was done in Dey et al. (1998) and Dey and Micheas (2000). There are also other approaches, such as that of computing non-dominated alternatives, which is outlined in Martin et al. (1998).

3.11 Model Robustness

93

3.11 Model Robustness The model for the observables is the most important component of statistical inference, and hence imprecisions in the specification of the model that can lead to inaccurate inferences must be viewed with great concern. There has been a lot of work in classical statistics in this regard, but most of that only addresses the problem of influence of outliers with respect to a specified target model. In principle, Bayesian approach to model robustness need not be any different from that for prior robustness or loss robustness. However, the problem gets complicated because the mapping of likelihood function to posterior density is not ratio-linear, and hence different techniques need to be employed to assess the sensitivity. If only a finite set of models need to be considered, the problem is a simple one and one simply needs to check the inferences obtained under the different models for the given data. It needs to be kept in mind that, even in this case, different models may be based on different parameters with different interpretations, and hence the specification of prior distributions may be a complicated problem. The following example which illustrates some of the possibilities is similar to Example 1 of Shyamalkumar (2000). (See Pericchi and Perez (1994) and Berger et al. (2000) also.) Example 3.22. Suppose the quantity of inferential interest is ^, the median of the model. Model uncertainty is represented by considering the set of two models, M = {N{9,1), Cauchy(i9,0.675)} , where 0.675 above is the scale parameter of the Cauchy distribution. In other words, X is either N{0^ 1) or Cauchy(^,0.675). Since 6 is the median of the model in either case, it is not diflacult to specify its prior distribution. Suppose the prior TT lies in the class F of N{0, r^), 1 < r^ < 10. The range of posterior means are as shown in Table 3.5. As can be seen, model robustness is also dependent on the observed x, just like prior or loss robustness. In many situations, this robustness will be absent, and there is no solution other than providing further input on model refinements. Model robustness does have a long history even though the material is not very extensive. Box and Tiao (1962) have considered this problem in a simple setup. Lavine (1991) and Fernandez et al. (2001) have used a nonparametric class of models, and Bayarri and Berger (1998b) have studied robustness in selection models. These can be considered global robustness approaches as compared with the approach of local robustness adopted by Cuevas and Sanz (1988), Sivaganesan (1993), and Dey et al. (1996). Extrema of functional derivative of the posterior quantities are studied by these authors. This is similar to the local robustness approach for prior distributions. Some of the frequentist approaches such as Huber (1964, 1981) are also somewhat relevant.

94

3 Utility, Prior, and Bayesian Robustness Table 3.5. Range of Posterior Means for Different Models X= 4 x =2 a; = 6 Likelihood 'udE{e\x) s\xpE{e\x) m.iE(e\x) s\ipE{e\x) iniE{e\x) sM^E{e\x) Normal 3.636 5.455 2.000 1.000 1.818 3.000 0.621 Cauchy 3.228 0.362 0.914 4.433 1.689

3.12 Exercises 1. (St. Petersburg paradox). Suppose you are invited to play the following game. A fair coin is tossed repeatedly until it comes up heads. The reward will be 2^ (in some unit of currency) if it takes n tosses until a head first appears. How much would you be willing to pay to play this game? Show that the expected monetary return is oo, but few would be willing to pay very much to play the game. 2. Consider a lottery where it costs $1 to buy a ticket. If you win the lottery you get $1000. If the probability of winning the lottery is 0.0001, decide what you should do under each of the following utility functions, u{x)^ x being the monetary gain: (a) u{x) = x; (b) u{x) = logg(.3 + x); (c) u{x) = exp(l + x/100). 3. A mango grower owns three orchards. Orchard I yields 50% of his total produce, II provides 30% and III provides the rest. Even though they are all of a single variety, 2% of the mangoes from I, 1% each from II and III are excessively sour tasting. (a) What is the probability that a mango randomly selected from the total produce is excessively sour? (b) What is the probability that a randomly selected mango that is found to be excessively sour came from orchard II? (c) Consider a box of 100 mangoes all of which came from a single orchard, but we don't know which one. A mango is selected randomly from this box and is found to be sour. What is the probability that a second mango randomly selected from the remaining 99 is also sour? 4. Show that the Student's t density can be expressed as a scale mixture of normal densities. 5. Refer to Example 3.1. Suppose that the prior for 6 has median 1, and upper quartile 2. Consider the priors, (i) 6 ~ exponential, (ii) log(^) ~ normal and (iii) log(^) ~ Cauchy. (a) Determine the hyperparameters of the three priors. (b) Plot the posterior mean E'^{0\x) for the three priors when x lies in the range, 0 < x < 50. 6. Let Xi, X2, • • •, Xn be a random sample from Poisson(^), where estimation of 6 is of interest. (a) Derive the range of posterior means when the prior lies in the class of Gamma distributions with prior mean ^o-

3.12 Exercises

95

(b) Compute the range of posterior means if the prior density is known only to be a continuous non-increasing function. 7. Let X i , X 2 , . . . , X n be i.i.d. A/'(^,cr^), cr^ known. Consider the following class of conjugate priors for 0: F = {A/'(0, r^),r^ > O}. (a) Find the range of posterior means. (b) Find the range of posterior variances. (c) Suppose X > 0. Plot the range of 95% HPD credible intervals. (d) Suppose a^ = 10 and n = 10. Further, suppose that an x of large magnitude is observed. If, now, a A^(0,1) prior is assumed (in which case prior mean is far from the sample mean but prior variance and sample variance are equal) show that the posterior mean and also the credible interval will show substantial shrinkage. Comment on this phenomenon of the prior not allowing the data to have more influence when the data and prior are in conflict. What would happen if instead a Cauchy(0,1) prior were to be used? 8. Let X\0 ^ N{0,1) and let Fsu denote the class of unimodal priors which are symmetric about 0. (a) Plot {infvrerst/ ^{^)i^^P7Tersu ^ ( ^ ) } for 0 < x < 10. (b) Plot {mf.er^^ ^"(^|x),sup,^^^^ E^{0\x)} for 0 < x < 10. 9. Let Xi,X2, • • •,Xn be i.i.d. with density f{x\e) = e x p ( - ( x - 6 > ) ) , x> 0, where — oc < ^ < 00. Consider the class of unimodal prior distributions on 0 which are symmetric about 0. Compute the range of posterior means and that of the posterior probability that ^ > 0, for n = 5 and x = (0.1828,0.0288,0.2355,1.6038,0.4584). 10. Suppose Xi, X 2 , . . . , Xn are i.i.d. A^(^, cr^), where 0 needs to be estimated, but a'^ which is also unknown is a nuisance parameter. Let x denote the sample mean and s'^_i = J2^=ii^'i ~ ^Y/{p — 1)^ the sample variance. (a) Show that under the prior 7r(^, a'^) oc {(j^)~^^ the posterior distribution of 6 is given by

V^{o - ^) Sn-1

~

,

tn-l-

(b) Using (a), justify the standard confidence interval X±

tn-i{a/2)sn-i/Vn

as an HPD Bayesian credible interval of coefficient 100(1 — a)%, where tn-i{a/2) is the t ^ - i quantile of order (1 — a/2). (c) If instead, 0\a'^ ~ A^(/i,C(j^) and 7r(cr^) oc (cr^)~-^, for specified /i and c, what is the HPD Bayesian credible interval of coefficient 100(1 — a)%7 (d) In (c) above, suppose c = 5 and fi is not specified, but is known to lie in the interval 0 < / / < 3 , n = 9, x = 0 and Sn-i = 1- Investigate the

96

3 Utility, Prior, and Bayesian Robustness

robustness of the credible interval given in (b) by computing the range of its posterior probability. (e) Consider independent priors: 6 ~ N{/jL,r^), 7r(cr^) oc (cr^)~^, where 0 < // < 3 and 5 < r^ < 10. Conduct a robustness study as in (d) now. 11. Let

(x) = i ^^^'''

^

\ liniA^o ^

if A > 0; = log(x) if A = 0,

and consider the following family of probability densities introduced by Albert et al. (1991): 7r(%,0,c,A) = A : ( c , A ) v ^ e x p | - | p A (^ + ^ T ^ T " ) } ' (^'^^)

12.

13. 14. 15. 16.

17.

where A:(c, A) is the normalizing constant, — ( X ) 0 , c > l , A>0. (a) Show that n is unimodal symmetric about /x. (b) Show that the family of densities defined by (3.39) contains many location-scale families. (c) Show that normal densities are included in this family. (d) Show that Student's t is a special case of this density when A = 0. (e) Show that (3.39) behaves like the double exponential when A = 1/2. (f) For 0 < A < 1, show that the density in (3.39) is a scale mixture of normal densities. Suppose X\6 ~ N{6,a'^), with 6 being the parameter of interest. Explain how the family of prior densities given by (3.39) can be used to study the robustness of the posterior inferences in this case. In particular, explain what values of A are expected to provide robustness over a large range of values of X = X. Refer to the definition of belief function, Equation (3.20). Show that Bel is a probability measure iff Bel{.) = Pl{-)' Show that any probability measure is also a belief function. Refer to Example 3.14. Prove (3.25) and (3.26). Refer to Example 3.14 again. Let X\e - N{0,1) and let TTQ denote A^(0, r^) with T^ = 2. Take 6 = 0.2 and suppose x = 3.5 is observed. (a) Construct the 95% HPD credible interval for 6 under TTQ. (b) Compute (3.25) and (3.26) for the interval in (a) now, and check whether robustness is present when the e-contamination class of priors is considered. (Dey and Birmiwal (1994)) Let X = (Xi,...,Xfc) have a multinomial distribution with probability mass function, P(Xi = xi,"',Xk = Xk\p) = Yfir^—-,Ili=iPV^ with n = E i = i ^^ ^^d lli=i^^

0 < Pi < 1, Yli=iPi — 1- Suppose under TTQ, p has the Dirichlet distribution T>{cx) with density ^o(p) = r F ^ ^ r H n t i P z " ^ " ' . with ao = E t i ^^ where a, > 0. Now lL=i^ ^^^^

3.12 Exercises consider the e-contamination class of priors with Q = {V{scx),s Derive the extreme values of the curvature C{q).

97 > 1}.

Large Sample Methods

In order to make Bayesian inference about the parameter 0^ given a model f{x\6)^ one needs to choose an appropriate prior distribution for 6. Given the data x, the prior distribution is used to find the posterior distribution and various posterior summary measures, depending on the problem. Thus exact or approximate computation of the posterior is a major problem for a Bayesian. Under certain regularity conditions, the posterior can be approximated by a normal distribution with the maximum likelihood estimate (MLE) as the mean and inverse of the observed Fisher information matrix as the dispersion matrix, if the sample size is large. If more accuracy is needed, one may use the Kass-Kadane-Tierney or Edgeworth type refinements. Alternatively, one may sample from the approximate posterior and take resort to importance sampling. Posterior normality has an important philosophical implication, which we discuss below. How the posterior inference is influenced by a particular prior depends on the relative magnitude of the amount of information in the data, which for i.i.d. observations may be measured by the sample size n or nl{6) or observed Fisher information 7^ (defined in Section 4.1.2), and the amount of information in the prior, which is discussed in Chapter 5. As the sample size grows, the influence of the prior distribution diminishes. Thus for large samples, a precise mathematical specification of prior distribution is not necessary. In most cases of low-dimensional parameter space, the situation is like this. A Bayesian would refer to it as washing away of the prior by the data. There are several mathematical results embodying this phenomenon of which posterior normality is the most well-known. This chapter deals with posterior normality and some of its refinements. We begin with a discussion on limiting behavior of posterior distribution in Section 4.1. A sketch of proof of asymptotic normality of posterior is given in this section. A more accurate posterior approximation based on Laplace's asymptotic method and its refinements by Tierney, Kass, and Kadane are the subjects of Section 4.3. A refinement of posterior normality is discussed

100

4 Large Sample Methods

in Section 4.2 where an asymptotic expansion of the posterior distribution with a leading normal term is outlined. Throughout this chapter, we consider only the case with a finite dimensional parameter. Also, 6 is assumed to be a "continuous" parameter with a prior density function. We apply these results for determination of sample size in Section 4.2.1.

4.1 Limit of Posterior Distribution In this section, we discuss the limiting behavior of posterior distributions as the sample size n —> oo. The limiting results can be used as approximations if n is sufficiently large. They may be used also as a form of frequentist validation of Bayesian analysis. We begin with a discussion of posterior consistency in Section 4.1.1. Asymptotic normality of posterior distribution is the subject of Section 4.1.2. 4.1.1 Consistency of Posterior Distribution Suppose a data sequence is generated as i.i.d. random variables with density /(aj|^o)- Would a Bayesian analyzing this data with his prior T:{6) be able to learn about OQI Our prior knowledge about 6 is updated into the posterior as we learn more from the data. Ideally, the updated knowledge about ^, represented by its posterior distribution, should become more and more concentrated near 6Q as the sample size increases. This asymptotic property is known as consistency of the posterior distribution at ^o- Let X i , . . . ,X^ be the observations at the nth stage, abbreviated as X ^ , having a density f{xn I 0), ^ G 0 C IZ^. Let 7r(0) be a prior density, 'K{0 \ Xn) the posterior density as defined in (2.1), and 77(. | Xn) the corresponding posterior distribution. Definition. The sequence of posterior distributions 77(. | Xn) is said to be consistent at some OQ G 0 , if for every neighborhood U ofOo, 11 (U \ Xn) —^ 1 as n ^ CO with probability one with respect to the distribution under OQ. The idea goes back to Laplace, who had shown the following. If X i , . . . , Xn are i.i.d. Bernoulli with P0{Xi = 1) = 0 and 7r{0) is a prior density that is continuous and positive on (0,1), then the posterior is consistent at all ^o in (0,1). von Mises (1957) calls this the second fundamental law of large numbers; the first being Bernoulli's weak law of large numbers. Need for posterior consistency has been stressed by Preedman (1963, 1965) and Diaconis and Preedman (1986). Prom the definition of convergence in distribution, it follows that consistency of 77(. I Xn) at 6o is equivalent to the fact that /7(. | Xn) converges to the distribution degenerate at OQ with probability one under OQ. Consistency of posterior distribution holds in the general case with a finite dimensional parameter under mild conditions. Por general results see, for example, Ghosh and Ramamoorthi (2003). Por a real parameter 6, consistency

4.1 Limit of Posterior Distribution

101

at ^0 can be proved by showing E{0 \ Xn) —> ^o and Var{6 \ Xn) -^ 0 with probabihty one under ^o- This follows from an application of Chebyshev's inequality. Example 4.1. Let X i , X 2 , . . . , Xn be i.i.d. Bernoulli observations with Pe{Xi = 1^ = 0. Consider a Beta (a, f3) prior density for 0. The posterior density of 0 given X i , X 2 , . . . , Xn is then a Beta (^^^1 Xi-\-a, n — Yl7=i ^^ +1^) density with E{0\X,,...,Xn)

n -\- a -\- p [a-\- (3 -\- n)^[a -{- p -\-n-\-1)

As ^ X^ILi ^^ ~^ ^0 with P6IQ-probability 1 by the law of large numbers, it follows that E{e I X i , . . . , X n ) -^ 9o and Var{0 \ X i , . . . , X n ) -^ 0 with probability one under ^o- Therefore, in view of the result mentioned in the previous paragraph, the posterior distribution of 0 is consistent. An important result related to consistency is the robustness of the posterior inference with respect to choice of prior. Let X i , . . . , X^ be i.i.d. observations. Let TTi and 7r2 be two prior densities which are positive and continuous at ^0, an interior point of 0 , such that the corresponding posterior distributions i7i(. I Xn) and i72(. | Xn) are both consistent at ^o- Then with probability one under ^0

L

7Ti{0\Xn)-7r2{0\Xn)\de-^0

&

or equivalently, s u p \ni{A A

I Xn)

- i T 2 ( A I Xn)\

^

0.

Thus, two different choices of the prior distribution lead to approximately the same posterior distribution. A proof of this result is available in Ghosh et al. (1994) and Ghosh and Ramamoorthi (2003). 4.1.2 Asymptotic Normality of Posterior Distribution Large sample Bayesian methods are primarily based on normal approximation to the posterior distribution of ^. As the sample size n increases, the posterior distribution approaches normality under certain regularity conditions and hence can be well approximated by an appropriate normal distribution if n is sufficiently large. When n is large, the posterior distribution becomes highly concentrated in a small neighborhood of the posterior mode. Suppose that the notations are as in Section 4.1.1, and 6n denotes the posterior mode. Under suitable regularity conditions, a Taylor expansion of log7r(^ | Xn) at 6n gives

102

4 Large Sample Methods log7r(0 I X „ ) = log7r(^„ I X „ ) + {6- 0 „ ) ' ^ l o g 7 r ( 0 | X„)|^^ - i ( 0 - ^ „ ) ' J „ ( 0 - 0 „ ) + ...

«iog7r(^„ I Xn) - \{e - 'OnYinie - K)

(4.1)

where / „ is a p x p matrix defined as

and may be called generalized observed Fisher information matrix. The term involving the first derivative is zero as the derivative is zero at the mode OnAlso, under suitable conditions the terms involving third and higher order derivatives can be shown to be asymptotically negligible as 0 is essentially close to On- Because the first term in (4.1) is free of 6, 7r(^|Xn), as a function of ^, is approximately represented as a density proportional to exp[-^{0-0nyin{O-en)h

which is a Np{6n^ In ) density (with p being the dimension of 0). As the posterior distribution becomes highly concentrated in a small neighborhood of the posterior mode On where the prior density 7r{0) is nearly constant, the posterior density 7r{0 \ Xn) is essentially the same as the likelihood f{Xn I 0). Therefore, in the above heuristics, we can replace On by the maximum likelihood estimate (MLE) On and In by the observed Fisher information matrix

SO that the posterior distribution of 0 is approximately Np{On^In )• The dispersion matrix of the approximating normal distribution may also be taken to be the expected Fisher information matrix I{0) evaluated at On where I{0) is a matrix defined as

Thus we have the following result. Result. Suppose that X i , X 2 , . . . ,X^ are i.i.d. observations, abbreviated as Xn^ having a density f{xn \0)^ 0 E © C IZ^. Let 7r(0) be a prior density and TT{0 I Xn) the posterior density as defined in (2.1). Let On be the posterior mode. On the MLE and / n , In and I{0) be the different forms of Fisher information matrix defined above. Then under suitable regularity conditions,

4.1 Limit of Posterior Distribution

103

for large n, the posterior distribution of 6 can be approximated by any one of the normal distributions Np{6n->in ) or Np(6n,In

) or

Np{6n,I~^{0n))-

In particular, under suitable regularity conditions, the posterior distribu-1/2

tion of I^ {0 — 6n)^ given X ^ , converges to Np{0,I) with probability one under the true model for the data, where / denotes the identity matrix of order p. This is comparable with the result from classical statistical theory ^ 1/2

that the repeated sampling distribution of I^ {6 —On) given 0 also converges to Np{OJ). For a comment on the accuracy of the different normal approximations stated in the above result and an example, see Berger (1985a, Sec. 4.7.8). We formally state a theorem below giving a set of regularity conditions under which asymptotic normality of posterior distribution holds. Posterior normality, in some form, was first observed by Laplace in 1774 and later by Bernstein (1917) and von Mises (1931). More recent contributors in this area include Le Cam (1953, 1958, 1986), Bickel and Yahav (1969), Walker (1969), Chao (1970), Borwanker et al. (1971), and Chen (1985). Ghosal (1997, 1999, 2000) considered cases where the number of parameters increases. A general approach that also works for nonregular problems is presented in Ghosh et al. (1994) and Ghosal et al. (1995). We present below a version of a theorem that appears in Ghosh and Ramamoorthi (2003). For simplicity, we consider the case with a real parameter 9 and i.i.d. observations X i , . . . , XnLet X i , X 2 , . . . ,Xn be i.i.d observations with a common distribution PQ possessing a density f{x\0) where ^ G 0 , an open subset of IZ. We fix ^o ^ 6^, which may be regarded as the "true value" of the parameter as the probability statements are all made under ^o- Let l{6^x) = log f{x\0)^Ln{0) = S i L i K^^^i)^ ^tie log-likelihood, and for a function /i, let /i^*^ denote the ith derivative of h. We assume the following regularity conditions on the density

f{x\e). (Al) The set {x : f{x\0) > 0} is the same for all 0 eO. (A2) 1(6^x) is thrice differentiable with respect to ^ in a neighborhood (<9o - (5,6>o H- 5) of l9o. The expectations Ee^l^'^\eo,Xi) and Eeol^'^\eo,Xi) are both finite and sup \l^^\e,x)\ < M{x) and Ee^M{Xi) < oc. (4.2) (A3) Interchange of the order of integration with respect to PQQ and differentiation at ^0 is justified, so that Eej'-'\eo,Xi)

= 0, Ee/^\9o,Xr)

=

-E0,{l^'\0o,Xi)f.

Also, the Fisher information number per unit observation I{0o) — Ee,{l^^\eo,Xi))'^ is positive. (A4) For any J > 0, with P^^o"Probability one

104

4 Large Sample Methods sup -{Ln{e)-Ln{Oo)) \e-eo\>s'n^

<-e

for some e > 0 and for all sufficiently large n. Remark: Suppose there exists a strongly consistent sequence of estimators 6n of 0. This means for all OQ e 0y On -^ OQ with P^^-probability one. Then by the arguments given in Ghosh (1983), a strongly consistent solution 6n of the likelihood equation Ln (0) = 0 exists, i.e., there exists a sequence of statistics 6n such that with P^^-probability one On satisfies the likelihood equation for sufficiently large n and On -^ OQ. Theorem 4.2. Suppose assumptions (Al) - (A4) hold and On is a strongly consistent solution of the likelihood equation. Then for any prior density 7r{0) which is continuous and positive at OQ, lim / | < ( i | X i , . . . , X „ ) - ^ ^ e - 5 ' ^ ^ ( ^ o ) | d t = 0

(4.3)

with PQQ-probability one, where 7r*(t|Xi,... ,Xn) is the posterior density of t = y/n{0 - On) given X i , . . . , X„. Also under the same assumptions, (4-3) holds with I{0o) replaced by In = -n^n

{On)'

A sketch of proof. We only present a sketch of proof. Interested readers may obtain a detailed complete proof from this sketch. The proof consists of essentially two steps. It is first shown that the tails of the posterior distribution are negligible. Then in the remaining part, the log-likelihood function is expanded by Taylor's theorem up to terms involving third derivative. The linear term in the expansion vanishes, the quadratic term is proportional to logarithm of a normal density, and the remainder term is negligible under assumption (4.2) on the third derivative. n

Because 7rn(^|Xi,..., Xn) oc TT f{Xi\0)TT{0), the posterior density oit = i=l

y/n{0 — On) can be written as < ( t | X i , ...,Xn)=

C-'7r{0n

+ t/V^)

exp[Ln{On

+ t/V^)

- Ln{On)]

(4.4)

where Cn =

7r{0n + t/^) exp[Ln{On + t/y^) - Ln{On)] dt. Jn Most of the statements made below hold with P^i^-probability one but we will omit the phrase "with P^^-probability one". Let gn{t)

= 7r{0n + t/y/^)

exp[Ln{On

+ t/y/^)

- Ln{On)]

- 7r(^o)e-i*'^^^°^

We first note that in order to prove (4.3), it is enough to show

4.1 Limit of Posterior Distribution

Ij\gn{t)\dt-^0. If (4.5) holds, Cn -^ T:{0O)^/2TT/I{6O) which is dominated by

105 (4.5)

and therefore, the integral in (4.3),

also goes to zero. To show (4.5), we break IZ into two regions Ai = {t : \t\ > Soy/n} and ^2 = {t : |t| < ^ov^} f^^ some suitably chosen small positive number 5o and show for i = 1, 2.

L

\gnit)\dt^O.

(4.6)

To show (4.6) for i = 1, we note that /

\9n{t)\dt

J Ax <

[ TT{en^-t/Vn)eMLn{en^t/^)-Ln{On)]dt^JAi

f

7l[0^)e-"^'"^^^^Ut.

JAI

It is easy to see that the second integral goes to zero. For the first integral, we note that by assumption (A4), for t G Ai, n for all sufficiently large n. It follows that (4.6) holds for i = 1. To show (4.6) for i = 2, we use the dominated convergence theorem. Expanding in Taylor series and noting that Ln (On) = 0 we have for large n, Ln{On + 4 = ) - ^r.{On)

= ' h ^ L

+ Rn{t)

(4.7)

where Rn{t) = \{t/y/nfL^n\On) and 9'^ lies between 0^ and On + t/^/n. By assumption (A2), for each t, Rn{t) -^ 0 and In -> I{0o) and therefore, 9n{t) —> 0. For suitably chosen (5o, for any t G A2, 1 1 '^ \Rn{t)\ < -Sot^-y^M{Xi) 6 n .^

1 <

-fin

for sufficiently large n so that from (4.7), exp[L„(^„ + tl^)

- L„(^„)] < e-^*'^~" < e-i*'^(^°).

106

4 Large Sample Methods

Therefore, for suitably chosen small SQ, \9n{t)\ is dominated by an integrable function on the set 742- Thus(4.6) holds for i = 2. This completes the proof of (4.3). The second part of the theorem follows as In -^ I{0o). D Remark. We assume in the proof above that 7r{6) is a proper probability density. However, Theorem 4.2 holds even if n is improper, if there is an no such that the posterior distribution of 6 given (xi,X2,... ,Xno) is proper for a.e. (xi, X2,..., XriQ). The following theorem states that in the regular case with a large sample, a Bayes estimate is approximately the same as the maximum likelihood estimate On- If we consider the squared error loss the Bayes estimate for 6 is given by the posterior mean 0^ =

67rn{0\Xi,...,

Xn) dO.

J0

Theorem 4.3. In addition to the assumptions of Theorem 4-2, assume that that prior density TT{6) has a finite expectation. Then \/n{6'!^ — On) -^ ^ with POQ-probability one. Proof. Proceeding as in the proof of Theorem 4.2 and using the assumption of finite expectation for TT, (4.3) can be strengthened to

with P^IQ-probability one. This implies

Jn Now ei

= E{e\Xi,...,

Jn Xn) = E0n

v27r + t/^/n\Xi,...,

X^) and therefore,

V^(^;-en)= t ^<(^|Xi,...,Xn)dt^^. D Jn Theorems 4.2 and 4.3 and their variants can be used to make inference about 6 for large samples. We have seen in Chapter 2 how our inference can be based on the posterior distribution. If the sample size is sufficiently large, for a wide variety of priors we can replace the posterior distribution by the approximating normal distribution having mean 6n and dispersion I^ or {nln)~^ which do not depend on the prior. Theorem 4.3 tells that in the problem of estimating a real parameter with squared error loss, the Bayes estimate is approximately the same as the MLE 6n- Indeed, Theorem 4.3 can be extended to show that this is also true for a wide variety of loss functions. Also the moments and quantiles of the posterior distribution can be approximated by the corresponding measures of the approximating normal distribution. We consider an example at the end of Section 4.3 to illustrate the use of asymptotic posterior normality in the problems of interval estimation and testing.

4.2 Asymptotic Expansion of Posterior Distribution

107

4.2 Asymptotic Expansion of Posterior Distribution Consider the setup of Theorem 4.2. Let Fr,{u) = 77,({V^/y2(^ - On) < u}\Xi,..., y^-i

Icy

X,)

^

be the posterior distribution function of y/nin {9 — On)- Then under certain regularity assumptions, Fn(u) is approximately equal to ^{u)^ where ^ is the standard normal distribution function. Theorem 4.2 states that under assumptions (A1)-(A4) on the density f{x\0)^ for any prior density TT{9) which is continuous and positive at 6Q^ lim sup \Fn{u) - ^{u)\ = 0 a.s. PQ^. n—>-oo

(4.8)

^

Recall that this is proved essentially in two steps. It is first shown that the tails of the posterior distribution are negligible. Then in the remaining part, the log-likelihood function is expanded by Taylor's theorem up to terms involving third derivative. The linear term in the expansion vanishes, the quadratic term is proportional to logarithm of a normal density, and the remainder term is negligible under assumption (4.2) on the third derivative. Suppose now that l{6^x) = log/(x|^) is {k + 3) times continuously differentiable and TT{6) is {k-\-l) times continuously differentiable at ^o with TT{6Q) > 0. Then the subsequent higher order terms in the Taylor expansion provide a refinement of the posterior normality result stated in Theorem 4.2 or in (4.8) above. Under conditions similar to (4.2) for the derivatives of Z(^, x) of order 3 , 4 , . . . , /c + 3, and some more conditions on f{x\6)^ Johnson (1970) proved the following rigorous and precise version of a refinement due to Lindley (1961). k

sup \Fn{u) - ^{u) - ct>{u)y2i;j{u;Xu

• •. ,X^)n-^'/2| < Mfen-^^+^^Z" (4.9)

j=i

eventually with P^i^-probability one for some Mk > 0, depending on A:, where (l){u) is the standard normal density and each ipj{u; X i , . . . , Xn) is a polynomial in u having coefficients bounded in Xi,..., Xn. Under the same assumptions one can obtain a similar result involving the Li distance between the posterior density and an approximation. The case k = 0 corresponds to that considered in Section (4.1.2) as (4.9) becomes sup \Fn{u) - ^{u)\ < Mon-^/^ (4.10) u

Another (uniform) version of the above result, as stated in Ghosh et al. (1982) is as follows. Let 0i be a bounded open interval whose closure 0i is properly contained in O and the prior TT be positive on 0i. Then, as stated in Ghosh et al. (1982), for r > 0, (4.9) holds with P^^-probability 1 - 0{n-''), uniformly in ^0 ^ 6^1 under certain regularity conditions (depending on r) which are stronger than those of Johnson (1970).

108

4 Large Sample Methods

For a formal argument showing how the terms in the asymptotic expansion given in (4.9) are calculated, see Johnson (1970) and Ghosh et al. (1982). For example, if we want to obtain an approximation of the posterior distribution upto an error of order o{n~^), we take k = 2 and proceed as follows. This is taken from Ghosh (1994). Let t - y/n{e - On) and ai = ^^^§fr^|^,^ > 1, so that a2 = -In- The posterior density of t is given by (4.4) and by Taylor expansion

and

Therefore, n{e„ + tl^fn) exp[L„(^„ + tjsfn) - L„(^„)] = 7r(^„) exp[a2tV2] X [1 + n-^/2^i(t; X i , . . . , X„) + n-^o.2{t\ X j , . . . , xA

+ oivT^),

where

aiit;Xi,...,Xn)

= It^^

+ t"^-^,

6

7r(6l„)

The normalizer Cn also has a similar expansion that can be obtained by integrating the above. The posterior density of t is then expressed as < ( i | X i , . . . , X „ ) = (27r)-i/2/y2e-tV2 l + ^ n - ^ / ^ j ( i ; X i , . . . , X „ ) + o(n-i), j=i

where 7i(i;, X i , . . . , X„) = ^t^a^ + t ^ ^

_a4__ J 5 _ 2 'Sal 72ai''3

and

1 ^"(^n) , 1 3^'(^«) I . - U 2a2 ;r(^„) ^ 2 a f ;r(^„) + ' ' ^ ^ ^^

4.2 Asymptotic Expansion of Posterior Distribution

109

Transforming to s = In ^, we get t h e expansion for t h e posterior density of y/nin {6 — On), a n d integrating it from —oo t o u, we get t h e terms in (4.9). T h e above expansion for posterior density also gives an expansion for t h e posterior mean:

Similar expansions can also be obtained for other moments and quantiles. For more details and discussion see Johnson (1970), Ghosh et al. (1982), and Ghosh (1994). Ghosh et al. (1982) and Ghosh (1994) also obtain expansions of Bayes estimate and Bayes risk. These expansions are rather delicate in the sense t h a t t h e t e r m s in the expansion can t e n d to infinity, see, e.g., t h e discussion in Ghosh et al. (1982). T h e expansions agree with those obtained by Tierney and K a d a n e (1986) (see Section 4.3) u p t o o{n~'^). Although t h e Tierney-Kadane approximation is more convenient for numerical calculations, t h e expansions obtained in Ghosh et al. (1982) and Ghosh (1994) are more suitable for theoretical applications. A Bayesian would want to prove an expansion like (4.9) under t h e marginal distribution of X i , . . . , Xn derived from t h e joint distribution of X ' s and 0. T h e r e are certain technical difficulties in proving this from (4.9). Such a result will hold if t h e prior 7r{0) is supported on a b o u n d e d interval and behaves smoothly at t h e b o u n d a r y points in the sense t h a t 7r{6) and {d'^/d0'^)7r{9), i = 1, 2 , . . . , A: are zero at t h e b o u n d a r y points. A r a t h e r technical proof is given in Ghosh et al. (1982). See also in this context Bickel and Ghosh (1990). For t h e uniform prior on a b o u n d e d interval, there can be no asymptotic expansion of t h e integrated Bayes risk (with squared error loss) of t h e form ao + 7^ + ^ + o{n-^) (Ghosh et al. (1982)). 4 . 2 . 1 D e t e r m i n a t i o n of S a m p l e S i z e in T e s t i n g In this subsection, we consider certain testing problems and find asymptotic approximations t o t h e corresponding (minimum) Bayes risks. These approximations can be used to determine sample sizes required to achieve given b o u n d s for Bayes risks. We first consider t h e case with a real p a r a m e t e r ^ G 0 , an open interval in IZ^ and t h e problem of testing Ho:0

<0o versus Hi : 0 > OQ

for some specified value ^o- Let X i , . . . X^ be i.i.d. observations with a common density f{x\9) involving the p a r a m e t e r 0. Let 7T{0) b e a prior density over 0 and 7r(^|x) be t h e corresponding posterior. Set Ro{x) Ri{x)

= P{0 > Oo\x) = f 7r{0\x)de, Je>eo = l-Ro{x).

110

4 Large Sample Methods

As mentioned in Section 2.7.2, the Bayes rule for the usual 0 — 1 loss (see Section 2.5) is to choose HQ if Ro{X) < Ri{X) or equivalently Ri{X) > | and to choose Hi otherwise. The (minimum) Bayes risk is then given by r(7r) -

/ Pe[Ri{X) Je>eo

> l/2]7r{e)de ^ f Pe[Ri{X) Je
< l/2]7r{0)de.

(4.11) By Theorem 2.7 an alternative expression for the Bayes risk is given by r(7r) = E[mm{Ro{X),Ri{X)}]

(4.12)

where the expectation is with respect to the marginal distribution of X. Suppose 1^ — ^o| > S where S is chosen suitably. For each such ^, ^^ is close to 6 with large probability and hence |^n — ^o| > <5. Intuitively, for such 6n it will be relatively easy to choose the correct hypothesis. This suggests most of the contribution to the right hand side of (4.11) comes from 0 close to OQ, i.e., from 1^ — ^o| < ^- A formal argument that we skip shows r(7r) = [

Pe[Ri{X)

> l/2]7r{e)de

+ /

Pe[Ri{X)

< l/2]TT{e)de + o{n-^),

(4.13)

Jeo-6r^<e<eQ

if 5n = c^/logn/^/n with c sufficiently large. You are invited to verify this for the N{e, 1) model in Problem 7. An approximation to the first integral of (4.13) can be obtained as follows. By the result on normal approximation to posterior stated in the paragraph following (4.10),

Ri{x) = p[V^i^^\o - On) < V^i'J\e^ - en)\x] can be approximated by ^[y/nln Pe[Ri{X)

{OQ — On)]- Hence

> 1/2] « Pg[^{V^PJ\eo

- k))

> 1/2]

Indeed, using appropriate uniform versions of the results on asymptotic expansions of posterior distribution (as stated above) and sampling distribution of ^In'^On - 0) given 6 (see, e.g., Ghosh (1994)), one obtains

Pe[Ri{x) > 1/2] = ^[-V^i'/\e){e

- eo)] + o{n-'^^)

uniformly in 6 belonging to bounded intervals contained in 0. Thus / Jdo<e<eo-\-6n

= I

Pe[Ri{X)>l/2]7r{e)de

^[-y/^i^^\e){e - eo)]7T{e) do + o(n-^/2).

4.2 Asymptotic Expansion of Posterior Distribution

111

With similar approximation for the second integral of (4.13), we have r(7r) = /

^[-v^/^/2(<9)(i9 - 6>o)]7r(<9) dO ^[V^I^^^{e){0

- ^o)]vr(^) de + o(n-^^^)

6n<e<eo

V^ Jo-
^[-tl^/'^(0o

+ t/^)]7r{0o

+ t/V^)

dt

V ^ i-cy^logn
If we assume TT{6) and I{6) have bounded derivatives in some neighborhoods of OQ , the above reduces to

V^ Jo

VWpo)

V^

J-oo

+ o(n-V^),

(4.14)

where C = j^[l - ^{u)]du « 0.3989423. Prom (4.14) it follows that if one wants to have Bayes risk at most equal to some specified ro then the required sample size UQ with which one can achieve this (approximately) is given by 4C2(7r(0o))^

.,,.^

In the same way we can handle, say, a two-parameter problem with parameter 6 = (^1,^2)- Suppose 91 and O2 are comparable and the quantity of interest is 77 = ^1 — ^2The problem is to test ^ 0 : ^ < ^0

for some specified 770- Let 7r(^) be the joint prior density of ^1, O2 and p{r]) be the marginal prior density of 77. Let In be the observed Fisher information matrix as defined in the first part of Subsection 4.1.2. Then a normal approximation to the posterior distribution of 6 is N{6n, I^ ), vide Subsection 4.1.2. This implies that a normal approximation to the posterior of 77 is given by N{6in

-02n,Vn)

with -11

-22

-12

where I^ denotes the (i, j ) t h element of I^ . Note that {nvn)~^^'^ —> h{0) = [/ii(^) _^ /22(^) _ 2/12(^)]-1/2 ^^^^^ 0 ^j^gj,g jijf^Q^ denotes the (i, j ) t h element of I~^{6), I{0) being the expected Fisher information matrix per unit observation.

112

4 Large Sample Methods

Let TT*{p) and 7r*{r]\P) be respectively the marginal prior density of /^ = 61+62 and conditional prior density of rj given P and a{r], f3) be b{6) expressed in terms of 77 and p. Then by arguments similar to those used above, an approximation to the Bayes risk for this problem is

^/nJ

a(77o,/?)

where C is as in (4.14). It would be a matter of taste whether one would use simulation or asymptotic approximation. In any case, each method can confirm the accuracy of the other. Advantage of asymptotics is that we get an overview quickly. In specific cases, simulation may be a more efficient alternative, and asymptotics can be used to confirm calculation. Example 4-4- Let the observations X i , . . . , Xn be i.i.d. B(l, ^), 0 < ^ < 1, and suppose we want to test HQ : 6 < 1/2 versus Hi : 6 > 1/2. If we consider the uniform prior 7T{6) = 1 on (0,1), we have

R^{x) = MT) = r^,^, , ,, / 0^(1 -e)"-^dO ( r + i, ): r; (; n -^^ r +1)71/2 which is a function of T = IZlLi ^^^ ^^^ ^^^ marginal distribution of T is uniform over { 0 , 1 , . . . , n}. Then from (4.12) the Bayes risk is given by 1 "^ rM = - — - ^ m i n { i ? o ( t ) , 1 - i?o(^)}. Here I(6) = [^(1 — 6)]~^ and the approximation suggested in (4.14) is 1 /"^ r*(7r) = -^ / [l-^{u)]du. V^ Jo Table 4.1 gives the exact values of Bayes risk r(7r) and its approximation r*(7r) for different values of n. If one wants to have Bayes risk at most equal to To = 0.04, from the approximate formula (4.15), the required sample size n is at least 100 while the exact expression for r(7r) yields n > 99. The above calculations are relevant in the planning stage, when there are no data. If we have a sample of size n and want to control the posterior Bayes risk by drawing m additional observations, we can follow a similar procedure replacing the prior by the posterior from the first stage of data. Ideally, the first-stage sample would be a pilot sample of relatively small size, and the bulk of the data would come from the second stage. In this case, we may even allow an improper noninformative prior for one-sided alternatives.

4.3 Laplace Approximation

113

Table 4.1. The Exact Values of Bayes Risk r(7r) and Its Approximation r*{7r) for Example 4.4. 10 20 30 40 50 60 70 80 90 100 150 200 250 .1230 .0881 .0722 .0627 .0561 .0513 .0475 .0445 .0419 .0398 .0325 .0282 .0252 r*(7r) .1262 .0892 .0728 .0631 .0564 .0515 .0477 .0446 .0421 .0399 .0326 .0282 .0252 n

r(7r)

4.3 Laplace Approximation Bayesian analysis requires evaluation of integrals of the form

Jgie)f{^\0)7r{e)d9 where / ( x | ^ ) is the likelihood function, 7r{0) is the prior density, and g{0) is some function of 6. For example, with g{6) = 1 we have the integrated likelihood required for calculation of Bayes factor in testing or model selection. Various other characteristics of posterior and predictive distribution may also be expressed in terms of such integrals. Laplace's method (see Laplace (1774)) is a technique for approximating integrals when the integrand has a sharp maximum. 4.3.1 Laplace's Method Let us consider an integral of the form oo

9(0)e"''('') d0 /

-OO

where q and h are smooth functions of 0 with h having a unique maximum at 9. In applications, nh{0) may be the log-likelihood function or logarithm of the unnormalized posterior density /(x|^)7r(0), and 0 may be the MLE or posterior mode. The idea is that if h has a unique sharp maximum at 0, then most contribution to the integral / comes from the integral over a small neighborhood {6 — 6^6 -{- S) of 0. We study the behavior of / as n —)• oo. As n -^ (X), we have re+6

I^I^=

/ Je-s

g(^)e^^^^) de.

Here I ^ Ii means I/Ii —> 1. Laplace's method involves Taylor series expansion of q and h about 0, which gives rO+S

Je-s

q{0) + {0- 0)q\§) + ^{9 - 9Yq"{9) + smaller terms

X exp nh[9) -h nh\9){9 -9)^

^h'\9){9

- 9f + smaller terms

114

4 Large Sample Methods

1 + (^ - e)q'{e)/q{e) +'-{e- e)\"{e)/q{e)

Jeie-5

h"{e){e-ef'\ de.

X exp

Assuming that c = —h"{6) is positive and using a change of variable t = ^/nc{6 — 0), we have -1

pOy/nC

/ ~ e nhwV27r^(^) ~e nc

1+

t^ t 1+ - = , ' ( . ) / , ( . ) + _ , " ( . ) / . ( . )

-'/^dt

2ncq{9)

27r •y/nc

"•

In general, for the case with a p-dinaensional parameter 6, I = e"''(^)(27r)P/2n-P/2det(/ih(0))-i/2g(0){i + 0{n-^))

(4.16)

where Ah{0) denotes the Hessian of —h, i.e.,

MO)

dOidOj

h{0))

.

/ pxp

Example 4-5. {Sterling's approximation to nf) Note that n\ can be written as a gamma integral n! = r{n-¥l)=

/

e-^x^ dx = /

e^(iogx-x/n) ^^

One can use the Laplace method described above to approximate n! as (Problem 9)

The Bayesian Information Criterion (BIC) Consider a model with likelihood / ( x | ^ ) and prior 7T{0). Equation (4.16), with q = TT and nh{6) equal to the log-likelihood, yields an approximation to the integrated likelihood that can be used to find an approximation to the Bayes factor defined in (2.11). Schwarz (1978) proposed a criterion, known as the BIC, based on (4.16) ignoring the terms that stay bounded as the sample size n —)- cxD. The criterion given by ^7C-log/(iE|^)-(p/2)logn serves as an approximation to the logarithm of the integrated likelihood of the model and is free from the choice of prior.

4.3 Laplace Approximation

115

C o n n e c t i o n B e t w e e n Laplace A p p r o x i m a t i o n a n d P o s t e r i o r Normality Posterior normality discussed in Section 4.1.2 and Laplace approximation are closely connected. The proof of posterior normality is essentially an application of Laplace approximation with a rigorous handling of the error term. We illustrate this below by re-deriving posterior normality by an application of Laplace approximation. Let X i , X 2 , . . . ^Xn be i.i.d. observations with a density f{x\0) and 6 be the MLE of 6. We will find an approximation to the posterior distribution of t = y/n{0 — 6) using Laplace's method. Let 7r(^) be the posterior density and n{-\x) denote the posterior distribution. Then for a > 0, n{-a

< t < a\x) = n{0 - a/y/n < 0 < 0-\-

aly/n\x)

where

Jn^ I "^ "^ e''^^^\{e)de, in= I e''^^^\{o)de, Je-a/y/n

J

As obtained above with c = —h"{6) which is observed Fisher information per unit observation. Using Laplace's method for J^ we have Jn - e""^^^^ [

[7r((9) + (6> - ^)7r'((9) + smaller terms] a/s/n

X exp ^ -nc{0 -

ef/2 dO

f-e+a/^

~ e"''(^)7r(^) /

exp -nc{e - 6f /2 de

V^ J-a Thus, for a > 0, n{-a

< t < a\x) - ^ r e-^*'/2 dt V27r J-a = P{-a < Z
4.3.2 T i e r n e y - K a d a n e - K a s s Refinements Suppose

116

4 Large Sample Methods

is the Bayesian quantity of interest where ^, / , and TT are smooth functions of 6. If we express (4.17) as

with /i(^) = Mog{/(x|0)7r(^)} and apply the Laplace approximation (4.16) to both the numerator and denominator (with q equal to g and 1), we obtain a first-order approximation

E''{g{e)\^)=g{e){l+0{n-')] (here 0 denotes the posterior mode). This has been derived by Tierney and Kadane (1986), Kass et al. (1988), and Tierney et al. (1989). Suppose now that g in (4.17) is positive, and let nh{0) = log/(x|0) + log7r(^), n/i*(^) = nh{0) + log^(0) = nh{e) + G(^), say. Now apply (4.16) to both the numerator and denominator of (4.17) with q equal to 1. Then, letting 0* denote the mode of /i*, T = ^h^{0), ^ * == A'^}{0*), Tierney and Kadane (1986) obtain the surprisingly accurate approximation E^g{0)\^) =

|r*|i/2expfn/i*((9*)) ) ^

{1 + 0{n-')} .

(4.18)

It is shown below how the approximation (4.18) is obtained in Tierney and Kadane (1986) for the case with a real parameter. Let a^ = -l/h"{e), (7*2 = -l//i*"((9*). Also let hk = hk{e) and h^ = hl^"") where ^l^k{0) = (d/de)''ip{e) for any function ^^((9). Note that under the usual regularity conditions cr, cr'^^hk, hi are all of order 0(1). Consider first the denominator of (4.17), which can be written as /e"^(^) dO = / e x p [nh{e) - ^{6 = e^^(^^)\/2^an-^/2 f where (f){0;6,a'^/n) is the N{0,a'^/n)

- Of + Rn{e)] exp{RniO))(t>{0;e,a^/n)de

density for 6 and

Rn = nh{0) - nh{e) + ^ n{ 6 - Of 2^' = 1(0 - efnhs 6

+ ^^-rP-nh^ 4!

+ •••

Using the expansion of e^ at zero and the expressions for moments of a normal distribution, we can obtain an approximation of order 0{n~'^) for any r > 1.

4.3 Laplace Approximation

117

Retaining terms upto the 6-th derivative HQ in the expansion of Rn, Tierney and Kadane (1986) obtain

/

e"''(^) de = e^^^^^V2^an-'/^

(i + ^ + ±^+ 0{n-^)] \ n n^' J

(4.19)

where 5_

yh, + 24'-a%l We have an exactly similar approximation for the numerator of (4.17) with a and hk replaced by cr* and h^. We then have

''•* r /,*/;5*x •/;ixM / , a* - a b* - b - a(a* - a) ^, _,, = — exp{n(/z*((9*) - hiff))} 1 + + ^ '- + 0{n ^ a \ n n'^ Now note that 0 = h*'{0*) = h'{9*) +

{l/n)G'{0*)

« h'{0) + 0* - e)h"{e) + {lln)G'{e) = ( r - e){h"{e) + (l/n)G"(^)) +

+ ( l / n ) ( r - ^)G"(^) {l/n)G'{e)

which impUes 9* - § = 0{n''^). This, together with the fact that hl{6) = hk{0) + (l/n)GkiO), imphes a* - a and b* - b are both of order 0{n~^). It then follows that E^g{9)\^)

= - exp{n(/i*(r) - /i(^))} (l + 0{n-^)) a

.

Example 4-6- We consider the data in Table 2.1 presented in Example 2.3. This is a set of data on food poisoning and we focus on the main suspect, namely, potato salad. Separately for Crabmeat and No Crabmeat, we wish to test the null hypothesis that there is no association between potato salad and illness. Let pi be the probability of being ill given that potato salad is taken and P2 be the same given no potato salad. If Xi denotes the number of people falling ill out of a total of rii people taking potato salad and X2 denotes the same out of a total of 7x2 people taking no potato salad, then Xi and X2 may be modeled as independent binomial variables with Xi following B{ni^pi),

118

4 Large Sample Methods

i — 1,2. The test for no association between potato salad and illness is then equivalent to testing H^ : pi = P2We first carry out the test through credible intervals for pi —p2 as described in Section 2.7.4. In order to obtain an exact Bayes test we have to choose prior densities for pi and p2- We have seen in Example 2.2 that the choice of a Beta prior for a binomial proportion simplifies the calculation of posterior. If we consider a Beta (a^, /3i) prior for p^, i = 1,2, the posterior density o{9 = pi —p2 can be obtained as 7r(^|Xi,X2)oc

\e + P2)^'^'''-H^ - ^ - P2r'~^'^^'~^P2'^'"'~\l

- P2T'~^'^^'~^ dp2

^0

which can only be numerically calculated for a given 9. Because the sample size here is sufficiently large, we will, however, find an approximation to the posterior distribution using asymptotic posterior normality. This does not involve specification of the prior distributions. One can easily calculate the Fisher information matrix In and show that the approximate distribution of 0 = {pi — P2) is A/'(a, 6^) where

a=pi -p2, b^ =Vi{l-Vi)/ni

+^2(1 -P2)/n2,Pi

=Xi/ni,p2

= X2/n2.

A 100(1 - a)% HPD cedible interval for 6 is then <^ - bZo,/2 < 9 < a-\- bZo,/2

where Zo,/2 is the 100(1 - a/2)% quantile of N{0,1). For the case with crabmeat, Xi = 120, rii = 200, X2 = 4, n2 == 35. The 99% HPD credible interval turns out to be (0.337, 0.635). For the case with no crabmeat, Xi = 22, rii = 46, X2 = 0, n2 = 23 and the 99% HPD credible interval is (0.307, 0.650). In both the cases, the hypothesized value (0) of 9 falls well outside the credible intervals implying strong evidence against the null hypothesis of no association. We can calculate the significance level P by finding the 100(1 — P)% credible interval that has the value 0 of null hypothesis on its boundary. More directly this will be the usual P-value corresponding with the observed X^ with one d.f. We consider only the case with crabmeat. The other case can be handled similarly. The logarithm of the ratio of the maximized likelihoods under Ho and Hi is obtained as logyl = —15.4891. Therefore P-value - P{xl > 30.9782) ^ 0 We now look at the same problem through the Bayes factor (BF). In order to compute the BF, we may use the Beta prior as mentioned above. However, because there is no consensus prior for this problem, we use the Schwarz BIG (Section 4.3.1) to approximate the BF. For the case with crabmeat, the BF arising from BIG is given by EFQ^ = 2.8754 x 10~^. This implies that with equal prior probabilities for HQ and iifi, the posterior probability of HQ is

4.4 Exercises

119

[1 -h l/BF^^]-^ = 2.8754 x 10"^. This is very small but not as small as the P-value. In any case, both the approaches indicate strong evidence in favor of potato salad being the cause of food poisoning.

4.4 Exercises 1. With Poisson likelihood and Gamma prior for the Poisson parameter 6^ show that the posterior is consistent at any ^o > 02. Let X i , . . . ,X^ be i.i.d. observations with a common density f{x\9)^ 0 G O = {^1, ^2, • • •, Ok}' Consider a prior (TTI, 7r2,..., iXk) , with TT^ > 0 for all i, J^TT^ = 1. Suppose the distribution corresponding to f{x\6i)^ i = 1 , . . . , /c are all distinct. Show that the posterior is consistent at each Oi. (Hint: Express the posterior in terms of Zr = (1/n) E ; = I logif{Xj\er)/fiXj lOi)), r = l,...k.) 3. Show that asymptotic posterior normality (as stated in Theorem 4.2) implies posterior consistency at OQ4. Verify Condition (A4) (see Theorem 4.2) for the N{6,1) example. 5. Obtain Laplace approximation to the integrated likelihood from (4.5). 6. Consider N(fi, 1) likelihood. Generate data of size 30 from A/'(0,1). Consider the following priors for /i : (i) A^(0, 2) (ii) N{1,2) (iii) C/(-3,3). For each of these priors find P(—0.5 < /i < 0.5) and P(—0.2 < ji < 0.6) using (a) exact calculation (b) normal approximation. Do the same thing with data generated from N{1,1). 7. Let X i , . ..,Xn be i.i.d. N{6,1) and the prior distribution of ^ be A^(0,r^). Consider the problem of testing HQ : 0 < 0 versus Hi : 0 > 0. (a) Show that the Bayes risk r(7r) given by (4.11) reduces to r(7r) = 2 / Je>o /e>o

^{-y/^0)7r{0)d0

where 7r(.) denotes the A/'(0,T^) density for 0. (b)Verify (4.13) in this case. Find numerically the exact posterior density ofO=pi—p2 in Example 4.6 with independent uniform priors for pi and p2. Compare this with the normal approximation to the posterior. Using the idea of Laplace method for approximating integrals, find the following approximation for n\ (see Example 4.5)

Choice of Priors for Low-dimensional Parameters

Given data, a Bayesian will need a likelihood function p{x\0) and a prior p{0). For many standard problems, the likelihood is known either from past experience or by convention. To drive the Bayesian engine, one would still need an appropriate prior. In this chapter, we consider only low-dimensional parameters. Admittedly, low dimension is not easy to define, but we expect the dimension d to be much smaller than the sample size n to qualify as low. In most of the examples in this chapter, d = 1 or 2 and is rarely bigger than 5. Ideally, one wants to choose a prior or a class of priors reflecting one's prior knowledge and belief about the unknown parameters or about different hypotheses. This is a subjective choice. If one has a class of priors, it would be necessary to study robustness of various aspects of the resulting Bayesian inference. Choice of subjective priors, usually called elicitation of priors, is still rather difficult. For some systematic methods of elicitation, see Kadane et al. (1980), Garthwaite and Dickey (1988, 1992). A recent review is Garthwaite et al. (2005). Empirical studies have shown experience and maturity help a person in quantifying uncertainty about an event in the form of a probability. However, assigning a fully specified probability distribution to an unknown parameter is difficult even when the parameter has a physical meaning like length or breadth of some article. In such cases, it may be realistic to expect elicitation of prior mean and variance or some other prior quantities but not a full specification of the distribution. Hopefully, the situation will improve with practice, but it is hard to believe that a fully specified prior distribution will be available in all but very simple situations. It is much more common to choose and use what are called objective priors. When very little prior information is available, objective priors are also called noninformative priors. The older terminology of noninformative priors is no longer in favor among objective Bayesians because a complete lack of information is hard to define. However, it is indeed possible to construct objective priors with low information in the sense of Bernardo's information measure or

122

5 Choice of Priors for Low-dimensional Parameters

non-Euclidean geometry. These priors are not unique but, as indicated for the Bernoulli example (Example 2.2) in Chapter 2, for even a small sample size the posteriors arising from them are very close to each other. All these priors are constructed through well-defined algorithms. If some prior information is available, in some cases one can modify these algorithms. The objective priors are typically improper but have proper posteriors. They are suitable for estimation problems and also for testing problems where both null and alternative hypotheses have the same dimension. The objective priors need to be suitably modified for sharp null hypotheses — the subject of Chapter 6. Most of this chapter (Sections 5.1, 5.2, and 5.5) is about different principles and methods of construction of objective priors (Section 5.1) and common criticisms and answers (Section 5.2). Subjective priors appear very naturally when the decision maker judges his data to be exchangeable. We deal with this in Section 5.3. An example of elicit at ion of a different kind is given in Section 5.4.

5.1 Different Methods of Construction of Objective Priors Because this section is rather long, we provide an overview here. How can we construct objective priors under general regularity conditions? We may do one of the following things. 1. Define a uniform distribution that takes into account the geometry of the parameter space. 2. Minimize a suitable measure of information in the prior. 3. Choose a prior with some form of frequentist ideas because a prior with little information should lead to inference that is similar to frequentist inference. To fully define these methods, we have to specify the geometry in (1), the measure of information in (2) and the frequentist ideas that are to be used in (3). This will be done in Subsections 5.1.2, 5.1.3, and 5.1.4. In Subsection 5.1.1, we discuss why the usual uniform prior 7r{6) = c has come in for a lot of criticism. Indeed, these criticisms help one understand the motivation behind (1) and (2). It is a striking fact that both (1) and (2) lead to the Jeffreys prior, namely, n{e) = [det(/,,(^))]i/2 where {lij{0)) is the Fisher information matrix. In the one-dimensional case, (3) also leads to the Jeffreys prior. We have noted in Chapter 1 that many common statistical models possess additional structure. Some are exponential families of distributions, some are location-scale families, or more generally families invariant under a group of

5.1 Different Methods of Construction of Objective Priors

123

transformations. Normals belong to both classes. For each of these special classes, there is a different choice of objective priors discussed in Subsections 5.1.5 and 5.1.6. The objective priors for exponential families come from the class of conjugate priors. In the case of location-scale families with scale parameter cr, the common objective prior is the so-called right invariant Haar measure 7ri(/i,cr) = and the Jeffreys prior turns out to be the left invariant Haar measure 7r2(M,cr) = — (see Subsection 5.1.7 for definitions). Jeffreys had noted this and expressed his preference for the former. As we discuss later, there are several strong reasons for preferring TTI to 7r2. To avoid some of the problems with the Jeffreys prior, Bernardo (1979) and Berger and Bernardo (1989) had suggested an important modification of the Jeffreys prior that we take up in Subsection 5.1.10. These priors are called reference priors. In the location-scale case, the reference prior is the right invariant Haar measure. They are considerably more difficult to find than the Jeffreys prior but explicit formulas are now available for many examples, vide Berger et al. (2006). A comprehensive overview and catalogue of objective priors, up to date as of 1995, is available in Kass and Wasserman (1996). A brief introduction is Ghosh and Mukerjee (1992). 5.1.1 Uniform Distribution and Its Criticisms The first objective prior ever to be used is the uniform distribution over a bounded interval. A common argument, based on "ignorance", seems to have been that if we know nothing about ^, why should we attach more density to one point than another? The argument given by Bayes, who was the first to use the uniform as an objective prior, is a variation on this. It is indicated in Problem 1. A second argument is that the uniform maximizes the Shannon entropy. The uniform was also used a lot by Laplace who seems to have arrived at a Bayesian point of view, independently of Bayes, but his argument seems to have been based on subjective argument that in his problems the uniform was appropriate. The principle of ignorance has been criticized by Keynes, Fisher, and many others. Essentially, the criticism is based on an invariance argument. Let rj = il^{0) be a one-to-one function of ^. If we know nothing about ^, then we know nothing about rj also. So the principle of ignorance applied to r] will imply our prior for r] is uniform (on ip{0)) just as it had led to a uniform prior for 9. But this leads to a contradiction. To see this suppose ip is differentiable and p{ri) — c on '0(0). Then the prior p*{6) for 0 is

124

5 Choice of Priors for Low-dimensional Parameters

P*{6)=p{ri)\i,'{e)\ = cW{e)\ which is not a constant in general. This argument also leads to an invariance principle. Suppose we have an algorithm that produces noninformative priors for both 9 and r/, then these priors p* {6) and p{r]) should be connected by the equation

P*W=p(r?)KWI

(5.1)

i.e., a noninformative prior should be invariant under one-to-one differentiable transformations. The second argument in favor of the uniform, based on Shannon entropy, is also flawed. Shannon (1948) derives a measure of entropy in the finite discrete case from certain natural axioms. His entropy is

^(P) = -X]p^logp^ which is maximized by the discrete uniform, i.e., at p = ( ^ r * * ? ^ ) - Entropy is a measure of the amount of uncertainty about the outcome of the experiment. A prior that maximizes this will maximize uncertainty, so it is a noninformative prior. Because such a prior should minimize information, we take negative of entropy as information. This usage differs from Shannon's identification of information and entropy. Shannon's entropy is a natural measure in the discrete case and the discrete uniform appears to be the right noninformative prior. The continuous case is an entirely diff'erent matter. Shannon himself pointed out that for a density p H{p) = -

{^og p{x))p{x)dx

is unsatisfactory, clearly it is not derived from axioms, it is not invariant under one-one transformations, and, as pointed out by Bernardo, it depends on the measure ii{x)dx with respect to which the density p{x) is taken. Note also that the measure is not non-negative. Just take /x(x) = 1 and take p{x) = uniform on [0, c]. Then H{p) > 0 if and only if c > 1, which seems quite arbitrary. Finally, if the density is taken with respect to fx(x) dx, then it is easy to verify that the density is p/fx and

is maximized at p = //, i.e., the entropy is maximum at the arbitrary /x. For all these reasons, we do not think H{p) is the right entropy to maximize. A different entropy, also due to Shannon, is explored in Subsection 5.1.3. However, H{p) serves a useful purpose when we have partial information. For details, see Subsection 5.1.12.

5.1 Different Methods of Construction of Objective Priors

125

5.1.2 Jeffreys Prior as a Uniform Distribution This section is based on Section 8.2.1 of Ghosh and Ramamoorthi (2003). We show in this section that if we construct a uniform distribution taking into account the topology, it automatically satisfies the invariance requirement (5.1). Moreover, this uniform distribution is the Jeffreys prior. Problem 2 shows one can construct many other priors that satisfy the invariance requirement. Of course, they are not the uniform distribution in the sense of this section. Being an invariant uniform distribution is more important than just being invariant. Suppose O = IZ^ and I{6) = {Iij{6)) is the d x d Fisher information matrix. We assume I{6) is positive definite for all 6. Rao (1987) had proposed the Riemannian metric p related to I{0) by

p{e, e^de) = Yl ^^^i^) ^^^ ^%(i + ^(i))It is known, vide Cencov (1982), that this is the unique Riemannian metric that transforms suitably under one-one differentiable transformations on 0. Notice that in general 0 does not inherit the usual Euclidean metric that goes with the (improper) uniform distribution over IZ^. Fix a ^0 and let ^p{6) be a smooth one-to-one transformation such that the information matrix Qlogp dlogp

/'^ — E^

is the identity matrix / at '0o = '0(^o)- This implies the local geometry in the t/^-space around t/?o is Euclidean and hence dtp is a suitable uniform distribution there. If we now lift this back to the ^-space by using the Jacobian of transformation and the simple fact dOj dipi

d9i dipi

ikjm

= /^o = 7,

we get the Jeffreys prior in the O-sp&ce, -1

dtp

-Hm

do ={det[iij{e)]pde.

A similar method is given in Hartigan (1983, pp. 48, 49). Ghosal et al. (1997) present an alternative construction where one takes a compact subset of the parameter space and approximates this by a finite set of points in the so-called Hellinger metric

diPe,Pe')

I

{VP0

- VP^)^ dx

126

5 Choice of Priors for Low-dimensional Parameters

where pe and pe^ are the densities of PQ and Pe^. One then puts a discrete uniform distribution on the approximating finite set of points and lets the degree of approximation tend to zero. Then the corresponding discrete uniforms converge weakly to the Jeffreys distribution. The Jeffreys prior was introduced in Jeffreys (1946). 5.1.3 Jeffreys Prior as a Minimizer of Information As in Subsection 5.1.1, let the Shannon entropy associated with a random variable or vector Z be denoted by HiZ) = Hip) =

-EpilogpiZ))

where p is the density (probability function) of Z. Let X = (Xi, X 2 , . . . , Xn) have density or probability function p{x\0) where 6 has prior density p{6). We assume Xi^ X2, > - - •, X^ are i.i.d. and conditions for asymptotic normality of posterior p{0\x) hold. We have argued earlier that H{p) is not a good measure of entropy and —H{p) not a good measure of information if p is a density. Using an idea of Lindley (1956) in the context of design of experiments, Bernardo (1979) suggested that a Kullback-Leibler divergence between prior and posterior, namely,

is a better measure of entropy and — J a better measure of information in the prior. To get a feeling for this, notice that if the prior is nearly degenerate, at say some 6Q^ SO will be the posterior. This would imply J is nearly zero. On the other hand, iip{9) is rather diffuse, p{0\x) will differ a lot from p{9)^ at least for moderate or large n, because p{0\x) would be quite peaked. In fact, p{0\x) would be approximately normal with mean 0 and variance of the order O(^). The substantial difference between prior and posterior would be reflected by a large value of J. To sum up J is small when p is nearly degenerate and large when p is diffuse, i.e., J captures how diffuse is the prior. It therefore makes sense to maximize J with respect to the prior. Bernardo suggested one should not work with the sample size n of the given data and maximize the J for this n. For one thing, this would be technically forbidding in most cases and, more importantly, the functional J is expected to be a nice function of the prior only asymptotically. We show below how asymptotic maximization is to be done. Berger et al. (1989) have justified to some extent the need to maximize asymptotically. They show that if one maximizes for fixed n, maximization may lead to a discrete prior with finitely many jumps — a far cry from a diffuse prior. We also note in passing that the

5.1 Different Methods of Construction of Objective Priors

127

measure J is a functional depending on the prior but in the given context of a particular experiment with i.i.d. observations having density p{x\0). This is a major difference from the Shannon entropy and suggests information in a prior is a relative concept, relative to a particular experiment. We now return to the question of asymptotic maximization. Fix an increasing sequence of compact cZ-dimensional rectangles Ki whose union is IZ^. For a fixed Ki, we consider only priors pi supported on Ki, and let n -^ oc. We assume posterior normality holds in the Kullback-Leibler sense, i.e.,

where p is the approximating cZ-dimensional normal distribution N{6,1~^{6)/n). For sufl[icient conditions see Clarke and Barron (1990) and Ibragimov and Has'minskii (1981). In view of (5.3), it is enough to consider

Using appropriate results on normal approximation to posterior distribution, it can be shown that

^""'=X,{X[i'°4w}«''''^""';p{x\e)dx\pi{e)de-hop{i) = {-^ log(2^) - ^ + ^ logn} + I

log(det/(^))i;>,(^) de

- [ {\ogpi{e))pi{e)de^op(i).

(5.4)

Here we have used the well-known fact about the exponent of a multivariate normal that

j-\{e'

- e)\i-\e)/n)-\e' - e)Me'\x)dd' = -^.

Hence by (5.3) and (5.4), we may write ^ f e ) = { - ^ l o g ( 2 7 r ) - ^ + ^ l o g n } + | ^ l o g | ^ ^ ^ ^ ^ M i L , W d 0 + o,(l). Thus apart from a constant and a negligible Op(l) term, J is the functional /

iog[ci{det{i{e))y/ypiie)]pi{e) de - logc^

where Q is a normalizing constant such that Q[det(/(^))]-^/^ is a probability density on Ki. The functional is maximized by taking

128

5 Choice of Priors for Low-dimensional Parameters ^ ^

rc4det(lW)]V2oni^, (^ 0 elsewhere.

^

'

Thus for every Ki^ the (normalized) Jeffreys prior ]9i(^) maximizes Bernardo's entropy. In a very weak sense, thep^(^) of (5.5) converge top(0) — [det(/(0))]^/^, namely, for any two fixed Borel sets B\ and B2 contained in Ki^ for some io-,

The convergence based on (5.6) is very weak. Berger and Bernardo (1992, Equation 2.2.5) suggest convergence based on a metric that compares the posterior of the proper priors over compact sets Bi and the limiting improper prior (whether Jeffreys or reference or other). Examples show lack of convergence in this sense may lead to severe inadmissibility and other problems with inference based on the limiting improper prior. However, checking this kind of convergence is technically difficult in general and not attempted in this book. We end this section with a discussion of the measure of entropy or information. In the literature, it is often associated with Shannon's missing information. Shannon (1948) introduced this measure in the context of a noisy channel. Any channel has a source that produces (say, per second) messages X with p.m.f. px{x) and entropy //(X) = - ^ p x ( x ) l o g p x ( x ) . X

A channel will have an output Y (per second) with entropy H{Y) = - 5 ] ^ p y ( y ) l o g p y ( y ) . y

If the channel is noiseless, then H{Y) = H{X), If the channel is noisy, Y given X is still random. Let p(x, y) denote their joint p.m.f. The joint entropy is H{X, Y) = - ^p{x,

y) logp(x, y).

x,y

Following Shannon, let Pxiy) = P{y = y\X = x} and consider the conditional entropy of Y given X namely, Hx{y)

= - ^^pix.y)

logpa:{y).

x,y

Clearly, H{X, Y) = H{X)-hHxiY) and similarly H{X, Y) = H{Y)^HY{X). HY{X) is called the equivocation or average ambiguity about input X given only output 1". It is the information about input X that is received given the

5.1 Different Methods of Construction of Objective Priors

129

output Y. By Theorem 10 of Shannon (1948), it is the amount of additional information that must be suppHed per unit time at the receiving end to correct the received message. Thus HY{X) is the missing information. So amount of information produced in the channel (per unit time) is H{X)-HY{X)

which may be shown to be non-negative by Shannon's basic results H{X)

+ HiY)

> H{X,

Y) = H{Y)

+

HY{X).

In statistical problems, we take X to be ^ and Y to be the observation vector X. Then H{6) — H-^iO) is the same measure as before, namely

p(«)

The maximum of H{X)~HY{X)

with respect to the source, i.e., with respect to p{x) is what Shannon calls the capacity of the channel. Over compact rectangles, the Jeffreys prior is this maximizing distribution for the statistical channel. It is worth pointing out that the Jeffreys prior is a special case of the reference priors of Bernardo (1979). Another point of interest is that as n —> oo, most of Bernardo's information is contained in the constant term of the asymptotic decomposition. This would suggest that for moderately large n, choice of prior is not important. The measure of information used by Bernardo was introduced earlier in Bayesian design of experiments by Lindley (1956). There p{6) is fixed but the observable X is not fixed, and the object is to choose a design, i.e., X, to minimize the information. Minimization is for the given sample size n, not asymptotic as in Bernardo (1979). 5.1.4 Jeffreys Prior as a Probability Matching Prior One would expect an objective prior with low information to provide inference similar to that based on the uniform prior for 0 in N{6^ 1). In the case of N{0^ 1) with a uniform prior for 6^ the posterior distribution of the pivotal quantity 0 — X, given X , is identical with the frequentist distribution of ^ — X, given 0. In the general case we will not get exactly the same distribution but only up to Op{n~^). A precise definition of a probability matching prior for a single parameter is given below. Let Xi,X2,...,Xn be i.i.d. p{x\0), 0 ^ 0 C 1Z. Assume regularity conditions needed for expansion of the posterior with the normal N{0, {nl{6))~^)

130

5 Choice of Priors for Low-dimensional Parameters

as the leading term. For 0 < a < 1, choose Oa{X) depending on the prior p{6) such that P{0 < ^ a ( X ) | X } = 1 - a + Op{n-^). (5.7) It can be verified that 6a{X) matching (to first order), if

= 6 -i- Op{l/\/n).

We say p{6) is probability

Pe{e < ^a(X)} = 1 - a + 0{n-^)

(5.8)

(uniformly on compact sets of 6). In the normal case with p{6) — constant,

e^{x) = x^zjyf^ where P{Z > z^) = OL, Z ^ 7V(0,1). We have matched posterior probability and frequentist probability up to Op(n~-^). Why one chooses this particular order may be explained as follows. For any prior p(^) satisfying some regularity conditions, the two probabilities mentioned above agree up to 0 ( n ~ 2 ) , so 0 ( n ~ 2 ) is too weak. On the other hand, if we strengthen 0(n~^^ to, say, 0 ( n ~ 2 ) , in general no prior would be probability matching. So 0(n~^^ is just right. It is instructive to view probability matching in a slightly different but equivalent way. Instead of working with Oot{X) one may choose to work with the approximate quantile 9 4- Zo^j ^[n and require P{(9 < ^ -h z^l^\X}

= Pe{e <e + Za/V^} + Op{n-^)

(5.9)

under 6 (uniformly on compact sets of 6). Each of these two probabilities has an expansion starting with (1 — a) and having terms decreasing in powers of n ~ 2 . So for probability matching, we must have the same next term in the expansion. In principle one would have to expand the probabilities and set the two second terms equal, leading to

The left-hand side comes from the frequentist probability, the right-hand side from the posterior probability (taking into account the limits of random quantities under 6). There are many common terms in both probabilities that cancel and hence do not need to be calculated. A convenient way of deriving this is through what is called a Bayesian route to frequentist calculations or a shrinkage argument. For details see Ghosh (1994, Chapter 9) or Ghosh and Mukerjee (1992) , or Datta and Mukerjee (2004). If one tried to match probabilities upto 0{n~^^'^)^ one would have to match the next terms in the expansion also. This would lead to two differential equations in the prior and in general they will not have a common solution. Clearly, the unique solution to (5.10) is the Jeffreys prior

5.1 Different Methods of Construction of Objective Priors p{e) OC

131

y/m.

Equation (5.10) may not hold if 6 has a discrete lattice distribution. Suppose X has a discrete distribution. Then the case where 0 has a lattice distribution causes the biggest problem in carrying through the previous theory. But the Jeffreys prior may be approximately probability matching in some sense, Ghosh (1994), Rousseau (2000), Brown et al. (2001, 2002). If d > 1, in general there is no multivariate probability matching prior (even for the continuous case), vide Ghosh and Mukerjee (1993), Datta (1996). It is proved in Datta (1996) that the Jeffreys prior continues to play an important role. We consider the special case d = 2 by way of illustration. For more details, see Datta and Mukerjee (2004). Let 0 = (^1, ^2) and suppose we want to match posterior probability of Oi and a corresponding frequentist probability through the following equations. P{Oi < ^ i , , ( X ) | X } = l - a + Op{n-^),

(5.11)

P{ei < ^ i , , ( X ) | ^ i , ^2} = l - a ^ 0{n-^).

(5.12)

Here ^I^Q,(X) is the (approximate) 100(1—a)- quantile of ^1. If ^2 is orthogonal to Oi in the sense that the off-diagonal element Ii2{0) of the information matrix is zero, then the probability matching prior is p{e) = / M ^ ^ ( ^ 2 ) where ^^(^2) is an arbitrary function of 62For a general multiparameter model with a one-dimensional parameter of interest Oi and nuisance parameters 62^... ^Od^ the probability matching equation is given by

X:^{PW/^^(/")-^/^}=0 j=l

(5.13)

^

where I~^{0) = {P-^)- This is obtained by equating the coefficient of n~-^/^ in the expansion of the left-hand side of (5.12) to zero; details are given, for example, in Datta and Mukerjee (2004). Example 5.1. Consider the location-scale model

a

\

a

— ( X ) 0 , where /(•) is a probability density. Let 61 = fi and 62 = cr, i.e., /i is the parameter of interest. It is easy to verify that I^^ oc a^ for j = 1,2 and hence in view of (5.13) the prior

132

5 Choice of Priors for Low-dimensional Parameters p(/i,o-) oc (7

is probability matching. Similarly, one can also verify that the same prior is probability matching when a is the parameter of interest. Example 5.2. We now consider a bivariate normal model with means /xi,/i2, variances crf,cr2, and correlation coefficient p, all the parameters being unknown. Suppose the parameter of interest is the regression coefficient pcr2/cri. We reparameterize as Ol=p(T2/(Ti,

92=(JI{1-

P^),

03= aj,

6>4=//i,

Or, =112

which is an orthogonal parameterization in the sense that Iij{6) = 0 for 2 < i < 5. Then I^^{e) = 0 for 2 < j < 5,/^^^) = Iii{e) = O2/O3, and the probability matching equation (5.13) reduces to

i.e.,

Hence the probability matching prior is given by

where ^^(^2, • • •, ^5) is an arbitrary smooth function of (^2, • • , C 7 5 J One can also verify that a prior of the form p*(/ii,/i2,ai,cr2,p) = {o-[(J^(l -

2\t\-l p)]

with reference to the original parameterization is probability matching if and only \it=\s^l (vide Datta and Mukerjee, 2004, pp. 28, 29). 5.1.5 Conjugate Priors and Mixtures Let X i , . . . ,Xn be i.i.d. with a one-parameter exponential density p{x\0) = exp{^((9) + Oil;{x) + h{x)]. We recall from Chapter 1 that T = ^ ^ '^{^i) is a minimal sufficient statistic and Eei^PiXi)) = -A^O). The likelihood is n

exp{nA{0) + OT} e x p { ^ h{Xi)}. 1

To construct a conjugate prior, i.e., a prior leading to posteriors of the same form, start with a so-called noninformative, possibly improper density p{0).

5.1 Different Methods of Construction of Objective Priors

133

We will choose /j. to be the uniform or the Jeffreys prior density. Then define a prior density p{e) = ce^^(^)+^X^) (5.14) where c is the normalizing constant,

if the integral is finite, and arbitrary iip{9) is an improper prior. The constants m and s are hyperparameters of the prior. They have to be chosen so that the posterior is proper, i.e.,

Je In this case, the posterior is p{0\x) = c'e('"+")-4(^)+^(«+^)MW,

(5.15)

i.e., the posterior is of the same form as the prior. Only the hyperparameters are different. In other words, the family of priors p{6) (vide (5.14)) is closed with respect to the formation of posterior. The form of the posterior allows us to interpret the hyperparameters in the prior. Assume initially the prior was fi. Take m to be a positive integer and think of a hypothetical sample of size m, with hypothetical data x ^ , . . . , x ^ such that Y2T ^i^i) ~ ^' "^^^ prior is the same as a posterior with /i as prior and s as hypothetical data based on a sample of size m. This suggests m is a precision parameter. We expect that larger the m, the stronger is our faith in such quantities as the prior mean. The hyperparameter s/m has a simple interpretation as a prior guess about Ee{ip) = —A'{0)^ which is usually an important parametric function. To prove the statement about s/m, we need to assume ii{0) — constant, i.e., n is the uniform distribution. We also assume all the integrals appearing below are finite. Let 0 = (a, 6), where a may be —oo, h may be oo. Integrating by parts nh

Ja

cm s

e ^ l ' + c - / e-^We^^rf^ '"'in J a (5.16)

m

•£ ^mA(e)-\-es = 0 at ^ = a, 6, which is often true if 0 is the natural parameter space. Diaconis and Ylvisaker (1979) have shown that (5.16) characterizes the prior.

134

5 Choice of Priors for Low-dimensional Parameters

A similar calculation with the posterior, vide Problem 3, shows the posterior mean of —A\9) is E{-A'{e)\X)

= ^ I ^ - + --^{-A'{e)).

(5.17)

i.e., the posterior mean is a weighted mean of the prior guess s/m and the MLE —A'{6) = T/n and the weights are proportional to the precision parameter m and the sample size n. If/i is the Jeffreys distribution, the right-hand side of (5.16), i.e., s/m maybe interpreted as

E (-A'{e)/^r^^&)) IE (xi^^^Jm) • i.e., sjm is a ratio of two prior guesses — a less compelling interpretation than for /x = uniform. Somewhat trivially [i itself, whether uniform or Jeffreys, is a conjugate prior corresponding to m = 0, 5 = 0. Also, in special cases like the binomial and normal, the Jeffreys prior is a conjugate prior with /i = uniform. We do not know of any general relation connecting the Jeffreys prior and the conjugate priors with // = uniform. Conjugate priors, specially with \i — uniform, were very popular because the posterior is easy to calculate, the hyperparameters are easy to interpret and hence elicit and the Bayes estimate for Eo{i^{X)) has a nice interpretation. All these facts generalize to the case of multiparameter exponential family of distribution d

p{x\e) = exp{A{0) + J2 ^iMx)

+ h{x)}.

1

The conjugate prior now takes the form d

p{e) = exp{m^(^) + ^ 0 i 5 j / i ( 6 > ) 1

where fi = uniform or Jeffreys, m is the precision parameter and Si/m may be interpreted as the prior guess for Ee{il^i{X)) if // = uniform. Once again the hyperparameters are easy to elicit. Also the Bayes estimate for Ee{i^i{X)) is a weighted mean of the prior guess and the MLE. It has been known for some time that all these alternative properties can also be a problem. First of all, note that having a single precision parameter even for the multiparameter case limits the flexibility of conjugate priors; one cannot represent complex prior belief. The representation of the Bayes estimate as a weighted mean can be an embarrassment if there is serious conflict between prior guess and MLE. For example, what should one do if the prior guess is 10 and MLE is 100 or vice versa? In such cases, one should usually give greater weight to data unless the prior belief is based on reliable

5.1 Different Methods of Construction of Objective Priors

135

expert opinion, in which case greater weight would be given to prior guess. In any case, a simple weighted mean seems ridiculous. A related fact is that a conjugate prior usually has a sharp tail, whereas prior knowledge about the tail of a prior is rarely strong. A cure for these problems is to take a mixture of conjugate priors by putting a prior on the hyperparameters. The class of mixtures is quite rich and given any prior, one can in principle construct a mixture of conjugate priors that approximates it. A general result of this sort is proved in Dalai and Hall (1980). A simple heuristic argument is given below. Given any prior one can approximate it by a discrete probability distribution (pi,...,p/c) over a finite set of points, say (T/I, • • •, ry/c) where rjj = E0.{ipi{X),-' • ^i/jd{X)). This may be considered as a mixture over k degenerate distributions of which the j t h puts all the probability on r/j. By choosing m sufficiently small and taking the prior guess equal to rj, one can approximate the k degenerate distributions by k conjugate priors. Finally, mix them by assigning weight pj to the jth conjugate prior. Of course the simplest applications would be to multimodal priors. The posterior for a mixture of conjugate priors can often be calculated numerically by MCMC (Markov chain Monte Carlo) method. See Chapter 7 for examples. As an example of a mixture we consider the Cauchy prior used in Jeffreys test for normal mean /i with unknown variance cr^, described in Section 2.7.2. The conjugate prior for /x given a^ is normal and the Cauchy prior used here is a scale mixture of normals A^(0, r~^) where r is a mixing Gamma variable. This mixture has heavier tail than the normal and use of such prior means the inference is influenced more by the data than the prior. It is expected that, in general, mixtures of conjugate priors will have this property, but we have not seen any investigation in the literature. 5.1.6 Invariant Objective Priors for Location-Scale Families Let 0 = (/i, cr), —oo 0 and

P(^l^) = If ( ^ )

(5-18)

where f{z) is a probability density on IZ. Let J^^^- be the 2 x 2 Fisher information matrix. Then easy calculation shows

which implies the Jeffreys prior is proportional to l/cr^. We show in Section 5.1.7 that this prior corresponds with the left invariant Haar measure and P2(M,cr) = -

136

5 Choice of Priors for Low-dimensional Parameters

corresponds to the right invariant Haar measure. See Dawid (Encyclopedia of Statistics) for other relevant definitions of invariance and their implications. We discuss some desirable properties of p2 in Subsection 5.1.8. 5.1.7 Left and Right Invariant Priors We now derive objective priors for location-scale families making use of invariance. Consider the linear transformations 9a,b^ = a-^hx^ —oo < a < 00,6 > 0. Then 9c,d'9a,b'X = c-\- d{a + bx) = c-\- ad-h dbx. We may express this symbolically as gc,d'9a,b = 9ej where e = c + ad, f = db specify the multiplication rule. Let G = {9a,bi —oo < a < oo, 6 > 0}. Then G is a group. It is convenient to represent 9a,b by the vector (a, 6) and rewrite the multiplication rule as (c,d).(a,6) = ( e , / ) . (5.19) Then we may identify 7Z x IZ^ with G and use both notations freely. We give IZ X 1Z^ its usual topological or geometric structure. The general theory of (locally compact) groups (see, e.g., Halmos (1950, 1974) or Nachbin (1965)) shows there are two measures /ii and /i2 on G such that /ii is left invariant, i.e., for all ^ G G and A C G, fii{gA)=fii{A) and //2 is right invariant, i.e., for all p and A, fi2iA9) = fi2{A) where gA = {gg'; g' eA},Ag = {g'g- g' e A}. The measures /ii and /i2 are said to be the left invariant and right invariant Haar measures. They are unique up to multiplicative constants. We now proceed to determine them explicitly. Suppose we assume //2 has a density /2, i.e., denoting points in 7^ x 7^+ by (ai,a2) //2(^) = / f2{ai,a2)daida2 (5.20) JA and assume /2 is a continuous function. With g = (61,62) and (ci,C2) = (ai, 02).(61, 62) with ci = ai -f a2bi, C2 = ^262, (5.21) one may evaluate fi2{Ag) in several ways, e.g..

5.1 Different Methods of Construction of Objective Priors fi2{Ag)=

h{ci,C2)dcidc2

137 (5.22)

JAg where Mci,C2)-/2(ai,a2)(J)-'

(5.23)

with J ^

d{ci,C2)

1^1

62.

0^2

d{ai,a2) Also, by definition of /2 fi2{Ag)=

f2{ci,C2)dcidc2.

(5.24)

JAg Because (5.22) and (5.24) hold for all A and /2 is continuous, we must have /2(C1,C2) = /l(ci,C2),

(5.25)

i.e., /2(ci,C2) = / 2 ( a i , a 2 ) —, 02 for all ( a i , a 2 ) , ( ^ i , ^ 2 ) 6 7^ x 7^+. Set a i = 0, a2 = 1. T h e n /2(6i,62) /2(0,l)^,i.e., /2(6i,62) = constant —.

=

(5.26)

02

It is easy to verify t h a t /i2 defined by (5.20) is right invariant if /2 is as in (5.26). One has merely to verify (5.25) and t h e n (5.22). Proceeding in t h e same way, one can show t h a t t h e left invariant Haar measure has density

We have now to lift these measures t o t h e (/i, (j)-space. To do this, we first define an isomorphic group of transformations on t h e p a r a m e t e r space. Each transformation Qa.h^ = a^hx on t h e sample space induces a transformation 9a,h defined by i.e., t h e right-hand side gives t h e location and scale of t h e distribution oiga^h^ where X has density (5.18). T h e transformation Qa^h -^ ^a,b ^^ ^ ^ isomorphism, i-e., {9a,b)~^ = {9a^b) and if 9a,b'gc,d = 9eJ then 9a,b'^c,d =

9eJ'

138

5 Choice of Priors for Low-dimensional Parameters

In view of this, we may write ^^^^02 ^^^^ ^^ (<^i7 <^2) and define the group multiplication by (5.19) or (5.21). Consequently, left and right invariant measures for g are the same as before and d/ii(6i,62) = constant —2dbidb2, dfi2(bi,b2) = constant —dbidb2> bi We now lift these measures on to the (/i, (j)-space by setting up a canonical transformation (/i,(7)-^^,,(0,l) that converts a single fixed point in the parameter space into an arbitrary point (//, cr). Because (0,1) is fixed, we can think of the above relation as setting up a one-to-one transformation between (/i, cr) and g^^ = {/J^^cr). Because this is essentially an identity transformation from (/i, cr) into a group of transformations, given any //* on the space of ^'s we define u on 0 = lZx 7^+ as u{A) = //*{(/x,(7);^^,^ eA} = /i*(A). Thus di/i (/i, a) = dfii (/x, a) = ^ dfi da and dv2{l^->cy) = dfi2{fJ':(^) = —dfida (T

are the left and right invariant priors for (/J^^cr)5.1.8 Properties of the Right Invariant Prior for Location-Scale Famines The right invariant prior density Pr{^i,(y)

1
has many attractive properties. We list some of them below. These properties do not hold in general for the left invariant prior

Heath and Sudderth (1978, 1989) show that inference based on the posterior corresponding with pr is coherent in a sense defined by them. Similar properties have been shown by Eaton and Sudderth (1998, 2004). Dawid et al. (1973) show that the posterior corresponding to p^ is free from the marginalization paradox. It is free from the marginalization paradox if the group is amenable.

5.1 Different Methods of Construction of Objective Priors

139

Dawid (Encyclopedia of Statistics) also provides counter-examples in case the group is not amenable. Amenability is a technical condition that is also called the Hunt-Stein condition that is needed to prove theorems relating invariance to minimaxity in classical statistics, vide Bondar and Milnes (1981). Datta and Ghosh (1996) show pr is probability matching in a certain strong sense. A famous classical theorem due to Hunt and Stein, vide Lehmann (1986), or Kiefer (1957), implies that under certain invariance assumptions (that include amenability of the underlying group), the Bayes solution is minimax as well as best among equivariant rules, see also Berger (1985a). We consider a couple of applications. Suppose we have two location-scale families (j~^fi{{x — \i)l(j)^ i = 0,1. For example, /o may be standard normal i.e., A/'(0,1) and / i may be standard Cauchy, i.e..

TT 1 + X^ '

The observations X i , X 2 , . . . ,Xn are i.i.d. with density belonging to one of these two families. One has to decide which is true. Consider the Bayes rule which accepts / i if

BF =

//n;=i ^ C T - V l ( ^ ) " -dfxda ^ 4 //n;=i "^-Vo(^)" ^ djida

> c

If c is chosen such that the Type 1 and Type 2 error probabilities are the same, then this is a minimax test, i.e., it minimizes the maximum error probability among all tests, where the maximum is over z = 0,1 and (/i, a) ElZ x 1Z^. Suppose we consider the estimation problem of a location parameter with a squared error loss. Let Xi, X 2 , . . . , Xn be i.i.d. ~ fo{x — 9). Here Pr = Pi = constant. The corresponding Bayes estimate is

IZ.nUiXj-e)d0 which is both minimax and best among equivariant estimators, i.e., it minimizes R{e,T{X)) = Ee{T{X) - Of among all T satisfying r ( x i + a , . . . , Xn -h a) = r ( x i , . . . , Xn) + a, a elZ. A similar result for scale families is explored in Problem 4. 5.1.9 General Group Families There are interesting statistical problems that are left invariant by groups of transformations other than the location-scale transformations discussed in the preceding subsections. It is of interest then to consider invariant Haar measures for such general groups also. An example follows.

140

5 Choice of Priors for Low-dimensional Parameters

Example 5.3. Suppose X ~ Np(0,I).

It is desired to test

Ho : 6 = 0 versus

HiiOj^O.

This testing problem is invariant under the group Go of all orthogonal transformations; i.e., if i J is an orthogonal matrix of order p, then QH^ = ^ X ~ Np{H6,I), so that gnO = HO. Also, gnO — 0. Further discussion of this example as well as treatment of invariant tests is taken up in Chapter 6. Discussion on statistical applications involving general groups can be found in sources such as Eaton (1983, 1989), Farrell (1985), and Muirhead (1982). 5.1.10 Reference Priors In statistical problems that are left invariant by a group of nice transformations, the Jeffreys prior turns out to be the left invariant prior, vide Datta and Ghosh (1996). But for reasons outlined in Subsection 5.1.8, one would prefer the right invariant prior. In all the examples that we have seen, an interesting modification of the Jeffreys prior, introduced in Bernardo (1979) and further refined in Berger and Bernardo (1989, 1992a and 1992b), leads to the right invariant prior. These priors are called reference priors after Bernardo (1979). A reference prior is simply an objective prior constructed in a particular way, but the term reference prior could, in principle, be applied to any objective prior because any objective prior is taken as some sort of objective or conventional standard, i.e., a reference point with which one may compare subjective priors to calibrate them. As pointed out by Bernardo (1979), suitably chosen reference priors can be appropriate in high-dimensional problems also. We discuss this in a later chapter. Our presentation in this section is based on Ghosh (1994, Chapter 9) and Ghosh and Ramamoorthi (2003, Chapter 1). If we consider all the parameters of equal importance, we maximize the entropy of Subsection 5.1.3. This leads to the Jeffreys prior. To avoid this, one assumes parameters as having different importance. We consider first the case oi d — 2 parameters, namely, (^i, ^2)5 where we have an ordering of the parameters in order of importance. Thus d\ is supposed to be more important than ^2- For example, suppose we are considering a random sample from iV(/i, (7^). If our primary interest is in //, we would take Q\ = /i, ^2 = cr^. If our primary interest is in a^, then d\ = cr^, ^2 = M- If our primary interest is in ^jcf., we take d\ = fi/a and O2 = /J^ or a or any other function such that (^1,^2) is a one-to-one sufficiently smooth function of (/i,cr). For fixed ^1, the conditional density p(^2|^i) is one dimensional. Bernardo (1979) recommends setting this equal to the conditional Jeffreys prior

Having fixed this, the marginal p{Oi) is chosen by maximizing

5.1 Different Methods of Construction of Objective Priors

141

Ellog'p{Oi in an asymptotic sense as explained below. Fix an increasing sequence of closed and bounded, i.e., compact rectangles Kii X K2i whose union is 0. Let Pi{02\0i) be t h e conditional Jeffreys prior, restricted t o K2i and Pi{Oi) a prior supported on Ku. As mentioned before, Pii02\0l)

=

Ci{ei)^/122(9)

where Ci{Oi) is a normalizing constant such t h a t Pii02\0i)de2

/ T h e n Pi{Oi^ O2) = Pi{0i)pi{02\0i) J{pi{0,),X)

= 1.

on Ku x K2i and we consider

,Pi{Oii\X)\ EUog PiiO [Oi) J

=

= E log

Pi{e\x)

pm .

JiPi{ex,02),x)

E log

I

Pi{02\0i,X) Pi{02\0l)

P^i0l)J{Pi{O2\Ol),X)del

JKU

where for fixed Oi, J{pi{62\0i),X)

is t h e Lindley-Bernardo functional

,Pi{02\0 i,X)\ J = £: <^ lo: Pi{0, \0i)

J

with Pi{02\Oi) being regarded as a conditional prior for 62 for fixed 9i. Applying t h e asymptotic normal approximation t o t h e first t e r m on t h e right-hand side as well as the second t e r m on t h e right-hand side, as in Subsection 5.1.3, J{Pi{Oi),X)

= Kr,+ \f

Pi{e)iog{det{i{e))}idd- [

UKiiXK2i

- [

Pi{Ol) I

JKH

pi{d)iogpi{e)de

JKiiXK2i

Pii02\Ol) [log V / 2 2 W -logft(^2|^l) d92dei

JK2i

^

r det(/(0)) 1 =

d02d0i

-

/

Kn +J

Pi{Oi)logPi{6,)d0i

Pi{0l) j Pi{0i)\ogpi{0i)d0^

ft(^2|^l)l0g|yj3^VdM^: (5.27)

142

5 Choice of Priors for Low-dimensional Parameters

where K^ is a constant depending on n. Let ipi{Oi) be the geometric mean of {I^^{6))~2 with respect to Pi{02\0i). Then (5.27) can be written as

which is maximized if Pi{Oi) = c-'0i((9i) on Kii = 0 outside. The product

Pi{e) = c[i,iie,)ci{ex)[i22{e)]i is the reference prior on Ku x K2i> If we can write this as Pi{e) = diA{ei,e2) on KU X K2i = 0 elsewhere then the reference prior on 0 may be taken as proportional to ^(^1,^2)Clearly, the reference prior depends on the choice of (^1, ^2) and the compact sets Kii,K2i. The normalization on Ku x K2i first appeared in Berger and Bernardo (1989). If an improper objective prior is used for fixed n, one might run into paradoxes of the kind exhibited by Fraser et al. (1985). See in this connection Berger and Bernardo (1992a). Recently there has been some change in the definition of reference prior, but we understand the final results are similar (Berger et al. (2006)). The above procedure is based on Ghosh and Mukerjee (1992) and Ghosh (1994). Algebraically it is more convenient to work with [I{6)]~^ as in Berger and Bernardo (1992b). We illustrate their method for d = S^ the general case is handled in the same way. First note the following two facts. A. Suppose 61,62^... lOj follow multivariate normal with dispersion matrix U. Then the conditional distribution of 6j given ^1, ^2? • • • ? ^j-i is normal with variance equal to the reciprocal of the (j, j)-th element of U~^. B. Following the notations of Berger and Bernardo (1992b), let S = [I{6)]~^ where I{0) is the dxd Fisher information matrix and Sj be the j x j principal submatrix of S. Let Hj = S~^ and hj be the (j, j)-th element of Hj. Then by A, the asymptotic variance of Oj given ^1, ^2, • • • 5 ^j-i is {hj)~^/n. To get some feeling for hj, note that for arbitrary d, j = d, hj = Idd and for arbitrary

dj = l,h, =

l/l'\

We now provide a new asymptotic formula for Lindley-Bernardo information measure, namely,

(

p{e,\e„...,ej.„x)

5.1 Different Methods of Construction of Objective Priors

= E{iog{N{dj{eu.. .,ej^i),hjHd)/n))) (where 0j{6i^...,^j_i)

143

- E{\ogpi{ej\eu... ,^,_i))

is t h e M L E for 6j given 9i,...

,Oj-i

and

Pi is a prior supported on a compact rectangle Ku x K2i x . . . x Kdi-) -p(^,|^i,...,^,_i)(i^,-

= Kn + E +Op(l)

(5.28)

where K^i is a constant depending on n and

V^,(^i,...,^,) = expW

-iog/i,(6^)p(^,+i,...,0d\0i,...,ej)dej+i...G^^4 1 /2

is t h e geometric m e a n of h-' (6) with respect to p ( ^ j + i , . . . , ^ ^ l ^ i , . . . , ^j). T h e proof of (5.28) parallels t h e proof of (5.4). It follows t h a t asymptotically (5.28) is maximized if we set p{0j\0i,...,

Oj^i) = 4(01,..., = 0

Oj.i)^j{Ou

. . . , Oj) on

Kj,

elsewhere .

(5.29)

If the dimension exceeds 2, say (i = 3, we merely start with a compact rectangle Ku x K2i x Ksi and set Pi(6>3|l9i,6>2) = Q(^i,6>2)\/i33(^) on i^3^ T h e n we first determine Pi{02\0i) and t h e n Pi{Oi) by applying (5.29) twice. Thus, Pi{O2\6i)^c[{0i)i^2{O^.O2)oiiK2i

Pi{Oi) = 4M0i) on KuOne can also verify t h a t t h e formulas obtained for d = 2 can be rederived in this way. A couple of examples follow. Example 5.4- Let X i , X 2 , . . . , Xn b e i.i.d. normal with m e a n 62 a n d variance ^ 1 , with Oi being t h e parameter of interest. Here

T h u s Pi{02\0i) — Ci on K2i where c~^ — volume of K2i, and therefore,

We t h e n have Pi{0ue2) = di{i/0i) for some constant di and t h e reference prior is taken as

144

5 Choice of Priors for Low-dimensional Parameters p(^l,e2)0C^.

This is the right invariant Haar measure unhke the Jeffreys prior which is left invariant (see Subsection 5.1.7). Example 5.5. (Berger and Bernardo, 1992b) Let (Xi,X2,X3) follow a multinomial distribution with parameters (n; ^1,^2, ^3), i-^-? (^15^25-^3) has density p{xi,X2,x^\ei,62,62) n\ -6l'6l^6l^{l -61-626>3)^~"^^xi! X2! xs\{n - xi - X2 - xs)\ 3 3 Xi>0,i = l,2,3,Y^Xi0,i = 1,2,3, ^ ( 9 ^ < 1. 1

•X2—3^3

1

Here the information matrix is I{0) =nDi8.g{6^\6^\6^'}

+ n ( l -6^-62-

^3)~'l3,

where Diag {ai,02,03} denotes the diagonal matrix with diagonal elements ^15 ^2) tt3 and I3 denotes the 3 x 3 matrix with all elements equal to one. Hence S{9) = ^Dia.g {61,62,03} -

^eO'

and for j = 1,2,3

SJ{e) =

^Dmg{e^,...,eJ}-^ey^e[^^ n

n

With0f,]=(^l,...,%)', Hj{e) = n Diag {^f \ . . . , 6'^} + n{l-6i

Oj)-^l J

and hj{e) = n6-\l

-6i

6j-i){l

-6i

6j)-K

Note that hj{0) depends only on 61,62,. - - ,6j so that

Here we need not restrict to compact rectangles as all the distributions involved have finite mass. As suggested above for the general case, the reference prior can now be obtained as P{03\0i,e2) = n-'O^'^^l p{62\6i) = n-'6-'/\l p{6i) = ^-'6-'/\l

-6,-62-6,- e,r"\

63)-'/^ 62)-'/\

0<63
0 < 01 < 1,

5.1 Different Methods of Construction of Objective Priors i.e., p{0) = n-^e;'^\l - ^i)-1/2^-1/2(1 -01Oi>0, i = 1,2,3, E ? ^ i < l -

^ 2 ) - ' / ' ^ 3 " ' / ' ( l -0,-02-

145

Os)-'/\

As remarked by Berger and Bernardo (1992b), inferences about Oi based on the above prior depend only on xi and not on the frequencies of other cells. This is not the case with standard noninformative priors such as Jeffreys prior. See in this context Berger and Bernardo (1992b, Section 3.4). 5.1.11 Reference Priors Without Entropy Maximization Construction of reference priors involves two interesting ideas. The first is the new measure of information in a prior obtained by comparing it with the posterior. The second idea is the step by step algorithm based on arranging the parameters in ascending order of importance. The first throws light on why an objective prior would depend on the model for the likelihood. But it is the step by step algorithm that seems to help more in removing the problems associated with the Jeffreys prior. We explore below what kind of priors would emerge if we follow only part of the Berger-Bernardo algorithm (of Berger and Bernardo (1992a)). We illustrate with two (one-dimensional) parameters 0i and O2 of which Oi is supposed to be more important. This would be interpreted as meaning that the marginal prior for Oi is more important than the marginal prior for O2. Then the prior is to be written as p{0i)p{02\0i)^ with PmOi)

OC Vl22{0)

and p{Oi) is to be determined suitably. Suppose, we determine p{Oi) from the probability matching conditions P{Oi < ^ i , a ( X ) | X } = 1 - a + Op{n-^), J P{Oi < 0i,^{X)\0u02}p{02\0i)d02

= 1 - a + 0(n-i).

(5.30) (5.31)

Here (5.30) defines the Bayesian quantile Oi^^ of ^i^ which depends on data X and (5.31) requires that the posterior probability on the left-hand side of (5.30) matches the frequentist probability averaged out with respect to p(^2|^i)Under the assumption that Oi and O2 are orthogonal, i.e., Ii2{0) = l2i{^) ~ 0, one gets (Ghosh (1994)) Pi{Oi)=

constant U

I^i^'^{e)p{02\0i)d02J

(5.32)

where Ku x K2i is a sequence of increasing bounded rectangles whose union is 01 X 02- This equation shows the marginal of Oi is a (normalized) harmonic mean of \ / 7 n which equals the harmonic mean of l/yP^.

146

5 Choice of Priors for Low-dimensional Parameters

What about a choice of pi {9i) equal to the geometric or arithmetic mean? The Berger-Bernardo reference prior is the geometric mean. The marginal prior is the arithmetic mean if we follow the approach of taking weak limits of suitable discrete uniforms, vide Ghosal et al. (1997). In many interesting cases involving invariance, for example, in the case of location-scale families, all three approaches lead to the right invariant Haar measures as the joint prior. 5.1.12 Objective Priors with Partial Information Suppose we have chosen our favorite so-called noninformative prior, say poHow can we utilize available prior information on a few moments of 61 Let p be an arbitrary prior satisfying the following constraints based on available information

L

9j{0)p{0) dO = Aj,

j = 1, 2 , . . . , A:.

(5.33)

If gj{0) = 0{^ -.'Op", we have the usual moments of 0. We fix increasing compact rectangles Ki with union equal to 0 and among priors pi supported on Ki and satisfying (5.33), minimize the Kullback-Leibler number K(puPo)=

f JKi

pmiog^dO. Po[P)

The minimizing prior is PI{9) = constant x exp < Y^ Xjgj{6) > po{6) where A^'s are hyperparameters to be chosen so as to satisfy (5.33). This can be proved by noting that for all priors pi satisfying (5.33),

K{pi.Po)= f Pi{e)iog?^de + K{pi,p:) = constant -f- ^

XjAj -\- K{pi,p*)

which is minimized at pi = p*If instead of moments we know values of some quantiles for (a onedimensional) 0 or more generally the prior probabilities aj of some disjoint subsets Bj of 0 , then it may be assumed UBj = 0 and one would use the prior given by

Sun and Berger (1998) have shown how reference priors can be constructed when partial information is available.

5.2 Discussion of Objective Priors

147

5.2 Discussion of Objective Priors This section is based on Ghosh and Samanta (2002b). We begin by hsting some of the common criticisms of objective priors. We refer to them below as "noninformative priors". 1. Noninformative priors do not exist. How can one define them? 2. Objective Bayesian analysis is ad hoc and hence no better than the ad hoc paradigms subjective Bayesian analysis tries to replace. 3. One should try to use prior information rather than waste time trying to find noninformative priors. 4. There are too many noninformative priors for a problem. Which one is to be used? 5. Noninformative priors are typically improper. Improper priors do not make sense as quantification of belief. For example, consider the uniform distribution on the real line. Let L be any large positive number. Then P{-L <e < L}/P{0 ^ ( - L , L)} = 0 for all L but for a sufficiently large L, depending on the problem, we would be pretty sure that —L<6
148

5 Choice of Priors for Low-dimensional Parameters

usually be very similar even with a modest amount of data. Where this is not the case, one would have to undertake a robustness analysis restricted to the class of chosen objective priors. This seems eminently doable. Even though usually objective priors are improper, we only work with them when the posterior is proper. Once again we urge the reader to go over the general comments. We would only add that many improper objective priors lead to same posteriors as coherent, proper, finitely additive priors. This is somewhat technical, but the interested reader can consult Heath and Sudderth (1978, 1989). Point 6 is well taken care of by Jeffreys prior. Also in a given problem not all one-to-one transformations are allowed. For example, if the coordinates of 6 are in a decreasing order of importance, then we need only consider r/ = (r/i,... ,r/d) such that rjj is a one-to-one continuously differentiable function of Oj. There are invariance theorems for reference and probability matching priors in such cases, Datta and Ghosh (1996). We have discussed Point 7 earlier in the context of the entropy of Bernardo and Lindley. This measure depends on the experiment through the model of likelihood. Generally information in a prior cannot be defined except in the context of an experiment. Hence it is natural that a low-information prior will not be the same for all experiments. Because a model is a mathematical description of an experiment, a low-information prior will depend on the model. We now turn to the last point. Coherence in the sense of Heath and Sudderth (1978) is defined in the context of a model. Hence the fact that an objective prior depends on a model will not automatically lead to incoherence. However, care will be needed. As we have noted earlier, a right Haar prior for location-scale families ensures coherent inference but in general a left Haar prior will not. The impact on likelihood principle is more tricky. The likelihood principle in its strict sense is violated because the prior and hence the posterior depends on the experiment through the form of the likelihood function. However, for a fixed experiment, decision based on the posterior and the corresponding posterior risk depend only on the likelihood function. We pursue this a bit more below. Inference based on objective priors does violate the stopping rule principle, which is closely related to the likelihood principle. In particular, in Example 1.2 of Carlin and Louis (1996), originally suggested by Lindley and Phillips (1976), one would get different answers according to a binomial or a negative binomial model. This example is discussed in Chapter 6. To sum up we do seem to have good answers to most of the criticisms but have to live with some violations of the likelihood and stopping rule principles.

5.4 Elicit at ion of Hyperparameters for Prior

149

5.3 Exchangeability A sequence of real valued random variables {Xi} is exchangeable if for all n, all distinct suffixes {n, ^2, • • •, ^n} and all ^ i , ^ 2 , • • •, -^n C 7^,

P{Xi^ e Bi.Xi, e B2,...x,^ G Bn} = P{Xi e BuX2 eB2,...,Xne

B^}.

In many cases, a statistician will be ready to assume exchangeability as a matter of subjective judgment. Consider now the special case where each Xi assumes only the values 0 and 1. A famous theorem of de Finetti then shows that the subjective judgment of exchangeability leads to both a model for the likelihood and a prior. Theorem 5.6. (de Finetti) / / Xi's are exchangeable and assume only values 0 and 1, then there exists a distribution 11 on (0,1) such that the joint distribution o/ X i , . . . , X^ can be represented as

p{Xi = xi,...,Xn = xn)= /

Y\e^'{i-ef-^'dn{e).

This means X^'s can be thought of as i.i.d. B(l, 9) variables, given ^, where 0 has the distribution 77. For a proof of this theorem and other results of this kind, see, for example, Bernardo and Smith (1994, Chapter 4). The prior distribution 77 can be determined in principle from the joint distribution of all the X^'s, but one would not know the joint distribution of all the X^'s. If one wants to actually elicit 77, one could ask oneself what is one's subjective predictive probability P{X^+i = l | X i , X 2 , . . . ,X^}. Suppose the subjective predictive probability is {a -\- Y^ Xi)/{a -\- (3 -V i) where a > 0, /? > 0. Then the prior for 6 is the Beta distribution with hyperparameters a and (3. Nonparametric elicitations of this kind are considered in Fortini et al. (2000).

5.4 Elicitation of Hyperparameters for Prior Although a full prior is not easy to elicit, one may be able to elicit hyperparameters in an assumed model for a prior. We discuss this problem somewhat informally in the context of two examples, a univariate normal and a bivariate normal likelihood. Suppose X i , X 2 , . . . , Xn are i.i.d. A^(/i, cr^) and we assume a normal prior for n given cr^ and an inverse gamma prior for cr^. How do we choose the hyperparameters? We think of a scenario where a statistician is helping a subject matter expert to articulate his judgment. Let p(/i|cr'^) be normal with mean 77 and variance c^a^ where c is a constant. The hyperparameter ry is a prior guess for the mean of X. The statistician has to make it clear that what is involved is not a guess about ji but a guess about

150

5 Choice of Priors for Low-dimensional Parameters

the mean of fi. So the expert has to think of the mean /x itself as uncertain and subject to change. Assuming that the expert can come up with a number for r/, one may try to elicit a value for c in two different ways. If the expert can assign a range of variation for /i (given a) and is pretty sure 77 will be in this range, one may equate this to 77 ib 3C(J, T] would be at the center of the range and distance of upper or lower limit from the center would be 3ca. To check consistency, one would like to elicit the range for several values of a and see if one gets nearly the same c. In the second method for eliciting c, one notes that c^ determines the amount of shrinking of X to the prior guess rj in the posterior mean C2

(vide Example 2.1 of Section 2.2). Thus if c^ = 1/n, X and rj have equal weights. If c^ = 5/n, the weight of 77 diminishes to one fifth of the weight for X. In most problems one would not have stronger prior belief. We now discuss elicitation of hyperparameters for the inverse Gamma prior for cr^ given by

p(a^) =

I

I

e-^/iP-')

^^^ r(a)/^« (a2)«+i The prior guess for (j^ is [/3{a — 1)]~^. This is likely to be more difficult to elicit than the prior guess r] about /x. The shape parameter a can be elicited by deciding how much to shrink the Bayes estimate of cr'^ towards the prior guess [/3{a — 1)]~^. Note that the Bayes estimate has the representation

Eia^lX) =

^ - \

, /

, + in-l)/2

a - 1 -h n/2 P{a - 1)

.^

a - 1 -f n/2

njX-.r {2a ^ n - 2)(1 -h nc^)'

where (n — 1)5^ = Yli^i — ^ ) ^ - In order to avoid dependence on X one may want to do the elicitation based on 2| 2\

^"^ '^ ' ^ a-l

a —1

+ {n-l)/2(3{a-l)

1

{n—l)/2

2

^ a - l + (n-l)/2^ *

The elicitation of prior for /i and cr, specially the means and variances of priors, may be based on examining related similar past data. We turn now to i.i.d. bivariate normal data (X^, Yi),i = 1, 2 , . . . , n. There are now five parameters, (/ix^cr^)? (I^Y^O'Y) and the correlation coefficient p. Also E(Y\X = x) = po + Pix, V a r ( r | X = x) = a^, where cr^ = a^{l - p'^), Pi = pcry/crx, Po = fJ^Y - (pcry/crx)Mx. One could reparameterize in various ways. We adopt a parameterization that is appropriate when prediction of Y given X is a primary concern. We consider (/i^, cr^) as parameters for the marginal distribution of X, which may

5.4 Elicitation of Hyperparameters for Prior

151

be handled as in t h e univariate case. We t h e n consider t h e three p a r a m e t e r s ((7^, /?o, Pi) of t h e conditional distribution of Y given X = x. T h e joint density m a y be written as a product of t h e marginal density of X, namely 7V(/ix, cr^) ^ ^ ^ ^^^ conditional density of Y given X = x, namely N{f3o + pix, (j^). T h e full likelihood is

PO - PlX^f It is convenient t o rewrite /3o 4- /^la:^ as 70 + 71 (x^ — x ) , with 71 = /?i, 70 = /^o + 7 i ^ - Suppose we think of x^'s t o be fixed. We concentrate on t h e second factor t o find its conjugate prior. T h e conditional likelihood given t h e x^'s is i

2TI(J

1

exp{-^X^(2/i-7o-7i(^z-^))^-^(7o-7o)^-^(7i-7i)^}

where 70 = y, 71 = X](^i - ^)(xi - x)/sxx, a n d 5^^ = Y.{xi - x^. Clearly, t h e conjugate prior for cr^ is a n inverse G a m m a a n d t h e conjugate priors for 70 a n d 71 are independent normals whose p a r a m e t e r s m a y be elicited along t h e same lines as those for t h e univariate normal except t h a t more care is needed. T h e statistician could fix several values of x a n d invite t h e expert to guess t h e corresponding values of 2/. A straight line t h r o u g h t h e scatter plot will yield a prior guess on t h e linear relation between x a n d y. T h e slope of this line m a y be taken as t h e prior mean for jSi and t h e intercept as t h e prior mean for /3o. These would provide prior means for 70,71 (for t h e given values of Xi^ in t h e present d a t a ) . T h e prior variances can be determined by fixing how much shrinkage towards a prior m e a n is desired. Suppose t h a t t h e prior distribution of a'^ has t h e density -l/ihcr')

^^"^ ^

r(a)6«((j2)«+i^

Given cr^, t h e prior distributions for 70 a n d 71 are taken t o b e independent normals A^(/io, CQCT^) a n d 7V(/ii, cfcr^) respectively. T h e marginal posterior distributions of 70 a n d 71 with these priors are Student's t with posterior means given by Ti

E{jo\x,y)

and E{-fi\x,y)

C

- -—3270 + — ^ / i o n -h CQ n-\- CQ

(5.34)

=

(^•^^)

^^-32^1 + 7^—^—:^/^i^XX

I ^1

^->XX ^

As indicated above, these expressions m a y b e used and c i . For elicitation of t h e shape p a r a m e t e r a of cr^ we m a y use similar representation of E{(j'^\x^y). S'^ = YHVi - To - ii{xi - ^ ) ) ^ , To a n d 71 are jointly

^1

t o elicit t h e values of CQ t h e prior distribution of Note t h a t t h e statistics sufficient for (cr^, 7 0 , 7 i ) -

152

5 Choice of Priors for Low-dimensional Parameters

Table 5.1. Data on Water Flow (in 100 Cubic Feet per Second) at Two Points (Libby and Newgate) in January During 1931-43 in Kootenai River (Ezekiel and Fox, 1959) Year 1931 32 33 34 35 36 37 38 39 40 41 42 43 Newgate (2/) 19.7 18.0 26.1 44.9 26.1 19.9 15.7 27.6 24.9 23.4 23.1 31.3 23.8 Libby (a:) 27.1 20.9 33.4 77.6 37.0 21.6 17.6 35.1 32.6 26.0 27.6 38.7 27.8

As in the univariate normal case considered above we do the eUcitation based on ^ ^ ^ 1^ ^ - ( n / 2 ) + a - 2 6 ( a - l ) + (n/2) + a - 2 ' '

^^'^^^

where a^ = S'^/{n — 2) is the classical estimate of cr^. We illustrate with an example. Example 5.7. Consider the bivariate data of Table 5.1 (Ezekiel and Fox, 1959). This relates to water flow in Kootenai River at two points, Newgate (British Columbia, Canada) and Libby (Montana, USA). A dam was being planned on the river at Newgate, B.C., where it crossed the Canadian border. The question was how the flow at Newgate could be estimated from that at Libby. Consider the above setup for this set of bivariate data. Calculations yield X = 32.5385, Sj:a, = 2693.1510, 70 = 24.9615, 71 = 0.4748 and a^ = 3.186 so that the classical (least squares) regression line is y - 2 4 . 9 6 1 5 + 0.4748(x-x), i.e., 2/= 9.5122 + 0.4748X. Suppose now that we have similar past data D for a number of years, say, the previous decade, for which the fitted regression line is given by y = 10.3254 + 0.4809x with an estimate of error variance (a^) 3.9363. We don't present this set of "historical data" here, but a scatter plot is shown in Figure 5.1. As suggested above, we may take the prior means for /?o, Pi and a"^ to be 10.3254, 0.4809, and 3.9363, respectively. We, however, take these to be 10.0, 0.5, and 4.0 as these are considered only as prior guesses. This gives /io = 10 + 0.5x = 26.2693 and //i = 0.5. Given that the historical data set D was obtained in the immediate past (before 1931), we have considerable faith in our prior guess, and as indicated above, we set the weights for the prior means /IQ, ^i^ and l/(6(a — 1)) of 70, 71, and cr^ in (5.34), (5.35), and (5.36) equal to 1/6 so that the ratio of the weights for the prior estimate and the classical estimate is 1 : 5 in each case. Thus we set c^^

n^cf

CT'^

a—1

1

5.x+cr'

(n/2)+a-2

6

5.4 Elicitation of Hyperparameters for Prior

153

which yields, with n == 13 and Sxx = 2693.151 for the current data, c^^ = 2.6, c]"^ = 538.63, and a = 2.1. If, however, the data set D was older, we would attach less weight (less than 1/6) to the prior means. Our prior guess for (j^ is l/{b{a - 1)). Equating this to 4.0 we get b = 0.2273. Now from (5.34)-(5.36) we obtain the Bayes estimates of 7o,7i and cr^ as 25.1795, 0.4790 and 3.3219 respectively. The estimated regression line is y = 25.1795+ 0.4790(x-x), i.e., ?/-9.5936-h 0.4790X. The scatter plots for the current Kootenai data of Table 5.1 and the historical data D as well as the classical and Bayesian regression line estimates derived above are shown in Figure 5.1. The symbols "o" for current and "*" for historical data are used here. The continuous line stands for the Bayesian regression line based on the current data and the prior, and the broken line represents the classical regression line based on the current data. The Bayesian line seems somewhat more representative of the whole data set than the classical regression line, which is based on the current data only. If one fits a classical regression line to the whole data set, it will attach equal importance to the current and the historical data; it is a choice between all or nothing. The Bayesian method has the power to handle both current data and other available information in a flexible way. The 95% HPD credible intervals for 70 and 71 based on the posterior t-distributions are respectively (21.4881, 28.8708) and (0.2225, 0.7355), which are comparable with the classical 95% confidence intervals — (23.8719, 26.0511) for 70 and (0.3991, 0.5505) for 71. Note that, as expected, the Bayesian intervals are more conservative than the classical ones, the Bayesian providing for the possible additional variation in the parameters. If one uses the objective prior p(705 7i,cr^) oc 1/cr^, the objective Bayes solutions would agree with the classical estimates and confidence intervals. All of the above would be inapplicable if x and y have the same footing and the object is estimation of the parameters in the model rather than prediction of unobserved y's for given x's. In this case, one would write the bivariate normal likelihood

TT

^

.^^ 2ix(jx(yY\/^ ^ ^^P 1

- P^

0/1 - p2) .5!\ \^ 2(1

_2 cr|

"^

_2 (jf.

"^P

ax

cry

The conjugate prior is a product of a bivariate normal and an inverse-Wishart distribution. Notice that we have discussed elicitation of hyperparameters for a conjugate prior for several parameters. Instead of substituting these elicited values in the conjugate prior, we could treat the hyperparameters as having

154

5 Choice of Priors for Low-dimensional Parameters

Bayesian regression line Classical regression line

/

o

Current data

*

Historical data

o

30

50

70

Fig. 5.1. Scatter plots and regression lines for the Kootenai River data. a prior distribution over a set of values around the elicited numbers. The prior distribution for the hyperparameters could be a uniform distribution on the set of values around the elicited numbers. This would be a hierarchical prior. An alternative would be to use several conjugate priors with different hyperparameter values from the set around the elicited numbers and check for robustness. Elicit at ion of hyperparameters of a conjugate prior for a linear model is treated in Kadane et al. (1980), Garthwaite and Dickey (1988, 1992), etc. A recent review is Garthwaite et al. (2005). Garthwaite and Dickey (1988) observe that the prior variance-covariance matrix of the regression coefficients, specially the off-diagonal elements of the matrix, are the most difficult to elicit. We have assumed the off-diagonal elements are zero, a common simplifying assumption, and determined the diagonal elements by eliciting how much

5.5 A New Objective Bayes Methodology Using Correlation

155

shrinkage towards the prior is sought in the Bayes estimates of the means of the regression coefficients. Garthwaite and Dickey (1988) indicate an indirect way of eliciting the variance-covariance matrix.

5.5 A New Objective Bayes Methodology Using Correlation As we have already seen, there are many approaches for deriving objective, reference priors and also for conducting default Bayesian analysis. One such approach that relies on some new developments is discussed here. Using the Pearson correlation coefficient in a rather different way, DasGupta et al. (2000) and Delampady et al. (2001) show that some of its properties can lead to substantial developments in Bayesian statistics. Suppose X is distributed with density f{x\6) and 6 has a prior TT. Let the joint probability distribution of X and 0 under the prior TT be denoted by P. We can then consider the Pearson correlation coefficient p^ between two functions ^i(X, 6) and g2{X^ 0) under this probability distribution P. An objective prior in the spirit of reference prior can then be derived by maximizing the correlation between two post-data summaries about the parameter ^, namely the posterior density and the likelihood function. Given a class P of priors, Delampady et al. (2001) show that the prior TT that maximizes p^{f{x\0),TT{6\x)} is the one with the least Fisher information /(TT) = ^ ^ { ^ log7r(^)}^ in the class r . Actually, Delampady et al. (2001) note that it is very difficult to work with the exact correlation coefficient pp{f{x\0)^ TT{6\X)} and hence they maximize an appropriate large sample approximation by assuming that the likelihood function and the prior density are sufficiently smooth. The following example is from Delampady et al. (2001). Example 5.8. Consider a location parameter 0 with |^| < 1. Assume that / and TT are sufficiently smooth. Then the prior density which achieves the minimum Fisher information in the class of priors compactly supported on [—1,1] is what is desired. Bickel (1981) and Huber (1974) show that this prior is cos2(7r(9/2),if |(9| < 1; (Q) ^ / ^^' 1^ 0, otherwise. Thus, the Bickel prior is the default prior under the correlation criterion. The variational result that obtains this prior as the one achieving the minimum Fisher information was rediscovered by Ghosh and Bhattacharya in 1983 (see Ghosh (1994); a proof of this interesting result is also given there). In addition to the above problem, it is also possible to identify a robust estimate of 9 using this approach. Suppose TT is a reference prior, and P is a class of plausible priors, TT may or may not belong to P. Let Sj^ be the Bayes estimate of 6 with respect to prior u e P. To choose an optimal Bayes

156

5 Choice of Priors for Low-dimensional Parameters Table 5.2. Values of S^{X) and Su{X) for Various X = x 0 .5 1 1.5 2 3 5 8 10 15 0 .349 .687 1.002 1.284 1.735 2.197 2.216 2.065 1.645 Suix) 0 .348 .683 .993 1.267 1.685 1.976 1.60 1.15 .481 X

estimate, maximize (over u E F) the correlation coefficient pp{6^5j^) between 0 and 5j^. Note that p^ is calculated under P , and in this joint distribution 6 follows the reference prior TT. Again, by ma:ximizing an appropriate large sample approximation pp{6^5y) (assuming that the likelihood function and the prior density are sufficiently smooth), Delampady et al. (2001) obtain the following theorem. Theorem 5.9. The estimate 5i>{X) maximizing Pp{9,Si^) is Bayes with respect to the prior density v{9) = C7T{0) exp

where /i, r

{-^r2(0-M)2|,

are arbitrary and c is a normalizing

(5.37)

constant.

The interesting aspect of this reference prior i/ is evident from the fact that it is simply a product of the initial reference prior TT and a Gaussian factor. This may be interpreted as u being the posterior density when one begins with a flat prior n and it is revised with an observation 6 from the Gaussian distribution to pull in its tails. Consider the following example again from Delampady et al. (2001). Example 5.10. Consider the reference prior density 7r(^) oc (1 -f- ^^/3)~^, density of the Student's ^3 prior, a flat prior. Suppose that the family F contains only symmetric priors and so v{6) is of the form v{0) = c(l + (9^/3)-^ exp {-^2/(2x2)}. Let X ^^ Cauchy(<9,a), with known a and having density f{x\e) = - L

1 + ^^

^)T" •

Some selected values are reported in Table 5.2 for a = 0.2. For small and moderate values of x, 6^^ and 8^, behave similarly, whereas for large values, 61, results in much more shrinkage than S-j^. This is only expected because the penultimate v has normal tails, whereas the reference TT has flat tails.

5.6 E x e r c i s e s 1. Let X ^ B{n^p). Choose a prior on p such that the marginal distribution of X is uniform on { 0 , 1 , . . . , n}.

5.6 Exercises

157

2. (Schervish (1995, p. 121)). Let /i be a function of (x, 6) that is differentiable in 0. Define a prior

p*ie) = [Va.r0{{d/de)h{x,9))]'/\ 3. 4.

5.

6. 7. 8.

9.

10.

(a) Show that p*(^) satisfies the invariance condition (5.1). (b) Choose a suitable h such that p* (6) is the Jeffreys prior. Prove (5.16) and generahze it to the multiparameter case. (Lehmann and Casela (1998)) For a scale family, show that there is an equivariant estimate of a^ that minimizes E{T — cr^)^/(7^^. Display the estimate as the ratio of two integrals and interpret as a Bayes estimate. Consider a multinomial distribution. (a) Show that the Dirichlet distribution is a conjugate prior. (b) Identify the precision parameter for the Dirichlet prior distribution. (c) Let the precision parameter go to zero and identify the limiting prior and posterior. Suggest why the limiting posterior, but not the limiting prior, is used in objective Bayesian analysis. Find the Jeffreys prior for the multinomial model. Find the Jeffreys prior for the multivariate normal model with unknown mean vector and dispersion matrix. (a) Examine why the Jeffreys prior may not be appropriate if the parameter is not identifiable over the full parameter space. (b) Show with an example that the Jeffreys prior may not have a proper posterior. (Hint. Try the following mixture: X = 0 with probability 1/2 and is A^(/i, 1) with probability 1/2.) (c) Suggest a heuristic reason as to why the posterior is often proper if we use a Jeffreys prior. Bernardo has recently proposed the use of min{K(/o, / i ) , i C ( / i , /o)}, instead of K(/o, / i ) , as the criterion to maximize at each stage of reference prior. Examine the consequence of this change for the examples of reference priors discussed in this chapter. Given (/i,cr^), let X i , . . . , X ^ be i.i.d. A/'(/i,(j^) and consider the prior 7r(/[i,cr^) oc 1/cr^. Verify that P{X - t^/2,n-is/Vn = P{X

- t^/2,n-ls/Vri

< /i < ^ + ta/2,n-is/Vn\iJ., < 1, let X i , . . . ,Xn be i.i.d. 5(1,6>). Consider the Jeffreys prior for 0. Find by simulation the frequentist coverage of 6 by the twotailed 95% credible interval for ^ = :| 5 ^ ? | ? f ? | • Do the same for the usual frequent ist interval 0 =b 2^0.025 \ / ^ ( l - e)/n where e = Y. Xi/n. 12. Derive (5.32) from an appropriate probability matching equation.

6 Hypothesis Testing and Model Selection

For Bayesians, model selection and model criticism are extremely important inference problems. Sometimes these tend to become much more complicated than estimation problems. In this chapter, some of these issues will be discussed in detail. However, all models and hypotheses considered here are lowdimensional because high-dimensional models need a different approach. The Bayesian solutions will be compared and contrasted with the corresponding procedures of classical statistics whenever appropriate. Some of the discussion in this chapter is technical and it will not be used in the rest of the book. Those sections that are very technical (or otherwise can be omitted at first reading) are indicated appropriately. These include Sections 6.3.4, 6.4, 6.5, and 6.7. In Sections 6.2 and 6.3, we compare frequentist and Bayesian approaches to hypothesis testing. We do the same in an asymptotic framework in Section 6.4. Recently developed methodologies such as the Bayesian P-value and some non-subjective Bayes factors are discussed in Sections 6.5 and 6.7.

6.1 Preliminaries First, let us recall some notation from Chapter 2 and also let us introduce some specific notation for the discussion that follows. Suppose X having density f{x\6) is observed, with 6 being an unknown element of the parameter space 0. Suppose that we are interested in comparing two models MQ and Mi, which are given by MQ : X has density f{x\6) where 6 e OQ; Ml : X has density f{x\e) where 6 e Oi.

(6.1)

For i = 0,1 let gi{0) be the prior density of 0^ conditional on Mi being the true model. Then, to compare models MQ and Mi on the basis of a random sample x = {xi,..., Xn) one would use the Bayes factor

160

6 Hypothesis Testing and Model Selection Boi(x) = ^ , mi(x)

(6.2)

where m , ( x ) = / f{-K\e)gi{e)de, J0i

i = 0,l.

(6.3)

We also use the notation BF^i for the Bayes factor. Recall from Chapter 2 that the Bayes factor is the ratio of posterior odds ratio of the hypotheses to the corresponding prior odds ratio. Therefore, if the prior probabilities of the hypotheses, TTQ = P^'iMo) = P'^(Oo) and TTI - P^(Mi) = P ^ ( 0 i ) = 1 - TTQ are specified, then as in (2.17),

P(Mo |x) = 11 + ^ - ^ BoY (^)}

•

(6-4)

Thus, if conditional prior densities ^o and g\ can be specified, one should simply use the Bayes factor PQI foi* model selection. If, further TTQ is also specified, the posterior odds ratio of MQ to Mi can also be utilized. However, these computations may not always be easy to perform, even when the required prior ingredients are fully specified. A possible solution is the use of BIC as an approximation to a Bayes factor. We study this in Subsection 6.1.1. The situation can get much worse when the task of specifying these prior inputs itself becomes a difficult problem as in the following problem. Example 6.1. Consider the problem that is usually called nonparametric regression. Independent responses yi are observed along with covariates x^, i = 1 , . . . , n. The model of interest is Vi = 9(xi) -h ei, i = 1 , . . . , n,

(6.5)

where e^ are i.i.d. N{0,a^) errors with unknown error variance cr^. The function g is called the regression function. In linear regression, ^ is a priori assumed to be linear in a set of finite regression coefficients. In general, g can be assumed to be fully unknown also. Now, if model selection involves choosing g from two different fully nonparametric classes of regression functions, this becomes a very difficult problem. Computation of Bayes factor or posterior odds ratio is then a formidable task. Various simplifications including reducing g to be semi-parametric have been studied. In such cases, some of these problems can be handled. Consider a different model checking problem now, that of testing for normality. This is a very common problem encountered in frequentist inference, because much of the inferential methodology is based on the normality assumption. Simple or multiple linear regression, ANOVA, and many other techniques routinely use this assumption. In its simplest form, the problem can

6.1 Preliminaries

161

be stated as checking whether a given random sample X i , X 2 , . . . , Xn arose from a population having the normal distribution. In the setup given above in (6.1), we may write it as Mo : X is N{fi,a'^) with arbitrary /j and a^ > 0; Ml : X does not have the normal distribution.

(6.6)

However, this looks quite different from (6.1) above, because Mi does not constitute a parametric alternative. Hence it is not clear how to use Bayes factors or posterior odds ratios here for model checking. The difficulty with this model checking problem is clear: one is only interested in MQ and not in Ml. This problem is addressed in Section 6.3 of Gelman et al. (1995). See also Section 9.9. We use the posterior predictive distribution of replicated future data to assess whether the predictions show systematic differences. In practice, replicated data will not be available, so cross-validation of some form has to be used, as discussed in Section 9.9. Gelman et al. (1995) have not used cross-validation and their P-values have come in for some criticism (see Section 6.5). The object of model checking is not to decide whether the model is true or false but to check whether the model provides plausible approximation to the data. It is clear that we have to use posterior predictive values and Bayesian P-values of some sort, but consensus on details does not seem to have emerged yet. It remains an important problem. 6.1.1 B I C Revisited Under appropriate regularity conditions on / , go, and ^ 1 , the Bayes factor given in (6.2) can be approximated using the Laplace approximation or the saddle point approximation. Let us change notation and express (6.3) as follows:

m,(x) = J f{^\ei)g,{Oi) dOi, i = 0,1.

(6.7)

/^<

where 9i is the ^^-dimensional vector of parameters under M^, assumed to be independent of n (the dimension of the observation vector x). Let 6i be the posterior mode of ^^, i = 0,1. Assume Oi is an interior point of Oi. Then, expanding the logarithm of the integrand in (6.7) around Oi using a secondorder Taylor series approximation, we obtain log (/(x|0i)5i(ei)) « log (/(x|0i)5i(^O) -\{^^-

^)'

^-e, (^' - ^)

'

where H— is the corresponding negative Hessian. Applying this approximation to (6.7) yields.

162

6 Hypothesis Testing and Model Selection

= f{^\e,)gSi){2T^r'^\Hz'\y^.

(6.8)

2 log BQI is a commonly used evidential measure to compare the support provided by the data x for MQ relative to Mi. Under the above approximation we have, oi (TD \ oi / / ( x | ^ o ) \ , oi 19o{Oo)\ 21og(5oi) « 21og ) -^ + 21og ' '» J(x|^i)y \9i{ei)^ \HZ^\ +(po-Pi)log(27r)+log|—?^ A variation of th^s approximation is also commonly used, where instead of the posterior mode 0^, the maximum likelihood estimate Oi is employed. Then, instead of (6.8), one obtains mi(x) « f{^\ei)gi{ei){'i^r'^\H:^^\^'\

(6.9)

Here H^ is the observed Fisher information matrix evaluated at the maximum likelihood estimator. If the observations are i.i.d. we have that H^ = nH^ ^ , where H ^ is the observed Fisher information matrix obtained from a single observation. In this case, mi(x) « /(x|g,)5,(0,)(27r)^^/2^-P^/2|^-i ,1/2^ and hence

\fiK\ei)J

\9i{ei)J

(

\H~i I 1,^1

An approximation to (6.10) correct to 0(1) is

21og(Boi) « 21og l ^ P ^ ]

- {Po-Pi)\ogn.

(6.11)

V/(^l^i)/ This is the approximate Bayes factor based on the Bayesian information criterion (BIC) due to Schwarz (1978). The term {po —pi) logn can be considered a penalty for using a more complex model.

6.2 P-value and Posterior Probability of Ho

163

A related criterion is

21ogf^J|*|)-2(po-Pi)

(6.12)

which is based on the Akaike information criterion (AIC), namely, ^ / C = 21og/(x|^)-2p for a model / ( x | ^ ) . The penalty for using a complex model is not as drastic as that in BIC. A Bayesian interpretation of AIC for high-dimensional prediction problems is presented in Chapter 9. Problem 16 of Chapter 9 invites you to explore if AIC is suitable for low-dimensional testing problems.

6.2 P-value and Posterior Probability of HQ as Measures of Evidence Against t h e Null One particular tool from classical statistics that is very widely used in applied sciences for model checking or hypothesis testing is the P-value. It also happens to be one of the concepts that is highly misunderstood and misused. The basic idea behind R.A. Fisher's (see Fisher (1973)) original (1925) definition of P-value given below did have a great deal of appeal: It is the probability under a (simple) null hypothesis of obtaining a value of a test statistic that is at least as extreme as that observed in the sample data. Suppose that it is desired to test Ho:0 = 0o versus Hi : 9 ^ Oo,

(6.13)

and that a classical significance test is available and is based on a test statistic T(X), large values of which are deemed to provide evidence against the null hypothesis. If data X = x is observed, with corresponding t = T(x), the P-value then is a = Pe, (T(X) > T{x)). Example 6.2. Suppose we observe X i , . . . ,Xn i.i.d. from N{9,a'^), where a^ is known. Then X is sufficient for 0 and it has the NiO^a'^/n) distribution. Noting that T = T{X) = \^/n {X - OQ) /cr\, is a natural test statistic to test (6.13), one obtains the usual P-value as a = 2[1 — ^(t)], where t = \y/n{x — 0o) /a\ and ^ is the standard normal cumulative distribution function. Fisher meant P-value to be used informally as a measure of degree of surprise in the data relative to HQ . This use of P-value as a post-experimental or conditional measure of statistical evidence seems to have some intuitive

164

6 Hypothesis Testing and Model Selection

justification. Prom a Bayesian point of view, various objections have been raised by Edwards et al. (1963), Berger and Sellke (1987), and Berger and Delampady (1987), against use of P-values as measures of evidence against HQ. A recent review is Ghosh et al. (2005). To a Bayesian the posterior probability of HQ summarizes the evidence against HQ. In many of the common cases of testing, the P-value is smaller than the posterior probability by an order of magnitude. The reason for this is that the P-value ignores the likelihood of the data under the alternative and takes into account not only the observed deviation of the data from the null hypothesis as measured by the test statistic but also more significant deviations. In view of these facts, one may wish to see if P-values can be calibrated in terms of bounds for posterior probabilities over natural classes of priors. It appears that calibration takes the form of a search for an alternative measure of evidence based on posterior that may be acceptable to a nonBayesian. In this connection, note that there is an interesting discussion of the admissibility of P-value as a measure of evidence in Hwang et al. (1992).

6.3 Bounds on Bayes Factors and Posterior Probabilities 6.3.1 Introduction We begin with an example where P-values and the posterior probabilities are very different. Example 6.3. We observe X ~ N{0^a'^/n), with known cr^. Upon using T = \^/n (^X — 6o) /cr\ as the test statistic to test (6.13), recall that the P-value comes out to be o; = 2[1 — ^(t)], where t = \y/n{x — OQ) /a\ and ^ is the standard normal cumulative distribution function. On the set {9 ^ 9o}, let 0 have the density (^i) of N{II^T'^). Then, we have. ^01 = \ / l + p-2 exp I - where p — a/{^/nr) TTo = 1/2, we get.

and r] =

{OQ

— /^)/T.

{t - pr}? (1+P^) NOW,

ri'

if we choose fi —

9Q^

r = a and

Soi = v / r T 7 ^ e x p { - i [ ^ ^ ] | For various values of t and n, the different measures of evidence, a = P-value, B = Bayes factor, and P = P{Ho\x) are displayed in Table 6.1 as shown in Berger and Delampady (1987). It may be noted that the posterior probability of HQ varies between 4 and 50 times the corresponding P-value which is an indication of how different these two measures of evidence can be.

6.3 Bounds on Bayes Factors and Posterior Probabilities

165

Table 6.1. Normal Example: Measures of Evidence n 10 20 50 100 B P B P t a B P B P B P B P 1.645 .10 .72 .42 .79 .44 .89 .471.27 .56 1.86 .65 2.57 .72 1.960 .05 .54 .35 .49 .33 .59 .37 .72 .42 1.08 .52 1.50 .60 2.576 .01 .27 .21 .15 .13 .16 .14 .19 .16 .28 .22 .37 .27 3.291 .001 .10 .09 .03 .03 .02 .02 .03 .03 .03 .03 .05 .05

1

5

6.3.2 Choice of Classes of Priors Clearly, there are irreconcilable differences between the classical P-value and the corresponding Bayesian measures of evidence in the above example. However, one may argue that the differences are perhaps due to the choice of TTQ or gi that cannot claim to be really 'objective.' The choice of TTQ = 1/2 may not be crucial because the Bayes factor, B, which does not need this, seems to be providing the same conclusion, but the choice of gi does have substantial effect. To counter this argument, let us consider lower bounds on B and P over wide classes of prior densities. What is surprising is that even these lower bounds that are based on priors 'least favorable' to HQ are typically an order of magnitude larger than the corresponding P-values for precise null hypotheses. The other motivation for looking at bounds over classes of priors is that they correspond with robust Bayesian answers that are more compelling when an objective choice for a single prior does not exist. Thus, in the case of precise null hypotheses, if G is the class of all plausible conditional prior densities gi under HQ , we are then lead to the consideration of the following bounds. f{x\Oo) B{G,x) (6.14) inf Boi suPgeG'^sW gea where mg{x) = J^ ^^ f{x\6)g(0) dO^ and Z(^o|G,x)

inf PiHQ\x) = 1

geG

^

-B_[G^x)

(6.15)

TTO

This brings us back to the question of choice of the class G as in Chapter 3, where the robust Bayesian approach has been discussed. As explained there, robustness considerations force us to consider classes that are neither too large nor too small. Choosing the class GA = {all densities} certainly is very extreme because it allows densities that are severely biased towards Hi. Quite often, the class GNC = {all natural conjugate densities with mean ^o} is an interesting class to consider. However, this turns out to be inadequate for robustness considerations. The following class Gus = {all densities symmetric about ^o and non-increasing in |^ — ^o|} (6.16)

166

6 Hypothesis Testing and Model Selection

which strikes a balance between these two extremes seems to be a good choice. Because we are comparing various measures of evidence, it is informative to examine the lower bounds for each of these classes. In particular, we can gather the magnitudes of the differences between these measures across the classes. To simplify proofs of some of the results given below, we restate a result indicated in Section 3.8.1. Lemma 6.4. Suppose CT is a set of prior probability measures on IZ^ given by CT = {vt ' t ^T}, T C VJ^ , and let C be the convex hull of CT- Then

sup / f{x\e) d7r{e) = sup / f{x\e) dvt{e). 7r€C J&

teT

(a.i?)

Jo

Proof Because C D CT, L H S > RHS in (6.17). However, as J f{x\e)d7T{e) = J f{x\0) frp Ut{d6)/2{dt), for some probability measure /i on T, using Fubini's theorem,

/ f{x\e)d7r{e) = [ f{x\e) [ dvt{6)d^i{t) Je

Je

JT

= j^(^jj{x\e)dv,{e)^ < sup / teT Je

df,{t)

f{x\0)dut{e).

Therefore,

sup / f{x\e)d7r{e) < sup / f{x\e)di^t{0), yielding the other inequality also. D The following results are from Berger and Sellke (1987) and Edwards et al. (1963). Theorem 6.5. Let 6{x) be the maximum likelihood estimate of 6 for the observed value of X. Then B{GA,X) = ^^^H-,

(6.18)

f{x\e{x)y P{HO\GA,X)

(6.19)

=

In view of Lemma 6.4, the proof of this result is quite elementary, once it is noted that the extreme points of GA are point masses. Theorem 6.6. Let Us be the class of all uniform distributions about OQ. Then B{Gus,x) P{Ho\Gus,x)

symmetric

= B{Us,x),

(6.20)

= P{Ho\Us,x).

(6.21)

6.3 Bounds on Bayes Factors and Posterior Probabilities

167

Proof. Simply note t h a t any unimodal symmetric distribution is a mixture of symmetric uniforms, a n d apply L e m m a 6.4 again. D Because B(Us^x)

— / ( a : | ^ o ) / s u p ^ ^ ^ ^ mg{x), sup mg{x) = sup / geUs geUs J

c o m p u t a t i o n of

f{x\0)g{0)(

is required t o employ T h e o r e m 6.6. Also, it m a y be noted t h a t as far as robustness is considered, using t h e class Gus of all symmetric unimodal priors is t h e same as using t h e class Us of all symmetric uniform priors. It is perhaps reasonable t o assume t h a t many of these uniform priors are somewhat biased against ifo, a n d hence we should consider unimodal symmetric prior distributions t h a t are smoother. One possibility is scale mixtures of normal distributions having mean ^o- This class is substantially larger t h a n just t h e class of normals centered at ^o; it includes Cauchy, all S t u d e n t ' s t a n d so on. To obtain t h e lower bounds, however, it is enough t o consider GNOV — { all normal distributions with m e a n ^o} , in view of L e m m a 6.4. Example 6.7. Let us continue with Example 6.3. We have t h e following results from Berger a n d Sellke (1987) a n d Edwards et al. (1963). (i) B_{GA^X)

= e x p ( — y ) , because t h e M L E of ^ is x; hence

(ii) P ( H O | G A , X ) = [1 +

i^exp(4)]-\

(iii) If t < 1, B{Gus,x) = 1, a n d P_{HQ^\GUS^X) = TTQ. This is because in this case, t h e unimodal symmetric distribution t h a t maximizes mg{x) is t h e degenerate distribution t h a t p u t s all its mass at ^o(iv) If t > 1, t h e g G Gus t h a t maximizes mg{x) is non-degenerate a n d from Theorem 6.6 and Example 3.4,

^ ^ ' ^ ^ ^ ' " ^ - s u p , , o ^ H - - t ) - ^ ( - ( ^ + t))}(v) If t < 1, B{GNor^ x) = 1, a n d P{Ho\GNor,

x) = TTQ. If t > 1, (t^ - 1 )

R{GNor^x)

= texp(

).

For various values of t, t h e different measures of evidence, a = P-value, S = lower b o u n d on Bayes factor, and P_ = lower b o u n d on P{Ho\x) are displayed in Table 6.2. TTQ has been chosen t o b e 0.5. W h a t we note is t h a t t h e differences between P-values a n d t h e corresponding Bayesian measures of evidence remain irreconcilable even when t h e lower bounds on such measures are considered. In other words, even t h e least possible Bayes factor a n d posterior probability of HQ are substantially larger t h a n the corresponding P-value. This is so, even for t h e choice G A , which is rather astonishing (see Edwards et al. (1963)).

168

6 Hypothesis Testing and Model Selection Table 6.2. Normal Example: Lower Bounds on Measures of Evidence GA

t 1.645 1.960 2.576 3.291

a .10 .05 .01 .001

B .258 .146 .036 .0044

P .205 .128 .035 .0044

Gus B P .639 .390 .408 .290 .122 .109 .018 .018

GNor

B .701 .473 .153 .024

P .412 .321 .133 .0235

6.3.3 Multiparameter Problems It is not the case that the discrepancies between P-values and lower bounds on Bayes factor or posterior probability of HQ are present only for tests of precise null hypotheses in single parameter problems. This phenomenon is much more prevalent. We shall present below some simple multiparameter problems where similar discrepancies have been discovered. The following result on testing a p-variate normal mean vector is from Delampady (1986). Example 6.8. Suppose X ~ Np(O^I)^ where X (^1, ^2, • • •, ^p)- It is desired to test

( X i , X 2 , . . . , X p ) and e =

HQ:e = 0^ versus Hi : 6 ^^ 6^, where 0^ = (O^^62,..., Op) is Si specified vector. The classical test statistic is

T{X)=^\\X-0Y, which has a Xp distribution under HQ . Thus the P-value of the data x is

«= pixi > nx)). Consider the class CUSP of unimodal spherically symmetric (about 0^) prior distributions for 0, the natural generalization Gus- This will consist of densities g{6) of the form g{0) = h{{6 — 0^y{0 — 0^)), where h is non-increasing. Noting that any unimodal spherically symmetric distribution is a mixture of uniforms on symmetric spheres, and applying Lemma 6.4, we obtain sup mg{x) = sup -—— / / ( x | ^ ) d0, geGusp k>o y[f^) J\\e-eo\\
is the Np{0,I)

exp(-i||x-^0|p exp(-^||x-0||2)c/(9' ^^Pfc>o vm v{k) I\ J\\e-0o\\
Using this result, numerical values were computed for different dimensions, p and different P-values, a. In Table 6.3 we present these values where

6.3 Bounds on Bayes Factors and Posterior Probabilities

169

Table 6.3. Multivariate Normal Example: Lower Bounds on Measures of Evidence a p B 1 .018 2 .014 3 .012 4 .011 5 .010 10 .009 15 .009 20 .009 30 .009 40 .009 oo .009

.001 P .018 .014 .012 .011 .010 .009 .009 .009 .009 .009 .009

a = .01 B P .122 .109 .098 .089 .090 .083 .085 .078 .082 .076 .078 .072 .075 .070 .074 .069 .074 .069 .073 .068 .073 .068

a = .05 B P .409 .290 .348 .258 .326 .246 .314 .239 .307 .235 .293 .226 .288 .223 .284 .221 .281 .219 .279 .218 .279 .218

a = .10 B P .639 .390 .570 .363 .540 .351 .523 .344 .513 .339 .491 .329 .483 .326 .478 .324 .473 .321 .471 .320 .468 .319

B_ denotes B_{GUSP^X) and P denotes P.{Ho\GusPiX) for TTQ = 0.5. As can be readily seen, the lower b o u n d s remain substantially larger t h a n t h e corresponding P-values in all dimensions. Note t h a t spherical s y m m e t r y is not t h e only generalization of symmet r y from one dimension to higher dimensions. Very different answers can be obtained if, for example, elliptical s y m m e t r y is used instead. Suppose we consider densities of t h e form g{e) = ^/\Q\h{{0 - O^YQiO - 6^°)), where Q is an arbitrary positive definite m a t r i x a n d h is non-increasing. T h e n t h e following result, which is informally s t a t e d in D e l a m p a d y and Berger (1990), obtains. For the sake of simplicity, let us take ^^ = 0. T h e o r e m 6 . 9 . Let / ( x | ^ ) be a multivariate, multiparameter density. sider the class of elliptically symmetric unimodal prior densities GuES = {9 '• di^) = \Q\^h{6'Q0)^

h non-increasing,

Q positive

Con-

definite

>.

(6.22) Then sup m ^ ( x ) = sup \ sup —-— / / ( x | Q - ^ u ) du \ , geGuBs Q>o [k>o V[f^) J\\u\\ 0 denotes is positive definite. Proof. Note t h a t sup g^GuBs

mg{x.) =

sup

/

f{x.\6)g{6)d0

g^GuESJ

= sup [f{x\e)h{e'Qe)\Q\ide h,Q J

(6.23) that Q

170

6 Hypothesis Testing and Model Selection = sup I sup / / ( x | Q - ^ u ) / i ( u ' u ) d u l Q>0 [ h J J

(6.24)

- sup I sup —— / / ( x | Q - ^ u ) du \ , Q>0 [k>0 V[f^) J\\u\\
6.3 Bounds on Bayes Factors and Posterior Probabilities

171

following results from Delampady and Berger (1990) once again point out the discrepancies between P-values and Bayesian measures of evidence. Example 6.10. Let n = (rii, n 2 , . . . , n/e) be a sample of fixed size N — X]i=i ^^ from a /c-cell multinomial distribution with unknown cell probabilities p = {pi,P2,... ,Pk) and density (mass) function

l l i = l 'H-

i=l

Consider testing HQ : p = p^ versus Hi : p ^ p^, where p° = (^1,^2^ • • • ,P^) is a specified interior point of the A:-dimensional simplex. Instead of focusing on the exact multinomial setup, the most popular approach is to use the chi-squared approximation. Here the test statistic of interest is (rii - Np^y '-N =1 Np^

E

which has the asymptotic distribution (as A^ -^ CXD) of x l - i under HQ. TO compare P-values so obtained with the corresponding robust Bayesian measures of evidence, the following are two natural classes of prior distributions to consider. (i) The conjugate class Gc of Dirichlet priors with density

where a^ > 0 satisfy

E!= !"»

(ai,a2,...,afe)' = i;^(p) = p°.

(ii) Consider the following transform of (pi,p2, • • • ,Pfc-i)'' 11 = iifn^

^ fpi - Pi V

Pi

P2-PI P2

Pfc-1-pLi Pk-1

The justification (see Delampady and Berger (1990)) for using such a transform is that its range is 71^~^ unlike that of p and its likelihood function is more symmetric and closer to a multivariate normal. Now let ^usp

— {unimodal ^*(u) that are spherically symmetric about 0 } ,

172

6 Hypothesis Testing and Model Selection

and consider the class of prior densities g obtained by transforming back to the original parameter: GTC/SP = |ff(p) =

ff*(u(p))|^||

Delampady and Berger (1990) show that as A^ ^ CXD, the lower bounds on Bayes factors over Gc and GTUS converge to those corresponding with the multivariate normal testing problem (chi-squared test) in Example 6.8, thus proving that irreconcilability of P-values and Bayesian measures of evidence is present in goodness of fit problems as well. Additional discussion of the multinomial testing problem with mixture of conjugate priors can be found in Good (1965, 1967, 1975). Edwards et al. (1963) discuss the possibility of finding lower bounds on Bayes factors over the conjugate class of priors for the binomial problem. Extensive discussion of the binomial problem and further references can be found in Berger and Delampady (1987). 6.3.4 Invariant Tests^ A natural generalization of the symmetry assumption (on the prior distribution) is invariance under a group of transformations. Such a generalization and many examples can be found in Delampady (1989a). A couple of those examples will be discussed below to show the flavor of the results. The general results that utilize sophisticated mathematical arguments will be skipped, and instead interested readers are referred to the source indicated above. For a good discussion on invariance of statistical decision rules, see Berger (1985a). Recall that the random observable X takes values in a space JY and has density (mass) function / ( x | ^ ) . The unknown parameter is 0 G 0 C T^'^, for some positive integer n. It is desired to test HQ : 6 ^ OQ versus Hi : 6 e 0i. We assume the following in addition. (i) There is a group Q (of transformations) acting on A' that induces a group Q (of transformations acting) on 0. These two groups are isomorphic (see Section 5.1.7) and elements of Q will be denoted by g, those of ^ by ^. (ii) f{gx\g6) = f{-x\0)k{g) for a suitable continuous map k (from Q to (0, oc)).

{m)geo = eo,gOi = ei,ge = e. In this context, the following concept of a maximal invariant is needed. Definition. When a group G of transformations acts on a space A', a function T(x) on A' is said to be invariant (with respect to G) if T(^(x)) = T(x) for all x G Af and g G G. A function T{x) is maximal invariant (with respect to G) if it is invariant and further T(xi) = T{-X2) implies xi = ^(x2) for some g ^ G. ^ Section 6.3.4 may be omitted at first reading.

6.3 Bounds on Bayes Factors and Posterior Probabilities

173

This means that G divides X into orbits where invariant functions are constant. A maximal invariant assigns different values to different orbits. Now from (i), we have that the action of Q and Q induce maximal invariants t(X) on X and r]{0) on 0 , respectively. Remark 6.11. The family of densities / ( x | ^ ) is said to be invariant under Q if (ii) is satisfied. The testing problem HQ : 0 ^ OQ versus Hi : 6 e 0i is said to be invariant under Q if in addition (iii) is also satisfied. Example 6.12. Consider Example 6.8 again and suppose X ~ Np(6^I). desired to test Ho: 0 = 0 versus Hi : 6 ^ 0.

It is

This testing problem is invariant under the group Qo of all orthogonal transformations; i.e., if H is an orthogonal matrix of order p, then ^/fX = HX. ~ Np{Hej), so that gnO = HO. Further, / ( x | ^ ) = ( 2 ^ ) - ^ / 2 e x p ( - i ( x - ^ y ( x - ^ ) ) , and

figH^lgnO)

= {2^)-^^^

eM-\{H^-He)\H^-HO))

= (27r)-^/2exp(-l(x-^y(x-^)) = /(x|^), so that (ii) is satisfied. Also, gnO = 0, and (iii) too is satisfied. Example 6.13. Let Xi,X2, • • • ,X^ be a random sample from N {O^cr'^) with both 0 and a unknown. The problem is to test the hypothesis HQ : 6 = 0 against Hi : 0 ^ 0. A sufficient statistic for (^, cr) is x = (X, 5), X = J27 Xi/n and S = [JZii^i

- X ) V n ] ^ / ^ Then

/(x|^, a) = K a - " 5 " - 2 e x p ( - n [(X - 0)^ + S^] 7(2^^)), where K is a constant. Also, X = {{x, s) :x en,s>0},

and O = {((9, cr) : 6> G 7^, a > 0}.

The problem is invariant under the group G = {gc = c : c > 0}, where the action of gc is given by gdx) = c(x,s) = (cx^cs). Note that f{gcx\6^a) =

c-^fix\d,a). A number of technical conditions in addition to the assumptions (i)-(iii) yield a very useful representation for the density of the maximal invariant statistic t(X). Note that this density, q{t{x.)\r]{6))^ depends on the parameter 6 only through the maximal invariant in the parameter space, ri{6).

174

6 Hypothesis Testing and Model Selection

The technique involved in the derivation of these results uses an averaging over a relevant group. The general method of this kind of averaging is due to Stein (1956), but because there are a number of mathematical problems to overcome, various different approaches were discovered as can be seen in Wijsman (1967, 1985, 1986), Andersson (1982), Andersson et al. (1983), and Farrell (1985). For further details, see Eaton (1989), Kariya and Sinha (1989), and Wijsman (1990). The specific conditions and proofs of these results can be found in the above references. In particular, the groups considered here are amenable groups as presented in detail in Bondar and Milnes (1981). See also Delampady (1989a). The orthogonal group, and the group of location-scale transformations are amenable. The multiplicative group of non-singular pxp linear transformations is not. Let us return to the issue of comparison of P-values and lower bounds on Bayes factors and posterior probabilities (of hypotheses) in this setup. We note that it is necessary to reduce the problem by using invariance for any meaningful comparison because the classical test statistic and hence the computation of P-value are already based on this. Therefore, the natural class of priors to be used for this comparison is the class Gj of G-invariant priors; i.e., those priors TT that satisfy (iv) 7T{A) = 7r{Ag).

Theorem 6.14. IfQ is a group of transformations satisfying certain regularity conditions (see Delampady (1989a)), inf B{Gi.x)=''^^'^' sup

g(t(x)|r/i) , , ,. , , q{t{^)\r}2)

(6.25)

where ©/Q denotes the space of maximal invariants on the parameter space. Corollary 6.15. If ©Q/Q — {0}; then under the same conditions as in Theorem 6.14, {sup^^0/^g(^(x)|r/)} Example 6.16. (Example 6.12, continued.) Consider the class of all priors that are invariant under orthogonal transformations, and note that this class is simply the class of all spherically symmetric distributions. Now, application of Corollary 6.15 yields,

q{t{^m where q{t\r]) is the density of a noncentral x^ random variable with p degrees of freedom and non-centrality parameter 77, and r) is the maximum likelihood estimate of 77 from data t(x). For selected values of p the lower bounds, B and P (for TTo = 0.5) are tabulated against their P-values in Table 6.4.

6.3 Bounds on Bayes Factors and Posterior Probabilities

175

Table 6.4. Invariant Test for Normal Means a = 0.01 a = B P B .0749 .0697 .2913 .0745 .0693 .2903 .0737 .0686 .2842 .0734 .0684 .2821

0.05 P .2256 .2250 .2213 .2200

Notice t h a t t h e lower bounds on t h e posterior probabilities of t h e null hypothesis are anywhere from 4 to 7 times as large as t h e corresponding P-values, indicating t h a t there is a vast discrepancy between P-values a n d posterior probabilities. This is t h e same phenomenon as w h a t was seen in Table 6.3. W h a t is, however, interesting is t h a t the class of priors considered here is larger and contains the one considered there, b u t t h e m a g n i t u d e of t h e discrepancy is about the same. Example 6.17. (Example 6.13, continued.) In t h e normal example with unknown variance, we have t h e maximal invariants t ( x ) = x/s and r]{0^ a) = 0/a. If we define, da Gj = {TT : c?7r(^, a) = hi{ri)drj — , /ii is any density for ry}, a we obtain. g(i(x)|0) B{Gi,x) =

?Wx)|^)'

where q{t\r]) is the density of a noncentral S t u d e n t ' s t r a n d o m variable with n— 1 degrees of freedom, and non-centrality p a r a m e t e r r/, and fj is t h e m a x i m u m likelihood estimate of 77. T h e fact t h a t all t h e necessary conditions (which are needed to apply t h e relevant results) are satisfied is shown in Andersson (1982) and Wijsman (1967). For selected values of t h e lower b o u n d s are t a b u l a t e d along with t h e P-values in Table 6.5. For small values of n, t h e lower b o u n d s in Table 6.5 are comparable with the corresponding P-values, whereas as n gets large t h e differences between these lower bounds and t h e P-values get larger. See also in this connection Section 6.4. There is substantial literature on Bayesian testing of a point null. A m o n g these are Jeffreys (1957, 1961), Good (1950, 1958, 1965, 1967, 1983, 1985, 1986), Lindley (1957, 1961, 1965, 1977), Raiffa and SchlaiflPer (1961), Edwards et al. (1963), Hildreth (1963), Smith (1965), Zellner (1971, 1984), Dickey (1971, 1973, 1974, 1980), Lempers (1971), R u b i n (1971), Leamer (1978), Smith and Spiegelhalter (1980), Zellner a n d Slow (1980), and D i a m o n d a n d Forrester (1983). Related work can also be found in P r a t t (1965), DeGroot (1973), Dempster (1973), Dickey (1977), Bernardo (1980), Hill (1982), Shafer (1982), and Berger (1986).

176

6 Hypothesis Testing and Model Selection Table 6.5. Test for Normal Mean, Variance Unknown a = 0.01 a = 0.05 a = 0.10 B P B P B P 2 .0117 .0116.0506 .0482 .0939 .0858 8 .0137 .0135.0941 .0860 .2114 .1745 12 .0212 .0208.1245 .1107.2309 .1876 16 .0290 .0282.1301 .1151 .2375 .1919 32 .0327 .0317.1380 .1213 .2478 .1986

n

Invariance and Minimaxity Our focus has been on deriving bounds on Bayes factors for invariant testing problems. There is, however, a large literature on other aspects of invariant tests. For example, if the group under consideration satisfies the technical condition of amenability and hence the Hunt-Stein theorem is valid, then the minimax invariant test is minimax among all tests. We do not discuss these results here. For details on this and other related results we would like to refer the interested readers to Berger (1985a), Kiefer (1957, 1966), and Lehmann (1986). 6.3.5 Interval Null Hypotheses and One-sided Tests Closely related to a sharp null hypothesis HQ : 9 = 9Q is an interval null hypothesis i^o • | ^ ~ ^ o | ^ ^- Dickey (1976) and Berger and Delampady (1987) show that the conflict between P-values and posterior probabilities remains if e is sufficiently small. The precise order of magnitude of small e depends on the sample size n. One may also ask similar questions of possible conflict between P-values and posterior probabilities for one-sided null, say, HQ : 6 < 6Q versus Hi : 6 > OQ. In the case of ^ = mean of a normal, and the usual uniform prior, direct calculation shows the P-value equals posterior probability. On the other hand, Casella and Berger (1987) show in general the two are not the same and the P-value may be smaller or greater depending on the family of densities in the model. Incidentally, the ambiguity of an improper prior discussed in Section 6.7 does not apply to one-sided nulls. In this case the Bayes factor remains invariant if the improper prior is multiplied by an arbitray constant.

6.4 Role of t h e Choice of an A s y m p t o t i c Framework^ This section is based on Ghosh et al. (2005). Suppose X i , . . . , X ^ are i.i.d. A/'(^,(j^), cr^ known, and consider the problem of testing HQ : 6 = 9o versus ^ Section 6.4 may be omitted at first reading.

6.4 Role of the Choice of an Asymptotic Framework

177

Hi : 9 ^ OQ.H instead of taking a lower bound as in the previous sections, we take a fixed prior density ^i(^) under Hi but let n go to oo, then the conflict between P-values and posterior probabilities is further enhanced. Historically this phenomenon was noted earlier than the conflict with the lower bound, vide Jeffreys (1961) and Lindley (1957). Let gi be a uniform prior density over some interval {OQ — a^Oo + a) containing ^0- The posterior probability of HQ given X = ( X i , . . . , Xn) is P{Ho\X)

= TTo exp[-n(X -

0o)y{2a^)]/K,

where TTQ is the specified prior probability of HQ and i r - TTO e x p [ - n ( X - 0of/{2a^)]

+ ^ - ^

/ ' " exp[-n(X -

0f/{2a^)]d0.

Suppose X is such that X = OQ -\- ZcxCfI\fn where ZQ, is the 100(1 — a ) % quantile of A^(0,1). Then X is significant at level ot. Also, for sufficiently large 72, X is well within (^o — f^-, OQ + CL) because X — OQ tends to zero as n increases. This leads to

r / exp[-n{X-ef/{2a^)]de^ Jeo-a

G^J(2i^ln)

and hence P{Ho\X)

2 /ON , ( 1 - ^O) - 7roexp(-z^/2)/[7roexp(-z2/2) + ^-^^(7^/(2^^].

Thus P{HQ\X) ^ 1 as n ^ OC whereas the P-value is equal to a for all n. This is known as the Jeffreys-Lindley paradox. It may be noted that the same phenomenon would arise with any flat enough prior in place of uniform. Indeed, P-values cannot be compared across sample sizes or across experiments, see Lindley (1957), Ghosh et al. (2005). Even a frequentist tends to agree that the conventional values of the significance level a like a = 0.05 or 0.01 are too large for large sample sizes. This point is further discussed below. The Jeffreys-Lindley paradox shows that for inference about ^, P-values and Bayes factors may provide contradictory evidence and hence can lead to opposite decisions. Once again, as mentioned in Section 6.3, the evidence against HQ contained in P-values seems unrealistically high. We argue in this section that part of this conflict arises from the fact that different types of asymptotics are being used for the Bayes factors and the P-values. We begin with a quick review of the two relevant asymptotic frameworks in classical statistics for testing a sharp null hypothesis. The standard asymptotics of classical statistics is based on what are called Pitman alternatives, namely, On — O^-^ d/^/n at a distance of 0 ( 1 / v ^ ) from the null. The Pitman alternatives are also called contiguous in the very general asymptotic theory developed by Le Cam (vide Roussas (1972), Le Cam and

178

6 Hypothesis Testing and Model Selection

Yang (2000), Hajek and Sidak (1967)). The log-likelihood ratio of a contiguous alternative with respect to the null is stochastically bounded as n —)« oo. On the other hand, for a fixed alternative, the log-likelihood ratio tends to — oo (under the null) or CXD (under the fixed alternative). If the probability of Type 1 error is 0 < a < 1, then the behavior of the likelihood ratio has the following implication. The probability of Type 2 error will converge to 0 < /? < 1 under a contiguous alternative 6n and to zero if ^ is a fixed alternative. This means the fixed alternatives are relatively easy to detect. So in this framework it is assumed that the alternatives of importance are the contiguous alternatives. Let us call this theory Pitman type asymptotics. There are several other frameworks in classical statistics of which Bahadur's (Bahadur, 1971; Serfling, 1980, pp. 332-341) has been studied most. We focus on Bahadur's approach. In Bahadur's theory, the alternatives of importance are fixed and do not depend on n. Given a test statistic, Bahadur evaluates its performance at a fixed alternative by the limit (in probability or a.s.) of ^ (log P-value) when the alternative is true. Which of these two asymptotics is appropriate in a given situation should depend on which alternatives are important, fixed alternatives or Pitman alternatives ^0 + d/y/n that approach the null hypothesis at a certain rate. This in turn depends on how the sample size n is chosen. If n is chosen to ensure a Type 2 error bounded away from 0 and 1 (like a ) , then Pitman alternatives seem appropriate. If n is chosen to be quite large, depending on available resources but not on alternatives, then Bahadur's approach would be reasonable. 6.4.1 Comparison of Decisions via P-values and Bayes Factors in Bahadur's Asymptotics In this subsection, we essentially follow Bahadur's approach for both P-values and Bayes factors. A Pitman type asymptotics is used for both in the next subsection. We first show that if the P-value is sufficiently small, as small as it is typically in Bahadur's theory, ^oi will tend to zero, calling for rejection of HQ^ i.e., the evidence in the P-value points in the same direction as that in the Bayes factor or posterior probability, removing the sense of paradox in the result of Jeffreys and Lindley. One could, therefore, argue that the Pvalues or the significance level a assumed in the Jeffreys-Lindley example are not small enough. The asymptotic framework chosen is not appropriate when contiguous alternatives are not singled out as alternatives of importance. We now verify the claim about the limit of J5oi. Without loss of generality, take ^0 = 0, (7^ = 1. First note that logBoi = - ^ ^ ' + i logn + Rn, where Rn = -log7r(X|ffi) - ilog(27r) 4-o(l)

(6.26)

6.5 Bayesian P-value

179

provided t h e prior gi{0) is a continuous function of 6 and is positive at all 0. If we omit Rn from t h e right-hand side of (6.26), we have Schwarz's (1978) approximation to t h e Bayes factor via BIG (Section 4.3). T h e logarithm of P-value (p) corresponding t o observed X is l o g p = log2[l - $ ( ^

\X\)]

= - f ^ ' ( 1 + °(1))

by s t a n d a r d approximation to a normal tail (vide Feller (1973, p. 175) or B a h a d u r (1971, p . 1)). T h u s ^\ogp -^ -0'^/2 and by (6.26), l o g ^ o i - ^ - o o . This result is t r u e as long as | X | > c(logn/n)-^/^, c > \[2. Such deviations are called m o d e r a t e deviations, vide Rubin and S e t h u r a m a n (1965). Of course, even for such P-values, p ^ {BQI/TI) SO t h a t P-values are smaller by an order of magnitude. T h e conflict in measuring evidence remains b u t the decisions are t h e same. Ghosh et al. (2005) also pursue t h e comparison of t h e three measures of evidence based on t h e likelihood ratio, t h e P-value based on t h e likelihood ratio test, and t h e Bayes factor BQI under general regularity conditions. 6.4.2 P i t m a n A l t e r n a t i v e and R e s c a l e d Priors We consider once again the problem of testing HQ : 0 = 0 versus Hi : 0 y^ 0 on t h e basis of a r a n d o m sample from N{0,1). Suppose t h a t t h e P i t m a n alternatives are t h e most i m p o r t a n t ones and t h e prior gi{0) under Hi p u t s most of the mass on P i t m a n alternatives. One such prior is N{0,5/n). Then

Boi = VoV5Tl 01 = -1- l e x p

n.

^

.^..

2 V5 + 1

If t h e P-value is close to zero, A/n|X| is large and therefore, ^ o i is also close t o zero, i.e., for these priors there is no paradox. T h e two measures are of t h e same order b u t t h e result of Berger and Sellke (1987) for symmetric unimodal priors still implies t h a t P-value is smaller t h a n t h e Bayes factor.

6.5 B a y e s i a n P-value^ Even t h o u g h valid Bayesian quantities such as Bayes factor and posterior probability of hypotheses are in principle t h e correct tools t o measure t h e evidence for or against hypotheses, they are quite often, and especially in m a n y practical situations, very difficult t o compute. This is because either t h e alternatives are only very vaguely specified, vide (6.6), or very complicated. Also, in some cases one may not wish t o compare two or more models b u t check how a model fits t h e d a t a . Bayesian P-values have been proposed t o deal with such problems. Section 6.5 may be omitted at first reading.

180

6 Hypothesis Testing and Model Selection

Let Mo be a target model, and departure from this model be of interest. If, under this model, X has density /(x|r/), r] G S, then for a Bayesian with prior TT on 77, mT^{x) = J^ fixlv)^^ drj, the prior predictive distribution is the actual predictive distribution of X. Therefore, if a model departure statistic T(X) is available, then one can define the prior predictive P-value (or tail area under the predictive distribution) as p=

P^-{T{X)>T{xobs)\Mo),

where Xobs is the observed value of X (see Box (1980)). Although it is true that this is a valid Bayesian quantity for model checking and it is useful in situations such as the ones described in Exercise 13 or Exercise 14, it does face the criticism that it may be influenced to a large extent by the prior n as can be seen in the following example. Example 6.18. Let Xi, X2, • • •, Xn be a random sample from N (^9,a'^) with both 0 and a'^ unknown. It is of interest to detect discrepancy in the mean of the model with the target model being MQ : 0 = 0. Note that T = y/nX (actually its absolute value) is the natural model departure statistic for checking this. (a) Case 1. It is felt a priori that a^ is known, or equivalently, we choose the prior on cr^, which puts all its mass at some known constant (JQ. Then under MQ, there are no unknown parameters and hence the prior predictive P-value is simply 2(l — ^{\/n\xobs\/(^o))^ where Xobs is the observed value of X. This can highly overestimate the evidence against MQ if CTQ happens to underestimate the actual model variance. (b) Case 2. Consider the usual non-informative prior on cr^: 7r((7^) oc l/cr^. Then, /•OO

m.(x)= / Jo

MK\a^)n{a^)da^

which is an improper density, thus completely disallowing computation of the prior predictive P-value. (c) Case 3. Consider an inverse Gamma prior IG{v^(3) with the following density for cr^: TT{(j'^\u,f3) = /^(cr^)"^""^^^ e x p ( - ^ ) for a'^ > 0, where jy and P are specified positive constants. Because T|cr^ ~ A^(0, cr^), under this prior the predictive density of T is then. m„{t)=

/ Jo

fTit\a^)n{ay,P)da^

6.5 Bayesian P-value

181

Table 6.6. Normal Example: Prior Predictive P-values 1 1 1 2 2 2 .5 .5 .5 5 5 5 .5 1 2 .5 1 2 .5 1 2 .5 1 2 p p .300 .398 .506 .109 .189 .300 .017 .050 .122 .0001 .001 .011 V

oc

rexp(-^(/3+f))(a2)-(''+i+V2)^^2 Jo a z Oc(2/3 + i2)-(2-+l)/2. If 2i/ is an integer, under this predictive distribution, T

t2u-

/f3/u Thus we obtain,

p = P""(|X|>|xofe,||Mo) T

^

,^

>

^/n\Xobs

\Mo

= 2(l-F,.(^^)V where F2u is the c.d.f. of t2u' For y^Xobs = 1-96 and various values of v and /?, the corresponding values of the prior predictive P-values are displayed in Table 6.6. Further, note that p ^ 1 as /3 ^ cx) for any fixed u > 0. Thus it is clear that the prior predictive P-value, in this example, depends crucially on the values of u and /?. What can be readily seen in this example is that if the prior TT used is a poor choice, even an excellent model can come under suspicion upon employing the prior predictive P-value. Further, as indicated above, non-informative priors that are improper (thus making 171^^ improper too) will not allow computing such a tail area, a further undesirable feature. To rectify these problems, Gutman (1967), Rubin (1984), Meng (1994), and Gelman et al. (1996) suggest modifications by replacing TT in m^ by 7r{r]\xobs)''

m "{x\xobs)=

/ f{x\rj)7r{r]\xobs)dr], and

^*^pm*(.|...3)(j^(X)>r(xo,,)). This is called the posterior predictive P-value. This removes some of the difficulties cited above. However, this version of Bayesian P-value has come

182

6 Hypothesis Testing and Model Selection

under severe criticism also. Bayarri and Berger (1998a) note that these modified quantities are not really Bayesian. To see this, they observe that there is "double use" of data in the above modifications: first to convert (a possibly improper) 7r(ry) into a proper 7T{rj\xobs)i ^^^ then again in computing the tail area of T(X) corresponding with T{xobs)' Furthermore, for large sample sizes, the posterior distribution of r] will essentially concentrate at r), the MLE of r/, so that m*{x\xobs) will essentially equal f{x\fj), a non-Bayesian object. In other words, the criticism is that, for large sample sizes the posterior predictive P-value will not achieve anything more than rediscovering the classical approach. Let us consider Example 6.18 again. Example 6.19. (Example 6.18, continued.) Let us consider the non-informative prior 7r(cr^) ex: 1/cr^ again. Then, as before, because T|cr^ ~ A^(0, cr^), and 1 ^ Ti{(j'^\yLohs) o c e x p ( - — ^ ^ a ; f ) ( a ^ ) - t + i i=l

^^M-^{^lhs

+ slhs)){^^) ^^',

the posterior predictive distribution of T is /•OO

Jo ^ ^ " ( a 2 ) - V 2 e x p ( - ^ ) ( a ^ ) W 2 e x p ( - - ^ ( ^ 2 ^ ^ + .^^J) oc r

exp(-^{n(xL + ^ L ) + t'})v^''^'^^'

Jo

oc

l 1+ -

t^

'^

^

-

\ -(^+l)/2

Therefore, we see that, under the posterior predictive distribution, T ^n-

V^obs

+ ^obs

Thus we obtain the posterior predictive P-value to be P = P'""'-l"°^'>(|X|>|xofc3||Mo) T 06s + ^obs

>

y/n\Xobs

Z,

V^obs

\ 1^0

~^ ^obs

^ 2 ( i - F . ( .vfl^;--l )), V

V^obs + Kbs J

where Fn is the distribution function of tn- This definition of a Bayesian P-value doesn't seem satisfactory. Let \xobs\ -^ CXD. Note that then p -^ 2 (1 — Fn{\/n)). Actually, p decreases to this lower bound as \xobs\ —> 00.

6.5 Bayesian P-value

183

Table 6.7. Values of pn = 2 (l - Fr^{y/n)) n

1

2

3

4

5

6

7

8

9

10

Pn .500 .293 .182 .116 .076 .050 .033 .022 .015 .010

Values of this lower bound for different n are shown in Table 6.7. Note that these values have no serious relationship with the observations and hence cannot be really used for model checking. Bayarri and Berger (1998a) attribute this behavior to the 'double' use of the data, namely, the use of x in computing both the posterior distribution and the tail area probability of the posterior predictive distribution. In an effort to combine the desirable features of the prior predictive P-value and the posterior predictive P-value and eliminate the undesirable features, Bayarri and Berger (see Bayarri and Berger (1998a)) introduce the conditional predictive P-value. This quantity is based on the prior predictive distribution m^ but is more heavily influenced by the model than the prior. Further, noninformative priors can be used, and there is no double use of the data. The steps are as follows: An appropriate statistic U{X), not involving the model departure statistic T(X), is identified, the conditional predictive distribution m{t\u) is derived, and the conditional predictive P-value is defined as

where Uobs = U{xobs)' The following example is from Bayarri and Berger (1998a). Example 6.20. (Example 6.18, continued.) T = ^/nX is the model departure statistic for checking discrepancy of the mean in the normal model. Let U{X) = s^ = ^ E r = i ( ^ ^ - ^f- Note that nU\a^ ~ cr^^^.-^. Consider 7r((j2) oc l/(j2 again" Then 7r{a^\U = s^) oc (a2)(^-i)/2+i exp(-n5V(2(j2)) is the density of inverse Gamma, and hence the conditional predictive density of T given s^^^ is 2\_/^2i 2 N fT{t\a')7r{a'\si,,)da

oc

dv

/

exp(-.{n5L + ^ ' } ) ^ " / ' -

Jo /

OC

7 2

1 ^2 X - n / 2 1+ --^-

V

^

^ Khs

Thus, under the conditional predictive distribution.

184

6 Hypothesis Testing and Model Selection n-1

T

^

^ohs

and hence we obtain the conditional predictive P-value to be Pc = P'"^!^'-) (|X| > \xobs\Mo) ^ohs

\

- 2 1 - F,_i

'\/n-

^ohs

J

\\xohs\ ^obs

We have thus found a Bayesian interpretation for the classical P-value from the usual t-test. It is worth noting that s^^^ was used to produce the posterior distribution to eliminate cr^, and that Xobs was then used to compute the tail area probability. It is also to be noted that in this example, it was easy to find C/(X), which eliminates cr^ upon conditioning, and that the conditional predictive distribution is a standard one. In general, however, even though this procedure seems satisfactory from a Bayesian point of view, there are problems related to identifying suitable U{X) and also computing tail areas from (quite often intractable) m{t\uobs)Another possibility is the partial posterior predictive P-value (see Bayarri and Berger (1998a) again) defined as follows: P* = p - ' ( ) (T(X) >

Tixobs)),

where the predictive density m* is obtained using a partial posterior density TT* that does not use the information contained in tobs = T{xobs) and is given by m*(0=y"/T(i|r/)7r*(77)d77, with the partial posterior TT* defined as 7r*(ry) oc fx\T{xobs\tobs,'n)T:{r]) jTytobsW

Consider Example 6.18 again with 7r(cr^) oc 1/cr^. Note that because Xi are i.i.d. iV(0,cr2) under MQ, / x ( x o 6 . k ' ) cx ( a 2 ) - / 2 e x p ( _ ^ ( x L + ^ L ) ) n «/x(5:o6.k^)(cr^)-("-^'/^exp(-^.L), SO that

6.6 Robust Bayesian Outlier Detection /x|x(xo6.|xo6.,^^)oc(a2)

(^

185

n i)/2exp(-;^52^J. 2^'

Therefore,

n*{a^)^{cr'r(-^y'^^eM-^A^s\ and is the same as 7r((j^|5^^^), which was used to obtain t h e conditional predictive density earlier. Thus, in this example, t h e partial predictive P-value is t h e same as the conditional predictive P-value. Because this alternative version p* does not require the choice of the statistic C/, it appears this m e t h o d may be used for any suitable goodness-of-fit test statistic T. However, we have not seen such work.

6.6 Robust Bayesian Outlier Detection Because a Bayes factor is a weighted likelihood ratio, it can also be used for checking whether an observation should be considered an outlier with respect to a certain target model relative to an alternative model. One such approach is as follows. Recall t h e model selection set-up as given in (6.1). X having density f{x\6) is observed, and it is of interest to compare two models M Q and M l given by Mo : X has density f{x\6)

where 0 e OQ]

M l : X has density f{x\0)

where 0 e Oi.

For i = 1,2, gi{0) is t h e prior density of ^, conditional on Mi being t h e true model. To compare M Q and M i on t h e basis of a r a n d o m sample x = ( x i , . . . , Xji) t h e Bayes factor is given by _ mo(x) mi(x) where 7n^(x) = j ^ f{'x.\0)gi{0)d6 for i = 1,2. To measure t h e effect on t h e Bayes factor of observation Xd one could use t h e quantity

where B{-X-d) is the Bayes factor excluding observation x^. If kd < 0, t h e n when observation Xd is deleted there is an increase of evidence for M Q . Consequently, observation Xd itself favors model M i . T h e extent t o which Xd favors M l determines whether it can be considered an outlier under model M Q . Similarly, a positive value for kd implies t h a t Xd favors M Q . P e t t i t and Young (1990) discuss how kd can be effectively used t o detect outliers when t h e prior is non-informative. T h e same analysis can be done with informative priors also. This assumes t h a t t h e conditional prior densities go and gi can b e fully

186

6 Hypothesis Testing and Model Selection

specified. However, we take the robust Bayesian point of view that only certain broad features of these densities, such as symmetry and unimodality, can be specified, and hence we can only state that go or gi belongs to a certain class of densities as determined by the features specified. Because kdj derived from the Bayes factor, is the Bayesian quantity of inferential interest here, upper and lower bounds on kd over classes of prior densities are required. We shall illustrate this approach with a precise null hypothesis. Then we have the problem of comparing Mo : 6 = 00 versus Mi : 6 ^ Oo using a random sample from a population with density f{x\6). Under Mi, suppose 9 has the prior density g^ g £ F. The Bayes factors with all the observations and without the dth observation, respectively, are

moo)

Sg(x) Bgi^-d)

=

/(x-dl^o)

f,^,j{K.,\e)9{e)de-

Because /(x|^) = f{xd\0)f{x-d\0), /(x|go)

we get Je^0j{^-d\O)9{0)de'

kd,g = log Je^eofi^\^)9iO)de

iogf{xd\eo) -log j,^,ji^_d\0)9{e)de

(6.28)

Now note that to find the extreme values of k^^g, it is enough to find the extreme values of

!s^eJi^\0)9{O)de f^d,g =

J,^,jix_d\e)9{e)de

(6.29)

over the set F. Further, this optimization problem can be rewritten as follows: sup hd,g geG

^^

Ie^eofi^Mfi^-im9i0)de

sup / g'€G'

inf hd a = inf geG '" geG inf

fixd\0)g*{e)de,

(6.30)

J0Tt0o

Io^eof(^d\e)f{x_d\e)g{e)dO !e^eji^-d\0)9{0)de /

f{xd\6)g*{e)d6,

(6.31)

6.6 Robust Bayesian Outlier Detection

187

where {

L^e,9iu)f{x_^\u)du

J

Equations (6.30) and (6.31) indicate how optimization of ratio of integrals can be transformed to optimization of integrals themselves. Consider the case where F is the class A of all prior densities on the set {^ : ^ 7^ ^o}- Then we have the following result. Theorem 6.21. If f{x_^\0)

> 0 for each 0 ^ OQ,

sup/i^,^ = sup/(xrf|(9), and inf hd^g = inf f{xd\0).

(6.32) (6.33)

Proof. From (6.30) and (6.31) above, snphd,g= sup / f{xd\0)g*{0)dO, geA g*eA*Jey^eo where

Now note that extreme points of A^ are point masses. Proof for the infimum is similar. The corresponding extreme values of kd are snp kd,g = logf{xd\Oo) - log inf f{xd\0),

(6.34)

^T^o

geA

inf kd,g = logf{xd\Oo) - log sup f{xd\0).

(6.35)

Example 6.22. Suppose we have a sample of size n from N{6.^ a^) with known cr2. Then, from (6.34) and (6.35), 1 sup kd,g = T-^ sup [{xd - Of - [xd - OQY] geA

'

"^cr^ e^Oo

= 00, and

If A ''"'^ = 2 ^ ife, t^^'' - ^)' - (^<^ ' ^°)'] 2^2

•

It can be readily seen from the above bounds on k^ that no observation, however large in magnitude, will be considered an outlier here. This is because A is too large a class of prior densities.

188

6 Hypothesis Testing and Model Selection

Instead, consider the class G of all N{6Q^ r^) priors with r^ > TQ > 0. Note that r^ close to 0 will make Mi indistinguishable from MQ, and hence it is necessary to consider r^ bounded away from 0. Then for ^ G G

^h^ej{^d\e)f{x_,\e)g[e)de !e^eJ(^-M9{e)d6

= I

f{xa\e)9\e)de,

where ^* is the density of A^(m, 5^) with 2, (n-l)r2 _ a^ m = m(x_^,r^) = (n - l)r2 +—-X-d + (72 - " ' (n - l)r^ + a^ f2 _ r 2 / J^ = J ^ ( x _ „ r 22x) =

70,

2^2

'^ <^

( n - l ) r 2 + 0-2 •

Note, therefore, that hd,g = hd,g{xd) is just the density of N{m,a'^ + S"^) evaluated at Xd- Thus,

For each x^, one just needs to graphically examine the extremes of the expression above as a function of r'^ to determine if that particular observation should be considered an outlier. Delampady (1999) discusses these results and also results for some larger nonparametric classes of prior densities.

6.7 Nonsubjective Bayes Factors^ Consider two models MQ and Mi for data X with density fi{x\6i) under model Mi, Oi being an unknown parameter of dimension pi^i = 0,1. Given prior specifications gi{Oi) for parameter ^^, the Bayes factor of Mi to MQ is obtained as

'°

^m,{x) ^ jh{x\e^)9i{e,)dex mo(x) j u{x\ea)go{eo)deo

^• ^

Here mi{x) is the marginal density of X under Mj, z = 0,1. When subjective specification of prior distributions is not possible, which is frequently the case, one would look for automatic method that uses standard noninformative priors. ^ Section 6.7 may be omitted at first reading.

6.7 Nonsubjective Bayes Factors

189

There are, however, difficulties with (6.36) for noninformative priors that are typically improper. If gi are improper, these are defined only up to arbitrary multiplicative constants Q ; CiQi has as much validity as gi. This implies that {ci/co)Bio has as much validity as BIQ. Thus the Bayes factor is determined only up to an arbitrary multiplicative constant. This indeterminacy, noted by Jeffreys (1961), has been the main motivation of new objective methods. We shall confine mainly to the nested case where /o and / i are of the same functional form and fo{x\0o) is the same as fi{x\6i) with some of the co-ordinates of Oi specified. However, the methods described below can also be used for non-nested models. It may be mentioned here that use of diffuse (flat) proper prior does not provide a good solution to the problem. Also, truncation of noninformative priors leads to a large penalty for the more complex model. An example follows. Example 6.23. (Testing normal mean with known variance.) Suppose we observe X = ( X i , . . . ,Xn). Under Mo,X^ are i.i.d. A^(0,1) and under Mi, Xi are i.i.d. N{0^ 1), 0 ^ R is the unknown mean. With the uniform noninformative prior g^{0) = c ior 6 under Mi, the Bayes factor of Mi to MQ is given by B^^ = V2^cn-^/2exp[nXV2]. If one uses a uniform prior over —K<6
1 nr2 nX' 2 nr2 + 1

which is approximately (nT^)"""*^/^ exp[nX^/2] for large values of nr^ and hence can be made arbitrarily small by choosing a large value of r^. Also B{}P^^ is highly non-robust with respect to the choice of r^, and this non-robustness plays the same role as indeterminacy. The expressions for B^^ and ^ J Q ^ ^ ^ clearly indicate similar behavior of these Bayes factors and the similar roles of \/2^c and (r^ + l / n ) - ! / ^ . A solution to the above problem with improper priors is to use part of the data as a training sample. The data are divided into two parts, X = ( X i , X2). The first part X i is used as a training sample to obtain proper posterior distributions for the parameters (given X i ) starting from the noninformative priors

190

6 Hypothesis Testing and Model Selection

These proper posteriors are then used as priors to compute the Bayes factor with the remainder of the data (X2). This conditional Bayes factor, conditioned on X i , can be expressed as

BioiXi) =

j

h{X2\ei)9i{ei\Xi)dei

J fo{X2\eo)goieo\Xi)deo mi{X) J mo{X) J

fo{Xi\eo)9o{Oo)deo fi{X,\9i)giiei)dei

= B ^ o ^ ^ , mi(Xi)

(6.37)

where mi{Xi) is the marginal density of X i under M^,z = 0,1. Note that if the priors Q^^, Z = 0,1, are used to compute Bio{Xi)^ the arbitrary constant multipher CI/CQ of ^lo is cancelled by (CQ/CI) of mo{Xi)/mi{Xi) so that the indeterminacy of the Bayes factor is removed in (6.37). A part of the data, X i , may be used as a training sample as described above if the corresponding posteriors gi{6i\Xi), i = 0,1 are proper or, equivalently, the marginal densities mi{Xi) of X i under M^,z = 0,1 are finite. One would naturally use minimal amount of data as such a training sample leaving most part of the data for model comparison. As in Berger and Pericchi (1996a), a training sample Xi may be called proper if 0 < m i ( X i ) < CXD, i = 0,1 and minimal if it is proper and no subset of it is proper. Example 6.24- (Testing normal mean with known variance.) Consider the setup of Example 6.23 and the uniform noninformative prior gi{0) = 1 for 6 under Mi. The minimal training samples are subsamples of size 1 with mo{Xi) = ( l / v ^ ) e - ^ ? / 2 and mi{Xi) = 1. Example 6.25. (Testing normal mean with variance unknown.) Let X = ( X i , . . . ,Xn). Mo : X i , . . . , X ^ arei.i.d. 7V(0,ag), Ml : X i , . . . ,Xn are i.i.d. Ndi.af). Consider the noninformative priors ^o(^o) = V^o under MQ and ^i(/i, CTI) = l/cTi. Here mi{Xi) = CXD for a single observation Xi and a minimal training sample consists of two distinct observations Xi, Xj and for such a training sample (X^,Xj), "^"^^^'^^•^ = 2AXf

+ X])

'^"^ - i ( ^ ^ ' ^ ^ ) = 2 | x / - X , | -

^'-'^^

6.7.1 The Intrinsic Bayes Factor As described above, a solution to the problem with improper priors is obtained using a conditional Bayes factor 5 i o ( X i ) , conditioned on a training sample X i . However, this conditional Bayes factor depends on the choice of

6.7 Nonsubjective Bayes Factors

191

the training sample Xi. Let X{1), / — 1,2,... L be the hst of all possible minimal training samples. Berger and Pericchi (1996a) suggest considering all these minimal training samples and taking average of the corresponding L conditional Bayes factors 5io(X(/))'s to obtain what is called the intrinsic Bayes factor (IBF). For example, taking an arithmetic average leads to the arithmetric intrinsic Bayes factor (AIBF) AIBF,O = B

'

±

^

^

^

(6.39)

L^mi(X(0) and the geometric average gives the geometric intrinsic Bayes factor (GIBF)

'=^«^—(nsiii)"''

<"»'

the sum and product in (6.39) and (6.40) being taken over the L possible training samples X{l)J = 1 , . . . , L. Berger and Pericchi (1996a) also suggest using trimmed averages or the median (complete trimming) of the conditional Bayes factors when taking an average of all of them does not seem reasonable (e.g., when the conditional Bayes factors vary much). AIBF and GIBF have good properties but are affected by outliers. If the sample size is very small, using a part of the sample as a training sample may be impractical, and Berger and Pericchi (1996a) recommend using expected intrinsic Bayes factors that replace the averages in (6.39) and (6.40) by their expectations, evaluated at the MLE under the more complex model Mi. For more details, see Berger and Pericchi (1996a). Situations in which the IBF reduces simply to the Bayes factor Bio with respect to the noninformative priors are given in Berger et al. (1998). The AIBF is justified by the possibility of its correspondence with actual Bayes factors with respect to "reasonable" priors at least asymptotically. Berger and Pericchi (1996a, 2001) have argued that these priors, known as "intrinsic" priors, may be considered to be natural "default" priors for the testing problems. The intrinsic priors are discussed here in Subsection 6.7.3. 6.7.2 The Fractional Bayes Factor O'Hagan (1994, 1995) proposes a solution using a fractional part of the full likelihood in place of using parts of the sample as training samples and averaging over them. The resulting "partial" Bayes factor, called the fractional Bayes factor (FBF), is given by

mo ( A , 6) where 6 is a fraction and

192

6 Hypothesis Testing and Model Selection

r v M - Ifi(x\ei)9iiei)d0i "^'^ ' ' ~ J[fi{x\e,)]'>gi{e,)ddi Note that FBFio can also be written as FBFio = Bio

i,.^.

m?(X)

where m^(X) = j[fi{X\eit9i{ei)deu

i = O, l.

(6.41)

To make FBF comparable with the IBF, one may take h = m/n where m is the size of a minimal training sample as defined above and n is the total sample size. O'Hagan also recommends other choices of 6, e.g., h = y/n/n or log n/n. We now illustrate through a number of examples. Example 6.26. (Testing normal mean with known variance.) Consider the setup of Example 6.23. The Bayes factor with the noninformative prior gi{9) = 1 was obtained as Bio = v ^ n - ^ / 2 e x p [ n X V 2 ] = V ^ n ' ^ / ^ A i o where Aio is the likelihood ratio statistic. Bayes factor conditioned on Xi is BioiXi)

= Biomo{X,)/mi{Xi)

= Bio(l/V^)exp(-Xf/2).

Thus n

n

AIBFio = n-' J2 Bio{Xi) = n'^^^ exp(nXV2) J2 exp(-X2/2), i=l

i=l

GIBFio = n-^/2exp[nXV2 - ( l / 2 n ) ^ X 2 ] . The median IBF (MIBF) is obtained as the median of the set of values Bio{Xi), i = l , 2 , . - . , n . The FBF with a fraction o
Example 6.27. (Testing normal mean with variance unknown.) Consider the setup of Example 6.25. For the standard noninformative priors considered in this example, we have

6.7 Nonsubjective Bayes Factors

193

^""^^-' m ^^lEix.-xr 1

v^

1-^^ ~ Xj\

FBF^ 10

-•-

Er=i(^i-^)2J

with b =

n Example 6.28. (normal linear model.) This example is from Berger and Pericchi (1996b, 2001). Berger and Pericchi determined the IBF for linear models for both the nested and non-nested case. We consider here only the nested case. Suppose for the data l^(n x 1) we consider the linear models Mi'.Y

= Xi(3, + e„ Ei - 7Vn(0, a | / n ) , i = 0,1

where l3i — {Pii,Pi2j • • •, A p J ' and af are unknown parameters, and Xi is an n X Pi known design matrix of rank pi < n. Consider priors of the form

Here Qi = 0 gives the reference prior of Berger and Bernardo (1992a), Qi = Pi corresponds with the Jeffreys prior. For the nested case, when nested in Mi, Berger and Pericchi (1996b) consider a modified Jeffreys for which qo = 0 and qi — Pi — Po- The integrated likelihoods mi{Y) these priors can be obtained as

and MQ is prior with

where C is a constant not depending on i, and Ri is the residual sum of squares under M^, i = 0,1. The Bayes factor BIQ with the modified Jeffreys prior is then given by \Yf

V

|l/2

/ TD \

(n-po)/2

Also, one can see that a minimal training sample Y{1) in this case is a sample of size m = pi-\-l such that for the corresponding design matrices Xi{l) (under M^), Xi(l)Xi{l)^ i = 0,1 are nonsingular. The ratio mo{Y(l))/mi{Y{l)) can be obtained from the expression of ^lo by inverting it and replacing n, X Q , X i , i?o, and Ri by m, Xo(/), X i ( / ) , ^o(O) ^^^ ^i(O) respectively, where Ri{l) is the residual sum of squares corresponding to {Y{1)) under M^, i = 0,1. Thus the conditional Bayes factor Bio{Y{l)), conditioned on Y{1) is given by

BwiYil)) X[XX^'

\RJ

\X',{l)Xo{l)\'^' VRoW

194

6 Hypothesis Testing and Model Selection

One may now find an average of these conditional Bayes factors to find an IBF. For example, an arithmetic mean of Bio{Y{l)ys for all possible minimal training samples y(/)'s gives the AIBF, and a median gives the median IBF. In case of fractional Bayes factor, one obtains that (see, for example, Berger and Pericchi, 2001, page 152), with m\{X) as defined in (6.41),

m\{X)

\27rJ

\X'oXo\'^'

V^o

with b = m/n and hence FBFio = 6(^^-^«)/2(i?o/i^i)^^-^^/^ See also O'Hagan (1995) in this context. For more examples, see Berger and Pericchi (1996a, 1996b, 2001) and O'Hagan (1995). Several other methods have been proposed as solutions to the problem with noninformative priors. Smith and Spiegelhalter (1980) and Spiegelhalter and Smith (1982) propose the imaginary minimal sample device; see also Ghosh and Samanta (2002b) for a generalization. Berger and Pericchi (2001) present comparison of four methods including the IBF and FBF with illustration through a number of examples. Ghosh and Samanta (2002b) discuss a unified derivation of some of the methods that shows that in some qualitative and conceptual sense, these methods are close to each other. 6.7.3 Intrinsic Priors Given a default Bayes factor such as the IBF or FBF, a natural question is whether it corresponds with an actual Bayes factor based on some priors at least approximately. If such priors exist, they are called intrinsic priors. A default Bayes factor such as IBF can then be calculated as an actual Bayes factor using the intrinsic prior, and one need not consider all possible training samples and average over them. A "reasonable" intrinsic prior that corresponds to a naturally developed good default Bayes factor may be considered with be a natural default prior for the given testing or model selection problem. On the other hand, a particular default Bayesian method may be evaluated on the basis of the corresponding intrinsic prior depending on how "reasonable" the intrinsic prior is. Berger and Pericchi (1996a) describe how one can obtain intrinsic priors using an asymptotic argument. We begin with an example. Example 6.29. (Example 6.26, continued.) Suppose that for some proper prior 7r(^) under model Mi, BF^^ ^ AIBFio (6.43)

6.7 Nonsubjective Bayes Factors

195

where BFIQ denotes the Bayes factor based on a prior n{6) under Mi. Using Laplace approximation (Section 4.3) to the integrated hkehhood under Mi, we have

where 6 is the MLE of 0 under Mi, and / is the observed Fisher information number. Thus using the expression for the AIBF in this example, and noting that / = 1, (6.43) can be expressed as 1 ""

7r(^>(l/V^)-^exp(-Xf/2). As the RHS converges to (l/x/2^)£:^(e-^i/2) := ( 1 / ^ / 2 ^ ) ( l / v ^ ) e - ^ /^ with probability one under any ^, this suggests the intrinsic prior

which is a N{0, 2) density. One can easily verify that BF^o/AIBFio

-^ 1

with probability one under any 0, i.e., the AIBF is approximately the same as the Bayes factor with an A^(0, 2) prior for 0 under Mi. If one considers the FBF, one can directly show that the FBF, with fraction 6, is exactly equal to the Bayes factor with a A^(0, {b~^ — l ) / n ) prior. Let us now consider the general case. Let Bio be the Bayes factor of Mi to Mo with noninformative priors gi{6i) for 0i under M^, i = 0,1. We illustrate below with the AIBF. Treatment for the other IBFs and FBF will be similar. Recall that AIBFio = BioBoi where ^oi = } ^ ^

"^'^Zml

Suppose for some priors TT^ under M^, i = 0,1, AIBFio is approximately equal to the Bayes factor based on TTQ and TTI, denoted BFio{7To,7ri). Using Laplace approximation (Section 4.3) to both the numerator and denominator of ^lo (see 6.36), AIBFio can be shown to be approximately equal to MX\d,)g,{e^){2n/nr/^\h\-y\^^^ fo{X\9o)9oi0o){27r/n)Po/2\i^\-^/2

^ °'

where n denotes the sample size, pi is the dimension of 0i^ Oi is the MLE of 6i^ and Ii is the observed Fisher information matrix under M^, i = 0,1. The same approximation applied to 5Fio(7ro,7ri), yields the approximation

196

6 Hypothesis Testing and Model Selection /i(X|0i)7ri(^i)(27r/n)^^/^|/i|-^/^ /o(X|^o)7ro(^o)(27r/n)Po/2|^|-i/2

(6.45)

t o 5Fio(7ro,7ri). We assume t h a t conditions for t h e Laplace approximation hold for t h e given models. To find t h e intrinsic priors, we equate (6.44) with (6.45) a n d this yields

Berger a n d Pericchi (1996a) obtain t h e intrinsic prior determining equations by taking limits on b o t h sides of (6.46) under MQ a n d M i . Assume t h a t , as n -^ oo, under M i , 0 i ^ 0 i , 6Q -^ a ( ^ i ) , a n d BQI -^ Bl{Oi)\ u n d e r M Q , ^O - ^ ^ o ,

^ i -> ^ ( ^ o ) , a n d B^i -^

BQ{6O).

T h e equations obtained by Berger a n d Pericchi (1996a) are

9MMa{e^)) - ^'^^'^ ""^ 9MOo))Meo) ' ^°^^°^-

^''''^

W h e n MQ is nested in M i , Berger a n d Pericchi suggested t h e solution MOo)

= 9o{Oo).

^i{Oi)

= gi{ei)Bl{ei).

(6.48)

However, this m a y n o t b e t h e unique solution t o (6.47). See also Dmochowski (1994) in this context. Example 6.30. (Example 6.27, continued.) A solution t o t h e intrinsic prior determining equations suggested by Berger a n d Pericchi (see (6.48)) is 7ro(cro) = — , 7ri(/i,o-i) = — B i ( / / , c r i )

(6.49)

CTi

CTQ

where ^r(M5<^i) = E^^CJ^BQI{XI,X2)

a n d BQI{XI,X2)

=

.

^ ,

v2\-

Note t h a t ^ o i (-^1,-^2) can be expressed as ^ 0 1 (-^15-^2)

( Z i + Z2) TTCTi

where Zi = {Xi-X2)^/{2af) - Xi a n d Z2 = {Xi-\-X2)'^/{2al) - noncentral X^ with d.f. = 1 a n d noncentrality parameter A = 2/1^/af. Also, Zi a n d Z2 are independent. Using t h e representation of a noncentral x^ density as a n (infinite) weighted s u m of central x^ densities, we have

6.7 Nonsubjective Bayes Factors

E

zi

A/2

Zx + Z2

Z{ Zy + Wi

E

j-0

197

(6.50)

where Wj ~ Xi+21 ^^^^ is independent of Z\. We then have ^ 1 (/^, CTi) =

— -

X = 2fl /CTi

>

and the intrinsic priors are given by 1

7ro(cro) = — , 7ri(//,cri) = —7ri(/i|cri) ai

CTQ

?\3 '-^ /(MV^D

1

with

7ri(;u|o-i) = — ^ ^ exp(-(u2/(7f) V

-

U + kY

It is to be noted that / ^ 7ri(/i|<7i)(i/i = 1. Example 6.31. (Testing normal mean with variance unknown.) This is from Berger and Pericchi (1996a). Consider the setup of Example 6.25 with the same prior QQ under MQ but in place of the standard noninformative prior ^i(/i,(Ji) = l/cTi use the Jeffreys prior gK/i^ai) = 1/cr^. In this case, a minimal training sample consists of two distinct observations Xi, Xj for which mo{Xi,Xj)

2 , x.2^ andmi(X^,X^)

2n{Xf^X])

V^iXi •x,r

Proceeding as in the previous example, noting that mo(Xi,X2) mi(Xi,X2)

Zi V^(Zi + Z2)'

where Zi and Z2 are as above, and using (6.50), the intrinsic priors are obtained as 7ro(^o) = —, 7ri(/i,cri) = —7ri(/i|ai) (Jo

CFl

1 l-exp(-/.V^,^)^ CTi

20r(/i2/o-i)

Here 7ri(//|(Ji) is a proper prior, very close to the Cauchy (0, (TI) prior for /i, which was suggested by Jeffreys (1961) as a default proper prior for /x (given cTi); see Subsection 2.7.2. Example 6.32. Consider a negative binomial experiment; Bernoulli trials, each having probability 6 of success, are independently performed until a total of n successes is accumulated. On the basis of the outcome of this experiment we want to test the null hypothesis HQ : 0 = ~ against the alternative Hi : 0 ^ ^. We consider this problem as choosing between the two models MQ : 0 — \ and Ml =0 e (0,1).

198

6 Hypothesis Testing and Model Selection

The data may be looked upon as n observations X i , . . . , Xn where Xi denotes number of failures before the first success, and for i = 2, • • •, n, Xi denotes number of failures between (z — l)th success and ith. success. The random variables X i , . . . , X n are i.i.d. with a common geometric distribution with probability mass function p{Xi = x)= 6>^(1 - 6>), X = 0 , 1 , 2 , . . . The likelihood function is /(Xi,...,X„|0) = 0^"-^*(l-^)". We consider the Jeffreys prior

which is improper. The Bayes factor with this prior is

Bio = 2^^^+^ / oE^^-^/'^{i - ey-^do. Jo Minimal training samples are of size 1, and the AIBF is given by AIBFio =

BioX-yZ

2Xi^l

Let 2X1+2

2^+2 E^^o-a-^)-

x=0

Then the intrinsic prior is 1

7r(^) = ^-1/2(1 _ e)-^B\6) Simplification yields

4

^

^

= - x=0 ^{2x

+ 1)2-^^^-1/^

7r((9)=: ((9-^/2+ <9^/V2)(2-6>) - 2

We now consider a simple example from Lindley and Phillips (1976), also presented in Carlin and Louis (1996, Chapter 1). In 12 independent tosses of a coin, one observes 9 heads and 3 tails, the last toss yielding a tail. It is shown that one gets different results according to a binomial or a negative binomial likelihood. Let us consider the problem of testing the null hypothesis HQ : 9 = 1/2 against the alternative Hi : 6 ^ 1/2 where 0 denotes the probability of head in a trial. If a binomial model is assumed, the random observable X is the number of heads observed in a fixed number of 12 tosses. One rejects HQ for large values of the statistic |X — 6|, and the corresponding

6.8 Exercises

199

P-value is 0.150. On the other hand, if a negative binomial model is assumed, the random observable X is the number of heads before the third trial appears. Note that expected value of X under HQ is 3. Suppose one rejects HQ for large values of |X — 3|. Then the corresponding P-value is 0.0325. Thus with the usual 5% Type 1 error level, the two model assumptions lead to different decisions. Let us now use a Bayes test for this problem. For the binomial model, the Jeffreys prior is proportional to 6~^^'^{1 — ^)~^/^, which can be normalized to get a proper prior. For the negative binomial model, the data can be treated as three i.i.d. geometrically distributed random variables, as described above. The Bayes factor under the binomial model (with Jeffreys prior) and the Bayes factor under the negative binomial model (with the intrinsic prior) are respectively 1.079 and 1.424. They are different as were the P-values of classical statistics, but unlike the P-values, the Bayes factors are quite close.

6.8 Exercises 1. Assume a sharp null and continuity of the null distribution of the test statistic. (a) Calculate E'/f^(P-value) and ^i:fQ(P-value|P-value < a), where 0 < a < 1 is the Type 1 error probability. (b) In view of your answer to (a), do you think 2(P-value) is a better measure of evidence against HQ than P-value? 2. Suppose X ~ A^(^, 1) and consider the two hypothesis testing problems: Ho:0 = -1 versus Hi : 0 = 1; H^ :0 = 1 versus H^ : 9 = - 1 . Find the Bayes factor of HQ relative to Hi and that of HQ relative to H^ if (a) X = 0 is observed, and (b) x = 1 is observed. Compute the classical P-value in both cases. 3. Refer to Example 6.3. Take r = 2(j, but keep the other parameter values unchanged. Compute ^oi for the same values of t and n as used in Table 6.1. 4. Suppose X r^ N{6^1) and consider testing Ho:0 = 0 versus Hi : 6 ^ 0. For three different values of x, x = 0,1, 2, compute the upper and lower bounds on Bayes factors when the prior on 9 under the alternative hypothesis lies in (a) FA = {all prior distributions on 7^},

ih)rM = {N{o,T%T^>o}, (c) Fs = {all symmetric (about 0) prior distributions on 7^},

200

6 Hypothesis Testing and Model Selection

(d) Fsu = {all unimodal priors on 7^, symmetric about 0}. Compute the classical P-value for each x value. What is the implication of FN C FSU C TS C T A ? 5. Let X ~ B(m, 0), and let it be of interest to test Ho:0=\

versus H^ : 6 ^ ^.

If m = 10 and observed data is x = 8, compute the upper and lower bounds on Bayes factors when the prior on 6 under the alternative hypothesis lies in (a) FA = {all prior distributions on (0,1)}, (b) FB = { B e t a ( a , a ) , a > 0}, (c) Fs = {all symmetric (about ^) priors on (0,1)}, (d) Fsu — {all unimodal priors on (0,1), symmetric about ^ } . Compute the classical P-value also. 6. Refer to Example 6.7. ( a ) S h o w t h a t ^ ( G A , x ) = exp(-4),P(ifo|GA,:^) = [l-f ^ exp(4)]"\ (b) Show that, if t < 1, B{Gus,x) = 1, and P{Ho\Gus,x) = TTQ. (c) Show that, if ^ < l^ B_{G]sforjX) = 1, and P(i/o|GiVor?^) = TTQ. If t > 1, RiGNor.x) = texp{-{t^ - l)/2). 7. Suppose X|0 has the tp(3,0,/p) distribution with density -(3+p)/2

/(x|0)cx(l + i ( x - ^ y ( x - ^ ) )

8. 9. 10. 11.

and it is of interest to test HQ : 6 = 0 versus Hi : 6 ^ 0. Show that this testing problem is invariant under the group of all orthogonal transformations. Refer to Example 6.13. Show that the testing problem mentioned there is invariant under the group of scale transformations. In Example 6.16, find the maximal invariants in the sample space and the parameter space. In Example 6.17, find the maximal invariants in the sample space and the parameter space. Let X\6 ~ N{Oy 1) and consider testing Ho:\9-

Oo\ < 0.1 versus Hi : \0 - 6>o| > 0.1.

Suppose X = ^0 + 1-97 is observed. (a) Compute the P-value. (b) Compute BQI and P{Ho\x) under the two priors, N{0o, r^), with r^ = (0.148)2 and t/((9o-l,6>o + l). 12. Let X\p ^ Binomial(10, p). Consider the two models: Mo ' p = - versus Mi : p ^ -•

6.8 Exercises

201

Under M i , consider t h e following three priors for p: (i) 17(0,1), (ii) Beta(10, 10), and (iii) Beta(100, 10). If four observations, x = 0, 3, 5, 7, and 10 are available, compute kd given in E q u a t i o n (6.27) for each observation, and for each of t h e priors and check which of t h e observations m a y be considered outliers under M Q . 13. (Box (1980)) Let X i , X2, • • •, X ^ be a r a n d o m sample from A^" (^, cr^) with b o t h 6 and cr^ unknown. It is of interest to detect discrepancy in t h e variance of t h e model with t h e target model being Mo : cr^ = cr^, a n d 0 -^

N{/j.,r'^),

where /i and r^ are specified. (a) Show t h a t t h e predictive distribution of ( X i , X2, • • •, Xn) under M Q is multivariate normal with covariance m a t r i x o-Qln + r ^ l l ' and E{Xi) = //, for i = 1, 2 , . . . , n. (b) Show t h a t under this predictive distribution.

n^) = i-2 y.i^i - X?+"r^,..

~ Xn-

(c) Derive and justify the prior predictive P-value based on t h e model d e p a r t u r e statistic T ( X ) . Apply this t o d a t a , x = (8, 5, 4, 7), and CTQ = 1, /i = 0, T2 = 2. (c) W h a t is t h e classical P-value for testing HQ : a'^ = CTQ in this problem? 14. (Box (1980)) Suppose t h a t under t h e target model, for i = 1, 2 , . . . , n, ^,|/3o, ^, a ' = / 3 o + x^6> + e,, e, - iV(0, ^2) i.i.d., cr^ ~ inverse Gamma(Q^,7), where c, /io, ^o, r, a and 7 are specified. Assume t h e s t a n d a r d linear regression model notation of y = f3ol-\-X0^e^ and suppose t h a t X ' l = 0. Further assume t h a t , given a^^ conditionally /?o, 0 and e are independent. Also, let Po and 9 be t h e least squares estimates of Po and 6, respectively, a n d RSS = (y - ^ o l - xdy{y - ^ Q I - XO). (a) Show t h a t under t h e target model, conditionally on cr^, t h e predictive density of y is proportional t o

(a^)-"/2exp(-^(%35#+ii55 +(9 - OoYiix'x)-'

+ r - i ) - i ( g - Oo))).

(b) Prove t h a t t h e predictive distribution of y under t h e target model is a multivariate t. (c) Show t h a t t h e joint predictive density of {RSS, 6) is proportional t o

202

6 Hypothesis Testing and Model Selection r

--

. - . . - -

^ -(n+a-l)/2

{27 + Rss -{-{6- Ooynx'x)-^ + r-^)-\e -Oo)} (d) Derive the prior predictive distribution of

Tiy) =

(d - doYiix'x)-^ + r-')-\d - do) 27 + RSS

(e) Using an appropriately scaled T(y) as the model departure statistic derive the prior predictive P-value. 15. Consider the same linear regression set-up as in Exercise 14, but let the target model now be ~ inverse Gamma(a,7). Assuming 7 to be close to 0, use

T{y) =

d'x'xd RSS

as the model departure statistic to derive the prior predictive P-value. Compare it with the classical P-value for testing HQ : 6 = 0. 16. Consider the same problem as in Exercise 15, but let the target model be Mo:0 = 0,/?o|cr^ ~ A/'(/xo,ca2),7r(cr2) oc

1 ..

Using T(y) = 0 X'XO as the model departure statistic and RSS as the conditioning statistic, derive the conditional predictive P-value. Compute the partial predictive P-value using the same model departure statistic. Compare these with the classical P-value for testing HQ : 0 = 0. 17. Let X i , X 2 , • • •, Xn be i.i.d. with density f{x\X,e)

= Aexp(-A(x - e)),x > 6>,

where A > 0 and — 00 < ^ < 00 are both unknown. Let the target model be Mo :6> = 0,7r(A)oc ^. A

Suppose the smallest order statistic, T = X(i) is considered a suitable model departure statistic for this problem. (a) Show that T|A ~ exponential(nA) under MQ. (b) Show that A|xo6s ~ Gamma(n, nXo^g) under MQ. (c) Show that

-CI''** = wrlb^^ (d) Compute the posterior predictive P-value. (e) Show that as tobs —> 00, the posterior predictive P-value does not necessarily approach 0. (Note that tobs < Xobs —> 00 also.)

6.8 Exercises

203

18. (Contingency table) Casella and Berger (1990) present the following twoway table, which is the outcome of a famous medical experiment conducted by Joseph Lister. Lister performed 75 amputations with and without using carbolic acid. Patient Carbolic Acid Used? Lived? Yes No Yes 34 19 No 6 16

Test for association of patient mortality with the use of carbolic acid on the basis of the above data using (a) BIC and (b) the classical likelihood ratio test. Discuss the different probabilistic interpretations underlying the two tests. 19. On the basis of the data on food poisoning presented in Table 2.1, you have to test whether potato salad was the cause. (Do this separately for Crab-meat and No Crab-meat). (a) Formulate this as a problem of testing a sharp null against the alternative that the null is false. (b) Test the sharp null using BIC. (c) Test the same null using the classical likelihood ratio test. (d) Discuss whether the notions of classical Type 1 and Type 2 error probabilities make sense here. 20. Using the BIC analyze the data of Problem 19 to explore whether crabmeat also contributed to food poisoning. 21. (Goodness of fit test). Feller (1973) presents the following data on bombing of London during World War II. The entire area of South London is divided into 576 small regions of equal area and the number (n^) of regions with exactly k bomb hits are recorded. k rik

0 1 2 3 4 5 and above 1 229 211 93 35 7

Test the null hypothesis that bombing was at random rather than the general belief that special targets were being bombed. (Hint: Under HQ use the Poisson model, under the alternative use the full multinomial model with 5 parameters and use BIC.) 22. (Hald's regression data). We present below a small set of data on heat evolved during the hardening of Portland cement and four variables that may be related to it (Woods et al. (1932), pp. 635-649). The sample size (n) is 13. The regressor variables (in percent of the weight) are xi = calcium aluminate (3Cao.Al203), X2 ~ tricalcium silicate (3CaO.Si02), x^ = tetracalcium alumino ferrite (4CaO.Al203.Fe203), and X/^ = dicalcium

204

6 Hypothesis Testing and Model Selection Table 6.8. Cement Hardening Data xi

X2 xz X4

7 1 11 11 7 11 3 1 2 21 1 11 10

26 29 56 31 52 55 71 31 54 47 40 66 68

6 60 15 52 8 20 8 47 6 33 9 22 17 6 22 44 18 22 4 26 23 34 9 12 8 12

y

78.6 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4

silicate (2CaO.Si02). T h e response variable is y = total calories given off during hardening per g r a m of cement after 180 days. Usually such a d a t a set is analyzed using normal linear regression model of t h e form yi=

Po+ PiXii H- P2X2i

h PpXpi + ei, i = 1 , . . . , n,

w h e r e p is t h e number of regressor variables in the model, Po.Pi,- •.Pp are unknown parameters, and e^'s are independent errors having a N{0^a'^) distribution. There are a number of possible models depending on which regressor variables are kept in the model. Analyze the d a t a and choose one from this set of possible models using (a) BIC, (b) A I B F of the full model relative to all possible models.

Bayesian Computations

Bayesian analysis requires computation of expectations and quantiles of probability distributions that arise as posterior distributions. Modes of the densities of such distributions are also sometimes used. The standard Bayes estimate is the posterior mean, which is also the Bayes rule under the squared error loss. Its accuracy is assessed using the posterior variance, which is again an expected value. Posterior median is sometimes utilized, and to provide Bayesian credible regions, quantiles of posterior distributions are needed. If conjugate priors are not used, as is mostly the case these days, posterior distributions will not be standard distributions and hence the required Bayesian quantities (i.e., posterior quantities of inferential interest) cannot be computed in closed form. Thus special techniques are needed for Bayesian computations. Example 7.1. Suppose X is N{0^a'^) with known cr^ and a Cauchy(/i,T) prior on 6 is considered appropriate from robustness considerations (see Chapter 3, Example 3.20). Then niO\x) oc exp {-{9 - x)y{2a^))

(r^ + (^ - / i ) ' ) " ' ,

and hence the posterior mean and variance are

E^(0\x) =

Vrnx) =

7^ r^ , and n e x p ( - i ^ ) ( r 2 + (^-^)^)-d^' -

^

y^

/

iE''(0\x)? .

n e x p ( - ( ^ ) ( r ^ + (^-M)^)-rf^ Note that the above integrals cannot be computed in closed form, but various numerical integration techniques such as IMSL routines or Gaussian quadrature can be efficiently used to obtain very good approximations of these. On the other hand, the following example provides a more difficult problem.

206

7 Bayesian Computations

Example 7.2. Suppose X i , X 2 , . . . ,X/e are independent Poisson counts with Xi ~ Poisson(0i). 6i are a priori considered related, and a joint multivariate normal prior distribution on their logarithm is assumed. Specifically, let Ui — log(^^) be the ith element of v and suppose i/-iVfc(/il,r2{(l-p)4+pll'}), where 1 is the A;-vector with all elements being 1, and //, r^ and p are known constants. Then, because k

/(x|i/)=exp

-5;^{e-^-i.,x,} i=l

/ n /

i=\

and 7r(i/) a exp ( - ^ ( « ^ - ^ 1 ) ' ((1 - p)h + p H ' ) ' (i^ - Ml)) we have that 7r(i/|x) (X

exp { - E'=i{e''^ - ^i^i} - M^

- Ml)' ((1 - P)lk + p l l ' ) " ' (i^ - Ml)} •

Therefore, if the posterior mean of 6j is of interest, we need to compute E'^(ej\x) = £;^(exp(i/^)|x)

/^feexp(i/^)^(i/|x)di/ J^^g{u\^)di^

where ^(i>'|x) = exp { - E'=i{e''* - ^iXi) -^{t^-

Ml)' ((1 - P)lk + /Oil')"' {^ - Ml)} •

This is a ratio of two A:-dimensional integrals, and as k grows, the integrals become less and less easy to work with. Numerical integration techniques fail to be an efficient technique in this case. This problem, known as the curse of dimensionality^ is due to the fact that the size of the part of the space that is not relevant for the computation of the integral grows very fast with the dimension. Consequently, the error in approximation associated with this numerical method increases as the power of the dimension A:, making the technique inefficient. In fact, numerical integration techniques are presently not preferred except for single and two-dimensional integrals. The recent popularity of Bayesian approach to statistical applications is mainly due to advances in statistical computing. These include the E-M algorithm discussed in Section 7.2 and the Markov chain Monte Carlo (MCMC) sampling techniques that are discussed in Section 7.4. As we see later, Bayesian analysis of real-life problems invariably involves difficult computations while MCMC techniques such as Gibbs sampling (Section 7.4.4) and MetropolisHastings algorithm (M-H) (Section 7.4.3) have rendered some of these very difficult computational tasks quite feasible.

7.1 Analytic Approximation

207

7.1 Analytic A p p r o x i m a t i o n This is exactly what we saw in Section 4.3.2 where we derived analytic large sample approximations for certain integrals using the Laplace approximation. Specifically, suppose

is the Bayesian quantity of interest where g, f, and TT are smooth functions of 6. First, consider any integral of the form /=

/

q{e) exp {-nh{e))

dO,

where /i is a smooth function with —h having its unique maximum at 6. Then, as indicated in Section 4.3.1 for the univariate case, the Laplace method involves expanding q and h about ^ in a Taylor series. Let h ' and q' denote the vectors of partial derivatives of h and g, respectively, and Ah and Aq denote the Hessians of h and q. Then writing

h{e) = h{d) + {0- dywid) + l{o- dyAh(e){e - g) + • • • - h(e) ^\{o- e)'Ah{e){e - g) +... and 1

q{e) = q{e) + {e- ey
I = j^^ [
= e-"''W(27r)'=/2n-*/2|/i^(g)|-i/2 /^(g) + 0{n-^)\

,

(7.2)

which is exactly (4.16). Upon applying this to both the numerator and denominator of (7.1) separately (with q equal to g and 1), a first-order approximation

E^g{e)\K)=g(e){l

+ 0{n-')}

easily emerges. It also indicates that a second-order approximation may be available if further terms in the Taylor series expansion are retained in the approximation. Suppose that g in (7.1) is positive, and let —nh{0) = log/(x|^) +Iog7r(0), —nh*(6) = —nh{6) -h log^(^). Now apply (7.2) to both the numerator and

208

7 Bayesian Computations

denominator of (7.1) with q equal to 1. Then, letting ^* denote the maximum of - / i * , E = Aj;^{d), r * = A~}{e^), as mentioned in Section 4.3.2, Tierney and Kadane (1986) obtain the fantastic approximation

E^gie)\^) =

|Z'*|V2expf-n/i*(0^)) ) ^ {1 + 0{n-')} , |r|i/2exp(-n/i(6>)j

(7.3)

which they call fully exponential This technique can be used in Example 7.2. Note that to derive the approximation in (7.3), it is enough to have the probability distribution of g(0) concentrate away from the origin on the positive side. Therefore, often when g is non-positive, (7.3) can be applied after adding a large positive constant to g, and this constant is to be subtracted after obtaining the approximation. Some other analytic approximations are also available. Angers and Delampady (1997) use an exponential approximation for a probability distribution that concentrates near the origin. We will not be emphasizing any of these techniques here, including the many numerical integration methods mentioned previously, because the availability of powerful and all-purpose simulation methods have rendered them less powerful.

7.2 The E-M Algorithm We shall use a slightly different notation here. Suppose Y | ^ has density f{y\0), and suppose the prior on 0 is 7r{6), resulting in the posterior density 7r{6\y). When 7r{6\y) is computationally difficult to handle, as is usually the case, there are some 'data augmentation' methods that can help. The idea is to augment the observed data y with missing or latent data z to obtain the 'complete' data x = (y, z) so that the augmented posterior density 7r(^|x) = 7r{6\y, z) is computationally easy to handle. The E-M algorithm (see Dempster et al. (1977), Tanner (1991), or McLachlan and Krishnan (1997)) is the simplest among such data augmentation methods. In our context, the E-M algorithm is meant for computing the posterior mode. However, if data augmentation yields a computationally simple posterior distribution, there are more powerful computational tools available that can provide a lot more information on the posterior distribution as will be seen later in this chapter. The basic steps in the iterations of the E-M algorithm are the following. Let p(z|y, 6) be the predictive density of Z given y and an estimate 6 of 6. Find z^*^ = E{Z\y, 6 ), where 0 is the estimate of 0 used at the ith step of the iteration. Note the similarity with estimating missing values. Use z^*^ to augment y and maximize 7r(^|y,z^*^) to obtain 0

. Then find z^*+^^ using

0 and continue this iteration. This combination of expectation followed by maximization in each iteration gives its name to the E-M algorithm.

7.2 The E-M Algorithm

209

Implementation of the E-M Algorithm Note that because 7r(^|y) = 7r(^,z|y)/p(z|y, ^ ) , we have that log7r(^|y) - l o g 7 r ( ^ , z | y ) - logp(z|y, ^ ) . Taking expectation with respect to Z|d

^y on both sides, we get

-(z) /

log7r(^,z|y)p(z|y,^

Q{e,d^'^)-H{0,d^'^)

f ) dz -

-(2) logp{z\y,0)p{z\y,e

) dz (7.4)

(where Q and H are according to the notation of Dempster et al. (1977)). Then, the general E-M algorithm involves the following two steps in the ith iteration: E-Step: Calculate Q(0,0 M-Step: Maximize Q{0^ 6''-')) with wi respect to 6 and obtain 0

such that

Note that

Prom the E-M algorithm, we have that Q{d^^+'^\e^''^) > Q(0(*\ ^^*''). Further, for any 0,

= Ilogp(z|y,

= /log

9)piz\y, e^'^)dz - Jlogp(z|y,

p{z\y,o)

p{z\y,e^'^)dz

p{z\y,d(i))

= -/log

d^'^)p{z\y, 0^'^) dz

p(z|y,^)

p(z|y,0«)dz

<0, because, for any two densities pi and p2^ Jlog{pi{x)/p2{x))pi{x) dx is the Kullback-Leibler distance between pi and P27 which is at least 0. Therefore, \i^) (ON H{d^'+^\e")-H{d'-'\d")
^{9^'^'^\y)>
210

7 Bayesian Computations Table 7.1. Genetic Linkage Data Cell Count Probability = 125 X X T " 2/2 = 18 ys = 20 1(1-^) 2/4 = 34 4 yi

Example 7.3. (genetic linkage model.) Consider the data from Rao (1973) on a certain recombination rate in genetics (see Sorensen and Gianola (2002) for details). Here 197 counts are classified into 4 categories as shown in Table 7.1, along with the corresponding theoretical cell probabilities. The multinomial mass function in this example is given by f{y\0) oc (2 + e)y^{l - O)y^-^y^0y^^ so that under the uniform(0,1) prior on (9, the observed posterior density is given by 7r(e|y) a (2 + 6^' (1 - 0)2/2+2/3^2/4^ This is not a standard density due to the presence of 2 + 0. If we split the first cell into two with probabilities 1/2 and 6/4, respectively, the complete data will be given by x = (xi, X2, X3, X4, X5), where xi-\-X2 = yi, xs = 2/2, X4 = ys and xs =2/4- The augmented posterior density will then be given by 7^((9|x)oc6>^^+^^(l-6>)^^+^^ which corresponds with the Beta density. The E-step of E-M consists of obtaining Q{0, e^^) = E [{X2 + X5) log^ + (X3 + X4) log(l - 0)|y, ^(^) = [E[x2\y,e^^]

+2/4}log0+(2/2 + 2/3)log(l-e).

(7.5)

The M-step involves finding ^(*+^) to maximize (7.5). We can do this by solving ^Q{eJ^'^) = 0, so that ^2|y

(9(^+1) =:.

2/4

(7.6) E

Now note that E

,^(0]

X2|y,^(^)

^2|y,^(^)

+ 2/4 + 2/2+^/3

E [X2IX1 + X2, e^'A, and that .

, .

^(^)/4

V

^ ,

X2IX1 + X2,^(') ~ binomial(Xi + X^, ~ ^ ^ ) - Therefore, /2+e(')/4 0ii) X2\Xi+X2 and hence

= yi,e^'A

=yi

2 + e(^'

7.3 Monte Carlo Sampling

211

Table 7.2. E-M Iterations for Genetic Linkage Data Example Iteration i 0^''^ .60825 1 2 .62432 3 .62648 4 .62678 5 .62682 .62682 6

In our example, (7.7) converges to ^ = .62682 in 5 iterations starting from 9^^^ = .5 as shown in Table 7.2.

7.3 Monte Carlo Sampling Consider an expectation that is not available in closed form. An alternative to numerical integration or analytic approximation to compute this is statistical sampling. This probabilistic technique is a familiar tool in statistical inference. To estimate a population mean or a population proportion, a natural approach is to gather a large sample from this population and to consider the corresponding sample mean or the sample proportion. The law of large numbers guarantees that the estimates so obtained will be good provided the sample is large enough. Specifically, let / be a probability density function (or a mass function) and suppose the quantity of interest is a finite expectation of the form

Efh{X.):= I h{:>c)f{x)dy:

(7.8)

(or the corresponding sum in the discrete case). If i.i.d. observations X i , X 2 , . . . can be generated from the density / , then ^

m

/im = - ^ M X i )

(7.9)

converges in probability (or even almost surely) to Efh(X.). This justifies using hm as an approximation for Efh(X.) for large m. To provide a measure of accuracy or the extent of error in the approximation, we can again use a statistical technique and compute the standard error. If Var//i(X) is finite, then yanfihm) = Var//i(X)/m. Further, Var//i(X) = Efh?{X) {Efh{X.)f can be estimated by

212

7 Bayesian Computations ^

m

s2

and hence the standard error of hm can be estimated by \2\l/2

-7=Sm = —(y^(h{Xi)

- hm) )

If one wishes, confidence intervals for Efh{X.) can also be provided using the central limit theorem. Because m {hm - Efh{X))

^ ^^^^ ^^ m—>-oo

in distribution, (hm — ^a/2^m/v^5 ^m + Za/2Sm/V^) can be used as an approximate 100(1 — a)% confidence interval for £'//i(X), with Zoc/2 denoting the 100(1 — a/2)% quantile of standard normal. The above discussion suggests that if we want to approximate the posterior mean, we could try to generate i.i.d. observations from the posterior distribution and consider the mean of this sample. This is rarely useful because most often the posterior distribution will be a non-standard distribution which may not easily allow sampling from it. Note that there are other possibilities as seen below. Example 7.4- (Example 7.1 continued.) Recall that

E''{e\x)

JZo ^ exP ( - ^

)

jr' + {6 -

M ) T ^

where (p denotes the density of standard normal. Thus E'^{6\x) is the ratio of expectation of h{e) = e/{r'^ + (6> - //)2) to that of h{e) = l/{r^ + ((9 - )u)2), both expectations being with respect to the N{x,a'^) distribution. Therefore, we simply sample 61,62, ••. from N{x,a'^) and use

E-{6\x) = ^'='

'^

^

'

^^ \

as our Monte Carlo estimate of E'^{6\x). Note that (7.8) and (7.9) are applied separately to both the numerator and denominator, but using the same sample of 6>'s. It is unwise to assume that the problem has been completely solved. The sample of ^'s generated from N{x,a'^) will tend to concentrate around x.

7.3 Monte Carlo Sampling

213

whereas to satisfactorily account for the contribution of the Cauchy prior to the posterior mean, a significant portion of the ^'s should come from the tails of the posterior distribution. It may therefore appear that it is perhaps better to express the posterior mean in the form

E''(0\x) =

7^

^

,

then sample ^'s from Cauchy(/i, r) and use the approximation Er=i^.exp(-i%f)^) E'^ieix) = Er^,exp(-i%^) However, this is also not totally satisfactory because the tails of the posterior distribution are not as heavy as those of the Cauchy prior, and hence there will be excess sampling from the tails relative to the center. The implication is that the convergence of the approximation is slower and hence a larger error in approximation (for a fixed m). Ideally, therefore, sampling should be from the posterior distribution itself for a satisfactory approximation. With this view in mind, a variation of the above theme has been developed. This is called the Monte Carlo importance sampling. Consider (7.8) again. Suppose that it is difficult or expensive to sample directly from / , but there exists a probability density u that is very close to / from which it is easy to sample. Then we can rewrite (7.8) as EfhiX.)

= [ /i(x)/(x)dx

Jx

^(x)

= / {/i(x)w;(x)}ii(x) c/x Jx where w{pC) = /(x)/ii(x). Now apply (7.9) with / replaced by u and h replaced by hw. In other words, generate i.i.d. observations X i , X 2 , . . .from the density u and compute ^

lib

hwm = — V " h(Xi)w(Ki). m m. .^—^ 2=1

The sampling density u is called the importance function. We illustrate importance sampling with the following example.

214

7 Bayesian Computations

Example 7.5. Suppose X i , X 2 , . . . , X n are i.i.d. N{6,a'^), where both 9 and cr^ are unknown. Independent priors are assumed for 0 and cr^, where 6 has a double exponential distribution with density exp(—1^|)/2 and cr^ has the prior density of (1 + a'^)~'^. Neither of these is a standard prior, but robust choice of proper prior all the same. If the posterior mean of 6 is of interest, then it is necessary to compute /"OO

OO

/

/

e7r{e,a^\x.)deda^.

-OO Jo

Because 7r(^, cr^|x) is not a standard density, let us look for a standard density close to it. Letting x denote the mean of the sample xi, X2,..., a:^ and s^ = Yl7=ii^i ~ ^)^/^? iiote that 7r(0, a^lx) cc (a2)-"/2 exp { - ^ = [si + {e-

X)r^'^'

{{0 - x)' + 4 } ) exp(-|0|)(l + (<,2)-(n/2+2) , ^ p ( _ _ ^

a^'

{ ( ^ _ ^^2 + ^2 } )

x{[sl + ie- xf] -'"^'-"'^} eM-\o\){j^f ^2

where ui{a'^\0) is the density of inverse Gamma with shape parameter n/2 + 1 and scale parameter ^{{0 — x)^ -h 5^}, and U2 is the Student's t density with d.f. n + 1, location x and scale a multiple of Sn- It may be noted that the tails of exp(—1^|)( j ^ ^ ) ^ do not have much of an influence in the presence of Ui{a'^\0)u2{0). Therefore, u{6^a'^) = Ui{a'^\0)u2{0) may be chosen as a suitable importance function. This involves sampling 6 first from the density U2{0), and given this 6, sampling a'^ from ui{a'^\6). This is repeated to generate further values of {0, cr^). Finally, after generating m of these pairs {6, cr^), the required posterior mean of 0 is approximated by

where w{e,a^) =

f(yi\0,a^)7r{e,a^)/u{e,a^).

In some high-dimensional problems, a combination of numerical integration, Laplace approximation and Monte Carlo sampling seems to give appealing results. Delampady et al. (1993) use a Laplace-type approximation to obtain a suitable importance function in a high-dimensional problem. One area that we have not touched upon is how to generate random deviates from a given probability distribution. Clearly, this is a very important subject being the basis of any Monte Carlo sampling technique. Instead of providing a sketchy discussion from this vast area, we refer the reader to

7.4 Markov Chain Monte Carlo Methods

215

the excellent book by Robert and Casella (1999). We would, however, like to mention one recent and very important development in this area. This is the discovery of a very efficient algorithm to generate a sequence of uniform random deviates with a very big period of 2^^^^'^ — 1. This algorithm, known as the Mersenne twister (MT), has many other desirable features as well, details on which may be found in Matsumoto and Nishimura (1998). The property of having a very large period is especially important because Monte Carlo simulation methods, especially MCMC, require very long sequences of random deviates for proper implementation.

7.4 Markov Chain Monte Carlo Methods 7.4.1 Introduction A severe drawback of the standard Monte Carlo sampling or Monte Carlo importance sampling is that complete determination of the functional form of the posterior density is needed for their implementation. Situations where posterior distributions are incompletely specified or are specified indirectly cannot be handled. One such instance is where the joint posterior distribution of the vector of parameters is specified in terms of several conditional and marginal distributions, but not directly. This actually covers a very large range of Bayesian analysis because a lot of Bayesian modeling is hierarchical so that the joint posterior is difficult to calculate but the conditional posteriors given parameters at different levels of hierarchy are easier to write down (and hence sample from). For instance, consider the normal-Cauchy problem of Example 7.1. As shown later in Section 7.4.6, this problem can be given a hierarchical structure wherein we have the normal model, the conjugate normal prior in the first stage with a hyperparameter for its variance and this hyperparameter again has the conjugate prior. Similarly, consider Example 7.2 where we have independent observations Xi ^ Poisson(^^). Now suppose the prior on the ^^'s is a conjugate mixture. We again see (Problem 14) that a hierarchical prior structure can lead to analytically tractable conditional posteriors. It turns out that it is indeed possible in such cases to adopt an iterative Monte Carlo sampling scheme, which at the point of convergence will guarantee a random draw from the target joint posterior distribution. These iterative Monte Carlo procedures typically generate a random sequence with the Markov property such that this Markov chain is ergodic with the limiting distribution being the target posterior distribution. There is actually a whole class of such iterative procedures collectively called Markov chain Monte Carlo (MCMC) procedures. Different procedures from this class are suitable for different situations. As mentioned above, convergence of a random sequence with the Markov property is being utilized in this procedure, and hence some basic understanding of Markov chains is required. This material is presented below. This

216

7 Bayesian Computations

discussion as well as the following sections are mainly based on Athreya et al. (2003). 7.4.2 Markov Chains in MCMC A sequence of random variables {Xn}n>o is a Markov chain if for any n, given the current value, X^, the past {Xj^j < n — 1} and the future {Xj : j >n-\-l} are independent. In other words, P{A n B\Xn) = P{A\Xn)P{B\Xn),

(7.10)

where A and B are events defined respectively in terms of the past and the future. Among Markov chains there is a subclass that has wide applicability. They are Markov chains with time homogeneous or stationary transition probabilities, meaning that the probability distribution of Xn-^i given Xn — x, and the past, Xj : j
= in)

= P{Xn = in\Xo = Zo, . . . , Xn-1 = in-l) xP{Xo = io,Xi = n , . . .Xn-i = in-l) = Pin.-iir.P{Xo = Zo, . . . , Xn-l = in-l) =

P{Xo

= io)PioilPili2

••

'Pin-lin-

A probability distribution TT is called stationary or invariant for a transition probability P or the associated Markov chain {Xn} if it is the case that when the probability distribution of XQ is n then the same is true for Xn for all n > 1. Thus in the countable state space case a probability distribution TT = {iTi : i e S} is stationary for a transition probability matrix P if for each j in 5,

P{Xi = j) = 5 Z ^ ( ^ i = ^'1^0 = ^)^(^o = i) .,p,j=P{Xo=j)=7rj. = Y.^i:

(7.11)

7.4 Markov Chain Monte Carlo Methods

217

In vector notation it says TT = (TTI, 7r2,...) is a left eigenvector of the matrix P with eigenvalue 1 and TT = TTP.

(7.12)

Similarly, if 6* is a continuum, a probability distribution TT with density p{x) is stationary for the transition kernel -P(-, •) if 7T{A) = / p(x) dx = P{x, A)p{x) dx JA Js for all A^S. A Markov chain {X^} with a countable state space S and transition probability matrix P = ((pij)) is said to be irreducible if for any two states i and j the probability of the Markov chain visiting j starting from i is positive, i.e., for some n > 1,P^-? = P{Xn = j\Xo = i) > 0. A similar notion of irreducibility, known as Harris or Doeblin irreducibility exists for the general state space case also. For details on this somewhat advanced notion as well as other results that we state here without proof, see Robert and Casella (1999) or Meyn and Tweedie (1993). In addition, Tierney(1994) and Athreya et al. (1996) may be used as more advanced references on irreducibility and MCMC. In particular, the last reference uses the fact that there is a stationary distribution of the Markov chain, namely, the joint posterior, and thus provides better and very explicit conditions for the MCMC to converge. Theorem 7.6. (law of large numbers for Markov chains^ Let {Xn}n>o be a Markov chain with a countable state space S and a transition probability matrix P. Further, suppose it is irreducible and has a stationary probability distribution TT = {iTi : i ^ S) as defined in (7.11). Then, for any bounded function h : S —> R and for any initial distribution of XQ -. n—l

Y^KXi)-^Y.^{j)7Tj 2= 0

(7.13)

j

in probability as n —y oo. A similar law of large numbers (LLN) holds when the state space S is not countable. The limit value in (7.13) will be the integral of h with respect to the stationary distribution n. A sufficient condition for the validity of this LLN is that the Markov chain {Xn} be Harris irreducible and have a stationary distribution TT. To see how this is useful to us, consider the following. Given a probability distribution TT on a set 5 , and a function h on S, suppose it is desired to compute the "integral of h with respect to TT" , which reduces to ^ • h{j)7rj in the countable case. Look for an irreducible Markov chain {Xn} with state space S and stationary distribution TT. Then, starting from some initial value

218

7 Bayesian Computations

Xo, run the Markov chain {Xj} for a period of time, say 0 , 1 , 2 , . . . n — 1 and consider as an estimate ^ n—1

^" = ; ; E ' * ( ^ ^

(7.14)

0

By the LLN (7.13), this estimate fin will be close to ^^ • h{j)7rj for large n. This technique is called Markov chain Monte Carlo (MCMC). For example, if one is interested in 7r{A) = J2jeA^j ^^^ some A C S then by LLN (7.13) this reduces to ^ n —1

in probability as n —)• oo, where IA{XJ) = 1 ii XJ e A and 0 otherwise. An irreducible Markov chain {Xn} with a countable state space S is called aperiodic if for some i G 5 the greatest common divisor, g.c.d. {n : p-^^ > 0} = 1. Then, in addition to the LLN (7.13), the following result on the convergence of P{Xr, = j) holds.

J2\P{Xn=j)-7rj\-^0

(7.15)

j

as n ^ oo, for any initial distribution of XQ. In other words, for large n the probability distribution of X^ will be close to TT. There exists a result similar to (7.15) for the general state space case that asserts that under suitable conditions, the probability distribution of Xn will be close to TT as n ^ cxo. This suggests that instead of doing one run of length n, one could do A^ independent runs each of length m so that n = Nm and then from the i*^ run use only the m}^ observation, say, Xm,i and consider the estimate 1 ^ AiV,m ^ T^ X ] h{Xni,i).

(7.16)

i=l

Other variations exist as well. Some of the special Markov chains used in MCMC are discussed in the next two sections. 7.4.3 Metropolis-Hastings Algorithm In this section, we discuss a very general MCMC method with wide applications. It will soon become clear why this important discovery has led to very considerable progress in simulation-based inference, particularly in Bayesian analysis. The idea here is not to directly simulate from the given target density (which may be computationally very difficult) at all, but to simulate an easy Markov chain that has this target density as the density of its stationary distribution. We begin with a somewhat abstract setting but very soon will get to practical implementation.

7.4 Markov Chain Monte Carlo Methods

219

Let 5 be a finite or countable set. Let TT be a probability distribution on S. We shall call TT the target distribution. (There is room for slight confusion here because in our applications the target distribution will always be the posterior distribution, so let us note that n here does not denote the prior distribution, but just a standard notation for a generic target.) Let Q = {{qij)) be a transition probability matrix such that for each i, it is computationally easy to generate a sample from the distribution {qij : j E S}. Let us generate a Markov chain {Xn} as follows. If Xn = i, first sample from the distribution {Qij '• J ^ S} and denote that observation 1^. Then, choose X^-^i from the two values X^ and Yn according to P{Xn-\-l

= Yn\Xn^Yn)

P ( X , + 1 = Xn\Xn^Yn)

=

p(Xn,Yn)

= 1 - p{Xr^,Yn),

(7.17)

where the "acceptance probability" p(-, •) is given by p(i,j)=min(^^,l| I ^i Qij

(7.18)

J

for all (i, j) such that iTiQij > 0. Note that {Xn} is a Markov chain with transition probability matrix P = ((pij)) given by f Qij Pij I

J i^h

k^i

Q is called the "proposal transition probability" and p the "acceptance probability" . A significant feature of this transition mechanism P is that P and TT satisfy TTiPij =" TTjPji for all ij.

(7.20)

This implies that for any j

i

i

or, TT is a stationary probability distribution for P, Now assume that S is irreducible with respect to Q and TT^ > 0 for all i in S. It can then be shown that P is irreducible, and because it has a stationary distribution TT, L L N (7.13) is available. This algorithm is thus a very flexible and useful one. The choice of Q is subject only to the condition that S is irreducible with respect to Q. Clearly, it is no loss of generality to assume that TTi > 0 for all i in S'. A sufficient condition for the aperiodicity of P is that Pa > 0 for some i or equivalently "^

Qij Pij < 1-

220

7 Bayesian Computations

A sufficient condition for this is that there exists a pair (i, j ) such that TTiQij > 0 a n d TTjQji < TTiQij.

Recall that if P is aperiodic, then both the LLN (7.13) and (7.15) hold. If 5' is not finite or countable but is a continuum and the target distribution 7r(-) has a density p(-), then one proceeds as follows: Let Q be a transition function such that for each x, Q{x, •) has a density q{x^y). Then proceed as in the discrete case but set the "acceptance probability" p{x^ y) to be p(x, y) = min

(p{y)Q{y,x) \p{x)q{x,y)

•^}

for all (x^y) such that p{x)q{x,y) > 0. A particularly useful feature of the above algorithm is that it is enough to know p(-) upto a multiplicative constant as in the definition of the "acceptance probability" p{',-)i only the ratios p{y)/p{x) need to be calculated. (In the discrete case, it is enough to know {TT^} upto a multiplicative constant because the "acceptance probability" p(-,-) needs only the ratios TTJ/TTJ.) This assures us that in Bayesian applications it is not necessary to have the normalizing constant of the posterior density available for computation of the posterior quantities of interest. 7.4.4 Gibbs Sampling As was pointed out in Chapter 2, most of the new problems that Bayesians are asked to solve are high-dimensional. Applications to areas such as micro-arrays and image processing are some examples. Bayesian analysis of such problems invariably involve target (posterior) distributions that are high-dimensional multivariate distributions. In image processing, for example, typically one has N X N square grid of pixels with A'' = 256 and each pixel has k >2 possible values. Thus each configuration has (256)^ components and the state space S has k^'^^^^ configurations. To simulate a random configuration from a target distribution over such a large S is not an easy task. The Gibbs sampler is a technique especially suitable for generating an irreducible aperiodic Markov chain that has as its stationary distribution a target distribution in a highdimensional space but having some special structure. The most interesting aspect of this technique is that to run this Markov chain, it suffices to generate observations from univariate distributions. The Gibbs sampler in the context of a bivariate probability distribution can be described as follows. Let TT be a target probability distribution of a bivariate random vector {X,Y). For each x, let P(x, •) be the conditional probability distribution of Y given X = x. Similarly, let Q{y,') be the conditional probability distribution of X given Y = y. Note that for each x, P(x, •) is a univariate distribution, and for each y, Q{yr) is also a univariate distribution. Now generate a bivariate Markov chain Zn = {Xn, Yn) as follows: Start with some XQ = XQ. Generate an observation YQ from the distribution P(xo,-)- Then generate an observation Xi from Q{Yo^'). Next generate an

7.4 Markov Chain Monte Carlo Methods

221

observation Yi from P ( X i , •) and so on. At stage n if Z^ = {X^, 1^) is known, then generate Xn-\-i from Q{Yn, •) and Yn-\-i from P(X^_^i, •). If TT is a discrete distribution concentrated on {(xi^yj) : 1 < i < K,l < j < L} and if TT^^ = 7r{xi,yj) then P{xi,yj) = 'Kijj'Ki. and Q{yj,Xi) = - ^ , where TT^. = Ylj^ij^ ^-j — ^i^ijThus the transition probabiHty matrix ^ = ((^(ij),(fc^))) foi' the {Z^} chain is given by ^(ij),(H) =

Q{yj,Xk)P{xk,yi)

It can be verified that this chain is irreducible, aperiodic, and has TT as its stationary distribution. Thus LLN (7.13) and (7.15) hold in this case. Thus for large n, Zn can be viewed as a sample from a distribution that is close to TT and one can approximate ^ ^ • h{i,j)7rij by J^^^i h{Xi, Yi)/n. As an illustration, consider sampling from ( v ) ^ ' ^ 2 ( ( p . ) ? [

i])-

Note that the conditional distribution of X given Y = y and that of Y given X = X are X\Y = yr^ N{py, 1 - p'^) and Y\X = x r^ N{px, 1 - p^).

(7.22)

Using this property, Gibbs sampling proceeds as described below to generate (Xn, Yn)^ n — 0,1, 2 , . . . , by starting from an arbitrary value a^o for XQ, and repeating the following steps for i = 0 , 1 , . . . , n. 1. Given xi for X, draw a random deviate from N{pxi^ 1 — p^) and denote it by Yi. 2. Given yi for Y^ draw a random deviate from N(pyi, 1 — p^) and denote it by Xi^i. The theory of Gibbs sampling tells us that if n is large, then (x^, i/n) is a 1 P . To see random draw from a distribution that is close to A^2 ( ( why Gibbs sampler works here, recall that a sufficient condition for the LLN (7.13) and the limit result (7.15) is that an appropriate irreducibility condition holds and a stationary distribution exists. From steps 1 and 2 above and using (7.22), one has Yi = pXi + V l - p 2 rji and Xi^i=pYi^^l-p^Cu where rji and ^i are independent standard normal random variables independent of X^. Thus the sequence {Xi} satisfies the stochastic difference equation

222

7 Bayesian Computations

where Because 77^,^2 are independent A^(0,1) random variables, UiJ^i is also a normally distributed random variable with mean 0 and variance p^(1 — p^)-\-{1— p^) = 1 — p^. Also {Ui\i>\ being i.i.d., makes {Xi}^>o a Markov chain. It turns out that the irreducibility condition holds here. Turning to stationarity, note that if XQ is a A/'(0,1) random variable, then X\ = p^Xo + \J\ is also a Ar(0,1) random variable, because the variance of Xi = p^ + 1 — p^ — 1 and the mean of X\ is 0. This makes the standard A^(0,1) distribution a stationary distribution for {X^}. The multivariate extension of the above-mentioned bivariate case is very straightforward. Suppose TT is a probability distribution of a fc-dimensional random vector (Xi, X 2 , . . . , X^). If u = (^^i, ^^2? • • •, ^fc) is any /c-vector, let u_^ = {u\',U2^..., iii_i,lii+i,...,life) be thefc— 1 dimensional vector resulting by dropping the zth component ui. Let 7r^(-|x_i) denote the univariate conditional distribution of Xi given that X_^ = (Xi, X2, X^-i, X^+i,..., X/^) — x_i. Now starting with some initial value for XQ = (3:015^025 ••• 5^ofc) generate X i = ( X i i , X i 2 , . . . ,Xifc) sequentially by generating X n according to the univariate distribution 7ri(-|xo_i) and then generating X12 according to 7r2(-|(Xii,xo3,a;o4,... ,xofc) and so on. The most important feature to recognize here is that all the univariate conditional distributions, Xi|X_i = x_i, known as full conditionals should easily allow sampling from them. This turns out to be the case in most hierarchical Bayes problems. Thus, the Gibbs sampler is particularly well adapted for Bayesian computations with hierarchical priors. This was the motivation for some vigorous initial development of Gibbs sampling as can be seen in Gelfand and Smith (1990). The Gibbs sampler can be justified without showing that it is a special case of the Metropolis-Hastings algorithm. Even if it is considered a special case, it still has special features that need recognition. One such feature is that full conditionals have sufficient information to uniquely determine a multivariate joint distribution. This is the famous Hammersley-Clifford theorem.. The following condition introduced by Besag (1974) is needed to state this result. Definition 7.7. Letp(yi,..., ?//e) be the joint density of a random vector Y = ( F i , . . . , Yk) and let p^^\yi) denote the marginal density of Yi, i = 1 , . . . , A:. If P^^\yi) > 0 for every i = 1 , . . . , A: implies that p{yi,... ,yk) > 0, then the joint density p is said to satisfy the positivity condition. Let us use the notation pi{yi\yi^... density of Yi\Y-i = y_i.

^ 7/i_i, ?/i_^i,..., 2/fc) for the conditional

Theorem 7.8. (Hammersley-Clifford) Under the positivity condition, the joint density p satisfies

7.4 Markov Chain Monte Carlo Methods

223

k

/ \' for every y and y' in the support of p. Proof. For y a n d y' in t h e support of p , p{yi^ '"^yk)

= Pk{yk\yi^ • • .,yk-i)p{yi,.. Pk{yk\yi,-",yk-i) = —T-n Pk{yk\yi^'"^yk-i)

/

• ,?//c-i) /x \p{yi.-"^yk-i^yk)

^ Pk{yk\yi^ • • •,z/fe-i)Pfe-i(yfc-i|^i,»• •,^fc-2,^fc) p/c(2/^l2/i, •. •, yk-i) Pk-i{yk-i\yii ^p{yi->---^y'k-i^y'k)

n

• • •.

Pj{yj\yi.'".yj-uyj+i.'".yk)

yk-2,y'k)

. ,

,x

^

It can be shown also t h a t under t h e positivity condition, t h e Gibbs sampler generates an irreducible Markov chain, t h u s providing t h e necessary convergence properties without recourse t o t h e M-H algorithm. Additional conditions are, however, required t o extend t h e above theorem t o t h e non-positive case, details of which m a y be found in R o b e r t a n d Casella (1999). 7.4.5

Rao-Blackwellization

T h e variance reduction idea of t h e famous Rao-Blackwell theorem in t h e presence of auxiliary information can b e used t o provide improved estimators when M C M C procedures are adopted. Let us first recall this theorem. T h e o r e m 7 . 9 . ( H a o - B l a c k w e l l t h e o r e m ^ Let 5 ( X i , X 2 , . . . ,-^n) ^e ^ ^ estimator of 0 with finite variance. Suppose that T is sufficient for 6, and let (5*(r), defined by 5''{t) = E{8{Xi,X2,... ,Xn)\T = t), he the conditional expectation of (5(Xi, X 2 , . . . , Xn) given T = t. Then E{d%T) The inequality

- Of < E{S{X,,X2,...,

X,) -

is strict unless 5 = S*, or equivalently,

Of.

S is already a

function

ofT. Proof. B y t h e property of iterated conditional expectation, E{S*{T))

= E[E{S{X,,X2,..

.,Xn)\T)]

= E{6{X,,X2,..

• ,X„)).

Therefore, t o compare t h e m e a n squared errors (MSE) of t h e two estimators, we need t o compare their variances only. Now,

224

7 Bayesian Computations Var((5(Xi, X2, ...,Xn))=

Var [E{S\T)] + E [Var((5|r)] = Var(5*)H-E[Var((5|T)] > Var((5*),

unless Var(5|T) = 0, which is the case only if 6 itself is a function of T. D The Rao-Blackwell theorem involves two key steps: variance reduction by conditioning and conditioning by a sufficient statistic. The first step is based on the analysis of variance formula: For any two random variables S and T, because Var(5) = Y8iv{E{S\T)) + £;(Var(5|r)), one can reduce the variance of a random variable S by taking conditional expectation given some auxiliary information T. This can be exploited in MCMC. Let {Xj^Yj),j = 1,2, ...jAT" be the data generated by a single run of the Gibbs sampler algorithm with a target distribution of a bivariate random vector (X, F ) . Let h{X) be a function of the X component of (X, F ) and let its mean value be /x. Suppose the goal is to estimate //. A first estimate is the sample mean of the h{Xj),j = 1,2,... ,A/'. From the MCMC theory, it can be shown that as A^ -^ oc, this estimate will converge to /i in probability. The computation of variance of this estimator is not easy due to the (Markovian) dependence of the sequence {Xj^j = 1,2,..., A/^}. Now suppose we make n independent runs of Gibbs sampler and generate {Xij^Yij)^j = 1,2,..., A"; i = 1,2,..., n. Now suppose that N is sufficiently large so that (X^jv, YIN) can be regarded as a sample from the limiting target distribution of the Gibbs sampling scheme. Thus (XJA^, F^JV), ^ = 1,2,..., n are i.i.d. and hence form a random sample from the target distribution. Then one can offer a second estimate of /i—the sample mean of h{XiN), i = 1,2,..., n. This estimator ignores a good part of the MCMC data but has the advantage that the variables h{Xijsf)^ z = 1,2,..., n are independent and hence the variance of their mean is of order n~^. Now applying the variance reduction idea of the Rao-Blackwell theorem by using the auxiliary information Yijsf, z = 1, 2 , . . . , n, one can improve this estimator as follows: Let k{y) = E{h{X)\Y — y). Then for each z, k{Yi]s[) has a smaller variance than h{Xi]sf) and hence the following third estimator,

n. ^—^

has a smaller variance than the second one. A crucial fact to keep in mind here is that the exact functional form of k{;\j) be available for implementing this improvement.

7.4 Markov Chain Monte Carlo Methods

225

7.4.6 Examples Example 7.10. (Example 7.1 continued.) Recall that X\6 ~ N{0,a'^) with known a'^ and 0 ~ Cauchy(//, r ) . The task is to simulate 0 from the posterior distribution, but we have already noted that sampling directly from the posterior distribution is difficult. What facilitates Gibbs sampling here is the result that the Student's t density, of which Cauchy is a special case, is a scale mixture of normal densities, with the scale parameter having a Gamma distribution (see Section 2.7.2, Jeffreys test). Specifically,

7r(^)oc {r^ +

iO-fify'

SO that 7T{6) may be considered the marginal prior density from the joint prior density of (^, A) where e\X -

N{/2,T'^/X)

and A - Gamma(l/2,1/2).

It can be noted that this leads to an implicit hierarchical prior structure with A being the hyperparameter. Consequently, 7r{0\x) may be treated as the marginal density from 7r(^,A|x). Now note that the full conditionals of 7r(0, A|x) are standard distributions from which sampling is easy. In particular,

A|l9, X - A|(9 - Exponential (^

'^ ^^^ ^^ ) •

C^-^^)

Thus, the Gibbs sampler will use (7.23) and (7.24) to generate (^, A) from 7r(6>,A|x). Example 7.11. Consider the following example due to Casella and George given in Arnold (1993). Suppose we are studying the distribution of the number of defectives X in the daily production of a product. Consider the model {X \Y^0) r^ binomial(y, ^), where F , a day's production, is a random variable with a Poisson distribution with known mean A, and 0 is the probability that any product is defective. The difficulty, however, is that Y is not observable, and inference has to be made on the basis of X only. The prior distribution is such that {9 \Y = y) r^ Beta(a,7), with known a and 7 independent of y . Bayesian analysis here is not a particularly difficult problem because the posterior distribution oi 0\X = x can be obtained as follows. First, note that X\e - Poisson(A6>). Next, 6 - Beta(a,7). Therefore, 7r(l9|X = x) oc exp(-A6>)(9^+"-^(l - 6>)^-\ 0 < 6> < 1.

(7.25)

226

7 Bayesian Computations

The only difficulty is that this is not a standard distribution, and hence posterior quantities cannot be obtained in closed form. Numerical integration is quite simple to perform with this density. However, Gibbs sampling provides an excellent alternative. Instead of focusing on ^|X directly, view it as a marginal component of {Y,6 \ X). It can be immediately checked that the full conditionals of this are given by Y\X = x,e r^x-\- Poisson(A(l - 6)), and 6\X = x^Y = y ^ Beta(a + x , 7 -h 2/ — x) both of which are standard distributions. Example 7.12. (Example 7.11 continued.) It is actually possible here to sample from the posterior distribution using what is known as the accept-reject Monte Carlo method. This widely applicable method operates as follows. Let g{x)/K be the target density, where K is the possibly unknown normalizing constant of the unnormalized density g. Suppose /i(x) is a density that can be simulated by a known method and is close to g^ and suppose there exists a known constant c > 0 such that ^(x) < ch{'x) for all x. Then, to simulate from the target density, the following two steps suffice. (See Robert and Casella (1999) for details.) Step 1. Generate Y ~ /i and [/ ~ C/(0,1); Step 2. Accept X = Y if t/ < g{Y)/{ch[Y)}', return to Step 1 otherwise. The optimal choice for c is sup{^(x)//i(x), but even this choice may result in undesirably large number of rejections. In our example, from (7.25), g{e) = exp(-A(9)l9^+^-Hl - 6>)^"^/{0 < 6> < 1}, so that h{6) may be chosen to be the density of Beta(a: + a , 7 ) . Then, with the above-mentioned choice for c, if ^ ~ Beta(x + a, 7) is generated in Step 1, its 'acceptance probability' in Step 2 is simply exp(—A^). Even though this method can be employed here, we, however, would like to use this technique to illustrate the Metropolis-Hastings algorithm. The required Markov chain is generated by taking the transition density q{z,y) = q{y\z) = h(y), independently of z. Then the acceptance probability is

= min {exp {-X{y - z)), 1} . Thus the steps involved in this "independent" M-H algorithm are as follows. Start at ^ == 0 with a value XQ in the support of the target distribution; in this case, 0 < XQ < 1. Given xt, generate the next value in the chain as given below. (a) Draw Yt from Beta(a: -f- a, 7). (b) Let Yt with probability pt Xt otherwise.

{

7.4 Markov Chain Monte Carlo Methods

227

where pt = minjexp {—\{Yt — xt)), 1}. (c) Set t = t-\-l and go to step (a). Run this chain until t = n, a suitably chosen large integer. Details on its convergence as well as why independent M-H is more efficient than acceptreject Monte Carlo can be found in Robert and Casella (1999). In our example, for X == 1, a = 1, 7 = 49 and A = 100, we simulated such a Markov chain. The resulting frequency histogram is shown in Figure 7.1, with the true posterior density super-imposed on it. Example 7.13. In this example, we discuss the hierarchical Bayesian analysis of the usual one-way ANOVA. Consider the model 1

eij^N{0,at)J

1 , . . . , 71^ 5 ^ =

i^\

i,

(7.26)

and are independent. Let the first stage prior on Oi and af be such that they are i.i.d. with Oi^N{ii^,al), i-1,.. inverse Gamma(ai,6i), cr,- ^

1,...,A:.

The second stage prior on fi^ and crj is

0

0.02

0.04

Fig. 7.1. M-H frequency histogram and true posterior density.

0.06

228

7 Bayesian Computations /XTT ~ ^ ( M O ^ ^ O )

^^^ <^^ '^ inverse Gamma(a2,62).

Here ai, a2, 61, 62, //05 and CTQ are all specified constants. Let us concentrate on computing /x-(y) = E^d\y). Sufficiency reduces this to considering only -J

rii

Yi = —^yij,i

= l,...,k

and

^^.^1

3=1

Prom normal theory,

which are independent and are also independent of which again are independent. To utilize the Gibbs sampler, we need the full conditionals of 7r(^,(T^,/X;r,^J|y)- It can be noted that it is sufficient, and in fact advantageous to consider the conditionals, (i) TT{e\a^,^^,(jl,y), (ii) 7r(a-2|0,/i^,cr2,y), (iii) 7r(/i^|cr2,^,cr2,y), and (iv) 7r((72|/i^,(9,o-2,y), rather than considering the set of all univariate full conditionals because of the special structure in this problem. First note that

and hence eifx^^ala'^.y

^

Nk{ti^'\U^'^),where

r(^) is diagonal with 4 ' ^ = ^ M A j i . ,

(7.27)

which determines (i). Next we note that, given 0, from (7.26), 5*^ = ^^LiiVij ~ ^lY is sufficient for a^, and they are independently distributed. Thus we have, and are independent, and af are i.i.d. inverse Gamma (ai,6i), so that

7.4 Markov Chain Monte Carlo Methods af\0,fi^,al,y

~ inverse Gamma(ai + -rii.bi -h 2^^^*^)'

229 (7.28)

and they are independent for i = 1 , . . . , /c, which specifies (ii). Turning to the full conditional of /i^r, we note from the hierarchical structure that the conditional distribution of/i7r|crj, ^, cr^^, y is the same as the conditional distribution of /j.-j^\a^^6. To determine this distribution, note that

for z = 1 , . . . , /c and are i.i.d. and //^^ ^ N{fio, CTQ). Therefore, treating 6 to be a random sample from N{fij^, cr^), so that 0 = Y2i=i ^i/^ i^ sufficient for /j.^, we have the joint distribution, (9|/i^,cr^ - N{ii^,(jl/k),

and /i^ -

N{IIO,(JI).

Thus we obtain,

which provides (iii). Just as in the previous case, the conditional distribution of crj|/ivr5^,cr^,y turns out to be the same as the conditional distribution of crJl/iTT,^. To obtain this, note again that

for i = 1 , . . . , A: and are i.i.d. so that this time X^^^i(^i — /^TT)"^ is sufficient for a'i. Further /^^{Oj — /i7r)^|^^ ^ ^\')dk ^^^ ^\ ^ inverse Gamma(a2, ^2), so that k

G\\\iT,,e,cT^,y

A: 1 x-^ r^ inverse Gamma(a2 + - , ^ 2 + 2^zZ^^^ "M^)^)- (7-30)

This gives us (iv), thus completing the specification of all the required full conditionals. It may be noted that the Gibbs sampler in this problem requires simulations from only the standard normal and the inverse Gamma distributions. Reversible J u m p M C M C There are situations, especially in model selection problems, where the MCMC procedure should be capable of moving between parameter spaces of different dimensions. The standard M-H algorithm described earlier is incapable

230

7 Bayesian Computations

of such movements, whereas the reversible jump algorithm of Green (1995) is an extension of the standard M-H algorithm to allow exactly this possibility. The basic idea behind this technique as applied to model selection is as follows. Given two models Mi and M2 with parameter sets Oi and ^2, which are possibly of different dimensions, fill the difference in the dimensions by supplementing the parameter sets of these models. In other words, find auxiliary variables 712 and 721 such that (^1,712) and (^2?721) can be mapped with a bijection. Now use the standard M-H algorithm to move between the two models; for moves of the M-H chain within a model, the auxiliary variables are not needed. We sketch this procedure below, but for further details refer to Robert and Casella (1999), Green (1995), Sorensen and Gianola (2002), Waagepetersen and Sorensen (2001), and Brooks et al. (2003). Consider models Mi, M 2 , . . . where model Mi has a continuous parameter space Oi. The parameter space for the model selection problem as a whole may be taken to be {{Mi,0i):eieei,i Let f{x.\Mi,6i) be

=

l,2,...}.

be the model density under model M^, and the prior density 7r(^) = "^7ri7T{ei\Mi)I{e

= 0ie

Oi),

i

where TT^ is the prior probability of model Mi and 7r{0i\Mi) is the prior density conditional on Mi being true. Then the posterior probability of any B C ^jOj is 7r(5|x)-Y" / 7r{ei\Mi,x)dei ,

JBnOi

where 7r(^i|Mi,x) a 7r,7r((9i|Mi)/(x|M^,0O is the posterior density restricted to Mi. To compute the Bayes factor of Mk relative to Mi, we will need P"(Mfc|x) TTl

P-(Mz|x)7rfc' where P"(M,|x) =

7riJ^^7r{ei\Mi)f{^\Mi,ei)dei

Zj ^3 4

7r(^,|M,)/(x|M„ ^,) dO,

is the posterior probability of Mi. Therefore, for the target density 7r(^|x), we need a version of the M-H algorithm that will facilitate the above-shown computations. Suppose Oi is a vector of length n^. It suffices to focus on moves between Oi in model Mi and Oj in model Mj with n^ < rij. The scheme provided by Green (1995) is as follows. If the current state of the chain is (Mi^Oi), a new value (Mj,Oj) is proposed for the chain from a proposal

7.4 Markov Chain Monte Carlo Methods

231

(transition) distribution Q{6i,d6j), which is then accepted with a certain acceptance probability. To move from model Mi to Mj, generate a random vector V of length rij — ni from a proposal density

i^ijiy) = n

^(^^)-

Identify an appropriate bijection map hij '.Oi x7^^^-^^

0„

and propose the move from 6i to 6j using Oj = hij{Oi^W). The acceptance probability is then

p{{Mi,ei),{Mj,ej))=mm{l,aij{euej)}, where O^ij i^i^^j)

—

dhij{Oi,v) 7r{e j\Mj,^)Pji{ej) 7r{0i\Mi,:>c)pij(ei)ilJij{v) d{e,,v)

with Pij{6i) denoting the (user-specified) probability that a proposed jump to model Mj is attempted at any step starting from 6i G Oi. Note that Example 7.14- For illustration purposes, consider the simple problem of comparing two normal means as in Sorensen and Gianola (2002). Then, the two models to be compared are yi\Mi^u^(7^ ~ N{v^a'^),i = 1,2,.. . m i + m2 i.i.d., 2 _ (N{iyi,a'^),i = 1,2,...mi; ?/i|M2,z/i,i/2,cr' \ N{u2, cr'^), z = mi + 1 , . . . mi + m2. To implement the reversible jump M-H algorithm we need the map, hi2 taking (i/, cr^, V) to (j^i, 1^21 cr^)- A reasonable choice for this is the linear map "^l" J^2

[a^\

=

"10 r 10-1 01 0

u^ a^\

_v _

7.4.7 Convergence Issues As we have already seen, Monte Carlo sampling based approaches for inference make use of limit theorems such as the law of large numbers and the central limit theorem to justify their validity. When we add a further dimension to this sampling and adopt MCMC schemes, stronger limit theorems are needed. Ergodic theorems for Markov chains such as those given in equations (7.13) and (7.15) are these useful results. It may appear at first that this procedure

232

7 Bayesian Computations

necessarily depends on waiting until the Markov chain converges to the target invariant distribution, and sampling from this distribution. In other words, one needs to start a large number of chains beginning with different starting points, and pick the draws after letting these chains run sufficiently long. This is certainly an option, but the law of large numbers for dependent chains, (7.13) says also that this is unnecessary, and one could just use a single long chain. It may, however, be a good idea to use many different chains to ensure that convergence indeed takes place. For details, see Robert and Casella (1999). There is one important situation, however, where MCMC sampling can lead to absurd inferences. This is where one resorts to MCMC sampling without realizing that the target posterior distribution is not a probability distribution, but an improper one. The following example is similar to the normal problem (see Exercise 13) with lack of identifiability of parameters shown in Carlin and Louis (1996). Example 7.15. (Example 7.11 continued.) Recall that, in this problem, [X \ Y^6) ~ binomial(F, ^), where Y \ X ^ Poisson(A). Earlier, we worked with a known mean A, but let us now see if it is possible to handle this problem with unknown A. Because Y is unobservable and only X is observable, there exists an 'identifiability' problem here, as can be seen by noting that X\6 ^ Poisson(A^). We already have the Beta(a,7) prior on 6. Suppose 0 < a < 1. Consider an independent prior on A according to which 7r(A) oc /(A > 0). Then, 7r(A,6>|x) oc exp(-A(9)A^r+^-^(l - 6>)^-\0 < 6> < 1, A > 0.

(7.31)

This joint density is improper because /

/

JQ

Jo

exp(-A^)A^^^+^-^(l-^)^-^6/Ad^

= f' £i^^o^^^-\i - ey-'de Jo

6>^+^

^

^

-r(x + i) / e''-^{i-e)^-Ue = oc.

Jo

In fact, the marginal distributions are also improper. However, it has full conditional distributions that are proper: A|l9,x ~ Gamma (x-h 1,6^) and

7T{0\X,X) CX

exp(-A(9)(9^+"-^(l - 6>)7 - 1

Thus, for example, the Gibbs sampler can be successfully employed with these proper full conditionals. To generate 6 from 7r(^|A,x), one may use the independent M-H algorithm described in Example 7.12. Any inference on the

7.5 Exercises

233

marginal posterior distributions derived from this sample, however, will be totally erroneous, whereas inferences can indeed be made on A^. In fact, the non-convergence of the chain encountered in the above example is far from being uncommon. Often when we have a hierarchical prior, the prior at the final stage of the hierarchy is an improper objective prior. Then it is not easy to check that the joint posterior is proper. Then none of the theorems on convergence of the chains may apply, but the chain may yet seem to converge. In such cases, inference based on MCMC may be misleading in the sense of what was seen in the example above.

7.5 Exercises 1. (Flury and Zoppe (2000)) A total of m + n lightbulbs are tested in two independent experiments. In the first experiment involving n lightbulbs, the exact lifetimes ?/i,..., ?/n of all the bulbs are recorded. In the second involving m lightbulbs, the only information available is whether these lightbulbs were still burning at some fixed time t > 0. This is known as right-censoring. Assume that the distribution of lifetime is exponential with mean 1/0, and use 7r{9) oc 6~^. Find the posterior mode using the E-M algorithm. 2. (Flury and Zoppe (2000)) In Problem 1, use uniform(0, ^) instead of exponential for the lifetime distribution, and 7r{9) — I(o,oo)(^)- Show that the E-M algorithm fails here if used to find the posterior mode. 3. (Inverse c.d.f. method) Show that, if the c.d.f. F[x) of a random variable X is continuous and strictly increasing, then U — F{X) ~ [/[0,1], and if y - t/[0,1], then Y = F-'^{V) has c.d.f F. Using this show that if U ~ [/[0,1], — \nU/f3 is an exponential random variable with mean /3~^. 4. (Box-Muller transformation method) Let Ui and U2 be a pair of independent Uniform (0, 1) random variables. Consider first a transformation to W = R^ = -2logUu V = 27TU2, and then let

X = RcosV]

Y^RsinV.

Show that X and Y are independent standard normal random variables. 5. Prove that the accept-reject Monte Carlo method given in Example 7.12 indeed generates samples from the target density. Further show that the expected number of draws required from the 'proposal density' per observation is c~^. 6. Using the methods indicated in Exercises 1, 2, and 3 above, or combinations thereof, prove that the standard continuous probability distributions can be simulated.

234

7 Bayesian Computations

7. Consider a discrete probability distribution that puts mass Pi on point Xi, i = Oyl, Let U ~ C/(0,1), and define a new random variable Y as follows.

What is the probability distribution of F ? 8. Show that the random sequence generated by the independent M-H algorithm is a Markov chain. 9. (Robert and Casella (1999)) Show that the Gamma distribution with a non-integer shape parameter can be simulated using the accept-reject method or the independent M-H algorithm. 10. Gibbs Sampling for Multinomial. Consider the ABO Blood Group problem from Rao (1973). The observed counts in the four blood groups, O, A, B, and AB are as given in Table 7.3. Assuming that the inheritance of these blood groups is controlled by three alleles, A, B, and 0, of which O is recessive to A and J5, there are six genotypes 00, AO, A A, BO, BB, and AB, but only four phenotypes. If r, p, and q are the gene frequencies of O, ^ , and 5, respectively (with p-\- q-\- r = 1), then the probabilities of the four phenotypes assuming Hardy-Weinberg equilibrium are also as shown in Table 7.3. Thus we have here a 4-cell multinomial probability vector that is a function of three parameters p^q.r with p -\- q -\- r = 1. One may wish to formulate a Dirichlet prior for p,q,r. But it will not be conjugate to the 4-cell multinomial likelihood function in terms of p,q^r from the data, and this makes it difficult to work out the posterior distribution of p, g, r. Although no data are missing in the real sense of the term, it is profitable to split each of the UA and TIB cells into two: UA into riAA^'^AO with corresponding probabilities p^^2pr and UB into ^BB, ^BO with corresponding probabilities q^, 2qr, and consider the 6-cell multinomial problem as a complete problem with riAA^'i^BB as 'missing' data. Table 7.3. ABO Blood Group Data Cell Count Probability r no = 176 UA = 182 p^ + 2pr UB = 60 q^ -f 2qr riAB = 17 2pq

Let N = no -\- riA -\- riB -\- riAB^ and denote the observed data by n = {no^riA.riB.riAB)' Consider estimation of p , ^ , r using a Dirichlet prior with parameters a, /?, 7 with the 'incomplete' observed data n. The likelihood upto a multiplicative constant is L(p, q, r) = r^^^ ( / -h 2pry

{q^ + 2qrY^ {pqT''^ •

7.5 Exercises

235

The posterior density of (p, g, r) given n is proportional to

Let UA = nAA + ^ A O , ^ B = ^ B B + '^BOJ and write n o o foi" ^ o - Verify that if we have the 'complete' data, n = {nooi '^AA^ '^AO^'^BB^'^BO^'^AB)^ then the likelihood is, upto a multiplicative constant {^p^YAA ^q-^^nsB (^2)noo (2pg)^A^ (2gr)^^^ (2pr)^^^

where n^ = riAA + 2^AB + 2^"^^ 4-

1

1

Show that the posterior distribution of (p, g, r) given n is Dirichlet with parameters n\ + a — 1, n ^ + /3 — 1, ^ J + 7 — 1, when the prior is Dirichlet with parameters (a, /?, 7). Show that the conditional distributions of {UAA, '^BB) given n and (p, g, r) is that of two independent binomials: (nAA|n,p,g,r) -^ binomial(nA, 0 . 0 )^ p^ + 2pr g2 (nBB|n,p,g,r) '-^ binomial(nB, -y-—r—), and g^ + 2gr (p, g, r|n, riAA.riBB) ~ Dirichlet(n^ + a - 1, n^^ + /? - 1, n j + 7 - 1). Show that the Rao-Blackwellized estimate of (p, g, r) from a Gibbs sample of size m is ^

m

- ^{a ^

+ n + \ /? + n + \ 7 + n + ' ) / ( a + /? + 7 + iV),

z=i

where the superscript i denotes the ith draw. 11. (M-H for the Weibull Model: Robert (2001)). The following twelve observations are from a simulated reliability study: 0.56, 2.26, 1.90, 0.94, 1.40, 1.39, 1.00, 1.45, 2.32, 2.08, 0.89, 1.68. A Weibull model with the following density form is considered appropriate: f(x\a,rj)

oc arjx^~^e~'^^'^ ,0 < x < oc,

with parameters {a,rj). Consider the prior distribution

236

7 Bayesian Computations 7r(a,ry)(xe-"r/^-^e-^^. The posterior distribution of (a,77) given the data (xi,X2,... ,Xn) has density 7r{a,r]\xi,X2,...,Xn)

n f "^ 1 oc ( Q ; 7 7 ) ' ' ( J J X ^ ) ' ' ~ ^ exp < - r y ^ x ^ \7r{a,rj). i=l I 2=1 J

To get a sample from the posterior density, one may use the M-H algorithm with proposal density q{a\r]'\a,r]) = — e x p ^ \ ar) I ce T] ) which is a product of two independent exponential distributions with means a, ry. Compute the acceptance probability p((a',r/'), (a^^^ry^*^)) at the tth step of the M-H chain, and explain how the chain is to be generated. 12. Complete the construction of the reversible jump M-H algorithm in Example 7.14. In particular, choose an appropriate prior distribution, proposal distribution and compute the acceptance probabilities. 13. (Carlin and Louis (1996)) Suppose ^1,2/2, • • - ^Vn is an i.i.d. sample with

where cr^ is assumed to be known. Independent improper uniform prior distributions are assumed for 9i and 62(a) Show that the posterior density of (^i,^2|y) is 7r(^i,^2|y) oc exp(-n(^i + ^2 - ^)V(2^'))^((^i,^2) G U'), which is improper, integrating to cxo (over 7^^)). (b) Show that the marginal posterior distributions are also improper. (c) Show that the full conditional distributions of this posterior distribution are proper. (d) Explain why a sample generated using the Gibbs sampler based on these proper full conditionals will be totally useless for any inference on the marginal posterior distributions, whereas inferences can indeed be made on 01 +(92. 14. Suppose A ' i , X 2 , . . . , X^ are independent Poisson counts with Xi having mean Oi. Oi are a priori considered related, but exchangeable, and the prior k

7r(^i,...,^fc)oc(l + ^ ^ , ) ~ ^ ' ^ ' \ is assumed. (a) Show that the prior is a conjugate mixture. (b) Show how the Gibbs sampler can be employed for inference.

7.5 Exercises

237

15. Suppose X i , X 2 , . . . , X n are i.i.d. random variables with Xi\Xi,X2 ~ exponential with mean I/A1A2, and independent scale-invariant non-informative priors on Ai and A2 are used, i.e., 7r(Ai, A2) oc (AiA2)"^/(Ai > 0, A2 > 0). (a) Show that the marginals of the posterior, 7r(Ai, A2|x) are improper, but the full conditionals are standard distributions. (b) What posterior inferences are possible based on a sample generated from the Gibbs sampler using these full conditionals?

8 Some Common Problems in Inference

We have already discussed some basic inference problems in the previous chapters. These include the problems involving the normal mean and the binomial proportion. Some other usually encountered problems are discussed in what follows.

8.1 Comparing Two Normal Means Investigating the difference between two mean values or two proportions is a frequently encountered problem. Examples include agricultural experiments where two different varieties of seeds or fertilizers are employed, or clinical trials involving two different treatments. Comparison of two binomial proportions was considered in Example 4.6 and Problem 8 in Chapter 4. Comparison of two normal means is discussed below. Suppose the model for the available data is as follows. Yu^... ,Yin^ is a random sample of size ni from a normal population, N{9i^a\)^ whereas ^21, • • • 5 ^2n2 is an independent random sample of size n2 from another normal population, N{62,(72)' ^1^ the four parameters ^1, O2, cr^, and a l are unknown, but the quantity of inferential interest is 77 = ^1 — ^2 • It is convenient to consider the case, cr^ = a l = cr^ separately. In this case, (Fi, 12,52) isjointly sufficient for ((9i, 6>2,cr2) where s^ = (E'^liiYu - Yi)^ + E ; i i ( ^ 2 i - Y2)^)/{ni+n2 - 2). Further, given (^1,^2,^'), 2

Fi ^ N{Qr. —).% rii

2

- N{e2. — ) , and (m + 712 - 2)s^ ^ 712

cj^xl,^n,-2.

and they are independently distributed. Upon utilizing the objective prior, ^{61,62, cr'^) oc (J~2, one obtains 7r(^i,^2,^^| data) - 7^(^l|a^yl)7^(^2k^^2)7^(a2|52), and hence

240

8 Some Common Problems in Inference 7r(77,cr^| data) = 7r{r]\a^,yi,y2)7r{a'^\s^).

(8.1)

Now, note that ^ k ^ y l , y 2 ^ N{yi -y2,cr^(— + — ) ) . ni

712

Consequently, integrating out cr^ from (8.1) yields, r

^

/-

- ^^2

1 (ni+n2-l)/2

.(.Idata). 1+^ l'-^''-fK.A I

,

(8.2)

K + " 2 - 2 ) s 2 ( ^ + J^)j

or, equivalently

' V ^ + ;^

d a t a ~ tni+n2-2-

In many situations, the assumption that a\ = a2 is not tenable. For example, in a clinical trial the populations corresponding with two different treatments may have very different spread. This problem of comparing means when we have unequal and unknown variances is known as the Behrens-Fisher problem, and a frequentist approach to this problem has already been discussed in Problem 17, Chapter 2. We discuss the Bayesian approach now. We have that {Yi^s]) is sufficient for {6i^a\) and (12,52) is sufficient for (^2,^2)5 where s^ = E]U{yij " Yi^/irn -l),i = 1,2. Also, given (^i,^2,^?,^i), 2

y. ^ Nidi, ^ ) , and (m - l)s^ ~ crfxl-i,i

= 1,2,

Tli

and further, they are all independently distributed. Now employ the objective prior 7r(<9i,6>2,o-i,cr|) ex erf ^(7^^, and proceed exactly as in the previous case. It then follows that under the posterior distribution also 6i and 62 are independent, and that 1 data ~ t n i - i Sl

and

data ^ ^ ^ 2 - 1 - (8-3) 52

It may be immediately noted that the posterior distribution oi rj = Oi — 62, however, is not a standard distribution. Posterior computations are still quite easy to perform because Monte Carlo sampling is totally straightforward. Simply generate independent deviates Oi and 62 repeatedly from (8.3) and utilize the corresponding rj — 61 — O2 values to investigate its posterior distribution. Problem 4 is expected to apply these results. Extension to the /c-mean problem or one-way ANOVA is straightforward. A hierarchical Bayes approach to this problem and implementation using MCMC have already been discussed in Example 7.13 in Chapter 7.

8.2 Linear Regression

241

8.2 Linear Regression We encountered normal linear regression in Section 5.4 where we discussed prior elicitation issues in the context of t h e problem of inference on a response variable Y conditional on some predictor variable X. Normal linear models in general, and regression models in particular are very widely used. We have already seen an illustration of this in Example 7.13 in Section 7.4.6. We intend t o cover some of the i m p o r t a n t inference problems related to regression in this section. Extending the simple linear regression model where E{Y\Po, /^^^ X = x) = Po -f Pix to t h e multiple linear regression case, £ ^ ( y | / 3 , X = x) = /3^x, yields t h e linear model y = Xf3 + e, (8.4) where y is t h e n-vector of observations, X t h e n x p m a t r i x having t h e appropriate readings from t h e predictors, /3 t h e p-vector of unknown regression coefficients, and e t h e n-vector of r a n d o m errors with m e a n 0 and constant variance cr^. T h e parameter vector t h e n is (/3,cr^), and most often t h e statistical inference problem involves estimation of /3 and also testing hypotheses involving t h e same p a r a m e t e r vector. For convenience, we assume t h a t X has full column rank p < n. We also assume t h a t t h e first column of X is t h e vector of I's, so t h a t t h e first element of /3, namely /3i, is t h e intercept. If we assume t h a t t h e r a n d o m errors are independent normals, we obtain t h e likelihood function for (/3,(7^) as 1 27rcr

^""^{"2^^^"^^^'^^"^^^}

1 27rcr

exp \

2(T2

(y - y)'(y - y) + (/3 - ^yx'x{f3 - ^)

.5)

where ^ = ( X ' X ) - ^ X V , and y = X ^ . It t h e n follows t h a t J3 is sufficient for [3 if a'^ is known, and (^, (y — y ) ' ( y ~ y ) ) is jointly sufficient for (f3^a'^). Further, ^\
(y-y)'(y-y))k'~'TMWe take t h e prior. 1

7r(^,cr^) OC ^ .

This leads to the posterior.

.6)

242

8 Some Common Problems in Inference 7r{/3,a^\y) = n{ma')n{cT'\{y-yyiy-y)).

(8.7)

It can be seen that

ma^^Np0,a\X'X)-') and that the posterior distribution of cr^ is proportional to an inverse Xn-pIntegrating out cr^ from this joint posterior density yields the multivariate t marginal posterior density for yS, i.e., 7r(^|y) r(n/2)|X'X|V2s-P {r{l/2))Pr{{n-p)/2){,/K^)P

1+

{l3-fiyx'X(f3-l3) {n — p)s^

, (8.8)

where 5^ = (y — y)'(y — y ) / ( n — p). Prom this, it can be deduced that the posterior mean of /3 is ^ if n > p + 2, and the 100(l-a)% HPD credible region for /3 is given by the ellipsoid {^:{f3-

^yX'Xifi

- ^ ) < ps^Fp,„_p{a)] ,

(8.9)

where Fp^n_p(a) is the (1 — a) quantile of the Fp^n-p distribution. Further, if one is interested in a particular /3j, the fact that the marginal posterior distribution of Pj is given by

Pj-Ai Id•33

'"n—pj

(8.10)

where djj is the j t h diagonal entry of (X'X)~^, can be used. Conjugate priors for the normal regression model are of interest especially if hierarchical prior modeling is desired. This discussion, however, will be deferred to the following chapters where hierarchical Bayesian analysis is discussed. Example 8.1. Table 8.1 shows the maximum January temperatures (in degrees Fahrenheit), from 1931 to 1960, for 62 cities in the U.S., along with their latitude (degrees), longitude (degrees) and altitude (feet). (See Mosteller and Tukey, 1977.) It is of interest to relate the information supplied by the geographical coordinates to the maximum January temperatures. The following summary measures are obtained.

X'X

62.0 2365.0 5674.0 56012.0 2365.0 92955.0 217285.0 2244586.0 5674.0 217285.0 538752.0 5685654.0 56012.0 2244586.0 5685654.0 1.7720873 x 10^

8.2 Linear Regression

243

Table 8.1. Maximum January Temperatures for U.S. Cities, with Latitude, Longitude, and Altitude City Latitude Longitude Altitude Max. Jan. Temp Mobile, Ala. 30 88 5 "~6i Montgomery, Ala. 32 86 160 59 Juneau, Alaska 134 58 50 30 Phoenix, Ariz. 64 112 33 1090 Little Rock, Ark. 92 34 286 51 Los Angeles, Calif. 34 118 340 65 San Francisco, Calif. 122 37 65 55 Denver, Col. 42 104 39 5280 New Haven, Conn. 41 72 40 37 Wilmington, Del. 41 39 75 135 44 Washington, D.C. 38 25 77 Jacksonville, Fla. 38 20 67 81 Key West, Fla. 74 24 5 81 Miami, Fla. 25 10 76 80 Atlanta, Ga. 84 52 33 1050 Honolulu, Hawaii 21 21 157 79 2704 Boise, Idaho 43 116 36 Chicago, 111. 41 595 33 87 Indianapolis, Ind. 39 86 710 37 Des Moines, Iowa 41 805 29 93 Dubuque, Iowa 42 620 27 90 Wichita, Kansas 42 37 97 1290 Louisville, Ky. 44 38 450 85 New Orleans, La. 64 29 90 5 32 Portland, Maine 43 25 70 Baltimore, Md. 44 39 76 20 Boston, Mass. 21 42 37 71 Detroit, Mich. 42 83 585 33 Sault Sainte Marie, Mich. 84 23 46 650 22 Minneapolis -St. Paul, Minn. 44 93 815 St. Louis, Missouri 455 40 38 90 Helena, Montana 112 4155 29 46 Omaha, Neb. 41 32 1040 95 Concord, N.H. 32 290 43 71 Atlantic City, N.J. 74 10 43 39 Albuquerque, N.M. 106 4945 46 35 continues

(x'xy

10"

94883.1914 -1342.5011 -485.0209 2.5756 -1342.5011 37.8582 -0.8276 -0.0286 -485.0209 -0.8276 5.8951 -0.0254 2.5756 -0.0286 -0.0254 0.0009

244

8 Some Common Problems in Inference Table 8.1 continued City Latitude Longitude Altitude Max. Jan. Temp 42 Albany, N.Y. 73 ^ 1 20 New York, N.Y. 40 40 55 73 35 Charlotte, N.C. 51 720 80 52 Raleigh, N.C. 78 35 365 Bismark, N.D. 1674 46 20 100 41 84 39 Cincinnati, Ohio 550 41 35 81 Cleveland, Ohio 660 Oklahoma City, Okla. 46 97 35 1195 44 122 45 Portland, Ore. 77 40 Harrisburg, Pa. 39 76 365 40 75 39 Philadelphia, Pa. 100 79 32 Charlestown, S.C. 61 9 44 34 Rapid City, S.D. 103 3230 49 36 Nashville, Tenn. 450 86 35 Amarillo, Tx. 50 3685 101 94 29 Galveston, Tx. 61 5 64 Houston, Tx. 40 95 29 111 Salt Lake City, Utah 37 40 4390 44 Burlington, Vt. 25 110 73 Norfolk, Va. 50 10 76 36 44 122 Seattle-Tacoma, Wash. 47 10 117 47 Spokane, Wash. 31 1890 89 43 Madison, Wise. 26 860 87 43 Milwaukee, Wise. 28 635 41 104 37 Cheyenne, Wyoming 6100 18 San Juan, Puerto Rico 81 35 66

/

XV

2739.0 \ 99168.0 252007.0 \ 2158463.0/

.0

/100.8260 \ -1.9315 0.2033 -0.0017/

and s = 6.05185.

V

On the basis of the analysis explained above, f3 may be taken as the estimate of f3 and the HPD credible region for it can be derived using (8.9). Suppose instead one is interested in the impact of latitude on maximum January temperatures. Then the 95% HPD region for the corresponding regression coefficient ^2 can be obtained using (8.10). This yields the t-interval, $2 i 5^/^22^58 (-975), or (-2.1623, -1.7007), indicating an expected general drop in maximum temperatures as one moves away from the Equator. If the joint impact of latitude and altitude is also of interest, then one would look at the HPD credible set for (/?2 5 A ) - This is given by {(/?2, A ) : (/^2 + 1.9315,/?4 + 0.0017)C-i(/^2 + 1.9315,/?4 + 0.0017)' <252F2,58(C^)},

8.3 Logit Model, Probit Model, and Logistic Regression

245

O

o

o o o I

(M O O O I

o o o I

o o o ' -2.4

-2.2

-2

-1.

-1.6

-1.4

Fig. 8.1. Plot of 95% and 99% HPD credible regions for (/32,/34) where C is the appropriate 2 x 2 block from ^ __

{X'X)~^^

4/ 3.7858 -2.8636 x 10"^ V-2.8636X 10-3 9.2635x10-^

Plotted in Figure 8.1 are the 95% and 99% HPD credible regions for (/32,/?4). Impact of altitude on maximum temperatures seems to be very limited for the case under consideration. Literature on Bayesian approach to linear regression is very large. Some of this material relevant to the discussion given above may be found in Box and Tiao (1973), Leamer (1978), and Gelman et al. (1995).

8.3 Logit Model, Probit Model, and Logistic Regression We consider a problem here that is related to linear regression but actually belongs to a broad class of generalized linear models. This model is useful for problems involving toxicity tests and bioassay experiments. In such experiments, usually various dose levels of drugs are administered to batches of animals. Most often the responses are dichotomous because what is observed is whether the subject is dead or whether a tumor has appeared. This leads to a setup that can be easily understood in the context of the following example.

246

8 Some Common Problems in Inference

Example 8.2. Suppose that k independent random variables yi, 2/2 5 • • • ? 2/fc are observed, where yi has the B(ni, pi) probability distribution, I < i < k. yi may be the number of laboratory animals cured of an ailment in an experiment involving n^ such animals. It is certainly possible to make inference on each pi separately based on the observed yi (as discussed previously). This, however, is not really useful if we want to predict the results of a similar experiment in future. Suppose that the pi are related to a covariate or an explanatory variable, such as dosage level in a clinical experiment. Then the natural approach is regression as described in the previous problem, because this allows us to explore and present the relationship between design (explanatory) variables and response variables, and (if needed) predictions of response at desired levels of the explanatory variables. Let U be the value of the covariate that corresponds with Pi, i = 1,2,...,/;:. Because p^'s are probabilities, linking them to the corresponding t^'s through a linear map as was done earlier does not seem appropriate now. Instead it can be made through a link function H such that Pi = H{(3Q -\- (3iti). H, here, is a known cumulative distribution function (c.d.f.) and (3Q and /?i are two unknown parameters. (If H is an invertible function, this is precisely H~^{pi) = /3o + Piti.) If the standard normal c.d.f. is used for H, the model is called the probit model, and if the logistic c.d.f. (i.e., H{z) — e~^/(l + e~^)) is used, it is called the logit model. The likelihood function for the unknown parameters, /?o and /?i, is then given by

n(::)m Suppose 7r(/?o, /?i) is the prior density on (ySo, /?i) so that the posterior density IS

7r(/?o,A|data) =

TTC/^O,/?!) n

\'ni—y%

t i ^(^0 + 0xU)y^ (1 - iy(/Jo + fixu)f

/7r(a, 6) ^ ^ t l ^ ( « + ^U)^' (1 - ^ ( a + hti)f'-^'

dadb

It may be noted that the sample size rii and the covariates ti (dose level) are treated here as fixed, or equivalently the model conditional on those variables is analyzed (as in the linear regression problem). More generally, instead of a single covariate t, if we have a set of s covariates represented by x and the corresponding vector of coefl[icients /3, the posterior density of /3 will be k

^(/3|data) oc 7r(/3) ]][i/(/3'x,)^^ (1 - i/(/3'xO)"^"^^ •

(8.11)

i=l

8.3.1 T h e Logit Model If we use the logit model whereby pi = exp{—/3'x^}/{l + exp{—/3'xi}}, and hence —log{pi/{l — Pi)) = i^'x^, we obtain the likelihood function

8.3 Logit Model, Probit Model, and Logistic Regression

247

'(^-ndf^F^)'"--*-"'^'

})"

= exp

-ii' J2 Vi^i

n (1 + exp(-/3'xO) "^ ,

(8.12)

which is largely intractable (but see Problem 7). Usually a flat prior such as 7r(/3) (X 1 is employed, but because j3 can be considered to be regression coefficients under reparameterization, an approximate conjugate normal prior can also be used. In such a case, a hierarchical prior structure is also meaningful. To motivate the approximate conjugate hierarchical prior structure, consider the following large sample approximation. For simplicity, let there be only one covariate t. Assume that the n^ are large enough for a satisfactory normal approximation of the binomial model. If we let pi = yi/rii^ then (approximately) these Pi are independent N(pi^pi{l — Pi)/ni) random variates. Now let 6i ~ — log(p^/(l — Pi)) and 6i = —\og{pi/{l —Pi))- It can be shown that, approximately, {6i — 6i)y/niPi{l — Pi) are independent 7V(0,1) random variates. Then (again approximately), the likelihood function for (/Jo^/^i) is £(/?o, Aldata) = exp

- - V i ^ , ( ^ , - {po ^ PiU))^

,

(8.13)

where Wi = niPi{l — Pi) are known weights. This suggests a bivariate normal prior on (/^o, /?i) as the first level in the hierarchical structure. Now the problem is very similar to that discussed in Section 5.4. Further, there is also another important use for the approximate likelihood in (8.13). Its product with the conjugate normal prior discussed above can be used as a natural proposal density for the M-H algorithm in the computations required for inferential purposes (see Problem 9). If instead a fiat prior on /3 is to be employed, then (8.13) itself (up to a constant) can be used as the proposal density. Example 8.3. (Example 8.2 continued). The data given in Table 8.2 is from Finney (1971) (which originally appeared in Martin (1942)) where results of spraying rotenone of different concentrations on the chrysanthemum aphids in batches of about fifty are presented. The concentration is in milligrams per liter of the solution and the dosage x is measured on the logarithmic scale. The median lethal dose LD50, the dose at which 50% of the subjects will expire, is one of the quantities of inferential interest. The plot oi p = y/n against x shown in Figure 8.2 is S-shaped, so a linear fit for p based on x is unsatisfactory. Instead, as suggested by Figure 8.3, the logistic regression is more appropriate here. Suppose that a flat prior on (/3o5 /3i) is to be used. Then the implementation of M-H algorithm as explained above using the bivariate normal proposal density is straightforward. A scatter plot of 1000 values of (/3o,/3i) so obtained is shown in Figure 8.4.

248

8 Some Common Problems in Inference

00

o

o

o (M

o

0.4

0.6

0.8

i

1.2

F i g . 8.2. Plot of proportion of deaths against dosage level.

CO

o

o

o Csl

o

0.2

0.4

0.6

0.8

1

1.2

Fig. 8.3. Plot of logistic response function: e"^+^^/(l + e~^+'^'^).

8.3 Logit Model, Probit Model, and Logistic Regression

249

Table 8.2. Toxicity of Rotenone Concentration (a) Dose (xi = log(ci)) Batch Size (n^) Deaths (yi) 6 50 0.4150 2.6 16 48 0.5797 3.8 24 46 5.1 0.7076 42 49 0.8865 7.7 44 10.2 50 1.0086

I

I

03 QJ

n

o 3

4

5 beta_0

6

7

F i g . 8.4. Scatter plot of 1000 (/3o,/5i) values sampled using M-H algorithm.

A comparison of t h e estimates of f3o and (3i obtained using M L E and posterior means are shown in Table 8.3. Histograms of the M-H samples of Po and (3i are shown in Figure 8.5 and Figure 8.6, respectively. T h e y seem to be skewed and hence t h e posterior estimates seem more appropriate. Let us consider the estimation of LD50 next. Note t h a t for t h e logit model LD50 is t h a t dosage level to for which E{yi/ni\ti = to) = exp(—(/?o + Pito))/il + exp(-(/3o + Pito))) = 0.5. This means t h a t LD50 = -f3o/(^i- We

Table 8.3. Estimates of /3o and jSi from Different Methods Method Po s.e.(/3o) ^ 1 s.e.(/3i) -7.065 MLE from logistic regression 4.826 Posterior estimates 4.9727 0.6312 -7.266 0.8859

250

8 Some Common Problems in Inference

3

4

5

6

7

F i g . 8.5. Histogram of 1000 /3o values sampled using M-H algorithm.

\-^^

F i g . 8.6. Histogram of 1000 /3i values sampled using M-H algorithm.

8.3 Logit Model, Probit Model, and Logistic Regression

251

can easily estimate this using our M-H sample, and we obtain an estimate of 0.6843 (in the logarithmic scale) with a standard error of 0.022. 8.3.2 The Probit Model As mentioned previously, if the standard normal c.d.f., ^ is used for the link function H above, we obtain the probit model. Then, assuming that 7r(/3) is the prior density on /3 the posterior density is obtained as k

7r(/3|data) oc 7r(/3) n ^ ( / 3 ' x O ^ ^ (1 - ^(/3'x,))"^~^^ .

(8.14)

i=l

Analytically, this is even less tractable than the posterior density for the logit model. However, the following computational scheme developed by Albert and Chib (1993) based on the Gibbs sampler provides a convenient strategy. First consider the simpler case involving Bernoulli y^'s, i.e., n^ = 1. Then, 7r(^|data) oc 7r(/3) n ^ ( / 3 ' x O ^ ^ (1 -

^{(^'^i))

l-Vi

The computational scheme then proceeds by introducing k independent latent variables Zi,Z2, ...,^fc, where Zi - Ar(/3'x^,l). If we let Yi = I{Zi > 0), then Yi,... ,Yk are independent Bernoulli with pi = P{Yi = 1) = ^{jS^'x.i). Now note that the posterior density of /3 and Z = ( Z i , . . . , Z^) given y = (?/i,...,7/fc) is k

7T{f3, Z|y) ex 7r{f3) J ] {/(Z, > 0)I{y, = 1) + / ( Z , < 0)/(y, = 0)} (/)(Z, - /3'x,), z=l

(8.15) where (p is the standard normal p.d.f. Even though (8.15) is not a joint density which allows sampling from it directly, it allows Gibbs sampler to handle it since 7r(/3|Z,y) and 7r(Z|^,y) allow sampling from them. It is clear that k

7T{f3\Z, y) cc 7T{f3) H ct>{Zi - I5'^i),

(8.16)

i=i

whereas 1^ / ^(/^'^*' ^) truncated at the left by 0 if T/^ = 1; Al/^, y ^ j Nll3'^i, 1) truncated at the right by 0 if T/^ = 0.

. s ^^ ^

Sampling Z from (8.17) is straightforward. On the other hand, (8.16) is simply the posterior density for the regression parameters in the normal linear model with error variance 1. Therefore, if a flat noninformative prior on /3 is used, then

252

8 Some Common Problems in Inference Table 8.4. Lethality of Chloracetic Acid Dose Fatalities Dose Fatalities 4 1 .0794 .1778 2 6 .1995 .1000 4 1 .2239 .1259 .2512 5 0 .1413 5 1 .1500 .2818 .3162 2 8 .1558

^|Z,y~iV,(^z,(X'X)-i), where X = ( x ^ , . . . ,x5.)' and ^ 2 — (X'X)~^X'Z. If a proper normal prior is used a different normal posterior will emerge. In either case, it is easy to sample /3 from this posterior conditional on Z. Extension of this scheme to binomial counts l i , 1^2, • • •, ^fc is straightforward. We let Yi = YTjLi ^ij where Yij = I{Zij > 0), with Zij ~ N{l3'xi, 1) are independent, j = l , 2 , . . . , n i , z = 1,2,...,/c. We then proceed exactly as above but at each Gibbs step we will need to generate X]^^i rii many (truncated) normals Zij. This procedure is employed in the following example. Example 8.4- (Example 8.2 continued). Consider the data given in Table 8.4, taken from Woodward et al. (1941) where several data sets on toxicity of certain acids were reported. This particular data set examines the relationship between exposure to chloracetic acid and the death of mice. At each of the dose levels (measured in grams per kilogram of body weight), ten mice were exposed. The median lethal dose LD50 is again one of the quantities of inferential interest. The Gibbs sampler as explained above is employed to generate a sample from the posterior distribution of (/?o,/?i) given the data. A scatter plot of 1000 points so generated is shown in Figure 8.7. Prom this sample we have the estimate of (-1.4426, 5.9224) for (/?o,/?i) along with standard errors of 0.4451 and 2.0997, respectively. To estimate the LD50, note that for the probit model LD50 is the dosage level to for which E{yi/ni\ti = to) = ^(/^o+A^o) = 0-5. As before, this implies that LD50 == —PQ/[3I- Using the sample provided by the Gibbs sampler we estimate this to be 0.248.

8.4 Exercises 1. Show how a random deviate from the Student's t is to be generated. 2. Construct the 95% HPD credible set for 61 —O2 for the two-sample problem in Section 8.1 assuming (T\ = CF^.

8.4 Exercises

253

in

o

QJ

n

in I

-3

-2

-1

0

beta_0 Fig. 8.7. Scatter plot of 1000 (/5o,/?i) values sampled using Gibbs sampler.

Show t h a t S t u d e n t ' s t can be expressed as scale mixtures of normals. Using this fact, explain how t h e 95% H P D credible set for 6i — O2 can be constructed for t h e case given in (8.3). Consider t h e d a t a in Table 8.5 from a clinical trial conducted by Mr. S. Sahu, a medical student at Bangalore Medical College, Bangalore, India (personal communication). T h e objective of t h e s t u d y was to compare two t r e a t m e n t s , surgical and non-surgical medical, for short-term management of benign prostatic hyperplasia (enlargement of p r o s t a t e ) . T h e r a n d o m observable of interest is t h e 'improvement score' recorded for each of t h e patients by t h e physician, which we assume t o be normally distributed. There were 15 patients in t h e non-surgical group and 14 in t h e surgical group.

Table 8.5 . Improvement Scores Surgical 15 9 12 16 14 15 18 13 12 11 15 9 16 9 Non-surgical 6 8 7 4 4 6 8 3 7 8 9 6 3 6 4

Apply t h e results from Problems 2 and 3 above t o make inferences about t h e difference in mean improvement in this problem. 5. Consider t h e linear regression model (8.4). (a) Show t h a t {/3^ (y — y ) ' ( y — y ) ) is jointly sufficient for (/3, cr^).

254

8 Some Common Problems in Inference

(b) Show that ^1^2 ~ iVp(/3, a 2 ( X ' X ) - i ) , (y - y)'(y - y))!^^ ~ a^xl-p, and they are independently distributed. (c) Under the default prior (8.6), show that /3|y has the multivariate t distribution having density (8.8). 6. Construct 95% HPD credible set for {f32,p3) in Example 8.1. 7. Consider the model given in (8.12). (a) What is a sufficient statistic for /37 (b) Show that the likelihood equations for deriving MLE of /3 must satisfy V ^

ri,exp{-^^x,} 1 + exp{-/3'x,} *'

_ ^ ^ ^' '"

-^ - ^' ^' • • •

8. Justify the approximate likelihood given in (8.13). 9. Consider a multivariate normal prior on /3 for Problem 7. (a) Explain how the M-H algorithm can be used for computing the posterior quantities. (b) Compare the above scheme with an importance sampling strategy where the importance function is proportional to the product of the normal prior and the approximate normal likelihood given in (8.13). 10. Apply the results from Problem 9 to Example 8.3 with some appropriate choice for the hyperparameters. 11. Justify (8.16) and (8.17), and explain how Gibbs sampler handles (8.15). 12. Analyze the problem in Example 8.4 with an additional data point having 9 fatalities at the dosage level of 0.3400. 13. Analyze the problem in Example 8.4 using logistic regression. Compare the results with those obtained in Section 8.3.2 using the probit model.

9 High-dimensional Problems

Rather than begin by defining what is meant by high-dimensional, we begin with a couple of examples. Example 9.1. (Stein's example) Let N{fipxi^ ^pxp = o-'^Ipxp) be a p-variate normal population and Xi = (Xn^..., X^p), i = 1 , . . . , n be n i.i.d. p-variate samples. Because U = cr^/, we may alternatively think of the data as p independent samples of size n from p univariate normal populations N{/j.j, cr^), J = 1 , . . . ,p. The parameters of interest are the /ij's. For convenience, we initially assume cr^ is known. Usually, the number of parameters, p, is large and the sample size n is small compared with p. These have been called problems with large p, small n. Note that n in Stein's example is the sample size, if we think of the data as a p-variate sample of size n. However, we could also think of the data as univariate samples of size n from each of p univariate populations. Then the total sample size would be np. The second interpretation leads to a class of similar examples. Note that the observations are not exchangeable except in subgroups, in this sense one may call them partially exchangeable. Example 9.2. Let f{x\fij)^j = 1 , . . . ,p, denote the densities for p populations, and Xij^i = l , . . . , n , j i = l , . . . , p denote p samples of size n from these p populations. As in Example 9.1, f{x\fij) may contain additional common parameters. The object is to make inference about the /ij's. In several path-breaking papers Stein (1955), James and Stein (1960), Stein (1981), Robbins (1955, 1964), Efron and Morris (1971, 1972, 1973a, 1975) have shown classical objective Bayes or classical frequentist methods, e.g., maximum likelihood estimates, will usually be inappropriate here. See also Kiefer and Wolfowitz (1956) for applications to examples like those of Neyman and Scott (1948). These approaches are discussed in Sections 9.1 through 9.4, with stress on the parametric empirical Bayes (PEB) approach of Efron and Morris, as extended in Morris (1983).

256

9 High-dimensional Problems

It turns out that exchangeability of / i i , . . . , /ip plays a fundamental role in all these approaches. Under this assumption, there is a simple and natural Bayesian solution of the problem based on a hierarchical prior and MCMC. Much of the popularity of Bayesian methods is due to the fact that many new examples of this kind could be treated in a unified way. Because of the fundamental role of exchangeability of /ij's and the simplicity, at least in principle, of the Bayesian approach, we begin with these two topics in Section 9.1. This leads in a natural way to the PEE approach in Sections 9.2 and 9.3 and the frequentist approach in Section 9.4. All the above sections deal with point or interval estimation. In Section 9.6 we deal with testing and multiple testing in high-dimensional problems, with an application to microarrays. High-dimensional inference is closely related to model selection in high-dimensional problems. A brief overview is presented in Sections 9.7 and 9.8. We discuss several general issues in Sections 9.5 and 9.9.

9.1 Exchangeability, Hierarchical Priors, Approximation to Posterior for Large p, and MCMC We follow the notations of Example 9.1 and Example 9.2, i.e., we consider p similar but not identical populations with densities / ( x | / i i ) , . . . ,/(x|/ip). There is a sample of size n, X i j , . . . , Xnj, from the jth population. These p populations may correspond with p adjacent small areas with unknown per capita income / i i , . . . , / i p , as in small area estimation, Ghosh and Meeden (1997, Chapters 4, 5). They could also correspond with p clinical trials in a particular hospital and fij, j = 1 , . . . , j9, are the mean effects of the drug being tested. In all these examples, the different studied populations are related to each other. In Morris (1983), which we closely follow in Section 9.2, the p populations correspond to p baseball players and /ij's are average scores. Other such studies are listed in Morris and Christiansen (1996). In order to assign a prior distribution for the /Xj's, we model them as exchangeable rather than i.i.d. or just independent. An exchangeable, dependent structure is consistent with the assumption that the studies are similar in a broad sense, so they share many common elements. On the other hand, independence may be unnecessarily restrictive and somewhat unintuitive in the sense that one would expect to have separate analysis for each sample if the / i / s were independent and hence unrelated. However, to justify exchangeability one would need a particular kind of dependence. For example, Morris (1983) points out that the baseball players in his study were all hitters; his analysis would have been hard to justify if he had considered both hitters and pitchers. Using a standard way of generating exchangeable random variables, we introduce a vector of hyperparameters rj and assume /ij's are i.i.d. 7T{fi\rj) given ry. Typically, if f{x\fi) belongs to an exponential family, it is convenient to

9.1 Exchangeability, Hierarchical Priors, and MCMC

257

take 7r(/i|r7) to be a conjugate prior. It can be shown that for even moderately large p - in the baseball example of Morris (1983), p = IS ~ there is a lot of information in the data on rj. Hence the choice of a prior for rj does not have much influence on inference about /i^'s. It is customary to choose a uniform or one of the other objective priors (vide Chapter 5) for r/. We illustrate these ideas by exploring in detail Example 9.1. Example 9.3. (Example 9.1, continued) Let f{x\fij) be the density of A^(/ij, cr^) where we initially assume a^ is known. We relax this assumption in Subsection 9.1.1. The prior for /ij is taken to be A^(?7i,772) where 771 is the prior guess about the /ij's and 772 is a measure of the prior uncertainty about this guess, vide Example 2.1, Chapter 2. The prior for 771,772 is TT(771,772), which we specify a bit later. Because {Xj = ^^=iXij/n^j = l , . . . , p ) form a sufficient statistic and Xj's are independent, the Bayes estimate for /j.j under squared error loss is EifXj\X)

= Eifij\X)

=j

E{iij\X,ii)ir{n\X)dn.

where X = {Xij, i — 1 , . . . , n, j = 1 , . . . ,p), X — ( X i , . . . , Xp) and (vide Example 2.1) £(M,|X,7,) = £(M,|X,-,^) = " ^ ' f ^ ' ^ . ^ j j / r ^ ' = (1 - ^ ) ^ i + ^ ' ^ i ' (9-1) 772 + ycF^/n)

with B = (cr^/71)7(772 + cr'^/n)^ depends on data only through Xj. The Bayes estimate of /ij, on the other hand, depends on Xj as above and also on ( X i , . . . , Xp) because the posterior distribution of rj depends on all the Xj^s. Thus the Bayes estimate learns from the full sufficient statistic justifying simultaneous estimation of all the /i^'s. This learning process is sometimes referred to as borrowing strength. If the modeling of/i^'s is realistic, we would expect the Bayes estimates to perform better than the Xj's. This is what is strikingly new in the case of large 79, small n and follows from the exchangeability of JLLJ^S.

The posterior density 7r(r7|X) is also easy to calculate in principle. For known cr^, one can get it explicitly. Integrating out the /i^'s and holding 77 fixed, we get X^'s are independent and X,|r/-7V(77i,772 + a V n ) .

(9.2)

Let the prior density of (7^1,7/2) be constant on 7^ x IZ^. (See Problem 1 and Gelman et al. (1995) to find out why some other choices like 71(771,7^2) = 1/7^2 are not suitable here.) Given (9.2) and 7r{r]i^r]2) as above.

258

9 High-dimensional Problems

7r(77|X) <x |27r(r,2 + ^ ) }

' 'exp | - ^ ^ - ^

f ^ ( X ^ - r , i ) 4 7r(r,)

where X = 1 E ? . , ^ j and S = E ? . i ( ^ j " X)\ In a similar way, 7r(Ax|X) = m7r(/i,|X,-,r7)7r(r;|X)dr7,

(9.4)

where (//j|Xj,r7) are independent normal with 2 /

mean as in (9.1) and variance

-^r-rr/2 4- (J^/n

(9.5)

and 7r(77|X) = 7r(7,i|X,7?2)7r(r/2|X)

(9.6)

is the product of a normal and inverse-Gamma (as given in (9.3)). Construction of a credible interval for ^j is, in principle, simple. Consider 7r(/ij|X) and fix 0 < a < 1. Calculate the posterior quantiles /d.{X),iIj{X) of orders lOOa/2 and 100(1 - a/2) for given data. Then

P{l,.{X)
=

l-a,

In general, to calculate /i. and fij one would have to resort to MCMC which is explained in Subsection 9.1.1. For large p, good approximations are available. To do this we anticipate to some extent Section 9.2. Because p is large, we can invoke the theorem on posterior normality (Chapter 4). Thus the posterior for r) is nearly normal with mean ff and variances of order 0 ( l / p ) , r) being the MLE of rj based on the "likelihood"

Hence, 7r{r)\x) is approximately (in the sense of convergence in distribution) degenerate at r). This implies

9.1 Exchangeability, Hierarchical Priors, and MCMC

259

7r{fij\X) = / 7r{iiij\Xj,r])7r{r]\X)d7] — 7r{/j.j\Xj,f)) (approximately) .

(9.7)

This in turn implies the Bayes estimate of /ij is E(fij\X)

= E{fij\Xj,f])

(approximately)

(9.8)

which, by (9.1), is a shrinkage estimate that shrinks Xj towards 771, and which depends on all the sample means. The approximation (9.8) is correct up to 0{l/p). A similar argument provides an approximation to the posterior s.d. but the accuracy is only 0 ( l / ^ / p ) . Simulations indicate the approximation for the Bayes estimate is quite good but that for the posterior s.d. is much less accurate. It is known, vide Morris (1983), that the approximation is also inadequate for credible intervals. As a final application of this approximation we indicate it is possible to check whether the prior 7r(/ij|?7) is consistent with data. More precisely, we check the consistency of f{xj\rj) with data, but a check for f{xj \r]) is indirectly a check for 7T{iij\rj). In the light of the data, 77 = r) is the most likely value of the hyperparameter 77. Under r), Xj^s are i.i.d normal with mean and variance given by (9.2) with 7] replaced by r). We can now check the fit of this model to the empirical distribution via the Q-Q plot. For each 0 < p < 1, we plot the lOOpth quantiles for the theoretical and empirical distributions as {x{p)^y{p)). If the fit is good, the resulting curve {(x(p), y(p)), 0 < p < 1} should scatter around an equiangular line passing through the origin. 9.1.1 M C M C a n d E - M A l g o r i t h m We begin with p exponential densities of the same form, namely. exp <^ nciOj) + ^tji{xj)eji

> h{xj),

j = 1 , . . . ,p.

(9.9)

The conjugate prior density for the jth model is proportional to k

exp{rjoc{0j) + ^7?z6>^z},

j = 1 , . . . ,p.

(9.10)

Note that the hyperparameters (770, T^I, . . . , 77/e) are the same for all j . Finally, consider a prior 7r{r]) for the hyperparameters. Let X = ( X i , . . . , Xp) and 9 = ( ^ 1 , . . . , Op). The conditional density of 0 given X,?7 is p

k

n{e\X,T]) (X Y[ exp{(7?o + n)c(%) + J^itjii^j)

+ ViWn)

(Q-H)

260

9 High-dimensional Problems

which shows conditionally ^^'s remain independent. Also p

p

7r{r]\X, e) oc exp{pd(r7) + (r/o + n) ^

c(^^) + ^

j=l

j=l

k

^(77^ +

tji{xj))eji}T:{'n)

i=l

(9.12) where exp{d{r))) is the normalizing constant of the expression in (9.10). By (9.12), the Bayes formula and cancellation of some common terms in the numerator and denominator of the Bayes formula, p

7r(r7|X, e) oc expipdirj) + r/o ^

p

k

c(0^) + X^ X^

j=i

j=i

Vi^jihirj).

i=i

li d{r]) has an explicit form, as is often the case, one can apply Gibbs sampling to draw samples from the joint posterior of 6 and rj using the conditionals 7r(^|X,?7) and 7r(?7|X,^). Otherwise one can apply Metropolis-Hastings. In general, the approximations based on f) are still valid but computation of fj is non-trivial. It turns out that the E-M algorithm can be applied, vide Gelman et al. (1995, Chapter 9). 'We illustrate the algorithms for MCMC and E-M in the case of A^(/ij, cr^), j = 1 , . . . ,p, with ( / i i , . . . , /ip) and a^ unknown. MCMC is quite straightforward here. Recall Example 7.13 from Chapter 7. The hierarchical Bayesian analysis of the usual one-way ANOVA was discussed there. With minimal modification, the same algorithm fits here to allow Gibbs sampling. Application of the E-M algorithm is also easy, as discussed in Gelman et al. (1995). We assume as before that /i^ are i.i.d. N{r]i^ri2), and further take 'TT(771, CT'^, 772) = 1/cr^. Then, recall from Section 7.2 that we need to apply the E and M steps to 1 ^ log7^(/x,r7l,a^r72|X) = - ( - + l)logcr2 - - l o g 7 / 2 - w-^^^J -|

P

~^i)^

n

- — ^ X^ ^{Xij

- fij)'^ H- constants .

(9.13)

The E-step requires the conditional distribution of (/x,cr'^) given X and the current value (^1 , ^2 ) ^^ (^17^2)- This is just the normal, inverse Gamma model. In the M-step we need to maximize E'^^^(log7r(/x,r7i, 0-^,7/2|X)) as a function of (771,7/2) which is straightforward.

9.2 Parametric Empirical Bayes To explain the basic ideas, we consider once more the special case of A/'(/ij, cr^), (j^ known. Explicit formulas are available in this special case for comparison

9.2 Parametric Empirical Bayes

261

with the estimates of Stein. Another interesting special case is discussed in Carlin and Louis (1996, Chapter 3). The PEB approach was introduced by Efron and Morris in a series of papers including Efron and Morris (1971, 1972, 1973a, 1973b, 1975, 1976). In this section we generally follow Morris (1983). Efron and Morris tried to take an intermediate position between a fully Bayes and a fully frequentist approach by treating the likelihood as given by /{xjlrj) obtained by integrating out the /ij's as in (9.2). The ry's are treated as unknown parameters as in frequentist analysis whereas the /ij 's are treated as random variables as in Bayesian analysis. This leads to a reduction of a high-dimensional frequentist problem about /i^'s to a low-dimensional semifrequent ist problem about r/, about which there is a lot of information in the data. The fully Bayesian and the PEB approach differ in that no prior is assigned to ?7, and rj is estimated by MLE or by a suitable unbiased estimate. So one may, if one wishes, think of the PEB approach as an approximation to the hierarchical Bayes approach of Section 9.1. A disadvantage of PEB is that accounting for the uncertainty about r] is more difficult than in hierarchical Bayes - a point that would be discussed again in subsection 9.2.1. An advantage is that one gets an explicit estimate for /i^, namely, (9.1) with rj replaced by an estimate of ry. Note that under the likelihood (9.2), the complete sufficient statistic is the pair

(^=JE^^'

S = J2(^j-Xf).

(9.14)

P .=1

Also, X and B = {p-3)^

(9.15)

are unbiased estimates of rji and B = ^

^

.

(9.16)

Then the best unbiased predictor of the RHS of (9.1) is fij = {l-B)Xj-^BX

(9.17)

which is the famous James-Stein-Lindley estimate of /i^. It shrinks Xj towards the overall mean X. The amount of shrinkage is determined by B, which is close to 1 if S/{p—3) is close to cr^/n and close to zero if S/{p — 3) is much larger than a^/n. Note that if S/{p — 3) is small, as in the first case, then the X^'s are close to X indicating /ij's are close to each other. This justifies a fairly strong shrinkage towards the grand mean. On the other hand, a large S/{p — 3) indicates heterogeneity among the /Xj's, suggesting relatively large weight for Xj. Because

262

9 High-dimensional Problems ^2

E{S/{p-l)) = ~ + m,

(9.18)

an unbiased estimate of r}2 is 772 = S/{p — 1) — a'^/n. Because r/2 ^ 0? a more plausible estimate is 7)2" = max(0, r/2), the positive part of the unbiased estimate. This suggests changing the estimate of B to

which is the James-Stein-Lindley positive part estimate. If we take 771 = 0, i.e., /i^'s are i.i.d A^(0,772), then the two estimates are of the form _ ftj = {1- B)Xj and Jlj = {1 - B)Xj. (9.20) These are the James-Stein and James-Stein positive part estimates. They shrink the estimate towards an arbitrary point zero and so do not seem attractive in the exchangeable case. But they have turned out to be quite useful in estimating coefficients in an orthogonal expansion of an unknown function with white noise as error, vide Cai et al. (2000). We study frequentist properties of these two estimates in Section 9.4. 9.2.1 P E B and H B Interval Estimates Morris defines a random confidence interval (//.(X),/ij(X)) for 11j to have PEB confidence coefficient (1 — a) if Pr/{//. l - a . Let Sj = [1 - {{v - l)/p)B]a'^/n jectured

+ (2/(p - ^))B^{Xj Xj ± z^i2Sj

(9.21) - Xf.

Morris has con(9.22)

is a PEB confidence interval with confidence coefficient 1 — a. It is shown in Basu et al. (2003) that the conjecture is not true but the violations are so rare and so small in magnitude that it hardly matters. Basu et al. (2003) suggest an adjustment that would make (9.22) true up to 0{p~'^). It is also shown there that asymptotically the adjusted interval is equivalent to a PEB interval proposed by Carlin and Louis (1996, Chapter 3). A trouble with Morris's interval is that it is somewhat ad hoc. We are not told how exactly it is derived. It seems he puts a noninformative prior on 7/1,772 and adjusts somewhat the HB credible interval to get a conservative frequentist coverage probability. There is a natural alternative that does not require additional adjustment and ensures (9.21) with the inequality replaced by an equality up to 0(p~^). To do this, one has to choose a prior for rj that is probability matching in the sense of

9.3 Linear Models for High-dimensional Parameters Pr,{/i. <^lj<^ij}

= l - a + 0(p-^),

263 (9.23)

where P{,ij > flj\X}

= a/2,

P{fij < IJ,.\X} = a/2,

(9.24)

and the probabilities in (9.24) are the posterior probabilities under the prior for T]. This leads to probability matching differential equations. A solution is

vide Datta, Ghosh, and Mukerjee (2000).

9.3 Linear Models for High-dimensional Parameters We can extend the HB and PEB approach to a more general setup using covariates and linear models. The parameters are no longer exchangeable but are affected by a common set of low-dimensional hyperparameters assuming the role of ry. The model in Sections 9.1 and 9.2 is a special case of the linear model below. Following Morris (1983), we change our notations slightly Yj\Oj - N{Oj,V),

j = 1 , . . . ,p

independent,

(9.26)

and given /3, A, ^pxl

= Zpxrf^rxl

+ Cpxl?

(9.27)

e^'s are i.i.d N{0^ A). Here p is at least moderately large, r is relatively small. Morris allows the variance of ej to depend on j , which is often a more realistic assumption. Keeping the same variance A for all j simplifies the algebra considerably. In the HB analysis we need to put a further prior on /3. A conjugate prior for /3 given A is f3^N{-j,,72iZ'Zy'). (9.28) Finally, A is given an inverse Gamma or a uniform or the standard prior 1/A for scale parameters. Assuming V is known, our specification of priors is complete. To do MCMC we partition the parameters and (random) hyperparameters into three sets (^,/3, A). The conditionals are as follows. (1) Given l3,A (and "K), 6 is multivariate normal. (2) Given 0,A{ and 1^), /3 is multivariate normal. (3) Given 0^(3 { and 1^), A has an inverse Gamma distribution. You are asked to find the parameters of these conditionals in Problem 6.

264

9 High-dimensional Problems In the PEB approach, one first notes ei\Yi,/3,A^N{e*,V{l-B)),

(9.29)

where 0* = (1 - B)Yi + BZl/3

(9.30)

with B = y/{y H- A). Here Zi is the ith column vector of Z. This is the shrinkage estimate corresponding with (9.1) of Section 9.1. In the PEB approach one has to estimate /3 and B either by maximizing the likelihood of the independent Y^'s with Yi\(3,A^NiZi(3,V

+ A)

(9.31)

or by finding a suitable unbiased estimate as in (9.18). Let ^ =

{Z'Z)-\Z'Y).

The statistic ^ and S={Y-

Z^YiY

- ZP)

are independent, complete sufficient statistics for (fS^A), Hence the best unbiased estimates for (3 and B are ^ and B = {p-r-

2)V/S

(vide Problem 10). Substituting in the shrinkage estimate 6^ for 6i^ one gets ^, = (1 - B)Yi + BZ'^^. This is the analogue of James-Stein-Lindley estimate under the regression model. In Problem 8, you are asked to show that the PEB risk of ^ j , namely Ef3,A{0i - OiY is smaller than the PEB risk of Yi, namely, E^^A{Yi - Oif. The relative strength of the PEB estimate comes through the use of ^ , which is based on the full data set Y. In Section 8.3, linear regression is discussed as a common statistical problem where an objective Bayesian analysis is done. You may wish to explore how similar ideas are used in this section to model a high-dimensional problem.

9.4 Stein's Frequentist Approach to a High-dimensional Problem Once again we study Example 9.1. Let Xj's be independent, Xj ~ N{iij,a'^/n). Classical criteria like maximum likelihood, minimaxity or minimizing variance

9.4 Stein's Prequentist Approach to a High-dimensional Problem

265

among unbiased estimates, all lead to ( X i , . . . , Xp) as estimate of (/xi,..., /Xp). Let p > 3. Stein, in a series of papers. Stein (1956), James and Stein (1960), Stein (1981), showed that if we define a loss function L{X,f^)

= J2^Xj-fzj)^

(9.32)

and generally L{T,fi)

= J2(^,{X)-f,jf

(9.33)

for a general estimate T, it is possible to choose a T that is better than X in the sense E^{L{T, /x)) < E^{L{X,/x)) for all ^JL. (9.34) Stein (1956) provides a heuristic motivation that suggests X is too large in a certain sense explained below. To see this compare the expectation of the squared norm of X with the squared norm of /x. E^{\\Xf)

= l l ^ f +p Wtif.

(9.35)

The larger the p the bigger the deviation between the LHS and RHS. So one would expect at least for sufficiently large p^ one can get a better estimate by shrinking each coordinate of X suitably towards zero. We present below two of the most well-known shrinkage estimates, namely, the James-Stein and the positive part James-Stein estimate. Both have already appeared in Section 9.2 as FEB estimates. It seems to us that the FEB approach provides the most insight about Stein's estimates, even though the FEB interpretation came much later. Morris points out that there is no exchangeability or prior in Stein's approach but summing the individual losses produces a similar effect. Moreover, pooling the individual losses would be a natural thing to do only when the different IIj 's are related in some way. If they are totally unrelated. Stein's result would be merely a curious fact with no practical significance, not a profound new data analytic tool. It is in the case of exchangeable high-dimensional problems that it provides substantial improvement. We present two approaches to proving that the Stein-James estimate is superior to the classical estimate. One is based on Stein (1981) with details as in Ibragimov and Has'minskii (1981). The other is an interesting variation on this due to Schervish (1995). Stein's I d e n t i t y . Let X ~ iV'(/x, a^) and (t){x) be a real valued function differentiate on 7Z with J (j)\u)du = (t>{x) — (j){a). Then
M)0(X)).

266

9 High-dimensional Problems

Proof. Integrating by parts or changing the order of integration

27ra . = a-'^E{(t){X){X

- fi))n

(9.36)

For more details see the proof in Stein (1981). As a corollary we have the following result. Corollary. Let ( X i , X 2 , . . . ^Xp) he a random vector ~ A/p(/x, ^I)(01, 2, • • •, (t>p) • ^^ -^ ^^ be differentiable, E\Q^\ (/)j(a:i,..., Xj_i, X, Xj^i,...,

Xp) = J^ '^^^j

Let (j) =

< oo,

^^^ assume that

(/)j(xi,... ,Xj_i,a:,a:j+i,.. .)exp{~2^7/^ } -^ 0 as \x\ -^ CXD. Then ^p|^}=^((^i-M.)0^)-

(9.37)

We now return to Example 9.1. The classical estimate for /x is X (Xi, X 2 , . . . , Xp). Consider an alternative estimate of the form Ji = X-\-n-^G'^g{X),

=

(9.38)

where g{x) = (^1, ^ , . . . , gp) : IZ^ -^ TZ^ with g satisfying the conditions in the corollary. Then, by the corollary, EJX

- Ai||2 - EJX

+ n-'a^g{X)

-

fif

= -2n-'a^E^{{X

- /x)'^(X)} -

n-'a'E^\\g{X)f

= -2n-'a'E^{J2

| ^ } " n-'a'Ejg{X)f.

(9.39)

Now suppose g{x) = grad(log(/>(cc)), where (/> is a twice continuously differentiable function from 71^ into IZ. Then

where A = ^ ^ -^. E^\\X - nf

Hence

- E^WJi - nf

= n-^a'E^\\gf

- n-'a^E^

|_L_zi^(x)| (9.41)

9.4 Stein's Prequentist Approach to a High-dimensional Problem

267

which is positive if 0(a:) is a positive non-constant superharmonic function, i.e, Act) < 0. (9.42) Thus Ji is superior to X if (9.42) holds. It is known that such functions exist if and only if p > 3. To show the superiority of the James-Stein positive part estimate for p > 3, take ^

=1

' ~\{P-

2)-(^-2)/2 exp { i ( p - 2) - | | x f ) } otherwise.

^^^^'

Then grad(log 0) is easily verified to be the James-Stein positive part estimate. To show the superiority of the James-Stein estimate, take 4>ix) = \\xr-\

(9.44)

We observed earlier that shrinking towards zero is natural if one modeled /ij's as exchangeable with common mean equal to zero. We expect substantial improvement if /^ = 0. Calculation shows E4li-^f

= ^E^\\X-^L\\^^2

(9.45)

if /x = 0,cr^ = l , n = 1. It appears that borrowing strength in the frequentist formulation is possible because Stein's loss adds up the losses of the component decision problems. Such addition would make sense only when the different problems are connected in a natural way, in which case the exchangeability assumption and the FEB or hierarchical models are also likely to hold. It is natural to ask how good are the James-Stein estimates in the frequentist sense. They are certainly minimax since they dominate minimax estimates. Are they admissible? Are they Bayes (not just FEB)? For the James-Stein positive part estimate the answer to both questions is no, see Berger (1985a, pp. 542, 543). On the other hand, Strawderman (1971) constructs a proper Bayes minimax estimate for p> 5. Berger (1985a, pp. 364, 365) discusses the question of which among the various minimax estimates to choose. Note that the FEB approach leads in a natural way to James-Stein positive part estimate, suggesting that it can't be substantially improved even though it is not Bayes. See in this connection Robert (1994, p. 66). There is a huge literature on Stein estimates as well as questions of admissibility in multidimensional problems. Berger (1985a) and Robert (1994) provide excellent reviews of the literature. There are intriguing connections between admissibility and recurrence of suitably constructed Markov processes, see Brown (1971), Srinivasan (1981), and Eaton (1992, 1997, 2004).

268

9 High-dimensional Problems

When extreme /x's may occur, the Stein estimates do not offer much improvement. Stein (1981) and Berger and Dey (1983) suggest how this problem can be solved by suitably truncating the sample means. For Stein type results for general ridge regression estimates see Strawderman (1978) where several other references are given. Of course, instead of zero we could shrink towards an arbitrary fiQ. Then a substantial improvement will occur near /XQ. Exactly similar results hold for the James-Stein-Lindley estimate and its positive part estimate if p > 4. For the James-Stein estimate, Schervish (1995, pp. 163-165) uses Stein's identity as well as (9.40) but then shows directly (with cr^ == l , n = 1) |2

d

-(p-2) 2

<'»-*%o^/''=^<'3

Clearly for ^ = James-Stein estimate. 2 2

which shows how the risk can be evaluated by simulating a noncentral x^ -distribution.

9.5 Comparison of High-dimensional and Low-dimensional P r o b l e m s In the low-dimensional case, where n is large or moderate and p small, the prior is washed away by the data, the likelihood influences the posterior more than the prior. This is not so when p is much larger than n - the so-called high-dimensional case. The prior is important, so elicitation, if possible, is important. Checking the prior against data is possible and should be explored. We discuss this below. In the high-dimensional cases examined in Sections 9.2 and 9.3 some aspects of the prior, namely 7r(/ij|r7), can be checked against the empirical distribution. We have discussed this earlier mathematically, but one can approach this from a more intuitive point of view. Because we have many jij 's as sample from 7r(/ij|r7) and X^'s provide approximate estimates of /i^'s, the empirical distribution of the Xj's should provide a check on the appropriateness of 7r(/i^-|r)).

Thus there is a curious dichotomy. In the low-dimensional case, the data provide a lot of information about the parameters but not much information about their distribution, i.e., the prior. The opposite is true in highdimensional problems. The data don't tell us much about the parameters but there is information about the prior.

9.6 High-dimensional Multiple Testing (PEB)

269

This general fact suggests that the smoothed empirical distribution of estimates could be used to generate a tentative prior if the likelihood is not exponential and so conjugate priors cannot be used. Adding a location-scale hyperparameter rj could provide a family of priors as a starting point of objective high-dimensional Bayesian analysis. Bernardo (1979) has shown that at least for Example 9.1 a sensible Bayesian analysis can be based on a reference prior with a suitable reparameterization. It does seem very likely that this example is not an exception but a general theory of the right reparameterization needs to be developed.

9.6 High-dimensional Multiple Testing (PEB) Multiple tests have become very popular because of application in many areas including microarrays where one searches for genes that have been expressed. We provide a minimal amount of modeling that covers a variety of such applications arising in bioinformatics, statistical genetics, biology, etc. Microarrays are discussed in Appendix D. Whereas PEB or HB high-dimensional estimation has been around for some time, PEB or HB high-dimensional multiple testing is of fairly recent origin, e.g., Efron et al. (2001a), Newton et al. (2003), etc. We have p samples, each of size n, from p normal populations. In the simplest case we assume the populations are homoscedastic. Let a^ be the common unknown variance, and the means / i i , . . . , /i^. For /ij, consider the hypotheses HQJ : fij = O^Hij : /ij ~ N{r]i^r]2)^j — 1,...,;?. The data are Xij^i = 1 , . . . n, j = 1 , . . . ,p. In the gene expression problem, Xij^i = l , . . . n are n i.i.d. observations on the expression of the j t h gene. The value of \Xij\ may be taken as a measure of observed intensity of expression. If one accepts HQJ^ it amounts to saying the j t h gene is not expressed in this experiment. On the other hand, accepting Hij is to assert that the j t h gene has been expressed. Roughly speaking, a gene is said to be expressed when the gene has some function in the cell or cells being studied, which could be a malignant tumor. For more details, see the appendix. In addition to HQJ and Hij, the model involves TTQ = probability that HQJ is true and TTi = 1 — TTo = probability that Hij is true. If J. _ ( 1 if Hij is true; ^ 1^ 0 if Hoj is true, then we assume / i , . . . , /^ are i.i.d. ~ 5 ( 1 , TTI). The interpretation of TTI has a subjective and a frequentist aspect. It represents our uncertainty about expression of each particular gene as well as approximate proportion of expression among p genes. If cr^, TTi, 771, ?72 are all known, Xj is sufficient for /J.J and a Bayes test is available for each j . Calculate the posterior probability of HIJ:

270

9 High-dimensional Problems

TTij

=

^1/1(^.0 7ri/i(X,)+7ro/o(X,)

which is a function of Xj only. Here /o a n d / i are densities of Xj under HQJ and Hij. If TTij > -

accept Hij

if TTij < -

accept iifoj •

and

This test is based only on the d a t a for t h e j t h gene. In practice, we do not know TTI, 771,772- In P E B testing, we have to estimate all three. In HB testing, we have to p u t a prior on (TTI, 771,772). To us a natural prior would be a uniform for TTI on some range (0, (5), (5 being upper bound to TTi, uniform prior for 7/1 on 7^ and uniform or some other objective prior for In t h e P E B approach, we have to estimate TTI, 771,772. If cr^ is also unknown, we have t o p u t a prior on a^ also or estimate it from d a t a . An estimate of G^ For fixed TTI , we can estimate r\\ and 7/2 by the m e t h o d of moments using t h e equations. X = i ^ X ,

=7ri77i,

- ^ ( X , - X ) 2 := ^

+ 7ri772 + ^1(1 - i^x)r]\,

(9.46) (9.47)

from which it follows t h a t ^1 = — X ,

(9.48)

TTi

Alternatively, if it is felt t h a t 771 = 0, t h e n t h e estimate for 772 is given by

Now we may maximize the joint likelihood of X^'s with respect to TTI. Using these estimates, we can carry out the Bayes test for each j , provided we know TTI or put a prior on TTI . We do not know of good P E B estimates of TTi.

Scott and Berger (2005) provide a very illuminating fully Bayesian analysis for microarrays.

9.6 High-dimensional Multiple Testing (PEB)

271

9.6.1 N o n p a r a m e t r i c Empirical B a y e s Multiple Testing Nonparametric empirical Bayes ( N P E B ) solutions were introduced by Robbins (1951, 1955, 1964). It is a Bayes solution based on a nonparametric estimate of t h e prior. Robbins applied these ideas in a n ingenious way in several problems. It was regarded as a breakthrough, b u t t h e m e t h o d never became popular because t h e nonparametric methods did not perform well even in moderately large samples and were somewhat unstable. Recently Efron et al. (2001a, b) have m a d e a successful application to a microarray with p equal to several t h o u s a n d s . T h e d a t a are massive enough for N P E B to be stable and perform well. After some reductions the testing problem takes t h e following form. For j = 1, 2 , . . . , p , we have r a n d o m variables Zj. Zj ~ /o(^) under H^j and Zj ^ fi{^) under Hij where /o is completely specified b u t fi{z) ^ fo{z) is completely unknown. This is what makes t h e problem nonparametric. Finally, as in t h e case of parametric empirical Bayes, t h e indicator of Hij is Ij = 1 with probability TTI and = 0 with probability TTQ = 1 — TTI . If TTI and / i were known we could use t h e Bayes test of HQJ based on t h e posterior probability of Hij ^ihi^j) 7rifi{zj)-i-{l-ni)fo{zj)'

P{H,,\z,)

Let f{z) = 7rifi{z) -\- (1 — 7Ti)fo{z). We know fo{z). Also we can estimate f{z) using any s t a n d a r d m e t h o d - kernel, spline, nonparametric Bayes, vide Ghosh and R a m a m o o r t h i (2003) - from t h e empirical distribution of t h e Zj's. B u t since TTI and / i are b o t h unknown, there is an identifiability problem and hence estimation of TTI, / i is difficult. T h e two papers, Efron et al. (2001a, b), provide several m e t h o d s for bounding TTI. One b o u n d follows from TTo <

mm[f{z)/fo{z)], z

T:i>l-mm[f{z)/fo{z)]. z

So t h e posterior probability of Hij is

f{^j) which is estimated by 1 — ^ min;^ IfK \ ^ , \ , where f is an estimate of f as ^ I /o(^) J f{zj) ' -^ -^ mentioned above. T h e minimization will usually be m a d e over observed values of z. Another b o u n d is given by TTo <

f^fo{z)dz'

272

9 High-dimensional Problems

Now minimize the RHS over different choices of A. Intuition suggests a good choice would be an interval centered at the mode of fo{z), which will usually be at zero. A fully Bayesian nonparametric approach is yet to be worked out. Other related papers are Efron (2003, 2004). For an interesting discussion of microarrays and the application of nonparametric empirical Bayes methodology, see Young and Smith (2005). 9.6.2 False Discovery Rate (FDR) The false discovery rate (FDR) was introduced by Benjamini and Hochberg (1995). Controlling it has become an important frequentist concept and method in multiple testing, specially in high-dimensional problems. We provide a brief review, because it has interesting similarities with NPEB, as noted, e.g., in Efron et al. (2001a, b). We consider the multiple testing scenario introduced earlier in this section. Consider a fixed test. The (random) FDR for the test is defined as yT^I{v{z)>o}i where U = total number of false discoveries, i.e., number of true HQJ^S that are rejected by the test for a z, and V = total number of discoveries, i.e., number of ifoj's that are rejected by a test. The (expected) FDR is FDR = E^ ( ^ / { v > o } ) To fix ideas suppose all Hoj's are true, i.e., all /ij's are zero, then U = V and so U :^{v>o} - ^{v>o} and FDR = P^=o{ at least one HQJ is rejected ) = Type 1 error probability under the full null. This is usually called family wise error rate (FWER). The Benjamini-Hochberg (BH) algorithm (see Benjamini and Hochberg (1995)) for controlling FDR is to define j Jo = max{j : P(^) < -a} where Pj = the P-value corresponding with the test for j t h null and PQ) = j t h order statistic among the P-values with P{i) = the smallest, etc. The algorithm requires rejecting all HQJ for which Pj < P(JQ)- Benjamini and Hochberg (1995) showed this ensures

^M [yhv>0}j

<^ ^

< ^ V/i

where po is the number of true ifoj's. It is a remarkable result because it is valid for all /i. This exact result has been generalized by Sarkar (2003).

9.7 Testing of a High-dimensional Null as a Model Selection Problem

273

Benjamini and Liu (1999) have provided another algorithm. See also Benjamini and Yekutieli (2001). Genovese and Wasserman (2001) provide a test based on an asymptotic evaluation of jo and a less conservative rejection rule. An asymptotic evaluation is also available in Genovese and Wasserman (2002). See also Storey (2002) and Donoho and Jin (2004). Scott and Berger (2005) discuss FDR from a Bayesian point of view. Controlling FDR leads to better performance under alternatives than controlling FWER. Many successful practical applications of FDR control are known. On the other hand, from a decision theoretic point of view it seems more reasonable to control the sum of false discoveries and false negatives rather than FDR and proportion of false negatives.

9.7 Testing of a High-dimensional Null as a Model Selection Problem^ Selection from among nested models is one way of handling testing problems as we have seen in Chapter 6. Parsimony is taken care of to some extent by the prior on the additional parameters of the more complex model. As in estimation or multiple testing, consider samples of size r from p normal populations N{/ii,a'^). For simplicity cr^ is assumed known. Usually cr^ will be unknown. Because S'^ = X^i ^ji^ij ~^i)'^/p{^~^) is an unbiased estimate of (J^ with lots of degrees of freedom, it does not matter much whether we put one of the usual objective priors for cr^ or pretend that cr^ is known to be 5^. We wish to test HQ : /i^ = 0 Vi versus Hi: at least one /i 7^ 0. This is sometimes called Stone's problem, Berger et al. (2003), Stone (1979). We may treat this as a model selection problem with MQ = HQ : /li = 0 \/i and Ml = Ho U Hi, i.e.. Mi : ^ elZ^.ln this formulation, MQ C MI whereas HQ and HI are disjoint. On grounds of parsimony, HQ is favored if both MQ and Ml are equally plausible. To test a null or select a model, we have to define a prior 7r(/Lx) under Mi and calculate the Bayes factor

o

^01 ~ —

nLi/o(^o —

—

lup r i L i

fi{Xi\^i)7r{ti)dfi'

There is no well developed theory of objective priors, specially for testing problems. However as in estimation it appears natural to treat /ij's as exchangeable rather than independent. A popular prior in this context is the Zellner and Siow (1980) multivariate Cauchy prior

^ Section 9.7 may be omitted at first reading.

274

9 High-dimensional Problems t2

Urll'l,

(27r)?crP

/

1

t

1 ,

\ ^

0

(9.51) Another plausible prior is the smooth Cauchy prior given by

^(^)

- ^ » ^ . l P + 2 /^V

2 1 ^2

i«.*/«f -e

/ (27r)2orP

2^2

dt

7rV't(l - t)

0

where M ( ^ , ^ ^ , ^ ^ ) is the hypergeometric iFi function of Abramowitz and Stegun (1970). It is tempting to use the difference (between the two models) of BIG as an approximation to the logarithm of Bayes factor (BF) even though it was developed by Schwarz for low-dimensional problems. Stone was the first to point out that the use of BIG is problematic in high-dimensional problems. Berger et al. (2003) have developed a generalization of BIG called GBIG, which provides a good approximation to the integrated likelihood for priors like the above Gauchy priors which are obtained by integrating the scale parameter for N{fii,a'^). In Stone's problem one has the normal linear model setup Xij = iJ^i-i- eij\ z = 1 , . . . ,p; j = 1 , . . . , r; n = pr.

(9.52)

It is assumed that as n —^ oo, p —> oo and r is fixed. Under these assumptions, Berger et al. (2003) provide a Laplace approximation and a GBIG. The GBIG also approximates the BIG for low-dimensional problems. The formula for ZiGBIG (the difference of GBIG for the comparison of Mi and MQ) is given by

ZiGBIC = (^X'X - I log(rcp) - | ) + - i ^ , P

(9.53)

_ 2

where Cp = - ^ Xi . Table 9.1, taken from Berger et al. (2003) provides ^

2=1

some idea of the accuracy of BIG, GBIG and Laplace approximation. One has p = 50 and r = 2 for these calculations and the multivariate Gauchy prior was used. Substantial new results appear in Liang et al. (2005). They propose a mixture of Zellner's (Zellner (1986)) popular g-prior. In Zellner's form, the prior looks like /x|Mi ~ A/'(0, ^ ( Z ' Z ) ~ ^ ) where Z is the design matrix (in our problem only composed of O's and I's). This g is usually elicited through an empirical Bayes method. The above authors consider a family of mixtures of ^-priors (under which the Zellner-Siow Gauchy prior is a special case) and use those for model selection. They propose Laplace approximations to the

9.7 Testing of a High-dimensional Null as a Model Selection Problem

275

Table 9.1. Comparison of the Performance of GBIC and Laplace Approximation with BIC Cp True Log Bayes Factor ABIC AGBIC Z\Laplace Approx 0.1 0.5 1.0 1.5 2.0 10.0

-8.5348 -3.8251 6.0388 20.8203 38.4814 397.369

-110.129 -90.129 -65.129 -40.129 -15.129 384.871

-1.956 -1.956 5.715 20.579 38.387 398.151

-8.5776 -3.9083 5.9236 20.7564 38.4408 397.369

marginal likelihood under these general priors and show that the models thus selected are generally correct asymptotically if the complex model is true. Under the null model, this type of consistency still holds under the ZellnerSiow prior. Further generalizations to non-normal problems appear in Berger (2005) and Chakrabarti and Ghosh (2005a). Both papers provide generalizations of BIC when the observations come from an exponential family of distributions in high-dimensional problems. In Table 9.2, using simulation results reported in Chakrabarti and Ghosh (2005a), the performance of GBIC and the Laplace approximation (log 77^2) with BIC are compared in approximating the integrated likelihood under the more complex model (denoted by 7712) when the more complex model is actually true and observations come from Bernoulli, exponential, and Poisson distributions. In this case one has p groups of observations, each group having a (potentially) different parameter value and each group has r observations. Under the simpler model, these different groups are assumed to have the same (specified) parameter value, while for the more complex model the parameter vector is assumed to belong to IZ^. See the paper for details on the priors used. In principle, the same methods apply to any two nested models Mo : Mi = 0,1 < i < pi,pi

P 50 50 50 50 50 50

r 10 200 10 200 10 200

logm2 -327.45 -4018.026 -662.526 -22186.199 -671.504 -15704.585

logm2 -327.684 -4018.072 -661.979 -22186.100 -670.775 -15704.618

BIC -349.577 -4052.757 -640.320 -22178.759 -683.383 -15713.139

GBIC -327.863 -4018.587 -660.384 -22186.117 -671.374 -15705.010

276

9 High-dimensional Problems

9.8 High-dimensional Estimation and Prediction Based on Model Selection or Model Averaging^ Given a set of data from an experiment or observational study done on a given population, a statistician is asked the following three questions quite frequently. First, which among a given set of possible statistical models seems to be the correct model describing the underlying mechanism producing the data? Second, what will be the predicted value of a future observation, if the experimental conditions are kept at predetermined levels? Third, what is the estimate of a single parameter or a vector (may be infinite dimensional) of parameters? We will focus in this section on some Bayesian approaches to answer the last two types of questions. But before going into the details, we will explain briefly in the next paragraph how one would pose the above three questions from a decision theoretic point of view and what is the basic difference in the Bayesian approaches in tackling such questions. Bayesian approaches to such questions are basically dictated by the goal of obtaining decision theoretic optimality, and hence the solutions are also heavily dependent upon the type of loss functions being used. The loss function, on the other hand, is mostly determined by the goal of the statistician or practitioner. The goal of the statistician in the first problem above is to select the correct model (which is assumed to be one in the list of models considered). The loss function often used in this problem is the 0-1 loss function. In the Bayesian approach to model selection, the statistician would put prior probabilities on the set of candidate models and a simple argument shows that for this loss, the optimum Bayesian model would be the posterior mode, i.e., the model that has the maximum posterior probability. As explained in the earlier section, BIG and GBIG can be used to select a model using the Bayesian paradigm with 0-1 loss if the sample size is large, in appropriate situations, as they approximate the integrated likelihood and hence can be used to find the model with highest posterior probability. On the other hand, if one is interested in answering the second or third question above (i.e., if one is interested in prediction or estimation of a parameter), the problem can be approached in two different ways. First, one might be interested in finding a particular model that does the best job of prediction (in some appropriate sense). Secondly, one might only want a predicted value, not a particular model for repeated future use in prediction. In either case, the most popular loss function is the squared prediction error loss, i.e., the square of the difference between the predicted/estimated value and the value being predicted/estimated. The best predictor/estimator turns out to be the Bayesian model averaging estimate (to be explained later) and the best predictive model is the one which minimizes the expected posterior predictive loss. We now consider the problem of optimal prediction from a Bayesian approach. We use the ideas, notations, and results of Barbieri and Berger (2004) ^ Section 9.8 may be omitted at first reading.

9.8 High-dimensional Estimation and Prediction

277

for this part. Consider the canonical model y = X/3 + €,

(9.54)

where y is an n x 1 vector of observations, X is the n x k full rank design matrix, /3 is the unknown k x 1 vector of regression coefficients and e is the n X 1 vector of random errors, which are i.i.d. A/'(0, cr^), a^ being known or unknown. Our goal is to predict a future observation y*, given by 2/*=x*/3 + e,

(9.55)

where x* = ( x j , . . . ,x^) is the value of the covariate vector for which the prediction is to be made. We consider the loss in predicting y* by y* as L{r,y*) = ir-y*f;

(9.56)

i.e., the squared error prediction loss. Assume that we have submodels Ml : y = Xi/3i + €,

(9.57)

where 1 = (/i,.. •, /fc) with /^ = 1 or 0 according as the ith covariate is in the model Ml or not, Xi is a matrix containing columns of X corresponding with the nonzero coordinates of 1 and /3i is the corresponding vector of regression coefficients. Let ki denote the number of covariates included in the model; then Xi is of dimension (n x A:i) and fSi is a> {k\ x 1) vector. We put prior probabilities P(Mi) to each model Mi included in the model space such that X^i P(Mi) = 1, and given model Mi, a prior 7ri(/3i,cr) is assumed on the parameters (^i,cr) included in model Mi. Using standard posterior calculations, one obtains the quantities (a) pi = P(Mi|y), the posterior probability of model Mi and (b) 7ri(/3i,cr|y), the posterior distribution of the unknown parameters in Mi. With this setup in mind, we shall now discuss two optimal prediction strategies, as described below. First note that the best predictor of y* for a given value of x* comes out as y* = E{y*\y)^ where the expectation is taken with respect to the posterior/predictive distribution of y* given y. This follows by noting that

Eiiv* - rf] = EyE[{y* - rf\y],

(Q.SS)

where the expectation inside is taken with respect to the posterior distribution of y* given y. But note that y* = £;(y*|y) = J2p,E{y'\y, 1

Mi) = x* ^ p i i / i A ,

(9.59)

1

where H\ is a {kxk\) matrix such that x.*Hi is the subvector of x* corresponding to the nonzero coordinates of 1 and /3i is the posterior mean of /3i with respect to 7ri(^i,cr|y). Noting that if we knew that Mi were the true model, then the optimal predictor of y* for x fixed at x* would be given by

278

9 High-dimensional Problems ^1* - x*i7i/3i, we have

(9.60)

r = E(2/*|y) ^ x*3 ^ X* ^ p i i / i A = X^pi^r. 1

(9.61)

1

y* is called the Bayesian model averaging estimate, in that it is a weighted average of the optimal Bayesian predictors under each individual model, the weights being the posterior probabilities of each model. Many authors have argued the use of the model averaging estimate as an appropriate predictive estimate. They justify this by saying that in using model selection to choose the best model and then making inference based on the assumption that the selected model is true, does not take into account the fact that there is uncertainty about the model itself. As a result, one might underestimate the uncertainty about the quantity of interest. See, for example, Madigan and Raftery (1994), Raftery, Madigan, and Hoeting (1997), Hoeting, Madigan, Raftery, and Volinsky (1999), and Clyde (1999); just to name a few, for detailed discussion on this point of view. However if the number of models in the model space is very large (e.g., in case all subsets of parameters are allowed in the model space, as will happen in high or even moderately high dimensions), the task of computing the Bayesian model averaging estimate exactly might be virtually impossible. Moreover, it is not prudent to keep in the model average those models that have small posterior probability indicating relative incompatibility with observed data. There are some proposals to get around this difficulty, as discussed in the literature cited above. Two of them are based on the 'Occam's window' method of Madigan and Raftery (1994) and the Markov chain Monte Carlo approach of Madigan and York (1995). In the first approach, the averaging is done over a small set of appropriately selected models, which are parsimonious and supported by data. In the second approach, one constructs a Markov chain with state space same as the model space and equilibrium distribution {P(Mi|y)} where M\ varies over the model space. Upon simulation from this chain, the Bayesian model averaging estimator is approximated by taking average value of the posterior expectations under each model visited in the chain. But it must be commented that Bayesian model averaging (BMA) has its limitations in high-dimensional problems. Each approach addresses both issues but it is unclear how well. Although BMA is the optimal predictive estimation procedure, often a single model is desired for prediction. For example, choice of a single model will require observing only the covariates included in the model. Also, as noted earlier, in high dimensions, BMA has its problems. We will assume now that the future predictions will be made for covariates x* such that Q = ^(x* V ) exists and is positive definite. A frequent choice of Q is Q = X'X, i.e., the future covariates will be like the ones observed in the past. In general, the best

9.8 High-dimensional Estimation and Prediction

279

single model will depend on x*, but we present here some general characterizations which give the optimal predictive model without this dependence. In general, the optimal predictive model is not the one with the highest posterior probability. However, there are interesting exceptions. If there are only two models, it is easy to show the posterior mode with shrinkage estimate is optimal for prediction (Berger (1997) and Mukhopadhyay (2000)). This also holds sometimes in the context of variable selection for linear models with orthogonal design matrix, as in Clyde and Parmigiani (1996). As Berger (1997) notes, it is easy to see that if one is considering only two models, say Mi and M2 with prior probabilities ^ each and proper priors are assigned to the unknown parameters under each model, the best predictive model turns out to be Ml or M2 according as the Bayes factor of Mi to M2 is greater than one or not, and hence the best predictive model is the one with the highest posterior probability. The characterizations we will describe here are in terms of what is called the 'median probability model.' If it exists, the median probability model Ml* is defined to be the model consisting of those variables only whose posterior inclusion probabilities are at least ^. The posterior inclusion probability for variable i is

p^= Yl P(^i\y)'

(9-62)

So, I* is defined coordinatewise as /^ = 1 if p^ > | and li = 0 otherwise. It is possible that the median probability model does not exist, in that the variables included according to the definition of 1* do not correspond with any model under consideration. But in the variable selection problem, if we are allowed to include or exclude any variable in the possible models, i.e., all possible values of 1 are allowed, then the median probability model will obviously exist. Another important class of models is a class of models with 'graphical model structure' for which the median probability model will always exist (this fact follows directly from the definition below). Definition 9.4. Suppose ing index set I{i) of other 'graphical model structure Jor each i, if variable Xi in the model. ^

that for each variable index i, there is a correspondvariables. A subclass of linear models is said to have ^ if it consists of all models satisfying the condition is in the model, then variables Xj with j G I{i) are

The class of models with 'graphical model structure' includes the class of models with all possible subsets of variables and sequences of nested models, ^i(j)7 J == O51) • • • 5 ^? where l(j) = ( 1 , . . . , 1, 0 , . . . , 0) with j ones and k — j zeros. For the all subsets scenario, I{i) is the null set while in the nested case /(i) = {j '• 1 < j < i} for i > 2 and I{i) is the null set for i = 0 or 1. The latter are natural in many examples including polynomial regression models, where j refers to the degree of polynomial used. Another example of nested models is provided by nonparametric regression (vide Chapter 10,

280

9 High-dimensional Problems

Sections 10.2, 10.3). The unknown function is approximated by partial sums of its Fourier expansion, with all coefficients after stage j assumed to be zero. Note that in this situation, the median probability model has a simple description; one calculates the cumulative sum of posterior model probabilities beginning from the smallest model, and the median probability model is the first model for which this sum equals or exceeds | . Mathematically, the median probability model is Mi(j*), where • *

•

-I

*

E P(^m ly) < ^ and 53 P(Mi(i) |y) > i.

(9.63)

We present some results on the optimality of the posterior median model in prediction. The best predictive model is found as follows. Once a model is selected, the best Bayesian predictor assuming that model is true is obtained. In the next stage, one finds the model such that the expected prediction loss (this expectation does not assume any particular model is true, but is an overall expectation) using this Bayesian predictor is minimized. The minimizer is the best predictive model. There are some situations where the median probability model and the highest posterior probability are the same. Obviously, if there is one model with posterior probability greater than | , this will be trivially true. Barbieri and Berger (2004) observe that when the highest posterior probability model has substantially larger probability than the other models, it will typically also be the median probability model. We describe another such situation later in the corollary to Theorem 9.8. We state and prove two simple lemmas. Lemma 9.5. (Barbieri and Berger, 2004) Assume Q exists and is positive definite. The optimal model for predicting y* under the squared error loss, is the unique model minimizing R{M,) = (//,/3, - /9)'Q(/fi/3i - ;9),

(9.64)

where ^ is defined in (9.61). Proof. As noted earlier, yl is the optimal Bayesian predictor assuming M\ is the true model. The optimal predictive model is found by minimizing with respect to 1, where 1 belongs to the space of models under consideration, the quantity E{y'' — y{)^. Minimizing this is equivalent to minimizing for each y the quantity £'[(^* — ^i*)^|y]- It is easy to see that for afixedx*, E[{y*-y;f\y]=C+iy-*-y:r,

(9.65)

where the symbols have been defined earlier and C is a quantity independent of 1. The expectation above is taken with respect to the predictive distribution of y* given y and x*. So the optimal predictive model will be found by finding

9.8 High-dimensional Estimation and Prediction

281

the minimizer of the expression obtained by taking a further expectation over X* on the second quantity on the right hand side of (9.65). By plugging in the values of y^ and 5*, we immediately get (y* - Vtf = iHi0i - ^ ) V x * ( / / i / 3 i - ;9).

(9.66)

The lemma follows. The uniqueness follows from the fact that Q is positive definite. D Lemma 9.6. (Barhieri and Berger, 2004) If Q is diagonal with diagonal elements Qi > 0; and the posterior means /3i satisfy (5\ = H//3 (where /3 is the posterior mean under the full model as in (9.54)) then ;^2

R{Mi) = YlPi\^{li-pi)^.

(9.67)

i=l

Proof From the fact 0\ = Hi p? it follows that ^ = 5 ^ p i H i A = J^piHiHi^^ = D{p)^, 1

(9.68)

1

where D{p) is the diagonal matrix with diagonal elements pi^ by noting that i

H\{i^j) = 1 if /i = 1 and j = ^ Ir and H\{i^j) = 0 otherwise. Similarly, r=l

R{My) = (HiH/3 - Z)(p)3)'0(HiH/3 - D{pyp) = ^^\D{1) - D{p))QiD{\) from where the result follows.

- D{p)y^,

(9.69)

D

Remark 9.7. The condition /3i = H / ^ , simply means that the posterior mean of /3i is found by taking the relevant coordinates of the posterior mean in the full model as in (9.54). As Barbieri and Berger (2004) comment, this will happen in two important cases. Assume X^X is diagonal. In the first case, if one uses the reference prior 7ri(/3i,cr) = 1/a or a constant prior if a is known, the LSE becomes same as the posterior means and the diagonality of (X'X) implies that the above condition will hold. Secondly, suppose in the full model 7r(/3, a) = A^/C(M, a'^A) where Z\ is a known diagonal matrix, and for the submodels the natural corresponding prior A^fci(H///, cr^H/Z\Hi). Then it is easy to see that for any prior on a'^ or if cr^ is known, the above will hold. We now state the first theorem. Theorem 9.8. (Barbieri and Berger, 2004) If Q is diagonal with Qi > 0 and /3i = Hi P; and the models have graphical model structure, then the median probability model is the best predictive model.

282

9 High-dimensional Problems "2

Proof. Because qi > 0, f3i > 0 for each i and Pi (defined in (9.62)) does not depend on 1, to minimize R{Mi) among all possible models, it suffices to minimize {k — PiY for each individual i and that is achieved by choosing li = 1 ii Pi > \ and /^ — 0 if p^ < | , whence 1 as defined will be the median probability model. The graphical model structure ensures that this model is among the class of models under consideration. D Remark 9.9. The above theorem obviously holds if we consider all submodels, this class having graphical model structure; provided the conditions of the theorem hold. By the same token, the result will hold under the situation where the models under consideration are nested. Corollary 9.10. (Barbieri and Berger, 2004) U ^^^ conditions of the above theorem hold, all submodels of the full model are allowed, a^ is known, X ' X is diagonal and Pi ^s have Ndii, K^^^) distributions and k

P{M,) = ]l{p°y^{l-p1)^'-'^\

(9.70)

i=l

where p^ is the prior probability that variable Xi is in the model, then the optimal predictive model is the model with highest posterior probability which is also the median probability model. Proof. Let Pi be the least squares estimate of Pi under the full model. Because X'X is diagonal, PiS are independent and the likelihood under Mi factors as

L{M,)cx\{{\',t{Kf-'^ i=l

where A^ depends only on Pi and /?^, A^ depends only on Pi and the constant of proportionality here and below depend an Y and Pi^s. Also, the conditional prior distribution of/?i's given Mi has a factorization k

7r{(3\Mi) =

l[[N{f,i,Xia^)]'^[S{0}]'-'^

i=l

where ^{0} = degenerate distribution with all mass at zero. It follows from (9.70) and the above two factorizations that the posterior probability of Mi has a factorization P{M,\Y)a\[{p^i /

\',N{^i,,\ia^)dpy^{{l-p^)K5{^}Y-'^

which in turn implies that the marginal posterior of including or not including ith variable is proportional to the two terms respectively in the zth factor. This completes the proof, vide Problem 21. (The integral can be evaluated as in Chapter 2.) D

9.8 High-dimensional Estimation and Prediction

283

We have noted before t h a t if t h e conditions in Theorem 9.8 are satisfied and the models are nested, t h e n t h e best predictive model is t h e median probability model. Interestingly even if Q is not necessarily diagonal, t h e best predictive model t u r n s out to be t h e median probability model under some mild assumptions, in t h e nested model scenario. Consider A s s u m p t i o n li Q = 7 X ' X for some 7 > 0, i.e., t h e prediction will be m a d e at covariates t h a t are similar to t h e ones already observed in t h e past. A s s u m p t i o n 2: /3i = 6/3i, where 6 > 0, i.e, t h e posterior means are proportional to the least squares estimates. Remark 9.11. Barbieri and Berger (2004) list two situations when t h e second assumption will be satisfied. First, if one uses t h e reference prior 7ri(/3i,cr) = l/o", whereby t h e posterior means will be t h e LSE's. It will also be satisfied with 6 = c / ( l + c), if one uses g-type normal priors of Zellner (1986), where 7ri(/3i|cr) ~ iV/ci(0,ccr^(X(Xi)~^) and t h e prior on a is arbitrary. T h e o r e m 9 . 1 2 . For a sequence of nested models for which the above two conditions hold, the best predictive model is the median probability model. Proof. See Barbieri and Berger (2004).

D

Barbieri and Berger(2004, Section 5) present a geometric formulation for identification of t h e optimal predictive model. T h e y also establish conditions under which t h e median probability model and t h e m a x i m u m posterior probability model coincides; and t h a t it is typically not enough to know only t h e posterior probabilities of each model to determine t h e optimal predictive model. Till now we have concentrated on some Bayesian approaches t o t h e prediction problem. It t u r n s out t h a t model selection based on t h e classical Akaike information criterion (AIC) also plays an i m p o r t a n t role in Bayesian prediction and estimation for linear models and function estimation. Optimality results for AIC in classical statistics are due t o Shibata (1981, 1983), Li (1987), and Shao (1997). T h e first Bayesian result a b o u t AIC is taken from Mukhopadhyay (2000). Here one has observations {yij : i = 1 , . . . , p , j — 1 , . . . , r, n = pr} given by Vij = f^i-i-eij,

(9.71)

where e^j are i.i.d. N{0, a^) with cr^ known. T h e models are M i : /i^ = 0 for all i and M2 : 77^ = limp_>oo - X^iLi M? ^ 0. Under M2, we assume a N{0, r'^Ip) prior on /x where r^ is to be estimated from d a t a using an empirical Bayes method. It is further assumed t h a t p —> oc as n ^ oc. T h e goal is t o predict a future set of observations {zij} independent of {yij} using t h e usual prediction error loss, with t h e 'constraint' t h a t once a model is selected, least squares estimates have to be used to make t h e predictions. Theorem 9.13 shows t h a t t h e constrained empirical Bayes rule is equivalent t o AIC asymptotically. A weaker result is given as Problem 17.

284

9 High-dimensional Problems

T h e o r e m 9.13. (Mukhopadhyay, 2000) Suppose M2 is true, then asymptotically the constrained empirical Bayes rule and AIC select the same model. Under Mi, AIC and the constrained empirical Bayes rule choose Mi with probability tending to 1. Also under Mi, the constrained empirical Bayes rule chooses Ml whenever AIC does so. The result is extended to general nested problems in Mukhopadhyay and Ghosh (2004a). It is however also shown in the above reference that if one uses Bayes estimates instead of least squares estimates, then the unconstrained Bayes rule does better than AIC asymptotically. The performance of AIC in the PEB setup of George and Foster (2000) is studied in Mukhopadhyay and Ghosh (2004a). As one would expect from this, AIC also performs well in nonparametric regression which can be formulated as an infinite dimensional linear problem. It is shown in Chakrabarti and Ghosh (2005b) that AIC attains the optimal rate of convergence in an asymptotically equivalent problem and is also adaptive in the sense that it makes no assumption about the degree of smoothness. Because this result is somewhat technical, we only present some numerical results for the problem of nonparametric regression. In the nonparametric regression problem Yi = f{-)+eui=l,...,n, (9.72) n one has to estimate the unknown smooth function / . In Table 9.3, we consider n = 100 and f{x) = (sin (27rx))^, (cos (TTX))^, 7+COS (27rx), and e^^'^^^Trx)^ the loss function L ( / , / ) = J^ (/(^) — f{x)Ydx^ and report the average loss of modified James-Stein estimator of Cai et al. (2000), AIC, and the kernel method with Epanechnikov kernel in 50 simulations. To use the first two methods, we express / in its (partial sum) Fourier expansion with respect to the usual sine-cosine Fourier basis of [0,1] and then estimate the Fourier coefficients by the regression coeflficients. Some simple but basic insight about the AIC may be obtained from Problems 15-17. It is also worth remembering that AIC was expected by Akaike to perform well in high-dimensional estimation or prediction problem when the true model is too complex to be in the model space.

9.9 Discussion Bayesian model selection is passing through a stage of rapid growth, especially in the context of bioinformatics and variable selection. The two previous sections provide an overview of some of the literature. See also the review by Ghosh and Samanta (2001). For a very clear and systematic approach to different aspects of model selection, see Bernardo and Smith (1994). Model selection based on AIC is used in many real-life problems by Burnham and Anderson (2002). However, its use for testing problems with 0-1

9.10 Exercises

285

Table 9.3. Comparison of Simulation Performance of Various Estimation Methods in Nonparametric Regression Modified James-Stein Function [Sin{27rx)Y 0.2165 [Cos{7rx)]^ 0.2235 7 + Cos{27rx) 0.2576 Sin{27rx) 0.2618

AIC Kernel Method 0.0793 0.0691 0.078 0.091 0.0529 0.5380 0.0850 0.082

loss is questionable vide Problem 16. A very promising new model selection criterion due to Spiegelhalter et al. (2002) may also be interpreted as a generalization of AIC, see, e.g., Chakrabarti and Ghosh (2005a). In the latter paper, GBIC is also interpreted from the information theoretic point of view of Rissanen (1987). We believe the Bayesian approach provides a unified approach to model selection and helps us see classical rules like BIC and AIC as still important but by no means the last word in any sense. We end this section with two final comments. One important application of model selection is to examine model fit. Gelfand and Ghosh (1998) (see also Gelfand and Dey (1994)) use leave-kout cross-validation to compare each collection of k data points and their predictive distribution based on the remaining observations. Based on the predictive distributions, one may calculate predicted values and some measure of deviation from the k observations that are left out. An average of the deviation over all sets of k left out observations provides some idea of goodness of fit. Gelfand and Ghosh (1998) use these for model selection. Presumably, the average distance for a model can be used for model check also. An interesting work of this kind is Bhattacharya (2005). Another important problem is computation of the Bayes factor. Gelfand and Dey (1994) and Chib (1995) show how one can use MCMC calculations by relating the marginal likelihood of data to the posterior via P{y) = L{e\y)P{e)/P{0\y). Other relevant papers are Carlin and Chib (1995), Chib and Greenberg (1998), and Basu and Chib (2003). There are interesting suggestions also in Gelman et al (1995).

9.10 Exercises 1. Show that 7r(ry2|X) is an improper density if we take 7r(?7i,7y2) = ^/V2 in Example 9.3. 2. Justify (9.2) and (9.3). 3. Complete the details to implement Gibbs sampling and E-M algorithm in Example 9.3 when /x and cr^ are unknown. Take 7r(77i, 0-^,772) = 1/cr^. 4. Let Xi^s be independent with density f{x\Oi)^ i = 1,2, . . . , p , 6i G IZ. Consider the problem of estimating 0 = ( ^ 1 , . . . , OpY with loss function

286

9 High-dimensional Problems p

z=l

5. 6.

7.

8. 9.

10. 11.

z=l

i.e., the total loss is the sum of the losses in estimating Oi by a^. An estimator for 6 is the vector {Ti{X),T2{X),.. .,Tp{X)). We call this a compound decision problem with p components. (a) Suppose sup5/(x|(5) = f{x\T{x)), i.e., T{x) is the MLE (of 6j in f{x\ej)). Show that (T(Xi),T(X2),. • .T(Xp)) is the MLE of 0. (b) Suppose T(X) (not necessarily the T{X) of (a)) satisfies the sufficient condition for a minimax estimate given at the end of Section 1.5. Is ( T ( X i ) , T ( X 2 ) , . . . ,T(Xp)) minimax for 0 in the compound decision problem? (c) Suppose T{X) is the Bayes estimate with respect to squared error loss for estimating 6 of f{x\6). Is ( T ( X i ) , . . . ,T(Xp)) a Bayes estimate for 01 (d) Suppose T = ( r i ( X i ) , . . . , r p ( X p ) ) and Tj[Xi) is admissible in the j t h component decision problem. Is T admissible? Verify the claim of the best unbiased predictor (9.17). Given the hierarchical prior of Section 9.3 for Morris's regression setup, calculate the posterior and the Bayes estimate as explicitly as possible. Find the full conditionals of the posterior distribution in order to implement MCMC. Prove the claims of superiority made in Section 9.4 for the James-SteinLindley estimate and the James-Stein positive part estimate using Stein's identity. Under the setup of Section 9.3, show that the PEB risk of 6i is smaller than the PEB risk oiYi. Refer to Sections 9.3 and 9.4. Compare the PEB risk of 6i and Stein's frequentist risk of 0 and show that the two risks are of the same form but one has E{B) and the other Ee{B). (Hint: See equations (1.17) and (1.18) of Morris (1983)). Consider the setup of Section 9.3. Show that B is the best unbiased estimate of B. (Disease mapping) (See Section 10.1 for more details on the setup.) Suppose that the area to be mapped is divided into N regions. Let Oi and Ei be respectively the observed and expected number of cases of a disease in the ith region, 2 = 1,2,..., A/". The unknown parameters of interest are 6i^ the relative risk in the iih. region, i = 1,2,...,7V. The traditional model for Oi is the Poisson model, which states that given ( ^ 1 , . . . 5 ^N}-) Oi^s are independent and Oi\0i ~ Poisson (Eiei), Let ^1,^2, • • • 5^iv be i.i.d. ~ Gamma(a,6). Find the PEB estimates of 01^02^' " ^Ojsf. In Section 10.1, we will consider hierarchical Bayes analysis for this problem.

9.10 Exercises

287

12. Let Yi be i.i.d A^(6>i,V), i = 1,2, . . . , p . Stein's heuristics (Section 9.4) shows ||1^|P is too large in a frequentist sense. Verify by a similar argument t h a t if 6i are i.i.d uniform on IZ t h e n ||"K|p is t o o small in an improper Bayesian sense, i.e., there is extreme divergence between frequentist probability and naive objective Bayes probability in a high-dimensional case. 13. (Berger (1985a, p. 542)) Consider a m u l t i p a r a m e t e r exponential family f{x\6) — c{6) exp{6'T{x))h{x), where x and 0 are vectors of t h e same dimension. Assuming Stein's loss, show t h a t (under suitable conditions) t h e Bayes estimate can be written as gradient(log m(cc)) — gradient(log/i(x)) where m(x) is t h e marginal density of x obtained by integrating out 6. 14. Simulate d a t a according to t h e model in Example 9.3, Section 9.1. (a) Examine how well the model can be checked from the d a t a Xij^ i = 1,2, . . . n , J = 1,2, . . . p . (b) Suppose one uses the empirical distribution of Xj^s as a surrogate prior for /ij's. Compare critically t h e Bayes estimate of /x for this prior with t h e P E B estimate. 15. (Stone's problem) Let Yij = a-\-fii-\-eij, eij ^ 7V(0, cr^), i = 1, 2 , . . . , p , j = 1, 2 , . . . , r, n = p r with cr^ assumed known or estimated by S'^ = J2^=i S ^ = i ( ^ ^ j ~ ^iYIvi'^ ~ ! ) • T h e two models are Mi:

iii^

OVi and M2 : /J^ e 7^^.

Suppose n —> 00, p l o g n / n -> oc and Xl?=l(M^ ~ P)'^/{p — 1) - ^ r^ > 0. (a) Show t h a t even though M2 is true, BIC will select M i with probability tending t o 1. Also show t h a t AIC will choose t h e right model M2 with probability tending to one. (b) As a Bayesian how i m p o r t a n t do you think is this notion of consistency? (c) Explore t h e relation between AIC and selection of model based on estimation of residual sum of squares by leave-one-out cross validation. 16. Consider an extremely simple testing problem. X ^ N{/j., 1). You have t o test HQ : fi = 0 versus Hi : fi ^ 0. Is AIC appropriate for this? C o m p a r e AIC, BIC, and t h e usual likelihood ratio test, keeping in mind t h e conflict between P-values and posterior probability of t h e sharp null hypothesis. 17. Consider two nested models and an empirical Bayes model selection rule with t h e evaluation based on t h e more complex model. T h o u g h you know t h e more complex model is true, you may be b e t t e r off predicting with t h e simpler model. Let Yij = fii-\- Cij, eij i.i.d A^(0,cr^), i = 1, 2 , . . . , p , j = 1, 2 , . . . , r with known cr^. T h e models are M l : /I = 0 M2:fie

nP,fi^Np{0,T'^Ip),T^

>0.

(a) Assume t h a t in P E B evaluation under M2 you estimate r"^ by t h e moment estimate:

288

9 High-dimensional Problems f'=

p ^-^

r

2=1

Show with PEB evaluation of risk under M2 and Mi, Y is preferred if and only if AIC selects M2. (b) Why is it desirable to have large p in this problem? (c) How will you try to justify in an intuitive way occasional choice of the simple but false model? (d) Use (a) to motivate how the penalty coefficient 2 arises in AIC. (This problem is based on a result in Mukhopadhyay (2001)). 18. Burnham and Anderson (2002) generated data to mimic a real-life experiment of Stromberg et al. (1998). Select a suitable model from among the 9 models considered by Ghosh and Samanta (2001). The main issue is computation of the integrated likelihood under each model. You can try Laplace approximation, the method based on MCMC suggested at the end of Section 9.9, and importance sampling. All methods are difficult, but they give very close answers in this problem. The data and the models can be obtained from the Web page http://www.isical.ac.in/~tapas/book

19. Let Xi ~ N{^, 1), z = 1 , . . . , n and ji ~ N{rji,r]2). Find the PEB estimate of 771 and ri2 and examine its implications for the inadequacy of the PEB approach in low-dimensional problems. 20. Consider NPEB multiple testing (Section 9.6.1) with known TTI and an estimate / of (1 — 7ri)/o H-TTI/I. Suppose for each z, you reject HQI : /j,i = 0 if fo{xi) < f{xi)a, where o < a < 1. Examine whether this test provides any control on the (frequentist) FDR. Define a Bayesian FDR and examine if, for small TTI, this is also controlled by the test. Suggest a test that would make the Bayesian FDR approximately equal to a. (The idea of controlling a Bayesian FDR is due to Storey (2003). The simple rules in this problem are due to Bogdan, Ghosh, and Tokdar (personal communication).) 21. For all subsets variable selection models show that the posterior median model and the posterior mode model are the same if

P{Mi\X) =

f[p^{l-p,)^-^^ i=l

where li = 1 ii the ith variable is included in Mi and k = 0 otherwise.

10 Some Applications

The popularity of Bayesian methods in recent times is mainly due to their successful applications to complex high-dimensional real-life problems in diverse areas such as epidemiology, microarrays, pattern recognition, signal processing, and survival analysis. This chapter presents a few such applications together with the required methodology. We describe the method without going into the details of the critical issues involved, for which references are given. This is followed by an application involving real or simulated data. We begin with a hierarchical Bayesian modeling of spatial data in Section 10.1. This is in the context of disease mapping, an area of epidemiological interest. The next two sections, 10.2 and 10.3, present nonparametric estimation of regression function using wavelets and Dirichlet multinomial allocation. They may also be treated as applications involving Bayesian data smoothing. For several recent advances in Bayesian nonparametrics, see Dey et al. (1998) and Ghosh and Ramamoorthi (2003).

10.1 Disease M a p p i n g Our first application is from the area of epidemiology and involves hierarchical Bayesian spatial modeling. Disease mapping provides a geographical distribution of a disease displaying some index such as the relative risk of the disease in each subregion of the area to be mapped. Suppose that the area to be mapped is divided into N regions. Let Oi and Ei be respectively the observed and expected number of cases of a disease in the ith region, i = 1, 2 , . . . , A^. The unknown parameters of interest are Oi, the relative risk in the ith region, i = 1,2, ...,A^. Here Ei is a simple-minded expectation assuming all regions have the same disease rate (at least after adjustment for age), vide Banerjee et al. (2004, p. 158). The relative risk Oi is the regional effect in a multiplicative model of expected number of cases: E{Oi) = EiOi. li Oi = 1, we have E{Oi) = Ei. The objective is to make inference about ^^'s across regions. Among other things, this helps epidemiologists and public health professionals

290

10 Some Applications

to identify regions or cluster of regions having high relative risks and hence needing attention and also to identify covariates causing high relative risk. The traditional model for Oi is the Poisson model, which states that given ( ^ 1 , . . . , ^iv), Oi's are independent and OilOi -

Poisson (Eiei).

(10.1)

Under this model J^^^'s are assumed fixed. The classical maximum likelihood estimate of ^^ is 6i = Oi/Ei, known as the standardized mortality ratio (SMR) for region i and Var(^i) = Oi/Ei^ which may be estimated as Oi/Ei. However, it was noted in Chapter 9 that the classical estimates may not be appropriate here for simultaneous estimation of the parameters ^i, ^2, • • •, ^iv • As mentioned in Chapter 9, because of the assumption of exchangeability of ^ 1 , . . . , ^iv, there is a natural Bayesian solution to the problem. A Bayesian modeling involves specification of prior distribution of (^i,...^iv). Clayton and Kaldor (1987) followed the empirical Bayes approach using a model that assumes (9i,6>2,...,^iv i.i.d. ~ Gamma (a,^>)

(10.2)

and estimating the hyperparameters a and b from the marginal density of {Oi} given a, 6 (see Section 9.2). Here we present a full Bayesian approach adopting a prior model that allows for spatial correlation among the ^^'s. A natural extension of (10.2) could be a multivariate Gamma distribution for ( ^ 1 , . . . , ^iv)- We, however, assume a multivariate normal distribution for the log-relative risks log^^, i = 1,...,A^. The model may also be extended to allow for explanatory covariates Xi which may affect the relative risk. Thus we consider the following hierarchical Bayesian model Oi\6i are independent ~ Poisson {EiOi) where log 6i = x[(3 -\- (f^i, z = 1 , . . . , A^.

(10.3)

The usual prior for = ((/)i,... ,(/>iv) is given by the conditionally autoregressive (CAR) model (Besag, 1974), which is briefiy described below. For details see, e.g., Besag (1974) and Banerjee et al. (2004, pp. 79-83, 163, 164). Suppose the full conditionals are specified as 4>i\(pjJ ^ i ~ N{J2ciij(t>j.cTf), z - 1,2,... ,Ar.

(10.4)

These will lead to a joint distribution having density proportional to exp | - i 0 ' Z ? - i ( / - A ) 0 |

(10.5)

where D = Diag((Ji,..., cr^) and A = {aij)isfxN- We look for a model that allows for spatial correlation and so consider a model where correlation depends

10.1 Disease Mapping

291

on geographical proximity. A proximity matrix W = {wij) is an N x N matrix where Wij spatially connects regions i and j in some manner. We consider here binary choices. We set Wu = 0 for all z, and for i ^ j , Wij = 1 if i is a neighbor of j , i.e., i and j share some common boundary and Wij = 0 otherwise. Also, Wij^s in each row may be standardized as Wij = Wij/wio where Wio = Ylj wij is the number of neighbors of region i. Returning to our model (10.5), we now set aij = awij/wio and af = \/wio- Then (10.5) becomes

e^vl-^(t>'{D^-aW)(j>\ where D^ = Diag(i/;io, 1^20, • • • ^UJNO)- This also ensures that D~^{I — A) — j{Duj — aW) is symmetric. Thus the prior for (j) is multivariate normal 0 - A^(0, r ) with r = X{D^ - aW)-^.

(10.6)

We take 0 < a < 1, which ensures propriety of the prior and positive spatial correlation; only the values of a close to 1 give enough spatial similarity. For a = 1 we have the standard improper CAR model. One may use the improper CAR prior because it is known that the posterior will typically emerge as proper. For this and other relative issues, see Banerjee et al. (2004). Having specified priors for all the unknown parameters including the spatial variance parameter A and propriety parameter a (0 < a < 1), one can now do Bayesian analysis using MCMC techniques. We illustrate through an example. Example 10.1. Table 10.1 presents data from Clayton and Kaldor (1987) on observed (O^) and expected {Ei) cases of lip cancer during the period 19751980 for A^ = 56 counties of Scotland. Also available are x^, values of a covariate, the percentage of the population engaged in agriculture, fishing, and forestry (AFF), for the 56 counties. The log-relative risk is modeled as \ogei = l3o^l3iXi^(t)u

i = l,...,iV

(10.7)

where the prior for ((/)i,..., 0jv) is as specified in (10.6). We use vague priors for /3o and /3i and a prior having high concentration near 1 for the parameter a. The data may be analyzed using WinBUGS. A WinBUGS code for this example isj)ut in the web page of Samanta. A part of the results - the Bayes estimates Oi of the relative risks for the 56 counties - are presented in Table 10.1. The ^^'s are smoothed by pooling the neighboring values in an automatic adaptive wa3^as suggested in Chapter 9. The estimates of (3Q and /?i are obtained as /3o = —0.2923 and /?i = 0.3748 with estimates of posterior s.d. equal to 0.3426 and 0.1325, respectively.

292

10 Some Applications

Table 10.1. Lip Cancer Incidence in Scotland by County: Observed Numbers (O^), Expected Numbers (Ei), Values of the Covariate AFF (xi), and Bayes Estimates of the Relative Risk {Oi). County ^ \Ei

i

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Xi

Oi

9 1.4 Te4.705 39 8.7 16 4.347 11 3.0 10 3.287 9 2.5 24 2.981 15 4.3 10 3.145 8 2.4 24 3.775 26 8.1 10 2.917 7 2.3 7 2.793 6 2.0 7 2.143 20 6.6 16 2.902 13 4.4 7 2.779 5 1.8 16 3.265 3 1.1 10 2.563 8 3.3 24 2.049 17 7.8 7 1.809 9 4.6 16 2.070 2 1.1 10 1.997 7 4.2 7 1.178 9 5.5 7 1.912 7 4.4 10 1.395 16 10.5 7 1.377 31 22.7 16 1.442 11 8.8 10 1.185 7 5.6 7 0.837 19 15.5 1 1.188 15 12.5 1 1.007 7 6.0 7 0.946 10 9.0 7 1.047

^jr\

County Oi Ei Xi 29 16 14.4 To 1.222 30 11 10.2 10 0.895 31 5 4.8 7 0.860 32 3 2.9 24 1.476 33 7 7.0 10 0.966 34 8 8.5 7 0.770 35 11 12.3 7 0.852 36 9 10.1 0 0.762 37 11 12.7 10 0.886 38 8 9.4 1 0.601 39 6 7.2 16 1.008 40 4 5.3 0 0.569 41 10 18.8 1 0.532 42 8 15.8 16 0.747 43 2 4.3 16 0.928 44 6 14.6 0 0.467 45 19 50.7 1 0.431 46 3 8.2 7 0.587 47 2 5.6 1 0.470 48 3 9.3 1 0.433 49 28 88.7 0 0.357 50 6 19.6 1 0.507 51 1 3.4 1 0.481 52 1 3.6 0 0.447 53 1 5.7 1 0.399 54 1 7.0 1 0.406 55 0 4.2 16 0.865 56 0 1.8 10 0.773

10.2 Bayesian N o n p a r a m e t r i c Regression Using Wavelets Let us recall t h e nonparametric regression problem t h a t was stated in Example 6.1. In this problem, it is of interest to fit a general regression function to a set of observations. It is assumed t h a t the observations arise from a realvalued regression function defined on an interval on t h e real line. Specifically, we have Vi = g{xi) + ^z, i = 1 , . . . , n, and xi G T , (10.8) where Si are i.i.d. A^(0, a^) errors with unknown error variance cr^, and ^ is a function defined on some interval T C iV^.

10.2 Bayesian Nonparametric Regression Using Wavelets

293

It can be immediately noted that a Bayesian solution to this problem involves specifying a prior distribution on a large class of regression functions. In general, this is a rather difficult task. A simple approach that has been successful is to decompose the regression function g into a linear combination of a set of basis functions and to specify a prior distribution on the regression coefficients. In our discussion here, we use the (orthonormal) wavelet basis. We provide a very brief non-technical overview of wavelets including multiresolution analysis (MRA) here, but for a complete and thorough discussion refer to Ogden (1997), Daubechies (1992), Hernandez and Weiss (1996), Miiller and Vidakovic (1999), and Vidakovic (1999). 10.2.1 A Brief Overview of Wavelets Consider the function r 1 '0(x) == < - 1 y 0

0 < X < 1/2; 1/2 < X < 1; otherwise.

(10.9)

which is known as the Haar wavelet, simplest of the wavelets. Note that its dyadic dilations along with integer translations, namely, ^j^k{x) = 2^'/V(2-^'^ - k),

j , keZ,

(10.10)

provide a complete orthonormal system for C'^ilZ). This says that any / G C'^iJZ) can be approximated arbitrarily well using step functions that are simply linear combinations of wavelets '0j^/e(x). What is more interesting and important is how a finer approximation for / can be written as an orthogonal sum of a coarser approximation and a detail function. In other words, for j e Z, let Vj — \f ^ C^{1V) : / is piecewise constant on intervals [A:2-^ {k -h 1)2-^), k e z \ .

(10.11)

Now suppose P^ f is the projection of / G C'^{7Z) onto Vj. Then note that

= P^-'f + ^ < / , ^j-i,k kez

> ^,-i,fc,

(10.12)

with g^~^ being the detail function as shown, so that Vj=^Vj-ieWj-u

(10.13)

where Wj = span {'0^^^, k G Z}. Also, corresponding with the 'mother' wavelet ip (Haar wavelet in this case), there is a father wavelet or scaling function

294

10 Some Applications

0 = ^[0,1] such that Vj = span{(/)j^fc, A: G Z}^ where (pj^k is the dilation and translation of 0 similar to the definition (10.10), i.e., 0,-.(x) = 2^/20(2^x - k),

j , keZ,

(10.14)

In fact, the sequence of subspaces {Vj} has the following properties: 1. • • • C VL2 C V-i C Vb C Fi C ^2 C • • •.

2. n,ezV^, = {o},u,-,zVS- = /:2(^)^ 3.

feVjiQf{2.)eVj^,.

4. / G Fo implies / ( . -k) eVo for all /c G Z. 5. There exists (f) E VQ such that span {0o,fc = 0(. — A:), A: G Z } = VQ. Given this 0, the corresponding ip can be easily derived (see Ogden (1997) or Vidakovic (1999)). What is interesting and useful to us is that there exist scaling functions (p with desirable features other than the Haar function. Especially important are Daubechies wavelets that are compactly supported and each having a different degree of smoothness. Definition: Closed subspaces {Vj}j^z satisfying properties 1-5 are said to form a multi-resolution analysis (MRA) of C'^{1Z). If Vj = span {0^^/^, A: G Z} form an MRA of C^ (IZ), then the corresponding 0 is also said to generate this MRA. In statistical inference, we deal with finite data sets, so wavelets with compact support are desirable. Further, the regression functions (or density functions) that we need to estimate are expected to have certain degree of smoothness. Therefore, the wavelets used here should have some smoothness also. The Haar wavelet does have compact support but is not very smooth. In the application discussed later, we use wavelets from the family of compactly supported smooth wavelets introduced by Daubechies (1992). These, however, cannot be expressed in closed form. A sketch of their construction is as follows. Because, from property 5 above of MRA, 0 G Vb C Fi, we have 0(x)-5^/ifc0i,^(x), kez

(10.15)

where the 'filter' coefficients hk are given by hk =< (/>, (/)i,A; >=V2

(t){x)(t)(2x - k) dx.

(10.16)

For compactly supported wavelets (/>, only finitely many hkS will be non-zero. Define the 27r-periodic trigonometric polynomial

"^^ kez

associated with {hk}. The Fourier transforms of (j) and ^^ can be shown to be of the form

10.2 Bayesian Nonparametric Regression Using Wavelets ^ oo 4>{^) = ^]lrnoi2-^Lo),

295 (10.18)

V2 . ,

V-H = -e-'^^/'moCl + 7r),^(|).

(10.19)

Depending on the number of non-zero elements in the filter {/i/c}, wavelets of different degree of smoothness emerge. It is natural to wonder what is special about MRA. Smoothing techniques such as linear regression, splines, and Fourier series all try to represent a signal in terms of component functions. At the same time, wavelet-based MRA studies the detail signals or differences in the approximations made at adjacent resolution levels. This way, local changes can be picked up much more easily than with other smoothing techniques. With this short introduction to wavelets, we return to the nonparametric regression problem in (10.8). Much of the following discussion closely follows Angers and Delampady (2001). We begin with a compactly supported wavelet function tp E C^ ^ the set of real-valued functions with continuous derivatives up to order s. We note that then g has the wavelet decomposition

g{x)= Y^ akMx) + Yl Yl Pjk^jA^)^ \k\
(10-20)

J>^\k\
with (j)k{x) = (j){x — /c), and V;,-fc(x) = 2^'/2^(2^x-A;), where Kj is such that (j)k{x) and i/jj^kix) vanish on T whenever |A:| > Kj, and (j) is the scaling function ('father wavelet') corresponding with the 'mother wavelet' ip. Such i ^ / s exist (and are finite) because the wavelet function that we have chosen has compact support. For any specified resolution level J , we have oo

J

9{x)= Yl o^kMx)^^ \k\
Y^ Pjki^jM^)^ Yl Y

j=0\k\
j=

f^jk^jA^)

J+l\k\
= gj{x)^Rj{x),

(10.21)

where J

9J(X) = Y

c^k^kix)-^"^

\k\
Rjix)=

E i=J+l

Y2 /^j/c'0i,fc(^), and 3>0\k\
E ^Jk^PjA^). \k\
(10.22)

296

10 Some Applications

In the representation (10.22), we note that the (j) functions appearing in the first part detect the global features of g^ and subsequently the -0 functions in the second part check for local details. To proceed further, many standard wavelet based procedures apply the 'discrete wavelet transform' to the data and work with the resulting wavelet coefficients (see Vidakovic (1999), Miiller and Vidakovic (1999)). We, however, use the familiar hierarchical Bayesian approach to specify the prior model for g in (10.8). At the resolution level J, (10.8) can be expressed as Vi = 9j{xi) + r/i -f Si,

(10.23)

where TJI = Rj{xi). Because the amount of information available in the likelihood function to estimate the infinitely many parameters ^ j ^ , j > J'>\k\ and (3 — {(^jk)o<j<j,\k\
where F = (^^^^^ J l , ) '

with M/3 = X^7=o(^"^j "^ ^) ^^^ ^^^ diagonal matrix A being the variancecovariance matrix of /?. Also, 77 = (r/i,. ..rjnYlr^ - Nn{0,r^Qn), where, to keep the covariance structure of rji simple, we choose

(10.24)

10.2 Bayesian Nonparametric Regression Using Wavelets {Qn)ij = T'^2-'^'^' exp{-c\xi

297

- Xj\),

for some moderate value of c. Further, let X = {^', S') with the ith row of ^ ' being {0fc(^i)}jfc|
(10.25)

where u = 77 + e ~ Nn{0^ U) with U = a^In -{- r'^Qn- This follows from the fact that Y|7,77, a ^ T^ - Nn{Xj

+ ry,a^/,),

(10.26)

From (10.25) and using standard hierarchical Bayes techniques (c/. Lindley and Smith (1972)) and matrix identities (c/. Searle (1982)), it follows that Y\a^T^ ~ NniO,a^In + r^ (XTX' + Q„)), j\Y,a^r^r^N{AY,B),

(10.27) (10.28)

where

A = T^rx' {a^in + T2 {xrx' 4- Q„))~', B = T^r- T^rx' {aHn + r2 (xrx' + Q„)) "' xr. To proceed to the second-stage calculations, some algebraic simplifications are needed (see Angers and Delampady (1992)). Spectral decomposition yields xrx' + Qn — HDH'^ where D = diag((ii, (i2,..., (in) is the matrix of eigenvalues and H is the orthogonal matrix of eigenvectors. Thus, a''In + T^ {XrX' -\-Qn) = H {a^In + r^D) H' = T'^H{vIn + D)H',

(10.29)

where v = cr^/r^. Using this spectral decomposition, the marginal density of Y given r^ and v can be written as m(Y|r^^;)

(27rr2)V2det(i;/n + i^)V2 X exp h^Y'H{vIn

1

1

+

D)-'H'Y\

J_J_v_^

(27rr2)-/2 n " ^ i ( ^ + (ii)V2 ^""^ 1 where t = ( t i , . . . , t^)' = H'Y.

2r^ ^

v+d

298

10 Some Applications

To derive the wavelet smoother, all that we need to do now is to eliminate the hyper- and nuisance parameters from the first-stage posterior distribution by integrating out these variables with respect to the second-stage prior on them. This is what we will do now. Alternatively, one could employ an empirical Bayes approach and estimate cr^ and r^ from equation (10.27) and replace cr^ and r^ by their estimates in equation (10.28) to approximate 7. However, this will underestimate the variance of the wavelet estimator, Y = X 7 . Suppose, then, 7r2(T^, v) is the second stage prior. It is well known in the context of hierarchical Bayesian analysis (see Chapter 9, specially equation (9.7) and Berger, 1985a) that the sensitivity of the second and higher stage hyper-priors on the final Bayes estimator is somewhat limited. Therefore, for computational ease, we choose 7r2(r^,i;) = 7r22('^)('r^)~" for some suitable choice of a > 0; 7r22 is the prior specified for v. Once a and 7r22 are specified, using equation (10.28) along with (10.29) and taking the expectation with respect to r^, we have that J& (7 I Y) = 7 = rX'HE

[{vin + D)-^ I Y] t,

(10.31)

where the expectation is taken with respect to 7r22(^ | Y ) . Again using equations (10.28) and (10.29), the posterior covariance matrix of 7 can be written as VarH I Y) = —^~--E ^ ' ' ^ n -h 2a 1 n-\-2a

r ,2=1

rx'HE

^E[^{vmvy\Y],

t ^ j <-"» + '')-'IVH'xr (10.32)

where 7(^;) = rX'H{vIn + D)-H. To compute these expectations, one can use several techniques. Because they involve only single dimensional integrals, standard numerical integration methods will work quite well. Several versions of the standard Monte Carlo approach can be employed quite satisfactorily and efficiently also. An example illustrating the methodology follows. Example 10.2. This is based on data provided by Prof. Abraham Verghese (F.R.E.S.) of the Indian Institute of Horticultural Research, Bangalore, India (personal communication), which have already been analyzed in Angers and Delampady (2001). The variable of interest y that we have chosen from the data set is the weekly average humidity level. The observations were made from June 1, 1995, to December 13, 1998. (For some reason, the observations were not recorded on the same day of the week every time.) We have chosen time (day of recording the observation) as the covariate x. (Any other available covariate can be used also because wavelet-based smoothing with respect to any arbitrary covariate (measured in some general way) can be handled with

10.3 Estimation Using Dirichlet Multinomial Allocation

^

299

CD

Fig. 10.1. Wavelet smoother and its error bands for the Humidity data. our methodology.) For illustration purposes, we have chosen the model with J = 6; the hyperparameter a is 0.5 and the prior 7r22 corresponds with an F distribution with degrees of freedom 24 and 4. We have used compactly supported Daubechies wavelets for this analysis. As explained earlier, these cannot be expressed in closed form, but computations with these wavelets are possible using any of the several statistical and mathematical software packages. In Figure 10.1, we have plotted gj (solid line) along with its error bands (dotted lines), ±.2^/Var{y \ Y), where Var{y \ Y) = Var{gj{x)

+ 77 + £ | Y).

More details on this example as well as other studies can be found in Angers and Delampady (2001).

10.3 Estimation of Regression Function Using Dirichlet Multinomial Allocation In Section 10.2, wavelets are used to represent the nonparametric regression function in (10.8) and a prior is put on the wavelet coefficients. Here we present an alternative approach based on the observation that the unknown regression function is locally linear and hence one may use a high-dimensional

300

10 Some Applications

parametric family for modeling locally linear regression. Suppose we have a regression problem with a response variable Y and a regressor variable X. Let {Xi,Yi),... {Xn^Yn) be independent paired observations on (X, F ) . Consider first the usual normal linear regression model where given values of the regressor variables x^'s, the 1^'s are independently normally distributed with common variance ay and mean E{Yi\xi) = /3i + /32^i5 a linear function of Xi. Let Zi = (Xi^Yi) be independent, Zi having the density f{z\(t>i) = f{x,y\(t)i) =

fx{x\/ii,a'^)fY{y\x,f3ii,p2i^(TY)

where /x(x|/ii, af) and / y (^|x, /?H, /?22, cry) denote respectively N{fii, erf) density for Xi and 7V(/3H -h/J2z^, cry) density for Yi given x, 0^ = {fii^af.Pu, /32z), i = 1 , . . . ,n. For simplicity we assume (jy is known, say, equal to 1. For the remaining parameters 0^, i = 1 , . . . , n, we have the Dirichlet multinomial allocation (DMA) prior, defined in the next paragraph. (1) Let k ~ p{k), a distribution on {1, 2 , . . . , n } . (2) Given /c, 0^, i = l , . . . , n have at most k distinct values ^i,...,^/.? where ^^'s are i.i.d. ~ GQ and Go is a distribution on the space of (/x, cr^, /?i, /32) (our choice of Go is mentioned below). (3) Given /c, the vector of weights {wi,..., Wk) ~ Dirichlet ( ^ i , . . . , Sk). (4) Allocation variables a i , . . . , a^ are independent with P{(^i =j) = ^ i , j =

l,...,k.

(5) Finally (l)i = Oa^, z = 1 , . . . , n. For simplicity, we illustrate with a known k (which will be taken appropriately large). We refer to Richardson and Green (1997) for the treatment of the case with unknown k; see also the discussion of this paper by Gruet and Robert, and Green and Richardson (2001). Under this prior 02 = {lJ'i^o'i^l3ii,P2i)^ ^ = l , . . . , n are exchangeable. This allows borrowing of strength, as in Chapter 9, from clusters of (xi,?/i)'s with similar values. To see how this works, one has to calculate the Bayes estimate through MCMC. We take Go to be the product of a normal distribution for /i, an inverse Gamma distribution for cr^ and normal distributions for Pi, and (32 • The full conditionals needed for sampling from the posterior using Gibbs sampler can be easily obtained, see Robert and Casella (1999) in this context. For example, the conditional posterior distribution of a i , . . . , a^ given other parameters are as follows: k

o^i = j with probability Wjf{Zi\6j)/y

^Wrf{Zi\6r). r=l

J = 1 , . . . , /c, z = 1 , . . . , n and a i , . . . , a^ are independent.

10.3 Estimation Using Dirichlet Multinomial Allocation

301

Due t o conjugacy, t h e other full conditional distributions can be easily obtained. You are invited to calculate t h e conditional posteriors in P r o b l e m 4. Note t h a t given /c, 6i,... ^Ok a n d i t ; i , . . . , t i ; ^ , we have a mixture with k components. Each m i x t u r e models a locally linear regression. Because Oi and Wi are random, we have a rich family of locally linear regression models from which t h e posterior chooses different members and assigns t o each member model a weight equal to its posterior probability density. T h e weight is a measure of how close is this member model to d a t a . T h e Bayes estimate of t h e regression function is a weighted average of t h e (conditional) expectations of locally linear regressions. We illustrate t h e use of this m e t h o d with a set of d a t a simulated form a model for which E{Y\x) = sin(2x) + e. We generate 100 pairs of observations {Xi,Yi) with normal errors e^. A scatter plot of t h e d a t a points and a plot of t h e estimated regression at each Xi (using t h e Bayes estimates of /3ii,/52z) together with t h e graph of sin(2x)

1.5

1 h

0.5

-0.5

-1.5

Fig. 10.2. Scatter plot, estimated regression, and true regression function.

302

10 Some Applications

are presented in Figure 10.2. In our calculation, we have chosen hyperparameters of the priors suitably to have priors with small information. Seo (2004) discusses the choice of hyperpriors and hyperparameters in examples of this kind. Following Miiller et al. (1996), Seo (2004) also uses a Dirichlet process prior instead of the DMA. The Dirichlet process prior is beyond the scope of our book. See Ghosh and Ramamoorthi (2003, Chapter 3) for details. It is worth noting that the method developed works equally well if X is non-stochastic (as in Section 10.2) or has a known distribution. The trick is to ignore these facts and pretend that X is also random as above. See Miiller et al. (1996) for further discussion of this point.

10.4 Exercises 1. Verify that Haar wavelets generate an MRA of C^{1V). 2. Indicate how Bayes factors can be used to obtain the optimal resolution level J in (10.21). 3. Derive an appropriate wavelet smoother for the data given in Table 5.1 and compare the results with those obtained using linear regression in Section 5.4. 4. For the problem in Section 10.3, explain how MCMC can be implemented, deriving explicitly all the full conditionals needed. 5. Choose any of the high-dimensional problems in Chapters 9 or 10 and suggest how hyperparameters may be chosen there. Discuss whether your findings will apply to all the higher levels of hierarchy.

Common Statistical Densities

For quick reference, listed below are some common statistical densities that are used in examples and exercise problems in the book. Only brief description including the name of the density, the notation (abbreviation) used in the book, the density itself, the range of the variable argument, and the parameter values and some useful moments are supplied.

A.l Continuous Models 1. Univariate normal (A/"(/i,(7^)): /(X|M,

a^) = (27rcT2)-i/2 exp ( - ( x - fif/{2a^))

,

—oo < X < oo, —oc 0. Mean = /i, variance = cr^. Special case: A^(0,1) is known as standard normal. 2. Multivariate normal (A/p(/x, 17)): /(x|/x, r ) = {27r)-P/^\S\-'/'

exp ( - ( x -

M)'^-'(X

X G 7^^, // G IZ^, ^pxp positive definite. Mean vector — /x, covariance or dispersion matrix = U. 3. Exponential {Exp{X)): f{x\X) = Aexp(-Aa:),x > 0, A > 0. Mean = 1/X, variance = 1/A^. 4. Double exponential or Laplace {DExp{ii^(7))\ f[x\ii,a)

= —exp 2cr V CF

— (X) < X < oo, —oo < / / < o o , (7 > 0.

Mean = /i, variance = 2cr^.

-

M))

,

304

A Common Statistical Densities

5. Gamma (Gamma(a, A)): f(x\a, A) = -=77-rx"~^ exp(-Ax), X > 0, a > 0, A > 0.

r{a)

Mean = a/A, variance = a/A^. Special cases: (i) Exp{X) is Gamma{l,X). (ii) Chi-square with n degrees of freedom (Xn) is 6. Uniform (t/(a, 6)): f{x\a,b)

=

/(a,6)(x),-oo

Gamma{n/2,1/2)-

0—a

Mean = (a + 6)/2, variance = {h — a)^/12. 7. Beta (5eta(a,/3)): /(^|a,/3) = 7 ^ 7 ^ ^ " " H l - x ) ^ - i / ( o , i ) W , a > 0,/3 > 0. Mean = a / ( a -f /?), variance = ay9/{(a -h /3)^(a + /? + 1)}. Special case: C/(0,1) is Beta{l, 1). 8. Cauchy {Cauchy{fi,a'^)): -oo < X < oc,

— o c 0 . Mean and variance do not exist. 9. t distribution (t(a,/i,(7^)):

•'^ ' ' ^ '

^

crv/S^r(a/2) V

cecr2

y

—oo < X < cxD, a > 0, —oo 0. Mean = /i if a > 1, variance = aa'^/{a — 2) if a > 2. Special cases: (i) Cauchydi^a'^) is t(l,/i,cr2). (ii) t(/c,0,1) = tk is known as Student's t with A: degrees of freedom. 10. Multivariate t {tp{a, /JL, U)):

«^l--)-(S|7y|il-r-(..^(x-.)'r-'(.-.,) X G 7^^, Q; > 0, /x G 7^^, Z'pxp positive definite. Mean vector = /x if a > 1, covariance or dispersion matrix = aZ:/(a - 2) if a > 2.

-(a+p)/2

A.l Continuous Models 11. F distribution with degrees of freedom a and ^

305

(F(Q;,/3)):

[1+ ijXJ Mean = /3/(/3-2) if/? > 2, variance = 2l3'^{a + l3-2)/{a{P-4){(3-2f} if /? > 4. Special cases: (i) If X ~ t{a, /i, ^2), (X - M) Vcr^ ^ ^ ( 1 , " ) • (ii) If X ~ tp{a, ti, S), i ( X - / * ) ' r - i ( X - Ai) - F(p, a). 12. Inverse Gamma {inverse Gamma{a, X)): f(x\a, A) = -T^r^x'^''^^^ r(a)

exp(-A/x), x > 0, a > 0, A > 0.

Mean = A/(a - 1) if a > 1, variance = A^/{(a - l)2(a - 2)} if a > 2. If X ^ inverse Gamma{a^ A), 1/X ~ Gamma{a^ A). 13. Dirichlet (finite dimensional) (D(a)):

X = (xi,...,Xfc)' with 0 < x^ < 1, for 1 0 for 1 < i < /c. Mean vector = ct/(X^^=i Q^z)? covariance or dispersion matrix = Ckxk where '• 2-^l^i

Oil

^^

ifi = j ;

Cij

14. Wishart

{y/p(n,E)):

^ ^ ^ 1 ^ ^ = 2»p/2r,(n/2) ' ^ l ~ " ^ ' ^^P (-^r-ace{i;-^A}/2) |A|("-^-i)/2, ^pxp positive definite, i7pxp positive definite, n> p^ p positive integer, Fpia) = [ exp {-trace{A}) J A positive definite

\A\^-(P+^)/'^

dA,

for a> {p- l ) / 2 . Mean == nU. For other moments, see Muirhead (1982). Special case: Xn i^ ^1(^5 !)• If W~^ ~ Wp{n, E) then W is said to follow inverse-Wishart distribution.

306

A Common Statistical Densities

15. Logistic

{{Logistic{fi,a)):

,( I

^

1

exp(-^)

^(l + exp(-V)) — OO < X < OO, —OO < /X < OO, (7 > 0 .

Mean = /x, variance = -K^G^ j*^.

A.2 Discrete Models 1. Binomial

{B{n,p)):

/(x|n,p) = ("y(l-p^-^ X = 0 , 1 , . . . , n, 0 1 integer. Mean = np, variance = np(l — p). Special case: Bernoulli{p) is 5 ( 1 , p). 2. Poisson (7^(A)): /(x|n,p) = ^ ^ ^ P ( ^ , x! x = 0,l,..., A > 0 . Mean = A, variance = A. 3. Geometric {Geometric{p)):

f{x\p) = {i-prp, x - 0 , 1 , . . . , 0 1 integer. Mean = k{l — p)/p, variance = k{l — p)/p'^. Special case: Geometric{p) is Negative binomial{l,p). 5. Multinomial {Multinomial{n,p)):

l l i = l -^l' iz=l

X = {xi,... ^XkY with Xi an integer between 0 and n, for 1 < i < k, ^i=i^i = '^ ^^^ P = {Pii- •' ^PkY with 0 < Pi < 1 for 1 < i < k, Yli=iPi = ^' Mean vector = np, covariance or dispersion matrix = C^xk where Q.. ^ f ^Pi(l -Pi) if ^ '•^ \ -npiPj if z

B Birnbaum's Theorem on Likelihood Principle

The object of this appendix is to rewrite the usual proof of Birnbaum's theorem (e.g., as given in Basu (1988)) using only mathematical statements and carefully defining all symbols and the domain of discourse. Let 6 ^ 0 he the parameter of interest. A statistical experiment S is performed to generate a sample x. An experiment £ is given by the triplet (Af, A^p) where Af is the sample space, A is the class of all subsets of A*, and P = {p{'\^)^ ^ G 0 } is a family of probability functions on (A',^), indexed by the parameter space 0. Below we consider experiments with a fixed parameter space 0. A (finite) mixture of experiments f i , . . . , f jt with mixture probabilities TTi,... ,7rfc (non-negative numbers free of 0, summing to unity), which may k

be written as

^^TT^^^^,

is defined as a two stage experiment where one first

i=l

selects £i with probability TT^ and then observes Xi G Af^ by performing the experiment £i. Consider now a class of experiments closed under the formation of (finite) mixtures. Let £ = (A',^,p) and £' = {X'^A'^p') be two experiments and X ^ X^x' ^ X'. By equivalence of the two points (f,x) and {£'^x')^ we mean one makes the same inference on 6 if one performs £ and observes x or performs £' and observes x' ^ and we denote this as {£,x)r~.{£',x'). We now consider the following principles. The likelihood principle (LP): We say that the equivalence relation "~" obeys the likelihood principle if {£^x) ~ {£' ^x') whenever p{x\6) = cp\x'\0) for some constant c > 0.

for all (9 G 0

(B.l)

308

B Birnbaum's Theorem on Likelihood Principle

The weak conditionality principle (WCP): An equivalence relation "~" satisfies WCP if for a mixture of experiments £ = ^i=i ^i£i'> {£,{i,Xi)) ~ (Si^Xi) for any i G { 1 , . . . , A:} and Xi G A^. The sufficiency principle (SP): An equivalence relation "~" satisfies SP if (f ,x) ~ {S^x') whenever S{x) = S{x') for some sufficient statistic S for 6 (or equivalently, S{x) = S{x') for a minimal sufficient statistic S). It is shown in Basu and Ghosh (1967) (see also Basu (1969)) that for discrete models a minimal sufficient statistic exists and is given by the likelihood partition, i.e., the partition induced by the equivalence relation (B.l) for two points X, x' from the same experiment. The difference between the likelihood principle and sufficiency principle is that in the former, x, x^ may belong to possibly different experiments while in the sufficiency principle they belong to the same experiment. The weak sufficiency principle (WSP): An equivalence relation "~" satisfies WSP if {e,x) ~ {£,x') whenever p{x\e) = p(V|(9) for all 6>. If follows that SP implies WSP, which can be seen by noting that

.e'eo is a (minimal) sufficient statistic. We assume without loss of generality that r p{x\e) > 0 for all X G A'. eee We now state and prove Birnbaum's theorem on likelihood principle (Birnbaum (1962)). T h e o r e m B . l . WCP and WSP together imply LP, i.e., if an equivalence relation satisfies WCP and WSP then it also satisfies LP. Proof. Suppose an equivalence relation "~" satisfies WCP and WSP. Consider two experiments £i = (^1,^41,^1) and £2 = (^2,^2,^2) with same 0 and samples x^ G A^, i = 1, 2, such that pi{xi\0) = cp2{x2\0) for sl\0e0

(B.2)

for some c > 0. We are to show that (£^i,xi) ~ (^25^2)- Consider the mixture experiment £ of £1 and £2 with mixture probabilities 1/(1 + c) and c/(l 4- c) respectively, i.e., 1+ c

1+ c

B Birnbaum's Theorem on Likelihood Principle

309

The points ( l , x i ) and (2,2:2) in the sample space of £ have probabilities Pi(xi|^)/(1 -h c) and p2(^2|^)c/(l + c), respectively, which are the same by (B.2). WSP then implies that (f,(l,xi))^(£:,(2,X2)).

(B.3)

(
(B.4)

Also, by WCP

From (B.3) and (B.4), we have (^i,Xi) ^ (^2,^2)-

•

c Coherence

Coherence was originally introduced by de Finetti to show any quantification of uncertainty that does not satisfy the axioms of a (finitely additive) probability distribution would lead to sure loss in a suitably chosen gample. This is formally stated in Theorem C.l below. This section is based on Schervish (1995, pp. 654, 655) except that we use finite additivity instead of countable additivity. Definition 1. For a bounded random variable X, the fair price or prevision P{X) is a number p such that a gambler is willing to accept all gambles of the form c{X — p) for all c in some sufficiently small symmetric interval around 0. Here c{X — p) represents the gain to the gambler. That the values of c are sufficiently small ensures all losses are within the means of the gambler to pay, at least for bounded X. Definition 2. Let {X(y^ a G A} be a collection of bounded random variables. Suppose that for each X^, P{Xa) is the prevision of a gambler who is willing to accept all gambles of the form c{Xa — P{Xa)) for —da ^ c < d^. These previsions are defined to be coherent if there do not exist a finite set AQ C A and {CQ, : —d < c^ < d^a ^ ^o}? d < m.ii[i{dcc^a G A Q } , such that YlaeA ^cxi^o^ ~ P{Xa.)) < 0 for all values of the random variables. It is assumed that a gambler willing to accept each of a finite number of gambles Ca{Xa — P{Xc))^a G AQ is also willing to take the gamble X^CKGA ^{^a — P{Xa)), c Sufficiently small, for finite sets AQ. If each Xc takes only a finite number of distinct values (as in Theorem C.l below), then XlaGA ^ai^ct — P{Xa)) < 0 for all values of X^^s if and only if XlaGA ^ai^o: — P{Xa)) < ~^ for all valucs of XQ,'S, for some e > 0. The second condition is what de Finetti requires. If the previsions of a gambler are not coherent (incoherent), then he can be forced to lose money always in a suitably chosen gamble. Theorem C.l. Let (S^A) be a "measurable^^ space. Suppose that for each A E A, the prevision is P{IA), where IA denotes the indicator of A. Then the

312

C Coherence

previsions are coherent if and only if the set function ^, defined as /x(A) = P(IA), is a finitely additive probability on A. Proof Suppose /i is a finitely additive probability on (5, .4). Let { A i , . . . , Am} be any finite collection of elements in A and suppose the gambler is ready to accept gambles of the form Ci{lAi — P{lAi))- Then m

Z =

Y.''i{lA,-P{lA,)) i=l

has /i-expectation equal to 0, and therefore it is not possible that Z is always less than 0. This implies incoherence cannot happen. Conversely, assume coherence. We show that /x is a finitely additive probability by showing any violation of the probability axiom leads to a non-zero non-random gamble that can be made negative. (i) /i(0) = 0 : Because I{(f)) = 0, —c/ji{(j)) = c{I{(j)) — ii{(j))) > 0 for some positive and negative values of c, implying /x(0) = 0 . Similarly fi{S) = 1. (ii) /i(A) > 0 VA G X : If fi{A) < 0, then for any c < 0, C{IA - ^A)) < —cfx{A) < 0. This means there is incoherence. (iii) /i is finitely additive : Let Ai,- - - ,Am be disjoint sets in A and UlZ^Ai = A, Let m

m

Z = J2
i=l

If fi{A) < Yl^if^i^i)^ then Z is always negative for any c > 0, whereas /2{A) > Y^iLi l^i^i) implies Z is always negative for any c < 0. Thus /J^{A) ^ S l ^ i A^(^^) leads to incoherence. D

p Microarray

Proteins are essential for sustaining life of a living organism. Every cell in an individual has the same information for production of a large number of proteins. This information is encoded in the DNA. The information is transcribed and translated by the cell machinery to produce proteins. Different proteins are produced by different segments of the DNA that are called genes. Although every cell has the same information for production of the same set of proteins, all cells do not produce all proteins. Within an individual, cells are organized into groups that are specialized to perform specific tasks. Such groups of specialized cells are called tissues. A number of tissues makes up an organ, such as pancreas. Two tissues may produce completely disjoint sets of proteins; or, may produce the same protein in different quantities. The molecule that transfers information from the genes for the production of proteins is called the messenger RNA (mRNA). Genes that produce a lot of mRNA are said to be upregulated and genes that produce little or no mRNA are said to be downregulated. For example, in certain cells of the pancreas, the gene that produces insulin will be upregulated (that is, large amounts of insulin mRNA will be produced), whereas it will be downregulated in the liver (because insulin is produced only by certain cells of the pancreas and by no other organ in the human body). In certain disease states, such as diabetes, there will be alteration in the amount of insulin mRNA. A microarray is a tool for measuring the amount of mRNA that is circulating in a cell. Microarrays simultaneously measure the amount of circulating mRNA corresponding with thousands of different genes. Among various applications, such data are helpful in understanding the nature and extent of involvement of different genes in various diseases, such as cancer, diabetes, etc. A generic microarray consists of multiple spots of DNA and is used to determine the quantities of mRNA in a collection of cells. The DNA in each spot is from a gene of interest and serves as a probe for the mRNA encoded by that gene. In general, one can think of a microarray as a grid (or a matrix) of several thousand DNA spots in a very small area (glass or polymer surface).

314

D Microarray

Each spot has a unique DNA sequence, different from the DNA sequence of the other spots around it. mRNA from a clump of cells (that is, all the mRNAs produced by the different genes that are expressed in these cells) is extracted and experimentally converted (using a biochemical molecule called reverse transcriptase) to their complementary DNA strands (cDNA). A molecular tag that glows is attached to each piece of cDNA. This mixture is then "poured" over a microarray. Each DNA spot in the microarray will hybridize (that is, attach itself) only to its complementary DNA strand. The amount of fluorescence (usually measured using a laser beam) at a particular spot on the microarray gives an indication as to how much mRNA of a particular type was present in the original sample. There are many sources of variability in the observations from a microarray experiment. Aside from the intrinsic biological variability across individuals or across tissues within the same individual, among the more important sources of variability are (a) method of mRNA extraction from the cells; (b) nature of fluorescent tags used; (b) temperature and time under which the experiment (that is, hybridization) was performed; (c) sensitivity of the laser detector in relation to the chemistry of the fluorescent tags used; and (d) the sensitivity and robustness of the image analysis system that is used to identify and quantify the fluorescence at each spot in the microarray. All of these experimental factors come in the way of comparing results across microarray experiments. Even within one experiment, the brightness of two spots can vary even when the same number of complementary DNA strands have hybridized to the spots, necessitating normalization of each image using statistical methods. The amount of mRNA is quantifled by a fluorescent signal. Some spots on a microarray, after the chemical reaction, show high levels of fluorescence and some show low or no fluorescence. The genes that show high level of fluorescence are likely to be expressed, whereas the genes corresponding with a low level of fluorescence are likely to be under-expressed or not expressed. Even genes that are not expressed may show low levels of fluorescence, which is treated as noise. The software package of the experimenter identifles background noise and calculates its mean, which is subtracted from all the measurements of fluorescence. This is the final observation Xi that we model with // = 0 indicating no expression, /i > 0 indicating expression, and / / < 0 indicating negative expression, i.e, under-expression. If the genes turn out in further studies to regulate growth of tumors, the expressed genes might help growth while the under-expressed genes could inhibit it.

E Bayes Sufficiency

If T is a sufficient statistic, then, at least for the discrete and continuous case with p.d.f., an appHcation of the factorization theorem imphes posterior distribution of 0 given X is the posterior given T. Thus in a Bayesian sense also, all information about 6 contained in X is carried by T. In many cases, e.g., for multivariate normal, the calculation of posterior can be simplified by an application of this fact. More importantly, these considerations suggest an alternative definition of sufficiency appropriate in Bayesian analysis. Definition. A statistic T is sufficient in a Bayesian sense, if for all priors 7T{0), the posterior 7T{0\X) = 7r((9|T(X)) Classical sufficiency always implies sufficiency in the Bayesian sense. It can be shown that if the family of probability measures in the model is dominated, i.e., the probability measures possess densities with respect to a cr-finite measure, then the factorization theorem holds, vide Lehmann (1986). In this case, it can be shown that the converse is true, i.e., T is sufficient in the classical sense if it is sufficient in the Bayesian sense. A famous counter-example due to Blackwell and Ramamoorthi (1982) shows this is not true in the undominated case even under nice set theoretic conditions.

References

Abramowitz, M. and Stegun, I. (1970). Handbook of Mathematical Functions, 55. National Bureau of Standards Applied Mathematics. Agresti, A. and Caffo, B. (2000). Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Amer. Statist 54, 280-288. Akaike, H. (1983). Information measure and model selection. Bull. Int. Statist. Inst. 50, 277-290. Albert, J.H. (1990). A Bayesian test for a two-way contingency table using independence priors. Canad. J. Statist. 14, 1583-1590. Albert, J.H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88, 669-679. Albert, J.H., Delampady, M., Polasek, W. (1991). A class of distributions for robustness studies. J. Statist. Plann. Inference 28, 291-304. Andersson, S. (1982). Distributions of maximal invariants using quotient measures. Ann. Statist. 10, 955-961. Andersson, S., Brons, H. and Jensen, S. (1983). Distribution of eigenvalues in multivariate statistical analysis. Ann. Statist. 1 1 , 392-415. Angers, J-F. (2000). P-credence and outliers. Metron 58, 81-108. Angers, J-F. and Berger, J.O. (1991). Robust hierarchical Bayes estimation of exchangeable means. Canad. J. Statist. 19, 39-56. Angers, J-F. and Delampady, M. (1992). Hierarchical Bayesian estimation and curve fitting. Canad. J. Statist. 20, 35-49. Angers, J-F. and Delampady, M. (1997). Hierarchical Bayesian curve fitting and model choice for spatial data. Sankhyd (Ser. B) 59, 28-43. Angers, J-F. and Delampady, M. (2001). Bayesian nonparametric regression using wavelets. Sankhyd (Ser. A) 63, 287-308. Arnold, S.F. (1993). Gibbs sampling. In: Rao, C.R. (ed). Handbook of Statistics 9, 599-625. Elsevier Science. Athreya, K.B., Doss, H. and Sethuraman, J. (1996). On the convergence of the Markov chain simulation method. Ann. Statist. 24, 69-100. Athreya, K.B., Delampady, M. and Krishnan, T. (2003). Markov chain Monte Carlo methods. Resonance 8, Part I, No. 4, 17-26, Part II, No. 7, 63-75, Part III, No. 10, 8-19, Part IV, No. 12, 18-32.

318

References

Bahadur, R.R. (1971). Some Limit Theorems in Statistics. CBMS Regional Conference Series in Applied Mathematics, 4. SI AM, Philadelphia, PA. Banerjee, S., Carlin, B.P. and Gelfand, A.E. (2004). Hierarchical Modeling and Analysis for Spatial Data. Chapman &; Hall, London. Barbieri, M.M. and Berger, J.O. (2004). Optimal predictive model selection. Ann. Statist. 32, 870-897. Basu, D. (1969). Role of the sufficiency and likelihood principles in sample survey theory. Sankhyd (Ser. A) 3 1 , 441-454. Basu, D. (1988). Statistical Information and Likelihood : A Collection of Critical Essays by Dr. D. Basu (Ghosh, J.K. ed), Lecture Notes in Statistics, SpringerVerlag, New york. Basu, D. and Ghosh, J. K. (1967). Sufficient statistics in sampling from a finite universe. Bull. Int. Statist. Inst. 42, 850-858. Basu, S. (1994). Variations of posterior expectations for symmetric unimodal priors in a distribution band. Sankhyd (Ser. A) 56, 320-334. Basu, S. (2000). Bayesian robustness and Bayesian nonparametrics. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 223-240, Springer-Verlag, New York. Basu, S. and Chib, S. (2003). Marginal likelihood and Bayes factors from Dirichlet process mixture models. J. Amer. Statist. Assoc. 98, 224-235. Basu, R., Ghosh, J.K. and Mukerjee, R. (2003). Empirical Bayes prediction intervals in a normal regression model: higher order asymptotics. Statist. Probab. Lett. 63, 197-203. Bayarri, M.J. and Berger, J.O. (1998a). Quantifying surprise in the data and model verification. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 6, 53-82. Oxford Univ. Press, Oxford. Bayarri, M.J. and Berger, J.O. (1998b). Robust Bayesian analysis of selection models. Ann. Statist. 26, 645-659. Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. (Ser. B) 57, 289-300. Benjamini, Y. and Liu, W. (1999). A step-down multiple hypotheses testing procedure that controls the false discovery rate under independence. Multiple comparisons (Tel Aviv, 1996). J. Statist. Plann. Inference 82, 163-170. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29, 1165-1188. Berger, J. (1982). Bayesian robustness and the Stein effect. J. Amer. Statist. Assoc. 77, 358-368. Berger, J.O. (1984). The robust Bayesian viewpoint (with discussion). In: Robustness of Bayesian Analyses. Studies in Bayesian Econometrics, 4, 63-144. NorthHolland Publishing Co., Amsterdam. Berger, J.O. (1985a). Statistical Decision Theory and Bayesian Analysis, 2nd Ed. Springer-Verlag, New York. Berger, J.O. (1985b). The frequentist viewpoint and conditioning. In: Le Cam, L. and Olshen, R.A. (eds) Proc. Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, I, 15-44. Wadsworth, Inc., Monterey, California. Berger, J. (1986). Are P-values reasonable measures of accuracy? In: Francis, I.S. et al. (eds) Pacific Statistical Congress. North-Holland, Amsterdam.

References

319

Berger, J.O. (1990). Robust Bayesian analysis: sensitivity to the prior. J. Statist. Plann. Inference 25, 303-328. Berger, J.O. (1994). An overview of robust Bayesian analysis (with discussion). Test 3, 5-124. Berger, J.O. (1997). Bayes factors. In: Kotz, S. et al. (eds) Encyclopedia of Statistical Sciences (Update) 3, 20-29. Wiley, New York. Berger, J. O. (2005). Generalization of BIG. Unpublished manuscript. Berger, J. and Berliner, M.L. (1986). Robust Bayes and empirical Bayes analysis with e-contaminated priors. Ann. Statist. 14, 461-486. Berger, J.O. and Bernardo, J.M. (1989). Estimating a product of means: Bayesian analysis with reference priors. J. Amer. Statist. Assoc. 84, 200-207. Berger, J.O. and Bernardo, J.M. (1992a). On the development of the reference priors. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 4, 35-60. Oxford Univ. Press, Oxford. Berger, J. and Bernardo, J.M. (1992b). Ordered group reference priors with application to the multinomial problem. Biometika 79, 25-37. Berger, J., Bernardo, J.M. and Mendoza, M. (1989). On priors that maximize expected information. In: Klein, J. and Lee, J.G. (eds) Recent Developments in Statistics and Their Applications, 1-20. Freedom Academy Publishing, Seoul. Berger, J., Bernardo, J.M. and Sun, D. (2006). A monograph on Reference Analysis (under preparation). Berger, J.O., Betro, B., Moreno, E., Pericchi, L.R., Ruggeri, F., Salinetti, G. and Wasserman, L. (eds) Bayesian Robustness. IMS, Hayward. Berger, J.O. and Delampady, M. (1987). Testing precise hypotheses (with discussion). Statist. Sci. 2, 317-352. Berger, J. and Dey, D.K. (1983). Combining coordinates in simultaneous estimation of normal means. J. Statist. Plann. Inference 8, 143-160. Berger, J.O., Ghosh, J.K. and Mukhopadhyay, N. (2003). Approximations and consistency of Bayes factors as model dimension grows. J. Statist. Plann. Inference 112, 241-258. Berger, J.O., Liseo, B. and Wolpert, R.L. (1999). Integrated likelihood methods for eliminating nuisance parameters. Statist. Sci. 14, 1-28. Berger, J.O. and Moreno, E. (1994). Bayesian robustness in bidimensional models: prior independence (with discussion). J. Statist. Plann. Inference 40, 161-176. Berger, J.O. and Pericchi, L.R. (1996a). The intrinsic Bayes factor for model selection and prediction. J. Amer. Statist. Assoc. 9 1 , 109-122. Berger, J.O. and Pericchi, L.R. (1996b). The intrinsic Bayes factor for linear models (with discussion). In: Bernardo, J.M. et al. (eds) Bayesian Statistics 5, 25-44. Oxford Univ. Press, London. Berger, J.O., Pericchi, L.R. and Varshavsky, J.A. (1998). Bayes factors and marginal distributions in invariant situations. Sankhyd (Ser. A) 60, 307-321. Berger, J.O. and Robert, C.P. (1990). Subjective hierarchical Bayes estimation of a multivariate normal mean: on the frequentist interface. Ann. Statist. 18, 617-651. Berger, J.O., Rios Insua, D. and Ruggeri, F. (2000). Bayesian robustness. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 1-32, Springer-Verlag, New York. Berger, J.O. and Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of p-values and evidence. J. Amer. Statist. Assoc. 82, 112-122.

320

References

Berger, J.O. and Wolpert, R. (1988). The Likelihood Principle, 2nd Ed. IMS Lecture Notes - Monograph Ser. 9. Hayward, California. Bernardo, J.M. (1979). Reference posterior distribution for Bayesian inference (with discussion). J. Roy. Statist Soc. (Ser. B) 41, 113-147. Bernardo, J.M. (1980). A Bayesian analysis of classical hypothesis testing. In: Bernardo, J.M. et al. (eds) Bayesian Statistics, 605-618. University Press, Valencia. Bernardo, J.M. and Smith, A.P.M. (1994). Bayesian Theory. Wiley, Chichester, England. Bernstein, S. (1917). Theory of Probability (in Russian). Berti, P., Regazzini, E. and Rigo, P. (1991). Coherent statistical inference and Bayes theorem. Ann. Statist. 19, 366-381. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Roy. Statist. Soc. (Ser. B) 36, 192-326. Besag, J. (1986). On the statistical analysis of dirty pictures. J. Roy. Statist. Soc. (Ser. B) 48, 259-279. Betro, B. and Guglielmi, A. (2000). Methods for global prior robustness under generalized moment conditions. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 273-293. Springer-Verlag, New York. Bhattacharya, S. (2005). Model assessment using inverse reference distribution approach. Tech. Report. Bickel, P.J. (1981). Minimax estimation of the mean of a normal distribution when the parameter space is restricted. Ann. Statist. 9, 1301-1309. Bickel, P.J. and Doksum, K.A. (2001). Mathematical Statistics: Basic Ideas and Selected Topics. Prentice Hall, Upper Saddle River, N.J. Bickel, P.J. and Ghosh, J.K. (1990). A decomposition for the likelihood ratio statistic and the Bartlett correction - a Bayesian argument. Ann. Statist. 18, 1070-1090. Bickel, P.J. and Yahav, J. (1969) Some contributions to the asymptotic theory of Bayes solutions. Z. Wahrsch. Verw. Gebiete 11, 257-275. Birnbaum, A. (1962). On the foundations of statistical inference (with discussion). J. Amer. Statist. Assoc. 57, 269-326. Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. The MIT Press, Cambridge. Blackwell, D. and Ramamoorthi, R.V. (1982). A Bayes but not classically sufficient statistic. Ann. Statist. 10, 1025-1026. Bondar, J.V. and Milnes, P. (1981). Amenability: A survey for statistical applications of Hunt-Stein and related conditions on groups. Z. Wahrsch. Verw. Gebiete 57, 103-128. Borwanker, J.D., Kallianpur, G. and Prakasa Rao, B.L.S. (1971). The Bernstein-von Mises theorem for stochastic processes. Ann. Math. Statist. 42, 1241-1253. Bose, S. (1994a). Bayesian robustness with more than one class of contaminations (with discussion). J. Statist. Plann. Inference 40, 177-187. Bose, S. (1994b). Bayesian robustness with mixture classes of priors. Ann. Statist. 22, 652-667. Box, G.E.P. (1980). Sampling and Bayes inference in scientific modeling and robustness. J. Roy. Statist. Soc. (Ser. A) 143, 383-430. Box, G.E.P. and Tiao, G. (1962). A further look at robustness via Bayes theorem. Biometrika 62, 169-173.

References

321

Box, G.E.P. and Tiao, G. (1973). Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading. Brooks, S.P., Giudici, P. and Roberts, G.O. (2003). Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions. J. Roy. Statist. Soc. (Ser. B) 65, 3-55. Brown, L.D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary-value problems. Ann. Math. Statist. 42, 855-903. Brown, L.D. (1986). Foundations of Exponential Families. IMS, Hayward. Brown, L.D., Cai, T.T. and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statist. Sci. 16, 101-133. Brown, L.D., Cai, T.T. and DasGupta, A. (2002). Confidence intervals for a binomial proportion and asymptotic expansions. Ann. Statist. 30, 160-201. Brown, L.D., Cai, T.T. and DasGupta, A. (2003). Interval estimation in exponential families. Statist. Sinica 13, 19-49. Burnham, K.P. and Anderson, D.R. (2002). Model Selection and Multimodel Inference: A Practical Information Theoretic Approach. Springer, New York. Cai, T.T., Low, M. and Zhao, L. (2000). Sharp adaptive estimation by a blockwise method. Tech. Report, Dept. of Statistics, Univ. of Pennsylvania. Carlin, B.P. and Chib, S. (1995). Bayesian model choice via Markov chain Monte Carlo methods. J. Roy. Statist. Soc. (Ser. B) 57, 473-484. Carlin, B.P. and Perez, M.E. (2000). Robust Bayesian analysis in medical and epidemiological settings. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 351-372, Springer-Verlag, New York. Carlin, B.P. and Louis, T.A. (1996). Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall, London. Casella, G. and Berger, R. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem (with discussion). J. Amer. Statist. Assoc. 82, 106-111. Casella, G. and Berger, R. (1990). Statistical Inference. Wadsworth, Belmont, California. Cencov, N.N. (1982). Statistical Decision Rules and Optimal Inference. AMS, Providence, R.L, Translation from Russian edited by Lev L. Leifman. Chakrabarti, A. (2004) Model selection for high dimensional problems with application to function estimation. Ph.D. thesis, Purdue Univ. Chakrabarti, A. and Ghosh, J.K. (2005a). A generalization of BIG for the general exponential family. (In press) Chakrabarti, A. and Ghosh, J.K. (2005b). Optimality of AIC in inference about Brownian Motion. (In press) Chao, M.T. (1970). The asymptotic behaviour of Bayes estimators. Ann. Math. Statist. 4 1 , 601-609. Chatterjee, S.K. and Chattopadhyay, G. (1994). On the nonexistence of certain optimal confidence sets for the rectangular problem. Statist. Probab. Lett. 2 1 , 263-269. Chen, C.F. (1985). On asymptotic normality of limiting density functions with Bayesian implications. J. Roy. Statist. Soc. (Ser. B) 47, 540-546. Chib, S. (1995). Marginal likelihood from the Gibbs output. J. Amer. Statist. Assoc. 90, 1313-1321. Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit models. Biometrika 85, 347-361.

322

References

Clarke, B. and Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory 36, 453-471. Clayton, D.G. and Kaldor, J.M. (1987). Empirical Bayes estimates of agestandardized relative risks for use in disease mapping. Biometrics 43, 671-681. Clyde, M. (1999). Bayesian model averaging and model search strategies. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 6, 157-185. Oxford Univ. Press, Oxford. Clyde, M. and Parmigiani, G. (1996). Orthogonalizations and prior distributions for orthogonalized model mixing. In: Lee, J.C. et al. (eds) Modelling and Prediction, 206-227. Springer, New York. Congdon, P. (2001). Bayesian Statistical Modelling. Wiley, Chichester, England. Congdon, P. (2003). Applied Bayesian Modelling. Wiley, Chichester, England. Cox, D.R. (1958). Some problems connected with statistical inference. Ann. Math. Statist. 29, 357-372. Cox, D.R. and Reid, N. (1987). Orthogonal parameters and approximate conditional inference (with discussion). J. Roy. Statist. Soc. (Ser. B) 49, 1-18. Csiszar, I. (1978). Information measures: a critical survey. In: Trans. 7th Prague Conf. on Information Theory, Statistical Decision Functions and the Eighth European Meeting of Statisticians (Tech. Univ. Prague, Prague, 1974) B , 73-86. Academia, Prague. Cuevas, A. and Sanz, P. (1988). On differentiability properties of Bayes operators. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 3, 569-577. Oxford Univ. Press, Oxford. Dalai, S.R. and Hall, G.J. (1980). On approximating parametric Bayes models by nonparametric Bayes models. Ann. Statist. 8, 664-672. DasGupta, A. and Delampady, M. (1990). Bayesian hypothesis testing with symmetric and unimodal priors. Tech. Report, 90-43, Dept. Statistics, Purdue Univ. DasGupta, A., Casella, G., Delampady, M., Genest, C , Rubin, H. and Strawderman, W.E. (2000). Correlation in a formal Bayes framework. Can. J. Statist. 28, 675687 Datta, G.S. (1996). On priors providing frequentist validity for Bayesian inference for multiple parametric functions. Biometrika 83, 287-298. Datta, G.S. and Ghosh, M. (1996). On the invariance of noninformative priors. Ann. Statist. 24, 141-159. Datta, G.S., Ghosh, M. and Mukerjee, R. (2000). Some new results on probability matching priors. Calcutta Statist. Assoc. Bull. 50, 179-192. Datta, G.S. and Mukerjee, R. (2004). Probability Matching Priors: Higher Order Asymptotics. Lecture Notes in Statistics. Springer, New York. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia. Dawid, A.P. (1973). Posterior expectations for large observations. Biometrika 60, 664-667. Dawid, A.P., Stone, M. and Zidek, J.V. (1973). Marginalization paradoxes in Bayesian and structural inference (with discussion). J. Roy. Statist. Soc. (Ser. B) 35, 189-233. de Finetti, B. (1972). Probability, Induction, and Statistics. Wiley, New York. de Finetti, B. (1974, 1975). Theory of Probability, Vols. 1, 2. Wiley, New York. DeGroot, M.H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.

References

323

DeGroot, M.H. (1973). Doing what comes naturally: Interpreting a tail area as a posterior probability or as a likelihood ratio. J. Amer. Statist. Assoc. 68, 966969. Delampady, M. (1986). Testing a precise hypothesis: Interpreting P-values from a robust Bayesian viewpoint. Ph.D. thesis, Purdue Univ. Delampady, M. (1989a). Lower bounds on Bayes factors for invariant testing situations. J. Multivariate Anal. 28, 227-246. Delampady, M. (1989b). Lower bounds on Bayes factors for interval null hypotheses. J. Amer. Statist. Assoc. 84, 120-124. Delampady, M. (1992). Bayesian robustness for elliptical distributions. Rebrape: Brazilian J. Probab. Statist. 6, 97-119. Delampady, M. (1999). Robust Bayesian outlier detection. Brazilian J. Probab. Statist. 13, 149-179. Delampady, M. and Berger, J.O. (1990). Lower bounds on Bayes factors for multinomial distributions, with application to chi-squared tests of fit. Ann. Statist. 18, 1295-1316. Delampady, M. and Dey, D.K. (1994). Bayesian robustness for multiparameter problems. J. Statist. Plann. Inference 40, 375-382. Delampady, M., Yee, I. and Zidek, J.V. (1993). Hierarchical Bayesian analysis of a discrete time series of Poisson counts. Statist. Comput. 3, 7-15. Delampady, M., DasGupta, A., Casella, G., Rubin, H., and Strawderman, W.E. (2001). A new approach to default priors and robust Bayes methodology. Can. J. Statist. 29, 437-450. Dempster, A.P. (1967). Upper and lower probabilities induced from a multivalued mapping. Ann. Math. Statist. 38, 325-339. Dempster, A.P. (1968). A generalization of Bayesian inference (with discussion). J. Roy. Statist. Soc. (Ser. B) 30, 205-247. Dempster, A.P. (1973). The direct use of likelihood for significance testing. In: Godambe, V.P. and Sprott,D.A. (eds) Proceedings of the Conference on Foundational Questions in Statistical Inference. Holt, Rinehart, and Winston, Toronto. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. (Ser. B) 39, 1-38. DeRobertis, L. and Hartigan, J.A. (1981). Bayesian inference using intervals of measures. Ann. Statist. 9, 235-244. Dey, D.K. and Birmiwal, L. (1994). Robust Bayesian analysis using divergence measures. Statist. Probab. Lett. 20, 287-294. Dey, D.K. and Berger, J.O. (1983). On truncation of shrinkage estimators in simultaneous estimation of normal means. J. Amer. Statist. Assoc. 78, 865-869. Dey, D.K., Ghosh, S.K. and Lou, K. (1996). On local sensitivity measures in Bayesian analysis. In: Berger, J.O. et al. (eds) Bayesian Robustness, IMS Lecture Notes, 29, 21-39. Dey, D.K., Lou, K. and Bose, S. (1998). A Bayesian approach to loss robustness. Statist. Decisions 16, 65-87. Dey, D.K. and Micheas, A.C. (2000). Ranges of posterior expected losses and erobust actions. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 145-159, Springer-Verlag, New York. Dey, D.K. and Peng, F. (1996). Bayesian analysis of outlier problems using divergence measures. Canad. J. Statist. 23, 194-213.

324

References

Dharmadhikari, S. and Joag-Dev, K. (1988). Unimodality, Convexity, and Applications. Academic Press, San Diego. Diaconis, P. and Freedman, J. (1986). On the consistency of Bayes estimates. Ann. Statist. 14, 1-26. Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponential families. Ann. Statist. 7, 269-281. Diamond, G.A. and Forrester, J.S. (1983). Clinical trials and statistical verdicts: Probable grounds for appeal. Ann. Intern. Med. 98, 385-394. Dickey, J.M. (1971). The weighted Hkelihood ratio, linear hypotheses on normal location parameters. Ann. Math. Statist. 42, 204-223. Dickey, J.M. (1973). Scientific reporting. J. Roy. Statist. Soc. (Ser. B) 35, 285-305. Dickey, J.M. (1974). Bayesian alternatives to the F-test and least squares estimate in the linear model. In: Fienberg, S.E. and Zellner, A. (eds) Studies in Bayesian Econometrics and Statistics, 515-554. North-Holland, Amsterdam. Dickey, J.M. (1976). Approximate posterior distributions. J. Amer. Statist. Assoc. 71, 680-689. Dickey, J.M. (1977). Is the tail area useful as an approximate Bayes factor? J. Amer. Statist. Assoc. 72, 138-142. Dickey, J.M. (1980). Approximate coherence for regression model inference - with a new analysis of Fisher's Broadback Wheatfield example. In: Zellner, A. (ed). Bayesian Analysis in Econometrics and Statistics: Essays in Honour of Harold Jeffreys, 333-354. North-Holland, Amsterdam. Dmochowski, J. (1994). Intrinsic priors via Kullback-Leibler geometry. Tech. Report 94-15, Dept. Statistics, Purdue Univ. Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Staist. 32, 962-94. Eaton, M.L. (1983). Multivariate Statistics - A Vector Space Approach. Wiley, New York. Eaton, M.L. (1989). Group Invariance Applications in Statistics. Regional Conference Series in Probability and Statistics, 1. IMS, Hayward, California. Eaton, M.L. (1992). A statistical diptych: admissible inferences - recurrence of symmetric Markov chains. Ann. Statist. 20, 1147-1179. Eaton, M.L. (1997). Admissibility in quadratically regular problems and recurrence of symmetric Markov chains: Why the connection? J. Statist. Plann. Inference 64, 231-247. Eaton, M.L. (2004). Evaluating improper priors and the recurrence of symmetric Markov chains: an overview. A Festschrift for Herman Rubin, 5-20, IMS Lecture Notes Monogr. Ser. 45. IMS, Beachwood, OH. Eaton, M.L. and and Sudderth, W.D. (1998). A new predictive distribution for normal multivariate linear models. Sankhyd (Ser. A) 60, 363-382. Eaton, M.L. and and Sudderth, W.D. (2004). Properties of right Haar predictive inference. Sankhyd 66, 487-512. Edwards, W., Lindman, H. and Savage, L.J. (1963). Bayesian statistical inference for psychological research. Psychol. Rev. 70, 193-242. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia. Efron, B. (2003). Robbins, empirical Bayes and microarrays. Dedicated to the memory of Herbert E. Robbins. Ann. Statist. 3 1 , 366-378.

References

325

Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99, 96-104. Efron, B. and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimators-Part I: The Bayes case. J. Amer. Statist. Assoc. 66, 807-815. Efron, B. and Morris, C. (1972). Limiting the risk of Bayes and empirical Bayes estimators-Part IL The empirical Bayes case. J. Amer. Statist. Assoc. 67, 130139. Efron, B. and Morris, C. (1973). Stein's estimation rule and its competitors - an empirical Bayes approach. J. Amer. Statist. Assoc. 68, 117-130. Efron, B. and Morris, C. (1973). Combining possibly related estimation problems (with discussion). J. Roy. Statist. Soc. (Ser. B) 35, 379-421. Efron, B. and Morris, C. (1975). Data analysis using Stein's estimator and its generalizations. J. Amer. Statist. Assoc. 70, 311-319. Efron, B. and Morris, C. (1976). Multivariate empirical Bayes and estimation of covariance matrices. Ann. Statist. 4, 22-32. Efron, B., Tibshirani, R., Storey, J.D. and Tusher, V. (2001a). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96, 1151-1160. Efron, B., Storey, J. and Tibshirani, R. (2001b). Microarrays, empirical Bayes methods, and false discovery rates. Tech. Report, Stanford Univ. Ezekiel, M. and Fox, F.A. (1959). Methods of Correlation and Regression Analysis. Wiley, New York. Fan, T-H. and Berger, J.O. (1992). Behavior of the posterior distribution and inferences for a normal mean with t prior convolutions. Statist. Decisions 10, 99-120. Fang, K.T., Kotz, S. and Ng, K.W. (1990). Symmetric Multivariate and Related Distributions. Chapman & Hall, London. Farrell, R.H. (1985). Multivariate Calculation - Use of the Continuous Groups. Springer-Verlag, New York. Feller, W. (1973). Introduction to Probability Theory and Rs Applications. Vol. 1, 3rd Ed. Wiley, New York. Ferguson, T.S. (1967). Mathematical Statistics: A Decision-Theoretic Approach. Academic Press, New York. Fernandez, C , Osiewalski, J. and Steel, M.F.J. (2001). Robust Bayesian inference on scale parameters. J. Multivariate Anal. 77, 54-72. Finney, D.J. (1971). Probit Analysis, 3rd Ed. Cambridge Univ. Press, Cambridge, U.K. Fishburn, P.C. (1981). Subjective expected utility: a review of normative theory. Theory and Decision 13, 139-199. Fisher, R.A. (1973). Statistical Methods for Research Workers, 14th Ed. Hafner, New York. Reprinted by Oxford Univ. Press, Oxford, 1990. Flury, B. and Zoppe, A. (2000). Exercises in EM. Amer. Statist. 54, 207-209. Fortini, S. and Ruggeri, F. (1994). Concentration functions and Bayesian robustness (with discussion). J. Statist. Plann. Inference 40, 205-220. Fortini, S. and Ruggeri, F. (2000). On the use of the concentration function in Bayesian robustness. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 109-126. Springer-Verlag, New York. Fortini, S., Ladelli, L. and Regazzini, E. (2000). Exchangeability, predictive distributions and parametric models. Sankhyd (Ser. A) 62, 86-109.

326

References

Eraser, D.A.S., Monette, G. and Ng, K.W. (1995). Marginalization, likelihood and structured models. In: Krishnaiah, P. (ed) Multivariate Analysis, 6, 209-217. North-Holland, Amsterdam. Preedman, D.A. (1963). On the asymptotic behavior of Bayes estimates in the discrete case. Ann. Math. Statist. 34, 1386-1403. Freedman, D.A. (1965). On the asymptotic behavior of Bayes estimates in the discrete case. II. Ann. Math. Statist. 36, 454-456. Preedman, D.A. and Purves, R.A. (1969). Bayes methods for bookies. Ann. Math. Statist. 40, 1177-1186. French, S. (1986). Decision Theory. Ellis Horwood Limited, Chichester, England. French, S. and Rios Insua, D. (2000). Statistical Decision Theory. Oxford Univ. Press, New York. Gardner, M. (1997). Relativity Simply Explained. Dover, Mineola, New York. Garthwaite, P.H. and Dickey, J.M. (1988). Quantifying expert opinion in linear regression problems. J. Roy. Statist. Soc. (Ser. B) 50, 462-474. Garthwaite, P.H. and Dickey, J.M. (1992). Elicit at ion of prior distributions for variable-selection problems in regression. Ann. Statist. 20, 1697-1719. Garthwaite, P.H., Kadane, J.B., and O'Hagan, A. (2005). Statistical methods for eliciting probability distributions. J. Amer. Statist. Assoc. 100, 680-701. Gelfand, A.E. and Dey, D.K. (1991). On Bayesian robustness of contaminated classes of priors. Statist. Decisions 9, 63-80. Gelfand, A.E. and Dey, D.K. (1994). Bayesian model choice: Asymptotics and exact calculations. J. Roy. Statist. Soc. (Ser. B) 56, 501-514. Gelfand, A.E. and Ghosh, S.K. (1998). Model choice: a minimum posterior predictive loss approach. Biometrika 85, 1-11. Gelfand, A.E. and Smith, A.F.M. (1990). Sampling based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85, 398-409. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995). Bayesian Data Analysis. Chapman & Hall, London. Gelman, A., Meng, X., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies (with discussion). Statist. Sinica 6, 733807. Genovese, C. and Wasserman, L. (2001). Operating characteristics and extensions of the FDR procedure. Tech. Report, Carnegie Mellon Univ. Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. Roy. Statist. Soc. (Ser. B) 64, 499-517. George, E.I. and Foster, D.P. (2000). Calibration and empirical Bayes variable selection. Biometrika 87, 731-747. Geweke, J. (1999). Simulation methods for model criticism and robustness analysis. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 6, 275-299. Oxford Univ. Press, New York. Ghosal, S. (1997). Normal approximation to the posterior distribution for generalized linear models with many covariates. Math. Methods Statist. 6, 332-348. Ghosal, S. (1999). Asymptotic normality of posterior distributions in highdimensional linear models. Bernoulli 5, 315-331. Ghosal, S. (2000). Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 74, 49-68.

References

327

Ghosal, S., Ghosh, J.K. and Ramamoorthi, R.V. (1997). Noninformative priors via sieves and packing numbers. In: Panchapakesan, S. and Balakrishnan, N. (eds) Advances in Statistical Decision Theory and Applications, 119-132. Birkhauser, Boston. Ghosal, S., Ghosh, J.K. and Samanta, T. (1995). On convergence of posterior distributions. Ann. Statist. 23, 2145-2152. Ghosh, J.K. (1983). Review of ^^Approximation Theorems of Mathematical Statistics'^ by R.J. Serfling. J. Amer. Statist. Assoc. 78. Ghosh, J.K. (1994). Higher Order Asymptotics. NSF-CBMS Regional Conference Series in Probability and Statistics. IMS, Hay ward. Ghosh, J.K. (1997). Discussion of "Noninformative priors do not exist: a dialogue with J.M. Bernardo". Jour. Statist. Plann. Inference 65, 159-189. Ghosh, J.K. (2002). Review of ^^Statistical Inference in Science" by D.A. Sprott. Sankhyd. (Ser. B) 64, 234-235. Ghosh, J.K., Bhanja, J., Purkayastha, S., Samanta, T. and Sengupta, S. (2002). A statistical approach to geological mapping. Mathematical Geology 34, 505-528. Ghosh, J.K. and Mukerjee, R. (1992). Non-informative priors (with discussion). In: Bernardo, J. M. et al. (eds). Bayesian Statistics 4, 195-210. Oxford Univ. Press, London. Ghosh, J.K. and Mukerjee, R. (1993). On priors that match posterior and frequentist distribution functions. Can. J. Statist. 2 1 , 89-96. Ghosh, J.K. and Ramamoorthi, R.V. (2003). Bayesian Nonparametrics. Springer, New York. Ghosh, J.K. and Samanta, T. (2001). Model selection - an overview. Current Science 80, 1135-1144. Ghosh, J.K. and Samanta, T. (2002a). Nonsubjective Bayes testing - an overview. J. Statist. Plann. Inference 103, 205-223. Ghosh, J.K. and Samanta, T. (2002b). Towards a nonsubjective Bayesian paradigm. In: Misra, J.C. (ed) Uncertainty and Optimality, 1-69. World Scientific, Singapore. Ghosh, J.K., Ghosal, S. and Samanta, T. (1994). Stability and convergence of posterior in non-regular problems. In: Gupta, S.S. and Berger, J.O. (eds). Statistical Decision Theory and Related Topics, V, 183-199. Ghosh, J.K., Purkayastha, S. and Samanta, T. (2005). Role of P-values and other measures of evidence in Bayesian analysis. In: Dey, D.K. and Rao, C.R. (eds) Handbook of Statistics 25, Bayesian Thinking: Modeling and Computation, 151170. Ghosh, J.K., Sinha, B.K. and Joshi, S.N. (1982). Expansion for posterior probability and integrated Bayes risk. In: Gupta, S.S. and Berger, J.O. (eds) Statistical Decision Theory and Related Topics, III, 1, 403-456. Ghosh, M. and Meeden, G. (1997). Bayesian Methods for Finite Population Sampling. Chapman Sz Hall, London. Goel, P. (1983). Information measures and Bayesian hierarchical models. J. Amer. Statist. Assoc. 78, 408-410. Goel, P. (1986). Comparison of experiments and information in censored data. In: Gupta, S.S. and Berger, J.O. (eds) Statistical Decision Theory and Related Topics, IV. 2, 335-349. Good, I.J. (1950). Probability and the Weighing of Evidence. Charles Griffin, London.

328

References

Good, I. J. (1958). Significance tests in parallel and in series. J. Amer. Statist. Assoc. 53, 799-813. Good, I.J. (1965). The Estimation of Probabilities: An Essay on Modem Bayesian Methods. M.I.T. Press, Cambridge, Massachusetts. Good, I.J. (1967). A Bayesian significance test for the multinomial distribution. J. Roy. Statist. Soc. (Ser. B) 29, 399-431. Good, I.J. (1975). The Bayes factor against equiprobability of a multinomial population assuming a symmetric Dirichlet prior. Ann. Statist. 3, 246-250. Good, I.J. (1983). Good Thinking: The Foundations of Probability and its Applications. Univ. Minnesota Press, Minneapolis. Good, I.J. (1985). Weight of evidence: A brief survey. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 2, 249-270. North-Holland, Amsterdam. Good, I.J. (1986). A flexible Bayesian model for comparing two treatments. J. Statist. Comput. Simulation 26, 301-305. Good, I.J. and Crook, J.F. (1974). The Bayes/non-Bayes compromise and the multinomial distribution. J. Amer. Statist. Assoc. 69, 711-720. Green, P.J. (1995). Reversible jump MCMC computation and Bayesian model determination. Biometrika 82, 711-732. Green, P.J. and Richardson, S. (2001). Modelling heterogeneity with and without the Dirichlet process. Scand. J. Statist. 28, 355-375. Gustafson, P. (2000). Local robustness in Bayesian analysis. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 71-88, Springer-Verlag, New York. Gustafson, P. and Wasserman, L. (1995). Local sensitivity diagnostics for Bayesian inference. Ann. Statist. 23, 2153-2167. Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. J. Roy. Statist. Soc. (Ser. B) 29, 104-109. Hajek, J. and Sidak, Z.V.(1967). Theory of Rank Tests. Academic Press, New York. Halmos, P.R. (1950). Measure Theory, van Nostrand, New York. Halmos, P.R. (1974). Measure Theory, 2nd Ed. Springer-Verlag, New York. Hartigan J.A. (1983). Bayes Theory. Springer-Verlag, New York. Heath, D. and Sudderth, W. (1978). On finitely additive priors, coherence, and extended admissibility. Ann. Statist. 6, 335-345. Heath, D. and Sudderth, W. (1989). Coherent inference from improper priors and from finitely additive priors. Ann. Statist. 17, 907-919. Hernandez, E. and Weiss, G. (1996). A First Course on Wavelets. CRC Press Inc., Boca Raton. Hewitt, E. and Savage, L.J. (1955). Symmetric measures on Cartesian products. Trans. Amer. Math. Soc. 80, 907-919. Hildreth, C. (1963). Bayesian statisticians and remote clients. Econometrica 3 1 , 422-438. Hill, B. (1982). Comment on "Lindley's paradox," by G. Shafer. J. Amer. Statist. Assoc. 77, 344-347. Hoeting, J.A., Madigan, D., Raftery, A.E. and VoHnsky, C.T. (1999). Bayesian model averaging: a tutorial (with discussion). Statist. Sci. 14, 382-417. Huber. P.J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35, 73-101. Huber. P.J. (1974). Fisher information and spline interpolation. Ann. Statist. 2, 1029-1034. Huber. P.J. (1981). Robust Statistics. John Wiley, New York.

References

329

Hwang, J.T., Casella, G., Robert, C , Wells, M.T. and Farrell, R.H. (1992). Estimation of accuracy in testing. Ann. Statist. 20, 490-509 Ibragimov, LA. and Has'minskii, R.Z. (1981). Statistical Estimation - Asymptotic Theory. Springer-Verlag, New York. Ickstadt, K. (1992). Gamma-minimax estimators with respect to unimodal priors. In: Gritzmann, P. et al. (eds) Operations Research '91. Physica-Verlag, Heidelberg. James, W. and Stein, C. (1960). Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1, 361-380. Univ. California Press, Berkeley. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London (Ser. A) 186, 453-461. Jeffreys, H. (1957). Scientific Inference. Cambridge Univ. Press, Cambridge. Jeffreys, H. (1961). Theory of Probability, 3rd Ed. Oxford Univ. Press, New York. Johnson, R.A. (1970). Asymptotic expansions associated with posterior distribution. Ann. Math. Statist. 42, 1899-1906. Kadane, J.B. (ed) (1984). Robustness of Bayesian Analyses. Studies in Bayesian Econometrics, 4. North-Holland Publishing Co., Amsterdam. Kadane, J.B., Dickey, J.M., Winkler, R.L., Smith, W.S. and Peters, S.C. (1980). Interactive elicitation of opinion for a normal linear model. J. Amer. Statist. Assoc. 75, 845-854. Kadane, J.B., Salinetti, G. and Srinivasan, C . (2000). Stability of Bayes decisions and applications. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 187-196, Springer-Verlag, New York. Kadane, J.B., Schervish, M.J. and Seidenfeld, T. (1999). Cambridge Studies in Probability, Induction, and Decision Theory. Cambridge Univ. Press, Cambridge. Kagan, A.M., Linnik, Y.V., Rao, C.R. (1973). Characterization Problems in Mathematical Statistics. Wiley, New York. Kahneman, D., Slovic, P. and Tversky, A. (1982). Judgement Under Uncertainty: Heuristics and Biases. Cambridge Univ. Press, New York. Kariya, T. and Sinha, B.K. (1989). Robustness of Statistical Tests. Statistical Modeling and Decision Science. Academic Press, Boston, MA. Kass, R. and Raftery, A. (1995). Bayes factors. J. Amer. Statist. Assoc. 90, 773-795. Kass, R. and Wasserman, L. (1996). The selection of prior distributions by formal rules (review paper). J. Amer. Statist. Assoc. 9 1 , 1343-1370. Kass, R.E., Tierney, L. and Kadane, J.B. (1988). Asymptotics in Bayesian computations. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 3, 261-278. Oxford Univ. Press, Oxford. Kiefer, J. (1957). Invariance, minimax sequential estimation, and continuous time processes. Ann. Math. Statist. 28, 573-601. Kiefer, J. (1966). Multivariate optimality results. In: Krishnaiah, P.R. (ed) Multivariate Analysis. Academic Press, New York. Kiefer, J. (1977). Conditional confidence statements and confidence estimators (with discussion). J. Amer. Statist. Assoc. 72, 789-827. Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist. 27, 887-906. Laplace, P.S. (1986). Memoir on the probability of the causes of events (English translation of the 1774 French original by S.M. Stigler). Statist. Sci. 1, 364-378. Lavine, M. (1991). Sensitivity in Bayesian statistics: the prior and the likelihood. J. Amer. Statist. Assoc. 86, 396-399.

330

References

Lavine, M., Pacifico, M.P., Salinetti, G. and Tardella, L. (2000). Linearization techniques in Bayesian robustness. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 261-272, Springer-Verlag, New York. Learner, E.E. (1978). Specification Searches. Wiley, New York. Learner, E.E. (1982). Sets of posterior means with bounded variance prior. Econometrica 50, 725-736. Le Cam, L. (1953). On Some Asymptotic Properties of Maximum Likelihood Estimates and Related Bayes Estimates. Univ. CaUfornia PubUcations in Statistics, 1, 277-330. Le Cam, L. (1958). Les properties asymptotiques des solutions de Bayes. Puhl. Inst. Statist. Univ. Paris, 7, 17-35. Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. SpringerVerlag, New York. Le Cam, L. and Yang, G.L. (2000). Asymptotics in Statistics: Some Basic Concepts, 2nd Ed. Springer-Verlag, New York. Lee, P.M. (1989). Bayesian Statistics: An Introduction. Oxford Univ. Press, New York. Lehmann, E.L. (1986). Testing Statistical Hypotheses, 2nd Ed. Wiley, New York. Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation, 2nd Ed. Springer-Verlag, New York. Lempers, F.B. (1971). Posterior Probabilities of Alternative Models. Univ. of Rotterdam Press, Rotterdam. Leonard, T. and Hsu. J.S.J. (1999). Bayesian Methods. Cambridge Univ. Press, Cambridge. Li, K.C. (1987). Asymptotic optimality for Cp, ci, cross validation and generalized cross validation: discrete index set. Ann. Statist. 15, 958-975. Liang, F., Paulo, R., Molina, G., Clyde, M.A. and Berger, J.O. (2005) Mixtures of p-priors for Bayesian variable selection. Unpublished manuscript. Lindley, D.V. (1956). On a measure of the information provided by an experiment. Ann. Math. Statist. 27, 986-1005. Lindley, D.V. (1957). A statistical paradox. Biometrika 44, 187-192. Lindley, D.V. (1961). The use of prior probability distributions in statistical inference and decisions. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1, 453-468. Univ. California Press, Berkeley. Lindley, D.V. (1965). An Introduction to Probability and Statistics from a Bayesian Viewpoint, 1, 2. Cambridge Univ. Press, Cambridge. Lindley, D.V. (1977). A problem in forensic science. Biometrika 64, 207-213. Lindley, D.V. and Phillips, L.D. (1976). Inference for a Bernoulli process (a Bayesian view). Amer. Statist. 30, 112-119. Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linear model. J. Roy. Statist. Soc. (Ser. B) 34, 1-41. Liseo, B. (2000). Robustness issues in Bayesian model selection. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 197-222, Springer-Verlag, New York. Liseo, B., Pettrella, L. and Salinetti, G. (1996). Robust Bayesian analysis: an interactive approach. In: Bernardo, J.M. et al. (eds) Bayesian Statistics, 5, 661-666. Oxford Univ. Press, London. Liu, R.C. and Brown, L.D. (1992). Nonexistence of informative unbiased estimators in singular problems. Ann. Statist. 2 1 , 1-13.

References

331

Lu, K. and Berger, J.O. (1989a). Estimated confidence procedures for multivariate normal means. J. Statist. Plann. Inference 23, 1-19. Lu, K. and Berger, J.O. (1989b). Estimation of normal means: frequentist estimators of loss. Ann. Statist. 17, 890-907. Madigan, D. and Raftery, A.E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam's window. J. Amer. Statist. Assoc. 89, 1535-1546. Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Int. Statist. Rev. 6 3 , 215-232. Martin, J.T. (1942). The problem of the evaluation of rotenone-containing plants. VI. The toxicity of 1-elliptone and of poisons applied jointly, with further observations on the rotenone equivalent method of assessing the toxicity of derris root. Ann. Appl. Biol. 29, 69-81. Martin, J. and Arias, J.P. (2000). Computing efficient sets in Bayesian decision problems. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 161-186. Springer-Verlag, New York. Martin, J., Rios Insua, D. and Ruggeri, F. (1998). Issues in Bayesian loss robustness. Sankhyd (Ser. A) 60, 405-417. Matsumoto, M. and Nishimura, T. (1998). Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans, on Modeling and Computer Simulation 8, 3-30. McLachlan, G.J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley, New York. Meng, X.L. (1994). Posterior predictive p-values. Ann. Statist. 22, 1142-1160. Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability. Springer-Verlag, New York. Moreno, E. (2000). Global Bayesian robustness for some classes of prior distributions. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 45-70, Springer-Verlag, New York. Moreno, E. and Cano, J.A. (1995). Classes of bidimensional priors specified on a collection of sets: Bayesian robustness. J. Statist. Plann. Inference 46, 325-334. Moreno, E. and Pericchi, L.R. (1993). Bayesian robustness for hierarchical econtamination models. J. Statist. Plann. Inference 37, 159-167. Morris, C.N. (1983). Parametric empirical Bayes inference: Theory and applications (with discussion). J. Amer. Statist. Assoc. 78, 47-65. Morris, C.N. and Christiansen, C.L. (1996). Hierarchical models for ranking and for identifying extremes, with application. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 5, 277-296. Oxford Univ. Press, Oxford. Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. AddisonWessley, Reading. Muirhead, R.J. (1982). Aspects of Multivariate Statistical Theory. Wiley, New York. Mukhopadhyay, N. (2000). Bayesian model selection for high dimensional models with prediction error loss and 0 — 1 loss. Ph.D. thesis, Purdue Univ. Mukhopadhyay, N. and Ghosh, J.K. (2004a). Parametric empirical Bayes model selection - some theory, methods and simulation. In: Bhattacharya, R.N. et al. (eds) Probability, Statistics and Their Applications: Papers in Honor of Rabi Bhattacharya. IMS Lecture Notes-Monograph Ser. 4 1 , 229-245. Mukhopadhyay, N. and Ghosh, J.K. (2004b). Bayes rule for prediction and AIC, an asymptotic evaluation. Unpublished manuscript.

332

References

Miiller, P. and Vidakovic, B. (eds) (1999). Bayesian Inference in Wavelet-Based Models. Lecture Notes in Statistics, 141. Springer, New York. Nachbin, L. (1965). The Haar Integral van Nostrand, New York. Neyman, J. and Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1-32. Newton, M.A., Yang, H., Gorman, P.A., Tomlinson, I. and Roylance, R.R. (2003). A statistical approach to modeling genomic aberrations in cancer cells (with discussion). In: Bernardo, J.M. et al. (eds) Bayesian Statistics 7, 293-305. Oxford Univ. Press, New York. Ogden, R.T. (1997). Essential Wavelets for Statistical Applications and Data Analysis. Birkhauser, Boston. O'Hagan, A. (1988). ModelUng with heavy tails. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 3, 569-577. Oxford Univ. Press, Oxford. O'Hagan, A. (1990). Outliers and credence for location parameter inference. J. Amer. Statist. Assoc. 85, 172-176. O'Hagan, A. (1994). Bayesian Inference. Kendall's Advanced Theory of Statistics, Vol. 2B. Halsted Press, New York. O'Hagan, A. (1995). Fractional Bayes factors for model comparisons. J. Roy. Statist. Soc. (Ser. B) 57, 99-138. Pearson, Karl (1892). The Grammar of Science. Walter Scott, London. Latest Edition: (2004). Dover Publications. Pericchi, L.R. and Perez, M.E. (1994). Posterior robustness with more than one sampling model (with discussion). J. Statist. Plann. Inference 40, 279-294. Perone P.M., Salinetti, G. and Tardella, L. (1998). A note on the geometry of Bayesian global and local robustness. J. Statist. Plann. Inference 69, 51-64. Pettit, L.I. and Young, K.D.S. (1990). Measuring the effect of observations on Bayes factors. Biometrika 77, 455-466. Pitman, E.J.G. (1979). Some Basic Theory for Statistical Inference. Chapman & Hall, London; A Halsted Press Book, Wiley, New York. Polasek, W. (1985). Sensitivity analysis for general and hierarchical linear regression models. In: Goel, P.K. and Zellner, A. (eds). Bayesian Inference and Decision Techniques with Applications. North-Holland, Amsterdam. Pratt, J.W. (1961). Review of ^^Testing Statistical Hypotheses'^ by E.L. Lehmann. J. Amer. Statist. Assoc. 56, 163-166. Pratt, J.W. (1965). Bayesian interpretation of standard inference statements (with discussion). J. Roy. Statist. Soc. (Ser. B) 27, 169-203. Raftery, A.E., Madigan, D. and Hoeting, J.A. (1997). Bayesian model averaging for linear regression models. J. Amer. Statist. Assoc. 92, 179-191. Raiffa, H. and Schlaiffer, R. (1961). Applied Statistical Decision Theory. Division of Research, School of Business Administration, Harvard Univ. Ramsey, P.P. (1926). Truth and probability. Reprinted in: Kyburg, H.E. and Smokier, H.E. (eds) Studies in Subjective Probability. Wiley, New York. Rao, C.R. (1973). Linear Statistical Inference and Its Applications, 2nd Ed. Wiley, New York. Rao, C.R. (1982). Diversity: its measurement, decomposition, apportionment and analysis. Sankhyd (Ser. A) 44, 1-22. Richardson, S, and Green, P.J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. Roy. Statist. Soc. (Ser. B) 59, 731-792.

References

333

Rissanen, J. (1987). Stochastic complexity. J. Roy. Statist Soc. (Ser. B) 49, 223-239. Rios Insua, D. and Criado, R. (2000). Topics on the foundations of robust Bayesian analysis. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 33-44. Springer-Verlag, New York. Rios Insua, D. and Martn, J. (1994). Robustness issues under imprecise beliefs and preferences. J. Statist. Plann. Inference 40, 383-389. Rios Insua, D. and Ruggeri, F. (eds) (2000). Robust Bayesian Analysis. Lecture Notes in Statistics, 152. Springer-Verlag, New York. Robbins, H. (1951). Asymptotically subminimax solutions of compound statistical decision problems. Proc. Second Berkeley Symp. Math. Statist. Probab. 131-148. Univ. California Press, Berkeley. Robbins, H. (1955). An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. Math. Statist. Probab. 1, 157-164. Univ. California Press, Berkeley. Robbins, H. (1964). The empirical Bayes approach to statistical decision problems. Ann. Math. Statist. 35, 1-20. Robert, C.P. (1994). The Bayesian Choice. Springer-Verlag, New York. Robert, C.P. (2001). The Bayesian Choice, 2nd Ed. Springer-Verlag, New York. Robert, C.P. and Casella, G. (1999). Monte Carlo Statistical Methods. SpringerVerlag, New York. Robinson, G.K. (1976). Conditional properties of Student's t and of the BehrensFisher solution to the two means problem. Ann. Statist. 4, 963-971. Robinson, G.K. (1979). Conditional properties of statistical procedures. Ann. Statist. 7, 742-755. Conditional properties of statistical procedures for location and scale parameters. Ann. Statist. 7, 756-771. Roussas, G.G. (1972). Contiguity of Probability Measures: Some Applications in Statistics. Cambridge Tracts in Mathematics and Mathematical Physics, 63. Cambridge Univ. Press, London-New York. Rousseau, J. (2000). Coverage properties of one-sided intervals in the discrete case and application to matching priors. Ann. Inst. Statist. Math. 52, 28-42. Rubin, D.B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Statist. 12, 1151-1172. Rubin, H. (1971). A decision-theoretic approach to the problem of testing a null hypothesis. In: Gupta, S.S. and Yackel, J. (eds) Statistical Decision Theory and Related Topics. Academic Press, New York. Rubin, H. and Sethuraman, J. (1965). Probabilities of moderate deviations. Sankhyd (Ser. A) 27, 325-346. Ruggeri, F. and Sivaganesan, S. (2000). On a global sensitivity measure for Bayesian inference. Sankhyd (Ser. A) 62, 110-127. Ruggeri, F. and Wasserman, L. (1995). Density based classes of priors: infinitesimal properties and approximations. J. Statist. Plann. Inference 46, 311-324. Sarkar, S. (2003). FDR-controlling stepwise procedures and their false negative rates. Tech. Report, Temple Univ. Satterwaite, F.E. (1946). An approximate distribution of estimates of variance components. Biometrics Bull, (now called Biometrics), 2, 110-114. Savage, L.J. (1954). The Foundations of Statistics. Wiley, New York. Savage, L.J. (1972). The Foundations of Statistics, 2nd revised Ed. Dover, New York. Schervish, M.J. (1995). Theory of Statistics. Springer-Verlag, New York. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464.

334

References

Scott, J.G. and Berger, J.O. (2005). An exploration of aspects of Bayesian multiple testing. To appear in J. Statist. Plan. Inference. Searle, S.R. (1982). Matrix Algebra Useful for Statistics. Wiley, New York. Seidenfeld, T., Schervish, M.J. and Kadane, J.B. (1995). A representation of partially ordered preferences. Ann. Statist 23, 2168-2217. Seo, J. (2004). Some classical and Bayesian nonparametric regression methods in a longitudinal marginal model. Ph.D. thesis, Purdue Univ. Serfiing, R.J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton Univ. Press, NJ. Shafer, G. (1979). Allocations of Probability. Ann. Probab. 7, 827-839. Shafer, G. (1982). Lindley's paradox (with discussion). J. Amer. Statist. Assoc. 77, 325-351. Shafer, G. (1982). Belief functions and parametric models (with discussion). J. Roy. Statist. Soc. (Ser. B) 44, 322-352. Shafer, G. (1987). Probability judgment in artificial intelligence and expert systems (with discussion). Statist. Sci. 2, 3-44. Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech. J. 27, 379-423 and 623-656. Reprinted in The Mathematical Theory of Communication (Shannon, C.E. and Weaver, W., 1949). Univ. Illinois Press, Urbana, XL. Shao, J. (1997). An asymptotic theory for linear model selection. Statist. Sinica 7, 221-264. Shibata, R. (1981). An optimal selection of regression variables. Biometrika 68, 45-54. Shibata, R. (1983). Asymptotic mean efficiency of a selection of regression variables. Ann. Inst. Statist. Math. 35, 415-423. Shyamalkumar, N.D. (2000). Likelihood robustness. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 109-126, Springer-Verlag, New York. Sivaganesan, S. (1989). Sensitivity of posterior mean to unimodality preserving contaminations. Statist, and Decisions 7, 77-93. Sivaganesan, S. (1993). Robust Bayesian diagnostics. J. Statist. Plann. Inference 35, 171-188. Sivaganesan, S. (2000). Global and local robustness approaches: uses and limitations. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 89-108, Springer-Verlag, New York. Sivaganesan, S. and Berger, J.O. (1989). Ranges of posterior means for priors with unimodal contaminations. Ann. Statist. 17, 868-889. Smith, A.F.M. and Roberts, G.O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion). J. Roy. Statist. Soc. (Ser. B) 55, 3-24. Smith, A.F.M. and Spiegelhalter, D.J. (1980). Bayes factors and choice criteria for linear models. J. Roy. Statist. Soc. (Ser. B) 42, 213-220. Smith, C.A.B. (1965). Personal probability and statistical analysis. J. Roy. Statist. Soc. (Ser. A) 128, 469-499. Sorensen, D. and Gianola, D. (2002). Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer-Verlag, New York. Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. J. Roy. Statist. Soc. (Ser. B) 64, 583-639.

References

335

Spiegelhalter, D.J. and Smith, A.F.M. (1982). Bayes factors for linear and log-linear models with vague prior information. J. Roy. Statist. Soc. (Ser. B) 44, 377-387. Sprott, D.A. (2000). Statistical Inference in Science. Springer-Verlag, New York. Srinivasan, C. (1981). Admissible generalized Bayes estimators and exterior boundary value problems. Sankhyd (Ser. A) 43, 1-25. Stein, C. (1955). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proc. Third Berkeley Symp. Math. Statist. Probab. 1, 197206. Univ. California Press, Berkeley. Stein, C. (1956). Some problems in multivariate analysis. Part I. Tech. Report, No. 6, Dept. Statistics, Stanford Univ. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9, 1135-1151. Stigler, S.M. (1977). Do robust estimators work with real data? (with discussion). Ann. Statist. 5, 1055-1098. Stone, M. (1979). Comments on model selection criteria of Akaike and Scwarz. J. Roy. Stat. Soc. (Ser. B) 4 1 , 276-278. Storey, J.D. (2002). A direct approach to false discovery rates. J. Roy. Stat. Soc. (Ser. B) 64, 479-498. Storey, J.D. (2003). The positive false discovery rate: a Bayesian interpretation and the g-value. Ann. Statist. 3 1 , 2013-2035. Strawderman, W.E. (1971). Proper Bayes minimax estimators of the multivariate normal mean. Ann. Math. Statist. 42, 385-388. Strawderman, W.E. (1978). Minimax adaptive generalized ridge regression estimators. J. Amer. Statist. Assoc. 73, 623-627. Stromborg, K.L., Grue, C.E., Nichols, J.D., Hepp, G.R., Hines, J.E. and Bourne, H.C. (1988). Postfledging survival of European starlings exposed as nestlings to an organophosphorus insecticide. Ecology 69, 590-601. Sun, D. and Berger, J.O. (1998). Reference priors with partial information. Biometrika 85, 55-71. Tanner, M.A. (1991). Tools for Statistical Inference. Springer-Verlag, New York. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1701-1762. Tierney, L. and Kadane, J.B. (1986). Accurate approximations for posterior moments. J. Amer. Statist. Assoc. 8 1 , 82-86. Tierney, L., Kass, R.E. and Kadane, J.B. (1989). Fully exponential Laplace approximations to expectations and variances of nonpositive functions. J. Amer. Statist. Assoc. 84, 710-716. Tversky, A. and Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science 2 1 1 , 453-458. Vidakovic, B. (1999). Statistical Modeling by Wavelets. Wiley, New York. Vidakovic, B. (2000). F-Minimax: A paradigm for conservative robust Bayesian. In: Rios Insua, D. and Ruggeri, F. (eds) Robust Bayesian Analysis, 241-259, Springer-Verlag, New York. von Mises, R. (1931). Wahrscheinlichkeitsrechnung. Springer, Berlin. von Mises, R. (1957). Probability, Statistics and Truth, 2nd revised English Ed. (prepared by Hilda Geiringer). The Macmillan Company, New York. Waagepetersen, R. and Sorensen, D. (2001). A tutorial on reversible jump MCMC with a view towards applications in QTL-mapping. Int. Statist. Rev. 69, 49-61.

336

References

Wald, A. (1950). Statistical Decision Functions. Wiley, New York and Chapman & Hall, London. Walker, A.M. (1969). On the asymptotic behaviour of posterior distributions. J. Roy. Statist. Soc. (Ser. B) 3 1 , 80-88. Walley, P. (1991). Statistical Reasoning with Imprecise Probabilities. Chapman Sz Hall, London. Wasserman, L. (1990). Prior envelopes based on belief functions. Ann. Statist. 18, 454-464. Wasserman, L. (1992). Recent methodological advances in robust Bayesian inference. In: Bernardo, J.M. et al. (eds) Bayesian Statistics 4, 35-60. Oxford Univ. Press, Oxford. Wasserman, L. and Kadane, J. (1990). Bayes' theorem for Choquet capacities. Ann. Statist. 18, 1328-1339. Wasserman, L., Lavine, M. and Wolpert, R.L. (1993). Linearization of Bayesian robustness problems. J. Statist. Plann. Inference 37, 307-316. Weiss, R. (1996). An approach to Bayesian sensitivity analysis. J. Roy. Statist. Soc. (Ser. B) 58, 739-750. Welch, B.L. (1939). On confidence limits and sufficiency with particular reference to parameters of location. Ann. Math. Statist. 10, 58-69. Welch, B.L. (1949). Further notes on Mrs. Aspin's tables. Bimometrika 36, 243-246. Welch, B.L. and Peers, H.W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. J. Roy. Statist. Soc. B, 25, 318-329. Wijsman, R.A. (1967). Cross-sections of orbits and their application to densities of maximal invariants. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1, 389400. Univ. California Press, Berkeley. Wijsman, R.A. (1985). Proper action in steps, with apphcation to density ratios of maximal invariants. Ann. Statist. 13, 395-402. Wijsman, R.A. (1986). Global cross sections as a tool for factorization of measures and distribution of maximal invariants. Sankhyd (Ser. A) 48, 1-42. Wijsman, R.A. (1990). Invariant Measures on Groups and Their Use in Statistics. IMS Lecture Notes-Monograph Ser. 14. IMS, Hayward, CA. Woods, H., Steinour, H.H. and Starke, H.R. (1932). Effect of composition of Portland cement on heat evolved during hardening. Industrial and Engineering Chemistry, 24, 1207-1214. Woodward, C , Lange, S.W., Nelson, K.W., and Calvert, H.O. (1941). The acute oral toxocity of acetic, chloracetic, dichloracetic and trichloracetic acids. J. Industrial Hygiene and Toxicology, 23, 78-81. Young, G.A. and Smith, R.L. (2005). Essentials of Statistical Inference. Cambridge Univ. Press, Cambridge, U.K. Zellner, A. (1971). An Introduction to Bayesian Inference in Economics. Wiley, New York. Zellner, A. (1984). Posterior odds ratios for regression hypotheses: General considerations and some specific results. In: Zellner, A. (ed) Basic Issues in Econometrics, 275-305. Univ. of Chicago Press, Chicago. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Goel, P.K. and Zellner, A. (eds) Basic Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, 233243. North-Holland, Amsterdam.

References

337

Zellner, A. and Slow, A. (1980). Posterior odds ratios for selected regression hypotheses. In: Bernardo, J.M. et al. (eds) Bayesian Statistics^ 585-603. University Press, Valencia.

Author Index

Abramowitz, M., 274 Agresti, A., 34 Akaike, H., 284 Albert, J.H., 96, 251 Anderson, D.R., 284, 288 Andersson, S., 174, 175 Angers, J-F., 91, 92, 208, 295, 297-299 Arnold, S.F., 225 Athreya, K.B., 216, 217 Bahadur, R.R., 178, 179 Banerjee, S., 289-291 Barbieri, M.M., 276, 280-283 Barron, A., 127 Basu, D., 8, 10, 25, 37, 307, 308 Basu, R., 262 Basu, S., 285 Bayarri, M.J., 93, 182-184 Bayes, T., 32, 123 Benjamini, Y., 272, 273 Berger, J.O., 15, 30, 37, 38, 40, 53, 60, 61, 72-77, 83, 91, 93, 123, 126, 139, 140, 142, 144-146, 164, 166, 167, 169, 171, 172, 175, 176, 179, 182-184, 190, 191, 193, 194, 196, 197, 267, 268, 270, 273-275, 277, 279-283, 287, 298 Berger, R., 1, 176, 203 Berliner, M.L., 83 Bernardo, J.M., 30, 33, 123, 126, 128, 129, 140, 142, 144, 145, 149, 157, 175, 193, 269, 284 Bernoulli, J., 100 Bernstein, S., 103

Berti, P., 71 Besag, J., 222, 290 Best, N.G., 285 Betro, B., 72 Bhanja, J., 57 Bhattacharya, R.N., 155 Bhattacharya, S., 285 Bickel, P.J., 1, 103, 109, 155 Birmiwal, L., 86, 89, 96 Birnbaum, A., 38, 57, 307, 308 Bishop, Y.M.M., 34 Blackwell, D., 315 Bondar, J.V., 139, 174 Borwanker, J.D., 103 Bose, S., 92 Bourne, H.C., 288 Box, G., 30 Box, G.E.R, 91, 93, 180, 201, 245 Brons, H., 174 Brooks, S.R, 230 Brown, L.D., 4, 14, 34, 131, 267 Burnham, K.P., 284, 288 Caffo, B., 34 Cai, T.T., 34, 131, 262, 284 Calvert, H.O., 252 Carlin, B.P., 30, 148, 198, 232, 236, 261, 262, 285, 289-291 Carlin, J.B., 30, 161, 245, 257, 260, 285 Carnap, R., 30 Casella, G., 1, 9, 155-157, 164, 176, 203, 215, 217, 223, 225-227, 230, 232, 234, 300 Cencov, N.N., 125

340

Author Index

Chakrabarti, A., 275, 284, 285 Chao, M.T., 103 Chatterjee, S.K., 37 Chattopadhyay, G., 37 Chen, C.F., 103 Chib, S., 251, 285 Christiansen, C.L., 256 Clarke, B., 127 Clayton, D.G., 290, 291 Clyde, M.A., 274, 278, 279 Congdon, P., 30 Cox, D.R., 37, 51, 52 Csiszar, I., 86 Cuevas, A., 93 Dalai, S.R., 135 DasGupta, A., 34, 131, 155, 156, 170 Datta, G.S., 130-132, 139, 140, 148, 263 Daubechies, L, 293, 294 Dawid, A.P., 91, 136, 138 de Finetti, B., 54, 66, 149, 311 DeGroot, M.H., 30, 66, 175 Delampady, M., 89, 90, 96, 155, 156, 164, 168-172, 174, 176, 188, 208, 214, 216, 295, 297-299 Dempster, A.P., 57, 81, 175, 208, 209 DeRobertis, L., 75, 79 Dey, D.K., 85, 86, 89, 90, 92, 93, 96, 268, 285, 289 Dharmadhikari, S., 170 Diaconis, P., 85, 100, 133 Diamond, G.A., 175 Dickey, J.M., 121, 154, 155, 175, 176 Dmochowski, J., 196 Doksum, K.A., 1 Donoho, D., 273 Doss, H., 217

Ferguson, T.S., 66, 67, 69, 70 Fernandez, C , 93 Fienberg, S.E., 34 Finney, D.J., 247 Fishburn, P.C., 73 Fisher, R.A., 21, 37, 38, 51, 59, 123, 163 Flury, B., 233 Forrester, J.S., 175 Fortini, S., 149 Foster, D.P., 284 Fox, F.A., 152 Eraser, D.A.S., 142 Freedman, D.A., 70, 71, 100 Freedman, J., 85 French, S., 55, 58, 67, 68, 70

Eaton, M.L., 138, 140, 174, 267 Eddington, A.S., 48 Edwards, W., 164, 166, 167, 172, 175 Efron, B., 23, 255, 261, 269, 271, 272 ErkanH, A., 302 Ezekiel, M., 152

Gardner, M., 48 Garthwaite, P.H., 121, 154, 155 Gelfand, A.E., 37, 85, 222, 285, 289-291 Gelman, A., 30, 161, 181, 245, 257, 260, 285 Genest, C , 155 Genovese, C , 273 George, E.L, 225, 284 Geweke, J., 91 Ghosal, S., 101, 103, 125, 146 Ghosh, J.K., 37, 57, 100, 101, 103, 104, 108, 109, 123, 125, 130, 131, 140, 142, 145-147, 155, 164, 176, 177, 179, 194, 262, 271, 273-275, 284, 285, 288, 289, 302, 308 Ghosh, M., 36, 139, 140, 148, 256, 263 Ghosh, S.K., 93, 285 Gianola, D., 210, 230, 231 Giudici, P., 230 Goel, P., 86 Good, I.J., 77, 172, 175 Gorman, P.A., 269 Green, P.J., 230, 300 Greenberg, E., 285 Grue, C.E., 288 Gruet, M., 300 Gustafson, P., 85 Guttman, I., 181

Fan, T-H., 91 Fang, K.T., 170 Farrell, R.H., 140, 164, 174 Feller, W., 179, 203

Hajek, J., 178 Hall, G.J., 135 Halmos, P.R., 136 Hartigan, J.A., 75, 79, 125

Author Index Has'minskii, R.Z., 127, 265 Heath, D., 41, 70, 71, 138, 148 Hepp, G.R., 288 Hernandez, E., 293 Hewitt, E., 54 Hildreth, C , 175 Hill, B., 175 Hines, J.E., 288 Hochberg, Y., 272 Hoeting, J.A., 36, 278 Holland, P.W., 34 Hsu. J.S.J., 30 Huber, P.J., 93, 155 Hwang, J.T., 164 Ibragimov, LA., 127, 265

James, W., 255, 265 Jeffreys, H., 30, 31, 41, 43, 46, 47, 60, 126, 175, 177, 189, 197 Jensen, S., 174 Jin, J., 273 Joag-Dev, K., 170 Johnson, R.A., 107-109 Joshi, S.N., 108, 109 Kadane, J.B., 57, 66, 72, 73, 81, 99, 109, 116, 117, 121, 154, 208 Kahneman, D., 72 Kaldor, J.M., 290, 291 Kallianpur, G., 103 Kariya, T., 174 Kass, R.E., 99, 116, 123 Keynes, L., 123 Kiefer, J., 37, 139, 176, 255 Kotz, S., 170 Krishnan, T., 208, 216 LadelH, L., 149 Laird, N.M., 208, 209 Lange, S.W., 252 Laplace, RS., 32, 99, 100, 103, 113, 115, 123 Lavine, M., 93 Le Cam, L., 103, 178 Leamer, E.E., 72, 75, 175, 245 Lehmann, E.L., 1, 9, 19, 37, 139, 157, 176

341

Lempers, F.B., 175 Leonard, T., 30 Li, K.C., 283 Liang, F., 274 Lindley, D.V., 107, 126, 129, 148, 175, 177, 189, 198, 297 Lindman, H., 164, 166, 167, 172, 175 Liseo, B., 53, 61, 83 Liu, R.C., 14 Liu, W., 273 Lou, K., 92, 93 Louis, T.A., 30, 148, 198, 232, 236, 261, 262 Low, M., 262, 284 Lu, K., 15 Miiller, R, 289, 293, 296, 302 Madigan, D., 36, 278 Martin, J., 92 Martin, J.T., 247 Matsumoto, M., 215 McLachlan, G.J., 208 Meeden, G., 36, 256 Mendoza, M., 126 Meng, X.L., 181 Meyn, S.P., 217 Micheas, A.C., 92 Milnes, P., 139, 174 Molina, G., 274 Monette, G., 142 Moreno, E., 72 Morris, C., 36, 255-257, 259, 261-263, 265, 286 Mosteller, P., 242 Muirhead, R.J., 140 Mukerjee, R., 123, 130-132, 142, 262, 263 Mukhopadhyay, N., 273, 274, 279, 283, 284, 288 Nachbin, L., 136 Nelson, K.W., 252 Newton, M.A., 269 Neyman, J., 21, 52, 255 Ng, K.W., 142, 170 Nichols, J.D., 288 Nishimura, T., 215 O'Hagan, A., 30, 91, 92, 121, 154, 191, 194

342

Author Index

Ogden, R.T., 293, 294 Osiewalski, J., 93 Perez, M.E., 93 Parmigiani, G., 279 Paulo, R., 274 Pearson, K., 32, 155 Pericchi, L.R., 72, 93, 190, 191, 193, 194, 196, 197 Peters, S.C., 121, 154 Pettit, L.I., 185 Pettrella, L., 83 Phillips, L.D., 148, 198 Pitman, E.J.G., 14 Polasek, W., 75, 96 Pratt, J.W., 37, 38, 175 Purkayastha, S., 57, 164, 176, 177, 179 Purves, R.A., 70, 71 Rios Insua, D., 55, 67, 68, 70, 72, 92, 93 Raftery, A.E., 36, 278 Raiffa, H., 55, 175 Ramamoorthi, R.V., 100, 101, 103, 125, 140, 146, 271, 289, 302, 315 Ramsey, F.P., 65 Rao, B.L.S.R, 103 Rao, C.R., 8, 86, 125, 210, 234 Regazzini, E., 71, 149 Reid, N., 52 Richardson, S., 300 Rigo, P., 71 Rissanen, J., 285 Robbins, H., 255, 271 Robert, C.R, 15, 30, 164, 215, 217, 223, 226, 227, 230, 232, 234, 235, 267, 300 Roberts, C O . , 230 Robinson, O.K., 62 Roussas, G.G., 177 Rousseau, J., 131 Roylance, R.R., 269 Rubin, D.B., 30, 161, 181, 208, 209, 245, 257, 260, 285 Rubin, H., 155, 156, 175 Ruggeri, P., 72, 84, 92, 93 Sahu, S., 253 Salinetti, G., 72, 83

Samanta, T., 57, 101, 103, 147, 164, 176, 177, 179, 194, 284, 288 Sanz, P., 93 Sarkar, S., 272 Satterwaite, F.E., 62 Savage, L.J., 30, 54, 66, 164, 166, 167, 172, 175 Schervish, M.J., 54, 66, 73, 157, 265, 268, 311 Schlaiffer, R., 55, 175 Schwarz, G., 114, 118, 162, 179 Scott, E.L., 52, 255 Scott, J.G., 270, 273 Searle, S.R., 297 Seidenfeld, T., 66, 73 Sellke, T., 164, 166, 167, 179 Sengupta, S., 57 Seo, J., 302 Serfling, R.J., 178 Sethuraman, J., 217 Shafer, G., 57, 81, 175 Shannon, C.E., 124, 128, 129 Shao, J., 283 Shibata, R., 283 Shyamalkumar, N.D., 93 Sidak, Z.V., 178 Sinha, B.K., 108, 109, 174 Sinha, D., 289 Slow, A., 175, 273 Sivaganesan, S., 77, 84, 90, 93 Slovic, P., 72 Smith, A.F.M., 30, 37, 149, 175, 194, 222, 284, 297 Smith, C.A.B., 175 Smith, R.L., 272 Smith, W.S., 121, 154 Sorensen, D., 210, 230, 231 Spiegelhalter, D.J., 175, 194, 285 Sprott, D.A., 57 Srinivasan, C., 267 Starke, H.R., 203 Steel, M.F.J., 93 Stegun, I., 274 Stein, C., 174, 255, 261, 264-268, 286, 287 Steinour, H.H., 203 Stern, H.S., 30, 161, 181, 245, 257, 260, 285 Stone, M., 138, 273

Author Index Storey, J.D., 269, 271-273, 288 Strawderman, W.E., 155, 156, 267, 268 Stromborg, K.L., 288 Sudderth, W., 41, 70, 71, 138, 148 Sun, D., 123, 142, 146 Tanner, M.A., 208 Tiao, G., 30, 91, 93, 245 Tibshirani, R., 269, 271, 272 Tierney, L., 99, 109, 116, 117, 208, 217 Tomlinson, L, 269 Tukey, J.W., 242 Tusher, V., 269, 271, 272 Tversky, A., 72 Tweedie, R.L., 217

343

Wasserman, L., 57, 72, 81, 82, 85, 123, 273 Weiss, G., 293 Welch, B.L., 37, 38, 51, 59, 60, 62 Wells, M.T., 164 West, M., 302 Wijsman, R.A., 174, 175 Winkler, R.L., 121, 154 Wolfowitz, J., 255 Wolpert, R., 37, 53, 61 Woods, H., 203 Woodward, G., 252

van der Linde, A., 285 Varshavsky, J.A., 191 Verghese, A., 298 Vidakovic, B., 91, 293, 294, 296 Volinsky, C.T., 36, 278 von Mises, R., 58, 100, 103

Yahav, J., 103 Yang, G.L., 178 Yang, H., 269 Yee, I., 214 Yekutieli, D., 273 Ylvisaker, D., 133 York, J., 278 Young, G.A., 272 Young, K.D.S., 185

Waagepetersen, R., 230 Wald, A., 21, 23, 36 Walker, A.M., 103 Walley, P., 81

Zellner, A., 175, 273, 274, 283 Zhao, L., 262, 284 Zidek, J.V., 138, 214 Zoppe, A., 233

Subject Index

accept-reject method, 226, 233 action space, 22, 38 AIBF, 191, 192, 194, 195 AIC, 163 Akaike information criterion, 163, 283, 284 algorithm E-M, 23, 206, 208, 210, 259, 260 M-H, 206, 223, 229, 247 independent, 226, 232 reversible jump, 231, 236 Mersenne twister, 215 Metropolis-Hastings, 206, 218, 222, 260 alternative contiguous, 177, 178 parametric, 161 Pitman, 177, 179 amenable group, 139, 174, 176 analysis of variance, 224 ancillary statistic, 9, 10, 37 ANOVA, 227, 240, 260 approximation error in, 206, 211 Laplace, 99, 113, 115, 161, 207, 214, 274 large sample, 207 normal, 49, 247 saddle point, 161 Schwarz, 179 Sterling's, 114 Tierney-Kadane, 109 asymptotic expansion, 108, 110

asymptotic framework, 177, 178 asymptotic normality of posterior, 126 of posterior distribution, 103 average arithmetic, 191 geometric, 191 axiomatic approach, 67 Bahadur's approach, 178 basis functions, 293 Basu's theorem, 10 Bayes estimate, 106 expansion of, 109 Bayes factor, 43, 113, 118, 159-179, 185-200, 273, 279, 285, 302 conditional, 190 default, 194 expected intrinsic, 191 intrinsic, 191 Bayes formula, 30, 32 Bayes risk bounds, 109 expansion of, 109 Bayes rule, 23, 36, 39 for 0-1 loss, 42 Bayesian analysis default, 155 exploratory, 52 objective, 30, 36, 55, 121, 147 subjective, 55, 147 Bayesian approach subjective, 36 Bayesian computations, 222

346

Subject Index

Bayesian decision theory, 41 Bayesian inference, 41 Bayesian information criterion, (BIC), 114, 162 Bayesian model averaging estimate, 278 Bayesian model averaging, BMA, 278 Bayesian paradigm, 57 Behrens-Fisher problem, 62, 240 belief functions, 57 best unbiased estimate, see estimate bias, 11 BIC, 160-163, 179, 274, 276 in high-dimensional problems, 274 Schwarz, 118 Bickel prior, 155 Birnbaum's theorem, 307 Bootstrap, 23 Box-Muller method, 233 calibration of P-value, 164 central limit theorem, 6, 212, 231 Chebyshev's inequality, 101 chi-squared test, 170 class of priors, 66, 73-97, 121, 155, 165, 186 conjugate, 74, 171 density ratio, 75 elliptically symmetric unimodal, 169, 170 e-contamination, 75, 86 extreme points, 166 group invariant, 174 mixture of conjugate, 172 mixture of uniforms, 168 natural conjugate, 165 nonparametric, 89 normal, 74, 167, 188 parametric, 89 scale mixtures of normal, 167 spherically symmetric, 174 symmetric, 74 symmetric star-unimodal, 170 symmetric uniform, 166, 167 unimodal spherically symmetric, 90, 168, 170, 171 unimodal symmetric, 74, 167 classical statistics, 36, 159 coherence, 66, 70, 138, 147, 148, 311 complete class, 36

complete orthonormal system, 293 completeness, 10 compound decision problem, 286 conditional inference, 38 conditional prior density, 160 conditionality principle, 38 conditionally autoregressive, 290 conditioning, 38, 224 confidence interval, 20, 34 PEB, 262 confidence set, 21 conjugate prior, see prior, 242 consistency, 7 of posterior distribution, 100 convergence, 209, 213, 215, 218, 223, 231 in probability, 7, 211 correlation coefficient, 155 countable additivity, 311 countable state space, 216-218 coverage probability, 34, 262 credibility, 49 credible interval, 48, 258 HB, 262 HPD, 42, 49 predictive, 50 credible region HPD, 244 credible set, 48 cross validation, 36 curse of dimensionality, 206 data analysis, 57 data augmentation, 208 data smoothing, 289 Daubechies wavelets, 294 de Finetti's Theorem, 54, 149 de Finetti's theorem, 54 decision function, 22, 39, 68 decision problem, 21, 65, 67 decision rule, 22, 36 admissible, 36 Bayes, 39 minimax, 23 decision theory, 276 classical, 36 delta method, 15 density, 303 posterior, 210

Subject Index predictive, 208 dichotomous, 245 Dirichlet multinomial allocation, 289, 299 prior, 300 discrete wavelet transform, 296 disease mapping, 289 distribution Bernoulli, 2, 306 Beta, 32, 304 binomial, 2, 306 Cauchy, 5, 304 chi-square, 304 conditional predictive, 183 Dirichlet, 305 double exponential, 303 exponential, 2, 303 F, 305 Gamma, 304 geometric, 3, 306 inverse Gamma, 305 Laplace, 303 limit, 215 logistic, 306 mixture of normal, 2 multinomial, 3, 306 multivariate t, 304 multivariate normal, 303 negative binomial, 3, 306 non-central Student's t, 175 noncentral chi-square, 174 normal, 1, 303 Poisson, 2, 306 posterior predictive, 182 predictive, 50 prior predictive, 180 Student's t, 304 t, 304 uniform, 5, 304 Wishart, 153, 305 divergence measure, 86 chi-squared, 86 directed, 86 generalized Bhattacharya, 86 Hellinger, 86 J-divergence, 86 Kagan's, 86 Kolmogorov's, 86 Kullback-Leibler, 86, 126, 146

347

power-weighted, 86 Doeblin irreducibility, 217 double use of data, 182, 183 elicitation nonpar ametric, 149 of hyperparameters, 150 of prior, 149 elliptical symmetry, 169 empirical Bayes, 274, 283, 290, 298 constrained, 283, 284 parametric, 36, 54, 255, 260 ergodic, 215 estimate y^-consistent, 8 approximately best unbiased, 15 best unbiased, 13 inconsistent, 52 method of moments, 270 Rao-Blackwellized, 235 shrinkage, 34 unbiased, 11 uniformly minimum variance unbiased, 13 estimation nonpar ametric, 289 simultaneous, 290 exchangeability, 29, 54, 122, 149, 256, 257, 265, 290 partial, 255 exploratory Bayesian analysis, see Bayesian analysis exponential family, 4, 6, 7, 10, 14, 17, 132 factorization theorem, 9, 315 false discovery, 53 false discovery rate, 272 father wavelet, 293 FBF, 191, 192 finite additivity, 311 Fisher information, 12, 47, 99, 102, 125 expected, 102 minimum, 155 observed, 99, 162 fractional Bayes factor, FBF, 191 frequency property, 50 frequentist, 170 conditional, 51

348

Subject Index

frequentist validation, 36, 58 of Bayesian analysis, 100 full conditionals, 222, 232, 290 fully exponential, 208 gamma minimax, 91 GBIC, 274, 276 gene expression, 53, 269, 314 generic target distribution, 219 geometric perturbation, 89 Gibbs sampler, 220-226, 232, 251 Gibbs sampling, 206, 220-222, 260 GIBF, 191, 192 global robustness, 76, 93 graphical model structure, 279, 281 group of transformations, 137, 172 Haar measure, 139 left invariant, 123, 136, 144 right invariant, 123, 136, 144, 146 Haar wavelet, 293, 302 Hammersley-Clifford theorem, 222 Harris irreducibility, 217 hierarchical Bayes, 22, 54, 222, 227, 240, 242, 260, 289, 290, 296, 298 hierarchical modeling, 215 hierarchical prior, see prior high-dimensional estimation, 276 PEB, HB, 269 high-dimensional multiple testing, 269 PEB, HB, 269 high-dimensional prediction, 276 high-dimensional problem, 15, 35, 140, 159, 214, 289 highest posterior density, 42 Hunt-Stein condition, 139 Hunt-Stein theorem, 139, 176 hypothesis testing, 11, 16-20, 41, 159, 163 IBF, 191, 192, 194 identifiabihty, 271 lack of, 53, 232 importance function, 214 importance sampling, 213 inequality Cramer-Rao, 12 information, 12 interval estimation, 22, 41

invariance, 51, 123, 136, 148 invariance principle, 124 invariant prior, 172 invariant test, 140, 172, 173, 176 inverse c.d.f. method, 233 James-Stein estimate, 262, 265, 267, 284 positive part, 265, 267 James-Stein-Lindley estimate, 261, 264, 268 positive part, 262 KuUback-Leibler distance, 209 latent variable, 251 law of large numbers, 6, 211, 231 for Markov chains, 217 second fundamental, 100 weak, 100 least squares, 58 likelihood equation, 8, 104 likelihood function, 7, 8, 29, 121 likelihood principle, 38, 147, 148, 307, 308 likehhood ratio, 7, 178, 179 weighted, 185 likelihood ratio statistic, 20 Lindley-Bernardo functional, 141 Lindley-Bernardo information, 142 linear model, 241, 263 generalized, 245 linear perturbation, 85, 86 linear regression, 160, 241 link function, 246 local robustness, 85, 93 location parameter, 40 location-scale family, 5-7, 136 log-concave, 8 logistic regression, 245, 247 logit model, 245, 246, 251 long run relative frequency, 29 loss, 29 0-1, 22, 276 absolute error, 41 posterior expected, 92 squared error, 22 Stein's, 267 loss function, 22, 38, 65-73, 92, 276 loss robustness, 92-93

Subject Index low-dimensional problem, 121, 159 lower bound Cramer-Rao, 12 on Bayes factor, 167, 172 over classes of priors, 165, 166 machine learning, 57 marginalization paradox, 138 Markov chain, 215-234 aperiodic, 218-221 ergodic theorem, 231 irreducible, 217, 218, 220, 221, 223 stationary distribution of, 217 Markov Chain Monte Carlo, see MCMC Markov property, 215 maximal invariant, 172, 173 maximum likelihood estimate, see MLE MCMC, 37, 206, 215, 218, 223, 224, 240, 256, 259, 260, 263, 278, 291 convergence of, 217 independent runs, 218 reversible jump, 229 mean arithmetic, 146 geometric, 146 mean squared error, MSE, 39 measure of accuracy, 37 measure of information, 122 Bernardo's, 121, 129 in prior, 122 measures of evidence, 43, 163, 164, 166, 179 Bayesian, 165, 167 median probability model, 279-281, 283 metric Euchdean, 125 Hellinger, 125 Riemannian, 125 MIBF, 191, 192 microarray, 53, 269, 271, 272, 313, 314 minimal sufficiency, 308 minimax, 139 minimax estimate, 267 minimax test, 139 minimaxity, 176 MLE, 7, 8, 20, 102, 106, 162, 255, 290 model checking, 160, 161, 180 model criticism, 159 model departure statistic, 180, 183

349

model robustness, 23, 93 model selection, 36, 159-160, 185, 194, 229, 256, 273, 276, 283 moderate deviations, 179 monotone likelihood ratio, 25 Monte Carlo, 298 Monte Carlo importance sampling, 215 Monte Carlo sampling, 211, 214, 215, 240 mother wavelet, 293 MRA, 293-295, 302 multi-resolution analysis, see MRA multiple testing, 256, 273 high-dimensional, 272 NPEB, 271 multivariate symmetry, 170 multivariate unimodality, 170 nested model, 189, 273, 279 Neyman-Pearson lemma, 17 Neyman-Pearson theory, 17 Neyman-Scott problem, 52 nonparametric Bayes, 289 nonparametric estimate of prior, 271 nonparametric regression, 160, 279, 284, 292, 295 normal approximation to posterior distribution, 101 normal linear model, 251 NPEB, 272 nuisance parameter, 51 null hypothesis interval, 176 precise, 165, 186 sharp, 35, 41, 177 numerical integration, 205, 214, 226, 298 objective Bayesian analysis, see Bayesian analysis high-dimensional, 269 Occam's window, 278 odds ratio posterior, 31, 35, 160 prior, 160 one-sided test, 176 orthogonal parameters, 47 outlier, 6, 185-188

350

Subject Index

P-value, 26, 163-202 Bayesian, 159, 161, 181 conditional predictive, 183, 184 partial posterior predictive, 184 partial predictive, 185 posterior predictive, 181, 182 prior predictive, 180 paradox, 36 Jeffreys-Lindley, 177, 178 parameter space, 38 parametric alternative, 170 pay-off, 67 expected, 71
prevision, 311 prior, 30 compactly supported, 155 conjugate, 132, 134, 135, 215, 259 mixture of, 75, 135, 215 conventional, 29 default, 191 Dirichlet, 62 elicitation of, 121 finitely additive, 41, 71, 148 hierarchical, 53, 215, 222, 233, 242, 256 conjugate, 247 improper, 29, 40, 122, 147, 233 intrinsic, 191, 194-196 Jeffreys, 33, 34, 49, 56, 122, 125, 128-130, 134, 140, 144, 148 least favorable, 165 left invariant, 138, 140 modified Jeffreys, 193 multivariate Cauchy, 273 noninformative, 29, 121, 124, 147 objective, 29, 34, 36, 40, 49, 55, 121, 136, 140, 148, 155 probability matching, 36, 49, 56, 129, 131, 132, 148, 262 reference, 33, 34, 56, 123, 129, 140, 142, 148, 155 right invariant, 138, 140 smooth Cauchy, 274 subjective, 36, 121 uniform, 33, 34, 41, 122 Zellner's g-prior, 274 Zellner-Siow, 275 prior belief, 34, 66 quantification of, 23, 29 prior density, 31 prior distribution, 30 prior knowledge, 36 partial, 36 probability acceptance, 219, 220 objective, 29 subjective, 29 upper and lower, 57 probit model, 245, 246, 251 profile likelihood, 51, 52 proposal density, 247

Subject Index random number generation, 214 randomization, 20 Rao-Blackwell theorem, 13, 223, 224 ratio of integrals, 206 rational behavior, 65, 66 rationality axioms, 55, 65, 66 regression locally linear, 300 regression function, 160, 292 regression model, 241 regular family, 6 regularity conditions Cramer-Rao type, 6, 8 relative risk, 289 resolution level, 295, 296, 302 right invariant, 136 risk frequentist, 91 integrated, 66 posterior, 39, 65 preposterior, 39 unbiased estimation of, 15 risk function, 22, 36 risk set, 20 robust Bayes, 57, 165, 185 robustness, 6, 36, 65, 71-96, 121, 155, 205, 296 Bayesian frequent ist approach, 91 measures of, 74 of posterior inference, 101 scale parameter, 40 scaling function, 293, 295 sensitivity, 65, 66, 72 analysis, 72 local, 85, 89 measures of, 76 overall, 85 relative, 84 Shannon entropy, 123, 124, 126 shrinkage, 261 shrinkage estimate, 259, 264 simulation, 23, 218 single run, 224 smoothing techniques, 295 spatial correlation, 290 spatial modeling, 289 spectral decomposition, 297

351

spherical symmetry, 169 state space, 216 stationary, 216, 217 stationary distribution, 217, 220, 221 stationary transition kernel, 217 statistical computing, 206 statistical decision theory, 38 Bayesian, 39 classical, 38 statistical learning, 57 Stein's example, 255 Stein's identity, 265, 268 Stone's problem, 273, 274 stopping rule paradox, 38 stopping rule principle, 148 strongly consistent solution, 104 subjective probability, 66 elicitation of, 55 sufficiency Bayes, 315 sufficiency principle, 38, 308 sufficient statistic, 9, 10, 224, 315 complete, 10, 13, 261 minimal, 9, 10, 38, 132 tail area, 182, 183 target distribution, 220, 224 target model, 180 test conditional, 51 likelihood ratio, 179 minimax, 20 non-randomized, 20 randomized, 20 unbiased, 19 uniformly most powerful, 17 test for association, 117, 203 test of goodness of fit, 170, 203 test statistic, 164 testing for normality, 160 time homogeneous, 216 total variation, 85 training sample, 189 minimal, 190 transition function, 216, 220 transition kernel, 216 transition probability invariant, 216

352

Subject Index

matrix, 216, 217 proposal, 219 stationary, 216 type 1 error, 16 type 2 error, 16 type 2 maximum likelihood, 77 utility, 29, 65-73 variable selection, 279

variance reduction, 224 wavelet, 289, 292-299 compactly supported, 294 wavelet basis, 293 wavelet decomposition, 295 wavelet smoother, 298 weak conditionality principle, 308 weak sufficiency principle, 308 WinBUGS, 291

springer Texts in Statistics

( continued from page ii)

Lehmann and Romano: Testing Statistical Hypotheses, Third Edition Lehmann and Casella: Theory of Point Estimation, Second Edition Lindman: Analysis of Variance in Experimental Design Lindsey: Applying Generalized Linear Models Madansky: Prescriptions for Working Statisticians McPherson: Applying and Interpreting Statistics: A Comprehensive Guide, Second Edition Mueller: Basic Principles of Structural Equation Modeling: An Introduction to LISREL and EQS Nguyen and Rogers: Fundamentals of Mathematical Statistics: Volume I: Probability for Statistics Nguyen and Rogers: Fundamentals of Mathematical Statistics: Volume II: Statistical Inference Noether: Introduction to Statistics: The Nonparametric Way Nolan and Speed: Stat Labs: Mathematical Statistics Through Applications Peters: Counting for Something: Statistical Principles and Personalities Pfeiffer: Probability for Apphcations Pitman: Probability Rawlings, Pantula and Dickey: Applied Regression Analysis Robert: The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, Second Edition Robert and Casella: Monte Carlo Statistical Methods Rose and Smith: Mathematical Statistics with Mathematica Ruppert: Statistics and Finance: An Introduction Santner and Duffy: The Statistical Analysis of Discrete Data Saville and Wood: Statistical Methods: The Geometric Approach Sen and Srivastava: Regression Analysis: Theory, Methods, and Applications Shao: Mathematical Statistics, Second Edition Shorack: Probability for Statisticians Shumway and Stoffer: Time Series Analysis and Its Applications: With R Examples, Second Edition Simonoff: Analyzing Categorical Data Terrell: Mathematical Statistics: A Unified Introduction Timm: Applied Multivariate Analysis Toutenburg: Statistical Analysis of Designed Experiments, Second Edition Wasserman: All of Nonparametric Statistics Wasserman: All of Statistics: A Concise Course in Statistical Inference Weiss: Modehng Longitudinal Data Whittle: Probability via Expectation, Fourth Edition Zacks: Introduction to Reliability Analysis: Probability Models and Statistical Methods

An Introduction to Bayesian Analysis

Read more

An introduction to analysis

Read more

An introduction to analysis

Read more

An introduction to analysis

Read more

An introduction to Bayesian inference in econometrics

Read more

Introduction to Bayesian Statistics

Read more

Introduction to Bayesian Statistics

Read more

Introduction to Bayesian statistics

Read more

Introduction to Bayesian statistics

Read more

Introduction to bayesian statistics

Read more

Introduction to Bayesian Statistics

Read more

An Introduction to Bayesian Analysis: Theory and Methods

Read more

An introduction to Bayesian inference in econometrics

Read more

Introduction to Bayesian Econometrics

Read more

Introduction to Bayesian Statistics

Read more

Introduction to Bayesian Econometrics

Read more

An Introduction to Numerical Analysis

Read more

An Introduction to Tensor Analysis

Read more

An introduction to numerical analysis

Read more

An introduction to numerical analysis

Read more

An Introduction to Numerical Analysis

Read more

An Introduction to Numerical Analysis

Read more

An Introduction to Numerical Analysis

Read more

An Introduction to Complex Analysis

Read more

An Introduction to Genetic Analysis

Read more

An introduction to functional analysis

Read more

An Introduction to Numerical Analysis

Read more

An introduction to nonlinear analysis

Read more

An Introduction to Numerical Analysis

Read more

An Introduction to Nonlinear Analysis

Read more

Recommend Documents

An Introduction to Bayesian Analysis

An introduction to analysis

An Introduction to Analysis SECOND EDITION JAMES R. KIRKWOOD SWEET BRIAR COLLEGE PWS Publishing Company I(!)P An Inte...

An introduction to analysis

An introduction to analysis

An Introduction to Analysis SECOND EDITION JAMES R. KIRKWOOD SWEET BRIAR COLLEGE PWS Publishing Company I(!)P An Inte...

An introduction to Bayesian inference in econometrics

Introduction to Bayesian Statistics

Karl-Rudolf Koch Introduction to Bayesian Statistics Second Edition Karl-Rudolf Koch Introduction to Bayesian Statis...

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics mICINTCNNIAL G THE W l L E Y B I C E N T E N N I A L - K N O W L E D G E F O R ...

Introduction to Bayesian statistics

Introduction to Bayesian statistics

Introduction to bayesian statistics

Preface How This Text Was Developed This text grew out of the course notes for an Introduction to Bayesian Statistics ...