This page intentionally left blank
Bayesian Logical Data Analysis for the Physical Sciences A Comparative Approach wi...
232 downloads
611 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
This page intentionally left blank
Bayesian Logical Data Analysis for the Physical Sciences A Comparative Approach with MathematicaTM Support
Increasingly, researchers in many branches of science are coming into contact with Bayesian statistics or Bayesian probability theory. By encompassing both inductive and deductive logic, Bayesian analysis can improve model parameter estimates by many orders of magnitude. It provides a simple and unified approach to all data analysis problems, allowing the experimenter to assign probabilities to competing hypotheses of interest, on the basis of the current state of knowledge. This book provides a clear exposition of the underlying concepts with large numbers of worked examples and problem sets. The book also discusses numerical techniques for implementing the Bayesian calculations, including an introduction to Markov chain Monte Carlo integration and linear and nonlinear least-squares analysis seen from a Bayesian perspective. In addition, background material is provided in appendices and supporting Mathematica notebooks are available from www.cambridge.org/052184150X, providing an easy learning route for upperundergraduate, graduate students, or any serious researcher in physical sciences or engineering. P H I L G R E G O R Y is Professor Emeritus at the Department of Physics and Astronomy at the University of British Columbia.
Bayesian Logical Data Analysis for the Physical Sciences A Comparative Approach with MathematicaTM Support P. C. Gregory Department of Physics and Astronomy, University of British Columbia
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge , UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521841504 © Cambridge University Press 2005 This book is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2005 - -
---- eBook (NetLibrary) --- eBook (NetLibrary)
- -
---- hardback --- hardback
Cambridge University Press has no responsibility for the persistence or accuracy of s for external or third-party internet websites referred to in this book, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Disclaimer of warranty We make no warranties, express or implied, that the programs contained in this volume are free of error, or are consistent with any particular standard of merchantability, or that they will meet your requirements for any particular application. They should not be relied on for solving a problem whose incorrect solution could result in injury to a person or loss of property. If you do use the programs in such a manner, it is at your own risk. The authors and publisher disclaim all liability for direct or consequential damages resulting from your use of the programs.
Contents
page xiii xv xvii
Preface Software support Acknowledgements 1
Role of probability theory in science 1.1 Scientific inference 1.2 Inference requires a probability theory 1.2.1 The two rules for manipulating probabilities 1.3 Usual form of Bayes’ theorem 1.3.1 Discrete hypothesis space 1.3.2 Continuous hypothesis space 1.3.3 Bayes’ theorem – model of the learning process 1.3.4 Example of the use of Bayes’ theorem 1.4 Probability and frequency 1.4.1 Example: incorporating frequency information 1.5 Marginalization 1.6 The two basic problems in statistical inference 1.7 Advantages of the Bayesian approach 1.8 Problems
1 1 2 4 5 5 6 7 8 10 11 12 15 16 17
2
Probability theory as extended logic 2.1 Overview 2.2 Fundamentals of logic 2.2.1 Logical propositions 2.2.2 Compound propositions 2.2.3 Truth tables and Boolean algebra 2.2.4 Deductive inference 2.2.5 Inductive or plausible inference 2.3 Brief history 2.4 An adequate set of operations 2.4.1 Examination of a logic function 2.5 Operations for plausible inference
21 21 21 21 22 22 24 25 25 26 27 29
v
vi
Contents
2.5.1 2.5.2 2.5.3 2.5.4
2.6 2.7 2.8
The desiderata of Bayesian probability theory Development of the product rule Development of sum rule Qualitative properties of product and sum rules Uniqueness of the product and sum rules Summary Problems
30 30 34 36 37 39 39
3
The how-to of Bayesian inference 3.1 Overview 3.2 Basics 3.3 Parameter estimation 3.4 Nuisance parameters 3.5 Model comparison and Occam’s razor 3.6 Sample spectral line problem 3.6.1 Background information 3.7 Odds ratio 3.7.1 Choice of prior pðTjM1 ; IÞ 3.7.2 Calculation of pðDjM1 ; T; IÞ 3.7.3 Calculation of pðDjM2 ; IÞ 3.7.4 Odds, uniform prior 3.7.5 Odds, Jeffreys prior 3.8 Parameter estimation problem 3.8.1 Sensitivity of odds to Tmax 3.9 Lessons 3.10 Ignorance priors 3.11 Systematic errors 3.11.1 Systematic error example 3.12 Problems
41 41 41 43 45 45 50 50 52 53 55 58 58 58 59 59 61 63 65 66 69
4
Assigning probabilities 4.1 Introduction 4.2 Binomial distribution 4.2.1 Bernoulli’s law of large numbers 4.2.2 The gambler’s coin problem 4.2.3 Bayesian analysis of an opinion poll 4.3 Multinomial distribution 4.4 Can you really answer that question? 4.5 Logical versus causal connections 4.6 Exchangeable distributions 4.7 Poisson distribution
72 72 72 75 75 77 79 80 82 83 85
Contents
4.7.1 Bayesian and frequentist comparison Constructing likelihood functions 4.8.1 Deterministic model 4.8.2 Probabilistic model 4.9 Summary 4.10 Problems 4.8
vii
87 89 90 91 93 94
5
Frequentist statistical inference 5.1 Overview 5.2 The concept of a random variable 5.3 Sampling theory 5.4 Probability distributions 5.5 Descriptive properties of distributions 5.5.1 Relative line shape measures for distributions 5.5.2 Standard random variable 5.5.3 Other measures of central tendency and dispersion 5.5.4 Median baseline subtraction 5.6 Moment generating functions 5.7 Some discrete probability distributions 5.7.1 Binomial distribution 5.7.2 The Poisson distribution 5.7.3 Negative binomial distribution 5.8 Continuous probability distributions 5.8.1 Normal distribution 5.8.2 Uniform distribution 5.8.3 Gamma distribution 5.8.4 Beta distribution 5.8.5 Negative exponential distribution 5.9 Central Limit Theorem 5.10 Bayesian demonstration of the Central Limit Theorem 5.11 Distribution of the sample mean 5.11.1 Signal averaging example 5.12 Transformation of a random variable 5.13 Random and pseudo-random numbers 5.13.1 Pseudo-random number generators 5.13.2 Tests for randomness 5.14 Summary 5.15 Problems
96 96 96 97 98 100 101 102 103 104 105 107 107 109 112 113 113 116 116 117 118 119 120 124 125 125 127 131 132 136 137
6
What is a statistic? 6.1 Introduction 6.2 The 2 distribution
139 139 141
viii
Contents
6.3 6.4 6.5 6.6
6.7 6.8
Sample variance S2 The Student’s t distribution F distribution (F-test) Confidence intervals 6.6.1 Variance 2 known 6.6.2 Confidence intervals for , unknown variance 6.6.3 Confidence intervals: difference of two means 6.6.4 Confidence intervals for 2 6.6.5 Confidence intervals: ratio of two variances Summary Problems
143 147 150 152 152 156 158 159 159 160 161
7
Frequentist hypothesis testing 7.1 Overview 7.2 Basic idea 7.2.1 Hypothesis testing with the 2 statistic 7.2.2 Hypothesis test on the difference of two means 7.2.3 One-sided and two-sided hypothesis tests 7.3 Are two distributions the same? 7.3.1 Pearson 2 goodness-of-fit test 7.3.2 Comparison of two-binned data sets 7.4 Problem with frequentist hypothesis testing 7.4.1 Bayesian resolution to optional stopping problem 7.5 Problems
162 162 162 163 167 170 172 173 177 177 179 181
8
Maximum entropy probabilities 8.1 Overview 8.2 The maximum entropy principle 8.3 Shannon’s theorem 8.4 Alternative justification of MaxEnt 8.5 Generalizing MaxEnt 8.5.1 Incorporating a prior 8.5.2 Continuous probability distributions 8.6 How to apply the MaxEnt principle 8.6.1 Lagrange multipliers of variational calculus 8.7 MaxEnt distributions 8.7.1 General properties 8.7.2 Uniform distribution 8.7.3 Exponential distribution 8.7.4 Normal and truncated Gaussian distributions 8.7.5 Multivariate Gaussian distribution
184 184 185 186 187 190 190 191 191 191 192 192 194 195 197 202
Contents
8.8
MaxEnt image reconstruction 8.8.1 The kangaroo justification 8.8.2 MaxEnt for uncertain constraints 8.9 Pixon multiresolution image reconstruction 8.10 Problems 9
Bayesian inference with Gaussian errors 9.1 Overview 9.2 Bayesian estimate of a mean 9.2.1 Mean: known noise 9.2.2 Mean: known noise, unequal 9.2.3 Mean: unknown noise 9.2.4 Bayesian estimate of 9.3 Is the signal variable? 9.4 Comparison of two independent samples 9.4.1 Do the samples differ? 9.4.2 How do the samples differ? 9.4.3 Results 9.4.4 The difference in means 9.4.5 Ratio of the standard deviations 9.4.6 Effect of the prior ranges 9.5 Summary 9.6 Problems
10 Linear model fitting (Gaussian errors) 10.1 Overview 10.2 Parameter estimation 10.2.1 Most probable amplitudes 10.2.2 More powerful matrix formulation 10.3 Regression analysis 10.4 The posterior is a Gaussian 10.4.1 Joint credible regions 10.5 Model parameter errors 10.5.1 Marginalization and the covariance matrix 10.5.2 Correlation coefficient 10.5.3 More on model parameter errors 10.6 Correlated data errors 10.7 Model comparison with Gaussian posteriors 10.8 Frequentist testing and errors 10.8.1 Other model comparison methods 10.9 Summary 10.10 Problems
ix
203 203 206 208 211 212 212 212 213 217 218 224 227 228 230 233 233 236 237 239 240 241 243 243 244 249 253 256 257 260 264 264 268 272 273 275 279 281 283 284
x
Contents
11 Nonlinear model fitting 11.1 Introduction 11.2 Asymptotic normal approximation 11.3 Laplacian approximations 11.3.1 Bayes factor 11.3.2 Marginal parameter posteriors 11.4 Finding the most probable parameters 11.4.1 Simulated annealing 11.4.2 Genetic algorithm 11.5 Iterative linearization 11.5.1 Levenberg–Marquardt method 11.5.2 Marquardt’s recipe 11.6 Mathematica example 11.6.1 Model comparison 11.6.2 Marginal and projected distributions 11.7 Errors in both coordinates 11.8 Summary 11.9 Problems
287 287 288 291 291 293 294 296 297 298 300 301 302 304
12 Markov chain Monte Carlo 12.1 Overview 12.2 Metropolis–Hastings algorithm 12.3 Why does Metropolis–Hastings work? 12.4 Simulated tempering 12.5 Parallel tempering 12.6 Example 12.7 Model comparison 12.8 Towards an automated MCMC 12.9 Extrasolar planet example 12.9.1 Model probabilities 12.9.2 Results 12.10 MCMC robust summary statistic 12.11 Summary 12.12 Problems
312 312 313 319 321 321 322 326 330 331 335 337 342 346 349
13 Bayesian revolution in spectral analysis 13.1 Overview 13.2 New insights on the periodogram 13.2.1 How to compute pðfjD; IÞ 13.3 Strong prior signal model 13.4 No specific prior signal model
352 352 352 356 358 360
306 307 309 309
Contents
13.5
13.6 13.7
13.4.1 X-ray astronomy example 13.4.2 Radio astronomy example Generalized Lomb–Scargle periodogram 13.5.1 Relationship to Lomb–Scargle periodogram 13.5.2 Example Non-uniform sampling Problems
xi
362 363 365 367 367 370 373
14 Bayesian inference with Poisson sampling 14.1 Overview 14.2 Infer a Poisson rate 14.2.1 Summary of posterior 14.3 Signal þ known background 14.4 Analysis of ON/OFF measurements 14.4.1 Estimating the source rate 14.4.2 Source detection question 14.5 Time-varying Poisson rate 14.6 Problems
376 376 377 378 379 380 381 384 386 388
Appendix A Singular value decomposition
389
Appendix B Discrete Fourier Transforms B.1 Overview B.2 Orthogonal and orthonormal functions B.3 Fourier series and integral transform B.3.1 Fourier series B.3.2 Fourier transform B.4 Convolution and correlation B.4.1 Convolution theorem B.4.2 Correlation theorem B.4.3 Importance of convolution in science B.5 Waveform sampling B.6 Nyquist sampling theorem B.6.1 Astronomy example B.7 Discrete Fourier Transform B.7.1 Graphical development B.7.2 Mathematical development of the DFT B.7.3 Inverse DFT B.8 Applying the DFT B.8.1 DFT as an approximate Fourier transform B.8.2 Inverse discrete Fourier transform
392 392 392 394 395 396 398 399 400 401 403 404 406 407 407 409 410 411 411 413
xii
Contents
B.9 The Fast Fourier Transform B.10 Discrete convolution and correlation B.10.1 Deconvolving a noisy signal B.10.2 Deconvolution with an optimal Weiner filter B.10.3 Treatment of end effects by zero padding B.11 Accurate amplitudes by zero padding B.12 Power-spectrum estimation B.12.1 Parseval’s theorem and power spectral density B.12.2 Periodogram power-spectrum estimation B.12.3 Correlation spectrum estimation B.13 Discrete power spectral density estimation B.13.1 Discrete form of Parseval’s theorem B.13.2 One-sided discrete power spectral density B.13.3 Variance of periodogram estimate B.13.4 Yule’s stochastic spectrum estimation model B.13.5 Reduction of periodogram variance B.14 Problems
415 417 418 420 421 422 424 424 425 426 428 428 429 429 431 431 432
Appendix C Difference in two samples C.1 Outline C.2 Probabilities of the four hypotheses C.2.1 Evaluation of pðC; SjD1 ; D2 ; IÞ C.2.2 Evaluation of pðC; SjD1 ; D2 ; IÞ C.2.3 Evaluation of pðC; SjD1 ; D2 ; IÞ C.2.4 Evaluation of pðC; SjD1 ; D2 ; IÞ C.3 The difference in the means C.3.1 The two-sample problem C.3.2 The Behrens–Fisher problem C.4 The ratio of the standard deviations C.4.1 Estimating the ratio, given the means are the same C.4.2 Estimating the ratio, given the means are different
434 434 434 434 436 438 439 439 440 441 442 442 443
Appendix D Poisson ON/OFF details D.1 Derivation of pðsjNon ; IÞ D.1.1 Evaluation of Num D.1.2 Evaluation of Den D.2 Derivation of the Bayes factor Bfsþb;bg
445 445 446 447 448
Appendix E Multivariate Gaussian from maximum entropy
450
References Index
455 461
Preface
The goal of science is to unlock nature’s secrets. This involves the identification and understanding of nature’s observable structures or patterns. Our understanding comes through the development of theoretical models which are capable of explaining the existing observations as well as making testable predictions. The focus of this book is on what happens at the interface between the predictions of scientific models and the data from the latest experiments. The data are always limited in accuracy and incomplete (we always want more), so we are unable to employ deductive reasoning to prove or disprove the theory. How do we proceed to extend our theoretical framework of understanding in the face of this? Fortunately, a variety of sophisticated mathematical and computational approaches have been developed to help us through this interface, these go under the general heading of statistical inference. Statistical inference provides a means for assessing the plausibility of one or more competing models, and estimating the model parameters and their uncertainties. These topics are commonly referred to as ‘‘data analysis’’ in the jargon of most physicists. We are currently in the throes of a major paradigm shift in our understanding of statistical inference based on a powerful theory of extended logic. For historical reasons, it is referred to as Bayesian Inference or Bayesian Probability Theory. To get a taste of how significant this development is, consider the following: probabilities are commonly quantified by a real number between 0 and 1. The end-points, corresponding to absolutely false and absolutely true, are simply the extreme limits of this infinity of real numbers. Deductive logic, which is based on axiomatic knowledge, corresponds to these two extremes of 0 and 1. Ask any mathematician or physicist how important deductive logic is to their discipline! Now try to imagine what you might achieve with a theory of extended logic that encompassed the whole range from 0 to 1. This is exactly what is needed in science and real life where we never know anything is absolutely true or false. Of course, the field of probability has been around for years, but what is new is the appreciation that the rules of probability are not merely rules for manipulating random variables. They are now recognized as uniquely valid principles of logic, for conducting inference about any proposition or hypothesis of interest. Ordinary deductive logic is just a special case in the idealized limit of complete information. The reader should be warned that most books on Bayesian statistics xiii
xiv
Preface
do not make the connection between probability theory and logic. This connection, which is captured in the book by physicist E. T. Jaynes, Probability Theory – The Logic of Science,1 is particularly appealing because of the unifying principles it provides for scientific reasoning. What are the important consequences of this development? We are only beginning to see the tip of the iceberg. Already we have seen that for data with a high signal-tonoise ratio, a Bayesian analysis can frequently yield many orders of magnitude improvement in model parameter estimation, through the incorporation of relevant prior information about the signal model. For several dramatic demonstrations of this point, have a look at the first four sections of Chapter 13. It also provides a more powerful way of assessing competing theories at the forefront of science by quantifying Occam’s razor, and sheds a new light on systematic errors (e.g., Section 3.11). For some problems, a Bayesian analysis may simply lead to a familiar statistic. Even in this situation it often provides a powerful new insight concerning the interpretation of the statistic. But most importantly, Bayesian analysis provides an elegantly simple and rational approach for answering any scientific question for a given state of information. This textbook is based on a measurement theory course which is aimed at providing first year graduate students in the physical sciences with the tools to help them design, simulate and analyze experimental data. The material is presented at a mathematical level that should make it accessible to physical science undergraduates in their final two years. Each chapter begins with an overview and most end with a summary. The book contains a large number of problems, worked examples and 132 illustrations. The Bayesian paradigm is becoming very visible at international meetings of physicists and astronomers (e.g., Statistical Challenges in Modern Astronomy III, edited by E. D. Feigelson and G. J. Babu, 2002). However, the majority of scientists are still not at home with the topic and much of the current scientific literature still employs the conventional ‘‘frequentist’’ statistical paradigm. This book is an attempt to help new students to make the transition while at the same time exposing them in Chapters 5, 6, and 7 to some of the essential ideas of the frequentist statistical paradigm that will allow them to comprehend much of the current and earlier literature and interface with his or her research supervisor. This also provides an opportunity to compare and contrast the two different approaches to statistical inference. No previous background in statistics is required; in fact, Chapter 6 is entitled ‘‘What is a statistic?’’ For the reader seeking an abridged version of Bayesian inference, Chapter 3 provides a stand-alone introduction on the ‘‘How-to of Bayesian inference.’’
1
Early versions of this much celebrated work by Jaynes have been in circulation since at least 1988. The book was finally submitted for publication in 2002, four years after his death, through the efforts of his former student G. L. Bretthorst. The book is published by Cambridge University Press (Jaynes, 2003, edited by G. L. Bretthorst).
Preface
xv
The book begins with a look at the role of statistical inference in the scientific method and the fundamental ideas behind Bayesian Probability Theory (BPT). We next consider how to encode a given state of information into the form of a probability distribution, for use as a prior or likelihood function in Bayes’ theorem. We demonstrate why the Gaussian distribution arises in nature so frequently from a study of the Central Limit Theorem and gain powerful new insight into the role of the Gaussian distribution in data analysis from the Maximum Entropy Principle. We also learn how a quantified Occam’s razor is automatically incorporated into any Bayesian model comparison and come to understand it at a very fundamental level. Starting from Bayes’ theorem, we learn how to obtain unique and optimal solutions to any well-posed inference problem. With this as a foundation, many common analysis techniques such as linear and nonlinear model fitting are developed and their limitations appreciated. The Bayesian solution to a problem is often very simple in principle, however, the calculations require integrals over the model parameter space which can be very time consuming if there are a large number of parameters. Fortunately, the last decade has seen remarkable developments in practical algorithms for performing Bayesian calculations. Chapter 12 provides an introduction to the very powerful Markov chain Monte Carlo (MCMC) algorithms, and demonstrates an application of a new automated MCMC algorithm to the detection of extrasolar planets. Although the primary emphasis is on the role of probability theory in inference, there is also focus on an understanding of how to simulate the measurement process. This includes learning how to generate pseudo-random numbers with an arbitrary distribution (in Chapter 5). Any linear measurement process can be modeled as a convolution of nature’s signal with the measurement point-spread-function, a process most easily dealt with using the convolution theorem of Fourier analysis. Because of the importance of this material, I have included Appendix B on the Discrete Fourier Transform (DFT), the Fast Fourier Transform (FFT), convolution and Weiner filtering. We consider the limitations of the DFT and learn about the need to zero pad in convolution to avoid aliasing. From the Nyquist Sampling Theorem we learn how to minimally sample the signal without losing information and what prefiltering of the signal is required to prevent aliasing. In Chapter 13, we apply probability theory to spectral analysis problems and gain a new insight into the role of the DFT, and explore a Bayesian revolution in spectral analysis. We also learn that with non-uniform data sampling, the effective bandwidth (the largest spectral window free of aliases) can be made much wider than for uniform sampling. The final chapter is devoted to Bayesian inference when our prior information leads us to model the probability of the data with a Poisson distribution. Software support The material in this book is designed to empower the reader in his or her search to unlock nature’s secrets. To do this efficiently, one needs both an understanding of the principles of extended logic, and an efficient computing environment for visualizing
xvi
Preface
and mathematically manipulating the data. All of the course assignments involve the use of a computer. An increasing number of my students are exploiting the power of integrated platforms for programming, symbolic mathematical computations, and visualizing tools. Since the majority of my students opted to use Mathematica for their assignments, I adopted Mathematica as a default computing environment for the course. There are a number of examples in this book employing Mathematica commands, although the book has been designed to be complete without reference to these Mathematica examples. In addition, I have developed a Mathematica tutorial to support this book, specifically intended to help students and professional scientists with no previous experience with Mathematica to efficiently exploit it for data analysis problems. This tutorial also contains many worked examples and is available for download from http://www.cambridge.org/052184150X. In any scientific endeavor, a great deal of effort is expended in graphically displaying the results for presentation and publication. To simplify this aspect of the problem, the Mathematica tutorial provides a large range of easy to use templates for publicationquality plotting. It used to be the case that interpretative languages were not as useful as compiled languages such as C and Fortran for numerically intensive computations. The last few years have seen dramatic improvements in the speed of Mathematica. Wolfram Research now claims2 that for most of Mathematica’s numerical analysis functionality (e.g., data analysis, matrix operations, numerical differential equation solvers, and graphics) Mathematica 5 operates on a par3 with Fortran or MATLAB code. In the author’s experience, the time required to develop and test programs with Mathematica is approximately 20 times shorter than the time required to write and debug the same program in Fortran or C, so the efficiency gain is truly remarkable.
2 3
http://www.wolfram.com/products/mathematica/; newin5/performance/numericallinear.html. Look up Mathematica gigaNumerics on the Web.
Acknowledgements
Most of the Bayesian material presented in this book I have learned from the works of Ed Jaynes, Larry Bretthorst, Tom Loredo, Steve Gull, John Skilling, Myron Tribus, Devinder Sivia, Jim Berger, and many others from the international community devoted to the study of Bayesian inference. On a personal note, I encountered Bayesian inference one day in 1989 when I found a monograph lying on the floor of the men’s washroom entitled Bayesian Spectrum Analysis and Parameter Estimation, by Larry Bretthorst. I was so enthralled with the book that I didn’t even try to find out whose it was for several weeks. Larry’s book led me to the work of his Ph.D. supervisor, Edwin T. Jaynes. I became hooked on this simple, elegant and powerful approach to scientific inference. For me, it was a breath of fresh air providing a logical framework for tackling any statistical inference question in an optimal way, in contrast to the recipe or cookbook approach of conventional statistical analysis. I would also like to acknowledge the proof reading and suggestions made by many students who were exposed to early versions of this manuscript, in particular, Iva Cheung for her very careful proof reading of the final draft. Finally, I am really grateful to my partner, Jackie, and our children, Rene, Neil, Erin, Melanie, Ted, and Laura, for their encouragement over the many years it took to complete this book.
xvii
1 Role of probability theory in science
1.1 Scientific inference This book is primarily concerned with the philosophy and practice of inferring the laws of nature from experimental data and prior information. The role of inference in the larger framework of the scientific method is illustrated in Figure 1.1. In this simple model, the scientific method is depicted as a loop which is entered through initial observations of nature, followed by the construction of testable hypotheses or theories as to the working of nature, which give rise to the prediction of other properties to be tested by further experimentation or observation. The new data lead to the refinement of our current theories, and/or development of new theories, and the process continues. The role of deductive inference1 in this process, especially with regard to deriving the testable predictions of a theory, has long been recognized. Of course, any theory makes certain assumptions about nature which are assumed to be true and these assumptions form the axioms of the deductive inference process. The terms deductive inference and deductive reasoning are considered equivalent in this book. For example, Einstein’s Special Theory of Relativity rests on two important assumptions; namely, that the vacuum speed of light is a constant in all inertial reference frames and that the laws of nature have the same form in all inertial frames. Unfortunately, experimental tests of theoretical predictions do not provide simple yes or no answers. Our state of knowledge is always incomplete, there are always more experiments that could be done and the measurements are limited in their accuracy. Statistical inference is the process of inferring the truth of our theories of nature on the basis of the incomplete information. In science we often make progress by starting with simple models. Usually nature is more complicated and we learn in what direction to modify our theories from the differences between the model predictions and the measurements. It is much like peeling off layers of an onion. At any stage in this iterative process, the still hidden layers give rise to differences from the model predictions which guide the next step.
1
Reasoning from one proposition to another using the strong syllogisms of logic (see Section 2.2.4).
1
2
Role of probability theory in science
ctive Inference Dedu Predictions
Testable Hypothesis (theory)
Observations Data
Hypothesis testing Parameter estimation
S ta
ti s ti c
a l ( p la u sible) Infere n
ce
Figure 1.1 The scientific method.
1.2 Inference requires a probability theory In science, the available information is always incomplete so our knowledge of nature is necessarily probabilistic. Two different approaches based on different definitions of probability will be considered. In conventional statistics, the probability of an event is identified with the long-run relative frequency of occurrence of the event. This is commonly referred to as the ‘‘frequentist’’ view. In this approach, probabilities are restricted to a discussion of random variables, quantities that can meaningfully vary throughout a series of repeated experiments. Two examples are: 1. A measured quantity which contains random errors. 2. Time intervals between successive radioactive decays.
The role of random variables in frequentist statistics is detailed in Section 5.2. In recent years, a new perception of probability has arisen in recognition that the mathematical rules of probability are not merely rules for calculating frequencies of random variables. They are now recognized as uniquely valid principles of logic for conducting inference about any proposition or hypothesis of interest. This more powerful viewpoint, ‘‘Probability Theory as Logic,’’ or Bayesian probability theory, is playing an increasingly important role in physics and astronomy. The Bayesian approach allows us to directly compute the probability of any particular theory or particular value of a model parameter, issues that the conventional statistical approach can attack only indirectly through the use of a random variable statistic. In this book, I adopt the approach which exposes probability theory as an extended theory of logic following the lead of E. T. Jaynes in his book,2 Probability Theory – 2
The book was finally submitted for publication four years after his death, through the efforts of his former student G. Larry Bretthorst.
1.2 Inference requires a probability theory
3
Table 1.1 Frequentist and Bayesian approaches to probability. Approach
Probability definition
FREQUENTIST STATISTICAL INFERENCE
pðAÞ ¼ long-run relative frequency with which A occurs in identical repeats of an experiment. ‘‘A’’ restricted to propositions about random variables.
BAYESIAN INFERENCE
pðAjBÞ ¼ a real number measure of the plausibility of a proposition/hypothesis A, given (conditional on) the truth of the information represented by proposition B. ‘‘A’’ can be any logical proposition, not restricted to propositions about random variables.
The Logic of Science (Jaynes, 2003). The two approaches employ different definitions of probability which must be carefully understood to avoid confusion. The two different approaches to statistical inference are outlined in Table 1.1 together with their underlying definition of probability. In this book, we will be primarily concerned with the Bayesian approach. However, since much of the current scientific culture is based on ‘‘frequentist’’ statistical inference, some background in this approach is useful. The frequentist definition contains the term ‘‘identical repeats.’’ Of course the repeated experiments can never be identical in all respects. The Bayesian definition of probability involves the rather vague sounding term ‘‘plausibility,’’ which must be given a precise meaning (see Chapter 2) for the theory to provide quantitative results. In Bayesian inference, a probability distribution is an encoding of our uncertainty about some model parameter or set of competing theories, based on our current state of information. The approach taken to achieve an operational definition of probability, together with consistent rules for manipulating probabilities, is discussed in the next section and details are given in Chapter 2. In this book, we will adopt the plausibility definition3 of probability given in Table 1.1 and follow the approach pioneered by E. T. Jaynes that provides for a unified picture of both deductive and inductive logic. In addition, Jaynes brought
3
Even within the Bayesian statistical literature, other definitions of probability exist. An alternative definition commonly employed is the following: ‘‘probability is a measure of the degree of belief that any well-defined proposition (an event) will turn out to be true.’’ The events are still random variables, but the term is generalized so it can refer to the distribution of results from repeated measurements, or, to possible values of a physical parameter, depending on the circumstances. The concept of a coherent bet (e.g., D’Agostini, 1999) is often used to define the value of probability in an operational way. In practice, the final conditional posteriors are the same as those obtained from the extended logic approach adopted in this book.
4
Role of probability theory in science
great clarity to the debate on objectivity and subjectivity with the statement, ‘‘the only thing objectivity requires of a scientific approach is that experimenters with the same state of knowledge reach the same conclusion.’’ More on this later.
1.2.1 The two rules for manipulating probabilities It is now routine to build or program a computer to execute deductive logic. The goal of Bayesian probability theory as employed in this book is to provide an extension of logic to handle situations where we have incomplete information so we may arrive at the relative probabilities of competing hypotheses for a given state of information. Cox and Jaynes showed that the desired extension can be arrived at uniquely from three ‘‘desiderata’’ which will be introduced in Section 2.5.1. They are called ‘‘desiderata’’ rather than axioms because they do not assert that anything is ‘‘true,’’ but only state desirable goals of a theory of plausible inference. The operations for manipulating probabilities that follow from the desiderata are the sum and product rules. Together with the Bayesian definition of probability, they provide the desired extension to logic to handle the common situation of incomplete information. We will simply state these rules here and leave their derivation together with a precise operational definition of probability to the next chapter. Sum Rule: pðAjBÞ þ pðAjBÞ ¼ 1
(1:1)
Product Rule: pðA; BjCÞ ¼ pðAjCÞpðBjA; CÞ (1:2) ¼ pðBjCÞpðAjB; CÞ; where the symbol A stands for a proposition which asserts that something is true. The symbol B is a proposition asserting that something else is true, and similarly, C stands for another proposition. Two symbols separated by a comma represent a compound proposition which asserts that both propositions are true. Thus A; B indicates that both propositions A and B are true and pðA; BjCÞ is commonly referred to as the joint probability. Any proposition to the right of the vertical bar j is assumed to be true. Thus when we write pðAjBÞ, we mean the probability of the truth of proposition A, given (conditional on) the truth of the information represented by proposition B. Examples of propositions: A ‘‘The newly discovered radio astronomy object is a galaxy.’’ B ‘‘The measured redshift of the object is 0:150 0:005.’’ A ‘‘Theory X is correct.’’ A ‘‘Theory X is not correct.’’ A ‘‘The frequency of the signal is between f and f þ df.’’
We will have much more to say about propositions in the next chapter.
5
1.3 Usual form of Bayes’ theorem
Bayes’ theorem follows directly from the product rule (a rearrangement of the two right sides of the equation): pðAjB; CÞ ¼
pðAjCÞpðBjA; CÞ : pðBjCÞ
(1:3)
Another version of the sum rule can be derived (see Equation (2.23)) from the product and sum rules above: Extended Sum Rule: pðA þ BjCÞ ¼ pðAjCÞ þ pðBjCÞ pðA; BjCÞ;
(1:4)
where A þ B proposition A is true or B is true or both are true. If propositions A and B are mutually exclusive – only one can be true – then Equation (1.4) becomes pðA þ BjCÞ ¼ pðAjCÞ þ pðBjCÞ:
(1:5)
1.3 Usual form of Bayes’ theorem
pðHi jD; IÞ ¼
pðHi jIÞpðDjHi ; IÞ ; pðDjIÞ
(1:6)
where Hi proposition asserting the truth of a hypothesis of interest I proposition representing our prior information D proposition representing data pðDjHi ; IÞ ¼ probability of obtaining data D; if Hi and I are true ðalso called the likelihood function LðHi ÞÞ pðHi jIÞ ¼ prior probability of hypothesis pðHi jD; IÞ ¼ posterior probability of Hi X pðDjIÞ ¼ pðHi jIÞpðDjHi ; IÞ i ðnormalization factor which ensures
X i
pðHi jD; IÞ ¼ 1Þ:
1.3.1 Discrete hypothesis space In Bayesian inference, we are interested in assigning probabilities to a set of competing hypotheses perhaps concerning some aspect of nature that we are studying. This set of competing hypotheses is called the hypothesis space. For example, a problem of current interest to astronomers is whether the expansion of the universe is accelerating or decelerating. In this case, we would be dealing with a discrete hypothesis
6
Role of probability theory in science
space4 consisting of H1 ( accelerating) and H2 ( decelerating). For a discrete hypothesis space, pðHi jD; IÞ is called a probability distribution. Our posterior probabilities for H1 and H2 satisfy the condition that 2 X
pðHi jD; IÞ ¼ 1:
(1:7)
i¼1
1.3.2 Continuous hypothesis space In another type of problem we might be dealing with a hypothesis space that is continuous. This can be considered as the limiting case of an arbitrarily large number of discrete propositions.5 For example, we have strong evidence from the measured velocities and distances of galaxies that we live in an expanding universe. Astronomers are continually seeking to refine the value of Hubble’s constant, H0 , which relates the recession velocity of a galaxy to its distance. Estimating H0 is called a parameter estimation problem and in this case, our hypothesis space of interest is continuous. In this case, the proposition H0 asserts that the true value of Hubble’s constant is in the interval h to h þ dh. The truth of the proposition can be represented by pðH0 jD; IÞdH, where pðH0 jD; IÞ is a probability density function (PDF). The probability density function is defined by pðH0 jD; IÞ ¼ lim
h!0
pðh H0 < h þ hjD; IÞ : h
(1:8)
Box 1.1 Note about notation The term ‘‘PDF’’ is also a common abbreviation for probability distribution function, which can pertain to discrete or continuous sets of probabilities. This term is particularly useful when dealing with a mixture of discrete and continuous parameters. We will use the same symbol, pð. . .Þ, for probabilities and PDFs; the nature of the argument will identify which use is intended. To arrive at a final numerical answer for the probability or PDF of interest, we eventually need to convert the terms in Bayes’ theorem into algebraic expressions, but these expressions can become very complicated in appearance. It is useful to delay this step until the last possible moment.
4
5
Of course, nothing guarantees that future information will not indicate that the correct hypothesis is outside the current working hypothesis space. With this new information, we might be interested in an expanded hypothesis space. In Jaynes (2003), there is a clear warning that difficulties can arise if we are not careful in carrying out this limiting procedure explicitly. This is often the underlying cause of so-called paradoxes of probability theory.
1.3 Usual form of Bayes’ theorem
7
Let W be a proposition asserting that the numerical value of H0 lies in the range a to b. Then Z b pðWjD; IÞ ¼ pðH0 jD; IÞdH0 : (1:9) a
In the continuum limit, the normalization condition of Equation (1.7) becomes Z pðHjD; IÞdH ¼ 1; (1:10) H
where H designates the range of integration corresponding to the hypothesis space of interest. We can also talk about a joint probability distribution, pðX; YjD; IÞ, in which both X and Y are continuous, or, one is continuous and the other is discrete. If both are continuous, then pðX; YjD; IÞ is interpreted to mean pðX; YjD; IÞ ¼ lim
x;y!0
pðx X < x þ x; y Y < y þ yjD; IÞ : x y
(1:11)
In a well-posed problem, the prior information defines our hypothesis space, the means for computing pðHi jIÞ, and the likelihood function given some data D.
1.3.3 Bayes’ theorem – model of the learning process Bayes’ theorem provides a model for inductive inference or the learning process. In the parameter estimation problem of the previous section, H0 is a continuous hypothesis space. Hubble’s constant has some definite value, but because of our limited state of knowledge, we cannot be too precise about what that value is. In all Bayesian inference problems, we proceed in the same way. We start by encoding our prior state of knowledge into a prior probability distribution, pðH0 jIÞ (in this case a density distribution). We will see a very simple example of how to do this in Section 1.4.1, and many more examples in subsequent chapters. If our prior information is very vague then pðH0 jIÞ will be very broad, spanning a wide range of possible values of the parameter. It is important to realize that a Bayesian PDF is a measure of our state of knowledge (i.e., ignorance) of the value of the parameter. The actual value of the parameter is not distributed over this range; it has some definite value. This can sometimes be a serious point of confusion, because, in frequentist statistics, the argument of a probability is a random variable, a quantity that can meaningfully take on different values, and these values correspond to possible outcomes of experiments. We then acquire some new data, D1 . Bayes’ theorem provides a means for combining what the data have to say about the parameter, through the likelihood function, with our prior, to arrive at a posterior probability density, pðH0 jD1 ; IÞ, for the parameter. pðH0 jD1 ; IÞ / pðH0 jI0 ÞpðD1 jH0 ; IÞ:
(1:12)
8
Role of probability theory in science (a)
(b)
Likelihood p(D |H0,M1,I )
Prior p(H0|M1,I ) Prior p(H0|M1,I )
Posterior p(H0|D,M1,I )
Parameter H0
Likelihood p(D |H0,M1,I )
Posterior p(H0|D,M1,I )
Parameter H0
Figure 1.2 Bayes’ theorem provides a model of the inductive learning process. The posterior PDF (lower graphs) is proportional to the product of the prior PDF and the likelihood function (upper graphs). This figure illustrates two extreme cases: (a) the prior much broader than likelihood, and (b) likelihood much broader than prior.
Two extreme cases are shown in Figure 1.2. In the first, panel (a), the prior is much broader than the likelihood. In this case, the posterior PDF is determined entirely by the new data. In the second extreme, panel (b), the new data are much less selective than our prior information and hence the posterior is essentially the prior. Now suppose we acquire more data represented by proposition D2 . We can again apply Bayes’ theorem to compute a posterior that reflects our new state of knowledge about the parameter. This time our new prior, I 0 , is the posterior derived from D1 ; I, i.e., I 0 ¼ D1 ; I. The new posterior is given by pðH0 jD2 ; I 0 Þ / pðH0 jI 0 ÞpðD2 jH0 ; I 0 Þ:
(1:13)
1.3.4 Example of the use of Bayes’ theorem Here we analyze a simple model comparison problem using Bayes’ theorem. We start by stating our prior information, I, and the new data, D. I stands for: a) Model M1 predicts a star’s distance, d1 ¼ 100 light years (ly). b) Model M2 predicts a star’s distance, d2 ¼ 200 ly. c) The uncertainty, e, in distance measurements is described by a Gaussian distribution of the form
9
1.3 Usual form of Bayes’ theorem
1 e2 pðejIÞ ¼ pffiffiffiffiffiffi exp 2 ; 2 2p where ¼ 40 ly. d) There is no current basis for preferring M1 over M2 so we set pðM1 jIÞ ¼ pðM2 jIÞ ¼ 0:5.
D ‘‘The measured distance d ¼ 120 ly.’’ The prior information tells us that the hypothesis space of interest consists of models (hypotheses) M1 and M2 . We proceed by writing down Bayes’ theorem for each hypothesis, e.g., pðM1 jD; IÞ ¼
pðM1 jIÞpðDjM1 ; IÞ ; pðDjIÞ
(1:14)
pðM2 jD; IÞ ¼
pðM2 jIÞpðDjM2 ; IÞ : pðDjIÞ
(1:15)
Since we are interested in comparing the two models, we will compute the odds ratio, equal to the ratio of the posterior probabilities of the two models. We will abbreviate the odds ratio of model M1 to model M2 by the symbol O12 . O12 ¼
pðM1 jD; IÞ pðM1 jIÞ pðDjM1 ; IÞ pðDjM1 ; IÞ ¼ ¼ : pðM2 jD; IÞ pðM2 jIÞ pðDjM2 ; IÞ pðDjM2 ; IÞ
(1:16)
The two prior probabilities cancel because they are equal and so does pðDjIÞ since it is common to both models. To evaluate the likelihood pðDjM1 ; IÞ, we note that in this case, we are assuming M1 is true. That being the case, the only reason the measured d can differ from the prediction d1 is because of measurement uncertainties, e. We can thus write d ¼ d1 þ e or e ¼ d d1 . Since d1 is determined by the model, it is certain, and so the probability,6 pðDjM1 ; IÞ, of obtaining the measured distance is equal to the probability of the error. Thus we can write 1 ðd d1 Þ2 pðDjM1 ; IÞ ¼ pffiffiffiffiffiffi exp 22 2p
!
! 1 ð120 100Þ2 ¼ pffiffiffiffiffiffi exp ¼ 0:00880: 2p 40 2ð40Þ2 Similarly we can write for model M2
6
See Section 4.8 for a more detailed treatment of this point.
(1:17)
10
Probability density
Role of probability theory in science
0
d1 0
50
dmeasured
d2
100 150 200 Distance (ly)
250
300
350
Figure 1.3 Graphical depiction of the evaluation of the likelihood functions, pðDjM1 ; IÞ and pðDjM2 ; IÞ.
1 ðd d2 Þ2 pðDjM2 ; IÞ ¼ pffiffiffiffiffiffi exp 22 2p
!
! 1 ð120 200Þ2 ¼ pffiffiffiffiffiffi exp ¼ 0:00135: 2p 40 2ð40Þ2
(1:18)
The evaluation of Equations (1.17) and (1.18) is depicted graphically in Figure 1.3. The relative likelihood of the two models is proportional to the heights of the two Gaussian probability distributions at the location of the measured distance. Substituting into Equation (1.16), we obtain an odds ratio of 6.52 in favor of model M1 .
1.4 Probability and frequency In Bayesian terminology, a probability is a representation of our state of knowledge of the real world. A frequency is a factual property of the real world that we measure or estimate.7 One of the great strengths of Bayesian inference is the ability to incorporate relevant prior information in the analysis. As a consequence, some critics have discounted the approach on the grounds that the conclusions are subjective and there has been considerable confusion on that subject. We certainly expect that when scientists from different laboratories come together at an international meeting, their state of knowledge about any particular topic will differ, and as such, they may have arrived at different conclusions. It is important to recognize
7
For example, consider a sample of 400 people attending a conference. Each person sampled has many characteristics or attributes including sex and eye color. Suppose 56 are found to be female. Based on this sample, the frequency of occurrence of the attribute female is 56=400 14%.
11
1.4 Probability and frequency
that the only thing objectivity requires of a scientific approach is that experimenters with the same state of knowledge reach the same conclusion. Achieving consensus amongst different experimenters is greatly aided by the requirement to specify how relevant prior information has been encoded in the analysis. In Bayesian inference, we can readily incorporate frequency information using Bayes’ theorem and by treating it as data. In general, probabilities change when we change our state of knowledge; frequencies do not.
1.4.1 Example: incorporating frequency information A 1996 newspaper article reported that doctors in Toronto were concerned about a company selling an unapproved mail-order HIV saliva test. According to laboratory tests, the false positive rate for this test was 2.3% and the false negative rate was 1.4% (i.e., 98.6% reliable based on testing of people who actually have the disease). In this example, suppose a new deadly disease is discovered for which there is no known cause but a saliva test is available with the above specifications. We will refer to this disease by the abbreviation UD, for unknown disease. You have no reason to suspect you have UD but decide to take the test anyway and test positive. What is the probability that you really have the disease? Here is a Bayesian analysis of this situation. For the purpose of this analysis, we will assume that the incidence of the disease in a random sample of the region is 1:10 000. Let H ‘‘You have UD.’’ H ‘‘You do not have UD.’’ D1 ‘‘You test positive for UD.’’ I1 ‘‘No known cause for the UD, pðD1 jH; I1 Þ ¼ 0:986; pðD1 jH; I1 Þ ¼ 0:023; incidence of UD in the population is 1:104 :’’ The starting point for any Bayesian analysis is to write down Bayes’ theorem. pðHjD1 ; I1 Þ ¼
pðHjI1 ÞpðD1 jH; I1 Þ : pðD1 jI1 Þ
Since pðD1 jI1 Þ is a normalization factor, which ensures can write
(1:19) P
i
pðHi jD1 ; I1 Þ ¼ 1, we
pðD1 jI1 Þ ¼ pðHjI1 ÞpðD1 jH; I1 Þ þ pðHjI1 ÞpðD1 jH; I1 Þ:
(1:20)
12
Role of probability theory in science
In words, this latter equation stands for
prob. of a
prob. you
0
prob. of a þ
1
B C @ test when you A have UD 0 1 prob. of a þ prob. you B C @ test when you A þ don’t have UD don’t have UD incidence of ¼ ðreliability of testÞ UD in population incidence ðfalse positive rateÞ þ 1 of UD 104 0:986 pðHjD1 ; I1 Þ ¼ 4 ¼ 0:0042: 10 0:986 þ 0:9999 0:023
þ test
¼
have UD
(1:21)
Thus, the probability you have the disease is 0.4% (not 98.6%). Question: How would the conclusion change if the false positive rate of the test were reduced to 0.5%? Suppose you now have a doctor examine you and obtain new independent data D2 , perhaps from a blood test. I2 ¼ New state of knowledge ¼ D1 ; I1 ) pðHjD2 ; I2 Þ ¼
pðHjI2 ÞpðD2 jH; I2 Þ ; pðD2 jI2 Þ
where pðHjI2 Þ ¼ pðHjD1 ; I1 Þ.
1.5 Marginalization In this section, we briefly introduce marginalization, but we will learn about important subtleties to this operation in later chapters. Consider the following parameter estimation problem. We have acquired some data, D, which our prior information, I, indicates will contain a periodic signal. Our signal model has two continuous parameters – an angular frequency, !, and an amplitude, A. We want to focus on the implications of the data for the !, independent of the signal’s amplitude, A. We can write the joint probability8 of ! and A given data D and prior information I as pð!; AjD; IÞ. In this case !, A is a compound proposition asserting that the two
8
Since a parameter of a model is not a random variable, the frequentist approach is denied the concept of the probability of a parameter.
1.5 Marginalization
13
propositions are true. How do we obtain an expression for the probability of the proposition !? We eliminate the uninteresting parameter A by marginalization. How do we do this? For simplicity, we will start by assuming that the parameter A is discrete. In this case, A can only take on the values A1 or A2 or A3 , etc. Since we are assuming the model to be true, the proposition represented by A1 þ A2 þ A3 þ , where the þ stands for the Boolean ‘or’, must be true for some value of Ai and hence, pðA1 þ A2 þ A3 þ jIÞ ¼ 1:
(1:22)
Now !; ½A1 þ A2 þ A3 þ is a compound proposition which asserts that both ! and ½A1 þ A2 þ A3 þ are true. The probability that this compound proposition is true is represented by pð!; ½A1 þ A2 þ A3 þ jD; IÞ. We use the product rule to expand the probability of this compound proposition. pð!; ½A1 þ A2 þ A3 þ jD; IÞ ¼ pð½A1 þ A2 þ A3 þ jD; IÞ pð!j½A1 þ A2 þ A3 þ ; D; IÞ
(1:23)
¼ 1 pð!jD; IÞ: The second line of the above equation has the quantity ½A1 þ A2 þ A3 þ ; D; I to the right of the vertical bar which should be read as assuming the truth of ½A1 þ A2 þ A3 þ ; D; I. Now ½A1 þ A2 þ A3 þ ; D; I is a compound proposition asserting that all three propositions are true. Since proposition ½A1 þ A2 þ A3 þ is given as true by our prior information, I, knowledge of its truth is already contained in proposition I. Thus, we can simplify the expression by replacing pð!j½A1 þ A2 þ A3 þ ; D; IÞ by pð!jD; IÞ. Rearranging Equation (1.23), we get pð!jD; IÞ ¼ pð!; ½A1 þ A2 þ A3 þ jD; IÞ:
(1:24)
The left hand side of the equation is the probability we are seeking, but we are not finished with the right hand side. Now we do a simple expansion of the right hand side of Equation (1.24) by multiplying out the two propositions ! and ½A1 þ A2 þ A3 þ using a Boolean algebra relation which is discussed in more detail in Chapter 2. pð!; ½A1 þ A2 þ A3 þ jD; IÞ ¼ pðf!; A1 g þ f!; A2 g þ f!; A3 g þ jD; IÞ: (1:25) The term f!; A1 g þ f!; A2 g þ f!; A3 g þ is a proposition which asserts that !; A1 is true, or, !; A2 is true, or, !; A3 is true, etc. We have surrounded each of the !; Ai terms by curly brackets to help with the interpretation, but normally they are not required because the logical conjunction operation designated by a comma between two propositions takes precedence over the logical ‘‘or’’ operation designated by the þ sign.
14
Role of probability theory in science
The extended sum rule, given by Equation (1.5), says that the probability of the sum of two mutually exclusive (only one can be true) propositions is the sum of their individual probabilities. Since the compound propositions !; Ai for different i are mutually exclusive, we can rewrite Equation (1.25) as pð!; ½A1 þ A2 þ A3 þ jD; IÞ ¼ pð!; A1 jD; IÞ þ pð!; A2 jD; IÞ þpð!; A3 jD; IÞ þ Substitution of Equation (1.26) into Equation (1.24) yields: X pð!jD; IÞ ¼ pð!; Ai jD; IÞ:
(1:26)
(1:27)
i
Extending this idea to the case where A is a continuously variable parameter instead of a discrete parameter, we can write Z pð!jD; IÞ ¼ dA pð!; AjD; IÞ: (1:28) The quantity, pð!jD; IÞ, is the marginal posterior distribution for !, which, for a continuous parameter like !, is a probability density function. It summarizes what D, I (our knowledge state) says about the parameter(s) of interest. The probability that ! R !2 will lie in any specific range from !1 to !2 is given by !1 pð!jD; IÞd!. Another useful form of the marginalization operation can be obtained by expanding Equation (1.28) using Bayes’ theorem: pð!; AjD; IÞ ¼
pð!; AjIÞpðDj!; A; IÞ : pðDjIÞ
(1:29)
Now expand pð!; AjIÞ on the right hand side of Equation (1.29) using the product rule: pð!; AjIÞ ¼ pð!jIÞpðAj!; IÞ:
(1:30)
Now we will assume the priors for ! and A are independent so we can write pðAj!; IÞ ¼ pðAjIÞ. What this is saying is that any prior information we have about the parameter ! tells us nothing about the parameter A. This assumption is frequently valid and it usually simplifies the calculations. Equation (1.29) can now be rewritten as pð!; AjD; IÞ ¼
pð!jIÞpðAjIÞpðDj!; A; IÞ : pðDjIÞ
Finally, substitution of Equation (1.31) into Equation (1.28) yields: Z pð!jD; IÞ / pð!jIÞ dA pðAjIÞpðDj!; A; IÞ:
(1:31)
(1:32)
1.6 The two basic problems in statistical inference
15
This gives the marginal posterior distribution pð!jD; IÞ, in terms of the weighted average of the likelihood function, pðDj!; A; IÞ, weighted by pðAjIÞ, the prior probability density function for A. This is another form of the operation of marginalizing out the A parameter. The integral in Equation (1.32) can sometimes be evaluated analytically which can greatly reduce the computational aspects of the problem especially when many parameters are involved. A dramatic example of this is given in Gregory and Loredo (1992) which demonstrates how to marginalize analytically over a very large number of parameters in a model describing a waveform of unknown shape.
1.6 The two basic problems in statistical inference 1. Model selection: Which of two or more competing models is most probable given our present state of knowledge? The competing models may have different numbers of parameters. For example, suppose we have some experimental data consisting of a signal plus some additive noise and we want to distinguish between two different models for the signal present. Model M1 predicts that the signal is a constant equal to zero, i.e., has no unknown (free) parameters. Model M2 predicts that the signal consists of a single sine wave of known frequency f. Let us further suppose that the amplitude, A, of the sine wave is a free parameter within some specified prior range. In this problem, M1 has no free parameters and M2 has one free parameter, A. In model selection, we are interested in the most probable model, independent of the model parameters (i.e., marginalize out all parameters). This is illustrated in the equation below for model M2 .
pðM2 jD; IÞ ¼
Z
dA pðM2 ; AjD; IÞ;
(1:33)
A
where A designates the appropriate range of integration of A as specified by our prior information, I. We can rearrange Equation (1.33) into another useful form by application of Bayes’ theorem and the product rule, following the example given in the previous section (Equations (1.28) to (1.32)). The result is
pðM2 jD; IÞ ¼
pðM2 jIÞ
R A
dA pðAjM2 ; IÞpðDjM2 ; A; IÞ : pðDjIÞ
(1:34)
In model selection, the hypothesis space of interest is discrete (although its parameters may be continous) and M2 stands for the second member of this discrete space. 2. Parameter estimation: Assuming the truth of a model, find the probability density function for each of its parameters. Suppose the model M has two free parameters f and A. In this case, we want to solve for pð fjD; M; IÞ and pðAjD; M; IÞ. The quantity pð fjD; M; IÞ is called the marginal posterior
16
Role of probability theory in science distribution for f, which, for a continuous parameter like f, is a probability density function as defined by Equation (1.8). In Chapter 3, we will work through a detailed example of both model selection and parameter estimation.
1.7 Advantages of the Bayesian approach 1. Provides an elegantly simple and rational approach for answering, in an optimal way, any scientific question for a given state of information. This contrasts to the recipe or cookbook approach of conventional statistical analysis. The procedure is well-defined: (a) Clearly state your question and prior information. (b) Apply the sum and product rules. The starting point is always Bayes’ theorem.
2. 3.
4.
5.
6.
For some problems, a Bayesian analysis may simply lead to a familiar statistic. Even in this situation it often provides a powerful new insight concerning the interpretation of the statistic. One example of this is shown in Figure 1.4 and discussed in detail in Chapter 13. Calculates probability of hypothesis directly: pðHi jD; IÞ. Incorporates relevant prior (e.g., known signal model) information through Bayes’ theorem. This is one of the great strengths of Bayesian analysis. For data with a high signal-to-noise ratio, a Bayesian analysis can frequently yield many orders of magnitude improvement in model parameter estimation, through the incorporation of relevant prior information about the signal model. This is illustrated in Figure 1.5 and discussed in more detail in Chapter 13. Provides a way of eliminating nuisance parameters through marginalization. For some problems, the marginalization can be performed analytically, permitting certain calculations to become computationally tractable (see Section 13.4). Provides a more powerful way of assessing competing theories at the forefront of science by automatically quantifying Occam’s razor. Occam’s razor is a principle attributed to the medieval philosopher William of Occam (or Ockham). The principle states that one should not make more assumptions than the minimum needed. It underlies all scientific modeling and theory building. It cautions us to choose from a set of otherwise equivalent models of a given phenomenon the simplest one. In any given model, Occam’s razor helps us to ‘‘shave off’’ those variables that are not really needed to explain the phenomenon. It was previously thought to be only a qualitative principle. This topic is introduced in Section 3.5. The Bayesian quantitative Occam’s razor can also save a lot of time that might otherwise be spent chasing noise artifacts that masquerade as possible detections of real phenomena. One example of this is discussed in Section 12.9 on extrasolar planets. Provides a way for incorporating the effects of systematic errors arising from both the measurement operation and theoretical model predictions. Figure 1.6 illustrates the effect of a systematic error in the scale of the cosmic ruler (Hubble’s constant) used to determine the distance to galaxies. This topic is introduced in Section 3.11.
These advantages will be discussed in detail beginning in Chapter 3. We close with a reminder that in Bayesian inference probabilities are a measure of our state of knowledge about nature, not a measure of nature itself.
17
1.8 Problems Simulated Time Series [sin 2π ft + noise (σ = 1)] Signal strength
4 2 0 –2 –4 10
20
30 40 Time axis
50
60
Power density
Fourier Power Spectral Density 35 30 25 20 15 10 5 0
0.1
0.2 0.3 Frequency
0.4
0.5
Probability density
Bayesian Probability Density 250 200 150 100 50 0
0.1
0.2 0.3 Frequency
0.4
0.5
Figure 1.4 The upper panel shows a simulated time series consisting of a single sinusoidal signal with added independent Gaussian noise. A common conventional analysis (middle panel) involves plotting the power spectrum, based on a Discrete Fourier Transform (DFT) statistic of the data. The Bayesian analysis (lower panel) involves a nonlinear processing of the same DFT statistic, which suppresses spurious peaks and the width of the spectral peak reflects the accuracy of the frequency estimate.
1.8 Problems 1. For the example given in Section 1.3.4, compute pðDjM1 ; IÞ and pðDjM2 ; IÞ, for a ¼ 25 ly. 2. For the example given in Section 1.4.1, compute the probability that the person has the disease, if the false positive rate for the test ¼ 0:5%, and everything else is the same.
18
Role of probability theory in science
Figure 1.5 Comparison of conventional analysis (middle panel) and Bayesian analysis (lower panel) of the two-channel nuclear magnetic resonance free induction decay time series (upper two panels). By incorporating prior information about the signal model, the Bayesian analysis was able to determine the frequencies and exponential decay rates to an accuracy many orders of magnitude greater than for a conventional analysis. (Figure credit G. L. Bretthorst, reproduced by permission from the American Institute of Physics.)
19
1.8 Problems
0.05 Probability density
case 1 0.04
case 2
0.03 0.02 0.01 0 1000
1500 Distance (Mpc)
2000
Figure 1.6 The probability density function for the distance to a galaxy assuming: 1) a fixed value for Hubble’s constant ðH0 Þ, and 2) incorporating a Gaussian prior uncertainty for H0 of 14%.
3. In Section 1.4.1, based on the saliva test result and the prior information, the probability that the person had the unknown disease (UD) was found to be 0.42%. Subsequently, the same person received an independent blood test for UD and again tested positive. If the false negative rate for this test is 1.4% and the false positive rate is 0.5%, what is the new probability that the person has UD on the basis of both tests? 4. Joint and marginal probability distributions (Refer to the example on this topic in the Mathematica tutorial.)
(a) Suppose we are interested in estimating the parameters X and Y of a certain model M, where both parameters are continuous as opposed to discrete. Make a contour plot of the following posterior joint probability density function given by: pðX; YjD; M; IÞ ¼ A1 exp
ðx x1 Þ2 þ ðy y1 Þ2 221
!
! ðx x2 Þ2 þ ðy y2 Þ2 ; þ A2 exp 222
where A1 ¼ 4:82033; A2 ¼ 4:43181; x1 ¼ 0:5; y1 ¼ 0:5; x2 ¼ 0:65; y2 ¼ 0:75; 1 ¼ 0:2; 2 ¼ 0:04, where 0 x 1 and 0 y 1. Your contour plot should cover the interval x ¼ 0 ! 1; y ¼ 0 ! 1. In Mathematica, this can be accomplished with ContourPlot. (b) Now make a 3-dimensional plot of pðX; YjD; M; IÞ. In Mathematica, this can be accomplished with Plot3D. (c) Now compute the marginal probability distributions pðXjD; M; IÞ and pðYjD; M; IÞ. The prior information is I ‘‘X and Y are only non-zero in the interval 0 ! 1, and uniform within that interval.’’ Check that the integral of pðXjD; M; IÞ in the interval 0 ! 1 is equal to 1.
20
Role of probability theory in science
(d) In your 3-dimensional plot of part (b), probability is represented by a height along the z-axis. Now imagine a light source located a great distance away along the y-axis illuminating the 3-dimensional probability density function. The shadow cast by pðX; YjD; M; IÞ on the plane defined by y ¼ 0, we will call the projected probability density function of X. Compute and compare the projected probability density function of X with the marginal distribution on the same plot. To accomplish this effectively, both density functions should be normalized to have an integral ¼ 1 in the interval x ¼ 0 ! 1. Note: the location of the peak of the marginal does not correspond to the location of the projection peak although they would if the joint probability density function were a single multi-dimensional Gaussian. (e) Plot the normalized marginal and projected probability density functions for Y on one graph.
2 Probability theory as extended logic
2.1 Overview The goal of this chapter is to provide an extension of logic to handle situations where we have incomplete information so we may arrive at the relative probabilities of competing propositions (theories, hypotheses, or models) for a given state of information. We start by reviewing the algebra of logical propositions and explore the structure (syllogisms) of deductive and plausible inference. We then set off on a course to come up with a quantitative theory of plausible inference (probability theory as extended logic) based on the three desirable goals called desiderata. This amounts to finding an adequate set of mathematical operations for plausible inference that satisfies the desiderata. The two operations required turn out to be the product rule and sum rule of probability theory. The process of arriving at these operations uncovers a precise operational definition of plausibility, which is determined by the data. The material presented in this chapter is an abridged version of the treatment given by E. T. Jaynes in his book, Probability Theory – The Logic of Science (Jaynes, 2003), with permission from Cambridge University Press.
2.2 Fundamentals of logic 2.2.1 Logical propositions In general, we will represent propositions by capital letters fA; B; C; etc:g. A proposition asserts that something is true. e:g:; A ‘‘The age of the specimen is 106 years:’’ The denial of a proposition is indicated by a bar: A ‘‘A is false.’’ We will only be concerned with two-valued logic; thus, any proposition has a truth value of either True or False Truth value: 1 or 0 21
22
Probability theory as extended logic
2.2.2 Compound propositions A; B asserts both A and B are true ðlogical product or conjunctionÞ A; A impossible statement, truth value ¼ F or zero A þ B asserts A is true or B is true or both are true ðlogical sum or disjunctionÞ A; B þ B; A asserts either A is true or B is true but both are not true ðexclusive form of logical sumÞ
2.2.3 Truth tables and Boolean algebra Consider the two compound propositions A ¼ B; C and D ¼ B þ C. Are the propositions A and D equal? Two propositions are equal if they have the same truth value. We can verify that A ¼ D by constructing a truth table which lays out the truth values for A and D for all the possible combinations of the truth values of the propositions B and C on which they are based (Table 2.1). Since A and D have the same truth value for all possible truth values of propositions B and C, then we can write A ¼ D (which means they are logically equivalent): We have thus established the relationship B; C ¼ B þ C and 6¼ B; C:
(2:1)
In addition, the last two columns of the table establish the relationship B; C ¼ B þ C:
(2:2)
Boole (1854) pointed out that the propositional statements in symbolic logic obey the rules of algebra provided one interprets them as having values of 1 or 0 (Boolean algebra). There are no operations equivalent to subtraction or division. The only operations required are multiplications (‘and’) and additions (‘or’). Table 2.1 B
C
B; C
A ¼ B; C
D¼BþC
BþC
BþC
B; C
T T F F
T F T F
T F F F
F T T T
F T T T
T T T F
F F F T
F F F T
23
2.2 Fundamentals of logic
Box 2.1
Worked exercise:
construct a truth table to show A; ðB þ CÞ ¼ A; B þ A; C. A
B
C
BþC
A; ðB þ CÞ
A; B
A; C
A; B þ A; C
T T T T F F F F
T F T F T F T F
T F F T T F F T
T F T T T F T T
T F T T F F F F
T F T F F F F F
T F F T F F F F
T F T T F F F F
Since A; ðB þ CÞ and A; B þ A; C have the same truth value for all possible truth values of propositions A, B and C, then we can write A; ðB þ CÞ ¼ A; B þ A; C. (This is a distributivity identity.)
One surprising result of Boolean algebra manipulations is that a given statement may take several different forms which don’t resemble one another. For example, show that D ¼ A þ B; C ¼ ðA þ BÞ; ðA þ CÞ. In the proof below, we make use of the relationships X; Y ¼ X þ Y (on line 1), and X; Y ¼ X þ Y (on line 3), from Equations (2.1) and (2.2). D ¼ A þ B; C ¼ A þ B; C ¼ A; B; C D ¼ A; ðB þ CÞ D ¼ A; B þ A; C ¼ ðA þ BÞ þ ðA þ CÞ D ¼ ðA þ BÞ; ðA þ CÞ D ¼ ðA þ BÞ; ðA þ CÞ or
A þ B; C ¼ ðA þ BÞ; ðA þ CÞ:
This can also be verified by constructing a truth table.
24
Probability theory as extended logic
Basic Boolean Identities Idempotence: Commutativity: Associativity: Distributivity: Duality:
If If
A; A AþA A; B AþB A; ðB; CÞ A þ ðB þ CÞ A; ðB þ CÞ A þ ðB; CÞ C ¼ A; B, D ¼ A þ B,
¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ then then
A A B; A BþA ðA; BÞ; C ðA þ BÞ þ C A; B þ A; C ðA þ BÞ; ðA þ CÞ C¼AþB D ¼ A; B
¼ ¼
A; B; C AþBþC
By the application of these identities, one can prove any number of further relations, some highly non-trivial. For example, we shall presently have use for the rather elementary ‘‘theorem’’: If B ¼ A; D A; B ¼ A; A; D ¼ A; D ¼ B
(2:3)
then A; B ¼ B: Also, we can show that: B; A ¼ A:
(2:4)
Proof of the latter follows from B ¼ A; D ¼ A þ D B; A ¼ A; A þ A; D ¼ A þ A; D ¼ A:
(2:5)
Clearly, Equation (2.5) is true if A is true and false if A is false, regardless of the truth of D.
2.2.4 Deductive inference Deductive inference is the process of reasoning from one proposition to another. It was recognized by Aristotle (fourth century BC) that deductive inference can be analyzed into repeated applications of the strong syllogisms: 1. If A is true, then B is true (major premise) A is true ðminor premiseÞ Therefore B is true ðconclusionÞ 2. If A is true, then B is true B is false Therefore A is false
2.3 Brief history
25
In Boolean algebra, these strong syllogisms can be written as: A ¼ A; B:
(2:6)
This equation says that the truth value of proposition A; B is equal to the truth value of proposition A. It does not assert that either A or B is true. Clearly, if B is false, then the right hand side of the equation equals 0, and so A must be false. On the other hand, if B is known to be true, then according to Equation (2.6), proposition A can be true or false. It is also written as the implication operation A ) B.
2.2.5 Inductive or plausible inference In almost all situations confronting us, we do not have the information required to do deductive inference. We have to fall back on weaker syllogisms: If A is true, then B is true B is true Therefore A becomes more plausible Example A ‘‘It will start to rain by 10 AM at the latest.’’ B ‘‘The sky becomes cloudy before 10 AM.’’ Observing clouds at 9:45 AM does not give us logical certainty that rain will follow; nevertheless, our common sense, obeying the weak syllogism, may induce us to change our plans and behave as if we believed that it will rain, if the clouds are sufficiently dark. This example also shows the major premise: ‘‘If A then B’’ expresses B only as a logical consequence of A and not necessarily as a causal consequence (i.e., the rain is not the cause of the clouds). Another weak syllogism: If A is true, then B is true A is false Therefore B becomes less plausible
2.3 Brief history The early work on probability theory by James Bernoulli (1713), Rev. Thomas Bayes (1763), and Pierre Simon Laplace (1774), viewed probability as an extension of logic to the case where, because of incomplete information, Aristotelian deductive reasoning is unavailable. Unfortunately, Laplace failed to give convincing arguments to show why
26
Probability theory as extended logic
the Bayesian definition of probability uniquely required the sum and product rules for manipulating probabilities. The frequentist definition of probability was introduced to satisfy this point, but in the process, eliminated the interpretation of probability as extended logic. This caused a split in the subject into the Bayesian and frequentist camps. The frequentist approach dominated statistical inference throughout most of the twentieth century, but the Bayesian viewpoint was kept alive notably by Sir Harold Jeffreys (1891–1989). In the 1940s and 1950s, G. Polya, R. T. Cox and E. T. Jaynes provided the missing rationale for Bayesian probability theory. In his book Mathematics and Plausible Reasoning, George Polya dissected our ‘‘common sense’’ into a set of elementary desiderata and showed that mathematicians had been using them all along to guide the early stages of discovery, which necessarily precede the finding of a rigorous proof. When one added (see Section 2.5.1) the consistency desiderata of Cox (1946) and Jaynes, the result was a proof that, if degrees of plausibility are represented by real numbers, then there is a unique set of rules for conducting inference according to Polya’s desiderata which provides for an operationally defined scale of plausibility. The final result was just the standard product and sum rules of probability theory, given axiomatically by Bernoulli and Laplace! The important new feature is that these rules are now seen as uniquely valid principles of logic in general, making no reference to ‘‘random variables’’, so their range of application is vastly greater than that supposed in the conventional probability theory that was developed in the early twentieth century. With this came a revival of the notion of probability theory as extended logic. The work of Cox and Jaynes was little appreciated at first. Widespread application of Bayesian methodology did not occur until the 1980s. By this time computers had become sufficiently powerful to demonstrate that the methodology could outperform standard techniques in many areas of science. We are now in the midst of a ‘‘Bayesian Revolution’’ in statistical inference. In spite of this, many scientists are still unaware of the significance of the revolution and the frequentist approach currently dominates statistical inference. New graduate students often find themselves caught between the two cultures. This book represents an attempt to provide a bridge.
2.4 An adequate set of operations So far, we have discussed the following logical operations: A; B logical product (conjunction) A þ B logical sum (disjunction) A ) B implication A negation
27
2.4 An adequate set of operations
By combining these operations repeatedly in every possible way, we can generate any number of new propositions, such as: C ðA þ BÞ; ðA þ A; BÞ þ A; B; ðA þ BÞ:
(2:7)
We now consider the following questions: 1. How large is the class of new propositions? 2. Is it infinite or finite? 3. Can every proposition defined from A and B be represented in terms of the above operations, or are new operations required? 4. Are the four operations already over-complete?
Note: two propositions are not different from the standpoint of logic if they have the same truth value. C, in the above equation, is logically the same statement as the implication C ¼ ðB ) AÞ. Recall that the implication B ) A can also be written as B ¼ A; B. This does not assert that either A or B is true; it only means that A; B is false, or equivalently that ðA þ BÞ is true.
Box 2.2
Worked exercise:
expand the right hand side (RHS) of proposition C given by Equation (2.7), and show that it can be reduced to ðA þ BÞ. RHS ¼ A; A þ A; B þ A; A; B þ A; B; B þ A; A; B þ A; B; B Drop all terms that are clearly impossible (false), e.g., A; A. Adding any number of impossible propositions to a proposition in a logical sum does not alter the truth value of the proposition. It is like adding a zero to a function; it doesn’t alter the value of the function. ¼ A; B þ A; B þ A; B ¼ A; ðB þ BÞ þ A; B ¼ A þ A; B ¼ A þ A; B ¼ A; A; B ¼ A; ðA þ BÞ ¼ A; B ¼ A þ B:
2.4.1 Examination of a logic function Any logic function C ¼ fðA; BÞ has only two possible values, and likewise for the independent variables A and B. A logic function with n variables is defined on a discrete space consisting of only m ¼ 2n points. For example, in the case of C ¼ fðA; BÞ, m ¼ 4 points; namely those at which A and B take on the values fTT,TF,FT,FFg. The number of independent logic functions ¼ 2m ¼ 16. Table 2.2 lists these 16 logical functions.
28
Probability theory as extended logic
Table 2.2 Logic functions of the two propositions A and B. A; B f1 ðA; BÞ f2 ðA; BÞ f3 ðA; BÞ f4 ðA; BÞ f5 ðA; BÞ f6 ðA; BÞ f7 ðA; BÞ f8 ðA; BÞ f9 ðA; BÞ f10 ðA; BÞ f11 ðA; BÞ f12 ðA; BÞ f13 ðA; BÞ f14 ðA; BÞ f15 ðA; BÞ f16 ðA; BÞ
TT
TF
FT
FF
T F F F T T T T F T T F F F T F
F T F F T T T F T T F T T F F F
F F T F T T F T T F T T F T F F
F F F T T F T T T F F F T T T F
¼ A; A
¼ B þ A; B ¼ ðB þ AÞ; ðB þ BÞ last step is a distributivity identity
(2:8)
¼ A; B ¼ A; B ¼ A; B ¼ A; B
We can show that f5 ! f16 are logical sums of f1 ! f4 . Example 1: f1 þ f3 þ f4 ¼ A; B þ A; B þ A; B
¼B þ A ¼ f8 : Example 2: f2 þ f4 ¼ A; B þ A; B ¼ ðA þ AÞ; B ¼ B ¼ f13 :
(2:9)
This method (called ‘‘reduction to disjunctive normal form’’ in logic textbooks) will work for any n. Thus, one can verify that the three operations: 8 9 disjunction; negation = < conjunction; logical product; logical sum; negation : ; AND OR NOT
2.5 Operations for plausible inference
29
suffice to generate all logic functions, i.e., form an adequate set. But the logical sum A þ B is the same as denying that they are both false: A þ B ¼ A; B. Therefore AND and NOT are already an adequate set. Is there a still smaller set? Answer: Yes. NAND, defined as AND which is represented by A " B. A " B A; B ¼ A þ B A¼A " A A; B ¼ ðA " BÞ " ðA " BÞ A þ B ¼ ðA " AÞ " ðB " BÞ: Every logic function can be constructed from NAND alone. The NOR operator is defined by: A # B A þ B ¼ A; B and is also powerful enough to generate all logic functions. A¼A # A A þ B ¼ ðA # BÞ # ðA # BÞ A; B ¼ ðA # AÞ # ðB # BÞ:
2.5 Operations for plausible inference We now turn to the extension of logic for a common situation where we lack the axiomatic information necessary for deductive logic. The goal according to Jaynes, is to arrive at a useful mathematical theory of plausible inference which will enable us to build a robot (write a computer program) to quantify the plausibility of any hypothesis in our hypothesis space of interest based on incomplete information. For example, given 107 observations, determine (in the light of these data and whatever prior information is at hand) the relative plausibilities of many different hypotheses about the causes at work. We expect that any mathematical model we succeed in constructing will be replaced by more complete ones in the future as part of the much grander goal of developing a theory of common sense reasoning. Experience in physics has shown that as knowledge advances, we are able to invent better models, which reproduce more features of the real world, with more accuracy. We are also accustomed to finding that these advances lead to consequences of great practical value, like a computer program to carry out useful plausible inference following clearly defined principles (rules or operations) expressing an idealized common sense. The rules of plausible inference are deduced from a set of three desiderata (see Section 2.5.1) rather than axioms, because they do not assert anything is true, but only state what appear to be desirable goals. We would definitely want to revise the
30
Probability theory as extended logic
operation of our robot or computer program if they violated one of these elementary desiderata. Whether these goals are attainable without contradiction and whether they determine any unique extension of logic are a matter of mathematical analysis. We also need to compare the inference of a robot built in this way to our own reasoning, to decide whether we are prepared to trust the robot to help us with our inference problems.
2.5.1 The desiderata of Bayesian probability theory I. Degrees of plausibility are represented by real numbers. II. The measure of plausibility must exhibit qualitative agreement with rationality. This means that as new information supporting the truth of a proposition is supplied, the number which represents the plausibility will increase continuously and monotonically. Also, to maintain rationality, the deductive limit must be obtained where appropriate. III. Consistency (a) Structural consistency: If a conclusion can be reasoned out in more than one way, every possible way must lead to the same result. (b) Propriety: The theory must take account of all information, provided it is relevant to the question. (c) Jaynes consistency: Equivalent states of knowledge must be represented by equivalent plausibility assignments. For example, if A; BjC ¼ BjC, then the plausibility of A; BjC must equal the plausibility of BjC.
2.5.2 Development of the product rule In Section 2.4 we established that the logical product and negation (AND, NOT) are an adequate set of operations to generate any proposition derivable from fA1 ; . . . ; AN g. For Bayesian inference, our goal is to find operations (rules) to determine the plausibility of logical conjunction and negation that satisfy the above desiderata. Start with the plausibility of A; B: Let ðA; BjCÞ plausibility of A; B supposing the truth of C. Remember, we are going to represent plausibility by real numbers (desideratum I). Now ðA; BjCÞ must be a function of some combination of ðAjCÞ, ðBjCÞ, ðBjA; CÞ, ðAjB; CÞ. There are 11 possibilities: ðA; BjCÞ ¼ F1 ½ðAjCÞ; ðAjB; CÞ ðA; BjCÞ ¼ F2 ½ðAjCÞ; ðBjCÞ ðA; BjCÞ ¼ F3 ½ðAjCÞ; ðBjA; CÞ ðA; BjCÞ ¼ F4 ½ðAjB; CÞ; ðBjCÞ ðA; BjCÞ ¼ F5 ½ðAjB; CÞ; ðBjA; CÞ
2.5 Operations for plausible inference
31
ðA; BjCÞ ¼ F6 ½ðBjCÞ; ðBjA; CÞ ðA; BjCÞ ¼ F7 ½ðAjCÞ; ðAjB; CÞ; ðBjCÞ ðA; BjCÞ ¼ F8 ½ðAjCÞ; ðAjB; CÞ; ðBjA; CÞ ðA; BjCÞ ¼ F9 ½ðAjCÞ; ðBjCÞ; ðBjA; CÞ ðA; BjCÞ ¼ F10 ½ðAjB; CÞ; ðBjCÞ; ðBjA; CÞ ðA; BjCÞ ¼ F11 ½ðAjCÞ; ðAjB; CÞ; ðBjCÞ; ðBjA; CÞ
Box 2.3
Note on the use of the ‘‘ = ’’ sign
1. In Boolean algebra, the equals sign is used to denote equal truth value. By definition, A ¼ B asserts that A is true if and only if B is true. 2. When talking about plausibility, which is represented by a real number, ðA; BjCÞ ¼ ðÞðÞ . . . means equal numerically. 3. means equal by definition.
Now let us examine these 11 different functions more closely. Since the order in which the symbols A and B appear has no meaning (i.e., A; B ¼ B; A) it follows that F1 ½ðAjCÞ; ðAjB; CÞ ¼ F6 ½ðBjCÞ; ðBjA; CÞ F3 ½ðAjCÞ; ðBjA; CÞ ¼ F4 ½ðAjB; CÞ; ðBjCÞ F7 ½ðAjCÞ; ðAjB; CÞ; ðBjCÞ ¼ F9 ½ðAjCÞ; ðBjCÞ; ðBjA; CÞ F8 ½ðAjCÞ; ðAjB; CÞ; ðBjA; CÞ ¼ F10 ½ðAjB; CÞ; ðBjCÞ; ðBjA; CÞ This reduces the number of equations dramatically from 11 to 7. The seven functions remaining are F1 ; F2 ; F3 ; F5 ; F7 ; F8 ; F11 . If any function leads to an absurdity in even one example, it must be ruled out, even if for other examples it would be satisfactory. Consider ðA; BjCÞ ¼ F2 ½ðAjCÞ; ðBjCÞ: Suppose A next person will have blue left eye. B next person will have brown right eye. C prior information concerning our expectation that the left and right eye colors of any individual will be very similar. Now ðAjCÞ could be very plausible as could ðBjCÞ, but ðA; BjCÞ is extremely implausible. We rule out functions of this form because they have no way of taking such influence into account. Our robot could not reason the way humans do, even qualitatively, with that functional form.
32
Probability theory as extended logic
Similarly, we can rule out F1 for the extreme case where the conditional (given) information represented by proposition C is that ‘‘A and B are independent.’’ In this extreme case, ðAjB; CÞ ¼ ðAjCÞ: Therefore, ðA; BjCÞ ¼ F1 ½ðAjCÞ; ðAjB; CÞ ¼ F1 ½ðAjCÞ; ðAjCÞ;
(2:10)
which is clearly absurd because F1 claims that the plausibility of A; BjC depends only on the plausibility of AjC. Other extreme conditions are A ¼ B; A ¼ C; C ¼ A, etc. Carrying out this type of analysis, Tribus (1969) shows that all but one of the remaining possibilities can exhibit qualitative violations with common sense in some extreme case. There is only one survivor which can be written in two equivalent ways: ðA; BjCÞ ¼ F ½ðBjCÞ; ðAjB; CÞ ¼ F ½ðAjCÞ; ðBjA; CÞ:
(2:11)
In addition, desideratum II, qualitative agreement with common sense, requires that F½ðAjCÞ; ðBjA; CÞ must be a continuous monotonic function of ðAjCÞ and ðBjA; CÞ. The continuity assumption requires that if ðAjCÞ changes only infinitesimally, it can induce only an infinitesimal change in ðA; BjCÞ or ðAjCÞ. Now use desideratum III: ‘‘Consistency’’ Suppose we want ðA; B; CjDÞ 1. Consider B; C to be a single proposition at first; then we can apply Equation (2.11):
ðA; B; CjDÞ ¼ F ½ðB; CjDÞ; ðAjB; C; DÞ ¼ FfF ½ðCjDÞ; ðBjC; DÞ; ðAjB; C; DÞg:
(2:12)
2. Consider A; B to be a single proposition at first:
ðA; B; CjDÞ ¼ F ½ðCjDÞ; ðA; BjC; DÞ ¼ FfðCjDÞ; F ½ðBjC; DÞ; ðAjB; C; DÞg:
(2:13)
For consistency, 1 and 2 must be equal. Let x ðAjB; C; DÞ; y ðBjC; DÞ; z ðCjDÞ, then: Ffx; F ½y; zg ¼ FfF ½x; y; zg:
(2:14)
This equation has a long history in mathematics and is called the ‘‘the Associativity Equation.’’ Acze´l (1966) derives the general solution (Equation (2.15) below) without assuming differentiability; unfortunately, the proof fills 11 pages of his book. R. T. Cox (1961) provided a shorter proof, but assumed differentiability.
2.5 Operations for plausible inference
33
The solution is wfF ½x; yg ¼ wfxgwfyg;
(2:15)
where wfxg is any positive continuous monotonic function. In the case of just two propositions, A, B given the truth of C, the solution to the associativity equation becomes wfðA; BjCÞg ¼ wfðAjB; CÞgwfðBjCÞg ¼ wfðBjA; CÞgwfðAjCÞg:
(2:16)
For simplicity, drop the fg brackets, but it should be remembered that the argument of w is a plausibility. wðA; BjCÞ ¼ wðAjB; CÞwðBjCÞ ¼ wðBjA; CÞwðAjCÞ:
(2:17)
Henceforth this will be called the product rule. Recall that at this moment, wðÞ is any positive, continuous, monotonic function. Desideratum II: Qualitative correspondence with common sense imposes further restrictions on wfxg Suppose A is certain given C. Then A; BjC ¼ BjC (i.e., same truth value). By our primitive axiom that propositions with the same truth value must have the same plausibility, ðA; BjCÞ ¼ ðBjCÞ ðAjB; CÞ ¼ ðAjCÞ:
(2:18)
Therefore, Equation (2.17), the solution to the associativity equation, becomes wðBjCÞ ¼ wðAjCÞwðBjCÞ:
(2:19)
This is only true when AjC is certain. Thus we have arrived at a new constraint on wðÞ; it must equal 1 when the argument is certain. For the next constraint, suppose that A is impossible given C . This implies A; BjC ¼ AjC AjB; C ¼ AjC: Then wðA; BjCÞ ¼ wðAjB; CÞwðBjCÞ
(2:20)
wðAjCÞ ¼ wðAjCÞwðBjCÞ:
(2:21)
becomes
34
Probability theory as extended logic
This must be true for any ðBjCÞ. There are only two choices: either wðAjCÞ ¼ 0 or þ1. 1. wðxÞ is a positive, increasing function ð0 ! 1Þ: 2. wðxÞ is a positive, decreasing function ð1 ! 1Þ:
They do not differ in content. Suppose w1 ðxÞ represents impossibility by þ1. We can define w2 ðxÞ ¼ 1=w1 ðxÞ which represents impossibility by 0. Therefore, there is no loss of generality if we adopt: 0 wðxÞ 1: Summary: Using our desiderata, we have arrived at our present form of the product rule: wðA; BjCÞ ¼ wðAjCÞwðBjA; CÞ ¼ wðBjCÞwðAjB; CÞ: At this point we are still not referring to wðxÞ as the probability of x. wðxÞ is any continuous, monotonic function satisfying: 0 wðxÞ 1; where wðxÞ ¼ 0 when the argument x is impossible and 1 when x is certain.
2.5.3 Development of sum rule We have succeeded in deriving an operation for determining the plausibility of the logical product (conjunction). We now turn to the problem of finding an operation to determine the plausibility of negation. Since the logical sum A þ A is always true, it follows that the plausibility that A is false must depend on the plausibility that A is true. Thus, there must exist some functional relation wðAjBÞ ¼ SðwðAjBÞÞ:
(2:22)
Again, using our desiderata and functional analysis, one can show (Jaynes, 2003) that the monotonic function wðAjBÞ obeys wm ðAjBÞ þ wm ðAjBÞ ¼ 1 for positive m. This is known as the sum rule. The product rule can equally well be written as wm ðA; BjCÞ ¼ wm ðAjCÞwm ðBjA; CÞ ¼ wm ðBjCÞwm ðAjB; CÞ: But then we see that the value of m is actually irrelevant; for whatever value is chosen, we can define a new function pðxÞ wm ðxÞ and our rules take the form
35
2.5 Operations for plausible inference
pðA; BjCÞ ¼ pðAjCÞpðBjA; CÞ ¼ pðBjCÞpðAjB; CÞ pðAjBÞ þ pðAjBÞ ¼ 1 This entails no loss of generality, for the only requirement we imposed on the function wðxÞ is that wðxÞ is a continuous, monotonic, increasing function ranging from w ¼ 0 for impossibility to w ¼ 1 for certainty. But if wðxÞ satisfies this, so does wm ðxÞ, 0 < m < 1. Reminder: We are still not referring to pðxÞ as a probability. We showed earlier that conjunction, A, B, and negation, A, are an adequate set of operations, from which all logic functions can be constructed. Therefore, it ought to be possible, by repeated applications of the product and sum rules, to arrive at the plausibility of any proposition. To show this, we derive a formula for the logical sum A þ B. pðA þ BjCÞ ¼ 1 pðA þ BjCÞ ¼ 1 pðA; BjCÞ ¼ 1 pðAjCÞpðBjA; CÞ ¼ 1 pðAjCÞ½1 pðBjA; CÞ ¼ 1 pðAjCÞ þ pðAjCÞpðBjA; CÞ ¼ pðAjCÞ þ pðA; BjCÞ
:
(2:23)
¼ pðAjCÞ þ pðBjCÞpðAjB; CÞ ¼ pðAjCÞ þ pðBjCÞ½1 pðAjB; CÞ ¼ pðAjCÞ þ pðBjCÞ pðBjCÞpðAjB; CÞ ) pðA þ BjCÞ ¼ pðAjCÞ þ pðBjCÞ pðA; BjCÞ: This is a very useful relationship and is called the extended sum rule. Starting with our three desiderata, we arrived at a set of rules for plausible inference: product rule: pðA; BjCÞ ¼ pðAjCÞpðBjA; CÞ ¼ pðBjCÞpðAjB; CÞ; sum rule: pðAjBÞ þ pðAjBÞ ¼ 1: We have in the two rules formulae for the plausibility of the conjunction, A, B, and negation, A, which are an adequate set of operations to generate any proposition derivable from the set fA1 ; . . . ; AN g. Using the product and sum rules, we also derived the extended sum rule pðA þ BjCÞ ¼ pðAjCÞ þ pðBjCÞ pðA; BjCÞ:
(2:24)
36
Probability theory as extended logic
For mutually exclusive propositions pðA; BjCÞ ¼ 0, so Equation (2.24) becomes pðA þ BjCÞ ¼ pðAjCÞ þ pðBjCÞ:
(2:25)
We will refer to Equation (2.25) as the generalized sum rule.
2.5.4 Qualitative properties of product and sum rules Check to see if the product and sum rules predict the strong (deductive logic) and weak (inductive logic) syllogisms. Strong syllogisms: (a) A)B A true B true
(b) A)B B false A false
major premise (prior information) minor premise (data) conclusion
Example: * * *
Let A ‘‘Corn was harvested in Eastern Canada in AD 1000.’’ Let B ‘‘Corn seed was available in Eastern Canada in AD 1000.’’ Let I ‘‘Corn seed is required to grow corn, so if corn was harvested, the seed must have been available.’’ This is our prior information or major premise.
In both cases, we start by writing down the product rule: Syllogism (a)
Syllogism (b)
pðA; BjIÞ ¼ pðAjIÞpðBjA; IÞ pðA; BjIÞ ! pðBjA; IÞ ¼ pðAjIÞ
pðA; BjIÞ ¼ pðBjIÞpðAjB; IÞ pðA; BjIÞ pðAjB; IÞ ¼ pðBjIÞ
Prior info. I ‘‘A; B ¼ A’’
Prior info. I ‘‘A; B ¼ A’’
! pðA; BjIÞ ¼ pðAjIÞ ! pðBjA; IÞ ¼ 1 i.e., B is true if A is true
! pðA; BjIÞ ¼ 0 since B could not be false if A is true according to the information I.
Data: A ¼ 1 (true)
Data: B ¼ 1 ( B false)
! pðBjA; IÞ ¼ 1 Certainty
! pðAjB; IÞ ¼ 0 Impossibility
Weak syllogisms: (a) A)B B true A more plausible
(b) A)B A false B less plausible
prior information data conclusion
37
2.6 Uniqueness of the product and sum rules
Start by writing down the product rule in the form of Bayes’ theorem: Weak Syllogism (a) pðAjB; IÞ ¼
Weak Syllogism (b)
pðAjIÞpðBjA; IÞ pðBjIÞ
pðBjA; IÞ ¼
Prior info. I ‘‘A; B ¼ A’’
pðBjIÞpðAjB; IÞ pðAjIÞ
Syllogism (a) gives pðAjB; IÞ pðAjIÞ based on the same prior information.
! pðBjA; IÞ ¼ 1 IÞ 1 pðAjIÞ ! 1 pðAjB; and pðBjIÞ 1 since I says nothing about the truth of B.
! pðAjB; IÞ pðAjIÞ or
pðAjB; IÞ pðAjIÞ
1
Substituting into Bayes’ theorem ! pðAjB; IÞ pðAjIÞ
Substituting into Bayes’ theorem ! pðBjA; IÞ pðBjIÞ
A more plausible
B less plausible
2.6 Uniqueness of the product and sum rules Corresponding to every different choice of continuous monotonic function pðxÞthere seems to be a different set of rules. Nothing given so far tells us what numerical value of plausibility should be assigned at the beginning of the problem. To answer both issues, consider the following: suppose we have N mutually exclusive and exhaustive propositions fA1 ; . . . ; AN g. *
Mutually exclusive only one can be true, i.e., pðAi ; Aj jBÞ ¼ 0 for i 6¼ j
pðA1 þ þ AN jBÞ ¼ *
PN
i¼1
pðAi jBÞ
Exhaustive the true proposition is contained in the set
PN
i¼1
pðAi jBÞ ¼ 1ðcertainÞ
This information is not enough to determine the individual pðAi jBÞ since there is no end to the variety of complicated information that might be contained in B. Development of new methods for translating prior information to numerical values of pðAi jBÞ is an ongoing research problem. We will discuss several valuable approaches to this problem in later chapters. Suppose our information B is indifferent regarding the pðAi jBÞ’s. Then the only possibility that reflects this state of knowledge is: pðAi jBÞ ¼
1 ; N
(2:26)
38
Probability theory as extended logic
where N is the number of mutually exclusive propositions. This is called the Principle of Indifference. In this one particular case, which can be generalized, we see that information B leads to a definite numerical value for pðAi jBÞ, but not the numerical value of the plausibility ðAi jBÞ. Instead of saying pðAi jBÞ is an arbitrary, monotonic function of the plausibility ðAi jBÞ, it is much more useful to turn this around and say: ‘‘the plausibility ðAi jBÞ is an arbitrary, monotonic function of pðÞ, defined in 0 pðÞ 1.’’ It is pðÞ that is rigidly fixed by the data, not ðAi jBÞ. The p’s define a particular scale on which degrees of plausibility can be measured. Out of the possible monotonic functions we pick this particular one, not because it is ‘‘correct,’’ but because it is more convenient. p is the quantity that obeys the product and sum rules and the numerical value of p is determined by the available information. From now on we will refer to them as probabilities. Jaynes (2003) writes, ‘‘This situation is analogous to that in thermodynamics, where out of all possible empirical temperature scales T, which are monotonic functions of each other, we finally decide to use the Kelvin scale T; not because it is more ‘correct’ than others, but because it is more convenient; i.e., the laws of thermodynamics take their simplest form ½dU ¼ Tds PdV; dG ¼ SdT þ VdP in terms of this particular scale. Because of this, numerical values of Kelvin temperatures are directly measurable in experiments.’’ With this operational definition of probability, we can readily derive another intuitively pleasing result. In this problem, our prior information is: I ‘‘An urn is filled with 10 balls of identical size, weight and texture, labeled 1; 2; . . . ; 10. Three of the balls (numbers 3, 4, 7) are red and the others are green. We are to shake the urn and draw one ball blindfolded.’’ Define the proposition: Ei ‘‘the ith ball is drawn’’; 1 i 10: Since the prior information is indifferent to these ten possibilities, Equation (2.26) applies. 1 ; 1 i 10: (2:27) pðEi jIÞ ¼ 10 The proposition R ‘‘that we draw a red ball’’ is equivalent to ‘‘we draw ball 3, 4, or 7.’’ This can be written as the logical sum statement: R ¼ E3 þ E4 þ E7 :
(2:28)
It follows from the extended sum rule that pðRjIÞ ¼ pðE3 þ E4 þ E7 jIÞ ¼
3 ; 10
(2:29)
2.8 Problems
39
in accordance with our intuition. More generally, if there are N such balls, and the proposition R is defined to be true for any specified subset of M of them, and false on the rest, then we have pðRjIÞ ¼
M : N
(2:30)
2.7 Summary Rather remarkably, the three desiderata of Section 2.5.1 have enabled us to arrive at a theory of extended logic together with a particular scale for measuring the plausibility of any hypothesis conditional on given information. We have shown that the rules for manipulating plausibility are the product and sum rules. The particular scale of plausibility we have adopted, now called probability, is determined by the data in a way that agrees with our intuition. We also showed that in the limit of complete information (certainty), the theory gives the same conclusions as the strong syllogisms of deductive inference. The main constructive requirement which determined the product and sum rules was the desideratum of structural consistency, ‘‘If a conclusion can be reasoned out in more than one way, every possible way must lead to the same result.’’ This does not mean that our rules have been proved consistent,1 only that any other rules which represent degrees of plausibility by real numbers, but which differ in content from the product and sum rules, will lead to a violation of one of our desiderata. Apart from the justification for probability as extended logic, the value of this approach to solving inference problems is being demonstrated on a regular basis in a wide variety of areas leading both to new scientific discoveries and a new level of understanding. Modern computing power permits a simple comparison of the power of different approaches in the analysis of well-understood simulated data sets. Some examples of the power of Bayesian inference will be brought out in later chapters.
2.8 Problems 1. Construct a truth table to show A; B ¼ A þ B: 2. Construct a truth table to show A þ ðB; CÞ ¼ ðA þ BÞ; ðA þ CÞ:
1
According to Go¨del’s theorem, no mathematical system can provide a proof of its own consistency.
40
Probability theory as extended logic
3. With reference to Table 2.2, construct a truth table to show that f8 ðA; BÞ ¼ A þ B: 4. Based on the available evidence, the probability that Jones is guilty is equal to 0.7, the probability that Susan is guilty is equal to 0.6, and the probability that both are guilty is equal to 0.5. Compute the probability that Jones is guilty and/or Susan is guilty. 5. The probability that Mr. Smith will make a donation is equal to 0.5, if his brother Harry has made a donation. The probability that Harry will make a donation is equal to 0.02. What is the probability that both men will make a donation?
3 The how-to of Bayesian inference
3.1 Overview The first part of this chapter is devoted to a brief description of the methods and terminology employed in Bayesian inference and can be read as a stand-alone introduction on how to do Bayesian analysis.1 Following a review of the basics in Section 3.2, we consider the two main inference problems: parameter estimation and model selection. This includes how to specify credible regions for parameters and how to eliminate nuisance parameters through marginalization. We also learn that Bayesian model comparison has a built-in ‘‘Occam’s razor,’’ which automatically penalizes complicated models, assigning them large probabilities only if the complexity of the data justifies the additional complication of the model. We also learn how this penalty arises through marginalization and depends both on the number of parameters and the prior ranges of these parameters. We illustrate these features with a detailed analysis of a toy spectral line problem and in the process introduce the Jeffreys prior and learn how different choices of priors affect our conclusions. We also have a look at a general argument for selecting priors for location and scale parameters in the early phases of an investigation when our state of ignorance is very high. The final section illustrates how Bayesian analysis provides valuable new insights on systematic errors and how to deal with them. I recommend that Sections 3.2 to 3.5 of this chapter be read twice; once quickly, and again after seeing these ideas applied in the detailed example treated in Sections 3.6 to 3.11.
3.2 Basics In Bayesian inference, the viability of each member of a set of rival hypotheses, fHi g, is assessed in the light of some observed data, D, by calculating the probability of each hypothesis, given the data and any prior information, I, we may have regarding the
1
The treatment of this topic is a revised version of Section 2 of a paper by Gregory and Loredo (1992), which is reproduced here with the permission of the Astrophysical Journal.
41
42
The how-to of Bayesian inference
hypotheses and data. Following a notation introduced by Jeffreys (1961), we write such a probability as pðHi jD; IÞ, explicitly denoting the prior information by the proposition, I, to the right of the bar. At the very least, the prior information must specify the class of alternative hypotheses being considered (hypothesis space of interest), and the relationship between the hypotheses and the data (the statistical model). The basic rules for manipulating Bayesian probabilities are the sum rule, pðHi jIÞ þ pðHi jIÞ ¼ 1;
(3:1)
and the product rule, pðHi ; DjIÞ ¼ pðHi jIÞpðDjHi ; IÞ
(3:2)
¼ pðDjIÞpðHi jD; IÞ: The various symbols appearing as arguments should be understood as propositions; for example, D might be the proposition, ‘‘N photons were counted in a time T.’’ The symbol Hi signifies the negation of Hi (a proposition that is true if one of the alternatives to Hi is true), and ðHi ; DÞ signifies the logical conjunction of Hi and D (a proposition that is true only if Hi and D are both true). The rules hold for any propositions, not just those indicated above. Throughout this work, we will be concerned with exclusive hypotheses, so that if one particular hypothesis is true, all others are false. For such hypotheses, we saw in Section 2.5.3 that the sum and product rules imply the generalized sum rule, pðHi þ Hj jIÞ ¼ pðHi jIÞ þ pðHj jIÞ:
(3:3)
To say that the hypothesis space of interest consists of n mutually exclusive hypotheses means that for the purpose of the present analysis, we are assuming that one of them is true and the objective is to assign a probability to each hypothesis in this space, based on D; I. We will use normalized prior probability distributions, unless otherwise stated, such that X
pðHi jIÞ ¼ 1:
(3:4)
i
Here a ‘‘þ’’ within a probability symbol stands for logical disjunction, so that Hi þ Hj is a proposition that is true if either Hi or Hj is true. One of the most important calculating rules in Bayesian inference is Bayes’ theorem, found by equating the two right hand sides of Equation (3.2) and solving for pðHi jD; IÞ: pðHi jD; IÞ ¼
pðHi jIÞpðDjHi ; IÞ : pðDjIÞ
(3:5)
3.3 Parameter estimation
43
Bayes’ theorem describes a type of learning: how the probability for each member of a class of hypotheses should be modified on obtaining new information, D. The probabilities for the hypotheses in the absence of D are called their prior probabilities, pðHi jIÞ, and those including the information D are called their posterior probabilities, pðHi jD; IÞ. The quantity pðDjHi ; IÞ is called the sampling probability for D, or the likelihood of Hi , and the quantity pðDjIÞ is called the prior predictive probability for D, or the global likelihood for the entire class of hypotheses. All of the rules we have written down so far show how to manipulate known probabilities to find the values of other probabilities. But to be useful in applications, we additionally need rules that assign numerical values or functions to the initial direct probabilities that will be manipulated. For example, to use Bayes’ theorem, we need to know the values of the three probabilities on the right side of Equation (3.5). These three probabilities are not independent. The quantity pðDjIÞ must satisfy the requirement that the sum of the posterior probabilities over the hypothesis space of interest is equal to 1. P X pðHi jIÞpðDjHi ; IÞ ¼ 1: (3:6) pðHi jD; IÞ ¼ i pðDjIÞ i Therefore, pðDjIÞ ¼
X
pðHi jIÞpðDjHi ; IÞ:
(3:7)
i
That is, the denominator of Bayes’ theorem, which does not depend on Hi , must be equal to the sum of the numerator over Hi . It thus plays the role of a normalization constant.
3.3 Parameter estimation We frequently deal with problems in which a particular model is assumed to be true and the hypothesis space of interest concerns the values of the model parameters. For example, in a straight line model, the two parameters are the intercept and slope. We can look at this problem as a hypothesis space that is labeled, not by discrete numbers, but by the possible values of two continuous parameters. In such cases, the quantity of interest (see also Section 1.3.2) is a probability density function or PDF. More generally, ‘PDF’ is an abbreviation for a probability distribution function which can apply to both discrete and continuous parameters. For example, given some prior information, M, specifying a parameterized model with one parameter, , pðjMÞ is the prior density for , which means that pðjMÞd is the prior probability that the true value of the parameter is in the interval ½; þ d. We use the same symbol, pð. . .Þ, for probabilities and PDFs; the nature of the argument will identify which use is intended.
44
The how-to of Bayesian inference
Bayes’ theorem, and all the other rules just discussed, hold for PDFs, with all sums replaced by integrals. For example, the global likelihood for model M can be calculated with the continuous counterpart of Equation (3.7), Z pðDjMÞ ¼ d pðjMÞpðDj; MÞ ¼ LðMÞ: (3:8) In words, the global likelihood of a model is equal to the weighted average likelihood for its parameters. We will utilize the global likelihood of a model in Section 3.5 where we deal with model comparison and Occam’s razor. If there is more than one parameter, multiple integrals are used. If the prior density and the likelihood are assigned directly, the global likelihood is an uninteresting normalization constant. The posterior PDF for the parameters is simply proportional to the product of the prior and the likelihood. The use of Bayes’ theorem to determine what one can learn about the values of parameters from data is called parameter estimation, though strictly speaking, Bayesian inference does not provide estimates for parameters. Rather, the Bayesian solution to the parameter estimation problem is the full posterior PDF, pðjD; MÞ, and not just a single point in parameter space. Of course, it is useful to summarize this distribution for textual, graphical, or tabular display in terms of a ‘‘best-fit’’ value and ‘‘error bars.’’ Possible summaries of the best-fit values are the posterior mode (most probable value of ) or the posterior mean, hi ¼
Z d pðjD; MÞ:
(3:9)
If the mode and mean are very different, the posterior PDF is too asymmetric to be adequately summarized by a single estimate. An allowed range for a parameter with probability content C (e.g., C ¼ 0:95 or 95%) is provided by a credible region, or highest posterior density region, R, defined by Z
d pðjD; MÞ ¼ C;
(3:10)
R
with the posterior density inside R everywhere greater than that outside it. We sometimes speak picturesquely of the region of parameter space that is assigned a large density as the ‘‘posterior bubble.’’ In practice, the probability (density function) pðjD; MÞ is represented by a finite list of values, pi , representing the probability in discrete intervals of . A simple way to compute the credible region is to sort these probability values in descending order. Then starting with the largest value, add successively smaller pi values until adding the next value would exceed the desired value of C. At each step keep track of the corresponding i value. The credible region is the range of that just
3.5 Model comparison and Occam’s razor
45
includes all the i values corresponding to the pi values added. The boundaries of the credible region are obtained by sorting these i values and taking the smallest and largest values.
3.4 Nuisance parameters Frequently, a parameterized model will have more than one parameter, but we will want to focus attention on a subset of the parameters. For example, we may want to focus on the implications of the data for the frequency of a periodic signal, independent of the signal’s amplitude, shape, or phase. Or we may want to focus on the implications of spectral data for the parameters of some line feature, independent of the shape of the background continuum. In such problems, the uninteresting parameters are known as nuisance parameters. As always, the full Bayesian inference is the full joint posterior PDF for all of the parameters; but its implications for the parameters of interest can be simply summarized by integrating out the nuisance parameters. Explicitly, if model M has two parameters, and , and we are interested only in , then it is a simple consequence of the sum and product rules (see Section 1.5) that, pðjD; MÞ ¼
Z d pð; jD; MÞ:
(3:11)
For historical reasons, the procedure of integrating out nuisance parameters is called marginalization, and pðjD; MÞ is called the marginal posterior PDF for . Equation (3.8) for the global likelihood is a special case of marginalization in which all of the model parameters are marginalized out of the joint prior distribution, pðD; jMÞ. The use of marginalization to eliminate nuisance parameters is one of the most important technical advantages of Bayesian inference over standard frequentist statistics. Indeed, the name ‘‘nuisance parameters’’ originated in frequentist statistics because there is no general frequentist method for dealing with such parameters; they are indeed a ‘‘nuisance’’ in frequentist statistics. Marginalization plays a very important role in this work. We will see a detailed example of marginalization in action in Section 3.6.
3.5 Model comparison and Occam’s razor Often, more than one parameterized model will be available to explain a phenomenon, and we will wish to compare them. The models may differ in form or in number of parameters. Use of Bayes’ theorem to compare competing models by calculating the probability of each model as a whole is called model comparison. Bayesian model comparison has a built-in ‘‘Occam’s razor:’’ Bayes’ theorem automatically penalizes complicated models, assigning them large probabilities only if the complexity of the
46
The how-to of Bayesian inference
data justifies the additional complication of the model. See Jeffreys and Berger (1992) for a historical account of the connection between Occam’s (Ockham’s) razor and Bayesian analysis. Model comparison calculations require the explicit specification of two or more specific alternative models, Mi . We take as our prior information the proposition that one of the models under consideration is true. Symbolically, we might write this as I ¼ M1 þ M2 þ þ MN , where the ‘‘þ’’ symbol here stands for disjunction (‘‘or’’). Given this information, we can calculate the probability for each model with Bayes’ theorem: pðMi jD; IÞ ¼
pðMi jIÞpðDjMi ; IÞ : pðDjIÞ
(3:12)
We recognize pðDjMi ; IÞ as the global likelihood for model Mi , which we can calculate according to Equation (3.8). The term in the denominator is again a normalization constant, obtained by summing the products of the priors and the global likelihoods of all models being considered. Model comparison is thus completely analogous to parameter estimation: just as the posterior PDF for a parameter is proportional to its prior times its likelihood, so the posterior probability for a model as a whole is proportional to its prior probability times its global likelihood. It is often useful to consider the ratios of the probabilities of two models, rather than the probabilities directly. The ratio, Oij ¼ pðMi jD; IÞ=pðMj jD; IÞ;
(3:13)
is called the odds ratio in favor of model Mi over model Mj . From Equation (3.12), pðMi jIÞ pðDjMi ; IÞ pðMj jIÞ pðDjMj ; IÞ pðMi jIÞ Bij ; pðMj jIÞ
Oij ¼
(3:14)
where the first factor is the prior odds ratio, and the second factor is called the Bayes factor. Note: the normalization constant in Equation (3.12) drops out of the odds ratio; this can make the odds ratio somewhat easier to work with. The odds ratio is also conceptually useful when one particular model is of special interest. For example, suppose we want to compare a constant rate model with a class of periodic alternatives, and will thus calculate the odds in favor of each alternative over the constant model. If we have calculated the odds ratios, Oi1 , in favor of each model over model M1 , we can find the probabilities for each model in terms of these odds ratios as follows: N mod X i¼1
pðMi jD; IÞ ¼ 1;
(3:15)
3.5 Model comparison and Occam’s razor
47
where Nmod is the total number of models considered. Dividing through by pðM1 jD; IÞ, we have N mod X 1 ¼ Oi1 : pðM1 jD; IÞ i¼1
(3:16)
Comparing Equation (3.16) to the expression for Oi1 , given by Oi1 ¼ pðMi jD; IÞ=pðM1 jD; IÞ;
(3:17)
we have the result that Oi1 pðMi jD; IÞ ¼ PNmod i¼1
Oi1
;
(3:18)
where of course O11 ¼ 1. If there are only two models, the probability of M2 is given by pðM2 jD; IÞ ¼
O21 1 ¼ : 1 þ O21 1 þ O121
(3:19)
In this work, we will assume that we have no information leading to a prior preference for one model over another, so the prior odds ratio will be unity, and the odds ratio will equal the Bayes factor, the ratio of global likelihoods. A crucial consequence of the marginalization procedure used to calculate global likelihoods is that the Bayes factor automatically favors simpler models unless the data justify the complexity of more complicated alternatives. This is illustrated by the following simple example. Imagine comparing two models: M1 with a single parameter, , and M0 with fixed at some default value 0 (so M0 has no free parameters). To calculate the Bayes factor B10 in favor of model M1 , we will need to perform the integral in Equation (3.8) to compute pðDjM1 ; IÞ, the global likelihood of M1 . To develop our intuition about the Occam penalty, we will carry out a back-of-the-envelope calculation for the Bayes factor. Often the data provide us with more information about parameters than we had without the data, so that the likelihood function, LðÞ ¼ pðDj; M1 ; IÞ, will be much more ‘‘peaked’’ than the prior, pðjM1 ; IÞ. In Figure 3.1 we show a Gaussian^ the maximum likelihood value of , together with looking likelihood centered at , a flat prior for . Let be the characteristic width of the prior. For a flat prior, we have that Z
d pðjM1 ; IÞ ¼ pðjM1 ; IÞ ¼ 1:
Therefore, pðjM1 ; IÞ ¼ 1=.
(3:20)
48
The how-to of Bayesian inference ∧
θ
∧ ∧ L(θ) = p(D|θ, M1, I )
p(D|θ, M1, I ) = L(θ)
δθ p(θ|M1,I ) = 1 ∆θ
∆θ Parameter θ
Figure 3.1 The characteristic width of the likelihood peak and of the prior.
The likelihood has a characteristic width2 which we represent by . The characteristic width is defined by Z ^ M1 ; IÞ : d pðDj; M1 ; IÞ ¼ pðDj; (3:21)
Then we can approximate the global likelihood (Equation (3.8)) for M1 in the following way: Z pðDjM1 ; IÞ ¼ d pðjM1 ; IÞpðDj; M1 ; IÞ ¼ LðM1 Þ Z 1 d pðDj; M1 ; IÞ ¼ (3:22) ^ M1 ; IÞ pðDj; ^ : or alternatively; LðM1 Þ LðÞ Since model M0 has no free parameters, no integral need be calculated to find its global likelihood, which is simply equal to the likelihood for model M1 for ¼ 0 , pðDjM0 ; IÞ ¼ pðDj0 ; M1 ; IÞ ¼ Lð0 Þ:
(3:23)
Thus the Bayes factor in favor of the more complicated model is B10
2
^ M1 ; IÞ ^ pðDj; LðÞ ¼ : pðDj0 ; M1 ; IÞ Lð0 Þ
(3:24)
pffiffiffiffiffiffi If the likelihood function is really a Gaussian and the prior is flat, it is simple to show that ¼ 2p, where is the standard deviation of the posterior PDF for .
3.5 Model comparison and Occam’s razor
49
The likelihood ratio in the first factor can never favor the simpler model because M1 contains it as a special case. However, since the posterior width, , is narrower than the prior width, , the second factor penalizes the complicated model for any ‘‘wasted’’ parameter space that gets ruled out by the data. The Bayes factor will thus favor the more complicated model only if the likelihood ratio is large enough to overcome this penalty. Equation (3.22) has the form of the best-fit likelihood times the factor that penalizes M1 . In the above illustrative calculation we assumed a simple Gaussian likelihood function for convenience. In general, the actual likelihood function can be very complicated with several peaks. However, one can always write the global likelihood of a model with parameter , as the maximum value of its likelihood times some factor, : pðDjM; IÞ Lmax :
(3:25)
The second factor, , is called the Occam factor associated with the parameters, . It is so named because it corrects the likelihood ratio usually considered in statistical tests in a manner that quantifies the qualitative notion behind ‘‘Occam’s razor:’’ simpler explanations are to be preferred unless there is sufficient evidence in favor of more complicated explanations. Bayes’ theorem both quantifies such evidence and determines how much additional evidence is ‘‘sufficient’’ through the calculation of global likelihoods. Suppose M1 has two parameters and , then following Equation (3.22), we can write ZZ pðDjM1 ; IÞ ¼ dd pðjM1 ; IÞpðjM1 ; IÞpðDj; ; M1 ; IÞ (3:26) ^ ; ^ M1 ; IÞ ¼ Lmax : pðDj; The above equation assumes independent flat priors for the two parameters. It is clear from Equation (3.26) that the total Occam penalty, total ¼ , can become very large. For example, if = ¼ = ¼ 0:01 then total ¼ 104 . Thus for the Bayes factor in Equation (3.24) to favor M1 , the ratio of the maximum likelihoods, ^ ; ^ M1 ; IÞ Lmax ðM1 Þ pðDj; ¼ pðDjM0 ; IÞ Lmax ðM0 Þ must be ‡104 . Unless the data argue very strongly for the greater complexity of M1 through the likelihood ratio, the Occam factor will ensure we favor the simpler model. We will explore the Occam factor further in a worked example in Section 3.6. In the above calculations, we have specifically made a point of identifying the Occam factors and how they arise. In many instances we are not interested in the value of the Occam factor, but only in the final posterior probabilities of the competing models.
50
The how-to of Bayesian inference
Because the Occam factor arises automatically in the marginalization process, its effect will be present in any model selection calculation.
3.6 Sample spectral line problem In this section, we will illustrate many of the above points in a detailed Bayesian analysis of a toy spectral line problem. In a real problem, as opposed to the hypothetical one discussed below, there could be all sorts of complicated prior information. Although Bayesian analysis can readily handle these complexities, our aim here is to bring out the main features of the Bayesian approach as simply as possible. Be warned; even though it is a relatively simple problem, our detailed solution, together with commentary and a summary of the lessons learned, will occupy quite a few pages.
3.6.1 Background information In this problem, we suppose that two competing grand unification theories have been proposed. Each one is championed by a Nobel prize winner in physics. We want to compute the relative probability of the truth of each theory based on our prior (background) information and some new data. Both theories make definite predictions in energy ranges beyond the reach of the present generation of particle accelerators. In addition, theory 1 uniquely predicts the existence of a new short-lived baryon which is expected to form a short-lived atom and give rise to a spectral line at an accurately calculable radio wavelength. Unfortunately, it is not feasible to detect the line in the laboratory. The only possibility of obtaining a sufficient column density of the short-lived atom is in interstellar space. Prior estimates of the line strength expected from the Orion nebula according to theory 1 range from 0.1 to 100 mK. Theory 1 also predicts the line will have a Gaussian line shape of the form ( ) ði o Þ2 T exp (3:27) ðabbreviated by Tfi Þ; 22L where the signal strength is measured in temperature units of mK and T is the amplitude of the line. The frequency, i , is in units of channel number and o ¼ 37. The width of the line profile is characterized by L , and L ¼ 2 channel numbers. The predicted line shape is shown in Figure 3.2. Data: To test this prediction, a new spectrometer was mounted on the James Clerk Maxwell telescope on Mauna Kea and the spectrum shown in Figure 3.3 was obtained. The spectrometer has 64 frequency channels with neighboring channels separated by
51
3.6 Sample spectral line problem 1
Signal strength (mK)
0.8
0.6
0.4
0.2
0 0
10
20
30
40
50
60
Channel number
Figure 3.2 Predicted spectral shape according to theory 1.
0:5 L . All channels have Gaussian noise characterized by ¼ 1 mK. The noise in separate channels is independent. The data are given in Table 3.1. Let D be a proposition representing the data from the spectrometer. D D1 ; D2 ; . . . ; DN ; N ¼ 64 (3:28) where D1 is a proposition that asserts that ‘‘the data value recorded in the first channel was d1 .’’
Spectral Line Data
Signal strength (mK)
3 2 1 0 –1
0
10
Figure 3.3 Measured spectrum.
20
30 40 Channel number
50
60
52
The how-to of Bayesian inference
Table 3.1 Spectral line data consisting of 64 frequency channels (#) obtained with a radio astronomy spectrometer. The output voltage from each channel has been calibrated in units of effective black body temperature expressed in mK. The existence of negative values arises from receiver channel noise which gives rise to both positive and negative fluctuations. #
mK
#
mK
#
mK
#
mK
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1.420 0.468 0.762 1.312 2.029 0.086 1.249 0.368 0.657 1.294 0.235 0.192 0.269 0.827 0.685 0.702
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0.937 1.331 1.772 0.530 0.330 1.205 1.613 0.300 0.046 0.026 0.519 0.924 0.230 0.877 0.650 1.004
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0.248 1.169 0.915 1.113 1.463 2.732 0.571 0.865 0.849 0.171 1.031 1.105 0.344 0.087 0.351 1.248
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
0.001 0.360 0.497 0.072 1.094 1.425 0.283 1.526 1.174 0.558 1.282 0.384 0.120 0.187 0.646 0.399
Question: Which theory is more probable? Based on our current state of information, which includes just the above prior information and the measured spectrum, what do we conclude about the relative probabilities of the two competing theories and what is the posterior PDF for the line strength? Hypothesis space: M1 ‘‘Theory 1 correct, line exists’’ M2 ‘‘Theory 2 correct, no line predicted’’
3.7 Odds ratio To answer the above question, we compute the odds ratio (abbreviated simply by the odds) of model M1 to model M2 . O12 ¼
pðM1 jD; IÞ : pðM2 jD; IÞ
(3:29)
3.7 Odds ratio
53
From Equation (3.14) we can write pðM1 jIÞ pðDjM1 ; IÞ pðM2 jIÞ pðDjM2 ; IÞ pðM1 jIÞ B12 pðM2 jIÞ
O12 ¼
(3:30)
where pðM1 jIÞ=pðM2 jIÞ is the prior odds, and pðDjM1 ; IÞ=pðDjM2 ; IÞ is the global likelihood ratio, which is also called the Bayes factor. Based on the prior information given in the statement of the problem, we assign the prior odds ¼ 1, so our final odds is given by, O12 ¼
pðDjM1 ; IÞ pðDjM2 ; IÞ
ðthe Bayes factorÞ:
(3:31)
To obtain pðDjM1 ; IÞ, the global likelihood of M1 , we need to marginalize over its unknown parameter, T. From Equation (3.8), we can write Z pðDjM1 ; IÞ ¼ dT pðTjM1 ; IÞpðDjM1 ; T; IÞ: (3:32) In the following section we will consider what form of prior to use for pðTjM1 ; IÞ. In Section 3.7.2 we will show how to evaluate the likelihood, pðDjM1 ; T; IÞ.
3.7.1 Choice of prior p(T|M1 , I ) We need to evaluate the global likelihood of model M1 for use in the Bayes factor. One of the items we need in this calculation is pðTjM1 ; IÞ, the prior for T. Choosing a prior is an important part of any Bayesian calculation and we will have a lot to say about this topic in Section 3.10 and other chapters, e.g., Chapter 8, and Sections 9.2.3, 13.3 and 13.4. For this example, we will investigate two common choices: the uniform prior and the Jeffreys prior.3 Uniform prior Suppose we chose a uniform prior for pðTjM1 ; IÞ in the range Tmin T Tmax pðTjM1 ; IÞ ¼
1 ; T
(3:33)
where T ¼ Tmax Tmin . There is a problem with this prior if the range of T is large. In the current example Tmax ¼ 100 and Tmin ¼ 0:1. To illustrate the problem, we compare the probability that
3
If the lower limit on T extended all the way to zero, we would not be able to use a Jeffreys prior because of the infinity at T ¼ 0. A modified version of the form, pðTjM1 ; IÞ ¼ 1=fðT þ aÞ ln½ða þ Tmax Þ=ag, where a is a constant, eliminates this singularity. This modified Jeffreys behaves like a uniform prior for T < a and a Jeffreys for T > a.
54
The how-to of Bayesian inference
T lies in the upper decade of the prior range (10 to 100 mK) to the lowest decade (0.1 to 1 mK). This is given by R 100 10 pðTjM1 ; IÞdT ¼ 100: (3:34) R1 0:1 pðTjM1 ; IÞdT We see that in this case, a uniform prior implies that the line strength is 100 times more probable to be in the top decade of the predicted range than the bottom, i.e., it is much more probable that T is strong than weak. Usually, expressing great uncertainty in some quantity corresponds more closely to a statement of scale invariance or equal probability per decade. In this situation, we recommend using a Jeffreys prior which is scale invariant. Jeffreys prior The form of the prior which represents equal probability per decade (scale invariance) is given by pðTjM1 ; IÞ ¼ k=T, where k ¼ constant. Z
1
pðTjM1 ; IÞdT ¼ k
0:1
Z
1
dT ¼ k ln 10 ¼ 0:1 T
Z
100
pðTjM1 ; IÞdT:
(3:35)
10
We can evaluate k from the requirement that Z
Tmax Tmin
Tmax pðTjM1 ; IÞdT ¼ 1 ¼ k ln Tmin
1 Tmax ¼ ln : k Tmin
(3:36)
(3:37)
Thus, the form of the Jeffreys prior is given by pðTjM1 ; IÞ ¼
1 T lnðTmax =Tmin Þ
:
(3:38)
A convenient way of summarizing the above comparison between the uniform and Jeffreys prior is to plot the probability of each distribution per logarithmic interval or pðln TjM1 ; IÞ. This can be obtained from the condition that the probability in the interval T to T þ dT must equal the probability in the transformed interval ln T to ln T þ d ln T. pðTjM1 ; IÞdT ¼ pðln TjM1 ; IÞd ln T d ln T 1 ¼ pðln TjM1 ; IÞ pðTjM1 ; IÞ ¼ pðln TjM1 ; IÞ dT T pðln TjM1 ; IÞ ¼ T pðTjM1 ; IÞ:
(3:39)
55 1
1.4 1.2 1 0.8 0.6 0.4 0.2 0
PPLI = T × PDF
PDF
3.7 Odds ratio
Jeffreys Uniform
Uniform
0.8 0.6 Jeffreys
0.4 0.2 0
0.1
1
10
100
0.1
1
T
10
100
T
Figure 3.4 The left panel shows the probability density function (PDF), pðTjM1 ; IÞ, for the uniform and Jeffreys priors. The right panel shows the probability per logarithmic interval (PPLI), T pðTjM1 ; IÞ.
Figure 3.4 compares plots of the probability density function (PDF), pðTjM1 ; IÞ (left panel), and the probability per logarithmic interval (PPLI), T pðTjM1 ; IÞ (right panel), for the uniform and Jeffreys priors.
3.7.2 Calculation of p(D|M1 ,T, I ) Let di represent the measured data value for the ith channel of the spectrometer. According to model M1 , di ¼ Tfi þ ei ;
(3:40)
where ei is an error term. Our prior information indicates that this error is caused by receiver noise which has a Gaussian distribution with a standard deviation of . Also, from Equation (3.27), we have ( ) ði o Þ2 fi ¼ exp : (3:41) 22L Assuming M1 is true, then if it were not for the error ei , di would equal Tfi . Let Ei ‘‘a proposition asserting that the ith error value is in the range ei to ei þ dei .’’ In this case, we can show (see Section 4.8) that pðDi jM1 ; T; IÞ ¼ pðEi jM1 ; T; IÞ. If all the Ei are independent4 then pðDjM1 ; T; IÞ ¼ pðD1 ; D2 ; . . . ; DN jM1 ; T; IÞ ¼ pðE1 ; E2 ; . . . ; EN jM1 ; T; IÞ ¼ pðE1 jM1 ; T; IÞpðE2 jM1 ; T; IÞ . . . pðEN jM1 ; T; IÞ ¼
N Y
pðEi jM1 ; T; IÞ
i¼1
4
We deal with the effect of correlated errors in Section 10.2.2.
(3:42)
56
The how-to of Bayesian inference 0.5
p(Ei |M1, T, I ) proportional to line height
Probability density
0.4 0.3 0.2 0.1
ei 0
i th data value di 0
2
T f i predicted i th value 4 Signal strength (mK)
6
8
Figure 3.5 Probability of getting a data value di a distance ei away from the predicted value is proportional to the height of the Gaussian error curve at that location.
Q where N i¼1 stands for the product of N of these terms.From the prior information, we can write 1 e2i pðEi jM1 ; T; IÞ ¼ pffiffiffiffiffiffi exp 2 2 2p ( ) (3:43) 1 ðdi Tfi Þ2 ¼ pffiffiffiffiffiffi exp : 22 2p It is apparent that pðEi jM1 ; T; IÞ is a probability density functionpsince ffiffiffiffiffiffi ei , the value of the error for channel i, is a continuous variable. The factor ð 2pÞ1 in the above equation ensures that the integral over ei from 1 to þ1 is equal to 1. In Figure 3.5, pðEi jM1 ; T; IÞ is shown proportional to the height of the Gaussian error curve at the position of the actual data value di . Combining Equations (3.42) and (3.43), we obtain the probability of the entire data set ( ) N Y 1 ðdi Tfi Þ2 pffiffiffiffiffiffi exp pðDjM1 ; T; IÞ ¼ 22 i ¼ 1 2p ( P ) (3:44) 2 ðd Tf Þ i i N=2 N ¼ ð2pÞ exp i : 22 In Section 3.7.4, we will need the maximum value of the likelihood given by Equation (3.44). Since we now know all the quantities in Equation (3.44) except T, we can readily compute the likelihood as a function of T in the prior range 0:1 T 100. The likelihood has a maximum ¼ 8:520 1037 (called the maximum likelihood) at T ¼ 1:561 mK.
3.7 Odds ratio
57
What we want is pðDjM1 ; IÞ, the global likelihood of M1 , for use in Equation (3.31). We now evaluate pðDjM1 ; IÞ, given by Equation (3.32), for the two different priors discussed in Section 3.7.1, where we argued that the Jeffreys prior matches much more closely the prior information given in this particular problem. Nevertheless, it is interesting to explore what effect the choice of a uniform prior would have on our conclusions. For this reason, we will do the calculations for both priors. Uniform prior case: P Z Tmax P P ð2pÞN=2 N T di f i T di f i T di fi pðDjM1 ; IÞ ¼ exp dT exp exp 2 T 2 2 Tmin ¼ 1:131 1038 : (3:45) According to Equation (3.25), we can always write the global likelihood of a model as the maximum value of its likelihood times an Occam factor, T , which arises in this case from marginalizing T. pðDjM1 ; IÞ ¼ Lmax ðM1 Þ T ¼ maximum value of ½pðDjM1 ; T; IÞ Occam factor ¼ 8:520 10
37
(3:46)
T :
Comparison of the results of Equations (3.45) and (3.46) leads directly to a value for the Occam factor, associated with our prior uncertainty in the T parameter, of T ¼ 0:0133.
Jeffreys prior case: P 2 ð2pÞN=2 N di exp pðDjM1 ; IÞ ¼ lnðTmax =Tmin Þ 22 P P T di f i T2 f 2i Z Tmax exp exp 2 22 dT T Tmin
(3:47)
¼ 1:239 1037 : In this case the Occam factor associated with our prior uncertainty in the T parameter, based on a Jeffreys prior, is 0.145. Note: the Occam factor based on the Jeffreys prior is a factor of 10 less of a penalty than for the uniform prior for the same parameter.
58
The how-to of Bayesian inference
3.7.3 Calculation of p(D|M2 , I ) Model M2 assumes the spectrum is consistent with noise and has no free parameters so in analogy to Equation (3.40), we can write di ¼ 0 þ e i
(3:48)
where ei ¼ Gaussian noise with a standard deviation of . Assuming M2 is true, then if it were not for the noise ei , di would equal 0. P 2 di pðDjM2 ; IÞ ¼ ð2pÞN=2 N exp 22 (3:49) 38 ¼ 1:133 10 : Since this model has no free parameters, there is no Occam factor, so the global likelihood is also the maximum likelihood, Lmax ðM2 Þ, for M2 .
3.7.4 Odds, uniform prior Substitution of Equations (3.45) and (3.49) into Equation (3.31) leads to an odds ratio for the uniform prior case given by odds ¼
1 T
Z
Tmax
dT exp Tmin
P P T di f i T 2 fi 2 exp : 2 22
(3:50)
For Tmin ¼ 0:1 mK and Tmax ¼ 100 mK, the odds ¼ 0:9986 and pðM1 jD; IÞ ¼
1 ¼ 0:4996: 1 1 þ odds
(3:51)
Although the ratio of the maximum likelihoods for the two models favors model M1 , by a factor of Lmax ðM1 Þ=Lmax ðM2 Þ ¼ 8:520 1037 =1:131 1038 75, the ratio of the global likelihoods marginally favors M2 because of the Occam factor which penalizes M1 for its extra complexity.
3.7.5 Odds, Jeffreys prior Substitution of Equations (3.47) and (3.49) into Equation (3.31) leads to an odds ratio for the Jeffreys prior case, given by P P T di fi T 2 fi 2 Z Tmax exp exp 1 2 22 : (3:52) odds ¼ dT lnðTmax =Tmin Þ Tmin T For Tmin ¼ 0:1 mK and Tmax ¼ 100 mK, the odds ¼ 10:94, and pðM1 jD; IÞ ¼ 0:916.
3.8 Parameter estimation problem
59
As noted earlier in this chapter, we consider the Jeffreys prior to be much more consistent with the large uncertainty in signal strength which was part of the prior information of the problem. On this basis, we conclude that for our current state of information, pðM1 jD; IÞ ¼ 0:916 and pðM2 jD; IÞ ¼ 0:084.
3.8 Parameter estimation problem Now that we have solved the model selection problem leading to a significant preference for M1 , which argues for the existence of the short-lived baryon, we would like to compute pðTjD; M1 ; IÞ, the posterior PDF for the signal strength. Again we will compute the result for both choices of prior for comparison, but consider the Jeffreys result to be more reasonable for the current problem. Again, start with Bayes’ theorem: pðTjD; M1 ; IÞ ¼
pðTjM1 ; IÞpðDjM1 ; T; IÞ pðDjM1 ; IÞ
(3:53)
/ pðTjM1 ; IÞpðDjM1 ; T; IÞ: We have already evaluated pðDjM1 ; T; IÞ in Equation (3.44). All that remains is to plug in our two different choices for the prior pðTjM1 ; IÞ. Uniform prior case: pðTjD; M1 ; IÞ / exp
P P T di f i T2 fi2 exp 2 22
(3:54)
Jeffreys prior case: P P 1 T di fi T2 fi2 pðTjD; M1 ; IÞ / exp exp : T 2 22
(3:55)
Figure 3.6 shows the posterior PDF for the signal strength for both the uniform and Jeffreys priors. As we saw earlier, the uniform prior favors stronger signals. In our original spectrum, the line strength was comparable to the noise level. How do the results change as we increase the line strength? Figure 3.7 shows a simulated spectrum for a line strength equal to five times the noise together with the estimated posterior PDF for the line strength. The increase in line strength has a dramatic effect on the odds which rise to 1:6 1012 for the uniform prior and 5:3 1012 for the Jeffreys prior.
3.8.1 Sensitivity of odds to Tmax Figure 3.8 is a plot of the dependence of the odds on the assumed value of Tmax for both uniform and Jeffreys priors. We see that under the uniform prior, the odds are
60
The how-to of Bayesian inference
0.7 Uniform
0.6 p(T \D, M1, I)
Jeffreys 0.5 0.4 0.3 0.2 0.1 0 0
1
2 3 Line strength T
4
5
Figure 3.6 Posterior PDF for the line strength, T, for uniform and Jeffreys priors.
Spectral Line Data
6 5 4 3 2 1 0 –1
p(T \D, M1, I)
Signal strength (mK)
much more strongly dependent on the prior range of T than for the Jeffreys case. In both cases, the Occam’s razor penalizing M1 compared to M2 for its greater complexity increases as the prior range for T increases. Model complexity depends not only on the number of free parameters but also on their prior ranges. In this problem, we assumed that both the center frequency and line width were accurately predicted by M1 ; the only uncertain quantity was the line strength. Suppose the center frequency and/or line width were uncertain as well. In this case, to compute the odds ratio, we would have to marginalize over the prior ranges for these parameters as well, giving rise to additional Occam’s factors and a subsequent lowering of the odds. This agrees with our intuition: the more uncertain our prior information about the expected properties of the line, the less significance we attach to any bump in the spectrum.
0
10
20 30 40 50 Channel number
60
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Uniform Jeffreys
0
1
2 3 4 5 Line strength T
6
7
Figure 3.7 The left panel shows a spectrum with a stronger spectral line. The right panel shows the computed posterior PDF for the line strength.
61
3.9 Lessons 17.5 Uniform
15
Jeffreys
Odds
12.5 10 7.5 5 2.5 0 1
1.5
2
2.5 Log10 Tmax
3
3.5
4
Figure 3.8 The odds ratio versus upper limit on the predicted line strength ðTmax Þ for the uniform and Jeffreys priors.
3.9 Lessons 1. In the model selection problem, we are interested in the global probabilities of the two models independent of the most probable model parameters. This was achieved using Bayes’ theorem and marginalizing over model M1 ’s parameter T, the signal strength (model M2 had no parameters pertaining to the spectral line data). An Occam’s razor automatically arises each time a model parameter is marginalized, penalizing the model for prior parameter space that gets ruled out by the data. The larger the prior range that is excluded by the likelihood function, pðDjM1 ; T; IÞ, the greater the Occam penalty as can be seen from Figure 3.8. Recall that the global likelihood for a model is the weighted average likelihood for its parameter(s). The weighting function is the prior for the parameter. Thus, the Occam penalty can be very different for two different choices of prior (uniform and Jeffreys). The results are always conditional on the truth of the prior which must be specified in the analysis, and there is a need to seriously examine the consequences of the choice of prior. 2. When the prior range for a parameter spans many orders of magnitude, a uniform prior implies that it is much more probable that the true value of the parameter is in the upper decade. Often, a large prior parameter range can be taken to mean we are ignorant of the scale, i.e., small values of the parameter are equally likely to large values. For these situations, a useful choice is a Jeffreys prior, which corresponds to equal probability per decade (scale invariance). Note: when the range of a prior is a small fraction of the central value, then the conclusion will be the same whether a uniform or Jeffreys prior is used. In the spectrum problem just analyzed, we started out with very crude prior information on the line strength predicted by M1 . Now that we have incorporated the new experimental information D, we have arrived at a posterior probability for the line strength, pðTjD; M1 ; IÞ. Were we to obtain more data, D2 , we would set our new prior pðTjM1 ; I2 Þ equal to our current posterior pðTjD; M1 ; IÞ, i.e., I2 ¼ D; I. The question of whether to use a Jeffreys or uniform prior would no longer be relevant.
62
The how-to of Bayesian inference 0.3
Probability density
0.25 0.2 0.15 0.1 0.05 10
20 30 Channel number
40
Figure 3.9 Marginal posterior PDF for the line frequency, where the line frequency is expressed as a spectrometer channel number.
3. If the location and line width were also uncertain, we would have to marginalize over these parameters as well, giving rise to other Occam factors which would decrease the odds still further. For example, if the prior range for the expected channel number of the spectral line were increased from less than 1 to 44 channels, the odds would decrease from 11 to 1, assuming a uniform prior for the line location. We can also compute the marginal posterior PDF for the line frequency for this case which is shown in Figure 3.9. This permits us to update our knowledge of the line frequency given the data and assuming the theory is correct. For further insights on this matter, see the discussion on systematic errors in Section 3.11. 4. Once we established that model M1 was more probable, we were able to apply Bayes’ theorem again, to compute the posterior PDF for the line strength. Note: no Occam factors arise in parameter estimation. Parameter estimation can be viewed as model selection where the competing models all have exactly the same complexity so the Occam penalties are identical and cancel out in the analysis. It can happen that the pðTjD; M1 ; IÞ can be very small for values of T close to zero. One might be tempted to rule out M2 because it predicts T ¼ 0, thus bypassing the model selection problem. This is not wise, however, because the model selection analysis includes Occam factors that could rule out M1 compared to the simpler M2 . As we noted, these Occam factors do not appear in the parameter estimation analysis. 5. In this toy problem, the spectral line data assume that any background continuum radiation or instrumental DC level has been subtracted off, which can only be done to a certain accuracy. It would be better to parameterize this DC level and marginalize over this parameter so that the effect of our uncertainty in this quantity (see Section 3.11) will be included in our final odds ratio and spectral line parameter estimates. A still more complicated version of this problem is if M1 simply predicts a certain prior range for the optical depth of the line but
3.10 Ignorance priors
63
leaves unanswered whether the line will be seen in emission or absorption against the background continuum. In this problem, a Bayesian solution is still possible but will involve a more complicated model of the spectral line data.
3.10 Ignorance priors In the analysis of the spectral line problem of Section 3.7.1, we considered two different forms of prior (uniform and Jeffreys) for the unknown line temperature parameter. We learned that there was a strong reason for picking the Jeffreys prior in this problem. What motivated a consideration of these particular priors in the first place? In this section we will attempt to answer this question. As we study any particular phenomenon, our state of knowledge changes. When we are well into the study, our prior for the analysis of new data will be well defined by our previous posterior. But in the earliest phase, our state of ‘‘ignorance’’ will be high. It is therefore useful to have arguments to aid us in selecting an appropriate form of prior to use in such situations. Of course, if we are completely ignorant we cannot even state the problem of interest, and in that case we have no use for a prior. Let us suppose our state of knowledge is sufficient to pose the problem but not much more. For example, we might be interested in the location of the highest point on the equator of Pluto. Are there any general arguments to help us select a suitable prior? In Section 2.6 we saw how to use the Principle of Indifference to arrive at a probability distribution for a discrete set of hypotheses. In the discussion that follows, we will consider a general argument that suggests the form of priors to use for two types of continuous parameters. We will make a distinction between location parameters, and scale parameters. For example, consider the location of an event in space. To describe this, we must locate the event with respect to some origin and specify the size (scale) of our units of space (e.g., ft, m, light years). The location of an event can be either a positive or negative quantity depending on our choice of origin but the scale (size of our space units) is always a positive quantity. We will first consider a prior for a location parameter. Suppose we are interested in evaluating pðXjIÞ, where X ‘‘a proposition asserting that the location of the tallest tree along the shore of Lake Superior is between x and x þ dx.’’ In this statement of the problem, x is measured with respect to a particular survey stake. We will represent the probability density by the function fðxÞ. What if we consider a different statement of the problem in which the only change is that the origin of our distance measurement has been shifted by an amount c and we are interested in pðX0 jIÞ where x0 ¼ x þ c? If a shift of location (origin) can make the problem appear in any way different, then it must be that we had some kind of prior knowledge about location. In the limit of complete ignorance, the choice of prior would be invariant to a shift in location. Although we are not completely ignorant it still might be useful, in the earliest phase of an investigation, to adopt a prior which is invariant to a shift in location. What form of prior does this imply? If we define our
64
The how-to of Bayesian inference
state of ignorance to mean that the above two statements of the problem are equivalent, then the desideratum of consistency demands that pðXjIÞdX ¼ pðX0 jIÞdX0 ¼ pðX0 jIÞdðX þ cÞ ¼ pðX0 jIÞdX:
(3:56)
From this it follows that fðxÞ ¼ fðx0 Þ ¼ fðx þ cÞ:
(3:57)
The solution of this equation is fðxÞ ¼ constant, so pðXjIÞ ¼ constant:
(3:58)
In the Lake Superior problem, it is apparent that we have knowledge of the upper ðxmax Þ and lower ðxmin Þ bounds of x, so the constant ¼ 1=ðxmax xmin Þ. If we are ignorant of these limits then we refer to pðXjIÞ as an improper prior, meaning that it is not normalized. An improper prior is useable in parameter estimation problems but is not suitable for model selection problems, because the Occam factors depend on knowing the prior range for each model parameter. Now consider a problem where we are interested in the mean lifetime of a newly discovered aquatic creature found in the ocean below the ice crust on the moon Europa. We call the lifetime a scale parameter because it can only have positive values, unlike a location parameter which can assume both positive and negative values. Let T ‘‘the mean lifetime is between and þ d.’’ What form of prior probability density, pðT jIÞ, should we use in this case? We will represent the probability density by the function gðÞ. What if we consider a different statement of the problem in which the only change is that the time is measured in units differing by a factor ? Now we are interested in pðT 0 jIÞ where 0 ¼ . If we define our state of ignorance to mean that the two statements of the problems are equivalent, then the desideratum of consistency demands that pðT jIÞdT ¼ pðT 0 jIÞdT 0 ¼ pðT 0 jIÞdðT Þ ¼ pðT 0 jIÞdT :
(3:59)
From this it follows that gðÞ ¼ gð 0 Þ ¼ gðÞ:
(3:60)
The solution of this equation is gðÞ ¼ constant=, so pðT jIÞ ¼
constant :
(3:61)
This form of prior is called the Jeffreys prior after Sir Harold Jeffreys who first suggested it. If we have knowledge of the upper (max ) and lower (min ) bounds of then we can evaluate the normalization constant. The result is pðT jIÞ ¼
1 : ln ðmax =min Þ
(3:62)
3.11 Systematic errors
65
Returning to the spectral line problem, we now see another reason for preferring the choice of the Jeffreys prior for the temperature parameter, because it is a scale parameter. In Section 9.2.3, we will discover yet another powerful argument for selecting the Jeffreys prior for a scale parameter.
3.11 Systematic errors In scientific inference, we encounter at least two general types of uncertainties which are broadly classified as random and systematic. Random uncertainties can be reduced by acquiring and averaging more data. This is the basis behind signal averaging which is discussed in Section 5.11.1. Of course, what appears random for one state of information might later be discovered to have a predictable pattern as our state of information changes. Some typical examples of systematic errors include errors of calibration of meters and rulers,5 and stickiness and wear in the moving parts of meters. For example, over time an old wooden meter stick may shrink by as much as a few mm. Some potential systematic errors can be detected by careful analysis of the experiment before performing it and can then be eliminated either by applying suitable corrections or through careful experimental design. The remaining systematic errors can be very subtle, and are detected with certainty only when the same quantity is measured by two or more completely different experimental methods. The systematic errors are then revealed by discrepancies between the measurements made by the different methods. Bayesian inference provides a powerful way of looking and dealing with some of these subtle systematic errors. We almost always have some prior information about the accuracy of our ‘‘ruler.’’ Clearly, if we had no information about its accuracy (in contrast to its repeatability), we would have no logical grounds to use it at all except as a means for ordering events. In this case, we would be expecting no more from our ruler and we would have no concern about a systematic error. What this implies is that we require at least some limited prior information about our ruler’s scale to be concerned about a systematic error. As we have seen, a unique feature of the Bayesian approach is the ability to incorporate prior information and see how it affects our conclusions. In the case of the ruler accuracy, the approach taken is to introduce the scale of the ruler into the calculation as a parameter, i.e., we parameterize the systematic error. We can then treat this as a nuisance parameter and marginalize (integrate over) this parameter to obtain our final inference about the quantity of interest. If the uncertainty in the accuracy of our scale is very large, this will be reflected quantitatively in a larger uncertainty in our final inference. In a complex measurement, many different types of systematic errors can occur, which in principle, can be parameterized and marginalized. For example, consider the
5
One important ruler in astronomy is the Hubble relation relating redshift or velocity to distance.
66
The how-to of Bayesian inference
following modification to the spectral line problem of Section 3.6. Even if we know the predicted frequency of the spectral line accurately, the observed frequency depends on the velocity of the source with respect to the observer through the Doppler effect. The observed frequency of the line, fo , is related to the emitted frequency, fe by v v (3:63) fo ¼ fe 1 þ for 1; c c where v is the line of sight component of the velocity of the line emitting region and c equals the velocity of light. In our search for a spectral line, we may be examining a small portion of the Orion nebula and only know the distribution of velocities for the integrated emission from the whole nebula, which may be dominated by turbulent and rotational motion of its parts. The unknown factor v introduces a systematic error in our frequency scale. In this case, we might choose to parameterize the systematic error in v by a Gaussian with a mean and equal to that of the Orion nebula as a whole. From the Bayesian viewpoint, we can even consider uncertain scales that arise in a theoretical model as introducing a systematic error on the same footing, for the purposes of inference, as those associated with a measurement. In the above example, we may know the velocity of the source accurately but the theory may be imprecise with regard to its frequency scale. Of course, the exact form by which we parameterize a systematic error is constrained by our available information, and just as our theories of nature are updated as our state of knowledge changes, so in general will our understanding of these systematic errors. It is often the case that we can obtain useful information about a systematic error from the interaction between measurements and theory in Bayesian inference. In particular, we can compute the marginal posterior for the parameter characterizing our systematic error as was done in Figure 3.9. This and other points raised in this section are brought out by the problems at the end of this chapter. The effect of marginalizing over any parameter, whether or not it is associated with a systematic error, is to introduce an Occam factor which penalizes the model for any prior parameter space that gets ruled out by the data through the likelihood function. The larger the prior range that is excluded by the likelihood function, the greater the Occam penalty. It is thus possible to rule out a valid model by employing an artificially large prior for some systematic error or model parameter. Fortunately, Bayesian inference requires one to specify one’s choice of prior so its effect on the conclusions can readily be assessed.
3.11.1 Systematic error example In 1929, Edwin Hubble found a simple linear relationship between the distance of a galaxy, x, and its recessional velocity, v, of the form v ¼ H0 x, where H0 is known as Hubble’s constant. Hubble’s constant provides the scale of our ruler for astronomical distance determination. An error in H0 leads to a systematic error in distance
3.11 Systematic errors
67
determination. A modern value of H0 ¼ 70 10 km s1 Mpc1 . Note: astronomical distances are commonly measured in Mpc (a million parsecs). Suppose a particular galaxy has a measured recessional velocity v m ¼ ð100 5Þ 103 km s1 . Determine the posterior PDF for the distance to the galaxy assuming: 1) A fixed value of H0 ¼ 70 km s1 Mpc1 . 2) We allow for uncertainty in the value of Hubble’s constant. We assume a Gaussian probability density function for H0 , of the form ( ) ðH0 70Þ2 pðH0 jIÞ ¼ k exp ; (3:64) 2 102 where k is a normalization constant. 3) We assume a uniform probability density function for H0 , given by 1=ð90 50Þ; for 50 H0 90 pðH0 jIÞ ¼ 0; elsewhere. 4) We assume a Jeffreys probability density function for H0 , given by 1 pðH0 jIÞ ¼ ½H0 lnð90=50Þ ; for 50 H0 90 0; elsewhere.
(3:65)
(3:66)
As usual, we can write v m ¼ vtrue þ e
(3:67)
where vtrue is the true recessional velocity and e represents the noise component of the measured velocity, vm . Assume that the probability density function for e can be described by a Gaussian with mean 0 and ¼ 5 km s1 . To keep the problem simple, we also assume the error in v is uncorrelated with the uncertainty in H0 . Through the application of Bayes’ theorem, as outlined in earlier sections of this chapter, we can readily evaluate the posterior PDF, pðxjD; IÞ, for the distance to the galaxy. The results for the four cases are given below and plotted in Figure 3.10. Case 1:
1 e2 pðxjD; IÞ / pðxjIÞ pðDjx; IÞ ¼ pðxjIÞ pffiffiffiffiffiffi exp 2 2 2p ( ) 1 ðv m v true Þ2 ¼ pðxjIÞ pffiffiffiffiffiffi exp 22 2p ( ) 1 ðv m H0 xÞ2 ¼ pðxjIÞ pffiffiffiffiffiffi exp : 22 2p
(3:68)
68
The how-to of Bayesian inference
0.025
case 1 case 2 case 3 case 4
p(x|D, I )
0.02 0.015 0.01 0.005 0 1000
1500
2000
Distance (Mpc)
Figure 3.10 Posterior PDF for the galaxy distance, x: 1) assuming a fixed value of Hubble’s constant (H0 ), 2) incorporating a Gaussian prior uncertainty for H0 , 3) incorporating a uniform prior uncertainty for H0 , and 4) incorporating a Jeffreys prior uncertainty for H0 .
Case 2: In this case, I incorporates a Gaussian prior uncertainty in the value of H0 . Z 1 pðxjD; IÞ ¼ dH0 pðx; H0 jD; IÞ 1
/ pðxjIÞ
Z
1
dH0 pðH0 jx; IÞ pðDjx; H0 ; IÞ 1
¼ pðxjIÞ
Z
1
dH0 pðH0 jIÞ pðDjx; H0 ; IÞ 1
¼ pðxjIÞ
Z
(
1
dH0 k exp 1
1 pffiffiffiffiffiffi exp 2p
(
ðH0 70Þ2 2 102
(3:69)
)
) ðvm H0 xÞ2 : 22
Case 3: In this case, I incorporates a uniform prior uncertainty in the value of H0 . Z 90 pðxjD; IÞ / pðxjIÞ dH0 pðH0 jIÞ pðDjx; H0 ; IÞ 50
¼ pðxjIÞ
Z
90
50
1 1 pffiffiffiffiffiffi exp dH0 ð90 50Þ 2p
(
) ðvm H0 xÞ2 : 22
(3:70)
3.12 Problems
Case 4: In this case, I incorporates a Jeffreys prior uncertainty in the value of H0 . Z 90 pðxjD; IÞ / pðxjIÞ dH0 pðH0 jIÞ pðDjx; H0 ; IÞ 50 ( ) Z 90 1 1 ðvm H0 xÞ2 pffiffiffiffiffiffi exp ¼ pðxjIÞ dH0 : H0 lnð90=50Þ 2p 22 50
69
(3:71)
Equations (3.68), (3.69), (3.70), and (3.71) have been evaluated assuming a uniform prior for pðxjIÞ, and are plotted in Figure 3.10. Incorporating the uncertainty in the scale of our astronomical ruler can lead to two effects. Firstly, the posterior PDF for the galaxy distance is broader. Secondly the mean of the PDF is clearly shifted to a larger value. The means of the PDFs for the four cases are 1429, 1486, 1512, and 1556 km s1 , respectively. It may surprise you that pðxjD; IÞ becomes asymmetric when we allow for the uncertainty in H0 . One way to appreciate this is to approximate the integral by a weighted summation over a discrete set of choices for H0 . For each choice of H0 , pðxjD; IÞ is a symmetric Gaussian offset by a distance x given by vm vm H0 vm x ¼ ¼ : (3:72) H0 þ H0 H0 H0 þ H0 H0 For H0 ¼ þ 20 km s1 Mpc1 , the bracketed term in Equation (3.72) is equal to 0:22. For H0 ¼ 20 km s1 Mpc1 , this term is equal to þ0:4. Thus, the set of discrete Gaussians is more spread out on one side than the other, which accounts for the asymmetry. 3.12 Problems 1. Redo the calculation of the odds for the spectral line problem of Section 3.6 for the case where there is a systematic uncertainty in the line center of 5 channels. 2. The prior information is the same as that given for the spectral line problem in Section 3.6 of the text. The measured spectrum is given in Table 3.2. The spectrum consists of 64 frequency channels. Theory predicts the spectral line has a Gaussian shape with a line width L ¼ 2 frequency channels. The noise in each channel is known to be Gaussian with a ¼ 1:0 mK and the spectrometer output is in units of mK. (a) Plot a graph of the raw data. (b) Compute the posterior probability of M1 ‘‘theory 1 is correct, the spectral line exists,’’ for the two cases: (1) Jeffreys prior for the signal strength, and (2) uniform prior. For this part of the problem, assume that the theory predicts that the spectral line is in channel 24. The prior range for the signal strength is 0.1 to 100 mK. In Mathematica you can use the command NIntegrate to do the numerical integration required in marginalizing over the line strength.
70
The how-to of Bayesian inference
Table 3.2 Spectral line data consisting of 64 frequency channels obtained with a radio astronomy spectrometer. The output voltage from each channel has been calibrated in units of effective black body temperature expressed in mK. The existence of negative values arises from receiver channel noise which gives rise to both positive and negative fluctuations. ch. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mK 0.25 0.19 0.25 0.56 0.41 0.94 0.84 0.30 2.06 1.39 0.07 1.80 1.02 0.46 0.29 0.36
ch. # 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
mK 0.42 1.43 1.33 0.06 0.82 0.42 3.76 1.10 1.31 1.86 0.32 1.14 1.24 0.29 0.02 1.52
ch. # 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
mK 0.44 0.05 0.59 0.94 0.10 0.57 0.40 0.97 2.20 0.15 0.37 0.67 0.05 0.20 0.65 1.24
ch. # 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
mK 1.56 0.64 0.48 1.79 0.07 1.30 0.29 0.23 0.50 0.93 1.28 1.98 1.85 0.89 0.65 0.28
(c) Explain your reasons for preferring one or the other of the two priors. (d) On the assumption that the model predicting the spectral line is correct, compute and plot the posterior probability (density function) for the line strength for both priors. (e) Summarize the posterior probability for the line strength by quoting the most probable value and the ðþÞ and ðÞ error bars that span the 95% credible region (see the last part of Section 3.3 for a definition of credible region). The credible region can be evaluated by computing the probability for a discrete grid of closely spaced line temperature values. Sort these (probability, temperature) pairs in descending order of probability and then sum the probabilities starting from the highest until they equal 95%. As each term is added, keep track of the upper and lower temperature bounds of the terms included in the sum. Mathematica command Sort[yourdata, OrderedQ[{#2, #1] &];, will sort the file ‘‘yourdata’’ in descending order according to the first item in each row of the data list. (f) Repeat the calculations in (b) and (d), only this time, assume that the prior prediction on the location of the spectral line frequency is uncertain; it is predicted to occur somewhere between channels 1 and 50. Assume a uniform
3.12 Problems
71
prior for the unknown line center.6 This will involve computing a twodimensional likelihood distribution in the variables line frequency and line strength for a discrete set of values of these parameters, and then using a summation operation to approximate integration7 (you will probably find NIntegrate too slow in two dimensions), for marginalizing over both parameters to obtain the global likelihood for computing the odds. For this purpose, you can use a line frequency interval of 1 channel and a signal strength interval of 0.1 mK for 100 intervals. Although this only spans the prior range 0.1 to 10 mK the PDF will be so low beyond 10 mK that it will not contribute significantly to the integral. (g) Calculate and plot the marginal posterior probabilities for the line frequency. (h) What additional Occam factor is associated with marginalizing over the prior line frequency range? 3. Plot pðxjD; IÞ for case 4 (Jeffreys prior) in Section 3.11.1, assuming 1 ; for 60 H0 80 pðH0 jIÞ ¼ H0 lnð80=60Þ 0; elsewhere.
(3:73)
Box 3.1 Equation (3.69) can be evaluated using Mathematica. The evaluation will be faster if you compute a Table of values for pðx; H0 jD; IÞ at equally spaced intervals in x, and use NIntegrate to integrate over the given range for H0 . p(x|D; I) Table [ ( "
1 pffiffiffiffiffiffi exp x; NIntegrate H0 2p s
#) ! ðu m - xH0 Þ2 ; {H0 ; 60; 80} ; 22
{x; 800; 2200; 50}]
6
7
Note: when the frequency range of the prior is a small fraction of center frequency, the conclusion will be the same whether a uniform or Jeffreys prior is assumed for the unknown frequency. A convenient way to sum elements in a list is to use the Mathematica command Plus@@list.
4 Assigning probabilities
4.1 Introduction When we adopt the approach of probability theory as extended logic, the solution to any inference problem begins with Bayes’ theorem: pðHi jD; IÞ ¼
pðHi jIÞpðDjHi ; IÞ : pðDjIÞ
(4:1)
In a well-posed problem, the prior information, I, defines the hypothesis space and provides the information necessary to compute the terms in Bayes’ theorem. In this chapter we will be concerned with how to encode our prior information, I, into a probability distribution to use for pðDjHi ; IÞ. Different states of knowledge correspond to different probability distributions. These probability distributions are frequently called sampling distributions, a carry-over from conventional statistics literature. Recall that in inference problems, pðDjHi ; IÞ gives the probability of obtaining the data, D, that we actually got, under the assumption that Hi is true. Thus, pðDjHi ; IÞ yields how likely it is that Hi is true,1 and hence it is referred to as the likelihood and frequently written as LðHi Þ. For example, we might have two competing hypotheses H1 and H2 that each predicts different values of some temperature, say 1 K and 4.5 K, respectively. If the measured value is 1:2 0:4 K then it is clear that H1 is more likely to be true. In precisely this type of situation we can use pðDjHi ; IÞ to compute quantitatively the relative likelihood of H1 and H2 . We saw how to do that in one case (Section 3.6) where the likelihood was the product of N independent Gaussian distributions.
4.2 Binomial distribution In this section, we will see how a particular state of knowledge (prior information I) leads us to the choice of likelihood, pðDjHi ; IÞ, which is the well-known binomial distribution (derivation due to M. Tribus, 1969). In this case, our prior information is as follows: 1
Conversely, if we know that Hi is true, then we can directly calculate the probability of observing any particular data value. We will use pðDjHi ; IÞ in this way to generate simulated data sets in Section 5.13.
72
4.2 Binomial distribution
73
I ‘‘Proposition E represents an event that is repeated many times and has two possible outcomes represented by propositions, Q and Q, e.g., tossing a coin. The probability of outcome Q is constant from event to event, i.e., the probability of getting an outcome Q in any individual event is independent of the outcome for any other event.’’ In the Boolean algebra of propositions we can write E as E ¼ Q þ Q;
(4:2)
where Q þ Q is the logical sum. Then the possible outcomes of n events can be written as E1 ; E2 ; . . . ; En ¼ ðQ1 þ Q1 Þ; ðQ2 þ Q2 Þ; . . . ; ðQn þ Qn Þ;
(4:3)
where Qi ‘‘outcome Q occurred for the ith event.’’ If the multiplication on the right is carried out, the result will be a logical sum of 2n terms, each a product of n logical statements, thereby enumerating all possible outcomes of the n events. For n ¼ 3 we find: E1 ; E2 ; E3 ¼ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 þ Q1 ; Q2 ; Q3 :
(4:4)
The probability of the particular sequence Q1 ; Q2 ; Q3 can be obtained from repeated applications of the product rule. pðQ1 ; Q2 ; Q3 Þ ¼ pðQ1 jIÞ pðQ2 ; Q3 jQ1 ; IÞ ¼ pðQ1 jIÞ pðQ2 jQ1 ; IÞ pðQ3 jQ1 ; Q2 ; IÞ:
(4:5)
Information I leads us to assign the same probability for outcome Q for each event independent of what happened earlier or later, so Equation (4.5) becomes pðQ1 ; Q2 ; Q3 Þ ¼ pðQ1 jIÞ pðQ2 jIÞ pðQ3 jIÞ ¼ pðQjIÞ pðQjIÞ pðQjIÞ
(4:6)
2
¼ pðQjIÞ pðQjIÞ : Thus, the probability of a particular outcome depends only on the number of Q’s and Q’s in it and not on the order in which they occur. Returning to Equation (4.4), we note that: one outcome, the first, contains three Q’s, three outcomes contain two Q’s, three outcomes contain only one Q, and one outcome contains no Q’s.
More generally, we are going to be interested in the number of ways of getting an outcome with r Q’s in n events or trials. In each event, it is possible to obtain a Q, so the question becomes in how many ways can we select r Q’s from n events where their order is irrelevant, which is given by n Cr .
74
Assigning probabilities n
3! 3 For example, ¼ ¼ 3; 2 2!1!
n! ¼ Cr ¼ r!ðn rÞ!
n : r
(4:7)
Q; Q; Q Q; Q; Q Q; Q; Q.
Thus, the probability of getting r Q’s in n events is the probability of any one sequence with r Q’s and ðn rÞ Q’s, multiplied by n Cr , the multiplicity of ways of obtaining r Q ’s in n events or trials. Therefore, we conclude that in n trials, the probability of seeing the outcome of r Q’s and ðn rÞ Q’s is pðrjn; IÞ ¼
n! pðQjIÞr pðQjIÞnr : r!ðn rÞ!
(4:8)
This distribution is called the binomial distribution. Note the similarity to the binomial expansion ðx þ yÞn ¼
n X
n! xr ynr : r!ðn rÞ! r¼0
(4:9)
Referring back to Equation (4.4), in the algebra of propositions, we can interpret En to mean E carried out n times and write it in a form analogous to Equation (4.9): En ¼ ðQ þ QÞn : Example: I ‘‘You pick up one of two coins which appear identical. One, coin A, is known to be a fair coin, while coin B is a weighted coin with pðheadÞ ¼ 0:2.’’ From this information and from experimental information you will acquire from tossing the coin, compute the probability that you picked up coin A. D ‘‘3 heads turn up in 5 tosses.’’ What is the probability you picked coin A?
Let odds ¼ ¼
pðAjD; IÞ pðBjD; IÞ pðAjIÞpðDjA; IÞ ¼ pðBjIÞpðDjB; IÞ
1 2 1 2
pðDjA; IÞ : pðDjB; IÞ
(4:10)
To evaluate the likelihoods pðDjA; IÞ and pðDjB; IÞ, we use the binomial distribution, given by pðrjn; IÞ ¼
n! pðheadjA; IÞr pðtailjA; IÞnr ; r!ðn rÞ!
where pðrjn; IÞ is the probability of obtaining r heads in n tosses and pðheadjA; IÞ is the probability of obtaining a head in any single toss assuming A is true. Now
4.2 Binomial distribution
75
n 5 pðheadjA; IÞr pðtailjA; IÞnr ¼ ð0:5Þ3 ð0:5Þ2 r 3 5 pðAjD;IÞ and pðDjB; IÞ ¼ ð0:2Þ3 ð0:8Þ2 ! odds ¼ 6:1 ¼ 1pðAjD;IÞ and so pðAjD; IÞ ¼ 0:86. 3 Thus, the probability you picked up coin A ¼ 0:86, based on our current state of knowledge. pðDjA; IÞ ¼
4.2.1 Bernoulli’s law of large numbers The binomial distribution allows us to compute pðrjn; IÞ, where r is, for example, the number of heads occurring in n tosses of a coin. According to Bernoulli’s law of large numbers, the long-run frequency of occurrence tends to the probability of the event occurring in any single trial, i.e., lim
n!1
r ¼ pðheadjIÞ: n
(4:11)
We can easily demonstrate this using the binomial distribution. If the probability of a head in any single toss is pðheadjIÞ ¼ 0:4, Figure 4.1 shows a plot of pðr=njn; IÞ versus the fraction r=n for a variety of different choices of n ranging from 20 to 1000.
Box 4.1 Mathematica evaluation of binomial distribution: Needs[‘‘Statistics ‘DiscreteDistributions’’’] The line above loads a package containing a wide range of discrete distributions of importance to statistics, and the following line computes the probability of r heads in n trials where the probability of a head in any one trial is p. PDF[Binomial Distribution[n, p], r] ! answer ¼ 0:205 ðn ¼ 10; p ¼ 0:5; r ¼ 4Þ
Notice as n increases, the PDF for the frequency becomes progressively more sharply peaked, converging on a value of 0.4, the probability of a head in any single toss. Although Bernoulli was able to derive this result, his unfulfilled quest lay in the inverse process: what could one say about the probability of obtaining a head, in a single toss, given a finite number of observed outcomes? This turns out to be a straightforward problem for Bayesian inference as we see in the next section.
4.2.2 The gambler’s coin problem Let I ‘‘You have acquired a coin from a gambling table. You want to determine whether it is a biased coin from the results of tossing the coin many times. You specify the bias of the coin by a proposition H, representing the probability of a head
76
Assigning probabilities
Probability density
25 n = 20 n = 100 n = 1000
20 15 10 5 0 0
0.2
0.4
0.6
0.8
1
r /n
Figure 4.1 A numerical illustration of Bernoulli’s law of large numbers. The PDF for the frequency of heads, r=n, in n tosses of a coin is shown for three different choices of n. As n increases, the distribution narrows about the probability of a head in any single toss ¼ 0:4.
occurring in any single toss. A priori, you assume that H can have any value in the range 0 ! 1 with equal probability. You want to see how pðHjD; IÞ evolves as a function of the number of tosses.’’ Let D ‘‘You toss the coin 50 times and record the following results: (a) 2 heads in the first 3 tosses, (b) 7 heads in the first 10 tosses, and (c) 33 heads in 50 tosses.’’ From the prior information, we determine that our hypothesis space H is continuous in the range 0 ! 1. As usual, our starting point is Bayes’ theorem: pðHjD; IÞ ¼
pðHjIÞpðDjH; IÞ : pðDjIÞ
(4:12)
Since we are assuming a uniform prior for pðHjIÞ, the action will all be in the likelihood term pðDjH; IÞ, which, in this case, is given by the binomial distribution: pðrjn; IÞ ¼
n! Hr ð1 HÞnr : r!ðn rÞ!
(4:13)
Note: the symbol H is being employed in two different ways. In Equation (4.13), it is acting as an ordinary algebraic variable standing for possible numerical values in the range 0 to 1. When it appears as an argument of a probability or PDF, e.g., pðHjD; IÞ, it acts as a proposition (obeying the rules of Boolean algebra) and asserts that the true value lies in the numerical range H to H þ dH. Figure 4.2 shows the results from Equation (4.13) as a function of H in the range 0 ! 1. From the figure, we can clearly see how the evolution of our state of knowledge of the coin translates into a progressively more sharply peaked posterior PDF. From this simple example, we can see how Bayes’ theorem solves the inverse problem: find pðHjD; IÞ given a finite number of observed outcomes represented by D.
77
4.2 Binomial distribution Weighted Coin 6 5
n=3 n = 10 n = 50
p(H |D,I )
4 3 2 1 0 0
0.2 0.4 0.6 0.8 H (probability of a head in one toss)
1
Figure 4.2 The posterior PDF for the bias of a coin determined from: (a) 3 tosses, (b) 10 tosses, and (c) 50 tosses.
4.2.3 Bayesian analysis of an opinion poll Let I ‘‘A number of political parties are seeking election in British Columbia. The questions to be addressed are: (a) what is the fraction of decided voters that support the Liberals, and (b) what is the probability that the Liberals will achieve a majority of at least 51% in the upcoming election, assuming the poll will be representative of the population at the time of the election?’’ Let D ‘‘In a poll of 800 decided voters, 18% supported the New Democratic Party versus 55% for the Liberals, 19% for Reform BC and 8% for other parties.’’ Let the proposition H ‘‘The fraction of the voters that will support the Liberals is between H and H þ dH.’’ In this problem our hypothesis space of interest is continuous in the range 0 to 1, so pðHjD; IÞ is a probability density function. Based only on the prior information as stated, we adopt a flat prior pðHjIÞ ¼ 1. Let r ¼ the number of respondents in the poll that support the Liberals. As far as this problem is concerned, there are only two outcomes of interest; a voter either will or will not vote for the Liberals. We can therefore use the binomial distribution to evaluate the likelihood function pðDjH; IÞ. Given a particular value of H, the binomial distribution gives the probability of obtaining D ¼ r successes in n samples, where in this case, a success means support for the Liberals. pðDjH; IÞ ¼
n! Hr ð1 HÞnr : r!ðn rÞ!
(4:14)
In this problem n ¼ 800, and r ¼ 440. From Bayes’ theorem we can write pðHjD; IÞ ¼
pðHjIÞpðDjH; IÞ pðDjH; IÞ pðDjH; IÞ ¼ ¼ R1 : pðDjIÞ pðDjIÞ 0 dH pðDjH; IÞ
(4:15)
78
Assigning probabilities
Probability density
20 n = 800 n = 200 n = 100
15
10
5
0 0
0.2 0.4 0.6 0.8 Fraction of voters supporting the Liberals
1
Figure 4.3 The posterior PDF for H, the fraction of voters in the province supporting the Liberals based on polls of size n ¼ 100; 200; 800 decided voters.
Figure 4.3 shows a graph of the posterior probability of H for a variety of poll sizes including n ¼ 800. The 95% credible region2 for H is 55þ3:4 3:5 %. A frequentist interpretation of the same poll would express the uncertainty in the fraction of decided voters supporting the Liberals in the following way: ‘‘The poll of 800 people claims an accuracy of 3:5%, 19 times out of 20.’’ We will see why when we deal with frequentist confidence intervals in Section 6.6. The second question, concerning the probability that the Liberals will achieve a majority of at least 51% of the vote, is addressed as a model selection problem. The two models are: 1. Model M1 ‘‘the Liberals will achieve a majority.’’ The parameter of the model is H, which is assumed to have a uniform prior in the range 0:51 H 1:0. 2. Model M2 ‘‘the Liberals will not achieve a majority.’’ The parameter of the model is H, which is assumed to have a uniform prior in the range 0 H 5 0:51.
From Equation (3.14) we can write
odds ¼ O12 ¼
2
pðM1 jIÞ B12 ; pðM2 jIÞ
(4:16)
Note: a Bayesian credible region is not the same as a frequentist confidence interval. For a uniform prior for H the 95% confidence interval has essentially the same value as the 95% credible region, but the interpretation is very different. The recipe for computing a credible region was given at the end of Section 3.3.
4.3 Multinomial distribution
79
where pðDjM1 ; IÞ pðDjM2 ; IÞ R1 ¼ 0:51 dH pðHjM1 ; IÞ pðDjM1 ; H; IÞ ¼ RH0:51 H ¼ 0 dH pðHjM2 ; IÞ pðDjM2 ; H; IÞ R1 ¼ 0:51 dHð1=0:49Þ pðDjM1 ; H; IÞ ¼ RH0:51 H ¼ 0 dHð1=0:51Þ pðDjM2 ; H; IÞ
B12 ¼
(4:17)
¼ 87:68: Based on I, we have no prior reason to prefer M1 over M2 , so O12 ¼ B12 . The probability that the Liberal party will win a majority is then given by (see Equation (3.18)) pðM1 jD; IÞ ¼
1 ¼ 0:989: ð1 þ 1=O12 Þ
(4:18)
Again, we emphasize that our conclusions are conditional on the assumed prior information, which includes the assumption that the poll will be representative of the population at the time of the election. Now that we have set up the equations to answer the questions posed above, it is a simple exercise to recompute the answers assuming different prior information, e.g., suppose the prior lower bound on H were 0.4 instead of 0.
4.3 Multinomial distribution When we throw a six-sided die there are six possible outcomes. This motivates the following question: Is there a generalization of the binomial distribution for the case where we have more than two possible outcomes? Again we can use probability theory as extended logic to derive the appropriate distribution starting from a statement of our prior information. I ‘‘Proposition E represents an event that is repeated many times and has m possible outcomes represented by propositions, O1 ; O2 ; . . . ; Om . The outcomes of individual events are logically independent, i.e., the probability of getting an outcome Oi in event j is independent of what outcome occurred in any other event.’’ E ¼ O1 þ O2 þ O3 þ þ Om , then for the event E repeated n times: En ¼ ðO1 þ O2 þ þ Om Þn : The probability of any particular En having O1 O2 .. .
occurring occurring .. .
n1 n2 .. .
times times .. .
Om
occurring
nm
times
is pðEn jIÞ ¼ pðO1 jIÞn1 pðO2 jIÞn2 . . . pðOm jIÞnm .
80
Assigning probabilities
Next we need to find the number of sequences having the same number of O1 ; O2 ; . . . ; Om (multiplicity) independent of the order. We can readily guess at the form of multiplicity by rewriting Equation (4.7) setting the denominator r!ðn rÞ! ¼ n1 !n2 !. multiplicity for the two-outcome case ¼
n! n! ¼ ; r!ðn rÞ! n1!n2 !
(4:19)
where n1 stands for the number of A’s and n2 for the number of A’s. Now in the current problem, we have m possible outcomes for each event, so, multiplicity for the m-outcome case ¼
n! ; n1 !n2 ! . . . nm !
(4:20)
P where n ¼ m i ¼ 1 ni . Therefore, the probability of seeing the outcome defined by n1 n2 . . . nm where ni ‘‘Outcome Oi occurred ni times’’ is pðn1 ; n2 ; . . . ; nm jEn ; IÞ ¼
m Y n! pðOi jIÞni : n1 !n2 ! . . . nm ! i ¼ 1
(4:21)
This is called the multinomial distribution. Compare this with the multinomial expansion: ðx1 þ x2 þ þ xm Þn ¼
X
n! xni xn2 . . . xnmm ; n1 !n2 ! . . . nm ! 1 2
(4:22)
where the sum is taken over all possible values of ni , subject to the constraint that Pm i ¼ 1 ni ¼ n.
4.4 Can you really answer that question? Let I ‘‘A tin contains N buttons, identical in all respects except that M are black and the remainder are white.’’ What is the probability that you will a pick a black button on the first draw assuming you are blindfolded? The answer is clearly M/N. What is the probability that you will a pick a black button on the second draw if you know that a black button was picked on the first and not put back in the tin (sampling without replacement)? Let Bi ‘‘A black button was picked on the ith draw.’’ Let Wi ‘‘A white button was picked on the ith draw.’’ Then M1 ; pðB2 jB1 ; IÞ ¼ N1 because for the second draw there is one less black button and one less button in total.
4.4 Can you really answer that question?
81
Now, what is the probability of picking a black button on the second draw pðB2 jIÞ when we are not told what color was picked on the first draw? In this case the answer might appear to be indeterminate, but as we shall show, questions of this kind can be answered using probability theory as extended logic. We know that either B1 or W1 is true, which can be expressed as the Boolean equation B1 þ W1 ¼ 1. Thus we can write: B2 ¼ ðB1 þ W1 Þ; B2 ¼ B1 ; B2 þ W1 ; B2 : But according to Jaynes consistency (see Section 2.5.1), equivalent states of knowledge must be represented by equivalent plausibility assignments. Therefore pðB2 jIÞ ¼ pðB1 ; B2 jIÞ þ pðW1 ; B2 jIÞ ¼ pðB1 jIÞpðB2 jB1 ; IÞ þ pðW1 jIÞpðB2 jW1 ; IÞ M M1 NM M ¼ þ N N1 N N1 ¼
(4:23)
M : N
In like fashion, we can show pðB3 jIÞ ¼
M : N
The probability of black at any draw, if we do not know the result of any other draw, is always the same. The method used to obtain this result is very useful. 1. Resolve the quantity whose probability is wanted into mutually exclusive sub-propositions:3
B3 ¼ ðB1 þ W1 Þ; ðB2 þ W2 Þ; B3 ¼ B1 ; B2 ; B3 þ B1 ; W2 ; B3 þ W1 ; B2 ; B3 þ W1 ; W2 ; B3 : 2. Apply the sum rule. 3. Apply the product rule.
If the sub-propositions are well chosen (i.e., they have a simple meaning in the context of the problem), their probabilities are often calculable. While we are on the topic of sampling without replacement, let’s introduce the hypergeometric distribution (see Jaynes, 2003). This gives the probability of drawing r
3
In his book, Rational Descriptions, Decisions and Designs, M. Tribus refers to this technique as extending the conversation. In many problems, there are many pieces of information which do not seem to fit together in any simple mathematical formulation. The technique of extending the conversation provides a formal method for introducing this information into the calculation of the desired probability.
82
Assigning probabilities
black buttons (blindfolded) in n tries from a tin containing N buttons, identical in all respects except that M are black and the remainder are white. NM M nr r ; (4:24) pðrjN; M; nÞ ¼ N n where
Box 4.2
M r
¼
M! etc: r!ðM rÞ!
(4:25)
Mathematica evaluation of hypergeometric distribution:
Needs[‘‘Statistics ‘DiscreteDistributions’ ’’] PDF[HypergeometricDistribution [n, nsucc , ntot ], r] gives the probability of r successes in n trials corresponding to sampling without replacement from a population of size ntot with nsucc potential successes.
4.5 Logical versus causal connections We now need to clear up an important distinction between a logical connection between two propositions and a causal connection. In the previous problem with M black buttons and N M white buttons, it is clear that pðBj jBj1 ; IÞ < pðBj jIÞ since we know there is one less black button in the tin when we take our next pick. Clearly, what was drawn on earlier draws can affect what will happen in later draws. We can say there is some kind of partial causal influence of Bj1 on Bj . Now suppose we ask the question what is the probability pðBj1 jBj ; IÞ? Clearly in this case what we get on a later draw can have no effect on what occurs on an earlier draw, so it may be surprising to learn that pðBj1 jBj ; IÞ ¼ pðBj jBj1 ; IÞ. Consider the following simple proof (Jaynes, 2003). From the product rule we write pðBj1 ; Bj jIÞ ¼ pðBj1 jBj ; IÞpðBj jIÞ ¼ pðBj jBj1 ; IÞpðBj1 jIÞ: But we have just seen that pðBj jIÞ ¼ pðBj1 jIÞ ¼ M=N for all j, so pðBj1 jBj ; IÞ ¼ pðBj jBj1 ; IÞ;
(4:26)
or more generally, pðBk jBj ; IÞ ¼ pðBj jBk ; IÞ;
for all j; k:
(4:27)
4.6 Exchangeable distributions
83
How can information about a later draw affect the probability of an earlier draw? Recall that in Bayesian analysis, probabilities are an encoding of our state of knowledge about some question. Performing the later draw does not physically affect the number Mj of black buttons in the tin at the jth draw. However, information about the result of a later draw has the same effect on our state of knowledge about what could have been taken on the jth draw, as does information about an earlier draw. Bayesian probability theory is concerned with all logical connections between propositions independent of whether there are causal connections. Example 1: I ‘‘A shooting has occurred and the police arrest a suspect on the same day.’’ A ‘‘Suspect is guilty of shooting.’’ B ‘‘A gun is found seven days after the shooting with suspect’s fingerprints on it.’’ Clearly, B is not a partial cause of A but still we conclude that pðAjB; IÞ5pðAjIÞ: Example 2: I ‘‘A virulent virus invades Montreal. Anyone infected loses their hair a month before dying.’’ A ‘‘The mayor of Montreal lost his hair in September.’’ B ‘‘The mayor of Montreal died in October.’’ Again, in this case, pðAjB; IÞ5pðAjIÞ. Although a logical connection does not imply a causal connection, a causal connection does imply a logical connection, so we can certainly use probability theory to address possible causal connections.
4.6 Exchangeable distributions In the previous section, we learned that information about the result of a later draw has the same effect on our state of knowledge about what could have been taken on the jth draw, as does information about an earlier one. Every draw has the same relevance to every other draw regardless of their time order. For example, pðBj jBj1 ; Bj2 ; IÞ ¼ pðBj jBjþ1 ; Bjþ2 ; IÞ, where again Bj is the proposition asserting a black button on the jth draw. The only thing that is significant about the knowledge of outcomes of other draws is the number of black or white buttons in these draws, not their time order. Probability distributions of this kind are called exchangeable distributions. It is clear that the hypergeometric distribution is exchangeable since for pðrjN; M; nÞ we are not required to specify the exact sequence of the r black button outcomes. The hypergeometric distribution takes into account the changing
84
Assigning probabilities
contents of the tin. The result of any draw changes the probability of a black on any other draw. If the number, N, of buttons in the tin is much larger than the number of draws n, then this probability changes very little. In the limit as N ! 1, the hypergeometric distribution simplifies to the binomial distribution, another exchangeable distribution. The multinomial distribution, discussed in Section 4.3, can be viewed as a generalization of the binomial distribution to the case where we have m possible outcomes, not just two. From its form given in Equation (4.21), which we repeat here, pðn1 ; n2 ; . . . ; nm jEn ; IÞ ¼
m Y n! pðOi jIÞni ; n1 !n2 ! . . . nm ! i ¼ 1
we can see that this is another exchangeable distribution because the probability depends only on the numbers of different outcomes ðn1 ; n2 ; . . . ; nm Þ observed and not on their order.
Worked example: A spacecraft carrying two female and three male astronauts makes a trip to Mars. The plan calls for three of the astronauts to board a detachable capsule to land on the planet, while the other two remain behind in orbit. Which three will board the capsule is decided by a lottery, consisting of picking names from a box. The first person selected is to be the captain of the capsule. The second and third names selected become capsule support crew. What is the probability that the captain is female if we know that at least one of the support crew members is female? Let Fi stand for the proposition that the ith name selected is female, and Mi if the person is male. Let Flater ‘‘We learn that at least one of the crew members is female.’’ Flater ¼ F2 þ F3 : This information reduces the number of females available for the first draw by at least one. To solve the problem we will make use of Bayes’ theorem and abbreviate Flater by FL . pðF1 jFL ; IÞ ¼
pðF1 jIÞpðFL jF1 ; IÞ : pðFL jIÞ
(4:28)
To evaluate two of the terms on the right, it will be convenient to work with denials of FL . From the sum rule, pðFL jF1 ; IÞ ¼ 1 pðFL jF1 ; IÞ. Since FL ¼ F2 þ F3 , we have that FL ¼ F2 ; F3 ¼ M2 ; M3 , according to the duality identity of Boolean algebra (Section
4.7 Poisson distribution
85
2.2.3). In words, the denial of at least one female in draws 2 and 3 is a male on both draws. Therefore, pðFL jF1 ; IÞ ¼ 1 pðM2 ; M3 jF1 ; IÞ ¼ 1 pðM2 jF1 ; IÞpðM3 jM2 ; F1 ; IÞ 3 2 1 ¼1 ¼ : 4 3 2
(4:29)
Similarly, we can write pðFL jIÞ ¼ 1 pðM2 ; M3 jIÞ. By exchangeability, pðM2 ; M3 jIÞ is the same as the probability of a male on the first two draws given only the conditional information I, i.e., not F1 ; I. Therefore, pðFL jIÞ ¼ 1 pðM2 ; M3 jIÞ ¼ 1 pðM1 ; M2 jIÞ ¼ 1 pðM1 jIÞpðM2 jM1 ; IÞ 3 2 7 : ¼1 ¼ 5 4 10 Substituting Equations (4.29) and (4.30) into Equation (4.28), we obtain 21 2 pðF1 jFL ; IÞ ¼ 5 7 2 ¼ : 7 10
(4:30)
(4:31)
The property of exchangeability has allowed us to evaluate the desired probability in a circumstance where we were given less precise information, namely, a female will be picked at least once on the second and third draws. Note: the result for pðF1 jFL ; IÞ is different from these two cases: pðF1 jIÞ ¼
2 5
and pðF1 jF2 ; IÞ ¼
21 1 ¼ : 51 4
4.7 Poisson distribution In this section4 and we will see how a particular state of prior information, I, leads us to choose the well-known Poisson distribution for the likelihood. Later, in Section 5.7.2, we will derive the Poisson distribution as a limiting ‘‘low count rate’’ approximation to the binomial distribution. Prior information: I ‘‘There is a positive real number r such that, given r, the probability that an event, or count, will occur in the time interval ðt; t þ dtÞ is ¼ r dt. Furthermore, knowledge of r makes any information about the occurrence or
4
Section 4.7 is based on a paper by E. T. Jaynes (1989).
86
Assigning probabilities
non-occurrence of the event in any other time interval (that does not include ðt; t þ dtÞ) irrelevant to this probability.’’ Let qðtÞ ¼ probability of no count in time interval (0,t). Let E ‘‘no count in ð0; t þ dtÞ’’. E is the conjunction of two propositions A and B given by E ¼ ½‘‘no count in ð0; tÞ’’; ½‘‘no count in ðt; t þ dtÞ’’ ¼ A; B: From the product rule, pðEjIÞ ¼ pðA; BjIÞ ¼ pðAjIÞpðBjA; IÞ. It follows that pðEjIÞ ¼ qðt þ dtÞ ¼ qðtÞð1 r dtÞ
dq ¼ r qðtÞ: dt
or
The solution for the evident initial condition qð0Þ ¼ 1 is qðtÞ ¼ expðr tÞ. Now consider the probability of the proposition: C ‘‘In the interval ð0; tÞ, there are exactly n counts which happen at times ðt1 ; t2 ; . . . ; tn Þ with infinitesimal tolerances ðdt1 ; . . . ; dtn Þ, where ð0 < t1 < t2 . . . < tn < tÞ’’ This is the conjunction of 2n þ 1 propositions C ¼ ½‘‘no count in ð0; t1 Þ’’; ð‘‘count in dt1 ’’Þ; ½‘‘no count in ðt1 ; t2 Þ’’; ð‘‘count in dt2 ’’Þ; . . . ; ½‘‘no count in ðtn1 ; tn Þ’’; ð‘‘count in dtn ’’Þ; ½‘‘no count in ðtn ; tÞ’’: By the product rule and the independence of different time intervals, pðCjr; IÞ ¼ expðr t1 Þ: ðr dt1 Þ: expðrðt2 t1 ÞÞ: ðr dt2 Þ . . . expðrðtn tn1 ÞÞ: ðr dtn Þ: ðexp rðt tn ÞÞ
(4:32)
n
¼ expðr tÞr dt1 . . . dtn : The probability (given r) that in the interval (0,t) there are exactly n counts, whatever the times, is given by Z t Z t3 Z t2 Z t4 pðnjr; t; IÞ ¼ expðrtÞrn dtn . . . dt3 dt2 dt1 0
¼ expðrtÞrn
Z
Z
t
¼ expðrtÞr
Z
0
ðrtÞ n!
0 t4 2 t3
dtn . . . 0 n
¼ expðrtÞ
dt3 Z
t
0
Z
t4
dtn . . . 0
n
0
0
2!
0 t3
t2 dt2 1!
dt3
Poisson distribution:
(4:33)
4.7 Poisson distribution
87
We will return to the Poisson distribution again in Chapter 5. Some sample Poisson distributions are shown in Figure 5.6, and its relationship to the binomial and Gaussian distributions is discussed in Section 5.7.2, together with some typical examples. Chapter 14 is devoted to Bayesian inference with Poisson sampling.
4.7.1 Bayesian and frequentist comparison Let’s use the Poisson distribution to clarify a fundamental difference between the Bayesian and frequentist approaches to inference. Consider how the probability of n1 counts in a time interval ð0; t1 Þ changes if we learn that n2 counts occurred in the interval ð0; t2 Þ where t2 > t1 . According to I, the occurrence or nonoccurrence of counts in any other time intervals that do not include the interval ð0; t1 Þ is irrelevant to the probability of interest. Since ð0; t2 Þ contains ð0; t1 Þ it is contributing information which we can incorporate through Bayes’ theorem which we write now. pðn1 jr; t1 ; t2 ; IÞ pðn2 jn1 ; r; t1 ; t2 ; IÞ pðn2 jr; t1 ; t2 ; IÞ pðn1 jr; t1 ; IÞ pðn1 ; ðn2 n1 Þjn1 ; r; t1 ; t2 ; IÞ ¼ : pðn2 jr; t2 ; IÞ
pðn1 jn2 ; r; t1 ; t2 ; IÞ ¼
(4:34)
Using the product rule, we can expand the second term in the numerator of Equation (4.34): pðn1 ; ðn2 n1 Þjn1 ; r; t1 ; t2 ; IÞ ¼ pðn1 jn1 ; r; t1 ; t2 ; IÞ pðn2 n1 jn1 ; r; t1 ; t2 ; IÞ ¼ 1 pðn2 n1 jn1 ; r; t1 ; t2 ; IÞ ½rðt2 t1 Þn2 n1 : ¼ exp½rðt2 t1 Þ ðn2 n1 Þ!
(4:35)
The other terms in Equation (4.34) can readily be evaluated by reference to Equation (4.33). Substituting into Equation (4.34) and simplifying, we obtain n2 ! exp½rt1 exp½rðt2 t1 Þ n1 ! ðn2 n1 Þ! exp½rt2 n1 n2 n1 ½rt1 ½rðt2 t1 Þ ½rt2 n2 n1 n2 t1 t1 n2 n1 t1 < t2 ¼ 1 : t2 t2 n1 n1 < n2
pðn1 jn2 ; r; t1 ; t2 ; IÞ ¼
(4:36)
The result is rather surprising because the new posterior does not even depend on r. The point is that r does not determine n1 ; it only gives probabilities for different values of n1 . If we know the actual value over the interval that includes ð0; t1 Þ, then this takes precedence over anything we could infer from r.
88
Assigning probabilities
In frequentist random variable probability theory, one might think that r is the sole relevant quantity, and thus arrive at a different conclusion, namely, ðrt1 Þn1 pðn1 jn2 ; r; t1 ; t2 ; IÞ ¼ pðn1 jr; t1 ; IÞ ¼ expðrt1 Þ : (4:37) n1 ! What if we used the measured n2 counts in the time interval t2 to compute a new estimate of r0 ¼ n2 =t2 and then used Equation (4.37) to compute pðn1 jr0 ; t1 ; IÞ. Would we get the same result as predicted by Equation (4.36)? The two distributions are compared in Figure 4.4 for n2 ¼ 10 counts, t2 ¼ 10 s and t1 ¼ 8 s. The two curves are clearly very different. In addition, the probability distribution given by Equation (4.37) predicts a tail extending well beyond 10 counts which makes no physical sense given that we know only 10 counts will occur in the longer interval t2 which contains t1 . From the frequentist point of view, replacing r by r0 would make little sense regarding long-run performance if the original r were estimated on the basis of the counts in a much longer time span than t2 . However, for any non-zero value of r, Equation (4.37) predicts there is a finite probability that n1 can exceed the actual measured value n2 in the larger interval, which is clearly impossible. In frequentist theory, a probability represents the percentage of time that something will happen in a very large number of identical repeats of an experiment, i.e., the longrun relative frequency. As we will learn in Section 6.6 and Chapter 7, frequentist theory says nothing directly about the probability of any estimate derived from a single data set. The significance of any frequentist result can only be interpreted with reference to a population of hypothetical data sets. From this point of view, the
t1 n1 t n2–n1 1− 1 t2 t2 n Exp [–r' t1] (r' t1) 1 with r' = n2 t2 n1! n2 n1
0.3
Probability
0.25 0.2
n2 = 10 counts t2 = 10 s t1 = 8 s
0.15 0.1 0.05 2
4
6
8
10
n1
Figure 4.4 A comparison of the predictions for pðn1 jn2 ; r; t1 ; t2 ; IÞ based on Equations (4.36) and (4.37) where we set r0 ¼ n2 =t2 . The assumed values are t1 ¼ 8 s, t2 ¼ 10 s and n2 ¼ 10 counts.
4.8 Constructing likelihood functions
89
frequentist procedure represented by Equation (4.37) is not intended to be optimum in the individual case that we are considering here. In contrast, Bayesian probability theory does apply to the individual case, where the goal is to reason as best we can on the basis of our current state of information. In a Bayesian analysis, only the data that were actually measured, combined with relevant prior information, are considered, hypothetical data sets play no role.
4.8 Constructing likelihood functions In this section, we amplify on the process of arriving at the likelihood function, pðDjM; ; IÞ, for use in a Bayesian analysis, where pðDjM; ; IÞ ¼ probability of obtaining data D; if model M and background (prior) information I are true (also called the likelihood function LðMÞÞ: The parameters of model M are collectively designated by the symbol . We can write D ¼ Y1 ; Y2 ; . . . ; YN ¼ fYi g, where *
*
*
Yi ‘‘A proposition asserting that the ith data value is in the infinitesimal range yi to yi þ dyi .’’ Zi ‘‘A proposition asserting that the M model prediction for the ith data value is in the range zi to zi þ dzi .’’ Ei ‘‘A proposition asserting that the ith error value is in the range ei to ei þ dei .’’
As usual, we can write yi ¼ z i þ e i :
(4:38)
In the simplest case (see Section 4.8.1) the predicted value, zi , is given by a deterministic model, mðxi jÞ, which is a function of some independent variable(s) xi , like position or time. More generally, the value of zi itself may be uncertain because of statistical uncertainties in mðxi jÞ, and/or uncertainties in the value of the independent variable(s) xi . We will represent the probability distribution for proposition Zi by the function pðZi jM; ; IÞ ¼ fZ ðzi Þ:
(4:39)
We can also represent the probability distribution for proposition Ei by another function given by pðEi jM; ; IÞ ¼ fE ðei Þ:
(4:40)
Our next step is to compute pðYi jM; ; IÞ. Now Yi depends on propositions Zi and Ei . To evaluate pðYi jM; ; IÞ, we first extend the conversation (Tribus, 1969) to include these propositions by writing down the joint probability distribution
90
Assigning probabilities
pðYi ; Zi ; Ei jM; ; IÞ. We can then solve for pðYi jM; ; IÞ by using the marginalizing operation as follows: ZZ pðYi jM; ; IÞ ¼ dZi dEi pðYi ; Zi ; Ei jM; ; IÞ ZZ (4:41) ¼ dZi dEi pðZi jM; ; IÞ pðEi jM; ; IÞ pðYi jZi ; Ei ; M; ; IÞ; where we assume Zi and Ei are independent. Since yi ¼ zi þ ei , pðYi jZi ; Ei ; M; ; IÞ ¼ ðyi zi ei Þ ! pðYi jM; ; IÞ ¼
Z
dzi fZ ðzi Þ
Z
dei fE ðei Þðyi zi ei Þ:
(4:42) (4:43)
The presence of the delta function in the second integral serves to pick out the value of the integrand at ei ¼ yi zi , so we have: Z pðYi jM; ; IÞ ¼ dzi fZ ðzi Þ fE ðyi zi Þ: (4:44) The right hand side of the equation is the convolution integral.5 We now evaluate our equation for pðYi jM; ; IÞ for two useful general cases.
4.8.1 Deterministic model In this case, we assume that for any specific choice of the model parameters there is no uncertainty in the predicted value, zi . We will refer to models of this kind as deterministic models. Given the model and the values of any of its parameters, then fZ ðzi Þ ¼ ðzi mðxi jÞÞ. In this case, Equation (4.44) becomes pðYi jM; ; IÞ ¼ fE ðyi mðxi jÞÞ ¼ pðEi jM; ; IÞ:
(4:45)
Thus, the probability of the ith data value is simply equal to the probability of the ith error term. If the errors are all independent,6 then pðDjM; ; IÞ ¼ pðY1 ; Y2 ; . . . ; YN jM; ; IÞ ¼ pðE1 ; E2 ; . . . ; EN jM; ; IÞ ¼
N Y
pðEi jM; ; IÞ;
(4:46)
i¼1
QN
where i ¼ 1 stands for the product of N of these terms. We have already encountered Equation (4.46) in the simple spectral line problem of Section 3.6 (see Equation (4.42)). 5
6
For more details on the convolution integral and how to evaluate it using the Fast Fourier Transform, see Sections B.4 and B.10. We deal with the effect of correlated errors in Section 10.2.2.
4.8 Constructing likelihood functions
91
4.8.2 Probabilistic model In the second case, our information about the model is uncertain. Here, we will distinguish between three different situations. 1. The model prediction, zi , includes a statistical noise component i . zi ¼ mðxi jÞ þ i :
(4:47)
Equation (4.38) can be rewritten as yi ¼ zi þ ei ¼ mðxi jÞ þ i þ ei :
(4:48)
In this case, the data, yi , can differ from the model, mðxi jÞ, because of a component ei due to measurement errors, and a component i due to a statistical uncertainty in our model. The two error terms are assumed to be uncorrelated. For example, suppose our data consist of a radar return signal from an unidentified aircraft. We could compare the signal to samples of measured radar return signals from a set of known aircraft for different orientations to arrive at the most probable identification. In this case, these sample measurements of known aircraft, which include a noise component, constitute our model, mðxi jÞ. Suppose that the probability distribution of i is described by a Gaussian with standard deviation mi . Then ( ) 1 ðzi mðxi jÞÞ2 exp pðZi jM; ; IÞ ¼ pffiffiffiffiffiffi 22mi 2pmi (4:49) 2 1 i ¼ pffiffiffiffiffiffi exp ¼ fZ ðzi Þ: 22i 2pmi Suppose also that the error term, ei , in Equation (4.38), has a Gaussian probability distribution with a standard deviation, i , of the form 2 1 ei pðEi jM; ; IÞ ¼ pffiffiffiffiffiffi exp ¼ fE ðyi zi Þ: 22i 2pi
(4:50)
Then according to Equation (4.44), pðYi jM; ; IÞ is the convolution of the two Gaussian probability distributions. It is easy to show7 that the result is another Gaussian given by ( ) 1 ðyi mðxi jÞÞ2 pðYi jM; ; IÞ ¼ pffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp : (4:51) 2ð2i þ 2mi Þ 2p 2i þ 2mi
7
Simply evaluate Equation (4.44) using Mathematica with limits on the integral of 1 after substituting for fZ ðzi Þ and fE ðyi zi Þ using Equations (4.49) and (4.50), respectively.
92
Assigning probabilities
If the Yi terms are all independent, then pðDjM; ; IÞ ¼ pðY1 ; Y2 ; . . . ; YN jM; ; IÞ ( !) ( ) N N (4:52) Y X ðyi mðxi jÞÞ2 N=2 2 2 1=2 ði þ mi Þ : ¼ ð2pÞ exp 2 2 2ði þ mi Þ i¼1 i¼1 2. In the second situation, our information about the model prediction, zi , is only uncertain because of uncertainty in the value of the independent variable xi . For example, we might be interested in fitting a straight line to some data with errors in both coordinates. Let xi0 be the nominal value of the independent variable and xi the true value. Then xi ¼ xi xi0 , is the uncertainty in xi . Now suppose the probability distribution of xi is a Gaussian given by ( ) 1 ðxi xi0 Þ2 (4:53) pðXi jIÞ ¼ pffiffiffiffiffiffi exp ¼ fX ðxi Þ; 22xi 2pxi where the scale of xi is set by xi . Our goal here is to compute an expression for pðZi jM; ; IÞ ¼ fZ ðzi Þ, for use in Equation (4.44). In Section 5.12, we will show how to compute the probability distribution of a function of xi if we know the probability distribution of xi . In our case, this function is zi ¼ mðxi jÞ. The function mðxi jÞ must be a monotonic and differentiable function over the range of xi of interest. Then there exists an inverse function xi ¼ m1 ðzi jÞ which is monotonic and differentiable. Thus, for every interval dxi there is a corresponding interval dzi . The result is
dxi
dxi
(4:54) fZ ðzi Þ ¼ fX ðxi Þ
¼ fX ðm1 ðzi jÞÞ
; dzi dzi which is valid provided the derivative does not change significantly over a scale of order 2xi . Let’s evaluate Equation (4.54) for the straight-line model, zi ¼ mðxi jA; BÞ ¼ A þ Bxi . In that case, xi ¼ m1 ðzi jA; BÞ ¼
1 A zi ; B B
(4:55)
so xi xi0 ¼
1 ðzi zi0 Þ: B
(4:56)
Also, it is apparent that
dxi
¼ 1 :
dz jBj i
(4:57)
93
4.9 Summary
Combining Equations (4.53), (4.54), (4.56) and (4.57), we obtain ( ) 1 ðzi zi0 Þ2 fZ ðzi Þ ¼ pffiffiffiffiffiffi exp : 2B2 2xi 2pjBjxi
(4:58)
We now have everything we need to evaluate Equation (4.44). Again, pðYi jM; ; IÞ is the convolution of the two Gaussian probability distributions. The result is ( ) 1 ðyi mðxi0 jA; BÞÞ2 pðYi jM; A; B; IÞ ¼ pffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp : (4:59) 2ð2i þ B2 2xi Þ 2p 2i þ B2 2xi If the Yi terms are all independent, then, pðDjM; A; B; IÞ ¼ ð2pÞ
N Y
N=2
( exp
!
ð2i i¼1
þ
B2 2xi Þ1=2
N X ðyi mðxi0 jA; BÞÞ2 i¼1
(4:60)
) :
2ð2i þ B2 2xi Þ
The reader is directed to Section 11.7 for a worked problem of this kind. 3. The model prediction, zi , is uncertain because of statistical uncertainties in both the model and the value of the independent variable(s), xi . In this case, if we again assume Gaussian distributions for the uncertain quantities, Equation (4.60) becomes ! N Y pðDjM; A; B; IÞ ¼ ð2pÞN=2 ð2i þ 2mi þ B2 2xi Þ1=2 i¼1
( exp
N X ðyi mðxi0 jA; BÞÞ2 i¼1
2ð2i þ 2mi þ B2 2xi Þ
(4:61)
) :
4.9 Summary In any Bayesian analysis, the prior information defines the hypothesis space of interest, prior probability distributions and the means for computing pðDjHi ; IÞ, the likelihood function. In this chapter, we have given examples of how to encode prior information into a probability distribution (commonly referred to as the sampling distribution) for use in computing the likelihood term. We saw how the well-known binomial, multinomial, hypergeometric and Poisson distributions correspond to different prior information. In the process, we learned that Bayesian inference is concerned with logical connections between propositions which may or may not correspond to causal physical influences. We introduced the notion of exchangeable distributions and learned how to compute probabilities for situations where the prior
94
Assigning probabilities
information, at first sight, appears very imprecise. In Section 4.7.1, we gained important insight into the fundamental difference between Bayesian and frequentist approaches to inference. Finally, in Section 4.8, we learned how to construct likelihood functions for both deterministic and probabilistic models.
4.10 Problems 1. A bottle contains 50 black balls and 30 red balls. The bottle is first shaken to mix up the balls. What is the probability that blindfolded, you will pick two red balls in three tries? 2. Let I ‘‘A tin is purchased from a company that makes an equal number of two types. Both contain 90 buttons which are identical except that 2/3 of the buttons in one tin are black (the rest are white) and 2/3 of the buttons in the other tin are white (the rest are black). You can’t distinguish the tins from their outside.’’ Let D ‘‘In a sample of ten buttons drawn from the tin, seven are black.’’ Let B ‘‘We are drawing from the black tin.’’ Let W ‘‘We are drawing from the white tin.’’ pðBjD; IÞ Compute the odds ¼ , assuming pðBjIÞ ¼ pðWjIÞ. pðWjD; IÞ 3. A tin contains 17 black buttons and 6 white buttons. The tin is first shaken to mix up the buttons. What is the probability that blindfolded, you will pick a white button on the third pick if you don’t know what was picked on the first two picks? 4. A bottle contains three green balls and three red balls. The bottle is first shaken to mix up the balls. What is the probability that blindfolded, you will pick a red ball on the third pick, if you learn that at least one red ball was picked on the first two picks? 5. A spacecraft carrying two female and three male astronauts makes a trip to Mars. The plan calls for a two-person detachable capsule to land at site A on the planet and a second one-person capsule to land at site B. The other two astronauts remain in orbit. Which three will board the two capsules is decided by a lottery, consisting of picking names from a box. What is the probability that a female occupies the one-person capsule if we know that at least one member of the other capsule is female, but we are not told the order in which the astronauts were picked? 6. In a particular water sample, ten bacteria are found, of which three are of type A. Let Q ‘‘the probability that any particular bacterium is of type A is between q and q þ dq.’’ Plot the posterior pðQjD; IÞ. What prior probability distribution did you assume and why? 7. In a particular water sample, ten bacteria are found, of which three are of type A. What is the probability of obtaining six type A bacteria, in a second independent water sample containing 12 bacteria in total?
4.10 Problems
95
8. In a radio astronomy survey, 41 quasars were detected in a total sample of 90 sources. Let F ‘‘the probability that any particular source is a quasar is between f and f þ df.’’ Plot the posterior pðFjD; IÞ assuming a uniform prior for F. 9. In problem 7, what is the probability of obtaining at least three type A bacteria? 10. A certain solution contains three types of bacteria: A; B, and C. Given pðAjIÞ ¼ 0:2; pðBjIÞ ¼ 0:3, and pðCjIÞ ¼ 0:5, what is the probability of obtaining a sample of ten bacteria with three type A, three type B and four type C? 11. A total of five -ray photons were detected from a particular star in one hour. What is the probability that three photons will be detected in the next hour of observing? 12. On average, five -ray photons are detected from a particular star each hour. What is the probability that three photons were detected in the first hour of a twohour observation that recorded eight photons in total? 13. In the opinion poll problem of Section 4.2.3, re-plot Figure 4.3 for n ¼ 55. 14. In the opinion poll problem of Section 4.2.3, compute the probability that the Liberals will achieve a majority of at least 51%, for n ¼ 55 and everything else the same. 15. We want to fit a straight line model of the form yi ¼ a þ bxi to the list of x; y pairs given below. The data have Gaussian distributed errors in both the x and y coordinates with x ¼ 1 and y ¼ 2. Assume uniform priors for a and b, with boundaries that enclose the range of parameter space where there is a significant contribution from the likelihood function. This means that we can treat the prior as a constant, and write pða; bjD; M; IÞ / pðDjM; a; b; IÞ. ff5; 1:22g; f4; 3:28g; f3; 2:52g; f2; 3:74g; f1; 3:01g; f0; 1:80g; f1; 2:49g; f2; 5:48g; f3; 0:42g; f4; 4:80g; f5; 4:22gg (a) Plot the data with error bars in both coordinates. (b) Show a contour plot of the joint posterior PDF, pða; bjD; IÞ. (c) For what choice of a; b is pða; bjD; IÞ a maximum? You can use the Mathematica command FindMaximum[ p(a, b|D, I),{a, 0:0},{b, 0:5}]. (d) Show the best fit line and data with error bars on the same plot. (e) Compute and plot the marginal distributions pðajD; IÞ and pðbjD; IÞ. One way to do this is to compute a table of the joint posterior values for a grid of a; b values and approximate the integrals required for marginalization by a summation over the rows or columns. Make sure to normalize your marR P ginal distributions so pðajD; IÞda ¼ 1 i pðai jD; IÞa.
5 Frequentist statistical inference
5.1 Overview We now begin three chapters which are primarily aimed at a discussion of the main concepts of frequentist statistical inference. This is currently the prevailing approach to much of scientific inference, so a student should understand the main ideas to appreciate current literature and understand the strengths and limitations of this approach. In this chapter, we introduce the concept of a random variable and discuss some general properties of probability distributions before focusing on a selection of important sampling distributions and their relationships. We also introduce the very important Central Limit Theorem in Section 5.9 and examine this from a Bayesian viewpoint in Section 5.10. The chapter concludes with the topic of how to generate pseudo-random numbers of any desired distribution, which plays an important role in Monte Carlo simulations. In Chapter 6, we address the question of what is a statistic and give some common important examples. We also consider the meaning of a frequentist confidence interval for expressing the uncertainty in parameter values. The reader should be aware that study of different statistics is a very big field which we only touch on in this book. Some other topics normally covered in a statistics course like the fitting of models to data are treated from a Bayesian viewpoint in later chapters. Finally, Chapter 7 concludes our brief summary of frequentist statistical inference with the important topic of frequentist hypothesis testing and discusses an important limitation known as the optional stopping problem.
5.2 The concept of a random variable Recall from Section 1.1 that conventional ‘‘frequentist’’ statistical inference and Bayesian inference employ fundamentally different definitions of probability. In frequentist statistics, when we write the probability pðAÞ, the argument of the probability is called a random variable. It is a quantity that can be considered to take on various values throughout an ensemble or a series of repeated experiments. For example: 1. A measured quantity which contains random errors. 2. Time intervals between successive radioactive decays. 96
5.3 Sampling theory
97
Before proceeding, we need an operational definition of a random variable. From this, we discover that the random variable is not the particular number recorded in one measurement, but rather, it is an abstraction of the measurement operation or observation that gives rise to that number. Definition: A random variable, X, transforms the possible outcomes of an experiment (measurement operation) to real numbers. Example: Suppose we are interested in measuring a pollutant’s concentration level for each of n time intervals. The observations (procedure for producing a real number) X1 ; X2 ; . . . ; Xn form a sample of the pollutant’s concentration. Before the instrument actually records the concentration level during the ith trial, the observation, Xi , is a random variable. The recorded value, xi , is not a random variable, but the actual measured value of the observation, Xi . Question: Why do we need to have n random variables Xi ? Why not one random variable X for which x1 ; x2 ; . . . ; xn are the realizations of the random variable during the n observations? Answer: Because we often want to determine the joint probability of getting x1 on trial 1, x2 on trial 2, etc. If we think of each observation as a random variable, then we can distinguish between situations corresponding to: 1. Sampling with replacement so that no observation is affected by any other (i.e., independent X1 ; X2 ; . . . ; Xn ). In this case, all observations are random variables with identical probability distributions. 2. Sampling without replacement. In this case, the observations are not independent and hence are characterized by different probability distributions. Think of an urn filled with black and white balls. When we don’t replace the drawn balls, the probability of say a black on each draw is different.
5.3 Sampling theory The most important aspect of frequentist statistics is the process of drawing conclusions based on sample data drawn from the population (which is the collection of all possible samples). The concept of the population assumes that in principle, an infinite number of measurements (under identical conditions) are possible. The use of the term random variable conveys the idea of an intrinsic uncertainty in the measurement characterized by an underlying population. Question: What does the term ‘‘random’’ really mean? Answer: When we randomize a collection of balls in a bottle by shaking it, this is equivalent to saying that the details of this operation are not understood or too complicated to handle. It is sometimes necessary to assume that certain complicated details, while undeniably relevant, might nevertheless have little numerical effect on
98
Frequentist statistical inference
the answers to certain questions, such as the probability of drawing r black balls from a bottle in n trials when n is sufficiently small. According to E. T. Jaynes (2003), the belief that ‘‘randomness’’ is some kind of property existing in nature is a form of Mind Projection Fallacy which says, in effect, ‘‘I don’t know the detailed causes – therefore Nature is indeterminate.’’ For example, later in this chapter we discuss how to write computer programs which generate seemingly ‘‘random’’ numbers, yet all these programs are completely deterministic. If you did not have a copy of the program, there is almost no chance that you could discover it merely by examining more output from the program. Then the Mind Projection Fallacy might lead to the claim that no rule exists. At scales where quantum mechanics becomes important, the prevailing view is that nature is indeterminate. In spite of the great successes of the theory of quantum mechanics, physicists readily admit that they currently lack a satisfactory understanding of the subject. The Bayesian viewpoint is that the limitation in scientific inference results from incomplete information. In both Bayesian and frequentist statistical inference, certain sampling distributions (e.g., binomial, Poisson, Gaussian) play a central role. To the frequentist, the sampling distribution is a model of the probability distribution of the underlying population from which the sample was taken. From this point of view, it makes sense to interpret probabilities as long-run relative frequencies. In a Bayesian analysis, the sampling distribution is a mathematical description of the uncertainty in predicting the data for any particular model because of incomplete information. It enables us to compute the likelihood pðDjH; IÞ. In Bayesian analysis, any sampling distribution corresponds to a particular state of knowledge. But as soon as we start accumulating data, our state of knowledge changes. The new information necessarily modifies our probabilities in a way that can be incomprehensible to one who tries to interpret probabilities as physical causations or long-run relative frequencies.
5.4 Probability distributions Now that we have a better understanding of what a random variable is let’s restate the frequentist definition of probability more precisely. It is commonly referred to as the relative frequency definition. Relative frequency definition of probability: If an experiment is repeated n times under identical conditions and nx outcomes yield a value of the random variable X ¼ x, the limit of nx =n, as n becomes very large,1 is defined as pðxÞ, the probability that X ¼ x. Experimental outcomes can be either discrete or continuous. Associated with each random variable is a probability distribution. A probability distribution may be
1
See Bernoulli’s law of large numbers discussed in Section 4.2.1.
99
5.4 Probability distributions
F (x)
p (x)
0.2 0.1
2
4
6
8
10
x (number of heads)
1 0.8 0.6 0.4 0.2 2
4
6
8
10
x (number of heads)
Figure 5.1 The left panel shows the discrete probabilities for the number of heads in ten throws of a fair coin. The right panel shows the corresponding cumulative distribution function.
quantitatively and conveniently described by two functions pðxÞ and FðxÞ which are given below for the discrete and continuous cases. 1. Discrete random variables Probability distribution function: (Also called the probability mass function). pðxi Þ gives the probability of obtaining the particular value of the random variable X ¼ xi . (a) pðxÞ ¼ pfX ¼ xg (b) P pðxÞ 0 for all x (c) x pðxÞ ¼ 1
Cumulative probability function: this gives the probability that the random variable will have a value x. x i ¼x X (a) FðxÞ ¼ pfX xg ¼ pðxi Þ xi ¼0 (b) 0 FðxÞ 1 (c) Fðxj Þ > Fðxi Þ if xj > xi (d) FfX > xg ¼ 1 FðxÞ Figure 5.1 shows the discrete probability distribution (binomial) describing the number of heads in ten throws of a fair coin. The right panel shows the corresponding cumulative distribution function. 2. Continuous random variables2 Probability density function: fðxÞ Rb
(a) pfa X bg ¼ a fðxÞ dx (b) fðxÞ 0ð1 < x < 1Þ (c)
2
R þ1 1
fðxÞdx ¼ 1
Continuous density function defined by fðX ¼ xÞ ¼ lim ½fðx < X < x þ xÞ=x. x!0
100 0.6 0.5 0.4 0.3 0.2 0.1
1 0.8 F (x)
f (x)
Frequentist statistical inference
0.6 0.4 0.2
1
2
3 x
4
5
6
1
2
3 x
4
5
6
Figure 5.2 The left panel shows a continuous probability density function and the right panel shows the corresponding cumulative probability density function.
Cumulative probability density function: Rx (a) FðxÞ ¼ pfX xg ¼ 1 fðxÞdx (b) Fð1Þ ¼ 0; Fðþ1Þ ¼ 1 (c) pfa < X < bg ¼ FðbÞ FðaÞ dFðxÞ (d) dx ¼ fðxÞ Figure 5.2 shows an example of a continuous probability density function (left panel) and the corresponding cumulative probability density function (right panel). 5.5 Descriptive properties of distributions The expectation value for a function, gðXÞ, of a random variable, X, is the weighted average of the function over all possible values of x. We will designate the expectation value of gðXÞ by hgðXÞi, which is given by P (discrete), all x gðxÞ pðxÞ hgðXÞi ¼ R þ1 (5:1) gðxÞ fðxÞdx (continuous). 1 The result, if it exists, is a fixed number (not a function) and a property of the probability distribution of X. The expectation defined above is referred to as the first moment of the distribution gðXÞ. The shape of a probability distribution can be rigorously described by the value of its moments: The rth moment of the random variable X about the origin ðx ¼ 0Þ is defined by (P r (discrete), x x pðxÞ 0r ¼ hXr i ¼ R þ1 r (5:2) 1 x fðxÞdx (continuous). Mean ¼ 01 ¼ hXi ¼ ¼ first moment about the origin. This is the usual measure of the location of a probability distribution. The rth central moment ðorigin ¼ meanÞ of X is defined by (P r (discrete), x ðx Þ pðxÞ r r ¼ hðX Þ i ¼ R þ1 r 1 ðx Þ fðxÞdx (continuous).
(5:3)
5.5 Descriptive properties of distributions
101
The distinction between r and 0r is simply that in the calculation of r the origin is shifted to the mean value of x. First central moment: hðX Þi ¼ hXi ¼ 0. Second central moment: VarðXÞ ¼ 2x ¼ hðX Þ2 i, where 2x ¼ usual measure of dispersion of a probability distribution. hðX Þ2 i ¼ hðX2 2X þ 2 Þi ¼ hX2 i 2hXi þ 2 ¼ hX2 i 22 þ 2 ¼ hX2 i 2 ¼ hX2 i hXi2
(5:4)
Therefore; 2 ¼ hX2 i hXi2 : The standard deviation, , equal to the square root of the variance, is a useful measure of the width of a probability distribution. It is frequently desirable to compute an estimate of 2 as the data are being acquired. Equation (5.4) tells us how to accomplish this, by subtracting the square of the average of the data from the average of the data values squared. Later, in Section 6.3, we will introduce a more accurate estimate of 2 called the sample variance. Box 5.1 Question: What is the variance of the random variable Y ¼ aX þ b? Solution: VarðYÞ ¼ hðY y Þ2 i ¼ hfðaX þ bÞ ðaX þ bÞg2 i ¼ hfaX ag2 i ¼ ha2 X2 2a2 X þ a2 2 i ¼ a2 ðhX2 i hXi2 Þ ¼ a2 VarðXÞ Third central moment: 3 ¼ hðX Þ3 i: This is a measurement of the asymmetry or skewness of the distribution. For a symmetric distribution, 3 ¼ 0 and 2nþ1 ¼ 0 for any integer value of n. Fourth central moment: 4 ¼ hðX Þ4 i: 4 is called kurtosis (another shape factor). It is a measure of how flat-topped a distribution is near its peak. See Figure 5.3 and discussion in the next section for an example.
5.5.1 Relative line shape measures for distributions The shape of a distribution cannot be entirely judged by the values of 3 and 4 because they depend on the units of the random variable. It is better to use measures relative to the distribution’s dispersion.
102
Frequentist statistical inference
α3 > 0 ≡ positively skewed →
α3 > 0 ≡ negatively skewed →
α3 = 0 ≡ symmetric →
α3 > 3 leptokurtic ≡ highly-peaked →
α3 > 3 platykurtic ≡ flat-topped →
Figure 5.3 Single peak distributions with different coefficients of skewness and kurtosis.
Coefficient of skewness: 3 ¼
3
ð2 Þ3=2 4 . Coefficient of kurtosis: 4 ¼ ð2 Þ2
.
Figure 5.3 illustrates a single peaked distribution for different 3 and 4 coefficients. Note: 4 ¼ 3 for any Gaussian distribution so distributions with 4 > 3 are more sharply peaked than a Gaussian, while those with 4 < 3 are more flat-topped.
5.5.2 Standard random variable A random variable X can always be converted to a standard random variable Z using the following definition: Z¼
X : x
Z has a mean hZi ¼ 0, and variance hZ2 i ¼ 2z ¼ 1.
(5:5)
103
5.5 Descriptive properties of distributions
For any particular value x of X, the quantity z ¼ ðx Þ=x indicates the deviation of x from the expected value of X in terms of standard deviation units. At several points in this chapter we will find it convenient to make use of the standard random variable.
5.5.3 Other measures of central tendency and dispersion Median: The median is a measure of the central tendency in the sense that half the area of the probability distribution lies to the left of the median and half to the right. For any continuous random variable, the median is defined by pðX medianÞ ¼ pðX medianÞ ¼ 1=2:
(5:6)
If a distribution has a strong central peak, so that most of its area is under a single peak, then the median is an estimator of the central peak. It is a more robust estimator than the mean: the median fails as an estimator only if the area in the tail region of the probability distribution is large, while the mean fails if the first moment of the tail is large. It is easy to construct examples where the first moment of the tail is large even though the area is negligible. Mode: Defined to be a value, xm of X, that maximizes the probability function (if X is discrete) or probability density (if X is continuous). Note: this is only meaningful if there is a single peak. If X is continuous, the mode is the solution to dfðxÞ ¼ 0; dx
d2 fðxÞ < 0: dx2
for
(5:7)
An example of the mode, median and mean for a particular PDF is shown in Figure 5.4. 3 mode median
Probability density
2.5
mean
2 1.5 1 0.5
0
0.2
0.4
0.6
0.8
1
x
Figure 5.4 The mode, median and mean are three different measures of this probability density function.
104
Frequentist statistical inference
5.5.4 Median baseline subtraction Suppose you want to remove the baseline variations in some data without suppressing the signal. Many automated signal detection schemes only work well if these baselines variations are removed first. The upper panel of Figure 5.5 depicts the output from a detector system with a signal profile represented by narrow Gaussian-like features sitting on top of a slowly varying baseline with noise. How do we handle this problem? Solution: Use running median subtraction. One way to remove the slowly varying baseline is to subtract a running median. The signal at sample location i is replaced by the original signal at i minus the median of all values within ðN 1Þ=2 samples. N is chosen so it is large compared to the signal profile width and short compared to baseline changes.
Signal strength
(a)
3.5 3 2.5 2 1.5 1 0.5 25
50
75 100 125 150 175 Sample number
0
25
50
75 100 125 150 175 Sample number
0
25
50
75 100 125 150 175 Sample number
2.5 Signal strength
(b)
2 1.5 1 0.5 0
2.5 Signal strength
(c)
2 1.5 1 0.5 0
Figure 5.5 (a) A signal profile sitting on top of a slowly varying baseline. (b) The same data with the baseline variations removed by a running median subtraction. (c) The same data with the baseline variations removed by a running mean subtraction; notice the negative bowl in the vicinity of the source profile.
5.6 Moment generating functions
105
Question: Why is median subtraction more robust than mean subtraction? Answer: When the N samples include some of the signal points, both the mean value and median will be elevated so that when the running subtraction occurs the signal will sit in a negative bowl as is illustrated in Figure 5.5(c). With mean subtraction, the size of the bowl will be proportional to the signal strength. With median subtraction, the size of the bowl is smaller and essentially independent of the signal strength for signals greater than noise. To understand why, consider a running median subtraction with N ¼ 21 and a signal profile, which for simplicity is assumed to have a width of only 1 sample. First, imagine a histogram of the 21 sample values when no signal is present, i.e., just a Gaussian noise histogram with some median, m0 . Now suppose a signal of strength S is added to sample 11, shifting it in the direction of increasing signal strength. Let T11 be the value of sample 11 before the signal was added. There are two cases of interest. (a) If T11 > m0 then T11 þ S > m0 and the addition of the signal produces no change in the median value, i.e., the number of sample values on either side of m0 is unchanged. (b) If T11 < m0 , then the addition of S can cause the sample to move to the other side of m0 thus increasing the median by a small amount to m1 . The size of S required to produce this small shift is S the RMS noise. Once sample 11 has been shifted to the other side, no further increase in the value of S will change the median. Figure 5.5(b) shows the result of a 21-point running median subtraction. The baseline curvature has been nicely removed and there is no noticeable negative bowl in the vicinity of the source. In the case of a running mean subtraction, the change in the mean of our 21 samples is directly proportional to the signal strength S, which gives rise to the very noticeable negative bowl that can be seen in Figure 5.5(c). Mean deviation (alternative measure of dispersion) (P hjX ji ¼
all x
R þ1 1
jx j pðxÞ
jx j fðxÞdx
(discrete), (continuous).
(5:8)
For long-tailed distributions, the effect on the mean deviation of the values in the tail is less than the effect on the standard deviation.
5.6 Moment generating functions In Section 5.5 we looked at various useful moments of a random variable. It would be convenient if we could describe all moments of a random variable in one function. This function is called the moment generating function. We will use it directly to compute moments for a variety of distributions. We will also employ the moment generating function in the derivation of the Central Limit Theorem, in Section 5.9, and
106
Frequentist statistical inference
in the proof of several theorems in Chapter 6. The moment generating function, mx ðtÞ, of the random variable X is defined by (P tX
mx ðtÞ ¼ he i ¼
tx x e pðxÞ R þ1 tx 1 e fðxÞdx
(discrete), (continuous),
(5:9)
where t is a dummy variable. The moment generating function exists if there is a positive constant such that mx ðtÞ is finite for jtj . The moments themselves are the coefficients in a Taylor series expansion of the moment generating function (see Equation ((5.12)) below) which converges for jtj . It can be shown that if a moment generating function exists, then it completely determines the probability distribution of X, i.e., if two random variables have the same moment generating function, they have the same probability distribution. The rth moment about the origin (see Equation (5.2)) is obtained by taking the rth derivative of mx ðtÞ with respect to t and then evaluating the derivative at t ¼ 0 as shown in Equation (5.10). r tX dr mx ðtÞ dr tX de ¼ he i ¼ dtr t ¼ 0 dtr dtr t ¼ 0 t¼0 (5:10) ¼ hXr etX it ¼ 0 ¼ hXr i ¼ 0r : For moments about the mean (central moments), we can use the central moment generating function. mx ðtÞ ¼ hexpftðx Þgi: Now we use a Taylor series expansion of the exponential, * + t2 ðX Þ2 t3 ðX Þ3 þ : hexp½tðX Þi ¼ 1 þ tðX Þ þ 2! 3!
(5:11)
(5:12)
From the expansion, one can see clearly that each successive moment is obtained by taking the next higher derivative with respect to t, each time evaluating the derivative at t ¼ 0. Example: Let X be a random variable with probability density function ( 1 expðx=Þ; for x > 0; > 0 fðxÞ ¼ 0; elsewhere.
(5:13)
107
5.7 Some discrete probability distributions
Determine the moment generating function and variance: Z 1 1 mx ðtÞ ¼ expðtxÞ expðx=Þdx 0 Z 1 1 ¼ exp½ð1 tÞx=dx 0 exp½ð1 tÞx=j1 ¼ 0 ð1 tÞ ¼ ð1 tÞ1
(5:14)
ðfor t < 1=Þ
dmx ðtÞ jt ¼ 0 ¼ ð1 tÞ2 jt ¼ 0 ¼ ¼ hXi dt
(5:15)
d2 mx ðtÞ jt ¼ 0 ¼ 22 ð1 tÞ3 jt ¼ 0 ¼ 22 ¼ hX2 i: dt2
(5:16)
From Equation (5.4), the variance, 2 , is given by 2 ¼ hX2 i hXi2 ¼ 22 2 ¼ 2 :
(5:17)
5.7 Some discrete probability distributions 5.7.1 Binomial distribution 3
The binomial distribution is one of the most useful discrete probability distributions and arises in any repetitive experiment whose result is either the occurrence or nonoccurrence of an event (only two possible outcomes, like tossing a coin). A large number of experimental measurements contain random errors which can be represented by a limiting form of the binomial distribution called the normal or Gaussian distribution (Section 5.8.1). Let X be a random variable representing the number of successes (occurrences) out of n independent trials such that the probability of success for any one trial is p.4 Then X is said to have a binomial distribution with probability mass function pðxÞ ¼ pðxjn; pÞ ¼
n! px ð1 pÞnx ; ðn xÞ! x!
for
x ¼ 0; 1; . . . ; n; 0 p 1; (5:18)
which has two parameters n and p.
3 4
A Bayesian derivation of the binomial distribution is presented in Section 4.2. Note: any time the symbol p appears without an argument, it will be taken to be a number representing the probability of a success. pðxÞ is a probability distribution either discrete or continuous.
108
Frequentist statistical inference
Cumulative distribution function: FðxÞ ¼
x X
pðiÞ ¼
i¼0
n i
¼
x X n i¼0
i
pi ð1 pÞðn iÞ
short-hand notation for number of combinations of n items taken i at a time.
ð5:19Þ
Box 5.2 Mathematica cumulative binomial distribution: Needs[‘‘Statistics ‘DiscreteDistributions’ ’’] The probability of at least x successes in n binomial trials is given by (1 – CDF[BinomialDistribution[n; p], x]) ! answer ¼ 0:623 ðn ¼ 10; p ¼ 0:5; x ¼ 4Þ Moment generating function of a binomial distribution: We can apply Equation (5.9) to compute the moment generating function of the binomial distribution. mx ðtÞ ¼ hetx i ¼
n X x¼0
mx ðtÞ ¼
etx
n x
px ð1 pÞnx
n X
n! ðet pÞx ð1 pÞnx ðn xÞ! x! x¼0
¼ ð1 pÞn þ nð1 pÞn1 ðet pÞ þ þ
n! ð1 pÞnk ðet pÞn ðn kÞ!
þ þ ðet pÞn ¼ binomial expansion of ½ð1 pÞ þ et pn : Therefore, mx ðtÞ ¼ ½1 p þ et pn . From the first derivative, we compute the mean, which is given by mean ¼ 01 ¼ dmdtx ðtÞ ¼ n½1 p þ et pn1 et pjt¼0 ¼ np. t¼0
The second derivative yields the second moment: 02 ¼
d2 mx ðtÞ dt2
¼ nðn 1Þ½1 p þ et pn2 ðet pÞ2 þ n½1 p þ et pn1 et pjt ¼ 0 ¼ nðn 1Þp2 þ np:
(5:20)
5.7 Some discrete probability distributions
109
But 02 ¼ hX2 i, and therefore, the variance 2 is given by 2 ¼ hðX Þ2 i ¼ hX2 i hXi2 ¼ hX2 i 2 ¼ nðn 1Þp2 þ np ðnpÞ2
(5:21)
2
¼ npð1 pÞ (variance of binomial distribution):
Box 5.3 Mathematica binomial mean and variance The same results could be obtained in Mathematica with the commands: Mean[BinomialDistribution[n, p]] Variance[BinomialDistribution[n, p]]. 5.7.2 The Poisson distribution The Poisson distribution was derived by the French mathematician Poisson in 1837, and the first application was to the description of the number of deaths by horse kicking in the Prussian army. The Poisson distribution resembles the binomial distribution if the probability of occurrence of a particular event is very small. Let X be a random variable representing the number of independent random events that occur at a constant average rate in time or space. Then X is said to have a Poisson distribution with probability function 8 x <e ; for x ¼ 0; 1; 2; . . . and > 0 pðxjÞ ¼ (5:22) x! : 0; elsewhere. The parameter of the Poisson distribution is , the average number of occurrences of the random event in some time or space interval. pðxjÞ is the probability of x occurrences of the event in a specified interval.5 The Poisson distribution is a limiting case of the binomial distribution in the limit of large n and small p: The following calculation illustrates the steps in deriving the Poisson distribution as a limiting case of the binomial distribution. Binomial distribution :
5
pðxjn; pÞ ¼
n! px ð1 pÞnx ; ðn xÞ!x!
(5:24)
In Section 4.7, we derived the Poisson distribution by using probability theory as logic, directly from a statement of a particular state of prior information. In that treatment, the Poisson distribution was written as ert ðrtÞn : ð5:23Þ n! From a comparison of Equations (5.23) and (5.22), it is clear that the symbol , the average number of occurrences in a specified interval, is equal to rt where r is rate of occurrence and t is a specified time interval. Also, in the current chapter, the symbol x will be used in place of n. pðnjr; t; IÞ ¼
110
Frequentist statistical inference
where p is the probability of a single occurrence in a sample n in some time interval. Multiply the numerator and denominator of Equation (5.24) by nx and substitute the following expansion: n!=ðn xÞ! ¼ nðn 1Þðn 2Þ . . . ðn x 1Þ:
(5:25)
With these changes, Equation (5.24) becomes pðxjn; pÞ ¼
nðn 1Þðn 2Þ . . . ðn x 1Þ ðnpÞx ð1 pÞnx nx x!
¼
nðn 1Þ . . . ðn x 1Þ x ð1 pÞnx nx x!
¼
x 1ð1 1nÞð1 2nÞ . . . ð1 ðx1Þ n Þ ð1 pÞn ; x ð1 pÞ x!
(5:26)
where has replaced the product np. Now ð1 pÞn ½ð1 pÞ1=p np ¼ ½ð1 pÞ1=p and by definition limð1 þ zÞ1=z ¼ e. z!0
Let z ¼ p, then lim½ð1 pÞ1=p ¼ e . p!0
Moreover, lim ð1 1=nÞð1 2=nÞ . . . ð1 ðx 1Þ=nÞ ¼ 1
(5:27)
limð1 pÞx ¼ 1:
(5:28)
n!1
and, p!0
Therefore, lim
n!1; p!0
pðxjn; pÞ ¼
e x ; x!
x ¼ 0; 1; 2; . . .
(5:29)
Thus, the Poisson distribution is a limiting case of the binomial distribution in the limit of large n and small p. To make use of the binomial distribution we need to know both n and p. In some instances the only information we have is the their product, i.e., the mean number of occurrences, . For example, traffic accidents are rare events and the number of accidents per unit of time is well described by the Poisson distribution. The number of traffic accidents that occur each day is usually recorded by the police department, but not the number of cars that are not involved in an accident. Mean of a Poisson distribution: ¼ hXi ¼
1 1 X X x e x1 ¼ e : x x! ðx 1Þ! x¼0 x¼0
(5:30)
5.7 Some discrete probability distributions
111
Let y ¼ x 1 ¼ e
1 X y
y! y¼0
¼ e e ¼ :
(5:31)
The mean of a Poisson distribution ¼ . (For a binomial distribution ¼ np) Cumulative distribution: FðxjÞ ¼
x X xi e
:
(5:32)
2 ðXÞ ¼ hX2 i hXi2
(5:33)
hX2 i ¼ hXðX 1Þ þ Xi ¼ hXðX 1Þi þ hXi:
(5:34)
xi ¼ 0
xi !
Poisson variance:
Then hXðX 1Þi ¼
1 X xðx 1Þx e
x!
x¼0
¼ e 2
1 X
x2 ðx 2Þ! x¼2
(5:35)
¼ 2 e eþ ¼ 2 ¼ 2 : Then hX2 i ¼ 2 þ 2 ðXÞ ¼ 2 þ hXi2 ¼ pffiffiffi ! ðXÞ ¼ : Note: for a binomial distribution, 2 ¼ npð1 pÞ ! np ¼ as p ! 0. Figure 5.6 illustrates how the shape of the Poisson distribution varies with . As increases, the shape of the Poisson distribution asymptotically approaches a Gaussian distribution. The dashed curve in the p ffiffiffi ¼ 40 panel is a Gaussian distribution with a mean ¼ and a standard deviation ¼ . Examples of situations described by a Poisson distribution: * * * * *
Number of telephone calls on a line in a given interval. Number of shoppers entering a store in a given interval. Number of failures of a product in a given interval. Number of photons detected from a distant quasar in a given time interval. Number of meteorites to fall per unit area of land.
112
Frequentist statistical inference 0.3
0.2 0.1 0 1 2 3 4 5 6 x
λ=2
0.2
p (x)
λ=1
0.3
p (x)
p (x)
0.4
0.1
λ = 40
0.06 0.05 0.04 0.03 0.02 0.01
0 1 2 3 4 5 6 x
20
30
40
50
60
x
Figure 5.6 As increases, the shape of the Poisson distribution becomes more symmetric.
5.7.3 Negative binomial distribution Imagine a binomial scenario involving a sequence of independent trials where the probability of success of each trial is p. Instead of fixing the number of trials, n, suppose we continue the trials until exactly k successes have occurred.6 Here, the random variable is n, the number of trials necessary for exactly k successes. If the independent trials continue until the kth success, then the last trial must have been a success. Prior to the last trial, there must have been k 1 successes in n 1 trials. The number of distinct ways k 1 successes can be observed in n 1 trials is n1 k1 . Therefore, the probability of k successes in the n trials with the last being a success is
n 1 k1 n1 k nk 1 pðnjk; pÞ ¼ p ð1 pÞ p ¼ p ð1 pÞnk : (5:36) k1 k1 Equation (5.36) is called the negative binomial distribution. Let the number of trials required to achieve k successes ¼ X þ k. Then random variable X is the number of failures before k successes, which is given by 8
kþx1 k > > p ð1 pÞx ; x ¼ 0; 1; 2; . . . > > k 1 < k ¼ 1; 2; . . . pðxjk; pÞ ¼ > > > 0p1 > : 0; elsewhere. For the special case of one success k ¼ 1, the above distribution is known as the geometric distribution. pðxjpÞ ¼ pð1 pÞx :
(5:37)
The geometric random variable represents the number of failures before the first success.
6
For example, an astronomer could plan to continue taking spectra of candidate stars until exactly 50 white dwarfs have been detected.
5.8 Continuous probability distributions
113
5.8 Continuous probability distributions 5.8.1 Normal distribution The normal (Gaussian) distribution is the most important and widely used probability distribution. One of the reasons why the normal distribution is so useful is because of the Central Limit Theorem. This theorem will be discussed in detail later, but briefly, it says the following: suppose you have a radioactive source and you measure the average number of decays in one hundred 10-second intervals. (We know that the individual counts obey a Poisson distribution). If you repeated the experiment many times and hence determined a large number of averages then, according to the Central Limit Theorem, the averages will be normally distributed. The distribution of the sample means (from populations with a finite mean and variance) approaches a normal distribution as the number of terms in the mean approaches infinity. It can be shown to be the limit of a binomial distribution as n ! 1 and np 1. Corollary: Whenever a random variable can be assumed to be the result of a large number of small effects, the distribution is approximately normal. Gaussian probability density function: ( ) 1 ðx Þ2 fX ðxÞ ¼ fðxj; Þ ¼ pffiffiffiffiffiffi exp 22 2p
(5:38)
for 1 < x < 1; 1 < < 1; 0 < 2 < 1:
Box 5.4
Mathematica evaluation of a Gaussian or normal distribution:
Needs[‘‘Statistics‘ContinuousDistributions ’ ’’] The line above loads a package containing a wide range of continuous distributions of importance to statistics, and the following line computes the probability density function at x for a normal distribution with mean and standard deviation . PDF[NormalDistribution[ m, s],x] ! answer ¼ 0:45662 ð ¼ 2:0; ¼ 0:4; x ¼ 1:5Þ The mean and standard deviation of the distribution are given by Mean[NormalDistribution[ m, s]] ! answer ¼ StandardDeviation[NormalDistribution[m; s]] ! answer ¼
114
Frequentist statistical inference
Central moment generating function: mX ðtÞ ¼ hexpftðX Þgi ( ) Z þ1 1 ðx Þ2 ¼ pffiffiffiffiffiffi dx expftðx Þg exp 22 2p 1
Z þ1 1 1 2 2 ¼ pffiffiffiffiffiffi exp 2 fðx Þ 2 tðx Þg dx: 2 2p 1 Adding and subtracting 4 t2 in the term in the curly braces: fðx Þ2 22 ðx Þt þ 4 t2 4 t2 g ¼ f½x 2 t2 4 t2 g ( )
2 2 Z þ1 t 1 ðx 2 t2 Þ2 pffiffiffiffiffiffi ! mX ðtÞ ¼ exp exp dx 2 22 2p 1
2 2 t 2 t2 4 t4 6 t6 þ þ ... ¼ exp ¼1þ 2 2 4 2! 8 3! d2 mX ðtÞ 2 VarðXÞ ¼ ¼ dt2 t¼0 4!4 ¼ 34 3 ¼ 0 and 4 ¼ 4 2! 4 4 ¼ coefficient of kurtosis ¼ ¼ 3. ð2 Þ2 1 Note: for a Poisson distribution, 4 ¼ 3 þ ! 3 as ! 1. ½1 6pð1 pÞ Also, for a binomial distribution 4 ¼ 3 þ ! 3 as n ! 1. npð1 pÞ Convention: If the random variable X is known to follow a normal distribution with mean and variance 2 , then it is common to abbreviate this by X Nð; 2 Þ: For convenience, the following transformation to the standard random variable is often made: X Nð0; 1Þ: Z¼ In terms of Z the normal distribution becomes 1 z2 fðZÞ ¼ pffiffiffiffiffiffi exp : 2 2p Cumulative distribution function: 1 FðxÞ Fðxj; Þ ¼ pðX xÞ ¼ pffiffiffiffiffiffi 2p
) ðt Þ2 exp dt: 22 1
Z
x
(
115
5.8 Continuous probability distributions
This integral cannot be integrated in closed form. Fðxj; Þ can be tabulated as a function of and , which requires a separate table for each pair of values. Since there are an infinite number of values for and , this task is not practical. Instead, it is common to calculate the cumulative distribution function of the standard random variable Z. Then:
02 Z ðz ¼ xÞ h x i 1 z pðX xÞ ¼ p Z exp ¼ pffiffiffiffiffiffi dz0 ¼ FðzÞ: 2 2p 1 Usually, FðzÞ is expressed in terms of the error function, erf(z). Z z 2 erfðzÞ ¼ pffiffiffi exp ðu2 Þdu p 0 (5:39) erfðzÞ ¼ erfðzÞ: pffiffiffi Then FðzÞ ¼ 12 þ 12 erfðz= 2Þ. The error function is in many computer libraries.
Box 5.5 In Mathematica, it can be evaluated with the command Erf[z]. However, it is simpler to compute the cumulative probability with the Mathematica command: CDF[NormalDistribution[ m, s], x]. For any normally distributed random variable: pð X þ Þ ¼ 0:683 pð 2 X þ 2Þ ¼ 0:954 pð 3 X þ 3Þ ¼ 0:997 pð 4 X þ 4Þ ¼ 0:999 937 pð 5 X þ 5Þ ¼ 0:999 999 43:
σ = 0.35 σ = 0.7 σ = 1.0
1 0.8 0.6 0.4 0.2
1 0.8 F (x)
f (x)
Figure 5.7 shows graphs of the normal distribution (left) and the cumulative normal distribution (right) for three different values of .
0.6 0.4 0.2
1
2
3 x
4
5
6
1
2
3
4
5
6
x
Figure 5.7 Graphs of the normal distribution (left) and the cumulative normal distribution (right).
116
Frequentist statistical inference
5.8.2 Uniform distribution Examples of a uniform distribution include round-off errors and quantization of noise in linear analog-to-digital conversion. A random variable is said to be uniformly distributed over the interval (a; b) if 1=ðb aÞ; for a x b fðxja; bÞ ¼ (5:40) 0; elsewhere mean ¼ ða þ bÞ=2; 3 ¼ 0 variance ¼ ðb aÞ2 =12; no mode;
4 ¼ 9=5
median ¼ mean:
The special case of a ¼ 0 and b ¼ 1 plays a key role in the computer simulation of values of a random variable with a specified distribution, which will be discussed in Section 5.13. Cumulative distribution function: 8 < 0; Fðxja; bÞ ¼ ðx aÞ=ðb aÞ; : 1;
for x < a for a x b for x > b.
(5:41)
5.8.3 Gamma distribution The gamma distribution is used extensively in several diverse areas. For example, it is used to represent the random time until the occurrence of some event which occurs only if exactly independent sub-events occur where the sub-events occur at an average rate ¼ 1= per unit of time. 8 < 1 x1 exp ðx=Þ; for x > 0; ; > 0 fðxÞ ¼ fðxj; Þ ¼ ðÞ (5:42) : 0; elsewhere. mean ¼ variance ¼ 2 pffiffiffiffi 3 ¼ 2= 4 ¼ 3 1 þ 2 R 1 Note: ðnÞ, the gamma function ¼ 0 u n1 exp ðuÞdu for n > 0. Some properties of the gamma function are: 1. ðn þ 1Þ ¼ n! (for n an integer) 2. ðn þ 1Þ ¼ nðnÞ pffiffiffi 3. ð1=2Þ ¼ p
Cumulative distribution function: The cumulative distribution function can be expressed in closed form if the shape parameter is a positive integer.
117
5.8 Continuous probability distributions
x1 x 1 x2 1 Fðxj; Þ ¼ 1 1 þ þ þ þ exp ðx=Þ: 2! ð 1Þ!
(5:43)
Example: Suppose a metal specimen will break after exactly two stress cycles. If stress occurs independently and at an average rate of 2 per 100 hours, determine the probability that the length of time until failure is within one standard deviation of the average time. Solution: Let X be a random variable representing the length of time until the second stress cycle. X is gamma-distributed with ¼ 2 and ¼ 50. ¼ mean ¼ ¼ 2 50 ¼ 100 pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi standard deviation ¼ 2 ¼ 2 502 ¼ 70:71 pð < X < þ Þ ¼ pð29:29 < X < 170:71Þ ¼ Fð170:71j2; 50Þ Fð29:28j2; 50Þ: Equation (5.43) for the cumulative distribution function reduces to: Fðxj; Þ ¼ 1 ð1 þ x=50Þ exp ðx=50Þ;
x>0
! pð < X < þ Þ ¼ 0:7376: When is an integer, the gamma distribution is known as the Erlang probability model after the Danish scientist who used it to study telephone traffic problems.
5.8.4 Beta distribution While we are on the subject of sampling distribution, here is one that plays a useful role in Bayesian inference. The family of beta distributions allows for a wide variety of shapes. 8 ð þ Þ 1 > < x ð1 xÞ1 ; 0 < x < 1 ðÞðÞ fðxÞ fðxj; Þ ¼ (5:44) ; > 0 > : 0; elsewhere. mean ¼ =ð þ Þ; variance ¼ . ð1 þ Þ2 ð þ þ 1Þ Note: the appearing in the beta distribution has no connection with the used in the previously mentioned gamma distribution. Some examples of the beta distribution are illustrated in Figure 5.8. Any smooth unimodal distribution in the interval x ¼ 0 to 1 is likely to be reasonably well approximated by a beta distribution, so it is often possible to approximate a Bayesian prior distribution in this way. If the likelihood function is a binomial distribution, then the Bayesian posterior will have a simple analytic form; namely, another beta distribution. More generally, when both the prior and posterior belong to the same distribution family (in this case the beta distribution), then the prior and
Frequentist statistical inference
2 1.5 1 0.5
0.2 0.4 0.6 0.8 1 x
α = 1.0, β = 2.0 2 1.5 1 0.5
α = 1.5, β = 1.5 f (x)
f (x)
f (x)
α = 1.0, β = 1.0
0.2 0.4 0.6 0.8 1 x
2 1.5 1 0.5 0.2 0.4 0.6 0.8 1 x
0.2 0.4 0.6 0.8 1 x
α = 3.0, β = 1.5
α = 3.0, β = 3.0
2 1.5 1 0.5
f (x)
f (x)
α = 1.5, β = 0.5 2 1.5 1 0.5
0.2 0.4 0.6 0.8 1 x
0.2 0.4 0.6 0.8 1 x 2 1.5 1 0.5
α = 0.5, β = 0.5 f (x)
α = 0.5, β = 1.5
0.2 0.4 0.6 0.8 1 x
2 1.5 1 0.5
α = 1.5, β = 3.0 f (x)
2 1.5 1 0.5
f (x)
f (x)
118
2 1.5 1 0.5
0.2 0.4 0.6 0.8 1 x
0.2 0.4 0.6 0.8 1 x
Figure 5.8 Graphs of beta density function for various values of , .
likelihood are called conjugate distributions. This can greatly simplify any calculations that involve the posterior distribution. The beta distribution is often referred to as a conjugate prior for the binomial likelihood.7 Other well-known examples of conjugate priors are the Gaussian (when dealing with a Gaussian likelihood) and the gamma distribution (when dealing with a Poisson likelihood). In each case, the posterior and prior are members of the same family of distributions.
5.8.5 Negative exponential distribution The negative exponential distribution is a special case of the gamma distribution for ¼ 1. 7
For example, if the prior is a beta distribution, Be(; ), then we can write pðxjI Þ / x1 ð1 xÞ1
ð0 x 1Þ :
ð5:45Þ
Suppose the likelihood is a binomial, with pð yjn; xÞ ¼ the probability of obtaining y successes in n trials where the probability of success in any trial is x. Then we can write the likelihood as pðDjx; I Þ / xy ð1 xÞny :
ð5:46Þ
The posterior is proportional to the product of Equations (5.45) and (5.46), and given by pðxjD; I Þ / xþy1 ð1 xÞþny1 which is another beta distribution, Beð þ y; þ n yÞ.
ð5:47Þ
119
5.9 Central Limit Theorem
8 < 1 exp ðx=Þ; for x > 0; > 0 fðxÞ fðxjÞ ¼ : 0; elsewhere.
(5:48)
The random variable X is the waiting time until the occurrence of the first Poisson event. That is, the negative exponential distribution can model the length of time between successive Poisson events. It has been used extensively as a time-to-failure model in reliability problems and in waiting-line problems. is the mean time between Poisson events. The cumulative negative exponential distribution function is given by FðxjÞ ¼ 1 expðx=Þ:
5.9 Central Limit Theorem Let X1 ; X2 ; X3 ; . . . ; Xn be n independent and identically distributed (IID) random variables with unspecified probability distributions, and having a finite mean, , and variance, 2 . The sample average, X ¼ ðX1 þ X2 þ X3 þ þ Xn Þ=n has a distribution with mean and variance 2 =n that tends to a normal (Gaussian) distribution as n ! 1. In other words, the standard random variable, ðX Þ pffiffiffi ! standard normal distribution: = n Proof: pffiffiffi As a proof, we will show that the moment generating function of ðX Þ n= tends to that of a standard normal as n ! 1. zi ¼ ðXi Þ=; i ¼ 1; n 9 hzi i ¼ 0 = by definition: hz2i i ¼ 1 ; ðX Þ pffiffiffi . Let Y ¼ = n n n P P P pffiffiffi zi ¼ 1 ðXi Þ ¼ n ðX Þ ¼ nY or Y ¼ p1ffiffin zi . Now Let
i¼1
i¼1
Then the moment generating function mY ðtÞ is given by * !+ n X zi pffiffiffi mY ðtÞ ¼ hexpðtYÞi ¼ exp t n i¼1
n zi ¼ exp t pffiffiffi since the zi ’s are IID: n
120
Frequentist statistical inference
zi tzi t2 z2 t3 z3i exp t pffiffiffi ¼ 1 þ pffiffiffi þ i þ þ n n 2!n 3!n3=2
Now
and since hzi i ¼ 0; hz2i i ¼ 1 for all i.
tzi t2 t3 hz3i i exp pffiffiffi þ ¼1þ þ 2n 3!n3=2 n
n t2 t3 hz3i i mY ðtÞ ¼ 1 þ þ 3=2 þ 2n 3!n n 1 t2 t3 hz3i i p ffiffi ffi þ ¼ 1þ þ n 2 3! n 2 h uin t t3 hz3i i þ pffiffiffi þ ¼ 1þ ; where u ¼ n 2 3! n h i u n ¼ eu : lim 1 þ n!1 n In the limit as n ! 1, all terms in u ! 0 except the first, t2 =2, since all other terms have an n in the denominator. t2 n!1 2 ¼ moment generating function of a standard normal.
lim mY ðtÞ ¼ lim eu ¼ exp
n!1
This completes the proof.
5.10 Bayesian demonstration of the Central Limit Theorem The proof of the CLT given in Section 5.9 does little to develop the reader’s intuition on how it works and what are its limitations. To help on both counts, we give the following demonstration of the CLT which is adapted from the work of M. Tribus (1969). In data analysis, it is common practice to compute the average of a repeated measurement and perform subsequent analysis using the average value. The probability density function (PDF) of the average is simply related to the PDF of the sum which is evaluated below using probability theory as logic. In this demonstration, we will be concerned with measurements of the length of a widget8 which is composed of many identical components that have all been manufactured on the same assembly line. Because of variations in the manufacturing process, the widgets do not all end up with exactly the same length. The components are analogs of the data points and the widget is the analog of the sum of a set of data points. 8
A widget is some unspecified gadget or device.
5.10 Bayesian demonstration of the Central Limit Theorem
121
I ‘‘a widget is composed of two components. Length of widget ¼ sum of component lengths.’’ Y ‘‘Length of widget lies between y and y þ dy.’’ Note: Y is a logical proposition which appears in the probability function and y is an ordinary algebraic variable. X1 ‘‘Length of component 1 lies between x1 and x1 þ dx1 .’’ X2 ‘‘Length of component 2 lies between x2 and x2 þ dx2 .’’ We are given that pðX1 jIÞ ¼ f1 ðx1 Þ pðX2 jIÞ ¼ f2 ðx2 Þ: Problem: Find pðYjIÞ. Now Y depends on propositions X1 and X2 . To evaluate pðYjIÞ, we first extend the conversation (Tribus, 1969) to include these propositions by writing down the joint probability distribution pðY; X1 ; X2 jIÞ. We can then solve for pðYjIÞ by using the marginalizing operation as follows: pðYjIÞ ¼ ¼
Z Z Z Z
dX1 dX2 pðY; X1 ; X2 jIÞ dX1 dX2 pðX1 jIÞpðX2 jIÞpðYjX1 ; X2 ; IÞ;
where we assume X1 and X2 are independent. Since y ¼ x1 þ x2 , pðYjX1 ; X2 ; IÞ ¼ ðy x1 x2 Þ Z Z ! pðYjIÞ ¼ dx1 f1 ðx1 Þ dx2 f2 ðx2 Þðy x1 x2 Þ: The presence of the delta function in the second integral serves to pick out the value of the integrand at x2 ¼ y x1 , so we have pðYjIÞ ¼
Z
dx1 f1 ðx1 Þf2 ðy x1 Þ:
(5:49)
The right hand side of this equation is the convolution integral.9 The convolution operation is demonstrated in Figure 5.9 for the case where both f1 ðxÞ and f2 ðxÞ are uniform PDFs of the same width. The result is a triangular distribution.
9
For more details on the convolution integral and how to evaluate it using the Fast Fourier Transform, see Sections B.4 and B.10.
122
Frequentist statistical inference
f2(x) −1
0
−1
0
−1
0
−1
0
−1
0
x
1
2
1
2
1
2
1
2
1
2
f2(−x)
x
f2(y − x)
x
f1(x)
x
Convolution
y
Figure 5.9 The convolution operation.
What if the widget is composed of three components? Let Z ‘‘Length of widget is between z and z þ dz.’’ Z Z Z pðZjIÞ ¼ dX1 dX2 dX3 pðZ; X1 ; X2 ; X3 jIÞ ¼ ¼
Z Z dY dX3 pðZ; X3 ; YjIÞ Z
Z dY pðYjIÞ
dX3 pðX3 jIÞ pðZjX3 ; Y; IÞ
where pðZjX3 ; Y; IÞ ¼ ðz y x3 Þ and pðYjIÞ is the solution to the two-component case.
123
5.10 Bayesian demonstration of the Central Limit Theorem
pðZjIÞ ¼ ¼
Z Z
dy pðYjIÞf3 ðz yÞ dy fðyÞf3 ðz yÞ:
Another convolution! Shown below, the probability density function (PDF) of the average is simply related to the PDF of the sum (which we have just evaluated). Let xA ¼ ðx1 þ x2 þ x3 Þ=3 ¼ z=3 or z ¼ 3xA . fðxA ÞdxA ¼ fðzÞdz ¼ 3fðzÞdxA ! pðXA jIÞ ¼ 3pðZjIÞ: Figure 5.10 compares the PDF of the average for the case of n ¼ 1, 2, 4 and 8 components. According to the Central Limit Theorem, pðXA jIÞ tends to a Gaussian distribution as the number of data being averaged becomes larger. After averaging only four components, the PDF has already taken on the appearance of a Gaussian. If instead of a uniform distribution, our starting PDF had two peaks (bimodal), then a larger number of components would have been required before the PDF of the average was a reasonable approximation to a Gaussian. On the basis of this analysis, we come to the following generalization of the Central Limit Theorem: ‘‘Any quantity that stems from a large number of sub-processes is expected to have a Gaussian distribution.’’ The Central Limit Theorem (CLT) is both remarkable and of great practical value in data analysis. In frequentist statistics, we are often uncertain of the form of the sampling distribution the data are drawn from. The equivalent problem in a Bayesian analysis is the choice of likelihood function to use. By working with the averages of data points (frequently as few as five points), we can appeal to the CLT and make use of a Gaussian distribution for the sampling distribution or likelihood function. The CLT also provides a deep understanding for why measurement uncertainties frequently have a Gaussian distribution. This is because the
n 8
2
3 PDF
3 PDF
4
n 6 n 4 n 2 n 1
1 0
2 1
0.2
0.4
0.6 X
0.8
1
0
0.2
0.4
0.6
0.8
1
X
Figure 5.10 The left panel shows a comparison of the probability density function of the average for the case of n ¼ 1, 2, 4 and 8 components. The right panel compares the n ¼ 8 case to a Gaussian (dashed curve) with the same mean and variance. The two curves are so similar it is difficult to separate them.
124
Frequentist statistical inference
measured quantity is often the result of a large number of effects, i.e., is some kind of averaged resultant of these effects (random variables in frequentist language). Since the distribution of the average of a collection of random variables tends to a Gaussian, this is often what we observe. Two exceptions to the Central Limit Theorem exist. They are: 1. One of the pðXi jIÞ is much broader than all of the others. It is apparent from the above demonstration that convolving a very wide uniform distribution with a narrow uniform distribution will give a result that is essentially the same as the original wide uniform distribution. 2. The variances of one or more of the individual pðXi jIÞ distributions are infinite. A Cauchy or Lorentzian distribution is an example of such a distribution: pðxj; ; IÞ ¼
p½ 2 þ ðx Þ2
:
Its very wide wings lead to an infinite second moment, i.e., the variance of X is infinite and the sample mean is not a useful quantity. One example of this is the natural line shape of a spectral line.
5.11 Distribution of the sample mean It is apparent from the previous section that the PDF of the sample mean (average) rapidly approaches a Gaussian in shape and the width of this Gaussian becomes narrower as the number of samples in the average increases. In this section, we want to quantify this latter effect using the frequentist approach. Let a random sample X1 ; X2 ; . . . ; Xn consist of n IID random variables such that
Then
hXi i ¼ and VarðXi Þ ¼ 2 : P P P hXi ¼ h1n Xi i ¼ 1n hXi i ¼ 1n ! hXi ¼ ;
and + 1X 1X 2 Xi n n E E 1 DX 1 XD ðXi Þ2 ¼ 2 ðXi Þ2 ¼ 2 n n X 1 1 ¼ 2 2 ¼ 2 n2 n n 2 VarðXÞ ¼ : n D E VarðXÞ ¼ ðX Þ2 ¼
*
(5:50)
5.12 Transformation of a random variable
125
The following is true for any distribution with a finite variance: hXi ¼ VarðXÞ ¼
2 n
Conclusion: The distribution of a sample average X sharpens around as the sample size n increases.10 Signal averaging is based on this principle.
5.11.1 Signal averaging example Every second a spectrometer output consisting of 64 voltage levels corresponding to 64 frequencies, is sampled by the computer. These voltages are added to a memory buffer containing the results of all previous one-second spectrometer readings. The accumulated spectra are shown in Figure 5.11 at different stages. Although no signal is evident above the noise level in the first spectrum, the signal is clearly evident (near channel 27) after eight spectra have been summed. Let Si / nXi ¼ the signal in the ith channel of n summed spectra, and qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Ni / n VarðXi Þ ¼ the noise in the ith channel. Then pffiffiffi Xi Si nXi pffiffiffi ¼ n : ¼ Ni n= n In radio astronomy, we are often trying to detect signals which are 103 of the noise. For a S=N 5, n 2:5 107 . However, in these cases, the time required for one independent sample is often much less than one microsecond, and is determined by the radio astronomy receiver bandwidth and detector time constant.
5.12 Transformation of a random variable In Section 5.13, we will want to generate random variables having a variety of probability distributions for use in simulating experimental data. To do that, we must first learn about the probability distribution of a transformed random variable.
10
What happens to the average of samples drawn from a distribution which has an infinite variance? In this case, the error bar for the sample mean does not decrease with increasing n. Even though the sample mean is not a good estimator of the distribution mean , we can still employ Bayes’ theorem to compute the posterior PDF of from the available samples. The PDF continues to sharpen about as the number of samples increases. For a good numerical demonstration of this point, see the lighthouse problem discussed by Sivia (1996) and Gull (1988a).
126
Frequentist statistical inference Computer memory buffer
Spectrometer
Sample
1 0.5
1 spectrum
–0.5 –1 –1.5
10
20
30
40
50
60
10
20
30
40
50
60
10
20
30
40
50
60
10
20
30
40
50
60
3 2 1
2 spectra –1 –2
4 spectra
8 6 4 2 –2 –4 15 10 5
8 spectra –5 –10
Figure 5.11 Signal averaging. Every second a spectrometer output consisting of 64 voltage levels, corresponding to 64 frequencies, is added to a computer memory buffer. The summed spectra are shown at different stages. Although no signal is evident above the noise level in the first spectrum, the signal is clearly evident (near channel 27) after eight spectra have been summed.
Problem: How do we obtain the probability density function, fY ðyÞ, of the transformed random variable y, where y ¼ gðxÞ, from knowledge of the probability density function, fX ðxÞ, of the original random variable X? The function y ¼ gðxÞ must be a monotonic (increasing or decreasing), differentiable function of x. Then there exists an inverse function x ¼ g1 ðyÞ which is also monotonic and differentiable and for every interval dx there is a corresponding interval dy. Then the probability that y Y y þ dy must equal the probability that x X x þ dx, or j fY ðyÞdyj ¼ j fX ðxÞdxj: Since probabilities are always positive, we can write dx fY ðyÞ ¼ fX ðxÞ : dy
(5:51)
(5:52)
5.13 Random and pseudo-random numbers
127
Example: Find fZ ðzÞ, where Z is the standard random variable defined by Z ¼ ðX Þ=, where ¼ mean and ¼ standard deviation. x : (5:53) z ¼ gðxÞ ¼ x ¼ g1 ðzÞ ¼ z þ and
dx ¼ : dz
(5:54)
Then from Equation (5.52), we can write fZ ðzÞ ¼ fX ðxÞ. 1 ðx Þ2 (normal distribution). Suppose fX ðxÞ ¼ pffiffiffiffiffiffi exp 22 2p Then from Equations (5.52) and (5.53), we obtain 1 z2 fZ ðzÞ ¼ pffiffiffiffiffiffi exp : 2 2p
(5:55)
5.13 Random and pseudo-random numbers Computer simulations have become an extremely useful tool for testing data analysis algorithms and analyzing complex systems, which are often comprised of many interdependent components. Some examples of their use are given below. *
* * *
*
To simulate experimental data in the design of a complex detector system in many branches of science. To test the effectiveness or completeness of some complex analysis program. To compute the uncertainties in the parameter estimates derived from nonlinear model fitting. To calculate the solution to a statistical mechanics problem which is not amenable to analytical solution. To make unpredictable data for use in cryptography, to deal with a variety of authentication and confidentiality problems.
What is usually done is to assume an appropriate probability distribution for each distinct component and to generate a sequence of random or pseudo-random values for each. There are many procedures, known by the generic name of Monte Carlo,11 that follow these lines and use the commodity called random numbers, which have to be manufactured somehow. Typically, the sequences of random numbers are generated by numerical algorithms that can be repeated exactly; such sequences are not truly random. However, they exhibit enough random properties to be sufficient for most applications. We consider below, possible ways of generating random values from some discrete and continuous probability distributions. 11
In Chapter 12, we discuss the important topic of Markov chain Monte Carlo (MCMC) methods, which are dramatically increasing our ability to evaluate the integrals required in a Bayesian analysis of very complicated problems.
128
Frequentist statistical inference
The uniform distribution on the interval (0, 1) plays a key role in the generation of random values. From it, we can generate random numbers for any other distribution using the following theorem: Theorem: For any continuous random variable X, the cumulative distribution function FðxjÞ with parameter may be represented by a random variable u, which is uniformly distributed on the unit interval. Proof: Rx By definition, FðxjÞ ¼ 1 fðtjÞdt. For each value of x, there is a corresponding value of FðxjÞ which is necessarily in the interval (0,1). Also, FðxjÞ is a random variable by virtue of the randomness of X. For each value u of the random variable u, the function u ¼ FðxjÞ defines a one-to-one correspondence between U and X having an inverse relationship x ¼ F1 ðuÞ. Recall that it was shown earlier how to obtain the PDF fðyÞ of the transformed random variable Y ¼ gðXÞ from the knowledge of the PDF of X. The result was 1 dx dg ðyÞ 1 : fy ðyÞ ¼ fX ðxjÞ ¼ fX ½g ðyÞj dy dy
(5:56)
In the present case, this means dx fU ðuÞ ¼ fX ½F1 ðuÞj : du
(5:57)
Since u ¼ FðxjÞ !
du dFðxjÞ ¼ ¼ fX ðxjÞ dx dx
(5:58)
dx ¼ f fX ðxjÞg1 : ! du But x ¼ F1 ðuÞ. Substituting for x in Equation (5.58), we obtain dx ¼ f fX ½F1 ðuÞjg1 : du
(5:59)
Substituting Equation (5.59) into Equation (5.57) yields fU ðuÞ ¼
fX ½F1 ðuÞj fX ½F1 ðuÞj
fU ðuÞ ¼ 1;
0 u 1:
(5:60)
129
5.13 Random and pseudo-random numbers
The essence of the theorem is that in many instances, we are able to determine the value of x corresponding to a value of u such that FðxjÞ ¼ u. For this reason, practically all computer systems have a built-in capability of generating random values for a uniform distribution on the unit interval. Therefore, to generate random variables for any continuous distribution, we need only generate a random number, u, from a uniform distribution and then solve Z x fX ðxjÞdx ¼ u for x; 1
where fX ðxjÞ is the PDF of distribution of interest. The procedure is illustrated in Figure 5.12. Suppose we want to generate random numbers with a PDF represented by panel (a). Construct the cumulative distribution function, FðxÞ, as shown in panel (b). Generate a sequence of random numbers which have a uniform distribution in the interval 0 to 1. Locate each of these on the y-axis of panel (c) and draw a line parallel to the x-axis to intersect FðxÞ. Drop a perpendicular to the horizontal axis and read off the x value. The distribution of random x values
1 0.6
(a)
(b) 0.8
0.4
CDF
PDF
0.5 0.3 0.2
0.6 0.4 0.2
0.1
0 1 1
2
3 x
4
5
6
1
(c)
60
3 x
4
5
(d)
50 Number
CDF
0.8
2
0.6 0.4
40 30 20
0.2
10
0 1
2
3 x
4
5
0 2
3 x
4
5
Figure 5.12 This figure illustrates the construction of random numbers which have a distribution corresponding to that shown in panel (a). It makes use of the cumulative distribution function (CDF) shown in panel (b), and a sequence of random numbers that have a uniform distribution in the interval 0 to 1. The construction is illustrated in panel (c) for 20 random numbers. Panel (d) shows a histogram of 500 random numbers generated by this process. See the text for details.
130
Frequentist statistical inference
derived in this way will have the desired probability distribution. Panel (c) illustrates this construction for 20 random numbers. A histogram of 500 of these random x values is shown in panel (d).
Examples: 1. Uniform distribution on the interval ða; bÞ:
fðxja; bÞ ¼
1 ; ða x bÞ: ðb aÞ
First, generate a random value of u in the interval (0, 1), equate it to the cumulative distribution function, integrate and solve:
ðb aÞ1
Z
x
dt ¼ u
a
x ¼ uðb aÞ þ a; ða x bÞ: 2. Negative exponential distribution:
fðxjÞ ¼
expðx=Þ; for x > 0; > 0 0; elsewhere.
(5:61)
Z t 1 x exp dt ¼ u 0
t x 1 ðÞ exp ¼ u 0 x 1 exp ¼u
1 x ¼ ln 1u
¼ lnð1 uÞ ¼ lnðuÞ:
(5:62)
3. Poisson distribution: recall the probability of exactly x occurrences in a time T is given by
pðxjTÞ ¼
ðrTÞx expðrTÞ ; x!
x ¼ 0; 1; 2; . . .
where r ¼ average rate of occurrences and ¼ rT is the average number of occurrences in time T. Since the time difference between independent Poisson occurrences has a negative exponential distribution, one can generate a random Poisson value by generating successive negative exponential random values using t ¼ lnðuÞ. The process continues until the sum
5.13 Random and pseudo-random numbers
131
of x þ 1 values of t exceeds the prescribed length T. The Poisson random value, therefore, is x. Recall ¼ mean time between Poisson events ¼ 1=r.
5.13.1 Pseudo-random number generators Considerable effort has been focused on finding methods for generating uniform distributions of numbers in the range [0,1] (Knuth, 1981; Press et al., 1992). These numbers can then be transformed into other ranges and other types of distributions (e.g., Poisson, normal) as we have seen. The procedure below illustrates one approach to pseudo-random number generation called Linear Congruent Generators (LCG) which generate a sequence of integers I1 ; I2 ; I3 ; . . . each between 0 and ðm 1Þ=m, where m is a large number, by the following operations: Step 1: Ii þ 1 ¼ aIi þ c where a and c are integers. This generates an upward-going sequence from a seed I0 . Step 2: Modulus½I; m ¼ I IntegerPart½I=m m e.g., Modulus½5; 3 ¼ 5 Integer Part½5=3 3 ¼ 2 This reduces the above sequence to a random one with values in the range 0 to m 1 (actually a distribution of round-off errors). Also written as Ii þ 1 ¼ aIi þ cðMod mÞ. Step 3: U ¼ m1 Modulus½I; m This gives the desired sequence U between 0 and ðm 1Þ=m. Notice the smallest difference between terms is 1=m, which means the numbers a LCG produces comprise a set of m equally spaced rational fractions in the range 0 x ðm 1Þ=m. Problems: 1. The sequence repeats itself with some period which is m. 2. For certain choices of parameters, some generators skip many of the possible numbers and give an incomplete set. A series that generates all the m distinct integers ð0 < n < m 1Þ during each period is called a full period. 3. Contains subtle serial correlations. See Numerical Recipes (Press et al., 1992) for more details.
Established rules for choosing parameters that give a long and full period are given by Knuth (1981) and by Park and Miller (1988). One way to reduce all of the above problems is to use a compound LCG or shuffling generator which works as follows: 1. Use two LCGs. 2. Use first LCG to generate N lists of random numbers. 3. Use second LCG to calculate a number l between 1 and N, then select top number from lth list (return this number back to the bottom of that list). 4. Period of compound LCG product of periods of individual LCGs.
132
Frequentist statistical inference
Box 5.6 Mathematica pseudo-random numbers: We can use Mathematica to generate pseudo-random numbers with a wide range of probability distributions. The following command will yield a list of 10 000 uniform random numbers in the interval 0 ! 1: Table[Random[ ],{10000}] Random uses the Wolfram rule 30 cellular automaton generator for integers (Wolfram, 2002). To obtain a table of pseudo-random numbers with a Gaussian distribution use the following commands. The first line loads a package containing a wide range of continuous distributions of interest for statistics. Needs[‘‘Statistics ‘ContinuousDistributions’ ’’] Table [Random[NormalDistribution[ m, s]], {10000}] Mathematica uses the time of day as a seed for random number generation. To ensure you always get the same sequence of pseudo-random numbers, you need to provide a specific seed (e.g., 99) with the command: SeedRandom[99]
5.13.2 Tests for randomness Most computers have lurking in their library routines a random number generator typically with the name RAN. X ¼ RAN(ISEED) is a typical calling sequence. ISEED is some arbitrary initialization value. Any random number generator needs testing before use as the example discussed below illustrates. Four common approaches to testing are: * *
* *
Random walk. Compare the actual distribution of the pseudo-random numbers to a uniform distribution using a statistical test. Two commonly used frequentist tests are the Kolmogorov–Smirnov test and the 2 goodness-of-fit test. The latter is discussed in Section 7.3.1. Examine the Fourier spectrum. Test for correlations between neighboring random numbers.
Examples of the latter two tests are given in this section. Panel (a) of Figure 5.13 shows the power spectral density of 262 144 pseudo-random numbers generated using Mathematica. The frequency axis is the number of cycles in the 262 144 steps. There do not appear to be any significant peaks indicative of a periodicity. Note: the uniformly distributed pseudo-random numbers in the interval
5.13 Random and pseudo-random numbers 1.4 1.2
2 (a)
(b) 1.5 PDF
1 PSD
133
0.8 0.6 0.4
1 0.5
0.2 0 5 1.4 1.2
10
15 20 25 Frequency
30
35
0 0.5 –0.5 Fourier amplitude (real)
40
(c)
4 PDF
PSD
1 0.8 0.6 0.4
1
(d)
3 2 1
0.2 0 5
10
15 20 25 Frequency
30
35
40
–0.5 0 0.5 Fourier amplitude (real)
1
Figure 5.13 Panel (a) shows the power spectral density of 262 144 pseudo-random numbers generated using Mathematica. The frequency axis is the number of cycles in the 262 144 steps. Panel (b) shows a histogram of the real part of the Fourier amplitudes. For comparison, panels (c) and (d) demonstrate how sensitive the PSD and Fourier amplitude histogram are to a repeating sequence of random numbers (5.24 cycles of 50 000 random numbers).
0 ! 1 were transformed to be uniform in the interval 0:5 ! 0:5 by subtracting 0.5 from each, so there is no DC (zero frequency) component in the spectrum. Panel (b) shows a histogram of the real part of the Fourier amplitudes which has a Gaussian shape. From the Central Limit Theorem, we expect the histogram to be a Gaussian, since each amplitude corresponds to a weighted sum (weighted by a sine wave) of a very large number of random values. Panels (c) and (d) demonstrate how sensitive the power spectral density (PSD) and Fourier amplitude histogram are to a repeating sequence of random numbers. Panel (c) shows the PSD for a sequence of 262 144 random numbers consisting of 5.24 cycles of 50 000 random numbers generated with Mathematica. Again the frequency axis is the number of cycles in the 262 144 steps. This time, one can clearly see peaks in the PSD at multiples of 5.24, and the histogram has become much narrower. Note: when the sequence is an exact multiple of the repeat period, the vast majority of Fourier amplitudes are zero and the histogram takes on the appearance of a sharp spike or delta function, sitting on a broad plateau. A program to carry out the above calculations can be found in the section of the Mathematica tutorial entitled, ‘‘Fourier test of random numbers.’’
134
Frequentist statistical inference
Since each pseudo-random number is derived from the previous value, it is important to test whether successive values are independent or exhibit correlations. We can look for evidence of correlations between adjacent random numbers by grouping them in pairs and plotting the value of one against the other. Such a plot is shown in Figure 5.14, for 3000 pairs of random numbers. If adjacent pairs were completely correlated, we would expect the points to lie on a straight line. This is clearly not the case as the points appear to be randomly scattered over the figure. The right panel of Figure 5.14 shows a similar correlation test involving neighboring points, taken three at a time, and plotted in three dimensions. If the sequence of numbers is perfectly random, then we expect the points to have an approximately uniform distribution, as they appear to do. It is possible to extend and quantify these correlation tests. The most common frequentist tool for quantifying correlation is called the autocorrelation function (ACF). Here is how it works for our problem: let fxi g be a list of uniformly distributed pseudo-random numbers in the interval 0 to 1. Now subtract the mean value of the list, x, to obtain a new list in the interval 0:5 to 0.5. Make a copy of the list fxi xg and place it below the first list. Then shift the copy to the left by j terms so the ith term in the original list is above the ði þ jÞth term in the copy. This shift is referred to as a lag. Next, multiply each term in the original list by the term in the shifted list immediately below, and compute the average of these products (for all terms that overlap in the two lists), which we designate ð jÞ. We can repeat this process and compute ðjÞ for a wide range of lags, ranging from j ¼ 0 to some large value. If the numbers in the list are truly 1 0.75 0.5 0.25 1
0 1
0.8
0.75
0.6 0.4
0.5
0.2
0.25
0 0
0.2
0.4
0.6
0.8
1
0 0 0.25 0.5 0.75 1
Figure 5.14 The left panel shows a correlation test between successive pairs of random numbers, obtained from the Mathematica pseudo-random number generator. The coordinates of each point are given by the pair of random numbers. The right panel shows a three-dimensional correlation test involving successive random numbers taken three at a time. The right panel was plotted with the Mathematica command ScatterPlot3D.
135
5.13 Random and pseudo-random numbers
random then any term in the original list will be completely independent of any term in the shifted copy. So each multiplication is equally likely to be a positive or negative quantity. The average of a large number of random positive and negative quantities tends to zero. Of course for j ¼ 0 (no shift), the two terms are identical so the products are all positive quantities and there is no cancellation. Thus, for a list of completely random numbers, a plot of the ACF, ð jÞ, will look like a spike at j ¼ 0 and be close to zero for all j 1. If the terms are not completely independent, then we expect the plot of ð jÞ to decay gradually towards zero over a range of j values. The formula for ð jÞ given below differs slightly from the operation just described, in that instead of computing the average, we sum the product terms for each j and then normalize by dividing by sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi X X ðxi xÞ2 ðxi þ j xÞ2 : overlap
overlap
With this normalization, the maximum value of the ACF is 1.0 and it allows the ACF to handle a wider variety of correlation problems than the particular one we are interested in here: P
overlap ½ðxi xÞðxi þ j xÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð jÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; P P 2 2 overlap ðxi xÞ overlap ðxi þ j xÞ
(5:63)
where the summation is carried out over the subset of samples that overlap. Figure 5.15 shows a plot of the ACF for a sequence of 10 000 uniformly distributed 1
0.2 0.15 ACF
0.8
ACF
0.6
0.1 0.05 0 2
0.4
4
6
8
10
Lag
0.2 0 0
200
400
600
800
1000
Lag
Figure 5.15 The autocorrelation function (ACF) for a sequence of 10 000 uniformly distributed pseudo-random numbers generated by the Mathematica Random command. The larger plot spans a range of 1000 lags, while the blow-up in the corner shows the first ten lags. Clearly, for lags 1, the ACF is essentially zero indicating no detectable correlation.
136
Frequentist statistical inference
pseudo-random numbers generated by the Mathematica Random command. The larger plot spans a range of 1000 lags, while the blow-up in the corner shows the first ten lags. Clearly, for lags 1, the ACF is essentially zero indicating no detectable correlation. The noise-like fluctuations for j 1 arise because of incomplete cancellation of the product terms for a finite list of random numbers. For quite a few years, RAN3 in Numerical Recipes (not an LCG but based on a subtractive method) was considered a reliable portable random number generator, but even this has been called into question (see Barber et al., 1985; Vattulainen et al., 1994; Maddox, 1994; and Fernandez and Rivero, 1996). However, what is random enough for one application may not be random enough for another. In the near future, random number generators based upon a physical process, like Johnson noise from a resistor or a reverse-biased zener diode, will be incorporated into every computer. Intel already supplies such devices on some chipsets for PC-type computers. One can anticipate that users of these hardware-derived random numbers will again be concerned with just how random these numbers are.
5.14 Summary The most important aspect of frequentist statistics is the process of drawing conclusions based on sample data drawn from the population (which is the collection of all possible samples). The use of the term random variable conveys the idea of an intrinsic uncertainty in the measurement characterized by an underlying population. A random variable is not the particular number recorded in one measurement, but rather, it is an abstraction of the measurement operation or observation that gives rise to that number, e.g., X may represent the random variable and x the realization of the random variable in one measurement. To the frequentist, the sampling distribution is a model of the probability distribution of the underlying population from which the sample was taken. In a Bayesian analysis, the sampling distribution is a mathematical description of the uncertainty in predicting the data for any particular model because of incomplete information. It enables us to compute the likelihood pðDjH; IÞ. We considered various descriptive properties of probability distributions: moments, moment generating functions (useful in proofs of important theorems) and measures of the central tendency of a distribution (mode, median and mean). This was followed by a discussion of some important discrete and continuous probability distributions. The most important probability distribution is the Gaussian or normal distribution. This is because a measured quantity is often the result of a large number of effects, i.e., is some kind of average of these effects. According to the Central Limit Theorem, the distribution of the average of a collection of random variables tends to a Gaussian. We also learned that in most circumstances, the distribution of the sample average sharpens around the sample mean as the sample size increases, which is the basis of signal averaging that plays such an important role in experimental work.
5.15 Problems
137
Finally, we learned how to generate pseudo-random samples from an arbitrary probability distribution, a topic that is of great importance in experimental simulations and Monte Carlo techniques.
5.15 Problems 1. Write a small program to reproduce Figure 5.5 starting from the raw data ‘‘badbase.dat’’
2. 3.
4.
5.
(supplied with the Mathematica tutorial) and using 21 points for the running median and mean. For the first and last ten points of the data, just subtract the average of the ten from each point. When sampling from a normal distribution with mean ¼ 2 and ¼ 1, compute Pð 2:7 X þ 2:7Þ. As one test of your pseudo-random number generator, generate a sample of 50 000 random numbers in the interval 0 to 1 with a uniform distribution. Compare the mean and variance of the sequence to that expected for a uniform distribution. Does the sample mean agree with the expected mean to within one standard error of the sample mean? Generate 10 000 pseudo-random numbers with a beta distribution with ¼ 2 and ¼ 4. See the Mathematica example in Section 5.13.1 of the book and use BetaDistribution instead of NormalDistribution. Plot a histogram of your random numbers, and on the same plot, overlay a beta distribution for comparison. Compute the mean and median of your simulated data set. See the BinCounts, Mean, and Median commands in Mathematica. Let X1 ; X2 ; X3 ; . . . ; Xn be n independent and identically distributed (IID) random variables with a beta PDF given by 8 > < ð þ Þ x1 ð1 xÞ1 ; for 0 < x < 1 ; > 0; fðxÞ fðxj; Þ ¼ ðÞðÞ (5:64) > : 0; elsewhere, where ¼ 2 and ¼ 4. What is the probability density function (PDF) of X, the average of n measurements? As an alternative to averaging large numbers of samples (simplest approach) you could make use of the convolution theorem and the Fast Fourier Transform (FFT) (remember to zero pad). Note: if you are not familiar with using the FFT and zero padding, you will find this approach much more challenging. Note: the Discrete Fourier Transform, the FFT and zero padding are discussed in Appendix B. a) By way of an answer, plot the PDFs for n ¼ 1; 3; 5; 8 and display all four
distributions on the same plot. Be careful to normalize each distribution for unit area.
138
Frequentist statistical inference b) Compute the mean and variance of the four PDFs (do not simply quote an
expected theoretical value) for each value of n. c) Compare your result for the n ¼ 5 case to a Gaussian with the same mean and
variance drawn on the same graph. Repeat for n ¼ 8. What conclusions do you draw?
6 What is a statistic?
6.1 Introduction In this chapter, we address the question ‘‘What is a statistic’’? In particular, we look at what role statistics play in scientific inference and give some common useful examples. We will examine their role in the two basic inference problems: hypothesis testing (the frequentist equivalent of model selection) and parameter estimation, with emphasis on the latter. Hypothesis testing will be dealt with in Chapter 7. Recall that an important aspect of frequentist statistical inference is the process of drawing conclusions based on sample data drawn from the population (which is the collection of all possible samples). The concept of the population assumes that in principle, an infinite number of measurements (under identical conditions) are possible. Suppose X1 ; X2 ; . . . ; Xn are n independent and identically distributed (IID) random variables that constitute a random sample from the population for which x1 ; x2 ; . . . ; xn is one realization. The population is assumed to have an intrinsic probability distribution (or density function) which, if known, would allow us to predict the likelihood of the sample x1 ; x2 ; . . . ; xn . For example, suppose the random variable we are measuring is the time interval between successive decays of a radioactive sample. In this case, the population probability density function is a negative exponential (see Section 5.8.5), given by fðxjÞ ¼ ½expðx=Þ/. The likelihood is given by Lðx1 ; x2 ; . . . ; xn jÞ ¼
n Y
fðxi jÞ:
(6:1)
i¼1
This particular population probability density function is characterized by a single parameter . Another population probability distribution that arises in many problems is the normal (Gaussian) distribution which has two parameters, and 2 . In most problems, the parameters of the underlying population probability distribution are not known. Without knowledge of their values, it is impossible to compute the desired probabilities. However, a population parameter can be estimated from a statistic, which is determined from the information contained in a random sample. It is for this reason that the notion of a statistic and its sampling distribution is so important in statistical inference. 139
140
What is a statistic?
Definition: A statistic is any function of the observed random variables in a sample such that the function does not contain any unknown quantities. One important statistic is the sample mean X given by X ¼ ðX1 þ X2 þ þ Xn Þ=n ¼
n 1X Xi : n i¼1
(6:2)
Note: we are using a capital X which implies we are talking about a random variable. All statistics are random variables and to be useful, we need to be able to specify their sampling distribution. For example, we might be interested in the mean redshift1 of a population of cosmic gamma-ray burst (GRB) sources. This would provide information about the distances of these objects and their mean energy. GRBs are the most powerful type of explosion known in the universe. The parameter of interest is the mean redshift which we designate . A parameter of a population is always regarded as a fixed and usually unknown constant. Let Z be a random variable representing GRB redshifts. Suppose the redshifts, fz1 ; z2 ; . . . ; z7 g, of a sample of seven GRB sources are obtained after a great deal of effort. What can we conclude about the population mean redshift from our sample, i.e., how accurately can we determine from our sample? This can be a fairly difficult question to answer using the individual measurements, zi , because we don’t know the form of the sampling distribution for GRB source redshifts. Happily, in this case, we can proceed with our objective by exploiting the Central Limit Theorem (CLT) which predicts the sampling distribution of the sample mean statistic. The way to think about this is as follows: consider a thought experiment in which we are able to obtain redshifts for a very large number of samples (hypothetical reference set) of GRB redshifts. Each sample consists of seven redshift measurements. The means of all these samples will have a distribution. According to the CLT, the distribution of sample means tends to a Gaussian as the number n of observations tends to infinity. In practice, a Gaussian sampling distribution is often employed when n 5. Of course, we don’t have the results from this hypothetical reference set, only the results from our one sample, but at least we know that the shape of the sampling distribution characterizing our sample mean statistic is approximately a Gaussian. This allows us to make a definite statement about the uncertainty in the population mean redshift which we derive from our one sample of seven redshift measurements. Just how we do this is discussed in detail in Section 6.6.2. In the course of answering that question, we will encounter the sample variance statistic, S2 , and develop the notion of a sampling distribution of a statistic.
1
Redshift is a measure of the wavelength shift produced by the Doppler effect. In 1929, Edwin Hubble showed that we live in an expanding universe in which the velocity of recession of a galaxy is proportional to its distance. A recession velocity shifts the observed wavelength of a spectral line to longer wavelengths, i.e., to the red end of the optical spectrum.
6.2 The 2 distribution
141
6.2 The 2 distribution The sampling distribution of any particular statistic is the probability distribution of that statistic that would be determined from an infinite number of independent samples, each of size n, from an underlying population. We start with a treatment of the 2 sampling distribution.2 We will prove in Section 6.3 that the 2 distribution describes the distribution of the variances of samples taken from a normal distribution. The 2 distribution is a special case of the gamma distribution: fðxj; Þ ¼
x 1 1 x exp ðÞ
(6:3)
with ¼ 2 and ¼ =2, where is called the degree of freedom. The 2 distribution has the following properties: x 1 fðxjÞ ¼ x21 exp 2 2 22 hxi ¼ ;
Var½x ¼ 2:
(6:4) (6:5)
The coefficients of skewness (3 ) and kurtosis (4 ) are given by 4 4 3 ¼ pffiffiffiffiffi ; 4 ¼ 3 1 þ : 2
(6:6)
Finally, the moment generating function of 2 with degrees of freedom is given by
m2 ðtÞ ¼ ð1 2tÞ2 :
(6:7)
We now prove two useful theorems pertaining to the 2 distribution. Theorem 1: Let fXi g ¼ X1 ; X2 ; . . . ; Xn be an IID sample from a normal distribution Nð; Þ. P P Let Y ¼ ni¼ 1 ðXi Þ2 =2 ¼ ni¼ 1 Z2i , where Zi are standard random variables. Then Y has a chi-squared 2n distribution with n degrees of freedom. Proof: Let mY ðtÞ ¼ the moment generating function (recall Section 5.6) of Y. From Equation (5.9), we can write 2
mY ðtÞ ¼ hetY i ¼ heti Zi i 2
2
2
¼ hetZ1 etZ2 etZn i:
2
(6:8)
The 2 statistic plays an important role in fitting models to data using the least-squares method, which is discussed in great detail in Chapters 10 and 11.
142
What is a statistic?
Since the random variable Z is IID then, 2
2
2
mY ðtÞ ¼ hetZ1 i hetZ2 i hetZn i
(6:9)
¼ mZ2 ðtÞ mZ2 ðtÞ mZ2n ðtÞ: 1
2
The moment generating function for each Zi is given by mZ2 ðtÞ ¼
Z
þ1
fðzÞ expðtZ2 ÞdZ 1
Z2 expðtZ Þ exp dZ 2 2 Z 1 Z ð1 2tÞ dZ; ¼ pffiffiffiffiffiffi exp 2 2p 1 ¼ pffiffiffiffiffiffi 2p
Z
2
(6:10)
where we have made use of the fact that fðzÞ is also a normal distribution, i.e., a Gaussian. 1 Multiplying and dividing Equation (6.10) by ð1 2tÞ2 we get
mZ2 ðtÞ ¼ ð1 2tÞ
12
Z
þ1
"
1
Z2
#
dZ pffiffiffiffiffiffi 1 exp 2ð1 2tÞ1 2pð1 2tÞ2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} 1
(6:11)
Integral of normal distribution ¼ 1 1
) mZ2 ðtÞ ¼ ð1 2tÞ2 : Therefore, n
mY ðtÞ ¼ ð1 2tÞ2 :
(6:12)
Comparison of Equations (6.12) and (6.7) shows that Y has a 2 distribution, with n degrees of freedom, which we designate by 2n . Figure 6.1 illustrates the 2 distribution for three different choices of the number of degrees of freedom. Example: In Section 5.9, we showed that for any IID sampling distribution with a finite variance, pffiffiffi pffiffiffi ðX Þ n= tends to Nð0; 1Þ as n ! 1, and therefore ½ðX Þ n=2 is approximately 21 with one degree of freedom.3 3
pffiffiffi When sampling from a normal distribution, the distribution of ðX Þ n= is always N(0, 1) regardless of the value of n.
6.3 Sample variance S2
143
ν=1 ν=3 ν=8
Probability density
0.25 0.2 0.15 0.1 0.05
2.5
5
7.5
10
12.5
15
17.5
20
χ value 2
Figure 6.1 The 2 distribution for three different choices of the number of degrees of freedom.
Theorem 2: If X1 and X2 are two independent 2 -distributed random variables with 1 and 2 degrees of freedom, then Y ¼ X1 þ X2 is also 2 -distributed with 1 þ 2 degrees of freedom. Proof: Since X1 and X2 are independent, the moment generating function of Y is given by 1
2
my ðtÞ ¼ mX1 ðtÞ mX2 ðtÞ ¼ ð1 2tÞ 2 ð1 2tÞ 2 ¼ ð1 2tÞ
ð1 þ2 Þ 2
(6:13) (6:14)
; 2
which equals the moment generating function of a random variable with 1 þ 2 degrees of freedom.
6.3 Sample variance S2 We often want to estimate the variance ð2 Þ of a population from an IID sample taken from a normal distribution. We usually don’t know the mean ðÞ of the population so we use the sample mean ðXÞ as an estimate. To estimate 2 we use another random variable called the sample variance ðS2 Þ, defined as follows: S2 ¼
n X ðXi XÞ2 i¼1
n1
:
(6:15)
144
What is a statistic?
Just why we define the sample variance random variable in this way will soon be made clear. Of course, for any particular sample of n data values, the sample random variable would take on a particular value designated by lower case s2 . Here is a useful theorem that enables us to estimate from S: Theorem 3: The sampling distribution of ðn 1ÞS2 =2 is 2 with ðn 1Þ degrees of freedom. Proof: ðn 1ÞS2 ¼
n X
ðXi XÞ2 ¼
X ½ðXi Þ ðX Þ2
i¼1
¼ ¼ ¼ ¼
X X X X
½ðXi Þ2 2ðXi ÞðX Þ þ ðX Þ2 X X ðXi Þ þ ðX Þ2 ðXi Þ2 2ðX Þ
(6:16)
ðXi Þ2 2ðX ÞnðX Þ þ nðX Þ2 ½ðXi Þ2 nðX Þ2 :
Therefore, n X ðn 1ÞS2 ðX Þ2 ðXi Þ2 þ ¼ : 2 2 =n 2 i¼1 |fflfflfflfflffl{zfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} 21 2n
(6:17)
From Theorem 2, ðn 1ÞS2 =2 is 2n1 with ðn 1Þ degrees of freedom. The expectation value of a quantity that has a 2 distribution with ðn 1Þ degrees of freedom is equal to the number of degrees of freedom (see Equation (6.5)). Therefore,
ðn 1ÞS2 ¼ n 1: 2
(6:18)
ðn 1ÞS2 ðn 1Þ ¼ n 1: ¼ hS2 i 2 2
(6:19)
But,
Therefore, hS2 i ¼ 2 :
(6:20)
This provides justification for our definition of S2 – its expectation value is the population variance. Note: this does not mean that S2 will equal 2 for any particular sample.
6.3 Sample variance S2
145
Note 1: We have just established Equation (6.20) when sampling for a normal distribution. We now show that Equation (6.20) is valid for IID sampling from any arbitrary distribution with finite variance. From Equation (6.16), we can write 2
hS i ¼
P
hðXi Þ2 i nhðX Þ2 i : n1
(6:21)
But hðXi Þ2 i ¼ VarðXi Þ ¼ 2 by definition, and hðX Þ2 i ¼ VarðXÞ ¼ 2 =n from Equation (5.50). It follows that hS2 i ¼
n2 2 ¼ 2 : n1
(6:22)
Thus, Equation (6.22) is valid for IID sampling from an arbitrary distribution with finite variance. In the language of frequentist statistics, we say that S2 , as defined in Equation (6.15), is an unbiased estimator of 2 . Standard error of the sample mean: We often want to quote a typical error for the mean of a population based on our sample. According to Equation (5.50), VarðXÞ ¼ 2 =n for any distribution with finite variance. Since we do not normally know 2 , the variance of the population, we use the sample variance as an estimate. S The standard error of the sample mean is defined as pffiffiffi : (6:23) n In Section 6.6.2 we will use a Student’s t distribution to be more precise about specifying the uncertainty in our estimate of the population mean from the sample mean. Note 2: In a situation where we know population but not 2 , define S2 : S2 ¼
n X ðXi Þ2 i¼1
n
:
(6:24)
It is easily shown that with this definition, nS2 =2 is 2n with n degrees of freedom. We lose one degree of freedom when we estimate from X. Example: A random sample of size n ¼ 16 (IID sample) is drawn from a population with a normal distribution of unknown mean () and variance (2 ). We compute the sample variance, S2 , and want to determine pð2 < 0:49S2 Þ: Solution: Equation (6.25) is equivalent to 2 S p 2 > 2:041 :
(6:25)
(6:26)
146
What is a statistic?
We know that the random variable X ¼ ðn 1ÞS2 =2 has a 2 distribution with ðn 1Þ degrees of freedom. In this case, ðn 1Þ ¼ 15 ¼ degrees of freedom. Therefore, p
2 S S2 > 2:041 ¼ p ðn 1Þ > 30:61 : 2 2
(6:27)
Let ¼ pððn 1ÞS2=2 > 30:61Þ: Then 1 ¼ pððn 1ÞS2=2 30:61Þ; or more generally, 1 ¼ pðX x1 Þ where x1 is the particular value of the random variable X for which the cumulative distribution pðX x1 Þ ¼ 1 . x1 is called the ð1 Þ quantile value of the distribution, and pðX x1 jÞ is given by Z x1 t 1 pðX x1 jÞ ¼ t21 exp dt ¼ 1 : (6:28) 2 2 22 0 For ¼ 15 degrees of freedom, 30.61 corresponds to ¼ 0:01 or x0:990 . Thus, the probability that the random variable 2 < 0:49S2 ¼ 1%. Figure 6.2 shows the 2 distribution for ¼ 15 degrees of freedom and the 1 ¼ 0:99 quantile value.
Probability density
0.06
0.04
0.02 χ20.99 0 0
10
20 χ2 value
30
40
Figure 6.2 The 2 distribution for ¼ 15 degrees of freedom. The vertical line marks the 1 ¼ 0:99 quantile value. The area to the left of this line corresponds to 1 .
6.4 The Student’s t distribution
147
We can evaluate Equation (6.28) with the following Mathematica command: Box 6.1
Mathematica 2 significance
Needs [‘‘Statistics ‘ContinuousDistributions’’’] The line above loads a package containing a wide range of continuous distributions of importance to statistics, and the following line computes , the area in the tail of the 2 distribution to the right of 2 ¼ 30:61, for ¼ 15 degrees of freedom. n c2 15 30:61 , GammaRegularized , ¼ 0:01 ¼ GammaRegularized 2 2 2 2 In statistical hypothesis testing (to be discussed in the next chapter), is referred to as the significance or the one-sided P-value of a statistical test.
6.4 The Student’s t distribution Recall, when sampling from a normal distribution with known standard deviation, , pffiffiffi the distribution of the standard random variable Z ¼ ðX Þ n= is Nð0; 1Þ. In practice, is usually not known. The logical thing to do is to replace by the sample standard deviation S. The usual inference desired is that there is a specified probability that X lies within S of the true mean . pffiffiffi Unfortunately, the distribution of ðX Þ n=S is not Nð0; 1Þ. However, it is pffiffiffi possible to determine the exact sampling distribution of ðX Þ n=S when sampling from Nð; Þ with both and 2 unknown. To this end, we examine the Student’s t distribution.4 The following useful theorem pertaining to the Student’s t distribution is given without proof. Theorem 4: Let Z be a standard normal random variable and let X be a 2 random variable with degrees of freedom. If Z and X are independent, then the random variable Z T ¼ pffiffiffiffiffiffiffiffiffi X=
(6:29)
has a Student’s t distribution with degrees of freedom and a probability density given by 2 ðþ1Þ 2 ½ðþ1Þ t 2 fðtjÞ ¼ pffiffiffiffiffiffi 1 þ ; ð1 < t < þ1Þ; > 0: p ð2Þ 4
(6:30)
The t distribution is named for its discoverer, William Gosset, who wrote a number of statistical papers under the pseudonym ‘‘Student.’’ He worked as a brewer for the Guinness brewery in Dublin in 1899. He developed the t distribution in the course of analyzing the variability of various materials used in the brewing process.
148
What is a statistic?
The Student’s t distribution has the following properties: hT i ¼ 0 and VarðT Þ ¼
ð 2Þ
(6:31)
> 2:
pffiffiffi When sampling Nð; Þ we know that ðX Þ n= is Nð0; 1Þ. We also know that ðn 1ÞS2 =2 is 2 with ðn 1Þ degrees of freedom. Therefore, we can identify Z with pffiffiffi ðX Þ n= and X with ðn 1ÞS2 =2 . Therefore, ðXÞ pffiffi ð= nÞ
T ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ ðn1ÞS2 =2
ðX Þ ðX Þ pffiffiffi pffiffiffi : ¼ ð= nÞ S ðS= nÞ
(6:32)
n1
pffiffiffi Therefore, ðX Þ n=S is a random variable with a Student’s t distribution with n 1 degrees of freedom. Figure 6.3 shows a comparison of a Student’s t distribution for three degrees of freedom, and a standard normal. The broader wings of the Student’s t distribution are clearly evident. The ð1 Þ quantile value for degrees of freedom, t1; , is given by 2 ðþ1Þ Z t1; 2 ½ðþ1Þ t 2 pðT t1; Þ ¼ pffiffiffiffiffiffi 1þ dt ¼ 1 : (6:33) p ð2Þ 1 Example: Suppose a cigarette manufacturer claims that one of their brands has an average nicotine content of 0.6 mg per cigarette. An independent testing organization
0.4
Probability density
Student’s t Normal
0.3
0.2
0.1
0 –4
–2
0 Student’s t value
2
4
Figure 6.3 Comparison of a standard normal distribution and a Student’s t distribution for 3 degrees of freedom.
6.4 The Student’s t distribution
149
measures the nicotine content of 16 such cigarettes and has determined the sample average and the sample standard deviation to be 0.75 and 0.197 mg, respectively. If we assume the amount of nicotine is a normal random variable, how likely is the sample result given the manufacturer’s claim? pffiffiffi T ¼ ðX Þ n=S has a Student’s t distribution. x ¼ 0:75 mg; s ¼ 0:197 mg, and n ¼ 16; so the number of degrees of freedom ¼ 15. Manufacturer claims ¼ 0:6 mg corresponds to a t¼
ð0:75 0:6Þ pffiffiffiffiffi ¼ 3:045: 0:197= 16
(6:34)
The Student’s t distribution is a continuous distribution, and thus we cannot calculate the probability of any specific t value since there is no area under a point. The question of how likely the t value is, given the manufacturer’s claim, is usually interpreted as what is the probability by chance that T 3:045. The area of the distribution beyond the sample t value gives us a measure of how far out in the tail of the distribution the sample value resides. Box 6.2
Mathematica solution:
We can solve the above problem with the following commands: Needs[‘‘Statistics ‘ContinuousDistributions’’’] The following line computes the area in the tail of the T distribution beyond T ¼ 3:045. (1 – CDF[StudentTDistribution[n], 3.045]) ! answer ¼ 0:004 ð ¼ 15Þ where CDF[StudentTDistribution [n], 3.045] stands for the cumulative density function of the T distribution from T ¼ 1 ! 3:045. Therefore, pðT > 3:045Þ ¼ ¼ 0:004 or 0.4%, i.e., the manufacturer’s claim is very improbable. The way to think of this is to imagine we could repeatedly obtain samples of 16 cigarettes and compute the value of t for each sample. The fraction of these t values that we would expect to fall in the tail area beyond t > 3:045 is only 0.4%. If the manufacturer’s claim were reasonable, we would expect that the t value of our actual sample would not fall so far out in the tail of the distribution. If you are still puzzled by this reasoning, we will have a lot more to say about it in Chapter 7. We will revisit this example in Section 7.2.3. Note: although ðx Þ=s ¼ 0:15=0:197 < 1, s is not a meaningful uncertainty for x – pffiffiffi only for xi . The usual measure of the uncertainty in x is s= n ¼ 0:049. The quantity pffiffiffi s= n is called the standard error of the sample mean.
150
What is a statistic?
6.5 F distribution (F-test) The F distribution is used to find out if two data sets have significantly different variances. For example, we might be interested in the effect of a new catalyst in the brewing of beer so we compare some measurable property of a sample brewed with the catalyst to a sample from the control batch made without the catalyst. What effect has the catalyst had on the variance of this property? Here, we develop the appropriate random variable for use in making inferences about the variances of two independent normal distributions based on a random sample from each. Recall that inferences about 2 , when sampling from a normal distribution, are based on the random variable ðn 1ÞS2 =2 , which has a 2n1 distribution. Theorem 5: Let X and Y be two independent 2 random variables with 1 and 2 degrees of freedom. Then the random variable F¼
X=1 Y=2
(6:35)
has an F distribution with a probability density function
pð f j1 ; 2 Þ ¼
8 < :
½ð1 þ2 Þ=2 ð1 =2Þð2 =2Þ
21 1 2
1ð 2Þ 1
f2
1ð þ Þ 1 2
ð1þf1 =2 Þ2
; ð f > 0Þ
0;
(6:36)
elsewhere.
An F distribution has the following properties: hFi ¼
2 ; ð2 > 2Þ: 2 2
(6:37)
(Surprisingly, hFi depends only on 2 and not on 1 .) VarðFÞ ¼
22 ð22 þ 21 4Þ 1 ð2 1Þ2 ð2 4Þ Mode ¼
; ð2 > 4Þ
2 ð1 2Þ : 1 ð2 þ 2Þ
(6:38)
(6:39)
Let X ¼ ðn1 1ÞS21 =21 and Y ¼ ðn2 1ÞS22 =22 . Then, F12 ¼
X=1 X=ðn1 1Þ S21 =21 ¼ ¼ : Y=2 Y=ðn2 1Þ S22 =22
(6:40)
151
6.5 F distribution (F-test)
Box 6.3
Mathematica example:
The sample variance is s21 ¼ 16:65 for n1 ¼ 6 IID samples from a normal distribution with a population variance 21 , and s22 ¼ 5:0 for n2 ¼ 11 IID samples from a second independent normal distribution with a population variance 22 . If we assume that 21 ¼ 22 , then from Equation (6.40), we obtain f ¼ 3:33 for 1 ¼ n1 1 ¼ 5 and 2 ¼ n2 1 ¼ 10 degrees of freedom. What is the probability of getting an f value 3:33 by chance if 21 ¼ 22 ? Needs[‘‘Statistics ‘ContinuousDistributions’’’] The following line computes the area in the tail of the F distribution beyond f ¼ 3:33. ð1 CDF[FRatioDistribution[n 1 , n 2 ], 3:33]Þ ! answer ¼ 0:05 where CDF[FRatioDistribution[n 1 , n 2 ], 3:33] stands for the cumulative density function of the F distribution from f ¼ 0 ! 3:33. Another way to compute this tail area is with FRatioPValue[fratio, n 1 , n 2 ] The F distribution for this example is shown in Figure 6.4. What if we had labeled our two measurements of s the other way around so 1 ¼ 10; 2 ¼ 5 and s21 =s22 ¼ 1=3:33? The equivalent question is: what is the probability that f 1=3:33 which we can evaluate by CDF[FRatioDistribution [n 1 , n 2 ], 1=3:33]? Answer : 0:05 Not surprisingly, we obtain the same probability.
0.7 predicted distribution
Probability density
0.6 0.5 0.4 0.3
measured
0.2 f0.95,5,10
0.1 0 0
1
2
3
4
5
f value
Figure 6.4 The F distribution for 1 ¼ 5; 2 ¼ 10 degrees of freedom. The measured value of 3.33, indicated by the line, corresponds to f0:95;5;10 , the 0.95 quantile value.
152
What is a statistic?
6.6 Confidence intervals In this section, we consider how to specify the uncertainty of our estimate of any particular parameter of the population, based on the results of our sample. We start by considering the uncertainty in the population mean when it is known that we are sampling from a population with a normal distribution. There are two cases of interest. In the first, we will assume that we know the variance 2 of the underlying population we are sampling from. More commonly, we don’t know the variance and must estimate it from the sample. This is the second case.
6.6.1 Variance 2 known Let fXi g be an IID Nð; 2 Þ random sample of n ¼ 10 measurements from a population with unknown but known ¼ 1. Let X be the sample meanprandom variable ffiffiffiffiffi pffiffiffi which will have a sample mean standard deviation, m ¼ = n ¼ 1= 10 ¼ 0:32, to two decimal places. The probability that X will be within one m ¼ 0:32 of is approximately 0.68 (from Section 5.8.1). We can write this as pð 0:32 < X < þ 0:32Þ ¼ 0:68:
(6:41)
Since we are interested in making inferences about from our sample, we rearrange Equation (6.41) as follows: pð 0:32 < X < þ 0:32Þ ¼ pð0:32 < X < 0:32Þ ¼ pð0:32 > X > 0:32Þ ¼ pðX þ 0:32 > > X 0:32Þ ¼ 0:68; or, pðX 0:32 < < X þ 0:32Þ ¼ 0:68:
(6:42)
Suppose the measured sample mean is x ¼ 5:40. Can we simply substitute this value into Equation (6.42), which would yield pð5:08 < < 5:72Þ ¼ 0:68?
(6:43)
We need to be careful how we interpret Equations (6.42) and (6.43). Equation (6.42) says that if we repeatedly draw samples of the same size from this population, and each time compute specific values for the random interval ðX 0:32; X þ 0:32Þ, then we would expect 68% of them to contain the unknown mean . In frequentist theory, a probability represents the percentage of time that something will happen. It says nothing directly about the probability that any one realization of a random interval will contain . The specific interval (5.08, 5.72) is but one realization of the random interval ðX 0:32; X þ 0:32Þ based on the data of a
153
6.6 Confidence intervals
single sample. Since the probability of 0.68 is with reference to the random interval ðX 0:32; X þ 0:32Þ, it would be incorrect to say that the probability of being contained in the interval (5.08, 5.72) is 0.68. However, the 0.68 probability of the random interval does suggest that our confidence in the interval (5.08, 5.72) for containing the unknown mean is high and we refer to it as a confidence interval. It is only in this sense that we are willing to assign a degree of confidence in the statement 5:02 < < 5:72. Meaning of a confidence interval: When we write pð5:08 < < 5:72Þ, we are not making a probability statement in a classical sense but rather are expressing a degree of confidence. In general, we write pð5:08 < < 5:72Þ ¼ 1 where 1 is called the confidence coefficient. It is important to remember that the ‘‘68% confidence’’ refers to the probability of the test, not to the parameter. If you listen closely to the results of a political poll, you will hear something like the following: ‘‘In a recent poll, 55% of a sample of 800 voters indicated they would vote for the Liberals. These results are reliable within 3:5%, 19 times out of 20.’’ What this means is that if you repeated the poll using the same methodology, then 95% (19 out of 20) of the time you would get the same result within 3.5%. In this case, the 95% confidence interval is 51.5 to 58.5%. A Bayesian analysis of the same polling data was given in Section 4.2.3. Figure 6.5 shows 68% confidence intervals for the means of 20 samples of a random normal distribution with a ¼ 5:0 and ¼ 1:0. Each sample consists of ten measurements. Notice that 13 out of 20 intervals contain the true mean of 5. The number expected for 68% confidence intervals is 13.6.
4.25
4.5
4.75
5
5.25
5.5
5.75
Confidence intervals
Figure 6.5 68% confidence intervals for the means of 20 samples of a random normal distribution with a ¼ 5:0 and ¼ 1:0. Each sample consists of ten measurements.
154
What is a statistic?
A general procedure for finding confidence intervals: If we wish to find a general procedure for finding confidence intervals, we must first return to Equation (6.42). pð0:32 < X < 0:32Þ ¼ 0:68:
(6:44)
More generally, we can write pðL1 < X < L2 Þ ¼ 1 ;
(6:45)
where L1 and L2 stand for the lower and upper limits of our confidence interval. We need to develop expressions for L1 and L2 . The limits are obtained from our sampling distribution, which in this particular case is the sampling distribution for the sample mean random variable. Recall that X is Nð; 2 =nÞ, so the distribution of the standard pffiffiffi random variable Z ¼ ðX Þ n= is Nð0; 1Þ. Figure 6.6 shows the distribution of Z. In terms of Z, we can write Equation (6.45) as L1 L2 pffiffiffi < Z < pffiffiffi ¼ 1 : p (6:46) = n = n The desired sampling distribution is
2 1 z fðzÞ ¼ pffiffiffiffiffiffi exp : 2 2p
(6:47)
Probability density
0.4
L1
0.3
L2
0.2
0.1
α /2 –3
–2
1–α –1
0 X–µ Z= σ/√n
α /2 1
2
3
4
Figure 6.6 The pffiffiffi figure shows the expected distribution for the standard random variable Z ¼ ðX Þ n= which is N(0,1). The lower ðL1 Þ and upper ðL2 Þ boundaries of the 1 confidence interval are indicated by the vertical lines. The location of L1 is set by the requirement that the shaded area to the left of L1 is equal to =2. Similarly, the shaded area to the right of L2 is equal to =2.
6.6 Confidence intervals
155
The limits L1 and L2 are evaluated from the following two equations: Z
ffi
L1 p = n
fðzÞdz ¼
; 2
(6:48)
fðzÞdz ¼
: 2
(6:49)
1
and Z
þ1
ffi
L2 p = n
pffiffiffi pffiffiffi Let Z2 ¼ L1 n= and Z12 ¼ L2 n=. Then pðZ2 < Z < Z12 Þ ¼ 1 :
(6:50)
It follows that L1 ¼ Z2 pffiffiffi n
and
L2 ¼ Z12 pffiffiffi : n
But for a standard normal, Z2 ¼ Z12 ; therefore, L1 ¼ L2 ¼ Z12 pffiffiffi : n
(6:51)
We can now generalize Equations (6.41) and (6.42) and write
p pffiffiffi Z1 2 < X < þ pffiffiffi Z1 2 ¼ 1 ; n n
(6:52)
p X pffiffiffi Z1 2 < < X þ pffiffiffi Z1 2 ¼ 1 : n n
(6:53)
and
Therefore, the 100ð1 Þ% confidence interval for is x pffiffiffi Z12 : n Clearly, the larger the sample size, the smaller the width of the interval.
(6:54)
156
What is a statistic?
Mathematica example:
Box 6.4
We can compute the 68% confidence interval for the mean of a population, with known variance ¼ 0:1, from a list of data values with the following commands: Needs[‘‘Statistics ‘ConfidenceIntervals’’’] The line above loads the confidence intervals package and the line below computes the confidence interval for a normal distribution. MeanCI[data, KnownVariance fi 0.1, ConfidenceLevel fi 0.68] Where data is a list of the sample data values. If the variance is unknown, leave out the KnownVariance fi 0.1 option and then the confidence interval will be based on a Student’s t distribution.
6.6.2 Confidence intervals for , unknown variance Again, we know that the distribution of X is Nð; 2 =nÞ, but since we do not know we are unable to use this distribution to compute the desired confidence interval. Fortunately, we can obtain the confidence interval using the Student’s t statistic which makes use of the sample variance which we can compute. Recall, that the pffiffiffi random variable T ¼ ðX Þ n=S has a Student’s t distribution with ðn 1Þ degrees of freedom. Figure 6.7 shows a Student’s t distribution for ¼ n 1 ¼ 9 degrees of
Probability density
0.4
0.3
L1
L2
0.2
0.1
α /2 –3
–2
1–α –1 T=
0 X–µ S/√n
α /2 1
2
3
4
pffiffiffi Figure 6.7 The figure shows the Student’s t distribution for the T ¼ ðX Þ n=S statistic, for ¼ n 1 ¼ 9 degrees of freedom. The lower ðL1 Þ and upper ðL2 Þ boundaries of the 1 confidence interval are indicated by the vertical lines. The location of L1 is set by the requirement that the shaded area to the left of L1 is equal to =2. Similarly, the shaded area to the right of L2 is equal to =2.
6.6 Confidence intervals
157
freedom. For a Student’s t distribution, the t12 ;n1 quantile value is defined by the equation pðt12 ; n1 < T < t12 ; n1 Þ ¼ 1 ; which we can rewrite as s s p ffiffi ffi p ffiffi ffi p t1 ; n1 < X < t1 ; n1 ¼ 1 : n 2 n 2
(6:55)
(6:56)
We can obtain values for L1 and L2 by comparing Equation (6.56) to Equation (6.45). pffiffiffi This yields L1 ¼ L2 ¼ ðs= nÞt12 ; n1 . The final 1 confidence interval is s x pffiffiffi t12 ; n1 : n
Box 6.5
(6:57)
Mathematica example:
In the introduction to this chapter, we posed a problem concerning the mean redshift of a population of cosmic gamma-ray bursts (GRB), based on a sample of seven measured GRB redshifts (the number known at the time of writing). The redshifts are: 1.61, 0.0083, 1.619, 0.835, 3.420, 1.096, 0.966. We now want to determine the 68% confidence interval for the mean redshift for the population GRB sources. We neglect the uncertainties in the individual measured redshifts as they are much smaller than the spread of the seven values. Although we do not know the probability density function for the population, we know from the CLT that the distribution of the sample mean random variable ðXÞ will be approximately normal for n ¼ 7. Furthermore, 5X4 ¼ , the mean of the population, and VarðXÞ ¼ 2 =n, where 2 ¼ the variance of the population. Since we do not know 2 , we use the measured sample variance, s2 ¼
7 X ðxi xÞ2 i¼1
n1
:
We can thus evaluate the Student’s t value for our sample and use this to arrive at our 68% confidence interval for the population mean redshift (recall Equation (6.56)). data ¼ {1:61, 0:0083, 1:619, 0:835, 3:420, 1:096, 0:966} MeanCI[data, ConfidenceLevel fi 0:68] ¼ {0:93, 1:80} In this case, we have left out the KnownVariance option to the confidence interval command, MeanCI, because the population variance is unknown. MeanCI now returns a confidence interval based on the Student’s t distribution.
158
What is a statistic?
6.6.3 Confidence intervals: difference of two means One of the most fundamental problems that occurs in experimental science, is that of analyzing two independent measurements of the same physical quantity, one ‘‘control’’ and one ‘‘trial,’’ taken under slightly different experimental conditions, e.g., drug testing. Here, we are interested in computing confidence intervals for the difference in the means of the control population and the trial population, when sampling from two independent normal distributions. 1) If x and y are unknown and x and y are known, then the random variable X Y ðx y Þ Z ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2x =nx þ 2y =ny
(6:58)
has a normal distribution Nð0; 1Þ. The 100ð1 Þ% confidence interval for ðx y Þ is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2x 2y (6:59) x y Z12 þ ; nx ny where the quantile Z12 is such that pðZ Z12 Þ ¼ 1
: 2
(6:60)
At this point you may be asking yourself what value of to use in presenting your results. There are really two types of questions we might be interested in. First, do the data indicate that the means are significantly different? This type of question is addressed in Chapter 7, which deals with hypothesis testing. The other type of question, which is being addressed here, concerns estimating the difference of the two means. For this question, it is common practice to use an ¼ 0:32, corresponding to a 68% confidence interval. We will look at this issue again in more detail in Section 7.2.1. 2) If x and y are unknown but assumed equal, the random variable T¼
X Y ðx y Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SD 1=nx þ 1=ny
has a Student’s t distribution with ðnx þ ny 2Þ degrees of freedom, where S2D ¼
½ðnx 1ÞS2x þ ðny 1ÞS2y : ðnx þ ny 2Þ
(6:61)
The 100ð1 Þ% confidence interval is x y sp t12 ; ðnx þny 2Þ :
(6:62)
6.6 Confidence intervals
159
3) If x and y are unknown and assumed to be unequal, the random variable T¼
X Y ðx y Þ SD
is distributed approximately as Student’s t with degrees of freedom (Press et al., 1992), where S2x S2y þ ; nx ny
S2D ¼
(6:63)
and, h ¼
S2x nx
½S2x =nx 2 nx 1
i S2 2
þ nyy þ
½S2y =ny 2 ny 1 :
(6:64)
See Section 7.2.2 for a worked example of the use of the Student’s t test for these conditions.
6.6.4 Confidence intervals for 2 Here, we are interested in computing confidence intervals for 2 when sampling from a normal distribution with unknown mean. Recall that ðn 1ÞS2 =2 is 2n1 . Then it follows that the 100ð1 Þ% interval for 2 is " # ðn 1Þs2 ðn 1Þs2 ; : (6:65) 21; n1 2; n1 2
Box 6.6
2
Mathematica example:
We can compute the 68% confidence interval for the variance of a population with unknown mean, from a list of data values designated by data, with the following command: VarianceCI[data,ConfidenceLevel fi 0.68]
6.6.5 Confidence intervals: ratio of two variances In this section, we want to determine confidence intervals for the ratio of two variances when sampling from two independent normal distributions. In this case, the 100ð1 Þ% confidence interval for 2y =2x is
160
What is a statistic?
"
# s2y s2y 1 ; f12 ; nx 1; ny 1 : s2x f12 ; ny 1; nx 1 s2x
(6:66)
Mathematica example:
Box 6.7
We can compute the 68% confidence interval for the ratio of the population variance of data1 to the population variance of data2 with the following command: VarianceRatioCI[data1,data2,ConfidenceLevel fi 0.68]
6.7 Summary In this chapter, we introduced the X, 2 , S2 , Student’s t and F statistics and showed how they are useful in making inferences about the mean ðÞ and variance ð2 Þ of an underlying population from a random sample. In the frequentist camp, the usefulness of any particular statistic stems from our ability to predict its distribution for a very large number of hypothetical repeats of the experiment. In this chapter we have assumed that each of the data sets, either real or hypothetical, is an IID sample from a normal distribution, in some cases appealing to the Central Limit Theorem to satisfy this requirement. For IID normal samples, we found that: 1.
S2 ¼
n X ðXi XÞ2 i¼1
n1
is an unbiased estimator of the sample variance, 2 . 2. The sampling distribution of
ðn 1Þ
n S2 X ðXi XÞ2 ¼ 2 2 i¼1
is 2 with ðn 1Þ degrees of freedom. This statistic has a wide range of uses that we will learn in subsequent chapters (e.g., Method of Least Squares) and is clearly useful in specifying confidence intervals for estimates of 2 based on the sample variance S2 . 3. One familiar form of the Student’s t statistic is
T¼
ðX Þ pffiffiffi : ðS= nÞ
The statistic is particularly useful for computing confidence intervals for the population mean, , based on the sample X and S2 . In Chapter 9, we will see the Student’s t distribution reappear in a Bayesian context characterizing the posterior PDF for a particular state of information.
6.8 Problems
161
4. The F statistic is given by
F12 ¼
S21 =21 : S22 =22
We saw how this can be used to compare the ratio of the variances of two populations from a measurement of the sample variance of each population.
Once the distribution of the statistic is specified, it is possible to interpret the significance of the particular value of the statistic corresponding to our one actual measured sample. For example, we were able to test a cigarette manufacturer’s claim about the mean nicotine content by comparing the value of the Student’s t statistic for a measured sample to the distribution for a hypothetical reference set. In case you didn’t fully comprehend this line of reasoning, we will go into it in more detail in the next chapter, which deals with frequentist hypothesis testing. Throughout this chapter, the expression ‘‘number of degrees of freedom’’ has cropped up in connection with each choice of statistic. Its precise meaning is defined in the definition of the 2 , Student’s t, and F distributions. Roughly, what it translates to in practice is the number of data points (or data bins if the data are binned) in the sample used to compute the statistic, minus the number of additional parameters (like S2 in the Student’s t statistic) that have to be estimated from the same sample. A major part of inferring the values of the and 2 of the population concerns their estimated uncertainties. In the frequentist case, this amounts to estimating a confidence interval, e.g., 68% confidence interval. Keep in mind that a frequentist confidence interval says nothing directly about the probability that a single confidence interval, derived from your one actual measured sample, will contain the true population value. Then what does the 68% confidence mean? It means that if you repeated the measurement a large number of times, each time computing a 68% confidence interval from the new sample, then 68% of these intervals will contain the true value. 6.8 Problems 1. Suppose you are given the IID normal data sample f0:753; 3:795; 4:827; 2:025g. Compute the sample variance and standard deviation. What is the standard error of the sample mean? 2. What is the 95% confidence interval for the mean of the IID normal sample f0:753; 3:795; 4:827; 2:025g? 3. Compute the area in the tail of a 2 distribution to the right of 2 ¼ 30:61, for ¼ 10 degrees of freedom. 4. The sample variance is s21 ¼ 16:65 for n1 ¼ 6 IID samples from a normal distribution with a population variance 21 . Also, s22 ¼ 5:0 for n2 ¼ 11 IID samples from a second independent normal distribution with a population variance 22 . If we assume that 21 ¼ 222 , then from Equation (6.40), we obtain f ¼ 1:665 for 1 ¼ n1 1 ¼ 5 and 2 ¼ n2 1 ¼ 10 degrees of freedom. What is the probability of getting an f value 3:33 by chance if 21 ¼ 222 ?
7 Frequentist hypothesis testing
7.1 Overview One of the main objectives in science is that of inferring the truth of one or more hypotheses about how some aspect of nature works. Because we are always in a state of incomplete information, we can never prove any hypothesis (theory) is true. In Bayesian inference, we can compute the probabilities of two or more competing hypotheses directly for our given state of knowledge. In this chapter, we will explore the frequentist approach to hypothesis testing which is considerably less direct. It involves considering each hypothesis individually and deciding whether to (a) reject the hypothesis, or (b) fail to reject the hypothesis, on the basis of the computed value of a suitable choice of statistic. This is a very big subject and we will give only a limited selection of examples in an attempt to convey the main ideas. The decision on whether to reject a hypothesis is commonly based on a quantity called a P-value. At the end of the chapter we discuss a serious problem with frequentist hypothesis testing, called the ‘‘optional stopping problem.’’
7.2 Basic idea In hypothesis testing we are interested in making inferences about the truth of some hypothesis. Two examples of hypotheses which we analyze below are: * *
The radio emission from a particular galaxy is constant. The mean concentration of a particular toxin in river sediment is the same at two locations.
Recall that in frequentist statistics, the argument of a probability is restricted to a random variable. Since a hypothesis is either true or false, it cannot be considered a random variable and therefore we must indirectly infer the truth of the hypothesis (in contrast to the direct Bayesian ). In the river toxin example, we proceed by assuming that the mean concentration is the same for both locations and call this the null hypothesis. We then choose a statistic, such as the sample mean, that can be computed from our one actual data set. The value of the statistic can also be computed in principle for a very large number of 162
163
7.2 Basic idea
hypothetical repeated measurements of the river sediment under identical conditions. Our choice of statistic must be one whose distribution is predictable for this reference set of hypothetical repeats, assuming the truth of our null hypothesis. We then compare the actual value of the statistic, computed from our one actual data set, to the predicted reference distribution. If it falls in a very unlikely spot (i.e., way out in the tail of the predicted distribution) we choose to reject the null hypothesis at some confidence level on the basis of the measured data set. If the statistic falls in a reasonable part of the distribution, this does not mean that we accept the hypothesis; only that we fail to reject it. At best, we can substantiate a particular hypothesis by failing to reject it and rejecting every other competing hypothesis that has been proposed. It is an argument by contradiction designed to show that the null hypothesis will lead to an absurd conclusion and should therefore be rejected on the basis of the measured data set. It is not even logically correct to say we have disproved the hypothesis, because, for any one data set it is still possible by chance that the statistic will fall in a very unlikely spot far out in the tail of the predicted distribution. Instead, we choose to reject the hypothesis because we consider it more fruitful to consider others.
7.2.1 Hypothesis testing with the 2 statistic Figure 7.1 shows radio flux density measurements of a radio galaxy over a span of 6100 days made at irregular intervals of time. The observations were obtained as part of a project to study the variability of galaxies at radio wavelengths. The individual radio flux density measurements are given in Table 7.1.
20
Radio flux density
15
10
5
0 0
1000
2000
3000
4000
Time (days)
Figure 7.1 Radio astronomy measurements of a galaxy over time.
5000
6000
164
Frequentist hypothesis testing
Table 7.1 Radio astronomy flux density measurements for a galaxy. Day Number
Flux Density (mJy)
0.0 718.0 1097.0 1457.1 2524.1 3607.7 3630.1 4033.1 4161.3 5355.9 5469.1 6012.4 6038.3 6063.2 6089.3
14.2 5.0 3.3 15.5 4.2 9.2 8.2 3.2 5.6 9.9 7.4 6.9 10.0 5.8 11.4
Below we outline the steps involved in the current hypothesis test: 1. Choose as our null hypothesis that the galaxy has an unknown but constant flux density. If we can demonstrate that this hypothesis is absurd at say the 95% confidence level, then this provides indirect evidence that the radio emission is variable. Previous experience with the measurement apparatus indicates that the measurement errors are independently normal with a ¼ 2:7. 2. Select a suitable statistic that (a) can be computed from the measurements, and (b) has a predictable distribution. More precisely, (b) means that we can predict the distribution of values of the statistic that we would expect to obtain from an infinite number of repeats of the above set of radio measurements under identical conditions. We will refer to these as our hypothetical reference set. More specifically, we are predicting a probability distribution for this reference set. To refute the null hypothesis, we will need to show that scatter of the individual measurements about the mean is larger than would be expected from measurement errors alone. A useful measure of the scatter in the measurements is the sample variance. We know from Section 6.3, that the random variable ðn 1ÞS2 =2 (usually called the 2 statistic) has a 2 distribution with ðn 1Þ degrees of freedom when the measurement errors are known to be independently normal. From Equation (6.4), it is clear that the distribution depends only on the number of degrees of freedom. 3. Evaluate the 2 statistic from the measured data. Let’s start with the expression for the 2 statistic for our data set:
2 ¼
n X ðxi xÞ2 i¼1
2
;
(7:1)
165
7.2 Basic idea
Probability density
0.08
0.06
0.04
0.02
Measured χ2 = 26.76
5
10
15
20
25
30
35
40
χ
2
Figure 7.2 The 2 distribution predicted on the basis of our null hypothesis with 14 degrees of freedom. The value computed from the measurements, 2 ¼ 26:76, is indicated by the vertical bar. where xi represents the ith flux density value, x ¼ 7:98 mJy is the average of our sample values, and ¼ 2:7, as given above. The number of degrees of freedom ¼ n 1 ¼ 15 1 ¼ 14, where n is the number of flux density measurements.1 Equation (7.1) becomes
2 ¼
n X ðxi xÞ2 i¼1
2
¼
n X ðxi 7:98Þ2
2:72
i¼1
¼ 26:76:
(7:3)
4. Plot the computed value of 2 ¼ 26:76 on the 2 distribution predicted for 14 degrees of freedom. This is shown in Figure 7.2. The 2 computed from our one actual data set is shown by the vertical line. The question of how unlikely is this value of 2 is usually interpreted in terms of the area in the tail of the 2 distribution to the right of this line which is called the P-value or significance. We can evaluate this from
P-value ¼ 1 Fð2 Þ ¼ 1
Z
2 0
x 1 21 exp x dx; 2 ð2Þ22
(7:4)
where Fð2 Þ is the cumulative 2 distribution. Alternatively, we can evaluate the P-value with the following Mathematica command.
1
The null hypothesis did not specify the assumed value for , the constant flux density, so we estimated it from the mean of the data. Whenever we estimate a model parameter from the data, we lose one degree of freedom. If the null hypothesis had specified , then we would have used the following expression for the 2 statistic: 2 ¼
n X ðxi Þ2 i¼1
which has a 2 distribution with n degrees of freedom.
2
;
ð7:2Þ
166
Frequentist hypothesis testing
Box 7.1 Mathematica 2 P-value Needs[‘‘Statistics ‘HypothesisTests’’’] The line above loads a package containing a wide range of hypothesis tests, and the following line computes the P-value for a 2 ¼ 26:76 and 14 degrees of freedom: ChiSquarePValue[26.76,14] ! 0:02 Note: the ChiSquarePValue has a maximum value of 0.5 and will measure the area in the lower tail of the distribution if 2measured falls in the lower half of the distribution. In the current problem, we want to be sure we measure the area to the right of 2measured , so use the command: i h 14 26:76 c2 ; GammaRegularized NM 2 2 ¼GammaRegularized 2 ; 2 where N is the number of data points and M is the number of parameters estimated from the data. Note: in some problems, it is relevant to use a two-sided test (see Section 7.2.3) using the 2 statistic, e.g., testing that the population variance is equal to a particular value. 5. Finally, compute our confidence in rejecting the null hypothesis which is equal to the area of the 2 distribution to the left of 2 ¼ 26:76. This area is equal to ð1 P-valueÞ ¼ 0:98 or 98%.
While the above recipe is easy to compute, it undoubtedly contains many perplexing features. Most among them is the strangely convoluted definition of the key determinant of falsification, the P-value, also known as the significance . What precisely does the P-value mean? It means that if the flux density of this galaxy is really constant, and we repeatedly obtained sets of 15 measurements under the same conditions, then only 2% of the 2 values derived from these sets would be expected to be greater than our one actual measured value of 26.76. At this point, you may be asking yourself why we should care about a probability involving results never actually obtained, or how we choose a P-value to reject the null hypothesis. In some areas of science, a P-value threshold of 0.05 (confidence of 95%) is used; in other areas, the accepted threshold for rejection is 0.01,2 i.e., it depends on the scientific culture you are working in. Unfortunately, P-values are often incorrectly viewed as the probability that the hypothesis is true. There is no objective means for deciding the latter without specifying an alternative hypothesis, H1 , to the null hypothesis. The point is that any
2
Note: because experimental errors are frequently underestimated, and hence 2 values overestimated, it is not uncommon to require a P-value < 0:001 before rejecting a hypothesis.
167
7.2 Basic idea
particular P-value might arise even if the alternative hypothesis is true.3 The concept of an alternative hypothesis is introduced in Section 7.2.3. In Section 9.3, we will consider a Bayesian analysis of the galaxy variability problem. There is another useful way of expressing a statistical conclusion like that of the above hypothesis test. Instead of the P-value, weffi can measure how far out in the tail pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi the statistic falls in units of the ¼ variance of the reference distribution. For example, the variance of a 2 distribution ¼ 2 ¼ 28 and the expectation value of 2 ¼ h2 i ¼ ¼ 14. Therefore: 2obs h2 i 2obs h2 i 26:76 14 pffiffiffiffiffi pffiffiffiffiffi ¼ ¼ 2:4 2:4: ¼ 28 2 In many branches of science, a minimum of a 3 effect is required for a claim to be taken with any degree of seriousness. More often, referees of scientific journals will require a 5 result to recommend publication. It depends somewhat on how difficult or expensive it is to get more data. Now suppose we were studying a sample of 50 galaxies for evidence of variable radio emission. If all 50 galaxies were actually constant, then for a confidence threshold of 98%, we would expect to detect only one false variable in the sample by chance. If we found that ten galaxies had 2 values that exceeded the 98% quantile value, then we would expect nine of them were not constant. If, on the other hand, we were studying a sample of 104 galaxies, we would expect to detect approximately 200 false variables. It is easy to see how to extend the use of the 2 test described above to other more complex situations. Suppose we had reason to believe that the radio flux density was decreasing linearly with time at a known rate, m, with respect to some reference time, t0 . In that case, 2 ¼
n X ðxi ½mðti t0 ÞÞ2 i¼1
2
;
(7:5)
where ti is the time of the ith sample. We will have much more use for the 2 statistic in later chapters dealing with linear and nonlinear model fitting (see Section 10.8).
7.2.2 Hypothesis test on the difference of two means Table 7.2 gives measurements of a certain toxic substance in the river sediment at two locations in units of parts per million (ppm). In this example, we want to test the hypothesis that the mean concentration of this toxin is the same at the two locations. How do we proceed? Sample 1 consists of 12 measurements taken from location 1, and 3
The difficulty in interpreting P-values has been highlighted in many papers (e.g., Berger and Sellke, 1987; Delampady and Berger, 1990; Sellke et al., 2001). The focus of these works is that P-values are commonly considered to imply considerably greater evidence against the null hypothesis H0 than is actually warranted.
168
Frequentist hypothesis testing
Table 7.2 River sediment toxin concentration measurements at two locations. Location 1 (ppm)
Location 2 (ppm)
13.2 13.8 8.7 9.0 8.6 9.9 14.2 9.7 10.7 8.3 8.5 9.2
8.9 9.1 8.3 6.0 7.7 9.9 9.9 8.9
sample 2 consists of 8 measurements from location 2. For each location, we can compute the sample mean. From the frequentist viewpoint, we can compare sample 1 to an infinite set of hypothetical data sets that could have been realized from location 1. For each of the hypothetical data sets, we could compute the mean of the 12 values. According to the Central Limit Theorem, we expect that the means for the hypothetical data sets will have an approximately normal distribution. Let X1 and X2 be random variables representing means for locations 1 and 2, respectively. It is convenient to work with the standard normal distributions given by Z1 ¼
X1 1 pffiffiffiffiffi ; 1 = n1
(7:6)
Z2 ¼
X2 2 pffiffiffiffiffi ; 2 = n2
(7:7)
and
where 1 and 2 are the population means and 1 and 2 the population standard deviations. Similarly, we expect Z¼
X1 X2 ð1 2 Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 21 =n1 þ 22 =n2
(7:8)
to be approximately normal as well (see Section 6.6.3). In the present problem, the null hypothesis represented by H0 corresponds to H0 1 ¼ 0:
(7:9)
169
7.2 Basic idea
Since we do not know the population standard deviation for Z, we need to estimate it from our measured sample. Then according to Section 6.6.3, the random variable T given by T¼
X1 X2 ð1 2 Þ Sp
(7:10)
has a Student’s t distribution. All that remains is to specify the value of Sp and the number of degrees of freedom, , for the Student’s t distribution. In the present problem, we cannot assume 1 ¼ 2 so we use Equations (6.63) and (6.64): S2p ¼
S21 S22 þ ; n 1 n2
(7:11)
and hS2 1
¼
n1 ½S21 =n1 2 n1 1
i S2 2
þ n22 þ
½S22 =n2 2 n2 1
:
(7:12)
Note: Equation (6.30) for the Student’s t probability density is valid even if is not an integer. Of course, we could test the hypothesis that the standard deviations are the same but we leave that for a problem at the end of the chapter. Inserting the data we find the t statistic ¼ 2:23 and ¼ 17:83 degrees of freedom. We are now ready to test the null hypothesis by comparing the measured value of the t statistic to the distribution of t values expected for a reference set of hypothetical data sets for 17.83 degrees of freedom. Figure 7.3 shows the reference t distribution and the measured value of 2.23. We can compute the area to the right of t ¼ 2:23 which is called the onesided P-value. We can evaluate this with the following Mathematica command: Needs[‘‘Statistics ‘HypothesisTests’’’] StudentTPValue[2.23,17.83,OneSided ->True] fi 0:0193 This P-value is the fraction of hypothetical repeats of the experiment that are expected to have t values 2:23 if the null hypothesis is true. In this problem, we would expect an equal number of hypothetical repeats to fall in the same area in the opposite tail region by chance. What we are really interested in is the fraction of hypothetical repeats that would be extreme enough to fall in either of these tail regions (shaded regions of Figure 7.3) or what is called the two-sided P-value. We can evaluate this with the following Mathematica command:
170
Frequentist hypothesis testing 0.4
Probability density
0.3 Student’s t = 2.23 value 0.2
0.1
–3
–2
–1 0 1 Student’s t statistic
2
3
4
Figure 7.3 Reference Student’s t distribution with 17.83 degrees of freedom. The measured t statistic is indicated by a line and the shaded areas correspond to upper and lower tail areas in a two-sided hypothesis test.
Needs[‘‘Statistics ‘HypothesisTests’’’] StudentTPValue[2.23,17.83,TwoSided ->True] fi 0:0386 Mathematica provides an easier way of computing this P-value (i.e., computes the t value and degrees of freedom for you) with MeanDifferenceTest[data1,data2,diff,TwoSided ->True, FullReport ->True] where data1 and data2 are the two data lists. If the variances of the two data sets are known to be the same then use MeanDifferenceTest[data1,data2,diff,EqualVariances ->True, TwoSided ->True,FullReport ->True] Our confidence in rejecting the null hypothesis is equal to the area outside of these extreme tail regions which equals 1 (two-sided P-value) ¼ 0:961 or 96.1%. If we use a typical threshold for rejection of say 95%, then in this case, we just reject the null hypothesis.
7.2.3 One-sided and two-sided hypothesis tests In the galaxy example, where we used the 2 statistic, we computed the confidence in rejecting the null hypothesis using a one-sided tail region. Why didn’t we use a twosided tail region as in the river toxin problem? Here, we introduce the concept of the
171
7.2 Basic idea
Table 7.3 Type I and type II errors in hypothesis testing. Possible decisions
Possible consequences
Reject H0
when in fact H0 true when in fact H0 false
Fail to reject H0
when in fact H0 true when in fact H0 false
Errors Type I error (conviction)
Type II error (acquittal)
alternative hypothesis, i.e., alternative to the null hypothesis. In the galaxy problem, the alternative hypothesis is that the radio emission is variable. If the alternative hypothesis were true, then examination of Equation (7.1) indicates that for a given value of the measurement error, , we expect the value of 2 to be greater4 when the source is variable than when it is constant. In this case, we would expect to measure 2 values in the upper tail but not the lower, which is why we used a one-sided test. In the river toxin problem, the alternative hypothesis is that the mean toxin levels at the two locations are different. In this case, if the alternative were true, we would expect t values in either tail region, which is why we used the two-sided P-value test. The rules of the game are that the null hypothesis is regarded as true unless sufficient evidence to the contrary is presented. If this seems a strange way of proceeding, it might prove useful to consider the following courtroom analogy. In the courtroom, the null hypothesis stands for ‘‘the accused is presumed innocent until proven otherwise.’’ Table 7.3 illustrates the possible types of errors that can arise in a hypothesis test. In hypothesis testing, a type I error is considered more serious (i.e., the possibility of convicting an innocent party is considered worse than the possibility of acquitting a guilty party). A type I error is only possible if we reject H0 , the null hypothesis. It is not possible to minimize both the type I and type II errors. The normal procedure is to select the maximum size of the type I error we can tolerate and construct a test procedure that minimizes the type II error.5 This means choosing a threshold value for the statistic which if exceeded will lead us to reject H0 . For example, suppose we are dealing with a one-sided upper tail region test and we are willing to accept a maximum type I error of 5%. This means a threshold value of the test statistic anywhere in the upper 5% tail area satisfies our type I error requirement. The size of the type II error is a minimum at the lower boundary of this region, i.e., the larger the value of the test statistic, the more likely it is we will acquit a possibly guilty party. Suppose we had used a two-tail test in the radio galaxy problem of Section 7.2.1 rather than the upper tail test. Recall that the alternative hypothesis is only expected to
4 5
The only way for the variability to reduce 2 is if the fluctuations in the galaxies’ output canceled measurement errors. The test that minimizes the type II error is often referred to as having maximum ‘‘power’’ in rejecting the null hypothesis when it really is false.
172
Frequentist hypothesis testing
give rise to larger values of 2 than those expected on the basis of the null hypothesis. In a two-tail test we would divide the rejection area equally between the lower and upper tails, i.e., for a 98% confidence threshold, that would mean the upper and lower tail areas would each have an area of 1%. The 2 value required for rejecting the null hypothesis in the upper tail region would be larger in a two-tail test than for a one-tail test. Thus, for a given confidence level, the two-tail test would increase the chance of a type II error, because we would have squandered rejection area in the lower tail region, a region of 2 that would not be accessed under the assumption of our alternative hypothesis. In the river toxin example of Section 7.2.2, the alternative hypothesis can give rise to values of the Student’s t statistic in either tail region. In this case, we will want to reject H0 if the t value falls far enough out in either tail region. In this case, divide the area corresponding to the maximum acceptable type I error equally between the two tail regions. To minimize the type II area, choose threshold values for the test statistic that are at the inner boundaries of these two tails. In practice, the role of the alternative hypothesis is mainly to help decide whether to use an upper tail region, a lower tail region or both tails in our statistical test. The choice depends on what is physically meaningful and minimizes the size of the type II error. In Section 6.4, we used the Student’s t statistic in an analysis of a cigarette manufacturer’s claim regarding nicotine content. Since we would reject the claim if the t value fell sufficiently far out in either tail region, we should use a two-sided test in this case. Typically, in frequentist hypothesis testing involving the use of P-values, a specific value for the type II error is not normally computed. Instead it is used as an argument to decide where in the tail region to locate the decision value of the test statistic, as outlined above.
7.3 Are two distributions the same? We have previously considered tests to compare the means and variances of two samples. Now generalize the questions and ask the simple question: ‘‘Can we reject the null hypothesis that the two samples are drawn from the same population?’’ Rejecting the null hypothesis in effect implies that the two data sets are from different distributions. Failing to reject the null hypothesis only shows that the data sets can be consistent with a single distribution. Deciding whether two distributions are different is a problem that occurs in many research areas. Example 1: Are stars uniformly distributed in the sky? That is, is the distribution of stars as a function of latitude the same as the distribution of the sky area with latitude? In this case, the data set (location of stars) and comparison distribution (sky area) are continuous.
7.3 Are two distributions the same?
173
Example 2: Are the educational patterns in Vancouver and Toronto the same? Is the distribution of people as a function of ‘‘last grade attended’’ the same? Here, both data sets are discrete or binned. Example 3: Are the distribution of grades in a particular physics course normally distributed? Here, the grades are discrete or binned and the distribution we to is continuous. In this latter case, we might be comparing with a normal distribution with a given and 2 or alternatively we might not know and 2 and be interested only in whether the shape is normal. One can always turn continuous data into binned data by grouping the events into specified ranges of the continuous variable(s). Binning involves a loss of information, however, and there is considerable arbitrariness as to how the bins should be chosen. The accepted test for differences between binned distributions is the Pearson 2 test. For continuous data as a function of a simple variable, the most generally accepted test is the Kolmogorov–Smirnov test.
7.3.1 Pearson 2 goodness-of-fit test Let a random sample of size n from the distribution of a random variable X be divided into k mutually exclusive and exhaustive classes (or bins) and let Ni ði ¼ 1; . . . ; kÞ be the number of observations in each class (or bin). We want to test the simple null hypothesis H0 pðxÞ ¼ p0 ðxÞ where the claimed probability model p0 ðxÞ is completely specified with regard to all parameters. Since p0 ðxÞ is completely specified, we can determine the probability pi of P obtaining an observation in the ith class under H0 , where by necessity ki¼ 1 pi ¼ 1. Let ni ¼ npi ¼ expected number in each class according to the null hypothesis, H0 . Usually, H0 does not predict n, and this is obtained by setting n¼
k X
Ni :
i¼1
This has the effect of reducing the number of degrees of freedom in the 2 test by one.6 Note: Ni is an integer while the ni ’s may not be.
6
Note on the number of degrees of freedom: If H0 does predict the ni ’s and there is no a priori constraint on any of the Ni ’s, then = number of bins, k. More commonly, the ni ’s are normalized after the fact so that their sum equals the sum of the Ni ’s, the total number of events measured. In this case, ¼ k 1. If the model that gives the ni ’s had additional free parameters that were adjusted after the fact to agree with the data, then each of these additional ‘‘fitted’’ parameters reduces the by one. The number of these additional fitted parameters (not including the normalization of the ni ’s) is commonly called the ‘‘number of constraints’’ so the number of degrees of freedom is ¼ k 1 when there are ‘‘zero constraints.’’
174
Frequentist hypothesis testing
Question: What is the form of p0 ðxÞ? Answer: Since there are k mutually exclusive categories with probabilities p1 ; p2 ; . . . ; pk , then under the null hypothesis, the probability of the grouped sample is the same as the probability of a multinomial distribution discussed in Section 4.3. In what follows, we will deduce the appropriate test statistic for H0 which is known as the Pearson chi-square goodness-of-fit test. Start with the simple case where k ¼ 2; thus, p0 ðxÞ is a binomial distribution. n! px ð1 pÞnx ; x ¼ 0; 1; . . . ; n: (7:13) pðxjn; pÞ ¼ ðn xÞ!x! pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi In this case, x ¼ n1 , p ¼ p1 , n x ¼ n2 and ð1 pÞ ¼ p2 . Recall that ¼ npð1 pÞ for the binomial distribution. Consider the standardized random variable N1 np1 Y ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : np1 ð1 p1 Þ
(7:14)
For np1 1, the distribution of Y is approximately the standard normal. Recall that the square of a standard normal variable, n X ðxi Þ2
2
i¼1
;
is 2 -distributed with n degrees of freedom. Thus we expect the statistic ðN1 np1 Þ2 np1 ð1 p1 Þ
(7:15)
to be approximately 2 -distributed with one degree of freedom. Note:
1 1 þ np1 np2
¼
nðp1 þ p2 Þ 1 ¼ 2 n p1 p2 np1 ð1 p1 Þ
ðN1 np1 Þ2 ðN1 np1 Þ2 ½ðn N2 Þ nð1 p2 Þ2 ¼ þ np1 ð1 p1 Þ np1 np2 ¼ ¼
ðN1 np1 Þ2 ðN2 np2 Þ2 þ np1 np2 2 X ðNi npi Þ2
npi
i¼1
(7:16)
:
Following this reasoning, it can be shown that for k 2 the statistic, k X ðNi npi Þ2 i¼1
npi
(7:17)
7.3 Are two distributions the same?
175
is approximately 2 -distributed with ¼ k 1 degrees of freedom. Ni is the observed frequency of the ith class and npi is the corresponding expected frequency under the null hypothesis. Any term in Equation (7.17) with Ni ¼ npi ¼ 0 should be omitted from the sum. A term with pi ¼ 0 and Ni 6¼ 0 gives an infinite 2 , as it should, since in this case, the Ni ’s cannot possibly be drawn from the pi ’s. Strictly speaking, the 2 P-value is the probability that the sum of squares of standard normal random variables will be greater than 2 . In general, the terms in the sum of Equation (7.17) will not be individually normal. However, if either the number of bins is large ð 1Þ, or the number of events in each bin is large ð 1Þ, then the 2 P-value is a good approximation for computing the significance of 2 values given by Equation (7.17). Its use to estimate the significance of the Pearson 2 goodness-of-fit test is standard. Example 1: In this first example, we apply the Pearson 2 goodness-of-fit test to a k ¼ 2 bin problem and compare with the exact result expected using a test based on the binomial distribution. Suppose a certain theory predicts that the fraction of radio sources that are expected to be quasars at an observing wavelength of 20 cm is 70%. A sample of 90 radio sources is selected and each source is optically identified. Only 54 turn out to be quasars. At what confidence level can we reject the above theory? Thus, our null hypothesis is that the theory is true. Predicted Observed
Quasars 63 54
Other 27 36
Number of degrees of freedom ¼ number of bins 1 ¼ 1. 21 ¼
2 X ðNi npi Þ2 i¼1
npi
¼
ð54 63Þ2 ð36 27Þ2 þ ¼ 4:29: 63 27
Our alternative hypothesis in this case is that the theory is not true. Based on the alternative hypothesis, we would choose an upper tail test for the Pearson 2 statistic. The observed 2 of 4.29 corresponds to a P-value of 3.8%. What is the corresponding P-value predicted by the binomial distribution? Recall that the binomial distribution (Equation (7.13)) predicts the probability of x successes in n trials when the probability of a success in any one trial is p. In this case, a success means the source is a quasar. According to the theory, p ¼ 0:7. First we calculate the area in the tail region extending from n ¼ 0 to n ¼ 54 which equals 2.7%. The true P-value is double this, or 5.4%, because we need to use a two-tailed test since we would reject the null hypothesis if the observed number fell far enough out in either tail of the binomial distribution. Examination of Equation (7.15) demonstrates that both of these binomial tails contribute to a single 2 tail. Thus, using the 2 test, we would reject the null hypothesis at the 95% confidence level but just fail to reject the
176
Frequentist hypothesis testing
hypothesis using the more accurate test based on the binomial distribution. Comparison of the results, P-values ¼ 3:8% ð2 Þ and ¼ 5:4% (binomial) indicates the approximate level of agreement to be expected from the Pearson 2 test in this simple two-bin test and an n 100. Now repeat the above test; only this time, we will use a sample of 2000 radio sources of which 1360 prove to be quasars. 21 ¼
2 X ðNi npi Þ2 i¼1
npi
¼
ð1360 1400Þ2 ð640 600Þ2 þ ¼ 3:81: 1400 600
In this case, the observed 2 of 3.81 corresponds to a P-value (significance ) of 5.1%. The more accurate P-value based on the binomial distribution is equal to 5.4%. As expected, as we increase the sample size, the agreement between the two tests becomes much closer. In this case, both tests just fail to reject the null hypothesis at the 95% level. Example 2: Now we consider an example involving a large number of bins or classes. Table 7.4 compares the total number of goals scored per game in four seasons of World Cup soccer matches (years 1990, 1994, 1998, and 2002), with the expected number if the number of goals is Poisson distributed. Only the goals scored in the 90 minutes regulation time are considered. This leaves out goals scored in extra time or in penalty shoot-outs. Based on the information provided, is there reason to believe at the 95% confidence level that the number of goals is not a Poisson random variable? The Poisson distribution is given by pðnÞ ¼
ðÞn e ; n!
where ¼ average number of goals per game (a parameter that must be estimated from the data). Each parameter estimated from the data decreases the number of degrees of freedom by one. From the data of Table 7.4, we compute ¼ 2:4785. The probability of exactly zero goals under the null hypothesis of a Poisson distribution is pð0Þ ¼
ð2:4785Þ0 e2:4785 ¼ 0:0839: 0!
For n ¼ 232 games, the expected number of games with zero goals is 232 0:0839 ¼ 19:46. Even though the Poisson distribution makes non-zero predictions for seven or more goals, the expected number rapidly falls below the resolution of our data set. There is no requirement that the bins be of equal size so our last bin is for 7 goals. For k ¼ 8 classes, the number of degrees of freedom ¼ 6. We lose one degree of P freedom from the normalizing n ¼ ki¼1 Ni , and another from estimating from the data. The value of 2 ¼ 2:66, which is derived from the data in Table 7.4,
7.4 Problem with frequentist hypothesis testing
177
Table 7.4 World Cup goal statistics. Number of goals
0 1 2 3 4 5 6 7 Totals
Actual number of games
Expected number of games
19 49 60 47 32 18 3 4
19.46 48.23 59.76 49.37 30.59 15.16 6.26 3.15
232
232
½Ni npi ðÞ2 npi ðÞ 0.0108 0.0124 0.0009 0.1142 0.0647 0.5302 1.7009 0.2267 2.6607
corresponds to a P-value (significance ) of 0.85. This corresponds to a confidence in rejecting the null hypothesis of 15%, which is much less than the 95% usually required. Thus, we fail to reject the null hypothesis that the number of goals scored is Poisson distributed. For more statistical analysis of the World Cup soccer data, see Chu (2003).
7.3.2 Comparison of two-binned data sets In this case, 2 ¼
X ðRi Si Þ2 i
Ri þ Si
;
(7:18)
where Ri and Si are the number of events in bin i for the first and second data set, respectively. It is instructive to compare Equation (7.18) with Equation (7.17). The term in the denominator of Equation (7.17), the predicted number of counts in the ith bin, is a measure of the expected variance in the counts. The variance of the difference of two random variables is the sum of their variances, which explains why the denominator in Equation (7.18) is Ri þ Si . P P If the data were collected in such a way that Ri ¼ Si , then ¼ k 1. If this requirement were absent, the number of degrees of freedom would be k.
7.4 Problem with frequentist hypothesis testing We now consider a serious problem with frequentist hypothesis testing referred to as the optional stopping problem (e.g., Loredo, 1990; Berger and Berry, 1988). The optional stopping problem is best illustrated by an example. Consider the following
178
Frequentist hypothesis testing
astronomical fable motivated by a tutorial given by Tom Loredo at a Maximum Entropy and Bayesian Methods meeting:
An Astronomical Fable Theorist: I predict the fraction of nearby stars that are like the sun (G spectral class) is f ¼ 0:1. Observer: I count five G stars out of N ¼ 102 total stars observed. This gives me a P-value ¼ 4:3%. Your theory is rejected at the 95% level. Theorist: Let me check that: I can use the binomial distribution to compute the probability of observing five or fewer G stars out of a total of 102 stars observed for a predicted probability f ¼ 0:1. P-value ¼ 2
5 X
pðn j N; f Þ;
(7:19)
n¼0
where, pðn j N; f Þ ¼
N! f n ð1 f ÞNn : n!ðN nÞ!
(7:20)
The factor of 2 in Equation (7.19) is because a two-tailed test is required here. My hypothesis could be rejected if either too few or too many G stars were counted. I get a P-value ¼ 10% so my theory is still alive. You have failed to reject my theory at the 95% level. Observer: Never trust a theorist with your data! I planned my observations by deciding beforehand that I would observe until I saw nG ¼ 5 G stars, and then stop. The random quantity your theory predicts is thus N, not nG . The correct reference distribution is the negative binomial. Thus, 1 X P-value ¼ 2 pðN j nG ; f Þ; (7:21) N¼102
where,
pðN j nG ; f Þ ¼
N 1 nG f ð1 f ÞNnG : nG 1
(7:22)
I get a P-value ¼ 4:3% as I claimed. Theorist: What if bad weather ended your observations before you saw five G stars? Observer: I’d either throw out the data, or include the probability of bad weather.
7.4 Problem with frequentist hypothesis testing
179
Theorist: But then you should include it in the analysis now, because the weather could have been bad. MORAL: Never trust a frequentist with your data!
The problem with the frequentist approach is that we need to specify a reference set of hypothetical samples that could have been observed, but were not, in order to compute the P-value of our observed sample. Thus, the decision on whether to reject the null hypothesis based on the P-value depends on the thoughts of the investigator about data that might have been observed but were not. Clearly, the theorist and observer had different thoughts about what was the appropriate reference set and thus arrived at quite different conclusions. To avoid this problem, experiments must therefore be carefully planned beforehand (e.g., the stopping rule specified before the experiment commences) to be amenable to frequentist analysis and if the plan is altered during execution for any reason (for example, if the experimenter runs out of funds), the data are worthless and cannot be analyzed. The fact that P-value hypothesis testing depends on considerations like the intentions of the investigator and unobserved data indicates a potentially serious flaw in the logic behind the use of P-values. Surely if our plan for an experiment has to be altered (e.g., astronomical observations cut short due to bad weather), we should still be able to analyze the resulting data provided we are fully aware of the physical details of the experiment. Clearly, our state of information has changed. Fortunately in Bayesian inference, the stopping rule plays no role in the analysis. There is no ambiguity over which quantity is to be considered a ‘‘random variable,’’ because the notion of a random variable and consequent need for a reference set of hypothetical data is absent from the theory. All that is required is a specification of the state of knowledge that allows us to compute the likelihood function.
7.4.1 Bayesian resolution to optional stopping problem In the Bayesian approach, where the probability assignments describe the state of knowledge defined in the problem, such paradoxes disappear. Here, we are interested in the posterior probability of f, the fraction of all nearby stars that are G stars. pð f jD; IÞ ¼
pð f jIÞpðDj f; IÞ : pðDjIÞ
(7:23)
The Bayesian calculation focuses on the functional dependence of the likelihood on the hypotheses corresponding to different choices of f. Both the binomial and negative binomial distributions depend on f in the same way, so Bayesian calculations by the theorist and the observer lead to the same conclusion, as we now demonstrate.
180
Frequentist hypothesis testing
1. Binomial case: N! f nG ð1 f ÞNnG pðDj f; IÞ ¼ pðnG jN; f Þ ¼ ðN nG Þ!ðnG Þ! Z pðDjIÞ ¼ df pð f jIÞ pðDj f; IÞ pð f jD; IÞ ¼ R
pð f jIÞf nG ð1 f ÞNnG df pð f jIÞf nG ð1 f ÞNnG
(7:24)
;
where the factorial terms cancel out because they appear in both the numerator and denominator. 2. Negative binomial case: pðDj f; IÞ ¼ pðNjnG ; f Þ ¼ pð f jD; IÞ ¼ R
N1 f nG ð1 f ÞNnG nG 1
(7:25)
pð f jIÞ f nG ð1 f ÞNnG df pð f jIÞf nG ð1 f Þ
: NnG
Again the factorial terms cancel out because they appear in both the numerator and denominator. Equations (7.24) and (7.25) are identical so the conclusions are the same; theorist and observer agree. Figure 7.4 shows the Bayesian posterior PDF for the fraction of G stars assuming a uniform prior for pð f jIÞ. The frequentist calculations, on the other hand, focus on the dependence of the sampling distribution on the data N and nG . Since the binomial and negative binomial distributions depend on N and nG in different ways, one would be led to different conclusions depending on the distribution chosen. Variations of weather and
17.5
Probability density
15 12.5 10 7.5 5 2.5 0 0
0.05
0.1 f (fraction of G stars)
Figure 7.4 Bayesian posterior PDF for the fraction of G stars.
0.15
0.2
7.5 Problems
181
equipment can affect N and nG but not f, and thus only the Bayesian conclusion is consistently the same. In the frequentist hypothesis test, we were attempting to reject the null hypothesis that the theorist’s prediction ð f ¼ 0:1Þ is correct. Recall that in a Bayesian analysis, we cannot compute the probability of a single hypothesis in isolation but only in comparison to one or more alternative hypotheses. The posterior PDF shown in Figure 7.4 allows us to compare the probability density at f ¼ 0:1 to the probability density at any other value of f. Assuming a uniform prior, the PDF is a maximum close to f 0:05.
7.5 Problems 1. In Section 7.2.2, we tested the hypothesis that the river sediment toxin concentra-
tions at the two locations are the same. Using the same data, test whether the variances of the data are the same for the two locations. Should you use a one-sided or a two-sided hypothesis test in this case? Some choices of Mathematica commands to use to answer this question are given in the following box: Needs[‘‘Statistics ‘HypothesisTests’’’] VarianceRatioTest[data1,data2,ratio,FullReport -< True] or VarianceRatioTest[data1,data2,ratio,TwoSided -> True, FullReport ->True] Note: OneSided -> True, is the default. Both are based on the FRatioPValue[fratio,numdef,dendef] calculation of Section 6.5. 2. Table 7.5 gives measurements of a certain river sediment toxic substance at two
locations in units of parts per million (ppm). The sampling is assumed to be from two independent normal populations. a) Determine the 95% confidence intervals for the means and variances of the two
data sets. b) At what confidence level (express as a %) can you reject the hypothesis that the
two samples are from populations with the same variance? Explain why you chose to use a one-sided or two-sided hypothesis test. c) At what confidence level can you reject the hypothesis that the two samples are from populations with the same mean? Assume the population variances are unknown but equal and use a two-sided hypothesis test. d) At what confidence level can you reject the hypothesis that the two samples are from populations with the same mean? Assume the population variances are unknown and unequal, and use a two-sided hypothesis test.
182
Frequentist hypothesis testing
Table 7.5 Measurements of the concentration of a river sediment toxin in ppm at two locations. Location 1 17.1 11.1 12.6 12.1 5.9 7.7 10.5 15.3 10.5 10.5
Location 2 7.0 12.0 6.8 9.3 8.9 9.4 9.6 7.6
Table 7.6 The distribution of a sample of 100 radiation measurements. Count Obtained 0 1 2 3 4 5 6 7 8 9 10 11 12
Number of Occurrences 1 6 18 17 23 10 15 4 4 1 0 0 1
Tips: The following Mathematica commands may prove useful. StudentTPValue FRatioPValue VarianceRatioTest MeanDifferenceTest MeanCI VarianceCI 3. In Example 1 of Section 7.3.1, suppose 41 quasars were detected in a total sample of
90 radio sources. With what confidence could you reject the hypothesis that 70% of radio sources are quasars? 4. Generate a list of 50 000 random numbers with a uniform distribution in the interval 0 to 1. Divide this interval into 500 bins of equal size and count the number
7.5 Problems
183
of random numbers in each bin (see Mathematica command BinCounts). Use the Pearson 2 goodness-of-fit test to see if you can reject the hypothesis that the counts have a uniform distribution at a 95% confidence level. 5. A distribution of background radiation measurements in a radioactively contaminated site are given in Table 7.6 based on a sample of 100 measurements. Use the Pearson 2 goodness-of-fit test to see if you can reject the hypothesis that the counts have a Poisson distribution at a 95% confidence level. Include a plot of the data.
8 Maximum entropy probabilities
8.1 Overview This chapter can be thought of as an extension of the material covered in Chapter 4 which was concerned with how to encode a given state of knowledge into a probability distribution suitable for use in Bayes’ theorem. However, sometimes the information is of a form that does not simply enable us to evaluate a unique probability distribution pðYjI Þ. For example, suppose our prior information expresses the following constraint: I ‘‘the mean value of cos y ¼ 0:6.’’ This information alone does not determine a unique pðYjI Þ, but we can use I to test whether any proposed probability distribution is acceptable. For this reason, we call this type of constraint information testable information. In contrast, consider the following prior information: I1 ‘‘the mean value of cos y is probably > 0:6.’’ This latter information, although clearly relevant to inference about Y, is too vague to be testable because of the qualifier ‘‘probably.’’ Jaynes (1957) demonstrated how to combine testable information with Claude Shannon’s entropy measure of the uncertainty of a probability distribution to arrive at a unique probability distribution. This principle has become known as the maximum entropy principle or simply MaxEnt. We will first investigate how to measure the uncertainty of a probability distribution and then find how it is related to the entropy of the distribution. We will then examine three simple constraint problems and derive their corresponding probability distributions. In the course of this examination, we gain further insight into the special properties of a Gaussian distribution. We also explore the application of MaxEnt to situations where the constraints are uncertain and consider an application to image 184
8.2 The maximum entropy principle
185
restoration/reconstruction.1 The last section deals with a promising Bayesian image reconstruction/compression technique called the PixonTM method.
8.2 The maximum entropy principle The major use of Bayes’ theorem is to update the probability of a hypothesis when new data become available. However, for certain types of constraint information, it is not always obvious how to use it directly in Bayes’ theorem. This is because the information does not easily enable us to evaluate a prior probability distribution or evaluate the likelihood function. As an example, consider the following problem involving a six-sided die: each side has a unique number of dots on it, ranging in number from one to six. Suppose the die is thrown a very large number of times and on each throw, the number of dots appearing on the top face is recorded. The book containing the results of the individual throws is then unfortunately lost. The only information remaining is the average number of dots on the repeated throws. Using only this prior information, how can we arrive at a unique assignment for the probability that the top face will have n dots on any one throw (i.e., we want to obtain a prior probability for each side of the die)? Principle: Out of all the possible probability distributions which agree with the given constraint information, select the one that is maximally non-committal with regard to missing information. Question: How do we accomplish the goal of being maximally non-committal about missing information? Answer: The greater the missing information, the more uncertain the estimate. Therefore, make estimates that maximize the uncertainty in the probability distribution, while still being maximally constrained by the given information. What is uncertainty and how do we measure it? Jaynes argued that the best measure of uncertainty to maximize is the entropy of the probability distribution, an idea which was first introduced by Claude Shannon in his pioneering work on information theory. We start by developing our intuition about uncertainty: Example 1: Consider an experiment with only two possible outcomes. For which of the three probability distributions listed below is the outcome most uncertain?
1
Image restoration, the recovery of images from image-like data, usually means removing the effects of point-spread-function blurring and noise. Image reconstruction means the construction of images from more complexly encoded data (e.g., magnetic resonance imaging data or from the Fourier data measured in radio astronomy aperture synthesis). In the remainder of the chapter, we will use the term image reconstruction to refer to both.
186
Maximum entropy probabilities
(1) p1 ¼ p2 ¼ 12 The outcome here most uncertain (2) p1 ¼ 14 ; p2 ¼ 34 1 99 (3) p1 ¼ 100 ; p2 ¼ 100 Example 2: Consider an experiment with different numbers of outcomes (1) p1 ¼ p2 ¼ 12 (2) p1 ¼ p2 ¼ p3 ¼ p4 ¼ 14 (3) p1 ¼ p2 ¼ ¼ p8 ¼ 18
Most uncertain
i.e., If there are n equally probable outcomes, the uncertainty / n.
8.3 Shannon’s theorem In 1948, Claude Shannon published a landmark paper on information theory in which he developed a measure of the uncertainty of a probability distribution which he labeled ‘entropy.’ He demonstrated that the expression for entropy has a meaning quite independent of thermodynamics. Shannon showed that the uncertainty, Sðp1 ; p2 ; . . . ; pn Þ, of a discrete probability distribution pi is given by the entropy of the distribution, which is Sð p1 ; p2 ; . . . ; pn Þ ¼
n X
pi lnð pi Þ ¼ entropy:
(8:1)
i¼1
The theorem is based on the following assumptions: (1) Some real numbered measure of the uncertainty of the probability distribution ð p1 ; p2 ; . . . ; pn Þ exists, which we designate by
Sð p1 ; p2 ; . . . ; pn Þ: (2) S is a continuous function of the pi . Otherwise an arbitrary small change in the probability distribution could lead to the same big change in the amount of uncertainty as a big change in the probability distribution. (3) Sð p1 ; p2 ; . . . ; pn Þ should correspond to common sense in that when there are many possibilities, we are more uncertain than when there are few. This condition implies that in the case where the pi are all equal (i.e., pi ¼ 1=n),
1 1 1 S ;...; ¼ nf n n n shall be a monotonic increasing function of n: (4) Sð p1 ; p2 ; . . . ; pn Þ is a consistent measure. If there is more than one way of working out its value, we must get the same answer for every possible way.
187
8.4 Alternative justification of MaxEnt
8.4 Alternative justification of MaxEnt Here, we consider how we might go about assigning a probability distribution for the sides of a weighted die given only constraint information about the die. Let pi ¼ the probability of the ith side occurring in any toss where i ¼ number of dots on that side. We now impose the constraint that
mean number of dots ¼
6 X
i pðiÞ ¼ 4:
(8:2)
i¼1
Note: the mean value for a fair die is 3.5. Our job is to come up with a unique set of pi values consistent with this constraint. As a start, let’s consider what we can infer about the probabilities of the six sides from prior information consisting of the mean number of dots from ten throws of the die. Suppose I ‘‘in ten tosses of a die, the mean number of dots was four.’’ We will solve this problem and then consider what happens as the number of tosses becomes very large. For a finite number of tosses, there are a finite number of possible outcomes. Let h1 ! hn be the set of hypotheses representing these different outcomes. Some example outcomes are given in Table 8.1. Which hypothesis is the most probable? In the die problem just discussed, we can use our information to reject all hypotheses which predict a mean 6¼ 4. This still leaves us a large number of possible hypotheses. Our intuition tells us that in the absence of any additional information, certain hypotheses are more likely than others (e.g., h1 is less likely than h2 or h4 ). Let’s try and refine our intuition. If we knew the individual pi ’s, we could calculate the probability of each hi . It is given by the multinomial distribution
Table 8.1 Some hypotheses about the possible outcomes of tossing a die ten times. # of dots 1 2 3 4 5 6 mean
h1
h2
h3
h4
0/10 0/10 0/10 10/10 0/10 0/10
1/10 1/10 1/10 1/10 6/10 0/10
1/10 2/10 2/10 2/10 2/10 1/10
1/10 1/10 1/10 2/10 4/10 1/10
4.0
4.0
3.5
4.0
hn 2/10 1/10 3/10 2/10 1/10 1/10
3.2
188
Maximum entropy probabilities
pðn1 ; n2 ; . . . ; n6 jN; p1 ; p2 ; . . . ; p6 Þ ¼
8 > > > <
N! pn11 pn22 pn66 > n !n ! . . . n ! |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} 1 2 6 > > :|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} P W
(8:3)
where N ¼ i ni ¼ the total number of throws of the die. W is the number of different ways that hi can occur in N trials, usually called the multiplicity. P is the probability of any particular realization of hypothesis hi . Without knowing the pi ’s we have no way of computing the P term.2 In what follows, we will ignore this term and investigate the consequences of the W term alone. Let’s evaluate the multiplicity, W, for h1 and h4 . 10! ¼ 1; 0!0!0!10!0!0! 10! ¼ 75 600: Wh 4 ¼ 1!1!1!2!4!1! Wh 1 ¼
It is obvious that h1 can only occur in one way, but to our surprise we find that h4 can occur in 75 600 different ways. In the absence of additional information, if we were to carry out a large number of repeats of ten tosses, we would expect h4 to occur 75 600 times more often than h1 . Thus, amongst the hypotheses that satisfy our constraint, the one with the largest multiplicity is the one we would consider most probable. Call this outcome Wmax . From Wmax we can derive the frequency of occurrence of any particular side of the die and use this as an estimate of the probability of that side. A problem arises because for a small number of throws like ten, the frequency determined for one or more sides might be zero. To set the probability of these sides to zero would be unwarranted since what it really means is that pi < 1=10. The general concept of using the multiplicity to select from the hi satisfying the constraint is good, but we need to refine it further. Suppose we were to use the average of the ten throws as the best estimate of the average of a much larger number of throws N 10. For a much larger N, there would be a correspondingly larger number of hypotheses about probability distributions which could satisfy the average constraint. In this case, the smallest increment in pi will be 1=N instead of 1/10. The one with the largest multiplicity (Wmax ) will be a smoother version of what we got earlier with only ten throws, and if N is large enough, it will be very unlikely for any of the pi to be exactly zero, unless of course the average was either one or six dots. We like the smoother version of the probability distribution that comes about from using a larger value of N; however, this gives rise to a new difficulty. We will see in Equation (8.7) that as N increases, Wmax increases at such a rate that there are 2
Clearly, if the mean number of dots is significantly different from 3.5, the value for a fair die, then the constraint information is telling us that some sides are more probable than others. A challenge for the future is to see how this information can be used to constrain the P term.
8.4 Alternative justification of MaxEnt
189
(essentially) infinitely more ways Wmax can be realized than other not-too-different probability distributions. Since we started with an average constraint pertaining to only ten throws, this degree of discrimination against acceptable competing hi is unwarranted. Happily, if we use Stirling’s approximation for ln N!, we can factor ln W into two terms as we now show. Stirling’s approximation for large N is ln N! ¼ N ln N N:
(8:4)
Writing ni ¼ Npi , the multiplicity becomes X6 X6 ln W ¼ N ln N N Npi ln Npi þ Npi i¼1 X X i¼1 ¼ N ln N N Npi lnðNpi Þ þ Npi X ¼ N ln N N N pi ln pi þ ln N þ N X6 ¼ N p ln pi i¼1 i ln W ¼ N
6 X
pi ln pi ¼ N entropy ¼ NS;
(8:5)
i¼1
where S¼
6 X
pi ln pi :
(8:6)
i¼1
Equation (8.5) factors ln W into two terms: the number of throws, N, and the entropy term, which depends only on the desired pi ’s. Maximizing entropy achieves the desired smooth probability distribution. Since the multiplicity W ¼ expðNSÞ, it follows that Wmax ¼ exp½NðSmax SÞ ¼ expðNSÞ; W
(8:7)
where Wmax is the multiplicity of the probability distribution with maximum entropy, Smax , and W is the multiplicity of a distribution with entropy S. The actual relative probabilities of two different probability distributions is proportional to the ratio of their multiplicities or / expðN entropyÞ. Clearly, the degree of discrimination depends strongly on N. For N large, Equation (8.7) tells us that there are (essentially) infinitely more ways the outcome corresponding to maximum entropy (MaxEnt) can be realized than any outcome having a lower entropy. Jaynes (1982) showed that the quantity 2NS has a 2 distribution with M k 1 degrees of freedom, where M is the number of possible outcomes and k is the number of constraints. In our problem of the die, M ¼ 6 and k ¼ 1. This allows us to compute explicitly the range of S about Smax corresponding to any confidence level.
190
Maximum entropy probabilities
8.5 Generalizing MaxEnt 8.5.1 Incorporating a prior In Equation (8.3), we argued that without knowing the pi ’s we have no way of computing term P. Using the principle of indifference, we assigned the same value for P for each of the acceptable hypotheses and concluded that the relative probability of acceptable hypotheses is proportional to the multiplicity term. In the present generalization, we allow for the possibility of prior information about the fpi g. For example, suppose that the index i enumerates the individual pixels in an image of a very faint galaxy taken with the Hubble Space Telescope. Our constraint information in this case is the set of measured image pixel values. However, because of noise, these constraints are uncertain. In Section 8.8.2 we will learn how to make use of MaxEnt with uncertain constraints. In general, to find the MaxEnt image requires an iterative procedure which starts from an assumed prior image which is often taken to be flat, i.e., all pi equal. However, if we already have another lower resolution image of the same galaxy taken with a ground-based telescope, then this would be a better prior image to start from. In this way, we can have a prior estimate of the pi values in Equation (8.3). For the moment we will return to the case where our constraints are certain and we will let fmi g be our prior estimate of fpi g. For example, maybe we know that two sides of the die have two dots and that the other four sides have 3, 4, 5, and 6 dots, respectively. Substituting into Equation (8.3), and generalizing the discussion to a discrete probability distribution where i varies from 1 to M (instead of i ¼ 1 to 6 for the die), we obtain pðn1 ; n2 ; . . . ; nM jN; p1 ; p2 ; . . . ; pM Þ ¼
N! mn1 mn22 mnMM : n1! n2 ! . . . nM ! 1
(8:8)
Taking the natural logarithm of both sides yields ln½ pðn1 ; n2 ; . . . ; nM jN; p1 ; p2 ; . . . ; pM Þ ¼
M X
ni ln½mi þ ln½N!
i¼1
¼
M X
M X
ln½ni !
i¼1
ni ln½mi N
i¼1
M X
(8:9)
pi ln½ pi
i¼1
where we have used Stirling’s approximation (Equation (8.4)) in the last line. Substituting for ni ¼ Npi , we obtain M M X X 1 ln½ pðn1 ; n2 ; . . . ; nM jN; p1 ; p2 ; . . . ; pM Þ ¼ pi ln½mi pi ln½ pi N i¼1 i¼1
¼
M X
(8:10)
pi ln½ pi =mi ¼ S:
i¼1
This generalized entropy is known by various names including the Shannon–Jaynes entropy and the Kullback entropy. It is also sometimes written with the opposite sign (so it has to be minimized) and referred to as the cross-entropy.
8.6 How to apply the MaxEnt principle
191
8.5.2 Continuous probability distributions The correct measure of uncertainty in the continuous case (Jaynes, 1968; Shore and Johnson, 1980) is: Z pðyÞ Sc ¼ pðyÞ ln dy: (8:11) mðyÞ The quantity mðyÞ, called the Lebesgue measure (Sivia, 1996), ensures that the entropy expression is invariant under a change of variables, y ! y0 ¼ f ðyÞ, because both pðyÞ and mðyÞ transform in the same way. Essentially, the measure takes into account how the (uniform) bin-widths in y-space translate to a corresponding set of (variable) binwidths in the alternative y0 -space. If mðyÞ is a constant, this equation reduces to Z Z Sc ¼ pðyÞ ln pðyÞdy þ ln mðyÞ pðyÞdy Z (8:12) ¼ pðyÞ ln pðyÞdy þ constant: To find the maximum entropy solution, we are interested in derivatives of Equation (8.12), and for this, a constant prior has no effect.
8.6 How to apply the MaxEnt principle In this section, we will demonstrate how to use the MaxEnt principle to encode some testable information into a probability distribution. We will need to use the Lagrange multipliers of variational calculus in which MaxEnt plays the role of a variational principle,3 so we first briefly review that topic.
8.6.1 Lagrange multipliers of variational calculus Suppose there are M distinct possibilities fyi g to be considered where i ¼ 1 to M. We want to compute pðyi jIÞ (abbreviated by pi ) subject to a testable constraint. If S represents the entropy of pðyi jIÞ, then the condition for maximum entropy is given by dS ¼
@S @S dp1 þ þ dpM ¼ 0: @p1 @pM
Without employing a constraint, the dpi ’s are independent and the only solution is if all the coefficients are individually equal to 0. Suppose we are given the constraint
3
One desirable feature of a variational principle is that it does not introduce correlations between the pi values unless information about these correlations is contained in the constraints. In Section 8.8.1, we show that the MaxEnt variational principle satisfies this condition.
192
Maximum entropy probabilities
P
p2i ¼ R, where R is a constant. Rewrite the constraint4 as C ¼ this constraint, any permissible dpi ’s must satisfy
P
p2i R ¼ 0. With
@C @C dp1 þ þ dpM @p1 @pM ¼ 2p1 dp1 þ þ 2pM dpM :
dC ¼ 0 ¼
(8:13)
We can combine dS and dC in the form dS dC ¼ 0;
(8:14)
where is an undetermined multiplier. @S @S 2p1 dp1 þ þ 2pM dpM ¼ 0: @p1 @pM Now if is chosen so ð@S=@p1 2p1 Þ ¼ 0, then the equation reduces to @S @S 2p2 dp2 þ þ 2pM dpM ¼ 0: @p2 @pM
(8:15)
But the remaining M 1 variables dpi can be considered independent so their coefficients must also equal zero to satisfy Equation (8.15). This yields a set of M equations which can be solved for the fpi g. It can be shown that this procedure does lead to a global maximum in S (e.g., Tribus, 1969).
8.7 MaxEnt distributions Before deriving MaxEnt probability distributions for some common forms of testable information, we first examine some general properties of MaxEnt distributions.
8.7.1 General properties Suppose we are given the following constraints: XM p ¼1 i¼1 i XM f ðy Þp ¼ h f1 i ¼ f1 i¼1 1 i i .. . XM
f ðy Þp i¼1 r i i
¼ h fr i ¼ fr
where M is the number of discrete probabilities. For example, suppose we have the constraint information hcosðyÞi ¼ f1 . In this case, f1 ðyi Þ ¼ cosðyi Þ. Equation (8.14) can be written with the help of Equation (8.10) as 4
In principle, R might be a f ðp1 ; . . . ; pM Þ and thus lead to a term in the differential.
193
8.7 MaxEnt distributions
" d
M X
pi ln pi þ
i¼1
M X
pi ln mi
i¼1 M X
r
! pi 1
1
i¼1
!#
fr ðyi Þpi fr
M X
M X
! f1 ðyi Þpi f1
i¼1
(8:16)
¼ 0:
i¼1
P PM Assuming mi is a constant, then M i ¼ 1 pi ln mi ¼ ln mi i ¼ 1 pi ¼ ln mi and d ln mi ¼ 0. In this case, the above equation simplifies to M X i¼1
@ ln pi ln pi pi 1 f1 ðyi Þ r fr ðyi Þ dpi @pi
M X ¼ ½ ln pi ð1 þ Þ 1 f1 ðyi Þ r fr ðyi Þdpi ¼ 0:
(8:17)
i¼1
For each i, we can solve for pi . pi ¼ exp½ð1 þ Þ exp½1 f1 ðyi Þ r fr ðyi Þ " # r X j fj ðyi Þ ; ¼ exp½0 exp
(8:18)
j¼1
where 0 ¼ 1 þ . Using the first constraint, we obtain " # M M r X X X pi ¼ exp½0 exp j fj ðyi Þ ¼ 1; i¼1
i¼1
(8:19)
j¼1
which can be rewritten as exp½þ0 ¼
M X i¼1
" exp
r X
# j fj ðyi Þ :
(8:20)
j¼1
Now differentiate Equation (8.20) with respect to k , and multiply through by exp½0 to obtain " # M r X @0 X ¼ exp½0 exp j fj ðyi Þ fk ðyi Þ @k i¼1 j¼1 (8:21) M X pi fk ðyi Þ ¼ h fk i: ¼ i¼1
This leads to the following useful result, that we make use of in Section 8.7.4.
@0 ¼ h fk i ¼ fk : @k
(8:22)
194
Maximum entropy probabilities
From Equation (5.4), we can write the variance of fk as Varð fk Þ ¼ h fk2 iðh fk iÞ2 :
(8:23)
We obtain h fk2 i from a second derivative of Equation (8.21). Substituting that into Equation (8.23) yields @ 2 0 @0 2 Varð fk Þ ¼ : (8:24) @k @2k
8.7.2 Uniform distribution Suppose the only known constraint is the minimal constraint possible for a probability P distribution M i ¼ 1 pi ¼ 1. Following Equation (8.14), we can write 3 2 !7 6 X M X 7 6 M 6 pi ln½ pi =mi pi 1 7 d6 7¼0 5 4 i¼1 i¼1 |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} (8:25) entropy constraint " !# M M M X X X d pi ln pi þ pi ln mi pi 1 ¼0 i¼1 M X i¼1
i¼1
i¼1
@ ln pi @pi ln pi pi þ ln mi dpi ¼ 0 @pi @pi M X
ð ln½ pi =mi 1 Þdpi ¼ 0:
i¼1
The addition of the Lagrange undetermined multiplier makes ð ln½ pi =mi 1 ¼ 0Þ for one pi and the remaining ðM 1Þ of the dpi ’s independent. So for all pi , we require ln½ pi =mi 1 ¼ 0;
(8:26)
pi ¼ mi eð1þÞ :
(8:27)
or,
Since
P
pi ¼ 1, M X
mi eð1þÞ ¼ 1 ¼ eð1þÞ
i¼1
Since
PM
i¼1
M X
mi :
(8:28)
i¼1
mi ¼ 1, then ¼ 1 and thus pi ¼ m i :
(8:29)
195
8.7 MaxEnt distributions
Suppose our prior information leads us to assume mi ¼ a constant ¼ 1=M. Then pi describes a uniform distribution. In the continuum limit, we would write Equation (8.29) as pðyjIÞ ¼ mðyÞ:
(8:30)
Thus, for mðyÞ ¼ a constant and the minimal constraint, distribution has maximum entropy.
R
pðyÞ ¼ 1, the uniform
8.7.3 Exponential distribution In this case, we assume an additional constraint that the average value of yi is known and equal to , so we have two constraints. P (1) M p ¼ 1 (constraint 1) PiM i (2) i yi pi ¼ (constraint 2: known mean)
For example, in the die problem of Section 8.4, we could be told the average number of dots on a very large number of throws of the die but not be given the results of the individual throws. In this case, we use two Lagrange multipliers, and 1 . Following Equation (8.14), we can write 2 3 ! !7 6 X M M X X 6 M 7 (8:31) d6 pi ln½pi =mi pi 1 1 yi pi 7 6 7¼ 0 4 i¼1 5 i¼1 i¼1 |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} entropy constraint 1 constraint 2 M X i¼1
@ ln pi @pi @pi ln½pi =mi pi yi 1 dpi ¼ 0 @pi @pi @pi M X ð ln½pi =mi 1 yi 1 Þdpi ¼ 0:
(8:32)
i¼1
Again, the addition of the Lagrange undetermined multipliers makes ð ln½ pi =mi 1 yi 1 ¼ 0Þ for one pi and the remaining ðM 1Þ of the dpi ’s independent. So for all pi , we require ln½ pi =mi 1 yi 1 ¼ 0;
(8:33)
pi ¼ mi eð1þÞ e1 yi :
(8:34)
or,
We can now apply our two constraints to determine the Lagrange multipliers. M X i¼1
pi ¼ 1 ¼ eð1þÞ
M X i¼1
mi e1 yi :
(8:35)
196
Maximum entropy probabilities
Therefore 1
eð1þÞ ¼ PM
i¼1
mi e1 yi
(8:36)
:
From the second constraint, we have M X i¼1
PM
yi pi ¼ ¼ Pi ¼M1
yi mi e1 yi
i¼1
mi e1 yi
;
(8:37)
or M X
yi mi e1 yi
i¼1
M X
mi e1 yi ¼ 0:
(8:38)
i¼1
For any particular value of , the above equation can be solved numerically for 1 . If we set mi ¼ 1=6 and yi ¼ i, then Equation (8.38) can be used to solve for the probability of the six sides of the die problem discussed in Section 8.4. We illustrate this with the following Mathematica commands assuming ¼ 2:2 and the result is shown in Figure 8.1. Box 8.1
Mathematica commands: MaxExt die problem
First, we define the function q[m ] for an arbitrary value of m: q[m ]: =
X6
i Exp[ i l1] m i¼1
X6 i¼1
Exp[ i l1];
The next line solves for l1 with m ¼ 2:2: sol = Solve[q[2:2]==0; l1] For each m; l1 has multiple complex solutions and one real solution l1real. Pick out the real solution using Cases. l1real=Cases[l1=:sol, Real][[1]] Next, evaluate the expression for the probability of the ith side, probi Exp[ i l1real] probi= P6 j ¼ 1 Exp[ j l1real] Finally create a table of the probabilities of the 6 sides. prob=Table[{i, probi}, {i, 6}] ff1; 0:421273g; f2; 0:251917g; f3; 0:150644g; f4; 0:0900838g; f5; 0:0538692g; f6; 0:0322133gg
197
8.7 MaxEnt distributions Mean number of dots = 2.2
Probability
0.4
0.3
0.2
0.1
1
2
3
4
5
6
Number of dots
Figure 8.1 The figure shows the probability of the six sides of a die given the constraint that the average number of dots ¼ 2:2.
We can simply generalize Equation (8.34) for pi to the continuous case pðyjIÞ: pðyjIÞ ¼ mðyÞeð1þÞ e1 y :
(8:39)
If we assume mðyÞ is a constant, then we have pðyjIÞ / e1 y :
(8:40)
The normalization and 1 can easily be evaluated if the limits of integration extend from 0 to 1. The result is 1 pðyjÞ ¼ ey= for y 0: (8:41) 8.7.4 Normal and truncated Gaussian distributions In this section, we assume mðyÞ, the prior for pðyÞ, has the following form: 1=ðyH yL Þ; if yL y yH mðyÞ ¼ 0; if yL > y or y > yH .
(8:42)
In this case, we assume an additional constraint that the variance of y is equal to 2 . The two constraints in this case are: (1)
R yH
(2)
R yH
yL yL
pðyÞdy ¼ 1 ðy Þ2 pðyÞdy ¼ 2
Because mðyÞ is a constant, we solve for pðyÞ which maximizes Z pðyÞ ln pðyÞdy
198
Maximum entropy probabilities
subject to the constraints 1 and 2. This optimization is best done as the limiting case of a discrete problem; explicitly, we need to find the solution to " # ( " #) M M M X X X 2 2 d pi ln pi pi 1 1 ðyi Þ pi ¼ 0; (8:43) i¼1
i¼1
i¼1
where M is the number of discrete probabilities. This leads to M X
½ ln pi 1 1 ðyi Þ2 dpi ¼ 0:
(8:44)
i¼1
For each value of i, we require ln pi 1 1 ðyi Þ2 ¼ 0;
(8:45)
or, pi ¼ eð1þÞ e1 ðyi Þ
2
2
¼ e0 e1 ðyi Þ ;
(8:46)
where 0 ¼ 1 þ . This generalizes to the continuum assignment 2
pðyÞ ¼ e0 e1 ðyÞ :
(8:47)
We can solve for 1 and 0 from our two constraints. From the first constraint, Z yH Z yH 2 0 pðyÞdy ¼ 1 ¼ e e1 ðyÞ dy: (8:48) yL
yL
Compare this equation to the equation for the error function5 erfðzÞ (see Equation (5.39)) given by Z z 2 erfðzÞ ¼ pffiffiffi expðu2 Þ du: (8:49) p 0 The solution for 0 in Equation (8.48), in terms of the error function, is pffiffiffi h npffiffiffiffiffi o npffiffiffiffiffi oi p 0 ¼ ln pffiffiffiffiffi þ ln erf 1 ðyL Þ : 1 ðyH Þ erf 2 1
(8:50)
We will consider two cases which depend on the limits of integration ðyL ; yH Þ. Case I (Normal Gaussian) Suppose the limits of integration satisfy the condition6 pffiffiffiffiffi pffiffiffiffiffi 1 ðyH Þ 1 and 1 ðyL Þ 1: 5 6
The error function has the properties: erfð1Þ ¼ 1; erfð1Þ ¼ 1 and erfðzÞ ¼ erfðzÞ. pffiffifollowing ffi Note: erfð1Þ ¼ 0:843, erfð 2Þ ¼ 0:955, erfð2Þ ¼ 0:995, and erfð3Þ ¼ 0:999978.
(8:51)
8.7 MaxEnt distributions
199
In this case, npffiffiffiffiffi o npffiffiffiffiffi o erf 1 ðyH Þ 1 and erf 1 ðyL Þ 1; and Equation (8.50) simplifies to rffiffiffiffiffi pffiffiffi p p : 0 ln pffiffiffiffiffi þ ln½2 ¼ ln 1 2 1 We now make use of Equation (8.22) to obtain an equation for 1 : hpffiffiffiffiffiffiffiffiffiffii @ ln p=1 @0 1 ¼ ¼ ¼ 2 : @1 21 @1
(8:52)
(8:53)
Combining Equations (8.53) and (8.52), we obtain 1 ¼
1 ; 22
1 e0 ¼ pffiffiffiffiffiffi : 2p
The result, 2 1 2 pðyÞ ¼ pffiffiffiffiffiffi eðyÞ =2 ; 2p
(8:54)
is a Gaussian. Thus, for a given 2 and a uniform prior that satisfies Equation (8.51), a Gaussian distribution has the greatest uncertainty (maximum entropy). Now that we have evaluated 0 and 1 , we can rewrite Equation (8.51) in the more useful form ðyH Þ pffiffiffi 1 2
and
ð yL Þ pffiffiffi 1: 2
(8:55)
We frequently deal with problems where the qth data value, yq , is described by an equation of the form yq ¼ ypq þ eq
)
yq ypq ¼ eq ;
where ypq is the model prediction for the qth data value and eq is an error term. In Section 4.8.1, we showed that for a deterministic model, Mj , the probability of the data, pðYq jMj ; IÞ, is equal to pðEq jMj ; IÞ, the probability of the errors. If we interpret the in Equation (8.54) as ypq , the model prediction, then this equation becomes the MaxEnt sampling distribution for the qth error term in the likelihood function. This is a very important result. It says that unless we have some additional prior information which justifies the use of some other sampling distribution, then use a Gaussian sampling distribution. It makes the fewest assumptions about the information you don’t have and will lead to the most conservative estimates (i.e., greater uncertainty than you would get from choosing a more appropriate distribution based on more information). In a situation where we do not know the appropriate sampling distribution, we will also, in general, not know the actual variance (2 ) of the distribution. In that case, we
200
Maximum entropy probabilities
can treat the of the Gaussian sampling distribution as an unknown nuisance parameter with a range specified by our prior information. The one restriction to this argument is that the prior upper bound on must satisfy Equation (8.55). If possible data values, represented by the variable y, are unrestricted, then this condition simply requires that the upper bound on be finite. In some experiments, the range of possible data values is limited, e.g., positive values only. In that case, the MaxEnt distribution may become a truncated Gaussian, as discussed in Case II below. We now consider a simple example that exploits a MaxEnt Gaussian sampling distribution with unknown . We will make considerable use of this approach in later chapters starting with Section 9.2.3.
Example: Suppose we want to estimate the location of the start of a stratum in a core sample taken by a Martian rover. The rover transmits a low resolution scan, which allows the experimenter to refine the region of interest for analysis by a higher resolution instrument aboard the rover. Unfortunately, the rover made a rough landing and ceases operation after only two high resolution measurements have been completed. In this example, we simulate a sample of two measurements made with the high resolution instrument for a stratum starting position of 20 units along the core sample. For the simulation, we assume a bimodal distribution of measurement errors as shown in panel (a) of Figure 8.2. We further suppose that the distribution of measurement errors is unknown by the scientist, named Jean, who will perform the analysis. Jean needs to choose a sampling distribution for use in evaluating the likelihood function in a Bayesian analysis of the high resolution core sample. In the absence of additional information, she picks a Gaussian sampling distribution, because from the above argument, the Gaussian will lead to the most conservative estimates (i.e., greater uncertainty than you would get from choosing a more appropriate distribution based on more information). Based on the low resolution core sample measurements, she assumes a uniform prior for the mean location extending from 15 to 25 units. She assumes a Jeffreys prior for and estimates a conservative upper limit to of 4 units. She estimates the lower limit, ¼ 0:4 units, by setting it equal to the digital read out accuracy. To see how well the parameterized Gaussian sampling distribution performs, we simulated five independent samples, each consisting of two measurements. Panels (b), (c), (d), (e), and (f) of Figure 8.2 show a comparison of the posterior PDFs for the stratum start location computed using: (1) the true sampling distribution (solid curve), and (2) a Gaussian with an unknown (dotted curve). The actual measurements are indicated by the arrows in the top of each panel. It is quite often the case that we don’t know the true likelihood function. In some cases, we have a sufficient number of repeated measurements (say five or more) that we can appeal to the CLT (see Section 5.11) and work with the average value, whose distribution will be closely Gaussian with a given by Equation (5.50). However, in
201
PDF
2
PDF
2
–2
–1
0 x
1
2
1.5 1
17 18 19 20 21 22 23 x
3 2.5
(c)
1.5
2
1 0.5
(d)
1.5 1 0.5
17 18 19 20 21 22 23 x
17 18 19 20 21 22 23 x 2.5 (f) 2
2.5 (e)
1.5 1 0.5
PDF
2 PDF
(b)
0.5 –3
2.5
2.5
(a)
0.6 0.5 0.4 0.3 0.2 0.1
PDF
PDF
8.7 MaxEnt distributions
1.5 1 0.5
17 18 19 20 21 22 23 x
17 18 19 20 21 22 23 x
Figure 8.2 Panel (a) shows the bimodal distribution of instrumental measurement errors. Panels (b), (c), (d), (e), and (f) show a comparison of the posterior PDFs for the stratum start location derived from five simulated data samples. Each sample consists of the two data points indicated by the two arrows at the top of each panel. The solid curve shows the result obtained using the true sampling distribution. The dotted curve shows the result using a Gaussian with unknown .
this example, we had only two measurements. Instead, we appealed to the MaxEnt principle and used a Gaussian likelihood function, marginalizing over the unknown variance. From Figure 8.2, it is apparent that the Gaussian likelihood function performed quite well compared to the true bimodal likelihood function. The conservative nature of the Gaussian assumption is apparent from the much broader tails. Further details of this type of analysis are given in Section 9.2.3. Case II (Truncated Gaussian) When the condition specified by Equation (8.55) is not satisfied, it is still possible to compute a MaxEnt sampling distribution that we refer to as a truncated Gaussian, but there is no simple analytic solution for the Lagrange multipliers. The ’s need to be solved for numerically.7
7
See Tribus (1969) for a more detailed discussion of this case.
202
Maximum entropy probabilities
8.7.5 Multivariate Gaussian distribution The MaxEnt procedure is easily extended to multiple variables by defining the entropy as a multi-dimensional integral: Z S ¼ pðYÞ ln½ pðYÞ=mðYÞdY; (8:56) R R R where pðYÞ ¼ pðy1 ; y2 ; . . . ; yN jIÞ ¼ pðfyi gjIÞ and dY ¼ dNy. Suppose the testable information only consists of knowledge of their individual variances, Z hðyi i Þ2 i ¼ ðyi i Þ2 pðy1 ; y2 ; . . . ; yN ÞdNy ¼ ii ¼ 2i ði ¼ 1 to NÞ; (8:57) and covariances, hðyi i Þðyj j Þi ¼
Z
ðyi i Þðyj j Þ pðy1 ; y2 ; . . . ; yN ÞdNy ¼ ij :
(8:58)
In Appendix E, we show that provided the prior limits on the range of each variable satisfy the condition given in Equation (8.55), maximizing Equation (8.56), with uniform measure, yields the general form of a correlated multivariate Gaussian distribution: " # 1 1X pðY j fi ; ij gÞ ¼ ðyi i Þ½E1 ij ðyj j Þ ; (8:59) pffiffiffiffiffiffiffiffiffiffiffiffi exp 2 ij ð2pÞN=2 det E where X
¼
ij
N X N X ; i¼1 j¼1
and 0
11 B 21 E¼B @ N1
12 22 N2
13 23 N3
1 1N 2N C C: A NN
(8:60)
In most applications that we will encounter, the yi variable will represent possible values of a datum and be labeled di . Equation (8.59) can then be rewritten as a likelihood: " # 1 1X 1 pðD j M; IÞ ¼ ðdi fi Þ½E ij ðdj fj Þ ; (8:61) pffiffiffiffiffiffiffiffiffiffiffiffi exp 2 ij ð2pÞN=2 det E where fi is the model prediction for the ith datum.
8.8 MaxEnt image reconstruction
203
If the variables are all independent, i.e., the covariance terms are all zero, then Equation (8.61) reduces to ( ) N Y 1 ðdi fi Þ2 pffiffiffiffiffiffi exp pðDjM; IÞ ¼ 22i i ¼ 1 2pi ! ( ) (8:62) N R Y X 1 ðdi fi Þ2 pffiffiffiffiffiffi ¼ exp : 22i i ¼ 1 2pi r¼1 In Chapter 10, we discuss the concepts of covariance and correlation in more detail and make use of a multivariate Gaussian in least-squares analysis.
8.8 MaxEnt image reconstruction It is convenient to think of a probability distribution as a special case of a PAD (positive, additive distribution). Another example of a PAD is the intensity or power, fðx; yÞ, of incoherent light as a function of RR position ðx; yÞ, in an optical image. This is positive and additive because the integral fðx; yÞdxdy represents the signal energy recorded by the image. By contrast, the amplitude of incoherent light, though positive, is not additive. A probability distribution is a PAD which is normalized so Z
þ1
pðYÞdY ¼ 1: 1
Question: What form of entropy expression should we maximize in image reconstruction to best represent the PAD fðx; yÞ values? Answer: Derived by Skilling (1989). Z fðx; yÞ Sð f; mÞ ¼ dy fðx; yÞ mðx; yÞ fðx; yÞ ln ; mðx; yÞ where mðx; yÞ is the prior estimate of fðx; yÞ. If fðx; yÞ and mðx; yÞ are normalized, this reduces to the simpler form: Z Z fðx; yÞ fðx; yÞ ln dxdy: mðx; yÞ
8.8.1 The kangaroo justification The following simple argument (Gull and Skilling, 1984) gives additional insight into the use of entropy for a PAD. Imagine that we are given the following information: a) One third of kangaroos have blue eyes. b) One third of kangaroos are left-handed.
204
Maximum entropy probabilities
How can we estimate the proportion of kangaroos that are both blue-eyed (BE) and left-handed (LH) using only the above information? The joint proportions of LH and BE can be represented by a 2 2 probability table which is shown in Table 8.2(a). The probabilities p1 ; p2 ; p3 and p4 must satisfy the given constraints: a) p1 þ p2 ¼ 13 (1/3 of kangaroos have blue eyes) b) p1 þ p3 ¼ 13 (1/3 of kangaroos are left-handed) c) p1 þ p2 þ p3 þ p4 ¼ 1.
Feasible solutions have one remaining degree of freedom which we parameterize by the variable z. Table 8.2(b) shows the parameterized joint probability table. The parameter z is constrained by the above three constraints to 0 z 13. Below we consider three feasible solutions: 1. The first corresponds to the independent case
1 1 1 pðBE; LHÞ ¼ pðBEÞpðLHÞ ¼ ¼ 3 3 9 1 z¼ 9 which leads to the contingency table shown in Figure 8.3(a). 2. Case of maximum positive correlation.
1 1 pðBE; LHÞ ¼ pðBEÞpðLHjBEÞ ¼ ð1Þ ¼ 3 3 1 z¼ : 3 3. Case of maximum negative correlation.
pðBE; LHÞ ¼ pðBEÞpðLHjBEÞ ¼
1 ð0Þ ¼ 0 3
z ¼ 0: Table 8.2 Panel (a) is the joint probability table for the kangaroo problem. In panel (b), the table is parameterized in terms of the remaining one degree of freedom represented by the variable z Blue eyes
True False
Left-Handed True
False
p1 p3
p2 p4
(a)
Blue eyes
Left-Handed True
True False
0≤z≤ 1 –z 3 (b)
False 1 3
1 3 1 3
–z +z
205
8.8 MaxEnt image reconstruction Left-Handed True False
Left-Handed True False
Left-Handed True False
Blue
True
1 9
2 9
Blue
True
1 3
0
Blue
True
0
1 3
Eyes
False
2 9
4 9
Eyes
False
0
2 3
Eyes
False
1 3
1 3
(a) Independent
(b) Positive correlation
(c) Negative correlation
Figure 8.3 The three panels give the joint probabilities for (a) the independent case, (b) maximum positive correlation, and (c) maximum negative correlation.
Suppose we must choose one answer – which is the best? The answer we select cannot be thought of as being any more likely than any other choice, because there may be some degree of genetic correlation between eye color and handedness. However, it is nonsensical to select either positive or negative correlations without having any relevant prior information. Therefore, based on the available information, the independent choice p1 ¼ pðBE; LHÞ ¼ 1=9 is preferred. Question: Is there some function of the pi which, when maximized subject to the known constraints, yields the same preferred solution? If so, then it would be a good candidate for a general variational principle which could be used in situations that were too complicated for our common sense. Skilling (1988) showed that the only functions with the desired property, pðBE; LHÞ ¼ 1=9, are those related monotonically to the entropy: S¼
4 X
pi ln pi
i¼1
¼ z ln z 2
1 1 1 1 z ln z þ z ln þ z : 3 3 3 3
Three proposed alternatives are listed in Table 8.3. Only one of the four ( gives the preferred uncorrelated result.
P
pi ln pi )
Table 8.3 Solutions to the kangaroo problem obtained by maximizing four different functions, subject to the constraints. Variation Function
Optimal z
Implied Correlation
P pi ln pi P 2 pi P ln p P 1=2i pi
1=9 ¼ 0:1111 1=12 ¼ 0:0833 0:1303 0:1218
uncorrelated negative positive positive
206
Maximum entropy probabilities
But what have kangaroos got to do with image restoration/reconstruction? Consider the following restatement of the problem. a) One third of the flux comes from the top half of the image. b) One third of the flux comes from the left half of the image.
What proportion of the flux comes from the top left quarter? All the advertised P functionals except ( pi ln pi ) imply either a positive or negative correlation in the distribution of the flux in the four quadrants based on the given information. Thus, these functionals fail to be consistent with our prior information on even the simplest non-trivial image problem. Inconsistencies are not expected to disappear just because practical data are more complicated.
8.8.2 MaxEnt for uncertain constraints Example: In image reconstruction, we want the most probable image when the data are incomplete and noisy. In this example, B ‘‘proposition representing prior information’’ Ii ‘‘proposition representing a particular image.’’ Apply Bayes’ theorem: pðIi jD; BÞ / pðIi jBÞpðDjIi ; BÞ:
(8:63)
Suppose the image consists of M pixels ð j ¼ 1 ! MÞ Let dj ¼ measured value for pixel j Iij ¼ predicted value for pixel j based on image hypothesis Ii ej ¼ dj Iij ¼ error due to noise which is assumed to be IID Gaussian. In this situation, the measured dj values are the constraints, which are uncertain because of noise. Thus, " # " # e2j dj Iij 2 pðdj jIij ; BÞ ¼ pðej jIij ; BÞ / exp 2 ¼ exp (8:64) 2j 2j and " # dj Iij 2 exp pðDjIi ; BÞ / 2j j¼1 " # m 1X dj Iij 2 2 ¼ exp ¼ exp : 2 j¼1 j 2 m Y
(8:65)
8.8 MaxEnt image reconstruction
207
Determination of p(Ii|B): Suppose we made trial images, Ii , by taking N quanta and randomly throwing them into the M image pixels. Then pðIi jBÞ is given by a multinomial distribution, pðIi jBÞ ¼ where W is the multiplicity. Recall for large N, ln W ! N
X
N! 1 W ¼ ; n1 ! . . . nM ! M N M N
(8:66)
pj ln pj ¼ N entropy ¼ NS;
j
where as N ! 1, pj ! nj =N ¼ constant. Therefore, pðIi jBÞ ¼
1 expðNSÞ: MN
(8:67)
In general, we don’t know the number of discrete quanta in the image, so we write pðIi jBÞ ¼ expðSÞ:
(8:68)
Substituting Equations (8.68) and (8.65) into Equation (8.63), we obtain 2 pðIi jD; BÞ ¼ exp S : 2
(8:69)
We want to maximize pðIi j D; BÞ or ðS 2 =2Þ. In ‘‘classic’’ MaxEnt, the parameter is set so the misfit statistic 2 is equal to the number of data points N. This in effect overestimates 2 , since some effective number 1 parameters are being ‘‘fitted’’ in doing the image reconstruction. The full Bayesian approach treats as a parameter of the hypothesis space which can be estimated by marginalizing over the image hypothesis space. Improved images can also be obtained by introducing prior information about the correlations between image pixels, enforcing smoothness. More details on MaxEnt image reconstruction can be found in Buck (1991), Gull and Skilling (1984), Skilling (1989), Gull (1989a), and Sivia (1996). Two examples that illustrate some of the capabilities of MaxEnt image reconstruction are shown in Figures 8.4 and 8.5. Figure 8.4 illustrates how the maximum entropy method is capable of increasing the contrast of an image, and can also increase its sharpness if the measurements are sufficiently accurate. Figure 8.5 illustrates how the maximum entropy method automatically allows for missing data (Skilling and Gull, 1985). These and other examples, along with information on commercial software products, are available from Maximum Entropy Data Consultants, Ltd. (http://www.maxent.co.uk/).
208
Maximum entropy probabilities
(a)
(b)
(c)
(d) (e)
Figure 8.4 The original high-resolution low-noise image is shown in panel (a). Panel (b) shows the blurred original with high added noise. The MaxEnt reconstruction of the blurred noisy image is shown in (c). This demonstrates how the maximum entropy method suppresses noise, yielding a higher contrast image. Panel (d) shows the blurred original with low added noise. The MaxEnt reconstructed image, shown in (e), demonstrates how well maximum entropy de-blurs if the data are accurate enough. (Courtesy S. F. Gull, Maximum Entropy Data Consultants.)
8.9 Pixon multiresolution image reconstruction Pin˜a and Puetter (1993) and Puetter (1995) describe another very promising Bayesian approach to image reconstruction, which they refer to as the PixonTM method. Instead of representing the image with pixels of a constant size, they introduce an image model where the size of the pixel varies locally according to the structure in the image. Their generalized pixels are called pixons. A map of the pixon sizes is called an image model. The Pixon method seeks to find the best joint image and image model that is consistent with the data based on a 2 goodness-of-fit criterion, and that can represent the structure in the image by the smallest number of pixons. For example, suppose we have a 1024 by 1024 image of the sky containing a galaxy which occupies the inner 100
209
8.9 Pixon multiresolution image reconstruction
(a)
(b)
(c)
(d)
(e)
(f)
Figure 8.5 This figure demonstrates how the maximum entropy method automatically allows for missing data. Panel (a) shows the original image when 50% of the pixels, selected at random, have been removed. Panel (b) shows the corresponding MaxEnt reconstructed image. Panel (c) shows the original image when 95% of the pixels have been removed. Panel (d) shows the corresponding MaxEnt reconstructed image. Panel (e) shows the original image when 99% of the pixels have been removed. Panel (f) shows the corresponding MaxEnt reconstructed image. (Courtesy S. F. Gull, Maximum Entropy Data Consultants.)
by 100 pixels. In principle, we need many numbers to encode the significant structure in the galaxy region, but only one number to encode information in the featureless remainder of the image. Because the Pixon method constructs a model that represents the significant structure by the smallest number of parameters (pixons), it has the smallest Occam penalty. Figure 8.6 shows an image reconstructions of a mock data set. The original image is shown on the far left along with a surface plot (center row). The original image is convolved (blurred) with the point-spread-function (PSF) shown at the bottom of the first column. Then noise (see bottom of second column) is added to the smoothed (PSF-convolved) data to produce the input (surface plot in middle panel) to the image reconstruction algorithm. To the right are a Pixon method reconstruction and a maximum entropy reconstruction. The algorithms used are the MEMSYS 5
210
Maximum entropy probabilities True
(a)
Input
(e)
PSF
(i)
(b)
(f)
Noise
(j)
FPB
(c)
MEMSYS (d)
(g)
(h)
Residuals (k)
Residuals
(l)
Figure 8.6 Reconstruction of a mock data set. The original image is shown on the far left (a) along with a surface plot (e). This image is convolved (blurred) with the point-spreadfunction (PSF) (i) shown at the bottom of the first column. Then noise (j) is added to the smoothed (PSF-convolved) data to produce the input (b), (f) to the image reconstruction algorithm. To the right are a Pixon method reconstruction (FPB) and a maximum entropy reconstruction (MEMSYS). (Courtesy Pixon LLC.)
algorithms, a powerful set of commercial maximum entropy (ME) algorithms available from Maximum Entropy Data Consultants, Ltd. The ME reconstructions were performed by Nick Weir, a recognized ME and MEMSYS expert. The reconstructions were supplemented by Nick Weir’s multi-correlation channel approach. The Pixon method reconstructions use the Fractal–Pixon Basis (FPB) approach (Pin˜a and Puetter, 1993; Puetter, 1995). The ‘‘Fractal’’ nomenclature has since been dropped, so the term FPB simply refers to the ‘‘standard’’ Pixon method. It can be seen that the FPB reconstruction has no signal correlated residuals and is effectively artifact (false-source) free, whereas these problems are obvious in the MEMSYS reconstruction. The absence of signal correlated residuals and artifacts can be understood from the underlying theory of the Pixon method (Puetter, 1995). Figure 8.7 shows the Pixon method applied to X-ray mammography, taken from the PixonTM homepage located at http://www.pixon.com, or alternatively, http:// casswww.ucsd.edu/personal/puetter/pixonpage.html. The raw X-ray image appears to the left. In this example, a breast phantom is used (material with X-ray absorption properties similar to the human breast). A small fiber (400 micrometer diameter) is present in the phantom. The signature of the fiber is rather faint in the direct X-ray image. The Pixon method reconstruction is seen to the right. Here, the signature of the fiber is obvious. Such image enhancement is of clear benefit to the discovery of weak
8.10 Problems
211
Figure 8.7 An example of the Pixon method applied to X-ray mammography. The raw X-ray image appears to the left. In this example, a breast phantom is used (material with X-ray absorption properties similar to the human breast). In this case, a small fiber (400 micrometer diameter) is present in the phantom. The Pixon method reconstruction is seen to the right. (Courtesy Pixon LLC.)
X-ray signatures. As can be seen, the X-ray signature of the fiber is very close to the noise level. This is evidenced by the break-up of the continuous fiber into pieces in the Pixon image. The Pixon method recognized that in certain locations, the X-ray signal present is not statistically significant. In these locations, the fiber vanished in the reconstructed image. 8.10 Problems 1. Use the maximum entropy method to compute and plot the probability of each side of a six-sided loaded die given that exhaustive tests have determined that the expectation value of the number of dots on the uppermost face ¼ 4:6. 2. Use the maximum entropy method to compute and plot the probability of each side of a six-sided loaded die given that exhaustive tests have determined that the expectation value of the number of dots on the uppermost face ¼ , for ¼ 1:1 to 5.9 in steps of 0.1. Plot the probability of each side versus . Plot the probability of all six sides on one plot versus . 3. Evaluate a unique probability distribution for pðYjIÞ (the question posed in Section 8.1) using the MaxEnt principle together with the constraint: ‘‘the mean value of cos y ¼ 0:6.’’ Our prior information also tells us that mðYjIÞ, the prior estimate of pðYjIÞ, is a constant in the range 0 to 2p. In working out the solution, you will encounter the modified Bessel functions of the first kind, designated by BesselI[n; z] in Mathematica. You may also find the command FindRoot[] useful.
9 Bayesian inference with Gaussian errors
9.1 Overview In the next three chapters, we will be primarily concerned with estimating model parameters when our state of knowledge leads us to assign a Gaussian sampling distribution when calculating the likelihood function. In this chapter, we start with a simple problem of computing the posterior probability of the mean of a data set. Initially, we assume the variance of the sampling distribution is known and then consider the case where the variance is unknown. We next look at the question of how to determine whether the signal present in the data is constant or variable. In the final section, we consider a Bayesian treatment of a fundamental problem that occurs in experimental science – that of analyzing two independent measurements of the same physical quantity, one ‘‘control’’ and one ‘‘trial.’’
9.2 Bayesian estimate of a mean Here we suppose that we have collected a set of N data values fd1 ; . . . ; dN g and we are assuming the following model is true: di ¼ þ e i ; where ei represents the noise component of the ith data value. For this one data set, and any prior information, we want to obtain the Bayesian estimate of . We will investigate three interesting cases. In all three cases, our prior information about ei leads us to adopt an independent Gaussian sampling distribution.1 In Section 9.2.1, we analyze the case where the noise is the same for all ei . In Section 9.2.2, we treat the more general situation where the i are unequal. Section 9.2.3 considers the case where the i are assumed equal but the value is unknown.
1
Note: if we had prior evidence of dependence, i.e., correlation, it is a simple computational detail to take this into account as shown in Section 10.2.2.
212
9.2 Bayesian estimate of a mean
213
9.2.1 Mean: known noise s In this situation, we will assume that the variance of the noise is already known. We might, for example, know this from earlier measurements with the same apparatus in similar conditions. We also assume the prior information gives us lower and upper limits on but no preference for in that range. The problem is to solve for pðjD; IÞ. The first step is to write down Bayes’ theorem: pðjD; IÞ ¼
pðjIÞ pðDj; IÞ ; pðDjIÞ
(9:1)
where the likelihood pðDj; IÞ is sometimes written as LðÞ. Our assumed prior for is given by pðjIÞ ¼ KðconstantÞ; ¼ 0;
L H
otherwise:
Evaluate K from Z
H
pðjIÞd ¼
Z
L
H
Kd ¼ 1:
L
Therefore, K¼
1 1 ¼ ; H L R
where R range of . This gives the normalized prior, pðjIÞ ¼
1 : R
(9:2)
The likelihood is given by ( ) 1 ðdi Þ2 pffiffiffiffiffiffi exp pðDj; IÞ ¼ 22 i¼1 2p ( P ) N 2 N2 N i¼1 ðdi Þ ¼ ð2pÞ exp 22 N Q ¼ N ð2pÞ 2 exp 2 ; 2 N Y
where we have abbreviated
PN
i¼1 ðdi
Þ2 by Q.
(9:3)
214
Bayesian inference with Gaussian errors
Expanding Q, we obtain Q¼
N X X X X ðdi Þ2 ¼ d2i þ 2 2 di i¼1
¼
X
d2i þ N2 2Nd
fd
1X di g N
X 2 2 ¼ Nð2 2d þ d Þ þ d2i Nd X 2 ¼ Nð dÞ2 þ d2i Nd X 2 2 ¼ Nð dÞ2 þ d2i 2Nd þ Nd X X X 2 ¼ Nð dÞ2 þ d2i 2d di þ d X ¼ Nð dÞ2 þ ðdi dÞ2
(9:4)
¼ Nð dÞ2 þ Nr2 ; P where r2 ¼ N1 ðdi dÞ2 is the mean square deviation from d. Now substitute Equation (9.4) into Equation (9.3): ( ) Nr2 Nð dÞ2 N2 N pðDj; IÞ ¼ ð2pÞ exp 2 exp : (9:5) 2 22 We can express pðDjIÞ as pðDjIÞ ¼
Z
H
d pðjIÞpðDj; IÞ:
(9:6)
L
Substitution of Equations (9.2), (9.5) and (9.6) into Equation (9.1) yields the desired posterior: n o n o NðdÞ2 N2 1 N Nr2 ð2pÞ exp exp R 22 22 n o: pðjD; IÞ ¼ (9:7) R N 2 NðdÞ2 H 1 N ð2pÞ 2 exp Nr R L d exp 22 22 Equation (9.7) simplifies to n o 2 exp ðdÞ NUM 22 =N n o ¼ pðjD; IÞ ¼ R : H ðdÞ2 DEN L exp 22 =N d
(9:8)
Therefore, ( ) 1 ð dÞ2 pðjD; IÞ ¼ exp : DEN 22 =N
(9:9)
215
9.2 Bayesian estimate of a mean
p(µ |D,I )
σ
√N
µ Figure 9.1 The posterior probability density function for pðjD; IÞ.
Since the denominator (DEN) evaluates to a constant (see Equation (9.11), the posterior, within the range L to H , is simply a Gaussian with variance equal to 2 =N. Thus, the uncertainty in the mean is inversely proportional to the square root of the sample size, which is the basis of signal averaging as discussed in Section 5.11.1. Figure 9.1 shows the resulting posterior probability density function for pðjD; IÞ in the limit of H ¼ þ1 and L ¼ 1. It is interesting to compare this Bayesian result to the frequentist confidence intervals for the mean when sampling from a normal distribution as discussed in Section 6.6. In the frequentist approach, we were not able to make any probability statement in connection with a single confidence interval derived from one data set fdi g. For example, the interpretation of the 68% confidence interval was: if we repeatedly draw samples of the same size from a population, pffiffiffiffi and each time compute specific values for the 68% confidence interval, d = N, then we expect 68% of these confidence intervals to contain the unknown mean . In the frequentist case, the problem was to find the mean of a hypothetical population of possible measurements for which our sample was but one realization. In the Bayesian case, we are making a probability statement about the value of a model parameter. From our posterior probability density function, pffiffiffiffiwe can always compute the probability that the model parameter lies within = N of the sample mean d. It turns out that when the prior bounds for are so wide that they are far outside the range indicated by the data, the value of this probability is 68%. In this particular instance, the boundaries of the Bayesian 68% credible region are the same as the frequentist 68% confidence interval. However, if we decrease the range of the prior bounds for , the probability contained within the frequentist 68% confidencepboundffiffiffiffi ary increases and reaches 100% when the prior boundaries coincide with = N. The differences in conclusions drawn between a Bayesian and a frequentist analysis of the same data are a consequence of the different definitions of probability used in the two approaches. Recall that in the frequentist case, the argument of a probability must be a random variable. Because a parameter is not a random variable, the frequentist approach does not permit the probability density of a parameter to be calculated directly or allow for the inclusion of a prior probability for the parameter.
216
Bayesian inference with Gaussian errors
The interpretation of any frequentist statistic, such as the sample mean, is always in relation to a hypothetical population of possible samples that could have been obtained under similar circumstances. Detail: Calculation of DEN in Equation (9.8) To evaluate DEN, compare with the error function erfðxÞ (the Mathematica command is Erf[x]). Z x 2 erfðxÞ ¼ pffiffiffi expðu2 Þdu; note : erfðxÞ ¼ erfðxÞ (9:10) p 0 let Nð dÞ2 22 12 2 Þ ; ¼ u ) u ¼ ð dÞð 22 N therefore, 2 12 2 du ¼ d N
or
2 12 2 d ¼ du; N
and 2 12 pffiffiffi Z uH p 2 2 2 pffiffiffi DEN ¼ expðu Þdu : 2 N p uL We can rewrite the integral limits in DEN as follows: Z uH Z uH Z uL ¼ uL
¼ ¼
Z
1 0
þ
Z1 uH 0
Z
1 uH
Z0 uL
Z
0
1
Z
uL 0
0
therefore, 2 12 pffiffiffi p 2 ½erfðuH Þ erfðuL Þ; DEN ¼ 2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} N ð 2 if uH 1 and uL 1Þ where 2 12 2 uH ¼ ðH dÞ; N 2 12 2 uL ¼ ðL dÞ: N
(9:11)
217
9.2 Bayesian estimate of a mean
9.2.2 Mean: known noise, unequal s In this situation, we again assume that the noise variance is known but that it can differ for each data point. The likelihood is given by ( ) N Y 1 ðdi Þ2 pffiffiffiffiffiffi exp pðDj; IÞ ¼ 22i i¼1 i 2p " # ( ) 2 N N Y X N ðd Þ i ¼ 1 ð2pÞ 2 exp i 22i i¼1 i¼1 (9:12) " # ( ) 2 N N Y X N w ðd Þ i i ð2pÞ 2 exp ¼ 1 i 2 i¼1 i¼1 " # N Y Q N2 1 i ð2pÞ exp ¼ ; 2 i¼1 where wi ¼ 1=2i is called the weight of data value di . In this case, Q is given by Q¼
N X
wi ðdi Þ2 ¼
i¼1
X
wi d2i þ 2
X
wi 2
X
w i di
P P w i di wi d2i ¼ wi 2 P þ P wi wi ( ! ) P P P P X ð wi di Þ2 wi di ð wi di Þ2 wi d2i 2 þ P 2 P 2 þ P ¼ wi 2 P wi wi ð wi Þ ð wi Þ ( ) P P P X wi di 2 ð wi di Þ2 wi d2 P 2 þ P i ¼ wi P wi wi ð wi Þ P P X wi di 2 ð wi di Þ2 X ¼ wi P P þ wi d2i : wi wi X
2
(9:13)
Only the first term, which contains the unknown mean , will appear in our final equation for the posterior probability of the mean, as we see below. Although the second and third terms do not appear in the final result, they can be shown to equal the P weighted mean square residual (r2w ) times the sum of the weights ( wi ). The weighted mean square residual, r2w , is given by ( P ) X 1 wi di 2 2 wi di P rw ¼ P ; (9:14) wi wi where
P
wi di =ð
P
wi Þ ¼ dw is the weighted mean of the data values. Thus o 1 nX r2w ¼ P wi ðdi dw Þ2 ; wi
(9:15)
218
Bayesian inference with Gaussian errors
and Q¼
X ð dw Þ2 P wi : þ r2w 1= wi
(9:16)
Now substitute Equation (9.16) into (9.12). " # ! 2P 2 N Y N w r ð d Þ i w P pðDj; IÞ ¼ 1 ð2pÞ 2 exp w exp i 2 2= wi i¼1 " # ! 2P N Y rw w i ð dw Þ2 N2 1 ¼ i ð2pÞ exp ; exp 2 22w i¼1
(9:17)
P where 2w ¼ 1=ð wi Þ. Substitution of Equations (9.2), (9.17) and (9.6) into Equation (9.1) yields the desired posterior: P Q 1 r2w wi ðdw Þ2 N2 1 ½ ð2pÞ expð Þ exp 2 i i R 2 2w P pðjD; IÞ ¼ (9:18) : 2 R Q N r w i ðdw Þ2 w H 1 1 ð2pÞ 2 expð ½ Þ d exp i i L R 2 22 w
Therefore, n o 2 wÞ exp ðd 22w n o : pðjD; IÞ ¼ R H ðdw Þ2 d L exp 22
(9:19)
w
Since the denominator evaluates to a constant, the posterior, within the range L to P H , is simply a Gaussian with variance 2w ¼ 1=ð wi Þ. The most probable value of is P P the weighted mean dw ¼ wi di =ð wi Þ.
9.2.3 Mean: unknown noise s In this section, we assume that the variance, 2 , of the noise is unknown but is assumed to be the same for each measurement,2 di . As in the previous case, we proceed by writing down the assumed model: di ¼ þ e i : In general, ei consists of the random measurement errors plus any real signal in the data that cannot be explained by the model. For example, suppose that unknown to us, the data contained a periodic signal superposed on the mean. In this connection, the periodic signal would act like an additional unknown noise term. It is often the case, that nature is more complex than our current model. In the absence of a detailed knowledge
2
See Section 12.9 on extrasolar planets for a discussion of the case where the variance is not constant.
9.2 Bayesian estimate of a mean
219
of the effective noise distribution, we could appeal to the Central Limit Theorem and argue that if the effective noise stems from a large number of sub-processes then it is expected to have a Gaussian distribution. Alternatively, the MaxEnt principle tells us that a Gaussian distribution would be the most conservative choice (i.e., maximally non-committal about the information we don’t have). For a justification of this argument, see Section 8.7.4. The only requirement is that the noise variance be finite. In what follows, we will assume the effective noise has a Gaussian distribution with unknown . Now we have two unknowns in our model, and . The joint posterior probability pð; jD; IÞ is given by Bayes’ theorem: pð; jIÞpðDj; ; IÞ : pðDjIÞ
pð; jD; IÞ ¼
(9:20)
We are interested in pðjD; IÞ regardless of what the true value of is. In this problem, is a nuisance parameter so we marginalize over : Z pðjD; IÞ ¼ pð; jD; IÞd: (9:21) From the product rule: pð; jIÞ ¼ pðjIÞpðj; IÞ. Assuming the prior for is independent of the prior for , then pð; jIÞ ¼ pðjIÞpðjIÞ: Combining Equations (9.20), (9.21), and (9.22), R pðjIÞ pðjIÞpðDj; ; IÞd pðjD; IÞ ¼ ; pðDjIÞ
(9:22)
(9:23)
where pðDjIÞ ¼
Z
Z pðjIÞ
pðjIÞpðDj; ; IÞdd:
(9:24)
As before we assume a flat prior for in the range L to H . Therefore pðjIÞ ¼
1 ; R
ðR ¼ H L Þ:
(9:25)
is a scale parameter, so it can only take on positive values 0 ! 1. Realistic limits do not go all the way to zero and infinity. For example, we always know that cannot be less than a value determined by the digitizing accuracy with which we record the data; nor so great that the noise power would melt the apparatus. Let L and H be our prior limits on . We will assume a Jeffreys prior for the scale parameter : K=; L H pðjIÞ ¼ 0; otherwise.
220
Bayesian inference with Gaussian errors
The constant K is determined from the condition Z H pðjIÞd ¼ 1 ) K ¼ L
1 lnðH =L Þ (9:26)
1 pðjIÞ ¼ : ln H =L Question: Why did we choose instead of the variance v ¼ 2 as our second parameter or does it matter? To the extent that both v and are ‘‘equally natural’’ parameterizations of the width of a Gaussian, it is desirable that investigators using either parameter reach the same conclusions. Answer: A feature of the Jeffreys prior is that it is invariant to such a reparameterization as we now demonstrate. We start with the requirement that pðvjIÞdv ¼ pðjIÞd:
(9:27)
The Jeffreys prior for can be written as pðjIÞd ¼
K d;
(9:28)
where K is a constant that depends on the prior upper and lower bounds on . Since ¼ v1=2 , d ¼ ð1=2Þv1=2 dv. Upon substitution into Equation (9.28), we obtain K K0 pðvjIÞdv ¼ dv ¼ dv: (9:29) 2v v Thus, choosing a Jeffreys prior for is equivalent to assuming a Jeffreys prior for v. It is easy to show that this would not be the case for a uniform prior. Another example of ‘‘equally natural’’ parameters is the choice of whether to use the frequency or period of an unknown periodic signal. Again, it is easy to show that the choice of a Jeffreys prior for frequency is equivalent to assuming a Jeffreys prior for the period. Calculation of the likelihood function: Lð; Þ ¼ pðDj; ; IÞ ¼
N Y
2 1 2 pffiffiffiffiffiffi e½ðdi Þ =2 2p i¼1
(9:30)
2
¼ N ð2pÞN=2 eQ=2 ; where Q depends on and is given by Equation (9.4). Q ¼ Nð dÞ2 þ Nr2 :
(9:31)
221
9.2 Bayesian estimate of a mean
Substituting Equations (9.25), (9.26), and (9.30) into Equation (9.23), we obtain N
ð2pÞ 2 R
pðjD; IÞ ¼
ð2pÞ
R H 1 H L ln L
R H 1 R ln H L
N2
R H L
L
R H ¼R
H L
Q
ðNþ1Þ e22 d Q
ðNþ1Þ e22 dd (9:32)
Q ðNþ1Þ 22
e d L : Q R H ðNþ1Þ e22 dd L
Now we change variables. Let ¼ Q=22 ; therefore, rffiffiffiffi rffiffiffiffiffi Q 1 3 Q d ¼ and d ¼ 2 2 2 2 Nþ1 2 2 Nþ1 Nþ1 Nþ1 ðNþ1Þ ¼ ¼ 2 2 2 Qð 2 Þ Q N
N
(9:33)
N
ðNþ1Þ d ¼ 2 2 1 2 1 Qð 2 Þ d
and therefore, R H pðjD; IÞ ¼ R H L
N
N
Qð 2 Þ 2 1 e d N R N dQð 2 Þ HL 2 1 e d L
(9:34)
N
Qð 2 Þ
R H L
N
dQð 2 Þ
;
where L ¼ Q=ð22H Þ and H ¼ Q=ð22L Þ: The integral with respect to in equation (9.34) can be evaluated in terms of the incomplete gamma function (see equation (C.15) of Appendix C). The integral clearly depends on , but provided L r and H r, where r ¼ the RMS residual of the most probable model fit, the integral is effectively constant. As an example, for N ¼ 10; L ¼ 0:5r and H ¼ 5r; the integral deviates by 1% for values of 2:3r. However, at jx dj ¼ 2:3r, the term QN=2 in equation (9.34) has j dj –4 ¼ 0. For larger values of j dj, the reached a value of 10 of its value at j dj ¼ 3r the integral is only down by 5%, but integral decreases monotonically. At j dj now QN=2 is down by a factor of 105. Use Equation (9.31) to substitute for Q: N
pðjD; IÞ R
½Nr2 þ Nð dÞ2 2
H
L
N
d½Nr2 þ Nð dÞ2 2
h i N 2 2 1 þ ðdÞ 2 r pðjD; IÞ iN2 ; R H h ðdÞ2 L d 1 þ r2
(9:35)
222
Bayesian inference with Gaussian errors
P where the quantity Nr2 ¼ ðdi dÞ2 , which is independent of , has been factored out of the numerator and denominator and canceled.Now compare " #N2 ð dÞ2 1þ (9:36) r2 with the Student’s t distribution which was discussed in Section 6.4. ðþ1Þ 2 ½ðþ1Þ t2 fðtjÞ ¼ pffiffiffiffiffiffi 2 1 þ : p 2
(9:37)
t2 ð dÞ2 ; ¼ r2
(9:38)
If we set
and the number of degrees of freedom ¼ N 1, then Equation (9.36) has the same form as the Student’s t distribution.3 From this comparison, it is clear that the posterior probability for when is unknown is a Student’s t distribution. If L ¼ 1 and H ¼ þ1, then " #N2 ðN2 Þ 1 ð dÞ2 1þ pðjD; IÞ pffiffiffi N1 : (9:39) r2 pð 2 Þ r If the limits on do not extend to 1, then the constant outside the square brackets will be different but computable from a Student’s t distribution and the known prior limits on . In practice, if L and H are well outside some measure of the range of argued for by the likelihood function, then the result is the same as setting the prior limits of to 1. We can easily generalize the results of this section to more complicated models than one that predicts the mean. Suppose the data were described by the following model: di ¼ mi ðÞ þ ei ;
where represents a set of model parameters with a prior pðjIÞ. Then from Equation (9.34), we can write N
pðjD; IÞ R
pðjIÞQð 2 Þ N
d pðjIÞQð 2 Þ
;
(9:40)
where Q is given by Q¼
N X
ðdi mi ðÞÞ2 :
(9:41)
i¼1
3
In Equation (9.36), r2 ¼ N1 (6.15).
P 2 2 ðdi dÞ2 ¼ N1 N S , where S is the frequentist sample variance as defined in Equation
223
9.2 Bayesian estimate of a mean 350
Radio flux density
300 250 200 150 100 50 0
2000
4000 Time (days)
6000
8000
Figure 9.2 Plot of the radio source measurements and 1 measurement errors.
Example: Often we encounter situations in which our model plus known instrumental errors fail to adequately describe the full range of variability in the data. We illustrate this with an example from radio astronomy. In this case, we are interested in inferring the mean flux density of a celestial radio source from repeated measurements with a radio telescope with well-known noise properties. Figure 9.2 shows 56 measurements of the radio flux density of a galaxy. The individual measurement errors are known to have a Gaussian distribution with a 1 ¼ 30 units of radio flux density. It is obvious from the scatter in the measurements compared to the error bars that there is some additional source of uncertainty or the signal strength is variable. For example, additional fluctuations might arise from propagation effects in the interstellar medium between the source and observer. In the absence of prior information about the distribution of the additional scatter, both the Central Limit Theorem and the MaxEnt principle (Section 8.7.4) lead us to adopt a Gaussian distribution because it is the most conservative choice. Let 2 ¼ the standard deviation of this Gaussian.4 The resulting likelihood function is the convolution of the Gaussian model of the additional scatter and the Gaussian measurement error distribution (see Section 4.8.2). The result is another Gaussian with a variance, 2 ¼ 21 þ 22 .
4
Since 2 is an unknown nuisance parameter we will need to marginalize over some prior range. For values of 2 close to the upper bounds of this range, the lower tail of the Gaussian distribution may extend into negative values of source strength. If the scatter arises from variations in the source strength then this situation is non-physical. Thus, it would be more exact to adopt a truncated Gaussian but the mathematics is greatly complicated. The current analysis must therefore be viewed as approximate.
224
Bayesian inference with Gaussian errors 0.1
Probability density
0.08 0.06 0.04 0.02 0 140
160
180 200 Mean flux density µ
220
Figure 9.3 Comparison of the computed results for the posterior PDF for the mean radio flux density assuming known (solid curve), and marginalizing over an unknown (dashed curve).
We have computed the posterior probability of the mean flux density pðjD; IÞ in two ways. First, assuming the known measurement errors and Equation (9.8), the result is shown as the solid curve in Figure 9.3. Next we assumed was unknown and plotted the result after marginalizing over values of in the range 30 to 400 units, using Equation (9.39). This results in the much broader dashed curve shown in Figure 9.3. In the latter analysis, where we marginalize over , we are in effect estimating from the data and any variability which is not described by the model is assumed to be noise (the following section provides a justification of this statement). This approach leads to a broader posterior probability distribution which reflects the larger effective noise when using a model that assumes the source flux density is constant. Note: if the effective noise had been equal to the measurement error ( ¼ 30) then the result for pðjD; IÞ would have been the same as if we had used a fixed noise of 30. A justification of this claim is given in the following section.
9.2.4 Bayesian estimate of s In the previous section, we computed the Bayesian estimate of the mean of a data set when the of the Gaussian sampling distribution is unknown. It is also of interest to see what the data have to say about . This can be answered by computing pðjD; IÞ, the posterior marginal for . Following Equation (9.23), we write R pðjIÞ pðjIÞpðDj; ; IÞd pðjD; IÞ ¼ : (9:42) pðDjIÞ
225
9.2 Bayesian estimate of a mean
Substituting Equations (9.30) and (9.31) into Equation (9.42), we obtain N
ð2pÞ 2 R
pðjD; IÞ ¼
N
ð2pÞ 2 R
Nr2
ðNþ1Þ e 22
1 H ln
R H 1 ln H L
Nr2
ðNþ1Þ e 22 R H
ðNþ1Þ e
L
2 Nr2 2
e
R H L
e
NðdÞ2 22
e
R H
L
ðNþ1Þ e 22 H L
L
L
Nr2
pðjD; IÞ ¼ R
R H
L
NðdÞ2 22
e
d
NðdÞ2 22
dd
d
NðdÞ2 22
dd
(9:43)
pffiffiffiffiffiffi 2p pffiffiffi N ¼R 2 p ffiffiffiffiffiffi H ðNþ1Þ Nr2 pffiffiffi d 2 e 2p Nr2
ðNþ1Þ e 22
N
L
¼ CN e
2 Nr2 2
:
In the above equation, we have made use of the fact that the integral of a normalized Gaussian is equal to 1, i.e., 1 pffiffiffiffiffiffi 2p pffiffiffi N
Z
H
e
ðdÞ2 22 =N
d ¼ 1;
(9:44)
pffiffiffiffiffiffi 2p pffiffiffiffi : N
(9:45)
L
therefore, Z
H
e
NðdÞ2 22
d ¼
L
The most probable value (mode) of Equation (9.43) is the solution of @p 2 2 ¼ ½N^ N1 þ Nr2 ^N3 CeNr =2^ ¼ 0: @
(9:46)
^ ¼ r:
(9:47)
The solution is
Thus, pðjD; IÞ has a maximum at ¼ r, the RMS deviation from d. Since pðjD; IÞ is not a simple Gaussian, it is of interest to compare the mode to hi and h2 i, the expectation values of and 2 , respectively. They are given by hi ¼
Z
1
pðjD; IÞd;
(9:48)
2 pðjD; IÞd:
(9:49)
0
2
h i ¼
Z
1 0
226
Bayesian inference with Gaussian errors
These equations can be evaluated using an inverse gamma integral and a change of variables. The results are pffiffiffiffi Nr ½ðN 2Þ=2 ; (9:50) hi ¼ pffiffiffi 2 ½ðN 1Þ=2 and N Nr2 1 X ¼ ðdi dÞ2 : (9:51) N 1 N 1 i¼1 pffiffiffiffiffiffiffiffiffi These three summaries ð^ ; hi; h2 iÞ of pðjD; IÞ are all different; the distribution is not symmetric like a Gaussian. Figure 9.4 illustrates the three summaries assuming a value of r ¼ 2. For N ¼ 3, hi can differ from ^ by as much as a factor 2, but this difference drops to 15% by N ¼ 10. As N increases, pffiffiffiffiffiffiffiffiffi the summaries asymptotically approach r, the RMS residual. Of the three, h2 i is the most representative, lying between the other two. The reader should recognize the expression for h2 i is identical to the equation for P the frequentist sample variance, S2 ¼ ðdi dÞ2 =ðN 1Þ, that is used to estimate the population variance for an IID sample taken from a normal distribution (see Section 6.3). Of course in Bayesian analysis, the concept of a population of hypothetical samples plays no role. The main message of this section is that in problems where is unknown, the effect of marginalizing over is roughly equivalent to setting ¼ RMS residual of the most probable model. Thus, anything in the data that can’t be explained by the model is treated as noise, leading to the most conservative estimates of model parameters. It is a very safe thing to do.
h2 i ¼
5 ∧
σ
Summary value
4
≺σ
√≺ σ 2
3 2 1
10
20
30
40
50
N
Figure 9.4 A comparison of the three summaries for the marginal probability density function for pðjD; IÞ assuming an RMS residual r ¼ 2.
227
0.05
Probability density
Probability density
9.3 Is the signal variable?
(a)
0.04 0.03 0.02 0.01
0.006
(b)
0.005 0.004 0.003 0.002 0.001
0 40
60
80
100
120
50 100 150 200 250 300 350
Standard deviation σ
Flux density
Figure 9.5 Panel (a) shows the marginal probability density function for pðjD; IÞ for the radio galaxy data of Figure 9.2. Panel (b) compares the effective Gaussian sampling distribution employed in the analysis with a normalized histogram of the actual data values.
Returning to the radio source example (Figure 9.2) of the previous section, we now use Equation (9.43) to estimate the posterior marginal pðjD; IÞ for these data, which are shown in panelp(a) ffiffiffiffiffiffiffiffiof ffi Figure 9.5. The three summaries in this case are ^ ¼ 73:8; hi ¼ 75:6; h2 i ¼ 74:5. Recall that in the absence of prior information on the sampling distribution for the radio source measurements, we adopted a Gaussian with unknown . ffiPanel (b) compares the effective Gaussian sampling pffiffiffiffiffiffiffiffi distribution (based on h2 i ¼ 74:5)) employed in the analysis with a normalized histogram of the actual data values. The Gaussian is centered at ¼ 182, the posterior maximum. So far in this chapter, we have been concerned with fitting a simple linear model with one parameter, the mean. In Chapter 10, we will be concerned with linear models with M parameters. We will also have occasion to marginalize over an unknown noise . Again, we can compute the posterior marginal pðjD; IÞ, after marginalizing over the M model parameters. Assuming a Jeffreys prior for with prior boundaries well outside the region of the posterior peak, it can be shown that the value of h2 i is given by h2 i ¼
N X 1 ðdi dÞ2 : ðN MÞ i¼1
(9:52)
9.3 Is the signal variable? In Section 7.2.1, we used a frequentist hypothesis test to decide whether we could reject, at the 95% confidence level, the null hypothesis that the radio signal from a galaxy is constant. If we can reject this hypothesis, then it provides indirect evidence that the signal is variable. A Bayesian analysis of the same data allows one to directly compare the probabilities of two hypotheses: Hc ‘‘the signal is constant,’’ and Hv ‘‘the signal is variable.’’ To compute pðHv jD; IÞ, it is first necessary to specify a model
228
Bayesian inference with Gaussian errors
for the signal variability. Some examples of different categories of variability models are given below. 1. The signal varies according to some specific non-periodic function of time, fðtjÞ, where stands for a set of model parameters, e.g., slope and intercept in a linear model. The model might make specific predictions concerning the parameters or they may be unknown nuisance parameters. Of course, each nuisance parameter will introduce an Occam penalty in the calculation of the Bayes factor. Model fitting is discussed in Chapters 10, 11, 12. 2. The signal varies according to some specific periodic function of time. Examples of this are discussed in Section 12.9 and Chapter 13. 3. The signal varies according to some unknown periodic function of time (Gregory and Loredo, 1992; Loredo, 1992; Gregory, 1999). In this case, it is possible to proceed if we assume a model, or family of models, that is capable of describing an arbitrary shape of periodic variability with the minimum number of parameters. An example of this will be discussed in Section 13.4. 4. The signal varies according to some unknown non-periodic function of time. Again, it is possible to proceed if we assume a model, or family of models, that is capable of describing an arbitrary shape of variability with the minimum number of parameters. An example of this is given by Gregory and Loredo (1993). 5. The model only provides information about the statistical properties of the signal variability, i.e., specifies a probability distribution of the signal fluctuations. When combined with a model of the measurement errors (see Section 4.8.2), it can be used as a sampling distribution to compute the likelihood of the data set. 6. Finally, we may only have certain constraints on a model of the signal variability. We can always exploit the MaxEnt principle to arrive at a form of the signal variability distribution that reflects our current state of knowledge. Again, when combined with a model of the measurement errors (see Section 4.8.2), it can be used as a sampling distribution to compute the likelihood of the data set.
9.4 Comparison of two independent samples The decisions on whether a particular drug is effective, or some human activity is proving harmful to the environment, are important topics to which Bayesian analysis can make a significant contribution. The issue typically boils down to comparing two independent samples referred to as the trial sample and the control sample. In this section, we will demonstrate a Bayesian approach to comparing two samples based on the treatment given by Bretthorst (1993), which is an extension of earlier work by Dayal (1972), and Dayal and Dickey (1976). His derivation is a generalization of the Behrens–Fisher and two-sample problems,5 using the traditional F and Student’s t distributions. 5
In the frequentist statistical literature, estimating the difference in means assuming the same but unknown standard deviation is referred to as the two-sample problem. Estimating the difference in means assuming different unknown standard deviations is known as the Behrens–Fisher problem.
9.4 Comparison of two independent samples
229
To start, we need to specify the prior information, I, which includes a statement of the problem, the hypothesis space of interest, and the sampling distribution to be used in calculating the likelihood function. To illustrate the methodology, we will re-visit a problem that was considered using frequentist statistical tools in Section 7.2.2, and in problem 2 at the end of Chapter 7. The problem is to compare the concentrations of a particular toxin in river sediment samples taken from two locations and tabulated in Table 7.2. The location 1 sample was taken upriver (control sample) from a processing plant and the location 2 sample taken downstream (trial sample) from the plant. In Section 7.2.2 we considered whether we could reject the null hypothesis that the mean toxin concentrations are the same at the two locations assuming the standard deviations of the two samples were different. Using the frequentist approach, we were just able to reject the null hypothesis at the 95% confidence level. The current analysis assumes the two samples can differ in only two ways, the mean toxin concentrations and/or the sample standard deviations. Let d1;i represent the ith measurement in the first sample consisting of N1 measurements in total. The symbol D1 will represent the set of measurements fd1;i g that constitute sample 1. We will model d1;i by the equation d1;i ¼ c1 þ e1;i ;
(9:53)
where, as usual, e1;i represents an unknown error component in the measurement. We assume that our knowledge of the source of the errors leads us to assume a Gaussian distribution for e1;i , with a standard deviation of 1 . To be more precise, 1 is a continuous hypothesis asserting that the noise standard deviation in D1 is between 1 and 1 þ d1 . In some cases, we assume a Gaussian distribution because, in the absence of knowledge of the true sampling distribution, employing a Gaussian distribution is the most conservative choice for the reasons given in Section 8.7.4. We also assume that individual measurements that constitute the sample are independent, i.e., the e1;i are independent. In the absence of the error component, the model predicts d1;i ¼ c1 . Although c1 will be referred to as the mean of D1 ; c1 is more precisely, a continuous hypothesis asserting that the constant signal component in D1 is between c1 and c1 þ dc1 . We can write a similar equation for the ith measurement of sample 2, which consists of N2 measurements. d2;i ¼ c2 þ e2;i :
(9:54)
Again, we will let 2 represent the standard deviation of the Gaussian error term. The hypothesis space of interest for our Bayesian analysis is given in Table 9.1. We will be concerned with answering the following hierarchy of questions: 1. Do the samples differ, i.e., do the mean concentrations and/or the standard deviations differ? 2. If so, how do they differ; in the mean, standard deviation or both?
230
Bayesian inference with Gaussian errors
Table 9.1 The hypotheses addressed. The symbol in the right hand column is used as an abbreviation for the hypothesis. Hypothesis
In words
Symbol
c1 ¼ c2 c1 6¼ c2 1 ¼ 2 1 6¼ 2 c1 ¼ c2 and 1 c1 ¼ c2 and 1 c1 6¼ c2 and 1 c1 6¼ c2 and 1 c1 6¼ c2 and/or c1 c2 ¼ 1 =2 ¼ r
Same means Means differ Same standard deviations Standard deviations differ Same means and standard deviations Same means, standard deviations differ Means differ, same standard deviations Means and standard deviations differ Means and/or standard deviations differ Difference in means ¼ Ratio of standard deviations ¼ r
C C S S C; S C; S C; S C; S CþS r
¼ 2 6¼ 2 ¼ 2 6¼ 2 1 6¼ 2
3. If the means differ, what is their difference ? 4. If the standard deviations differ, what is their ratio r?
To answer the above questions, we need to compute the probabilities of the hypotheses listed in Table 9.1. The discrete hypotheses are represented by the capitalized symbols and the continuous hypotheses by the lower case symbols, and r. For example, the symbol C stands for the hypothesis that the means are the same, and C stands for the hypothesis that they differ.
9.4.1 Do the samples differ? The answer to question (1) can be obtained by computing pðC þ SjD1 ; D2 ; IÞ, the probability that the means and/or the standard deviations are different given the sample data (D1 and D2 ) and the prior information I. Apart from the priors for the parameters c1 ; c2 ; 1 ; 2 , we have already specified the prior information I above. To compute the probability that the means and/or the standard deviations differ, we note that from Equation (2.1), we can write pðC þ SjD1 ; D2 ; IÞ ¼ pðC; SjD1 ; D2 ; IÞ ¼ 1 pðC; SjD1 ; D2 ; IÞ:
(9:55)
Equation (9.55) demonstrates that it is sufficient to compute the probability that the means and the standard deviations are the same. From that, one can compute the probability that the means and/or the standard deviations differ. The hypothesis C; S assumes the means and the standard deviations are the same, so only two
9.4 Comparison of two independent samples
231
parameters (a constant c1 , and a standard deviation 1 ) have to be removed by marginalization: pðC; SjD1 ; D2 ; IÞ ¼
Z
dc1 d1 pðC; S; c1 ; 1 jD1 ; D2 ; IÞ;
(9:56)
where c2 ¼ c1 and 2 ¼ 1 . The right hand side of this equation may be factored using Bayes’ theorem to obtain Z pðC; SjD1 ; D2 ; IÞ ¼ K dc1 d1 pðC; S; c1 ; 1 jIÞpðD1 ; D2 jC; S; c1 ; 1 ; IÞ; (9:57) where, K¼
1 : pðD1 ; D2 jIÞ
(9:58)
We need to evaluate the probabilities of four basic alternative hypotheses. They are ðC; SÞ; ðC; SÞ; ðC; SÞ and ðC; SÞ. Equation (9.57) gives the posterior for (C; S). We could similarly write out the posterior for the other three. For hypothesis ðC; SÞ, which assumes 1 6¼ 2 , the result is pðC; SjD1 ; D2 ; IÞ ¼ K
Z
dc1 d1 d2 pðC; S; c1 ; 1 ; 2 jIÞ (9:59)
pðD1 ; D2 jC; S; c1 ; 1 ; 2 ; IÞ: Each of the four posteriors has a different numerator on the right hand side but a common denominator, pðD1 ; D2 jIÞ. Recall that the denominator in Bayes’ theorem, pðD1 ; D2 jIÞ, ensures that the posterior is normalized over this hypothesis space. In terms of these basic hypotheses, pðD1 ; D2 jIÞ is the sum of the four numerators and is given by Z pðD1 ; D2 jIÞ ¼ dc1 d1 pðC; S; c1 ; 1 jIÞpðD1 ; D2 jC; S; c1 ; 1 ; IÞ þ þ þ
Z Z Z
dc1 d1 d2 pðC; S; c1 ; 1 ; 2 jIÞpðD1 ; D2 jC; S; c1 ; 1 ; 2 ; IÞ dc1 dc2 d1 pðC; S; c1 ; c2 ; 1 jIÞpðD1 ; D2 jC; S; c1 ; c2 ; 1 ; IÞ dc1 dc2 d1 d2 pðC; S; c1 ; c2 ; 1 ; 2 jIÞ
pðD1 ; D2 jC; S; c1 ; c2 ; 1 ; 2 ; IÞ:
(9:60)
232
Bayesian inference with Gaussian errors
Assuming logical independence of the parameters and the data, Equation (9.57) may be further simplified to obtain Z pðC; SjD1 ; D2 ; IÞ ¼K dc1 d1 pðC; SjIÞpðc1 jIÞpð1 jIÞ (9:61) pðD1 jC; S; c1 ; 1 ; IÞpðD2 jC; S; c1 ; 1 ; IÞ; where pðC; SjIÞ is the prior probability that the means and the standard deviations are the same, pðc1 jIÞ is the prior probability for the mean, pð1 jIÞ is the prior probability for the standard deviation, and pðD1 jC; S; c1 ; 1 ; IÞ and pðD2 jC; S; c1 ; 1 ; IÞ are the likelihoods of the two data sets. Assignment of priors In this calculation we will adopt bounded uniform priors for the location parameters, c1 and c2 , and Jeffreys priors for the scale parameters, 1 and 2 . Thus, for the mean, c1 , we write 1=Rc ; if L c1 H pðc1 jIÞ ¼ (9:62) 0; otherwise where Rc H L, and H and L are the limits on the constant c1 and are assumed known. The same prior will be used for the c2 constant. The prior for the standard deviation, 1 , of the noise component in D1 , is given by 1=1 logðR Þ; if L 1 H pð1 jIÞ ¼ (9:63) 0; otherwise where R is the ratio H =L , and H and L are the limits on the standard deviation 1 and are also assumed known. Again, the same prior will be assumed for 2 . We now come to the difficult issue of choosing prior ranges for the mean toxin concentrations (the means c1 and c2 in Equations (9.53) and (9.54)), and the standard deviations (1 and 2 ). Recall from Section 3.5 that in a Bayesian model selection problem, marginalizing over parameters introduces Occam penalties, one for each parameter. Here, the models all contain the same types of parameters, constants and standard deviations, but they contain differing numbers of these parameters. Consequently, the prior ranges are important and will affect model selection conclusions. We also saw in Section 3.8.1 that for a uniform prior, the results are quite sensitive to the prior boundaries. In general, any scientific enquiry is motivated from a particular prior state of knowledge on which we base our selection of prior boundaries. In the current instance, the motivation is to illustrate some methodology so we will investigate the dependence of the results on four different choices of prior boundaries as given in Table 9.2. Finally, we need to assign a prior probability for each of the four fundamental hypotheses: ðC; SÞ; ðC; SÞ; ðC; SÞ, and ðC; SÞ. Since the given information, I, indicates no preference, we assign a probability of 1/4 to each.
233
9.4 Comparison of two independent samples
Table 9.2 Different choices for lower and upper bounds on the priors for the mean and standard deviation of the river sediment toxin concentrations. Case 1 2 3 4
Mean lower (ppm)
Mean upper (ppm)
L lower (ppm)
H upper (ppm)
2 7 2 7
18 12 18 12
0.4 0.4 1 1
10 10 4 4
There is a danger that the reader will get lost in the forest of calculations required to evaluate the probabilities of the four fundamental hypotheses so we have moved them to Appendix C. If you are planning on applying Bayesian analysis to a non-trivial problem in your own research field, it often helps to see worked examples of other non-trivial problems. Consider Appendix C as such a worked example. After we have evaluated the four basic hypotheses, Equation (9.61) can be used to determine if the data sets are the same. Equation (9.55) can be used to determine the probability that the means and/or the standard deviations differ, and thus, answers the first question of interest, ‘‘Do the samples differ?’’
9.4.2 How do the samples differ? We now address the second question: assuming that the two samples differ, how do they differ? There are only three possibilities: the means differ, the standard deviations differ, or both differ. To determine if the means differ, one computes pðCjD1 ; D2 ; IÞ. Similarly, to determine if the standard deviations differ, one computes pðSjD1 ; D2 ; IÞ. Using the sum rule, these probabilities may be written pðCjD1 ; D2 ; IÞ ¼ pðC; SjD1 ; D2 ; IÞ þ pðC; SjD1 ; D2 ; IÞ
(9:64)
pðSjD1 ; D2 ; IÞ ¼ pðC; SjD1 ; D2 ; IÞ þ pðC; SjD1 ; D2 ; IÞ
(9:65)
and
where pðCjD1 ; D2 ; IÞ is computed independent of whether or not the standard deviations are the same, while pðSjD1 ; D2 ; IÞ is independent of whether or not the means are the same.
9.4.3 Results We now have expressions for computing the probability for the first nine hypotheses appearing in Table 9.1. These calculations have been implemented in a special
234
Bayesian inference with Gaussian errors
section in the Mathematica tutorial entitled, ‘‘Bayesian analysis of two independent samples.’’ This analysis program produces three different types of output: (1) the probability for the four fundamental compound hypotheses; (2) the probability that the means are different, the probability that the variances are different, and the probability that one or both are different; and finally (3) the probability for the difference in means and the ratio of the standard deviations. Table 9.3 illustrates the output for the prior boundaries corresponding to case 4 in Table 9.2, i.e., 7:0 c1 ; c2 12 ppm and for the standard deviations 1 1 ; 2 4. The last line gives an odds ratio of 9.25 in favor of the means and/or standard deviations being different. Recall that the posterior probability is proportional to the product of the prior probability and the likelihood. Following Bretthorst’s analysis, we assumed equal Table 9.3 Output from Mathematica program: ‘‘Bayesian analysis of two independent samples,’’ for the river sediment toxin measurements. Data Summary No. 12 8 20 Prior Prior Prior Prior
Standard Deviation 2.1771 1.2800 2.0256
Average 10.3167 8.5875 9.6251
Data Set river B.1 river B.2 Combined
mean lower bound mean upper bound standard deviation lower bound standard deviation upper bound
7.0 12.0 1.0 4.0
Number of steps for plotting pðjD1 ; D2 ; IÞ Number of steps for plotting pðrjD1 ; D2 ; IÞ
200 300
Hypothesis C; S same means, same standard deviations C; S different means, same standard deviation C; S same mean, different standard deviations C; S different means, different standard deviations C means are the same C means are different The odds ratio in favor of different means S standard deviations are the same S standard deviations are different The odds ratio in favor of different standard deviations C; S same means, same standard deviations C þ S one or both are different The odds ratio in favor of a difference
Probability 0.0975 0.2892 0.1443 0.4690 0.2419 0.7581 odds ¼ 3:13 0.3867 0.6133 odds ¼ 1:59 0.0975 0.9025 odds ¼ 9:25
235
9.4 Comparison of two independent samples
prior probabilities for the four compound hypotheses: pðC; SjIÞ, pðC; SjIÞ, pðC; SjIÞ, and pðC; SjIÞ. With this assumption, the prior odds favoring different means and/or different standard deviations ðC þ SÞ is 3.0 to 1. The data acting through the likelihood term are responsible for increasing this from 3 to 9.25 in this case. If instead we had taken as our prior that pðC; SjIÞ ¼ pðC þ SjIÞ, then the posterior odds ratio would be reduced to 3.08. It is important to remember that all Bayesian probabilities are conditional probabilities, conditional on the truth of the data and prior information. It is thus important in any Bayesian analysis to specify the prior used in the analysis. Table 9.4 illustrates the dependence of the probabilities of the different hypotheses on different choices of prior boundaries. It is clear from the table that as we increase our prior uncertainty, the hypotheses with more parameters to marginalize over suffer larger Occam penalties and hence their probability is reduced compared to the simpler hypothesis of no change. This might be a good time to review the material on the Occam factor in Section 3.5. In all cases, the odds ratio, Oddsdiff , favoring different means and/or standard deviations, exceeds 1. In this analysis, we have purposely considered four choices of prior boundaries to see what effect the different boundaries have on the final results. It often requires some careful thought to translate the available background information into an appropriate choice of prior parameter boundaries. Otherwise one might make these boundaries artificially large and as a consequence, the probability of a possibly correct complex model will decrease in relation to simpler models. How do the present results compare to our earlier frequentist test of the null hypothesis that the mean toxin concentrations are the same (see Section 7.2.2)? On the basis of that analysis, we obtained a P-value ¼ 0:04 and thus rejected the null hypothesis at the 96% confidence level. The frequentist P-value is often incorrectly viewed as the probability that the null hypothesis is true. The Bayesian conclusion regarding the question of whether the means are the same is given by pðCÞ ¼ 1 pðCÞ. Although this depends on our prior uncertainty in the means and standard deviations of two samples, the minimum value for pðCÞ according to Table 9.4 is 1 0:76 ¼ 0:24. The difficulty in interpreting P-values and confidence Table 9.4 Dependence of the probabilities of the hypotheses of interest on the different prior boundaries given in Table 9.2. See Table 9.1 for the meaning of the different hypotheses. #
pðC; SÞ
pðC; SÞ
pðC; SÞ
pðC; SÞ
pðCÞ
pðSÞ
pðC þ SÞ
Oddsdiff
1 2 3 4
0.293 0.141 0.202 0.098
0.276 0.420 0.190 0.289
0.209 0.101 0.299 0.144
0.222 0.338 0.308 0.469
0.498 0.757 0.499 0.758
0.431 0.439 0.607 0.613
0.707 0.858 0.798 0.902
2.41 6.06 3.95 9.25
236
Bayesian inference with Gaussian errors
levels has been highlighted in many papers (e.g., Berger and Sellke, 1987; Delampady and Berger, 1990; Sellke et al., 2001). The focus of these works is that P-values are commonly considered to imply considerably greater evidence against the null hypothesis than is actually warranted. Now that one knows that the means and/or standard deviations are not the same, or at the very least are probably not the same, one would like to know what is different between the control and the trial. Are the means different? Are the standard deviations different? Examination of Table 9.4 indicates that the probability the means differ is 0.758, for the choice of prior boundaries corresponding to case 4. The probability that the standard deviations differ is 0.613. Using the calculations presented so far, one can determine if something is different, and then determine what is different. But after determining what is different, again one’s interest in the problem changes. The next step is to estimate the magnitude of the changes.
9.4.4 The difference in means To estimate the difference in means, one must first introduce this difference into the problem. Defining and to be the difference and sum, respectively, of the constants c1 and c2 , one has ¼ c1 c2 ;
¼ c1 þ c2 :
(9:66)
The two constants, c1 and c2 , are then given by c1 ¼
þ ; 2
c2 ¼
: 2
(9:67)
The probability for the difference, , is then given by pðjD1 ; D2 ; IÞ ¼ pð; SjD1 ; D2 ; IÞ þ pð; SjD1 ; D2 ; IÞ ¼ pðSjD1 ; D2 ; IÞpðjS; D1 ; D2 ; IÞ
(9:68)
þ pðSjD1 ; D2 ; IÞpðjS; D1 ; D2 ; IÞ: This is a weighted average of the probability for the difference in means given that the standard deviations are the same (the two-sample problem) and the probability for the difference in means given that the standard deviations are different (the Behrens–Fisher problem). The weights are just the probabilities that the standard deviations are the same or different. Two of these four probabilities, pðSjD1 ; D2 ; IÞ and pðSjD1 ; D2 ; IÞ ¼ 1 pðSjD1 ; D2 ; IÞ, have already been computed in Equation (9.65). The other two probabilities, pðjS; D1 ; D2 ; IÞ and pðjS; D1 ; D2 ; IÞ, are derived in Appendix C.3. Figure 9.6 shows the probability density function for the difference in means for the prior boundaries corresponding to case 4 of Table 9.2. Three curves are shown: the probability for the difference in means given that the standard
237
9.4 Comparison of two independent samples Peak = 1.75; mean = 1.73; 95% lower = 0; 95% upper = 3.5
Probability density
0.4 0.3 0.2 average p (δ |S, D1, D2, I ) p (δ |S, D1, D2, I )
0.1 0 –1
0
1
2
3
4
δ Mean toxin level difference (ppm) Figure 9.6 Three probability density functions are shown: the probability for the difference in means given that the standard deviations are the same (dotted line); the probability for the difference in means given that the the standard deviations are different (dashed line); and the probability for the difference in means independent of whether or not the standard deviation are the same (solid line).
deviations are the same (dotted line); the probability for the difference in means given that the the standard deviations are different (dashed line); and the probability for the difference in means independent of whether or not the standard deviation are the same (solid line). All three curves are very similar but it is possible to notice small differences especially near the peak. The parameters listed along the top border of the figure are the peak and mean of the distribution, and the lower and upper boundaries of the 95% credible region. These apply to the weighted average distribution (solid curve).
9.4.5 Ratio of the standard deviations To estimate the ratio of the standard deviations, this ratio must first be introduced into the problem. Defining r and to be r¼
1 ; 2
¼ 2
(9:69)
and substituting these into the model, Equations (9.53) and (9.54), one obtains d1i ¼ c1 þ noise of standard deviation r;
(9:70)
d2i ¼ c2 þ noise of standard deviation :
(9:71)
and
238
Bayesian inference with Gaussian errors
The probability for the ratio of the standard deviations, pðrjD1 ; D2 ; IÞ, is then given by pðrjD1 ; D2 ; IÞ ¼ pðr; CjD1 ; D2 ; IÞ þ pðr; CjD1 ; D2 ; IÞ ¼ pðCjD1 ; D2 ; IÞpðrjC; D1 ; D2 ; IÞ
(9:72)
þ pðCjD1 ; D2 ; IÞpðrjC; D1 ; D2 ; IÞ: This is a weighted average of the probability for the ratio of the standard deviations given that the means are the same plus the probability for the ratio of the standard deviations given that the means are different. The weights are just the probabilities that the means are the same or not. Two of the four probabilities, pðCjD1 ; D2 ; IÞ and pðCjD1 ; D2 ; IÞ ¼ 1 pðCjD1 ; D2 ; IÞ, have already been computed in Equation (9.64). The other two probabilities, pðrjC; D1 ; D2 ; IÞ and pðrjC; D1 ; D2 ; IÞ, are derived in Appendix C.4. In case 4 of Table 9.4, there is significant evidence in favor of the means being different. Thus, we might expect that the probability for the ratio of the standard deviations, assuming the same means, will differ from the probability for the ratio of the standard deviations assuming that the means are different. These two distributions, as well as the weighted average, are shown in Figure 9.7. The probability for the ratio of the standard deviations, assuming that the means are the same, is shown as the dotted line. This model does not fit the data as well (the pooled data have a larger standard deviation than either data set separately). Consequently, the uncertainty in this probability distribution is larger compared to Peak = 1.45; Mean = 1.61; 95% lower = 0.59; 95% upper = 2.7
Probability density
0.8
average p (r |C, D1, D2, I ) p (r |C, D1, D2, I )
0.6
0.4
0.2
0 0
0.5
1 1.5 2 2.5 r Standard deviation ratio
3
3.5
Figure 9.7 Probability density for the ratio of the standard deviations. Three probability density functions are shown: the probability for the ratio of standard deviations given that the means are the same (dotted line); the probability for the ratio of standard deviations given that the means are different (dashed line); the probability for the ratio of standard deviations independent of whether or not the means are the same (solid line).
239
9.4 Comparison of two independent samples
the other models and the distribution is more spread out. The probability for the ratio of standard deviations assuming different means is shown as the dashed line. This model fits the data better, and results in a more strongly peaked probability distribution. But probability theory tells one to take a weighted average of these two distributions, the solid line. The weights are just the probabilities that the means are the same or different. Here those probabilities are 0.242 and 0.758, respectively. As expected, the weighted average agrees more closely with pðrjC; D1 ; D2 ; IÞ. The parameters listed along the top border of the figure apply to the weighted average distribution (solid curve).
9.4.6 Effect of the prior ranges We have already discussed the effect of different prior ranges on the model selection conclusions (see Table 9.4). Here we look at their effect on the two-parameter estimation problems. Figure 9.8 shows the weighted average estimate of the difference in the means of the two data sets as given by Equation (9.68), for the four choices of prior boundaries. The results for all four cases are essentially identical. Provided the prior ranges are outside the parameter range selected by the likelihood function, we don’t expect much of an effect. This is because in a parameter estimation problem we are comparing a continuum of hypotheses all with the same number of parameters so the Occam factors cancel. However, since we are plotting the weighted average of the two-sample calculation and the Behrens–Fisher calculation, in principle, the prior ranges can affect the weights differently and lead to a small effect. This is particularly
Case 1 Case 2 Case 3 Case 4
Probability density
0.4
0.3
0.2
0.1
0 –1
0
1
2
3
4
δ Mean toxin level difference (ppm) Figure 9.8 Posterior probability of the differences in the mean river sediment toxin concentration for the four different choices of prior boundaries given in Table 9.2. The effects of different choices of prior boundaries are barely discernible near the peak.
240
Bayesian inference with Gaussian errors 0.7
Case 1 Case 2 Case 3 Case 4
Probability density
0.6 0.5 0.4 0.3 0.2 0.1 0 0
1
2 3 r Standard deviation ratio
4
Figure 9.9 Posterior probability for the ratio of standard deviations of the river sediment toxin concentration for the four different choices of prior boundaries given in Table 9.2.
noticeable in the case of the estimation of the standard deviation ratio shown in Figure 9.9, which is a weighted average (Equation (9.72)) of the result assuming no difference in the means and the result which assumes the means are different.
9.5 Summary This whole chapter has been concerned with Bayesian analysis of problems when our prior information (state of knowledge) leads us to assign a Gaussian sampling distribution when calculating the likelihood function. In some cases, we do this because, in the absence of knowledge of the true sampling distribution, employing a Gaussian distribution is the most conservative choice for the reasons given in Section 8.7.4. We examined a simple model parameter estimation problem – namely, estimating a mean. We started with data for which the of the Gaussian sampling distribution was a known constant. This was extended to the case where is known but is not the same for all data. Often, nature is more complex than the assumed model and gives rise to residuals which are larger than the instrumental measurement errors. We dealt with this by treating as a nuisance parameter which we marginalize over, leading to a Student’s t PDF. This has the desirable effect of treating anything in the data that can’t be explained by the model as noise, leading to the most conservative estimates of model parameters. The final section of this chapter dealt with a Bayesian analysis of two independent samples of some physical quantity taken under slightly different conditions, and we wanted to know if there has been a change. The numerical example deals with toxin concentrations in river sediment which are taken at two different locations. One location might be upriver from a power plant and the other just downstream. What other examples can you think of where this type of analysis might be useful? The
9.6 Problems
241
Bayesian analysis allows the experimenter to investigate the problem in ways never before possible. The details of this non-trivial problem are presented in Appendix C, and Mathematica software to solve the problem is included in the accompanying Mathematica tutorial.
9.6 Problems 1. Vi is a set of ten voltage measurements with known but unequal independent Gaussian errors i . Vi ¼ f4:36; 4:00; 4:87; 5:64; 6:14; 5:92; 3:93; 6:58; 3:78; 5:84g i ¼ f2:24; 1:94; 1:39; 2:55; 1:69; 1:38; 1:00; 1:60; 1:00; 1:00g
a) Compute the weighted mean value of the voltages. b) Compute and plot the Bayesian posterior probability density for the mean voltage assuming a uniform prior for the mean in the range 3 to 7. c) Find the 68.3% credible region for the mean and compare the upper and lower boundaries to þ mean mean where PN
wi di ¼ Pi N ; i wi
(9:73)
and 1 2mean ¼ PN i
wi
;
wi ¼ 1=2i :
(9:74)
d) Compute and plot the Bayesian posterior probability density for the mean voltage assuming a uniform prior for the mean in the range 4.6 to 5.4. Be sure your probability density is normalized to an area of 1 in this prior range. Plot the posterior over the mean range 3 to 7. (Hint: For plotting you may find it useful to examine the item ‘‘Define a function which has a different meaning in different ranges of x’’ in the section ‘‘Functions and Plotting’’ of the Mathematica tutorial.) e) Find the new 68% credible region for the mean based on (d) and compare with that found in (c). 2. In Section 9.2.3, we derived the Bayesian estimate of the mean assuming the noise is unknown. The desired quantity, pðjD; IÞ, was obtained from the joint posterior
242
Bayesian inference with Gaussian errors
pð; jD; IÞ by marginalizing over the nuisance parameter , leading to Equation (9.34). Would we have arrived at the same conclusion if we had started from the joint posterior pð; 2 jD; IÞ and marginalized over the variance, 2 ? To answer this question, re-derive Equation (9.34) for this case. 3. Table 7.5 gives measurements of the concentration of a toxic substance at two locations in units of parts per million (ppm). The sampling is assumed to be from two independent normal populations. Assume a uniform prior for the unknown mean concentrations, and a Jeffreys prior for the unknown s. Use the material discussed in Section 9.4 to evaluate the items (a) to (f) listed below, for two different choices of prior ranges for the means and standard deviations. These two choices are: 1) mean (1,18), ð0:1; 12Þ 2) mean (7,13), ð1:0; 4:0Þ Note: the prior ranges for the mean and standard deviation are assumed to be the same at both locations. a) The probabilities of the four models: i. ii. iii. iv.
two data sets have same mean and same standard deviation, have different means and same standard deviation, have same mean and different standard deviations, have different means and different standard deviations.
b) c) d) e)
The odds ratio in favor of different means. The odds ratio in favor of different standard deviations. The odds ratio in favor of different means and/or different standard deviations. Plot a graph of the probability of the difference in means assuming the standard deviations are (i) the same, (ii) different, and (iii) regardless of whether or not the standard deviations are the same. Plot the result for all three on the same graph. f) Plot a graph of the probability of the ratio of standard deviations assuming the means are (i) the same, (ii) different, and (iii) regardless of whether or not the means are the same. Plot all three on the same graph. g) Explain the changes in the probabilities of the four models that occur as a result of a change in the prior ranges of the parameters, in terms of the Occam’s penalty. These calculations have been implemented in a special section in the Mathematica tutorial accompanying this book. The section is entitled, ‘‘Bayesian analysis of two independent samples.’’
10 Linear model fitting (Gaussian errors)
10.1 Overview An important part of the life of any physical scientist is comparing theoretical models to data. We now begin three chapters devoted to the nuts and bolts of model fitting. In this chapter, we focus on linear models.1 By a linear model, we mean a model that is linear with respect to the model parameters, not (necessarily) with respect to the indicator variables labeling the data. We will encounter the method of linear leastsquares, which is so familiar to most undergraduate science students, but we will see it as a special case in a more general Bayesian treatment. Examples of linear models: 1. fi ¼ A0 þ A1 xi þ A2 x2i þ where A0 , A1 ; . . . are the linear model parameters, and xi is the independent (indicator) variable. 2. fi;j ¼ A0 þ A1 xi þ A2 yj þ A3 xi yj where A0 , A1 ; . . . are the linear model parameters, and xi ; yj are a pair of independent variables. 3. fi ¼ A1 cos !ti þ A2 sin !ti where A1 , A2 are the linear model parameters, and ! is a known constant. 4. Ti ¼ Tfi ; where T is the linear parameter, n o 2 i o Þ and fi is a Gaussian line shape of the form fi ¼ exp ð2 ; 2 L and o and L are known constants. Such a model was considered previously in Section 3.6.
It is important to distinguish between linear and nonlinear models. In the fourth example, Ti is called a linear model because Ti is linearly dependent on T. On the other hand, if the center frequency o and/or line width L were unknown, then Ti would be
1
About 10 years ago, Tom Loredo circulated some very useful notes on this topic. Those notes formed the starting point for my own treatment, which is presented in this chapter.
243
244
Linear model fitting (Gaussian errors)
a nonlinear model. Nonlinear parameter estimation will be considered in the following chapter. In Section 10.2, we first derive the posterior distribution for the amplitudes (i.e., parameters) of a linear model for a signal contaminated with Gaussian noise. A remarkable feature of linear models is that the joint posterior probability distribution pðA1 ; . . . ; AnjD; IÞ of the parameters is a multivariate (multi-dimensional) Gaussian if we assume a flat prior.2 That means that there is a single peak in the joint posterior. We derive the most probable amplitudes (which are the same as those found in linear least-squares) and their errors. The errors are given by an entity called the covariance matrix of the parameters, which we will introduce in Section 10.5.1. We also revisit the use of the 2 statistic to assess whether we can reject the model in a frequentist hypothesis test. In Section 10.3, we briefly consider the relationship between least-squares model fitting and regression analysis. In most of this chapter, we assume that the data errors are independent and identically distributed (IID). In Section 10.2.2, we show how to generalize the results to allow for data errors with standard deviations that are not equal and also not independent. We will also show how to find the boundaries of the full joint credible regions using the 2 distribution. We then consider how to calculate the marginal probability distribution for individual model parameters, or for a subset of amplitudes of particular interest. A useful property of Gaussian joint posterior distributions allows us to calculate any marginal distribution by maximizing with respect to the uninteresting amplitudes, instead of integrating them in a marginalization operation which is in general more difficult. Finally, we derive some results for Bayesian model comparison with Gaussian posteriors and consider some other schemes to decide on the optimum model complexity.
10.2 Parameter estimation Our task is to infer the parameters of some model function, f, that we sample in the presence of noise. We assume that we have N data values, di , that are related to N values of the function fi , according to di ¼ f i þ e i ;
(10:1)
where ei represents an unknown ‘‘error’’ component in the measurement of fi . We assume that our knowledge (or lack thereof !) of the source of the errors is described by a Gaussian distribution for the ei .3 For now, we assume the distribution for each ei to
2
3
Though only a linear model leads to an exactly Gaussian posterior, nonlinear models may be approximately Gaussian close to the peak of their posterior probability. If the only knowledge we have about the noise is that it has a finite variance, then the MaxEnt principle of Chapter 8 tells us to assume a Gaussian. This is because it makes the fewest assumptions about the information we don’t have and will lead to the most conservative estimates, i.e., greater uncertainty than we would get from choosing a more appropriate distribution based on more information.
245
10.2 Parameter estimation
be independent of the values of the other errors, and that all of the error distributions have a common standard deviation, . We will later generalize the results to remove the restriction of equal and independent data errors. By a linear model, we mean that fi can be written as a linear superposition of M functions, gi , where gi is the value of the th known function for the ith datum. The M functions are each completely specified (they have no parameters); it is their relative amplitudes that are unknown and to be inferred. Denoting the coefficients of the known functions by A , we thus have
fi ðAÞ ¼
M X
(10:2)
A gi :
¼1
Our task is to infer fA g, which we will sometimes denote collectively with an unadorned A, as we have here. For example, if
fi ¼ A1 þ A2 xi þ A3 x2i þ A4 x3i þ ¼
M X
A gi ;
(10:3)
¼1
then gi ¼ f1; xi ; x2i ; . . .g. Note: to avoid confusing the various indices that will arise in this calculation, we are using Roman indices to label data values and Greek indices to label model basis functions. Thus, Roman indices can take on values from 1 to N, and Greek indices can take on values from 1 to M. When limits in a sum are unspecified, the sum should be taken over the full range appropriate to its index. Our goal is to compute the joint posterior probability distribution of the parameters, pðA1 ; . . . ; AM jD; IÞ. According to Bayes’ theorem, this will require us to specify priors for the parameters and to evaluate the likelihood function. The likelihood function is the joint probability for all the data values, which we denote collectively by D, given all of the parameters specifying the model function, fi . This is just the probability that the difference between the data and the specified function values is made up by the noise. With identical, independent Gaussians for the errors, the likelihood function is simply the product of N Gaussians, one for each of the ei ¼ di fi . Figure 10.1 illustrates graphically the basis for the calculation of the likelihood function pðDjfA gIÞ for a model fi of the form fi ¼ A1 þ A2 xi þ A3 x2i . The smooth curve is the model prediction for a specific choice of the parameters, namely A1 ¼ 0:5; A2 ¼ 0:8; A3 ¼ 0:06. The predicted values of fi for each choice of the independent variable xi are marked by a dashed line. The actual measured value of di (represented by a cross) is located at the same value of xi but above or below fi as a result of the uncertainty ei . Since the distribution of ei values is assumed to be Gaussian, at the location of each fi value, we have constructed a Gaussian probability density function (which we call a tent) for ei along the line of fixed xi , with probability plotted in the z-coordinate.
246
Linear model fitting (Gaussian errors)
4
2 y-axis 0
15
–2
10 5 0
1
x-axis
2 30 Probability density
Figure 10.1 This figure illustrates graphically the basis for the calculation of the likelihood function pðDjfA gIÞ for a model fi of the form fi ¼ A1 þ A2 xi þ A3 x2i . The smooth curve is the model prediction for a specific choice of the parameters. The predicted values of fi for each choice of the independent variable xi are marked by a dashed line. The actual measured value of di (represented by a cross) is located at the same value of xi but above or below fi as a result of the uncertainty ei . At the location of each fi value we have constructed a Gaussian probability density function (tent) for ei along the line of fixed xi , with probability plotted in the z-coordinate. For the assumed choice of model parameters, the probability of any di is proportional to the height of the Gaussian curve directly above the data point which is shown by the solid line.
We have assumed that the width of each Gaussian curve, determined by i , is the same for all fi but in principle they can all be different. For the assumed choice of model parameters, the probability of any di is proportional to the height of the Gaussian curve directly above the data point which is shown by the solid line. The probability of the data set D is proportional to the product of these Gaussian heights. As we vary the choice of model parameters, the locations of the fi points and Gaussian tents move up and down while the measured data points stay fixed. For some choice of model parameters, the likelihood will be a maximum. It should be clear that the particular choice of model parameters illustrated in the figure is far from optimal, since the data values are systematically above the model. A better choice of parameters would have the data values distributed about the model. Of course, if our prior information indicated that the probability density function for ei had a different non-Gaussian shape, then we only need to change the shape of the probability tents.
247
10.2 Parameter estimation
The product of these N IID Gaussians is given by "
N 1 X pðDjfA g; IÞ ¼ exp ðdi fi Þ2 22 i¼1 N ð2pÞN=2
1
¼
1
e
N ð2pÞN=2
Q=22
# (10:4)
:
In Equation (10.4), the quadratic form QðfA gÞ is
QðfA gÞ ¼
N X ðdi fi Þ2 i¼1
¼
N X
di
i¼1
¼
N X i¼1
M X
!2 (10:5)
A gi
¼1
d2i þ
N X X
A A gi gi 2
i¼1
N X i¼1
di
M X
A gi :
¼1
To get to our destination, we will take advantage of the quadratic nature of Gaussians to simplify our notation. The new notation will not only make things look simpler, it will actually simplify the calculations themselves, and their interpretation. The new notation will eliminate Roman (data) indices by denoting such quantities ~ the N values of the total model are written as vectors. Thus, the N data are written as d, as ~ f, the N error values are written as ~ e, and the M model functions are written as ~ g . In terms of these N-dimensional vectors, the data equation is d~ ¼ ~ f þ~ e;
(10:6)
with ~ f¼
X
A~ g :
(10:7)
Note: ~ f is the sum of M vectors, where usually M5N. Thus, the model spans an M-dimensional subspace of the N-dimensional data space. Actually, if one or more of the ~ g ’s can be written as a linear superposition of the others, the model spans a subspace of dimension less than M. Hereafter, we will refer to the ~ g ’s as the basis functions for the model.
248
Linear model fitting (Gaussian errors)
The quadratic form is Q ¼ ðd~ ~ fÞ2 ¼ ~ e ~ e ¼ e2 , the squared magnitude of the error ~ ~ vector extending from f to d. In terms of the basis functions, the quadratic form can be written QðfA gÞ ¼ ðd~ ~ fÞ2 ¼ d2 þ f2 2d~ ~ f X X 2 g ¼d þ A A ~ g 2 A d~ ~ g ~
¼ d2 þ
X
A2~ g þ 2 g ~
X
(10:8)
A A~ g 2 g ~
X
g ; A d~ ~
6¼
P
where we follow the usual notation, ~ a b~ ¼ i ai bi , and a2 ¼ ~ a~ a. It follows that X X X A A~ g ¼ A A gi gi ; (10:9) g ~
i
where X
¼
M X M X
(10:10)
:
¼1 ¼1
~ g and d~ ~ g are easily computable from the data values di and the corresponding g ~ values of the basis functions gi . To estimate the amplitudes, we need to assign a prior density to them. We will simply use a uniform prior that is constant over some range A for each parameter, so that pðfA g j IÞ ¼ Q
1 : A
(10:11)
Then as long as we are inside the prior range, the posterior density for the amplitudes is just proportional to the likelihood function 2
pðfA gjD; IÞ ¼ CeQðfA gÞ=2 :
(10:12)
Outside the prior range, the posterior vanishes. In this equation, C is a normalization constant4 C¼
4
pðfA gjIÞ ; pðDjIÞ
(10:13)
This is only true for the choice of a uniform prior for pðfA gjIÞ appropriate for amplitudes which are location parameters. For a scale parameter like a temperature, we should use a Jeffreys prior and then C is no longer a constant. However, for many parameter estimation problems, the likelihood function selects out a very narrow range of the parameter space over which a Jeffreys prior is effectively constant. Thus, the exact choice of prior used is only critical if there are few data or we are dealing with a model selection problem. In the latter case, the choice of prior can have a big effect as we saw in Section 3.8.1.
249
10.2 Parameter estimation
where the global likelihood in the denominator is Z dM A pðfA gjIÞpðDjA ; IÞ pðDjIÞ ¼ A
¼Q
1 1 N=2 N A ð2pÞ
Z
(10:14) M
d A e
Q=22
:
A
We can calculate the value of C if needed – indeed, we will do so below when we discuss model comparison – but since it is independent of the A , we don’t need to know it to address many parameter estimation questions. The full joint posterior distribution given by Equation (10.12) is the Bayesian answer to the question, ‘‘What do the data tell us about the A parameters?’’ It is usually useful to have simple summaries of the posterior, especially if it is of high dimension, in which case it is often difficult to depict it (either mentally or graphically). We will devote the next few subsections to finding point estimates for the amplitudes (most probable and mean values) and credible regions. We’ll also discuss how to summarize the implications of the data for a subset of the amplitudes by marginalizing out the uninteresting amplitudes. In fact, since the mean amplitudes require integrating the posterior over the A , we’ll have to learn how to do such marginalization integrals before we can calculate the mean amplitudes.
10.2.1 Most probable amplitudes The most probable values for the amplitudes are the values that maximize the posterior (Equation (10.12)), which (because of our uniform prior) are the values that minimize Q and lead to the ‘‘normal equations’’ of the method of least-squares. Denoting the most probable values by A^ , we can find them by solving the following set of M equations (one for each value of ): X @Q ¼2 g 2~ g d~ ¼ 0; (10:15) A^~ g ~ @A A¼A^ or, X
~ g ¼ ~ g d: A^~ g ~
(10:16)
For the M ¼ 2 case, Equation (10.16) corresponds to the two equations A^1~ g1 þ A^2~ g1 ¼ ~ g1 d~ g1 ~ g2 ~
(10:17)
~ A^1~ g2 þ A^2~ g2 ¼ ~ g2 d: g1 ~ g2 ~
(10:18)
and
250
Linear model fitting (Gaussian errors)
^ f, by Define the most probable model vector, ~ ^ X^ ~ f¼ A ~ g :
(10:19)
^ f, Equation (10.16) can be written as Using ~ ^ ~ ~ f~ g¼~ g d:
(10:20)
This doesn’t help us solve Equation (10.16), but it gives us a bit of insight: the most probable total model function is the one whose projection on each basis function equals the data’s projection on each basis function. Crudely, the most probable model vector explains as much of the data as can be spanned by the M-dimensional model basis. Equations (10.17) and (10.18) can be written in the following matrix form: ~ g1 ~ g1 ~ g1 g2 ~ A^1 ~ g1 d~ : (10:21) ¼ ~ g2 ~ g2 g1 ~ g2 ~ A^2 ~ g2 d~
Problem: Evaluate Equation (10.21) for a straight line model. Solution: When fitting a straight line, f ¼ A1 þ A2 x, the two basis functions are gi1 ¼ 1 and gi2 ¼ xi . In this case, the matrix elements are given by:
~ g1 ¼ g1 ~
N X
g2i1 ¼
i
N X
1 ¼ N;
~ g2 ¼ ~ g2 ~ g1 ¼ g1 ~
N X
gi1 gi2 ¼
N X
i
~ g2 ¼ g2 ~
N X
N X
xi ;
(10:23)
i
gi2 gi2 ¼
N X
i
~ g1 d~ ¼
(10:22)
i
x2i ;
(10:24)
di ;
(10:25)
i
gi1 di ¼
i
~ g2 d~ ¼
N X i
N X i
di xi ;
(10:26)
10.2 Parameter estimation
and Equation (10.21) becomes P P A^1 PN P i x2i P i di : ¼ A^2 i xi i xi i di xi
251
(10:27)
It will prove useful to express Equation (10.16) in a more compact matrix form. Let G be an N M matrix where the th column contains N values of the th basis function evaluated at each of the N data locations. As an example, consider the M ¼ 2 case again. The two basis functions for the ith data value are gi1 and gi2 , and G is given by 1 0 g11 g12 B g21 g22 C C B B C C: B (10:28) GB C C B @ A gN1 gN2 Now take the transpose of G which is given by g11 g21 gN1 T G : g12 g22 gN2 Define the matrix y ¼ GT G, which for M ¼ 2 is given by ! P P 2 i gi1 i gi1 gi2 y P P 2 i gi2 gi1 i gi2 ! ~ g21 g2 g1 ~ ¼ 2 ~ g1 g2 g2 ~ ! ¼
11
12
21
22
(10:29)
(10:30)
:
g2 ¼ ~ g2 ~ g1 . More generally, Thus, y is a symmetric matrix because ~ g1 ~
¼~ g ~ g ¼
:
Finally, if D is a column matrix of data values di , then for M ¼ 2 P ~ gi1 di ~ d g T i 1 P : G D ¼ ~ g2 d~ i gi2 di
(10:31)
(10:32)
Equation (10.21) can now be written as ^ ¼ GT D: GT GA
(10:33)
The solution to this matrix equation is given by ^ ¼ ðGT GÞ1 GT D ¼ y1 GT D; A
(10:34)
252
Linear model fitting (Gaussian errors)
or in component form becomes A^ ¼
X ~ ½y1 ~ g d:
(10:35)
In the method of least-squares, the set of equations represented by Equation (10.33) are referred to as the normal equations. Again, for the M ¼ 2 case, we can write Equation (10.33) in long form: ~ A^1 ~ d g 11 12 1 ; (10:36) ¼ A^2 ~ g2 d~ 21 22 where
12
¼
and where given by
21 .
12
¼
The solution (Equation (10.34)) is given by 1 12 A^1 ~ g1 d~ ; 22 ¼ 21 A^2 ~ g2 d~ 11 21
(10:37)
and the denominator, , is the determinant of the y matrix ¼ð
11 22
2 12 Þ:
(10:38)
~ 12 ð~ ~ dÞ g2 dÞ 2 11 22 12
A^1 ¼
g1 22 ð~
A^2 ¼
g1 12 ð~
~þ dÞ 11 22
g2 11 ð~ 2 12
~ dÞ
(10:39) :
(10:40)
Note: the matrix must be non-singular – the basis vectors must be linearly independent – for a solution to exist. We henceforth assume that any redundant basis models have been eliminated, so that y is non-singular.5 Problem: Evaluate Equation (10.34) for the straight line model. Solution: Comparison of Equations (10.27) and (10.36) allows an evaluation of all the terms needed for Equations (10.39) and (10.40). P P P 2P x di i xi i xi di A^1 ¼ i i Pi P N i x2i ð i xi Þ2 P P P i xi i di þ N i xi di ^ : A2 ¼ P P N i x2i ð i xi Þ2
5
Sometimes the data do not clearly distinguish between two or more of the basis functions provided, and y gets sufficiently close to being singular that the answer becomes extremely sensitive to round-off errors. The solution in this case is to use singular value decomposition which is discussed in Appendix A.
253
10.2 Parameter estimation
10.2.2 More powerful matrix formulation Everything we have done so far has assumed that the error associated with each datum is independent of the errors for the others, and that the Gaussian describing our knowledge of the magnitude of the error has the same variance for each datum. In general, however, the errors can have different variances, and could be correlated. In that case, we need to replace the likelihood function pðDjfA g; IÞ, given by Equation (10.4), by the multivariate Gaussian (Equation (8.61)) which we derived using the MaxEnt principle in Section 8.7.5 and Appendix E. The new likelihood is " # 1 1X 1 pðDjfA g; IÞ ¼ ðdi fi Þ½E ij ðdj fj Þ pffiffiffiffiffiffiffiffiffiffiffiffi exp 2 ij ð2pÞN=2 det E (10:41) 1 2 =2 ¼ ; pffiffiffiffiffiffiffiffiffiffiffiffi e ð2pÞN=2 det E where E is called the data covariance matrix6 0 11 12 13 B 21 22 23 E¼B @ N1 N2 N3
1 1N 2N C C; A NN
(10:42)
and X ij
¼
N X N X
:
i¼1 j¼1
How does this affect the ‘‘normal equations’’ of the method of least-squares? We simply need to replace Q=2 appearing in the likelihood, Equation (10.4), by X 2 ¼ ðdi fi Þ½E1 ij ðdj fj Þ: (10:43) i;j
If the errors are independent, E is diagonal, with entries equal to 2i , and Equation (10.43) takes the familiar form 2 ¼
X ðdi fi Þ2 i
2i
¼
X e2 i
i
2i
:
(10:44)
The inverse data covariance matrix, E1 , plays the role of a metric7 in the full N-dimensional vector space of the data. Thus, if we set ¼ 1, and understand ~ a b~
6
7
If we designate each element in the covariance matrix by ij , then the diagonal elements are given by ii ¼ 2i ¼ he2i i, where he2i i is the expectation value of e2i . The off-diagonal elements are ij ¼ hei ej i ði 6¼ jÞ. The metric of a vector space is useful for answering questions having to do with the geometry of the vector space, such as the distance between two points. In our work, we use it to compute the dot product in the vector space.
254
Linear model fitting (Gaussian errors)
to stand for ~ a½E1 b~ everywhere a dot product occurs in the above analysis, then we already have the desired generalization! Thus, we have that X d~ d~ ¼ di ½E1 ij dj (10:45) ij
and ~ g ¼ g ~
X
gi ½E1 ij gj :
(10:46)
ij
The new equivalents of Equations (10.33) and (10.34) are ^ ¼ GT E1 D GT E1 GA
(10:47)
^ ¼ ðGT E1 GÞ1 GT E1 D ¼ Y1 GT E1 D: A
(10:48)
To bring out the changes more clearly, we repeat our earlier Equation (10.34) for the model amplitudes. ^ ¼ ðGT GÞ1 GT D ¼ y1 GT D: A
(10:49)
Notice that whenever we employ E, we need to replace y by its capitalized form Y. Recall the matrix y did not incorporate information about the data errors while Y does. In the case that the data errors all have the same and are independent, then Y ¼ =2 . The following problem employs the Y matrix for the case of independent errors. We consider a problem with correlated errors in Section 10.6 after we have introduced the correlation coefficient. Problem: Fit a straight line model to the data given in Table 10.1, where di is the average of ni data values measured at xi . The probability of the individual di measurements is IID normal with 2 ¼ 8:1, regardless of the xi value. Recall from the Central Limit Theorem, Table 10.1 Data table. xi
di
ni
10 20 30 40 50 60 70 80
0.5 4.67 6.25 10.0 13.5 13.7 17.5 23.0
14 3 25 2 3 22 5 2
255
10.2 Parameter estimation
pðdijIÞ will tend to a Gaussian with variance ¼ 2 =ni as ni increases even if pðdijIÞ is very non-Gaussian. Solution: The data covariance matrix E can be written as 0 1 1=n1 0 0 0 B 0 1=n2 0 0 C B C 2B E¼ B 0 0 1=n3 0 C C: @ A 0 0 0 1=nN
(10:50)
The inverse data covariance matrix E1 is 1 0 n1 0 0 0 C B B 0 n2 0 0 C C B 1 E1 ¼ 2 B 0 0 n3 0 C C B B C A @ 0 0 0 nN 1 0 14 0 0 0 C B B 0 3 0 0C C 1B ¼ 2B 0 0 25 0 C C: B C B A @ 0 0 0 2 Y ¼ GT E1 G
0
14 B B 0 1 1 1 1 1B B 0 ¼ 10 20 30 80 2 B B @ 0 76 3010 1 ¼ 2 : 3010 152 300 Y1 ¼ 2
0 3
0 0
0
25
0
0
0:061 0:0012
10 1 0 CB 0 CB 1 CB B 0C CB 1 CB A@ 1 2
(10:51)
1 10 C 20 C C 30 C C C A 80
0:0012 : 0:000030
(10:52)
(10:53)
Let R ¼ GT E1 D 0
¼
1 10
1 20
1 30
14 B 0 1B 1 B 0 80 2 B @ 0
0 3 0 0
1 10 0:5 0 0 C B 0 0C CB 4:67 C C B 25 0 CB 6:25 C C: A@ A 23:0 0 2
(10:54)
256
Linear model fitting (Gaussian errors)
Then, ^ ¼ Y1 R: A
A^1 A^2
¼
(10:55)
2:054 : 0:275
(10:56)
Figure 10.2 shows the straight line model fit, y ¼ 0:275x 2:054, with the data plus error bars overlaid. Mathematica provides a variety of simple ways to enter the G, E, D matrices and then we can evaluate the matrix of amplitudes, A, with the commands: Y ¼ Transpose½G:Inverse½E:G A ¼ Inverse½Y:Transpose½G:Inverse½E:D
10.3 Regression analysis Least-squares fitting is often called regression analysis for historical reasons. For example, Mathematica provides the package called Statistics ‘Linear Regression’ for doing a linear least-squares fit. Regression analysis applies when we have, for example, two quantities like height and weight, or income and education and we want to predict one from information about the other. This is possible if there is a correlation between the two quantities, even if we lack a model to account for the correlation. Typically in these problems, there is little experimental error in the measurements compared to the intrinsic spread in their values. Thus, we can talk about the regression line for income
25
y-axis
20 15 10 5 0 10
20
30
40 50 x-axis
60
70
80
Figure 10.2 Straight line model fit with data plus errors overlaid. Clearly, the best fitting line is mostly determined by the data points with smallest error bars.
257
10.4 The posterior is a Gaussian
on education. The regression line is often called the least-squares line because it makes the sum of the squares of the residuals as small as possible. In contrast, model fitting usually assumes that there is an underlying exact relationship between some measured quantities, and we are interested in the best choice of model parameters or in comparing competing models. In some areas, especially in the life sciences, the phenomena under study are sufficiently complex that good models are hard to come by and regression analysis is the name of the game. The results from a regression analysis may make no physical sense but may point the way to a physical model. Consider the following simple example which assumes the experimenter is ignorant of an elementary geometrical relationship, that the area of a rectangle is equal to the product of the width and height. The experimenter fabricates many different shaped rectangles and determines the area of each rectangle by counting the number of very small squares that fit into the rectangle. Our experimenter then examines whether there is a correlation between rectangle area and perimeter. The resulting regression line would look like a line with considerable scatter, because the perimeter is not a good physical model for the area. In contrast, a plot of area versus the product of the width times the height would be an almost perfect straight line, limited only by the measurement accuracy of the width and height measurements.
10.4 The posterior is a Gaussian We have succeeded in finding the most probable model parameters. Now we want to determine the shape of their joint probability distribution with an eye to specifying credible regions for each parameter. We will continue to work with the simple case where all the data errors are assumed to be IID so that pðfA gjD; IÞ is given by Equation (10.12), 2
pðfA gjD; IÞ ¼ CeQðfA gÞ=2 :
(10:57)
Then maximizing pðfA gjD; IÞ, corresponds to minimizing Q. Since we’ve already taken one derivative of Q (Equation (10.15)), let’s see what happens when we take another. Define A ¼ A A^ . Call the value of Q at the mode Qmin . Recall that the mode is the value that maximizes the probability density. Consider a Taylor series expansion of Q about the Qmin . X @Q X @2Q A þ 1=2 A A : (10:58) Q ¼ Qmin þ @A min @A @A min The first derivative is zero at the minimum and from Equation (10.8), it is clear there are no higher derivatives than the second. We are now in the position to write Q in a form that explicitly reveals the posterior distribution to be a multivariate Gaussian. Q ¼ Qmin þ QðAÞ;
(10:59)
258
Linear model fitting (Gaussian errors)
and X @2Q A A : QðAÞ ¼ 1=2 @A @A min
(10:60)
Taking another derivative of Equation (10.15) and substituting from Equation (10.31), we get the equation8 @ 2 Q ¼ 2~ g ~ g ¼ 2 @A @A min
¼2
:
(10:62)
Substituting this into Equation (10.60), we get the equation QðAÞ ¼
X
A
A ;
(10:63)
where y is a symmetric matrix. Note: the differential dðA Þ ¼ dA , so densities for the A are directly equal to densities for the A . Thus, Q pðfA gjD; IÞ ¼ C0 exp 2 ; 2
(10:64)
Qmin C0 ¼ C exp 2 ; 2
(10:65)
where
is an adjusted normalization constant. If we let A be a column matrix of A values, then Equation (10.64) can be written as AT yA pðfA gjD; IÞ ¼ C exp : 22 0
8
(10:66)
For M=2, y is given by
y
! ~ g21 g2 g1 ~ ~ g1 g22 g2 ~ 0 @2 Q 1 @2 Q
1 ¼ @ 2
@A21
@A1 @A2
2
2
@ Q @A2 @A1
@ Q @A22
A:
ð10:61Þ
259
10.4 The posterior is a Gaussian
Since y is symmetric there is a change of variable, A ¼ OX, that transforms9 the quadratic form AT yA into the quadratic form XT LX (Principal Axis Theorem). L is a diagonal matrix whose diagonal elements are the eigenvalues of the y matrix and the columns of O are the orthonormal eigenvectors of y. For the M ¼ 2 case, we have 1 Q ¼ ðX1 X2 Þ 0 ¼
1 X21
þ
0
2
X1
X2
(10:67)
2 X22 ;
where 1 and 2 are the eigenvalues of y. They are all positive since y is positive definite.10 Thus, Q ¼ k (a constant) defines the ellipse (see Figure 10.3),
X21 X22 þ ¼ 1; k=1 k=2
(10:68)
0.32 e2
A2
0.3
0.28
e1 √k /λ 2
0.26 √k /λ 1
Q=k 0.24 9.5
10
10.5
11
11.5
A1
Figure 10.3 The contour in A1 A2 parameter space for Q ¼ a constant k. It is an ellipse, centered at ðA^1 ; A^2 Þ, whose major and minor axes are determined by the eigenvalues and eigenvectors ~ e of y. Note: dX1 and dX2 in Equation (10.67), are measured in the directions of the eigenvectors ~ e1 and ~ e2 , respectively. 9
10
In two dimensions the transformation O corresponds to a planar rotation followed by a reflection of the X2 axis. Since is symmetric this is equivalent to a rotation of the axes. This would not be the case if the basis vectors were linearly dependent. In that case, would be singular and 1 would not exist.
260
Linear model fitting (Gaussian errors)
pffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffi with major and minor axes given by k=1 and k=2 , respectively. From Equation (10.64), we see that Q ¼ k corresponds to a contour of constant posterior probability in the space of our two parameters. Clearly, Figure 10.3 provides information about the joint credible region for the model parameters A1 and A2 . We still need to compute what size of ellipse corresponds to, say, a 95% credible region. In the next section, we discuss how to find various summaries of a Gaussian posterior. Problem: In Section 10.2.2, we fitted a straight line model to the data in Table 10.1. Find the eigenvalues of the corresponding Y matrix given by Equation (10.52) and which we repeat here. 1 76 3010 Y¼ 2 ; (10:69) 3010 152 300 where 2 ¼ 8:1. In that problem, the errors were not all the same, so we employed the data covariance matrix E1 . For that situation, Equation (10.66) becomes P A ½Y A pðfA gjD; IÞ ¼ C0 exp : (10:70) 2 Since there are only two parameters ðA1 ; A2 Þ, we can rewrite Equation (10.70) as ( P2 P2 ) ¼1 ¼1 A ½Y A 0 pðA1 ; A2jD; IÞ ¼ C exp : (10:71) 2 Solution: We can readily determine the eigenvalues of Y with the following Mathematica command: Eigenvalues½Y f18 809:8; 2:03766g So the two eigenvalues are 1 ¼ 18 809:8 and 2 ¼ 2:03766.
10.4.1 Joint credible regions A credible region is a locus of points of constant probability which surrounds a region containing a specified probability in the joint probability distribution. Figure 10.3 illustrated that in the two-parameter case, this locus is an ellipse defined by Q ¼ k where k is a constant. In this section, we will find out for what value of k the ellipse contains say 68.3% of the posterior joint probability. The results will be presented in a more general way so we can specify the credible region corresponding to Q ¼ k for an arbitrary number of model parameters, not just the M ¼ 2 case.
10.4 The posterior is a Gaussian
261
We slightly simplify the notation for this subsection to connect with more familiar results, by writing the posterior density for the A as pðfA gjD; IÞ ¼ Ce
2
=2
(10:72)
;
where 2 ðfA gÞ ¼ Q=2 . Let 2min ¼ Qmin =2 , and 2 ¼ Q=2 . Then from Equation (10.64), we can write pðfA gjD; IÞ ¼ C0 e
2
=2
:
(10:73)
By definition, the boundary of a joint credible region for all the amplitudes is defined by 2 ðfA gÞ ¼ 2min þ 2crit , where 2crit is a constant chosen such that the region contains some specified probability, P. Our task is to find 2crit such that Z P¼ dMA pðfA gjD; IÞ: (10:74) 2 <2crit
The result, which is given without proof, is P¼1
ðM=2; 2crit =2Þ : ðM=2Þ
(10:75)
This is the probability within the joint credible region for all the amplitudes corresponding to 2 < 2crit . The quantity ðM=2; 2crit =2Þ is one form of the incomplete gamma function,11 which is given by ð=2; xÞ ¼
1 ð=2Þ
Z
1
et t21 dt;
(10:76)
x
where ¼ M is the number of degrees of freedom. Recall from Section 6.2 that the 2 distribution is a special case of a gamma distribution. In Mathematica ð=2; 2crit =2Þ is given by the command Gamma½n=2; Dc2crit =2 Table 10.2 gives values for 2crit as a function of P and obtained from Equation (10.75). For example, the credible region containing probability P ¼ 68:3%, for M ¼ 2 free model parameters, is bounded by a surface of constant 2 ¼ 2min þ 2crit ¼ 2min þ 2:3. Note: in Table 10.2, the degrees of freedom ¼ M, the number of free model parameters.
11
See Numerical Recipes by Press et al. (1992).
262
Linear model fitting (Gaussian errors)
Table 10.2 This table allows us to find the value of 2crit in Equation (10.74) that defines the boundary of the joint posterior probability in model parameters that contains a specified probability P. Thus, the P ¼ 68:3% joint probability boundary in two parameters corresponds to a 2crit ¼ 2:3, where 2crit is measured from 2min , the value corresponding to the peak of the joint posterior probability. Degrees of Freedom
P
68.3% 90% 95.4% 99% 99.73% 99.99%
1
2
3
4
5
6
1.00 2.71 4.00 6.63 9.00 15.1
2.30 4.61 6.17 9.21 11.8 18.4
3.53 6.25 8.02 11.3 14.2 21.1
4.72 7.78 9.70 13.3 16.3 23.5
5.89 9.24 11.3 15.1 18.2 25.7
7.04 10.6 12.8 16.8 20.1 27.8
Question: Figure 10.3 shows an ellipse of constant probability for a two-parameter model defined by Q ¼ k (a constant). For what value of k does the ellipse contain a probability of 95.4%? Answer: First we note that Q ¼ 2 2 . From Table 10.2, we obtain 2 ¼ 6:17 for ¼ 2 degrees of freedom (for a two-parameter model) and a probability of 95.4%. The desired value of k ¼ 6:172 . Question: Suppose we fit a model with six linear parameters. We are really only interested in two of these parameters so we remove the other four by marginalization. The result is the posterior probability distribution for the two interesting parameters. Now suppose we want to plot the 95.4% credible region (ellipse) for these two parameters. How many degrees of freedom should be used when consulting Table 10.2? Answer: 2! Problem: In Section 10.2.2, we fitted a straight line model to the data in Table 10.1. Compute and plot the 95.4% joint credible region for the slope and intercept. Solution: From Table 10.2, we obtain 2 ¼ 6:17 for ¼ 2 degrees of freedom (for a twoparameter model) and a probability of 95.4%. Now 2 is related to the Y matrix by 2 ¼
2 X 2 X
A ½Y A ;
(10:77)
¼1 ¼1
where 1 Y¼ 2
76 3010
3010 152 300
¼
9:383 371:6 ; 371:6 18 802
(10:78)
263
10.4 The posterior is a Gaussian
since 2 ¼ 8:1. Combining Equations (10.77) and (10.78), we obtain 9:383 371:6 A1 2 ¼ ðA1 A2 Þ ¼ 6:17: 371:6 18 802 A2
(10:79)
It is convenient to change from rectangular coordinates ðA1 ; A2 Þ to polar coordinates ðr; Þ. Equation (10.79) becomes 9:383 371:6 r cos ðr cos r sin Þ ¼ 6:17: (10:80) 371:6 18 802 r sin Next, solve this equation for r for a set of values of , and by so doing map out the joint credible region. In Mathematica this can easily be accomplished by: polarA [r , q ] :={{r * Cos[q]}, {r * Sin[q]}}; tpolarA[r , q ] := Transpose[polarA[r, q]]; locus = Table[ {NSolve[Flatten[tpolarA[r, q].Y.polarA[r, q]][[1]]==6:17, r] [[2]][[1, 2]], q}, {q, 0, 2p, Dq}]; Finally, transform the r; values back to A1 , A2 and convert them to A1 , A2 , where A1 ¼ A1 þ A^1 and A2 ¼ A2 þ A^2 . Note: A^1 and A^2 are the most probable values of the intercept and slope, respectively. Figure 10.4(a) shows a plot of the resulting 95.4% joint credible region (dashed curve). The solid curve shows the 68.3% joint credible region derived in the same way. What if we are interested in summarizing what the data and our prior information have to say about the slope, i.e., determining the marginal PDF, pðA2 jD; IÞ? In this
∆χ 2 = 6.17
Slope
0.3
∆χ 2 = 2.3
0.28 0.26
0.32
(a)
0.3 Slope
0.32
A
∆χ2 = 1
0.28 0.26
(b)
∆χ2 = 4
B
A′ B′
0.24
0.24 –3
–2 Intercept
–1
–3
–2
–1
Intercept
Figure 10.4 Panel (a) is a plot of the 95.4% (dashed) and 68.3% (solid) joint credible regions for the slope and intercept of the best-fit straight line to the data of Table 10.1. Panel (b) shows ellipses corresponding to 2 ¼ 1:0 and 2 ¼ 4:0. The two lines labeled A and A0 , which are tangent to the inner ellipse, define the 68.3% credible region for the marginal PDF, pðA2 jD; IÞ. The two lines labeled B and B0 , which are tangent to the outer ellipse, define the 95.4% credible region for pðA2 jD; IÞ.
264
Linear model fitting (Gaussian errors)
case, we need to marginalize over all possible values of A1 . We will look at this question more fully in Section 10.5, but we can use the material of this section to obtain the 68.3% credible region for A2 as follows. By good fortune it turns out that for Gaussian posteriors, in any number of dimensions, the marginal PDF is also equal to the projected distribution (projected PDF). What do we mean by the projected PDF for A2 ? Imagine a light source located a great distance away along the A1 axis, illuminating the 3-dimensional probability density function, thought of as an opaque mountain sitting on the A1 ; A2 plane. The height of the mountain at any particular A1 ; A2 is equal to pðA1 ; A2 jD; IÞ. The shadow cast by this mountain on the plane defined by A1 ¼ 0 is called the projected probability density function of A2 . To plot out the projected PDF, we can do the following. Each value of A2 in our final plot corresponds to a line parallel to the A1 axis. Vary A1 along this line and find the maximum value of pðA1 ; A2 jD; IÞ along the line. If we raise the line to this height it will be tangent to the surface of the probability mountain for this A2 . This is the value of the projected PDF for that particular value of A2 . Now consider the two lines shown in Figure 10.4(b), labeled A and A0 , which define the borders of the 68.3% credible region for A2 . You might naively expect these two lines to be tangent to the ellipse containing 68.3% of the joint probability, pðA1 ; A2 jD; IÞ, as illustrated in Figure 10.4(a) for 2 ¼ 2:3. The correct answer is that they are tangent to the ellipse shown in Figure 10.4(b), corresponding to 2 ¼ 1:0, as indicated in Table 10.2 for one degree of freedom. The two lines, B and B0 , which define the 95.4% credible region, are tangent to the ellipse defined by 2 ¼ 4:0. In a like fashion we could locate the 68.3% and 95.4% credible region boundaries for A1 .
10.5 Model parameter errors In Sections 10.2.1 and 10.2.2, we found the most probable values of linear model parameters. To complete the discussion, we need to specify the uncertainties of these parameters and introduce the parameter covariance matrix.
10.5.1 Marginalization and the covariance matrix Now suppose that we are only interested in a subset of the model amplitudes (for example, one amplitude may describe an uninteresting mean background level, or, we may be interested in the probability density function of only one of the parameters). We can summarize the implications of the data for the interesting amplitudes by calculating the marginal distribution for those amplitudes, integrating the uninteresting nuisance parameters out of the full joint posterior. In this subsection we start by showing how to integrate out a single amplitude; the procedure can be repeated to remove more parameters. We then consider the special case of a model with only two
265
10.5 Model parameter errors
parameters (M ¼ 2) and see how this leads to an understanding of the parameter errors. Suppose that the amplitude we want to marginalize out is A1 . Returning to the Q notation and IID Gaussian errors, the marginal distribution for the remaining amplitudes is then pðA2 ; . . . ; AMjD; IÞ ¼
Z
dA1 pðfA gjD; IÞ A1
Z
¼ C0
(10:81) 2
dA1 eQ=2 ; A1
where QðAÞ ¼
X
A
(10:82)
A :
To perform the required integral, we first pull out the A1 -dependent terms in Q, writing Q ¼ ðA1 Þ2
11
þ 2A1
M X
1 A
M X
þ
¼2
A
A ;
(10:83)
;¼2
where A1 appears only in the first two terms. Now we complete the square for A1 by adding and subtracting a term as follows:
Q ¼ ðA1 Þ
2
11
þ 2A1
M X
1 A
þ
¼2
1
M X
11
¼2
" ¼
11
A1 þ
!2
1 A
þ
M X
A
1
M X
11
¼2
!2 1 A
(10:84)
A
;¼2 1 11
M X
#2
1 A
þ Qr :
¼2
By construction, A1 appears only in the squared term, and the terms depending on the remaining A’s make up the reduced quadratic form,
Qr ¼
1
M X
11
¼2
!2 1 A
þ
M X ;¼2
A
A :
(10:85)
266
Linear model fitting (Gaussian errors)
Equation (10.81) can now be written 0 Qr =22
pðA2 ; . . . ; AMjD; IÞ ¼ C e
Z
dðA1 Þ A1
2 exp4
11 22
A1 þ
1 11
M X
!2 3 5: 1 A
(10:86)
¼2
The integrand is a Gaussian in A1 , with variance 2 = 11 . If the range of integration, A1 , were infinite, the integral ffi would merely be a constant (the normalization constant pffiffiffiffiffiffiffiffiffiffiffiffiffiffi for the Gaussian, 2p= 11 ). With a finite range, it can be written in terms of error functions with arguments that depend on A2 through AM . But as long as the prior range pffiffiffiffiffiffiffi is large compared to = 11 , this integral will be very nearly constant with respect to the remaining amplitudes. Thus, to a good approximation, the marginal distribution is 2
pðA2 ; . . . ; AMjD; IÞ ¼ C00 eQr =2 ;
(10:87)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where C00 ¼ C0 2p= 11 is a new normalization constant. In the limit where A1 ! 1, this result is exact. Again, it is useful to consider the special case of only two parameters, M ¼ 2. In this case, after marginalizing out A1 we are left with pðA2jD; IÞ. Let’s evaluate this now. We can rewrite Equation (10.85), Qr ¼
1
ð
12 A2 Þ
11
¼
A22
11
2
þ A2 2 22 12 :
22 A2
(10:88)
11
Thus, we can write pðA2jD; IÞ ¼ C0
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A2 =22 2p= 11 e 2 2 ;
(10:89)
which is a Gaussian with variance 22 given by 22
¼
2
11 11 22
2 12
:
(10:90)
Similarly, if we had marginalized out A2 instead, we would have obtained a Gaussian PDF for pðA1jD; IÞ with 21 given by 22 2 2 1 ¼ : (10:91) 2 11 22 12
10.5 Model parameter errors
267
Notice that the variances for A1 and A2 can be written in terms of the elements of y1 : 1 12 22 y1 ¼ : (10:92) 2 12 11 11 22 12 Comparing with Equations (10.91) and (10.90), we can write 21 ¼ 2 ½y1 11 ;
(10:93)
22 ¼ 2 ½y 1 22 :
(10:94)
Thus, we have shown for the M ¼ 2 case, that the matrix y1 that we needed to solve for A^ (see Equation (10.34)), when multiplied by the data variances (2 ), also contains information about the errors of the parameters. In Section 10.5.3, we will generalize this result for the case of a linear model with an arbitrary number of parameters M, and show that 2 ¼ 2 ½y1 :
(10:95)
The matrix V ¼ 2 y1 is given the name parameter variance-covariance matrix or simply the parameter covariance matrix.12 We shall shortly define what we mean by covariance. If we wish to summarize our posterior state of knowledge about the parameters with a few numbers, then we can write A1 ¼ A^1 1
(10:96)
A2 ¼ A^2 2 :
(10:97)
Formally, the variance of A1 is defined as the expectation value of the square-of-thedeviations from the mean 1 ; Z VarðA1 Þ ¼ hðA1 1 Þ2 i ¼ dA1 ðA1 1 Þ2 pðA1jD; IÞ: (10:98) A1
The idea of variance can be broadened to consider the simultaneous deviations of both A and A . The covariance is given by Z Z dA dA ðA ÞðA ÞpðA ; AjD; IÞ (10:99) ¼ A
A
and is a measure of the correlation between the inferred parameters. If, for example, there is a high probability that overestimates of A are associated with overestimates
12
If we are employing the covariance matrix, E, for our knowledge of the measurement errors, then simply replace y 1 by Y1 , by and drop all the factors of 2 in Equations (10.93), (10.94), (10.101), and (10.100).
268
Linear model fitting (Gaussian errors)
of A , and underestimates of A associated with underestimates of A , then the covariance will be positive. Negative covariance (anti-correlation) implies that overestimates of A will be associated with underestimates of A . When the estimate of one parameter has little or no influence on the inferred value of the other, then the magnitude of the covariance qffiffiffiffiffiffiffiffiffiffiffi will be negligible in comparison to the variance pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi terms, j j ¼ 2 2 . By now you may have guessed that the covariance of the inferred parameters is given by 2 times the off-diagonal elements of the y1 matrix. ¼ 2 ½y1 :
(10:100)
Thus, for the M ¼ 2 case, we have 12 ¼ 2 ½y1 12 2 ¼ 11 22
12
2 12
:
(10:101)
Referring to Figure 10.3, we see that the major axis of the elliptical credible region is inclined to the A1 axis with a positive slope. This indicates a positive correlation between the parameters A1 and A2 . A value of 12 ¼ 0 would correspond to a major axis which is aligned with the A1 axis if 1 > 2 .
10.5.2 Correlation coefficient It is useful to summarize the correlation between estimates of any two parameters by a coefficient in the range from 1 to þ1, where 1 indicates complete negative correlation, þ1 indicates complete positive correlation, and 0 indicates no correlation. The correlation coefficient is defined by ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ qffiffiffiffiffiffiffiffiffiffiffi : 2 2
(10:102)
In the extreme case of ¼ 1, the elliptical contours will be infinitely wide in one direction (with onlyp information ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi in the prior preventing this catastrophe) and oriented at an angle tan1 ð 22 = 11 Þ . In this case, the parameter error bars 1 and 2 will be huge, saying that our individual estimates of A1 and A2 are completely unreliable, but we can still infer a linear combination of the parameters quite well. For large and positive, the probability contours will all bunch up close to the line pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A2 ¼ intercept þ mA1 , where m ¼ ð 22 = 11 Þ. We can rewrite this as A2 mA1 ¼ intercept. Varying the intercept corresponds to motion perpendicular to this line. The concentration of probability contours implies that the probability density of the intercept is quite narrow. Since the intercept is equal to A2 mA1 , this indicates
10.5 Model parameter errors
269
that the data contain a lot of information about the difference A2 mA1 . If is large and negative, then we can infer the sum A2 þ mA1 . Problem: In Section 10.2.2, we fitted a straight line model to the data in Table 10.1. We are now in a position to evaluate the errors for the marginal posterior density functions for the intercept, A1 , and slope, A2 , from the diagonal elements of V ¼ Y1 ¼ ðGT E1 GÞ1 which is given by Y1 ¼ 2
0:061 0:0012
0:0012 ; 0:000030
(10:103)
where 2 ¼ 8:1. Solution: Let 1 and 2 be the 1 errors of A1 and A2 . Y1 is the variance-covariance matrix of the parameter errors. In this case, it includes the data covariance matrix E1 so we need to use Equations (10.93) and (10.94) without the 2 term in front.
1 ¼
2 ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½Y1 11 ¼ 0:0612 ¼ 0:70;
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½Y1 22 ¼ 0:000030 2 ¼ 0:016;
(10:104)
(10:105)
and,
A^1 A^2
¼
2:05 0:70 : 0:275 0:016
(10:106)
The correlation coefficient is 12 ½Y1 12 0:0012 ffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ 0:885: 12 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 11 22 0:061 0:00003 ½Y1 11 ½Y1 22
(10:107)
When there are only two parameters, it is more informative to give a contour plot of the joint posterior probability density function, which illustrates the correlation in the parameter error estimates. In Section 10.4.1, we showed how to compute the joint credible region for the slope and intercept and plotted two examples in Figure 10.4(a). This figure is repeated in the left panel of Figure 10.5. The two contours shown enclose 95.4% and 68.3% of the probability. In the previous analysis, the intercept and slope are referenced to the origin of our data. It turns out that the size of the correlation coefficient depends on the origin we choose for the x-coordinate, as we demonstrate in Figure 10.6. In fact, if we shift the origin by just the right amount, call it xw , we can eliminate the correlation altogether.
270
Linear model fitting (Gaussian errors)
0.32 ∆χ2 = 6.17
0.32
(a)
0.3 ∆χ2 = 2.3
0.28
Slope
Slope
0.3
0.28
0.26
0.26
0.24
0.24 –3
–2 Intercept
(b)
∆χ2 = 6.17
∆χ2 = 2.3
–1
8
9 Intercept
10
11
Figure 10.5 Panel (a) shows a contour plot of the joint posterior PDF pðA1 ; A2jD; IÞ. The dashed and solid contours enclose 95.4% and 68.3% of the probability, respectively. Panel (b) is the same but using the weighted average x-coordinate as the origin of the fit.
60 50
y-axis
40 30 20 10 0 25
50
75
100 x-axis
125
150
175
Figure 10.6 Two straight line fits to some data with an origin well outside the x range of the data. Clearly, any variation in the slope parameter will have a strongly correlated effect on the intercept and vice versa. If the x origin had been chosen closer to the middle of the data, then variations in slope would have a much smaller effect on the intercept.
From Equation (10.107) it is clear that 12 ¼ 0 if ½Y1 12 ¼ 0, and it is easy to show that the off-diagonal elements of Y1 are zero if the off-diagonal elements of Y are zero. These modified basis functions are referred to as orthogonal basis functions. From Equations (10.47) and (10.46) we can write ½Y12 ¼
X ij
gi1 ½E1 ij gj2 :
(10:108)
271
10.5 Model parameter errors
For the straight line model gi1 ¼ f1; 1; . . . ; 1g and gj2 ¼ fx1 ; x2 ; . . . ; xN g. Shifting the origin to xw changes gj2 to g0j2 ¼ fx1 xw ; x2 xw ; . . . ; xN xw g. We can solve for xw by setting ½Y012 ¼
X
gi1 ½E1 ij g0j2 ¼ 0:
(10:109)
ij
From Equation (10.52), we can rewrite Equation (10.109) as 0
½Y012
14
B B 0 B 1 B 0 ¼ ð1; 1; 1; . . . ; 1Þ 2 B B B B @ 0 0
0
0
3
0
0
25
0 1 x 1 xw C B B x 2 xw C C B C B 1 B ¼ 2 ð14; 3; 25; . . . ; 2ÞB x3 xw C C C B C B A @ xN xw 1 0 x1 xw C B B x2 xw C C B C B C ¼ ðw1 ; w2 ; w3 ; . . . ; wN ÞB x x w C B 3 C B C B A @ xN xw
0
10
x 1 xw
1
CB C C B 0C CB x2 xw C CB C C B 0C CB x3 xw C CB C C B C A@ A 0 2 xN x w
(10:110)
¼ w1 ðx1 xw Þ þ w2 ðx2 xw Þ þ þ wN ðxn xw Þ ¼ 0; where the data weights, wi , are given by the diagonal elements of the inverse data covariance matrix E1 . The solution of Equation (10.110) is given by P wi xi : xw ¼ P wi
(10:111)
Panel (b) of Figure 10.5 shows the joint credible region for the parameters obtained using the weighted average, xw , as the origin. The major and minor axes of the ellipse are now parallel to the parameter axes. The sensitivity of the analysis to the choice of origin arises from the model predictions’ dependence on the origin.
272
Linear model fitting (Gaussian errors)
Suppose that instead of a straight line model, we had wanted to fit a higher order polynomial to the data. The appropriate function to fit, to ensure the coefficients are uncorrelated, can be shown to be of the form yðxi Þ ¼ A1 þ A2 ðxi xw Þ þ A3 ðxi 1 Þðxi 2 Þ þ A4 ðxi 1 Þðxi 2 Þðxi 3 Þ þ ;
(10:112)
In the case of a polynomial with just the first 3 terms, we can compute 1 and 2 from the two equations X ½Y013 ¼ gi1 ½E1 ij g0j3 ¼ 0; (10:113) ij
½Y023 ¼
X
g0j2 ½E1 jk g0k3 ¼ 0;
(10:114)
jk
where g0j2 ¼ fx1 xw ; x2 xw ; . . . ; xN xw g and g0k3 ¼ fðx1 1 Þðx1 2 Þ; ðx2 1 Þðx2 2 Þ; . . . ; ðxN 1 ÞðxN 2 Þg: (10:115)
10.5.3 More on model parameter errors In Sections 10.2.1 and 10.2.2, we found the most probable values of linear model parameters. To complete the discussion, we need to specify the uncertainties of these parameters. We made a start on this in Section 10.5.1 for the special case of a linear model with only two parameters ðM ¼ 2Þ. We also introduced the concept of the covariance of two parameters and the correlation coefficient. In this section, we will generalize these results for an arbitrary value of M. In Section 10.4, we showed that the posterior probability distribution of the parameters pðfA gjD; IÞ is a multivariate Gaussian given by " # 1 X pðfA gjD; IÞ / exp 2 A A : (10:116) 2 If we use the more powerful matrix formulation which includes the data covariance matrix E (see Section 10.2.2), then we need to replace 1 X A 22 Equation (10.116) becomes
A
by
1X A A : 2
"
# 1X A A pðfA gjD; IÞ / exp 2 " # 1X ðA A^ Þ½Y ðA A^ Þ : ¼ exp 2
(10:117)
10.6 Correlated data errors
273
Now compare Equation (10.117) with Equation (10.41) for pðDjfA g; IÞ which is repeated here. " # 1X 1 pðDjfA g; IÞ / exp ðdi fi Þ½E ij ðdj fj Þ : (10:118) 2 ij Both have the same form. In Equation (10.118), E is the data covariance matrix. By analogy the inverse of Y; Y1 is the model parameter covariance matrix. Thus, everything we need to know about the uncertainties with which the various parameters have been determined, and their correlations, is given by Y1 ¼ ðGT E1 GÞ1 , a matrix which we previously computed (Section 10.2.2) in the determination of the most probable values of the parameters. The variance terms are given by the diagonal elements of Y1 and the covariance terms by the off-diagonal elements. We see that the parameter errors depend on the data errors through E1 but in a complicated way which depends on our choice of model basis functions. In Section 10.2.2, we also saw that E1 plays the role of a metric in the full N-dimensional vector space of the data. In a similar fashion, Y plays the role of a metric in the M-dimensional subspace spanned by the model functions.
10.6 Correlated data errors In this section, we compute the mean of a data set for which the off-diagonal elements of the data covariance matrix, E, are not all zero, i.e., the noise components are correlated. These correlations can be introduced by the experimental apparatus prior to the digitization of the data, or by subsequent software operations. Panel (a) of Figure 10.7 shows 100 simulated data samples of a mean, ¼ 0:5, with added IID Gaussian noise ð ¼ 1Þ. Panel (b) shows the same data after a smoothing operation that replaces each original sample ðdi Þ by a weighted average ðzi Þ of the original sample and its nearest neighbors according to Equation (10.119). 8 < 0:75di þ 0:25diþ1 zi ¼ 0:25di1 þ 0:5di þ 0:25diþ1 : 0:25di1 þ 0:75di
for i ¼ 1 for 1 < i < 100
(10:119)
for i ¼ 100.
If the characteristic width of the signal component in the data is very broad13 (in this example the signal is a DC offset), then the smoothing will have little effect on the signal component. However, it will introduce correlations into the independent noise components. These correlations need to be incorporated in the analysis. The dominant correlation in the smoothed data is with the nearest neighbor on either side. There is
13
If the smoothing has a significant effect on the signal component then this can be accounted for in the signal model by a convolution operation, as discussed in Appendix B.4.3.
274
Linear model fitting (Gaussian errors) (b)
Signal strength
1 0 –1 20
1
40 60 Sample number
(c)
0.6 0.4 0.2 0 –0.2 0
5
10
15
20 Lag
25
30
2 1 0 –1
100
Raw data Smoothed
0.8 ACF
80
20
Probability density
Signal strength
(a) 2
35
40 60 80 Sample number
(d)
6
100
Case 1 Case 2 Case 3
5 4 3 2 1 0 0
0.2
0.4 0.6 Mean µ
0.8
1
Figure 10.7 Panel (a) shows 100 independent samples of a mean value . In panel (b), the data have been smoothed using a running average that introduces correlations. Panel (c) compares the autocorrelation functions (ACF) for the raw data and smoothed data. Panel (d) compares the posterior density for for three cases. Case 1 is based on an analysis of the independent samples. The smoothed data results correspond to case 2 (assuming no correlations) and case 3 (including correlations).
also a weaker correlation with the next nearest neighbor14. A common tool for calculating correlations is the autocorrelation function (ACF), which was introduced in Section 5.13.2. To compute the ACF of the noise15 we need a sample of data (without any signal present) which has been smoothed in the same way. Panel (c) of Figure 10.7 compares the autocorrelation functions for the raw data and smoothed data. For the smoothed data, the ACF yields a correlation coefficient 1 ¼ 0:68 for nearest neighbors (lag of 1), and 2 ¼ 0:16 for next nearest neighbors (lag of 2). Panel (d) shows the Bayesian marginal posterior of the mean, pð jD; IÞ, computed for three cases. In case 1, pð jD; IÞ was computed from the independent samples of panel (a) following the treatment of Section 9.2.1. In case 2, pð jD; IÞ was computed
14
15
According to equation (5.63), we first subtract the mean of the noise data which can introduce a correlation between all the noise terms. If N is large this correlation is very weak and has been neglected in the current analysis. We can write 12 ¼ he1 e2 i ¼ he2 e3 i ¼ hei eiþ1 i ¼ 2 ðj ¼ 1Þ; where ðj ¼ 1Þ is the value of the ACF for lag j ¼ 1.
275
10.7 Model comparison with Gaussian posteriors
from the smoothed data and assuming no correlation. This second case results in a narrower posterior which is unwarranted because our state of information is unchanged from case 1, we have simply transformed the original data via a smoothing operation. In case 3, we incorporated information about the correlations introduced by the smoothing. This yielded essentially the same result as we obtained in case 1, as we would expect. In all three cases, we assumed the noise was an unknown nuisance parameter. We assumed a Jeffreys prior for and a uniform prior for . For case 3, the likelihood was computed from Equation (10.41), which for the current problem can be written as " # 1 1X 1 pðDj ; ; IÞ ¼ ðdi fi Þ½E ij ðdj fj Þ pffiffiffiffiffiffiffiffiffiffiffi exp 2 ij ð2pÞN=2 det E
1 1 T 1 ¼ pffiffiffiffiffiffiffiffiffiffiffi exp Y E Y ; 2 ð2pÞN=2 det E
(10:120)
where YT ¼ fðd1 f1 Þ; ðd2 f2 Þ; . . . ; ðdN fN Þg is a vector of the differences between the measured and predicted data values. From the results of the ACF, the data covariance matrix, E, is given by 0
1
1
1
2
0
0
0
0
B B 1 B E ¼ 2 B B 2 B @
1
1
2
0
0
1
1
1
0
0
0C C C 0C C C A
0
0 0 0 0 2 1 1 0:68 0:16 0
B 0:68 B B 2B ¼ B 0:16 B @ 0
1 0
0
0
0C C C 0C C: C A
1
0:68
0:16
0
0
0:68
1
0:68
0
0
0
0
0
0:16 0:68
1
(10:121)
1
10.7 Model comparison with Gaussian posteriors In this section, we are interested in comparing the probabilities of two linear models with different numbers of amplitude parameters. From our treatment of model comparison in Section 3.5, it is clear that the key quantity in model comparison is the evaluation of the global likelihood of a model. Calculation of the global likelihood requires integrating away all of the model parameters from the product of the prior
276
Linear model fitting (Gaussian errors)
and the likelihood. The integral required to calculate the global likelihood was given earlier as Equation (10.14), which we repeat here: pðDjMi ; IÞ ¼ Q
1 1 N=2 N A ð2pÞ
Z
2
dM A eðQ=2
Þ
A
1 1 2 ¼Q eðQmin =2 Þ N=2 N A ð2pÞ
Z
(10:122) M
d A e
ðQ=22 Þ
;
A
where we have used Equation (10.59) to expand Q. We could do the remaining integral by repeating the process of the preceding Section 10.5.1 for each amplitude: complete the square and integrate, one amplitude at a time. This gets to be very tedious if there are a large number of parameters. A mathematically more elegant approach involves transforming to an orthonormal set of model basis functions. The result is given by " pffiffiffiffiffiffiffiffiffiffiffi# ð2pÞM=2 det V 1 2 Q pðDjMi ; IÞ ¼ eðmin =2Þ N=2 N A ð2pÞ (10:123) ¼ M Lmax ; where V is the parameter covariance matrix. The quantity Lmax is the likelihood for the model at the mode, which is given by ^ Mi Þ ¼ Lmax ¼ pðDjA;
1 N ð2pÞ
2
N=2
eðmin =2Þ
(10:124)
and the Occam factor for the model is M
pffiffiffiffiffiffiffiffiffiffiffi ð2pÞM=2 det V Q ¼ : A
(10:125)
Assigning competing models equal prior probabilities, the posterior probability for a model will be proportional to Equation (10.123). The odds ratio in favor of one model over a competitor is simply given by the ratio of Equation (10.123) for the two models. Suppose model 1 has M1 parameters, denoted A , and has a minimum 2 equal to 21;min . Suppose model 2 has M2 parameters, denoted A0 , and has a minimum 2 equal to 22;min . Then the odds ratio in favor of model 1 over model 2 is pðM1 jIÞ pðDjM1 ; IÞ pðDjM1 ; IÞ ¼1 pðM2 jIÞ pðDjM2 ; IÞ pðDjM2 ; IÞ sffiffiffiffiffiffiffiffiffiffiffiffiffi Q 2 det V1 M A 2 ; ¼ emin =2 ð2pÞðM1 M2 Þ=2 Q¼1 M 1 det V2 ¼1 A0
O12 ¼
(10:126)
277
10.7 Model comparison with Gaussian posteriors
where V1 and V2 are the covariance matrices for the estimated parameters and 2min ¼ 22;min 21;min . If the two models have some parameters in common, then the ratio of the prior ranges A and A0 for these parameters will cancel. Problem: In Section 3.6, we considered two competing theories. One theory ðM1 Þ predicted the existence of a spectral line with known Gaussian shape and location, and a line amplitude in the range 0.1 to 100 units. The other theory ðM2 Þ predicted no spectral line (i.e., an amplitude ¼ 0). We now re-analyze the 64 channel spectral line data given in Table 3.1 using the linear least-squares method discussed in this chapter. We will assume a uniform prior for the line amplitude predictions of M1 .16 Solution: First we calculate the most probable amplitude using Equation (10.34), ^ ¼ ðGT E1 GÞ1 GT E1 D ¼ Y1 GT E1 D: A
(10:127)
For M1 , the model prediction, fi , is given by ! ði o Þ2 ; fi ¼ A1 gi ¼ A1 exp 22L
(10:128)
and so there is only one basis function, gi . Thus, GT is given by GT ¼ ðg1 ; g2 ; . . . ; g64 Þ:
(10:129)
Also, for this problem, the inverse data covariance matrix is a 64 64 matrix given by 1 1 0 0 1 0 0 0 1 0 0 0 B0 1 0 0C B0 1 0 0C C C B B 1 1 (10:130) ¼B 0 0 1 0C 0 0 1 0C E ¼ 2¼B C; C B B @ A A @ 0 0 0 1 0 0 0 1 since the errors are independent and identically distributed with a 2 ¼ 1. The D matrix is a column matrix containing the 64 channel spectrometer measurements given in Table 3.1 of Section 3.6. Now that we have specified all the matrices, we can evaluate A^ from Equation (10.127) using Mathematica. Since M1 has only one parameter, the parameter covariance matrix V ¼ Y1 ¼ ^ The final answer for A^ is ðGT E1 GÞ1 is a single number equal to the variance of A. A^ ¼ 1:54 0:53:
(10:131)
Now we want to compute the odds ratio, O12 , in favor of M1 over the competing model M2 . Since the two models were assigned equal prior probability, the odds is
16
The diligent reader might object at this point that in Section 3.6, we gave a strong argument for using a Jeffreys prior for the line amplitude. Linear least-squares analysis is widely used in data analysis and we wanted to highlight its strengths and weaknesses which are discussed in the conclusions given at the end of the problem.
278
Linear model fitting (Gaussian errors)
given by O12 ¼
pðM1 jIÞ pðDjM1 ; IÞ pðDjM1 ; IÞ ¼1 : pðM2 jIÞ pðDjM2 ; IÞ pðDjM2 ; IÞ
(10:132)
Model M2 has no undetermined parameter and predicts the spectrum equals zero apart from noise. The quantity pðDjM2 ; IÞ is given by pðDjM2 ; IÞ ¼
1
2
N ð2pÞ
N=2
e2;min =2 ;
(10:133)
where 22;min ¼
N¼64 X d2 i 2 i
¼ 57:13:
(10:134)
For model M1 we can use Equation (10.123) with M ¼ 1 parameter, yielding "
pffiffiffiffiffiffiffiffiffiffiffi# ð2pÞ1=2 det V 1 2 pðDjM1 ; IÞ ¼ e 1;min =2 N=2 A1 N ð2pÞ
(10:135)
¼ 1 Lmax ; where 21;min ¼
N¼64 X i
^ i Þ2 ðdi Ag ¼ 48:49: 2
(10:136)
Equation (10.135) contains an Occam penalty, 1 , which penalizes M1 for prior parameter space that gets ruled out by the data through the likelihood function. The penalty arises automatically from marginalizing over the prior range A. In this case, "
pffiffiffiffiffiffiffiffiffiffiffi# ð2pÞ1=2 det V 1 ¼ ¼ 0:0133: A1
(10:137)
Substituting Equations (10.133) and (10.135) into Equation (10.132), we get 2
2
O12 ¼ 1 eð2;min 1;min Þ=2 ¼ 1:0:
(10:138)
Conclusions: a) Not surprisingly, the results obtained here for A^ and O12 are the same as we got from the brute force analysis used in Section 3.6 for the uniform prior assumption for A. In the current problem, we were dealing with only one parameter (for M1 ). Some problems involve
10.8 Frequentist testing and errors
279
a very large number of linear parameters and in these cases, the linear least-squares approach is very efficient because no integrals need to be performed. b) In principle, linear least-squares analysis is only applicable for linear parameters with uniform priors. In Section 3.10, we learned that there are good reasons for distinguishing between location parameters, i.e., both positive and negative values are allowed, and scale parameters, which are always positive. We also learned that there are strong reasons for preferring a Jeffreys prior over a uniform prior when dealing with a scale parameter. Of course, in some problems, we are fortunate to have much more selective prior information about parameter values, e.g., based on the results of previous experiments. The reader may well ask why we chose the spectral line example where the amplitude is a scale parameter. Linear least-squares analysis is widely used in data analysis and we wanted to highlight its strengths and weaknesses. For many parameter estimation problems, the choice of prior is not too important because the posterior probability density is usually dominated by the likelihood function which is generally rather strongly peaked except when there are very little data. c) In model selection problems, the choice of prior is much more critical. In Section 3.6, we addressed this question in considerable detail. We showed that using a more appropriate Jeffreys prior led to an odds ratio favoring M1 which was a factor of 11 larger than for the uniform prior assumption. The main message here is to only use the material on model comparison in Section 10.7 when dealing with parameters for which the choice of a uniform prior is appropriate.
10.8 Frequentist testing and errors The results in this chapter have been developed from a Bayesian perspective. For comparison purposes, we now introduce a section on model testing and parameter errors from a frequentist perspective. My apologies to those of you who have your Bayesian hat on at this point and can’t face the transition again. You can always skip over this section now and return to it if you want to answer question 3(e) in the problems at the end of this chapter. In Section 7.2.1, we discussed the use of the 2 statistic in hypothesis testing. Once we have determined the best set of model parameters, we can use the 2 statistic to test if the model is acceptable by attempting to reject the model at some confidence level. From Equation (10.41) we see that 2 for the fit17 is given by X 2 ¼ ðdi fi Þ½E1 ij ðdj fj Þ: (10:139) ij
If the errors are independent, this reduces to the more familiar form: 2 ¼
X ðdi fi Þ2 i
17
2i
:
(10:140)
In the frequentist context, if the data errors are IID normal, then treated as a random variable, this quantity will have a 2 distribution with the number of degrees of freedom equal to N M. If the data covariance matrix E1 has non-zero off-diagonal elements, indicating correlations, then the number of degrees of freedom will be less than N M.
280
Linear model fitting (Gaussian errors)
If the model contains M parameters and there are N data points, then our confidence in rejecting the model is given by the Mathematica command: 1 - GammaRegularized
h
2
NM c 2 ; 2
i
Some words of caution are in order on the use of the above for rejecting a hypothesis. First GammaRegularized [(N M)=2; 2 =2] measures the significance of the test, which equals the area of the 2 distribution to the right of our measured value. Again, if the model is correct and the data errors are known to be IID normal, this area represents the fraction of hypothetical repeats of the experiment that are expected to fall in this tail area by chance. If this area is very small, then there are at least three possibilities: (1) we have underestimated the size of the data errors, (2) the model does a poor job of explaining the systematic structure in the data, or (3) the model is correct; the result is just a very unlikely statistical fluctuation. Because experimental errors are frequently underestimated, it is not uncommon to require a significance < 0:001 before rejecting a hypothesis. Note: if the significance is very large, e.g., 0:5, this is an indication that the data errors may have been overestimated. Note: Mathematica provides a command called ChiSquarePValue½c2 ; N M which has a maximum value of 0.5. This command can sometimes lead to confusion because it measures the area in either tail region. Thus, if the measured value of 2 is less than the number of degrees of freedom (the expectation value for the hypothetical reference distribution), we would not want to refer to our confidence in rejecting the model as 1-ChiSquarePValue. The above model test assumes we know the errors accurately. What if the scatter in the data from the best fitting model is considerably larger than the data errors used in the analysis but we are convinced that the model is correct? The other option is that we have underestimated the errors. Perhaps the model correctly describes some aspect of the data but in addition something else is going on. Frequently this is how we discover the presence of some new phenomenon: by looking for systematic effects in the residuals after subtracting off the best-fit model from the data. In Section 10.5.1, we saw that information about the model parameter errors is contained in the covariance matrix, one form of which is given by V ¼ 2 y1 ¼ 2 ðGT GÞ1 :
(10:141)
If we have underestimated the data errors () then this will lead to our underestimating the parameter errors. We saw in Section 9.2.3 that in a Bayesian analysis, we can marginalize over any unknown data error and ensure that our parameter uncertainties properly reflect the size of the residuals between the best fitting model and data (see Figure 9.3).
10.8 Frequentist testing and errors
281
A useful frequentist method for obtaining more robust parameter errors is based on assuming the model is correct and then adjusting all the assumed measurement errors by a factor k. The new value of 2 is then given by 2 ¼
N X ðdi fi Þ2 i¼1
k2 2i
¼
2meas ; k2
(10:142)
where 2meas is the value of 2 computed using the initial i error estimates. The factor k is computed in the following way. When the model is valid, and the data errors are known, the expected value of 2 for the best choice of model parameters, 2expect , is equal to the number of degrees of freedom ¼ N M, where N ¼ the number of data points, and M ¼ the number of fit parameters. The procedure then is to adjust the value of k so that 2 in Equation (10.142) is equal to N M. 2 ¼
2meas ¼ 2expect ¼ N M: k2
(10:143)
The solution is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2meas k¼ : NM
(10:144)
Including a factor k is a good thing to do since often nature is more complicated than either the model or known measurement errors can account for. Increasing all the measurement errors from to k corresponds to increasing the terms in the parameter covariance matrix (see Equation (10.141)) by k2 or the parameter standard errors by k. The equivalent operation in Bayesian analysis would be to introduce k as a nuisance parameter and marginalize over our prior uncertainty in this parameter. The method just described for obtaining more accurate error estimates assumed that the model was true. We cannot turn around and use these errors in a 2 hypothesis test to see if we can reject the model. In contrast, in a Bayesian analysis, we can allow for uncertainty in the value of k and also carry out a model selection operation after marginalizing over the unknown k. Recall that whenever we marginalize over a model parameter, this introduces an Occam penalty which penalizes the model for our prior uncertainty in the parameter in a quantitative fashion. If all the models that we are comparing in the model selection operation depend on this parameter in the same way, then these Occam penalties cancel out. If they depend on the parameter in different ways (e.g., models pðC; SjD; IÞ and pðC; SjD; IÞ in Section 9.4), then the Occam penalties do not cancel.
10.8.1 Other model comparison methods It is often useful to plot the variance of the residuals versus the number of model fitting parameters, M. For example, in the case of a polynomial model we can vary the
282
Linear model fitting (Gaussian errors)
number of parameters. Of course, for M ¼ 0, the residual variance is just the data variance. We can characterize a model by how quickly the curve of residual variance versus M drops. Any variance curve which drops below another indicates a model which is better, in the sense that it achieves a better quality fit to the data with a given number of model functions. What one would expect to find is a very rapid drop as the systematic signal is taken up by the model, followed by a slow drop as additional model functions expand the noise. The total number of useful model parameters is determined by the break in this curve. In constructing the residual variance curves, we need to be aware that if we were to rearrange the order in which the best model terms are incorporated, we can always produce a curve that is above that of the same model but with a different order. We want to order the model parameters to produce the lowest residual variance curve before selecting the break point. The F statistic can also be used to decide which basis functions are significant. Suppose our null hypothesis is that a model with M unknown parameters is the correct model. If the model’s prediction for the ith data point is designated fl, then 2v ¼
N X ðdi fi Þ2
2
i¼1
(10:145)
has a 2v distribution with ¼ N M degrees of freedom. Consider the effect on 2 when an extra term is added to the model fitting function so the number of degrees of freedom is decreased by 1. If the simpler model is true then the effect of the extra term is to remove some of the noise variation. The expected decrease in 2 is the same as if we had not added the extra term but simply reduced N by one. Thus 2 ¼ 2 21
(10:146)
has a 2 distribution with 1 degree of freedom. According to equation (6.35) F¼
2 1Þ
21 =ð
(10:147)
follows an F distribution18 with 1 and 1 degrees of freedom. If the simpler model is correct you expect to get an f ratio near 1.0. If the ratio is much greater than 1.0, there are two possibilities: 1. The more complicated model is correct. 2. The simpler model is correct, but random scatter led the more complicated model to fit better. The P-value tells you how rare this coincidence would be. 18
Since 2 is a common factor in the calculation of both 2 and 2v1 we can rewrite equation (10.147) as F ¼
2 2 N N i¼1 ðdi fi Þ i¼1 ðdi fþi Þ 2 N i¼1 ðdi fþi Þ
ð10:148Þ
where f+i ¼ the predicted value for the model with the extra term. Thus it is not necessary to know 2 to carry out this F-test.
10.9 Summary
283
If the P-value is small enough (e.g., 5%), reject the simpler model. Otherwise, conclude that there is no compelling evidence to reject the simpler model. As an example, we use the F-test to compare models M1 (line exists) and M2 (no line exists) in the spectral line problem of Section 3.6. The best fit for M1 (63 degrees of freedom) yielded a 2v ¼ 63 ¼ 48:49. For M2, with 64 degrees of freedom, 2v ¼ 64 ¼ 57:13. Substituting these values into equation (10.147) yields f ¼ 10.5. This corresponds to a P-value ¼ 0.2%. On the basis of this F-test, we can reject the simpler model M2 at a 99.8% confidence level. 10.9 Summary Here we briefly summarize the main results of this chapter: 1. We saw how the Bayesian treatment leads to the familiar method of least-squares when we are interested in the question of the most probable set of model parameters (see Equation (10.34)), assuming an IID normal distribution for our knowledge of the measurement errors and a flat prior for each parameter. 2. We then relaxed the IID requirement for our knowledge of the measurement errors by introducing E, the covariance matrix for the errors. Equation (10.48) gives the revised solution for the most probable set of parameters. Weighted linear least-squares can be seen as a special case of this equation. 3. A full description of our knowledge of the model parameters is given by the joint posterior distribution for the parameters. For a linear model, and a flat prior for each parameter, this distribution is particularly simple, namely a multivariate Gaussian. Equation (10.75) or Table 10.2 defines the boundary, 2crit , of a (joint) credible region for one or more of the parameters that contains a specified probability. Also, it turns out that for Gaussian posteriors, in any number of dimensions, the marginal PDF is also equal to the projected distribution (projected PDF). 4. A useful summary of the parameter errors is given by the model parameter covariance matrix, V ¼ 2 y1 . If we are employing the covariance matrix, E, for our knowledge of the measurement errors, then simply replace y 1 by Y1 ¼ ðGT E1 GÞ1 , by and drop all the factors of 2 in Equations (10.93), (10.94), (10.100), and (10.101). 5. The parameter covariance matrix also provides information about the correlation between the estimates of any pair of model parameters, which is conveniently expressed by the correlation coefficient (see Equation (10.102)), , ranging between 1. If is close to 1 then it will not be possible to estimate reliably the two parameters separately, but we can still infer a linear combination of the parameters quite well. This comes about because the model basis functions are not orthogonal. At the end of Section 10.5.2, we show how to construct an orthogonal polynomial model. 6. The key quantity in Bayesian model comparison is the global likelihood of a model. Calculation of the global likelihood requires integrating away all of the model parameters. The final result regarding Bayesian model selection is usually expressed as an odds ratio, which is given by Equation (10.126). It is important to remember that this equation assumes uniform parameter priors and prior boundaries well removed from the peak of the posterior. Where these assumptions do not hold, the necessary marginalizations must in general be
284
Linear model fitting (Gaussian errors)
carried out numerically and the resulting odds ratio can be very different. See Section 3.6 for a detailed example of this latter point. 7. Common frequentist methods for model testing and estimating robust parameter errors are discussed in Section 10.8.
10.10 Problems 1. Fit a straight line model to the data given in Table 10.3, where di is the average of ni data values measured at xi . The probability of the individual di measurements is normal with ¼ 4:0, regardless of the xi value. a) b) c) d)
Give the slope and intercept of the best-fit line together with their errors. Plot the best-fit straight line together with the data values and their errors. Give the parameter covariance matrix. Repeat (a) and (c) but this time use the average x-coordinate as the origin. Comment on the differences between the covariance matrices in (c) and (d).
2. Compute and plot the ellipse that defines the 68.3% and 95.4% joint credible region for the slope and intercept, for the data given in Table 10.3. The shape of this ellipse depends on the x-coordinate origin used in the fit (see Figure 10.5). Use the average x-coordinate as the origin. See the section of the Mathematica tutorial entitled ‘‘Joint Credible Region Contouring.’’ 3. Table 10.4 gives measurements of ozone partial pressure, y, in millibars in each of 15 atmospheric layers where each layer, x, is approximately 2 km in height. The layers have been scaled for convenience from 7 to þ7. Use the least-squares method to fit the data with (i) a quadratic model: yðxÞ ¼ A1 þ A2 x þ A3 x2 (ii) a cubic model: yðxÞ ¼ A1 þ A2 x þ A3 x2 þ A4 x3
Table 10.3 Data table xi
di
ni
10 20 30 40 50 60 70 80
0.387 5.045 7.299 6.870 16.659 13.951 16.781 20.323
14 3 25 2 3 22 5 2
285
10.10 Problems
Table 10.4 Measurements of ozone partial pressure, y, in millibars in each of 15 atmospheric layers where each layer, x, is approximately 2 km in height. The layers have been scaled for convenience from 7 to þ7. Layer 7 7 7 7 7 7 7 7 6 6 6 6 6 6 5
Pressure
Layer
Pressure
Layer
Pressure
53.8 53.3 54.8 54.6 53.7 55.2 55.7 54.1 63.8 64.2 66.9 67.2 65.4 67.3 71.8
5 5 5 5 4 4 4 4 4 4 3 3 3 3 2
73.2 75.6 76.2 72.7 79.4 81.1 85.2 83 84.1 82.8 90.3 84.2 88.3 86 93.2
2 2 1 1 1 0 0 0 0 1 1 1 2 2 2
97.4 98.3 102.8 96.9 98.2 98.9 96.1 99.6 91.4 101.1 94.6 95.9 92.3 96.6 98.5
Layer 3 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7
Pressure 93.6 86.2 87.9 89.5 74.8 82.3 76.9 81.2 73.6 65.4 67.1 60.2 54.9 50.8 44.7 38.5
Please include the following items as part of your solution: a) In this problem, you don’t know that the raw data errors are normally distributed, or even if the variance is the same from one layer to the next. Explain how you can take advantage of the Central Limit Theorem (CLT) in this problem. Note: real data are seldom as nice as we would like. For some ozone layers there are fewer than five data values (the approximate number recommended for applying the CLT), so you may want to combine data for some of the layers where this is a problem. Of course, combining layers results in lower structural resolution. Note: you must provide a table of the ozone values and computed errors you actually used in your model fitting. Explain how you computed the errors. b) Determine the parameters for each model and the variance-covariance matrix. Quote an error for each parameter and explain what your errors mean. c) Compare the models with the data by plotting the model fits on the same graph as your data. Include error bars (as you have determined them to be) on the data points used for fitting. d) Compute the Bayesian odds ratio of the cubic model to the quadratic model. For the purpose of this calculation, assume the prior information warrants a
286
Linear model fitting (Gaussian errors)
flat prior probability for each parameter with ranges A given by: A1 ¼ 100, A2 ¼ A3 ¼ 10, and A4 ¼ 1. Explain in words what you conclude from this. e) Calculate the frequentist 2 goodness-of-fit statistic and the P-value (significance) for each model. The confidence in rejecting the model ¼ 1P-value. Explain what you conclude from these goodness-of-fit results. 4. Repeat the analysis of the ozone data as described in the previous problem, but this time adopt the following different strategy: instead of rebinning the data to take advantage of the CLT, use the original binning as given in Table 10.4. According to the MaxEnt principle (see Section 8.7.4), unless we have prior information that justifies the use of some other sampling distribution, then use a Gaussian sampling distribution. It makes the least assumptions about the information we don’t have and will lead to the most conservative estimates. Use Equation (9.51) to estimate of each layer. In contrast to the approach proposed in the previous problem, we do not have to sacrifice the resolution of the original data through rebinning.
11 Nonlinear model fitting
11.1 Introduction In the last chapter, we learned that the posterior distribution for the parameters in a linear model with Gaussian errors and flat priors is itself a multivariate Gaussian. The topology for this distribution in the multi-dimensional parameter space is very simple. In contrast, even for flat priors, the topology of the posterior for a nonlinear model can be very complex with many hills and valleys. Examples of nonlinear models: 1. fi ¼ A1 cos !ti þ A2 sin !ti
where A1 , A2 are linear parameters, and ! is a nonlinear parameter. ( ) ( ) ðxi C1 Þ2 ðxi C2 Þ2 2. fi ¼ A1 þ A2 exp þ A3 exp 221 222 where A1 ; A2 ; A3 are linear parameters, and C1 ; C2 ; 21 ; 22 are nonlinear parameters. In this chapter, we will let represent the set of all parameters both linear and nonlinear and ^ the most probable set of the parameters. Again, the problem is to find the most probable set of parameters together with an estimate of their errors. (Of course, if the posterior has several maxima of comparable magnitude then it doesn’t make sense to talk about a single best set of parameters.) The Bayesian solution to the problem is very simple in principle but can be very difficult in practice. The calculations require integrals over the parameter space which can be difficult to evaluate. The brute force approach is as follows: for a one-parameter model, the most robust way is to plot the posterior or 2 . This entails division of the parameter range into a finite number of grid points. As long as there are enough grid points to cover the prior range (a few hundred is usually adequate), this will usually work. It doesn’t matter whether the posterior PDF is asymmetric, multi-modal or differentiable. 287
288
Nonlinear model fitting
This approach can easily be extended to two parameters. It is also easy to compute marginal distributions. One need only add up the probabilities in the 1 or 2 direction, as appropriate. After the two-parameter case, however, this approach rapidly becomes impractical. In fact, the number of calculations is ð100ÞM , where M is the number of parameters. For example: 2 parameters might take 100 milliseconds to compute 5 parameters might take one day to compute 11 parameters might take the age of the universe to compute. Fortunately, the last fifteen years have seen remarkable developments in practical algorithms for performing Bayesian calculations (Loredo, 1999). They can be grouped into three families: asymptotic approximations; methods for moderate dimensional models; and methods for high dimensional models. In this chapter, we will mainly be concerned with solutions that assume the posterior distribution for the parameters can be approximated by a multivariate Gaussian. We will first illustrate this in a simulation and then focus on methods for efficiently finding the most probable set of parameters and their covariance matrix. In the following chapter, we will give an introduction to Markov chain Monte Carlo algorithms which facilitate full Bayesian calulations for nonlinear models involving very large numbers of parameters.
11.2 Asymptotic normal approximation Expressed in frequentist language, asymptotic theory tells us that the maximium likelihood estimator becomes more unbiased, more normally distributed and of smaller variance as the sample size becomes larger (see Lindley, 1965). In other words, as the sample size increases, the nonlinear problem asymptotically approaches a linear problem. From a Bayesian perspective, the posterior distribution for the parameters asymptotically approaches a multivariate normal (Gaussian) distribution. We will illustrate this with a simulation. We simulated data sets with different numbers of data points by randomly sampling a nonlinear model, represented by fðxjÞ, which has one nonlinear parameter . We also added independent Gaussian noise to each data point with a mean of zero and a ¼ 2. The data values are described by the equation yi ¼ fðxi j ¼ 2=3Þ þ ei :
(11:1)
Figure 11.1 illustrates a set of N ¼ 12 simulated data points together with a plot of the known model prediction. Of course, the data points differ from this model because of the added noise. We then carry out a Bayesian analysis of the simulated data, assuming we know the mathematical form of the model but not the value of the model parameter, . Our goal is to infer the posterior PDF for assuming a flat prior. The
289
11.2 Asymptotic normal approximation 6
y
4
2
0
0.2
0.4
0.6
0.8
1
x
Figure 11.1 A simulated set of 12 data points for a nonlinear model with the one parameter ¼ 2=3 (solid line) plus added Gaussian noise.
steps involved in calulating the posterior should now be fairly familiar to the reader (e.g., Section 10.2). The resulting PDFs are graphed in Figure 11.2 for four data sets of different size, N. It is apparent from this simulation that for small data sets, the posterior exhibits multiple peaks, but as N increases, the posterior approaches a Gaussian shape with a decreasing variance. The conclusion is not affected by the choice of prior; in large samples, the data totally dominate the priors and the result converges on a value of ¼ 2=3, the value used to simulate the data. For a nonlinear model with M parameters, the joint posterior for the parameters asymptotically approaches an M-dimensional multivariate Gaussian as the number of data points becomes much greater than the number of unknown parameters.
14 N=5 N = 10 N = 40 N = 80
Probability density
12 10 8 6 4 2 0 0
0.2
0.4
0.6
0.8
1
α Parameter value Figure 11.2 The Bayesian posterior density function for the nonlinear model parameter for four simulated data sets of different size ranging from N ¼ 5 to N ¼ 80. The N ¼ 5 case has the broadest distribution and exhibits four maxima.
290
Nonlinear model fitting
In what follows, we will assume that in the vicinity of the mode of the joint posterior, the product of the prior and likelihood can be approximated by a multivariate Gaussian. We want to develop a convenient mathematical formulation to describe an approximate multivariate Gaussian. We start with one form of the posterior for a true multivariate Gaussian we developed for linear models in Section 10.4 which we repeat here (see Equation (10.66)), only this time we let A stand for the set of linear model parameters that we previously wrote as fA g. AT yA : pðAjD; M; IÞ ¼ C0 exp 22
(11:2)
This equation describes the joint posterior for a set of linear model parameters assuming flat priors for the parameters. When we use the more powerful matrix formulation which includes the data covariance matrix E (see Section 10.5.3), then we replace y by Y and rewrite Equation (11.2) as 1
pðAjD; M; IÞ ¼ C0 e2ðA
T
YAÞ
:
(11:3)
The term C0 is the value of the posterior at the mode, which can be written as the product of the prior times the maximum value of the likelihood. The exponential term in Equation (11.3) describes the variation of the likelihood about the mode which has the form of a multivariate Gaussian. Thus, we can rewrite Equation (11.2) as pðAjD; M; IÞ / pðAjM; IÞpðDjA; M; IÞ ¼ pðAjM; IÞLðAÞ " # X 1 ^ ^ exp ¼ pðAjM; IÞLðAÞ ðA A^ Þ½Y ðA A^ Þ : 2 Now take the natural logarithm of both sides. ^ ^ ln½ pðAjM; IÞLðAÞ ¼ ln pðAjM; IÞLðAÞ " # 1X ðA A^ Þ½Y ðA A^ Þ : þ 2
(11:4)
(11:5)
^ We can show that Y is a matrix of second derivatives of ln½ pðAjM; IÞLðAÞ at A ¼ A. ¼
@2 ln½ pðAjM; IÞLðAÞ @A @A
^ ðat A ¼ AÞ:
(11:6)
For the nonlinear model case, we will represent the set of model parameters by and write an equation analogous to (11.4). " # X 1 ^ ^ exp pðjD; M; IÞ pðjM; IÞLðÞ ð ^ Þ½I ð ^ Þ ; (11:7) 2
291
11.3 Laplacian approximations
where I is called the Fisher information matrix and is the nonlinear problem analog of Y in the linear case. The approximate sign in the above equation is there because the posterior is only approximately a multivariate Gaussian at the mode. We can rewrite Equation (11.7) as " # h i X 1 ^ ^ þ ln½pðjM; IÞLðÞ ¼ ln pðjM; IÞLðÞ ð ^ Þ½I ð ^ Þ : (11:8) 2 ^ I is a matrix of second derivatives of ln½pðjM; IÞLðÞ at ¼ . I ¼
@2 ln½pðjM; IÞLðÞ @ @
^ ðat ¼ Þ:
(11:9)
Recall that Y1 is the covariance matrix of the parameters in the linear problem. Y1 provides a measure of how wide or spread out the Gaussian is. If the posterior in the nonlinear problem is not Gaussian, but is unimodal (single peak), then I1 does not give the variances and covariances of the posterior distribution. However, it may give a good estimate of them, and is probably easier to calculate than the integrals required to get the variances and covariances. A difficulty arising in these computations is that it has not been possible to present guidelines for how large the sample size must be for asymptotic properties to be closely approximated. In Section 11.4, we will assume the approximation is good enough, and ^ But first we will focus on useful schemes for finding the most probable parameters, . investigate another useful type of approximation that allows us to obtain a better estimate of the desired Bayesian quantities without having to perform complicated integrals. These kinds of approximation originated with Laplace, so they are called Laplacian approximations.
11.3 Laplacian approximations 11.3.1 Bayes factor Suppose we want to compute the Bayes factor for model comparison (Section 3.5). In this case, we need to compute the global likelihood, pðDjM; IÞ, by integrating over all the model parameters (also required for the normalization constant in parameter estimation). We can evaluate this from Equation (11.7). Z pðDjM; IÞ ¼ d pðjM; IÞLðÞ ^ ^ pðjM; IÞLðÞ
Z
1 d exp ðqT IqÞ ; 2
(11:10)
where ½ ¼ ð ^ Þ. We can use the principal axis theorem to make a change of variables according to q ¼ OX, that transforms qT Iq to XT LX, where L is
292
Nonlinear model fitting
a diagonal matrix of eigenvalues of the I matrix. The columns of O are the orthonormal eigenvectors of I. Let 1 ; 2 ; . . . ; M be the eigenvalues of I. Then we can write Z 1 T I ¼ d exp ðq IqÞ 2 (11:11) Z 1 T ¼ J dX exp ðX LXÞ ; 2 R R where J ¼ det O, is the Jacobian of the transformation, d ¼ J dX. Since the columns of O are orthonormal J ¼ 1. For the M ¼ 2 case, " # Z Z X2 X2 I ¼ dX exp dX exp 2 2 Z pffiffiffiffiffiffi 2 1 1 1 X2 ¼ ð 2pÞ pffiffiffiffiffiffi pffiffiffiffiffi dX pffiffiffiffiffiffi pffiffiffiffiffiffi exp 2= 2p 1= (11:12) " # Z X2 1 dX pffiffiffiffiffiffi pffiffiffiffiffi exp 2= 2p 1= pffiffiffiffiffiffi 1 1 ¼ ð 2pÞ2 pffiffiffiffiffiffi pffiffiffiffiffi : For the general case of arbitrary M, we have, 1 I ¼ ð2pÞM=2 qffiffiffiffiffiffiffiffiffiffiffiffiffi : Q
(11:13)
We can express our result for I in terms of the det I by writing I ¼ OLOT . Then det I ¼ det OT det L det O Y Y 1 ¼ : ¼1
(11:14)
Substituting Equation (11.14) into Equation (11.13), we obtain I ¼ ð2pÞM=2 ðdet IÞ1=2 :
(11:15)
Even if the multivariate Gaussian approximation is not exact, but the posterior distribution has a single dominant peak located away from the prior boundary of the parameter space, then the use of Equation (11.15) provides a useful Laplacian approximation. Thus, the global likelihood can be written as M=2 ^ ^ pðDjM; IÞ pðjM; IÞLðÞð2pÞ ðdet IÞ1=2 :
(11:16)
293
11.3 Laplacian approximations
In the case of a perfect Gaussian approximation and a uniform parameter prior, Equation (11.16) reduces to Equation (10.123). In Section 11.4, we will discuss how to ^ locate the best set of parameters, .
11.3.2 Marginal parameter posteriors We can also use the Laplacian approximation in Equation (11.15) to do the integral needed to eliminate nuisance parameters. Suppose we want to obtain the marginal probability distribution for one of the parameters which we will label .1 We need to remove the remaining parameters which we label collectively as . Instead of integrating over the parameters, we construct a ‘‘profile’’ function for , found by maximizing2 the prior the likelihood over for each choice of : fðÞ ¼ max pð; jM; IÞ Lð; Þ. The profile function is a projection of the posterior onto the axis. Finding a maximum is generally much faster than computing the integrals. An efficient method of finding the maximum, starting from a good guess, is discussed in Section 11.5. We can construct an approximate marginal distribution for by multiplying fðÞ by a factor that accounts for the volume of space: pðjD; M; IÞ / fðÞ½det IðÞ1=2 ;
(11:17)
where IðÞ is the information matrix of the nuisance parameters, with held fixed. To illustrate how different the marginal and projected distributions can be, consider a hypothetical joint probability distribution for the parameters and as shown in panel (a) of Figure 11.3. The projected and marginal distributions for are shown in (b) Marginal and Projected
(a) Joint Probability Probability Density
1
φ axis
0.8 0.6 0.4 0.2 0
0
0.2 0.4 0.6 0.8 θ axis
1
3
projected
2.5
marginal
2 1.5 1 0.5 0 0
0.2
0.4 0.6 θ axis
0.8
1
Figure 11.3 Comparison of the projected and marginal probability distribution for the parameter. 1
2
More generally, can represent a subset of one or more parameters of interest, with the remainder considered as nuisance parameters. This process is easy to visualize if there are only two parameters A1 and A2 . The joint probability distribution, pðA~1 ; A2 jD; IÞ, is a three-dimensional space with A1 and A2 as the x; y-axes and probability as the z-axis. Each choice of A2 (i.e., A2 ¼ constant) corresponds to a vertical slice through the probability mountain. We then vary A1 until we find the maximum value of the probability in that slice. Repeat this process for all possible choices of parameter A2 and record the probability pðA~1 ; A2 jD; IÞ. The resulting PDF is a function of A2 and can be seen to be the projection of the joint probability mountain onto the A2 axis.
294
Nonlinear model fitting
panel (b). Although the peak probability occurs near ¼ 0:65, more probability resides in the broad plateau to the left of the peak and this is indicated by the marginal distribution. The value of the marginal, pðjD; IÞ, for any particular choice of , is proportional to the integral over , i.e., the area under a slice of the joint probability distribution for fixed. Clearly, this area can be approximated by the peak height of the slice times a characteristic width of the probability distribution in the slice. In this two-parameter problem, the projected distribution is converted to an approximation of the true marginal by multiplying by the factor ½det IðÞ1=2 in Equation (11.17), which gives the scale of the width of the distribution in the direction for the particular value of . Recall from Equation (11.14) the det IðÞ is equal to the product of the eigenvalues of IðÞ. At this point, it might be useful to refer back to Figure 10.3, which shows how the eigenvalues of the corresponding y matrix in the linear model problem give information on the scale of the width of the posterior. We explore the Laplacian marginal distribution further in the following example: consider a nonlinear model of the form fðxj; Þ ¼ x1 ð1 xÞ1 for 0 < x and ; > 1. We constructed a simulated data set for 12 values of the independent variable, x, using this nonlinear model with ¼ 6; ¼ 3, and added independent Gaussian noise with a mean of zero and a ¼ 0:005. The data values are described by the equation yi ¼ fðxi j ¼ 6; ¼ 3Þ þ ei :
(11:18)
The results of this simulation are shown in the four panels of Figure 11.4. Panel (a) shows the simulated data (diamonds) and model (solid curve). Panel (b) shows a contour plot of the Bayesian joint posterior probability of and , which differs significantly from a multivariate Gaussian. Panel (c) compares the projected or profile function (dots) and Bayesian marginal probability density for (dashed). Panel (d) is the same as (c) but with the Laplacian approximation of the marginal overlaid, illustrating the close agreement with the true marginal density. You will have to look very closely to see any difference. The difference between the derived most probable values of ¼ 3:2 and the true value of ¼ 3 is simply a consequence of the added noise. The Laplacian marginal distribution can perform remarkably well even for modest amounts of data, despite the fact that onepffiffiffiffi might expect the underlying Gaussian approximation to be good only to order 1= N, the usual rate of asymptotic convergence to a Gaussian. The Laplacian approximations are good to order 1=N or higher. For more details on this point, see Tierney and Kadane (1986).
11.4 Finding the most probable parameters In this section, we will assume flat priors for the model parameters and focus on methods for finding the peak of the likelihood. Again, we assume the data are given by di ¼ f i þ e i ;
295
11.4 Finding the most probable parameters (b) Probability Contours
β axis
y
(a) Model + Data 0.02 0.015 0.01 0.005 0 – 0.005 0.2
0.4
0.6
0.8
4 3.8 3.6 3.4 3.2 3 2.8 2.6
1
4
5
6
3
3.2 3.4 3.6 3.8 β axis
8
(d)
1.5 1.25 1 0.75 0.5 0.25
Probability density
Probability density
(c)
2.8
7
α axis
x
4
1.5 1.25 1 0.75 0.5 0.25 2.8
3
3.2 3.4 3.6 3.8 β axis
4
Figure 11.4 The figure provides a demonstration of the Laplacian approximation to the marginal posterior for a model parameter. Panel (a) shows the simulated data (diamonds) and model (solid curve), which has two nonlinear parameters and . (b) shows a contour plot of the Bayesian joint posterior probability of and , which differs significantly from a multivariate Gaussian. (c) compares the projected or profile function (dots) and Bayesian marginal probability density for (dashed). (d) is the same as (c) but with the Laplacian approximation of the marginal overlaid, illustrating the close agreement with the true marginal density.
where fi represents the model function and assume our knowledge of noise ei leads us to assume Gaussian errors. Then the likelihood is given by " # N 1X 1 LðÞ ¼ pðDj; M; IÞ ¼ C exp ðdi fi Þ½E ij ðdj fj Þ ; (11:19) 2 i; j¼1 where E ¼ covariance matrix of measurement errors. If the errors are independent, E is diagonal with entries equal to 2i . In this case " # N N Y 1X ðdi fi Þ2 N=2 1 pðDj; M; IÞ ¼ ð2pÞ i exp 2 i¼1 2i i¼1 (11:20) 2 ðÞ ¼ C exp : 2
296
Nonlinear model fitting
In general, 2 ðÞ may have many local minima but only one global minimum. For a nonlinear model, there is no general solution to the global minimization problem. Some of the approaches to finding the global minimum are as follows: 1. Random search techniques a) Monte Carlo exploration of parameter space b) Simulated annealing c) Genetic algorithm 2. Home in on minimum from initial guess a) Levenberg Marquardt (iterative linearization) b) Downhill simplex 3. Combination of above. MINUIT is a very powerful Fortran-based function minimization and error analysis tool developed at CERN. It is designed to find the minimum value of a multiparameter function and analyze the shape of the function around the minimum. The principal application is for statistical analysis, working on 2 or log-likelihood functions, to compute the best-fit parameter values and uncertainties, including correlations between the parameters. MINUIT contains code for carrying out a combination of the above items 1(a), 1(b), 2(a) and 2(b). For more information on MINUIT see: http://wwwinfo.cern.ch/asdoc/minuit.
11.4.1 Simulated annealing The idea of using a temperature parameter in optimization problems started to become popular with the introduction of the simulated annealing (SA) method by Kirkpatrick et al. (1983). It is based on a thermodynamic analogy to growing a crystal starting with the material in a liquid state called a melt. When a melt is slowly cooled, the atoms will achieve the lowest energy crystal state (i.e., global minimum), whereas if it is rapidly cooled, it will reach a higher energy amorphous state. Kirkpatrick et al. (1983) proposed a computer imitation of thermal annealing for use in optimization problems. In one version of simulated annealing, we construct a modified posterior probability distribution pT ðjD; IÞ which is given by pT ðjD; IÞ ¼ exp
ln½ pðjD; IÞ ; T
(11:21)
which contains a temperature parameter T. For T ¼ 1, pT ðjD; IÞ is equal to the true posterior distribution for . For higher temperatures, pT ðjD; IÞ is a flatter version of pðjD; IÞ. The basic scheme involves an exploration of the parameter space by a series of random changes in the current c estimate of the solution next ¼ c þ ;
(11:22)
where is chosen by a random number generator. The proposed update is always considered advantageous if it yields a higher pT ðjD; IÞ, but bad moves are sometimes
11.4 Finding the most probable parameters
297
accepted. This occasional allowance of retrograde steps provides a mechanism for escaping entrapment in local maxima. The process starts off with T large so the acceptance rate for unrewarding changes is high. The value of T is gradually decreased towards T ¼ 1 as the number of iterations gets larger and the acceptance rate of unrewarding changes drops. This general scheme, of always accepting an uphill step while sometimes accepting a downhill step, has become known as the Metropolis algorithm (Metropolis et al., 1953). At each value of T the Metropolis algorithm is used to explore the parameter space. The Metropolis algorithm and the related Metropolis–Hasting algorithms are described in more detail in Section 12.2. Assuming a flat prior for , it is frequently the case that pðjD; IÞ / expf2 =2g. Simulated annealing works well for a 2 topology like that shown in Figure 11.5, where there is an underlying trend towards a global minimum.
11.4.2 Genetic algorithm Genetic algorithms are a class of search techniques inspired from the biological process of evolution by means of natural selection (Holland, 1992). They can be used to construct numerical optimization techniques that perform robustly in parameter search spaces with complex topology. Consider the following generic modeling task: a model that depends on a set of adjustable parameters is used to fit a given dataset; the task consists in finding the single parameter set that minimizes the difference between the model’s predictions and the data. The genetic algorithm consists of the following steps. 1. Start by generating a set (‘‘population’’) of trial solutions, usually by choosing random values for all model parameters. 2. Evaluate the goodness-of-fit (‘‘fitness’’) of each member of the current population (through a 2 measure with the data, for example). 3. Select pairs of solutions (‘‘parents’’) from the current population, with the probability of a given solution being selected made proportional to that solution’s fitness. Breed the two solutions selected in (2) and produce two new solutions (‘‘offspring’’).
Global and local minima
χ2(a)
a
Figure 11.5 Sample topology of 2 for a nonlinear model with one parameter labeled a.
298
Nonlinear model fitting
4. Repeat steps (2)–(3) until the number of offspring produced equals the number of individuals in the current population. 5. Use the new population of offspring to replace the old population. 6. Repeat steps (1) through (5) until some termination criterion is satisfied (e.g., the best solution of the current population reaches a goodness-of-fit exceeding some preset value).
Superficially, this may look like some peculiar variation of a Monte Carlo theme. There are two crucial differences: first, the probability of a given solution being selected to participate in a breeding event is made proportional to that solution’s fitness (step 2); better trial solutions breed more often, the computational equivalent of natural selection. Second, the production of new trial solutions from existing ones occurs through breeding. This involves encoding the parameters defining each solution as a string-like structure (‘‘chromosome’’), and performing genetically inspired operations of crossover and mutation to the pair of chromosomes encoding the two parents, the end result of these operations being two new chromosomes defining the two offspring. Applying the reverse process of decoding those strings into solution parameters completes the breeding process and yields two new offspring solutions that incorporate information from both parents. If you want to try out the genetic algorithm and watch a demonstration, check out the following web site: http://www.hao.ucar.edu/public/research/si/pikaia/pikaia.html#sec2. PIKAIA (pronounced ‘‘pee-kah-yah’’) is a general purpose function optimization Fortran-77 subroutine based on a genetic algorithm. PIKAIA was written by Paul Charbonneau and Barry Knapp (Charbonneau, 1995; Charbonneau and Knapp, 1995) both at the High Altitude Observatory, a scientific division of the National Center for Atmospheric Research in Boulder, Colorado. The above web site lists other useful references.
11.5 Iterative linearization In this section, we will develop the equations needed for understanding the Levenberg–Marquardt method which is discussed in Section 11.5.1. This is a widely used and efficient scheme for homing in on the best set of nonlinear model parameters, ^ starting from an initial guess of their values. Start with a Taylor series expansion of , 2 about some point in parameter space represented by c (standing for current ) and keep only the first three terms: 2 ðÞ 2 ðc Þ þ
X @2 ðc Þ k
@k
k þ
1 X @ 2 2 ðc Þ k l ; 2 kl @k @l
(11:23)
where ¼ c :
(11:24)
11.5 Iterative linearization
299
For a linear model, 2 is quadratic so there are no higher derivatives. Let
kl ¼
1 @ 2 2 ðc Þ 2 @k @l
be called the curvature matrix. On the topic of nomenclature, in nonlinear analysis literature, the Hessian (H) matrix is frequently mentioned and is related to our curvature matrix by H ¼ 2k . In matrix form, Equation (11.23) becomes 2 ðÞ 2 ðc Þ þ r2 ðc Þq þ qT k q:
(11:25)
Take the gradient of both sides of Equation (11.25) r2 ðÞ r2 ðc Þ þ k q:
(11:26)
The left hand side is the gradient at location away from c . ^ the best set of Now consider the special case where takes us from c to , 2 2 ^ parameter values. At ¼ c ; ¼ min . In this case, ^ ¼ r2 ¼ 0 r2 ðÞ min
(11:27)
k q ¼ r2 ðc Þ
(11:28)
^ ¼ c k 1 r2 ðc Þ
(11:29)
or
where k 1 ¼ inverse of the curvature matrix. For a linear model, 2 is exactly a quadratic and thus k is constant independent of c . For a nonlinear model, we expect that sufficiently close to 2min ; 2 will be approximately quadratic so we should be able to ignore higher order terms in the Taylor ^ expansion. Equation (11.29) should provide a reasonable approximation if c is close to . This suggests an iterative algorithm: 1. 2. 3. 4.
^ Start with a good guess 1 of . 2 Evaluate gradient r ð1 Þ and curvature matrix k ð1 Þ. Calculate improved estimate using Equation (11.29). Repeat process until gradient ¼ 0.
When r2 ðc Þ ¼ 0 then k 1 ¼ information matrix. Thus, the covariances of the parameters are to a good approximation given by kl ¼ ½k 1 kl :
(11:30)
If Equation (11.29) provides a poor approximation to the shape of the model function at c , then all we can do is to step down the gradient. next ¼ c constantr2 ðc Þ;
(11:31)
300
Nonlinear model fitting
where the constant is small enough not to exhaust the downhill direction (more on the constant later). Note: if you are planning on writing your own program for iterative linearization, see the useful tips on computing the gradient and curvature (Hessian) matrices given in Press (1992).
11.5.1 Levenberg–Marquardt method We can rewrite Equation (11.28) as a set of M simultaneous equations for k ¼ 1; . . . ; M M X
kl l ¼ k ;
(11:32)
l¼1
where k ¼ @ 2 ðc Þ=@k ; and for M ¼ 2,
11 1 þ 12 2 ¼ 1 ,
21 1 þ 22 2 ¼ 2 : We can also rewrite Equation (11.31) as l ¼ constant l :
(11:33)
Equations (11.32) and (11.33) are central to the discussion of the Levenberg–Marquardt method which follows. Far from 2min , use Equation (11.33) which corresponds to stepping down the direction of steepest descent on a scale set by the constant. Close to 2min , use Equation (11.32) which allows us to jump directly to the minimum. What sets the scale of the constant in Equation (11.33)? Note: l ¼ @2 =@l has dimensions of 1=l which may have dimensions (e.g., m). Each component l may have different dimensions. The constant of proportionality between l and l must therefore have dimensions of 2l . Looking at , there is only one obvious quantity with the above dimension and that is 1= ll , the reciprocal of the diagonal element. But the scale might be too big, so divide it by an adjustable non-dimensional fudge factor . 1 l ll
(11:34)
ll l ¼ l :
(11:35)
l ¼ or
The next step is to combine Equations (11.32) and (11.35) by defining a new curvature matrix k 0
0kk ¼ kk ð1 þ Þ
(11:36)
11.5 Iterative linearization
301
and
0kl ¼ kl
ðk 6¼ lÞ:
(11:37)
The new equation is M X
0kl l ¼ k :
(11:38)
l¼1
If is large, k 0 is forced into being dominated by the diagonal elements and becomes Equation (11.33). If ! 0, Equation (11.38) ! Equation (11.32). The basis of the ^ then Equation (11.33) representing the steepest method is that when c is far from , ^ then Equation (11.32) is best. descent is best. When c is close to , The Levenberg–Marquardt method employs Equation (11.38) which can switch between these two desirable states (Equations (11.32) and (11.33)) by varying . Recall Equation (11.32) can jump to the 2min in one step if the approximation is valid.
11.5.2 Marquardt’s recipe 1. 2. 3. 4. 5. 6.
^ Compute 2 ð1 Þ for guess of . Pick a small value of 0:001. Solve Equation (11.38) for and evaluate 2 ð1 þ Þ. If 2 ð1 þ Þ 2 ð1 Þ increase by factor of 10 and go to (3). If 2 ð1 þ Þ < 2 ð1 Þ, decrease by a factor of 10, update trial solution. 2 Repeat steps (3) to (5) until the solution converges.
1 þ .
Since k plays the role of a metric on the M-dimensional subspace spanned by the model functions, the Levenberg–Marquardt method is referred to as a variable metric approach. The matrix k is the same as the y matrix in the linear model case. All that is necessary is a condition for stopping the iteration. Iterating to convergence or machine accuracy is generally wasteful and unnecessary since the minimum at best is only a statistical estimate of the parameter . Recall from our earlier discussion of joint credible regions in Section 10.4.1, that changes in 2 by an amount 1 are never statistically meaningful. For M ¼ 2 parameters, the probability ellipse defined by 2 ¼ 2:3 away from 2min encompasses 68.3% of the joint PDF. For M ¼ 1 the corresponding 2 ¼ 1. These considerations suggest that, in practice, stop iterating on the 1st or 2nd iteration that 2 decreases by an amount 1. Once the minimum is found, set ¼ 0 and compute the variance-covariance matrix V ¼ k 1 to obtain the estimated standard errors of the fitted parameters. Mathematica uses the Levenberg–Marquardt method in NonlinearRegress analysis. Subroutines are also available in Press (1992). If the posterior has several maxima of comparable magni^ tude, then in this case it doesn’t make sense to talk about a single best .
302
Nonlinear model fitting
11.6 Mathematica example In this example, we illustrate the solution of a simple nonlinear model fitting problem using Mathematica’s NonlinearRegress which implements the Levenberg–Marquardt method. The data consist of one or possibly two spectral lines sitting on an unknown constant background. The measurement errors are assumed to be IID normal with a ¼ 0:3. Model 1 assumes the spectrum contains a single spectral line while model 2 assumes two spectral lines. The raw data and measurement errors are shown in panel (a) of Figure 11.6, together with the best fitting model 1 shown by the solid curve. The parameter values for the best fitting model 1 were obtained with the NonlinearRegress command as illustrated in Figure 11.7. The arguments to the command are as follows: 1. data is a list of ðx; yÞ pairs of data values where the x value is a frequency and the y value a signal strength. 2. model[ f ] is the mathematical form of the model for the spectrum signal strength as a function of frequency, f. This is given by
model[ f ] :¼ a0 þ a1 line[ f, f 1] where,
line[ f ; f1 ] :¼
Sin[2pð f f 1Þ=Df ] 2pð f f1Þ=Df
(a) Model 1 + Raw Data
1
3
Signal
Signal
(b) Residuals
1.5
4 2 1
0.5 0 –0.5
0
1
2
3
4
5
6
0
Frequency axis (c) Model 2 + Raw Data
Signal
3 2 1 0
1
2
3
4
Frequency axis
1
2
3
4
5
6
Frequency axis
4 Signal
;
5
6
(d) Residuals
1 0.75 0.5 0.25 0 –0.25 –0.5 –0.75 0
1
2
3
4
5
6
Frequency axis
Figure 11.6 Two nonlinear models fitted to simulated spectral line data. Panel (a) shows the raw data and the best fit (solid curve) for model 1 which assumes a single line. Panel (b) illustrates the model 1 fit residuals. Panel (c) shows the best-fit model 2 compared to the data. Panel (d) shows the model 2 residuals.
303
11.6 Mathematica example result = Nonlinear [Regressdata, model[f], {f}, {{a0, 1.2), {a1,4}, {f1, 2.6}}, Weights->wt, RegressionReport-> {BestFitParameters, ParameterCITable, AsymptoticCovarianceMatrix, FitResiduals, BestFit}, ShowProgress-> True] Iteration:1 ChiSquared:128.05290737029276` Parameters:{1.2, 4., 2.6} Iteration:2 ChiSquared:67.07120991521835` Parameters:{1.04667, 3.1995, 2.58491} Iteration:3 ChiSquared:66.77321228639481` Parameters:{1.04602, 3.20506, 2.57661} Iteration:4 ChiSquared:66.74836021101748` Parameters:{1.04588, 3.20626, 2.5742} Iteration:5 ChiSquared:66.74632730538346` Parameters:{1.04587, 3.20636, 2.57351} Iteration:6 ChiSquared:66.7461616482541` Parameters:{1.04587, 3.20636, 2.57332}
{BestFitParameters→ {a0→1.04587, a1→3.20637, f1→2.57326}, Estimate 1.04587 3.20637 2.57326
a0 ParameterCITable→ a1 f1
Asymptotic SE 0.0545299 0.190783 0.0204051
–0.00434888 –1.20476 × 10–6
0.00297351 AsymptoticCovarianceMatrix→
CI {0.936233, 1.15551} , {2.82277, 3.58996} {2.53224, 2.61429}
–0.00434888
0.036398
1.20476 × 10–6 1.68676 × 10–6
1.68676 × 10–6
,
0.000416367
FitResiduals→ {–0.209178, 0.0139153, –0.351702, 0.108264, –0.081711, 0.232977, –0.0502895, 0.0859089, –0.0988917, –0.181999, 0.0500372, –0.33713, –0.329429, –0.218716, –0.151456, 0.0154221, –0.264555, 0.496109, 0.256117, 0.616037, –0.194287, –0.0907203, –0.448441, 0.12935, –0.0518219, 1.02491, 0.559758, 0.660035, 0.980723, 0.291933, –0.133344, –0.389732, –0.0151427, –0.186632, –0.166834, –0.388142, 0.297471, –0.477721, –0.287358, –0.331853, –0.401507, –0.263538, 0.20304, 0.102339, 0.166333, 0.178261, –0.337149, –0.339727, –0.258125, 0.100323, 0.062864}, BestFit→ 1.04587 +
0.182741 Sin[4.18879 (–2.57326 + f)]2 } (–2.57326 + f)2
Figure 11.7 Example of the use of the Mathematica command NonlinearRegress to fit model 1 to the spectral data. where D f ¼ 1:5. Note: line [ f, f 1] becomes indeterminate for ð f f 1Þ ¼ 0. To avoid the likelihood of this condition occurring in NonlinearRegress, set the initial estimate of f 1 to a non-integer number. 3. f is the independent variable frequency. 4. The third item is a list of the unknown model parameters and initial estimates. Since NonlinearRegress uses the Levenberg–Marquardt method, it is important that the initial estimates land you somewhere in the neighborhood of the global minimum of 2 , where
2 ¼
N X ðdi model[ f ]Þ2 i
i¼1
2i
¼
N X
wti ðdi model[ fi ]Þ2 :
i¼1
5. wt is an optional list of weights to be assigned to the data points, where wti ¼ 1=2i . 6. RegressionReport is a list of options for the output of NonlinearRegress. 7. ShowProgress fi True shows the value of 2 achieved after each iteration of the Levenberg–Marquardt method and the parameter values at that step.
304
Nonlinear model fitting
The full NonlinearRegress command together with its arguments is shown in bold face type in Figure 11.7. The output, shown in normal type face, indicates that the minimum 2 achieved for model 1 was 66.7.3 Below that is a list of the various RegressionReport items. The second item lists the parameter values, the asymptotic standard error for each parameter and the frequentist confidence interval (95% by default) for each parameter. The asymptotic error for each parameter is equal to the square root of the corresponding diagonal element in the AsymptoticCovarianceMatrix. The use of these errors is based on the assumption that in the vicinity of the mode, the joint posterior probability density function for the parameters is a good approximation to a multivariate Gaussian (see Section 11.2). The AsymptoticCovarianceMatrix ¼ I1 , the inverse of the observed information matrix. Note: the AsymptoticCovarianceMatrix elements, as given by NonlinearRegress, have been scaled by a factor k2 ¼ 1:39 where k is given by Equation (10.144). This leads to more robust parameter errors but we must correct for this later on when we compute the Bayesian odds ratio for comparing models 1 and 2. The values of 2 quoted in the output of Figure 11.7 have not been modified by the k factor and thus 2min ¼ 66:7 is the minimum value calculated on the basis of the input measurement error ¼ 0:3. Panel (b) of Figure 11.6 shows the residuals after subtracting model 1 from the data. There is clear evidence for another spectral line at about 3.6 on the frequency axis. On the basis of these residuals, a second model was constructed, consisting of two spectral lines sitting on a constant background. Model 2 has the mathematical form: model[ f; f ]: = a0 + a1 line[ f1] + a2 line[ f; f2]: Panel (c) shows the best fitting model 2. The residuals shown in Panel (d) appear to be consistent with the measurement errors and show no evidence for any further systematic signal component. The output from Mathematica’s NonlinearRegress command for model 2 is shown in Figure 11.8.
11.6.1 Model comparison Here, we compute the Bayesian odds ratio given by O21 ¼
pðM2 jD; IÞ pðM2 jIÞ pðD j M2 ; IÞ ¼ pðM1 jD; IÞ pðM1 jIÞ pðD j M1 ; IÞ
(11:39)
pðM2 jIÞ ¼ Bayes factor: pðM1 jIÞ
3
Here, we evaluate the frequentist theory confidence in rejecting model 1. Model 1 has three fit parameters so the number of degrees h iof freedom ¼ N M ¼ 51 data points 3 ¼ 48; thus the confidence is ¼ 1 GammaRegularized
NM 2 2 ; 2
¼ 0:96.
305
11.6 Mathematica example result = NonlinearRegress[data, model2[f], {f}, {{a0, 1.2}, {a1, 3}, {f1, 2.6}, {a2, 1}, {f2, 3.5}},Weights-> wt, RegressionReport -> {BestFitParameters, ParameterCITable, BestFit}, ShowProgress -> True] Iteration:1 ChiSquared:113.44567855780089` Parameters:{1.2, 3., 2.6, 1., 3.5} Iteration:2 ChiSquared:44.87058091255775` Parameters:{0.996707, 3.16903, 2.55205, 0.447086, 3.22451}
Iteration:3 ChiSquared:39.478968989091` Parameters:{0.958289, 3.08133, 2.51609, 0.866957, 2.98474} Iteration:4 ChiSquared:31.202371941916066` Parameters:{0.934233, 2.99456, 2.48976, 1.14837, 3.12047} Iteration:5 ChiSquared:30.26136004999436` Parameters:{0.933864, 2.94356, 2.48483, 1.20244, 3.06137} Iteration:6 ChiSquared:30.236264668012137` Parameters:{0.932706, 2.93206, 2.48383, 1.22363, 3.06401}
{BestFitParameters → {a0→0.932717, a1→2.93172, f1→2.48379, a2→1.22387, f2→3.0638}, a0 a1 ParameterCITable → f1 a2 f2
Estimate 0.932717 2.93172 2.48379 1.22387 3.0638
Asymptotic SE 0.0406876 0.182306 0.0253238 0.182139 0.0606851
CI {0.850817,1.01462} {2.56476,3.29869} , {2.43281,2.53476} {0.857242,1.5905} {2.94164,3.18595}
BestFit → 0.932717 +
0.0697522 Sin[4.18879(–3.0638 + f)]2 0.167088 Sin[4.18879(–2.48379 + f)]2 + (–3.0638+f)2 (–2.48379 + f)2
Figure 11.8 The output from Mathematica’s NonlinearRegress command for model 2.
We will use the Laplace approximation for the Bayes factor described in Section 11.3.1 which expresses the global likelihood, given by Equation (11.16), in terms of the determinant of the information matrix, I. M=2 ^ i ; IÞLðÞð2pÞ ^ pðDjMi ; IÞ pðjM ðdet IÞ1=2
¼Q
1 1 2 emin =2 ð2pÞM=2 ðdet IÞ1=2 : N=2 N ð2pÞ
(11:40)
Let V stand for the parameter asymptotic covariance matrix in the nonlinear problem, so pffiffiffiffiffiffiffiffiffiffiffiffiffi (11:41) ðdet IÞ1=2 ¼ det V: Equation (11.40) assumes uniform priors for the model parameters, where is the prior range for parameter . In the current problem, we assume the prior ranges for the parameters are known to within a factor of three of the initial estimates used in NonlinearRegress, i.e., 3 =3 ¼ 2:667 . Let V be the asymptotic covariance matrix elements returned by Mathematica’s NonlinearRegress command. Recall that Mathematica scales the asymptotic covariance matrix elements by a factor k2 , to allow for more robust parameter errors, where k is given by Equation (10.144). We need to remove this factor for use in Equation (11.40),
306
Nonlinear model fitting
by multiplying the asymptotic covariance matrix provided by Mathematica by 1=k2 , before computing its determinant, i.e., 1 V : (11:42) k2 We can extract V from result, the name given to the result of the NonlinearRegress command. For model 1 (see Figure 11.7) the covariance matrix was the third item in the RegressionReport. Thus, V ¼ result½½3; 2½½1 is the desired matrix.4 The value of O21 derived from Equations (11.39), (11.40), (11.41), and (11.42), assuming equal prior probabilities for the two models, is 1:4 105 . V¼
11.6.2 Marginal and projected distributions Finally, we will compute the Laplacian approximation to the Bayesian marginal probability density function pðjD; M; IÞ and compare it to the frequentist projected probability, which we refer to as the profile function, fðÞ, according to Equation (11.17). We illustrate this calculation for the a2 parameter. The Laplacian marginal is the profile function fða2Þ times the factor rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1=2 ½det Iða2Þ ¼ det Vða2Þ ¼ det 2 V ða2Þ: (11:43) k The quantity V ða2Þ is the asymptotic covariance matrix evaluated by NonlinearRegress obtained by fixing a2 and minimizing 2 , with all the other parameters free to vary. This can be done using a simple Do loop to repeatedly run NonlinearRegress for different values of a2. For an example of this, see the nonlinear fitting section of the Mathematica tutorial. The profile function is given by 2 min ða2Þ fða2Þ / exp : (11:44) 2 Let k^ be the value of k in Equation (11.43) for the fit corresponding to the most probable set of parameters. If k^ > 1, this indicates that the data errors may have been underestimated. An approximate way5 to take account of this, when computing the marginal parameter PDF, is to modify Equations (11.43) and (11.44) as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k^2 1=2 ½det Iða2Þ ¼ det Vða2Þ ¼ det 2 V ða2Þ (11:45) k
2min ða2Þ fða2Þ / exp : 2 k^2
4
5
(11:46)
The quantity result[[3,2]] is V expressed in Mathematica’s MatrixForm. To compute the determinant of this matrix we need to extract the argument of MatrixForm which is given by result[[3,2]][[1]]. A fully Bayesian way of handling this would be to treat k as a parameter and marginalize over a prior range for k.
307
11.7 Errors in both coordinates
The resulting projected and marginal PDF for a2 are shown in Figure 11.9 and are clearly quite different. It is a common frequentist practice to improve on the asymptotic standard errors of a parameter by finding the two values of the parameter for which the projected 2 ¼ 2min þ 1, in analogy to the linear model case (see Table 10.2). For example, see the command MINOS in the MINUIT software (James, 1998). As we have discussed earlier, the Bayesian marginal distribution should be strongly preferred over the projected, and the Laplacian approximation provides a quick way of estimating the marginal. Finally we can readily compute the Bayesian 95% credible region for a2 from the marginal distribution and compare with the 95% confidence interval returned by NonlinearRegress. They are: Bayesian 95% credible region ¼ ð0:74; 1:68Þ frequentist 95% confidence interval ¼ ð0:86; 1:60Þ:
11.7 Errors in both coordinates In Section 4.8.2, we derived the likelihood function applicable to the general problem of fitting an arbitrary model when there are independent errors in both coordinates. For the special case of a straight line model (see also Gull, 1989b) the likelihood function, pðDjM; A; B; IÞ, is given by Equation (4.60), which we repeat here after replacing yi by di . ! N
Y 1=2 pðDjM; A; B; IÞ ¼ð2pÞN=2 2i þ B2 2xi ( exp
i¼1 N X ðdi mðxi0 jA; BÞÞ2 i¼1
2ð2i þ B2 2xi Þ
(11:47)
) :
0.05 projected 0.04 PDF
marginal 0.03 0.02 0.01 0 0.5
0.75
1
1.25 a2
1.5
1.75
2
Figure 11.9 Comparison of the projected and Laplacian marginal PDF for the a2 parameter.
308
Nonlinear model fitting
Here, A and B are model parameters representing the intercept and slope. It is apparent that when there are errors in both coordinates, the problem has become nonlinear in the parameters. Problem: In Section 10.2.2, we fitted a straight line model to the data given in Table 10.1 using the method of least-squares. This time assume that the xi coordinates are uncertain with an uncertainty descibed by a Gaussian PDF with a xi ¼ 3. Using the likelihood given in Equation (11.47), compute the marginal PDF for the intercept (A) and the marginal for the slope (B), and compare the results to the case where xi ¼ 0. Assume uniform priors with boundaries well outside the region with significant likelihood. Solution: Since we are assuming flat priors, the joint posterior pðA; BjD; M; IÞ is directly proportional to the likelihood. The marginal PDF for the intercept is given by Z pðAjD; M; IÞ ¼ dB pðA; BjD; M; IÞ / pðAjM; IÞ
(11:48)
Z dB pðBjM; IÞ pðDjM; A; B; IÞ:
1.2 1 0.8 0.6 0.4 0.2
25 20 PDF
PDF
We can write a similar equation for the marginal PDF for the slope. The upper two panels of Figure 11.10 show plots of the two marginals for two cases. The solid curves correspond to xi ¼ 0 (no uncertainty in xi values), and the dashed curves to xi ¼ 3. The uncertainty in the xi values results in broader and shifted marginals.
15 10 5
9
10
11 Intercept
12
13
–40
–20
0 x
20
40
0.25
0.3 Slope
0.35
25 20 y
15 10 5 0
Figure 11.10 The top two panels show the marginal PDFs for the intercept and slope. The solid curves show the result when there is no error in the x-coordinate. The dashed curves are the result when there are errors in both coordinates. The lower panel shows the corresponding best-fit straight lines.
11.9 Problems
309
The lower panel of Figure 11.10 shows the most probable straight line fits for the two cases. The likelihood function given by Equation (11.47) contains xi in two terms. In both terms, it is multiplied by the slope parameter B. The effect of the first term is to favor smaller values of B. The effect of the second term is to decrease the relative weight given to measurements with a smaller i , i.e., this causes the points to be given more equal weight. In this particular case, the best fitting line has a slope and intercept which are slightly larger when xi ¼ 3:0.
11.8 Summary The problem of finding the best set of parameters for a nonlinear model can be very challenging, because the posterior distribution for the parameters can be complex with many hills and valleys. As the sample size increases, the posterior asymptotically approaches a multivariate normal distribution (Section 11.2). Unfortunately, there are no clear guidelines for how large the sample size must be. The goal is to find the global maximum in the posterior, or equivalently, the minimum in 2 . A variety of methods are discussed, including random search techniques like simulated annealing and the genetic algorithm. The other main approach is to home in on the minimum in 2 from a good initial guess using an iterative linearization technique like Levenberg–Marquardt (Sections 11.5 and 11.5.1), the method used in Mathematica’s Non-linearRegress command. Once the minimum is located, the parameter errors can be approximately estimated from I1 , the inverse of the information matrix (Section 11.2). I1 is analogous to the parameter covariance matrix in linear model fitting. Improved error estimates can be obtained from the Laplacian approximation to the marginal posterior distribution for any particular parameter (see Section 11.3.2). For model comparison problems, Section 11.3.1 describes a useful Laplacian approximation for the global likelihood, pðDjM; IÞ, that is needed in calculating the Bayes factor. Section 11.6 and the section entitled, ‘‘Nonlinear Least-Squares Fitting’’ in the accompanying Mathematica tutorial, provide useful templates for the analysis of typical nonlinear model fitting problems. The data from some experiments have errors in both coordinates which can turn a linear model fitting problem into a nonlinear problem. This issue was discussed earlier in Section 4.8.2, and a particular example of fitting a straight line model was treated in Section 11.7.
11.9 Problems Nonlinear Model Fitting (See ‘‘Nonlinear Least-Squares Fitting’’ in the Mathematica tutorial.) Table 11.1 gives a frequency spectrum consisting of 100 pairs of frequency and voltage (x, y). From measurements when the signal was absent, the noise is known to be IID normal with a standard deviation ¼ 0:3 voltage units. The spectrum is
310
Nonlinear model fitting
Table 11.1 The table contains a frequency spectrum consisting of 100 pairs of frequency and voltage. f (Hz) 1.00 1.17 1.34 1.51 1.68 1.85 2.02 2.19 2.36 2.53 2.70 2.87 3.04 3.21 3.38 3.55 3.72 3.89 4.06 4.23 4.40 4.57 4.74 4.91 5.08
V
f (Hz)
V
f (Hz)
V
f (Hz)
V
1.391 1.000 0.552 1.249 0.534 1.386 0.971 0.901 0.851 1.334 0.549 1.373 0.997 1.231 1.586 2.244 1.914 2.467 2.609 3.036 3.581 4.073 5.010 4.989 4.940
5.25 5.42 5.59 5.76 5.93 6.10 6.27 6.44 6.61 6.78 6.95 7.12 7.29 7.46 7.63 7.80 7.97 8.14 8.31 8.48 8.65 8.82 8.99 9.16 9.33
5.537 6.091 6.163 5.365 5.916 5.530 4.552 3.833 3.756 3.055 3.009 2.855 2.357 2.732 1.836 1.918 1.534 2.238 2.623 2.275 2.408 2.701 2.659 3.224 2.237
9.50 9.67 9.84 10.01 10.18 10.35 10.52 10.69 10.86 11.03 11.20 11.37 11.54 11.71 11.88 12.05 12.22 12.39 12.56 12.73 12.90 13.07 13.24 13.41 13.58
3.113 3.293 3.139 2.840 3.119 3.311 4.347 4.819 4.378 4.544 4.562 5.662 4.479 5.373 4.883 4.678 5.100 3.868 4.132 3.702 3.267 3.323 3.413 2.762 2.418
13.75 13.92 14.09 14.26 14.43 14.60 14.77 14.94 15.11 15.28 15.45 15.62 15.79 15.96 16.13 16.30 16.47 16.64 16.81 16.98 17.15 17.32 17.49 17.66 17.83
2.038 2.585 2.492 2.193 1.866 1.571 1.779 1.542 1.562 1.666 0.904 1.074 1.530 0.747 0.945 1.301 1.323 0.919 1.320 0.915 0.814 0.983 1.158 0.917 1.355
thought to consist of two or more narrow lines which are broadened by the instrumental response of the detector which is well described by a Gaussian with a L ¼ 1:0 frequency unit. In addition, there is an unknown constant offset. Use a model for the signal consisting of a sum of Gaussians plus a constant offset of the form ! ! ðxi C1 Þ2 ðxi C2 Þ2 yðxi Þ ¼ A0 þ A1 exp þ A2 exp þ 22L 22L In this problem, refer to the model with two lines as model 2, that with three lines as model 3, etc. The objective of this assignment is to determine the most probable model and the best estimates of the model parameters and their errors. Find the most likely number
11.9 Problems
311
of lines by fitting progressively more Gaussians, examining the residuals after each trial. The following items are required as part of your solution: 1. Plot the raw data together with error bars. NonlinearRegress in Mathematica uses the
2.
3. 4.
5.
6.
7.
Levenberg–Marquardt method which requires good initial guesses of the parameter values. For each model, provide a table of your initial guess of each parameter value. For each choice of model, give a table of the best-fit parameters and their errors as derived from the asymptotic covariance matrix. Also list the covariance matrix. Note: If you are using Mathematica’s NonlinearRegress, remember that it computes an asymptotic covariance matrix that is scaled by a factor k2 . This is an attempt to obtain more robust parameter errors based on assuming the model is correct and then adjusting all the assumed measurement errors by a factor k (explained in Section 11.6; see also Equation (10.142) and discussion). For each choice of model, compute the factor k. Plot each model on top of the data with error bars. For each model, plot the residuals and decide whether there is evidence for another line to be fitted. Estimate the parameters of the line from the residuals and then generate a new model to fit to the data that includes the new line together with the earlier lines. Note: the residuals may suggest the presence of multiple lines. It is best to add only the strongest one to your next model. Some of the minor features in the residuals will disappear as the earlier model lines re-adjust their best locations in response to the addition of the one new line. For each model, calculate the 2 goodness-of-fit and the frequentist P-value (significance), which represents the fraction of hypothetical repeats of the experiment that are expected to fall in the tail area by chance if the model is correct. The confidence in rejecting the model ¼ 1 P-value. Explain what you conclude from these goodness-of-fit results. For each model, compute the Laplacian estimate of the global likelihood for use in the model selection problem. Compute the odds ratio for model ðiÞ compared to model ði 1Þ. Assume a uniform prior for each model amplitude parameter, with a range of a factor of 3 of your initial guess, Ag , for the parameter, i.e., 3Ag Ag =3 ¼ 2:667Ag . Assume a uniform prior for each model line center frequency parameter within the range 1 to 17 frequency units. For your best model, compute and plot (on the same graph) the projected probability and the Laplacian approximation to the marginal probability for A3, the amplitude of the third strongest line. Again, see the Mathematica tutorial for an example.
12 Markov chain Monte Carlo
12.1 Overview In the last chapter, we discussed a variety of approaches to estimate the most probable set of parameters for nonlinear models. The primary rationale for these approaches is that they circumvent the need to carry out the multi-dimensional integrals required in a full Bayesian computation of the desired marginal posteriors. This chapter provides an introduction to a very efficient mathematical tool to estimate the desired posterior distributions for high-dimensional models that has been receiving a lot of attention recently. The method is known as Markov Chain Monte Carlo (MCMC). MCMC was first introduced in the early 1950s by statistical physicists (N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller) as a method for the simulation of simple fluids. Monte Carlo methods are now widely employed in all areas of science and economics to simulate complex systems and to evaluate integrals in many dimensions. Among all Monte Carlo methods, MCMC provides an enormous scope for dealing with very complicated systems. In this chapter we will focus on its use in evaluating the multi-dimensional integrals required in a Bayesian analysis of models with many parameters. The chapter starts with an introduction to Monte Carlo integration and examines how a Markov chain, implemented by the Metropolis–Hastings algorithm, can be employed to concentrate samples to regions with significant probability. Next, tempering improvements are investigated that prevent the MCMC from getting stuck in the region of a local peak in the probability distribution. One such method called parallel tempering is used to re-analyze the spectral line problem of Section 3.6. We also demonstrate how to use the results of parallel tempering MCMC for model comparison. Although MCMC methods are relatively simple to implement, in practice, a great deal of time is expended in optimizing some of the MCMC parameters. Section 12.8 describes one attempt at automating the selection of these parameters. The capabilities of this automated MCMC algorithm are demonstrated in a re-analysis of an astronomical data set used to discover an extrasolar planet. 312
12.2 Metropolis–Hastings algorithm
313
12.2 Metropolis–Hastings algorithm Suppose we can write down the joint posterior density,1 pðXjD; IÞ, of a set of model parameters represented by X. We now want to calculate the expectation value of some function fðXÞ of the parameters. The expectation value is obtained by integrating the function weighted by pðXjD; IÞ. Z Z h fðXÞi ¼ fðXÞpðXjD; IÞdX ¼ gðXÞdX: (12:1) For example, if there is only one parameter and we want to compute its mean value, then fðXÞ ¼ X. Also, we frequently want to compute the marginal probability of a subset XA of the parameters and need to integrate over the remaining parameters designated XB . Unfortunately, in many cases, we are unable to perform the integrals required in a reasonable length of time. In this section, we develop an efficient method to approximate the desired integrals, starting with a discussion of Monte Carlo integration. Given a value of X, the discussion below assumes we can compute the value of gðXÞ. In straight Monte Carlo integration, the procedure is to pick n points, uniformly randomly distributed in a multi-dimensional volume (V) of our parameter space X. The volume must be large enough to contain all regions where gðXÞ contributes significantly to the integral. Then the basic theorem of Monte Carlo integration estimates the integral of gðXÞ over the volume V by sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Z hg2 ðXÞi hgðXÞi2 h fðXÞi ¼ ; (12:2) gðXÞdX V hgðXÞi V n V where hgðXÞi ¼
n n 1X 1X gðXi Þ; hg2 ðXÞi ¼ g2 ðXi Þ: n i¼1 n i¼1
(12:3)
There is no guarantee that the error is distributed as a Gaussian, so the error term is only a rough indicator of the probable error. When the random samples Xi are independent, the law of large numbers ensures that the approximation can be made as accurate as desired by increasing n. Note: n is the number of random samples of gðXÞ, not the size of the fixed data sample. The problem with Monte Carlo integration is that too much time is wasted sampling regions where pðXjD; IÞ is very small. Suppose in a one-parameter problem the fraction of the time spent sampling regions of high probability is 101 . Then in an M-parameter problem, this fraction could easily fall to 10M . A variation of the simple Monte Carlo described above, which involves reweighting the integrand and adjusting the sample rules (known as ‘‘importance sampling’’), helps considerably but it is difficult to design the reweighting for large numbers of parameters. 1
In the literature dealing with MCMC, it is common practice to write pðXÞ instead of pðXjD; IÞ.
314
Markov chain Monte Carlo
In general, drawing samples independently from pðXjD; IÞ is not currently computationally feasible for problems where there are large numbers of parameters. However, the samples need not necessarily be independent. They can be generated by any process that generates samples from the target distribution, pðXjD; IÞ, in the correct proportions. All MCMC algorithms generate the desired samples by constructing a kind of random walk in the model parameter space such that the probability for being in a region of this space is proportional to the posterior density for that region. The random walk is accomplished using a Markov chain, whereby the new sample, Xtþ1 , depends on the previous sample Xt according to an entity called the transition probability or transition kernel, pðXtþ1 jXt Þ. The transition kernel is assumed to be time independent. The remarkable property of pðXtþ1 jXt Þ is that after an initial burn-in period (which is discarded) it generates samples of X with a probability density equal to the desired posterior pðXjD; IÞ. How does it work? There are two steps. In the first step, we pick a proposed value for Xtþ1 which we call Y, from a proposal distribution, qðYjXt Þ, which is easy to evaluate. As we show below, qðYjXt Þ can have almost any form. To help in developing your intuition, it is perhaps convenient to contemplate a multivariate normal (Gaussian) distribution for qðYjXt Þ, with a mean equal to the current sample Xt . With such a proposal distribution, the probability density decreases with distance away from the current sample. The second step is to decide on whether to accept the candidate Y for Xtþ1 on the basis of the value of a ratio r given by r¼
pðYjD; IÞ qðXt jYÞ ; pðXt jD; IÞ qðYjXt Þ
(12:4)
where r is called the Metropolis ratio. If the proposal distribution is symmetric, then the second factor is ¼ 1. If r 1, then we set Xtþ1 ¼ Y. If r < 1, then we accept it with a probability ¼ r. This is done by sampling a random variable U from Uniform(0, 1), a uniform distribution in the interval 0 to 1. If U r we set Xtþ1 ¼ Y, otherwise we set Xtþ1 ¼ Xt . This second step can be summarized by a term called the acceptance probability ðXt ; YÞ given by pðYjD; IÞ qðXt jYÞ ðXt ; YÞ ¼ minð1; rÞ ¼ min 1; : (12:5) pðXt jD; IÞ qðYjXt Þ The MCMC method as initially proposed by Metropolis et al. in 1953, considered only symmetric proposal distributions, having the form qðYjXt Þ ¼ qðXt jYÞ. Hastings (1970) generalized the algorithm to include asymmetric proposal distributions and the generalization is commonly referred to as the Metropolis–Hastings algorithm. There are now many different versions of the algorithm. The Metropolis–Hastings algorithm is extremely simple: 1. Initialize X0 ; set t ¼ 0: 2. Repeat fObtain a new sample Y from qðYjXt Þ
Sample a Uniform(0,1) random variable U If U r set Xtþ1 ¼ Y otherwise set Xtþ1 ¼ Xt Increment tg
315
12.2 Metropolis–Hastings algorithm
Example 1: Suppose the posterior is a Poisson distribution, pðXjD; IÞ ¼ X e =X!. For our proposal distribution qðYjXt Þ, we will use a simple random walk such that: 1. 2. 3. 4.
Given Xt , pick a random number U1 Uð0; 1Þ If U1 > 0:5, propose Y ¼ Xt þ 1 otherwise Y ¼ Xt 1 Compute the Metropolis ratio r ¼ pðYjD; IÞ=pðXt jD; IÞ ¼ YXt Xt !=Y! Acceptance/rejection: U2 Uð0; 1Þ Accept Xtþ1 ¼ Y if U2 r otherwise set Xtþ1 ¼ Xt
Figure 12.1 illustrates the results for the above simple MCMC simulation using a value of ¼ 3 and starting from an initial X0 ¼ 10 which is far out in the tail of the posterior. Panel (a) shows a sequence of 1000 samples from the MCMC. It is clear that the samples quickly move from our starting point far out in the tail to the vicinity of the posterior mean. Panel (b) compares a histogram of the last 900 samples from the MCMC with the true Poisson posterior which is indicated by the solid line. The
12 (a)
10
X
8 6 4 2 0
200
400
600
800
1000
t
Number
200
(b)
150 100 50
2
4
6
8
10
X
Figure 12.1 The results from a simple one-dimensional Markov chain Monte Carlo simulation for a Poisson posterior for X. Panel (a) shows a sequence of 1000 samples from the MCMC. Panel (b) shows a comparison of the last 900 MCMC samples with the true posterior indicated by the solid curve.
316
Markov chain Monte Carlo
agreement is very good. We treated the first 100 samples as an estimate of the burn-in period and did not use them. Example 2: Now consider a MCMC simulation of samples from a joint posterior pðX1 ; X2 jD; IÞ in two parameters X1 and X2 , which has a double peak structure. Note: if we want to refer to the tth time sample of the ith parameter from a Markov chain, we will do so with the designation Xt;i . We define the posterior in Mathematica with the following commands. Needs[‘‘Statistics ‘MultinormalDistribution’ ’’] dist1=MultinormalDistribution [{0; 0}, {{1; 0}, {0; 1}}] The first argument f0; 0g indicates the multinormal distribution is centered at 0,0. The second argument ff1; 0g; f0; 1gg gives the covariance of the distribution. dist2=MultinormalDistribution[{{4; 0}; {{2; 0:8}; {0:8; 2}}] Posterior=0.5 (PDF[dist1, {X1 ; X2 }]þ PDF[dist2, {X1 ; X2 }]) The factor of 0.5 ensures the posterior is normalized to an area of one. In this example, we used a proposal density function qðY1 ; Y2 jX1 ; X2 Þ which is a two-dimensional Gaussian (normal) distribution. [MultinormalDistribution[{X1 ,X2 }, {{21 ,0},{0,22 }}]] The results for 8000 samples of the posterior generated with this MCMC are shown in Figure 12.2. Note that the first 50 samples were treated as the burn-in period and are not included in this plot. Panel (a) shows a sequence of 7950 samples from the MCMC with 1 ¼ 2 ¼ 1. The two model parameters represented by X1 and X2 could be very different physical quantities each characterized by a different scale. In that case, 1 and 2 could be very different. Panel (b) shows the same points with contours of the posterior overlaid. The distribution of sample points matches the contours of the true posterior very well. Panel (c) shows a comparison of the true marginal posterior (solid curve) for X1 and the MCMC marginal (dots). The MCMC marginal is simply a normalized histogram of the X1 sample values. Panel (d) shows a comparison of the true marginal posterior (solid curve) for X2 and the MCMC marginal (dots). In both cases, the agreement is very good. We also investigated the evolution of the MCMC samples for proposal distributions with different values of . Panel (a) in Figure 12.3 shows the case for a 1=10 the scale of the smallest features in the true posterior. The starting point for each simulation was at X1 ¼ 4:5; X2 ¼ 4:5. In this case, the burn-in period is considerably longer and it appears that a larger number of samples would be needed to do justice to the posterior which is indicated by the contours. Panel (b) illustrates the case for ¼ 1, the value used for Figure 12.2. Panel (c) uses a 10 times the scale of the smallest features in the posterior. From the density of the points it appears that we have used a
317
3 (a) 2 1 0 –1 –2 –2
0
2 X1
4
0.15 0.1 0.05
6
–2
0
2 X1
4
6
8
0.4
3 (b) 2 1 0 –1 –2
Prob. density
X2
(c)
0.2 Prob. density
X2
12.2 Metropolis–Hastings algorithm
–2
0
2
4
(d)
0.3 0.2 0.1 –3 –2 –1
6
X1
0 X2
1
2
3
4
Figure 12.2 The results from a two-dimensional Markov chain Monte Carlo simulation of a double peaked posterior. Panel (a) shows a sequence of 7950 samples from the MCMC. Panel (b) shows the same points with contours of the posterior overlaid. Panel (c) shows a comparison of the marginal posterior (solid curve) for X1 and the MCMC marginal (dots). Panel (d) shows a comparison of the marginal posterior (solid curve) for X2 and the MCMC marginal (dots).
σ = 0.1
4
2 X2
X2
2 0
0 –2
–2 –4
σ=1
4
–4
(a) –4 –2 0
2 4 X1
6
8
10
(b) –4 –2 0
2 4 X1
6
8
10
σ = 10
4
X2
2 0 –2 –4
(c) –4 –2 0
2 4 X1
6
8
10
Figure 12.3 A comparison of the samples from three Markov chain Monte Carlo runs using Gaussian proposal distributions with differing values of the standard deviation: (a) ¼ 0:1, (b) ¼ 1, (c) ¼ 10. The starting point for each run was at X1 ¼ 4:5 and X2 ¼ 4:5.
318
Markov chain Monte Carlo
much smaller number of MCMC samples. In fact we used the same number of samples. Recall that in MCMC we carry out a test to decide whether to accept the new proposal (see discussion following Equation (12.4)). If we fail to accept the proposal, then we set Xtþ1 ¼ Xt . Thus, many of the points in panel (c) are repeats of the same sample as the proposed sample was rejected on many occasions. It is commonly agreed that finding an ideal proposal distribution is an art. If we restrict the conversation to Gaussian proposal distributions then the question becomes what is the optimum choice of ? As mentioned earlier, the samples from a MCMC are not independent, but exhibit correlations. In Figure 12.4, we illustrate the correlations of samples corresponding to the three choices of used in Figure 12.3 by plotting the autocorrelation functions (ACFs) for X2 . The ACF, ðhÞ, which was introduced in Section 5.13.2, is given by P overlap ½ðXt XÞðXtþh XÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi; ðhÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (12:6) P P 2 2 ðX XÞ ðX XÞ t tþh overlap overlap where Xtþh is a shifted version of Xt and the summation is carried out over the subset of samples that overlap. The shift h is referred to as the lag. It is often observed that ðhÞ is roughly exponential in shape so we can model the ACF ðhÞ expf
h exp
g:
(12:7)
The autocorrelation time constant, exp , reflects the convergence speed of the MCMC sampler and is approximately equal to the interval between independent samples. In general, the smaller the value of exp the better, i.e., the more efficient, the MCMC
X2 ACF
1 0.8
σ = 0.1 σ=1
0.6
10
0.4 0.2 0
50
100
150 Lag
200
250
300
Figure 12.4 A comparison of the autocorrelation functions for three Markov chain Monte Carlo runs using Gaussian proposal distributions with differing values of the standard deviation: ¼ 0:1; ¼ 1; ¼ 10.
12.3 Why does Metropolis–Hastings work?
319
sampler is. Examination of Figure 12.4 indicates that of the three choices of chosen above, ¼ 1:0 leads to the smallest values of exp for X2 . Of course, in this example just considered, we have set X1 ¼ X2 . In general, they will not be equal. Related to the optimum choice of is the average rate at which proposed state changes are accepted, called the acceptance rate. Based on empirical studies, Roberts, Gelman, and Gilks (1997) recommend calibrating the acceptance rate to about 25% for a high-dimensional model and to about 50% for models of one or two dimensions. The acceptance rates corresponding to our three choices of in Figure 12.3 are 95%, 63%, and 5%, respectively. A number of issues arise from a consideration of these two simple examples. How do we decide: (a) the length of the burn-in period, (b) when to stop the Markov chain, and (c) what is a suitable proposal distribution? For a discussion of these points, the reader is referred to a collection of review and application papers (Gilks, Richardson, and Spiegelhalter 1996). For an unpublished 1996 roundtable discussion of informal advice for novice practitioners, moderated by R. E. Kass, see www.amstat.org/ publications/tas/kass.pdf. The treatment of MCMC given in this text is intended only as an introduction to this topic. Loredo (1999) gives an interesting perspective on the relationship between the development of MCMC in statistics and certain computational physics techniques. Define a function ðXÞ ¼ ln½ pðXjIÞ pðDjX; IÞ. Then the posterior distribution can be written R as pðXjD; IÞ ¼ eðXÞ =Z, where Z ¼ dX eðXÞ . Evaluation of the posterior resembles two classes of problems familiar to physicists: evaluating Boltzmann factors and partition functions in statistical mechanics, and evaluating Feynman path weights and path integrals in Euclidean quantum field theory. For a discussion of some useful modern extensions of the Metropolis algorithm that are particularly accessible to physical scientists, see Liu (2001) and the first section of Toussaint (1989). A readable tutorial for statistics students is available in Chib and Greenberg (1995).
12.3 Why does Metropolis–Hastings work? Remarkably, for a wide range of proposal distributions qðYjXÞ, the Metropolis–Hastings algorithm generates samples of X with a probability density which converges on the desired target posterior pðXjD; IÞ, called the stationary distribution of the Markov chain. For the distribution of Xt to converge to a stationary distribution, the Markov chain must have three properties (Roberts, 1996). First, it must be irreducible. That is, from all starting points, the Markov chain must be able (eventually) to jump to all states in the target distribution with positive probability. Second it must be aperiodic. This stops the chain from oscillating between different states in a regular periodic movement. Finally the chain must be positive recurrent. This can be expressed in terms of the existence of a stationary distribution pðXÞ, say, such that if an initial value X0 is sampled from pðXÞ, then all subsequent iterates will also be distributed according to pðXÞ.
320
Markov chain Monte Carlo
To see that the target distribution is the stationary distribution of the Markov chain generated by the Metropolis–Hastings algorithm, consider the following: suppose we start with a sample Xt from the target distribution. The probability of drawing Xt from the posterior is pðXt jD; IÞ. The probability that we will draw and accept a sample Xtþ1 is given by the transition kernel, pðXtþ1 jXt Þ ¼ qðXtþ1 jXt Þ ðXt ; Xtþ1 Þ, where ðXt ; Xtþ1 Þ is given by Equation (12.5). The joint probability of Xt and Xtþ1 is then given by Joint probabilityðXt ; Xtþ1 Þ ¼ pðXt jD; IÞ pðXtþ1 jXt Þ ¼ pðXt jD; IÞ qðXtþ1 jXt ÞðXt ; Xtþ1 Þ pðXtþ1 jD; IÞ qðXt jXtþ1 Þ ¼ pðXt jD; IÞ qðXtþ1 jXt Þ min 1; pðXt jD; IÞ qðXtþ1 jXt Þ
(12:8)
¼ minðpðXt jD; IÞ qðXtþ1 jXt Þ; pðXtþ1 jD; IÞqðXt jXtþ1 ÞÞ ¼ pðXtþ1 jD; IÞ qðXt jXtþ1 ÞðXtþ1 ; Xt Þ ¼ pðXtþ1 jD; IÞ pðXt jXtþ1 Þ: Thus, we have shown pðXt jD; IÞ pðXtþ1 jXt Þ ¼ pðXtþ1 jD; IÞ pðXt jXtþ1 Þ;
(12:9)
which is called the detailed balance equation. In statistical mechanics, detailed balance occurs for systems in thermodynamic equilibrium.2 In the present case, the condition of detailed balance means that the Markov chain generated by the Metropolis–Hastings algorithm converges to a stationary distribution. Recall from Equation (12.8) that pðXt jD; IÞpðXtþ1 jXt Þ is the joint probability of Xt and Xtþ1 . We will now integrate this joint probability with respect to Xt , making use of Equation (12.9), and demonstrate that the result is simply the marginal probability distribution of Xtþ1 . Z Z pðXt jD; IÞ pðXtþ1 jXt ÞdXt ¼ pðXtþ1 jD; IÞpðXt jXtþ1 Þ dXt Z (12:10) ¼ pðXtþ1 jD; IÞ pðXt jXtþ1 Þ dXt ¼ pðXtþ1 jD; IÞ: Thus, we have shown that once a sample from the stationary target distribution has been obtained, all subsequent samples will be from that distribution.
2
It may help to consider the following analogy: suppose we have a collection of hydrogen atoms. The number of atoms making a transition from excited state t to state t þ 1 in 1 s is given by N pðtÞ pðt þ 1jtÞ, where N equals the total number of atoms, p(t) is the probability of an atom being in state t, and pðt þ 1jtÞ is the probability that an atom in state t will make a transition to state t þ 1 in 1 s. Similarly the number making transitions from t þ 1 to t in 1 s is given by N pðt þ 1Þ pðtjt þ 1Þ. In thermodynamic equilibrium, the rate of transition from t to t þ 1 is equal to the rate from t þ 1 to t, so pðtÞ pðt þ 1jtÞ ¼ pðt þ 1Þ pðtjt þ 1Þ:
321
12.5 Parallel tempering
12.4 Simulated tempering The simple Metropolis–Hastings algorithm outlined in Section 12.2 can run into difficulties if the target probability distribution is multi-modal. The MCMC can become stuck in a local mode and fail to fully explore other modes which contain significant probability. This problem is very similar to the one encountered in finding a global minimum in a nonlinear model fitting problem. One solution to that problem was to use simulated annealing (see Section 11.4.1) by introducing a temperature parameter T . The analogous process applied to drawing samples from a target probability distribution (e.g., Geyer and Thompson, 1995) is often referred to as simulated tempering (ST). In annealing, the temperature parameter is gradually decreased. In ST, we create a discrete set of progressively flatter versions of the target distribution using a temperature parameter. For T ¼ 1, the distribution is the desired target distribution which is referred to as the cold sampler. For T 1, the distribution is much flatter. The basic idea is that by repeatedly heating up the distribution (making it flatter), the new sampler can escape from local modes and increase its chance of reaching all regions of the target distribution that contain significant probability. Typical inference is based on samples drawn from the cold sampler and the remaining observations discarded. Actually, in Section 12.7 we will see how to use the samples from the hotter distributions to evaluate Bayes factors in model selection problems. Again, let pðXjD; IÞ be the target posterior distribution we want to sample. Applying Bayes’ theorem, we can write this as pðXjD; IÞ ¼ C pðXjIÞ pðDjX; IÞ; where C ¼ 1=pðDjIÞ is the usual normalization constant which is not important at this stage and will be dropped. We can construct other flatter distributions as follows: pðXjD; ; IÞ ¼ pðXjIÞpðDjX; IÞ ¼ pðXjIÞ expð ln½ pðDjX; IÞÞ;
for 0 < < 1:
(12:11)
Rather than use a temperature which varies from 1 to infinity, we prefer to use its reciprocal which we label and refer to as the tempering parameter. Thus varies from 1 to zero. We will use a discrete set of values labeled f1; 2 ; ; m g, where ¼ 1 corresponds to the cold sampler (target distribution) and m corresponds to our hottest sampler which is generally much flatter. This particular formulation is also convenient for our later discussion on determining the Bayes factor in model selection problems. Rather than describe ST in detail, we will describe a more efficient related algorithm called parallel tempering in the next section.
12.5 Parallel tempering Parallel tempering (PT) is an attractive alternative to simulated tempering (Liu, 2001). Again, multiple copies of the simulation are run in parallel, each at a different
322
Markov chain Monte Carlo
temperature (i.e., a different ¼ 1=T ). One of the simulations, corresponding to ¼ 1=T ¼ 1, is the desired target probability distribution. The other simulations correspond to a ladder of higher temperature distributions indexed by i. Let n equal the number of parallel MCMC simulations. At intervals, a pair of adjacent simulations on this ladder are chosen at random and a proposal made to swap their parameter states. Suppose simulations i and iþ1 are chosen. At time t, simulation i is in state Xt;i and simulation iþ1 is in state Xt;iþ1 . If the swap is accepted by the test given below then these states are interchanged. In the example discussed in Section 12.6, we specify that on average, a swap is proposed after every ns iterations (ns ¼ 30 was used) of the parallel simulations in the ladder. This is done by choosing a random number, U1 Uniform[0,1], at each time iteration and proposing a swap only if U1 1=ns . If a swap is to be proposed, we use a second random number to pick one of the ladder simulations i in the range i ¼ 1 to ðnb 1Þ, and propose swapping the parameter states of i and i þ 1. A Monte Carlo acceptance rule determines the probability for the proposed swap to occur. Accept the swap with probability pðXt;iþ1 jD; i ; IÞ pðXt;i jD; iþ1 ; IÞ r ¼ min 1; ; pðXt;i jD; i ; IÞ pðXt;iþ1 jD; iþ1 ; IÞ
(12:12)
where pðXjD; ; IÞ is given by Equation (12.11). We accept the swap if U2 Uniform[0,1] r. This swap allows for an exchange of information across the population of parallel simulations. In the higher temperature simulations, radically different configurations can arise, whereas in lower temperature states, a configuration is given the chance to refine itself. By making exchanges, we can capture and improve the higher probability configurations generated by the population by putting them into lower temperature simulations. Some experimentation is needed to refine suitable choices of i values. Adjacent simulations need to have some overlap to achieve a sufficient acceptance probability for an exchange operation.
12.6 Example Although MCMC really comes into its own when the number of model parameters is very large, we will apply it to the toy spectral line problem we analyzed in Section 3.6, because we can compare with our earlier results. The objective of that problem was to test two competing models, represented by M1 and M2 , on the basis of some spectral line data. Only M1 predicts the existence of a particular spectral line. In the simplest version of the problem, the line frequency and shape is exactly predicted by M1 ; the only quantity which is uncertain is the line strength T expressed in temperature units. The odds ratio in favor of M1 was found to be 11:1 assuming a Jeffreys prior for the line strength. We also computed the most probable line strength. In Section 3.9, we investigated how our conclusions would be altered if the line frequency were uncertain, i.e., it
12.6 Example
323
could occur anywhere between channels 1 to 44. In that case, the odds ratio favoring M1 dropped from 11:1 to 1:1, assuming a uniform prior for the line center frequency. Below, we apply both the Metropolis–Hastings and parallel tempering versions of MCMC to the problem of estimating the marginal posteriors of the line strength and center frequency to compare with our previous results. In Section 12.7, we will employ parallel tempering to compute the Bayes factor needed for model comparison. Metropolis–Hastings results In this section, we will draw samples from pðXjD; M1 ; IÞ, where X is a vector representing the two parameters of model M1 , namely the line strength T and the line center frequency expressed as channel number. We use a Jeffreys prior for T in the range Tmin ¼ 0:1 mK to Tmax ¼ 100 mK. We assume a uniform prior for in the range channel 1 to 44. The steps in the calculation are as follows: 1. Initialize X0 ; set t ¼ 0: In this example we set X0 ¼ fT0 ¼ 5; 0 ¼ 30g 2. Repeat f a) Obtain a new sample Y from qðYjXt Þ Y ¼ fT0 ; 0 g
we set q(T0 |Tt )=Random[NormalDistribution[Tt ,T =1:0]] and q(n 0 jn t Þ=Random[NormalDistribution[n t ; s f =1:0]] b) Compute the Metropolis ratio r¼
pðY|D; M1 ; IÞ pðT 0 ; 0 jM1 ; IÞ pðDjM1 ; T 0 ; 0 ; IÞ ¼ pðXt jD; M1 ; IÞ pðTt ; t jM1 ; IÞ pðDjM1 ; Tt ; t ; IÞ
where pðDjM1 ; T; ; IÞ is given by Equations (3.44) and (3.41). The priors pðT; jM1 ; IÞ ¼ pðTjM1 ; IÞ pðjM1 ; IÞ are given by Equations (3.38) and (3.33). Note: if T 0 or 0 lie outside the prior boundaries set r ¼ 0. c) Acceptance/rejection: U U(0; 1) d) Accept Xtþ1 ¼ Y if U r, otherwise set Xtþ1 ¼ Xt e) Increment tg Figure 12.5 shows results for 105 iterations of a Metropolis–Hastings Markov chain Monte Carlo. Panel (a) shows every 50th value of parameter , expressed as a channel number, and panel (c) the same for parameter T. It is clear that the values move quickly to a region centered on channel 37 with occasional jumps to a region centered on channel 24 and only one jump to small channel numbers. The T parameter can be seen to fluctuate between 0.1 and 3:5 mK. Panels (b) and (d) show a blow-up of the first 500 iterations. It is apparent from these panels that the burn-in period is very short, < 50 iterations for a starting state of T ¼ 5 and ¼ 30. Figure 12.6 shows distributions of the two parameters. In panel (a), the joint distribution of T and is apparent from the scatter plot of every 20th iteration obtained
324 Channel number
Markov chain Monte Carlo 40 30 20 10 (a)
Channel number
0
42 40 38 36 34 32 30
50000 Iteration
100000
(b)
0 5
100
200 300 Iteration
400
500
(c)
4 T
3 2 1 0
0
5
50000 Iteration
100000
(d)
4 T
3 2 1 0
0
100
200 300 Iteration
400
500
Figure 12.5 Results for 105 iterations of a Metropolis–Hastings Markov chain Monte Carlo. Panel (a) shows every 50th value of parameter and panel (c) the same for parameter T. Panels (b) and (d) show a blow-up of the first 500 iterations.
after dropping the burn-in period consisting of the first 50 iterations. To obtain the marginal posterior density for the parameter, we simply plot a histogram of all the values (post burn-in) normalized by dividing by the sum of the values multiplied by the width of each bin. This is shown plotted in panel (b) together with our earlier marginal distribution (solid curve) computed by numerical integration. It is clear that
325
T
12.6 Example 4 3.5 3 2.5 2 1.5 1 0.5
(a)
10
Probability density
0.3
40
(b)
0.25 0.2 0.15 0.1 0.05 10
1.2 Probability density
20 30 Channel number
20 30 Channel number
40
(c)
1 0.8 0.6 0.4 0.2 0.5
1
1.5 2 T (mK)
2.5
3
3.5
Figure 12.6 Results for the spectral line problem using a Metropolis–Hastings Markov chain Monte Carlo analysis. Panel (a) is a scatter plot of the result for every 20th iteration in the two model parameters, channel number and line strength T. Panel (b) shows the marginal probability density for channel number (points) compared to our earlier numerical integration result indicated by the solid curve. Panel (c) shows the marginal probability density for line strength T (points) compared to our earlier numerical integration result indicated by the solid curve.
105 iterations of Metropolis–Hastings does a good job of defining the dominant peak of the probability distribution for but does a poor job of capturing two other widely separated islands containing significant probability. On the other hand, it is clear from panel (c) that it has done a great job of defining the distribution of T. Parallel tempering results We also analyzed the spectral line data with a parallel tempering (PT) version of MCMC described in Section 12.5. We used five values for the tempering parameter, ,
326
Markov chain Monte Carlo
uniformly spaced between 0.01 and 1.0, and ran all five chains in parallel. At intervals (on average every 50 iterations) a pair of adjacent simulations on this ladder are chosen at random and a proposal made to swap their parameter states. We used the same starting state of T ¼ 5; ¼ 30 and executed 105 iterations. The final results for the ¼ 1, corresponding to the target distribution, are shown in Figures 12.7 and 12.8. The acceptance rate for this simulation was 37%. Comparing panel (a) of Figures 12.7 and 12.5, we see that the PT version visits the two low-lying regions of probability much more frequently than the Metropolis–Hastings version. Comparing the marginal densities of Figures 12.8 and 12.6 we see that the PT marginal density for is in better agreement with the expected results indicated by the solid curves. For both versions, the marginal densities for T are in excellent agreement with the expected result. In more complicated problems, we often cannot conveniently compute the marginal densities by another method. In this case, it is useful to compare the results from a number of PT simulations with different starting parameter states.
12.7 Model comparison So far we have demonstrated how to use MCMC to compute the marginal posteriors for model parameters. In this section, we will show how to use the results of parallel tempering to compute the Bayes factor used in model comparison (Skilling, 1998; Goggans and Chi, 2004). In the toy spectral line problem of Section 3.6, we were interested in computing the odds ratio of two models M1 and M2 which from Equation (3.30) is equal to the prior odds times the Bayes factor given by B12 ¼
pðDjM1 ; IÞ ; pðDjM2 ; IÞ
(12:13)
where pðDjM1 ; IÞ and pðDjM2 ; IÞ are the global likelihoods for the two models. In the version of this problem analyzed in Section 12.6, M1 has two parameters and T. For independent priors, Z Z pðDjM1 ; IÞ ¼ d pðjM1 ; IÞ dT pðTjM1 ; IÞpðDjM1 ; ; T; IÞ: (12:14) In what follows, we will generalize the model parameter set to an arbitrary number of parameters which we represent by the vector X. To evaluate pðDjM1 ; IÞ, using parallel tempering MCMC, we first define a partition function Z ZðÞ ¼ dX pðXjM1 ; IÞ pðDjM1 ; X; IÞ Z (12:15) ¼ dX expfln½ pðXjM1 ; IÞ þ ln½ pðDjM1 ; X; IÞg;
327
Channel number
12.7 Model comparison 40 30 20 10 0
(a) 0
Channel number
40 38 36 34 32 30 28
50000 Iteration
100000
(b)
0 5
100
200 300 Iteration
400
500
(c)
4 T
3 2 1 0
0
5
50000 Iteration
100000
(d)
4 T
3 2 1 0
0
100
200 300 Iteration
400
500
Figure 12.7 Results for 105 iterations of a parallel tempering Markov chain Monte Carlo. Panel (a) shows every 50th value of parameter and panel (c) the same for parameter T. Panels (b) and (d) show a blow-up of the first 500 iterations.
where is the tempering parameter introduced in Section 12.4. Now take the derivative of ln½ZðÞ. d 1 d ln½ZðÞ ¼ ZðÞ d ZðÞ d
(12:16)
328
T
Markov chain Monte Carlo 4 3.5 3 2.5 2 1.5 1 0.5
(a)
10
Probability density
0.3
20 30 Channel number
40
(b)
0.25 0.2 0.15 0.1 0.05 10
Probability density
1.2
20 30 Channel number
40
(c)
1 0.8 0.6 0.4 0.2 0.5
1
1.5
2 T (mK)
2.5
3
3.5
Figure 12.8 Results for the spectral line problem using a Markov chain Monte Carlo analysis with parallel tempering. Panel (a) is a scatter plot of the result for every 20th iteration in the two model parameters, channel number and line strength T. Panel (b) shows the marginal probability density for channel number (points) compared to our earlier numerical integration result indicated by the solid line. Panel (c) shows the marginal probability density for line strength T (points) compared to our earlier numerical integration result indicated by the solid line.
d ZðÞ ¼ d
Z dX ln½ pðDjM1 ; X; IÞ
expfln½ pðXjM1 ; IÞ þ ln½ pðDjM1 ; X; IÞg Z ¼ dX ln½ pðDjM1 ; X; IÞ pðXjM1 ; IÞ pðDjM1 ; X; IÞ :
(12:17)
12.7 Model comparison
Substituting Equation (12.17) into Equation (12.16), we obtain R dX ln½ pðDjM1 ; X; IÞ pðXjM1 ; IÞ pðDjM1 ; X; IÞ d ln½ZðÞ ¼ R d dX pðXjM1 ; IÞ pðDjM1 ; X; IÞ
329
(12:18)
¼ hln½ pðDjM1 ; X; IÞi ; where hln½ pðDjM1 ; X; IÞi is the expectation value of the ln½ pðDjM1 ; X; IÞ. This quantity is easily evaluated from the MCMC results which consist of sets of Xt samples, one set for each value of the tempering parameter . Let fXt; g represent the samples for tempering parameter . hln½ pðDjM1 ; X; IÞi ¼
1X ln½ pðDjM1 ; Xt; ; IÞ; n t
(12:19)
where n is the number of samples in each set after the burn-in period. From Equation (12.18) we can write Z
1 0
d ln½ZðÞ ¼ ln½Zð1Þ ln½Zð0Þ Z ¼ d hln½ pðDjM1 ; X; IÞi :
Now from Equation (12.15) Z Zð1Þ ¼ dX pðXjM1 ; IÞ pðDjM1 ; X; IÞ ¼ pðDjM1 ; IÞ;
(12:20)
(12:21)
and Zð0Þ ¼
Z dX pðXjM1 ; IÞ:
(12:22)
From Equations (12.20) and (12.21) we can write Z ln½ pðDjM1 ; IÞ ¼ ln½Zð0Þ þ dhln½ pðDjM1 ; X; IÞi :
(12:23)
For a normalized prior, Zð0Þ ¼ 1 and Equation (12.23) becomes Z ln½ pðDjM1 ; IÞ ¼ dhln½ pðDjM1 ; X; IÞi :
(12:24)
Armed with Equation (12.24) we are now in a position to evaluate the Bayes factor given by Equation (12.13), which is at the heart of model comparison. Returning to the spectral line problem, hln½ pðDjM1 ; ; T; IÞi ¼
1X ln½ pðDjM1 ; t; ; Tt; ; IÞ: n t
(12:25)
330
Markov chain Monte Carlo
We evaluated Equation (12.25) for the five values of ¼ 0:01, 0.2575, 0.505, 0.7525, 1.0 used in the PT MCMC analysis of Section 12.6. The results were 97:51; 87:1937; 86:4973; 85:9128; 85:1565, respectively. We then evaluated the integral in Equation (12.24) by generating an interpolating function and integrating the interpolating function in the interval 0 to 1. This yielded ln½pðDjM1 ; IÞ ¼ 87:4462. A more sophisticated interpolation of the results yielded ln½pðDjM1 ; IÞ ¼ 87:3369. Model M2 had no free parameters and pðDjM2 ; IÞ ¼ 1:133 1038 from Equation (3.49). The resulting Bayes factors for the two interpolations were B1;2 ¼ 0:93 and 1.04, respectively. This should be compared to B1;2 ¼ 1:06 obtained from our earlier solution to this problem.
12.8 Towards an automated MCMC As the number of model parameters increases, so does the time required to choose a suitable value for each of the parameter proposal distributions. Suitable means that MCMC solutions, starting from different locations in the prior parameter space, yield equilibrium distributions of model parameter values that are not significantly different, in an acceptable number of iterations. Generally this involves running a series of chains, each time varying for one or more of the parameter proposal distributions, until the chain appears to converge on an equilibrium distribution with a proposal acceptance rate, , that is reasonable for the number of parameters involved, e.g., approximately 25% for a large number of parameters (Roberts, Gelman, and Gilks, 1997). This is especially time consuming if each parameter corresponds to a different physical quantity, so that the values can be very different. In this section, we describe one attempt at automating this process, which we apply to the detection of an extrasolar planet using some real astronomical data. Suppose we are dealing with M parameters that are represented collectively by fX g. Let represent the characteristic width of a symmetric proposal distribution for X . We will assume Gaussian proposal distributions but the general approach should also be applicable to other forms of proposal distributions. To automate the MCMC, we need to incorporate a control system that makes use of some form of error signal to steer the selection of the f g. For a manually controlled MCMC, a useful approach is to start with a large value of , approximately one tenth of the prior uncertainty of that parameter. In a PT MCMC, this will normally be sufficient to provide access to all areas with significant probability within the prior range, but may result in a very small acceptance rate for the ¼ 1 member of the PT MCMC chain. By running a number of smaller iteration chains, each time perturbing one or more of the f g, it soon becomes clear which parameters are restraining the acceptance rate from a more desirable level. Larger f g values yield larger jumps in parameter proposal values. The general approach of refining the f g towards smaller values is analogous to an annealing operation. The refinement is terminated when the target proposal acceptance rate is reached.
12.9 Extrasolar planet example
331
In the automated version of this process described below, the error signal used for the control system is the difference between the current acceptance rate and a target acceptance rate. The control system steers the proposal ’s to desirable values during the burn-in stage of a single parallel tempering MCMC run. Although inclusion of the control system may result in a somewhat longer burn-in period, there is a huge saving in time because it eliminates many trial runs to manually establish a suitable set of f g. In addition the control system error monitor provides another indication of the length of the burn-in period. In practice, it is important to repeat the operation for a few different choices of initial parameter values, to ensure that the MCMC results converge. The automatic parallel tempering MCMC (APT MCMC) algorithm contains major and minor cycles. During the major cycles the current set of f g are used for n1 iterations. The acceptance rate achieved during this major cycle is compared to the target acceptance rate. If the difference (control system error signal), , is greater than a chosen threshold, tol1 , then a set of minor cycles, one cycle of n2 iterations for each , are employed to explore the sensitivity of the acceptance rate to each . The f g are updated and another major cycle run. If tol1 is set ¼ 0, then the minor cycles are always performed after each major cycle. At this point, the reader might find it useful to examine the evolution of the error signal, and the { }, for the examples shown in Figures 12.12 and 12.13. One can clearly see the expected Poisson fluctuations in the pffiffiffiffiffiffiffiffi error signal after the { } stabilize. For these examples we set tol1 =1.5 n1 to reduce the number of minor cycles. Normally the control system is turned off after is less pffiffiffiffiffiffiffiffi than some threshold, tol2 . Typically tol2 ¼ n1 . Full details of the control system are not included here as it is considered experimental and in a process of evolution. The latest version is included in the Mathematica tutorial in the section entitled ‘‘Automatic parallel tempering MCMC,’’ along with useful default values for the algorithm parameters and input data format. Figure 12.9 provides a summary of the inputs and outputs for the APT MCMC algorithm. In the following section we demonstrate the behavior of the algorithm with a set of astronomical data used to detect an extrasolar planet.
12.9 Extrasolar planet example In this section, we will apply the automated parallel tempering MCMC described in Section 12.8 to some real astronomical data, which were used to discover (Tinney et al., 2003) an extrasolar planet orbiting a star with a catalog number HD 2039. Although light from the planet is too faint to be detected, the gravitational tug of the planet on the star is sufficient to produce a measurable Doppler shift in the velocity of absorption lines in the star’s spectrum. By fitting a Keplerian orbit to the measured radial velocity data, v i , it is possible to obtain information about the orbit and a lower limit on the mass of the unseen planet. The predicted model radial velocity, fi , for a particular orbit is given below, and involves six unknowns. The geometry of a stellar orbit with respect to the observer is shown in Figure 12.10. The points labeled F, P,
332
Markov chain Monte Carlo
Target Posterior p({Xα}|D,M,I)
Data n = no. of iterations {Xα}init = start parameters {σα}init = start proposal σ's Tempering levels
___ n1 n2 λ γ
APT MCMC
-
Control system diagnostics {Xa} iterations Summary statistics Best fit model & residuals {Xα}marginals {Xα} 68.3% credible regions p(D|M,I) global likelihood for model comparison
Control system parameters ___ ___ ___ ___ ___ ___ = major cycle iterations = minor cycle iterations = acceptance ratio = damping constant
Figure 12.9 An overview schematic of the inputs and outputs for the automated parallel tempering MCMC.
and S, are the location of the focus of the elliptical orbit, periastron, and the star’s position at time ti , respectively. fi ¼ V þ K½cosf ðti þ t0 Þ þ !g þ e cos !;
(12:26)
where V ¼ the systematic velocity of the system. K ¼ velocity amplitude ¼ 2pP1 ð1 e2 Þ1=2 a sin i. P ¼ the orbital period. a ¼ the semi-major axis of the orbit. e ¼ the eccentricity of the elliptical orbit. i ¼ the inclination of the orbit as defined in Figure 12.10. ! ¼ the longitude of periastron, angle LFA in Figure 12.10.
¼ the fraction of an orbit prior to the start of data-taking that periastron occurred at. Thus, t0 ¼ P ¼ the number of days prior to ti ¼ 0 that the star was at periastron, for an orbital period of P days. At ti ¼ 0, the star is at an angle AFB from periastron. ðti þ t0 Þ ¼ the angle (AFS) of the star in its orbit relative to periastron at time ti . The dependence of on ti , which follows from the conservation of angular momentum, is given by the solution of d 2p½1 þ e cos ðti þ t0 Þ2 ¼ 0: dt Pð1 e2 Þ3=2
(12:27)
333
12.9 Extrasolar planet example Orbital plane
B C S F
L
A P
inclination
Sky plane
To observer
Figure 12.10 The geometry of a stellar orbit with respect to the observer. The sky plane is perpendicular to the dashed line connecting the star and the observer.
To fit Equation (12.26) to the data, we need to specify the six model parameters, P; K; V; e; !; . The measured radial velocities and their errors are shown Figure 12.11. As we have discussed before, it is good idea not to assume that the quoted measurement errors are the only error component in the data.
150
Velocity (ms–1)
100
50
0
–50
0
200
400 600 800 1000 1200 Julian day number (–2,451,118.0578)
1400
Figure 12.11 HD 2039 radial velocity measurements plotted from the data given in Tinney et al., (2003).
334
Markov chain Monte Carlo
We can represent the measured velocities by the equation v i ¼ f i þ ei ;
(12:28)
where ei is the component of v i which arises from measurement errors plus any real signal in the data that cannot be explained by the model prediction fi . For example, suppose that the star actually has two planets, and the model assumes only one is present. In regard to the single planet model, the velocity variations induced by the second planet act like an additional unknown noise term. In the absence of detailed knowledge of the effective noise distribution, other than that it has a finite variance, the maximum entropy principle tells us that a Gaussian distribution would be the most conservative choice (i.e., maximally non-committal about the information we don’t have). We will assume the noise variance is finite and adopt a Gaussian distribution for ei with a variance 2i . In a Bayesian analysis where the variance of ei is unknown, but assumed to be the same for all data points, we can treat as an unknown nuisance parameter. Marginalizing over has the desirable effect of treating anything in the data that can’t be explained by the model as noise and this leads to the most conservative estimates of model parameters. In the current problem, the quoted measurement errors are not all the same. We let si ¼ the experimenter’s estimate of i , prior to fitting the model and examining the model residuals. The i values are not known, but the si values are our best initial estimates. They also contain information on the relative weight we want to associate with each point. Since we do not know the absolute values of the i , we introduce a parameter called the noise scale parameter, b, to allow for this.3 It could also be called a noise weight parameter. Several different definitions of b are possible including 2i ¼ bs2i and i ¼ bsi . The definition we use here is given by 1 b ¼ 2: 2 i si
(12:29)
Again marginalizing over b has the desirable effect of treating anything in the data that can’t be explained by the model as noise, leading to the most conservative estimates of orbital parameters. Since b is a scale parameter, we assume a Jeffreys prior (see Section 3.10).
3
Note added in proof: A better choice for parameterizing any additional unknown noise term is to rewrite equation (12.28) as vi ¼ fi þ ei þ e0 where ei is the noise component arising from known but unequal measurement errors, and e0 is the additional unknown noise term. From the arguments given above, we can characterize the combination of ei þ e0 by a Gaussian distribution with variance = 2i þ 20 . With this form of parameterization we would marginalize over 0 instead of b. The one advantage of using b is that it can allow for the possibility that the measurement errors have been overestimated.
335
12.9 Extrasolar planet example
pðbjIÞ ¼
1 ; b ln bbmax min
(12:30)
with bmax ¼ 2 and bmin ¼ 0:1. We also compute pðbjD; Model; IÞ. If the most probable estimate of b 1, then the one-planet model is doing a good job accounting for everything that is not noise based on the si estimates. If b < 1, then either the model is not accounting for significant real features in the data or the initial noise estimates, si , were low.
12.9.1 Model probabilities In this section, we set up the equations needed to (a) specify the joint posterior probability of the model parameters (parameter estimation problem) for use in the MCMC analysis, and (b) decide if a planet has been detected (model selection problem). To decide if a planet has been detected, we will compare the probability of M1 ‘‘the star’s radial velocity variations are caused by one planet’’ to the probability of M0 ‘‘the radial velocity variations are consistent with noise.’’ From Bayes’ theorem, we can write pðM1 jD; IÞ ¼
pðM1 jIÞ pðDjM1 ; IÞ ¼ C pðM1 jIÞ pðDjM1 ; IÞ; pðDjIÞ
(12:31)
where Z pðDjM1 ;IÞ¼
Z dP
Z dK
Z dV
Z de
Z
Z
d
d!
db pðP;K;V;e; ;!;bjM1 ;IÞ
(12:32)
pðDjM1 ;P;K;V;e; ;!;b;IÞ: The joint prior for the model parameters, assuming independence, is given by pðP; K; V; e; ; !; bjM1 ; IÞ ¼ P ln
1
Pmax Pmin
K ln
1
Kmax Kmin
1 ðVmax Vmin Þ
1 1 1 : ðemax emin Þ 2p b ln bmax
(12:33)
bmin
Note: we have assumed a uniform prior for in the range 0 to 1, so pð |M1 ; IÞ ¼ 1. "
N=2
pðDjM1 ; P; K; V; e; ; !; b; IÞ ¼ Ab
# N bX ðv i fi Þ2 exp ; 2 i¼1 s2i
(12:34)
where " A ¼ ð2pÞ
N=2
N Y i¼1
# s1 i
:
(12:35)
336
Markov chain Monte Carlo
For the purposes of estimating the model parameters, we will assume a prior uncertainty in b in the range bmin ¼ 0:1 and bmax ¼ 2. When it comes to comparing the probability of M1 to M0 , or to a model which assumes there are two planets present, we will set b ¼ 1 and perform the model comparison based on the errors quoted in Tinney et al., (2003). The probability of M0 is given by pðM0 jD; IÞ ¼ C pðM0 jIÞpðDjM0 ; IÞ;
(12:36)
where pðDjM0 ; IÞ ¼ ¼
Z
Z db
Z
dV pðV; bjD; M0 ; IÞ Z
(12:37) dV pðV; bjM0 ; IÞ pðDjM0 ; V; b; IÞ;
db
1 1 ; ðVmax Vmin Þ b ln bmax
pðVjM0 ; IÞ ¼
(12:38)
bmin
and " pðDjM0 ; V; b; IÞ ¼ ð2pÞ
N=2
N Y
# s1 i
i¼1
"
# N bX ðv i VÞ2 b exp : 2 i¼1 s2i N 2
(12:39)
The integral over V in Equation (12.37) can be performed analytically yielding " # rffiffiffi Z N p 1=2 bW X N3 2 2 pðDjM0 ; IÞ ¼ A db b 2 exp ðv 2 ðv w Þ Þ W 2 2 i¼1 w (12:40) ½erfðumax Þ erfðumin Þ; where vw ¼
N X
wi v i ;
(12:41)
wi v 2i ;
(12:42)
i¼1
v 2w ¼
N X i¼1
wi ¼ 1=s2i ; W¼
N X i¼1
wi ;
(12:43)
(12:44)
12.9 Extrasolar planet example
umax ¼
umin
bW 2
337
1=2 ðVmax vw Þ;
(12:45)
1=2 bW ¼ ðVmin vw Þ: 2
(12:46)
In conclusion, Equations (12.31) and (12.34) are required for the parameter estimation part of the problem, and Equations (12.32) and (12.40) answer the model selection part of the problem. Equation (12.32) is evaluated from the results of the parallel tempering chains according to the method discussed in Section 12.7. 12.9.2 Results The APT MCMC algorithm described in Section 12.8 was used to re-analyze the measurements of Tinney et al. (2003). Figures 12.12 and 12.13 show the diagnostic information output by the MCMC control system for two runs of the APT MCMC algorithm that use different starting values for the parameters and different starting values for the proposal ’s. The top left panel shows the evolution of the control system error for 100 000 iterations. Even for the best set of f g, the control system pffiffiffiffiffiffiffiffi error will exhibit statistical fluctuations of order n1 which will result in fluctuations of f g throughout the run. Recall, ¼ the target acceptance fraction and n1 ¼ the number of iterations in major cycles (see Section 12.8). These fluctuations are of no consequence since the equilibrium distribution of parameter values is insensitive to small fluctuations in f g. To reduce the time spent in perturbing f g values, we set a pffiffiffiffiffiffiffiffi threshold on the control system error of 1:5 n1 . When the error is less than this value no minor cycles are executed. Normally, the control system is disabled the first time pffiffiffiffiffiffiffiffi the error is <1:5 n1 . This was not done in the two examples shown in order to illustrate the behavior of the control system and evolution of the f g. For the two runs, the error drops to a level consistent with the minimum threshold set for initiating a change in f g in 8000 and 9600 iterations, respectively. The other six panels exhibit the evolution of the f g to relatively stable values. Table 12.1 compares the starting and final values for two APT MCMC runs with a set of f g values arrived at manually. The starting parameter values for the two APT MCMC runs are shown in Table 12.2. Control system parameters were: scmax ¼ 0:1, n1 ¼ 1000, n2 ¼ 100, ¼ 0:25, and a damping factor, ¼ 1:6. scmax specifies the maximum scaling of f g to be used in a minor cycle. Tempering values used were f0:01; 0:2575; 0:505; 0:7525; 1g. values are chosen to give ’50% swap acceptance between adjacent levels. Figure 12.14 shows the iterations of the six model parameters, P; K; V; e; ; !, for the 100 000 iterations of APT MCMC 1. Only every 100th value is plotted. The plot for K shows clear evidence that parallel tempering is doing its job, enabling regions of significant probability to be explored apart from the biggest peak region. A conservative burn-in period of 8000 samples was arrived at from an examination of the
338
Markov chain Monte Carlo 50
80
0 60
σP
Error
–50 –100
40
–150 20
–200 –250
0
20000
40000 60000 Iteration
80000
0
20000
40000 60000 Iteration
80000
0
20000
40000 60000 Iteration
80000
0
20000
40000 60000 Iteration
80000
80 20
σV
σK
60 40
10
20
5 20000
40000 60000 Iteration
80000
0.175 0.15 0.125 0.1 0.075 0.05 0.025
0.175 0.15 0.125 0.1 0.075 0.05 0.025
σχ
σe
0
σω
15
0
20000
40000 60000 Iteration
80000
0
20000
40000 60000 Iteration
80000
0.35 0.3 0.25 0.2 0.15 0.1 0.05
Figure 12.12 The upper left panel shows the evolution of the APT MCMC control system error versus iteration number for the first run. The other six panels exhibit the evolution of the Gaussian parameter proposal distribution ’s.
control system error, shown in the upper left panel of Figure 12.12, and the parameter iterations of Figure 12.14. The level of agreement between two different MCMC runs can be judged from a comparison of the marginal distributions of the parameters. Figures 12.15 and 12.16 show the posterior marginals for the six model parameters, P; K; V; e; ; !, and the noise scale parameter b for APT MCMC 1 and APT MCMC 2, respectively. The final model
339
12.9 Extrasolar planet example 60
0
50
–50
40
σP
Error
50
–100
30
–150
20
–200
10
–250 20000
40000 60000 Iteration
80000 100000
20000
40000 60000 Iteration
80000 100000
0
20000
40000 60000 Iteration
80000 100000
0
20000
40000 60000 Iteration
80000 100000
10 8 6 4 2 0
20000
40000 60000 Iteration
80000 100000
0.12
0.12
0.1
0.1
0.08
0.08
σχ
σe
0 12
70 60 50 40 30 20 10
σV
σK
0
0.06
0.06
0.04
0.04
0.02
0.02 0
20000
40000 60000 Iteration
80000 100000
0
20000
40000 60000 Iteration
80000 100000
0.2
σω
0.15 0.1 0.05
Figure 12.13 The upper left panel shows the evolution of the APT MCMC control system error versus iteration number for the second run. The other six panels exhibit the evolution of the Gaussian parameter proposal distribution ’s.
parameter values are given in Table 12.3, along with values of a sin i, M sin i, and the Julian date of periastron passage, that were derived from the parameter values. pffiffiffiffiffiffiffiffiffiffiffiffiffi a sin iðkmÞ ¼ 1:38 105 KP 1 e2 ; (12:47) where K is in units of m s1 and P is in days. pffiffiffiffiffiffiffiffiffiffiffiffiffi M sin i ¼ 4:91 103 ðM Þ2=3 KP1=3 1 e2 ;
(12:48)
340
Markov chain Monte Carlo
Table 12.1 Comparison of the starting and final values of proposal distribution ’s for two automatic parallel tempering MCMC runs, to manually derived values. Proposal
APT MCMC 1
P (days) Kðm s1 Þ Vðm s1 Þ e
!
APT MCMC 2
Manual
Start
Final
Start
Final
Final
70 70 20 0.15 0.15 0.3
6.2 5.7 1.5 0.012 0.013 0.023
50 60 10 0.1 0.1 0.2
7.8 6.0 1.2 0.012 0.009 0.019
10 5 2 0.005 0.007 0.05
Table 12.2 Starting parameter values for the two automatic parallel tempering MCMC runs. Trial 1 2
P
K
V
e
!
b
950 1300
80 250
2 5
0.4 0.2
0.0 0.0
0.0 0.0
1.0 1.0
where M is the mass of the planet measured in Jupiter masses, and M is the mass of the star in units of solar masses. One important issue concerns what summary statistic to use to represent the best estimate of the parameter values. We explore the question of a suitable robust summary statistic further in Section 12.10. In Table 12.3, the final quoted parameter values correspond to the MAP values. The median values are shown in brackets below. The error bars correspond to the boundaries of the 68.3% credible region of the marginal distribution. The MAP parameter values for APT MCMC 1 were used to construct the model plotted in panel (a) of Figure 12.17. The residuals are shown in panel (b). Figure 12.18 shows the posterior probability distributions for a sin i, M sin i, and the Julian date of periastron passage, that are derived from the MCMC samples of the orbital parameters. The Bayes factors, pðDjM1 ; IÞ=pðDjM0 ; IÞ, determined from the two APT MCMC runs were 1:4 1014 and 1:6 1014 . Clearly, both trials overwhelmingly favor M1 over M0 . The upper panel of Figure 12.19 shows a comparison of the marginal and projected probability density functions for the velocity amplitude, K, derived from the APT MCMC parameter samples. To understand the difference, it is useful to examine the strong correlation that is evident between K and orbital eccentricity in the lower panel.
341
1300
400
1200
300
K (ms–1)
Period/days
12.9 Extrasolar planet example
1100
200 100
1000 0
20000
40000 60000 Iteration
80000
100000
0
20000
40000 60000 Iteration
80000
100000
0
20000
40000 60000 Iteration
80000
100000
0
20000
40000 60000 Iteration
80000
100000
0.8
20
0.7
10
V (ms–1)
Eccentricity
0.9
0.6 0.5
0 –10
0.4 –20
0.3 0
20000
40000 60000 Iteration
80000
100000 0
0.3 –0.2
ω
χ
0.2 0.1
–0.4 –0.6
0
–0.8
–0.1 0
20000
40000 60000 Iteration
80000
100000
Figure 12.14 The figure shows every 100th APT MCMC iteration of the six model parameters, P; K; V; e; ; !.
Not only is the density of samples much higher at low K values, but the characteristic width of the K sample distribution is also much broader, giving rise to an enhancement in the marginal beyond that seen in the projected. Finally, even though the 68.3% credible region contains b ¼ 1, we decided to analyze the best-fit residuals, shown in the lower panel of Figure 12.17, to see what probability theory had to say about the evidence for another planet.4 The APT MCMC program was re-run on the residuals to look for evidence of a second planet in the period range 2 to 500 days, K ¼ 1 to 40 m s1 , V ¼ 10 to 10 m s1 , e ¼ 0 to 0.95, ¼ 0 to 1, and ! ¼ p to p. The most probable orbital solution had a period of þ2:4 þ0:16 1 1 11:90 0:02 days, K ¼ 18þ9 15 m s , V ¼ 2:71:6 m s , eccentricity ¼ 0:6260:18 , þ2 ! ¼ 1564 deg, periastron passage ¼ 1121 1 days ðJD 2;450;000Þ, and an M sin i ¼ 0:14þ0:07 0:04 . Figure 12.20 shows this orbital solution overlaid on the residuals for two cycles of phase. Note: the second cycle is just a repeat of the first. The computed Bayes factor pðDjM2 ; IÞ=pðDjM1 ; IÞ ¼ 0:7. Assuming a priori that 4
Note: a better approach would be to fit a two-planet model to the original radial velocity data.
Markov chain Monte Carlo
0.014 0.012 0.01 0.008 0.006 0.004 0.002
Probability density
Probability density
342
1200 Period (d)
1250
100
1300
0.07 0.06 0.05 0.04 0.03 0.02 0.01 –10
0
10
20
30
0.2
0.25
0.3
500
3 2 1 0.5
0.6 0.7 0.8 Eccentricity
0.9
4 3 2 1 –0.7
–0.6 –0.5 –0.4 –0.3 –0.2
Longitude of periastron ω
χ Probability density
400
4
40
Probability density
Probability density
0.15
300
5
V (ms–1) 14 12 10 8 6 4 2
200
K (ms–1) Probability density
Probability density
1150
0.0175 0.015 0.0125 0.01 0.0075 0.005 0.0025
2 1.5 1 0.5 0.4
0.6
0.8
1
1.2
1.4
1.6
Noise scale parameter b
Figure 12.15 The marginal probabilities for the six model parameters, P; K; V; e; ; !, and the noise scale parameter b for the run APT MCMC 1.
pðM2 jIÞ ¼ pðM1 jIÞ, this result indicates that it is more probable that the orbital solution for the residuals arises from fitting a noise feature than from the existence of a second planet. Thus, there is insufficient evidence at this time to claim the presence of a second planet.
12.10 MCMC robust summary statistic In the previous section, the best estimate of each model parameter is based on the maximum a posteriori (MAP) value. It has been argued, e.g., Fox and Nicholls (2001),
343
0.015 0.0125 0.01 0.0075 0.005 0.0025
Probability density
Probability density
12.10 MCMC robust summary statistic
1200 1250 Period (days)
1300
100
0.06 0.05 0.04 0.03 0.02 0.01 –10
0
10
20
30
0.25
500
3 2 1 0.5
Probability density
Probability density Probability density
χ
400
4
40
15 12.5 10 7.5 5 2.5 0.2
300
5
V (ms–1)
0.15
200
K (ms–1) Probability density
Probability density
1150
0.014 0.012 0.01 0.008 0.006 0.004 0.002
0.3
0.6 0.7 0.8 Eccentricity
0.9
4 3 2 1 –0.7
–0.6
–0.5
–0.4
–0.3
Longitude of periastron ω
–0.2
2 1.5 1 0.5 0.4
0.6
0.8
1
1.2
1.4
1.6
Noise scale parameter b
Figure 12.16 The marginal probabilities for the six model parameters, P; K; V; e; ; !, and the noise scale parameter b for the run APT MCMC 2.
that MAP values are sometimes unrepresentative of the bulk of the posterior probability. Fox and Nicholls were considering the reconstruction of degraded binary images. The current problem is very different but the issue remains the same: what choice of summary statistic to use? Two desirable properties are: a) that it be representative of the marginal probability distribution, and b) the set of summary parameter values provides a good fit to the data. Here, we consider three other possible choices of summary statistic. They are the mean, the median, and the marginal posterior mode (MPM), all of which satisfy point (a). In repeated APT MCMC runs, it was found that
344
Markov chain Monte Carlo
Table 12.3 Comparison of the results from two parallel tempering MCMC Bayesian runs with the analysis of Tinney et al. (2003). The values quoted for the two APT MCMC runs are MAP (maximum a posterior) values. The error bars correspond to the boundaries of the 68.3% credible region of the marginal distribution. The median values are given in brackets on the line below. Note: the periastron time and error quoted by Tinney et al. is identical with their P value and is assumed to be a typographical error. Parameter
Tinney et al. (2003)
APT MCMC 1
APT MCMC 2
1183 150
1188þ28 35 (1188)
1177þ36 21 (1188)
Velocity amplitude K ðm s1 Þ
130 20
106þ46 29 (115)
116þ56 39 (125)
Eccentricity e
0:67 0:1
0:63þ0:12 0:06 (0.67)
0:65þ0:15 0:06 (0.68)
Longitude of periastron ! (deg)
333 15
333þ6 5 (334)
332þ8 3 (334)
a sin i (units of 106 km)
1:56 0:3
1:35þ0:4 0:3 (1.42)
1:4þ0:4 0:4 (1.66)
Periastron time ðJD 2;450;000Þ
1183 150
864þ18 58 (845)
856þ52 28 (844)
0:7þ8 6 (0.8)
1:4þ7 7 (2.1)
4:9 1:0
4:2þ1:2 1:0 (4.5)
4:5þ1:3 1:3 (4.7)
15
13.8 (14.1)
14.0 (14.0)
Orbital period P (days)
Systematic velocity V ðm s1 Þ M sin i ðMJ Þ
RMS about fit
the MPM solution provided a relatively poor fit to the data, while the mean was somewhat better, and in all cases, the median provided a good fit – almost as good as the MAP fits. One example of the fits is shown in Figure 12.21. The residuals were as follows: (a) 14:0 m s1 (MAP), (b) 16.1 (mean), (c) 18.7 (MPM), and (d) 14.0 (median). In the previous example the Bayes factor favored the one-planet model, M1 , compared to the no-planet model, M0 , by a factor of approximately 1014 . It is also
345
12.10 MCMC robust summary statistic 150
(a)
Velocity (ms–1)
100 50 0 –50
Velocity residual (ms–1)
0
200
400 600 800 1000 Julian day number (–2,451,118.0578)
1200
1400
400 600 800 1000 Julian day number (–2,451,118.0578)
1200
1400
(b)
20 0 –20 –40
0
200
Figure 12.17 Panel (a) shows the raw data with error bars plotted together with the model radial velocity curve using the MAP (maximum a posteriori) summary statistic. Panel (b) shows the radial velocity residuals.
interesting to compare the four different summary statistics in the case where the Bayes factor is close to 1, as we found for the toy spectral line problem in Section 12.7, i.e., neither model is preferred. Figure 12.22 shows a comparison of the fits obtained using (a) the MAP, (b) the mean, (c) the MPM, and (d) the median. Both the MAP and median summary statistic placed the model line at the actual location of the simulated spectral line (channel 37). The MAP achieved a slightly lower RMS residual ðRMS ¼ 0:87Þ compared to the median ðRMS ¼ 0:89Þ. The mean statistic performed rather poorly and the MPM not much better. The conclusion, based on the current studies, is that the median statistic provides a robust alternative to the common MAP statistic for summarizing the posterior distribution. Unfortunately, the median was not one of the statistics considered in Fox and Nicholls (2001).
Markov chain Monte Carlo 1.2 1 0.8 0.6 0.4 0.2
PDF
PDF
346
PDF
1
2
3 4 a sin i (106 km)
0.35 0.3 0.25 0.2 0.15 0.1 0.05
5
4
6
8 10 M sin i (MJ )
12
14
0.012 0.01 0.008 0.006 0.004 0.002 700 750 800 850 900 950 Periastron passage (JD – 2,450,000)
Figure 12.18 The figure shows the distribution of three useful astronomical quantities; a sin i, M sin i and epoch of periastron passage, that are derived from the MCMC samples of the orbital parameters.
12.11 Summary This chapter provides a brief introduction to the powerful role MCMC methods can play in a full Bayesian analysis of a complex inference problem involving models with large numbers of parameters. We have only demonstrated their use for models with a small to a moderate number of parameters, where they can easily be compared with results from other methods. These comparisons will provide a useful introduction and calibration of these methods for readers wishing to handle more complex problems. For the examples considered, the median statistic proved to be a robust alternative to the common MAP statistic for summarizing the MCMC posterior distribution. The most ambitious topic treated in this chapter dealt with an experimental new algorithm for automatically annealing the values for the parameter proposal distributions in a parallel tempering Markov chain Monte Carlo (APT MCMC) calculation. This was applied to the analysis of a set of astronomical data used in the detection of an extrasolar planet. Existing analyses are based on the use of nonlinear leastsquares methods which typically require a good initial guess of the parameter values (see Section 11.5). Frequently, the first indication of a periodic signal comes from a periodogram analysis of the data. As we show in the next chapter, a Bayesian analysis based on prior information of the shape of the periodic signal can frequently do a better job of detection than the ordinary Fourier power spectrum, otherwise known as the Schuster periodogram. In the extrasolar planet Kepler problem, the mathematical form of the signal is well known and is built into the Bayesian analysis. The APT
347
12.11 Summary
Probability density
0.015 0.0125
Marginal Projected
0.01 0.0075 0.005 0.0025 0 100
200 300 K (ms–1)
400
500
Eccentricity e
0.9 0.8 0.7 0.6 0.5 100
200
300 K (ms–1)
400
Figure 12.19 The upper panel shows a comparison of the marginal and projected probability density functions for the velocity amplitude, K. The lower panel illustrates the strong correlation between K and orbital eccentricity.
Velocity (ms–1)
20
0
–20
–40
0
0.5
1 Phase
1.5
2
Figure 12.20 The figure shows the most probable orbital solution to the data residuals (for two cycles of phase), after removing the best fitting model of the first planet.
Markov chain Monte Carlo 150 Velocity (ms–1)
Velocity (ms–1)
348
100 50 0 –50
(a) 200 400 600 800 1000 1200 1400 Julian day number (–2,451,118.0578)
150
200
100
150
50 0 –50
(c) 0
(b) 0
Velocity (ms–1)
Velocity (ms–1)
0
250 200 150 100 50 0 –50
200 400 600 800 1000 1200 1400 Julian day number (–2,451,118.0578)
100 50 0 –50
200 400 600 800 1000 1200 1400 Julian day number (–2,451,118.0578)
(d) 0
200 400 600 800 1000 1200 1400 Julian day number (–2,451,118.0578)
Figure 12.21 The four panels illustrate typical fits obtained in the extrasolar planet problem using different choices of summary statistic to represent the MCMC parameter distributions. They correspond to: (a) the MAP (maximum a posteriori), (b) the mean, (c) the MPM (marginal posterior mode), and (d) the median.
Spectral Line Data (a)
2 1 0 –1 –2 0
10
20
30 40 Channel
50
3
(c)
1 0 –1 –2 10
20
30 40 Channel
(b)
2 1 0 –1 –2 0
10
50
60
20
30 40 Channel
50
60
50
60
Spectral Line Data
4
2
0
3
60
Spectral Line Data
4 Signal Strength (mK)
Signal Strength (mK)
3
Spectral Line Data
4
Signal Strength (mK)
Signal Strength (mK)
4
3
(d)
2 1 0 –1 –2 0
10
20
30 40 Channel
Figure 12.22 The four panels illustrate the fits obtained in the toy spectral line problem using different choices of summary statistic to represent the MCMC parameter distributions. They correspond to: (a) the MAP (maximum a posteriori), (b) the mean, (c) the MPM (marginal posterior mode), and (d) the median.
12.12 Problems
349
MCMC algorithm implemented in Section 12.9 is thus effective for both detecting and characterizing the orbits of extrasolar planets. Another advantage is that a good initial guess of the orbital parameter values is not required, which allows for the earliest possible detection of a new planet. Moreover, the built-in Occam’s razor in the Bayesian analysis can save a great deal of time in deciding whether a detection is believable. Finally, it is important to remember that the MCMC techniques described in this chapter are basically tools to allow us to evaluate the integrals needed for a full Bayesian analysis of some problem of interest. The APT MCMC algorithm discussed in the context of the extrasolar planet problem can readily be modified to tackle other very different problems.
12.12 Problems 1. In Section 12.6, we used both the Metropolis–Hastings and parallel tempering (PT) versions of MCMC to re-analyze the toy spectral line problem of Section 3.6. A program to perform the PT calculations is given in the Markov chain Monte Carlo section of the Mathematica tutorial. Use this program to analyze the spectrum given in Table 12.4, for n ¼ 10 000 to 50 000 iterations, depending on the speed of your computer. As part of your solution, recompute Figures 12.7, 12.8, and the Bayes factor used to compare the two competing models. Explain how you arrived at your choice for the number of burn-in samples. The prior information is the same as that assumed in Section 12.6. Theory predicts the spectral line has a Gaussian shape with a line width L ¼ 2 frequency channels. The noise in each channel is known to be Gaussian with a ¼ 1:0 mK and the spectrometer output is in units of mK. 2. Repeat the analysis of problem 1 with the following changes. In addition to the unknown line strength and center frequency, the line width is also uncertain. Assume a uniform prior for the line width, with upper and lower bounds of 0.5 and 4 frequency channels, respectively. You will need to modify the parallel tempering MCMC program to allow for the addition of the line width parameter. Experiment with your choice of , for the line width in the Gaussian proposal distribution, to obtain a reasonable value for the acceptance rate somewhere in the range 0.25 to 0.5. Your solution should include a plot of the marginal probability distribution for each of the three parameters and a calculation of the Bayes factor for comparing the two models. Justify your choice for the number of burn-in samples. 3. Carry out the analysis described in problem 2 by modifying the experimental APT MCMC software provided in the Mathematica tutorial, and discussed in Section 12.8. 4. In Section 11.6, we illustrated the solution of a simple nonlinear model fitting problem using Mathematica’s NonlinearRegress, which implements the Levenberg–Marquardt method. In this problem we want to analyze the same spectral line data (Table 12.5) using the experimental APT MCMC software given in the Mathematica tutorial and discussed in Section 12.8. It will yield a fully Bayesian solution to the problem without the need to assume the asymptotic normal approximation, or to assume the Laplacian approximations for computing the Bayes factor and marginals. In general, MCMC solutions come into their own for
350
Markov chain Monte Carlo
Table 12.4 Spectral line data consisting of 64 frequency channels obtained with a radio astronomy spectrometer. The output voltage from each channel has been calibrated in units of effective black body temperature expressed in mK. The existence of negative values arises from receiver channel noise which gives rise to both positive and negative fluctuations. ch. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mK
ch. #
mK
ch. #
mK
ch. #
mK
0.82 2.07 0.38 0.99 0.12 1.35 0.20 0.36 0.78 1.01 0.44 0.34 1.58 0.08 0.38 0.71
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0.90 0.33 0.80 1.42 0.28 0.42 0.12 0.14 0.63 1.77 0.67 0.55 1.98 0.08 1.16 0.48
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0.03 1.47 1.70 1.89 4.55 3.59 2.02 0.21 0.05 0.54 0.09 0.61 2.49 0.07 1.45 0.56
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
0.72 0.38 0.02 1.26 1.35 0.04 1.45 1.48 1.16 0.40 0.01 0.29 1.35 0.21 1.67 0.70
Table 12.5 Spectral line data consisting of 51 pairs of frequency and signal strength (mK) measurements. f
mK
f
mK
f
mK
f
mK
0.00 0.12 0.24 0.36 0.48 0.60 0.72 0.84 0.96 1.08 1.20 1.32 1.44
0.86 1.08 0.70 1.16 0.98 1.32 1.05 1.17 0.96 0.86 1.12 0.79 0.86
1.56 1.68 1.80 1.92 2.04 2.16 2.28 2.40 2.52 2.64 2.76 2.88 3.00
0.97 0.97 1.06 0.85 1.94 2.34 3.55 3.53 4.11 3.72 3.52 2.78 3.03
3.12 3.24 3.36 3.48 3.60 3.72 3.84 3.96 4.08 4.20 4.32 4.44 4.56
1.95 1.75 2.03 1.42 1.06 0.79 1.11 0.88 0.88 0.68 1.39 0.62 0.80
4.68 4.80 4.92 5.04 5.16 5.28 5.40 5.52 5.64 5.76 5.88 6.00
1.39 0.64 0.79 1.27 1.17 1.23 1.23 0.71 0.71 0.80 1.16 1.12
12.12 Problems
351
higher dimensional problems but it is desirable to gain experience working with simpler problems. Modify the APT MCMC software to analyze these data for the two models described in Section 11.6. In Mathematica, model 1 has the form: model[a0_ ; a1_ ; f1_ ]:=a0 þ a1 line[ f1] where
line[ f 1 ] :¼
sin[2p(ff1)=Df] 2p( ff1)=Df
and f ¼ 1:5.
Model 2 has the form: model[a0 ; a1 ; a2 ; f1 ; f2 ] :¼ a0 þ a1 line[ f1] þ a2 line[ f2]; where f 2 is assumed to be the higher frequency line. Adopt uniform priors for all parameters and assume a lower bound of 0 and an upper bound of 10 for a0, a1 and a2. For the two spectral line model, we need to carefully consider the prior boundaries for f1 and f2 to prevent the occurrence of two degenerate peaks in the joint posterior. Adopt a range for f2 = 1.0 to 5.0. Since by definition, f1 is the lower frequency line, at any iteration the current value of f1 must be less than current value of f2. Thus 1 pðf 1jf 2; M2 ; IÞ ¼ : f2 1:0
13 Bayesian revolution in spectral analysis1
13.1 Overview Science is all about identifying and understanding organized structures or patterns in nature. In this regard, periodic patterns have proven especially important. Nowhere is this more evident than in the field of astronomy. Periodic phenomena allow us to determine fundamental properties like mass and distance, enable us to probe the interior of stars through the new techniques of stellar seismology, detect new planets, and discover exotic states of matter like neutron stars and black holes. Clearly, any fundamental advance in our ability to detect periodic phenomena will have profound consequences in our ability to unlock nature’s secrets. The purpose of this chapter is to describe advances that have come about through the application of Bayesian probability theory,2 and provide illustrations of its power through several examples in physics and astronomy. We also examine how non-uniform sampling can greatly reduce some signal aliasing problems.
13.2 New insights on the periodogram Arthur Schuster introduced the periodogram in 1905, as a means for detecting a periodicity and estimating its frequency. If the data are evenly spaced, the periodogram is determined by the Discrete Fourier Transform (DFT), thus justifying the use of the DFT for such detection and measurement problems. In 1965, Cooley and Tukey introduced the Fast Discrete Fourier Transform (FFT), a very efficient method of implementing the DFT that removes certain redundancies in the computation and greatly speeds up the calculation of the DFT. A detailed treatment of the DFT and FFT is given in Appendix B. The Schuster periodogram was introduced largely for intuitive reasons, but in 1987, Jaynes provided a formal justification by applying the principles of Bayesian inference
1
2
The term ‘‘spectral analysis’’ has been used in the past to denote a wider class of problems than will be considered in this chapter. For a brief introduction to stochastic spectrum estimation, see Appendix B.13.4. The first three sections of this chapter are a revised version of an earlier paper by the author (Gregory, 2001), which is reproduced here with the permission of the American Institute of Physics.
352
13.2 New insights on the periodogram
353
as follows: suppose we are analyzing data consisting of samples of a continuous function contaminated with additive independent Gaussian noise with a variance of 2 . Jaynes showed that, presuming the possible periodic signal is sinusoidal (but with unknown amplitude, frequency, and phase), the Schuster periodogram exhausts all the information in the data relevant to assessing the possibility that a signal is present, and to estimating the frequency and amplitude of such a signal. The periodogram is essentially the squared magnitude of the FFT and can be defined as 2 N1 1 X i2pnfkT periodogram ¼ Cð fn Þ ¼ dk e N k¼0 (13:1) ¼
1 jFFTj2 : N
In an FFT, the frequency interval, f ¼ 1=T , where T is the duration of the data set. The quantity Cð fn Þ is indeed fundamental to spectral analysis but not because it is itself a satisfactory spectrum estimator. Jaynes showed that the probability for the frequency of a periodic sinusoidal signal is given approximately by3 Cð fn Þ pð fn jD; IÞ / exp : (13:2) 2 Thus, the proper algorithm to convert Cð fn Þ to pð fn jD; IÞ involves first dividing Cð fn Þ by the noise variance and then exponentiating. This naturally suppresses spurious ripples at the base of the periodogram, usually accomplished with linear smoothing; but does it by attenuation rather than smearing, and therefore does not sacrifice any precision. The Bayesian nonlinear processing of Cð fn Þ also yields, when the data give evidence for them, arbitrarily sharp spectral peaks. Since the peak in pð fn j D; IÞ can be much sharper than the peak in Cð fn Þ, it is necessary to zero pad the FFT to obtain a sufficient density of points in Cð fn Þ for use in Equation (13.2) to accurately define a peak in pð fn jD; IÞ. Figure 13.1 provides a demonstration of these properties for a simulated data set consisting of a single sine wave plus additive Gaussian noise given by Equation (13.3). y ¼ A cos 2pft þ Gaussian noise ðmean ¼ 0; ¼ 1Þ;
(13:3)
where A ¼ 1, f ¼ 0:1 Hz. The upper panel shows 64 simulated data points computed from Equation (13.3), with one- error bars. The middle panel is the Fourier power spectral density or periodogram, computed for this data according to Equation (13.1).4 The sinusoidal signal is clearly indicated by the prominent peak. 3
4
Bretthorst (2000) derives the exact result for non-uniformly sampled data which involves an analogous nonlinear transformation of the Lomb–Scargle periodogram (Lomb, 1976; Scargle, 1982, 1989). Bretthorst (2001) also shows how to generalize the Lomb–Scargle periodogram for the case of a non-stationary sinusoid. This is discussed further in Section 13.5. The 64 points were zero padded to provide a total of 512 points for the FFT. See Appendix B.11 for more details on zero padding.
354
Bayesian revolution in spectral analysis Simulated Time Series [sin 2πft + noise (σ = 1)] Signal strength
4 2 0 –2 –4 10
20
30 40 Time axis
50
60
Power density
Fourier Power Spectral Density 35 30 25 20 15 10 5 0
0.1
0.2 0.3 Frequency
0.4
0.5
Probability density
Bayesian Probability Density 250 200 150 100 50 0
0.1
0.2 0.3 Frequency
0.4
0.5
Figure 13.1 Comparison of conventional (middle panel) and Bayesian analysis (lower panel) of a simulated time series (upper panel).
The signal-to-noise ratio (S/N), defined as the ratio of the RMS signal amplitude to the noise , was 0.7 in the above simulation. If we repeated the simulation with a larger S/N ratio, the main peak would increase in relation to the noise peaks and we would start to notice side lobes emerging associated with the finite duration of the data set (rectangular window function, see Appendix B.7.1). However, a well-known property of the periodogram is that the width of any spectral peak depends only on the duration of the data set and not on the signal-to-noise level. Various methods have been used to determine the accuracy to which the peak frequency can be determined, but, as we see below, the Bayesian posterior probability for the signal frequency provides this information directly.
13.2 New insights on the periodogram
355
The lower panel of Figure 13.1 shows the Bayesian probability density for the period of the signal, derived from Equation (13.2). As the figure demonstrates, the spurious noise features are suppressed and the width of the spectral peak is much narrower than the peak in the periodogram. In a Bayesian analysis, the width of spectral peak, which reflects the accuracy of the frequency estimate, is determined by the duration of the data, the S/N, and the number of data points. More precisely, the standard deviation of the spectral peak, f, for a S/N > 1, is given by S pffiffiffiffi 1 f 1:6 T N Hz; (13:4) N where T ¼ the data duration in s, and N ¼ the number of data points in T . To improve the accuracy of the estimate, the two most important factors are how long we sample (the T dependence) and the signal-to-noise ratio. Equation (13.2) assumes that the noise variance is a known quantity. In some situations, the noise is not well understood, i.e., our state of knowledge is less certain. Even if the measurement apparatus noise is well understood, the data may contain a greater complexity of phenomena than the current signal model incorporates. In such cases, Equation (13.2) is no longer relevant, but again, Bayesian inference can readily handle this situation by treating the noise variance as a nuisance parameter with a prior distribution reflecting our uncertainty in this parameter. We saw how to do that when estimating the mean of a data set in Section 9.2.3. The resulting posterior can be expressed in the form of a Student’s t distribution. The corresponding result for estimating the frequency of a single sinusoidal signal (Bretthorst, 1988) is given approximately5 by 2N 2Cð fn Þ 2 ; (13:5) pð fn jD; IÞ / 1 Nd2 P where N is the number of data values and d2 ¼ ð1=NÞ j d2i is the mean square average of the data values. The analysis assumes any DC component in the data has been removed. If is not well known, then it is much safer to use Equation (13.5) than Equation (13.2) because Equation (13.5) will treat anything that cannot be fitted by the model as noise. This leads to more conservative estimates. A corollary of Jaynes’ analysis is that for any other problem (e.g., non-sinusoidal light curve, non-Gaussian noise, or non-uniform sampling) use of the FFT is not optimal; more information can be extracted from the data if we use more sophisticated statistics. Jaynes made this point himself, and it has been amply demonstrated in the work of Bretthorst (1988), who has applied similar methods to signal detection and estimation problems with non-sinusoidal models with Gaussian noise probabilities. 5
Note: Equations (13.2) and (13.5) do not require the data to be uniformily sampled provided that: 1) the number of data values N is large, 2) there is no constant (DC) component in the data, and 3) there is no evidence of a low frequency.
356
Bayesian revolution in spectral analysis
In the following sections, we will consider two general classes of spectral problems: (a) those for which we have strong prior information of the signal model, and (b) those for which we have no specific prior information about the signal.
13.2.1 How to compute p( f |D,I ) In Section 13.2, we saw that the periodogram, Cð fn Þ, follows naturally from Bayesian probability theory6 when our prior information indicates there is a single sine wave present in the data and we want to compute the pð fn jD; IÞ. Equation (13.2) gives the relationship between pð fn jD; IÞ and Cð fn Þ if the noise is known, and Equation (13.5) applies when the noise is unknown. The value of the periodogram at a set of discrete frequencies, indexed by n, is given by 2 N 1 X i2pfn tj Cð fn Þ ¼ dj e N j ¼ 1 1 jFFTj2 N 2 N 1 X i2pnj or CðnÞ ¼ dj e N N j ¼ 1 ¼
¼
(13:6)
jHn j2 ; N
where Hn is the FFT or DFT transform defined by Equations (B.49) and (B.55) in Appendix B. We illustrate the calculations in more detail by comparing the Bayesian pðnjD; IÞ to the one-sided PSD (given in Appendix B.13.2) for two simulated time series shown in Figure 13.2. The time series consist of 64 samples at one-second intervals of dj ¼ A cos 2pftj þ Gaussian noise ð ¼ 1Þ:
(13:7)
The simulated data for two different choices of signal amplitude (A ¼ 0:8 and A ¼ 10), corresponding to low and high signal-to-noise ratios, are shown in the two top panels of Figure 13.2. In the computation of the FFT, we take N time samples at intervals of T seconds and compute N transform points Hn . The value, n ¼ 0, corresponds to the FFT at zero frequency, and n ¼ N=2 to the value at the Nyquist frequency ¼ 1=ð2TÞ. Values of n between N=2 þ 1 to N 1 correspond to values of the FFT for negative frequencies. In Appendix B.13.1 we show that jHn j2 T=N is the two-sided PSD (two-sided periodogram) with units of power Hz1 . 6
In general, for a different signal model or noise model, Bayesian inference will lead to an equation involving a different function or statistic of the data for computing the probability of the signal frequency.
357
13.2 New insights on the periodogram Simulation [10 sin 2πft + noise (σ = 1)]
4
Signal strength
Signal strength
Simulation [0.8 sin 2πft + noise (σ = 1)] 2 0 –2 –4 10
20 30 40 Time axis
50
Power density 0.4
–10
3000 2500 2000 1500 1000 500
0.1
Bayesian Probability
175 150 125 100 75 50 25 0
0.1
0.2 0.3 Frequency
20
30 40 Time axis
50
60
Fourier Power Spectral Density
0.5
0.2 0.3 Frequency
0.4
0.5
Bayesian Probability
0.4
0.5
Probability density
Power density Probability density
0.2 0.3 Frequency
0 –5 10
Fourier Power Spectral Density
0.1
5
60
17.5 15 12.5 10 7.5 5 2.5 0
10
1000 800 600 400 200 0
0.1
0.2 0.3 Frequency
0.4
0.5
Figure 13.2 Comparison of conventional (middle panels) and Bayesian analysis (bottom panels) of two simulated time series (top panels).
In the computation of the Bayesian pðnjD; IÞ; CðnÞ is just the positive frequency part of jHn j2 =N. CðnÞ ¼
jHn j2 N
for n ¼ 0; 1; . . . ;
N : 2
(13:8)
Both CðnÞ and 2 have units of power and thus their ratio is dimensionless. In general, pðnjD; IÞ will be very narrow when CðnÞ=2 > 1 because of the exponentiation occurring in Equations (13.2) or (13.5). Thus, to accurately define pðnjD; IÞ we need to zero pad the FFT to obtain a sufficient density of Hn points to accurately define the pðnjD; IÞ peak. Zero padding is used to obtain higher frequency resolution in the transform and is discussed in detail in Appendix B.11. In the zero padding case, Equation (13.8) becomes CðnÞ ¼
jHn j2 Norig
for n ¼ 0; 1; . . . ;
Nzp ; 2
(13:9)
358
Bayesian revolution in spectral analysis
where Norig is the number of original time series samples and Nzp is the total number of points including the added zeros. For analysis of the time series in Figure 13.2, we zero padded to produce Nzp ¼ 512 points. Box 13.1 Note: Mathematica uses a slightly different definition of Hn to that given in Equation (13.6), which we designate by ½Hn Math . N nj 1 X dj ei2pN ½Hn Math ¼ pffiffiffiffi N j¼1
The modified version of Equation (13.9) is CðnÞ ¼
Nzp j½Hn Math j2 Norig
for n ¼ 0; 1; . . . ;
Nzp ; 2
where ½Hn Math ¼ Fourier[data], and data is a list of dj values. The bottom two panels of Figure 13.2 show the Bayesian posterior pðnjD; IÞ computed from Equation (13.2) for the two simulations. The middle panels show the one-sided PSD for comparison. For the weak signal-to-noise simulation shown on the left, the PSD display exhibits many spurious noise peaks. The Bayesian pðnjD; IÞ shows a single strong narrow peak while the spurious noise features have been strongly suppressed. Keep in mind that both quantities were computed from the same FFT of the time series. The comparison serves to show how much of an improvement can be obtained by a Bayesian estimation of the period over the intuitive PSD spectrum estimator even for a RMS signal-to-noise ratio of 0:6. The three panels on the right hand side of Figure 13.2 illustrate the corresponding situation for a RMS signal-to-noise ratio of 7. In this case, we can clearly see the side lobes adjacent to the main peak which arise from using a rectangular data windowing function. In a conventional analysis, these side lobes are reduced by employing a data windowing function which reduces the relative importance of data at either end and results in a broadening of the spectral peak. The Bayesian analysis suppresses both the side lobes and spurious ripples by attenuation, and results in a very much narrower spectral peak. Also, because we have computed pðnjD; IÞ directly, we can readily compute the accuracy of our spectral peak frequency estimate.
13.3 Strong prior signal model Larry Bretthorst (1988, 1990a, b, c, 1991) extended Jaynes’ work to more complex signal models with additive Gaussian noise and revolutionized the analysis of Nuclear Magnetic Resonance (NMR) signals. In NMR free-induction decay, the signal
13.3 Strong prior signal model
359
consists of a sum of exponentially decaying sinusoids of different frequency and decay rate. The top two panels of Figure 13.3 illustrate the quadrature channel measurements in an NMR free-induction decay experiment. In this example, the S/N is very high. The middle panel illustrates the conventional absorption spectrum based on an
Figure 13.3 Comparison of conventional analysis (middle panel) and Bayesian analysis (bottom panel) of the two-channel NMR time series (top two panels). (Figure credit G. L. Bretthorst, reproduced by permission from the American Institute of Physics.)
360
Bayesian revolution in spectral analysis
FFT of the data, which shows three obvious spectral peaks with an indication of further structure in the peaks. The bottom panel illustrates Bretthorst’s Bayesian analysis of this NMR data, which clearly isolates six separate peaks. The resolution is so good that the six peaks appear as delta functions in this figure. A similar improvement was obtained in the estimation of the decay rates. The Bayesian analysis provides much more reliable and informative results when prior knowledge of the shape of the signal and noise statistics are incorporated. The question of how many frequencies are present, and what are the marginal PDFs for the frequencies and decay rates, can readily be addressed in the Bayesian framework using a Markov chain Monte Carlo computation. We saw how to do this in Sections 12.5 to 12.9. Frequently, the physics of the problem provides additional information about the relationships between pairs of frequencies, which can be incorporated as useful prior information. Varian Corporation now offers an expert analysis package with their new NMR machines based on Bretthorst’s Bayesian algorithm. The manual for this software is available online at http:// bayesiananalysis.wustl.edu/.
13.4 No specific prior signal model In this case, we are addressing the detection and measurement of a periodic signal in a time series when we have no specific prior knowledge of the existence of such a signal or of its characteristics, including its shape. For example, an extraterrestrial civilization might be transmitting a repeating pattern of information either intentionally or unintentionally. What scheme could we use to optimally detect such a signal after we have made our best guess at a suitable wavelength of observation? Bayesian inference provides a well-defined procedure for solving any inference problem including questions of this kind. However, to proceed with the calculation, it is necessary to assume a model or family of models which is capable of approximating a periodic signal of arbitrary shape. A very useful Bayesian solution to the problem of detecting a signal of unknown shape was worked out by the author in collaboration with Tom Loredo (Gregory and Loredo, 1992, 1993, 1996), for the case of event arrival time data. The Gregory–Loredo (GL) algorithm was initially motivated by the problem of detecting periodic signals (pulsars) in X-ray astronomy data. In this case, the time series consisted of individual photon arrival times where the appropriate sampling distribution is the Poisson distribution. To address the periodic signal detection problem, we compute the ratio of the probabilities (odds) of two models MPer and M1 . Model MPer is a family of periodic models capable of describing a background plus a periodic signal of arbitrary shape. Each member of the family is a histogram with m bins, with m ranging from 2 to some upper limit, typically 12. Three examples are shown in Figure 13.4. The prior probability that MPer is true is divided equally among the members of this family. Model M1 assumes the data are consistent with a
13.4 No specific prior signal model
2–bin periodic model
6–bin periodic model
12–bin periodic model
Constant model
361
Figure 13.4 Three of the four panels show members of MPer , a family of histogram (piecewise constant) periodic signal models, with m ¼ 2; 6, and 12 bins, respectively. The constant rate model, M1 , is a special case of MPer , with m ¼ 1 bin, and is illustrated in the bottom right panel.
constant event rate. M1 is a special case of MPer , with m ¼ 1 bin. Model M1 is illustrated in the bottom right panel of Figure 13.4. The Bayesian calculation automatically incorporates a quantified Occam’s penalty, penalizing models with a larger number of bins for their greater complexity.7 The calculation thus balances model simplicity with goodness-of-fit, allowing us to determine both whether there is evidence for a periodic signal, and the optimum number of bins for describing the structure in the data. The parameter space for the m-bin periodic model consists of the unknown period, an unknown phase (position of the first bin relative to the start of the data), and m histogram amplitudes describing the signal shape. A remarkable feature of this particular signal model is that the search in the m shape parameters can be carried out analytically, permitting the method to be computationally tractable. Further research is underway to investigate computationally tractable ways of incorporating additional desirable features into the signal model, such as variable bin widths to allow for a reduction in the number of bins needed to describe certain types of signal. The solution in the Poisson case yields a result that is intuitively very satisfying. The probability for the family of periodic models can be shown to be approximately inversely proportional to the entropy (Gregory and Loredo, 1992) of any significant organized periodic structure found in the search parameter space. What structure is significant is determined through built-in quantified Occam’s penalties in the calculation. Of course, structure with a high degree of organization corresponds to a state of low entropy. In the absence of knowledge about the shape of the signal, the method identifies the most organized significant periodic structure in the model parameter space.
7
The Occam penalty becomes so large for m 12, that the data are generally not good enough to make it worthwhile including periodic models with larger values of m.
362
Bayesian revolution in spectral analysis
Some of the capabilities of the GL method are illustrated in the following two examples, one taken from X-ray astronomy and the other from radio astronomy.
13.4.1 X-ray astronomy example In 1984, Seward et al. discovered a 50 ms X-ray pulsar at the center of a previously known radio supernova remnant, SNR 0540-693, located in the Large Magellanic Cloud. The initial detection of X-ray pulsations was from an FFT periodogram analysis of the data obtained from the Einstein Observatory. The true pulsar signal turned out to be the second highest peak in the initial FFT. Confidence in the reality of the signal was established from FFT runs on other data sets. The pulsar was re-observed with the ROSAT Observatory by Seward and colleagues, but this time, an FFT search failed to detect the pulsar. In Gregory and Loredo (1996), we used the GL method on a sample ROSAT data set of 3305 photons provided by F. Seward. The data spanned an interval of 116 341 s and contained many gaps. In the first instance, we incorporated the prior information on the period, period derivative and their uncertainties, obtained from the earlier detection with the Einstein Observatory data. The Gregory–Loredo method provides a calculation of the global odds ratio defined as the ratio of the probability for the family of periodic models to the probability of a constant rate model, regardless of the exact shape, period and phase of the signal. The resulting odds ratio of 2:6 1011 indicates near certainty in the presence of a periodic signal. It is interesting to consider whether we would still claim the detection of a periodic signal if we did not have the prior information derived from the earlier detection. Thus, in the second instance, we assume a prior period search range extending from the rotational breakup period of a neutron star ð 1:5 msÞ, to half the duration of the data. This gives an odds ratio of 4:5 105 . This is greatly reduced due to the much larger Occam penalty associated with not knowing the period. But this still provides overwhelming evidence for the presence of a periodic signal, despite the fact that it was undetected by FFT techniques. In their paper, Seward et al. (1984) used another method commonly employed in X-ray astronomy, called period folding (also known as epoch folding), to obtain the pulsar light curve and a best period. Period folding involves dividing the trial period into m bins (typically five) and binning the data modulo the trial period for a given trial phase. The 2 statistic is used to decide at some significance level, whether a constant model can be rejected, and thus indirectly infer the presence of a periodic signal. In Seward et al. (1984), their period uncertainty was estimated from the halfwidth of the 2 peak, which is sometimes used as a naive estimate of the accuracy of the frequency estimate. Figure 13.5 shows a comparison of the largest frequency peak comparing the GL marginal probability density for f to the period folding h2 i statistic. The width of the GL marginal probability density for f is more than an order of magnitude smaller.
363
13.4 No specific prior signal model
{CHI-square} and {Bayesian p(f)}
80
60
40
20
0
70
75
80
85
90
(frequency – 19852800) micro Hz
Figure 13.5 Close-up of largest frequency peak comparing the Gregory–Loredo probability density for f to the period folding h2 i statistic (diamonds). The h2 i statistic versus trial frequency results from epoch folding analysis using m ¼ 5 bins (Gregory and Loredo, 1996).
13.4.2 Radio astronomy example In 1999, the author generalized the GL algorithm to the Gaussian noise case. Application of the method to a radio astronomy data set has resulted in the discovery of a new periodic phenomenon (Gregory, 1999, 2002; Gregory et al., 1999; Gregory and Neish, 2002) in the X-ray and radio emitting binary, LS I þ61 303. LS I þ61 303 is a remarkable tenth magnitude binary star (Gregory and Taylor, 1978; Hutchings and Crampton, 1981) that exhibits periodic radio outbursts every 26.5 days (Taylor and Gregory, 1982), which is the binary orbital period. The radio, infrared, optical, X-ray and -ray data indicate that the binary consists of a rapidly rotating massive young star, called a Be star, together with a neutron star in an eccentric orbit. The Be star exhibits a dense equatorial wind and the periodic radio outbursts are thought to arise from variations in wind accretion by the neutron star in its eccentric orbit. Some of the energy associated with the accretion process is liberated in the form of outbursts of radio emission. One puzzling feature of the outbursts has been the variablity of the orbital phase of the outburst maxima, which can range over 180 degrees of phase. In addition, the strength of the outburst peaks was known to vary on time scales of approximately 4 years (Gregory et al., 1989; Paredes et al., 1990). Armed with over twenty years of data, we (Gregory, 1999; Gregory et al., 1999) applied Bayesian inference to assess a variety of hypotheses to explain the outburst timing residuals and peak flux density variations. The results for both the outburst peak flux density and timing residuals demonstrated a clear 1667-day periodic modulation in both quantities. The periodic modulation model was found to be 3 103
364
Bayesian revolution in spectral analysis
times more probable than the sum of the probabilities of three competing non-periodic models. Figure 13.6 shows the data and results from the timing residual analysis. Panel (a) shows the radio outburst peak timing residuals.8 The abscissa is the time interval in days from the peak of the first outburst in 1977. Very sparsely sampled measurements
Residuals (days)
10
(a)
5 0 –5 –10 0
Residuals (days)
10
2000
4000 6000 Time (days)
8000
10000
2000
4000 6000 Time (days)
8000
10000
(b)
5 0 –5 –10 0
Probability density
0.04
(c)
0.03 0.02 0.01 0 1000
1500 2000 Modulation period (days)
2500
Figure 13.6 Panel (a) shows the outburst timing residuals. A comparison of the predicted outburst timing residuals with the data versus time is shown in panel (b). The solid curves show the estimated mean light curve, 1 standard deviation. The new data are indicated by a filled box symbol. Panel (c) shows the probability for the modulation period of LS I þ61 303.
8
The timing residuals depend on the assumed orbital period which is not accurately known independent of the radio data. The GL algorithm was modified to compute the joint probability distribution of the orbital and modulation periods. Only the marginal distribution for the modulation period is shown.
13.5 Generalized Lomb–Scargle periodogram
365
were obtained from the initial discovery in 1977 until 1992. However, beginning in January 1994, Ray et al. (1997) performed detailed monitoring (several times a day) with the National Radio Astronomy Observatory Green Bank Interferometer. With such sparsely sampled data, the eye is unable to pick out any obvious periodicity. Panel (c) shows the Bayesian marginal probability density for the modulation period. The single well-defined peak provides clear evidence for a periodicity of approximately 1667 days. Subsequent monitoring of the binary star system has confirmed and refined the orbital and modulation period. Panel (b) shows a comparison of the predicted outburst timing residuals with the data versus time. The solid curves show the estimated mean light curve, 1 standard deviation. The new data, indicated by a shaded box, nicely confirm the periodic modulation model. This discovery has contributed significantly to our understanding of Be star winds. The Mathematica tutorial includes a Markov chain Monte Carlo version of the GL algorithm, for the Gaussian noise case, in the section entitled, ‘‘MCMC version of the Gregory–Loredo algorithm.’’
13.5 Generalized Lomb–Scargle periodogram In Section 13.2, we introduced Jaynes’ insights on the periodogram from probability theory and discussed how to compute pð fjD; IÞ in more detail in Section 13.2.1. Bretthorst (2000, 2001) generalized Jaynes’ insights to a broader range of singlefrequency estimation problems and sampling conditions and removed the need for the approximations made in the derivation of Equations (13.2) and (13.5). In the course of this development, Bretthorst established a connection between the Bayesian results and an existing frequentist statistic known as the Lomb–Scargle periodigram (Lomb, 1976; Scargle, 1982, 1989), which is a widely used replacement for the Schuster periodogram in the case of non-uniform sampling. We will summarize Bretthorst’s Bayesian results in this section. In particular, his analysis allows for the following complications: 1. Either real or quadrature data sampling. Quadrature data involve measurements of the real and imaginary components of a complex signal. The top two panels of Figure 13.3 show an example of quadrature signals occurring in NMR. Let dR ðti Þ denote the real data at time ti and dI ðt0i Þ denote the imaginary data at time t0i . There are NR real samples and NI imaginary samples for a total of N ¼ NR þ NI samples. 2. Uniform or non-uniform sampling and for quadrature data with non-simultaneous sampling. The analysis does not require the ti and t0i to be simultaneous and successive samples can be unequally spaced in time. 3. Allows for a non-stationary single sinusoid model of the form
dR ðti Þ ¼ A cosð2pfti ÞZðti Þ þ B sinð2pfti ÞZðti Þ þ nR ðti Þ;
(13:10)
366
Bayesian revolution in spectral analysis
where A and B are the cosine and sine amplitudes, and nR ðti Þ denotes the noise at ti . The function Zðti Þ describes an arbitrary modulation of the amplitude, e.g., exponential decay as exhibited in NMR signals. If Z(t) is a function of any parameters, those parameters are assumed known, e.g., the exponential decay rate. Z(t) is sometimes called a weighting function or apodizing function. The corresponding signal model for the imaginary channel is given by
dI ðt0j Þ ¼ A cosð2pft0j ÞZðt0j Þ þ B sinð2pft0j ÞZðt0j Þ þ nI ðt0j Þ:
(13:11)
The angle is defined in such a way as to make the cosine and sine functions orthogonal on the discretely sampled times. This corresponds to the condition
0¼
NR X
cosð2pfti Þ sinð2pfti ÞZðti Þ2
i¼1
NI X
(13:12) sinð2pft0j
Þ cosð2pft0j
ÞZðt0j Þ2 :
j¼1
The solution of Equation (13.12) is given by
" PN # PN I 2 R 0 0 2 1 i ¼ 1 sinð4pfti ÞZðti Þ j ¼ 1 sinð4pftj ÞZðtj Þ 1 ¼ tan : PN I PNR 2 0 0 2 2 i ¼ 1 cosð4pfti ÞZðti Þ j ¼ 1 cosð4pftj ÞZðtj Þ
(13:13)
Note: if the data are simultaneously sampled, ti ¼ t0j , then the orthogonal condition is automatically satisfied so ¼ 0. 4. The noise terms nR ðti Þ and nI ðti Þ are assumed to be IID Gaussian with an unknown . Thus, is a nuisance parameter, which is assumed to have a Jeffreys prior. By marginalizing over , any variability in the data that is not described by the model is assumed to be noise. The final Bayesian expression for pð fjD; IÞ, after marginalizing over the amplitudes A and B (assuming independent uniform priors), is given by 2N 1 pð f jD; IÞ / pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ½Nd2 h2 2 ; Cð fÞSð fÞ
(13:14)
where the mean-square data value, d2 , is defined as
" # NR NI X X 1 2 2 0 d2 ¼ dR ðti Þ þ dI ðtj Þ : N i¼1 j¼1
(13:15)
The term h2 is given by
h2 ¼
Rð f Þ2 Ið f Þ2 þ ; Cð f Þ Sð f Þ
(13:16)
13.5 Generalized Lomb–Scargle periodogram
367
where
Rð f Þ
NR X
dR ðti Þ cosð2pfti ÞZðti Þ
i¼1
Ið f Þ
NR X
NI X
dI ðt0j Þ sinð2pft0j ÞZðt0j Þ;
(13:17)
dI ðt0j Þ cosð2pft0j ÞZðt0j Þ;
(13:18)
j¼1
dR ðti Þ sinð2pfti ÞZðti Þ þ
i¼1
Cð f Þ
NI X j¼1
NR X
cos2 ð2pfti ÞZðti Þ2 þ
i¼1
NI X
sin2 ð2pft0j ÞZðt0j Þ2
(13:19)
cos2 ð2pft0j ÞZðt0j Þ2 :
(13:20)
j¼1
and
Sð fÞ
NR X i¼1
sin2 ð2pfti ÞZðti Þ2 þ
NI X j¼1
13.5.1 Relationship to Lomb–Scargle periodogram If the sinusoidal signal is known to be stationary (Zðti Þ is a constant) and the data are entirely real, then Equations (13.17) to (13.20) greatly simplify. In this case, the quantity h2 given by Equation (13.16) corresponds to the Lomb–Scargle periodogram; however, we now see this statistic in a new light. The Bayesian expression for pð fjD; IÞ (Equation (13.14)) involves a nonlinear processing of the Lomb–Scargle periodogram, analogous to the nonlinear processing of the Schuster periodogram in Equation (13.5). In fact, Bretthorst showed that for uniformly sampled quadrature data and a stationary sinusoid, the Lomb–Scargle periodogram reduces to a Schuster periodogram, the power spectrum of the data. For real data, Equations (13.2) and (13.5) are only approximately true. As we will demonstrate in Figure 13.7, Equation (13.5) can provide an excellent approximation to pð fjD; IÞ for uniformly sampled real data and a stationary sinusoid, and the Schuster periodogram is much faster to compute than the Lomb–Scargle periodogram. Equations (13.14) to (13.20) provide the exact answer for pð fjD; IÞ for a much wider range problems and involve a generalized version of the Lomb–Scargle statistic. 13.5.2 Example In this example, we compare the Schuster periodogram to the Lomb–Scargle periodogram, for the time series simulation involving a stationary sinusoid model and
368
Bayesian revolution in spectral analysis
Signal strength
Simulation [0.8 sin 2πft + noise (σ = 1)] 4 2 0 –2 –4 10
20
30 40 Time axis
50
0
0.1
0.2 0.3 Frequency
0.4
0
Bayesian Probability
175 150 125 100 75 50 25 0
0.1
0.2 0.3 Frequency
0.4
17.5 15 12.5 10 7.5 5 2.5
0.5
Probability density
Probability density
Lomb–Scargle Periodogram
17.5 15 12.5 10 7.5 5 2.5
Power density
Power density
Fourier Power Spectral Density
60
0.5
175 150 125 100 75 50 25
0.1
0.2 0.3 Frequency
0.4
0.5
Bayesian Lomb–Scargle Probability
0
0.1
0.2 0.3 Frequency
0.4
0.5
Figure 13.7 The middle two panels compare the Fourier power spectral density (Schuster periodogram) and Lomb–Scargle periodogram for the uniformly sampled time series simulation shown in the top panel. The bottom two panels compare the Bayesian counterparts for the same time series.
uniformily sampled real data that we used in Section 13.2.1. In the first, which is illustrated in Figure 13.7, the data are uniformly sampled. The top panel shows the time series and the two middle panels show the Fourier power spectral density (Schuster periodogram) and Lomb–Scargle periodogram of this time series. The corresponding Bayesian pð fjD; IÞ probability densities are shown in the bottom two panels. Clearly, for this example, the Schuster periodogram provides an excellent approximation to the Lomb–Scargle periodogram, and is much faster to compute. Recall that the width of the spectral peak in the Bayesian pð fjD; IÞ depends on the signal-to-noise ratio (SNR). Even for a moderate SNR, the spectral peak can become very narrow, requiring a large number of evaluations of the Lomb–Scargle statistic at very closely spaced frequencies.
369
13.5 Generalized Lomb–Scargle periodogram
The second example, which is illustrated in Figure 13.8, makes use of the same time series, but has 14 samples removed creating gaps in the otherwise uniform sampling. These gaps can clearly be seen in the top right panel. To compute the Schuster periodogram, some assumption must be made regarding the data in the gaps, to achieve the uniform sampling required for the calculation of the FFT. In the top left panel, the missing data have been filled in with values equal to the time series average. In the calculation of the Lomb–Scargle periodogram, only the actual data are used. The two bottom panels again illustrate the corresponding Bayesian pð f jD; IÞ probability densities. In this case, it is clear that the Bayesian generalization of the Lomb–Scargle periodogram does a better job. In the latter example, the data are uniformly sampled apart from the gaps. In the next section we will explore the issue of non-uniform sampling in greater detail.
Simulation [0.8 sin 2πft + noise (σ = 1)]
4
Signal strength
Signal strength
Simulation [0.8 sin 2πft + noise (σ = 1)] 2 0 –2 –4 20 30 40 Time axis
50
0
0.1
0.2 0.3 Frequency
0.4
0.4
20 30 40 Time axis
50
60
Lomb–Scargle Periodogram
12 10 8 6 4 2 0
0.5
Probability density
Probability density
0.2 0.3 Frequency
–4
0.1
0.2 0.3 Frequency
0.4
0.5
Bayesian Lomb–ScargleProbability
120 100 80 60 40 20 0.1
–2
0.5
Bayesian Probability
0
0
10
Fourier Power Spectral Density
12 10 8 6 4 2
2
60
Power density
Power density
10
4
120 100 80 60 40 20 0
0.1
0.2 0.3 Frequency
0.4
0.5
Figure 13.8 The middle two panels compare the Fourier power spectral density (Schuster periodogram) and Lomb–Scargle periodogram for a time series with significant data gaps, as shown in the top right panel. The bottom two panels compare the Bayesian counterparts for the same time series.
370
Bayesian revolution in spectral analysis
13.6 Non-uniform sampling In many cases, the data available are not uniformly sampled in the coordinate of interest, e.g., time. In some cases this introduces complications, but on the flip side, there is a distinct advantage. Non-uniform sampling can eliminate the common problem of aliasing (Bretthorst 1988, 2000a). In this section, we explore this effect with a demonstration. We start with a uniform time series of 32 points containing a sinusoidal signal plus additive independent Gaussian noise. The data are described by the following equation: dk ¼ 2 cosð2pfkTÞ þ noise ð ¼ 1Þ
with f ¼ 1:23Hz;
(13:21)
where T is the sample interval and k is an index running for 32 points. In this demonstration, T ¼ 1 s. At 1.23 Hz, the signal frequency is well above the Nyquist frequency ¼ 1=ð2TÞ ¼ 0:5 Hz. In Figure 13.9 we demonstrate how the aliasing arises. The top panel shows the Fourier Transform (FT) of the sampling. It is convenient to show both positive and negative frequencies which arise in the mathematics of the FT. The middle panel shows the FT of the signal together with the Nyquist frequency. The bottom panel shows the resulting convolution. There are three aliased signals at f ¼ 0:23, f ¼ 0:77, and f ¼ 1:77, only one of which, at f ¼ 0:23, is below the Nyquist frequency. For deeper understanding of this figure, the reader is referred to Appendix B.5 and B.6. We start by computing the Fourier transform and Bayesian posterior probability density for the signal frequency of the initial uniformly sampled data. We will replace some of the samples by samples taken at times that are not an integer multiple of T, and explore how the spectrum is altered. Since we will be considering non-uniform samples, we make use of the Lomb–Scargle periodogram, discussed in Section 13.5, to compute the power spectrum. We also display the Bayesian posterior probability density for the signal frequency using Bretthorst’s Bayesian generalization of the Lomb–Scargle algorithm, also discussed in Section 13.5. Figure 13.10 shows the evolution of both quantities as the number of non-uniform samples is increased. In the top two panels (a), the original signal frequency at 1.23 Hz is clearly seen together with three aliased signals. In the second row (b), one uniformly sampled data point has been replaced by one non-uniform sample. The Lomb–Scargle periodogram shows only a slight change, but remarkably, the Bayesian probability density has clearly distinguished the real signal at 1.23 Hz. As more and more uniformly sampled points are replaced, the amplitudes of the aliases in the Lomb–Scargle periodogram decrease. Notice that for the non-uniform sampling used in this demonstration, no alias occurs up to frequencies 4 times the effective Nyquist frequency. Of course, it must be true that the aliasing phenomenon returns at sufficiently high frequencies. If the sampling times tk , although non-uniform, are all integer multiples of some small interval t, then signals at frequencies > 1=ð2tÞ will still be aliased.
371
13.6 Non-uniform sampling
1/ T
1/ T
1/ T
1/ T
f
–
+ (CONVOLVE)
–1.23
1.23
f –
–fN
+
fN
–1.23
1.23
f – –1.77
–0.77
–0.23
0.23
0.77
1.77
+
Figure 13.9 How aliasing arises. Uniform sampling at an interval T in the time domain corresponds to convolution in the frequency domain. The top panel shows the Fourier Transform (FT) of the sampling. The middle panel shows the FT of the signal together with the Nyquist frequency. The bottom panel shows the resulting convolution. There are 3 aliased signals at f ¼ 0:23, f ¼ 0:77, and f ¼ 1:77, only one of which, at f ¼ 0:23, is below the Nyquist frequency.
Consideration of Figure 13.11 shows why aliasing does not occur for the nonuniform sampling. In this example, we first generated four data points (filled boxes) by sampling a 1.23 Hz sinusoidal signal, with no noise, at one-second intervals. The figure shows four sinusoids corresponding to the 1.23 Hz signal and the 3 aliases at 0.23, 0.77 and 1.77 Hz, which all pass through these uniformly sampled data points. Next, we replaced the first uniform sample by one which is non-uniformly sampled in time (star). In this case, only the 1.23 Hz sinusoid passes through all four points. There is clearly an advantage to employing non-uniform sampling which needs to be considered as part of the experimental design. As Figure 13.11 clearly demonstrates, even the addition of a small number of non-uniform samples (only one required in this 32-point time series) to an otherwise uniformly sampled data set is sufficient to strongly suppress aliasing in the Bayesian posterior probability density for signal frequency.
Bayesian revolution in spectral analysis
Power density
1 Frequency
1.5
2
Lomb–Scargle Periodogram (b) 1 non-uniform sample out of 32
0.5
1 Frequency
1.5
Bayesian Lomb–Scargle
50 40 30 20 10 0
0.5
1 Frequency
1.5
0
0.5
1 Frequency
1.5
2
100 50 0
0.5
1 Frequency
1.5
2
Bayesian Lomb–Scargle 200 150 100 50
2
Lomb–Scargle Periodogram (d) 16 non-uniform samples out of 32
70 60 50 40 30 20 10
1.5
150
2
(c) 5 non-uniform samples out of 32
0
1 Frequency
200
Lomb–Scargle Periodogram
80 70 60 50 40 30 20 10
0.5
Bayesian Lomb–Scargle Probability density
70 60 50 40 30 20 10 0
Power density
0.5
Probability density
Power density
0
Probability density
Lomb–Scargle Periodogram (a) Uniform sampling
70 60 50 40 30 20 10
0
0.5
1 Frequency
1.5
2
Bayesian Lomb–Scargle Probability density
Power density
372
2
200 150 100 50 0
0.5
1 Frequency
1.5
2
Figure 13.10 Evolution of the Lomb–Scargle periodogram (left) and Bretthorst’s Bayesian generalization of the Lomb–Scargle periodogram (right), with increasing number of nonuniform samples in the time series. Notice how sensitive the Bayesian result is to a change of only one sample from a uniform interval to a non-uniform interval.
373
13.7 Problems
Signal amplitude
2
1
0
–1
–2 0
1
2
3
4
Time
Figure 13.11 An illustration of how four different frequencies can all pass through the same set of four uniformly sampled data points (boxes) but only one passes through all the points when one sample is relaced by a non-uniform sample (star).
13.7 Problems Table 13.1 is a simulated times series consisting of a single sinusoidal signal with additive IID Gaussian noise. In this problem, you will compare the usual one-sided power spectral density discussed in Appendix B to the Bayesian posterior probability density for the frequency of the model sinusoid. Table 13.1 The table contains 64 samples of a simulated times series consisting of a single sinusoidal signal with additive IID Gaussian noise. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
mK
#
mK
#
mK
#
mK
0.474 0.281 1.227 1.523 0.831 0.978 0.169 0.04 0.76 0.847 0.106 1.814 1.16 0.249 1.054 0.359
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0.865 0.206 0.926 2.294 0.786 0.522 1.04 0.181 1.47 1.837 0.523 0.605 1.595 0.413 1.275 1.644
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0.225 1.017 0.817 2.064 0.103 1.878 0.625 1.418 0.464 1.182 1.319 1.354 1.784 0.989 1.52 1.239
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
0.369 0.695 1.291 0.978 0.592 0.986 1.005 1.268 0.571 1.128 0.64 0.144 1.468 0.71 1.486 0.129
374
Bayesian revolution in spectral analysis
Part 1: Fast Fourier Transform and PSD a) Use an FFT to determine the one-sided power spectral density (PSD), as defined by Equation (B.102) in Appendix B. Plot both the raw data and your spectrum and determine the period of the strongest peak. b) To obtain a more accurate determination of the peak in the PSD, add zeros to the end of the input data so that the total data set (data þ appended zeros) is 1024 points and recompute the PSD. Note: although the number of spectral points will increase, the 1=N normalization term in Equation (B.102) still refers to the original number of data points. Plot the new PSD.Do you expect the width of the peak to be affected by zero padding? Part 2: Bayesian posterior probability of signal frequency The Bayesian posterior probability of the signal frequency, assuming a model of a single harmonic signal plus independent Gaussian noise, is given by Equation (13.2). If the noise is not well understood, then it is safer to use the Student’s t form of Equation (13.5) which treats anything that cannot be fitted by the model as noise and leads to more conservative parameter estimates. Since we are evaluating pðnjD; IÞ at n discrete frequencies, we rewrite Equation (13.5) as 2N2orig 2CðnÞ 1 Norig d2 pðnjD; IÞ ¼ ; (13:22) 2N2orig PNzp =2 2CðnÞ 1 0 2 Norig d
1 where d2 ¼ Norig
P
i
d2i .
In Equation (13.22), the frequency associated with any particular value of n is given by fn ¼ n=NT. T equals the sample interval in time and N is the total number of samples. The value n ¼ 0 corresponds to zero frequency. The quantity CðnÞ is the positive frequency part of the two-sided periodogram (two-sided PSD) given by Equation (13.6), which we rewrite as CðnÞ ¼
jHn j2 ; N
for
n ¼ 0; 1; . . . ;
N : 2
(13:23)
In general, pðnjD; IÞ will be very narrow when CðnÞ=2 > 1 because of the exponentiation occurring in Equation (13.2). Thus, to accurately define pðnjD; IÞ, we need to zero pad the FFT to obtain a sufficient density of Hn points to accurately define the pðnjD; IÞ peak. Zero padding is discussed in detail in Appendix B. In the zero padding case, Equation (13.23) becomes CðnÞ ¼
jHn j2 ; Norig
for
n ¼ 0; 1; . . . ;
Nzp ; 2
(13:24)
13.7 Problems
375
where Norig is the number of original time series samples and Nzp is the total number of points including the added zeros. (a) Compute and plot pðnjD; ; IÞ from a zero padded FFT of the time series given above. Assume the variance of the data set ¼ 1. (b) Measure the width of the peak at half height of pðnjD; ; IÞ and compare the width of the PSD peak at half height. (c) Plot the natural logarithm of the Bayesian pðnjD; IÞ and compare its shape to the PSD. Note: Mathematica uses a slightly different definition of Hn to that given in Equation (B.51), which we designate by ½Hn Math . The modified version of Equation (13.24) is given by: Nzp Nzp CðnÞ ¼ : (13:25) j½Hn Math j2 ; for n ¼ 0; 1; . . . ; Norig 2
14 Bayesian inference with Poisson sampling
14.1 Overview In many experiments, the basic data consist of a set of discrete events distributed in space, time, energy, angle or some other coordinate. They include macroscopic events like a traffic accident or the location of a star. They also include microscopic events such as the detection of individual particles or photons in time or position. In experiments of this kind, our prior information often leads us to model the probability of the data (likelihood function) with a Poisson distribution. See Section 4.7 for a derivation of the Poisson distribution, and Section 5.7.2 for the relationship between the binomial and Poisson distributions. For temporally distributed events, the Poisson distribution is given by pðnjr; IÞ ¼
ðrTÞn erT : n!
(14:1)
It relates the probability that n discrete events will occur in some time interval T to a positive real-valued Poisson process event rate r. When n and rT are large, the Poisson distribution can be accurately approximated by a Gaussian distribution. Here, we will be concerned with situations where the Gaussian approximation is not good enough and we must work directly with the Poisson distribution. In this chapter, we employ Bayes’ theorem to solve the following inverse problem: compute the posterior PDF for r given the data D and prior information I. We divide this into three common problems: 1. How to infer a Poisson rate r. 2. How to infer a signal in a known background. 3. Analysis of ON/OFF data, where ON is the signal þ background and OFF is a just the background. The background is only known imprecisely from the OFF measurement.
The treatment is similar to that given by Loredo (1992), but also includes a treatment of the source detection question in the ON/OFF measurement problem. In the above three problems, the Poisson rate is assumed to be constant in the ON or OFF source position. In Section 14.5, we consider a simple radioactive decay problem in which r varies significantly over the duration of the data. 376
14.2 Infer a Poisson rate
377
14.2 Infer a Poisson rate The simplest problem is to infer the rate r from a single measurement of n events. From Bayes’ theorem: pðrjIÞpðnjr; IÞ pðrjn; IÞ ¼ : (14:2) pðnjIÞ The prior information I must specify both pðrjIÞ and the likelihood function pðnjr; IÞ. In this case, the latter is just the Poisson distribution. ( likelihood : pðnjr; IÞ ¼ ½ðrTÞn erT =n! I prior : pðrjIÞ ¼ ? Our first guess at pðrjIÞ is a Jeffreys prior since r is a scale parameter. However, the scale invariance argument is not valid if r may vanish. Instead, we adopt a uniform prior for r based on the following argument: intuition suggests ignorance of r corresponds to not having any prior preference for seeing any particular number of counts, n. In situations where it is desirable to use the Poisson distribution, the prior range for n is frequently small, so it is reasonable to use a uniform prior: pðnjIÞ ¼ constant: But pðnjIÞ is also the denominator in Equation (14.2), so we can write Z 1 pðnjIÞ ¼ dr pðrjIÞ pðnjr; IÞ 0 Z 1 1 ¼ dðrTÞ pðrjIÞðrTÞn erT : n!T 0
(14:3)
For pðnjIÞ to be constant, it is necessary that Z 1 dðrTÞ pðrjIÞðrTÞn erT / n! 0
but
Z
1
dx xn ex ¼ ðn þ 1Þ ¼ n!
(14:4)
0
which implies that pðrjIÞ ¼ constant. Use pðrjIÞ ¼ Then
1 ; rmax
0 r rmax :
Z rmax 1 dðrTÞðrTÞn erT pðnjIÞ ¼ Trmax 0 1 ðn þ 1; rmax TÞ ; ¼ Trmax n!
(14:5)
(14:6)
378
Bayesian inference (Poisson sampling)
Rx where ðn; xÞ ¼ 0 dy yn1 ey is one form of the incomplete gamma function.1 Now substitute Equations (14.1), (14.5), and (14.6) into Equation (14.2) to obtain the posterior pðrjn; IÞ. pðrjn; IÞ ¼
TðrTÞn erT n! ; ðn þ 1; rmax TÞ n!
0 r rmax :
(14:7)
For rmax T n, then ðn þ 1; rmax TÞ ’ ðn þ 1Þ ¼ n! and Equation (14.7) simplifies to pðrjn; IÞ ¼
TðrTÞn erT ; n!
r 0:
(14:8)
14.2.1 Summary of posterior TðrTÞn erT ; n! mode : r ¼ n=T;
pðrjn; IÞ ¼
r0
mean : < r > ¼ ðn þ 1Þ=T; pffiffiffiffiffiffiffiffiffiffiffi sigma : r ¼ n þ 1=T: Figure 14.1 illustrates the shape of the posterior pðrjn; IÞ, divided by the time interval T, for four different choices of n ranging from n ¼ 0 to 100. In each case, the count interval T ¼ 1 s. As n increases, the pðrjn; IÞ becomes more symmetrical and gradually approaches a Gaussian (shown by the dotted line) with the same mode and standard deviation. For n ¼ 10, the Gaussian approximation is still a poor fit. By n ¼ 100, the Gaussian approximation provides a good fit near the mode but still departs noticeably in the wings. The 95% credible region can be found by solving for the two values rhigh and rlow which satisfy the two conditions: pðrhigh jn; IÞ ¼ pðrlow jn; IÞ; and Z
rhigh
pðrjn; IÞdr ¼ 0:95:
rlow
For n ¼ 1 the credible region is 0:042 4:78 r p ¼ 0:95: T T
1
See Press et al. (1992).
(14:9)
379
14.3 Signal + known background 0.4
0.8
p(rn, I )/T
p(rn, I )/T
1 n=0
0.6 0.4
n=1
0.3 0.2 0.1
0.2 1
2
3
4
5
1
2
3
0.14 0.12 0.1 0.08 0.06 0.04 0.02
4
5
6
7
8
Rate r 0.04 p(rn, I )/T
p(rn, I )/T
Rate r
n = 10
n = 100
0.03 0.02 0.01
5
10 15 Rate r
80
20
100
120
140
Rate r
Figure 14.1 The posterior PDF, pðrjn; IÞ, divided by the time interval T, plotted for four different values of n. For comparison, a Gaussian with the same mode and standard deviation is shown by the dotted curve for the n ¼ 10 and n ¼ 100 cases.
14.3 Signal + known background In this case, the measured rate consists of two components, one due to a signal of interest, s, and the other a known background rate, b. ( r¼sþb
s ¼ signal rate b ¼ known background rate:
Since we are assuming the background rate is known, pðsjn; b; IÞ ¼ pðrjn; b; IÞ: We can now use Equation (14.8) of the previous section for pðrjn; b; IÞ, and replace r by s þ b. The result is pðsjn; b; IÞ ¼ C
C1 ¼
ebT n!
Z 0
1
T½ðs þ bÞTn eðsþbÞT n!
(14:10)
dðsTÞðs þ bÞn Tn esT :
(14:11)
380
Bayesian inference (Poisson sampling)
The constant C ensures that the area under the probability density function ¼ 1. Using a binomial expansion of ðs þ bÞn (see Equation (D.7) in Appendix D), we can arrive at the following simple expression for C1 : C1 ¼
n X ðbTÞi ebT i¼0
i!
:
(14:12)
Equation (14.10) was proposed by Helene (1983, 1984) as a Bayesian solution for analyzing multichannel spectra in nuclear physics.
14.4 Analysis of ON/OFF measurements In this section, we want to infer the source rate, s, when the background rate, b, is imprecisely measured. This is called an ON/OFF measurement. OFF ! detector pointed off source to measure b ON ! detector pointed on source to measure s þ b: The usual approach is to assume OFF ! b^ b ON ! r^ r ; pffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi where b^ ¼ Noff =T and b ¼ Noff =T and r^ ¼ Non =T and r ¼ Non =T. Then ^ s ¼ r^ b^ and the variance 2s ¼ 2r þ 2b . This procedure works well for the Poisson case provided both s and b are large enough that the Poisson is well approximated by a Gaussian. But when either or both of the rates are small, the procedure fails. This can lead to negative estimates of s and/or error bars extending into non-physical negative values. This is a big problem in -ray and ultra-high energy astrophysics, where data are very sparse. First consider the OFF measurement: pðbjNoff ; Ib Þ ¼
Toff ðbToff ÞNoff ebToff : Noff !
(14:13)
For the ON measurement, we can write the joint probability of the source and background rate: pðs; bjIÞpðNon js; b; IÞ pðNon jIÞ pðsjb; IÞpðbjIÞpðNon js; b; IÞ ¼ : pðNon jIÞ
pðs; bjNon ; IÞ ¼
(14:14)
14.4 Analysis of ON/OFF measurements
381
Note: the prior information, I, includes information about the background OFF measurement in addition to the model Msþb , which asserts that the Poisson rate in the ON measurement is equal to s þ b. We can express this symbolically by I ¼ Noff , Ib , Msþb . In the parameter estimation part of the problem, we will estimate the value of the source rate s in the model Msþb . Following this, we will evaluate a model selection problem to compare the probability of model Msþb , which assumes a source is present, to the simpler model Mb , which asserts that the Poisson rate in the ON source measurement is equal to b, i.e., no source is present.
14.4.1 Estimating the source rate The likelihood for the ON measurement is the Poisson distribution for a source with strength s þ b:
pðNon js; b; IÞ ¼
½ðs þ bÞTon Non eðsþbÞTon : Non !
(14:15)
We will again assume a constant prior for s, so we write pðsjb; IÞ ¼ 1=smax . The prior for b is simply the posterior from the background measurement, given by Equation (14.13). Combining Equations (14.13), (14.14) and (14.15), we can compute the joint posterior for s and b. To find the posterior for s alone, independent of the background, we just marginalize with respect to b. Z bmax pðsjNon ; IÞ ¼ db pðs; bjNon ; IÞ: (14:16) 0
The exact integral can be calculated after expanding the binomial, ðs þ bÞNon and making use of the incomplete gamma function to evaluate the integrals, as we did in Section 14.2. The details of this calculation are given in Appendix D. The result is pðsjNon ; IÞ ¼
Non X i¼0
Ci
Ton ðsTon Þi esTon ; i!
(14:17)
where i ðNon þNoff iÞ! 1 þ TToff ðNon iÞ! on Ci P : j ðNon þNoff jÞ! Non Toff 1 þ j¼0 Ton ðNon jÞ! Note: Non X i¼0
Ci ¼ 1:
(14:18)
382
Bayesian inference (Poisson sampling)
Non = 8, Noff = 3, Ton = 1, Toff = 1
0.4
Non = 8, Noff = 36, Ton = 1, Toff = 12 Non = 24, Noff = 9, Ton = 3, Toff = 3
p(s |Noh, I )
0.3
Non = 2, Noff = 3, Ton = 1, Toff = 1 0.2
0.1
2
4
6 8 10 Signal rate (s–1)
12
14
16
Figure 14.2 The posterior probability density of the source rate, pðsjNon ; IÞ, plotted for four different combinations of Non , Noff , Ton , and Toff .
Figure 14.2 shows plots of the posterior probability density of the source rate pðsjNon ; IÞ for four different combinations of Non ; Noff ; Ton ; and Toff . Notice that even in the case that Non < Noff , the value of pðsjNon ; IÞ is always zero for non-physical negative values of s. It is also clear that increasing the ON measurement time and/or the OFF measurement time sharpens the definition of pðsjNon ; IÞ. We can gain a better understanding for the meaning of the complicated Ci term in Equation (14.17) by evaluating pðsjNon ; IÞ for a restatement of the problem. The background information I includes Noff , the number of background events measured in the off-source measurement as well as Ton and Toff . For the state of information corresponding to Non ; I, we can use Bayes’ theorem to compute pðijNon ; IÞ, the probability that i of the on-source events are due to the source and Non i are due to the background. Clearly, i is an integer that can take on values from 0 to Non . We can then obtain the posterior probability for s, from the joint probability pðs; ijNon ; IÞ, by marginalizing over i as follows: pðsjNon ; IÞ ¼
Non X i¼0
¼
Non X i¼0
pðs; ijNon ; IÞ ¼
Non X
pðijNon ; IÞpðsji; Non ; IÞ
i¼0
Ton ðsTon Þi esTon ; pðijNon ; IÞ i!
(14:19)
where we have used Equation (14.8) to evaluate pðsji; Non ; IÞ. Comparing Equation (14.17) with (14.19), we can write Ci ¼ pðijNon ; IÞ, the probability that i of the ON measurement events are due to the source. We are now able to interpret Equation (14.17) in the following useful way: Bayes’ theorem estimates s by taking a weighted average of the posteriors one would obtain
383
14.4 Analysis of ON/OFF measurements
attributing i ¼ 0; 1; 2; . . . ; Non events to the source. The weight Ci is equal to the probability of attributing i of the on-source events to the source, or equivalently, attributing Non i events to the background, assuming Msþb is true. Now suppose our question changes from estimating the source rate, s, to the question of how confidently we can claim to have detected the source in the ON measurement. We might be tempted to reason as follows: for the source to have been detected in the ON measurement, then at least one of the Non photons must have been from the source. The probability of attributing at least one of the Non photons to the source is just the sum of Ci terms for i ¼ 1 to Non , which is given by i PNon Toff ðNon þNoff iÞ! 1 þ i¼1 Ton ðNon iÞ! pði 1jNon ; IÞ P : (14:20) j Non Toff ðNon þNoff jÞ! 1 þ j¼0 Ton ðNon jÞ! In Figure 14.3, we have plotted pði 1jNon ; IÞ versus Non for two different background OFF measurements. In the first case, Noff ¼ 3 counts in a Toff ¼ 1 s. In the second case, Noff ¼ 36 counts in a Toff ¼ 12 s. In many experiments, the cost and/or effort required to obtain ON measurements (e.g., particle accelerator beam ON) is much greater than for OFF measurements. As our knowledge of the background rate improves, we see that pði 1jNon ; IÞ decreases for Non 3, the expected number of background photons, and increases above this. For Non ¼ 8 and Noff ¼ 3 counts in Toff ¼ 1 s, the probability is 95.6%. This rises to 98.9% for Noff ¼ 36 counts in Toff ¼ 12 s. What is wrong with setting the probability of source detection equal to pði 1jNon ; IÞ? The answer is that the Ci probabilities are based on assuming that
1
p(i ≥ 1|Non, I )
0.8 0.6 0.4 Noff = 3,Ton = 1,Toff = 1 0.2
Noff = 36,Ton = 1,Toff = 12
0
2
4
6 8 Non (counts)
10
12
14
Figure 14.3 The probability of attributing at least one of the Non photons to the source, pði 1jNon ; IÞ, assuming model Msþb is true, versus the number of counts in the ON measurement, for two different durations of the background OFF measurement.
384
Bayesian inference (Poisson sampling)
the model Msþb is true. They do not account for the extra complexity of Msþb when compared to an intrinsically simpler model, Mb , which asserts that the Poisson rate in the ON source measurement is equal to b, i.e., no source is present. To answer the source detection question, we need to compare the probabilities of these two models, which we do next. This approach automatically introduces an Occam penalty which penalizes Msþb for its greater complexity.
14.4.2 Source detection question In the parameter estimation problem above, we assumed the truth of model Msþb and estimated the value of s. Here, we will address the source detection (model selection) problem: ‘‘Have we detected a source in the ON measurement?’’ To answer this, we will compute the odds ratio, Ofsþb;bg , of two models Msþb and Mb , which have the following meaning: Mb ‘‘the ON measurement is solely due to the Poisson background rate, b.’’ The prior probability for b is derived from the OFF measurement. Msþb ‘‘the Poisson rate in the ON source measurement is equal to s þ b.’’ The prior for s is a constant in the range 0 to smax . Again, the prior probability for b is derived from the OFF measurement. In our earlier work, our background information was represented by I ¼ Noff ; Ib ; Msþb . In the current model selection problem, we will use the abbreviation Ioff ¼ Noff ; Ib :
(14:21)
According to Section 3.14, we can write the odds as pðMsþb jNon ; Ioff Þ pðMsþb jIoff Þ pðNon jMsþb ; Ioff Þ ¼ pðMb jNon ; Ioff Þ pðMb jIoff Þ pðNon jMb ; Ioff Þ ¼ prior odds Bfsþb;bg ;
Ofsþb;bg ¼
(14:22)
where Bfsþb;bg is the Bayes factor, the ratio of the global likelihoods for the two models. The calculation of the global likelihood for Msþb introduces an Occam factor that penalizes this model for its greater complexity when compared to Mb . The Occam factor depends directly on the prior uncertainty in the additional parameter s (see Section 3.5). The details behind the calculation of the Bayes factor are again given in Appendix D.2. The result is
Bfsþb;bg
Non X Non ! ðNon þ Noff iÞ! Toff i 1þ : smax Ton ðNon þ Noff Þ! i¼0 ðNon iÞ! Ton
(14:23)
385
14.4 Analysis of ON/OFF measurements 1
p(Ms + b|Non, Ioff)
0.8 0.6 0.4 Noff = 3, Ton = 1, Toff = 1
0.2
Noff = 36, Ton = 1, Toff = 12 0
2
4
6 8 Non (counts)
10
12
14
Figure 14.4 The probability that model Msþb is true, pðMsþb jNon ; Ioff Þ, versus the number of counts in the ON measurement, for two different durations of the background OFF measurement.
In what follows, we will assume a prior odds ratio of 1, so Ofsþb;bg ¼ Bfsþb; bg . Since pðMsþb jNon ; Ioff Þ þ pðMb jNon ; Ioff Þ ¼ 1, we can express pðMsþb jNon ; Ioff Þ in terms of the odds ratio: pðMsþb jNon ; Ioff Þ ¼
1 : ð1 þ 1=Ofsþb; bg Þ
(14:24)
In Figure 14.4, we have plotted PðMsþb jNon ; Ioff Þ versus Non for two different background OFF measurements, assuming an smax ¼ 30. In the first case, Noff ¼ 3 counts in a Toff ¼ 1 s. In the second case, Noff ¼ 36 counts in a Toff ¼ 12 s. For a given value of Non , the probability that a source is detected decreases for Non 3, the expected number of background photons, and increases above this. For Non ¼ 8 and Noff ¼ 3 counts in Toff ¼ 1 s, the probability is 61.0%. This rises to 74.6% for Noff ¼ 36 counts in Toff ¼ 12 s. These are significantly lower than the corresponding probabilities for pði 1jNon ; IÞ as shown in Figure 14.3. Sensitivity to prior information We close by reminding the reader that the conclusions of any Bayesian analysis are always conditional on the truth of our prior information. It is clear that in the source detection (model selection) problem, the Bayes factor (Equation (14.23)) is very sensitive to the choice of the prior upper boundary,2 smax . Halving the value of smax causes the Bayes factor to increase by a factor of two. It is useful to consider the uncertainty in smax as introducing a systematic error into our conclusion. As discussed in Section 3.6, we can readily allow for the effect of systematic error in a Bayesian 2
We met this issue before in Section 3.8.1, for a completely different spectral line problem.
386
Bayesian inference (Poisson sampling)
analysis. The solution is to treat smax as an additional parameter in the problem, choose a prior for this parameter and marginalize over the parameter. This will introduce an additional Occam penalty reducing the odds ratio in a way that quantitatively reflects our uncertainties in smax . Depending on the importance of the result, it may be useful to examine the dependence of the conclusion on the choice of prior for s, by considering an alternative but reasonable form of prior. One alternative choice worth considering in this case is a modified Jeffreys of the form pðsjIÞ ¼ 1=fðs þ aÞ ln½ða þ smax Þ=ag, where a is a constant. This modified Jeffreys looks like a uniform prior for s < a and a Jeffreys for s > a.
14.5 Time-varying Poisson rate So far, we have assumed the Poisson rate, r, is a constant. We now analyze a simple problem in which r is a function of time. I ‘‘We want to estimate the half-life, , of a radioactive sample. The sample count rate is given by rðtjr0 ; Þ ¼ r0 2t= ;
(14:25)
where r0 is the count rate at t ¼ 0. Assume a uniform prior for r0 , and an independent Jeffreys prior for .’’ The data, D, are a simulated list of times,3 fti g, for N ¼ 60 measured Geiger counter clicks, for a radioactive sample with a half-life of 30 s. To make use of the full resolution of the data, we will work with the individual event times. D ¼ f1:44; 1:64; 2:55; 2:88; 2:9; 3:27; 4:39; 5:01; 5:08; 5:11; 5:33; 5:4; 5:45; 5:58; 5:79; 6:17; 7:84; 7:86; 8:8; 8:9; 11:71; 11:73; 11:78; 14:88; 14:96; 15:61; 18:95; 19:42; 20:11; 20:28; 21:46; 21:52; 23:62; 24:21; 24:38; 24:39; 25:76; 27:92; 28:92; 29:28; 29:74; 30:04; 31:34; 32:08; 34:62; 35:04; 35:38; 36:43; 36:94; 38:97; 40:66; 41:62; 42:69; 43:02; 43:36; 45:11; 47:38; 49:65; 50:52; 51:22g From Bayes’ theorem we can write pðr0 ; jD; IÞ / pðr0 jIÞ pðjIÞpðDjr0 ; ; IÞ:
(14:26)
The prior ranges’ upper and lower boundaries for r0 and are assumed to lie well outside the region of interest defined by the likelihood function. Thus, for our current parameter estimation problem, we can write Z 1 pðjD; IÞ / dr0 pðDjr0 ; ; IÞ: (14:27) r0 3
See Problem 6 for hints on how to simulate your own data set.
387
14.5 Time-varying Poisson rate
The likelihood can be calculated as follows: divide the observation period, T, into small time intervals, t, each containing either one event (counter click) or no event. We assume that t is sufficiently small that the average rate in the interval t is approximately equal to the rate at any time within the interval. From the Poisson distribution, p0 ðtÞ, the probability of no event in t is given by p0 ðtÞ ¼ erðtÞt ;
(14:28)
and the probability of one event is given by p1 ðtÞ ¼ rðtÞterðtÞt :
(14:29)
If N and M are the number of time intervals in which one event and no events are detected, respectively, then the likelihood function is given by pðDjrðtÞ; IÞ ¼
N Y
p1 ðti Þ
i¼1
¼ t
"
N
" ¼ t
N
M Y j¼1
N Y i¼1 N Y i¼1
p0 ðtj Þ #
"
rðti Þ exp #
NþM X
# rðtj Þt
(14:30)
j¼1
Z rðti Þ exp dt rðtÞ : T
Note: we have replaced the sum of rðtÞt over all the observed intervals by the integral of the rate over the intervals, with the range of integration T ¼ ðN þ MÞt. The tN factor in the likelihood cancels with the same factor which appears in the denominator of Bayes’ theorem, so the result does not depend on the size of t, and is well-behaved even in the limit where t become infinitesimal. Now use Equation (14.25) to substitute for rðtÞ in Equation (14.30): " # Z N Y N ti = t= pðDjr0 ; ; IÞ ¼ t r0 2 exp dt r0 2 T i¼1 (14:31) PN h r i 1 0 ti N N T= i¼1 12 exp ¼ t r0 2 ; ln 2 where T is the duration of the time series data. The marginal posterior probability density for the half-life can be obtained by substituting Equation (14.31) into Equation (14.27), and then evaluated numerically.4 14.5.
4
Since there are only two parameters, the joint probability distribution can be quickly evaluated at a finite number of two-dimensional grid points. Each marginal distribution can be obtained by summing the results for grid points along the other parameter.
388
Bayesian inference (Poisson sampling)
Probability
0.02
0.015
0.01
0.005
0 20
40
60
80 Half-life (s )
100
120
140
Figure 14.5 The marginal posterior for the half-life of a radioactive sample.
What if the rate were to vary with time in some unknown fashion, perhaps periodically? For an interesting treatment of this class of problem, see Gregory and Loredo (1992, 1993, and 1996) and Loredo (1992).
14.6 Problems 1. The results of an ON/OFF measurement are Non ¼ 5 counts, Noff ¼ 90 counts, Ton ¼ 1 s, Toff ¼ 100 s. Plot pðsÞ, the posterior probability of the source rate. 2. Using the data given in Problem 1, compute the probability of attributing i of the on-source events to the source. Plot a graph of this probability for the range i ¼ 0 to 5. 3. Repeat Problem 2, only this time, compute the probability of attributing j of the on-source events to the background. Plot a graph of this probability for the range j ¼ 0 to 5. 4. For the radioactive counter times given in Section 14.5, compute and plot the marginal posterior PDF for the initial count rate r0 . 5. Compute and plot the marginal posterior PDF for the radioactive sample half-life, based on the first 30 counter times given in Section 14.5. 6. Simulate your own radioactive decay time series (N ¼ 100 count times) for an initial decay rate of one count per second and a half-life of 40 s. Divide the decay into 20 bins. For each bin, use Equation (5.62) to generate a list of Poisson time intervals for a Poisson rate corresponding to the time of the middle of the bin. Convert the time intervals to a sequence of count times and add the start time of the corresponding bin. You can use Mathematica’s Select[list,# 1 < bin boundary time &] command to select out the times from any particular list that are less than the end time of the corresponding bin. Use ListPlot[] to plot a graph of your time series.
Appendix A Singular value decomposition
Frequently, the solution of a linear least-squares problem using the normal equations of Section 10.2.2, ^ ¼ ðGT E1 GÞ1 GT E1 D ¼ Y1 GT E1 D; A
(A:1)
fails because a zero pivot occurs in the matrix calculation because C is singular. If the matrix is sufficiently close to singular, the answer becomes extremely sensitive to round-off errors, in which case you typically get fitted A ’s with very large amplitudes that are delicately balanced to almost precisely cancel out. Here is an example of a nearly singular matrix:
1:0 1:0
1:0 : 1:0001
(A:2)
This arises when the data do not clearly distinguish between two or more of the basis functions provided. The solution is to use singular value decomposition (SVD). When some combination of basis functions is irrelevant to the fit, SVD will drive the amplitudes of these basis functions down to small values rather than pushing them up to delicately canceling infinities. How does SVD work? First, we need to restate the least-squares problem slightly differently. In least-squares, we want to minimize X ðdi fi Þ2 X di fi 2 2 ¼ ¼ ; (A:3) i i 2i i i where fi ðAÞ ¼
M X ¼1
389
A gi :
(A:4)
390
Appendix A
In matrix form, we can write 2 ¼ jX A bj2 ; where X is the design matrix given by 0 g1 ðx1 Þ X¼@
1
g1 ðxN Þ N
gM ðx1 Þ 1
gM ðxN Þ N
(A:5) 1 A;
(A:6)
and 0 d1 1 1
b ¼ @ A:
(A:7)
dN N
The problem is to find A which minimizes Equation (A.5). Any rectangular matrix can be written in reduced SVD form as follows1 (see any good linear algebra text for a proof): X ¼ UT D V; ðN MÞ ðN MÞðM MÞðM MÞ
(A:9)
where the columns of U are orthonormal and are the eigenvectors of X XT . The columns of V are orthonormal and are the eigenvectors of XT X and XT X ¼ GT E1 G. The elements of the diagonal matrix, D, are called the singular values of X. These singular values, !1 ; !1 ; . . . ; !M , are the square roots of the non-zero eigenvalues of both XT X and X XT . 0 1 !1 0 0 B 0 !2 0 C B C (A:10) C D¼B B C @ A 0 0 !M The number of singular values is equal to the rank of X. The three matrices U; D; V can be obtained in Mathematica with the command fU; D; Vg ¼ SingularValues½X
1
In some texts, Equation (A.9) is written in the form X ¼ U D VT :
ðA:8Þ
Singular value decomposition
391
In a least-squares problem, the design matrix, X, does not have an inverse because it is not a square matrix, but we can use SVD to construct a pseudo-inverse, Xþ , which provides the best solution in a least-squares sense to Equation (A.5) in terms of the basis functions that the data can distinguish between. We will designate that solution Aþ , which is given by Aþ ¼ Xþ b: The pseudo-inverse, in the Mathematica convention, is given by 1 Xþ ¼ VT D1 U ¼ VT diag U: !
(A:11)
(A:12)
Before using Equation (A.11), it is desirable to investigate the singular values of the design matrix X. If any singular values ! are close to zero, set 1=! ¼ 0 for that . This corresponds to throwing away one or more basis functions that the data can not decide on. The condition number of a matrix is defined by maximum eigenvalue minimum eigenvalue maximum singular value 2 ¼ : minimum singular value
condition number ¼
(A:13)
The matrix becomes ill-conditioned if the reciprocal of its condition number approaches the floating point accuracy. The PseudoInverse command in Mathematica allows one to specify a Tolerance option for throwing away basis functions whose singular values are < Tolerance multiplied by the maximum singular value. Equation (A.11) becomes Aþ ¼ PseudoInverse½X; Tolerance > t:b Note: use of the Tolerance option can lead to strange results when fitting polynomials. This is because the range (scale of changes) of the different basis functions can be very different, e.g., the x3 will have a much larger range than say the x term. The different scales of the basis functions make a simple comparison of singular values difficult. In this case, it is better to rescale x so it lies in the range 0 to 1 or 1 to þ1 before computing the singular values.
Appendix B Discrete Fourier Transforms
B.1 Overview The operations of convolution, correlation and Fourier analysis play an important role in data analysis and experiment simulations. These operations on digitally sampled data are efficiently carried out with the Discrete Fourier Transform (DFT), and in particular with a fast version of the DFT called the Fast Fourier Transform (FFT). In this section, we introduce the DFT and FFT and explore their relationship to the analytic Fourier transform and Fourier series. We investigate Fourier deconvolution of a noisy signal with an optimal Weiner filter. We also learn how to minimally sample data without losing any information (Nyquist theorem) and about the aliasing that occurs when a waveform is sampled at an insufficient rate. Since the DFT is an approximation to the analytic Fourier transform, we learn how to zero pad a time series to obtain accurate Fourier amplitudes, and to remove spurious end effects in discrete convolution. Finally, we explore two commonly used approaches to spectral analysis and how to reduce the variance of spectral density estimators.
B.2 Orthogonal and orthonormal functions Before we consider the problem of representing a function in terms of a sum of orthogonal basis functions, we review the more familiar problem of representing a vector in terms of a set of orthogonal unit vectors. The vector F can be represented as the vector sum F ¼ Fx ^ ix þ Fy ^ iy þ Fz ^iz ;
(B:1)
where the ^ i’s are unit vectors along 3 mutually perpendicular axes. Because the unit vectors satisfy the relation ^ ix ^ iy ¼ ^ iy ^ iz ¼ ^ iz ^ ix ¼ 0, they are said to be an orthogonal set. In addition, ^ ix ^ ix ¼ ^ iy ^ iy ¼ ^ iz ^ iz ¼ 1;
(B:2)
so they are called an orthonormal set. They are not the only orthonormal set which can be used to represent F. Any orthonormal coordinate system can be used. For example, in spherical coordinates, 392
Discrete Fourier Transforms
393
ir þ F ^ i þ F ^i : F ¼ Fr ^
(B:3)
In summary, we can represent the vector F by F¼
N X
in ; Fn ^
(B:4)
n¼1
where the orthonormal set of basis vectors satisfies the condition ^ in ¼ m;n ¼ 1 m ¼ n im ^ ¼ 0 m 6¼ n:
(B:5)
To find the scalar component along ^ im , take the scalar product of ^im with F. im : Fm ¼ F ^
(B:6)
In an analogous fashion, we would like to represent a function yðtÞ in terms of an orthonormal set of basis functions, yðtÞ ¼
N X
Yn n ðtÞ:
(B:7)
n¼1
We need to define the equivalent of the scalar product for use with functions, which is called the inner product for two functions. It is easy to show that Z 1 þp sin mt sin nt dt ¼ m;n : (B:8) p p If this relationship isRto be satisfied, then the inner product between two functions p should be defined as p xðtÞ yðtÞ dt. Thus, if yðtÞ ¼
N X
N X
sin nt Yn pffiffiffi ; p n¼1
Yn n ðtÞ ¼
n¼1
then the inner product of yðtÞ and m ðtÞ is Z Z N X yðtÞm ðtÞdt ¼ Yn n¼1
¼
N X
þp
p
(B:9)
sin nt sin mt pffiffiffi pffiffiffi dt p p (B:10)
Yn m;n ¼ Ym :
n¼1
The next question is whether any function can be represented by an orthonormal series like Equation (B.9). Since all terms on the right side of Equation (B.9) are periodic in t, with period 2p, their sum will also be periodic. If the original function yðtÞ is periodic as well, over the same period, then this series representation will be valid for all values of t. Otherwise, the series will only represent the function yðtÞ in the range p < t < p.
394
Appendix B
What about the question of completeness? Returning to the vector analogy: in general, a set of unit vectors is not complete if it is possible to find a vector belonging to the space which is orthogonal to every vector in the set, i.e., 3 basis vectors required for 3-dimensional space. How many dimensions does our function space have? It is clear from Equation (B.9) that for values n > N, it is possible to find a function sin nt which is orthogonal to all members. Furthermore, even if N is infinite, cos nt is orthogonal to every member of the set for any value of n. A complete set must contain at least an infinite number of functions of the form sin nt and cos nt. We will say more about completeness later when we discuss the Nyquist sampling theorem. Many well-known sets of functions exhibit relationships similar to Equation (B.9). Such function sets, not sinusoidal and usually not periodic, can be used to form orthogonal series. In general, the inner product can be defined in the interval a < t < b as follows: Z b xðtÞ y ðtÞ !ðtÞdt; (B:11) a
where y ðtÞ is the complex conjugate of yðtÞ and !ðtÞ is a weighting function. A set of functions n ðtÞ is an orthogonal set over the range a < t < b if n m ¼
Z
b a
n ðtÞm ðtÞ!ðtÞdt ¼ kn m;n :
(B:12)
The set is orthonormal if kn = 1 for all n. Examples of useful orthogonal functions are: 1. 2. 3. 4. 5. 6. 7.
1; cos x; sin x; cos 2x; sin 2x; . . . used in a Fourier series Legendre polynomials Spherical harmonics Bessel functions Chebyshev or Tschebyscheff polynomials Laguerre polynomials Hermite polynomials
B.3 Fourier series and integral transform In the case of the Fourier series, the limits p to þp correspond to the period of the function. The limits can be made arbitrary by setting t ¼ 2pt0 =T. Then Z Z 1 þp 2 þT=2 sin nt sin mt dt ¼ sin ð2pn f0 t0 Þ sin ð2pm f0 t0 Þdt0 ; (B:13) p p T T=2 where f0 ¼ 1=T.
395
Discrete Fourier Transforms
B.3.1 Fourier series The Fourier series representation of yðtÞ is given by yðtÞ ¼
1 X
½an cos 2pn f0 t þ bn sin 2pn f0 t:
(B:14)
n¼0
To find the coefficients an and bn of yðtÞ, compute the inner product of yðtÞ with the cosine and sine basis functions. This is analogous to finding the component of a vector in Equation (B.6). Z 2 T=2 an ¼ yðtÞ cos 2pn f0 tdt; (B:15) T T=2 and, bn ¼
2 T
Z
T=2
yðtÞ sin 2pn f0 tdt;
(B:16)
T=2
for n ¼ 0; 1; . . . Exponential notation We will rewrite Equation (B.14) using the common exponential notation, where 1 cos 2pn f0 t ¼ ðei2pnf0 t þ ei2pnf0 t Þ 2 1 sin 2pn f0 t ¼ ðei2pnf0 t ei2pnf0 t Þ: 2i Equation (B.14) becomes " # 1 1 X 1 X ðan ibn Þei2pnf0 t þ ðan þ ibn Þei2pnf0 t : yðtÞ ¼ a0 þ 2 n¼1 n¼1
(B:17)
To simplify the expression, negative values of n are introduced. Thus, we can rewrite Equation (B.17) as " # 1 1 X 1 X i2pðnÞf0 t i2pnf0 t yðtÞ ¼ a0 þ ðan ibn Þe þ ðan þ ibn Þe 2 n¼1 n¼1 " # 1 1 X 1 X i2pnf0 t i2pnf0 t (B:18) ðajnj ibjnj Þe þ ðan þ ibn Þe ¼ a0 þ 2 n¼1 n¼1 ¼
1 X n¼1
Yn ei2pnf0 t ;
396
Appendix B
where 81 < 2 ðan þ ibn Þ; Y n ¼ a0 ; :1 2 ðajnj ibjnj Þ;
n>0 n=0 n < 0.
(B:19)
In Equation (B.18), we expanded yðtÞ in terms of the ei2pnf0 t basis set. Alternatively we could have expanded it in terms of the ei2pnf0 t basis set, in which case we would write yðtÞ ¼
1 X
Y0n ei2pnf0 t ;
(B:20)
n¼1
where 81 < 2 ðan ibn Þ; Y0n ¼ a0 ; :1 2 ðajnj þ ibjnj Þ;
n>0 n¼0 n < 0.
(B:21)
Both conventions exist in the literature, but we will use the convention specified by Equations (B.18) and (B.19) to be consistent with the default definitions for the Discrete Fourier Transform (discussed in Section B.7) used in Mathematica.
B.3.2 Fourier transform In the Fourier series, the Fourier frequency components are separated by f0 ¼ 1=T. In the limit as T ! 1, the Fourier components, Yn , become a continuous function Yð f Þ where Yð f Þ is called the Fourier transform of yðtÞ. Z 1 yðtÞ ¼ ½gð fÞ cos 2pft þ kð fÞ sin 2pftdf: (B:22) 0
If we define Yð fÞ by 81 f>0 < 2 fgð fÞ þ ikð fÞg; YðfÞ ¼ gð0Þ; f=0 :1 2 fgðjfjÞ ikðjfjÞg; f < 0, then Equation (B.22) can be rewritten as Z þ1 Z yðtÞ ¼ YðfÞei2pft df where Yð f Þ ¼ 1
(B:23)
þ1
yðtÞ ei2pft dt:
(B:24)
1
Designate Fourier transform pairs by yðtÞ()Yð f Þ:
(B:25)
Units: If t is measured in seconds, then f is in units of cycles s1 ¼ hertz. If t is measured in minutes, then f is in cycles per minute. Some common Fourier transform pairs are illustrated in Figure B.1.
397
Discrete Fourier Transforms TIME DOMAIN
FREQUENCY DOMAIN
h (t ) = A, | t | < To = A , | t | = To 2 = 0, | t | > To t
A
–To
H (f ) = 2ATo
2 f πTo
f
To
h (t ) = 2 A fo
Sin [2f πTo]
1 2To
Sin [2foπt ] 2foπt A
1 3 To 2To H(f ) = A, | f | < fo = A , | f | = fo 2 = 0, | f | > fo
t 1 2fo
1 fo
f fo
–fo
3 2fo
H (f ) = K δ (f )
h (t ) = K .....
.....
K f
t
∞
Σ δ (t–nT )
∞
Σ
δ (f – n ) H (f ) = 1 T n=–∞ T
h (t ) =
n=–∞
.....
1 T
.....
f
t –1 T
T
1 T
Figure B.1 Some common Fourier transform pairs.
Note: we normally associate the analysis of periodic functions such as a square wave with Fourier series rather than Fourier transforms. We can show that a Fourier transform reduces to a Fourier series whenever the function being transformed is periodic. Example: Consider the FT of a pulse time waveform 8 < A; hðtÞ ¼ A=2; : 0;
jtj < T0 t ¼ T0 jtj > T0 .
(B:26)
398
Appendix B
The value of the function at a discontinuity must be defined to be the mid-value if the inverse Fourier transform is to hold (Brigham, 1988). Z T0 Z T0 Z T0 i2pft Hð fÞ ¼ Ae dt ¼ A cos 2p ftdt þ iA sin 2p ftdt: (B:27) T0
T0
T0
The final integral ¼ 0 since the integral is odd: T0 A sin 2pT0 f sin 2p ft : ) Hð fÞ ¼ ¼ 2AT0 2pf 2pT0 f T0
(B:28)
Table B.1 gives the correspondence between important symmetry properties in time and frequency domains. Table B.1 Correspondence of symmetry properties in the two domains. If hðtÞ is . . .
then Hð f Þ is . . .
real imaginary real even and imaginary odd real odd and imaginary even real and even real and odd imaginary and even imaginary and odd complex and even complex and odd
real part even imaginary part odd real part odd imaginary part even real imaginary real and even imaginary and odd imaginary and even real and odd complex and even complex and odd
B.4 Convolution and correlation We previously considered some fundamental properties of the FT. However, there exists a class of FT relationships whose importance outranks those previously considered. These properties are the convolution and correlation theorems. The importance of the convolution operation in science is discussed in Section B.4.3. Convolution integral yðtÞ ¼
Z
þ1
sðÞhðt Þd ¼ sðtÞ hðtÞ
(B:29)
1
or alternatively, yðtÞ ¼
Z
þ1
hðÞ sðt Þd;
(B:30)
1
where the symbol in Equation (B.29) stands for convolution. The convolution procedure is illustrated graphically in Figure B.2.
399
Discrete Fourier Transforms h (τ)
0
0.5
τ
s (τ)
1
1.5
2
0
0.5
τ
1
1.5
2
1
1.5
2
1
1.5
2
1
1.5
2
1
1.5
2
h (– τ)
FOLD
0
0.5
τ
h (t – τ)
SHIFT
0
0.5
0
0.5
τ
MULTIPLY
τ y (t )
INTEGRATE 0
0.5 t
Figure B.2 Graphical illustration of convolution.
B.4.1 Convolution theorem The convolution theorem is one of the most powerful tools in modern scientific analysis. According to the convolution theorem, the FT of the convolution of two functions is equal to the product of the FT of each function separately. hðtÞ sðtÞ()Hð fÞ Sð fÞ:
(B:31)
400
Appendix B
Proof : Z
þ1
yðtÞei2pft dt ¼
Z
1
þ1 Z þ1
1
sðÞhðt Þd ei2pft dt:
Now interchange the order of integration: Z þ1 Z þ1 i2pft YðfÞ ¼ sðÞ hðt Þe dt d: 1
Let r ¼ ðt Þ. Then, Z
þ1
(B:32)
1
(B:33)
1
Z hðt Þei2pft dt ¼
1
1
hðrÞei2pfðrþÞ dr
(B:34)
1
¼ ei2pf HðfÞ:
(B:35)
Therefore, YðfÞ ¼ HðfÞ
Z
1
sðÞei2pf d ¼ Hð fÞ Sð fÞ:
(B:36)
1
This relationship allows one the complete freedom to convolve mathematically (or visually) in the time domain by simple multiplication in the frequency domain. Among other things, it provides a convenient tool for developing additional FT pairs. Figure B.3 illustrates the theorem applied to the convolution of a rectangular pulse of width 2T0 with a bed of nails (an array of uniformly-spaced delta functions). We can equivalently go from convolution in the frequency domain to multiplication in the time domain. hðtÞsðtÞ ! HðfÞ SðfÞ: (B:37)
B.4.2 Correlation theorem The correlation of two functions sðtÞ and hðtÞ is defined by Z 1 Corr ðs; hÞ ¼ zðtÞ ¼ sðÞhð þ tÞd:
(B:38)
1
It is useful to compare Equation (B.38) with Equation (B.29) for convolution. Convolution involves a folding of hðÞ before shifting, while correlation does not. According to the correlation theorem, zðtÞ()Hð fÞS ð fÞ ¼ Zð fÞ: Thus, Corr ðs; hÞ()Hð fÞS ð fÞ are an FT pair. Compare with the convolution: sðtÞ hðtÞ()Sð fÞHð fÞ.
(B:39)
401
Discrete Fourier Transforms CONVOLVE IN TIME DOMAIN (a)
MULTIPLY IN FREQUENCY DOMAIN (b)
h (t )
H (f )
2ATo
A
t
f
(CONVOLVE)
(MULTIPLY)
(c)
(d) s (t )
S (f ) 1 T
1 ....
....
....
....
t
f 1 T
T
(e)
(f)
h (t ) ∗ s (t )
2ATo T
H (f ) S (f )
t
f
Figure B.3 Example of the convolution theorem.
Note: if sðtÞ is a real and even function, Sð f Þ is real and S ð f Þ ¼ Sð f Þ. Thus, in this case, Corr ðs; hÞ ¼ sðtÞ hðtÞ = convolution.
B.4.3 Importance of convolution in science The goal of science is to infer how nature works based on measurements or observations. Nature )
Measurements Apparatus ) Observation:
Unfortunately, all measurement apparatus introduces distortions which need to be understood. Often, the most exciting questions of the day require pushing the measurement equipment to its very limits where the distortions are most extreme. Of course, some of these distortions can be approximately calculated from theory, like the diffraction effects of a telescope or microscope, but others need to be measured. Are there any general principles that help us to understand these distortions that we can apply to any measurement process? The answer is yes for any linear measurement
402
Appendix B
process where the output is linearly related to the input signal, even if the apparatus is a very complex piece of equipment consisting of many separate parts, e.g., a radio telescope consisting of one or more parabolic antennas and a room full of sophisticated electronics. Any linear measurement process corresponds mathematically to a convolution of the measurement apparatus point spread function with the signal from nature. The point spread function is the response of the apparatus to an input signal that is unresolved in the measurement dimension, e.g., a short pulse in the time dimension. From an understanding of the equipment, it is often possible to partially correct for these distortions to better approximate the original signal. Radio astronomy example The simplest radio telescope consists of a parabolic collecting antenna which focuses, amplifies, and detects the radiation arriving within a narrow cone of solid angle (two angular dimensions). Because of diffraction, the sensitivity within the cone may have several peaks often referred to as the main beam and side lobes. The angular size of the main beam in radians is wavelength=telescope diameter. For example, a 100 m diameter telescope operating at a wavelength of 3 cm has about the same resolving power as the human eye at optical wavelengths. The detailed shape of the main beam and side lobes may be very difficult to calculate, especially when the telescope is operated at very short wavelengths where irregularities in the telescope surface and gravitational distortions are most important. Any image of the intensity distribution of the sky (incident radiation), made with this instrument, will be blurred by this diffraction pattern or point spread function. Fortunately, the point spread function can be measured provided it is stable. This can be achieved by observing a strong ‘‘point’’ source of very small angular extent, much smaller than the main beam of the telescope. The use of an unresolved ‘‘point’’ source to measure the telescope point spread function, and the blurring effect the point spread function has on a model extended source, are illustrated in Figure B.4 for one angular coordinate, . The dashed curve in the upper left panel shows the response of the telescope as a function of that we wish to measure. In this example, it consists of a main beam and a strong secondary side lobe. The solid curve represents the radio intensity distribution of an unresolved point source. The source is fixed in position but the telescope response, defined by the location of the center of the main beam and represented by 0 , can be steered. By scanning the telescope (varying 0 ) across the point source, we can map out the telescope point spread function as shown in the lower left panel. One can see that the response of the telescope to the point source (point spread function) in 0 is the mirror image of the telescope response in . Thus, to simulate an observation of an extended source, the model galaxy, we need to obtain the telescope response in by folding the measured point spread function in 0 about the main beam axis. Then for each pointing position of the telescope, we multiply the telescope response in times the galaxy intensity distribution and
403
Discrete Fourier Transforms (c) Galaxy and Telescope Beam
1
1
0.8
0.8
Intensity
Intensity
(a) Telescope Beam + Point Source
0.6 0.4 0.2
0.6 0.4 0.2 –0.25 0
0.25 0.5 0.75 1
1.25 1.5
Angular position θ
Angular position θ
(b) Point Source Response
(d) Galaxy Convolved with Beam
1
1
0.8
0.8
Intensity
Intensity
–0.25 0 0.25 0.5 0.75 1 1.25 1.5
0.6 0.4 0.2
0.6 0.4 0.2
–0.25 0
0.25 0.5 0.75 1
1.25 1.5
–0.25 0
Telescope pointing position θ0
0.25 0.5 0.75 1
1.25 1.5
Telescope position θ0
Figure B.4 A simulation of the response of a radio telescope to an unresolved ‘‘point’’ source and an extended source in one angular sky coordinate, . The dashed curve in the upper panels is the telescope sensitivity as a function of . The solid curve in the upper left represents the intensity distribution of a point source, and in the right panel, a model galaxy intensity distribution. The lower panels are the measured telescope output versus 0 , the telescope pointing position.
integrate. The lower right panel shows the results of such a convolution with our model galaxy intensity distribution. The convolution theorem provides what is often a simpler way of computing the measured galaxy intensity distribution. Just Fourier transform the telescope sensitivity in and the galaxy intensity distribution, multiply the two transforms, and then take the inverse Fourier transform. The inverse of convolution, called deconvolution, is demonstrated in Section B.10.1.
B.5 Waveform sampling In many practical situations, we only obtain samples of some continuous function. How could we go about sampling the continuous voltage function, vðtÞ to obtain a sample at t ¼ ? One way would be to convert vðtÞ to a frequency fðtÞ ¼ kvðtÞ and count the number of cycles (N) in some short time interval to þ 4T.
404
Appendix B
N¼
Z
þ4T
fðtÞdt ¼
Z
þ4T
k vðtÞdt:
(B:40)
If sðÞ, the time averaged value of vðtÞ around t ¼ , is the desired sample, then N ¼ sðÞ ¼ k4T
R þ4T
k vðtÞdt : R þ4T k dt
(B:41)
We can generalize this to R þ1 sðÞ ¼
1 kðt Þ vðtÞdt ; R þ1 1 kðt Þdt
(B:42)
where kðtÞ is some suitable weighting function. One choice of kðtÞ is a square pulse of R þ1 width T and height 1=T. In this case, 1 kðt Þdt ¼ 1. The ideal choice which we can only approach in practice is ! kðtÞ ¼ ðtÞ (the impulse or Dirac delta function) Z þ1 ðt Þdt ¼ 1: (B:43) ðt Þ ¼ 0 for t 6¼ and 1
In most texts, sampling at uniform intervals separated by T is represented by multiplying the waveform by a set of impulse functions with separation T (often referred as a bed of nails). Note: the FT of a bed of nails is another bed of nails such that 1 1 X 1 X n 4ðtÞ ¼ ðt nTÞ() 4ðfÞ ¼ f ; (B:44) T n¼1 T n¼1 where the area integral of one nail in the frequency domain ¼ 1=T. We can use the convolution theorem to illustrate (see Figure B.5) how to determine the FT of a sampled waveform. The FT of the sampled waveform is then a periodic function where one period is equal to, within the constant ð1=TÞ, the FT of the continuous function hðtÞ. Notice that in this situation, we have not lost any information about the original continuous hðtÞ. By picking out one period of the transform, we can reconstruct identically the continuous waveform by the inverse FT.
B.6 Nyquist sampling theorem Consider what would happen in Figure B.5 if the sampling interval were made larger. In the frequency domain, the separation ð¼ 1=TÞ between impulse functions of Sð f Þ would decrease. Because of this decreased spacing of the frequency impulses, their convolution with the frequency function Hð f Þ results in overlapping waveforms as illustrated in panel (f) of Figure B.6.
405
Discrete Fourier Transforms MULTIPLY IN TIME DOMAIN (a)
CONVOLVE IN FREQUENCY DOMAIN (b)
h (t )
1
H (f )
t
f fc
(MULTIPLY) (c)
(CONVOLVE) (d)
s (t )
1 — T
1 ....
fc
....
S (f )
....
.... f
t 1 –— T
T
(e)
(f)
h (t ) s (t )
t
1 — T
1 — T
H (f ) ∗ S (f )
f
Figure B.5 The Fourier transform of a sampled waveform illustrated using multiplication in the time domain and convolution in the frequency domain.
In this case, we can no longer recover an undistorted simple period which is identical with the FT of the continuous function hðtÞ. This distortion of a sampled waveform is known as aliasing. It arises because the original waveform was not sampled at a sufficiently high rate. For a given sampling interval, T, the Nyquist frequency is defined as 1=ð2TÞ. If the waveform that is being sampled contains frequency components above the Nyquist frequency, they will give rise to aliasing. Examination of Figure B.6(b) and (d) indicates that convolution overlap will occur until the separation of the impulses of SðfÞ is increased to 1=T ¼ 2fc , where fc is the highest frequency. Therefore, the sampling interval T must be 1=ð2fc Þ. The Nyquist sampling theorem states that if the Fourier transform of a function hðtÞ is zero for all jfj fc , then the continuous function hðtÞ can be uniquely determined from a knowledge of its sampled values at intervals of T 1=ð2fc Þ. If HðfÞ ¼ 0 for jfj fc then we say that HðfÞ is band-limited. In practice, it is a good idea to use a smaller sample interval T 1=ð4fc Þ. Conversely, if hðtÞ is time-limited, that is hðtÞ ¼ 0 for jtj Tc then hðtÞ can be uniquely reconstructed from samples of HðfÞ at intervals 4f ¼ 1=ð2Tc Þ.
406
Appendix B CONVOLVE IN FREQUENCY DOMAIN
MULTIPLY IN TIME DOMAIN (a)
(b)
h (t )
1
H (f )
f
t fc (MULTIPLY) (c)
fc
(CONVOLVE) (d)
s (t ) 1
....
....
1 — T
S (f )
....
.... f
t 1 –— T
T
(e)
(f)
h (t ) s (t )
t
1 — T
1 — T
H (f ) ∗ S (f )
f
Figure B.6 When the waveform is sampled at an insufficient rate, overlapping (referred to as aliasing) occurs in the transform domain.
B.6.1 Astronomy example In this example, taken from radio astronomy, we are interested in determining the intensity distribution, bð; Þ of a galaxy with the Very Large Array (VLA). The position of any point in the sky is specified by the two spherical coordinates ; . The VLA is an aperture synthesis radio telescope consisting of twenty-seven 25 m diameter dish antennas which can be moved along railway tracks to achieve a variety of relative spacings up to a maximum of 21 km. By this means, it can make images with an angular resolution equivalent to that of a telescope with a 21 km diameter aperture (0.08 arcseconds at ¼ 1 cm). The signal from each antenna is cross-correlated separately with the signals from all other antennas while all antennas track the same source. It can be shown that each cross-correlation is directly proportional to a twodimensional Fourier component of bð; Þ. If there are N antennas, there are NðN 1Þ=2 correlation pairs. The VLA records 27ð271Þ=2 ¼ 351 Fourier components simultaneously. The FT of bð; Þ is equal to Bðu; vÞ. The quantities u and v are called spatial frequencies and have units ¼ 1= where is in radians (dimensionless). Let u ¼ x= and v ¼ y= be the components of the projected separation of any pair of antennas on
Discrete Fourier Transforms
407
a plane perpendicular to the line of sight to the distant radio source in units of the observing wavelength. In practice, one wants to measure the minimum number of Fourier components necessary to reconstruct bð; Þ, i.e., move the dish antennas along the railway tracks as little as possible. This is where the sampling theorem comes in handy. If the galaxy is known to have a finite angular width 4 ¼ 4 ¼ 4 , then from the sampling theorem, this means we only need to sample in u and v at intervals of 4 u ¼ 4v
1 : 4
(B:45)
Note: in this problem, 4 is the equivalent to 2fc in Figure B.6 in the time frequency problem. Thus, if 4 10 arcseconds ¼ 4:8 105 radians, 4u¼
4x 4y ¼ 20 833 ¼ 4v ¼ :
(B:46)
If the wavelength of observation, ¼ 6 cm, then 4x ¼ 1:25 km, which means that the increment in antenna spacing required for complete reconstruction of bð; Þ is 1.25 km. Since the antennas are 25 m in diameter, they could in principle be spaced at intervals of 25 m and at that increment in spacing, we could obtain all the Fourier components (coefficients) necessary to reconstruct (synthesize) the image of a source of angular extent 8:3 arcmin. Because each antenna will shadow its neighbor at very close spacings, the limiting angular size that can be completely reconstructed is smaller than this.
B.7 Discrete Fourier Transform B.7.1 Graphical development The approach here is to develop the Discrete FT (abbreviated DFT) from a graphical derivation based on the waveform sampling and the convolution theorem, following the treatment given by Brigham (1988). The steps in this derivation are illustrated in Figure B.7. Panel (a) of Figure B.7 shows the continuous FT pair hðtÞ and HðfÞ. To obtain discrete samples, we multiply hðtÞ by the bed of nails shown in the time domain which corresponds to convolving in the f domain, with the result shown in (c). Note: the effect of sampling is to create a periodic version of HðfÞ. To represent the fact that we only want a finite number of samples, we multiply in the time domain by the rectangular window function of width T0 of panel (d), which corresponds to convolving with its frequency transform. This has side lobes, which produce an undesirable ripple in our transform as seen in panel (e). One way to reduce the ripple is to use a tapered window function instead of a rectangle.
408
Appendix B (a)
h (t )
H (f )
t
s1(t ) ....
f
S1f )
(b) ....
t
T
f 1 − T
− –1 T
(c)
H(f ) ∗ S1 (f )
h(t ) s(t )
f
t
1 — 2T
1 –— 2T W (f )
(d)
w (t )
f
t T —o 2
1 — 1 –— To To
T —o 2 h (t ) s (t ) w (t )
H (f ) ∗ S1(f ) ∗ W (f )
(e)
f
t T —o 2 s2(t )
T – —o 2
1 — 2T
S2(t )
(f) t
– To
1 –— 2T
...
... f 1 — To ~ H (f )
To ~ h (t )
(g) t
N
f N
Figure B.7 Graphical development of the Discrete Fourier Transform. See discussion in Section B.7.1.
Finally, to manipulate a finite number of samples in the frequency domain, we multiply in the frequency domain by a bed of nails at a frequency interval f ¼ 1=T0 as shown in panels (f) and (g). After taking the DFT of our N samples of hðtÞ, we obtain an N-sample approximation of the HðfÞ as shown in panel (g). Note 1: sampling in the frequency domain results in a periodic version of hðtÞ in the time domain. It is very important to be aware of this periodic property when executing convolution operations using the DFT.
Discrete Fourier Transforms
409
Note 2: the location of the rectangular window function wðtÞ is very important. Its width T0 equals NT, where T is the sample interval. If wðtÞ had been located so that a sample value coincided with each end-point, the rectangular function would be N þ 1 sample values, and the convolution of hðtÞsðtÞwðtÞ with the impulses spaced at intervals of T0 as shown in panels (f) and (g) would result in time domain aliasing.
B.7.2 Mathematical development of the DFT We now estimate the FT of a function from a finite number of samples. Suppose we have N equally spaced samples hk hðkTÞ at an interval of T seconds, where k ¼ 0; 1; 2; . . . ; N 1. From the discussion of the Nyquist sampling theorem, for a sample interval T, we can only obtain useful frequency information for jfj < fc . We seek estimates at the discrete values fn
n ; NT
n¼
N N ;...; ; 2 2
(B:47)
where the upper and lower limits are fc . Counting n ¼ 0, this range corresponds to N þ 1 values of frequency, but only N values will be unique. Z þ1 Hðfn Þ ¼ hðtÞ ei2pfn t dt 1
N1 X
hðkTÞ ei2pfn kT T
k¼0
¼T
N 1 X
hk ei2pnfkT
(B:48)
k¼0
¼T
N1 X
hk ei2pnk=N
k¼0
¼ THðnfÞ ¼ TH
n ¼ THn ; NT
where T ¼ the sample interval and hk for k ¼ 0; . . . ; N 1, are the sample values of the truncated hðtÞ waveform. Hn is defined as the DFT of hk and given by Hn ¼
N 1 X
hk ei2pnk=N :
(B:49)
k¼0
Defined in this way, Hn does not depend on the sample interval T. The relationship between the DFT of a set of samples of a continuous function hðtÞ at interval T, and the continuous FT of hðtÞ can be written as Hðfn Þ ¼ THn :
(B:50)
410
Appendix B
We can show that since Equation (B.49) for Hn is periodic, there are only N distinct complex values computable. To show this, let n ¼ r þ N, where r is an arbitrary integer from 0 to N 1. Hn ¼ H
N1 n X hk ei2pkðrþNÞ=N ¼ NT k¼0
¼
N1 X
hk ei2pkr=N ei2pk (B:51)
k¼0
¼
N1 X
hk ei2pkr=N
k¼0
¼H
r ¼ Hr ; NT
since ei2pk ¼ cosð2pkÞ þ i sinð2pkÞ ¼ 1 for k an integer. Until now, we have assumed that the index n varies from N=2 to N=2. Since Hn is periodic, it follows that Hn ¼ HNn so HN=2 ¼ HN=2 and thus we only need N values of n. It is customary to let n vary from 0 to N 1. Then n ¼ 0 corresponds to the DFT at zero frequency and n ¼ N=2 to the value at fc . Values of n between N=2 þ 1 and N 1 correspond to values of the DFT for negative frequencies from –ðN=2 1Þ; ðN=2 2Þ; . . . ; 1. Thus, to display the DFT in the same way as an analytic transform is displayed (f on the left and þf on the right), it is necessary to reorganize the DFT frequency values.
B.7.3 Inverse DFT Again, our starting point is the integral FT: Z þ1 hðkTÞ ¼ hk ¼ HðfÞ ei2pfkT df 1
hðkTÞ
N1 X
Hðfn Þ e
(B:52) i2pfn kT
f:
n¼0
Now substitute Hðfn Þ ¼ THn and f ¼ ð1=NTÞ: hk ¼
N 1 X
THn ei2pfn kT f
n¼0 N1 1X Hn ei2pfn kT : hk ¼ N n¼0
(B:53)
Note: the definition of DFT pair given in Mathematica is the more symmetrical form
Discrete Fourier Transforms
411
1 X 1 N Hn ¼ pffiffiffiffi hk ei2pnk=N N k¼0
(B:54)
N1 1 X hk ¼ pffiffiffiffi Hn ei2pfn kT : N n¼0
(B:55)
Box B.1 The Fourier transform of a list of real or complex numbers, represented by ui , is given by the Mathematica command (Fourier [{u1 ; u2 ; ; un }].) The inverse Fourier transform is given by InverseFourier [{u 1 ; u 2 ; ; u n }]. Mathematica can find Fourier transforms for data in any number of dimensions. In n dimensions, the data is specified by a list nested n levels deep. For two dimensions, often used in image processing, the command is Fourier [{{u11 ; u12 ; ; u1n }; {u21 ; u22 ; ; u2n }; }]. An example of the use of the FFT in the convolution and deconvolution of an image is given in the accompanying Mathematica tutorial.
B.8 Applying the DFT We have already developed the relationship between the discrete and continuous Fourier transforms. Here, we explore the mechanics of applying the DFT to the computation of Fourier transforms and Fourier series. The primary concern is one of correctly interpreting these results.
B.8.1 DFT as an approximate Fourier transform To illustrate the application of the DFT to the computation of Fourier transforms, consider Figure B.8. Figure B.8(a) shows the real function fðtÞ, given by 0; t<0 fðtÞ ¼ (B:56) et ; t 0. We wish to compute by means of the DFT an approximation to the Fourier transform of this function. The first step in applying the discrete transform is to choose the number of samples N and the sample interval T. For T ¼ 0:25, we show the samples of fðtÞ within the dashed rectangular window function in Figure B.8(b). Note: the start of the window
412
Appendix B (a)
Amplitude
1 0.8 f (t )
0.6 0.4 0.2 –5
0
5
10
15
20
t (b)
(c) 1 T = 0.25 N = 32
DFT (Real)
Amplitude
1 0.8 0.6 0.4 0.2 0
T = 0.25 N = 32
0.8 0.6
Positive f
Nyquist f
0.2
10 20 30 Sample number k
5
10 15 20 25 Frequency sample n
T = 0.25 N = 32
0.4 0.2
Negative f Positive f
–0.2
Nyquist f
–0.4 5
30
(e)
10 15 20 25 Frequency sample n
30
DFT (Imaginary)
DFT (Imaginary)
(d)
0
Negative f
0.4
T = 0.125 N = 64
0.4 0.2 0
Negative f Positive f
–0.2
Nyquist f
–0.4 10
20 30 40 50 Frequency sample n
60
Figure B.8 A 32-point DFT of the function fðtÞ. The function itself is plotted in panel (a). Panel (b) illustrates the location of the 32 time samples within the rectangular window function (dashed box). Panel (c) compares the real part of the DFT to the continuous Fourier transform shown by the solid curve. The imaginary part of the DFT is compared to the continuous case (solid curve) in panel (d). Panel (e) illustrates the improved agreement obtained by halving the sample interval in time and doubling the number of samples.
function occurs T=2 ahead of the first sample so that there are only N samples within the window, as discussed in Section B.7.1. Also, the value of the function at a discontinuity must be defined to be the mid-value if the inverse Fourier transform is to hold. Since the DFT assumes the function is periodic, we set this value equal to the average of the function value at both ends to avoid the discontinuity at t ¼ 0. We next compute the Fourier transform using the DFT approximation n Hðfn Þ THn ¼ TH ; (B:57) NT
Discrete Fourier Transforms
413
where H
1 n N X ekT ei2pnk=N ; n ¼ 0; 1; . . . ; N 1: ¼ NT k¼0
(B:58)
Note: the scale factor T in Equation (B.57), is required to produce equivalence between the continuous and discrete transforms. These results are shown in panels (c) and (d) of Figure B.8 In Figure B.8(c), we show the real part of the Fourier transform as computed by Equation (B.58). The index n ¼ 0 corresponds to zero frequency or the DC term, which is proportional to the data average. Note: the real part of the discrete transform is symmetrical about n ¼ N=2, the Nyquist frequency sample. The real part of a Fourier transform of a real function is even and the imaginary part of the transform is odd. In Figure B.8(b), the results for the real part for n > N=2 are simply negative frequency results. For T ¼ 0:25 s, the physical frequency associated with frequency sample n ¼ N=2 is 1=ð2TÞ ¼ 2 Hz. Sample ðN=2Þ þ 1 ¼ 17 corresponds to a negative frequency ¼ ðN=2 1Þ=ðNTÞ ¼ 1:875 Hz, and sample n ¼ 31 corresponds to the frequency ¼ 1=ðNTÞ ¼ 0:125 Hz. The conventional method of displaying results of the discrete Fourier transform is to graph the results of Equation (B.58) as a function of the parameter n. As long as we remember that those results for n > N=2 actually relate to negative frequency results, then we should encounter no interpretation problems. In panel (d) of Figure B.8, we illustrate the imaginary part of the Fourier transform and the discrete transform. As shown, the discrete transform approximates rather poorly the continuous transform for the higher frequencies. To reduce this error, it is necessary to decrease the sample interval T and increase N. Panel (e) shows the improved agreement obtained by halving T and doubling N to 64 samples. We note that the imaginary function is odd with respect to n ¼ N=2. Again, those results for n > N=2 are to be interpreted as negative frequency results. In summary, applying the discrete Fourier transform to the computation of the Fourier transform only requires that we exercise care in the choice of T and N and interpret the results correctly. For a worked example of the DFT using Mathematica, see the section entitled ‘‘Exercise on DFT, Zero Padding and Nyquist Sampling,’’ in the accompanying Mathematica tutorial.
B.8.2 Inverse discrete Fourier transform Assume that we are given the continuous real and imaginary frequency functions considered in the previous discussion and that we wish to determine the corresponding time function by means of the inverse discrete Fourier transform hðkTÞ ¼ 4f
N 1 X
½Rðn4fÞ þ iIðn4fÞei2pnk=N
n¼0
for k ¼ 0; 1; . . . ; N 1;
(B:59)
414
Appendix B
where 4f is the sample interval in frequency. Assume N ¼ 32 and 4f ¼ 1=8. Since we know that RðfÞ, the real part of the complex frequency function, must be an even function, then we fold Rð fÞ about the frequency f ¼ 2:0 Hz, which corresponds to the sample point n ¼ N=2. As shown in Figure B.9(a), we simply sample the frequency function up to the point n ¼ N=2 and then fold these values about n ¼ N=2 to obtain the remaining samples. In Figure B.9(b), we illustrate the method of determining the N samples of the imaginary part of the frequency function. Because the imaginary frequency function is odd, we must not only fold about the sample value N=2 but also flip the results. To preserve symmetry, we set the sample at n ¼ N=2 to zero. Computation of Equation (B.59) with the sampled function illustrated in Figures B.9(a) and (b) yields the inverse discrete Fourier transform. The result is a complex function whose imaginary part is approximately zero and whose real part is as shown in panel (c). We note that at k ¼ 0, the result is approximately equal to the correct midvalue and reasonable agreement is obtained for all but the results for k large. Improvement can be obtained by reducing 4f and increasing N. The key to using the discrete inverse Fourier transform for obtaining an approximation to continuous results is to specify the sampled frequency functions correctly. (a)
(b)
DFT (Real)
0.8 0.6
Positive f
0.4
Negative f
Fold about n = N/2
0.2 5
10 15 20 25 Frequency sample n
Fold about n = N/ 2
0.4 DFT (Imaginary)
T = 0.25 N = 32
1
0.2
Negative f
0
Positive f
–0.2
Nyquist f
–0.4
30
5
10 15 20 25 Frequency sample n
30
(c) T = 0.25 N = 32
1 Amplitude
0.8 0.6 0.4 0.2 0 5
10 15 20 Sample number k
25
30
Figure B.9 Panels (a) and (b) illustrate the sampling of the real and imaginary parts of the continuous Fourier transform in readiness for computing the inverse DFT which is shown in panel (c).
415
Discrete Fourier Transforms
Figures B.9(a) and (b) illustrate this correct method. One should observe the scale factor 4f which was required to give a correct approximation to continuous inverse Fourier transform results.
B.9 The Fast Fourier Transform The FFT (Cooley and Tukey, 1965) is a very efficient method of implementing the DFT that removes certain redundancies in the computation and greatly speeds up the calculation of the DFT. Consider the DFT AðnÞ ¼
N1 X
xðkÞei2pnk=N
ðn ¼ 0; 1; . . . ; N 1Þ;
(B:60)
k¼0
where we have replaced kT by k and n=NT by n for convenience of notation. Let w ¼ ei2p=N . Equation (B.60) can be written in matrix form: 3 2 Að0Þ ¼ w0 6 Að1Þ 7 ¼ 6 w0 7 6 6 7 6 6 4 Að2Þ 5 ¼ 6 4 w0 Að3Þ ¼ w0 2
w0
w0
w1
w2
w2
w4
w3
w6
w0
32
xð1Þ
3
76 xð2Þ 7 w3 7 7 76 7 6 w6 7 54 xð3Þ 5 w9
or AðnÞ ¼ wnk xðkÞ:
(B:61)
xð4Þ (B:62)
It is clear from the matrix representation that since w and possibly xðkÞ are complex, then N2 complex multiplications and ðNÞðN 1Þ complex additions are necessary to perform the required matrix computation. The FFT owes its success to the fact that the algorithm reduces the number of multiplications from N2 to N log2 N. For example, if N ¼ 1024 ¼ 210 N2 ¼ 220 operations in DFT N log2 N ¼ 210 10 operations in FFT. This amounts to a factor of 100 reduction in computer time and round-off errors are also reduced. How does it work? The FFT takes an N-point transform and splits it into two N=2-point transforms. This is already a saving, since 2ðN=2Þ2 < N2 . The N=2-point transforms are not computed, but each split into two N=4-point transforms. It takes log2 N of these splittings, so that generating the N-point transform takes a total of approximately N log2 N operations rather than N2 . The mathematics involves a splitting of the data set xðkÞ into odd and even labeled points, yðkÞ and zðkÞ.
416
Appendix B
Let yðkÞ ¼ xð2kÞ for
k ¼ 0; 1; . . . ; N=2 1
zðkÞ ¼ xð2k þ 1Þ:
(B:63)
Equation (B.60) can be rewritten as: AðnÞ ¼
N=21 Xn
yðkÞei4pnk=N þ zðkÞei2pnð2kþ1Þ=N
o
k¼0
¼
N=21 X
yðkÞei4pnk=N þ ei2pn=N
k¼0
N=21 X
(B:64) zðkÞei4pnk=N :
k¼0
This will still generate the whole set AðnÞ if n is allowed to vary over the full range (0 n N 1). First, let n vary over (0 n N=2 1). Then AðnÞ ¼ BðnÞ þ CðnÞwn ðvalid for 0 n N=2 1Þ; where BðnÞ ¼
N=21 X
yðkÞw2nk
and
CðnÞ ¼
N=21 X
k¼0
(B:65)
zðkÞw2nk
k¼0
(B:66)
N for n ¼ 0; . . . ; 1: 2 But since BðnÞ and CðnÞ are periodic in the half-interval, generating AðnÞ for the second half may be done without further computing using the same BðnÞ and CðnÞ:
N A nþ ¼ BðnÞ þ CðnÞwn wN=2 2 (B:67) ¼ BðnÞ CðnÞwn
ð0 n N=2 1Þ;
where wN=2 ¼ eip ¼ cos p ¼ 1:
(B:68)
The work of computing an N-point transform AðnÞ has been reduced to computing two N=2 point transforms BðnÞ and CðnÞ and appropriate multiplicative phase factors wn . Each of these sub-sequences yðkÞ and zðkÞ can be further subdivided with each step involving a further reduction in operations. These reductions can be carried out as long as the original sequence is a power of 2. Consider n ¼ 8 ¼ 23 . In 3 divisions we go from 1 8, to 2 4, to 4 2, to 8 1. Note: the DFT of one term is simply the term itself, X i:e: Að0Þ ¼ xðkÞei2pkn=N ¼ xð0Þ; for n ¼ 0: (B:69) k¼0
In the above we have assumed N ¼ power of 2. (2 is called the radix ¼ r). One can use other values for the radix, e.g., N ¼ 42
Discrete Fourier Transforms
Speed enhancement: ’
417
N logr N : N2
In addition to the splitting into sub-sequences, the FFT also makes use of periodicities in the exponential term wnk to eliminate redundant operations.1 wnk ¼ wnk mod N
(B:70)
e.g., if N ¼ 4; n ¼ 2 and k ¼ 3, then
i2p w ¼ w ¼ exp ð6Þ ¼ expði3pÞ 4 i2p ð2Þ ¼ w2 ¼ expðipÞ ¼ exp 4 6
nk
(B:71)
w2 ¼ w0 : Note: the FFT is not an approximation but a method of computing which reduces the work by recognizing symmetries and by not repeating redundant operations.
B.10 Discrete convolution and correlation One of the most common uses of the FFT is for computing convolutions and correlations of two time functions. Discrete convolution can be written as sðkTÞ ¼
N1 X
hðiTÞr½ðk iÞT ¼ hðkTÞ rðkTÞ:
(B:72)
i¼0
According to the Discrete Convolution Theorem, N 1 X i¼0
hðiTÞr½ðk iÞT () H
n n R : NT NT
(B:73)
Note: the discrete convolution theorem assumes that hðkTÞ and rðkTÞ are periodic since the DFT is only defined for periodic functions of time. Usually, one is interested in convolving non-periodic functions. This can be accomplished with the DFT by the use of zero padding which is discussed in the next section. Reiterating, discrete convolution is only a special case of continuous convolution; discrete convolution assumes both functions repeat outside the sampling window.
1
For example,
I I mod m ¼ I Int
m m
6
4¼2 6 mod 4 ¼ 6 Int 4
418
Appendix B
To efficiently compute the discrete convolution: 1. Use an FFT algorithm to compute RðnÞ and HðnÞ. 2. Multiply the two transforms together, remembering that the transforms consist of complex numbers. 3. Then use the FFT to inverse transform the product. 4. The answer is the desired convolution hðkÞ rðkÞ.
If both time functions are real (generally so) both of their transforms can be taken simultaneously. For details see Press (1992). What about deconvolution? One is usually more interested in the signal hðkTÞ before it is smeared by the instrumental response. Deconvolution is the process of undoing the smearing of the data, due to the effect of a known response function. Deconvolution in the frequency domain consists of dividing the transform of the convolution by RðnÞ, e.g., HðnÞ ¼
SðnÞ ; RðnÞ
(B:74)
and then transforming back to obtain hðkÞ. This procedure can go wrong mathematically if RðnÞ is zero for some value of n, so that we can’t divide by it. This indicates that the original convolution has truly lost all information at that one frequency so that reconstruction of that component is not possible. Apart from this mathematical problem, the process is generally very sensitive to noise in the input data and to the accuracy to which rðkÞ is known. This is the subject of the next section.
B.10.1 Deconvolving a noisy signal We already know how to deconvolve the effects of the response function, rðkÞ (short for rðkTÞ), of the measurement device, in the absence of noise. We transform the measured output, sðkÞ, and the response, rðkÞ, to the frequency domain yielding SðnÞ (short for SðnfÞ) and RðnÞ. The transform, HðnÞ, of the desired signal, hðkÞ, is given by Equation (B.74). Even without additive noise, this can fail because for some n, RðnÞ may equal 0. The solution in this case is HðnÞ ¼
SðnÞ ; ðRðnÞ þ Þ
(B:75)
where is very small compared to the maximum value of RðnÞ and =RðnÞ > the machine precision. Panel (a) in Figure B.10 shows our earlier result (see Figure B.4) of convolving the image of a galaxy with the response (beam pattern) of a radio telescope. Panel (b)
419
Discrete Fourier Transforms
shows the deconvolved version assuming perfect knowledge of RðnÞ and a value of ¼ 108 . In practice, we will only be able to determine RðnÞ to a certain accuracy which will limit the accuracy of the deconvolution. In panel (c), we have added independent Gaussian noise to (a), and panel (d) shows the best reconstruction obtained by varying the size of , which occurs for an 0:15. We now investigate
(b) Deconvolved Galaxy
1
1
0.8
0.8
Intensity
Intensity
(a) Galaxy Convolved with Beam
0.6 0.4 0.2
0.6 0.4 0.2 –0.25 0 0.25 0.5 0.75 1 1.25 1.5
Telescope position θ0
Telescope position θ0
(c) Convolved Galaxy + Noise
(d) Deconvolved Galaxy
1
1
0.8
0.8
Intensity
Intensity
–0.25 0 0.25 0.5 0.75 1 1.25 1.5
0.6 0.4 0.2
0.6 0.4 0.2
–0.25 0 0.25 0.5 0.75 1 1.25 1.5 Telescope position θ0
–0.25 0 0.25 0.5 0.75 1 1.25 1.5 Telescope position θ0
(e) Deconvolved (Weiner filter)
Intensity
1 0.8 0.6 0.4 0.2 –0.25 0 0.25 0.5 0.75 1 1.25 1.5 Telescope position θ0
Figure B.10 Panel (a) shows the earlier result (see Figure (B.4)) of convolving a model galaxy with the point spread function of a radio telescope. Panel (b) shows the deconvolved galaxy image. Panel (c) is the same as (a) but with added Gaussian noise. Panel (d) is the best deconvolved image without any filtering of noise. Panel (e) shows the result of deconvolution using an optimal Weiner filter.
420
Appendix B
the use of an optimal Weiner filter to improve upon the reconstruction when noise is present.
B.10.2 Deconvolution with an optimal Weiner filter If additive noise is present, the output from the measurement system is now cðkÞ, where cðkÞ ¼ sðkÞ þ nðkÞ:
(B:76)
The task is to find the optimum filter ðkÞ or ðnÞ, which, when applied to the CðnÞ, the transform of the measured signal cðkÞ, and then divided by RðnÞ, produces an output ~ HðnÞ, that is closest to HðnÞ in a least-squares sense. This translates to the equation X½SðnÞ þ NðnÞ ðnÞ SðnÞ 2 RðnÞ RðnÞ n h i X ¼ jRðnÞj2 jSðnÞj2 j1 ðnÞj2 þjNðnÞj2 jðnÞj2
(B:77)
n
¼ minimum: If the signal SðnÞ and noise NðnÞ are uncorrelated, their cross product when summed over n can be ignored. Equation (B.77) will be a minimum if and only if the sum is minimized with respect to ðnÞ at every value of n. Differentiating with respect to , and setting the result equal to zero gives ðnÞ ¼
jSðnÞj2 jSðnÞj2 þ jNðnÞj2
:
(B:78)
The solution contains SðnÞ and NðnÞ but not the CðnÞ, the transform of the measured quantity cðkÞ. We happen to know SðnÞ and NðnÞ because we are working with a simulation. In general, we only know CðnÞ, so we estimate SðnÞ and NðnÞ in the following way: Figure B.11 shows the log of jCðnÞj2 ; jSðnÞj2 and jNðnÞj2 in panels (a), (b), (c), respectively. For small n, jCðnÞj2 has the same shape as jSðnÞj2 , while at large n, it looks like the noise spectrum. If we only had jCðnÞj2 , we could estimate jSðnÞj2 by extrapolating the spectrum at high values of n to zero. Similarly, we can estimate the jNðnÞj2 by extrapolating back into the signal region. Panel (d) shows the resulting optimal filter given by Equation (B.78). Where jSðnÞj2 jNðnÞj2 , ðnÞ ¼ 1 and when the noise spectrum dominates, ðnÞ 0. ~ Panel (e) of Figure (B.10) shows the reconstruction HðnÞ obtained using Equation (B.79): CðnÞðnÞ ~ HðnÞ ¼ : RðnÞ We investigate other approaches to image reconstruction in Chapter 8.
(B:79)
421
Discrete Fourier Transforms (a)
(b) 2
1
0.1
Log S (n)
Log C (n)
2
1
0.01 0.001 0.0001 50
100
150
200
0.1 0.01 0.001 0.0001
250
50
100
150
200
Frequency channel
Frequency channel
(c)
(d)
1 0.1
Log Φ (n)
Log N (n)
2
1 0.1
250
0.01 0.001
0.0001
0.01 0.001 0.0001
50
100
150
200
Frequency channel
250
50
100
150
200
250
Frequency channel
Figure B.11 The figure shows the log of jCðnÞj2 , jSðnÞj2 and jNðnÞj2 in panels (a), (b), (c), respectively. Panel (d) shows the optimal Weiner filter.
B.10.3 Treatment of end effects by zero padding Since the discrete convolution theorem assumes that the response function is periodic, it falsely pollutes some of the initial channels with data from the far end because of the wrapped-around response arising from the assumed periodic nature of the response function, hðÞ. Although the convolution is carried out by multiplying the Fast Fourier Transforms of hðÞ and xðÞ and then inverse transforming back to the time domain, the polluting effect of the wrap-around is best illustrated by analyzing the situation completely in the time domain. Figure B.12 illustrates the convolution of xðÞ, shown in panel (b), by an exponential decaying response function shown in panel (a). First, the response function is folded about ¼ 0, causing it to disappear from the left of panel (c). Since the DFT assumes that hðÞ is periodic, a wrap-around copy of hðÞ appears at the right of the panel. To compute the convolution at t, we shift the folded response to the right by t, multiply hðt Þ and xðÞ and integrate. Panel (e) shows the resulting polluted convolution. To avoid polluting the initial samples, we add a buffer zone of zeros at the far end of the data stream. The width of this zero padding is equal to the maximum wrap-around of the response function (see Figure B.13). Note: if we increase N for the data stream, we must also add zeros to the response to make up the same number of samples. The wrap-around response shown in panels (c) and (d) is multiplied by zeros and so does not pollute the convolution as shown in panel (e).
422
Appendix B h (τ)
(a)
0.2
0.4
0.6
0.8
1
0.2
τ h (− τ)
(c)
x (τ)
(b)
0.4
0.6
0.8
1
0.6
0.8
1
τ h (t − τ)
(d)
Wrap around response
0.2
0.4
0.6
0.8
1
0.2
0.4
τ
τ y (t )
(e)
0.2
0.4
0.6
0.8
1
t
Figure B.12 Wrap-around effects in FFT convolution.
A Mathematica example of zero padding in the convolution of a two-dimensional image is given in the accompanying Mathematica tutorial.
B.11 Accurate amplitudes by zero padding The FFT of hðkÞ produces a spectrum HðnÞ in which any intrinsically narrow spectral feature is broadened by convolution with the Fourier transform of the window function. For a rectangular window, this usually results in only two samples to define a spectral peak in HðnÞ and deductions about the true amplitude of the peak are usually underestimated unless by chance one of these samples lies at the center of the peak. This situation is illustrated in Figure B.14. Panel (a) shows 12 samples of a sine wave taken within a rectangular window function indicated by the dashed lines. The actual samples are represented by the vertical solid lines. In panel (b), the DFT (vertical lines) is compared to the analytic Fourier transform of the windowed continuous sine wave. In this particular
423
Discrete Fourier Transforms (a)
h (τ)
x (τ)
(b) Zero pad
Zero pad
0.2 0.4 0.6 0.8 1 1.2 1.4
0.2 0.4 0.6 0.8 1 1.2 1.4
τ
(c)
τ
h (− τ)
h (t − τ)
(d)
Wrap around response
0.2 0.4 0.6 0.8 1 1.2 1.4
0.2 0.4 0.6 0.8 1
τ
1.2 1.4
τ (e)
0
y (t)
0.2 0.4 0.6 0.8 t
1
1.2 1.4
Figure B.13 Removal of wrap-around effects in FFT convolution by zero padding.
example, only two DFT components are visible; the others fall on the zeros of the analytic transform. Doubling the length of the window function to 2 T0 causes the samples to be more closely spaced in the frequency domain 4f ¼ 1=ð2T0 Þ, but at the same time, the transform of the window function becomes narrower by the same factor. The net effect is to not increase the number of samples within the spectral peak. Demonstrate this for yourself by executing the section entitled ‘‘Exercise on the DFT, Zero Padding and Nyquist Sampling,’’ in the accompanying Mathematica tutorial. Consider what happens when we append 3N zeros to hðkÞ which has been windowed by a rectangular function N samples long. This situation is illustrated in panels (c) and (d) of Figure B.14. There are now four times as many frequency components to define the spectrum. Even in this situation, noticeable differences between the DFT and analytic transform start to appear at larger values of f. In the top panel, we see 12 samples of the data and to the right, the magnitude of the discrete transform. In the bottom panel, we have four times the number of points to be transformed by adding 36 zeros to the 12 original data points. Now the spectral peak remains the same in size but we have four times as many points defining the peak.
424
Appendix B (a)
(b)
12 Data Sample
1 Magnitude
y
0.5 0
–0.5 –1 0
(c)
0.5
1
1.5 t
2
2.5
3
–0.4
Magnitude
y
0
–0.5 –1 4
6 t
0
0.2
0.4
8
Fourier Transform
(d)
Zero padding
2
–0.2
Fraction of sampling frequency
1
0
1.4 1.2 1 0.8 0.6 0.4 0.2 0
3.5
12 Data Sample + 48 Zeros
0.5
Fourier Transform
10
12
1.4 1.2 1 0.8 0.6 0.4 0.2 0 –0.4
–0.2
0
0.2
0.4
Fraction of sampling frequency
Figure B.14 How to obtain more accurate DFT amplitudes by zero padding. The frequency axis in the two right hand panels is in units of 1=T, the sampling frequency. On this scale the Nyquist frequency ¼ 0:5.
B.12 Power-spectrum estimation The measurement of power spectra is a difficult and often misunderstood topic. Because the FFT yields frequency and amplitude information, many investigators proceed to estimate the power spectrum from the magnitude of the FFT. If the waveform is periodic or deterministic, then the correct interpretation of the FFT result is likely. However, when the waveforms are random processes, it is necessary to develop a statistical approach to amplitude estimation. We will consider two approaches to the subject in this section, and Chapter 13 provides a powerful Bayesian viewpoint of spectral analysis. We start by introducing Parseval’s theorem.
B.12.1 Parseval’s theorem and power spectral density Parseval’s theorem states that the energy in a waveform hðtÞ computed in the time domain must equal the energy as computed in the frequency domain.
425
Discrete Fourier Transforms
Energy ¼
Z
1 2
h ðtÞdt ¼
Z
1
1
jHð fÞj2 df:
(B:80)
1
From this equation, it is clear that jHð fÞj2 is an energy spectral density. Frequently, one wants to know ‘‘how much energy’’ is contained in the frequency interval between f and f þ df. In such circumstances, one does not usually distinguish between þf and f, but rather regards f as varying from 0 to þ1. In such cases, we define the one-sided energy spectral density (ESD) of the function hðtÞ as Eh ð fÞ jHð fÞj2 þ jHðfÞj2 ; 0 f < 1:
(B:81)
When hðtÞ is real, then the two terms are equal, so Eh ð fÞ ¼ 2jHð fÞj2 :
(B:82)
If hðtÞ goes endlessly from 1 < t < 1, then its ESD will, in general, be infinite. Of interest then is the one-sided ESD per unit time or power spectral density (PSD). This is computed from a long but finite stretch of hðtÞ. The PSD is computed for a function ¼ hðtÞ in the finite stretch which is zero elsewhere, divided by the length of the stretch used. Parseval’s theorem in this case states that the integral of the one-sided PSD over positive frequency is equal to the mean-square amplitude of the signal hðtÞ. Proof of Parseval’s theorem using the convolution theorem: FT of hðtÞ hðtÞ ¼ Hð fÞ Hð fÞ. That is,
R1
h2 ðtÞei2pt dt ¼ 1
R1
1
Setting ¼ 0 yields Z 1 Z h2 ðtÞdt ¼
HðfÞHð fÞdf.
1
Z
1
jHð fÞj2 df: QED
(B:83)
Hð fÞ ¼ Rð fÞ þ iIð fÞ and thus HðfÞ ¼ RðfÞ þ iIðfÞ:
(B:84)
1
1
Hð fÞHðfÞdf ¼
1
The last equality follows since
For hðtÞ real, Rð fÞ is even and Ið fÞ is odd, and HðfÞ ¼ RðfÞ iIðfÞ ¼ H ðfÞ:
(B:85)
B.12.2 Periodogram power-spectrum estimation A common approach used to estimate the spectrum of hðtÞ is by means of the periodogram also referred to as the Schuster periodogram after Schuster who first introduced the method in 1898. 2 Z L=2 1 Let P^p ð fÞ ¼ hðtÞei2pft dt ; (B:86) L L=2
426
Appendix B
where the subscript p denotes periodogram estimate and L is the length of the data set. An FFT is normally used to compute this. We define the power spectral density, Pð f Þ, as follows: 2 Z L=2 1 i2pft Pð fÞ ¼ lim hðtÞe dt : L!1 L L=2
(B:87)
We now develop another power-spectrum estimator which is in common use.
B.12.3 Correlation spectrum estimation Let hðtÞ be a random function of time (could be the sum of a deterministic function and noise). In contrast to a pure deterministic function, future values of a random function cannot be predicted exactly. However, it is possible that the value of the random function at time t influences the value at a later time t þ . One way to express this statistical characteristic is by means of the autocorrelation function, which for this purpose is given by ðÞ ¼ lim 1=L L!1
Z
þL=2
hðtÞ½hðt þ Þdt;
(B:88)
L=2
where hðtÞ extends from L=2 to þL=2. In the limit of L ¼ 1, the power spectral density function Pð fÞ and the autocorrelation function ðÞ are a Fourier transform pair: Z þ1 Z þ1 ðÞ ¼ Pð fÞei2pf df()Pð fÞ ¼ ðÞei2pf d: (B:89) 1
1
Proof: From Equation (B.87) and the correlation theorem, we have Z Pð fÞ /
1
1
2 hðtÞei2pft dt
¼ FT ½hðtÞ FT ½hðtÞ Z 1 hðtÞhðt þ Þdt ¼ FT
(B:90)
1
¼ FT ½ðÞ: In the literature, Pð fÞ is called by many terms including: power spectrum, spectral density function, and power spectral density function (PSD). If the autocorrelation function is known, then the calculation of the power spectrum is determined directly from the Fourier transform.
Discrete Fourier Transforms
427
Since hðtÞ is known only over a finite interval, we estimate ðÞ based on this finite duration of data. The estimator generally used is ^ ðÞ ¼
1 L jj
Z
Ljj
hðtÞh½t þ jjdt; jj < L;
(B:91)
0
where hðtÞ is known only over length L. ^ is undefined for jj > L. However, Notice that Pð fÞ cannot be calculated since ðÞ ^ consider the quantity wðÞðÞ, where wðÞ is a window function which is non-zero for ^ exists for jj L and zero elsewhere. The modified autocorrelation function wðÞ ðÞ all and hence its FT exists. Z 1 2pf ^ ^ Pc ð fÞ ¼ wðÞ ðÞe d; (B:92) 1
where wðÞ ¼ 1 for jj5L and is zero elsewhere. P^c ð fÞ is called the correlation or lagged-product estimator of the PSD. This approach to spectral analysis is commonly referred to as the Blackman–Tukey procedure (Blackman and Tukey, 1958). An instrument for estimating the PSD in this way is called an autocorrelation spectrometer. They are very common in the field of radio astronomy and especially in aperture synthesis telescopes where it is already necessary to cross-correlate the signals from different pairs of telescopes. It is quite convenient to add additional multipliers to calculate the correlation as a function of delay to obtain the spectral information as well. Although the periodogram and correlation spectrum estimation procedures appear quite different, they are equivalent under certain conditions. It can be shown (Jenkins and Watts, 1968) that P^p ð fÞ ¼
Z
þL=2
L=2
jj 1 L
i2pf ^ d: ðÞe
(B:93)
The inverse FT yields
jj ^ ^ p ðÞ ¼ 1 ðÞ; jj < L: L
(B:94)
Hence, if we modify the lagged-product spectrum estimation technique by simply using a triangular (Bartlett) window function in Equation (B.92), then the two procedures are equivalent. In spectrum estimation problems, one strives to achieve an estimator whose mean value (the average of multiple estimates) is the parameter being estimated. It can be shown (Jenkins and Watts, 1968) that the mean value of both the correlation and periodogram estimation procedures is the true spectrum Pð fÞ convolved with the frequency-domain window function: E P^c ð fÞ ¼ E P^p ð fÞ ¼ Wð fÞ Pð fÞ: (B:95)
428
Appendix B
Hence, the mean (expectation) value equals the true spectrum only if the frequencydomain window function is an impulse function (i.e., the data record length is infinite in duration). If the mean of the estimate is not equal to the true value, then we say that the estimate is biased.
B.13 Discrete power spectral density estimation We will develop the discrete form of the PSD from the discrete form of Parseval’s theorem. We start with a continuous waveform hðtÞ and its transform HðfÞ, which are related by hðtÞ ¼
Z
þ1
Hð fÞei2pft df
where
Hð fÞ ¼
1
Z
þ1
hðtÞei2pft dt:
(B:96)
1
We refer to Hð fÞ as two-sided because from the mathematics, it has non-zero values at both ðþÞ and ðÞ frequencies. According to Parseval’s theorem, Energy ¼
Z
1
h2 ðtÞdt ¼
Z
1
1
jHð fÞj2 df:
(B:97)
1
Thus, jHð fÞj2 ¼ two-sided energy spectral density.
B.13.1 Discrete form of Parseval’s theorem Suppose our function hðtÞ is sampled at N uniformly spaced points to produce the values hk for k ¼ 0 to N 1 spanning a length of time L ¼ NT with T ¼ the sample interval. Energy ¼
N1 X
h2k T ¼
k¼0
N1 X jHð fn Þj2 f n¼0
(B:98)
N1 X ¼ jTHn j2 f: n¼0
Note: the substitution THn ¼ Hðfn Þ comes from Equation (B.50). Thus, jTHn j2 ¼ two-sided discrete energy spectral density. We note in passing that the usual discrete form of Parseval’s theorem is obtained from Equation (B.98) by rewriting f ¼ 1=ðNTÞ and then simplifying to give N 1 X k¼0
h2k ¼
N1 1X jHn j2 : N n¼0
(B:99)
429
Discrete Fourier Transforms
We will find Equation (B.98) a more useful version of the discrete form of Parseval’s theorem because it makes clear that jTHn j2 is a discrete energy spectral density. waveform energy waveform duration N1 1 X ¼ h2 T NT k¼0 k
Average waveform power ¼
¼
N1 X jHn j2 T n¼0
¼
N
(B:100)
f
N1 1X h2 N k¼0 k
¼ mean squared amplitude:
ðB:101Þ
We can identify the two-sided discrete PSD with jHn j2 T/N from the RHS of the above equation, which has units of power per cycle.
B.13.2 One-sided discrete power spectral density Let Pð fn Þ ¼ the one-sided power spectral density. T jH0 j2 N i Th Pðfn Þ ¼ jHn j2 þ jHNn j2 ; n ¼ 1; 2; . . . ; ðN=2 1Þ N 2
T P fN=2 ¼ HN=2 ; N Pðf0 Þ ¼
(B:102)
where fN=2 corresponds to the Nyquist frequency and Pð fn Þ is only defined for zero and positive frequencies. From Equation (B.102), it is clear that Pð fn Þ is normalized so PN1 that n¼0 Pð fn Þf ¼ the mean squared amplitude. Note: our expression for the onesided discrete PSD which has units of power per unit of bandwidth differs from the one given in Press et al., (1992). In particular, the definition used there, PNR ð fn Þ, is related to our Pð fn Þ by PNR ð fn Þ ¼ Pð fn Þf ¼ Pð fn Þ=ðNTÞ.
B.13.3 Variance of periodogram estimate What is the variance of Pð fn Þ as N ! 1? In other words, as we take more sampled points from the original function (either sampling a longer stretch of data, or else by resampling the same stretch of data with a faster sampling rate), how much more accurate do the estimates Pð fn Þ become?
430
Appendix B
The unpleasant answer is that periodogram estimates do not become more accurate at all! It can be shown that in the case of white Gaussian noise2 the standard deviation at frequency fn is equal to the expectation value of the spectrum of fn (Marple, 1987). How can this be? Where did this information go as we added more points? It all went into producing estimates at a greater number of discrete frequencies fn . If we sample a longer run of data using the same sampling rate, then the Nyquist critical frequency fc is unchanged, but we now have finer frequency resolution (more fn ’s). If we sample the same length with a finer sampling interval, then our frequency resolution is unchanged, but the Nyquist range extends to higher frequencies. In neither case do the additional samples reduce the variance of any one particular frequency’s estimated PSD. Figure B.15 shows examples for increasing N. As you will see below, there are ways to reduce the variance of the estimate. However, this behavior caused many researchers to consider periodograms of noisy
16 – points
(b)
PSD
PSD
(a) 14 12 10 8 6 4 2
0.1 0.2 0.3 0.4 0.5 Fraction of sampling frequency
0.1 0.2 0.3 0.4 0.5 Fraction of sampling frequency
64 – points
(d)
PSD
PSD
(c) 14 12 10 8 6 4 2
0.1 0.2 0.3 0.4 0.5 Fraction of sampling frequency
32 – points
14 12 10 8 6 4 2
128 – points
14 12 10 8 6 4 2 0.1 0.2 0.3 0.4 0.5 Fraction of sampling frequency
Figure B.15 Power spectral density (PSD) for white (IID) Gaussian noise for different record lengths. The frequency axis is in units of 1=T, the sampling frequency. On this scale the Nyquist frequency ¼ 0:5.
2
The term white noise means that the spectral density of the noise is constant from zero frequency through the frequencies of interest, i.e., up to the Nyquist frequency. It is really another way of saying the noise is independent. An independent ensemble of noise values has an autocorrelation function (Equation (B.88)) which is a delta function. According to equation (B.90), the power spectral density is just the FT of the autocorrelation function, which in this case would be a constant.
Discrete Fourier Transforms
431
data to be erratic and this resulted in a certain amount of disenchantment with periodograms for several decades. However, even Schuster was aware of the solution. This disenchantment led G. Yule to introduce a notable alternative analysis method in 1927. Yule’s idea was to model a time series with linear regression analysis data. This led to the parametric methods which assume a time-series model and solve for the parameters of the random process. These include autoregressive (AR), moving average (MA) and autoregressive-moving average (ARMA) process models (see Priestley, 1981; Marple, 1987 for more details). In contrast, the correlation and periodogram spectral estimations are referred to as non-parametric statistics of a random process.
B.13.4 Yule’s stochastic spectrum estimation model The Schuster periodogram is appropriate to a model of a sinusoid with additive noise. Suppose the situation were more akin to a pendulum which was being hit by boys throwing peas randomly from both sides. The result is simple harmonic motion powered by a random driving force. The motion is now affected, not by superposed noise, but by a random driving force. As a result, the graph will be of an entirely different kind to a graph in the case of a sinusoid with superposed errors. The pendulum graph will remain surprisingly smooth, but the amplitude and phase will vary continuously as governed by the inhomogeneous difference equation: xðnÞ þ a1 xðn 1Þ þ a2 xðn 2Þ ¼ ðnÞ;
(B:103)
where ðnÞ is the white noise input. Given an empirical time series, xðnÞ, Yule used the method of regression analysis to find these coefficients. Because he regressed xðnÞ on its own past instead of some other variable, he called it autoregression. The least-squares normal equations involve the empirical autocorrelation coefficients of the time series, and today these equations are called the Yule–Walker equations. A good example of such a time series is electronic shot noise passing through some band pass filter which rings every shot. A detailed discussion of these methods goes beyond the scope of this book and the interested reader is referred to the works of Priestley (1981) and Marple (1987).
B.13.5 Reduction of periodogram variance There are two simple techniques for reducing the variance of a periodogram that are very nearly identical mathematically, though different in implementation. The first is to compute a periodogram estimate with finer discrete frequency spacing than you really need, and then to sum the periodogram estimates at K consecutive discrete
432
Appendix B
frequencies to get one ‘‘smoother’’ estimate at the mid frequency of those K.3 The variance of that summed estimate will be smaller than the estimate itself by a factor of exactly pffiffiffiffi 1=K, i.e., the standard deviation will be smaller than 100 percent by a factor 1= K. Thus, to estimate the power spectrum at M þ 1 discrete frequencies between 0 and fc inclusive, you begin by taking the FFT of 2MK points (which number had better be an integer power of two!). You then take the modulus squared of the resulting coefficients, add positive and negative frequency pairs and divide by ð2MKÞ2 . Finally, you ‘‘bin’’ the results into summed (not averaged) groups of K. The reason that you sum rather than average K consecutive points is so that your final PSD estimate will preserve the normalization property that the sum of its M þ 1 values equals the mean square value of the function. A second technique for estimating the PSD at M þ 1 discrete frequencies in the range of 0 to fc is to partition the original sampled data into K segments each of 2M consecutive sampled points. Each segment is separately FFT’d to produce a periodogram estimate. Finally, the K periodogram estimates are averaged at each frequency. It is this final averagingpthat ffiffiffiffi reduces the variance of the estimate by a factor of K (standard deviation by K). The principal advantage of the second technique, however, is that only 2M data points are manipulated at a single time, not 2KM as in the first technique. This means that the second technique is the natural choice for processing long runs of data, as from a magnetic tape or other data record.
B.14 Problems 1. Exercise on the DFT, zero padding and Nyquist sampling a) In the accompanying Mathematica tutorial, you will find a section entitled, ‘‘Exercise on the DFT, Zero Padding and Nyquist Sampling.’’ Execute the notebook and make sure you understand each step. Do not include a copy of this part in your submission. b) Repeat the exercise items, but this time with fn ¼ Cos½2pf1 t+Sin½2pf2 t. Let f1 ¼ 1 Hz; f2 ¼ 0:7 Hz; T ¼ 0:25 s and the data window length L ¼ 3 s. c) Explain why one of the two frequencies only appeared in the real part of the analytic FT and the other only appeared in the imaginary part. d) What was the effect of zero padding on the DFT? e) Comment on the degree of agreement between the FT and the DFT in the Mathematica tutorial. f) Repeat item (b), only this time increase the window size L ¼ 8 s. What effect did this have on the spectrum? Explain why this occurred (see Figure B.7). 3
Of course, if your goal is to detect a very narrow band signal, then smoothing may actually reduce the signal-to-noise ratio for detecting such a signal.
Discrete Fourier Transforms
433
g) What is the value of the Nyquist frequency for the sampling interval used? h) Do the two signals appear at their correct frequencies in the FT and DFT? Explain why there are low level bumps in the spectrum at other frequencies. i) Recompute the DFT with a sample interval, T ¼ 0:65 s, and a data window length L ¼ 65 s. Do the two signals appear at their correct frequencies? If not, explain why. 2. Exercise on Fourier image convolution and deconvolution
a) In the accompanying Mathematica tutorial, you will find a section entitled ‘‘Exercise on Fourier Image Convolution and Deconvolution.’’ Execute the notebook and make sure you understand each step. b) Repeat (a) using a point spread function which is the sum of the following two multinormal distributions: Multinormal½f0; 0g; ff1; 0g; f0; 1gg Multinormal½f4; 0g; ff2; 0:8g; f0:8; 2gg
Appendix C Difference in two samples
C.1 Outline In Section 9.4, we explored a Bayesian treatment of the analysis of two independent measurements of the same physical quantity, the control and the trial, taken under slightly different experimental conditions. In the next four subsections, we give the details behind the calculations of the probabilities of the four fundamental hypotheses ðC; SÞ, ðC; SÞ, ðC; SÞ and ðC; SÞ which arose in Section 9.4. After determining what is different, the next problem is to estimate the magnitude of the changes. Section 9.4.4 introduced the calculation for the probability of the difference in the two means pðjD1 ; D2 ; IÞ. The details of this calculation are given in Section C.3. Finally, Section 9.4.5 introduced the calculation for the probability for the ratio of the standard deviations, pðrjD1 ; D2 ; IÞ. The details of this calculation are given in Section C.4.
C.2 Probabilities of the four hypotheses C.2.1 Evaluation of pðC; SjD1 ; D2 ; IÞ The only quantities that remain to be assigned are the two likelihood functions. The prior probability for the noise will be taken to be Gaussian. pðD1 jC; S; c1 ; 1 ; IÞ and pðD2 jC; S; c1 ; 1 ; IÞ D1 fd11 ; d12 ; d13 ; . . . ; d1N1 g where D1i ¼ c1 þ e1i ; therefore, pðD1 jC; S; c1 ; 1 ; IÞ ¼ pðe11 ; e12 ; . . . ; e1N1 jc1 1 IÞ 3 2 N 1 2 Y6 1 e 7 ¼ 4qffiffiffiffiffiffiffiffiffiffi exp i 2 5 21 i¼1 2p21 ( ) N1 X N ðd1i c1 Þ2 2 21 pðD1 jC; S; c1 ; 1 ; IÞ ¼ ð2p1 Þ exp 221 i¼1 434
(C:1)
435
Difference in two samples
pðD2 jC; S; c1 ; 1 ; IÞ
¼ð2p21 Þ
N2 2
( exp
N2 X ðd2i c1 Þ2
221
i¼1
) :
(C:2)
Let 1
¼
N1 X
ðd1i c1 Þ2
i¼1
¼
X
d21i þ N1 c21 2c1
X
d1i
and 2
¼
N2 X
d22i þ N2 c21 2c1
X
d2i
i¼1
and 1
þ
2
¼
N¼N 1 þN2 X
d2i þ Nc21 2c1
i¼1
N¼N 1 þN2 X
di
i¼1
¼ Nd2 þ Nc21 2c1 Nd
(C:3) 2
2
¼ Nc21 2c1 Nd þ NðdÞ NðdÞ þ Nd2 ¼ Nðc1 dÞ2 þ Nðd2 ðdÞ2 Þ; where d and d2 are the mean and mean square of the pooled data, N ¼ N1 þ N2 . Therefore, ( ) Nðd2 ðdÞ2 Þ 2 N2 pðD1 jC; S; c1 ; 1 ; IÞ pðD2 jC; S; c1 ; 1 ; IÞ ¼ð2p1 Þ exp 221 ( ) (C:4) Nðc1 dÞ2 exp : 221 Now combine the likelihoods with the priors to obtain the posterior probability pðC; SjD1 ; D2 ; IÞ: ( ) N ð2p21 Þ 2 Nðd2 ðdÞ2 Þ exp pðC; SjD1 ; D2 ; IÞ ¼K d1 4Rc 1 lnðR Þ 221 L ( ) Z cH Nðc1 dÞ2 dc1 exp : 221 cL Z
H
(C:5)
If the limits on the c1 integral extend from minus infinity to plus infinity, and the limits on the 1 integral extend from zero to infinity, then both integrals can be
436
Appendix C
evaluated in closed form. However, with finite limits, either of the two indicated integrals may be evaluated, but the other must be evaluated numerically. The integral over amplitude will be evaluated in terms of erf(x), the error function 2 erfðxÞ ¼ pffiffiffi p
Z
x
2
eu du
(C:6)
0
by setting
u2 ¼
Z
cH
cL
(
Nðc1 dÞ2 dc1 exp 221
Nðc1 dÞ2 : 221
)
(C:7)
rffiffiffiffi Z XH 2 2 eu du 1 ¼ N XL rffiffiffiffi pffiffiffi Z XH p 2 2 2 pffiffiffi eu du ¼ 1 2 N p XL ¼
(C:8)
rffiffiffiffiffiffiffi p 1 ½erfðXH Þ erfðXL Þ: 2N
Evaluating the integral over the amplitude, we obtain: pffiffiffiffiffiffiffiffiffiffiffiffi Z Kð2pÞN=2 p=2N H pðC; SjD1 ; D2 ; IÞ ¼ d1 N 1 4Rc logðR Þ L z exp 2 ½erfðXH Þ erfðXL Þ; 21
(C:9)
where rffiffiffiffiffiffiffiffi N ðc1H dÞ; XH ¼ 22
rffiffiffiffiffiffiffiffi N ðc1L dÞ; XL ¼ 22
z ¼ N½d2 ðdÞ2 :
(C:10)
C.2.2 Evaluation of pðC; SjD1 ; D2 ; IÞ Notice that pðC; SjD1 ; D2 ; IÞ assumes the constants are the same in both data sets, but the standard deviations are different. Thus, pðC; SjD1 ; D2 ; IÞ is a marginal probability
437
Difference in two samples
density, where the constant and the two standard deviations were removed as nuisance parameters. Z pðC; SjD1 ; D2 ; IÞ ¼ dc1 d1 d2 pðC; S; c1 ; 1 ; 2 jD1 ; D2 ; IÞ ¼K
Z
dc1 d1 d2 pðC; S; c1 ; 1 ; 2 jIÞ
pðD1 ; D2 jC; S; c1 ; 1 ; 2 ; IÞ Z ¼K dc1 d1 d2 pðC; SjIÞ pðc1 jIÞ pð1 jIÞ pð2 jIÞ
(C:11)
pðD1 jC; S; c1 ; 1 ; IÞ pðD2 jC; S; c1 ; 2 ; IÞ: By analogy with Equations (C.1) to (C.3), we can evaluate pðD1 jC; S; c1 ; 1 ; 2 ; IÞ pðD2 jC; S; c1 ; 1 ; 2 ; IÞ U1 N2 U2 N2 N1 ¼ ð2pÞ 1 exp 2 2 exp 2 ; 1 2
(C:12)
N1 2 ðd 2c1 d1 þ c21 Þ; 2 1
(C:13)
where U1 ¼
U2 ¼
N2 2 ðd 2c1 d2 þ c21 Þ 2 2
and d1 , d21 , d2 , d22 are the means and mean squares of D1 and D2 respectively. Substituting Equation (C.12) into (C.11) and adding the priors, we have
pðC; SjD1 ; D2 ; IÞ ¼
Z
Kð2pÞN=2
Z
H
dc1
H
ðN1 þ1Þ
d1 1
16Rc ½logðR Þ2 L L Z H U2 ðN þ1Þ d2 2 2 exp 2 : 2 L
U1 exp 2 1
(C:14)
The integrals over 1 and 2 will be evaluated in terms of Qðr; xÞ, one form of the incomplete gamma function of index r and argument x: Qðr; xÞ ¼
1 ðrÞ
Z
1
et tr1 dt:
(C:15)
x
If we let t¼
U1 ; 21
r¼
N1 ; 2
(C:16)
438
Appendix C
then we can show that Z H U1 1 N1 ðN þ1Þ d1 1 1 exp 2 ¼ U1 2 ðN1 =2Þ 2 1 L Z 1 Z 1 N N1 1 1 1 et t 2 1 dt et t 2 1 dt ðN1 =2Þ XL ðN1 =2Þ XH N1 1 N1 U1 N1 U1 ¼ U1 2 ðN1 =2Þ Q ; 2 Q ; 2 : 2 2 H 2 L
(C:17)
Evaluating the integral over 1 and 2 , one obtains pðC; SjD1 ; D2 ; IÞ ¼
Kð2pÞN=2 ðN1 =2ÞðN2 =2Þ 2
Z
H
dc1 U1
N1 2
16Rc ½logðR Þ L N1 U1 N1 U1 ; 2 Q ; 2 Q 2 H 2 L N2 U2 N2 U2 ; 2 Q ; 2 : Q 2 H 2 L
U2
N2 2
(C:18)
C.2.3 Evaluation of pðC; SjD1 ; D2 ; IÞ pðC; SjD1 ; D2 ; IÞ ¼
Z
dc1 dc2 d1 pðC; S; c1 ; c2 ; 1 jD1 ; D2 ; IÞ Z ¼K dc1 dc2 d1 pðC; S; c1 ; c2 ; 1 jIÞ
pðD1 ; D2 jC; S; c1 ; c2 ; 1 ; IÞ Z ¼K dc1 dc2 d1 pðC; S; jIÞ pðc1 jIÞ pðc2 jIÞ pð1 jIÞ
(C:19)
pðD1 jC; S; c1 ; 1 ; IÞ pðD2 jC; S; c2 ; 1 ; IÞ: Evaluating the integrals over c1 and c2 , one obtains pðC; SjD1 ; D2 ; IÞ ¼
Kð2pÞN=2 p pffiffiffiffiffiffiffiffiffiffiffiffi 8R2c logðR Þ N1 N2
Z
H
L
z1 þ z2 d1 1Nþ1 exp 221
(C:20)
½erfðX1H Þ erfðX1L Þ½erfðX2H Þ erfðX2L Þ; where z1 ¼ N1 ½d21 ðd1 Þ2 ; X1H
sffiffiffiffiffiffiffiffi N1 ðH d1 Þ; ¼ 221
z2 ¼ N2 ½d22 ðd2 Þ2 ; X1L
sffiffiffiffiffiffiffiffi N1 ðL d1 Þ; ¼ 221
(C:21)
(C:22)
439
Difference in two samples
X2H
sffiffiffiffiffiffiffiffi N2 ðH d2 Þ; ¼ 221
X2L
sffiffiffiffiffiffiffiffi N2 ðL d2 Þ: ¼ 221
(C:23)
C.2.4 Evaluation of pðC; SjD1 ; D2 ; IÞ pðC; SjD1 ; D2 ; IÞ ¼
Z
dc1 dc2 d1 d2 pðC; S; c1 ; c2 ; 1 ; 2 jD1 ; D2 ; IÞ Z ¼K dc1 dc2 d1 d2 pðC; S; c1 ; c2 ; 1 ; 2 jIÞ
pðD1 ; D2 jC; S; c1 ; c2 ; 1 ; 2 ; IÞ Z ¼K dc1 dc2 d1 d2 pðC; SjIÞ pðc1 jIÞpðc2 jIÞ pð1 jIÞ pð2 jIÞ pðD1 jC; S; c1 ; 1 ; IÞ pðD2 jC; S; c2 ; 2 ; IÞ: (C:24) Evaluating the integrals over c1 and c2 , one obtains pðC; SjD1 ; D2 ; IÞ ¼
Kð2pÞN=2 p pffiffiffiffiffiffiffiffiffiffiffiffi 8R2c ½logðR Þ2 N1 N2 Z H z1 1 d1 N exp ½erfðX1H Þ erfðX1L Þ 1 221 L Z H z2 2 d2 N exp ½erfðX2H Þ erfðX2L Þ; 2 222 L
(C:25)
where X1H
sffiffiffiffiffiffiffiffi N1 ðc1H d1 Þ; ¼ 221
X2H
sffiffiffiffiffiffiffiffi N2 ðc1H d2 Þ; ¼ 222
z1 ¼ N1 ½d21 ðd1 Þ2 ;
X1L
sffiffiffiffiffiffiffiffi N1 ðc1L d1 Þ; ¼ 221
ðC:26Þ
X2L
sffiffiffiffiffiffiffiffi N2 ðc1L d2 Þ; ¼ 222
ðC:27Þ
z2 ¼ N2 ½d22 ðd2 Þ2 :
ðC:28Þ
C.3 The difference in the means Section 9.4.4 introduced the calculation for the probability of the difference in the two means pðjD1 ; D2 ; IÞ, which was expressed in Equation (9.68) as a weighted sum of pðjS; D1 ; D2 ; IÞ and pðjS; D1 ; D2 ; IÞ, the probability for the difference in means given that the standard deviations are the same (the two-sample problem) and the
440
Appendix C
probability for the difference in means given that the standard deviations are different (the Behrens–Fisher problem). The details of the calculation of these two probabilities are given below.
C.3.1 The two-sample problem pðjS; D1 ; D2 ; IÞ is essentially the two-sample problem. This probability is a marginal probability where the standard deviation and have been removed as nuisance parameters: pðjS; D1 ; D2 ; IÞ ¼
Z Z
/ ¼
Z
d d1 pð; ; 1 jS; D1 ; D2 ; IÞ d d1 pð; ; 1 jS; IÞ pðD1 ; D2 jS; ; ; 1 ; IÞ
(C:29)
d d1 pðjIÞ pðjIÞ pð1 jIÞ
pðD1 jS; ; ; 1 ; IÞ pðD2 jS; ; ; 1 ; IÞ; where pðjIÞ and pðjIÞ are assigned bounded uniform priors: 1 ; if L H H L pðjIÞ ¼ 2Rc 0; otherwise
(C:30)
and pðjIÞ ¼
1 2Rc
0;
;
if 2L 2H otherwise.
(C:31)
We can evaluate pðD1 jS; ; ; 1 ; IÞ by comparison with Equation (C.1), after substituting for c1 according to Equation (9.67). N Q1 2 21 pðD1 jS; ; ; 1 ; IÞ ¼ ð2p1 Þ exp 2 ; (C:32) 1 where Q1 is given by: Q1 ¼
N1 X i¼1
ð þ Þ 2 d1i 2
N1 2 2 2 ¼ d1 ð þ Þd1 þ þ þ : 2 2 4 4
(C:33)
Similarly, pðD2 jS; ; ; 1 ; IÞ is given by N2 Q2 pðD2 jS; ; ; 1 ; IÞ ¼ ð2p21 Þ 2 exp 2 ; 1
(C:34)
Difference in two samples
441
where Q2 ¼
N2 X i¼1
ð Þ d2i 2
2
N2 2 2 2 ¼ d2 ð Þd2 þ þ : 2 2 4 4 The product of Equations (C.32) and (C.34) can be simplified to V N2 N pðD1 jS; ; ; 1 ; IÞ pðD2 jS; ; ; 1 ; IÞ ¼ ð2pÞ 1 exp 2 ; 1
(C:35)
(C:36)
where V¼
N 2 2 2 d 2b d þ þ þ ; 2 2 4 4
(C:37)
¼
N1 N2 ; N
(C:38)
and
b¼
N 1 d1 N 2 d2 : 2N
After substituting Equations (C.30) and (C.31) and (C.36) into Equation (C.29), the integral over 1 is evaluated in terms of incomplete gamma functions. pðjS; D1 ; D2 ; IÞ /
Z 2H ðN=2Þ N dV 2 8R2c logðR Þ 2L N V N V ; 2 Q ; 2 : Q 2 H 2 L
(C:39)
The final integral over is computed numerically.
C.3.2 The Behrens–Fisher problem The Behrens–Fisher problem is essentially given by pðjS; D1 ; D2 ; IÞ, the probability for the difference in means given that the standard deviations are not the same. This probability is a marginal probability where both the standard deviations and the sum of the means, , have been removed as nuisance parameters: Z pðjS; D1 ; D2 ; IÞ ¼ d d1 d2 pð; ; 1 ; 2 jS; D1 ; D2 ; IÞ Z / d d1 d2 pð; ; 1 ; 2 jS; IÞ pðD1 ; D2 jS; ; ; 1 ; 2 ; IÞ (C:40) Z ¼ d d1 d2 pðjIÞ pðjIÞ pð1 jIÞ pð2 jIÞ pðD1 jS; ; ; 1 ; IÞ pðD2 jS; ; ; 2 ; IÞ;
442
Appendix C
where all of the terms appearing in this probability density function have been previously assigned. To evaluate the integrals over 1 and 2 , one substitutes Equations (9.63), (C.30) and (C.31), and a Gaussian noise prior is used in the two likelihoods. Evaluating the integrals, one obtains pðjS; D1 ; D2 ; IÞ /
ðN1 =2ÞðN2 =2Þ 16R2c ½logðR Þ2
N1 ; Q 2 N2 Q ; 2
Z
2H
2L
N1
d W1 2 W2
W1 N1 ; Q 2 2H W2 N2 ; Q 2 2H
N2 2
W1 2L W2 ; 2L
(C:41)
where " # N1 2 ð þ Þ2 d1 d1 ð þ Þ þ W1 ¼ ; 2 4
(C:42)
and " # N2 2 ð Þ2 d2 d2 ð Þ þ W2 ¼ : 2 4
(C:43)
With the completion of this calculation, the probability for the difference in means, Equation (9.68), is now complete. We now turn our attention to calculation of the probability for the ratio of the standard deviations.
C.4 The ratio of the standard deviations Section 9.4.5 introduced the calculation for the probability for the ratio of the standard deviations, pðrjD1 ; D2 ; IÞ, independent of whether the means are the same or different. This is a weighted average of the probability for the ratio of the standard deviations given the means are the same, pðrjC; D1 ; D2 ; IÞ, and the probability for the ratio of the standard deviations given that the means are different, pðrjC; D1 ; D2 ; IÞ. These two probabilities are given below.
C.4.1 Estimating the ratio, given the means are the same The first term to be addressed is pðrjC; D1 ; D2 ; IÞ. This probability is a marginal probability where both and c1 have been removed as nuisance parameters:
443
Difference in two samples
pðrjC; D1 ; D2 ; IÞ ¼ / ¼
Z dc1 d pðr; c1 ; jC; D1 ; D2 ; IÞ Z
dc1 d pðr; c1 ; jC; IÞ pðD1 ; D2 jC; r; c1 ; ; IÞ
Z
(C:44)
dc1 d pðrjIÞpðc1 jIÞ pðjIÞ
pðD1 jC; r; c1 ; ; IÞ pðD2 jC; r; c1 ; ; IÞ; where the prior probability for the ratio of the standard deviations is taken to be a bounded Jeffreys prior: pðrjIÞ ¼
1=½2r logðR Þ; if L =H r H =L 0; otherwise.
(C:45)
To evaluate the integral over c1 , one substitutes Equations (9.63) and (C.45), and a Gaussian noise prior probability is used to assign the two likelihoods. Evaluating the integral, one obtains
pðrjC; D1 ; D2 ; IÞ ¼
ð2pÞN=2
pffiffiffiffiffiffiffiffiffiffiffi N 1 p=8wr 1
Rc ½logðR Þ2
Z
H L
n
z o d N exp 2 ½erfðXH Þ erfðXH Þ; 2
(C:46)
where XH ¼
rffiffiffiffiffiffiffiffi w ½c1H v=w; 22
XL ¼
rffiffiffiffiffiffiffiffi w ½c1L v=w; 22
ðC:47Þ
N 1 d1 þ N 2 d2 ; r2
ðC:48Þ
u¼
N1 d21 þ N2 d22 ; r2
v¼
w¼
N1 þ N2 ; r2
z¼u
v2 : w
ðC:49Þ
C.4.2 Estimating the ratio, given the means are different The second term that must be computed is pðrjC; D1 ; D2 ; IÞ, the probability for the ratio of standard deviations given that the means are not the same. This is a marginal probability where , c1 , and c2 have been removed as nuisance parameters:
444
Appendix C
pðrjC; D1 ; D2 ; IÞ ¼ / ¼
Z dc1 dc2 d pðr; c1 ; c2 ; jC; D1 ; D2 ; IÞ Z Z
dc1 dc2 d pðr; c1 ; c2 ; jC; IÞ pðD1 ; D2 jC; r; c1 ; c2 ; ; IÞ
(C:50)
dc1 dc2 d pðrjIÞ pðc1 jIÞ pðc2 jIÞ pðjIÞ
pðD1 jr; C; c1 ; ; IÞ pðD2 jr; C; c2 ; ; IÞ; where all of the terms appearing in this probability density function have been previously assigned. To evaluate the integral over c1 and c2 , one substitutes Equations (9.62), (9.63) and (C.45) and a Gaussian noise prior is used in assigning the two likelihoods. Evaluating the indicated integrals, one obtains pðrjC; D1 ; D2 ; IÞ /
Z H ð2pÞN=2 p d rN1 Nþ1 p ffiffiffiffiffiffiffiffiffiffiffi ffi 4R2c ½logðR Þ2 N1 N2 L n z1 z2 o exp 2 2 2 ½erfðX1H Þ erfðX1L Þ 2r 2 ½erfðX2H Þ erfðX2L Þ;
(C:51)
where X1H X2H
rffiffiffiffiffiffiffiffiffiffiffi N1 ¼ ½c1H d1 ; 2r2 2 rffiffiffiffiffiffiffiffi N2 ¼ ½c2H d2 ; 22
z1 ¼ N1 ½d21 ðd1 Þ2 ;
X1L X2L
rffiffiffiffiffiffiffiffiffiffiffi N1 ¼ ½c1L d1 ; 2r2 2 rffiffiffiffiffiffiffiffi N2 ¼ ½c2L d2 ; 22
z2 ¼ N2 ½d22 ðd2 Þ2 :
ðC:52Þ ðC:53Þ ðC:54Þ
Appendix D Poisson ON/OFF details
D.1 Derivation of pðsjNon ; IÞ In Section 14.4, we explored a Bayesian analysis of ON/OFF measurements, where ON is signal þ background and OFF is a just the background. The background is only known imprecisely from OFF measurement. In this appendix, we derive Equation (14.17) for pðsjNon ; IÞ, the posterior probability of the signal event rate. Our starting point is Equation (14.16), which we repeat here together with some of the other relevant equations: pðsjNon ; IÞ ¼
Z
bmax
db pðs; bjNon ; IÞ
(D:1)
0
pðs; bjIÞpðNon js; b; IÞ pðNon jIÞ
pðs; bjNon ; IÞ ¼ ¼
pðsjb; IÞpðbjIÞpðNjs; b; IÞ pðNon jIÞ
pðbjNoff ; Ib Þ ¼ pðbjIÞ ¼
Toff ðbToff ÞNoff ebToff Noff !
pðsjb; IÞ ¼ 1=smax :
(D:2)
(D:3)
(D:4)
The denominator of pðs; bjNon ; IÞ in Equation (D.2) is given by pðNon jIÞ ¼
Z
Z
smax s¼0
¼
bmax
ds Z
db pðNon ; s; bjIÞ b¼0
Z
smax
ds s¼0
(D:5)
bmax
db pðsjb; IÞpðbjIÞpðNon js; b; IÞ:
b¼0
445
446
Appendix D
Substituting Equations (D.5), (D.4), (D.3) and (D.2) into Equation (D.1), we obtain R bmax
1 db smax
b¼0
pðsjNon ; IÞ ¼ R
smax s¼0
ds
R bmax b¼0
ð1þNoff Þ N b off ebToff
Toff
Noff !
1 db smax
on ðsþbÞTon ðsþbÞNon TN on e Non !
ð1þNoff Þ N b off ebToff
Toff
Noff !
on ðsþbÞTon ðsþbÞNon TN on e Non !
R bmax db bNoff ebToff ðs þ bÞNon eðsþbÞTon ¼ R smax b¼0R bmax Noff ebToff ðs þ bÞNon eðsþbÞTon s¼0 ds b¼0 db b ¼
(D:6)
Num : Den
D.1.1 Evaluation of Num We start with a binomial expansion of ðs þ bÞNon . ðs þ bÞNon ¼
Non X i¼0
Non ! si bðNon iÞ : i!ðNon iÞ!
(D:7)
The numerator of Equation (D.6) becomes Num ¼
Z
bmax
db bNoff ebðTon þToff Þ
b¼0
¼
Non X i¼0
¼
Non X i¼0
Non ! si esTon i!ðNon iÞ!
Z
Non X i¼0 bmax
Non ! si bðNon iÞ esTon i!ðNon iÞ! db bðNon þNoff iÞ eb½Ton þToff
(D:8)
b¼0
Non ! si esTon integral: i!ðNon iÞ!
We now want to evaluate the integral in Equation (D.8), which we first rewrite in the form of an incomplete gamma function: Integral ¼ ðTon þ Toff ÞðNon þNoff iþ1Þ Z bmax ½Ton þToff dðb½Ton þ Toff Þ ðb½Ton þ Toff ÞðNon þNoff iÞ e
b¼0 b½Ton þToff
(D:9)
:
Compare this to one form of the incomplete gamma function: Z x dy yn ey : ðn þ 1; xÞ ¼ 0
(D:10)
447
Poisson ON/OFF details
Thus, Equation (D.9) can be rewritten as Integral ¼ ðTon þ Toff ÞðNon þNoff iþ1Þ
(D:11)
ð½Non þ Noff i þ 1; bmax ½Ton þ Toff Þ: Provided bmax ½Ton þ Toff ½Non þ Noff i, we have that ð½Non þ Noff i þ 1; bmax ½Ton þ Toff Þ ð½Non þ Noff i þ 1Þ
(D:12)
¼ ðNon þ Noff iÞ! Substituting Equation (D.12) into Equation (D.11), we obtain Integral ðTon þ Toff ÞðNon þNoff iþ1Þ ðNon þ Noff iÞ!
(D:13)
Now substitute Equation (D.13) into Equation (D.8) to obtain Non X ðNon þ Noff iÞ!
Non !
Num
ðTon þ Toff ÞðNon þNoff þ1Þ Non !
¼
i¼0
i!ðNon iÞ!
si esTon ðTon þ Toff Þi
Ton ðTon þ Toff ÞðNon þNoff þ1Þ Non X ðNon þ Noff iÞ! Ton ðsTon Þi esTon i : i!ðNon iÞ! i¼0 1 þ TToff on
(D:14)
D.1.2 Evaluation of Den The equation for denominator (Den) in Equation (D.6) is the same as Equation (D.14) for the numerator (Num) except for the additional integral over s. Den ¼
Z
smax
ds s¼0
Non ! Ton ðTon þ Toff ÞðNon þNoff þ1Þ
Non X ðNon þ Noff iÞ! Ton ðsTon Þi esTon i i!ðNon iÞ! i¼0 1 þ TToff on
Non X ðNon þ Noff iÞ! Toff i 1þ ¼ i!ðNon iÞ! Ton Ton ðTon þ Toff ÞðNon þNoff þ1Þ i¼0 Z smax dðsTon ÞðsTon Þi esTon : Non !
s¼0
(D:15)
448
Appendix D
The integral can be recognized as the incomplete gamma function ði þ 1; smax Ton Þ. Provided smax Ton i þ 1, we can write ði þ 1; smax Ton Þ i!, and Equation (D.15) simplifies to Non X Non ! ðNon þ Noff iÞ! Toff i Den 1þ : (D:16) ðNon iÞ! Ton Ton ðTon þ Toff ÞðNon þNoff þ1Þ i¼0 Substitution of Equation (D.14) and (D.16) into Equation (D.17) yields pðsjNon ; IÞ ¼
Non X i¼0
Ci
Ton ðsTon Þi esTon ; i!
(D:17)
where i ðNon þNoff iÞ! 1 þ TToff ðNon iÞ! on Ci P : j Non Toff ðNon þNoff jÞ! j¼0 1 þ Ton ðNon jÞ!
(D:18)
D.2 Derivation of the Bayes factor Bfsþb;bg Here, we will derive the Bayes factor, Bfsþb;bg , given in Equation (14.23), for the two models Msþb and Mb , which have the following meaning: Mb ‘‘the ON measurement is solely due to the Poisson background rate, b, where the prior probability for b is derived from the OFF measurement.’’ Msþb ‘‘the ON measurement is due to a source with unknown Poisson rate, s, plus a Poisson background rate b. Again, the prior probability for b is derived from the OFF measurement.’’ pðNon jMsþb ; Ioff Þ pðNon jMb ; Ioff Þ R smax R bmax ds 0 db pðNon ; s; bjMsþb ; Ioff Þ ¼ 0 R bmax db pðNon ; bjMb ; Ioff Þ 0 R smax R bmax ds 0 db pðsjb; Msþb ; Ioff ÞpðbjMsþb ; Ioff ÞpðNon js; b; Msþb ; Ioff Þ ¼ 0 R bmax db pðbjMb ; Ioff ÞpðNon jb; Mb ; Ioff Þ 0 ð1þN off Þ bNoff ebToff R smax R bmax on ðsþbÞTon ðsþbÞNon TN 1 Toff on e ds db s¼0 b¼0 smax Noff ! Non ! ¼ off Þ bNoff ebToff Non Non bTon R bmax Tð1þN b Ton e off db b¼0 Noff ! Non ! R smax R bmax 1 Noff bToff e ðs þ bÞNon eðsþbÞTon s¼0 ds smax b¼0 db b ¼ R bmax ðNon þNoff Þ ebðTon þToff Þ b¼0 db b Num1 ; ¼ Den1
Bfsþb;bg ¼
(D:19)
449
Poisson ON/OFF details
where Ioff ¼ Noff ; Ib , as defined in Equation (14.21). Comparing Equation (D.19) to Equation (D.6), we see that Num1 ¼ 1=smax Den, which we have already evaluated in Equation (D.16). All that remains is to evaluate Den1, which we do here: Z bmax Den1 ¼ db bðNon þNoff Þ ebðTon þToff Þ b¼0
¼ðTon þ Toff ÞðNon þNoff þ1Þ
Z
bmax ½Ton þToff
dðb½Ton þ Toff Þ
(D:20)
b¼0
ðb½Ton þ Toff ÞðNon þNoff Þ ebðTon þToff Þ : The integral in the above equation is the incomplete gamma function ð½Non þ Noff þ 1; bmax ½Ton þ Toff Þ; which can be approximated as ð½Non þ Noff þ 1; bmax ½Ton þ Toff Þ ½Non þ Noff !;
(D:21)
provided bmax ½Ton þ Toff ½Non þ Noff . Equation (D.20) can be rewritten as Den1 ðTon þ Toff ÞðNon þNoff þ1Þ ½Non þ Noff !
(D:22)
Substituting Num1 and Den1 into Equation (D.19), and canceling quantities in common, yields Bfsþb;bg
Non X Non ! ðNon þ Noff iÞ! Toff i 1þ : smax Ton ðNon þ Noff Þ! i¼0 ðNon iÞ! Ton
(D:23)
Appendix E Multivariate Gaussian from maximum entropy
In this appendix, we will derive the multivariate Gaussian distribution of Equation (8.59) from the MaxEnt principle, given constraint information on the variances and covariances of the multiple variables. We will start with the simpler case of only two variables, y1 and y2 , and then generalize the result to an arbitrary number of variables. We assume that the priors for y1 and y2 have the following form: ( 1 ; if yiL yi yiH mðyi Þ ¼ yiH yiL (E:1) 0; if yiL > yi or yi > yiH . The constraints in this case are: 1.
R y1H R y2H
2.
R y1H R y2H
3.
R y1H R y2H
4.
R y1H R y2H
y1L
y2L
pðy1 ; y2 Þdy1 dy2 ¼ 1
y1L
y2L
ðy1 1 Þ2 pðy1 ; y2 Þ dy1 dy2 ¼ 11 ¼ 21
y1L
y2L
ðy2 2 Þ2 pðy1 ; y2 Þ dy1 dy2 ¼ 22 ¼ 22
y1L
y2L
ðy1 1 Þðy2 2 Þ pðy1 ; y2 Þ dy1 dy2 ¼ 12 ¼ 21
Because mðyi Þ is a constant, we solve for pðy1 ; y2 Þ which maximizes S¼
Z
pðy1 ; y2 Þ ln ½pðy1 ; y2 ÞdN y;
(E:2)
where N ¼ 2 in this case. The problem then is to maximize pðfyi gÞ subject to the constraints 1 to 4. This optimization is best done as the limiting case of a discrete problem. Let yi and yj (Roman typeface) represent the discrete versions of y1 and y2 , respectively. Explicitly, we need to find the solution to ( ) " # X X 1 2 3 d pij ln pi pij 1 A1 A2 A3 ¼ 0; (E:3) 2 2 2 ij ij 450
Multivariate Gaussian from maximum entropy
451
where ( A1 ¼ ( A2 ¼ ( A3 ¼
X ðyi i Þ2 pij ii ij
X ðyj j Þ2 pij jj
) )
ij
) X ðyi i Þðyj j Þpij ij ; ij
and X ij
¼
N X N X
:
i¼1 j¼1
This leads to o X 1n 1 ðyi i Þ2 ln pij 1 2 ij n o 1 2 ðyj j Þ2 þ 3 ðyi i Þðyj j Þ dpij ¼ 0: þ 2
(E:4)
For each ij, we require o 1n 1 ðyi i Þ2 þ 2 ðyj j Þ2 þ 3 ðyi i Þðyj j Þ ¼ 0; (E:5) ln pij 1 2 or, pij ¼ e
0
o 1n 2 2 exp 1 ðyi i Þ þ 2 ðyj j Þ þ 3 ðyi i Þðyj j Þ ; 2
(E:6)
where 0 ¼ 1 þ . This generalizes to the continuum assignment pðy1 ; y2 Þ ¼ expf0 g o (E:7) 1n exp 1 ðy1 1 Þ2 þ 2 ðy2 2 Þ2 þ 3 ðy1 1 Þðy2 2 Þ : 2 To simplify the notation, we will use the abbreviation y1 ¼ ðy1 1 Þ and y2 ¼ ðy2 2 Þ. Then Equation (E.7) becomes 1 2 2 pðy1 ; y2 Þ ¼ expf0 g exp 1 y1 þ 2 y2 þ 3 y1 y2 2 (E:8) Q ¼ expf0 g exp ; 2
452
Appendix E
where Q ¼ 1 y21 þ 2 y22 þ 3 y1 y2 3 2 2 ¼ 1 y21 þ 2 y1 y2 þ 32 y22 þ 2 y22 3 y22 21 41 41 2 3 2 ¼ 1 y1 þ y2 þ 2 3 y22 : 21 41
(E:9)
In Equation (E.9), we have carried out an operation called completing the squares, which will help us in our next step, evaluating 0 from constraint number 1. Z y1H Z y2H Z y1H Z y2H Q pðy1 ; y2 Þdy1 dy2 ¼ e0 exp 2 y1L y2L y1L y2L Z y2H 1 2 2 3 y22 ¼e0 dy2 exp (E:10) 2 41 y2L " 2 # Z y1H 1 3 y1 þ dy1 exp y2 ¼ 1: 2 21 y1L The second integrand in Equation (E.10) is a Gaussian in dy1 , with variance 1=1 . If the range of integration were infinite, the integral pffiffiffiffiffiffiffiffiffiffiffiffiffiwould merely be a constant (the normalization constant for the Gaussian, 2p=1 ). With a finite range, it can be written in terms of error functions with arguments that depend on y2 . But, as we showed in Section 8.7.4, as long as the limits y1H and y1L lie well outside the region where there is a significant contribution to the integral, then the limits can effectively be replaced by þ1 and 1, which is what we assume here. The first integrand in Equation (E.10) is another Gaussian in dy2 . We will also assume that range of integration is effectively infinite, so the integrand evaluates to the qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi normalization constant,
2p=ð2 23 =41 Þ. Equation (E.10) thus simplifies to 2p e0 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 ¼ 1: 1 2 43
(E:11)
The solution is e
0
1 ¼ 2p
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 2 3 : 4
(E:12)
We now make use of Equation (8.22) to evaluate the remaining Lagrange multipliers, 1 , 2 and 3 .
@0 1 11 ; ¼ hðy1 1 Þ2 i ¼ @1 2 2
(E:13)
Multivariate Gaussian from maximum entropy
@0 1 22 ; ¼ hðy2 2 Þ2 i ¼ @2 2 2
(E:14)
@0 1 12 : ¼ hðy1 1 Þðy2 2 Þi ¼ @3 2 2
(E:15)
453
Note: the extra factor of 2 appearing in the denominator on the right hand side of Equations (E.13), (E.14), (E.15), when compared to Equation (8.22), arises from the factor of 1/2 introduced in front of 1 , 2 and 3 in Equation (E.4), which defines the meaning of these Lagrange multipliers. The solutions to Equations (E.13), (E.14), and (E.15) are as follows: 1 ¼
22 ; 11 22 212
(E:16)
2 ¼
11 ; 11 22 212
(E:17)
3 ¼
212 : 11 22 212
(E:18)
Equation (E.12) for the term e0 can now be expressed in terms of 11 , 22 and 12 as follows: e0 ¼
1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : 2p 11 22 212
(E:19)
At this point, it is convenient to rewrite Q, which first appeared in Equation (E.8), in the following matrix form: ! ! 1 3 =2 y1 Q ¼ ðy1 y2 Þ 2 y2 3 =2 (E:20) ¼ YT E1 Y: The E1 matrix, which stands for the inverse of the E matrix, can be expressed in terms of 11 , 22 and 12 as follows: 1 22 12 1 E ¼ : (E:21) 11 22 212 12 11 Although E1 is rather messy, the E matrix itself is a very simple and useful matrix. 11 12 E¼ : (E:22) 12 22
454
Appendix E
Now substitute Equations (E.20) and (E.19) into Equation (E.8) to obtain a final equation for pðy1 ; y2 Þ. 1 1
pðy1 ; y2 Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp YT E1 Y 2 2p 11 22 212 (E:23) 1 1 T 1 ¼ pffiffiffiffiffiffiffiffiffiffiffi exp Y E Y ; 2 ð2pÞN=2 det E where N ¼ 2 for two variables. Equation (E.23) is also valid for an arbitrary number of variables,1 which we write as 1 1 T 1 Y pðfyi gjfi ; ij gÞ ¼ E Y exp p ffiffiffiffiffiffiffiffiffiffi ffi 2 ð2pÞN=2 det E " # (E:24) 1 1X 1 ¼ ðy Þ½E ðy Þ ; exp p ffiffiffiffiffiffiffiffiffiffi ffi i i j ij j 2 ij ð2pÞN=2 det E where 0
11 B 21 E¼B @ N1
12 22 N2
13 23 N3
1 1N 2N C C: A NN
(E:25)
The E matrix is called the data covariance matrix when each y variable describes possible values of a datum, di .
1
?In Equation (E.24), fyi g refers to a set of continuous variables.
References
Acze´l, J. (1966). Lectures on Functional Equations and their Applications. New York: Academic Press. See also Aczel, J. (1987), A Short Course on Functional Equations, Dordrecht–Holland: D. Reidel. Barber, M. N., Pearson, R. B., Toussaint, D., and Richardson, J. L. (1985). Finite-size scaling in the three-dimensional Ising model. Physics Review B, 32, 1720–1730. Bayes, T. (1763). An essay toward solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society, pp. 370–418. Berger, J. O. and Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76, 159–165. Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p-values and evidence. Journal of the American Statistical Association, 82, 112–122. Bernoulli, J. (1713). Ars conjectandi, Basel: Thurnisiorum. Reprinted in Die Werke von Jakob Bernoulli, Vol. 3, Basel: Birkhaeuser, (1975), pp. 107–286. Blackman, R. B. and Tukey, J. W. (1958). The Measurement of Power Spectra. New York: Dover Publications, Inc. Boole, G. (1854). An Investigation of the laws of Thought. London: Macmillan; reprinted by Dover Publications, New York (1958). Bretthorst, G. L. (1988). Bayesian Spectrum Analysis and Parameter Estimation. New York: Springer-Verlag. Bretthorst, G. L. (1990a). Bayesian analysis. I. Parameter estimation using quadrature NMR models. Journal of Magnetic Resonance, 88, 533–551. Bretthorst, G. L. (1990b). Bayesian analysis. II. Signal detection and model selection. Journal of Magnetic Resonance, 88, 552–570. Bretthorst, G. L. (1990c). Bayesian analysis. III. Applications to NMR signal detection, model selection, and parameter estimation. Journal of Magnetic Resonance, 88, 571–595. Bretthorst, G. L. (1991). Bayesian analysis. IV. Noise and computing time considerations. Journal of Magnetic Resonance, 93, 369–394. Bretthorst, G. L. (1993). On the difference in means. In Physics & Probability Essays in honor of Edwin T. Jaynes, W. T. Grandy and P. W. Milonni (eds.). England: Cambridge University Press. Bretthorst, G. L. (2000a). Nonuniform sampling: Bandwidth and aliasing. In Maximum Entropy and Bayesian Methods in Science and Engineering, J. Rychert, G. Erickson, and C. R. Smith (eds.). USA: American Institute of Physics, pp. 1–28. 455
456
References
Bretthorst, G. L. (2001). Generalizing the Lomb–Scargle periodogram. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris Ali Mohammad-Djafari (ed.). New York: American Institute of Physics Proceedings, 568, 241–245. Brigham, E. O. (1988). The Fast Fourier Transform and Its Applications, New Jersey: Prentice Hall. Buck, B. and MaCaulay V. A. (eds.) (1991). Maximum Entropy in Action. Oxford Science Publication, Oxford: Clarendon Press. Charbonneau, P. (1995). Genetic algorithms in astronomy and astrophysics. Astrophysical Journal (Supplements), 101, 309–334. Charbonneau, P. and Knapp, B. (1995). A User’s guide to PIKAIA 1.0, NCAR Technical Note 418+IA. Boulder: National Center for Atmospheric Research. Chib, S. and Greenberg, E. (1995). Understanding the Metropolis algorithm. American Statistician, 49, 327–335. Chu, S. (2003). Using soccer goals to motivate the Poisson process. INFORMS ransactions on Education, 3 (2) http://ite.pubs.informs.org/Vol3No2/Chu/index.php. Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex fourier series. Mathematics of Computing, 19, 297–301. Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics, 17, 1–13. Cox, R. T. (1961). The Algebra of Probable Inference, Baltimore, MD: Johns Hopkins University Press. D’Agostini, G. (1999). Bayesian Reasoning in High-Energy Physics: Principles and Applications. CERN Yellow Reports. Dayal, Hari H. (1972). Bayesian statistical inference in Behrens–Fisher Problems, Ph.D. dissertation, State University of New York at Buffalo, September 1972. Dayal., Hari H. and James M. Dickey, (1976), Bayes factors for Behrens–Fisher problems. The Indian Journal of Statistics, 38, 315–328. Delampady, M. and Berger, J. O. (1990). Lower bounds on Bayes factors for multinomial distributions, with applications to chi-squared tests of fit. Annals of Statistics, 18, 1295–1316. Feigelson E. D. and Babu, G. J. (eds.) (2002). Statistical Challenges in Modern Astronomy III. New York: Springer-Verlag. Fernandez, J. F. and Rivero, J. (1996). Fast algorithms for random numbers with exponential and normal distributions. Computers in Physics, 10, 83–88. Fox, C. and Nicholls, G. K. (2001). Exact MAP states and expectations from perfect sampling: Greig, Porteous and Seheult revisited. In Bayesian Inference and Maximum Entropy Methods in Science and Engineeering, Paris. Ali MohammadDjafari (ed). New York: American Institute of Physics Proceedings, 568, 252–263. Geyer, C. and Thompson, E. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. Journal of the American Statistical Association, 90, 909–920. Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. London: Chapman and Hall. Goggans, P. M. and Chi, Y. (2004). Using thermodynamic integration to calculate the posterior probability in Bayesian model selection problems. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Proceedings, 23rd International Workshop, G. Erickson and Y. X. Zhai (eds.). USA: American Institute of Physics, pp. 59–66.
References
457
Gregory, P. C. (1999). Bayesian periodic signal detection: Analysis of 20 years of radio flux measurements of LS I+618 303. Astrophysical Journal, 520, 361–375. Gregory, P. C. (2001). A Bayesian revolution in spectral analysis. In Bayesian Inference and Maximum Entropy Methods in Science and Engineeering, Paris. Ali Mohammad-Djafari, (ed.) New York: American Institute of Physics Proceedings, 568, 557–568. Gregory, P. C.(2002). Bayesian analysis of radio observations of the Be X-ray binary LS I+618 303. Astrophysical Journal, 575, 427–434. Gregory, P. C. and Loredo, T. J. (1992). A new method for the detection of a periodic signal of unknown shape and period. Astrophysical Journal, 398, 146–168. Gregory, P. C. and Loredo, T. J. (1993). A Bayesian method for the detection of unknown periodic and non-periodic signals in binned time series. In Maximum Entropy and Bayesian Methods, Paris. Ali Mohammad-Djafari and G. Demoment, (eds.). Dordrecht: Kluwer Academic Press, pp. 225–232. Gregory, P. C. and Loredo, T. J. (1996). Bayesian periodic signal detection: Analysis of ROSAT observations of PSR 0540-693. Astrophysical Journal, 473, 1059–1066. Gregory, P. C. and Neish, C. (2002). Density and velocity structure of the Be star equatorial disk in the binary, LS I+618 303, a probable microquasar. Astrophysical Journal, 580, 1133–1148. Gregory, P. C. and Taylor, A. R. (1978). New highly variable radio source, possible counterpart of gamma-ray source CG 135+1. Nature, 272, 704–706. Gregory, P. C., Xu, H. J., Backhouse, C. J. and Reid, A. (1989). Four-year modulation of periodic radio outbursts from LS I+618 303. Astrophysical Journal, 339, 1054–1058. Gregory, P. C., Peracaula, M. and Taylor, A. R. (1999). Bayesian periodic signal detection: Discovery of periodic phase modulation in LS I+618 303 radio outbursts. Astrophysical Journal, 520, 376–390. Gull, S. F. (1988). Bayesian inductive inference and maximum entropy. In Maximum Entropy & Bayesian Methods in Science and Engineering, G. J. Erickson and C. R. Smith (eds.). Dordrecht: Kluwer Academic Press, pp. 53–74. Gull, S. F. (1989a). Developments in maximum entropy data analysis. in Maximum Entropy & Bayesian Methods, J. Skilling (ed.), Dordrecht: Kluwer Academic Press. pp. 53–71. Gull, S. F. (1989b). Bayesian data analysis – straight line fitting. In Maximum Entropy & Bayesian Methods, J. Skilling (ed.). Dordrecht: Kluwer Academic Press, pp. 511–518. Gull, S. F., and Skilling, J. (1984). Maximum entropy method in image processing. IEEE Proceedings, 131, Part F, (6) 646–659. Hastings, W. K. (1970). Monte Carlo Sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Helene, O. (1983). Upper limit of peak area. Nuclear Instruments and Methods, 212, 319–322. Helene, O. (1984). Errors in experiments with small numbers of events. Nuclear Instruments and Methods, 228, 120–128. Holland, J. (1992). Genetic algorithms. Scientific American, July, 66–72. Hutchings, J. B. and Crampton, D. (1981). Spectroscopy of the unique degenerate binary star LS I+618 303. PASP, 93, 486–489. James, F. (1998). MINUIT, Function Minimization and Error Analysis Reference Manual, Version 94.1. Computing and Network Division, CERN Geneva, Switzerland.
458
References
Jaynes, E. T. (1957). How does the brain do plausible reasoning? Stanford University Microwave Laboratory Report 421. Reprinted in Maximum Entropy and Bayesian Methods in Science and Engineeering, G. J. Erickson and C. R. Smith (eds.) (1988). Dordrecht: Kluwer Academic Press. Jaynes, E. T. (1968). Prior probabilities. IEEE Transactions on System Science & Cybernetics, 4(3), 227–241. Jaynes, E. T. (1976). Confidence intervals vs Bayesian intervals. In Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science, 2, pp. 175–257, W. L. Harper and C. A. Hooker (eds.). Dordrecht: D. Reidel. Jaynes, E. T. (1982). On the Rationale of Maximum Entropy Methods. Proceedings of the IEEE, 70(9), 939–952. Jaynes, E. T. (1983). Papers on Probability, Statistics and Statistical Physics, a reprint collection. Dordrecht: D. Reidel. Second edition, Dordrecht: Kluwer Academic Press, (1989). Jaynes, E. T. (1987). Bayesian spectrum and chirp analysis. In Maximum Entropy and Bayesian Spectral Analysis and Estimation Problems, C. R. Smith and G. L. Erickson (eds.). Dordrecht: D. Reidel, pp. 1–37. Jaynes, E. T. (1990). Probability theory as logic. In Maximum-Entropy and Bayesian Methods, P. F. Fougre (ed.). Dordrecht: Kluwer, pp. 1–16. Jaynes, E. T. (2003). Probability Theory – The Logic of Science, G. L. Bretthorst (ed.). Cambridge: Cambridge University Press. Jeffreys, H. (1931). Scientific Inference. Cambridge: Cambridge University Press. Later editions, 1937, 1957, 1973. Jeffreys, H. (1932). On the theory of errors and least squares. Proceedings of the Royal Society, 138, 48–55. Jeffreys, H. (1939). Theory of Probability. Oxford: Clarendon Press. Later editions, 1948, 1961, 1967, 1988. Jeffreys, W. H. and Berger, J. O. (1992). Ockham’s razor and Bayesian analysis. American Scientist, 80, 64–72. Jenkins, G. M. and Watts, D. G. (1968). Spectral Analysis and its Applications, San Francisco: Holden Day. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimisation by simulated annealing. Science, 220, 671–680. Knuth, D. (1981). Seminumerical algorithms, 2nd edn, vol. 2 of The Art of Computer Programming. Reading, MA: Addison-Wesley. Laplace, P. S. (1774). Me´moire sur la probabilite´ des causes par les e´ve´nements. Me´moires de l’Acade´mie royale des sciences, 6, 621–656. Reprinted in Laplace (1878–1912), vol. 8, pp. 27–65, Paris: Gauthier–Villars, English translation by S. M. Stigler (1986). Lindley, D. V. (1965). Introduction to Probability and Statistics, (Part 1 – Probability and Part 2 – Inference). Cambridge: Cambridge University Press. Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. New York: Springer-Verlag. Lomb, N. R. (1976). Least squares frequency analysis of unevenly spaced data. Astrophysical and Space Sciences, 39, 447–462. Loredo, T. J. (1990). From Laplace to Supernova SN 1987A: Bayesian inference in astrophysics. Maximum Entropy and Bayesian Methods, Dartsmouth. P. Fouge`re (ed.). Dordrecht: Kluwer Academic Press, pp. 81–142.
References
459
Loredo, T. J. (1992). The promise of Bayesian inference for astrophysics. In Statistical Challenges in Modern Astronomy, E. D. Feigelson and G. J. Babu (eds.). New York: Springer-Verlag, pp. 275–297. Loredo, T. J. (1999). Computational technology for Bayesian inference. In ASP Conference Series, Vol. 172, Astronomical Data Analysis Software and Systems VIII, D. M. Mehringer, R. L. Plante, and D. A. Roberts (eds.). San Fransisco: Astronomical Society of the Pacific, pp. 297–306. Maddox, J. (1994). The poor quality of random numbers. Nature, 372, 403. Marple, S. L. (1987). Digital Spectral Analysis (Appendix 4a). Englewood Cliffs, NJ: Prentice Hall. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21, 1087–1092. Nedler, J. A. and Mead, R. (1965). A simple method for function minimizations. Computing Journal, 7, 308–313. Paredes, J. M., Estelle, R. and Ruis, A. (1990). Observation at 3.6 cm wavelength of the radio light curve of LS I+618 303. Astronomy and Astrophysics, 232, 377–380. Park, S. K. and Miller, K. W. (1988). Random number generators: good ones are hard to find. Communications of the Association for Computing Machinery, 31 (10), 1192–1201. Pin˜a, R. K. and Puetter, R. C. (1993). Bayesian image reconstruction: the Pixon and optimal image modeling. Proceedings of the Astronomical Society of the Pacific. 105, 630–637. Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1992). Numerical Recipes (second edition). Cambridge: Cambridge University Press. Priestley, M. B. (1981). Spectral Analysis and Time Series. London: Academic Press. Puetter, R. C. (1995). Pixon-based multiresolution image reconstruction and the quantification of picture information content. International Journal of Image Systems & Technology, 6, 314–331. Ray, P. S., Foster, R. S., Waltman, E. B. et al. (1997). Long term monitoring of LS I+618 303 at 2.25 and 8.3 GHz. Astrophysical Journal, 491, 381–387. Roberts, G. O. (1996). Markov chain concepts related to sampling algorithms. In Markov Chain Monte Carlo in Practice, W. R. Gilks, S. Richardson, and D. J. Spiegelhalter (eds.). London: Chapman and Hall. pp. 45–57. Roberts, G. O., Gelman, A. and Gilks, W. R. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of Applied Probability, 7, 110–120. Scargle, J. D. (1982). Studies in astronomical time series analysis II. Statistical aspects of spectral analysis of unevenly sampled data. Astrophysical Journal, 263, 835–853. Scargle, J. D. (1989). Studies in astronomical time series analysis III. Autocorrelation and cross-correlation functions of unevenly sampled data. Astrophysical Journal, 343, 874–887. Schuster, A. (1905). The periodogram and its optical analogy. Proceedings of the Royal Society of London, 77, 136–140. Sellke, T., Bayarri, M. J. and Berger, J. O. (2001). Calibration of P-values for testing precise null hypotheses. The American Statistician, 55, 62–71. Seward, F. D., Harnden, F. R., and Helfand, D. J. (1984). Discovery of a 50 millisecond pulsar in the Large Magellanic Cloud. Astrophysical Journal Letters, 287, L19–22.
460
References
Shannon, C. E. (1948). Bell Systems Tech. J., 27, 379, 623; these papers were reprinted in C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, Urbana: University of Illinois Press, (1949). Shore, J. and Johnson, R. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory, 26, 26–37. Sivia, D. S. (1996). Data Analysis: A Bayesian Tutorial. Oxford: Clarendon Press. Skilling, J. (1988). The axioms of maximum entropy. In Maximum Entropy & Bayesian Methods on Science and Engineering, Vol. 1, G. J. Erickson and C. R. Smith (eds.). Dordrecht: Kluwer Academic Press, p. 173. Skilling, J. (1989). Classical maximum entropy. In Maximum Entropy & Bayesian Methods, J. Skilling (ed.). Dordrecht: Kluwer Academic Press, pp. 45–52. Skilling, J. (1998). Probabilistic data analysis: an introductory guide. Journal of Microscopy, 190, 297–302. Skilling, J. and Gull, S. F. (1985). Algorithms and applications. In Maximum Entropy & Bayesian Methods in Inverse Problems, C. R. Smith and W. T. Grandy, Jr. (eds.), pp. 83–132. Stigler, S. M. (1986). Laplace’s 1774 memoir on inverse probability. Translation of Laplace’s 1774 Memoir on ‘‘Probability of Causes.’’ Statistical Science, 1, 359–378. Taylor, A. R. and Gregory, P. C. (1982). Periodic radio emission from LS I+618 303. Astrophysical Journal, 255, 210–216. Tierney, L. and Kadane, J. B. (1986), Accurate approximations for posterior moments and densities. J. American Statistical Association, 81, 82–86. Tinney, C. G., Butler, R. P., Marcy, G. W., Jones, H. R. A., Penny, A. J., McCarthy, C., Carter, B. D., and Bond, J. (2003). Four new planets orbiting metal-enriched stars. Astrophysical Journal, 587, 423–428. Toussaint, D. (1989). Introduction to algorithms for Monte Carlo simulations and their application to QCD. Computational Physics Communications, 56, 69–92. Tribus, M. (1969). Rational Descriptions, Decisions and Designs. Oxford: Pergamon Press. Vattulainen, I., Ala-Nissila, T., and Kankaala, K. (1994). Physical tests for random numbers in simulations. Physical Review Letters, 73, 2513–2516. Wolfram, S. (1999). The Mathematica Book (fourth edition). Cambridge: Cambridge University Press. Wolfram, S. (2002). A New Kind of Science. Champaign, IL: Wolfram Media, Inc.
Index
Terms followed by [ ] are Mathematica commands. 2 cumulative distribution, 165 2 distribution, 141 moment generating function, 142 properties, 141 2 statistic, 163, 164, 279
Bayesian, 87 advantages, 16 frequentist comparison, 87 Bayesian analysis detailed example, 50 Bayesian inference basics, 41 how-to, 41 Poisson distribution, 376 Bayesian mean estimate, 212 unknown noise, 218 Bayesian noise estimate, 224 Bayesian probability theory, 2 Bayesian revolution, 26 Bayesian versus frequentist, 87 Be star, 363 bed of nails, 400, 404, 407 Behrens–Fisher problem, 228, 441 Bernoulli’s law of large numbers, 75 illustration, 76 beta distribution, 117 bimodal, 123 binomial distribution, 74 Bayesian, 72 binomial expansion, 74 BinomialDistribution[ ], 75 Blackman–Tukey procedure, 427 Boolean algebra, 22 Boolean identities, 24 associativity, 24 commutativity, 24 distributivity, 24 duality, 24 idempotence, 24 Bretthorst, G. L., 228 burn-in period, 314, 316, 319, 323, 329, 331
absorption lines, 331 acceptance probability, 314 acceptance rate, 319, 331 ACF, 134, 318 adequate set of operations, 26 aliasing, 392, 405 alternative hypothesis, 170 amplitudes, most probable, 249 AND, 28 anti-correlation, see negative covariance apodizing function, 366 APT MCMC automated MCMC, 331, 341 AR, 431 ARMA, 431 asymptotic covariance matrix, 305 asymptotic normal approximation, 288 autocorrelation, 318 autocorrelation function, 134, 274, 318 autocorrelation spectrometer, 427 automated MCMC, 330 autoregressive, see AR autoregressive-moving average, see ARMA basis function, 245, 247 Bayes factor, 46, 53, 326 Bayes’ theorem, 42, 45, 61, 72, 76, 77, 84, 125, 184, 206, 213, 219, 231, 245, 321, 377, 382 model of inductive inference, 7 usual form, 5
461
462 Cauchy, 124 causal consequence, 25 CDF[ ], 149, 151 CDF[BinomialDistribution½n; p; x, 108 CDF[NormalDistribution½; ; x, 113, 115 Central Limit Theorem, 105, 113, 119, 219 Bayesian demonstration, 120 exceptions, 124 central tendency, 103 ChiSquarePValue[ ], 166 coefficient of kurtosis, 102 skewness, 102 column density, 50 column matrix, 258 common sense, 25, 29 completeness, 394 compound LCG, 131 compound propositions, 22, 42 condition number, 391 confidence coefficient, 153 confidence interval, 78, 153, 215 for variance known, 152 variance unknown, 156 for 2 , 159 for difference of means, 158 ratio of two variances, 159 conjugate distribution, 118 conjugate prior distributions, 118 constraint minimal, 194 continuous distributions, 113 continuous hypothesis space, 6 continuous random variable, 99 ContourPlot[ ], 19 control, 158, 434 control system, 331 convention, 114 convolution, 392, 398 importance in science, 401 radio astronomy example, 402 using an FFT, 417 convolution integral, 90, 121, 398 convolution theorem, 399 discrete, 407, 417 core sample, 200 correlation, 318, 392 using an FFT, 417 correlation coefficient, 268, 274 correlation spectrum estimation, 426 correlation theorem, 400 courtroom analogy, 171 covariance matrix, 280, 291
Index data errors, 253 inverse data, 253 parameters, 264 Cox, R.T., 4, 26 credible region, 44, 78, 215, 260, 378 cross-entropy, 190 cumulative density function, 149 cumulative distribution function, 99, 100, 108 gamma, 116 Gaussian, 114 normal, 114 curvature matrix, 299, 300 data covariance matrix, 253, 454 inverse, 253 deconvolution, 418 deductive inference, 1, 24 degree of confidence, 153 degree of freedom, 141 desiderata, 4, 26, 29, 30 consistency, 30 Jaynes, 30 propriety, 30 structural, 30 rationality, 30 design matrix, 390 detailed balance, 320 DFT, 392, 407 approximation, 411 approximation of inverse, 413 discontinuity treatment, 412 graphical development, 407 interesting results, 411 inverse, 410 mathematical development, 409 diagonal matrix, 259 difference in means and/or variances hypotheses, 434 difference in two samples, 434 differentiable, 126 Dirac delta function, 404 Discrete Fourier Transform, see DFT discrete random variables, 99 disjunctive normal form, 28 dispersion, 101, 105 dispersion measure, 103 distribution function cumulative, see cumulative distribution function distributions, 98 2 , 141 beta, 117 binomial, 72, 74, 107 continuous, 113 descriptive properties of, 100
463
Index discrete probability, 107 Erlang, 117 exchangeable, 83 F, 150 F statistic, 282 gamma, 116 Gaussian, 113 geometric, 112 hypergeometric, 83 multinomial, 79, 80, 174 negative binomial, 112 negative exponential, 118 normal, 113 Poisson, 85, 109, 376 Student’s t, 147, 222 uniform, 116 Doppler shift, 331 downhill simplex method, 296 eigenvalues, 259 Eigenvalues[ ], 260, 263 eigenvectors, 259, 390 encoding, 72 end effects, 421 energy spectral density, 425 entropy generalization incorporating prior, 190 epoch folding, 362, 363 equally natural parameterization, 220 erf½z; see error function Erlang distribution, 117 error function, 115, 198, 436 errors in both coordinates, 92, 307 ESD, see energy spectral density exchangeable distributions, 83 exclusive hypothesis, 42 exercises spectral line problem, 69 expectation value, 100, 144 experimental design non-uniform sampling, 371 exponential notation, 395 extended sum rule, 5, 35 extrasolar planets, 331 F distribution, 150 mode, 150 properties, 150 variance, 150 F-test, 150 Fast Fourier Transform, 392, 415 accurate amplitudes, 422 how it works, 415 zero padding, 422 FFT, see Fast Fourier Transform
first moment, see expectation value Fisher information matrix, 291 Flatten[ ], 263 Fourier analysis, 392 Fourier series, 394 Fourier spectrum, 132 Fourier transform, 396 Fourier[ ], 358, 411 FRatioPValue[ ], 182 frequency, 10, 11 frequentist, 2, 78, 96, 162 full period, 131 gambler’s coin problem, 75 gamma function, 116 Gamma[=2; 2crit =2], 261 GammaRegularized[ ], 147, 166, 280 Gaussian, 48 line shape, 50 noise, 55, 91 Gaussian distributions, 123 Gaussian moment generating function, 114 Gaussian posterior, 257 Geiger counter, 386 generalized sum rule, 36, 42 genetic algorithm, 296, 297 geometric distribution, 112 GL method, 360 global likelihood, 44, 46, 275, 326 global minimization, 296 Gregory–Loredo method, see GL method half-life, 386 Hessian matrix, 299 historical perspective recent, 25 HIV, 11 Hubble’s constant, 16, 66 hypergeometric distribution, 81 HypergeometricDistribution[ ], 82 hypothesis, 4, 21 exclusive, 42 of interest, 5 hypothesis space, 5, 52 continuous, 6 discrete, 5, 6 hypothesis testing, 162 2 statistic, 163 difference of two means, 167 one- and two-sided, 170 sameness, 172 ignorance priors, 63 IID, 119 ill-conditioned, 391
464 image model, 208 image reconstruction, 203 pixon, 208 implication, 26 implication operation, 25 impulse function, 404 incomplete gamma function, 261, 437, 446 incomplete information, 1, 4, 29, 206 incomplete set, 131 independent errors, 90 independent random events, 55, 109 inductive inference, 25 inference Bayesian, 5 deductive, 1, 24 inductive, 25 plausible, 25 statistical, 3 inner product, 393 inverse DFT, 410 Inverse[ ], 256 InverseFourier[ ], 411 iterative linearization, 296, 298 Jaynes, E. T., 2, 26 Jeffreys prior, 54, 220 modified, 386 joint posterior, 219 joint prior distribution, 45 joint probability, 7, 19 Keplerian orbit, 331 Kolmogorov–Smirnov test, 173 Kullback entropy, 190 kurtosis, 101 lag, 318 Lagrange multipliers, 191 Laplacian approximations, 291 Bayes factor, 291 marginal parameter posteriors, 293 LCG, 131 least-squares model fitting, 244 least-squares problem, 389 Lebesgue measure, 191 leptokurtic distribution, 102 Levenberg–Marquardt method, 296, 298, 300 lighthouse problem, 125 likelihood characteristic width of, 47 global, 326 likelihood function, 5, 89 likelihood ratio, 49, 53
Index line strength posterior, 60 linear congruential generators, 131 linear least-squares, 243 linear models, 243 linearly independent, 252 logic function, 27 logical disjunction, 42 logical product, 26 logical proposition, 21 logical sum, 26 exclusive form, 22 logical versus causal connections, 82 logically equivalent, 22 Lomb–Scargle periodogram, 367 long-run frequency, 75 long-run relative frequency, 2 Lorentzian, 124 MA, 431 MAP maximum a posteriori, 343 marginal distribution, 263 marginal PDF, 263 marginal posterior, 45 marginal probability, 19 marginalization, 12, 16, 45 marginalization integral, 249 Markov chain, 314 Markov chain Monte Carlo, 312 Martian rover, 200 matrix formulation, 251, 253 MaxEnt, 184 alternative derivation, 187 classic, 207 computing pi values, 191 continuous probability distribution, 191 exponential distribution, 195 Gaussian distribution, 197 generalization, 190 global maximum, 192 image reconstruction, 203 kangaroo justification, 203 multivariate Gaussian, 450 noisy constraints, 206 uniform distribution, 194 maximally non-committal, 185 maximum entropy principle, see MaxEnt maximum likelihood, 56 MCMC, 312 acceptance rate, 319, 330 annealing parameter, 325, 327 aperiodic, 319
465
Index automated, 330 Bayes factor, 326 burn-in period, 331 control system, 330, 331 convergence speed, 318 detailed balance, 320 model comparison, 326 parallel simulations, 322 parallel tempering, 321 partition function, 326 robust summary statistic, 342 sample correlations, 318 simulated tempering, 321 stationary distribution, 319 temperature parameter, 321 when to stop, 319 mean, 44, 100, 343 Poisson, 110 mean deviation, 105 mean square deviation, 214 Mean[BinomialDistribution[n, p]], 109 MeanCI[ ], 182 MeanCI[data, KnownVariance], 156 MeanDifferenceTest[ ], 170, 182 median, 103, 343 baseline subtraction, 104 running subtraction, 104 metric, 253 Metropolis algorithm, 297, 314 Metropolis ratio, 314 Metropolis–Hastings why it works, 319 Metropolis–Hastings algorithm, 313 mind projection fallacy, 98 mode, 44 model, 21 deterministic, 90 probabilistic, 91 model comparison, 45, 275, 326 other methods, 281 model fitting, 257 model function, 245 model parameter errors, 264 model selection, 15, 335 model testing frequentist, 279 models high dimensional, 288 moderate dimensional, 288 modified Jeffreys prior, 53, 386 moment about the mean, 106 about the origin, 106, 269 first central, 101
fourth central, 101 second central, 101 third central, 101 moment generating function, 105 2 , 141 binomial distribution, 99, 108 central, 106 Monte Carlo, 127, 296 Monte Carlo integration, 313 most probable model vector, 250 moving average, see MA MPM marginal posterior mode, 343 multinomial distribution, 79, 174, 187 multinomial expansion, 80 Multinormal[ ], 433 MultinormalDistribution[ ], 316 multiplicity, 74, 80, 188 multivariate Gaussian, 202, 244 NAND, 29 negation, 26 negative binomial distribution, 112 negative covariance, 268 negative exponential distribution, 118, 119 negatively skewed distribution, 102 NIntegrate [ ], 71 noise Bayesian estimate, 224 noise scale parameter, 334 non-uniform sampling, 370 nonlinear models examples, 243, 287 NonlinearRegress [ ], 302 NOR, 29 normal distribution, see distributions, normal, 103 normal equations, 252 normal Gaussian, 198 NormalDistribution[ ], 132 NOT, 28 notation, 6 NSolve[ ], 263 nuisance parameters, 16, 45, 264 null hypothesis, 162, 164 Nyquist frequency, 370, 405, 413, 429, 430, 432 Nyquist sampling theorem, 392, 404 astronomy example, 406 objectivity, 11 Occam factor, 49, 239, 276 Occam’s razor, 16, 45, 60 odds, see odds ratio
466 odds ratio, 9, 46, 52, 277 Jeffreys prior, 58 prior, 46, 53 sensitivity, 59 uniform prior, 58 versus prior boundary, 61 ON/OFF measurements, 380 operations implication, 26 logical disjunction, 42 logical product, 26 logical sum, 26 negation, 26 optional stopping problem, 177 Bayesian resolution, 179 OR, 28 orthogonal basis functions, 270 orthogonal functions, 392 orthonormal, 390 orthonormal functions, 276, 392 orthonormal set, 392 P-value, 147, 165, 166 one-sided, 147, 169 PAD, see positive, additive distribution parallel tempering, 312, 321, 323, 325, 326, 330 parameter location, 63 scale, 63, 219 parameter covariance matrix, 267, 273, 276, 283 parameter estimation, 12, 15, 43, 59, 244 Parseval’s theorem, 424, 428 discrete form, 428 partition function, 326 PDF, see probability distribution function Pearson 2 goodness-of-fit test, 173, 175 comparison of two binned data sets, 177 period folding, 362 periodogram, 352 reductive of variance, 431 variance, 429 periodogram power spectrum estimation, 425 Pixon method, 208 planet, 331 platykurtic distribution, 102 plausibility, 3 scale of, 26 plausible inference, 4 Plot3D[ ], 19 point spread function, 402 Poisson cumulative distribution, 111 mean, 110
Index time-varying rate, 386 variance, 111 Poisson distribution, 85, 109, 376 Bayesian, 85 examples, 111 infer rate, 377 limiting form of a binomial, 109 ON/OFF measurements, 380 signal and known background, 379 Poisson ON/OFF details, 445 poll, 153 polling, 77 population, 97 positive definite, 259 positive, additive distribution, 203 positively skewed distribution, 102 posterior bubble, 44 density, 44 mean, 44 mode, 44 posterior PDF, 44 posterior probability, 5 power spectral density, 425 discrete, 428 one-sided, 429 variance, 429 power spectrum, 426 power-spectrum estimation, 424 power spectral density function, 426 Principal Axis Theorem, 259 Principle of Indifference, 38 prior, 232 choice of, 53 exponential, 195 ignorance, 63 Jeffreys, 54, 57, 220 location parameter, 63 modified Jeffreys, 53 odds ratio, 46 scale parameter, 63 uniform, 53, 57, 377 prior information, 65, 72 prior probability, 5 probability, 10 definition of, 3 distribution, 6 of data value, 56 per logarithmic interval, 54 posterior, 59 relative, 52 relative frequency definition, 98
467
Index rules for manipulating, 4 weighted average, 236 probability and frequency, 10 probability density, 43 probability density function, 6, 99 Gaussian, 113 probability distribution, 72 probability distribution function, 6, 43 definition, 99 probability mass function, 99 probability theory, 21 as logic, 2 requirements of, 2 role of, 1 product rule, 4, 35, 42 development of, 30 qualitative properties, 36 uniqueness, 37 profile function, 293 projected distribution, 264 projected probability density function, 20, 264 proposal distribution, 314 proposition, 21 examples of, 4 logical, 21 PSD, see power spectral density correlation estimator, 427 lagged-product estimator, 427 pseudo-random, 127 pseudo-random number generation, 131 PseudoInverse[ ], 391 quantile value, 146, 148 Student’s t, 157 radial velocity, 333 radioactivity, 386 radix, 416 RAN3, 136 random number generation, 127 random variable, 2, 96, 113 continuous, 99 generation, 129 random walk, 132 Random[ ], 132 randomness tests for, 132 ratio of standard deviations, 442 rectangular matrix, 390 regression analysis, 244, 256 relative frequency definition, 98 relative line shape measures, 101 ripple, 407 RMS deviation, 225
sample mean distribution, 124 sample variance, 143, 164 sampling non-uniform, 370 sampling distribution, 72, 98 sampling probability, 43 sampling theory, 97 scale invariance, 54 scale parameter, 219 Schuster periodogram, 425 scientific inference, 1 scientific method model of, 1 SeedRandom[99], 132 serial correlations, 131, 191, 205, 207 Shannon, C., 185, 186 Shannon–Jaynes entropy, 190 Shannon’s theorem, 186 shuffling generator see compound LCG,, 131 signal averaging, 65, 125 signal variability, 227 signal-to-noise ratio, 354 significance, 147, 165 simulated annealing, 296 simulated tempering, 321 singular basis vectors, 252 singular value decomposition, 252, 389, 389 singular values, 390 SingularValues[ ], 390 skewness, 101 Sort[ ], 70 spatial frequency, 406 spectral analysis, 352 spectral line problem, 50, 322, 329 spectral density function, 426 standard deviation, 55, 101 ratio, 237 standard error of the sample mean, 145, 149 standard random variable, 102, 115, 119, 127, 154 state of knowledge, 72 stationary distribution, 319 statistic, 140 statistical inference, 3 statistics conventional, 2 frequentist, 2, 96 stellar orbit, 331 Stirling’s approximation, 189 stochastic spectrum estimation, 431 straight line fitting, 92, 307 Student’s t distribution, 147, 222 StudentTPValue[ ], 169, 170, 182
468 sum rule, 4, 35, 42 development of, 34 generalized, 42 qualitative properties, 36 uniqueness, 37 SVD, see singular value decomposition syllogisms strong, 24, 28, 36 weak, 25, 28, 36 symmetric matrix, 251 systematic errors, 16, 65, 66, 385 systematic uncertainty examples of, 65 parameterization of, 65, 386 tapered window function, 407 target distribution, 314 Taylor series expansion, 257, 298 tempering, 312, 321, 323, 326, 328 testable information, 184 time series, 356 time-to-failure, 119 transformation of random variable, 125 transition kernel, 314 probability, 314 Transpose[ ], 256 trial, 73, 158, 434 truncated Gaussian, 201 truth tables, 22, 27, 132 truth value, 21, 27 two-sample problem, 228, 440
Index two-sided test, 166 two-valued logic, 21 uncertainty random, 65 systematic, 65 undetermined multiplier, 192 uniform distribution, 116 uniform prior, 53, 279 unimodal, 291 variable metric approach, 301 variance, 101, 124, 141, 194 Bayesian estimate, 224 variance curve, 282 Variance[BinomialDistribution[n; p]], 109 VarianceCI[ ], 182 VarianceRatioCI[ ], 160 VarianceRatioTest[ ], 182 vector notation, 247 waveform sampling, 403 weighted average, 100 weighting function, 366 Weiner filter, 420 well-posed problem, 7 white noise, 430 Yule, G., 431 Yule–Walker equations, 431 zero pad, 357, 392, 421 zero padding, 421