Bootstrap Methods and their Application

Bootstrap methods and their application Cambridge Series on Statistical and Probabilistic Mathematics Editorial Board...

Author: A. C. Davison | D. V. Hinkley

490 downloads 3006 Views 13MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Bootstrap methods and their application

Cambridge Series on Statistical and Probabilistic Mathematics Editorial Board: R. Gill (Utrecht) B.D. Ripley (Oxford) S. Ross (Berkeley) M. Stein (Chicago) D. Williams (Bath) This series of high quality upper-division textbooks and expository mono graphs covers all areas of stochastic applicable mathematics. The topics range from pure and applied statistics to probability theory, operations re search, mathematical programming, and optimzation. The books contain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of the oretical methods, the books contain important applications and discussions of new techniques made possible be advances in computational methods.

Bootstrap methods and their application A . C. D a v iso n

Professor o f Statistics, Department o f Mathematics, Swiss Federal Institute o f Technology, Lausanne

D . V. H in k le y

Professor o f Statistics, Department o f Statistics and Applied Probability, University o f California, Santa Barbara

H I C a m b r id g e U N IV E R S IT Y P R E S S

P U B L IS H E D BY THE PRESS S Y N D IC A T E OF THE U N IV E R S IT Y OF C A M B R ID G E

The Pitt Building, Trumpington Street, Cambridge CB2 1RP, United Kingdom C A M B R ID G E U N IV E R S IT Y PRESS

The Edinburgh Building, Cambridge CB2 2R U , United Kingdom 40 West 20th Street, N ew York, N Y 10011-4211, U SA 10 Stamford Road, Oakleigh, M elbourne 3166, Australia © Cambridge University Press 1997 This book is in copyright. Subject to statutory exception and to the provisions o f relevant collective licensing agreements, no reproduction o f any part may take place without the written permission o f Cambridge University Press First published 1997 Printed in the United States o f America Typeset in TgX M onotype Times A catalogue record fo r this book is available fro m the British Library

Library o f Congress Cataloguing in Publication data D avison, A. C. (Anthony Christopher) Bootstrap methods and their application / A.C. D avison, D.V. Hinkley. p. cm. Includes bibliographical references and index. ISB N 0 521 57391 2 (hb). ISBN 0 521 57471 4 (pb) 1. Bootstrap (Statistics) I. Hinkley, D. V. II. Title. QA276.8.D38 1997 519.5'44~dc21 96-30064 CIP ISBN 0 521 57391 2 hardback ISB N 0 521 57471 4 paperback

Contents

Preface 1

Introduction

2

The Basic Bootstraps 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

3

In tro d u ctio n Param etric Sim ulation N o n p aram etric Sim ulation Simple Confidence Intervals R educing E rro r Statistical Issues N o n p aram etric A pproxim ations for V ariance and Bias Subsam pling M ethods B ibliographic N otes Problem s Practicals

Further Ideas 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9

In tro d u ctio n Several Sam ples Sem iparam etric M odels Sm ooth E stim ates o f F C ensoring M issing D a ta F inite Population Sam pling H ierarchical D a ta B ootstrapping the B ootstrap

ix 1 11 11 15 22 27 31 37 45 55 59 60 66 70 70 71 77 79 82 88 92 100 103

v

Contents

vi 3.10 3.11 3.12 3.13 3.14

B ootstrap D iagnostics Choice o f E stim ator from the D ata B ibliographic N otes Problem s Practicals

136

Tests 4.1

Intro d u ctio n

4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

R esam pling for Param etric Tests N o n p aram etric P erm utation Tests N o n p aram etric B ootstrap Tests A djusted P-values Estim ating Properties o f Tests B ibliographic N otes Problem s Practicals

Confidence Intervals 5.1 5.2

113 120 123 126 131

Intro d u ctio n

136 140 156 161 175 180 183 184 187 191 191 193 202 211 220 223

Basic C onfidence Lim it M ethods 5.3 Percentile M ethods 5.4 T heoretical C om parison o f M ethods 5.5 Inversion o f Significance Tests 5.6 D ouble B ootstrap M ethods 5.7 Em pirical C om parison o f B ootstrap M ethods 5.8 M ultip aram eter M ethods 5.9 C onditional Confidence Regions 5.10 Prediction 5.11 B ibliographic N otes 5.12 Problem s 5.13 Practicals

230 231 238 243 246 247 251

Linear Regression

256

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8

256 257 273 290 307 315 316 321

Intro d u ctio n Least Squares L inear Regression M ultiple L inear Regression A ggregate Prediction E rro r and V ariable Selection R obust Regression B ibliographic N otes Problem s Practicals

vii

Contents

7

8

9

Further Topics in Regression

326

7.1

In tro d u ctio n

326

7.2

G eneralized L inear M odels

327

7.3

Survival D a ta

346

7.4

O th er N onlinear M odels

353

7.5

M isclassification E rro r

358

7.6

N o n p aram etric Regression

362

7.7

B ibliographic N otes

374

7.8

Problem s

376

7.9

Practicals

378

Complex Dependence

385

8.1

In tro d u ctio n

385

8.2

Time Series

385

8.3

Point Processes

415

8.4

B ibliographic N otes

426

8.5

Problem s

428

8.6

Practicals

432

Improved Calculation

437

9.1

In tro d u ctio n

437

9.2

Balanced B ootstraps

438

9.3

C ontrol M ethods

446

9.4

Im po rtan ce R esam pling

450

9.5

Saddlepoint A pproxim ation

466

9.6

B ibliographic N otes

485

9.7

Problem s

487

9.8

Practicals

494

10 Semiparametric Likelihood Inference

499

10.1 Likelihood

499

10.2 M ultinom ial-B ased Likelihoods

500

10.3 B ootstrap Likelihood

507

10.4 Likelihood Based on Confidence Sets

509

10.5 Bayesian B ootstraps

512

10.6 B ibliographic N otes

514

10.7 Problem s

516

10.8 Practicals

519

viii 11

Contents

Computer Implementation

522

11.1 11.2 11.3 11.4 11.5 11.6

In tro d u ctio n Basic B ootstraps F u rth er Ideas Tests Confidence Intervals L inear Regression

522 525 531 534 536 537

11.7 11.8

F u rth er Topics in Regression Time Series

540 543

11.9 Im proved S im ulation 11.10 S em iparam etric Likelihoods Appendix A. Cumulant Calculations Bibliography Name Index Example index Subject index

545 549 551 555 568 572 575

Preface

The publication in 1979 of Bradley Efron’s first article on bootstrap methods was a major event in Statistics, at once synthesizing some of the earlier resampling ideas and establishing a new framework for simulation-based statistical analysis. The idea of replacing complicated and often inaccurate approximations to biases, variances, and other measures of uncertainty by com puter simulations caught the imagination of both theoretical researchers and users of statistical methods. Theoreticians sharpened their pencils and set about establishing mathematical conditions under which the idea could work. Once they had overcome their initial skepticism, applied workers sat down at their terminals and began to amass empirical evidence that the bootstrap often did work better than traditional methods. The early trickle of papers quickly became a torrent, with new additions to the literature appearing every month, and it was hard to see when would be a good moment to try to chart the waters. Then the organizers o f COMPSTAT ’92 invited us to present a course on the topic, and shortly afterwards we began to write this book. We decided to try to write a balanced account o f resampling methods, to include basic aspects of the theory which underpinned the methods, and to show as many applications as we could in order to illustrate the full potential of the methods — warts and all. We quickly realized that in order for us and others to understand and use the bootstrap, we would need suitable software, and producing it led us further towards a practically oriented treatment. Our view was cemented by two further developments: the appearance o f two excellent books, one by Peter Hall on the asymptotic theory and the other on basic methods by Bradley Efron and Robert Tibshirani; and the chance to give further courses that included practicals. O ur experience has been that hands-on computing is essential in coming to grips with resampling ideas, so we have included practicals in this book, as well as more theoretical problems. As the book expanded, we realized that a fully comprehensive treatm ent was beyond us, and that certain topics could be given only a cursory treatm ent because too little is known about them. So it is that the reader will find only brief accounts o f bootstrap methods for hierarchical data, missing data problems, model selection, robust estimation, nonparam etric regression, and complex data. But we do try to point the more ambitious reader in the right direction. No project of this size is produced in a vacuum. The majority of work on the book was completed while we were at the University of Oxford, and we are very grateful to colleagues and students there, who have helped shape our work in various ways. The experience of trying to teach these methods in Oxford and elsewhere — at the Universite de Toulouse I, Universite de Neuchatel, Universita degli Studi di Padova, Queensland University of Technology, Universidade de Sao Paulo, and University of Umea — has been vital, and we are grateful to participants in these courses for prompting us to think more deeply about the

ix

X

Preface

material. Readers will be grateful to these people also, for unwittingly debugging some of the problems and practicals. We are also grateful to the organizers of COMPSTAT ’92 and CLAPEM V for inviting us to give short courses on our work. While writing this book we have asked many people for access to data, copies of their programs, papers or reprints; some have then been rewarded by our bombarding them with questions, to which the answers have invariably been courteous and informative. We cannot name all those who have helped in this way, but D. R. Brillinger, P. Hall, M. P. Jones, B. D. Ripley, H. O’R. Sternberg and G. A. Young have been especially generous. S. Hutchinson and B. D. Ripley have helped considerably with computing matters. We are grateful to the mostly anonymous reviewers who commented on an early draft of the book, and to R. G atto and G. A. Young, who later read various parts in detail. A t Cambridge University Press, A. W oollatt and D. Tranah have helped greatly in producing the final version, and their patience has been commendable. We are particularly indebted to two people. V. Ventura read large portions o f the book, and helped with various aspects of the com putation. A. J. Canty has turned our version o f the bootstrap library functions into reliable working code, checked the book for mistakes, and has made numerous suggestions that have improved it enormously. Both of them have contributed greatly — though o f course we take responsibility for any errors that remain in the book. We hope that readers will tell us about them, and we will do our best to correct any future versions of the book; see its WWW page, at U R L http://dmawww.epf1.ch/davison.mosaic/BMA/ The book could not have been completed without grants from the U K Engineer ing and Physical Sciences Research Council, which in addition to providing funding for equipment and research assistantships, supported the work o f A. C. Davison through the award o f an Advanced Research Fellowship. We also acknowledge support from the US N ational Science Foundation. We must also mention the Friday evening sustenance provided at the Eagle and Child, the Lam b and Flag, and the Royal Oak. The projects of many authors have flourished in these amiable establishments. Finally, we thank our families, friends and colleagues for their patience while this project absorbed our time and energy. Particular thanks are due to Claire Cullen Davison for keeping the Davison family going during the writing of this book. A. C. Davison and D. V. Hinkley Lausanne and Santa Barbara May 1997

1 Introduction

The explicit recognition o f uncertainty is central to the statistical sciences. N o tions such as prior inform ation, probability models, likelihood, stan d ard errors an d confidence limits are all intended to form alize uncertainty and thereby m ake allow ance for it. In sim ple situations, the uncertainty o f an estim ate may be gauged by analytical calculation based on an assum ed probability m odel for the available data. But in m ore com plicated problem s this approach can be tedious an d difficult, and its results are potentially m isleading if inappropriate assum ptions or sim plifications have been made. F or illustration, consider Table 1.1, which is taken from a larger tabulation (Table 7.4) o f the num bers o f A ID S reports in E ngland and W ales from m id -1983 to the end o f 1992. R eports are cross-classified by diagnosis period an d length o f reporting delay, in three-m onth intervals. A blank in the table corresponds to an unknow n (as yet unreported) entry. The problem was to predict the states o f the epidem ic in 1991 and 1992, which depend heavily on the values missing at the b o tto m right o f the table. T he d a ta su p p o rt the assum ption th at the reporting delay does n o t depend on the diagnosis period. In this case a simple m odel is th a t the num ber o f reports in row j and colum n k o f the table has a Poisson distribution with m ean Hjk = exp(oij -f f t) . If all the cells o f the table are regarded as independent, then the to tal nu m b er o f u n reported diagnoses in period j has a Poisson distribution w ith m ean n jk = exp(ay) k

exP (Pk), k

where the sum is over colum ns with blanks in row j. The eventual total o f as yet u n rep o rted diagnoses from period j can be estim ated by replacing a j and Pk by estim ates derived from the incom plete table, and thence we obtain the predicted to tal for period j. Such predictions are shown by the solid line in

1

2

1 ■ Introduction

D iagnosis period

R e p o rtin g delay interval (q u a rte rs ):

Y ear

Q u a rte r

0+

1

2

3

4

5

6

1988

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

31 26 31 36 32 15 34 38 31 32 49 44 41 56 53 63 71 95 76 67

80 99 95 77 92 92 104 101 124 132 107 153 137 124 175 135 161 178 181

16 27 35 20 32 14 29 34 47 36 51 41 29 39 35 24 48 39

9 9 13 26 10 27 31 18 24 10 17 16 33 14 17 23 25

3 8 18 11 12 22 18 9 11 9 15 11 7 12 13 12

2 11 4 3 19 21 8 15 15 7 8 6 11 7 11

8 3 6 8 12 12 6 6 8 6 9 5 6 10

1989

1990

1991

1992

>14 •••

••• ••• ••• •••

6 3 3 2 2 1

T otal rep o rts to end o f 1992 174 211 224 205 224 219 253 233 281 245 260 285 271 263 306 258 310 318 273 133

Figure 1.1, together w ith the observed to tal reports to the end o f 1992. How good are these predictions? It would be tedious b u t possible to p u t pen to p ap er and estim ate the prediction uncertainty th ro u g h calculations based on the Poisson model. But in fact the d a ta are m uch m ore variable th an th a t m odel would suggest, and by failing to take this into account we w ould believe th at the predictions are m ore accurate th a n they really are. Furtherm ore, a b etter approach would be to use a sem iparam etric m odel to sm ooth out the evident variability o f the increase in diagnoses from q u arter to q u arter; the corresponding prediction is the dotted line in Figure 1.1. A nalytical calculations for this m odel would be very unpleasant, and a m ore flexible line o f attack is needed. W hile m ore th an one approach is possible, the one th a t we shall develop based on com puter sim ulation is b o th flexible and straightforw ard.

Purpose of the Book O ur central goal is to describe how the com puter can be harnessed to obtain reliable stan d ard errors, confidence intervals, and o th er m easures o f uncertainty for a wide range o f problem s. The key idea is to resam ple from the original d a ta — either directly o r via a fitted m odel — to create replicate datasets, from

Table 1.1 Numbers of AIDS reports in England and Wales to the end of 1992 (De Angelis and Gilks, 1994) extracted from Table 7.4. A t indicates a reporting delay less than one month.

3

1 ■Introduction

Figure 1.1 Predicted quarterly diagnoses from a parametric model (solid) and a semiparametric model (dots) fitted to the AIDS data, together with the actual totals to the end of 1992 (+).

Time

which the variability o f the quantities o f interest can be assessed w ithout longwinded and error-prone analytical calculation. Because this approach involves repeating the original d a ta analysis procedure w ith m any replicate sets o f data, these are som etim es called computer-intensive methods. A n o th er nam e for them is bootstrap methods, because to use the d a ta to generate m ore d a ta seems analogous to a trick used by the fictional B aron M unchausen, who when he found him self a t the b o tto m o f a lake got out by pulling him self up by his b ootstraps. In the sim plest nonparam etric problem s we do literally sample from the data, and a com m on initial reaction is th a t this is a fraud. In fact it is not. It turns out th a t a wide range o f statistical problem s can be tackled this way, liberating the investigator from the need to oversimplify complex problem s. T he ap proach can also be applied in simple problem s, to check the adequacy o f stan d ard m easures o f uncertainty, to relax assum ptions, and to give quick approxim ate solutions. A n exam ple o f this is random sam pling to estim ate the p erm u tatio n distribution o f a nonparam etric test statistic. It is o f course true th a t in m any applications we can be fairly confident in a p articu lar p aram etric m odel and the stan d ard analysis based on th a t model. Even so, it can still be helpful to see w hat can be inferred w ithout particular p aram etric m odel assum ptions. This is in the spirit o f robustness o f validity o f the statistical analysis perform ed. N onparam etric b o o tstrap analysis allows us to do this.

4

1 • Introduction

3 5 7 18 43 85 91 98 100 130 230 487 _____________________________________________________________________

Despite its scope an d usefulness, resam pling m ust be carefully applied. Unless certain basic ideas are understood, it is all too easy to produce a solution to the w rong problem , or a b ad solution to the right one. B ootstrap m ethods are intended to help avoid tedious calculations based on questionable assum ptions, and this they do. But they can n o t replace clear critical thought ab o u t the problem , ap p ro p riate design o f the investigation and d a ta analysis, and incisive presentation o f conclusions. In this b o o k we describe how resam pling m ethods can be used, and evaluate their perform ance, in a wide range o f contexts. O u r focus is on the m ethods and their practical application rath er th an on the underlying theory, accounts o f which are available elsewhere. This book is intended to be useful to the m any investigators w ho w ant to know how and when the m ethods can safely be applied, and how to tell when things have gone wrong. The m athem atical level o f the book reflects this: we have aim ed for a clear account o f the key ideas w ithout an overload o f technical detail.

Examples B ootstrap m ethods can be applied b o th when there is a well-defined probability m odel for d a ta an d when there is not. In o u r initial developm ent o f the m ethods we shall m ake frequent use o f tw o simple examples, one o f each type, to illustrate the m ain points. Example 1.1 (Air-conditioning data) Table 1.2 gives n = 12 times between failures o f air-conditioning equipm ent, for which we wish to estim ate the underlying m ean or its reciprocal, the failure rate. A simple m odel for this problem is th a t the times are sam pled from an exponential distribution. The dotted line in the left panel o f Figure 1.2 is the cum ulative distribution function (C D F ) F t ) = / °’

\ l - e x p (-y/n),

y ~ °’

y > 0,

for the fitted exponential distrib u tio n w ith m ean fi set equal to the sample average, y = 108.083. The solid line on the sam e plot is the nonparam etric equivalent, the em pirical distribution function (E D F ) for the data, which places equal probabilities n-1 = 0.083 at each sam ple value. C om parison o f the two curves suggests th a t the exponential m odel fits reasonably well. A n alternative view o f this is shown in the right panel o f the figure, which is an exponential

Table 1.2 Service hours between failures of the air-conditioning equipment in a Boeing 720 jet aircraft (Proschan, 1963).

1 ■ Introduction

5

o

O co

Figure 1.2 Summary displays for the air-conditioning data. The left panel shows the EDF for the data, F (solid), and the CDF of a fitted exponential distribution (dots). The right panel shows a plot of the ordered failure times against exponential quantiles, with the fitted exponential model shown as the dotted line.

o o in o o o o

co o o

CM

O o

0.0 0.5 Failure time y

1.0 1.5 2.0 2.5 3.0

Quantiles of standard exponential

Q -Q plot — a plot o f ordered d a ta values yy) against the standard exponential quantiles

n+ 1

= - log (1 K=1

n+ 1

A lthough these plots suggest reasonable agreem ent with the exponential m odel, the sam ple is ra th e r too small to have m uch confidence in this. In the d a ta source the m ore general gam m a m odel with m ean /i and index k is used; its density is fw (y) =

1 1

/ \ K I K ' „K-1. y K exP ( - Ky / v l

y > o,

h, k

> o.

( i.i)

F or o u r sam ple the estim ated index is k = 0.71, which does not differ signif icantly (P = 0.29) from the value k = 1 th a t corresponds to the exponential m odel. O u r reason for m entioning this will becom e apparent in C h apter 2. Basic properties o f the estim ator T = Y for fj. are easy to obtain theoretically under the exponential model. For example, it is easy to show th at T is unbiased and has variance fi2/n. A pproxim ate confidence intervals for n can be calculated using these properties in conjunction with a norm al approxim ation for the distrib u tio n o f T, alth o u g h this does n o t w ork very well: we can tell this because Y / n has an exact gam m a distribution, which leads to exact confidence limits. Things are m ore com plicated under the m ore general gam m a model, because the index k is only estim ated, and so in a traditional approach we would use approxim ations — such as a norm al approxim ation for the distribution o f T, or a chi-squared approxim ation for the log likelihood ratio statistic.

6

1 ■ Introduction

The param etric sim ulation m ethods o f Section 2.2 can be used alongside these approxim ations, to diagnose problem s w ith them , or to replace them entirely.

■ Example 1.2 (City population data) Table 1.3 reports n = 49 d a ta pairs, each corresponding to a city in the U nited States o f A m erica, the p air being the 1920 and 1930 p o pulations o f the city, w hich we denote by u and x. The d a ta are plotted in Figure 1.3. Interest here is in the ratio o f m eans, because this would enable us to estim ate the to tal pop u latio n o f the U SA in 1930 from the 1920 figure. I f the cities form a ran d o m sam ple w ith ( U , X ) denoting the p air o f populatio n values for a random ly selected city, then the total 1930 population is the prod u ct o f the to tal 1920 popu latio n and the ratio o f expectations 6 = E (X )/E ([7). This ratio is the p aram eter o f interest. In this case there is no obvious p aram etric m odel for the jo in t distribution o f ( U , X ) , so it is n atu ral to estim ate 9 by its em pirical analog, T = X / U , the ratio o f sam ple averages. We are then concerned w ith the uncertainty in T. If we had a plausible param etric m odel — for exam ple, th a t the pair ( U, X ) has a bivariate lognorm al distrib u tio n — then theoretical calculations like those in Exam ple 1.1 would lead to bias an d variance estim ates for use in a norm al approxim ation, which in tu rn would provide approxim ate confidence intervals for 6. W ithout such a m odel we m ust use nonparam etric analysis. It is still possible to estim ate the bias an d variance o f T, as we shall see, and this m akes norm al approxim ation still feasible, as well as m ore com plex approaches to setting confidence intervals. ■ Exam ple 1.1 is special in th a t an exact distribution is available for the statistic o f interest an d can be used to calculate confidence limits, at least u nder the exponential m odel. But for param etric m odels in general this will n o t be true. In Section 2.2 we shall show how to use param etric sim ulation to o b tain approxim ate distributions, either by approxim ating m om ents for use in norm al approxim ations, or — when these are inaccurate — directly. In Exam ple 1.2 we m ake no assum ptions ab o u t the form o f the d ata disribution. But still, as we shall show in Section 2.3, sim ulation can be used to obtain properties o f T, even to approxim ate its distribution. M uch o f C h ap ter 2 is devoted to this.

Layout of the Book C h ap ter 2 describes the properties o f resam pling m ethods for use w ith sin gle sam ples from p aram etric an d nonparam etric m odels, discusses practical m atters such as the num bers o f replicate datasets required, and outlines delta m ethods for variance approxim ation based on different forms o f jackknife. It

1 • Introduction Table 13 Populations in thousands of n — 49 large US cities in 1920 (u) and in 1930 (x) (Cochran, 1977, p. 152).

u

X

u

X

u

X

138 93 61 179 48 37 29 23 30

143 104 69 260 75 63 50 48 111 50 52 53 79 57 317 93 58

76 381 387 78 60 507 50 77 64 40 136 243 256 94 36 45

80 464 459 106 57 634 64 89 77 60 139 291 288 85 46 53

67 120 172 66 46 121 44 64 56 40 116 87 43 43 161 36

67 115 183 86 65 113 58 63 142 64 130 105 61 50 232 54

2

38 46 71 25 298 74 50

Figure 1J Populations of 49 large United States cities (in 1000s) in 1920 and 1930.

c

o « 3 Q. O Q. O CO O)

1920 population

8

1 ■ Introduction

also contains a basic discussion o f confidence intervals and o f the ideas th at underlie b o o tstrap m ethods. C h apter 3 outlines how the basic ideas are extended to several samples, sem iparam etric and sm ooth models, simple cases where d a ta have hierarchical structure or are sam pled from a finite population, an d to situations where d ata are incom plete because censored o r missing. It goes on to discuss how the sim ulation o u tp u t itself m ay be used to detect problem s — so-called boo tstrap diagnostics — an d how it m ay be useful to b o o tstrap the bootstrap. In C h ap ter 4 we review the basic principles o f significance testing, and then describe M onte C arlo tests, including those using M arkov C hain sim ulation, and param etric b o o tstrap tests. This is followed by discussion o f nonparam etric perm utatio n tests, and the m ore general m ethods o f semi- and nonparam etric boo tstrap tests. A double b o o tstrap m ethod is detailed for im proved approxi m ation o f P-values. Confidence intervals are the subject o f C h ap ter 5. A fter outlining basic ideas, we describe how to construct simple confidence intervals based on sim ulations, an d then go on to m ore com plex m ethods, such as the studentized bootstrap, percentile m ethods, the double b o o tstrap and test inversion. The m ain m ethods are com pared em pirically in Section 5.7, then there are brief accounts o f confidence regions for m ultivariate param eters, and o f prediction intervals. The three subsequent chapters deal w ith m ore com plex problem s. C h ap ter 6 describes how the basic resam pling m ethods m ay be applied in linear regression problem s, including tests for coefficients, prediction analysis, and variable selection. C h ap ter 7 deals w ith m ore com plex regression situations: generalized linear models, oth er nonlinear m odels, semi- and nonparam etric regression, survival analysis, and classification error. C h apter 8 details m ethods appropriate for tim e series, spatial data, an d poin t processes. C h apter 9 describes how variance reduction techniques such as balanced sim ulation, control variates, and im portance sam pling can be adapted to yield im proved sim ulations, w ith the aim o f reducing the am ount o f sim ulation needed for an answ er o f given accuracy. It also shows how saddlepoint m ethods can som etim es be used to avoid sim ulation entirely. C h apter 10 describes various sem iparam etric versions o f the likelihood function, the ideas underlying which are closely related to resam pling m ethods. It also briefly outlines a Bayesian version o f the b o otstrap. C hapters 2 -10 contain problem s intended to reinforce the reader’s under standing o f b o th m ethods an d theory, and in some cases problem s develop topics th at could n o t be included in the text. Some o f these dem and a know l edge o f m om ents and cum ulants, basic facts ab o u t which are sketched in the A ppendix. The book also contains practicals th a t apply resam pling routines w ritten in

1 ■ Introduction

9

the S language to sets o f data. The practicals are intended to reinforce the ideas in each chapter, to supplem ent the m ore theoretical problem s, and to give exam ples on which readers can base analyses o f their own data. It would be possible to give different sorts o f course based on this book. O ne w ould be a “theoretical” course based on the problem s and an o th er an “applied” course based on the practicals; we prefer to blend the two. A lthough a library o f routines for use with the statistical package S P lu s is bundled w ith it, m ost o f the book can be read w ithout reference to p a r ticular softw are packages. A p art from the practicals, the exception to this is C h ap ter 11, which is a short introduction to the m ain resam pling routines, arran g ed roughly in the order with which the corresponding ideas ap p ear in earlier chapters. R eaders intending to use the bundled routines will find it useful to w ork through the relevant sections o f C h apter 11 before attem pting the practicals.

Notation A lthough we believe th a t o u r n o tation is largely standard, there are not enough letters in the English and G reek alphabets for us to be entirely consistent. G reek letters such as 6, P and v generally denote param eters or o ther unknow ns, while a is used for error rates in connection with significance tests and confidence sets. English letters X , Y, Z , and so forth are used for random variables, which take values x, y, z. T hus the estim ator T has observed value t, which m ay be an estim ate o f the unknow n p aram eter 0. The letter V is used for a variance estim ate, an d the letter p for a probability, except for regression models, where p is the num b er o f covariates. Script letters such as J/~ are used to denote sets. Probability, expectation, variance and covariance are denoted Pr( ), E( ), var(-) and cov(-, •), while the jo in t cum ulant o f Yi, Y1Y2 and Y3 is denoted cum(Yi, Yj Y2, Y3). We use I {A} to denote the indicator random variable, which takes values one if the event A is true and zero otherwise. A related function is the H eaviside function

We use #{/!} to denote the nu m ber o f elem ents in the set A, and #{^4r} for the num ber o f events A r th a t occur in a sequence A i , A 2 , __ We use = to m ean “is approxim ately equal to ”, usually corresponding to asym ptotic equivalence as sam ple sizes tend to infinity, ~ to m ean “is distributed as” o r “is distributed according to ”, ~ to m ean “is distributed approxim ately a s”, ~ to m ean “is a sam ple o f independent identically distributed random variables from ”, while s has its usual m eaning o f “is equivalent to ”.

10

1 ■ Introduction

The d a ta values in a sam ple o f size n are typically denoted by y i , . . . , y n, the observed values o f the ran d o m variables y i , . . . , y n; their average is y = n-'Zyj-

We m ostly reserve Z for ran d o m variables th a t are stan d ard norm al, at least approxim ately, an d use Q for ran d o m variables w ith o ther (approxim ately) know n distributions. As usual N(n, a 2) represents the norm al distribution w ith m ean \i an d variance a 2, while za is often the a quantile o f the stan d ard norm al distribution, w hose cum ulative distrib u tio n function is ®( ). The letter R is reserved for the n u m b er o f replicate sim ulations. Sim ulated copies o f a statistic T are denoted T ' , r = 1 ,..., R, w hose ordered values are r ('i) ^ ^ T (R)- E xpectation, variance an d probability calculated w ith respect to the sim ulation distribution are w ritten Pr*(), E*(-) and var*(-). W here possible we avoid boldface type, and rely on the context to m ake it plain when we are dealing w ith vectors o r m atrices; a T denotes the m atrix transpose o f a vector o r m atrix a. We use PD F, C D F, an d E D F as sh o rth an d for “probability density function”, “cum ulative distribution function”, and “em pirical distribution function”. The letters F and G are used for C D F s, an d / and g are generally used for the corresponding PD F s. A n exception to this is th a t /*; denotes the frequency with which y; app ears in the rth resample. We use M L E as sh o rth an d for “m axim um likelihood estim ate” or som etim es “m axim um likelihood estim ation”. The end o f each exam ple is m arked ■, an d the end o f each algorithm is m arked •.

2 The Basic Bootstraps

2.1 Introduction In this chap ter we discuss techniques which are applicable to a single, h om o geneous sam ple o f data, denoted by y i,...,} V T he sam ple values are thought o f as the outcom es o f independent and identically distributed ran d o m variables Y U . . . ,Y „ w hose probability density function (P D F ) and cumulative distribution function (C D F ) we shall denote by / and F, respectively. T he sam ple is to be used to m ake inferences ab o u t a p o p ulation characteristic, generically denoted by 6, using a statistic T whose value in the sam ple is t. We assum e for the m om ent th a t the choice o f T has been m ade and th a t it is an estim ate for 6, which we take to be a scalar. O u r atten tio n is focused on questions concerning the probability distribution o f T. F or exam ple, w hat are its bias, its stan d ard error, or its quantiles? W hat are likely values und er a certain null hypothesis o f interest? H ow do we calculate confidence limits for 6 using T ? T here are tw o situations to distinguish, the param etric and the n o n p a ra m et ric. W hen there is a p articu lar m athem atical m odel, with adjustable constants o r p aram eters ip th a t fully determ ine / , such a m odel is called parametric and statistical m ethods based on this m odel are param etric m ethods. In this case the p aram eter o f interest 6 is a com ponent o f or function o f ip. W hen no such m athem atical m odel is used, the statistical analysis is nonparametric, and uses only the fact th a t the ran d o m variables Yj are independent and identically distributed. Even if there is a plausible param etric m odel, a nonparam etric analysis can still be useful to assess the robustness o f conclusions draw n from a p aram etric analysis. A n im p o rta n t role is played in nonparam etric analysis by the empirical distribution which puts equal probabilities n-1 a t each sam ple value yj. The corresponding estim ate o f F is the empirical distribution function (E D F ) F,

11

12

2 • The Basic Bootstraps

which is defined as the sam ple p ro p o rtio n #{^4} means the number of times the event A occurs.

n M ore form ally F(y) = l i Z H ^ y - y ^ j=i

w

where H(u) is the unit step function which ju m p s from 0 to 1 at u = 0. N otice th at the values o f the E D F are fixed (0, j[), so the E D F is equivalent to its points o f increase, the ordered values >’(i) < • • • < y ln} o f the data. An exam ple o f the E D F was shown in the left panel o f Figure 1.2. W hen there are rep eat values in the sample, as would often occur with discrete data, the E D F assigns probabilities p ro p o rtional to the sam ple fre quencies at each distinct observed value y. The form al definition (2.1) still applies. The E D F plays the role o f fitted m odel when no m athem atical form is assum ed for F, analogous to a param etric C D F w ith param eters replaced by their estim ates.

2.1.1 Statistical functions M any simple statistics can be th o u g h t o f in term s o f properties o f the EDF. For exam ple, the sam ple average y = n_1 yj is the m ean o f the E D F ; see Exam ple 2.1 below. M ore generally, the statistic o f interest t will be a sym m etric function o f y \ , . . . , y„, m eaning th a t t is unaffected by reordering the data. This implies th a t t depends only on the ordered values y(i) < • • • < y^), or equivalently on the E D F F. O ften this can be expressed simply as t = t(F), where t(-) is a statistical function — essentially ju st a m athem atical expression o f the algorithm for com puting t from F. Such a statistical function is o f central im portance in the n o n p aram etric case because it also defines the param eter o f interest 9 th ro u g h the “algorithm ” 9 = t(F). This corresponds to the qualitative idea th a t 6 is a characteristic o f the population described by F. Simple exam ples o f such functions are the m ean an d variance o f Y , which are respectively defined as t(F) =

J

y dF( y) ,

t(F) =

J

y 2 dF(y) ~ { J ydF(y) J

.

(2.2)

T he same definition o f 9 applies in p aram etric problem s, although then 6 is m ore usually defined explicitly as one o f the m odel param eters tp. T he relationship betw een the estim ate t an d F can usually be expressed as t = t(F), corresponding to the relation 9 = t(F) betw een the characteristic o f interest an d the underlying distribution. T he statistical function t( ) defines

13

2.1 ■Introduction

b o th the p aram eter an d its estim ate, b u t we shall use t( ) to represent the function, and t to represent the estim ate o f 9 based on the observed d ata

Example 2.1 (Average)

T he sample average, y, estim ates the population m ean H

=

J

ydF(y).

To show th a t y = t(F), we substitute for F in the defining function at (2.2) to obtain

j= i because f a ( y ) d H ( y — x) = a(x) for any continuous function a(-).

■

Example 2.2 (City population data) F or the problem outlined in Exam ple 1.2, the p aram eter o f interest is the ratio o f m eans 9 = E (X )/E (l/). In this case F is the bivariate C D F o f Y = (V , X ), and the bivariate E D F F puts probability n~l at each o f the d a ta pairs (uj ,Xj). T he statistical function version o f 9 simply uses the definition o f m ean for b o th nu m erato r and denom inator, so th at fxdF(u,x) f ud F( u, x) The corresponding estim ate o f 9 is * [ xdF(u,x) t = t(F) = J udF(u,x) w ith x = n-1 J2 x j ar*d « = n_1 J 2 uj-

A quantity A„ is said to be 0(nd) if lim„_00 n~dA„ = a for some finite a, and o(nJ) if lim„_0Q n~dA„ = 0.

x u ■

It is quite straightforw ard to show th at (2.1) implies convergence o f F to F as n—>oo (Problem 2.1). T hen if t(-) is continuous in an appropriate sense, the definition T = t( ) implies th a t T converges to 6 as n—>oo, which is the property o f consistency. N o t all estim ates are exactly o f the form t(F). For example, if t(F) = var(Y ) then the usual unbiased sam ple variance is nt(F)/(n — 1). A lso the sample m edian is n o t exactly F -1 ( |) . Such small discrepancies are fairly un im p o rtan t as far as applying the b o o tstrap techniques discussed in this book. In a very form al developm ent we could write T — tn(F) and require th a t tn—*t as n—>oo, possibly even th a t t„ — t = 0 ( « _1). But such form ality would be excessive here, an d we shall assum e in general discussion th at T = t(F). (One case th at does

2 • The Basic Bootstraps

14

require special treatm en t is n o n p aram etric density estim ation, which we discuss in Exam ple 5.13.) The representation 6 = t(F) defines the p aram eter and its estim ator T in a robust way, w ithout any assum ption ab o u t F, oth er th an th a t 6 exists. This guarantees th a t T estim ates the right thing, no m atter w hat F is. Thus the sam ple average y is the only statistic th a t is generally valid as an estim ate o f the population m ean f i : only if Y is sym m etrically distributed ab o u t /i will statistics such as trim m ed averages also estim ate fi. This property, which guarantees th at the correct characteristic o f the underlying distribution is estim ated, w hatever th a t distribution is, is som etim es called robustness o f specification.

2.1.2 Objectives M uch o f statistical theory is devoted to calculating approxim ate distributions for p articu lar statistics T , on which to base inferences ab o u t their estim ands 8. Suppose, for exam ple, th a t we w ant to calculate a (1 — 2a) confidence interval for 6. It m ay be possible to show th a t T is approxim ately norm al w ith m ean 6 + P and variance v; here P is the bias o f T. If p an d v are b o th know n, then we can write P r(T < 1 1 F) = O

.

(2-3)

where () is the stan d ard norm al integral. I f the a quantile o f the standard norm al distrib u tio n is z« =
(2.4)

as follows from Pr(/? + v1/2za < T —

0 < ft + v1/2Z!_a) = 1 - 2a.

There is a catch, however, which is th a t in practice the bias /? and variance v will not be know n. So to use the norm al approxim ation we m ust replace P and v w ith estim ates. To see how to do this, note th a t we can express P and v as P = b(F) = E ( T | F) -

t(F),

v = v(F) = v ar( T \ F),

(2.5)

thereby stressing their dependence on the underlying distribution. We use expressions such as E (T | F) to m ean th a t the ran d o m variables from which T is calculated have distrib u tio n F; here a pedantic equivalent would be E{t(F) | Y U . . . , Y „ ~ F } . Suppose th a t F is estim ated by F, which m ight be the em pirical distrib u tio n function, o r a fitted p aram etric distribution. Then estim ates o f bias an d variance are obtained simply by substituting F for F in

= means “is approximately equal to”.

15

2.2 ■Parametric Simulation

(2.5), th a t is B = b(F) = E ( T \ F ) - t ( F ) ,

V = v(F) = v a r(T | F).

(2.6)

These estim ates B an d V are used in place o f (i and v in equations such as (2.4). Example 2.3 (Air-conditioning data) U nder the exponential m odel for the d a ta in Exam ple 1.1, the m ean failure tim e n is estim ated by the average T = Y , which has a gam m a distrib u tio n with m ean fi and shape param eter k = n. Therefore the bias an d variance o f T are b(F) = 0 and i>(F) = /i2/ n , and these are estim ated by 0 and y 2/n. Since n = 12, y = 108.083, and 20.025 = —1.96, a 95% confidence interval for /i based on the norm al approxim ation (2.3) is + 1.96n_1/2y = (46.93,169.24). ■ E stim ates such as those in (2.6) are b o o tstrap estim ates. H ere they have been used in conjunction w ith a norm al approxim ation, which som etim es will be adequate. However, the b o o tstrap approach o f substituting estim ates can be applied m ore am bitiously to im prove upon the norm al approxim ation and o th e r first-order theoretical approxim ations. The elaboration o f the b o o tstrap ap proach is the purpose o f this book.

2.2 Parametric Simulation In the previous section we pointed out th a t theoretical properties o f T m ight be h ard to determ ine w ith sufficient accuracy. We now describe the sound practical alternative o f repeated sim ulation o f d a ta sets from a fitted param etric model, an d em pirical calculation o f relevant properties o f T. Suppose th a t we have a p articular param etric m odel for the distribution o f the d a ta y \ , . . . , y „ . We shall use F v (y) and f v (y) to denote the C D F and P D F respectively. W hen 1p is estim ated by (p — often b u t not invariably its m axim um likelihood estim ate — its substitution in the m odel gives the fitted model, w ith C D F F{y) = F^(y), which can be used to calculate properties o f T, som etim es exactly. We shall use Y * to denote the random variable distributed according to the fitted m odel F, and the superscript * will be used with E, var and so forth when these m om ents are calculated according to the fitted distribution. O ccasionally it will also be useful to w rite \p = xp’ to em phasize th a t this is the p aram eter value for the sim ulation model. Example 2.4 (Air-conditioning data) We have already calculated the m ean and variance u nder the fitted exponential m odel for the estim ator T = Y o f Exam ple 1.1. O u r sam ple estim ate for the m ean fi is t = y. So here 7* is exponential w ith m ean y. In the n o tatio n ju st introduced, we have by

16

2 • The Basic Bootstraps

theoretical calculation w ith this exponential distrib u tion th at E*(Y*) = y,

v ar'(Y * ) = y 2/n.

N ote th a t the estim ated bias o f Y is zero, being the difference between E '(Y *) an d the value ji = y for the m ean o f the fitted distribution. These m om ents were used to calculate an approxim ate norm al confidence interval in Exam ple 2.3. If, however, we wished to calculate the bias and variance o f T = log Y under the fitted m odel, i.e. E* (log Y*) — lo g y and v ar’ (lo g Y '), exact calculation is m ore difficult. The delta m ethod o f Section 2.7.1 would give approxim ate values —(2n)~* and n-1 . But m ore accurate approxim ations can be obtained using sim ulated sam ples o f 7* s. Sim ilar results and com m ents would apply if instead we chose to use the m ore general gam m a m odel (1.1) for this example. T hen Y* would be a gam m a random variable with m ean y and index k. m

2.2.1 Moment estimates So now suppose th a t theoretical calculation w ith the fitted m odel is too complex. A pproxim ations m ay n o t be available, or they m ay be untrustw orthy, perhaps because the sam ple size is small. The alternative is to estim ate the properties we require from sim ulated datasets. We w rite such a dataset as Yj",. . . , Y„* w here the YJ are independently sam pled from the fitted distribution F. W hen the statistic o f interest is calculated from a sim ulated dataset, we denote it by T*. F rom R repetitions o f the d a ta sim ulation we obtain T [ , . . . , T ’R. Properties o f T — 6 are then estim ated from T,*,. . . , T^. F or example, the estim ator o f the bias b(F) — E (T | F) — 0 o f T is B = b(F) = E (T | F) — t = E*(T*) - t, and this in tu rn is estim ated by R

B r = / r 1 Y , Tr ~ t = T* - 1.

(2.7)

r= 1

N ote th a t in the sim ulation t is the p aram eter value for the model, so th at T ' — t is the sim ulation analogue o f T — 6. The corresponding estim ator o f the variance o f T is 1 Vr =

R D 7’-* - f *)2’

(2-8)

with sim ilar estim ators for oth er m om ents. These em pirical approxim ations are justified by the law o f large num bers. F or exam ple, B r converges to B, the exact value under the fitted model, as R

2.2 ■Parametric Simulation Figure 2.1 Empirical biases and variances of Y* for the air-conditioning data from four repetitions of parametric simulation. Each line shows how the estimated bias and variance for R ~ 10 initial simulations change when further simulations are successively added. Note how the variability decreases as the simulation size increases, and how the simulated values converge to the exact values under the fitted exponential model, given by the horizontal dotted lines.

17

cC/> O in

increases. We usually d ro p the subscript R from B R, VR, and so forth unless we are explicitly discussing the effect o f R. How to choose R will be illustrated in the exam ples th a t follow, and discussed in Section 2.5.2. It is im p o rtan t to recognize th a t we are not estim ating absolute properties o f T , b u t ra th e r o f T relative to 9. Usually this involves the estim ation erro r T —9, b u t we should n o t ignore the possibility th at T / 0 (equivalently log T — log 9) o r som e o th er relevant m easure o f estim ation error m ight be m ore appropriate, depending u p o n the context. B ootstrap sim ulation m ethods will apply to any such measure. Example 2.5 (Air-conditioning data) C onsider Exam ple 1.1 again. As we have seen, sim ulation is unnecessary in practice for this problem because the m om ents are easy to calculate theoretically, b u t the exam ple is useful for illustration. H ere the fitted m odel is an exponential distribution for the failure times, w ith m ean estim ated by the sam ple average y = 108.083. All sim ulated failure tim es Y * are generated from this distribution. Figure 2.1 shows the results from several sim ulations, four for each o f eight values o f R, in each o f which the em pirical biases and variances o f T" = Y" have been calculated according to (2.7) and (2.8). O n both panels the “correct” values, nam ely zero and y 2/ n = (108.083)2/1 2 = 973.5, are indicated by horizontal d o tted lines. Evidently the larger is R, the closer is the sim ulation calculation to the right answer. H ow large a value o f R is needed? Figure 2.1 suggests th a t for some purposes R = 100 or 200 will be adequate, b u t th a t R = 10 will n o t be large enough. In this problem the accuracy o f the em pirical approxim ations is quite easy to determ ine from the fact th at n Y / n has a gam m a distribution with

2 • The Basic Bootstraps

18 index n. The sim ulation variances o f B R and F r are t2

t4 /

2

6 \

nR’

n2 \ R - 1 + n R . ) ’

and we can use these to say how large R should be in order th a t the sim ulated values have a specified accuracy. For exam ple, the coefficients o f variation o f VR a t R = 100 and 1000 are respectively 0.16 and 0.05. However, for a com plicated problem w here sim ulation was really necessary, such calculations could n o t be done, an d general rules are needed to suggest how large R should be. These are discussed in Section 2.5.2. ■

2.2.2 Distribution and quantile estimates The sim ulation estim ates o f bias and variance will som etim es be o f interest in their own right, but m ore usually w ould be used w ith norm al approxim ations for T , p articularly for large samples. For situations like those in Exam ples 1.1 and 1.2, however, the norm al approxim ation is intrinsically inaccurate. This can be seen from a norm al Q -Q plot o f the sim ulated values t \ , . . . , t R, th a t is, a plot o f the ordered values < • • • < t ’R) against expected norm al order statistics. It is the em pirical distrib u tio n o f these sim ulated values which can provide a m ore accurate distrib u tio n al approxim ation, as we shall now see. If as is often the case we are approxim ating the distribution o f T — 8 by th a t o f T m— t, then cum ulative probabilities are estim ated simply by the em pirical distribution function o f the sim ulated values t ' — t. M ore formally, if G(u) = P r( T — 8 < u), then the sim ulation estim ate o f G(u) is n i \ — t < u} 1 G* (U) = ~ ^ R ------- = R Z 2 1{tr ~ 1 -

,

r=l

where I {A} is the indicator o f the event A, equal to 1 if A is true and 0 otherwise. As R increases, so this estim ate will converge to G(u), the exact C D F o f T* — t under sam pling from the fitted model. Ju st as w ith the m om ent approxim ations discussed earlier, so the approxim ation GR to G contains two sources o f error, i.e. th a t betw een G an d G due to d a ta variability and th a t betw een GR an d G due to finite sim ulation. We are often interested in quantiles o f the distrib ution o f T — 8, and these are approxim ated using ordered values o f t* — t. T he underlying result used here is th a t if X i , . . . , X N are independently distributed with C D F K and if denotes the j \ h ordered value, then

This implies th a t a sensible estim ate o f K ~ l (p) is X ^ N+i)p), assum ing th at

2.2 • Parametric Simulation

19

( N + l)p is an integer. So we estim ate the p quantile o f T —9 by the (R + l)p th ordered value o f t" — t, th a t is t(‘(R+1)p) — t. We assum e th at R is chosen so th at (/?

l)p is an integer. The sim ulation approxim ation GR and the corresponding quantiles are in principle b etter th a n results obtained by norm al approxim ation, provided th at R is large enough, because they avoid the supposition th a t the distribution o f T* — t has a p articu lar form. Example 2.6 (Air-conditioning data) T he sim ulation experim ents described in Exam ple 2.5 can be used to study the sim ulation approxim ations to the d istribution an d quantiles o f Y — fi. First, Figure 2.2 shows norm al Q -Q plots o f t* values for R = 99 (top left panel) and R = 999 (top right panel). Clearly a norm al ap proxim ation would n o t be accurate in the tails, and this is already fairly clear w ith R = 99. F or reference, the lower h a lf o f Figure 2.2 shows corresponding Q -Q plots w ith exact gam m a quantiles. T he n onnorm ality o f T * is also reasonably clear on histogram s o f t* values, show n in Figure 2.3, at least at the larger value R = 999. C orresponding density estim ate plots provide sm oother displays o f the same inform ation. We look next at the estim ated quantiles o f Y — p.. T he p quantile is a p proxim ated by J'f’jK+np) — y for p = 0.05 and 0.95. The values o f R are 1 9 ,3 9 ,9 9 ,1 9 9 ,..., 999, chosen to ensure th a t (R + 1)p is an integer throughout. T hus at R = 19 the 0.05 quantile is approxim ated by y ^ — y and so forth. In order to display the m agnitude o f sim ulation error, we ran four independent sim ulations a t R = 1 9 ,3 9 ,9 9 ,...,9 9 9 . The results are plotted in Figure 2.4. A lso shown by d o tted lines are the exact quantiles under the m odel, which the sim ulations ap proach as R increases. T here is large variability in the approxi m ate quantiles for R less th an 100 and it appears th a t 500 or m ore sim ulations are required to get accurate results. The same sim ulations can be used in o th er ways. F or example, we m ight w ant to know a b o u t log Y — log /i, in which case the em pirical properties o f logy* — lo g y are relevant. ■ T he illustration used here is very simple, but essentially the same m ethods can be used in arb itrarily com plicated param etric problems. F or example, distributions o f likelihood ratio statistics can be approxim ated when largesam ple approxim ations are inaccurate or fail entirely. In C hapters 4 and 5 respectively we show how param etric boo tstrap m ethods can be used to calculate significance tests an d confidence sets. It is som etim es useful to be able to look at the density o f T, for exam ple to see if it is m ultim odal, skewed, or otherw ise differs appreciably from norm ality. A rough idea o f the density g(u) o f U = T —6, say, can be had from a histogram o f the values o f t ' — t. A som ew hat b etter picture is offered by a kernel density

20

2 • The Basic Bootstraps

Figure 2.2 Normal (upper) and gamma (lower) Q-Q plots of (* values based on R = 99 (left) and R = 999 (right) simulations from the fitted exponential model for the air-conditioning data.

Quantiles of standard normal

ooo

Quantiles of standard normal

/

/■ •

o

■

o C\J •>

o

/*S

o

CD

’

to

Jr

o o

/

o

/

o o o

j

in

/

/ /

O ''fr 60 80

120

160

200

50

Exact gamma quantile

100

150

200

Exact gamma quantile

estim ate, defined by

r= l

v

y

<«>

where w is a sym m etric P D F with zero m ean and h i s a. positive bandw idth th a t determ ines the sm oothness o f gh. The estim ate gh is non-negative and has unit integral. It is insensitive to the choice o f w(-), for which we use the standard norm al density. The choice o f h is m ore im portant. T he key is to produce a sm ooth result, while n o t flattening out significant modes. If the choice o f h is quite large, as it m ay be if R < 100, then one should rescale the density

21

2.2 - Parametric Simulation

Figure 2 3 Histograms of t* values based on R = 99 (left) and R = 999 (right) simulations from the fitted exponential model for the air-conditioning data.

o o

o

o

o r~

O o

co

o o

in o

o

o o

o

Tt o

liB 50

l

100

150 t*

lb

o

o

o 200

50

100

150

200

t*

Figure 2.4 Empirical quantiles (p = 0.05, 0.95) of T* — t under resampling from the fitted exponential model for the air-conditioning data. The horizontal dotted lines are the exact quantiles under the model.

estim ate to m ake its m ean and variance agree with the estim ated m ean bR and variance vR o f T — 9; see Problem 3.8. As a general rule, good estim ates o f density require at least R = 1000: density estim ation is usually h ard er th an probability o r quantile estim ation. N ote th a t the same m ethods o f estim ating density, distribution function and quantiles can be applied to any transform ation o f T. We shall discuss this fu rth er in Section 2.5.

22

2 • The Basic Bootstraps

2.3 Nonparametric Simulation Suppose th a t we have no p aram etric m odel, b u t th a t it is sensible to assum e th at Y i,. . . , Y„ are independent and identically distributed according to an unknow n A distribution function F. We use the E D F F to estim ate the unknow n C D F F. We shall use F ju st as we w ould a p aram etric m o d e l: theoretical calculation if possible, otherw ise sim ulation o f datasets and em pirical calculation o f required properties. In only very simple cases are exact theoretical calculations possible, b u t we shall see in Section 9.5 th a t good theoretical approxim ations can be obtained in m any problem s involving sam ple m om ents. Example 2.7 (Average) In the case o f the average, exact m om ents sam pling from the E D F are easily found. F or exam ple, E*(Y*) = E '(Y * ) = ^

^

under

; =y

j=i

and similarly 1 v a r* (Y * )= -v a r * ( Y ') n

=

1 1 " 1 -E *{Y * — E*(Y*)}2 = - x V - { y , — y f n 1 1 n ^ n 1 }=i (n — 1)

=

1

2

—

A p art from the factor (n — 1)/n, this is the usual result for the estim ated variance o f Y . ■ O ther simple statistics such as the sam ple variance and sam ple m edian are also easy to handle (Problem s 2.3, 2.4). To apply sim ulation w ith the E D F is very straightforw ard. Because the E D F puts equal probabilities on the original d a ta values y i , . . . , y „ , each Y* is independently sam pled a t ran d o m from those d a ta values. T herefore the sim ulated sam ple Y(’, . . . , Y„* is a ran d o m sam ple taken with replacem ent from the data. This simplicity is special to the case o f a hom ogeneous sample, but m any extensions are straightforw ard. This resam pling procedure is called the nonparametric bootstrap. Example 2.8 (City population data) H ere we look at the ratio estim ate for the problem described in Exam ple 1.2. F or convenience we consider a subset o f the d a ta in Table 1.3, com prising the first ten pairs. This is an application with no obvious param etric m odel, so nonparam etric sim ulation m akes good sense. Table 2.1 shows the d a ta and the first sim ulated sample, which has been draw n by random ly selecting subscript j ' from the set { l,...,n } w ith equal probability and taking (w*,x*) = (uj-,xj-). In this sam ple j ' = 1 never occurs

23

2.3 ■Nonparametric Simulation Table 2.1 The dataset

for ratio estimation, and one synthetic sample. The values j* are chosen randomly with equal probability from with replacement; the simulated pairs are

.7 u

1 138

X

/' u’ X*

143

2 93 104

3 61 69

4 179 260

5 48 75

6 37 63

7 29 50

8 23 48

9 30 111

10 2 50

6 37 63

7 29 50

2 93 104

2 93 104

3 61 69

3 61 69

10 2 50

7 29 50

2 93 104

9 30 111

1 138 143

2 93 104

(«/ -Xj*). Table 2.2 Frequencies with which each original data pair appears in each of R = 9 nonparametric bootstrap samples for the data on US cities.

j u X

3 61 69

4 179 260

5 48 75

6 37 63

7 29 50

8 23 48

9 30 111

10 2 50

1

1

1 2 4

1 1 2 1

N u m b ers o f tim es each p air sam pled D a ta

1

1

1

3

2 1

1

1

1

1

1 2 1 1

2 1

1

S tatistic t = 1.520

R eplicate r 1 2 3 4 5 6 7 8 9

an d /

1 1 3 1 1 2

1 1

2 1 1 3

2 1

1 1 1

2 2 1 1

2 1

1 2

2 3 2

1

1 1

1 1

2 1 1

1 1

1 2

1 1

3 1 1

t\ t* r; t\ t'5 t'6 t; tj (j

= = = = = = = = =

1.466 1.761 1.951 1.542 1.371 1.686 1.378 1.420 1.660

= 2 occurs three times, so th at the first d a ta pair is never selected, the

second is selected three times, and so forth. Table 2.2 shows the sam e sim ulated sample, plus eight m ore, expressed in term s o f the frequencies o f original d ata pairs. The ratio t* for each sim ulated sam ple is recorded in the last colum n o f the table. A fter the R sets o f calculations, the bias and variance estim ates are calculated according to (2.7) and (2.8). The results are, for the R = 9 replicates shown, b = 1.582 — 1.520 = 0.062,

v = 0.03907.

A simple approxim ate distribution for T — 6 is N(b,v). W ith the results so far, this is N (0.062,0.0391), b u t this is unlikely to be accurate enough and a larger value o f R should be used. In a sim ulation with R = 999 we obtained b = 1.5755 — 1.5203 = 0.0552 and v = 0.0601. The latter is appreciably bigger th an the value 0.0325 given by the delta m ethod variance estim ate n

vL = n~2 J ^ ( x ; - t u j f / u 1, j=i

24

2 ■The Basic Bootstraps

o C oO < oN

o

c\i

I 1 ll J llll.-_

in o

o o

0.5

1.0

1.5

2.0

Figure 2.5 City population data. Histograms of t9 and z * under nonparametric resampling for sample of size n — 10, R = 999 simulations. Note the skewness of both t* and

■ q

2.5

-8

_

n .llll

-6

-4

-2

0

z*

t*

which is based on an expansion th a t is explained in Section 2.7.2; see also Problem 2.9. The discrepancy betw een v and Vi is due partly to a few extrem e values o f f \ an issue we discuss in Section 2.3.2. T he left panel o f Figure 2.5 shows a histogram o f t \ whose skewness is evident: use o f a norm al approxim ation here w ould be very inaccurate. We can use the sam e sim ulations to estim ate d istributions o f related statistics, such as transform ed estim ates or studentized estim ates. The right panel o f Figure 2.5 shows a histogram o f studentized values z* = (t* — t ) / v ^ /2, where v'L is the delta m ethod variance estim ate based on a sim ulated sample. T h at is,

v'L = n~2 Y ^ ( x ,j - t , Uj)2/ u 2. 7=1 The corresponding theoretical ap proxim ation for Z is the N ( 0,1) distribution, which we would ju d g e also inaccurate in view o f the strong skewness in the histogram . We shall discuss the rationale for the use o f z* in Section 2.4. One n atu ral question to ask here is w hat effect the sm all sam ple size has on the accuracy o f norm al approxim ations. This can be answ ered in p a rt by plotting density estim ates. T he left panel o f Figure 2.6 shows three estim ated densities for T* — t w ith o u r sam ple o f n = 10, a kernel density estim ate based on o u r sim ulations, the N(b, v) approxim ation with m om ents com puted from the sam e sim ulations, an d the N ( 0 , vl ) approxim ation. The right panel shows corresponding density approxim ations for the full d a ta with n = 49; the em pirical bias and variance o f T are b = 0.00118 and v = 0.001290, and the

2.3 ■Nonparametric Simulation

25

Figure 2.6 Density estimates for 7* —t based on 999 nonparametric simulations for the city population data. The left pane! is for the sample of size n = 10 in Table 2.1, and the right panel shows the corresponding estimates for the entire dataset of size n = 49. Each plot shows a kernel density estimate (solid), the N(b,v) approximation (dashes), with these moments computed from the same simulations, and the N(0, vl ) approximation (dots).

delta m ethod variance approxim ation is vl = 0.001166. A t the larger sample size the norm al approxim ations seem very accurate. ■

2.3.1 Comparison with parametric methods A n atu ral question to ask is how well the nonparam etric resam pling m ethods m ight com pare to p aram etric m ethods, w hen the latter are appropriate. Equally im p o rtan t is the question as to which param etric m odel would produce results like those for n o n p aram etric resam pling: this is an o th er way o f asking just w hat the nonp aram etric b o o tstrap does. Some insight into these questions can be gained by revisiting Exam ple 1.1. Example 2.9 (Air-conditioning data) We now look at the results o f applying no n p aram etric resam pling to the air-conditioning data. O ne m ight naively expect to o btain results sim ilar to those in Exam ple 2.5, where exponential resam pling was used, since we found in Exam ple 1.1 th a t the d a ta ap p ear com patible w ith an exponential model. Figure 2.7 is the n o n p aram etric analogue o f Figure 2.4, and shows quantiles o f T* — t. It appears th a t R = 500 or so is needed to get reliable quantile estim ates; R = 100 is enough for the corresponding plot for bias and variance. U nder nonparam etric resam pling there is no reason why the quantiles should ap proach the theoretical quantiles under the exponential model, and it seems th a t they do n o t d o so. This suggestion is confirm ed by the Q-Q plots in Figure 2.8. The first panel com pares the ordered values o f t ' from R = 999 n o n p aram etric sim ulations w ith theoretical quantiles under the fitted exponen tial model, an d the second panel com pares the t' with theoretical quantiles

2 ■The Basic Bootstraps

26

Figure 2.7 Empirical quantiles (p = 0.05, 0.95) of T* — t under nonparametric resampling from the air-conditioning data. The horizontal lines are the exact quantiles based on the fitted exponential model.

R

Figure 2.8 Q-Q plots of y* under nonparametric resampling from the air-conditioning data, first-against theoretical quantiles under fitted exponential model (left panel) and then against theoretical quantiles under fitted gamma model (right pane!).

under the best-fitting gam m a m odel w ith index k = 0.71. The agreem ent in the second panel is strikingly good. O n reflection this is natural, because the E D F is closer to the larger gam m a m odel th a n to the exponential model. ■

2.3.2 Effects o f discreteness F or intrinsically continuous data, a m ajor difference betw een param etric and nonparam etric resam pling lies in the discreteness o f the latter. U nder nonpara-

2.4 ■Simple Confidence Intervals

27

m etric resam pling, T* and related quantities will have discrete distributions, even though they m ay be approxim ating continuous distributions. This m akes results som ew hat “fuzzy” com pared to their param etric counterparts. Example 2.10 (Air-conditioning data) For the nonparam etric sim ulation dis cussed in the previous exam ple, the right panels o f Figure 2.9 show the scatter plots o f sam ple stan d ard deviation versus sam ple average for R = 99 and R = 999 sim ulated datasets. C orresponding plots for the exponential sim u lation are shown in the left panels. T he qualitative feature to be read from any one o f these plots is th a t d a ta stan d ard deviation is proportional to d ata average. The discreteness o f the nonparam etric m odel (the E D F ) adds noise whose peculiar b anded structure is evident a t R = 999, although the qualitative structure is still apparent. ■ F or a statistic th at is sym m etric in the d a ta values, there are up to W"

_ f i n — 1\ _ (2n — 1)! \ n—1) n\(n — 1)!

possible values o f t*, depending upon the sm oothness o f the statistical function t( ). Even for m oderately small sam ples the support o f the distribution o f T* will often be fairly dense: values o f m„ for n = 7 and 11 are 1716 and 352 716 (Problem 2.5). It would therefore usually be harm less to think o f there being a P D F for T*, and to approxim ate it, either using sim ulation results as in Figure 2.6 o r theoretically (Section 9.5). There are exceptions, however, m ost n otably when T is a sam ple quantile. The case o f the sam ple m edian is discussed in Exam ple 2.16; see also Problem 2.4 and Exam ple 2.15. For m any practical applications o f the sim ulation results, the effects o f discreteness are likely to be fairly m inim al. However, one possible problem is th at outliers are m ore likely to occur in the sim ulation output. F or example, in Exam ple 2.8 there were three outliers in the sim ulation, and these inflated the estim ate v ‘ o f the variance o f T*. Such outliers should be evident on a norm al Q -Q plot (or com parable relevant plot), and when found they should be om itted. M ore generally, a statistic th at depends heavily on a few quantiles can be sensitive to the repeated values th a t occur under nonparam etric sampling, an d it can be useful to sm ooth the original d a ta when dealing with such statistics; see Section 3.4.

2.4 Simple Confidence Intervals The m ajor application for distributions and quantiles o f an estim ator T is in the calculation o f confidence limits. There are several ways o f using boo tstrap sim ulation results in this context, m ost o f which will be explored in C h apter 5. H ere we describe briefly two basic m ethods.

28

2 • The Basic Bootstraps

Figure 2.9 Scatter plots of sample standard deviation versus sample average for samples generated by parametric simulation from the fitted exponential model (left panels) and by nonparametric resampling (right panels). Top line is for R = 99 and bottom line is for R — 999.

Bootstrap average

Bootstrap average

O O

CO

O

in C\J Q

C/) o. CO to

o o

Q

co Q. (0

CsJ o

LO o

8 CD o

8

m

0

50

100 150 200 250 300

Bootstrap average

Bootstrap average

T he sim plest ap proach is to use a norm al approxim ation to the distribution o f T. As outlined in Section 2.1.2, this m eans estim ating the lim its (2.4), which require only b o o tstrap estim ates o f bias and variance. As we have seen in previous sections, a norm al approxim ation will n o t alw ays suffice. T hen if we use the b o o tstrap estim ates o f quantiles for T — 6 as described in Section 2.2.2, an equitailed (1 — 2a) confidence interval will have limits 1 ~ (^(R+lXl-a)) — f)>

1 — (^(R+lJa) — 0-

(2.10)

This is based on the probability im plication Prr„
=>

P r ( T - b < 6 < T - a) = 1 - 2 a .

29

2.4 ■Simple Confidence Intervals

We shall refer to the limits (2.10) as the basic bootstrap confidence limits. Their accuracy depends upon R, o f course, and one would typically take R > 1000 to be safe. But accuracy also depends upon the extent to which the distribution o f T" — t agrees w ith th a t o f T — 9. Com plete agreem ent will occur if T — 9 has a distribution n o t depending on any unknow ns. This special property is enjoyed by quantities called pivots, which we discuss in m ore detail in Section 2.5.1. If, as is usually the case, the distribution o f T — 9 does depend on unknow ns, then we can try alternative expressions contrasting T and 6, such as differences o f transform ed quantities, o r studentized com parisons. For the latter, we define the studentized version o f T — 9 as

where V is an estim ate o f v a r(T | F): we give a fairly general form for V in Section 2.7.2. The idea is to mimic the Student-t statistic, which has this form, and which elim inates the unknow n standard deviation when m aking inference ab o u t a norm al mean. T hro u g hout this book we shall use Z to denote a studentized statistic. Recall th a t the S tudent-t (1 — 2a) confidence interval for a norm al m ean n has limits y - v l/2tn- i ( l - a ) ,

y - v l/2t„-i(a),

where v is the estim ated variance o f the m ean and f„_i(a), t„_ i(l — a) are quantiles o f the Student-f distribution w ith n — 1 degrees o f freedom , the distribution o f the pivot Z . M ore generally, when Z is defined by (2.11), the (1 — 2a) confidence interval limits for 9 have the analogous form

where zp denotes the p quantile o f Z . One simple approxim ation, which can often be justified for large sam ple size n, is to take Z as being N ( 0,1). The result would be no different in practical term s from using a norm al approxim ation for T — 9, and we know th a t this is often inadequate. It is m ore accurate to estim ate the quantiles o f Z from replicates o f the studentized bootstrap statistic, Z* = (T* — t ) / V * 1/2, where T ' and V * are based on a sim ulated ran d o m sample, Y ’, . . . , Yn'. If the m odel is param etric, the Y ' are generated from the fitted param etric distribution, and if the m odel is nonparam etric, they are generated from the E D F F, as outlined in Section 2.3. In either case we use the (R + l)a th order statistic o f the sim ulated values z \ , . . . , z ' R, nam ely z(*(K+1)(x), to estim ate z„. Then the studentized bootstrap confidence interval for 9 has limits (2 .12)

30

2 • The Basic Bootstraps

This studentized b o o tstrap m ethod is m ost likely to be o f use in n o n p ara m etric problem s. O ne reason for this is th a t w ith param etric m odels we can som etim es find “exact” solutions (as w ith the exponential m odel for E xam ple 1.1), and otherw ise we have available m ethods based on the likelihood function. This does n o t necessarily rule out the use o f param etric sim ulation, o f course, for approxim ating the distribution o f the q uantity used as basis for the confidence interval. Example 2.11 (Air-conditioning data) U nder the exponential m odel for the d a ta o f Exam ple 1.1, we have T = Y , and since v a r(T | FM) = n 2/n, we would take V = Y 2/n. This gives Z = (T - n ) / V l/2 = n 1/2(l - n / Y ) , which is an exact pivot because Q = Y / n has the gam m a distribution with index n and unit mean. S im ulation to construct confidence intervals is unneces sary because the quantiles o f the gam m a distribution are available from tables. Param etric sim ulation would be based on Q* = Y* / t , where Y* is the average o f a rando m sam ple Y , \ . . . , Y* from the exponential distribution with m ean t. Since Q‘ has the same distribution as Q, the only erro r incurred by sim ulation would be due to the random ness o f the sim ulated quantiles. F or exam ple, the estim ates o f the 0.025 an d 0.975 quantiles o f Q based on R = 999 sim ulations are 0.504 and 1.608, com pared to the exact values 0.517 and 1.640; these lead to estim ated an d exact 95% confidence intervals (67.2,214.6) and (65.9,209.2) respectively. We shall discuss these intervals m ore fully in C hapter 5. ■ Example 2.12 (City population data) F or the sam ple o f n = 10 pairs analysed in Exam ple 2.8, o u r estim ate o f the ratio 8 is t = x / u = 1.52. The 0.025 and 0.975 quantiles o f the 999 values o f t ‘ are 1.236 and 2.059, so the 95% basic boo tstrap confidence interval (2.10) for 8 is (0.981,1.804). To apply the studentized interval, we use the delta m ethod approxim ation to the variance o f T, which is (Problem 2.9) n

VL = n ~ 2 J ^ ( x y - tU j)2/Q 2,

j =i and base confidence intervals for 8 on ( T — 0 ) / v lL[ 2, using sim ulated values o f z ' = (t* — t ) / v L . T he sim ulated values in the right panel o f Figure 2.5 show th at the density o f the studentized b o o tstrap statistic Z ' is n o t close to norm al. The 0.025 and 0.975 quantiles o f the 499 sim ulated z ' values are -3.063 and 1.447, and since v i = 0.0325, an approxim ate 95% equitailed confidence interval based on (2.12) is (1.260,2.072). T his is quite different from the interval above. The usefulness o f these confidence intervals will depend on how well F

2.5 ■Reducing Error

31

estim ates F an d the extent to which the distributions o f T — 6 and o f Z depend on F. We can n o t ju d g e the form er, b u t we can check the latter using the m ethods outlined in Section 3.9.2; see Exam ples 3.20 and 9.11. ■

2.5 Reducing Error T he erro r in resam pling m ethods is generally a com bination o f statistical error and sim ulation error. The first o f these is due to the difference between F and F, and the m agnitude o f the resulting error will depend upon the choice o f T. T he sim ulation erro r is wholly due to use o f em pirical estim ates o f properties under sam pling from F, ra th e r th an exact properties. Figure 2.7 illustrates these tw o sources o f error in quantile estim ation. The decreasing sim ulation erro r shows as reduced scatter o f the quantile estim ates for increased R. Statistical error due to an inappropriate m odel for T is reflected by the difference betw een the sim ulated nonparam etric quantiles for large R and the d o tted lines th a t indicate the quantiles under the exponential m odel. The fu rth er statistical error due to the difference betw een F and F cann o t be illustrated, because we do n o t know the true m odel underlying the data. However, other sam ples o f the same size from th a t m odel would yield different estim ates o f the true quantiles, quite ap art from the variability o f the quantile estim ates obtained from each specific dataset by sim ulation.

2.5.1 Statistical error T he basic b o o tstra p idea is to approxim ate a quantity c{F) — such as v ar(T | F) — by the estim ate c(F), where F is either a param etric or a nonparam etric estim ate o f F based on d a ta The statistical erro r is then the difference betw een c(F) and c(F), and as far as possible we wish to m inimize this or remove it entirely. This is som etim es possible by careful choice o f c(-). For exam ple, in Exam ple 1.1 w ith the exponential m odel, we have seen th a t w orking with T / 9 rem oves statistical error completely. F or b o th confidence interval and significance test calculation, we usually have a choice as to w hat T is and how to use it. Significance testing raises special issues, because we then have to deal with a null hypothesis sam pling distribution, so here it is best to focus on confidence interval calculation. For simplicity we also assum e th a t estim ate T is decided upon. T hen the quantity c(F) will be a quantile or a m om ent o f some quantity Q = q (F, F) derived from T , such as h (T) — h{6) o r ( T — 6 ) / V l/2 where V is an estim ated variance, or som ething m ore com plicated such as a likelihood ratio. The statistical problem is to choose am ong these possible quantities so th at the resulting Q is as nearly pivotal as possible, th a t is it has (at least approxim ately) the same distribution under sam pling from b o th F and F.

32

2 • The Basic Bootstraps

Provided th a t Q is a m onotone function o f 8, it will be straightforw ard to o btain confidence limits. F or exam ple, if Q = h ( T ) — h(8) with h(t) increasing in t, and if ax is an approxim ate lower a quantile o f h (T ) — h(8), then 1 - a = Pr{Ji(T) - h(8) > aa} = Pr [0 < h~l {h (T) - a* } ],

(2.13)

where /i_1( ) is the inverse transform ation. So h~l { h(T) — aa} is an upper (1 — a) confidence lim it for 8. Parametric problems In param etric problem s F = F# and F = Fv have the sam e form, differing only in p aram eter values. T he n otion o f a pivot is quite simple here, m eaning constant behaviour und er all values o f the m odel param eters. M ore formally, we define a pivot as a function Q = q ( T , 8 ) w hose distribution does o r n o t a p articular q uantity Q is exactly or nearly pivotal, by exam ining its behaviour under the m odel form w ith varying p aram eter values. F or example, in the context o f Exam ple 1.1 n o t depend on the value o f \p: for all q,

In general Q may also depend on other statistics, as when Q is the studentized form of T.

Pr{ q ( T ,0 ) < q | v>} is independent o f \p. O ne can check, som etim es theoretically and always em pirically, whether, we could sim ultaneously exam ine properties o f T — 8, log T — log 8 and the studentized version o f the form er, by sim ulation under several exponential m odels close to the fitted m odel. This m ight result in plots o f variance or selected quantiles versus param eter values, from which we could diagnose the nonpivotal behaviour o f T — 6 and the pivotal b ehaviour o f log T — log 8. A special role for tran sfo rm atio n h ( T) arises because som etim es it is rela tively easy to choose h{-) so th a t the variance o f T is approxim ately o r exactly independent o f 8, and this stability is the prim ary feature o f stability o f distri bution. Suppose th a t T has variance v(6). T hen provided the function h(-) is well behaved at 8, T aylor series expansion as described in Section 2.7.1 leads to h(8) is the first derivative dh(6)/d6.

W L i { h ( T ) } ± { h ( 8 ) } 2 v(8), which in tu rn implies th a t the variance is m ade approxim ately constant (equal to 1) if

H{t) = /

M lijp '

(114)

This is know n as the variance-stabilizing transformation. A ny constant m ultiple o f h ( T) will be equally effective: often in one-sam ple problem s where v{8) = ri~l it2(8) equation (2.14) w ould be applied w ith a(u) in place o f {u(m)}1/2, in which case h(-) is independent o f n and v a r(T ) = n-1 . F or a problem where v{8) varies strongly with 8, use o f this transform ation

2.5 ■Reducing Error

Figure 2.10 Log-log plot of estimated variance of Y against 6 for the air-conditioning data with an exponential model. The plot suggests strongly that var(Y | 0) oc 62.

33

< ocD (0 •c (0

o o o

>

50 60 70

90

200

theta

in conjunction w ith (2.13) will typically give m ore accurate confidence limits th an would be obtained using direct approxim ations o f quantiles for T — 6. If such use o f the transfo rm ation is appropriate, it will som etim es be clear from theoretical considerations, as in the exponential case. O therw ise the tran sfo rm atio n w ould have to be identified from a scatter plot o f sim ulationestim ated variance o f T versus 6 for a range o f values o f 8. Example 2.13 (Air-conditioning data) Figure 2.10 shows a log-log plot o f the em pirical variances o f r* = y ' based on R = 50 sim ulations for each o f a range o f values o f 6. T h a t is, for each value o f 0 we generate R values t ’ corresponding to sam ples y y " „ from the exponential distribution with m ean 6, and then plot log { ( R — l) -1 X)(t* — r*)2} against log0. T he linearity an d slope o f the plot confirm th at v a r(T | F ) oc 62, where 6 = E (T | F). a Nonparametric problems In n o n p aram etric problem s the situation is m ore com plicated. It is now unlikely (but n o t strictly im possible) th a t any quantity can be exactly pivotal. A lso we cann o t sim ulate d a ta from a distribution with the same form as F, because th a t form is unknow n. However, we can sim ulate d a ta from distributions near to and sim ilar to F, an d this m ay be enough since F is near F. A rough idea o f w hat is possible can be h ad from Exam ple 2.10. In the right-hand panels o f Figure 2.9 we plotted sam ple stan d ard deviation versus sam ple average for a series o f n o nparam etrically resam pled datasets. If the E D F s o f those datasets are th o u g h t o f as m odels n ear both F and F, then although the pattern is obscured by the banding, the plots suggest th a t the true m odel has standard deviation p ro p o rtio n al to its m ean — which is indeed the case for the m ost

34

2 • The Basic Bootstraps

likely true m odel. T here are conceptual difficulties with this argum ent, b u t there is little question th a t the im plication draw n is correct, nam ely th at log Y will have approxim ately the sam e variance und er sam pling from b o th F and F. A m ore tho ro u g h discussion o f these ideas for nonparam etric problem s will be given in Section 3.9.2. A m ajor focus o f research on resam pling m ethods has been the reduction o f statistical error. This is reflected particularly in the developm ent o f accurate confidence lim it m ethods, which are described in C h apter 5. In general it is best to rem ove as m uch o f the statistical erro r as possible in the choice o f procedure. However, it is possible to reduce statistical erro r by a b o o tstrap technique described in Section 3.9.1.

2.5.2 Simulation error Sim ulation erro r arises w hen M onte C arlo sim ulations are perform ed and properties o f statistics are approxim ated by their em pirical properties in these sim ulations. F o r exam ple, we approxim ate the estim ate B = E*(T* | F) — t o f bias /? = E (T ) — 8 by the average B R = R ~ l — t) = T ' — t, using the independent replications Tj*,. . . , T R, each based on a random sam ple from our d a ta E D F F. The M onte C arlo variability in R ~ ] T ’ can only be removed entirely by an infinite sim ulation, which seems b o th im possible and unnecessary in practice. T he practical question is, how large does R need to be to achieve reasonable accuracy, relative to the statistical accuracy o f the quantity (bias, variance, etc.) being approxim ated by sim ulation? W hile it is n o t possible to give a com pletely general an d firm answer, we can get a fairly good sense o f w hat is required by considering the bias, variance and quantile estim ates in simple cases. This we now do. Suppose th a t we have a sam ple y u - - - , y n from the N(p,<j2) distribution, and th at the p aram eter o f interest 9 — n is estim ated by the sam ple average t = y. Suppose th a t we use nonparam etric sim ulation to approxim ate the bias, variance and the p quantile ap o f T — 8 = Y — jx. T hen the first step, as described in Section 2.3, is to take R independent replicate sam ples from y y n, an d calculate their m eans Yj* ,..., Y^. From these we calculate the bias, variance an d quantile estim ators as described earlier. O f course the problem is so simple th a t we know the real answers, nam ely 0, n~xa 2 and w~1/2
35

2.5 • Reducing Error

First consider the bias estim ator Br = R ~ l £

t ; - Y.

C onditional on the p articu lar sam ple y \ , . . . , y n, or equivalently its E D F F, the m ean and variance o f the bias estim ator across all possible sim ulations are (2.15) because E*(Y /) = y and v a r’(Yr*) = n~la 2. The unconditional variance o f B R, taking into account the variability between different sam ples from the underlying distribution, is v ar ( R

?; - y )

=

vary |e * ( R ~ 1 ^

Yr* -

j

+ E y | var' ( r ' 5 ] y ; - y ) } , where E y ( - ) and vary(-) denote the m ean and variance taken with respect to the jo in t distrib u tio n o f Y \ , . . . , Y n. F rom (2.15) this gives v ar (Br ) = vary(O) +

Ey

a2 n —1 — x ——. n nR

(2.16)

This result does not depend on norm ality o f the data. A sim ilar expression holds for any sm ooth statistic T w ith a linear approxim ation (Section 2.7.2), except for an 0 ( n ~ 2) term. N ext consider the variance estim ator VR = (R — I)-1 XXYr’ — Y*)2, where Y* = R ^ 1 Yr*. The m ean and variance o f VR across all possible sim ulations, conditional on the data, are

where 72 is the standardized fourth cum ulant — the standardized kurtosis — o f the d a ta (A ppendix A). N ote th a t would be zero for a param etric sim ulation b u t not for o u r n o n p aram etric sim ulation, although in general 72 = 0 ( n ~ l ) because the d a ta are norm ally distributed. The unconditional variance o f VR, averaging over all possible datasets, is var(F«) = vary

+ Ey

which reduces to (2.17) The first term on the right o f (2.17) is due to d a ta variation, the second to

36

2 • The Basic Bootstraps

sim ulation variation. T he im plication is th a t to m ake the sim ulation variance as small as 10% o f th a t due to d a ta variation, one m ust take R = lOn. The corresponding result for general d a ta distributions would include an additional term from the kurtosis o f the Yj. A sim ilar result holds for a general sm ooth statistic T. Finally consider the estim ator o f the p quantile ap for Y — p., which is = ^((r+1)p) —y w ith Y(’r+i)p) (R + l)p th o rder statistic o f the sim ulated values Y ,*,..., Yr. T he general calculation o f sim ulation properties o f aPtR is com plicated, so we m ake the sim plifying assum ption th a t the N(y, ri~la 2) approxim ation for Y * is exact. W ith this assum ption, stan d ard properties o f order statistics give <*p ,r

E ‘(ap,R) = ap = n~l/2a zp, and p ( l - p ) = 2np(l — p) a2 exp(z2) var (aPjR) = p ^ = ------------ — ----------(2.18) R g 2(ap) nR where g( ) is the density o f Y* — Y conditional on F, here taken to be the N(0, n_1a 2) density. (N ote th a t the m iddle term o f (2.18) applies for any T and any d a ta distribution, w ith g( ) the density o f T — 8.) T he unconditional variance o f aPiR over all d atasets can then be reduced to ^ v a r(„„s ) =

( zp , 2 n p ( l - p ) e \ p ( z 2) \ - | ^ + ----------------------- ’- j .

(2.19)

The im plication o f (2.19) is th a t the variance inflation due to sim ulation is approxim ately 4nnp(l — p )ex p (z2) _ nd(p) z 2p R

~

R

’

say. Some values o f d(p) are as follows. p o rl-p d(p)

0.01 5.15

0.025 3.72

0.05 3.30

0.10 3.56

0.25 8.16

So to m ake the variance inflation factor 10% for the 0.025 quantile, for example, we would need R = 40n. E qu atio n (2.19) m ay n o t be useful in the centre o f the distribution, where d(p) is very large because zp is small. Example 2.14 (Air-conditioning data) To see how well this discussion applies in practice, we look briefly a t results for the d a ta in Exam ple 1.1. T he statistic o f interest is T = log Y, which estim ates 8 = log fi. The true m odel for Y is taken to be the gam m a distrib u tio n w ith index k = 0.71 and m ean p. = 108.083; these are the d a ta estim ates. Effects due to sim ulation e rro r are approxim ated

37

2.6 ■Statistical Issues Table 2 3 Components of variance (xlO -3 ) in bootstrap estimation of p quantile for log Y —log /i, due to data variation and simulation variation, based on nonparametric simulation applied to the data of Example 1.1.

Source

T ype

P 0.01

0.99

0.05

0.95

0.10

0.90

D a ta

actual theoretical

31.0 26.6

6.9 26.6

14.0 13.3

3.6 13.3

8.3 8.1

2.2 8.1

S im ulation, R = 100

actual theoretical actual theoretical actual theoretical

53.6 32.9 4.3 6.6 2.2 3.3

9.4 32.9 2.4 6.6 0.8 3.3

8.5 10.5 2.0 2.1 1.5 1.0

3.2 10.5 0.6 2.1 . 0.1 1.0

3.8 6.9 1.2 1.4 0.8 0.7

2.6 6.9 0.4 1.4 0.2 0.7

S im ulation, R = 500 S im ulation, R = 1000

by taking sets o f R sim ulations from one long nonparam etric sim ulation o f 9999 datasets. Table 2.3 shows the actual com ponents o f variation due to sim ulation an d d a ta variation, together with the theoretical com ponents in (2.19), for estim ates o f quantiles o f l o g ? — log/i. O n the whole the theory gives a fairly accurate prediction o f perform ance. ■ It is n o t necessarily best to choose R solely on the basis o f the variance inflation factor. F or exam ple, if we had been discussing the studentized statistic Z defined by (2.11) an d its quantiles, then the com ponent o f variation due to d a ta variance would be approxim ately zero to the accuracy used in (2.18), based on the N ( 0 ,1) approxim ation. So the variance inflation factor would be enorm ous. W h at really counts is the effect o f the sim ulation on the final result, say the length an d coverage o f the confidence interval. This presents a m uch m ore delicate question (Problem 5.5). A n o th er way to estim ate quantiles for T —6 is by norm al approxim ation with b o o tstrap estim ates o f bias and variance. Sim ilar calculations o f sim ulation error are possible; see Problem 2.7. In general the norm al approxim ation is suspect, although its applicability can be assessed by a norm al Q -Q plot o f the sim ulated t' values.

2.6 Statistical Issues 2.6.1 When does the bootstrap work? Consistency T here are two senses in which resam pling m ethods m ight “w ork”. First, do they give reliable results when used with the sort o f d a ta encountered in practice? This question is crucial in applications, and is a m ajor focus o f this book. It leads one to consider how the resam ples themselves can be used to tell

38

2 ■ The Basic Bootstraps

when an d how a b o o tstrap calculation m ight fail, and ideally how it should be am ended to yield useful answers. This topic o f boo tstrap diagnostics is discussed m ore fully in Section 3.10. A second question is: u nder w hat idealized conditions will a resam pling procedure produce results th a t are in some sense m athem atically correct? Answ ers to questions o f this sort involve an asym ptotic fram ew ork in which the sam ple size n—>oo. A lthough such asym ptotics are ultim ately intended to guide practical work, they often act only as a backstop, by rem oving from consideration procedures th a t do n o t have ap p ro p riate large-sam ple properties, and are usually n o t subtle enough to discrim inate am ong com peting procedures according to their finite-sam ple characteristics. N evertheless it is essential to appreciate when a naive application o f the b o o tstrap will fail. To put the theoretical basis for the b o o tstrap in simple term s, suppose th at we have a ran d o m sam ple or equivalently its E D F F, from which we wish to estim ate properties o f a standardized quantity Q = q ( YU ---, Y„;F). For exam ple, we m ight take Q{Yu . . . , Y n\F) = n 1/2 j ? -

J

y d F ( y ) ^ = n ^ 2( ? - 6 ) ,

say, and w ant to estim ate the distribution function G fA<1) = Pt {Q(Y u ... ,Y„' ,F) < q \ F } ,

(2.20)

where the conditioning on F indicates th a t Y \ , . . . , Y„ is a random sample from F. The b o o tstrap estim ate o f (2.20) is G*»(«) = Pr { Q ( l Y , . . . , y B* ;F ) ^ q I F }

(2.21)

where in this case Q{Y{, . . . , Y * ; F ) = n{/1{ Y ' —y). In order for G p n to approach G f n as n—*■oo, three conditions m ust hold. Suppose th a t the true distribution F is surrounded by a neighbourhood in a suitable space o f distributions, and th at as n—*oo, F eventually falls into J f w ith probability one. T hen the conditions are: 1 2 3

for any A € j V, GA,n m ust converge weakly to a limit Ga ,x ; this convergence m ust be uniform on J/'-, and the function m apping A to GAyCC m ust be continuous.

H ere weak convergence o f GA,„ to GA,X m eans th a t as n -*oo,

J

h(u)dGAy„(u)

->■

J

h(u)dGAi0D(u)

for all integrable functions h(-). U nder these conditions the b o o tstrap is con sistent, m eaning th a t for any q and e > 0, Pr{\Gpn(q) — GF^ }(q)\ > e}—>0 as n—yoo.

39

2.6 ■Statistical Issues

T he first condition ensures th at there is a limit for Gf,„ to converge to, and w ould be needed even in the happy situation where F equalled F for every n > n', for som e ri. N ow as n increases, F changes, so the second and third conditions are needed to ensure th at G p n approaches G fi00 along every possible sequence o f F s. If any one o f these conditions fails, the b o o tstrap can fail. Example 2.15 (Sample maximum) Suppose th at Y i,. . . , Yn is a random sample from the uniform distribution on (0 ,9). T hen the m axim um likelihood estim ate o f 9 is the largest sam ple value, T = Yln), where Y(i) < ■■< Y(n) are the sample order statistics. C onsider nonparam etric resam pling. The lim iting distribution o f Q = n(9 — T ) / 9 is stan d ard exponential, and this suggests th a t we take our standardized quantity to be Q' = n(t — T ' ) / t , where t is the observed value o f T , an d T* is the m axim um o f a b o o tstrap sam ple o f size n taken from y i , . . . , y n. As n—>oo, however, Pr(g* = 0 | F) = Pr(T* = t \ F) = 1 - (1 - n_1)"-> 1 - e_1, an d consequently the lim iting distribution o f Q* can n o t be stan d ard exponen tial. The problem here is th a t the second condition fails: the distributional convergence is not uniform on useful neighbourhoods o f F. A ny fixed o r d er statistic Y(k) suffers from the same difficulty, b u t a statistic like a sample quantile, where we would take k = pn for some fixed 0 < p < 1, does not. ■ Asymptotic accuracy Here and below we say X n = Op{nd) when Prfn^l-Xnl > e)-*p for some constant p as n—►oo, and X„ = op(nd) when Pr(n rf|ATn| > e)-*0 as n—>cc, for any e > 0.

Consistency is a w eak property, for exam ple guaranteeing only th at the true probability coverage o f a nom inal (1 — 2a) confidence interval is 1 —2ot + op(l). S tan d ard norm al approxim ation m ethods are consistent in this sense. Once consistency is established, m eaning th at the resam pling m ethod is “valid”, we need to know w hether the m ethod is “good” relative to o ther possible m ethods. This involves looking at the rate o f convergence to nom inal properties. For example, does the coverage o f the confidence interval deviate from (1 —2a) by 0 p(n~l/2) or by 0 p(n-1 )? Some insight into this can be obtained by expansion m ethods, as we now outline. M ore detailed calculations are m ade in Section 5.4. Suppose th a t the problem is one where the lim iting distribution o f Q is stan d ard norm al, and where an Edgeworth expansion applies. T hen the distribution o f Q can be w ritten in the form Pr (Q < q \ F ) = <S>(q) + n~x/1a{q)(q) + 0 ( n ~ l ),

(2.22)

where (•) an d {■) are the C D F and P D F o f the stan d ard norm al distribution, and a(-) is an even quad ratic polynom ial. For a wide range o f problem s it can be shown th a t the corresponding approxim ation for the b o o tstrap version o f Q is Pr(2* < q \ F ) = (q) + 0 ^ ) ,

(2.23)

40

2 • The Basic Bootstraps

where a(-) is obtained by replacing unknow ns in a(-) by estim ates. Now typically a(q) = a(q) + 0 p(n~1/2), so P r(Q' < q \ F) — Pr«2 < q \ F) = Op(n~l ).

(2.24)

T hus the estim ated distrib u tio n for Q differs from the true distribution by a term th a t is Op(n_1), provided th a t Q is constructed in such a way th a t it is asym ptotically pivotal. A sim ilar argum ent will typically hold when Q has a different lim iting distribution, provided it does n o t depend on unknow ns. Suppose th a t we choose n o t to standardize Q, so th a t its lim iting distribution is norm al w ith variance v. A n E dgew orth expansion still applies, now with form

Pr(fi £ , I F) _ * ( - « j ) +

( - k ) * ( J L ) + 0(n-1),

(125)

where a'(-) is a q u ad ratic polynom ial th a t is different from a( ). The corre sponding expansion for Q' is

Pr(Q■ < , | F) - ® ( ^ ) + „ - ' ' V ( j i j ) * ( j i j ) + O

,

(2.26)

Typically v = v + Op(n~l/2), which w ould im ply th a t P r(2 “ < q I F) - P r(Q < q \ F) = Op(n~V2),

(2.27)

because the leading term s on the right-hand sides o f (2.25) and (2.26) are different. The difference betw een (2.24) and (2.27) explains o u r insistence on w orking w ith approxim ate pivots w henever possible: use o f a pivot will m ean th at a boo tstrap distribution function is an o rd er o f m agnitude closer to its target. It also gives a cogent theoretical m otivation for using the b o o tstrap to set confidence intervals, as we now outline. We can obtain the a quantile o f the distribution o f Q by inverting (2.22), giving the Cornish-Fisher expansion qx = z a + n - '^ a ' ^ Z x ) + 0 ( n _1), where za is the a quantile o f the stan d ard norm al distribution, and a"(-) is a further polynom ial. T he corresponding b o o tstrap quantile has the property th a t q ’^ —qn = Op(n~l ). F or simplicity take Q = ( T — 0 ) / V l/1, where V estim ates the variance o f T. T hen an exact one-sided confidence interval for 9 based on Q would be I a = [T — V 1/2qx, oo), an d this contains the true 6 w ith probability a. T he corresponding b o o tstrap interval is / ’ = [T — I/1/2g ”,oo), where q ’ is the a quantile o f the distrib u tio n o f Q* — which w ould often be estim ated by sim ulation, as we have seen. Since q'x — qx = Op(n~[), we have Pr(0 e I a) = a,

P r(0 e /* ) = a + 0 ( n ~ l ),

2.6 ■Statistical Issues

41

so th a t the actual probability th at / ' contains 6 differs from the nom inal probability by only 0 ( n -1 ). In contrast, intervals based on inverting (2.25) will contain 8 w ith probability a + 0 ( n ~ l/2). This interval is in principle no m ore accurate th a n using the interval [T — F 1/2za, oo) obtained by assum ing th at the distribution o f Q is stan d ard norm al. Thus one-sided confidence intervals based on quantiles o f Q’ have an asym ptotic advantage over the use o f a norm al approxim ation. Sim ilar com m ents apply to tw o-sided intervals. The practical usefulness o f such results will depend on the num erical value o f the difference (2.24) at the values o f q o f interest, and it will always be wise to try to decrease this statistical error, as outlined in Section 2.5.1. T he results above based on E dgew orth expansions apply to m any com m on statistics: sm ooth functions o f sam ple m om ents, such as m eans, variances, and higher m om ents, eigenvalues and eigenvectors o f covariance m atrices; sm ooth functions o f solutions to sm ooth estim ating equations, such as m ost m axim um likelihood estim ators, estim ators in linear and generalized linear models, and som e robust estim ators; and to m any statistics calculated from tim e series.

2.6.2 Rough statistics: unsmooth and unstable W h at typically validates the b o o tstrap is the existence o f an E dgew orth ex pansion for the statistic o f interest, as would be the case when th at statistic is a differentiable function o f sam ple m om ents. Some statistics, such as sam ple quantiles, depend on the sam ple in an unsm ooth or unstable way such th at stan d ard expansion theory does n o t apply. O ften the nonparam etric resam pling m ethod will still be valid, in the sense th a t it is consistent, b u t for finite sam ples it m ay n o t w ork very well. P art o f the reason for this is th a t the set o f possible values for T* m ay be very small, and very vulnerable to unusual d ata points. A case in poin t is th a t o f sam ple quantiles, the m ost fam iliar o f which — the sam ple m edian — is discussed in the next example. Exam ple 2.15 gives a case where naive resam pling fails completely. Example 2.16 (Sample median) Suppose th at the sample size is odd, n = 2m + 1, so th a t the sam ple m edian is y = y(m+\). In large sam ples the m edian is approxim ately norm ally distributed ab o u t the population m edian //, but stan d ard nonparam etric m ethods o f variance estim ation (jackknife and delta m ethod) d o not w ork here (Exam ple 2.19, Problem 2.17). N onparam etric resam pling does w ork to som e extent, provided the sam ple size is quite large and the d a ta are not too dirty. Crucially, b o o tstrap confidence limits work quite well. N ote first th a t the b o o tstrap statistic Y* is concentrated on the sample values y^k), which m akes the estim ated distribution o f the m edian very discrete and very vulnerable to unusual observations. Problem 2.4 shows th at the exact

2 ■The Basic Bootstraps

42

Normal

Theoretical Empirical M ean bootstrap Effective df

Table 2.4 Theoretical, empirical and mean bootstrap estimates of variance (x 10“ 2) of sample median, based on 10000 datasets of sizes n = 11,21. The effective degrees of freedom of bootstrap variances uses a x2 approximation to their distribution.

Cauchy

f3

11

21

11

21

11

21

14.3 13.9 17.2 4.3

7.5 7.3 8.8 5.4

16.8 19.1 25.9 3.2

8.8 9.5 11.4 4.9

22.4 38.3 14000 0.002

11.7 14.6 22.8 0.5

distribution o f Y * is p r(y * =

m

, \

^ ;=0

"

m

, s

(2.28) j=0 '■*'

for k = l , . . . , n where = k / n ; sim ulation is n o t needed in this case. The m om ents o f this b o o tstrap distribution, including its m ean and variance, converge to the correct values as n increases. However, the convergence can be very slow. To illustrate this, Table 2.4 com pares the average b o o tstrap variance w ith the em pirical variance o f the m edian for d a ta sam ples o f sizes n = 11 and 21 from the stan d ard norm al distribution, the Student-t distribution with three degrees o f freedom , and the C auchy d istrib u tio n ; also shown are the theoretical variance approxim ations, which are incalculable when the true distribution F is unknow n. We see th a t the b o o tstrap variance can be very po o r for n = 11 when distributions are long-tailed. The value 1.4 x 104 for average boo tstrap variance w ith C auchy d a ta is not a m istake: the b o o tstrap variance exceeds 100 for ab o u t 1% o f d atasets: for som e sam ples the b o o tstrap variance is huge. The situation stabilizes when n reaches 40 o r more. The gross discreteness o f y * could also affect the simple confidence limit m ethod described in Section 2.4. But provided the inequalities used to justify (2.10) are taken to be < an d > rath er th a n < and > , the m ethod w orks well. For example, for C auchy sam ples o f size n = 11 the coverage o f the 90% basic boo tstrap confidence interval (2.10) is 90.8% in 1000 sam ples; see Problem 2.4. We suggest ado p tin g the sam e practice for all problem s where t* is supported on a small nu m b er o f values. ■ The statistic T will certainly behave wildly under resam pling w hen t(F) does not exist, as happens for the m ean when F is a C auchy distribution. Q uite naturally over repeated sam ples the b o o tstrap will produce silly and useless results in such cases. T here are two points to m ake here. First, if d a ta are taken from a real population, then such m athem atical difficulties can n o t arise. Secondly, the stan d ard approaches to d a ta analysis include careful screening o f d a ta for outliers, nonnorm ality, an d so forth, which leads either to deletion o f disruptive d a ta elem ents or to sensible and reliable choices o f estim ators

2.6 ■Statistical Issues

43

T. In short, the m athem atical pathology o f nonexistence is unlikely to be a practical problem .

2.6.3 Conditional properties Resam pling calculations are based on the observed data, and in th at sense resam pling m ethods are conditional on the data. This is especially so in the nonp aram etric case, where nothing b u t d a ta is used. Because o f this, the question is som etim es asked: “Are resam pling m ethods therefore conditional in the inferential sense?” The short answ er is: “N o, at least n o t in any useful way — unless the relevant conditioning can be m ade explicit.” C onditional inference arises in param etric inference when the sufficient statis tic includes an ancillary statistic A whose distribution is free o f param eters. T hen we argue th at inferences ab o u t param eters (e.g. confidence intervals) should be based on sam pling distributions conditional on the observed value o f A ; this brings inference m ore into line w ith Bayesian inference. Two exam ples are the configuration o f residuals in location models, and the values o f explanatory variables in regression models. The first cannot be accom m odated in nonp aram etric b o o tstrap analysis because the effect depends upon the u n know n F. The second can be accom m odated (C hapter 6) because the effect does n o t depend upon the stochastic p a rt o f the model. It is certainly true th a t the b o o tstrap distribution o f T* will reflect ancillary features o f the data, as in the case o f the sam ple m edian (Exam ple 2.16), b u t the reflection is pale to the poin t o f uselessness. T here are situations where it is possible explicitly to condition the resam pling so as to provide conditional inference. Largely these situations are those where there is an experim ental ancillary statistic, as in regression. O ne other situation is discussed in Exam ple 5.17.

2.6.4 When might the bootstrap fail? Incomplete data So far we have assum ed th a t F is the distribution o f interest and th at the sample y i , . . . , y „ draw n from F has nothing rem oved before we see it. This m ight be im p o rtan t in several ways, n o t least in guaranteeing statistical consistency o f o u r estim ator T. But in some applications the observation th a t we get m ay not always be y itself. F or example, w ith survival d a ta the ys m ight be censored, m eaning th a t we m ay only learn th a t y was greater th an some cut-off c because observation o f the subject ceased before the event which determ ines y. Or, with m ultiple m easurem ents on a series o f patients it m ay be th a t for som e patients certain m easurem ents could n o t be m ade because the patient did n o t consent, or the d o cto r forgot.

44

2 • The Basic Bootstraps

U nder certain circum stances the resam pling m ethods we have described will work, b u t in general it w ould be unwise to assum e this w ithout careful thought. A lternative m ethods will be described in Section 3.6. Dependent data In general the n o n p aram etric resam pling m ethod th a t we have described will n o t work for dependent data. This can be illustrated quite easily in the case where the d a ta form one realization o f a correlated tim e series. For example, consider the sam ple average y an d suppose th a t the d a ta com e from a stationary series {Yj} whose m arginal variance is a 2 = var(Y; ) and whose autocorrelations are ph = c o n ( Y j , Y j +h) for h = 1 ,2 ,... In Exam ple 2.7 we showed th a t the nonparam etric b o o tstrap estim ate o f the variance o f Y is approxim ately s2/n, an d for large n this will ap proach
The sum here w ould often differ considerably from one, and then the b o otstrap estim ate o f variance would be badly wrong. Sim ilar problem s arise w ith oth er form s o f dependent data. The essence o f the problem is th a t simple b o o tstrap sam pling im poses m utual independence on the Y j , effectively assum ing th a t their jo in t C D F is F(yi) x • ■• x F (yn) and thus sam pling from its estim ate x • • • x F (y '). This is incorrect for dependent data. The difficulty is th a t there is no obvious way to estim ate a general jo in t density for Y i,...,Y „ given one realization. We shall explore this im p o rtan t subject furth er in C h ap ter 8. W eakly dependent d a ta occur in the altogether different context o f finite population sam pling. H ere the basic nonparam etric resam pling m ethods work reasonably well. M ore will be said ab o u t this in Section 3.7. Dirty data W hat if sim ulated resam pling is used when there are outliers in the d a ta? T here is no substitute for careful d a ta scrutiny in this o r any o th er statistical context, an d if obvious outliers are found, they should be removed or corrected. W hen there is a fitted p aram etric m odel, it provides a benchm ark for plots o f residuals an d the panoply o f statistical diagnostics, and this helps to detect poor m odel fit. W hen there is no p aram etric m odel, F is estim ated by the ED F, and the bench m ark is sw ept aw ay because the d a ta and the m odel are one and the same. It is then vital to look closely a t the sim ulation output, in order to see w hether the conclusions depend crucially on p articular observations. We retu rn to this question o f sensitivity analysis in Section 3.10.

2.7 • Nonparam etric Bias and Variance

45

2.7 Nonparametric Approximations for Variance and Bias 2.7.1 Delta methods

means “is approximately distributed as”.

In p aram etric analysis it is often possible to represent estim ators T in term s o f fundam ental statistics U i , . . . , Um, such as sam ple m om ents, for which exact or approxim ate distrib u tio n al calculations are relatively easy. T hen we can take advantage o f the delta m ethod to obtain distributional approxim ations for T itself. C onsider first the case o f a scalar estim ator T which is a function o f the scalar statistic U based on a sam ple o f size n, say T = g(U). Suppose th at it is know n th a t U ~ N (C , n - ' a 2(l:) ) .

In some cases the Op(n~l ) “remainder term” in the second expression would be op(n~i/2), but this would not affect the principal result of the delta method below.

Two form al expressions are U = £ + op(l) and U = ( + n~1/2(T(C)Z + Op(n_1), where Z is a N ( 0,1) variable. The first o f these corresponds to a statem ent o f the consistency p roperty o f U, and the second amplifies this to state both the rate o f convergence an d the norm al approxim ation in an alternative form. N ow consider T = g(U), where g(-) is a sm ooth function. We shall see below th a t provided th a t g(£) ^ 0, T -JV (0 ,n _1{g(C)}2o’2(C )) , where 0 = g(£); the d o t indicates differentation w ith respect to (. This result is w hat is usually m eant by the delta method result, the principal feature being the delta method variance approximation var{g([/)} = { g (0 } 2v ar(l/).

(2.29)

To see why (2.29) should be true, note th a t if g( ) is sm ooth then T is consistent for 0 = g ((), since g (t/)

=

g (C +

o P( l) ) =

g ( 0

+

o P( l) .

Further, by T aylor series expansion we can write T = g(U) = g(0 + { U - C)g(C) + \ { U - o 2g ( 0 + 0 p(n~l ),

(2.30)

since the rem ainder is p ro p o rtional to ( U — ( ) 3- A truncated version o f the series expansion is T = g(U) = g ( 0 + ( U - O g ( 0 + 0p(rT1/2).

(2.31)

F rom the latter, we can see th a t the norm al approxim ation for U implies th a t T = g ( 0 + n - 1/2g(C M £)Z + op(n~1/2), which in tu rn entails (2.29).

46

2 • The Basic Bootstraps

N othing has yet been said a b o u t the bias o f T, which would usually be hidden in the Op(n_1) term . I f we take the larger expansion (2.30), ignore the rem ainder term , an d take expectations, we obtain E (T ) = g(C) + g(C )E (u - 0 +

A t);

or, if U is unbiased for £, E (T ) = 0 + ^ g ( C)a2( 0 . These results extend quite easily to the case o f vector U and vector T , as outlined in Problem 2.9. T he extension includes the case where U is the set o f observed frequencies / i , . . . , / m when Y is discrete w ith probabilities on m possible values. T hen the analogue o f (2.31) is T = g ( n u . . . , n m) + ^ 2 j=i

- n j ' j d 8^nu^ ' ,7tm\

(2.32)

In this case the norm al ap proxim ation for / i , . . . , / m is easy to derive, b u t is singular because o f the con strain t 5 Z /; = n / C 71j = n- 1° effect (2.32) provides a version o f the nonparam etric delta m ethod, restricted to discrete d a ta problem s. In the next subsection we extend the expansion m ethod to the general nonparam etric case.

2.7.2 Influence function and nonparametric delta method T here is a simple variance approxim ation for m any statistics T with the representation t(F). T he key idea is an extension o f the Taylor series expan sion to statistical functions, which allows us to extend (2.32) to continuous distributions. T he linear form o f the expansion is t{G) = t ( F ) + j uL rt( y, , F) dG ( y) ,

(2.33)

where L t, the first derivative o f f(-) at F, is defined by L t( y; F) = lim

■o

t{(l - e)F + eHy} - t(F) _ 8t {(1 - e)F + eHy} e

de

,

(2.34)

E=0

with H y(u) = H( u — y) the H eaviside or unit step function jum ping from 0 to 1 at u = y. In this form the derivative satisfies / L t( y ; F ) d F ( y ) = 0, as seen on setting G = F in (2.33). O ften the function L t(y) = L t( y; F ) is called the influence function o f T an d its em pirical approxim ation l(y) = L t( y; F) is called the empirical influence function. T he p articu lar values lj = l(yj) are called the empirical influence values.

2.7 ■Nonparametric Bias and Variance

47

T he nonparametric delta method comes from applying the first-order approx im ation (2.33) w ith G = F, t(F) = r(F) + / L ((y ; F)dF(y) = t(F) + - £ L , ( y j ; F). J H j-i

(2.35)

The right-hand side o f (2.35) is also know n as the linear approximation. We apply the central lim it theorem to the sum on the right-hand side o f (2.35) and obtain T —9

~

N ( 0 , vl (F))

because f L , ( y ; F ) d F ( y ) = 0, where vl (F) = n - 'v a r jL ^ Y )} = n~l

J

L 2{y)dF{y).

In practice vL( F) is approxim ated by substituting F for F in the result, th at is by using the sam ple version n

vL = vL(F) = n - 2 Y , ‘j,

(2-36)

j=i

which is know n as the nonparametric delta method variance estimate. N ote th a t (2.35) implies th at

J

L , ( y ; F) d F{ y ) = n-1 ^

lj = 0.

In some cases it m ay be difficult to evaluate the derivative (2.34) theoretically. T hen a num erical approxim ation to the derivative can be m ade, th at is ,2.37)

s

w ith a sm all value o f e such as (100n)-1 . The same m ethod can be used for em pirical influence values lj = L,(yj;F). A lternative approxim ations to the em pirical influence values lj, which are all th a t are needed in (2.36), are described in the following sections. Example 2.17 (Average) Let t = y, corresponding to the statistical function t(F) = f ydF(y). To apply (2.34) we write £{(1 — e)F + eHy} = (1 — e)fi + sy, and differentiate to obtain d{(l - e)n + ey} M y) =

de

= y-H. e=0

The em pirical influence function is therefore /(y) = y —y, with lj = y j — y . Thus the delta m ethod variance approxim ation (2.36) is Vi = (n — 1)s2/ n 2, where s2

48

2 ■The Basic Bootstraps

is the unbiased sam ple variance o f the yj. This differs by the factor (n — 1)/n from the m ore usual n o n p aram etric variance estim ate for y. m The m ean is an exam ple o f a linear statistic, whose general form is / a(y) dF(y). As the term inology suggests, linear statistics have zero derivatives beyond the first; they have influence function a{y) — E{a(Y)}. This applies to all m om ents ab o u t zero; see Problem 2.10. C om plicated statistics w hich are functions o f simple statistics can be dealt w ith using the chain rule. So if t(F) = a { t i ( F ) ,...,r m(F)}, then m

~

=E

<2-38)

i= l o ti

This can also be used to find the influence function for a transform ed statistic, given the influence function for the statistic itself. Example 2.18 (Correlation) The sam ple correlation is the sam ple version o f the prod u ct m om ent correlation, w hich for the p air Y = ( U , X ) can be defined in term s o f p rs = E ( U rX s) by

__ / r>\ ___

P n —PioPoi {(^ 2 0 -

P ? 0 )(W>2 -

/* o i)} l / r

T he influence functions o f m eans are given in Exam ple 2.17. F o r second m om ents we have L ^ J u , x) = urx s — /xrs, w hen r + s = 2, because the [irs are linear statistics (Problem 2.10). The p artial derivatives o f p(-) with respect to the ps are straightforw ard, and (2.38) leads to the influence function for the correlation coefficient, L p(u, x) = usx s — \p{u] + x ] ),

(2.40)

where us = (u — p io )/(P 20 - Pi0)l/2> an d xs = (x - pm )/(P 02 - Poi)l/2 are standardized variates. If we w anted to w ork w ith the stan d ard tran sfo rm ation £ = | log whose derivative is ( l — p 2)_ l, then an o th er application o f (2.38) shows th at the influence function would be L d u>x ) = j Z T p i {MsXs ~

+ xs)} •

Example 2.19 (Quantile) The p quantile qp o f a distribution F w ith density / is defined as the solution to the equation F{qp(F)} = p. I f we set Fe(x) = ( l — e)F(x) + e H( x — y), we have p = Fe{qp(Fe)} = ( l - e)F{qp(Fe)} + s H{ q p(Fe) - y}.

2.7 • Nonparametric Bias and Variance Table 2.5 Exact em pirical influence values and their regression estimates for the ratio applied to the city population data with n — 10.

49

Ca s e Exact

1 2 -1 .0 4 -0.58

3 -0.37

4 -0.19

R egression

-1.11

-0 .4 4

-0.38

-0.65

5 0.03 -0.04

6 0.11 0.12

7 0.09 0.13

8 0.20 0.27

9 1.02 1.16

10 0.73 0.94

Figure 2.11 Empirical influence values for city population example. The left panel shows the lj for the n = 10 cases; the line has slope t = 1.52. The right panels show 999 values o f t ‘ plotted against jittered values o f for j = 1,2,9,4 (clockwise from top left); the lines have slope lj and pass through t when / ' = 0.

0

50

100 150 200 250 300

O n differentiating this w ith respect to e and setting e = 0, we find th at

H{qp - y ) ~ p f (qP)

L q, {y ;F ) =

Evidently this has m ean zero. T he approxim ate variance o f qp( F) is vl

{F)

=

tT

1

_ P(1 ~ P ) / LUy;F)dF(y) = n f{ q v ip)

the em pirical version o f which requires an estim ate o f f ( q p). But since n o n p aram etric density estim ates converge m uch m ore slowly th an estim ates o f m eans, variances, and so forth, estim ation o f variance for quantile estim ates is h ard er and requires m uch larger samples. ■ Example 2.20 (City population data) For the ratio estim ate t = x / u , calcu lations in Problem 2.16 lead to em pirical influence values lj = (xj — tuj)/u. N um erical values for the city population d a ta o f size 10 are given in Table 2.5; the regression estim ates are discussed in Exam ple 2.23. T he variance estim ate is Vl = = 0.182.

50

2 - The Basic Bootstraps

The lj are plotted in the left panel o f Figure 2.11. Values o f yj = (uj, x j ) close to the line x = tu have little effect on the ratio t. C hanging the d ata by giving m ore weight to those yj w ith negative influence values, for which (uj, Xj) lies below the line, would result in sm aller values o f t th a n th a t actually observed, and conversely. We discuss the right panels in Exam ple 2.23. ■ In some applications the estim ator T will be defined by an estim ating equation, the sim plest form being ^2 c(yj, t) = 0 such th at f c(y, 8)dF(y) = 0. T hen the influence function for scalar t is (Problem 2.12) L

( v

) _

,(y)

E { - c ( y ,0 ) } ’

where c = dc/dd. The corresponding em pirical influence values are therefore —nc(yj, t) j

Z nyjJY

and the nonparam etric delta m ethod variance estim ate is „ _ E (£ to 0 1 1

1

{£«>■;. Of2'

A simple illustration is Exam ple 2.20, where t is determ ined by the estim ating function c(y, 6) = x — 6u. For som e purposes it is useful to go beyond the first derivative term in the expansion o f t(F) and o btain the quad ratic approxim ation t(F) = t(F) + j L t( y; F) dF(y) +

\jj

Qt(y, 2; F) dF(y)dF(z),

(2.41)

where the second derivative Qt( y , z ; F ) is defined by

d£l d£2

£,=82=0

This derivative satisfies / Qt( x , y , F ) d F ( x ) = / Qt( x ,y ;F) dF{y ) = 0, b u t in general J Q, ( x, x; F) dF ( x) ^ 0. T he values qjk = Qt(yj,yk',F) are em pirical second derivatives o f t(-) analogous to the em pirical influence values lj. In principle (2.41) will be m ore accurate th an (2.35).

2.7.3 Jackknife estimates A n other ap p ro ach to approxim ating the influence function, b u t only a t the sam ple values y \ , . . . , y „ themselves, is the jackknife. H ere lj is approxim ated by ljackj = { n - W - t - j ) ,

(2.42)

where t - j is the estim ate calculated w ith y; om itted from the data. In effect this corresponds to num erical approxim ation (2.37) using e = —(n — I)- 1 ; see Problem 2.18.

2.7 • Nonparametric Bias and Variance

51

The jackknife approxim ations to the bias and variance o f T are 1

bjack = ~ ~

n

j

Ijack,j,

Vjack = ^ ackj ~

It is reasonably straightforw ard to apply (2.33) w ith F - j and F in place o f G an d F, respectively, to show th a t IjackJ — lj 5 see Problem 2.15. Example 2.21 (Average) F or the sam ple average t = y and the case deletion values are = (ny — y j ) / ( n — 1) and so ljack,j = }’j ~ V- This is the same as the em pirical influence function because t is linear. The variance approxim ation in (2.43) reduces to {n{n — l )}-1 ^2(yj — y)2 because bjack = 0; the denom inator n — 1 in the form ula for vjack was chosen to ensure th at this happens. ■ O ne application o f (2.43) is to show th a t in large sam ples the jackknife bias approxim ation gives n

bjack = E*(T") — t = \ n ~ 2

Qjj'i j=i

see Problem 2.15. So far we have seen two ways to approxim ate the bias and variance o f T using approxim ations to the influence function, nam ely the nonparam etric delta m ethod and the jackknife m ethod. O ne can generalize the basic approxim ation by using alternative num erical derivatives in these two m ethods.

2.7.4 Empirical influence values via regression T he approxim ation (2.35) can also be applied to the b o o tstrap estim ate T*. If the E D F o f the b o o tstra p sam ple is denoted by F*, then the analogue o f (2.35) is t(F*) = t(F) + - V L t(y*;F), n J 7=1

o r in sim pler n o tatio n =

(2.44)

j- 1

say, where /* is the nu m b er o f times th a t y* equals yj, for j = 1, . . . , n . The linear ap proxim ation (2.44) will be used several times in future chapters. U nder the n o n p aram etric b o o tstrap the jo in t distribution o f the /* is m ulti nom ial (Problem 2.19). It is easy to see th a t var(T *) = n~2 = vl , showing

2 • The Basic Bootstraps

52

Figure 2.12 Plots of linear approxim ation t*L against r* for the ratio applied to the city population data, with n = 10 (left panel), and n = 49 (right panel).

th a t the b o o tstrap estim ate o f variance should be sim ilar to the nonparam etric delta m ethod approxim ation. Example 2.22 (City population data) The right panels o f Figure 2.11 show how 999 resam pled values o f f* depend on «-1 / j for four values o f j, for the d ata w ith n = 10. T he lines w ith slope lj sum m arize fairly well how t’ depends on /* , b u t the correspondence is n o t ideal. A different way to see this is to p lo t t* against the corresponding t'L. Figure 2.12 shows this for 499 replicates. The line shows where the values for an exactly linear statistic would fall. The linear approxim ation is poor for n = 10, b u t it is m ore accurate for the full dataset, where n = 49. In Section 3.10 we outline how such plots m ay be used to find a suitable scale on which to set confidence limits. ■ Expression (2.44) suggests a way to approxim ate the /,-s using the results o f a b o o tstrap sim ulation. Suppose th a t we have sim ulated R sam ples from F as described in Section 2.3. Define /*• to be the frequency with which the d a ta value yj occurs in the rth b o o tstrap sample. T hen (2.44) implies th a t t; = t + ^

]

T

r = l,...,R.

j=i

This can be viewed as a linear regression equation for “ responses” t* with “covariate values” and “coefficients” lj. We should, however, adjust for the facts th a t E*(7” ) =f= t in general, th a t J2j h = 0, and th at J 2 j f r j = n- F ° r the first o f these we add a general intercept term , or equivalently replace t with T .

2.7 • Nonparametric Bias and Variance F or the second two we d ro p the term

A

A

A

53 resulting in the regression equation

_

So the vector I = ( /j,___ i ) o f approxim ate values o f the lj is obtained with the least-squares regression form ula / = (F*TF*)_1F*r d*,

(2.46)

where F* is the R x ( n — 1) m atrix w ith (r,j) elem ent n-1 /*;, and the rth row o f the R x 1 vector d* is t* — f*. In fact (2.45) is related to an alternative, o rthogonal expansion o f T in which the “rem ainder” term is uncorrelated with the “linear” piece. The several different versions o f influence produce different estim ates o f v ar(T ). In general vl is an underestim ate, w hereas use o f the jackknife values or the regression estim ates o f the Is will typically produce an overestim ate. We illustrate this in Section 2.7.5. Example 2.23 (City population data) For the previous exam ple o f the ratio estim ator, Table 2.5 gives regression estim ates o f em pirical influence values, obtained from R = 1000 samples. The exact estim ate v l for v a r(T ) is 0.036, com pared to the value 0.043 obtained from the regression estimates. The b o o tstrap variance is 0.042. For n = 49 the corresponding values are 0.00119, 0.00125 an d 0.00125. O u r experience is th a t R m ust be in the hundreds to give a good regression approxim ation to the em pirical influence values. ■

2.7.5 Variance estimates In previous sections we have outlined the m erits o f studentized quantities

where V = v{F) is an estim ate o f v a r(T | F). O ne general way to obtain a value for V is to set M v = (M - 1) 1 - 0 2> m=1 where t ], . . . ,t 'M are calculated by b o o tstrap sam pling from F. Typically we would take M in the range 50-200. N ote th at resam pling is needed to produce a stan d ard erro r for the original value t o f T.

54

2 • The Basic Bootstraps

Now suppose th a t we wish to estim ate the quantiles o f Z , using em pirical quantiles o f b o o tstrap sim ulations r=

(2-48)

Since M b o o tstrap sam ples from F were needed to obtain v, M bo o tstrap sam ples from F ' are needed to produce v". T hus w ith R = 999 and M = 50, we would require R ( M + 1) = 50949 sam ples in all, which seems prohibitively large for m any applications. This suggests th a t we should replace u1/2 with a standard error th a t involves no resam pling, as follows. W hen a linear approxim ation (2.44) applies, we have seen th a t var(T* | F) can be estim ated by v l = n~2 ^ l], where the lj = L ((y; ;F ) are the em pirical influence values for t based on the E D F F o f y \ , . . . , y n- T he corresponding variance estim ate for v a r(T ’ | F ' ) is v‘Lr = ri~2 ^ L 2{yy, F'), based on the em pirical influence values for t’ at the E D F F ’ o f y ‘r l, . . . , y' rn. A lthough this requires no furth er sim ulation, the L t( y ’ \ F *) m ust be calculated for each o f the R samples. If an analytical expression is know n for the em pirical influence values, it will typically be straightforw ard to calculate the VLr- If not, num erical differentiation can be used, though this is m ore tim e-consum ing. I f neither o f these is feasible, we can use the furth er approxim ation 2

(2.49) which is exact for a linear statistic. In effect this uses the usual form ula, with lj replaced by L t(y*j\F) — n-1 J 2 L t(y*k ;F) in the rth resam ple. However, the right-hand side o f (2.49) can badly underestim ate v'Lr if the statistic is not close to linear. A n im proved approxim ation is outlined in Problem 2.20. Example 2.24 (City population data) Figure 2.13 com pares the variance a p proxim ations for n = 10. T he top left panel shows v" with M = 50 plotted against the values n

for R = 200 b o o tstrap samples. T he top right panel shows the values o f the approxim ate variance on the right o f (2.49), also plotted against v'L. T he lower panels show Q -Q plots o f the corresponding z* values, with (t* — t ) / v ^ /2 on the horizontal axis. Plainly v’L underestim ates v', though not so severely as to have a big effect on the studentized b o o tstrap statistic. But the right o f (2.49) underestim ates v'L to an extent th a t greatly changes the distribution o f the corresponding studentized b o o tstrap statistics.

2.8 ■Subsampling Methods

55

Figure 2.13 Variance approxim ations for the city population data, n — 10. The top panels com pare the bootstrap variance v* calculated with M = 50 and the right o f (2.49) with v*L for R = 200 samples. The bottom panels com pare the corresponding studentized bootstrap statistics.

co >

Q_

2 2o o

CO

vL*

T he rig h t-h an d panels o f the corresponding plots for the full d a ta show m ore nearly linear relationships, so it appears th a t (2.49) is a b etter approxim ation at sample size n = 49. In practice the sam ple size cannot be increased, and it is necessary to seek a tran sfo rm ation o f t to attain approxim ate linearity. T he tran sfo rm atio n outlined in Exam ple 3.25 greatly increases the accuracy o f (2.49), even w ith n = 10. ■

2.8 Subsampling Methods Before and after the developm ent o f nonparam etric b o o tstrap m ethods, other m ethods based on subsam ples were developed to deal with special problems.

56

2 ■ The Basic Bootstraps

We briefly review three such m ethods here. The first two are in principle superior to resam pling for certain applications, although their com petitive m erits in practice are largely untested. T he third m ethod provides an alternative to the nonparam etric delta m ethod for variance approxim ation.

2.8.1 Jackknife methods In Section 2.7.3 we m entioned briefly the jacknife m ethod in connection with estim ating the variance o f T, using the values o f t obtained when each case is deleted in turn. G eneralized versions o f the jackknife have also been proposed for estim ating the distribution o f T — 0, as alternatives to the bootstrap. For this to work, the jackknife m ust be generalized to m ultiple case deletion. For example, suppose th a t we delete d observations rath er th an one, there being N = (j) ways o f doing this; this is the sam e thing as taking all subsets o f size n — d. The full set o f group-deletion estim ates is t{,. . . , tfN , say. The em pirical distribution o f — t will approxim ate the distribution o f T — 6 only if we renorm alize to rem ove the discrepancy in sam ple sizes, n — d versus n. So if T — 6 = Op(n~a), we take the em pirical distribution o f z f = (n - d)a{S - t)

(2.50)

as the delete-^ jackknife approxim ation to the distribution o f Z = na( T — 6). In practice we would n o t use all N subsam ples o f size n — d, b u t rath er R random subsam ples, ju st as with ordinary resampling. In principle this m ethod will apply m uch m ore generally th an b o o tstrap resam pling. But to w ork in practice it is necessary to know a and to choose d so th at n — d—>oo and d /n —>1 as n increases. T herefore the m ethod will work only in rath er special circum stances. N ote th a t if n —d is small relative to n, then the m ethod is not very different from a generalized b o o tstrap th a t takes sam ples o f size n — d ra th er th an n.

Example 2.25 (Sample maximum) We referred earlier to the failure o f the boo tstrap w hen applied to the largest o rd er statistic t = y(n), which estim ates the upper lim it o f a distribution on [0,0]. The jackknife m ethod applies here w ith a = 1, as n(9— T ) is approxim ately exponential w ith m ean 6 for uniform ly distributed ys. However, em pirical evidence suggests th a t the jackknife m ethod requires a very large sam ple size in o rd er to give good results. For example, if we take sam ples o f n = 100 uniform variables, for values o f d in the range 80-95 the distrib u tio n o f (n — d)(t — T +) is close to exponential, but the m ean is w rong by a factor th a t can vary from 0.6 to 2. ■

2.8 ■Subsampling M ethods

57

2.8.2 All-subsamples method A different type o f subsam pling consists o f taking all N = 2" — 1 non-em pty subsets o f the data. This can be applied to a lim ited type o f problem , including M -estim ation where m ean /i is estim ated by the solution t to the estim ating equation ^ c(yj — t) = 0. If the ordered estim ates from subsets are denoted by tJ’j ),. . . , f[N), then rem arkably fi is equally likely to be in any o f the N + 1 intervals

Hence confidence intervals for fi can be determ ined. In practice one w ould take a ran d o m selection o f R such subsets, and attach equal probability ( R + I)-1 to the R + 1 intervals defined by the R ff values. It is unclear how efficient this m ethod is, and to w hat extent it can be generalized to o th er estim ation problems.

2.8.3 Half-sampling methods T he jackknife m ethod for estim ating v a r(T ) can be extended to deal with estim ates based on m any samples, b u t in one special circum stance there is another, sim pler subsam pling m ethod. O riginally this was proposed for samplesurvey d a ta consisting o f stratified sam ples o f size 2. To fix ideas, suppose th at we have sam ples o f size 2 from each o f m strata, and th a t we estim ate the p o pulation m ean n by the w eighted average t = Y27=i wifi^ these weights reflect stratu m sizes. The usual estim ate for v a r(T ) is v = J 2 wf sf with sj the sam ple variance for the ith stratum . The half-sam pling m ethod is designed to reproduce this variance estim ate using only subsam ple values o f t, ju st as the jackknife does. T hen the m ethod can be applied to m ore com plex problems. In the present context there are N = 2m half-sam ples form ed by taking one elem ent from each stratu m sample. If ft denotes the estim ator calculated on such a half-sam ple, then clearly ft — t equals \ ~ y a ) c ] , where cj = +1 according to which o f yn and y,%is in the half-sam ple. D irect calculation shows th a t for a ran d o m half-sam ple E (T t — T )2 = jv a r(T ), so th a t an unbiased estim ate o f v a r(T ) is obtained by doubling the average o f (ft — t)2 over all N half-sam ples: this average equals the usual estim ate given earlier. But it is unnecessary to use all N half-sam ples. If, say, we use R half-sam ples, then we require th at

2 ■The Basic Bootstraps

58 From the earlier representation for .

i

R

[ 1

s r= l

m

i E I

- 1 we see th a t this implies th at 1

m m

wf ( yn - y a )1 +

i= 1

j(yn - y a ) { y n - yj i) i= l j = 1

equals 1 m 4 E i=l

-

>‘2)2-

For this to hold for all d a ta values we m ust have = 0 for all i ± j. This is a stan d ard problem arising in factorial design, and is solved by w hat are know n as P lackett-B urm an designs. If the rth half-sam ple coefficients cfrj form the rth row o f the R x m m atrix C +, and if every observation occurs in exactly | R half-sam ples, then C +TC f = rnlmxm. In general the ith colum n o f C + can be expressed as ( c y, . —1) w ith the first R — 1 elem ents obtained by i — 1 cyclic shifts o f c i j , . . . , For exam ple, one solution for m = 7 with R = 8 is -1 -1 +1 - 1 +1 + 1 ni ( +l +1 +1 - 1 - 1 +1 - 1 +1 +1 +1 +1 - 1 - 1 +1 - 1 -1 +1 +1 +1 - 1 - 1 - 1 +1 - 1 +1 +1 +1 - 1 - 1 -1 +1 - 1 +1 +1 +1 - 1 -1 -1 +1 - 1 +1 +1 +1 U i -1 -1 -1 -1 -1 1) This solution requires th a t R be the first m ultiple o f 4 greater th a n or equal to m. The half-sam ple designs for m = 4 ,5 ,6 ,7 are the first in colum ns o f this C + m atrix. In practice it would be com m on to double the half-sam pling design by adding its com plem ent —C \ which adds furth er balance. It is fairly clear th a t the half-sam pling m ethod extends to stratum sample sizes k larger th a n 2. The basic idea can be seen clearly for linear statistics o f the form m

t= n + X i= 1

k

m

k~l E

k

= ^ + E

7=1

i= l

a,> j= l

say. Suppose th a t in the rth subsam ple we take one observation from each stratum , as specified by the zero -o n e indicator c jy . T hen '! - , = E

E

cl,,j(aU - a,),

which is a linear regression m odel w ithout erro r in which the atj — a, are coefficients and the are covariate values to be determ ined. If the ay — a,

2.9 ■Bibliographic Notes

59

can be calculated, then the usual estim ate o f v ar(T ) can be calculated. The choice o f - values corresponds to selection o f a fractional factorial design, w ith only m ain effects to be calculated, and this is solved by a Plackett-B urm an design. O nce the subsam pling design is obtained, the estim ate o f v a r(T ) is a form ula in the subsam ple values tj. The same form ula w orks for any statistic th a t is approxim ately linear. The same principles apply for unequal stratum sizes, although then the solution is m ore com plicated and m akes use o f orthogonal arrays.

2.9 Bibliographic Notes T here are two key aspects to the m ethods described in this chapter. The first is th a t in o rd er for statistical inference to proceed, an unknow n distribution F m ust be replaced by an estim ate. In a param etric m odel, the estim ate is a p aram etric distribution F$, w hereas in a nonparam etric situation the estim ate is the em pirical distribution function or som e m odification o f it (Section 3.3). A lthough the use o f the E D F to estim ate F m ay seem novel a t first sight, it is a n atu ral developm ent o f replacing F by a param etric estim ate. We have seen th a t in essence the E D F will produce results sim ilar to those for the “ nearest” param etric model. The second aspect is the use o f sim ulation to estim ate quantities o f interest. The w idespread availability o f fast cheap com puters has m ade this a practical alternative to analytical calculation in m any problem s, because com puter time is increasingly plentiful relative to the num ber o f hours in a researcher’s day. T heoretical approxim ations based on large samples can be tim e-consum ing to obtain for each new problem , and there m ay be d o u b t about their reliability in small samples. C ontrariw ise, sim ulations are tailored to the problem at hand an d a large enough sim ulation m akes the num erical erro r negligible relative to the statistical erro r due to the inescapable uncertainty ab o u t F. M onte C arlo m ethods o f inference had already been used for m any years when E fron (1979) m ade the connection to standard m ethods o f param etric inference, drew the atten tio n o f statisticians to their potential for nonparam etric inference, and originated the term “b o o tstra p ”. This work and subsequent developm ents such as his 1982 m onograph m ade strong connections with the jackknife, which had been introduced by Q uenouille (1949) and Tukey (1958), and w ith o th er subsam pling m ethods (H artigan, 1969, 1971, 1975; M cC arthy, 1969). M iller (1974) gives a good review o f jackknife m ethods; see also G ray an d Schucany (1972). Y oung and D aniels (1990) discuss the bias in the nonparam etric boo tstrap introduced by using the em pirical distribution function in place o f the true distribution. H all (1988a, 1992a) strongly advocates the use o f the studentized b o o tstrap

60

2 ■ The Basic Bootstraps

statistic for confidence intervals an d significance tests, and m akes the connec tion to E dgew orth expansions for sm ooth statistics. The em pirical choice o f scale for resam pling calculations is discussed by C h apm an and H inkley (1986) and T ibshirani (1988). H all (1986) analyses the effect o f discreteness on confidence intervals. Efron (1987) discusses the num bers o f sim ulations needed for bias and quantile estim ation, while D iaconis an d H olm es (1994) describe how sim ulation can be avoided com pletely by com plete en um eration o f b o o tstrap sam ples; see also the bibliographic notes for C h ap ter 9. Bickel and F reedm an (1981) were am ong the first to discuss the conditions under which the b o o tstrap is consistent. T heir w ork was followed by Bretagnolle (1983) and others, and there is a grow ing theoretical literature on m odifications to ensure th a t the b o o tstra p is consistent for different classes o f aw kw ard statistics. T he m ain m odifications are sm oothing o f the d ata (Sec tion 3.4), which can im prove m atters for nonsm ooth statistics such as quantiles (D e Angelis and Young, 1992), subsam pling (Politis and R om ano, 1994b), and rew eighting (B arbe and Bertail, 1995). H all (1992a) is a key reference to Edgew orth expansion theory for the b o o tstrap , while M am m en (1992) describes sim ulations intended to help show when the b o o tstrap works, and gives the oretical results for various situations. Shao and Tu (1995) give an extensive theoretical overview o f the b o o tstrap an d jackknife. A threya (1987) has show n th a t the b o o tstra p can fail for long-tailed distri butions. Some o th er exam ples o f failure are discussed by Bickel, G otze and van Zwet (1996). T he use o f linear approxim ations an d influence functions in the context o f robust statistical inference is discussed by H am pel et al. (1986). Fernholtz (1983) describes the expansion theory th a t underlies the use o f these approx im ation m ethods. A n alternative and o rthogonal expansion, sim ilar to th at used in Section 2.7.4, is discussed by E fron and Stein (1981) and E fron (1982). Tail-specific approxim ations are described by H esterberg (1995a). The use o f m ultiple-deletion jackknife m ethods is discussed by H inkley (1977), Shao and W u (1989), W u (1990), and Politis and R om ano (1994b), the last w ith num erous theoretical exam ples. T he m ethod based on all non-em pty subsam ples is due to H artig an (1969), an d is nicely p u t into context in C h apter 9 o f Efron (1982). H alf-sam ple m ethods for survey sam pling were developed by M cC arthy (1969) an d extended by W u (1991). The relevant factorial designs for half-sam pling were developed by Plackett and B urm an (1946).

2.10 Problems 1

Let F denote the E D F (2.1). Show that E {f(y )} = F(y) and that var{F(y)} = f (3'){l — F(y)}/ n. Hence deduce that provided 0 < F(y) < 1, F(y) has a limiting

61

2.10 ■Problems

normal distribution for large n, and that Pr(|F(y) — F(y)| < e)—>1 as n—too for any positive e. (In fact the much stronger property s u p ^ ^ ^ ^ |F(y) — F (y )|—>0 holds with probability one.) (Section 2.1) 2

Suppose that Y ],..., Y„ are independent exponential with mean Y=n~' E

their average is

Yj .

(a) Show that Y has the gamma density (1.1) with k = n, so its mean and variance are n and fi2/n. (b) Show that log Y is approximately normal with mean log^i and variance n~'. (c) Compare the normal approximations for Y and for log Y in calculating 95% confidence intervals for /z. Use the exact confidence interval based on (a) as the baseline for the comparison, which can be illustrated with the data o f Example 1.1. (Sections 2.1, 2.5.1) 3

Under nonparametric simulation from a random sample y [ , . . . , y„ in which T = nr1 Yj — Y) 2 takes value t, show that E '(T ') = (n — l)t/n,

var‘(7” ) = (n — l ) 2 [m4/ n + (3 - n)t2/ {n(n — 1)}] / n2,

where w 4 = n- 1 E / X ; - f ) 4(Section 2.3; Appendix A) 4

Let t be the median o f a random sample o f size n = 2m + 1 with ordered values >>(i) < • • • < y(„); t = y(m+i). (a) Show that T" > if and only if fewer than m + 1 o f the Y ’ are less than or equal to y ^ . (b) Hence show that

This specifies the exact resampling density (2.28) o f the sample median. (The result can be used to prove that the bootstrap estimate o f var(T ) is consistent as n—>oo.) (c) Use the resampling distribution to show that for n = 11 P r * ( r < y,3 j) = Pr’( T ‘ > y(9)) = 0.051, and apply (2.10) to deduce that the basic bootstrap 90% confidence interval for the population median 6 is (2 y(6) — y(9 ), 2 y(6) — (d) Examine the coverage o f the confidence interval in (c) for samples from normal and Cauchy distributions. (Sections 2.3, 2.4; Efron, 1979, 1982) 5

Consider nonparametric simulation o f Y* based on distinct linearly independent observations y i,...,y „ . (a) Show that there are m„ = (^"T,1) ways that n — 1 red balls can be put in a line with n white balls. Explain the connection to the number o f distinct values taken by Y '. (b) Suppose that the value y" taken by Y* is n~l J 2 f j y j < where / ” can be one o f 0 and J 2 j f j ~ n- Find Pr(Y ” = y), and deduce that the most likely value o f Y ” is y, with probability p„ = n'./n". (c) Use Stirling’s approximation, i.e. n \ ~ (27r)l/2e~"n"+1//2 as n—>oo, to find approx imate formulae for m„ and p„. (d) For the correlation coefficient T calculated from distinct pairs («i, x j ) ,. . . , (u„,x„),

62

2 ■The Basic Bootstraps show that T* is indeterminate with probability W hat is the probability that 17” | = 1? Discuss the implications o f this when n < 10. (Section 2.3; Hall, 1992a, Appendix I) Suppose that are independently distributed with a two-parameter density W hat simulation experiment would you perform to check whether or not Q = q ( Y u . . . , Y n;6) is a pivot? If / is the gamma density (1.1), let fi be the M LE o f n, let

feAy)-

tpin) = max Y

l°g//vc(y; )

j=i be the profile log likelihood for n and let Q = 2 { /p(/i) — /?p(n)}. In theory Q should be approximately a x] variable for large n. Use simulation to examine whether or not Q is approximately pivotal for n = 10 when k is in the range (0.5,2). (Section 2.5.1) 7

The bootstrap normal approximation for T — 9 is N ( b R, v R), so that the p quantile ap for T — 96 can be approximated appro by ap = bR + zpvR 2. Show that the simulation variance o f this estimate is

i* \ ■ v°° I . , *3 , l 2 / t , k4 K ) - R { ' + Z ' , ^ + i2' ( 2 + < where k 3 and k4 are the third and fourth cumulants o f T" under bootstrap resampling. If T is asymptotically normal, k ^ / v U2 = 0 ( n ~ l/2) and k 4/ v1^ = 0 (n “ ’). Compare this variance to that o f the bootstrap quantile estimate — t in the special case T = Y . (Sections 2.2.1, 2.5.2; Appendix A) 8

9

Suppose that estimator T has expectation equal to 0(1 + y ) , so that the bias is 9y. The bias factor y can be estimated by C = E’( T ' ) / T — 1. Show that in the case o f the variance estimate T = ri [ ^ 2(Yj — Y ) 2, C is exactly equal to y. I f C were approximated from R resamples, what would be the simulation variance o f the approximation? (Section 2.5) Suppose that the random variables U = (Ui, .. . , Um) have means C i,...,( m and covariances cov(Uk,Ui) = n-1 cow( 0 , and that Ti = g t ( U ) , . . . , T q = gq(U). Show that E(T,)

=

g , . ( 0 + i n - > f > w( 0 | ^ ,

cov(Tj, Tj)

=

/r ‘ f >

w(

How are these estimated in practice? Show that 2

\ " (x i — tuj)2

" - 2£ i=i is a variance estimate for t = x / u , based on independent pairs (u i, Xi) ,...,( « „ ,x n). (Section 2.7.1)

63

2.10 ■Problems 10

(a) Show that the influence function for a linear statistic t(F) = / a(x) dF(x) is a ( y ) — t(F). Hence obtain the influence functions for a sample mom ent fir — f x r dF(x), for the variance /1 2 (F) — {/ti(F)}2, and for the correlation coefficient (Example 2.18). (b) Show that the influence function for {t(F) — 6 } / v ( F ) i/2 evaluated at 9 = t{F) is v(F)~l/2L, (y; F) . Hence obtain the empirical influence values lj for the studentized quantity {t{F) — t ( F) } / v L( F ) l/2, and show that they have the properties E O = 0 and n~2 E I2 = 1 . (Section 2.7.2; Hinkley and Wei, 1984)

11

The pairs ( U [ , X i ) , . . . , { U „ , X n) are independent bivariate normal with correlation 9. Use the influence function o f Example 2.18 to show that the sample correlation T has approximate variance n~l { 1 — 92)2. Then apply the delta method to show that \ log ( j r £ ) , called Fisher’s z-transform, has approximate variance n~]. (Section 2.7.1; Appendix A)

12

Suppose that a parameter 0 = t(F) is determined implicitly through the estimating equation

J u { y, 9 ) d F ( y ) = 0

.

(a) Write the estimating equation as

J u { y J ( F ) } dF(y) = 0, u(x;0) = du(x-,6)/d8

replace F by (1 — e)F + eH y, and differentiate with respect to e to show that the influence function for f(-) is

,(-V’ *

— f U(x;9)dF(x) '

Hence show that with 9 = t{F) the y'th empirical influence value is t =

1

u ( y j ; 6)

- n ~ l E L i “(w ;

(b) Let {p be the maximum likelihood estimator o f the (possibly vector) parameter o f a regular parametric m odel / v (y) based on a random sample y u . ..,y„. Show that the j \ h empirical influence value for \p at yj may be written as n I ~ lSj, where y-v g 2 l o g / v-,(y; )

dxpdip7

d\ogjjiyj) ’

dxp

J

Hence show that the nonparametric delta method variance estimate for ip is the so-called sandwich estimator

/-> ( X s A r ) ' - ' Compare this to the usual parametric approximation when y \ , . . . , y „ is a random sample from the exponential distribution with mean tp . (Section 2.7.2; Royall, 1986)

64 13

2 ■ The Basic Bootstraps The a trimmed average is defined by

t { F) =r h a [ computed at the E D F F. Express t(F) in terms o f order statistics, assuming that na is an integer. How would you extend this to deal with non-integer values o f not? Suppose that F is a distribution symmetric about its mean, p.. By rewriting t(F) as

)

rii-«(f)

-— — / 1 2 a

udF(u),

where qa(F) is the a quantile o f F, use the result o f Example 2.19 to show that the influence function o f t(F) is

L t(y,F )= l

l - 2 « r I, 1 — 2a) ', {{q '(F )-p }(l-2 * )-\

y
Hence show that the variance o f t ( F) is approximately

i

n(1 _

r r
^ “ >*)2dF(y) + «te«(F) -

2ay [J {F)

n}2 + a{qi-x(F) - n}2 .

Evaluate this at F = F. (Section 2.7.2) 14

Let Y have a p-dimensional multivariate distribution with mean vector p and covariance matrix fi. Suppose that Q has eigenvalues Aj > • • • > Xp and corre sponding orthogonal eigenvectors ej, where e j e y = 1. Let Fc = (1 — s)F + eHy. Show that the influence function for Q is L a ( y ',F) = { y - n)(y - p ) T — fl, and by considering the identities Q(Fc)ej(Fs) = Xj(Fe)ej(Fc),

ej (F£) t e j(Fc) = 1,

or otherwise, show that the influence function for l j is { e j ( y — p)}2 — Xj. (Section 2.7.2) 15

Consider the biased sample variance t — n_ 1 J2(yj ~ J')2(a) Show that the empirical influence values and second derivatives are lj = (yj - y ) 2 - U

qjk = - 2 ( y j - y)(yk - y).

(b) Show that the exact case-deletion values o f t are

Compare these with the result o f the general approximation t - t-j = ( n -

1

y 'lj -

-

1

)~2qjj,

which is obtained from (2.41) by substituting F for F and for F. (c) Calculate jackknife estimates o f the bias and variance o f T. Are these sensible estimates? (Section 2.7.3; Appendix A)

2.10 ■Problems 16

65

The empirical influence values lj can also be defined in terms o f distributions supported on the data values. Suppose that the support o f F is restricted to y i , . . . , y n, with probabilities p = ( p i , . . . , p n) on those values. For such distributions t(F) can be re-expressed as t{p). (a) Show that h = j Rt{(l - e)p + s l j } e=0

where P = ( $ , ■-■>%) and 1 j is the vector with Hence or otherwise show that

1

in the y'th position and

0

elsewhere.

n

0 = Mp) -

X Mp)> k=\

where 'tj(p) = 8t(p)/dpj. (b) Apply this result to derive the empirical influence values lj = (xj — tuj )/ u for the estimate t = J2 Pjx j ! 5Z Pjuj o f the ratio o f two means. (c) The empirical second derivatives qtj can be defined similarly. Show that

d2

qtj = g~ ^

t{(l - El - E 2)p + Ell, + E 2 ly}

£| =£2=0 Hence deduce that <2.7 = 'iij(P) ~ n ‘

5Z k=l

tik ( P ) -

n '

+ n 2

k=l

Y1ikl

k,l=1

(Section 2.7.2) 17

Suppose that t = + }W i)) is the median o f a sample o f even size n = 2m from a distribution with continuous C D F F and P D F / whose median is p. Show that the case-deletion values are either y lmj or and that the jackknife variance estimate is

1 (,y<m ( +i) - y(m))'i2 ■ vjack = " “ — By writing Yu> = F '{1 — exp(—£y))}, where is the 7 th order statistic o f a random sample from the standard exponential distribution, and recalling properties o f exponential order statistics, show that nVJadl~

J ___ / I

?\2

( ‘ X2 ) Pin)

4

as n—*oo. This confirms that the jackknife variance estimate is not consistent. (Section 2.7.3) 18

A generalized form o f jackknife can be defined by estimating the influence function at yj by t { ( l - e ) F + eHyj} - t ( F ) e for some value e. D iscuss the effects o f (a) e—>0, (b) e = —( n — l ) -1 , (c) e = ( n + 1) - 1 which respectively give the infinitesimal jackknife, the ordinary jackknife, and the positive jackknife.

66

2 • The Basic Bootstraps

Show that in (b) and (c) the squared distance (dF — dFe)T(dF — dFc) from F to Fe = (1 — s)F + eH Vj is o f order 0 ( n ~ 2), but that if F* is generated by bootstrap sampling, E* j( d F ‘ — d F ) T {dF’ — dF) j = 0 ( n ~ l ). Hence discuss the results you would expect from the butcher knife, which uses e = n~l/2. How would you calculate it? (Section 2.7.3; Efron, 1982; Hesterberg, 1995a) 19

The cumulant generating function o f a multinomial random variable with denominator n and probability vector ( 7 1 1 , . . . , n„) is K ( £ ) = n log

7

ty e x p ( ^ -)|,

where £ = (a) Show that with Kj = n~l, the first four cumulants o f the /* are E‘(/D co v '( / ' , / * )

= =

1, dij-n~\

cum ' ( f i J ’j J k )

=

n~2{n2Sijk-«<5ft[3] + 2 } ,

cum ‘ (/,',/* , f l J J )

=

n }{n}dijki - n2 (c5ft<5y,[3] + SJkl[4]) + 2nSit [6 ] -

6

},

where S:J = 1 when i = j and zero otherwise, and so on, and d:k [3] = d,k + S,j + Sjk, and so forth. (b) N ow consider t ’Q = f + n ~ ' J 2 f j h + \ n~2 H Show that E*(tg) = t + \ n~2 ^2 qjj and that t'g has variance

i £

1j + ^ E

{E 4 - 1 ( E « « ) * + E

E

• (2-51)

(Section 2.7.2; Appendix A ; D avison, Hinkley and Schechtman, 1986; McCullagh, 1987) 20

Show that the difference between the second derivative Q , ( x , y ) and the first derivative o f L,(x) is equal to L,(y). Hence show that the empirical influence value can be written as

n lj = L t( y j ) + n~l ^ { 2 ,( y y ,y ( c ) ~ L ,(yk)}k= 1 Use the resampling version o f this result to discuss the accuracy o f approximation (2.49) for v‘L . (Sections 2.7.2, 2.7.5)

2.11 Practicals 1

Consider parametric simulation to estimate the distribution o f the ratio when a bivariate lognormal distribution is fitted to the data in Table 2.1: ml < - m e a n ( lo g ( c it y $ u ) ) ; m2 < - m e a n ( lo g ( c it y $ x ) ) s i <- s q r t ( v a r ( lo g ( c it y $ u ) ) ) ; s 2 <- s q r t(v a r (lo g (c it y $ x ))) rho < - c o r ( l o g ( c i t y ) ) [ 1 , 2 ] c it y .m le < - c (m l, m2 , s i , s 2 , rho)

2.11 ■Practicals

67

city.sim <- function(city, mle) { n <- nrow(city) zl <- rnorm(n) ; z2 <- rnonn(n) z2 <- m l e [5]*zl+sqrt(1-mle[5]”2)*z2 data, frame(u=exp(mle [1] +mle [3] *zl) , x=exp(mle[2]+m l e [4]*z2)) } city.fun <- function(data, i=l:nrow(data)) { d <- data[i,] tstar <- sum(d$x)/sum(d$u) ubar <- mean(d$u) c(tstar, sum((d$x-tstar*d$u)~2/(nrow(d)*ubar)"2)) } city.para <- boot (city,city.fun, 11=999, sim="parametric",r a n .gen=city.sim,mle=city.mle) Are histograms o f t ’ and z ' similar to those for nonparametric simulation, shown in Figure 2.5?

tstar <- city.para$t[,l] zstar <- (tstar-city.para$tO[1])/sqrt(city.para$t[,2]) split.screen(c(1,2)) screen(l); hist(tstar) screen(2); hist(zstar) screen(l); qqnorm(tstar,pch=".") screen(2); qqnorm(zstar,pch="."); abline(0,1,lty=2) Use (2.10) and (2.12) to give 95% confidence intervals for the true ratio under this model:

city.para$tO[l] - sort (tstar-city .para$tO [1] ) [c (975,25)] city.para$tO [1] - sqrt(city.para$t0[2])*sort(zstar) [c(975,25)] Compare these intervals with those given in Example 2.12. Repeat this with R = 199 and R = 399. (Sections 2.2, 2.3, 2.4) 2

c o .t r a n s f e r contains data on the carbon monoxide transfer factor for seven smokers with chickenpox, measured on admission to hospital and after a stay o f one week. The aim is to estimate the average change in the factor. To display the data:

attach(co.transfer) plot(0.5*(entry+week),week-entry) t .test(week-entry) Are the differences normal? Is the Student-t confidence interval reliable? For a bootstrap approach:

co.fun <- function(data, i) { d <- data[i,] y <- d$week-d$entry c(mean(y), var(y)/nrow(d)) } co.boot <- boot(co.transfer, co.fun, R=999) Compare the variance o f the bootstrap estimate t" with the estimated variance o f t, in c o .b o o t $ t 0 [ 2 ]. Compare normal-based and studentized bootstrap 95% confidence intervals. To display the bootstrap output:

2 • The Basic Bootstraps split.screen(c(l,2)) screen(l); split.screen(c(2,1)) screen(3); qqnonn(co,boot$t[,1],ylab="t*",pch=".") abline(co.boot$tO[l],sqrt(co.boot$t0[2]) ,lty=2) screen(2) plot(co.boot$t[,1],sqrt(co.boot$t[,2]),xlab="t*",ylab="SE*",pch=".") screen(4); z <- (co,boot$t[,1]- co.boot$tO[1])/sqrt(co.boot$t[,2]) qqnorm(z); abline(0,1 ,lty=2) What is going on here? Is the normal interval useful? What difference does dropping the simulation outliers make to the studentized bootstrap confidence interval? (Sections 2.3, 2.4; Hand et ai , 1994, p. 228) cd4 contains the C D 4 counts in hundreds for 20 HIV-positive patients at baseline and after one year o f treatment with an experimental anti-viral drug. We attempt to set a confidence interval for the correlation between the baseline and later counts, using the nonparametric bootstrap.

corr.fun <- function(d, w = rep(l, nrow(d))/nrow(d)) { w <- w/sum(w) n <- nrow(d) ml <- sum(d[, 1] * w) m2 <- sum(d[, 2] * w) vl <- sum(d[, 1] “ 2 * w) - ml"2 v2 <- sum(d[, 2] “ 2 * w) - m2~2 rho <- (sum(d[, 1] * d[, 2] * w) - ml * m2)/sqrt(vl * v2) i <- rep(l:n,round(n*w)) us <- (d[i, 1] - ml)/sqrt(vl) xs <- (d[i, 2] - m2)/sqrt (v2) L <- us * xs - 0.5 * rho * (us~2 + xs'2) c(rho, sum(L"2)/n"2) > cd4.boot <- boot(cd4, corr.fun, R=999, stype="w") Is the variance independent o f t? Is z* pivotal? Should we transform the correlation coefficient?

t0 <- cd4.boot$t0[l] tstar <- cd4.boot$t[,1] vL <- cd4.boot$t[,2] zstar <- (tstar-tO)/sqrt(vL) fisher <- function( r ) 0.5*log( (l+r)/(l-r) ) split.screen(c(1,2)) screen(l); plot(tstar,vL) screen(2); plot(fisher(tstar),vL/(l-tstar"2)~2) For a studentized bootstrap confidence interval on transformed scale:

zstar <- (fisher(tstar)-fisher(tO))/sqrt(vL/(l-tstar~2)~2) vO <- cd4.boot$t0[2]/(l-t0“ 2)"2 fisher(tO) - sqrt(v0)*sort(zstar) [c(975,25)] W hat are these on the correlation scale? How do they compare to intervals obtained without the transformation? If there are simulation outliers, delete them and recalculate the intervals. (Sections 2.3, 2.4, 2.5; D iC iccio and Efron, 1996)

2.11 ■Practicals 4

69

How many simulations are required for quantile estimation? To get som e idea, we make four replicate plots with 39, 99, 399 and 999 simulations.

split.screen(c(4,4)) quantiles <- matrix(NA,16,4) n <- c (39,99,399,999) p <- c(0.025,0.05,0.95,0.975) for (i in 1:4) { y <- rnorm(999) for (j in 1:4) { quantiles[(j-1)*4+i,] <- quantile(y [1 :n[j]] , probs=p) screen((i-1)*4+j) qqnorm(y [1 :n[j] ] ,ylab="y" ,main=paste("R = ",n[j])) abline(h=quantile(y[l :n[j]] ,p) ,lty=2) } } Repeat the loop a few times. How large a simulation is required to get reasonable estimates o f the 0.05 and 0.95 quantiles? O f the 0.025 and 0.975 quantiles? (Section 2.5.2) 5

Following on from Practical 2.3, we compare variance approximations for the correlation in cd4:

L.inf <- empinf(data=cd4,statistic=corr.fun) L.jack <- empinf(data=cd4,statistic=corr.fun,type="jack") L.reg <- empinf(boot.out=cd4.boot,type="reg") split.screen(c (1,2)) screen(l); plot(L.inf,L.jack); screen(2); plot(L.inf,L.reg) v.inf <- sum(L.inf“2)/nrow(cd4)'2 v.jack <- var(L.jack)/nrow(cd4) v.reg <- sum(L.reg~2)/nrow(cd4)~2 v.boot <- var(cd4.boot$t[,l]) c (v.inf,v .r eg,v .j a ck,v .boot) Discuss the different variance approximations in relation to the values o f the influence values. Compare with results for the transformed correlation coefficient. To see the accuracy o f the linear approxim ation:

close.screen(all=T);plot(tstar.linear.approx(cd4.boot,L.reg)) Find the correlation between t ’ and its linear approximation. M ake the corre sponding plots for the other empirical influence values. Are the plots better on the transformed scale? (Section 2.7)

3 Further Ideas

3.1 Introduction In the previous chap ter we laid out the basic elem ents o f resam pling or b o o tstrap m ethods, in the context o f the analysis o f a single hom ogeneous sam ple o f data. This ch ap ter deals w ith how those ideas are extended to some m ore com plex situations, an d then tu rn s to uses for variations and elaborations o f simple b o o tstrap schemes. In Section 3.2 we describe how to construct resam pling algorithm s for several independent sam ples, and then in Section 3.3 we discuss briefly the use o f partial m odelling, either qualitative or sem iparam etric, a topic explored m ore fully in the later chapters on regression m odels (C hapters 6 and 7). Section 3.4 exam ines w hen it is w orthw hile to m odify the statistic by using a sm oothed em pirical distribution function. In Sections 3.5 and 3.6 we tu rn to situations where d a ta are censored or missing and therefore are incom plete. One relatively simple situation where the stan d ard b o o tstrap m ust be modified to succeed is finite population sampling, which we consider in Section 3.7. In Section 3.8 we deal with simple situations o f hierarchical variation. Section 3.9 is an account o f nested b ootstrapping, where we outline how to overcome som e o f the shortcom ings o f a single b o o tstrap calculation by a fu rther level o f sim ulation. Section 3.10 describes b o o tstrap diagnostics, which are concerned w ith the assessm ent o f sensitivity o f resam pling analysis to individual observations, as well as the use o f bo o tstrap o u tp u t to suggest m odifications to the calculations. Finally, Section 3.11 describes the use o f nested b o o tstrapping in selecting an estim ator from the data.

70

71

3.2 ■Several Samples

3.2 Several Samples Suppose th a t we are interested in a param eter th a t depends on the populations F \ , . . . , F k , and th a t the d a ta consist o f independent random sam ples from

these populations. The ith sam ple is >'n, ■■ and arises from p o pulation F t, for i = 1 If there is no further inform ation ab o u t the populations, the nonparam etric estim ate o f F t is the E D F o f the ith sample, Recall that the Heaviside function H(u) jumps from 0 to 1 at u = 0.

Since each o f the k p opulations is separate, nonparam etric sim ulation from their respective E D F s F i , . . . , F k leads to datasets

where is generated by sam pling n,- tim es w ith equal probabilities, n“ ', from the ith original sample, independently o f all other sim ulated samples. This am ounts to stratified sam pling in which each o f the original sam ples corresponds to a stratum , an d nt observations are taken w ith equal probability from the ith stratum . W ith this extension o f the resam pling algorithm , we proceed as o u tlined in C h ap ter 2. F or example, if v = v(Fi,...,F/c) is an estim ated variance for t, confidence intervals for 6 could be based on sim ulated values o f z* = (t* — t ) / v ' ll2 ju st as described in Section 2.4, where now t* and v' are form ed from sam ples generated by the sim ulation algorithm described above. Example 3.1 (Difference of population means) Suppose we are interested in the difference o f two p o p u latio n m eans, 6 = t(Fi,F 2 ) = f y d F i ( y ) — f The corresponding estim ate o f f(F], F 2) based on independent sam ples from the two distributions is the difference o f the two sam ple averages,

for which the usual unbiased estim ate o f variance is

This differs slightly from the delta m ethod variance approxim ation, which we describe in Section 3.2.1. A sim ulated value o f T w ould be f* = yj — y ’2 > where yj is the average o f n\ observations generated w ith equal probability from the first sample, y ii ,- - -, yi m, and

is the average o f n2 observations generated with equal

72

3 ■Further Ideas

Series 4 5

6

7

8

105 83

95 90

76 76

78 78

82 79

84 86

76 75 51 76 93 75 62

76 76 87 79 77 71

78 79 72 68 75 78

78 86 87 81 73 67 75 82 83

81 79 77 79 79 78 79 82 76 73 64

85 82 77 76 77 80 83 81 78 78 78

1

2

3

76 82

87 95

83 54 35 46 87 68

98 100 109 109 100 81 75 68 67

probability from the second sample, y 2 i , - - - , y 2n2- T he corresponding unbiased estim ate o f variance for t* based on these sam ples would be 1

”>

1

«2

Example 3.2 (Gravity data) Between M ay 1934 and July 1935, a series o f experim ents to determ ine the acceleration due to gravity, g, was perform ed at the N atio n al B ureau o f S tan d ard s in W ashington D C. T he experim ents, m ade with a reversible pendulum , led to eight successive series o f m easurem ents. The d ata are given in Table 3.1. Figure 3.1 suggests th a t the variance decreases from one series to the next, th a t there is a possible change in location, and th a t mild outliers m ay be present. T he m easurem ents for the later series seem m ore reliable, and although we would wish to estim ate g from all the data, it seems in ap p ropriate to pool the series. We suppose th a t each o f the series is taken from a separate population, F i,...,F g , b u t th a t each pop u latio n has m ean g; for a check on this see Exam ple 4.14. T hen the ap p ro p riate form o f estim ator is a weighted com bination

r = Ef=i V(Fi)/
Table 3.1 Eight series of measurements of the acceleration due to gravity, g, given as deviations from 980000 xlO-3 cm s"2, in units of cms-2 x 10-3. (Cressie, 1982)

73

3.2 ■Several Samples

Figure 3.1 Gravity series box plots, showing a reduction in variance, a shift in location, and possible outliers.

O 00 O CD o

<j2(Fj) is an estim ated variance for n(Fi). The estim ated variance o f T is

v =

j E 1/ ^ 1 1=1

If the d a ta were tho u g h t to be norm ally distributed with m ean g but different variances, we w ould take KFi) = yh

v 2(Fi) = {n,(n, - l)}-1 E ^ 'V “ ^ ) 2 j

to be the average o f the ith series and its estim ated variance. The resulting estim ator T is then an em pirical version o f the optim al weighted average. For o u r d a ta t = 78.54 w ith stan d ard error uI/2 = 0.59. Figure 3.2 shows sum m ary plots for R = 999 nonparam etric sim ulations from this model. The to p panels show norm al plots for the replicates t ' and for the corresponding studentized b o o tstrap statistics z* = (f* — t ) / v ' l/2. Both are m ore dispersed th a n norm al. There is one large negative value o f z*, and the lower panels show w h y : on the left we see th a t the u* for the smallest value o f t* is very small, w hich inflates the corresponding z*. We would certainly om it this value on the grounds th at it is a sim ulation outlier. The average an d variance o f the £* are 78.51 and 0.371, so the bias estim ate for t is 78.51 — 78.54 = —0.03, and a 95% confidence interval for g based on a norm al approxim ation is (77.37,79.76). The 0.025 x (R + 1) and 0.975 x (R + 1) order statistics o f the z* are -3.03 and 2.50, so the 95% studentized boo tstrap confidence interval for g is (77.07,80.32), slightly wider th an th at based on the norm al approxim ation, as the top right panel o f Figure 3.2 w ould suggest.

74

3 • Further Ideas

10 o00 O)

r^

* N

/

GO

hr-.

o

/ ' y

m

1

. .. y

o V

/

in •

-2

0

2

-2

Quantiles of standard normal

0

2

Quantiles of standard normal

ho CD

o

in

o tr

o co d c\j

o 77

78

79

80

81

t*

A p art from the resam pling algorithm , this mimics exactly the studentized bo o tstrap procedure described in Section 2.4. ■ O ther, constrained resam pling plans m ay be suggested by stronger assum p tions ab o u t the populations, as discussed in Section 3.3. T he advantage o f the resam pling plan described here is th a t it is robust.

3.2.1 Influence functions and variance approximations The discussion in Section 2.7.2 generalizes quite easily to the case o f m ultiple independent samples, w ith separate influence functions corresponding to each

Figure 3.2 Summary plots for 999 nonparametric simulations of the weighted average £ for the gravity data and its estimated variance v. The top panels show normal quantile plots of t* and the studentized bootstrap statistic z* = (t* —t)/v*^2. The line on the top left has intercept t and slope vl/2, and on the top right the line has intercept zero and unit slope. The bottom panels show that the smallest t* also has the smallest v*, leading to an outlying value of z*.

75

3.2 ■Several Samples

population represented. W hen T has the representation f(FI ;. .. , Fk), the an a logue o f the linear approxim ation (2.35) is k J t(Fu . . . , F k) = t(Fu . .., Fk) + E ~ ,=i

Hi t1 1 ) j=

1

where the influence functions L t i are defined by Lv(y;F) =

St (Fu . . . , ( l - e ) F i + eHy, . . . , F k) ds

(3.2) 6=0

and for brevity we w rite F = (Fi, . . . , Fk). A s in the single sam ple case, the influ ence functions have m ean zero, E { Ltii( y;F)} = 0 for each i. T hen the im m ediate consequence o f (3.1) is the nonparam etric delta m ethod approxim ation T —6

~

N ( 0, vL),

for large where the variance approxim ation vL is given by the variance o f the second term on the right-hand side o f (3.1), th a t is k 1 v l = V - v a r { L M(Y ;F) \ F}. £ f n‘

(3.3)

By analogy with the single sam ple case, em pirical influence values are obtained by substituting the E D F s F = ( F i , . . . , F k) for the C D F s F in (3.2) to give h j = Lt Ay j i f )These values satisfy E y = i kj ~ ® f° r eac^ *• S ubstitution o f em pirical variances o f the em pirical influence values in (3.3) gives the variance approxim ation

' ‘■“ E i D S -

<1 4 >

i= i n i j = i

which generalizes (2.36). Example 3.3 (Difference of population means) F or the difference between sam ple averages in Exam ple 3.1, the first influence function is

j

x \{ {\ - £)dFi(xi) + edHy(xi)} -

J

x 2dF2(x2) e=0

= y -p i

ju st as in Exam ple 2.17. Sim ilarly L^2(y,F) = —(y — fi2). In this case the linear approxim ation (3.1) is exact. The variance approxim ation form ula (3.3) gives vL = — v a r(F i) + — var(Y2), ni n2

76

3 ■Further Ideas

and the em pirical version (3.4) is

vl

= X "1 ;=1

~ h)2 +

i

_ *2)2;=1

As usual this differs slightly from the unbiased variance approxim ation. N ote th a t if we could assum e th a t the two p o pulation variances were equal, then it w ould be ap p ro p riate to replace vl by

( 5 > u - J1'12+ 2 > « - »>2} •

(b,+ i )

sim ilar to the usual “ pooled variance” form ula.

■

The various com m ents m ade ab o u t calculation in Section 2.7 apply here w ith obvious m odifications. T hus the em pirical influence values can be ap proxim ated accurately by num erical differentiation, which here m eans f ^ t(F\ , ...,( 1 - e ) F j + e H yj , .. ., Fk) - t lj ~ for small

e.

e

We can also use the generalization o f (2.44), namely

'■ - ‘ +

E

r E

^ -

<3-5>

.- = 1 J.1

where / ' denotes the frequency o f d a ta value in the b o o tstrap sample. T hen given sim ulated values we can approxim ate the ly by regression, generalizing the m ethod outlined in Section 2.7.4. A lternative ways to calculate the Ijj an d vL are described in Problem s 3.6 and 3.7. The m ultisam ple analogue o f the jackknife m ethod o f Section 2.7.3 involves the case deletion estim ates ^jack,]] = (tlj where T hen

l)(t

t—jj),

is the estim ate obtained by om itting the yth case in the ith sample. k

vjack = E

^

_ J)

n,

~^jack,if-

One can also generalize the discussion o f bias approxim ation in Section 2.7.3. However, the extension o f the quad ratic approxim ation (2.41) is n o t straight forw ard, because there are “cross-population” terms. The same approxim ation (3.1) could be used even when the samples, and hence the F,s, are correlated. But this w ould have to be taken into account in (3.3), which as stated assum es m utual independence o f the samples. In general it would be safer to incorporate dependence th ro u g h the use o f appropriate m ultivariate E D Fs.

3.3 ■Semiparametric Models

77

3.3 Semiparametric Models In a sem iparam etric m odel, some aspects o f the d ata distribution are specified in term s o f a small num ber o f param eters, b u t other aspects are left arbitrary. A simple exam ple would be the characterization Y = fi + ae, with no assum ption on the distribution o f e except th a t it has centre and scale zero and one. Usually a sem iparam etric m odel is useful only when we have nonhom ogeneous data, w ith only the differences characterized by param eters, com m on elem ents being nonparam etric. In the context o f Section 3.2, and especially Exam ple 3.2, we m ight for exam ple be fairly sure th a t the distributions F, differ only in scale or, m ore cautiously, scale and location. T h a t is, Yy m ight be expressed as Yy — fli 4“ 6 i'c-ij> where the ey are sam pled from a com m on distribution with C D F Fo, say. The norm al distrib u tio n is a p aram etric m odel o f this form. The form can be checked to some extent by plotting standardized residuals such as

for ap p ro p riate estim ates jl, an d au to verify hom ogeneity across samples. The com m on Fo will be estim ated by the E D F o f all n, o f the ey-s, or better by the E D F o f the standardized residuals e y /( l — n f 1)1/2. The resam pling algorithm will then be Yy

fii ”1" *7['£y,

j

1, • • • , Wj, i

1 ,. . . , /c,

where the £y-s are random ly sam pled from the ED F, i.e. random ly sam pled w ith replacem ent from the standardized eys; see Problem 3.1. In an o th er context, w ith positive d a ta such as lifetimes, it m ight be ap p ro priate to think o f d istributions as differing only by m ultiplicative effects, i.e. Yy = HiSij, where the ey are random ly sam pled from some baseline distribution w ith unit m ean. The exponential distribution is a param etric m odel o f this form. The principle here w ould be essentially the sam e: estim ate the ey by residuals such as ey = y y //i„ then define Yy = &£*• with the e*- random ly sam pled w ith replacem ent from the eys. Sim ilar ideas apply in regression situations. The param etric p art o f the model concerns the system atic relationship betw een the response y and explanatory variables x, e.g. th ro u g h the m ean, and the nonparam etric p a rt concerns the ran d o m variation. We consider this in detail in C hapters 6 and 7. R esam pling plans such as those ju st outlined will give m ore accurate answers when their assum ptions ab o u t the relationships betw een F, are correct, but they are not robust to failure o f these assum ptions. Some pooling o f inform ation

78

3 • Further Ideas

across sam ples m ay be essential in o rd er to avoid difficulties w hen the sam ples are small, b u t otherw ise it is usually unnecessary. If we widen the m eaning o f sem iparam etric to include any partial modelling, then features less tangible th a n param eters com e into play. T he following two exam ples illustrate this.

Example 3.4 (Symmetric distribution) Suppose th a t with our simple random sam ple it was ap p ro p riate to assum e th a t the distrib u tion was sym m etric ab o u t its m ean or m edian. Using this assum ption could be critical to correct statistical analysis; see Exam ple 3.26. W ithout a param etric m odel it is h ard to see a clear choice for F. But we can argue as follows: u n d er F the distributions o f

Y-n

an d —( 7 — n) are the same, so u n d er F the d istributions o f Y* —fi and should be the same. This will be tru e if we sym m etrize the E D F ab o u t p., m eaning th a t we take F to be the E D F o f y \ , . . .,y„,2p.—y \ , . . . , 2 p . —y„. A robust choice for p. w ould be the m edian. (For discrete distributions we could equivalently average sam ple pro p o rtio n s for ap p ro p riate pairs o f d ata values.) The m ean, m edian an d o th er sym m etrically defined location estim ates o f the resulting estim ated distrib u tio n are all equal. ■

Example 3.5 (Equal marginal distributions) Suppose th a t Y is bivariate, say Y = ( U , X ) , and th a t it is ap p ro p riate from the context to assum e th a t U and X have the sam e m arginal distribution. T hen F can be forced to have the same m argins by defining it as the E D F o f the 2 n pairs ( u i ,x i) ,. x „ ) ,( x i,u i) ,..., (xn,Un).

B

In b o th o f these exam ples the resulting estim ate will be m ore efficient th an the EDF. This m ay be less im p o rtan t th a n producing a m odel which satisfies the practical assum ptions an d m akes intuitive sense.

Example 3.6 (M ixed discrete-continuous distributions) There will be situa tions where the raw E D F is n o t suitable for resam pling because it is not a credible m odel. Such a situation arises in classification, where we have a binary response y and covariates x which are used to predict y. If the ob served covariate values x i , . . . , x„ are distinct, then the conditional probabilities Tt(x) = Pr(Y = 1 | x) estim ated from the E D F are all 0 or 1. This is clearly not credible, so the E D F should n o t be used as a resam pling m odel if the focus o f interest is a property th a t depends critically on the conditional probabilities n(x). A n a tu ra l m odification o f the E D F is to keep the m arginal E D F o f x, but to replace the 0-1 values o f the conditional distrib u tion by a sm ooth estim ate o f n(x). This is discussed fu rth er in Exam ple 7.9. ■

3.4 ■Smooth Estimates o f F

79

3.4 Smooth Estimates of

F

F or nonparam etric situations we have so far m ostly assum ed th at the E D F F is a suitable estim ate o f F. But F is discrete, and it is natural to ask if a sm ooth estim ate o f F m ight be better. The m ost likely situation for im provem ent is where the effects o f discreteness (Section 2.3.2) are severe, as in the case o f the sam ple m edian (Exam ple 2.16) or o th er sam ple quantiles. W hen it is reasonable to suppose th a t F has a continuous PD F, one possi bility is to use kernel density estim ation. F or scalar y we take

t M

- h t H j=i

n r 1)-

™

where w(-) is a continuous an d sym m etric P D F with m ean zero and unit variance, an d do calculations o r sim ulations based on the corresponding C D F Fh, rath er th a n on the E D F F. This corresponds to sim ulation by setting Y ‘ = yr. + h£j,

j = l,...,n,

where the l j are independent and uniform ly distributed on the integers 1 ,..., n and the ej are a ran d o m sam ple from w(-), independent o f the l j . This is the smoothed bootstrap. N ote th a t h = 0 recovers the EDF. The variance o f an observation generated from (3.6) is n~l J2(yj ~ S’)2 + ^2> and it m ay be preferable for the sam ples to have the same variance as for the unsm oothed b ootstrap. This is im plem ented via the shrunk smoothed bootstrap, under which h sm ooths betw een F and a m odel in which d a ta are generated from density w(-) centred at the m ean and rescaled to have the variance o f F ; see Problem 3.8. H aving decided which sm oothed b o o tstrap is to be used, we estim ate the required p roperty o f F , a(F), by a(F/,) ra th er th an a(F). So if T is an estim ator o f 9 = t(F), an d we inten d to estim ate a(F) = v a r(T | F) by sim ulation, we w ould obtain values t \ , . . . , t ’R calculated from sam ples generated from F/,, and then estim ate a(F) by (R — I)-1 — F ) 2. N otice th a t it is a(F), n o t t(F), th a t is estim ated using sm oothing. To see w hen a(F/,) is b etter th an a(F), suppose th a t a(F) has linear approxi m ation (2.35). Then n

a(Fh) - a(F)

=

n~l ^

J

L a( Y j + h £ j - , F ) w ( E j ) d E j - i -------

7= 1

n

=

n - 1 Y , L a( Yj ; F ) + \ h 2n~ l £ 7=1

L "(Y ,;F ) + • ■•

7=1

for large n and small h, where L "(u ;F ) = d2L a( u ; F ) / 3 u 2. It follows th at the

80

3 ■Further Ideas

n

2 0

80

Table 3.2 Root mean squared error (xlO-2) for estimation of n1//2 times the standard deviation of the transformed correlation coefficient for bivariate normal data with correlation 0.7, for usual and smoothed bootstraps with R = 200 and smoothing parameter h.

Smoothed, h Usual h= 0 18.9 11.4

0 .1

18.6 1 1 .2

0.25

0.5

16.6 10.4

11.9 8.5

1 .0

6 .6

6.4

m ean squared erro r o f a(Fh), M S E ( h ) = E[{a(Fj,) — a(F )}2], roughly equals n~lj L a( y ; F)2 d F (y ) +h 2n ^ j L a( y ; F ^ y ; F) d F ( y ) + \ hA^ f U'a{ y ; F) dF(y)

J.

(3.7) Sm oothing is n o t beneficial if the coefficient o f h2 is positive, b u t if it is negative (3.7) can be reduced by choosing a positive value o f h th a t trades off the last two terms. The leading term in (3.7) is unaffected by the choice o f h, which suggests th a t in large sam ples any effect o f sm oothing will be m inor for such statistics. Example 3.7 (Sample correlation) To illustrate the discussion above, we take a(F) to be the scaled stan d ard deviation o f T = i log{(l + C )/( 1 — C)}, where C is the correlation coefficient for bivariate norm al data. We extend (3.6) to bivariate y by taking w( ) to be the bivariate norm al density with m ean zero and variance m atrix equal to the sam ple variance m atrix. F or each o f 200 samples, we applied the sm oothed b o o tstrap w ith different values o f h and R = 200 to estim ate a(F). Table 3.2 shows results for two sam ple sizes. F or n = 20 there is a reduction in root m ean squared error by a factor o f ab o u t three, w hereas for n = 80 the factor is ab o u t two. Results for the shrunk sm oothed b o o tstrap are the same, because o f the scale invariance o f C and the form o f w( ). ■ Sm oothing is potentially m ore valuable w hen the quantity o f interest depends on the local behaviour o f F, as in the case o f a sam ple quantile. Example 3.8 (Sample median) Suppose th a t t(F) is the sam ple m edian, and th at we wish to estim ate its variance a(F). In Exam ple 2.16 we saw th at the discreteness o f the m edian posed problem s for the ordinary, unsm oothed, b ootstrap. D oes sm oothing im prove m atters? U nder regularity conditions on F an d h, detailed calculations show th a t the m ean squared error o f na(Fh) is pro p o rtio n al to (n/i)-1 ci + h4C2 ,

(3.8)

where c\ an d c2 depend on F and w(-) b u t not on n. Provided th at c\ and c2 are non-zero, (3.8) is minim ized at h oc n-1/5, and (3.8) is then o f order n-4/5,

3.4 ■Smooth Estimates o f F Table 3.3 Root mean squared error for estimation of n times the variance of the median of samples of size n from the £3 and exponential densities, for usual, smoothed and shrunk smoothed bootstraps with R = 200 and smoothing parameter h.

81 S m oothed, h

n

£3

Exp

S h ru n k sm oothed, h

U sual h= 0

0.1

0.25

0.5

1.0

0.1

0.25

0.5

1.0

11 81

2.27 0.97

2.08 0.76

2.17 0.77

3.59 1.81

10.63 6.07

2.06 0.75

2.00 0.67

2.72 1.17

4.91 2.30

11 81

1.32 0.57

1.15 0.48

1.02 0.37

1.18 0.41

7.53 1.11

1.13 0.47

0.92 0.34

0.76 0.27

0.93 0.27

w hereas it is 0 ( n ~ ,/2) in the unsm oothed case. T hus there are advantages to sm oothing here, a t least in large samples. Sim ilar results hold for other quantiles. Table 3.3 shows results o f sim ulation experim ents where 1000 sam ples were taken from the exponential an d tj distributions. F or each sam ple sm oothed an d shrunk sm oothed b o o tstrap s were perform ed w ith R = 200 an d several values o f h. U nlike in Table 3.2, the advantage due to sm oothing increases with n, and the shrunk sm oothed b o o tstrap im proves on the sm oothed bootstrap, particularly at larger values o f h. As predicted by the theory, as n increases the root m ean squared error decreases m ore rapidly for sm oothed th an for unsm oothed bootstrap s; it decreases fastest for shru n k sm oothing. F o r the tj d a ta the ro o t m ean squared erro r is n o t m uch reduced. F or the exponential d a ta sm oothing was per form ed on the log scale, leading to reduction in root m ean squared erro r by a factor two o r so. Too large a value o f h can lead to large increases in ro o t m ean squared error, b u t choice o f h is less critical for shrunk sm ooth ing. Overall, a small am o u n t o f shrunk sm oothing seems w orthw hile here, provided the d a ta are well-behaved. But sim ilar experim ents w ith Cauchy d a ta gave very p o o r results m ade worse by sm oothing, so one m ust be sure th a t the d a ta are n o t pathological. F urtherm ore, the gains in preci sion are n o t large enough to be critical, at least for these sam ple sizes.

■ The discussion above begs the im p o rtan t question o f how to choose the sm oothing p aram eter for use w ith a p articular dataset. O ne possibility is to treat the problem as one o f choosing am ong possible estim ators a(Fh) an d use the nested b o o tstrap , as in Exam ple 3.26. However, the use o f an estim ated h is n o t sure to give im provem ent. W hen the rate o f decrease o f the optim al value o f h is know n, an o th er possibility is to use subsam pling, as in E xam ple 8.6.

82

3 ■Further Ideas

3.5 Censoring 3.5.1 Censored data Censoring is present w hen d a ta con tain a lower or upper b o und for an observation ra th e r th a n the value itself. Such d a ta often arise in m edical and industrial reliability studies. In the m edical context, the variable o f interest m ight represent the tim e to death o f a patien t from a specific disease, with an indicator o f w hether the tim e recorded is exact or a lower b o und due to the p atient being lost to follow -up or to d eath from oth er causes. The com m onest form o f censoring is right-censoring, in which case the value observed is Y = m in (7 ° , C), where C is a censoring value, and Y° is a no n negative failure time, which is know n only if Y° < C. The d a ta themselves are pairs ( Y , D ), w here D is a censoring indicator, w hich equals one if Y° is observed an d equals zero if C is observed. Interest is usually focused on the distributio n F° o f Y°, w hich is obscured if there is censoring. The survivor function and the cumulative hazard function are central to the study o f survival data. The survivor function corresponding to F°(y) is Pr(Y ° > y) = 1 — F°(y), an d the cum ulative h azard function is A°(y) = —lo g { l—-F°(y)}. The cum ulative h azard function m ay be w ritten as / 0y dA°(u), where for continuous y the hazard function d A° ( y) /d y m easures the in stan taneous rate o f failure at tim e y, conditional on survival to th a t point. A constant h azard X leads to an exponential distrib u tion o f failure tim es with survivor an d cum ulative h azard functions exp(—Ay) and Ay; departures from these simple form s are often o f interest. T he sim plest m odel for censoring is random censorship, u n der which C is a random variable w ith distrib u tio n function G, independent o f Y°. In this case the observed variable Y has survivor function Pr(Y > y ) = { I - F ° ( y ) } { l - G ( y ) } . O ther form s o f censoring also arise, an d these are often m ore realistic for applications. Suppose th a t the d a ta available are a hom ogeneous random sam ple (yi,di), . . . , (y n, dn), and th a t censoring occurs at random . Let y\ < ■■■< y„, so there are n o tied observations. A stan d ard estim ate o f the failure-tim e survivor function, the product-limit o r Kaplan-Meier estim ate, m ay then be w ritten as (3.9)

I f there is no censoring, all the dj equal one, and F°(y) reduces to the E D F o f y i , . . . , y n (Problem 3.9). T he product-lim it estim ate changes only a t successive failures, by an am o u n t th a t depends on the num b er o f censored observations

3.5 ■Censoring

83

between them . Ties betw een censored and uncensored d ata are resolved by assum ing th a t censoring happens instantaneously after a failure m ight have occurred; the estim ate is unaffected by o th er ties. A stan d ard error for 1—F°(y) is given by Greenwood’s formula, 1/2

(3.10) In setting confidence intervals this is usually applied on a transform ed scale. Both (3.9) an d (3.10) are unreliable where the num bers a t risk o f failure are small. Since 1 — dj is an indicator o f censoring, the product-lim it estim ate o f the censoring survivor function 1 — G is

■-*M- n Gr^Hj:yj< y v

J/

<^>

T he cum ulative h azard function m ay be estim ated by the Nelson-Aalen estim ate H{u) is the Heaviside function, which equals zero if u < 0 and equals one otherwise.

----- —y

(3.12)

Since y\ < • ■• < y„, the increase in A (> at yj is dA°(yj) = dj /( n — j + 1). The in terp retatio n o f (3.12) is th a t at each failure the hazard function is estim ated by the num b er observed to fail, divided by the num ber o f individuals at risk (i.e. available to fail) im m ediately before th a t time. In large sam ples the increm ents o f A 0, the d A0(yj), are approxim ately independent binom ial variables with denom inators (n + 1 — j ) and probabilities dj /( n — j + 1). The product-lim it estim ate m ay be expressed as 1 - F 0( y ) =

J ] {l-dA °(yj)} j-yj^y

(3.13)

in term s o f the com ponents o f (3.12). Example 3.9 (AM L data) Table 3.4 contains d a ta from a clinical trial conducted a t Stanford U niversity to assess the efficacy o f m aintenance chem o therapy for the rem ission o f acute m yelogeneous leukaem ia (A M L). A fter reaching a state o f rem ission through treatm ent by chem otherapy, patients were divided random ly into two groups, one receiving m aintenance chem otherapy an d the oth er not. T he objective o f the study was to see if m aintenance chem otherapy lengthened the tim e o f remission, w hen the sym ptom s recur. T he d a ta in the table were gathered for prelim inary analysis before the study ended.

3 ■Further Ideas

84

Table 3.4 Remission G ro u p 1 G ro u p 2

9 5

13 5

>13 8

18 8

23 12

>28 >16

31 23

34 27

>45 30

48 33

> 161 43

45

The left panel o f Figure 3.3 shows the estim ated survivor functions for the tim es o f rem ission. A plus on one o f the lines indicates a censored observation. T here is some suggestion th a t m aintenance prolongs the time to remission, b u t the sam ples are sm all and the evidence is n o t overwhelming. T he right panel shows the estim ated survivor functions for the censoring times. Only one observation in the n o n-m aintained group is censored, b u t the censoring distributions seem sim ilar for b o th groups. The estim ated probabilities th a t rem ission will last beyond 20 weeks are respectively 0.71 and 0.59 for the groups, w ith stan d ard errors from (3.10) b o th equal to 0.14. ■

3.5.2 Resampling plans Cases W hen the d a ta are a hom ogeneous sam ple subject to random censorship, the m ost direct way to b o o tstra p is to set 7* = m in( Y °’,C ’ ), where Y °* and C* are independently generated from F° and G respectively. This implies th at P r{Y'> y) = {l-G (y )}{l-F °(y )} = U ( j-yj^y

" ~ J_ ) ,

which corresponds to the E D F th a t places m ass n~l on each o f the n cases (yj,dj). T h a t is, o rdinary b o o tstra p sam pling u nder the random censorship m odel is equivalent to resam pling cases from the original data. Conditional bootstrap A second sam pling scheme starts from the prem ise th a t since the censoring variable C is unrelated to Y°, know ledge o f the quantities C i,...,C „ alone would tell us noth in g a b o u t F°. They w ould in effect be ancillary statistics. This suggests th a t sim ulations should be conditional on the p atte rn o f censorship, so far as practicable. To allow for the censoring pattern, we argue th a t although the only values o f cj know n exactly are those w ith dj = 0, the observed values o f the rem aining observations are lower b o unds for the censoring variables, because Cj > yj when d} = 1. This suggests the following algorithm .

times (weeks) for two groups o f patients with acute myelogeneous leukaemia (AM L), one receiving maintenance chem otherapy (G roup 1) and the other not (Miller, 1981, p. 49). ^ indicates right-censoring.

3.5 ■Censoring

Figure 3.3 Product-limit survivor function estimates for two groups o f patients with A M L, one receiving maintenance chem otherapy (solid) and the other not (dots). The left panel shows estimates for the time to remission, and the right panel shows the estimates for the time to censoring. In the left panel, + indicates times o f censored observations; in the right panel + indicates times o f uncensored observations.

85

n na o CO

> 3

C/D

Time (weeks)

Time (weeks)

Algorithm 3.1 (Conditional bootstrap for censored data) F or r = 1 ,...,/? , 1 generate Y| ° \ . .., Fn°* independently from F °; 2 for j = 1 ,..., n, m ake sim ulated censoring variables by setting C ’ = yj if dj = 0, an d if dj = 1, generating Cj from {G(y) — G(y; )}/{ 1 — G(y; )}, which is the estim ated distribution o f Cj conditional on Cj > y j ; then 3 set YJ = m in( Y,0*, CJ), for j = 1 ,..., n.

.

I f the largest observation is censored, it is given a notional failure time to the right o f the observed value, and conversely if the largest observation is uncensored, it is given a n o tio n al censoring tim e to the right o f the observed value. This ensures th a t the observation can ap p ear in b o o tstrap resamples. B oth the above sam pling plans can accom m odate m ore com plicated patterns o f censoring, provided it is uninform ative. F o r example, it m ight be decided at the start o f a reliability experim ent on independent and identical com ponents th a t if they have n o t already failed, item s will be censored at fixed times c i , . . . , c„. In this situation an ap p ro p riate resam pling plan is to generate failure tim es Y?* from F°, and then to take YJ = min(YJ0*,c,), for j = 1 Thi s am ounts to having separate censoring distributions for each item, w ith the j t h p u ttin g m ass one at c; . O r in a m edical study the yth individual m ight be subject to ran d o m censoring up to a tim e c“ , corresponding to a fixed calendar date for the end o f the study. In this situation, Yj = m in( Y f , C j , d f ) , with the indicator Dj equalling zero, one, o r tw o according to w hether Cj, Y j \ or c j was observed. T hen an ap p ro p riate conditional sam pling plan w ould generate

3 ■Further Ideas

86

Yj0' and C* as in the conditional plan above, b u t take YJ = m in(y;°”, and m ake D ’ accordingly. Weird bootstrap The sam pling plans outlined above m im ic how the d a ta are th o u g h t to arise, by generating individual failure and censoring times. W hen interest is focused on the survival o r h azard functions, a third and quite different approach uses direct sim ulation from the N elso n -A alen estim ate (3.12) o f the cum ulative hazard. The idea is to treat the num bers o f failures a t each observed failure tim e as independent binom ial variables w ith denom inators equal to the num bers of individuals at risk, and m eans equal to the num bers th at actually failed. Thus w hen yi < ■• ■< y n, we take the sim ulated num b er to fail at tim e yj, N*, to be binom ial w ith den o m in ato r n — j + 1 an d probability o f failure dj / ( n — j + 1). A sim ulated N elso n -A alen estim ate is then A°*00 = E V n - L ;=1 l ^ k =i ™\yj

vV yk)

(3-14)

which can be used to estim ate the uncertainty o f the original estim ate A Q(y). In this weird bootstrap the failures at different tim es are unrelated, the num ber at risk does n o t depend on previous failures, there are no individuals whose sim ulated failure tim es underlie -4°’ (y), and no explicit assum ption is m ade ab o u t the censoring m echanism . Indeed, under this scheme the censored indi viduals are held fixed, b u t the num b er o f failures is a sum o f binom ial variables (Problem 3.10). The sim ulated survivor function corresponding to (3.14) is obtained by substituting

into (3.13) in place o f dA°(yj). Example 3.10 (AM L data) Figure 3.3 suggests th a t the censoring distribu tions for b o th groups o f d a ta in Table 3.4 are sim ilar, b u t th at the survival distributions them selves are not. To com pare the resam pling schemes described above, we consider estim ates o f two param eters, the probability o f remission beyond 20 weeks and the m edian survival time, b o th for G ro u p 1. These estim ates are 1 — F°(20) = 0.71 an d inf{t : F°(t) > 5} = 31. Table 3.5 com pares results from 499 sim ulations using the ordinary, condi tional, and weird bootstraps. F or the survival probabilities, the ordinary and conditional b o o tstrap s give sim ilar results, and b o th stan d ard errors are sim ilar to th a t from G reenw ood’s form ula; the weird b o o tstrap probabilities are significantly higher an d are less variable. The schemes give infinite estim ates

87

3.5 ■Censoring Table 3.5 Results for 499 replicates of censored data bootstraps of Group 1 of the AML data: average (standard deviation) for estimated probability of remission beyond 20 weeks, average (standard deviation) for estimated median survival time, and the number of resamples in which case 3 occurs 0, 1, 2 and 3 or more times. Figure 3.4 Comparison of distributions of differences in median survival times for censored data bootstraps applied to the AML data. The dotted line is the line x = y.

F requency o f case 3

Cases C o n d itio n al W eird

P robability

M edian

0

1

2

>3

0.72 (0.14) 0.72 (0.14) 0.73 (0.12)

32.5 (8.5) 32.8 (8.5) 33.3 (7.2)

180 75 0

182 351 499

95 71 0

42 3 0

co c o

■o a>

V-

c o

O

-20

20 Cases

40

-20

0

20

40

Cases

o f the m edian 21, 19, and 2 tim es respectively. The w eird b o o tstrap results for the m edian are less variable th a n the others. The last colum ns o f the table show the num bers o f sam ples in which the sm allest censored observation appears 0, 1, 2, and 3 or m ore times. U nder the conditional scheme the observation appears m ore often th an under the ordinary b o o tstrap , and und er the weird b o o tstrap it occurs once in each resample. Figure 3.4 com pares the distributions o f the difference o f m edian survival times betw een the two groups, und er the three schemes. R esults for the condi tional and o rdinary b o o tstrap s are similar, b u t the weird bo o tstrap again gives results th a t are less variable th a n the others. This set o f d a ta gives an extrem e test o f m ethods for censored data, because quantiles o f the product-lim it estim ate are very discrete. T he weird b o o tstra p also gave results less variable th a n the o ther schemes for a larger set o f data. In general it seems th a t case resam pling and conditional resam pling give quite sim ilar an d reliable results, b o th differing from the weird bootstrap. ■

88

3 ■Further Ideas

3.6 Missing Data The expression “missing d a ta ” relates to d atasets o f a stan d ard form for which some entries are missing or incom plete. This happens in a variety o f different ways. F o r example, censored d a ta as described in Section 3.5 are incom plete w hen the censoring value c is reported instead o f y°. O r in a factorial ex perim ent a few factor com binations m ay n o t have been used. In such cases estim ates an d inferences w ould take a simple form if the dataset were “com plete”. But because p a rt o f the stan d ard form is missing, we have two problem s: how to estim ate the quantities o f interest, and how to m ake inferences about them. We have already discussed ways o f dealing w ith censored data. N ow we exam ine situations where each response has several com ponents, some o f which are missing for som e cases. Suppose, then, th a t the fictional o r p o tential com plete d a ta are y°s and th a t corresponding observed d a ta are ys, w ith some com ponents taking the value N A to represent “n o t available”. Parametric problems F o r param etric problem s the situation is relatively straightforw ard, at least in principle. First, in defining estim ators there is a general fram ew ork w ithin which com plete-data M L E m ethods can be applied using the iterative EM algorithm , which essentially w orks by estim ating missing values. Form ulae exist for com puting approxim ate stan d ard errors o f estim ators, b u t sim ulation will often be required to obtain accurate answers. O ne extra com ponent th at m ust be specified is the m echanism which takes com plete d a ta y° into observed d a ta y, i.e. f ( y \ y°). T he m ethodology is sim plest w hen d a ta are missing at random . The corresponding Bayesian m ethodology is also relatively straightforw ard in principle, and num erous general algorithm s exist for using com plete-data form s o f posterior distribution. Such algorithm s, although they involve sim u lation, are som ew hat rem oved from the general context o f b o o tstra p m ethods and will n o t be discussed here. Nonparametric problems N onparam etric analysis is som ew hat m ore com plicated, in p a rt because o f the difficulty o f defining ap p ro p riate estim ators. T he following artificial exam ple illustrates som e o f the key ideas. Example 3.11 (Mean with missing data) Suppose th a t responses y° had been obtained from n random ly chosen individuals, b u t th a t m random ly selected values were then lost. So the observed d a ta are y u - - - , y n = y \ , - . - , y l - m, N A , . . . , N A .

The EM or expectation maximization algorithm is widely used in incomplete data problems.

89

3.6 • Missing Data

To estim ate the popu latio n m ean /i we should o f course use the average response y = (n — m)-1 X/’ whose variance we would estim ate by n—m

v = (n — m) 2 Y ( y j - y f ■ But think o f this as a prototype missing d a ta problem , to which resam pling m ethods are to be applied. C onsider the following two approaches: 1

First estim ate fi by t = y, the average o f the non-m issing data. Then (a) sim ulate sam ples y\,...,y*n by sam pling with replacem ent from the n observations y \ , . . . , y„-m, N A , . . . , N A ; then (b ) calculate f* as the average o f non-m issing values.

2

First estim ate the missing values y „_m+l, . . . , by = y for j = n —m + 1 , . . . , n an d estim ate n as the m ean o f y \ , . . . , y°_m, }>°_m+1). . . , y°. Then (a) sam ple w ith

replacem ent from

y\,...,yQ n_m, f n_m+x, . . . , f n to

get

(ft) duplicate the data-loss procedure by replacing a random ly chosen m o f the y*° w ith N A ; finally (c) duplicate the d a ta estim ation o f fi to get /*. In the first approach, we choose the form o f t to take account o f the missing data. T hen in the resam pling we get a random num ber o f missing values, M* say, w hose m ean is m. The effect o f this is to m ake the variance o f T* som ew hat larger th a n the variance o f T : specifically

A ssum ing th a t we discard all resam ples with rn = n (all d a ta missing), the b o o tstrap variance will overestim ate v ar(T ) by a factor which ranges from 15% for n = 10, m = 5 to 4% for n = 30, m = 15. In the second approach, the first step was to fix the d ata so th at the com plete-data estim ation form ula /t = n-1 YTj=i y*j f ° r t could be used. Then we attem pted to sim ulate d a ta according to the two steps in the original d ata-generation process. U nfortunately the E D F o f y®,...,y®_m,y®_m+l,...,y® is an underdispersed estim ate o f the true C D F F. Even though the estim ate t is n o t affected in this particularly simple problem , the boo tstrap distribution certainly is. This is illustrated by the b o o tstrap variance

Both approaches can be repaired. In the first, we can stratify the sam pling w ith com plete an d incom plete d a ta as strata. In the second approach, we can ad d variability to the estim ates o f missing values. This device, called multiple

90

3 ■Further Ideas

imputation, replaces the single estim ate y® = y by the set y® + e \ , . . . , yj + e„_m, where ek = yk — y for k = 1 ,..., n — m. W here the estim ate yj was previously given weight 1, the n — m im puted values for the y'th case are now given equal weights (n — m)~l . The im plication is th a t F is m odified to equal n~] on each com plete-data value, and n_1 x (n — m)_1 on the m(n — m) values + ek. In this simple case y® + ek = yk, so F reduces to the E D F o f the non-m issing d a ta y n-m, as a consequence o f which t(F) = y and the b o o tstrap distribution o f T* is correct. ■ This exam ple suggests two lessons. First, if the com plete-data estim ator can be m odified to w ork for incom plete data, then resam pling cases will w ork reasonably well provided the p ro p o rtio n o f m issing d a ta is sm all: stratified resam pling would reduce variation in the am o u n t o f missingness. Secondly, the com plete-data estim ator and full sim ulation o f d a ta observation (including the data-loss step) can n o t be based on single im p u tatio n estim ation o f missing values, b u t m ay w ork if we use m ultiple im p u tatio n appropriately. O ne fu rth er poin t concerns the data-loss m echanism , which in the exam ple we assum ed to be com pletely random . If d a ta loss is dependent upon the response value y, then resam pling cases should still be v a lid : this is som ew hat sim ilar to the censored-data problem . But the o th er approach via m ultiple im putatio n will becom e com plicated because o f the difficulty o f defining a p propriate m ultiple im putations. Example 3.12 (Bivariate missing data) A m ore realistic exam ple concerns the estim ation o f bivariate correlation when some cases are incom plete. Suppose th a t Y is bivariate w ith com ponents U an d X . T he param eter o f interest is 6 = c o t t ( U , X ) . A ran d o m sam ple o f n cases is taken, such th a t m cases have x missing, b u t no cases have b o th u an d x missing o r ju st u missing. I f it is safe to assum e th a t X has a linear regression on U, then we can use fitted regression to m ake single im pu tatio n s o f missing values. T h a t is, we estim ate each missing x; by Xj = x + b(uj — u), where x, u and b are the averages and the slope o f linear regression o f x on u from the n — m com plete pairs. It is easy to see th a t it would be w rong to substitute these single im putations in the usual form ula for sam ple correlation. The result would be biased aw ay from zero if b ± 0. O nly if we can m odify the sam ple correlation form ula to remove this effect will it be sensible to use simple resam pling o f cases. The o th er strategy is to begin w ith m ultiple im p u tation to obtain a suitable bivariate F, next estim ate 6 w ith the usual sam ple correlation t(F), and then resam ple appropriately. M ultiple im p u tatio n uses the regression residuals from

3.6 • Missing Data

91

Figure 3.5 Scatter plot of bivariate sample and multiple imputation values. Left panel shows observed pairs (o) and cases where only u is observed (•). Right panel shows observed pairs (o) and multiple imputation values (+). Dotted line is imputation regression line obtained from observed pairs.

- 3 - 2 - 1 0 1 2 3

- 3 - 2 - 1 0 1 2 3

com plete pairs, ej = Xj — Xj = Xj — {x + b(uj — u )}, for j = — T hen each missing Xj is k j plus a random ly selected O ur estim ate F is the bivariate distribution which puts weight n~l on each com plete pair, and w eight n-1 x (n — m)-1 on each o f the n — m m ultiple im putations for each incom plete case. T here are two strong, implicit assum ptions being m ade here. First, as th ro u g h o u t o u r discussion, it is assum ed th at values are missing at random . Secondly, hom ogeneity o f conditional variances is being assumed, so th a t pooling o f residuals m akes sense. As an illustration, the left panel o f Figure 3.5 shows a scatter plot for a sam ple o f n = 20 where m = 5 cases have x com ponents missing. Com plete cases ap p e a r as open circles, and incom plete cases as filled circles — only the u com ponents are observed. In the right panel, the do tted line is the im putation line which gives x , for j = 1 6 ,...,2 0 , and the m ultiple im putation values are plotted w ith sym bol + . T he m ultiple im putation E D F will put probability ^ on each open circle, and probability on each + . The results in Table 3.6 illustrate the effectiveness o f the m ultiple im p u ta tion ED F. The table shows sim ulation averages and stan d ard deviations for estim ates o f co rrelation 6 and a \ = var(X ) using the stan d ard com plete-data form s o f the estim ators, w hen h alf o f the x values are missing in a sample o f size n = 20 from the bivariate norm al distribution. In this problem there would be little gain from using incom plete cases, b u t in m ore com plex situa tions there m ight be so few com plete cases th at m ultiple im putation would be highly effective or even essential.

92

3 ■Further Ideas Table 3.6 Average Full d a ta estim ates

a\ 9

1.00 (0.33) 0.69 (0.13)

O bserved d a ta estim ates --------------------------------------------------------------------------------------------------C om plete case only Single im p u ta tio n M ultiple im p u tatio n

1.01 (0.49) 0.68 (0.20)

0.79 (0.44) 0.79 (0.18)

0.96 (0.46) 0.70 (0.19)

H aving set u p an ap p ro p riate m ultiple im p u tation E D F F, resam pling proceeds in an obvious way, first creating a full set o f n pairs by random sam pling from F, and then selecting m cases random ly w ithout replacem ent for which the x values are “lo st”. T he first stage is equivalent to random sam pling w ith replacem ent from n — m copies o f the com plete d a ta plus all m x (n — m) possible m ultiple im p u tatio n values. ■

3.7 Finite Population Sampling Basics The sim plest form o f finite popu latio n sam pling is when a sample is taken random ly w ith o u t replacem ent from a population ^ with values w ith N > n know n. T he statistic t ( y \ , . . . , y n) is used to estim ate the corresponding popu latio n q uantity 9 = t{°i)\,...,ay ^ ) . The d a ta are one o f the (^ ) possible sam ples Y \ , . . . , Y n from the population, and the w ithoutreplacem ent sam pling m eans th a t the Yj are exchangeable b u t n o t independent; the sam pling fraction is defined to be / = n / N . I f n
, y,

with ^Placement, w ithout replacem ent,

where y = ( N — I )-1 T he sam ple variance c = (n — I )-1 X X >;—y )2 is an unbiased estim ate o f y, an d the usual stan d ard erro r for y under w ithoutreplacem ent sam pling is obtained from the second line o f (3.15) by replacing y with c. N orm al approxim ation to the distribution o f Y then gives approxim ate (1 — 2a) confidence lim its y + (1 — / ) 1'/2c 1/2n_ 1/ 2za for 9, where za is the a

(standard deviation) o f estim ators for variance and correlation 6 from bivariate normal da ta (u,x) with sample size n = 20 and m = 10 x values missing at random. True values o^ — l and B — 0.7. Results from 1000 simulated datasets.

3.7 ■Finite Population Sampling

93

quantile o f the stan d ard norm al distribution. Such confidence intervals are a factor (1 —/ ) 1/2 shorter th a n for sam pling with replacem ent. The lack o f independence affects possible resam pling plans, as is seen by applying the o rdinary b o o tstrap to 7 . Suppose th a t 7 1*,...,Y„* is a random sam ple tak en w ith replacem ent from y i , . . . , y n- T heir average 7* has variance var*(7*) = n~2 ^ 2 ( y j —y ) 2, and this has expected value n~2(n— l)y over possible sam ples y i , . . . , y „ . This only m atches the second line o f (3.15) if / = n~l . T hus for the larger values o f / generally m et in practice, ordinary b o o tstrap standard errors for y are too large an d the confidence intervals for 6 are system atically too wide. ■ Modified sample size The key difficulty w ith the ordinary b o o tstrap is th at it involves withreplacem ent sam ples o f size n and so does n o t capture the effect o f the sam pling fraction, which is to shrink the variance o f an estim ator. O ne way to deal w ith this is to take resam ples o f size n', resam pling with or w ithout re placem ent. The value o f n' is chosen so th a t the estim ator variance is m atched, a t least approxim ately. F or w ith-replacem ent resam ples the average 7 ’ o f 7 ,* ,...,7 n* has variance var*(7*) = (n — 1)c/{n'n), which is only an unbiased estim ate o f (1 — f ) y / n w hen n' = (n — 1)/(1 — / ) ; this usually exceeds n. F or w ithout-replacem ent resam pling, a sim ilar argum ent implies th a t we should take n' = f n . O ne obvious difficulty with this is th a t if /
94

3 ■Further Ideas

the (^) possible w ithout-replacem ent sam ples from 9 ’, and the corresponding b o o tstrap value is X* = f(Y,*,. • •, Y„’). If N / n is n o t an integer, we w rite N = kn + 1, where 0 < I < n, and form t y ’ by taking k copies o f y i , . . . , y n an d adding to them a sam ple o f size I taken w ithout replacem ent from y i , . . . , y n- B ootstrap sam ples are form ed as w hen N = kn, b u t a different <&' is used for each. We call this the population bootstrap. U nder a superp o p u latio n m odel, the m em bers o f the population aJJ are them selves a ran d o m sam ple from an underlying distribution, 2P. The nonparam etric m axim um likelihood estim ate o f & is the E D F o f the sample, which suggests the following resam pling plan.

Algorithm 3.2 (Superpopulation bootstrap) For r = 1 1 generate a replicate pop u latio n = (/W \ , . .. , w ith replacem ent from y \ , . . . , y n\ then

by sam pling N times

2 generate a b o o tstrap sam ple Y ,* ,...,Y B* by sam pling n times w ithout replacem ent from and set T ‘ = t ( Y ‘, . . . , Y„*).

As one w ould expect, this gives results sim ilar to the population bootstrap. Example 3.14 (Sample average) Suppose th a t y \ , . . . , y n are scalars, th a t N = kn, and th a t interest focuses on 6 = N ~ l J2 <3fj, as in Exam ple 3.13. T hen under the population b ootstrap, vv*\ N ( n - 1) i v a r ( y , = < A r = T j ; ’'< 1 - ' , “ and this is the correct form ula a p a rt from the first factor on the right, which is typically close to one. U n d er the su p erp o p u latio n b o o tstra p a straightforw ard calculation establishes th a t the m ean variance o f Y ’ is (n — l) /n x (1 —/ ) n -1 c (Problem 3.12). These sam pling schemes m ake alm ost the right allowance for the sam pling fraction, at least for the average. F or the m irror-m atch scheme we suppose th a t n = km for integer m, and write Y* = n~l ]Tf= i Y,j, where (Y(j , . . . , Y ^) is the ith w ithout-replacem ent resam ple, independent o f the o th er w ithout-replacem ent resamples. T hen we can use (3.15) to establish th a t var*(Y ’ ) = (km)~l ( 1 — m / n )m ~lc. Because our assum ptions im ply th a t / = m/n, this is an unbiased estim ate o f var(Y ), b u t it would be biased if m ^ n f . m

3.7 • Finite Population Sampling

95

Studentized confidence intervals Suppose th a t v = v(yu . . . , y n) is an estim ated variance for the statistic t = t(yu- ■■, y n), based on the w ithout-replacem ent sam ple y i , . . . , y n, and th a t some b o o tstrap scheme is used to form replicates t* and v* o f t and v, for r = 1 ,..., R. T hen the studentized b o o tstrap can be used to form confidence intervals for 6, based on the values o f z* = (t* — t ) / v ' l/2. As outlined in Section 2.4, a (1 — 2a) confidence interval has limits t - „ V2_* 1

z ((R + l)(l-a))>

t __*.1/2 * 1

v

Z(UM-1)«)>

where z(*(R+1)p) is the em pirical p quantile o f the z*. If the population or su p erp o p u latio n b o o tstrap s are used, and N, n—>cc in such a way th a t / = n / N —>n, where 0 < n < 1, these intervals can be shown to have the same good properties as when the are a random sam ple from an infinite p o p u latio n ; see Section 5.4.1. Example 3.15 (City population data) F or a num erical assessm ent o f the schemes outlined above, we consider again the d a ta in Exam ple 1.2, on 1920 a n d 1930 p opulations (in thousands) o f N = 49 U S cities. Table 2.1 contains populations yj = (Uj,Xj) for a sam ple o f n = 10 cities taken w ithout replace m ent from the 49, and we use them to estim ate the m ean 1930 population 6 = N~l x j f ° r the 49 cities. Two sta n d a rd estim ators o f 6 are the ratio and regression estim ators. The ratio estim ate an d its estim ated variance are given by *

_ -

„

frat — MJV x

^2 j= ix j

E j= i

>

..

_

v rat —

( I - / ) y -' ( ~ 7T /

n(n ~ 1) “

I xj

V

~

)

u j t r a t \ 2__ > WjV — TV / _

«jv /

UJ '

1 V ''. .

N

(3.16) F o r o u r d a ta trat = 156.8 an d vrat = 10.852. The regression estim ate is based on the straight-line regression x = j?o + fixu fit to the d a ta (w i,x i),...,(u „ ,x „ ), using least squares estim ates /?o and (1]. The regression estim ate o f 9 and its estim ated variance are 11 _n treg = Po +

Vreg =

^ ^

Pluj) j

(3-17)

for ou r d a ta treg = 138.3 and vreg = 8.322. Table 3.7 contains 95% confidence intervals for 6 based on norm al approxi m ations to trat an d treg, an d on the studentized b o o tstrap applied to (3.16) and (3.17). N orm al approxim ations to the distributions o f trat and treg are poor, an d intervals based on them are considerably shorter th an the o ther intervals. The popu latio n and su perpopulation bootstraps give rath er sim ilar intervals. T he sam pling fraction is / = 10/49, so the estim ate o f the distribution o f 7” using m odified sam ple size and w ithout-replacem ent resam pling uses

3 ■Further Ideas

96

Schem e

R a tio

N o rm al M odified size, n' = 2 M odified size, n' = 11 M irro r-m atch , m = 2 P o p u lation S u p erp o p u latio n

137.8 58.9 111.9 115.6 118.9 120.3

174.7 298.6 196.2 196.0 193.3 195.9

123.7 1 M il 112.8 116.1 114.0

N o rm al M odified size, n' = 2 M odified size, n' = 11 M irro r-m atch , m = 2 P o p u latio n S u p erp o p u latio n

7 1 2 3 2 1

152.0 — 258.2 258.7 240.7 255.4

L ength

C overage L ow er

Table 3.7 City population data: 95% confidence limits for the mean population per city in 1930 based on the ratio and regression estimates, using normal approximation and various resampling methods with R = 999.

R egression

U pper

O verall

A verage

SD

89

82 98 89 88 89 91

23 151 34 33 36 41

142 19 19 21 24

98 91 91 91 92

8.2

sam ples o f size n f = 2. N o t surprisingly, w ithout-replacem ent resam ples o f size n' = 2 from 10 observations give a very p o o r idea o f w hat happens w hen sam ples o f size 10 are taken w ithout replacem ent from 49 observations, and the corresponding confidence interval is very wide. Studentized boo tstrap confidence limits can n o t be based on treg, because w ith ri = 2 we have v’eg = 0. F or w ith-replacem ent resam pling, we take (n — 1)/(1 — / ) = n' = 11, giving intervals quite close to those for the m irror-m atch, population and superpop u latio n bootstraps. Figure 3.6 shows why the upp er endpoints o f the ratio and regression confidence intervals differ so m uch. T he variance estim ate v*eg is unstable because o f resam ples in which case 4 does n o t ap p e a r and case 9 appears just once o r n o t at all; then z* takes large negative values. The right panel o f the figure explains this; the regression slope changes m arkedly w hen case 4 is deleted. Exclusion o f case 9 fu rth er reduces the regression sum o f squares and hence v’eg. T he ratio estim ate is m uch less sensitive to case 4. I f we insisted on using treg, one solution w ould be to exclude from the sim ulation sam ples in which case 4 does n o t appear. T hen the 0.025 and 0.975 quantiles o f z ’eg using the popu latio n b o o tstrap are -1.30 an d 3.06, and the corresponding confidence interval is [112.9,149.1].

Table 3.8 City population data. Empirical coverages (%) and average and standard deviation of length of 90% confidence intervals based on the ratio estimate of the 1930 total, based on 1000 samples of size 10 from the population of size 49. The nominal lower, upper and overall coverages are 5, 95 and 90.

91

3.1 • Finite Population Sampling

Figure 3.6 Population bootstrap results for regression estim ator based on city d a ta with n = 10. The left panel shows values o f z'eg and ivJ/2 for resamples in which case 4 appears at least once (dots), and in which case 4 does not appear and case 9 appears zero times (0), once (1), or m ore times (+ ); the dotted line shows The right panel shows the sample and the regression lines fitted to the d a ta with case 4 (dashes) and w ithout it (dots); the vertical line shows the value fi at which 0 is estimated.

X o

-----------y

o o CO o in C\J o o CVJ o lO

/

4

//

o o

CM

o in

O co

/ //

9 / > 2'

Aft /Q

•m ,IUy

o 2

4

6 sqrt(v*)

8

10

0

50 100 150 200 250 300

u

To com pare the perform ances o f the various m ethods in setting confidence intervals, we conducted a num erical experim ent in which 1000 sam ples o f size n = 10 were taken w ithout replacem ent from the p o p ulation o f size N = 49. F or each sam ple we calculated 90% confidence intervals [L, U] for 6 using R = 999 b o o tstrap samples. Table 3.8 contains the em pirical values o f Pr(0 < L), Pr(0 < U), an d Pr(L < 9 < U). T he norm al intervals are short an d their coverages are m uch too small, while the m odified intervals with ri = 2 have the opposite problem . Coverages for the m odified sam ple size with ri = 11 and for the pop u latio n and superpopulation b o o tstrap are close to their nom inal levels, though their endpoints seem to be slightly too far left. The 80% and 95% intervals an d those for the regression estim ator have sim ilar properties. In line w ith o th er studies in the literature, we conclude th a t the population and superp o p u latio n b o o tstraps are the best o f those considered here. ■ Stratified sampling In m ost applications the pop u lation is divided into k strata, the ith o f which contains N t individuals from which a sam ple o f size n, is taken w ithout replacem ent, independent o f o th er strata. The ith sam pling fraction is f i = tii/Ni and the p ro p o rtio n o f the p o pulation in the ith stratu m is vv, = N t/ N , where N = N i H-------- 1- N k- The estim ate o f 9 and its stan d ard erro r are found by com bining quantities from each stratum . Two different setups can be envisaged for m athem atical discussion. In the first — the “small-fc” case — there is a small num ber o f large stra ta: the asym ptotic regim e takes k fixed and n „ N j—>oo with where 0 < 7tj < 1.

98

3 ■Further Ideas

A p art from there being k strata, the same ideas and results will apply as above, w ith the chosen resam pling scheme applied separately in each stratum . The second setup — the “large-/c” case — is where there are m any sm all stra ta; in m athem atical term s we suppose th a t k —>00 b u t th a t N, and n, are bounded. This situation is m ore com plicated, because biases from each stratum can com bine in such a way th a t a b o o tstrap fails completely. Example 3.16 (Average)

Suppose th a t the p o p u lation ,]M com prises k strata,

and th at the yth item in the ith stratu m is labelled the average for th at stratum is ^ . T hen the pop u latio n average is 6 = which is estim ated by T = wiYi, where % is the average o f the sam ple Y,i,. . . , Yint from the ith stratum . T he variance o f T is k

. Ni W,2(l - / ,) X — — - W f ,

V= £ i=l

(3.18)

j= 1

1

an unbiased estim ate o f w hich is k

v = £ v v , 2( l - U ) x — >=1 Hi

.

Hi

- £ ( Y y - Yj)2. 1 j= 1

(3.19)

Suppose for sake o f sim plicity th a t each N ,/n , is an integer, and th a t the popu latio n b o o tstrap is applied to each stratu m independently. T hen the variance o f the b o o tstra p version o f T is v a r '( T ') - E » , 2( l - / , ) X

x - l j

(3.20)

the m ean o f which is obtained by replacing the last term on the right by (Ni — I )-1 Z j i & i j — &i)2- If k is fixed and TV,—>-oo while f ~ * n t , (3.20) will converge to v, b u t this will n o t be the case if n!; N, are bounded and k —>00. T he boo tstrap bias estim ate also m ay fail for the same reason (Problem 3.12).

■ F or setting confidence intervals using the studentized b o o tstrap the key issue is n o t the perform ance o f bias and variance estim ates, b u t the extent to which the distrib u tio n o f the resam pled q uantity Z* = (T* — t ) / V ’ll2m atches th at o f Z = ( T —6 ) / V 1/2. D etailed calculations show th a t when the population and superpopulation b o o tstrap s are used, Z an d Z* have the same limiting distribution u n d er b o th asym ptotic regimes, an d th a t under the fixed-/c setup the approxim ation is b etter th a n th a t using the other resam pling plans. Example 3.17 (Stratified ratio) F or em pirical com parison o f the m ore prom is ing o f these finite populatio n resam pling schemes w ith stratified data, we gen erated a pop u latio n w ith N pairs (u,x) divided into strata o f sizes N i , . . . , N k

99

3.7 ■Finite Population Sampling Table 3.9 Empirical coverages (%) of nominal 90% confidence intervals using the ratio estimate for a population average, based on 1000 stratified samples from populations with k strata of size N, from each of which a sample of size n = N/'i was taken without replacement. The nominal lower (L), upper (U) and overall (O) coverages are 5, 95 and 90.

N o rm al M odified size M irro r-m atch P o p u latio n S u p erp o p u latio n

k = 20, N = 18

k = 5, N = 72

k = 3 , N = 18

L

U

O

L

U

O

L

U

O

5 6 9 6 3

93 94 92 95 97

88 89 83 89 95

4 4 8 5 2

94 94 90 95 98

90 90 82 90 96

7 6 6 6 3

93 96 94 95

86 90 88 89 96

98

according to the ordered values o f u. The aim was to form 90% confidence intervals for k

N,

e = r l E E x'> .=i j=\

where x,j is the value o f x for the jth elem ent o f stratu m i. We took independent sam ples (uy,Xy) o f sizes n, w ithout replacem ent from the ith stratum , an d used these to form the ratio estim ate o f 9 and its estim ated variance, given by k

k

t = V WjU, X ti, i= 1

i

V = Y Wi ( 1 ~ f i ) i= 1

n,

X — (---- 7T ^ l } j —1

~~ t t o j ) 2’

where E / ' = 1 X ij

E jW

_

1

Ni

.....

these extend (3.16) to stratified sampling. We used b o o tstrap resam ples with R = 199 to com pute studentized b oo tstrap confidence intervals for 9 based on 1000 different sam ples from sim ulated datasets. Table 3.9 shows the em pirical coverages o f these confidence intervals in three situations, a “large-/c” case with k = 20, Nj = 18 and n, = 6, a “small-fc” case with k = 5, Ni = 72 and n, = 24, and a “small-fc” case w ith k = 3, Ni = 18 and n, = 6. The m odified sam pling m ethod used sam pling w ith replacem ent, giving sam ples o f size n' = 7 when n = 6 an d size ri = 34 w hen n = 24, while the corresponding values o f m for the m irror-m atch m ethod were 3 and 8. T h roughout / i = jIn all three cases the coverages for norm al, population and m odified sample size intervals are close to nom inal, while the m irror-m atch m ethod does poorly. T he superp o p u latio n m ethod also does poorly, perhaps because it was applied to separate stra ta ra th e r th an used to construct a new p o pulation to be stratified a t each replicate. Sim ilar results were obtained for nom inal 80% and 95% confidence limits. O verall the population b o o tstrap and m odified sample

3 ■Further Ideas

100

size m ethods d o best in this lim ited com parison, an d coverage is n o t im proved by using the m ore com plicated m irror-m atch m ethod. ■

3.8 Hierarchical Data In some studies the v ariatio n in responses m ay be hierarchical or m ulti level, as happens in repeated-m easures experim ents and the classical split-plot experim ent. D epending u p o n the n atu re o f the p aram eter being estim ated, it m ay be im p o rtan t to take careful account o f the two (or m ore) sources o f variation w hen setting up a resam pling scheme. In principle there should be no difficulty w ith p aram etric resam pling: having fitted the m odel param eters, resam ple d a ta will be generated according to a com pletely defined model. N onparam etric resam pling is n o t straightforw ard: certainly it will n o t m ake sense to use simple n o n p aram etric resam pling, which treats all observations as independent. H ere we discuss some o f the basic points ab o u t nonparam etric resam pling in a relatively simple context. Perhaps the m ost basic problem involving hierarchical variation can be form ulated as follows. F o r each o f a groups we o b tain b responses y tj such th a t y i} = X; +

Zij,

i = 1 , . . . , a, j = l , . . . , b ,

(3.21)

where the x,s are random ly sam pled from Fx an d independently the z^s are random ly sam pled from Fz, w ith E (Z ) = 0 to force uniqueness o f the model. T hus there is hom ogeneity o f variation in Z betw een groups, and the structure is additive. T he feature o f this m odel th a t com plicates resam pling is the correlation betw een observations w ithin a group, var(Yjy) = c* + a\,

cov(y,; , Yik) = a 2x,

j f k.

(3.22)

For d a ta having this nested structure, one m ight be interested in param eters o f Fx o r Fz o r some co m bination o f both. F o r exam ple, w hen testing for presence o f variation in X the usual statistic o f interest is the ratio o f betw een-group and w ithin-group sum s o f squares. How should one resam ple nonparam etrically for such a d ata structure? There are two simple strategies, for b o th o f which the first stage is to random ly sample groups w ith replacem ent. A t the second stage we random ly sam ple w ithin the groups selected at the first stage, either w ithout replacem ent (Strategy 1) or w ith replacem ent (Strategy 2). N ote th a t Strategy 1 keeps selected groups intact. To see which strategy is likely to w ork better, we look at the second m om ents o f resam pled d a ta y'j to see how well they m atch (3.22). C onsider selecting y'i V. . . , y ’ib. A t the first stage we select a ran d o m integer /* from {1 ,2 ,__a}. A t the second stage, we select ran d o m integers from {1,2 either w ithout replacem ent (Strategy 1) o r w ith replacem ent (Strategy 2): the

101

3.8 ■Hierarchical Data

sam pling w ithout replacem ent is equivalent to keeping the J* th group intact. U nder b o th strategies E*(5y I /* = O = )V, and

However, E*(Yy* Y*’ | /* = n =

6(6- 1)

yi’iyi'm,

Strategy 1, Strategy 2.

h £ tm = i ynyi'm,

T herefore E*(Yt; ) = ?., 1

SSg SSyy var*(Y,*) = — + J a ab

(3.23)

and Strategy 1, Strategy 2,

(3.24)

where y = a 1 £ y t, S S B = E L iO ^ - y - f and S S W = £ ? =1 E*=i(.Vy ~ tf)2- To see how well the resam pling variation mimics (3.22), we calculate expectations o f (3.23) an d (3.24), using

This gives E {v a r'(i'jy )} = and Strategy 1, Strategy 2. O n balance, therefore, Strategy 1 m ore closely mimics the variation properties o f the data, an d so is the preferable strategy. R esam pling should w ork well so long as a is m oderately large, say at least 10, ju st as resam pling hom ogeneous d a ta w orks well if n is m oderately large. O f course b o th strategies would work well if b o th a an d b were very large, b u t this is rarely the case. A n application o f these results is given in Exam ple 6.9. The preceding discussion w ould apply to balanced d a ta structures, b u t not to m ore com plex situations, for which a m ore general approach is required. A direct, m odel-based ap proach would involve resam pling from suitable estim ates o f the tw o (or m ore) d a ta distributions, generalizing the resam pling from F in C h ap ter 2. H ere we outline how this m ight work for the d a ta structure (3.21).

3 ■Further Ideas

102

Estim ates o f the two C D F s Fx an d Fz can be form ed by first estim ating the xs and zs, and then using their E D F s. A naive version o f this, which parallels stan d ard linear m odel theory, is to define xi = yu

ztj = y,j - %

(3.25)

The resulting way to o btain a resam pled d ataset is to

1

choose x j , . . . , x* by random ly sam pling w ith replacem ent from x i , . . . , x a ; then

2

choose z*n , . . . , z ' ab by random ly sam pling ab times with replacem ent from z n , . . . , z ab; and finally

3

set y-j = x* + z-j,

i=

j = l,...,b.

S traightforw ard calculations (Problem 3.17) show th a t this approach has the sam e second-m om ent properties o f as Strategy 2 earlier, show n in (3.23) and (3.24), w hich are n o t satisfactory. Som ew hat predictably, Strategy 1 is mim icked by choosing z\ r a n d o m l y w ith replacem ent from one group o f residuals Zki,...,Zkb — either a random ly selected group or the group corresponding to x* (Problem 3.17). W hat has gone w rong here is th a t the estim ates x* in (3.25) have excess variation, nam ely a ^ S S g = <xl + b~loj, relative to T he estim ates Zy defined in (3.25) will be satisfactory provided b is reasonably large, although in principle they should be standardized to - 11

( 1 - f c - 1)1/2 '

(3.26)

The excess variation in X; can be corrected by using the shrinkage estim ate = cy■■+ (1 - c ) y i . , where c is given by (i - c Y =

1

b ( b - l ) S S B’

o r 1 if the righ t-h an d side is negative. A straightforw ard calculation shows th a t this choice for c m akes the variance o f the x* equal to the com ponents o f variance estim ator o f \ see Problem 3.18. N ote th a t the w isdom o f m atching first and second m om ents m ay depend u p o n 9 being a function o f such m om ents.

103

3.9 ■Bootstrapping the Bootstrap

3.9 Bootstrapping the Bootstrap 3.9.1 Bias correction o f bootstrap calculations A s w ith m ost statistical m ethods, the b o o tstrap does n o t provide exact answers. F or exam ple, the basic confidence interval m ethods outlined in Section 2.4 do n o t have coverage exactly equal to the target, or nom inal, coverage. Similarly the bias and variance estim ates B and V o f Section 2.2.1 are typically biased. In m any cases the discrepancies involved are n o t practically im portant, or there is some specific rem edy — as w ith the im proved confidence lim it m ethods o f C h ap ter 5. N evertheless it is useful to have available a general technique for m aking a bias correction to a b o o tstrap calculation. T h a t technique is the b o o tstrap itself. H ere we describe how to apply the b o o tstrap to improve estim ation o f the bias o f an estim ator in the simple situation o f a single random sample. In the n o tatio n o f C h ap ter 2, the estim ator T = t(F) has bias P = b(F) = E ( T ) - 0 = E{t(F) | F} - t(F). T he b o o tstrap estim ate o f this bias is B = b(F) = E*(T*) - T = E*{t(F*) | F} - t(F),

(3.27)

where F* denotes either the E D F o f the boo tstrap sam ple Y J , . . . , Y * draw n from F or the p aram etric m odel fitted to th at sample. Thus the calculation applies to b o th param etric an d nonparam etric situations. There is b o th random variation an d system atic bias in B in g e n eral: it is the bias w ith which we are concerned here. As with T itself, so w ith B : the bias can be estim ated using the bootstrap. If we w rite y = c(F) = E (B \ F ) — b(F), then the simple b o o tstra p estim ate according to the general principle laid out in C h ap ter 2 is C = c(F). From the definition o f c(F) this implies C = E*(B* | F ) - B , the b o o tstrap estim ate o f the bias o f B. To see ju st w hat C involves, we use the definition o f B in (3.27) to obtain C = E*[E**{r(F**) | F*} - t(F*) | F] - [E*{t(F*) | F} - t (F)];

(3.28)

or m ore simply, after com bining terms, C = E*{E**(T**)} - 2E*(T* | F ) + T .

(3.29)

H ere F** denotes the E D F o f a sample draw n from F*, o r from the param etric m odel fitted to th a t sam ple; T** is the estim ate com puted w ith th a t sam ple; and E** denotes expectation over the the distribution o f th a t sam ple conditional on F*. T here are tw o levels o f b o o tstrapping in this procedure, which is therefore

104

3 ■Further Ideas

called the nested or double bootstrap. In principle a nested b o o tstrap m ight involve m ore th a n tw o levels, b u t in practice the com putational burden would ordinarily be too great for m ore th a n two levels to be w orthwhile, and we shall assum e th a t a nested b o o tstra p has ju st tw o levels. The adjusted estim ate o f the bias o f T is Badj = B — C .

Since typically bias is o f o rder n-1 , the adjustm ent C is typically o f order n~2. T he following exam ple gives a simple illustration o f the adjustm ent. Example 3.18 (Sample variance) Suppose th a t T = n~l Z ( Y j — Y )2 is used to estim ate v a r(Y ) = a 2. Since E { J](Y / — Y ) 2} = (n — l ) a 2, the bias o f T is easily seen to be /? = —n_1<x2, which the b o o tstrap estim ates by B = —n~l T. The bias o f this bias estim ate is E (B) — ft = n~2o 2, which the b o o tstrap estim ates by C = n~2T. T herefore the adjusted bias estim ate is B — C = —n-1 T — n~2 T. T h at this is an im provem ent can be checked by showing th a t it has expectation /?(1 + n~2), w hereas B has expectation /?(1 + n~]). ■ In m ost applications b o o tstrap calculations are approxim ated by sim ulation. So, as explained in C h ap ter 2, for m ost estim ators T we would approxim ate the bias B by ]T t* — t using the resam pled values and the d a ta value t o f the estim ator. Likewise the expectations involved in the bias adjustm ent C will usually be approxim ated by sim ulation. The calculation is as follows. Algorithm 3.3 (Double bootstrap for bias adjustment) F or r = 1 1 generate the rth original b o o tstrap sam ple y j,...,y * and then t’ by • •

sam pling a t ran d o m from (nonparam etric case) or sam pling param etrically from the fitted m odel (param etric case);

2 obtain M second-level b o o tstrap sam ples y \ ' , . . . , y ' n’, either • •

sam pling w ith replacem ent from y \ , . . . , y ' n (nonparam etric case) or sam pling from the m odel fitted to y [ , . . . , y * (param etric case);

3 evaluate the estim ator T for each o f the M second-level sam ples to give ..

..

V l’ - '- ’ VMT hen approxim ate the bias adjustm ent C in (3.29) by .

c

R

- k m E

m

_

R

E ' ~ - r E ' ; + '

r = l m= 1

r=l

»3 3 °)

105

3.9 ■Bootstrapping the Bootstrap

A t first sight it w ould seem th at to apply (3.30) successfully would involve a vast am o u n t o f com putation. If a general rule is to use a t least 100 samples when b ootstrapping, this w ould im ply a total o f R M + R = 10100 sim ulated sam ples and evaluations o f t. But this is unnecessary, because o f theoretical an d co m p u tatio n al techniques th a t can be used, as explained in C h apter 9. For the case o f the bias B discussed here, the sim ulation variance o f B — C would be no greater th a n it was for B if we used M = 1 and increased R by a factor o f a b o u t 5, so th a t a to tal o f ab out 500 sam ples would seem reasonable; see Problem 3.19. M ore com plicated applications o f the technique are discussed in E xam ple 3.26 an d in C hapters 4 and 5. Theory It m ay be intuitively clear th a t bo o tstrap p in g the b o o tstrap will reduce the o rd er o f bias in the original b o o tstrap calculation, at least in simple situations such as Exam ple 3.18. However, in some situations the order o f the reduction m ay n o t be clear. H ere we outline a general calculation which provides the answer, so long as the quantity being estim ated by the b o o tstra p can be expressed in term s o f an estim ating equation. F or simplicity we focus on the single-sam ple case, b u t the calculations extend quite easily. Suppose th a t the q uantity P = b(F) being estim ated by the b o o tstrap is defined by the estim ating equation E { h ( F , F - p ) \ F } = 0,

(3.31)

w here h( G, F; P) is chosen to be o f order one. T he b o o tstrap solution is P = b(F), which therefore solves E * { h ( F \ F - , P ) \ P } = 0. In general p has a bias o f order n~a, say, where typically a is Therefore, for some e(F) th a t is o f order one, we can write E { h ( F , F ; p ) \ F } = e ( F ) n - a.

1 or §.

(3.32)

To correct for this bias we introduce the ideal pertu rb atio n y = c„(F) which modifies b(F) to b(F,y) in o rd er to achieve E[h{F,F-,b(F,y)}\F]=0.

(3.33)

T here is usually m ore th a n one way to define b(F,y), b u t we shall assum e th a t y is defined to m ake b{F, 0) = b(F). T he b o o tstrap estim ate for y is y = cn(F), which is the solution to E '[h{F ',F ;b(F \y)}\F]= 0, and the adjusted value o f P is then the second level o f resam pling.

= b (F, y ); it is b(F",y) th a t requires

106

3 ■Further Ideas

W hat we w ant to see is the effect o f substituting p ajj for ft in (3.32). First we approxim ate the solution to (3.33). T aylor expansion ab o u t 7 = 0 , together with (3.32), gives E [h{F, F ; b(F, y ) } \ F] = e(F)n~a + dn(F)y,

(3.34)

where dn( F ) = ^ E [ h { F , F ; b ( F , y ) } \ F ] y=0 Typically d„{F) = d(F) =f= 0, so th a t if we w rite r(F) = e(F)/d(F) then (3.33) and (3.34) together im ply th at 7 = c„(F) = —r(F)n~a. This, together w ith the corresponding approxim ation for y = cn(F), gives ?

-

y

= —n~a{r(F) - r(F)} = - r T ^ X , ,

say. The quantity X n = n ^ 2{r(F) - r(F)} is Op(l) because F and F differ by Op(n~l/1). It follows th at, because y = 0 ( n ~ a), h { F, F\ b ( F, y)} = h {F ,F ; b( F, y) } - n~a~ l/2X n- ^ h { F , F -,b(F, y)}

.

(3.35)

y= 0

We can now assess the effect o f the adjustm ent from [3 to (iadj- Define the conditional quantity

kH(X„) = ^ - E [ h { F , F ; b ( F , y ) } \ X „ , F ] 8y

,

1/2)+

y=0

which is Op(l). T hen taking expectations in (3.35) we deduce that, because o f (3.34), E[h{F, F; b( F, y) } \ F] = - n~a- V 2E{X„kn( X n) I F}.

(3.36)

In m ost applications E { X nkn( X n) \ F} = 0 ( n ~ b) for b = 0 or j , so com paring (3.36) with (3.32) we see th a t the adjustm ent does reduce the order o f bias by at least j. Example 3.19 (Adjusted bias estimate) In the case o f the bias /? = E( T \ F) — 9, we take h(F, F;[5) = t(F) — t(F) — /? and b(F, y) = b(F) — y. In regular problem s the bias an d its estim ate are o f order n ~ \ and in (3.32) a = 2. It is easy to check th a t d„(F) = 1, so th a t X„ = n l/2{e(F) — e(F)} and kn( X n) = ^ E { t ( F ) - t(F) - ( P - y ) \ e(F), F}

= 1. ?=o

Note that if the next term in expansion (3.34) were 0(n~a~c), then the right-hand side of (3.35) would strictly be 0 (n- a-(.-l/2) + 0(n-a-c-1/2). In almost all cases this will lead to the same conclusion.

3.9 ■Bootstrapping the Bootstrap

107

This implies th at E { X nk n(Xn) I F} = n i/2E{e(F) - e(F) \ F} = 0 ( n ~ l/2). E quation (3.36) then becom es E { T — 9 — (fi — y)} = 0 (n -3 ). This generalizes the conclusion o f Exam ple 3.18, th at the adjusted b o o tstrap bias estim ate fi — y is correct to second order. ■ F u rth er applications o f the double b o o tstrap to significance tests and confi dence limits are described in Sections 4.5 and 5.6 respectively.

3.9.2 Variation o f properties o f T A som ew hat different application o f b o o tstrapping the boo tstrap concerns assessm ent o f how the distribution o f T depends on the param eters o f F. Suppose, for exam ple, th a t we w ant to know how the variance o f T depends upon 9 an d o th er unknow n m odel param eters, but th at this variance cannot be calculated theoretically. O ne possible application is to the search for a variance-stabilizing transform ation. The p aram etric case does n o t require nested b o o tstrap calculations. However, it is useful to outline the approach in a form th at can be m im icked in the nonparam etric case. The basic idea is to approxim ate v ar(T | ip) = v(xp) from sim ulated sam ples for an appropriately b road range o f param eter values. Thus we would select a set o f p aram eter values ipn---,V>K, f ° r each o f which we w ould sim ulate R sam ples from the corresponding param etric m odel, and com pute the corresponding R values o f T . This would give t'kl, . . . , t'kR, say, for the m odel w ith p aram eter value \pk. T hen the variance v(tpk) = v ar(T | xpk) w ould be approxim ated by R

v(Vk) =

(3'37) r= l

where t*k = J T 1 £ ? = i C Plots o f v{\pk) against com ponents o f yik can then be used to see how v a r(T ) depends on 1p. Exam ple 2.13 shows an application o f this. The sam e sim ulation results can also be used to approxim ate other properties, such as the bias or quantiles o f T , or the variance o f transform ed T. As described here the num ber o f sim ulated datasets will be R K , b u t in fact this num b er can be reduced considerably, as we shall show in Section 9.4.4. The sim ulation can be bypassed com pletely if we estim ate v(ipk) by a delta-m ethod variance approxim ation VL(y)k), based on the variance o f the influence function under the p aram etric m odel. However, this will often be impossible. In the nonparam etric case there appears to be a m ajor obstacle to per form ing calculations analogous to (3.37), nam ely the unavailability o f models corresponding to a series o f p aram eter values rpi,...,\pK. But this obstacle can

108

3 ■Further Ideas

be overcome, at least partially. Suppose for simplicity th at we have a single sam ple problem , so th a t the E D F F is the fitted m odel, and im agine th at we have draw n R independent b o o tstrap sam ples from this model. These b o o t strap sam ples can be represented by their E D F s F ’, which can be thought o f as the analogues o f param etric m odels defined by R different values o f param eter ip. Indeed the corresponding values o f 9 = t(F) are simply t(F*) = (*, and other com ponents o f ip can be defined sim ilarly using the representation ip = p(F). This gives us the same fram ew ork as in the p aram etric case above. F or ex am ple consider variance estim ation. To approxim ate v a r(T ) under param eter value tp* = p(F'), we sim ulate M sam ples from the corresponding m odel F *; calculate the corresponding values o f T , which we denote by , m = 1 ,..., M ; and then calculate the analogue o f (3.37), M K = v(Wr) = M ~ l

“ fr*)2,

(3.38)

m=1 with t ’’ = M ~ l E m =i Cm- T he scatter plot o f v’ against t* will then be a proxy for the ideal plot o f v a r(T | ip) against 6, an d sim ilarly for o ther plots. Example 3.20 (City population data) Figure 3.7 shows the results o f the double b o o tstrap procedure outlined above, for the ratio estim ator applied to the d a ta in Table 2.1, w ith n = 10. The left panel shows the bias b’ estim ated using M = 50 second-level b o o tstrap sam ples from each o f R = 999 first-level b o o tstrap samples. The right panel shows the corresponding stan d ard errors * 112 vr . The lines from applying a locally w eighted robust sm oother confirm the clear increase w ith the ratio in each panel. The lim plication o f Figure 3.7 is th a t the bias and variance o f the ratio are no t stable w ith n = 10. Confidence intervals for the true ratio 9 based on norm al approxim ations to the distrib u tio n o f T — 9 will therefore be poor, as will basic b o o tstra p confidence intervals, and those based on related quantities such as the studentized b o o tstrap are suspect. A reasonable in terpretation o f the right panel is th a t v a r(T ) oc 92, so th a t log T should be m ore stable. ■ The p articu lar application o f variance estim ation can be handled in a sim pler way, a t least approxim ately. I f the n o nparam etric delta m ethod variance approxim ation vL (Sections 2.7.2 an d 3.2.1) is fairly accurate, which is to say if the linear ap proxim ation (2.35) or (3.1) is accurate, then v'r = v(tp') can be estim ated by v l = vl ( f ;). Example 3.21 (Transformed correlation) A n exam ple where simple b o otstrap m ethods tend to perform badly w ithout the (explicit o r im plicit) use o f tran s form ation is the correlation coefficient. F or a sam ple o f size n = 20 from a bivariate norm al distribution, w ith sam ple correlation t = 0.74, the left panel

109

3.9 ■Bootstrapping the Bootstrap

Figure 3.7 Bias and standard error estimates for ratio applied to city population data, n = 10. For each of R = 999 bootstrap samples from the data, M = 50 second-level samples were drawn, and the resulting bias and standard error estimates b* and v ' 1/2 plotted against the bootstrapped ratio t*. The lines are from a robust nonparametric curve fit to the simulations.

t*

t*

Figure 3.8 Scatter plot of v*L versus t* for nonparametric simulation from a bivariate normal sample of size n = 20 with R = 999. The left panel is for t the sample correlation, with dotted line showing the theoretical relationship. The right panel is for transformed sample correlation.

Transformed t*

o f Figure 3.8 contains a scatter plot o f v’L versus t* from R = 999 n o n p aram et ric sim ulations: the d o tted line is the approxim ate norm al-theory relationship v a r(T ) = n ~ '( l — 02)2. T he p lo t correctly shows strong instability o f variance. The right panel shows the corresponding plot for b o otstrapping the tra n s form ed estim ate ^ l o g ^ l + f ) /( l - t)}, whose variance is approxim ately n~l : here v i is com puted as in Exam ple 2.18. The plot correctly suggests quite stable variance. ■

3 ■Further Ideas

110

As presented here the selection o f p aram eter values ip* is com pletely random , and R would need to be m oderately large (at least 50) to get a reasonable spread o f values o f \p*. T he to tal nu m b er o f samples, R M + R, will then be very large. It is, however, possible to im prove upon the algorithm ; see Section 9.4.4. A n other im p o rtan t problem is the roughness o f variance estim ates, apparent in b o th o f the preceding exam ples. This is due n o t ju st to the size o f M , but also to the noise in the E D F s F* being used as models. Frequency smoothing O ne m ajor difference betw een the p aram etric an d nonparam etric cases is th at the param etric m odels vary sm oothly w ith p aram eter values. A simple way to inject such sm oothness into the nonp aram etric “m odels” F ’ is to sm ooth them. F or simplicity we consider the one-sam ple case. Let w( ) be a sym m etric density w ith m ean zero and unit variance, and consider the sm oothed frequencies

f j ( o , e ) c c ( n r O ^ , r= l

'

j =

(3-39)

'

H ere e > 0 is a sm oothing p aram eter th a t determ ines the effective range o f values o f t* over which the frequencies are sm oothed. As is com m on with kernel sm oothing, the value o f e is m ore im p o rtan t th an the choice o f w(-), which we take to be the stan d ard norm al density. N um erical experim entation suggests th a t close to 6 = t, values o f e in the range 0 .2 v l/ 2 - 1 .0 v l/2 are suitable, where v is an estim ated variance for t. We choose the co n stan t o f proportionality in (3.39) to ensure th a t Z j f j { 8 ,E) = n- F ° r a given e, the relative frequencies n~ 1 f j ( 8 , e) determ ine a distribution F e‘ , for which the p aram eter value is 8 " = t{Fg); in general 0* is n o t equal to 8 , although it is usually very close. Example 3.22 (City population data) In co n tin u ation o f Exam ple 3.20, the top panels o f Figure 3.9 show the frequencies f j for four sam ples with values o f t' very close to 1.6. T he variation in the f j leads to the variability in both b* and v" th a t shows so clearly in Figure 3.7. The lower panels show the sm oothed frequencies (3.39) for distributions Fg with 8 = 1.2, 1.52, 1.6, 1.9 and e = 0.2u1/2. The corresponding values o f the ratio are 8 ’ = 1.23, 1.51, 1.59, an d 1.89. T he observations w ith the smallest em pirical influence values are m ore heavily weighted when 8 is less th a n the original value o f the statistic, t = 1.52, and conversely. The third panel, for 6 = 1.6, results from averaging frequencies including those shown in the upper panels, an d the distribution is m uch sm oother th an those. The results are not very sensitive to the value o f e, although the tilting o f the frequencies is less m arked for larger s. The sm oothed frequencies can be used to assess how the bias and variance

3.9 ■Bootstrapping the Bootstrap Figure 3.9 Frequencies for city population data. The upper panels show frequencies /* for four samples with values of t‘ close to 1.6, plotted against empirical influence values lj for the ratio. The lower panels show smoothed frequencies f ’{6,e) for distributions Fq with 9 = 1.2, 1.52, 1.6, 1.9 and e = 0.2i>1//2.

1.0

theta=1.2

-0.5

111

0.0

0.5

r = 1 .5988

t*=1,6015

theta=1.6

theta=1.9

1.0

theta=1.52

-1.0

-0.5

0.0

0.5

1.0

o f T depend on 0. F o r each o f a range o f values o f 0, we generate samples from the m ultinom ial distribution Fg w ith expected frequencies (3.39), and calculate the corresponding values o f t*, t'(0). say. We then estim ate the bias for sam pling from F'e by t*(0) — O’, where t'(9) is the average o f the t'r( 6 ). The variance is estim ated similarly. T he top panel o f Figure 3.10 shows values o f t*(8 ) plotted against jittered values o f 0 for 100 sam ples generated from Fg at 0 = 1 .2 ,...,1 .9 ; we took e = 0.2. The lower panels show th a t the corresponding biases and standard deviations, which are connected by the rougher solid lines, com pare well with the double b o o tstrap results. The am ount o f com putation is m uch less, however. T he sm oothed estim ates are based on 1000 sam ples to estim ate the Fg, an d then 100 sam ples at each o f the eight chosen values o f 0 , w hereas the double b o o tstra p required ab o u t 25 000 samples. ■ O th er applications o f (3.39) are described in C hapters 9 and 10. Variance stabilization Experience suggests th a t b o o tstrap m ethods for confidence limits and signif icance tests based on estim ators T are m ost effective when 9 is essentially a location param eter, which is approxim ately induced by a variance-stabilizing transform ation. Ideally such a transform ation would be derived theoretically from (2.14) w ith variance function v(0) = var( T | F). In a nonparam etric setting a suitable transform ation m ay som etim es be suggested by analogy w ith a param etric problem , as in Exam ple 3.21. If not, a tran sfo rm atio n can be obtained em pirically using the double boo tstrap estim ates o f variance discussed earlier in the section. Suppose th a t we have b o o tstrap sam ples F* = . - , y ' J and the corresponding statistics t", for

3 ■Further Ideas

112

Figure 3.10 Use of smoothed nonparametric distributions to estimate bias and standard deviation functions for the ratio of the city population data. The top panel shows 100 bootstrapped ratios calculated from samples generated from Fg, for each of 6 = 1.2,..., 1.9; for clarity the 0 values are jittered. The lower panels show 200 of the points from Figure 3.7 and the estimated bias and standard deviation functions from that figure (smooth curves), with the biases and standard deviations estimated from the top panel (rougher curves).

theta (jittered)

r = 1 W ithout loss o f generality, suppose th at t\ < ■■■ < t*R. One way to im plem ent em pirical variance-stabilization is to choose Ri o f the t" th at are roughly evenly-spaced an d th a t include and t'R. For each o f the corresponding F* we then generate M b o o tstrap values t” , from which we estim ate the variance o f t ’ to be v'r as defined in (3.38). We now sm ooth a plot o f the v ’ against the t’, giving an estim ate v(Q) o f the variance v a r(T | F ) as a function o f the p aram eter 0 = t(F), and integrate num erically to obtain the estim ated variance-stabilizing transfo rm atio n tt

^

{t)

f ‘

J

dd

{ m v /r

(3.40)

3.10 ■Bootstrap Diagnostics

113

In general, b u t especially for small Ri, it will be b etter to fit a sm ooth curve to values o f logt>*, in p art to avoid negative estim ates v(0). Provided th at a suitable sm oothing m ethod is used, inclusion o f t\ and t'R in the set for which the v" are estim ated implies th at all the transform ed values h(t*) can be calculated. T he transform ed estim ator h ( T ) should have approxim ately unit variance. A ny o f the com m on sm oothers can be used to obtain v(0), and simple inte gration algorithm s can be used for the integral (3.40). I f the nested boo tstrap is used only to obtain the variances o f Ri o f the f*, the total num ber o f b o o tstrap sam ples required is R + M R i . Values o f R\ and M in the ranges 50-100 and 25-50 will usually be adequate, so if R = 1000 the overall num ber o f b o o tstrap sam ples required will be 2250-6000. If variance estim ates for all the t ’ are available, for exam ple nonparam etric delta m ethod estim ates, then the delta m ethod shows th a t approxim ate standard errors for the h(t'r) will be i>*1/2/ v ( t ') 1/2; a plot o f these against t* will provide a check on the adequacy o f the transform ation. T he sam e procedure can be applied with second-level resam pling done from sm oothed frequencies, as in Exam ple 3.22. Example 3.23 (City population data) For the city population d ata o f E xam ple 2.8 the p aram eter o f interest is the ratio 6 , which is estim ated by t = x / u. Figure 3.7 shows th a t the variance o f T depends strongly on 6 . We used the procedure outlined above to estim ate a transform ation based on R = 999 b o o tstrap samples, w ith R\ = 50 and M = 25. The transform ation is shown in the left panel o f Figure 3.11: the right panel shows the stan d ard errors v ^ 2 / v ( O l/2 o f the h(t'). T he transform ation has been largely successful in stabilizing the variance. In this case the variances VLr based on the linear approxim ation are readily calculated, an d the tran sfo rm atio n could have been estim ated from them rather than from the nested bootstrap. ■

3.10 Bootstrap Diagnostics 3.10.1 Jackknife-after-bootstrap Sensitivity analysis is im p o rtan t in understanding the im plications o f a statisti cal calculation. A conclusion th a t depended heavily on ju st a few observations would usually be regarded as m ore tentative th an one supported by all the data. W hen a p aram etric m odel is fitted, difficulties can be detected by a wide range o f diagnostics, careful scrutiny o f which is p a rt o f a param etric boo tstrap analysis, as o f any param etric m odelling. But if a nonparam etric b o o tstrap is used, the E D F F is in effect the m odel, and there is no baseline against which

114

3 ■Further Ideas

f ID

CO

to of as or

com pare outliers, for example. In this situation we m ust focus on the effect individual observations on b o o tstrap calculations, to answ er questions such “would the confidence interval differ greatly if this point were rem oved?”, “w hat happens to the significance level when this observation is deleted?”

Nonparametric case Once a nonparam etric resam pling calculation has been perform ed, a basic question is how it w ould have been different if an observation, yj, say, had been absent from the original data. F or exam ple, it m ight be wise to check w hether or n o t a suspicious case has affected the quantiles used in a confidence interval calculation. T he obvious way to assess this is to do a fu rth er sim ulation from the rem aining observations, b u t this can be avoided. This is because a resam ple in which y; does n o t ap p ear can be th o u g ht o f as a random sample from the d a ta w ith yj excluded. Expressed formally, if J* is sam pled uniform ly from { l ,...,n } , then the conditional distribution o f J ' given th at J* =/= j is the sam e as the distribution o f /*, where /* is sam pled uniform ly from { 1 ,... , j — \ , j + 1 ,...,« } . T he probability th a t is n o t included in a boo tstrap sample is (1 — n-1 )" = e ~ \ so the num b er o f sim ulations R - j th a t do not include yj is roughly equal to R e ~l = 0.368R. So we can m easure the effect o f on the calculations by com paring the full sim ulation w ith the subset o f t \ , . . . , t R ’ obtained from bo o tstrap sam ples where yj does n o t occur. In term s o f the frequencies f ’j which count the num ber o f tim es yj app ears in the rth sim ulation, we sim ply restrict attention to replicates with f ' j = 0. F or exam ple, the effect o f yj on the bias estim ate B can be

Figure 3.11 Variance-stabilization for the city population ratio. The left panel shows the empirical transformation «(•), and the right panel shows the standard errors u jy2/{v(r*)}1,/2 of the h{t*), with a smooth curve.

115

3.10 ■Bootstrap Diagnostics Table 3.10 M easurements on the head breadth and length o f the first two adult sons in 25 families (Frets, 1921).

1 2 3 4 5 6 7 8 9 10 11 12 13

F irst son L en Brea

Second son Len Brea

191 195 181 183 176 208 189 197 188 192 179 183 174

179 201 185 188 171 192 190 189 197 187 186 174 185

155 149 148 153 144 157 150 159 152 150 158 147 150

145 152 149 149 142 152 149 152 159 151 148 147 152

14 15 16 17 18 19 20 21 22 23 24 25

F irst son Len B rea

Second son L en Brea

190 188 163 195 186 181 175 192 174 176 197 190

195 187 161 183 173 182 165 185 178 176 200 187

159 151 137 155 153 145 140 154 143 139 167 163

157 158 130 158 148 146 137 152 147 143 158 150

m easured by the scaled difference

n(B_j - B) = J

J -

I

£ J

(t; - t - j ) - i

'^>=0

- t ) 1, r

(3.41)

J

where B - j is the bias estim ate from the resam ples in which yj does not appear, and r_; is the value o f t when yj is excluded from the original data. Such calculations are applications o f the jackknife m ethod described in Section 2.7.3, so the technique applied to b o o tstra p results is called the jackknife-after-bootstrap. The scaling factor n in (3.41) is n o t essential. A useful diagnostic is the plot o f jackknife-after-bootstrap m easures such as (3.41) against em pirical influence values, possibly standardized. F or this purpose any o f the approxim ations to em pirical influence values described in Section 2.7 can be used. The next exam ple illustrates a related plot th a t shows how the distrib u tio n o f r* — t changes w hen each observation is excluded. Example 3.24 (Frets’ heads) Table 3.10 contains d ata on the head breadth and length o f the first two ad u lt sons in 25 families. T he correlations am ong the log m easurem ents are given below the diagonal in Table 3.11. T he values above the diagonal are the partial correlations. For exam ple, the value 0.13 in the second row is the correlation betw een the log head b read th o f the first son, b i, and the log head length o f the second son, h, after allowing for the other variables. In effect, this is the correlation betw een the residuals from separate regressions o f b\ and lj on the other two variables. T he correlations are all large, b u t four o f the partial correlations are small, which suggests the simple in terpretation th at each o f the four pairs o f m easurem ents for first and second sons is independent conditionally on the values o f the o th er two m easurem ents.

116

3 ■Further Ideas

F irst son L ength B readth

F irst son S econd son

L ength B readth L ength B readth

0.43 0.75 0.72 0.72

0.70 0.72

Table 3.11 Correlations (below diagonal) and partial correlations (above diagonal) for log measurements on the head breadth and length of the first two adult sons in 25 families.

Second son L ength B readth

0.21

0.17

0.13

0.22 0.64

0.85

We focus on the p artial correlation t = 0.13 betw een log foj and log I2 . The top panel o f Figure 3.12 shows a jack k n ife-after-b ootstrap plot for t, based on 999 b o o tstrap samples. T he points at the left-hand end show the em pirical 0.05, 0.1, 0.16, 0.5, 0.84, 0.9, an d 0.95 quantiles o f the values o f t’ — t *_2 for the 368 b o o tstrap sam ples in which case 2 was n o t selected; ~t_’ 2 is the average o f t* for those samples. T he d o tted lines are the corresponding quantiles for all 999 values o f t* — t. T he distribution is clearly m uch m ore peaked when case 2 is left out. T he panel also contains the corresponding quantiles when other cases are excluded. T he horizontal axis shows the em pirical influence values for t: clearly puttin g m ore weight on case 2 sharply decreases the value o f t. The low er left panel o f the figure shows th a t case 2 lies som ew hat away from the rest, and the plot o f residuals for the regressions o f logfti and lo g /2 on (lo g b2,lo g h) in the low er right panel accounts for the jackknife-afterb oo tstrap results. Case 2 seems outlying relative to the others: deleting it will clearly increase t substantially. T he overall average and stan d ard deviation o f the t* are 0.14 an d 0.23, changing to 0.34 and 0.17 when case 2 is excluded. The evidence against zero p artial correlation depends heavily on case 2. ■ A n o th er version o f the diagnostic plot uses case-deletion averages o f the i-e- t_j = R_j X>r:/*.=0 instead o f the em pirical influence values. This m ore clearly reveals how the quantity o f interest varies w ith param eter values. Parametric case In the p aram etric case different calculations are needed, because random sam ples from a case-deletion m odel are n o t simply an unw eighted subset o f the original b o o tstrap samples. N evertheless, those original b o o tstrap samples can still be used if we m ake use o f the following identity relating expectations under two different p aram eter v alu es: E { h ( Y ) \ r p ' } = E { h ( Y ) f^ Y li 'P w) | y j-

(3.42)

Suppose th a t the full-data estim ate (e.g. m axim um likelihood estim ate) o f the m odel p aram eter is xp, an d th a t when case j is deleted the corresponding estim ate is xp^j. The idea is to use (3.42) w ith xp an d xp-j in place o f xp and xpr

117

3.10 • Bootstrap Diagnostics

Figure 3.12 Jackknifeafter-bootstrap analysis for the partial correlation between lo g b\ and lo g /2 for Frets’ heads data. The top panel shows 0.05, 0.1,0.16, 0.5, 0.84, 0.9 and 0.95 empirical quantiles o f r’ — t*_j when each o f the cases is dropped from the bootstrap calculation in turn. The lower panels show scatter plots o f the raw values o f logfci and log fe, and o f their residuals when regressed on the other two variables.

-

3

-

2

-

1

0

1

2

infinitesimal jackknife value

Log b1

Residual for log b1

respectively. F or example,

Therefore the param etric analogue o f (3.41) is /d

di _

;

f l W .*

} ~ "\

R

r

. \ f ( y * I V-y)

j) f ( y ;

Iv)

1 V~V**

R

§ (r

}J ’

w here the sam ples y* are draw n from the full-data fitted model, th at is with p aram eter value ip. Sim ilar w eighted calculations apply to o ther features o f the

118

3 ■Further Ideas

distributio n o f T* — t; see Problem 3.20. O th er applications o f the importance reweighting identity (3.42) will be discussed in C h ap ter 9.

3.10.2 Linearity Statistical analysis is simplified w hen the statistic o f interest T is close to linear. In this case the variance approxim ation v i will be an accurate estim ate o f the b o o tstrap variance v a r(T | F), and saddlepoint m ethods (Section 9.5) can be applied to o btain accurate estim ates o f the distribution o f t \ w ithout recourse to sim ulation. A linear statistic is n o t necessarily close to norm ally distributed, as Exam ple 2.3 illustrates. N o r does linearity guarantee th at T is directly related to a pivot and therefore useful in finding confidence intervals. O n the o th er hand, experience from o th er areas in statistics suggests th at these three properties will often occur together. This suggests th a t we aim to find a transfo rm atio n h(-) such th a t h ( T ) is well described by the linear approxim ation th a t corresponds to (2.35) or (3.1). For simplicity we focus on the single-sam ple case here. T he shape o f h(-) would be revealed by a p lo t o f h(t) against t, b u t o f course this is n o t available because h(-) is unknow n. However, using T aylor approxim ation and (2.44) we do have h(t') = h(tl) = h{t) + h(t)± Y ' f j l j - h(t) + h(t)(t'L - t), " i =i which shows th a t t’L = c + dh(t') w ith ap p ro p riate definitions o f constants c and d. T herefore a plot o f the values o f t'L = t + m_1 Y ^ f ) h against the t* will look roughly like h(-), a p a rt from a location and scale shift. We can now estim ate h(-) from this plot, either by fitting a p articular param etric form, or by nonparam etric curve estim ation. Example 3.25 (City population data) T he top left panel o f Figure 3.13 shows t ’L plotted against t" for 499 b o o tstrap replicates o f the ratio t = x / u for the d ata in Table 2.1. The p lo t is highly nonlinear, an d the logarithm ic tran sfo r m ation, o r one even m ore extreme, seems appropriate. N ote th a t the plot has shape sim ilar to th a t for the em pirical variance-stabilizing transform ation in Figure 3.11. For a p aram etric transform ation, we try a B ox-C ox transform ation, h{t) = (tx — 1) / 1, w ith the value o f k estim ated by m axim izing the log likelihood for the regression o f the h(t') on the t'Lr. This strongly suggests th at we use I = —2, for which the fitted curve is shown as the solid line on the plot. This is close to the result for a sm oothing spline, shown as the d o tted line. The to p right panel shows the linear approxim ation for h(t‘), i.e. h(t) + h(t)n~l Y T j = i f j b ’ plotted against h(tm). This plot is close to the line w ith unit gradient, and confirm s the results o f the analysis o f transform ations.

h(t) is dh(t)/dt.

3.10 • Bootstrap Diagnostics

Figure 3.13 Linearity transformation for the ratio applied to the city population data. The top left panel shows linear approximations t*L plotted against bootstrap replicates f \ with the estimated parametric transformation (solid) and a transformation estimated by a smoothing spline (dots). The top right panel shows the same plot on the transformed scale. The lower left panel shows the plot for the studentized bootstrap statistic. The lower right panel shows a normal Q-Q plot of the studentized bootstrap statistic for the transformed values h{t*).

119

h(t*)

CO

CO

..y -‘

CM

r*

CM _c= * N

O

O

CNJ

V

•‘ jf c a

C \1

CO

CO

-6

-4-2

0

2

- 3 - 2 - 1 0 1 2 3 Quantiles of Standard Normal

z*

The lower panels show related plots for the studentized b o o tstrap statistics on the original scale and on the new scale, . t'-t Z ~ *1/2 ’ vL

. h(t')-h(t) Z> >~ *1/2 ’ h(t)vL

where v’L = n~ 2 ^ 2 f j l j . T he left panel shows that, like t*, z ’ is far from linear. The lower right panel shows th a t the distribution o f z ’h is fairly close to stan d ard norm al, though there are som e outlying values. The distribution o f z* is far from norm al, as shown by the right panel o f Figure 2.5. It seems that, here, the tran sfo rm ation th a t gives approxim ate linearity o f t* also

3 ■Further Ideas

120

m akes the corresponding studentized b o o tstrap statistic roughly norm al. The transform atio n based on the sm oothing spline w ould give sim ilar results. ■

3.11 Choice of Estimator from the Data In some applications we m ay w ant to choose an estim ator o r o th er procedure after looking a t the data, especially if there is considerable prio r uncertainty ab o u t the n atu re o f ran d o m variation o r o f the form o f relationship am ong variables. The sim plest exam ple w ith hom ogeneous d a ta involves the choice o f estim ator for a pop u latio n m ean fi, when em pirical evidence suggests th at the underlying distribution F has long, n on-norm al tails. Suppose th a t T ( 1 ) ,..., T ( K ) can all be considered potentially suitable esti m ators for n, and for the m om ent assum e th a t all are unbiased, which m eans th a t the underlying d a ta distrib u tio n is sym m etric. T hen one n a tu ra l criterion for choice am ong these estim ators is variance or, since their exact variances will be unknow n, estim ated variance. So if the estim ated variance o f T(i) is V(i), a n atu ral procedure is to select as estim ate for a given dataset th at t(i) whose estim ated variance is smallest. This defines the adaptive estim ator T by T = T(i)

if

V(i) = m in V(k). 1Zk<.K

F or m ost simple estim ators we can use the nonp aram etric delta m ethod vari ance estim ates. But in general, an d for m ore com plicated problem s, we use the b o o tstrap to im plem ent this procedure. T hus we generate R boo tstrap samples, com pute the estim ates f* (l),. . . , t ' ( K ) for each sample, and then choose t to be th a t t(i) for which the b o o tstra p estim ate o f variance R

«;(0 = ( « - l r 1 5 3 {t;(o - r ( o }2 r= 1

is sm allest; here t’{i) = R~' J 2r f’(0How we generate the b o o tstrap sam ples is im p o rtan t here. H aving assum ed sym m etry o f d a ta distribution, the resam pling distribution should be sym m etric so th a t the t'(i) are unbiased for fi. O therw ise selection based on variance alone is questionable. F u rth er discussion o f this is postponed to Exam ple 3.26. So far the procedure is straightforw ard. But now suppose th a t we w ant to estim ate the variance o f T, o r quantiles o f T — p. For the variance, the m inim um estim ate v(i) used to select t = t{i) will tend to be too low: if / is the rand o m index corresponding to the selected estim ator, then E{K (/)} < var{ T(J)} = v ar(T ). Sim ilarly the resam pling distribution o f T* = T * (/) will be artificially co ncentrated relative to th a t o f T , so th a t em pirical quantiles o f the t’(i) values will tend to be too close to t. W hether or n o t this selection bias

3.11 ■Choice o f Estimator from the Data

121

is serious depends on the context. However, the bias can be adjusted for by b o o tstrap p in g the w hole procedure, as follows. L et y\,...,y*n be one o f the R sim ulated samples. Suppose th at we apply the procedure for choosing am ong T ( 1 ) ,..., T { K ) to this b o o tstrap sample. T h a t is, we generate M sam ples with equal probability from y \ , . . . , y ’n, and calculate the estim ates f” (l), . . . , f ” (K ) for the mth such sample. T hen choose the estim ator w ith the smallest estim ated variance M

m=1 where f'*(i) =

C ( 0 - T h a t is, t* = t*(i)

if

v'(i) = m in v'(k).

D oing this for each o f the R sam ples y [ , . . . , y * gives t \ , . . . , t ’R, and the em pirical d istribution o f the t‘ — t values approxim ates the distribution o f T — F or exam ple, v = ( R —I )-1 ^ ( t * —t )2 estim ates the variance o f T, and by accounting for the selection bias should be m ore accurate th an t>(i). T here are two byproducts o f this double b o o tstrap procedure. One is infor m ation on how w ell-determ ined is the choice o f estim ator, if this is o f interest, simply by exam ining the relative frequency with which each estim ator is cho sen. Secondly, the bias o f v(i) can be approxim ated: on the log scale bias is estim ated by R ~ l ^ l o g y ’ — log v, where v'r is the sm allest value o f the v’(i)s in the rth b o o tstrap sample. Example 3.26 (Gravity data) Suppose th a t the d a ta in Table 3.1 were only available as a com bined sample o f n = 81 m easurem ents. T he different dispersions o f the ingredient series m ake the com bined sam ple very no n norm al, so th a t the simple average is a po o r estim ator o f the underlying m ean fi. O ne possible ap proach is to consider trim m ed average estim ates n-k

which are averages after d ropping the k smallest and k largest order statistics yy y The usual average and sam ple m edian correspond respectively to k = 0 an d \{n — 1). The left panel o f Figure 3.14 plots the trim m ed averages against k. The m ild dow nw ard trend in the plot suggests slight asym m etry o f the d a ta distribution. O u r aim is to use the b o o tstrap to choose am ong the trim m ed averages. T he trim m ed averages will all be unbiased if the underlying d a ta distribution is sym metric, an d estim ator variance will then be a sensible criterion on which to base choice. The b o o tstrap procedure m ust build in the assum ed symmetry,

3 • Further Ideas

2.0

2.0

122

9

9 &

§ a >

1

'

I

O

% " 9 e

6

*

•

,

20

30

40

0

.

*

*

’

0

O

0

o

0 0

0 0

10

9 6

10

20

30

40

10

and this can be done (cf. Exam ple 3.4) by sim ulating sam ples sym m etrized version o f F such as

20

30

40

from a

F sym(y ) = l2 { F ( y ) + F( 2 U - y - 0)} ,

which is sim ply the E D F o f y i , . . . , y „, p. — {y\ — p.),. . . , p — (y„ — p.), with p. an estim ate o f fi which for this purpose we take to be the sam ple m edian. The centre panel o f Figure 3.14 shows b o o tstrap estim ates o f variance for eleven trim m ed averages based on R = 1000 sam ples d raw n from Fsym. We conclude from this th a t k = 36 is best, b u t th a t there is little to choose am ong trim m ed averages w ith k = 2 4 ,..., 40. A sim ilar conclusion em erges if we sam ple from F, although the b o o tstrap variances are noticeably higher for k > 24. If sym m etry o f the underlying distrib u tio n were in doubt, then we should take the biases o f the estim ators into account. O ne n atu ral criterion then would be m ean squared error. In this case o u r b o o tstrap sam ples would be draw n from F, an d we w ould select am ong the trim m ed averages on the basis o f bo o tstrap m ean squared error R

mse(i) = K_ 1 £ { r ; ( 0 - y } 2 r= 1

N ote th a t m ean squared erro r is m easured relative to the m ean y o f the b o o tstrap population. T he right panel o f Figure 3.14 shows the boo tstrap m ean squared errors for o u r trim m ed averages, an d we see th a t the estim ated biases do have an effect: now a value o f k nearer 20 w ould ap p e ar to be best. U nder the sym m etric b o o tstrap , when the m ean o f Fsym is the sam ple m edian because we sym m etrized ab o u t this point, b o o tstrap m ean squared erro r equals bo o tstrap variance. To focus the rest o f the discussion, we shall assum e sym m etry and therefore choose t to be the trim m ed average w ith k = 36. T he value o f t is 78.33, and the m inim um b o o tstrap variance based on 1000 sim ulations is 0.321. We now use the double b o o tstra p procedure to estim ate the variance for t, and to determ ine ap pro p riate quantiles for t. First we generate R = 1000

Figure 3.14 Trimmed averages and their estimated variances and m ean squared errors for the pooled gravity data, based on R = 1000 bootstrap samples, using the ordinary bootstrap (•) and the symmetric bootstrap (o).

3.12 ■Bibliographic Notes

123

sam ples y j,...,y g [ from Fsym. To each o f these sam ples we then apply the original sym m etric b o o tstrap procedure, generating M = 100 sam ples o f size n = 81 from the sym m etrized E D F o f y \ , . .. , 3^ , choosing t* to be th a t one o f the 11 trim m ed averages w ith sm allest value o f v’(i). The variance v o f t\ , . . . , t'R equals 0.356, which is 10% larger th an the original m inim um variance. If we use this variance w ith a norm al aproxim ation to calculate a 95% confidence interval centred on t, the interval is [77.16,79.50]. This is very sim ilar to the intervals obtained in Exam ple 3.2. The frequencies w ith which the different trim m ing proportions are chosen are: k 12 16 20 24 28 32 36 40 Frequency 1 25 54 96 109131 49886 T hus when sym m etry o f the underlying distribution is assum ed, a fairly heavy degree o f trim m ing seems desirable for these data, and the value k = 36 actually chosen seems reasonably well-determ ined. ■ The general features o f this discussion are as follows. We have a set o f estim ators T (a) = t(a, F ) for a e A, and for each estim ator we have an estim ated value C (a ,F ) for a criterion C (a ,F ) = E {c(T (a),0) | F} such as variance or m ean squared error. The adaptive estim ator is T = t(a, F) where a = a(F) m inim izes C (a ,F ) w ith respect to a. We w ant to know ab o u t the d istribution o f T, including for exam ple its bias and variance. The distribution o f T — 6 = t(F) — t(F) under sam pling from F will be approxim ated by evaluating it under sam pling from F. T h at is, it will be approxim ated by the d istribution o f T* - t = t (F') - f(F) = t( a , F*) - t( a, F) un d er sam pling from F. H ere F* is the analogue o f F based on y y * : if F is the E D F o f the data, then F* is the E D F o f sam pled from F. W hether or n o t the allowance for selection bias is num erically im portant will depend u p o n the density o f a values and the variability o f C(a,F).

3.12 Bibliographic Notes The extension o f b o o tstrap m ethods to several unrelated sam ples has been used by several authors, including Hayes, Perl and Efron (1989) for a special contrast-estim ation problem in particle physics; the application is discussed also in Efron (1992) an d in Practical 3.4. A general theoretical account o f estim ation in sem iparam etric m odels is given in the book by Bickel et al. (1993). The m ajority o f applications o f sem iparam etric m odels are in regression; see references for C hapters 6 and 7.

124

3 ■Further Ideas

E fron (1979, 1982) suggested and studied em pirically the use o f sm ooth ver sions o f the ED F, b u t the first system atic investigation o f sm oothed bootstraps was by Silverm an and Y oung (1987). They studied the circum stances in which sm oothing is beneficial for statistics for which there is a linear approxim ation. Hall, D iCiccio an d R om an o (1989) show th a t when the quantity o f interest depends on a local property o f the underlying C D F, as do quantiles, sm ooth ing can give w orthw hile theoretical reductions in the size o f the m ean squared error. Sim ilar ideas apply to m ore com plex situations such as L\ regression (D e Angelis, H all and Y oung 1993); see how ever the discussion in Section 6.5. D e Angelis an d Y oung (1992) give a useful review o f b o o tstrap sm oothing, and discuss the em pirical choice o f how m uch sm oothing to apply. See also W ang (1995). R o m an o (1988) describes a problem — estim ation o f the m ode o f a density — where the estim ator is undefined unless the E D F is sm oothed; see also Silverm an (1981). In a spatial d a ta problem , K endall and K endall (1980) used a form o f b o o tstrap th a t jitte rs the observed data, in order to keep the rough configuration o f p oints co n stan t over the sim ulations; this am ounts to sam pling w ithout replacem ent when applying the sm oothed bootstrap. Young (1990) concludes th a t although this ap proach can o u tperform the unsm oothed bootstrap , it does n o t perform so well as the sm oothed b o o tstrap described in Section 3.4. G eneral discussions o f survival d a ta can be found in the books by Cox and O akes (1984) and Kalbfleisch an d Prentice (1980), while Flem ing and H arringto n (1991) and A ndersen et al. (1993) give m ore m athem atical accounts. T he product-lim it estim ator was derived by K ap lan and M eier (1958): it and variants are widely used in practice. Efron (1981a) proposed the first b o o tstra p m ethods for survival data, and discussed the relation betw een trad itio n al an d b o o tstrap stan d ard errors for the product-lim it estim ator. A kritas (1986) com pared variance estim ates for the m edian survival tim e from E fron’s sam pling scheme and a different a p proach o f R eid (1981), and concluded th a t E fron’s scheme is superior. The conditional m ethod outlined in Section 3.5 was suggested by H jo rt (1985), and subsequently studied by K im (1990), who concluded th a t it estim ates the conditional variance o f the product-lim it estim ator som ew hat b etter th an does resam pling cases. D oss and G ill (1992) an d B urr and D oss (1993) give weak convergence results leading to confidence bands for quantiles o f the survival time distribution. T he asym ptotic behaviour o f param etric and no n param etric b o o tstrap schemes for censored d a ta is described by H jo rt (1992), while A ndersen et al. (1993) discuss theoretical aspects o f the weird b o o t strap. The general ap p ro ach to m issing-data problem s via the EM algorithm is dis cussed by D em pster, L aird and R ubin (1977). Bayesian m ethods using m ultiple im putatio n an d d a ta au gm entation are decribed by T anner and W ong (1987)

3.12 ■Bibliographic Notes

125

and T anner (1996). A detailed treatm ent o f m ultiple im putation techniques for m issing-data problem s, w ith special em phasis on survey data, is given by R ubin (1987). The principal reference for resam pling in m issing-data problem s is Efron (1994), together with the useful, cautionary discussion by D. B. Rubin. T he account in Section 3.6 puts m ore em phasis on careful choice o f estim ators. C ochran (1977) is a stan d ard reference on finite population sampling. V ari ance estim ation by balanced subsam pling m ethods was discussed in this con text as early as M cC arthy (1969), but the first a ttem p t to apply the boo tstrap directly was by G ross (1980), who describes w hat we have term ed the “p o pula tion b o o tstra p ”, b u t restricted to cases where N / n is an integer. This approach was subsequently developed by Bickel and F reedm an (1984), while C hao and Lo (1994) also m ake a case for this approach. Booth, Butler and H all (1994) describe the construction o f studentized b o o tstrap confidence limits in this context. Presnell and B ooth (1994) give a critical discussion o f earlier literature and describe the superp o p u latio n bootstrap. The use o f modified sam ple sizes was proposed by M cC arth y and Snowden (1985) and the m irror-m atch m ethod by S itter (1992). A different approach based on rescaling was introduced by R ao and W u (1988). A com prehensive theoretical discussion o f the jackknife an d b o o tstrap in sam ple surveys is given in C h apter 6 o f Shao and Tu (1995), w ith later developm ents described by Presnell and Booth (1994) and Booth, Butler and H all (1994), on which the account in Section 3.7 is largely based. Little has been w ritten ab o u t resam pling hierarchical d a ta although two relevant references are given in the bibliographic notes for C h apter 7. R elated m ethods for b o o tstrap p in g em pirical Bayes estim ates in hierarchical Bayes m odels are described by L aird and Louis (1987). N onparam etric estim ation o f the C D F for a ran d o m effect is discussed by L aird (1978). B ootstrapping the b o o tstrap is described by C hapm an and H inkley (1986), an d was applied to estim ation o f variance-stabilizing transform ations by Tibshirani (1988). T heoretical aspects o f adjustm ent o f boo tstrap calculations were developed by H all an d M artin (1988). See also the bibliographic notes for C hapters 4 and 5. M ilan and W h ittak er (1995) give a param etric boo tstrap analysis o f the d a ta in Table 3.10, and discuss the difficulties th at can arise when resam pling in problem s with a singular value decom position. Efron (1992) introduced the jackknife-after-bootstrap, and described a vari ety o f ingenious uses for related calculations. D ifferent graphical diagnostics for b o o tstrap reliability are developed in an asym ptotic fram ew ork by Beran (1997). The linearity plot o f Section 3.10.2 is due to C ook and W eisberg (1994). Theoretical aspects o f the em pirical choice o f estim ator are discussed by Leger and R om an o (1990a,b) and Leger, Politis and R om ano (1992). Efron (1992) gives an exam ple o f choice o f level o f trim m ing o f a robust estim ator, w ithout double bootstrapping. Some o f the general issues, w ith examples, are discussed by Faraw ay (1992).

126

3 ■Further Ideas

3.13 Problems 1

In a two-sample problem, with data y tj, j = 1 ,..., n„ i = 1,2, giving sample averages y,- and variances t>„ describe models for which it would be appropriate to resample the following quantities: (a) e y = ytj - % (b) ei} = (ytj - 3>.)/(l + n~l )l/2, (c) etj = (ytj - y,•)/{«.■( 1 + n - l )}l/2, (d) = + ( y , j — yi)/{vt( 1 + n~l )}l/1, where the signs are allocated with equal probabilities,

(e) etj = yij/% In each case say how a simulated dataset would be constructed. What difficulties, if any, would arise from replacing y and v, by more robust estimates o f location and scale? (Sections 3.2, 3.3) 2

A slightly simplified version o f the weighted mean o f k samples, as used in Example 3.2, is defined by

=

E

k

-

i=i w.-y,-

E i= i wi where w, = n j a j , with y,- = n~' J 2 j ytj and a f = n~[ J 2 j(yij ~ Pi)2 estimates o f mean /j, and variance of o f the ith distribution. Show that the influence functions for T are

Ltjiy-;F) = ^ 7- [yi - /*.• - (w- - 0) { (y< / v Wi

^ }! ^ \.

where qj,- = n j a } . Deduce that the first-order approximation under the constraint Hi = ■■■= Hk for the variance o f T is vL = 1 / ^ with empirical analogue vL = 1/ vv>- Compare this to the corresponding formula based on the unconstrained empirical influence values. (Section 3.2.1) 3

Suppose that Y is bivariate with polar representation (X , m ), so that Y T = (X cos co, X sin co). If it is known that w has a uniform distribution on [0,27t), independent o f X , what would be an appropriate resampling algorithm based on the random sample y i , . . . , y „ l (Section 3.3)

4

Spherical data y i , . . . , y „ are points on the sphere o f unit radius. Suppose that it is assumed that these data come from a distribution that is symmetric about the unknown mean direction /i. In light o f the symmetry assumption, what would be an appropriate resampling algorithm for simulating data y j ,...,y * ? (Section 3.3; Ducharme et a l., 1985)

5

Two independent random samples y ii,...,y i„ , and y i \ , . . . , y 2ni o f positive data are obtained, and the ratio o f sample means t = y ^ / y i is used to estimate the corresponding population ratio 9 = ^ 2 / ^ 1 (a) Show that the influence functions for t are Lt.i (yi ;F) = -(J 'i - 111 )6 / ni,

L t<1 (y 2 \ F) = ( y i ~ n i ) / n 1 .

Hence obtain the formula

vl = {n^^^iyij -

yi)2+ n22^2(y2j - y2)2} /yi

127

3.13 ■Problems

for the approximate variance o f T. (b) Describe an appropriate resampling algorithm. How could this be modified if one could assume a multiplicative model, i.e. Y\j = n\E\j and Y2j = ^ £ 2 ; with all es sampled from a com m on distribution o f positive random variables? (c) Show that under the multiplicative m odel the approximate variance formula can be changed to vL = t2 Y , , / e'j - 1 )2/(« i ^2 ), where eu = y,j/y,. (Section 3.2.1) 6

The empirical influence values can be calculated more directly as follows. Consider only distributions supported on the data values, with probabilities p t = ( p n , ... , p inf) on the values in the ith sample for i = 1, . . . , k . Then write T = so that t = t(pi,...,pk) with pi = (},■■■, ^ ) . Show that the empirical influence value lij corresponding to the 7 'th case in sample i is given by

lv = ^ - t { p u . . . , ( l - s ) p i + e l j , . . . , p k} I , u£ I£=0 where l j is the vector with (Section 3.2.1) 7

in the j th position and zeroes elsewhere.

1

Following on from the previous problem, re-express t(p 1 , . . . , pk) as a function u(n) o f a single probability vector n = ( 7t 1 1 , . . . , 7t 1 „1 , . . . , nk„t ). For example, for the ratio o f means o f two independent samples, t = _p2 />’i,

= ( £ j W I > y ) / ( 5 > u * y /£*iy)The observed value t is then equal to u{n) where n = (~n, . . . , - n ) with n = £ * 1 , nt. Show that l j = j u {(1 - e)n + e h j} ^ where

1

y is the vector with

1

= -/y,

in the («,■_1 + j )th position, with n0 =

elsewhere. One consequence o f this is that vL = n~ 2 A pply these calculations to the ratio t = yi/yi. (Section 3.2.1) 8

0

, and zeroes

J2j'=i %•

If x i , . . . , x „ is a random sample from some distribution G with density g, suppose that this density is estimated by

-

Vh i b w i ^ r ) = l j w { ^ r )

where w is a symmetric P D F with mean zero and variance t2. (a) Show that this density estimate has mean x and variance n~l Y H x j ~ x ) 2 + h2x2. (b) Show that the random variable x = x j + he has P D F gh, where J is uniformly distributed on ( l , . . . , n ) and e has P D F w. Hence describe an algorithm for bootstrap simulation from a sm oothed version o f the EDF. (c) Show that the rescaled density 1

^

( x

j= 1

v

— a —bxj\ hb

J7

will have the same first two mom ents as the E D F if a = (1 — b)x and b = {1 + nh2z 2/ J2(x j “ x)2} ~ l/2. W hat algorithm simulates from this sm oothed E D F?

128

3 ■Further Ideas (d) D iscuss the special problems that arise from using gh(x) when the range o f x is [0, oo) rather than (—oo, oo). (e) Extend the algorithms in (b) and (c) to multivariate x. (Section 3.4; Silverman and Young, 1987; Wand and Jones, 1995)

9

Consider resampling cases from censored data (y i, d \ ) , . . . , (y„, dn), where yi < ■■■< y n. Let f j denote the number o f times that (y j , d j ) occurs in an ordinary bootstrap sample, and let Sj = / ' H-------1- / ' . (a) Show that when there is no censoring, the product-limit estimate puts mass n-1 on each observed failure yi <■■■< y„, so that F° = F. (b) Show that if B(m, p) denotes binom ial distribution with index m and probability p, then

and deduce that this has variance J2j-.yJ and infer that v a r { l-F ° -(y )} = { l - F ° ( y ) } 2 £ )-yj
( n - j + l)2'

This equals the variance from Greenwood’s formula, (3.10), apart from replacement o f (n — j + l) 2 by (n - j)(n - j + 1). (Section 3.5; Efron, 1981a; Cox and Oakes, 1984, Section 4.3) 10

Consider the weird bootstrap applied to a hom ogeneous sample o f censored data, (yi ,di),...,(y„,d„), in which >’i < ■- < >>„. Let d A 0,(yj) = N j / ( n — j + l), where the N'j are independent binomial variables with denominators n —j + 1 and probabilities dj / ( n — j + 1). (a) Show that the total number o f failures under this resampling scheme is dis tributed as a sum o f independent binom ial observations. (b) Show that

and that if dn = 1 then dA° ' (yn) always equals one. (Section 3.5; Andersen et al., 1993) 11

Suppose that Yj = ( U j , X j ) , j = 1 , . . . , « are bivariate normal with mean vector H and variance matrix Q. When the Ys are observed, a random m cases have x missing. Obtain formulae for the maximum likelihood estimators o f fi and fl. Verify that these formulae agree with the multiple-imputation estimators constructed by the method o f Section 3.6.

12

(a) Establish (3.15), and show that the sample variance c is an unbiased estimate o f y.

129

3.13 ■Problems

(b) N ow suppose that N = kn for some integer k. Show that under the population bootstrap,

E'(y*) = y,

v a r-(n =

x ( l - / ) n - 1c.

(c) In the context o f Example 3.16, suppose that the parameter o f interest is a nonlinear function o f 9, say t] — g (6), which is estimated by g(T ). Use the delta m ethod to show that the bias o f g (T ) is roughly ^g"(0)var(T), and that the bootstrap bias estimate is roughly ig " (t)var'(T ‘ ). Under what conditions on n and N does the bootstrap bias estimate converge to the true bias? (Section 3.7; Bickel and Freedman, 1984; Booth, Butler and Hall, 1994) 13

To model the superpopulation bootstrap, suppose that the original data are y i , . . . , y n and that <9* contains copies o f y u . . - , y n; the joint distri bution o f the M j is multinomial with probabilities n~{ and denominator N. If Y]’, . . . , y„* are sampled without replacement from <&' and if Y ’ = n~l J2 Y/> show that

E-(Y ') = y,

E m {var'(Y* | M)} = ^

x (1 - f h ^ c .

(Section 3.7; Presnell and Booth, 1994) 14

Suppose we wish to perform mirror-match resampling with k independent withoutreplacement samples o f size m, but that k = {n(l — m/ n ) } / {{ m( 1 — / ) } is not an integer. Let K ’ be the random variable such that Pr( K ’ = k') = 1 - Pr(X* = k' + 1) = k'(l + k' - k)/ k, where k' = [k] is the integer part o f k. Show that if the mirror-match algorithm is applied for an average Y ’ with this distribution fo r X ', var”(Y ”) = (1—m/n)c/ (mk). Show also that under mirror-match resampling with the simplifying assumption that randomization is not required because k is an integer,

f, « (* -!) E (C ) = c l 1- ^ j ^ T j where C ‘ is the sample variance o f the Y-. What implications are there for variance estimation for more complex statistics? (Section 3.7; Sitter, 1992) 15

Suppose that n is a large even integer and that N = 5n/2, and that instead o f applying the population bootstrap we choose a population from which to resample according to

f

#{/!} is the number of elements in the set A.

y i , - - - , y n,

yi, - --, y«,

y u . . . , y n,

y u . . . , y n,

with probability \ , yi,...,y„,

with probability

Having selected <&’ we take a sample Y ,\...,Y „ ' from it without replacement and calculate Z ‘ = (Y* — y ){(l — f ' ) n ~ l c}~i/2. Show that if f = n / N the approximate distribution o f Z ’ is the normal mixture |N (0, | ) + |N (0, y ) , but that if f = n/#{<&'} the approximate distribution o f Z ‘ is N ( 0,1). Check that in the first case, E*(Z*) = 0 and var’(Z ’ ) = 1. Comment on the implications for the use o f randomization in finite population resampling. (Section 3.7; Bickel and Freedman, 1984; Presnell and Booth, 1994)

130 16

5 • Further Ideas Suppose that we have data y i,...,y „ , and that the bootstrap sample is taken to be Yj = y + d(yij — y),

j

where I \ , . are independently chosen at random from 1 Show that when d = {n'( 1 —/ ) / ( « — 1)}I/2, we have E * (y ‘) = y and var'(Y*) = (1 —f ) n ~ l c. How might the value o f ri be chosen? Discuss critically this resampling scheme. (Section 3.7; R ao and Wu, 1988) 17

Suppose that y , j = x ,+ z i;, i = 1, . . . , a and j — 1 where the x,s are independent with mean [i and variance <x2, and the zi;s are independent with mean 0 and variance <x2. Consider the resampling schemes

v'j = yij + (yK,jj — yx,), where / i a n d K u . . . , K a are randomly sampled with replacement from { l , . . . , a } , and J x, . . . , J b are randomly sampled from { 1 ,...,6 } either with or with out replacement. Show that the second-moment properties o f the YJs are given by (3.23) and (3.24). (Section 3.8) 18

For the m odel o f Problem 3.17, define estimates o f the x,s and z,; s by ^ = cy. + (1 - c)yh

ztj = d(y,7 - y,)-

Show that the E D F s o f % and have first two mom ents which are unbiased for the corresponding m om ents o f the A"s and Z s if

(Section 3.8) 19

Consider the double bootstrap procedure for adjusting the estimated bias o f T, as described in Section 3.9, when T is the average Y . Show that the variance o f simulation error for the adjusted bias estimate B — C is

with s2 the sample variance. Hence deduce that for fixed R M the best choice for M is 1. How would the results change for a statistic other than the average? Derive the corresponding result for the bias correction o f the bootstrap estimate o f var(T). (Sections 2.5.2, 3.9) 20

Extend the discussion following (3.42) to jackknife-after-bootstrap calculations for Describe the calculation in detail when parametric simulation is performed from the exponential density. (Section 3.10; Efron, 1992)

21

Let tp(F) denote the p x 100% trimmed average o f distribution F, i.e. ■F-'a-p)

3.14 ■Practicals

131

(a) If Fk denotes the gamma distribution with index k and unit mean, show that tp(FK) = k(1 - 2p )-'{ F K+1(yK,i_p) - / \ +](>v,P)}, where y K

3.14 Practicals 1

To perform the analysis for the gravity data outlined in Example 3.2: g r a v .fu n < - f u n c t io n ( d a t a , i ) { d <- d a t a f i,] m < - ta p p ly (d $ g ,d $ s e r ie s ,m e a n ) v < - t a p p l y ( d $ g ,d $ s e r i e s ,v a r ) n <- ta b le (d S s e r ie s ) c (su m (m * n /v )/su m (n /v ), l/s u m ( n /v ) ) > g r a v .b o o t < - b o o t ( g r a v it y , g r a v .f u n , R=200, s t r a t a = g r a v it y $ s e r i e s ) Plot the estimate and its variance. Is the simulation well-behaved? How normal are the bootstrapped estimates and studentized bootstrap statistics? N ow for a semiparametric analysis, as suggested in Section 3.3: a t t a c h ( g r a v it y ) n <- t a b le (s e r ie s ) m < - r e p ( t a p p ly ( g , s e r i e s , m ean), n) s <- r e p (s q r t(ta p p ly (g ,s e r ie s ,v a r )) ,n ) r e s < - (g - m )/s q q n o r m (r e s); a b l i n e ( 0 , 1 ,lt y = 2 ) g ra v < - d a ta .fr a m e (m , s , s e r i e s , r e s ) g r a v .fu n < - f u n c t io n ( d a t a , i ) { e < - d a t a $ r e s [ i] y < - data$m + d a ta $ s * e m < - t a p p l y ( y , d a t a $ s e r ie s , mean) v < - t a p p l y ( y , d a t a $ s e r ie s , v a r) n <- ta b le (d a ta $ s e r ie s ) c (s u m (m * n /v )/su m (n /v ), l/s u m ( n /v ) )

> g r a v l.b o o t < - b o o t( g r a v , g r a v .f u n , R=200) D o residuals r e s for the different series look similar? Compare the values o f t’ and d' for the two sampling schemes. Compare also 80% confidence intervals for g. (Section 3.2) 2

Dataframe charm ing contains data on the survival o f 97 men and 365 women in a retirement home in California. The variables are sex, ages in months at which individuals entered and left the home, the time in months they spent there, and a censoring indicator (0/1 denoting censored due to leaving the hom e/died there). For details see Hyde (1980). We compare the variability o f the survival probabilities at 75 and 85 years (900 and 1020 months), and o f the estimated 0.75 and 0.5 quantiles o f the survival distribution.

132

3 ■Further Ideas

chan <- charming[1:97,] # men only chan$age <- chan$entry+chan$time attach(chan) chan.F <- survfit(Surv(age, cens)) chan.F max(chan.F$surv[chan.F$time>900]) max(chan.F$surv[chan.F$time>1020]) chan.G <- survfit(Surv(age-0.01*cens,l-cens)) split.screen(c(2,1)) screen(l); plot(chan.F,xlim=c(760,1200),main="survival") screen(2); plot(chan.G,xlim=c(760,1200),main="censoring") chan.fun <- function(data) { s <- survfit(Surv(age,cens),data=data) c(max(s$surv[s$time>900]), max(s$surv[s$time>1020]), min(s$time[(s$surv<=0.75)]), min(s$time[(s$surv<=0.5)])) } chan.bootl <- censboot(chan, chan.fun, R=99, sim = "ordinary") chan.boot2 <- censboot(chan, chan.fun, R=99, F .surv=chan.F , G.surv=chan.G, sim = "cond",index=c(6,5)) chan.boot3 <- censboot(chan, chan.fun, R=99, F .surv=chan.F , sim = "weird",index=c(6,5)) Give normal-approximation confidence limits for each o f the survival probabilities, transformed if necessary, and compare them with those from chan.F . How do the intervals for the different bootstraps compare? (Section 3.5; Efron, 1981a) To study the performance o f censored data resampling schemes when the censoring pattern is fixed, we perform a small simulation study. We apply a fixed censoring pattern to samples o f size 50 from the unit exponential distribution, and for each sample we calculate t = ( t \ , t 2), where t\ is the maximum likelihood estimate o f the distribution mean and t2 is the number o f censored observations. We apply each bootstrap scheme to the sample, and record the mean and standard deviation o f t from the bootstrap simulation. (This is quite time-consuming: take n re p s and R as big as you dare.)

exp.fun <- function(d) { d.s <- survfit(Surv(y, cens),data=d) prob <- min(d.s$surv[d.s$time=0.5]) c(sum(d$y)/sum(d$cens), sum(l-d$cens)) > results <- NULL; nreps <- 100; n <- 50; R <- 25 cens <- 3*runif(n) for (i in 1:nreps) { yO <- rexp(n) junk <- data.frame(y = pmin(yO.cens), cens = as.numeric(yOCcens)) junk.F <- survfit(Surv(y,cens),data=junk) junk.G <- survfit(Surv(y,1-cens),data=junk) ord.boot <- censboot(junk, exp.fun, R=R) con.boot <- censboot(junk, exp.fun, R=R, F.surv=junk.F, G.surv=junk.G, sim = "cond") wei.boot <- censboot(junk, exp.fun, R=R, F.surv=junk.F, sim = "weird") res <- c(exp.fun(junk ), apply(ord.boot$t, 2, mean),

133

3.14 ■Practicals

apply(con.boot$t, 2, mean), apply(wei.boot$t, 2, mean), sqrt(apply(ord.boot$t, 2, var)), sqrt(apply(con.boot$t, 2, var)), sqrt(apply(wei.boot$t, 2, var))) results <- rbind(results, res) } The estimated bias and standard deviation o f t ly and the bootstrap bias estimates are

mean(results[,1])-l sqrt(var(results[,1] )) bias.o <- results[,3]-results[,1] bias.c <- results[,5]-results[,1] bias.w <- results[,7]-results[,1] How do they compare? W hat about the estimated standard deviations? How do the numbers o f censored observations vary under the schemes? (Section 3.5; Efron, 1981a; Burr, 1994) 4

The tau particle is a heavy electron-like particle which decays into various col lections o f other charged particles shortly after its production. The decay usually involves one charged particle, in which case it can happen in a number o f modes, the main four o f which are labelled p, n, e, and p. It takes a major research project to measure the rate o f occurrence o f single-particle decay, decayi, or any o f its com ponent rates decay,,, decay^, decaye, and decay,,, and just one o f these can be measured in any one experiment. Thus dataframe ta u on decay rates for 60 experiments represent several years o f work. Here we use them to estimate and form a confidence interval for the parameter 8 = decay! — decay p — decay n — decay e — decay Suppose that we had thought o f using the 0, 12.5, 25, 37.5 and 50% trimmed averages to estimate the difference. To calculate these and to obtain bootstrap confidence intervals for the estimates o f 8:

tau.diff <- function(data) { yO <- tapply(data[,l] ,data[,2] ,mean) yl <- tapply(data[,1],data[,2],mean,trim=0.125) y2 <- tapply(data[,1],data[,2],mean,trim=0.25) y3 <- tapply(data[, 1] ,data[,2] ,mean,trim=0.375) y4 <- tapply(data[, 1] ,data[,2] .median) y <- rbind(y0, yl, y2, y3, y4) y[,l]-apply(y[,-l] ,l,sum) > tau.diff(tau) tau.fun <- function(data, i) tau.diff(data[i,]) tau.boot <- boot(tau,tau.fun,R=999,strata=tau$decay) boot.ci(tau.boot, type=c("norm","basic"), index=l) boot.ci(tau.boot, type=c("norm","basic"), index=2) and so forth, with index=3, 4, 5 for the remaining degrees of trim. Does the degree of trimming affect the interval much? To see the jackknife-after-bootstrap plot when 8 is estimated using the average: j a c k . a f t e r . b o o t ( t a u . b o o t , in d e x = l) How does the degree o f trim affect the bootstrap distributions o f the different estimators o f 0?

134

3 ■Further Ideas N ow suppose that we want to choose the estimator from the data, by taking the trimmed average with smallest variance. For the original data this is the 25% trimmed average, so the estimate is 16.87. Its variance can be estimated by a double bootstrap, which we can implement as follow s:

tau.nest <- function(data, i) { d <- data[i,] d.trim <- tau.diff(d) v.trim <- apply(boot(d, tau.fun, R=25, strata=d$decay)$t, 2, var) c(d.trim, v.trim) } tau.boot2 <- boot(tau, tau.nest, R=100, strata=tau$decay) To see what degrees o f trimming give the smallest variances, and to calculate the corresponding estimates and obtain their variance:

i <- matrix(l:5,5,tau.boot2$R) i <- i[t(tau.boot2$t[,6:10]==apply(tau.boot2$t[,6:10] ,l,min))] table(i) t.best <- tau.boot2$t[cbind(l:tau.boot2$R,i)] var(t.best) Is the optimal degree o f trimming well-determined? How would you use the results o f Problems 2.13 and 2.4 to avoid the second level o f bootstrapping? (Section 3.11; Efron, 1992) 5

We apply the jackknife-after-bootstrap to the correlation coefficient between plumage and behaviour in cross-bred ducks.

ducks.boot <- boot(ducks, corr, R=999, stype="w") ducks.L <- empinf(data=ducks, statistic=corr) split.screen(c(1,2)) screen(l) split.screen(c(2,1)) screen(4) attach(ducks) plot(plumage,behaviour,type="n") text(plumage,behaviour,round(ducks.L ,2)) screen(3) plot(plumage.behaviour,type="n") text(plumage.behaviour,1:nrow(ducks)) screen(2) jack.after.boot(boot.out=ducks.boot,useJ=F,stinf=F, L=ducks.L) (a) The value o f the correlation is t = 0.83. Will it increase or decrease if observation 7 is deleted from the sample? (Be careful.) W hat is the effect on t o f deleting observation 6? (b) What happens to the bootstrap distribution o f t" — t when observation 8 is deleted from the sample? W hat about observation 6? (c) Show that the probability that neither observation 5 nor observation 6 is in a bootstrap sample is (1 — ^ ) u = 0.11. N ow suppose that observation 5 is deleted, and calculate the probability that observation 6 is not in a bootstrap sample. D oes this explain what happens in (b)? 6

Suppose that we are interested in the largest eigenvalue o f the covariance matrix between the baseline and one-year C D 4 counts in cd4; see Practical 2.3. To

3.14 ■Practicals

135

calculate this and its approximate variance using the nonparametric delta method (Problem 2.14), and to bootstrap it:

eigen.fun <- functioned, w = rep(l, nrow(d))/nrow(d)) { w <- w/sum(w) n <- nrow(d) m <- crossprod(w, d) m2 <- sweep(d,2,m) v <- crossprod(diag(sqrt(w)) m2) eig <- eigen(v,symmetric=T) stat <- eig$values[1] e < - eig$vectors [ ,1 ] i <- rep(l:n,round(n*w)) ds <- sweep(d[i,],2,m) L <- (ds7.*’ /,e) "2 - stat c(stat, sum(L~2)/n~2) > cd4.boot <- boot(cd4,eigen.fun,R=999,stype="w") Some diagnostic plots:

split.screen(c(l,2)) screen(l); split.screen(c(2,1)) screen(3) plot(cd4.boot$t[,1],cd4.boot$t[,2],xlab="t*",ylab="vL*",pch=".") screen(4) plot(cd4[,l],cd4[,2],type="n",xlab="baseline", ylab="one year" ,xlim=c(l,7) ,ylim=c(1,7)) text(cd4[, 1] ,cd4[,2] ,c(l :20) ,cex=0.7) screen(2); jack.after.boot(cd4.boot,useJ=F,stinf=F) W hat is going on here? (Section 3.10.1; Canty, D avison and Hinkley, 1996)

4 Tests

4.1 Introduction M any statistical applications involve significance tests to assess the plausibil ity o f scientific hypotheses. R esam pling m ethods are n o t new to significance testing, since rando m izatio n tests and p erm u tatio n tests have long been used to provide nonp aram etric tests. A lso M onte C arlo tests, which use sim ulated datasets, are quite com m only used in certain areas o f application. In this chap ter we describe how resam pling m ethods can be used to produce significance tests, in b o th p aram etric and nonparam etric settings. The range o f ideas is som ew hat w ider th a n the direct b o o tstrap approach introduced in the pre ceding tw o chapters. To begin with, we sum m arize some o f the key ideas o f significance testing. T he sim plest situation involves a simple null hypothesis Ho which com pletely specifies the probability distribution o f the data. Thus, if we are dealing with a single sam ple y \ , . . . , y n from a p o p u latio n w ith C D F F, then Ho specifies th a t F = Fo, where F0 contains no unknow n param eters. A n exam ple would be “exponential w ith m ean 1”. T he m ore usual situation in practice is th at Ho is a composite null hypothesis, which m eans th a t some aspects o f F are n o t determ ined and rem ain unknow n w hen Ho is true. A n exam ple would be “norm al w ith m ean 1”, the variance o f the norm al distribution being unspecified. P-values A statistical test is based on a test statistic T which m easures the discrepancy between the d a ta an d the null hypothesis. In general discussion we shall follow the convention th a t large values o f T are evidence against H 0. Suppose for the m om ent th a t this null hypothesis is simple. If the observed value o f the test statistic is denoted by t then the level o f evidence against Ho is m easured by

136

137

4.1 ■Introduction

the significance probability p = P r(T > t | Ho),

(4.1)

often called the P-value. A corresponding notion is th at o f a critical value tp for t, associated w ith testing at level p: if t > tp then Ho is rejected at level p, or 100p%. N ecessarily tp is defined by P r(T > t P | Ho) = p. T he level p is also called the error rate or the size o f the test, and { (y i,...,} 'n) : t > tp} is called the level p critical region o f the test. The distribution o f T under Ho is called the null distribution o f T. U nder Ho the P-value (4.1) has a uniform distribution on [0,1], if T is continuous, so th a t the corresponding random variable P has distribution P r(P < p \ H 0) = p.

(4.2)

This yields the error rate in terp retation o f the P-value, nam ely th at if the observed test statistic were regarded as ju st decisive against Ho, then this is equivalent to following a procedure which rejects H 0 with error rate p. The sam e is not exactly true if T is discrete, and for this reason m odifications to (4.1) are som etim es suggested for discrete d a ta problem s: we shall n o t worry a b o u t the distinction here. It is im p o rtan t in applications to give a clear idea o f the degree o f discrepancy betw een d a ta an d null hypothesis, if not giving the P-value itself then at least indicating how it com pares to several levels, say p = 0.10,0.05,0.01, rather th a n ju st testing a t the 0.05 level. Choice o f test statistic In the p aram etric setting, we have an explicit form for the sam pling distribution o f the d a ta w ith a finite num ber o f unknow n param eters. O ften the null hypothesis specifies num erical values for, or relationships between, som e or all o f these param eters. T here is also an alternative hypothesis H A which describes w hat alternatives to Ho it is m ost im p o rtan t to detect, or w hat is thought likely to be true if Ho is not. This alternative hypothesis guides the specific choice o f T , usually th ro u g h use o f the likelihood function L(e) = f Yu...,Yn( y u . - . , y n \ 0 ) , i.e. the jo in t density o f the observations. F or example, when Ho and H A are b o th simple, say Ho : 8 = 0 o an d Ha : 0 = dA, then the best test statistic is the likelihood ratio T = L(9 a )/ L{60).

(4.3)

A rath er different situation is where we wish to test the goodness o f fit o f the param etric model. Som etim es this can be done by em bedding the m odel into a larger m odel, w ith one or a few additional param eters corresponding

138

4 ■ Tests

to departu re from the original m odel. We would then test those additional param eters. O therw ise general purpose goodness o f fit tests will be used, for exam ple chi-squared tests. In the nonp aram etric setting, no p articu lar form s are specified for the distributions. T hen the ap p ro p riate choice o f T is less clear, b u t it should be based on at least a qualitative n otion o f w hat is o f concern should Ho n o t be true. Usually T would be based on a statistical function s(F) th a t reflects the characteristic o f physical interest and for which the null hypothesis specifies a value. F or example, suppose th a t we wish to test the null hypothesis Hq th at X and Y are independent, given the ran d o m sam ple (X i, Vi) , . . . , (X„, Y„). The correlation s(F) = corr(AT, Y ) = p is a convenient m easure o f dependence, and p = 0 und er Hq. If the alternative hypothesis is positive dependence, then a natu ral test statistic is T = s(F), the raw sam ple correlation; if the alternative hypothesis is ju st “dependence”, then the tw o-sided test statistic T = s 2 (F) could be used. Conditional tests In m ost p aram etric problem s and all nonparam etric problem s, the null h y p o th esis Ho is com posite, th a t is it leaves som e param eters unknow n and therefore does not com pletely specify F. Therefore P-value (4.1) is not generally welldefined, because P r( T > t \ F) m ay depend upon which F satisfying Ho is taken. T here are two clean solutions to this difficulty. One is to choose T carefully so th a t its distrib u tio n is the same for all F satisfying H o : examples include the Student-t test for a norm al m ean w ith unknow n variance, and rank tests for nonparam etric problem s. The second and m ore widely applicable so lution is to elim inate the p aram eters which rem ain unknow n when Ho is true by conditioning on the sufficient statistic und er Ho- If this sufficient statistic is denoted by S, then we define the conditional P-value by p = Pr(T > t \ S = s , H 0).

(4.4)

Fam iliar exam ples include the Fisher exact test for a 2 x 2 table and the S tudent-t test m entioned earlier. O th er exam ples will be given in the next two sections. A less satisfactory approach, which can nevertheless give good approxim a tions, is to estim ate F by a C D F f '0 which satisfies Ho and then calculate p = Pr( T > t \ Fo).

(4.5)

Typically this value will n o t satisfy (4.2) exactly, b u t will deviate by an am ount which m ay be practically negligible. Pivot tests W hen the null hypothesis concerns a p articu lar p aram eter value, the equiva lence betw een significance tests an d confidence sets can be used. This equiv

139

4.1 ■Introduction

alence is th a t if the value 6 q is outside a 1 — a confidence set for 6 , then 6 differs from do w ith P-value less th an a. T he p articular alternative hypothesis for which this applies is determ ined by the type o f confidence se t: for example, if the confidence set is all values to the right o f a lower confidence limit, then the im plied alternative is H A : 6 > do- A specific form o f test based on this equivalence is the pivot test. F or example, suppose th a t T is an estim ator for scalar 6 , w ith estim ated variance V. Suppose further th a t the studentized form Z = ( T — 6 ) / V x/1 is a pivot, m eaning th at its distribution is the same for all relevant F, and in p articu lar for all 6 . The Student-r statistic is a fam iliar instance o f this. F or the one-sided test o f Ho : 6 = do versus H A : 6 > 6 o, the P-value attached to the observed studentized test statistic zo = (f — 0q)/v 1/2 is p = P r{(T - 60) / V 112 > ( t - 60) / v l/2 | H0}. But because Z is a pivot, Pr{Z > (t - 6 o ) / v 1/2 I Ho} = Pr{Z > (t - 60) / v [/2 \ F}, an d therefore p = Pr (Z > z0 | F ) .

(4.6)

T he p articu lar advantage o f this, in the resam pling context, is th a t we do not have to construct a special null hypothesis sam pling distribution. In p aram etric problem s it is usually possible to express the m odel in term s o f the p aram eter o f interest ip and o th er (nuisance) param eters X, so th at the null hypothesis concerns only ip. In the above discussion o f conditional tests, (4.4) would be independent o f X. One general approach to construction o f a test statistic T is to generalize the simple likelihood ratio (4.3), and to define LR =

maxHaL(\p, a ) m axWo L(rp, X)

F or testing Ho : rp = xpo versus H A : ip =j= xpo, this generalized likelihood ratio is equivalent to the m ore convenient expression LR =

L ( v>A)

L(wo,%>)

= m axy^ L(\p, A)

^

maxAL(ip0, A)'

O f course this also applies when there is no nuisance param eter. F or m any m odels it ispossible to show th at T = 2 log L R has approxim ately the Xd d istribution u nder Ho, where d is the dim ension o f ip, so th at p = Pr(X2d > t),

(4.8)

independently o f X. T hus the likelihood ratio L R is an approxim ate pivot. T here is a variety o f related statistics, including the score statistic, and the signed likelihood ratio for one-param eter problems. W ith each likelihood-based

140

4 ■ Tests

statistic there is a simple approxim ation to the null distribution, and m odifi cations to im prove approxim ation in m oderate-sized samples. The likelihood ratio m ethod appears lim ited to p aram etric problem s, but as we shall see in C h apter 10 it is possible to define analogues in the nonparam etric case. W ith all o f the P-value calculations introduced thus far, simple approxim a tions for p exist in m any cases by appealing to lim iting results as n increases. Part o f the purpose o f this chapter is to provide resam pling alternatives to such approxim ations when they either fail to give ap p ro p riate accuracy o r do not exist a t all. Section 4.2 discusses ways in which resam pling and sim ulation can help with param etric tests, starting w ith exact M onte C arlo tests. Section 4.3 briefly reviews p erm u tatio n and random ization tests. This leads on to the wider topic o f nonp aram etric b o o tstrap tests in Section 4.4. Section 4.5 describes a simple m ethod for im proving P-values when these are biased. M ost o f the exam ples in this chap ter involve relatively simple applications. C hapters 6 and beyond con tain m ore substantial applications.

4.2 Resampling for Parametric Tests Broadly speaking, p aram etric resam pling m ay be useful in any testing problem where either stan d ard approxim ations do not apply o r where the accuracy o f such approxim ations is suspect. T here is a wide range o f such problems, including hypotheses w ith order constraints, hypotheses involving separate models, and graphical tests. In all o f these problem s, the basic m ethod is to use a param etric resam pling scheme as outlined in Section 2.2 except th at here the sim ulation m odel m ust satisfy the relevant null hypothesis.

4.2.1 M onte Carlo tests One special situation is when the null hypothesis distribution o f T does no t involve any nuisance param eters. O ccasionally this happens directly, but m ore often it is induced, either by standardizing som e initial statistic, or by conditioning on a sufficient statistic, as explained earlier. In the latter case the exact P-value is given by (4.4) ra th e r th a n (4.1). In practice the exact P-value m ay be difficult or im possible to calculate, and M onte C arlo tests provide convenient approxim ations to the full tests. As we shall see, M onte C arlo tests are exact in their own right, and am ong b o o tstrap tests are special in this way. The basic M onte C arlo test com pares the observed statistic t to R indepen dent values o f T which are obtained from corresponding sam ples indepen dently sim ulated und er the null hypothesis model. If these sim ulated values are denoted by t j , . . . , tR, then under H q all R + 1 values t , t \ , . . . , t R are equally

141

4.2 ■Resampling fo r Parametric Tests

likely values o f T. T h a t is, assum ing T is continuous, r P r(T < T(*, | Ho) = T rT - r , R + 1

(4.9)

where as usual Tlr) denotes the rth ordered value. If exactly k o f the sim ulated t* values exceed t and none equal it, then p = P r(T > t | H 0) = Pmc =

(4.10)

The right-hand side is referred to as the M onte C arlo P-value. I f T is con tinuous, then it follows from (4.9) th a t under H q the distribution o f the corresponding ran d o m variable Pmc is uniform on ( ^ y , • • •, 1). This result is the discrete analogue o f (4.2), and guarantees th a t Pmc has the error rate in terpretation. In this sense the M onte C arlo test is exact. It differs from the full test, which corresponds to R = oo, by blurring the critical region o f the full test for any attainable level. If T is discrete, then repeat values o f t* can occur. If exactly / o f the t’ values equal t, then it is som etim es advocated th a t one bounds the significance probability, k+l k + l+ l R + l-^ m c R + 1

---------< pmc < --------------- .

O u r strict in terp retatio n o f (4.1) would have us use the upper bound, and so we ad o p t the general definition #(y4) means the number of times the event A occurs.

P™ =

1 + #{t* ^ f) R + 1 ■

/j i n (4 U )

Example 4.1 (Logistic regression) Suppose th a t y i , . . . , y „ are independent binary outcom es, w ith corresponding scalar covariate values x i ,...,x „ , and th a t we wish to test w hether o r not x influences y. I f our chosen m odel is the logistic regression m odel ,

Pr(F ; = 1 | x/) PriYj = 0 | x j) ~

^ X>’

then the null hypothesis is H q :\p = 0 . U nder H q the sufficient statistic for X is S an d T = J2 x j Yj is the n atu ral test statistic; T is in fact optim al for the logistic m odel, b u t is also effective for m onotone transform ations o f the odds ratio o th er th an logarithm . The significance is to be calculated according to (4.4). T he null distribution o f Y i,...,Y „ given S = s is uniform over all (") perm u tatio n s o f y i , . . . , y „ . R ath er th an generate all o f these perm utations to com pute (4.4) exactly, we can generate R random perm utations and apply (4.11). A sim ulated sam ple will then be ( x j,y j) , . . . , (xn,y^), where y \ , . . . , y ’n is a ran d o m p erm u tatio n o f y \ , . . . , y n, and the associated test statistic will be = E x jyj.

142

4 ■ Tests

0 0 1 4 3

1 2 1 1 1

2 0 1 2 4

3 2 1 5 3

4 4 4 2 1

3 2 1 0 0

4 3 5 3 0

2 3 2 2 2

2 4 2 1 7

1 2 3 1 0

In some applications there will be repeats am ong the x values, o r equivalently m, binom ial trials w ith a, occurrences o f y = 1 a t the ith distinct value o f x. If the d a ta are expressed in the latter form , then the same random p erm utation procedure can be applied to the original expanded form o f d a ta with « = ^ m, individual ys. ■ Example 4.2 (Overdispersed counts) T he d a ta in Table 4.1 are n = 50 counts o f fir seedlings in small q uadrats, p a rt o f a larger dataset. The actual spatial layout is preserved, alth o u g h we are n o t concerned with this here. R ath er we wish to test the null hypothesis th a t these d a ta are a random sam ple from a Poisson distribution w ith unknow n m ean. The concern is th a t the d a ta are overdispersed relative to the Poisson distribution, which strongly suggests th a t we take as test statistic the dispersion index T = J2(Yj — Y ) 2 / Y . U nder the Poisson m odel S = J 2 ^ j is sufficient for the com m on m ean, so we carry o u t a conditional test an d apply (4.4). F or the data, t = 55.15 and s = 107. Now und er the null hypothesis Poisson m odel, the conditional distribution o f Y \ , ... ,Y„ given J 2 ^ j = s is m ultinom ial w ith deno m inator s and n categories each having probability n-1 . It is easy to sim ulate from this distribution. In the first R = 99 sim ulated values t ', 24 are larger th a n t = 55.15. So the M onte C arlo P-value (4.11) is equal to 0.25, an d we conclude th a t the d a ta dispersion is consistent w ith Poisson dispersion. Increasing R to 999 m akes little difference, giving p = 0.235. The left panel o f Figure 4.1 shows a histogram o f all 999 values o f t' — t: the unshaded p art o f the histogram corresponds to values t' > t which count tow ard significance. For this simple problem the null distrib u tio n o f T given S = s is approx im ately j. T h a t this approxim ation is accurate for our d a ta is illustrated in the right panel o f Figure 4.1, which plots the ordered values o f t* against quantiles o f the X49 distribution. The P-value obtained w ith this approxim ation is 0.253, close to the exact value. T here are two points to m ake ab o u t this. First, the sim ulation results enable us to check on the accuracy o f the theoretical approxim ation: if the approxim ation is good, then we can use it; b u t if it isn’t, then we have the M onte C arlo P-value. Secondly, the M onte C arlo m ethod does n o t require know ledge o f a theoretical approxim ation, which m ay not even exist in m ore com plicated problem s, such as spatial analysis o f these data. The M onte C arlo m ethod applies very generally. ■

Table 4.1 n = 50 counts o f balsam-fir seedlings in five feet square quadrats.

4.2 • Resampling fo r Parametric Tests

Figure 4.1 Simulation results for dispersion test. Left panel: histogram of R = 999 values of the dispersion statistic t* obtained under multinomial sampling: the data value is t = 55.15 and pmc = 0.235. Right panel: chi-squared plot of ordered values of t*, dashed line corresponding to xl$ approximation to null conditional distribution.

143

o w §GO c

o e 0Q_ CO

20

40

60

80 Chi-squared quantiles

It seems intuitively clear th a t the sensitivity o f the M onte C arlo test increases w ith R. We shall discuss this issue later, b u t for now we note th a t it is advisable to take R to be a t least 99. T here are tw o im p o rtan t aspects o f the M onte C arlo test which m ake it widely useful. T he first is th a t we only need to be able to sim ulate d a ta under the null hypothesis, this being relatively simple even in some very com plicated problem s, such as those involving spatial processes (C hapter 8). Secondly, do n o t need to be independent outcom es: the m ethod rem ains valid so long as they are exchangeable outcom es, which is to say th a t the jo in t density o f T, T R u nder Ho is invariant under p erm u tatio n o f its argum ents. This allows us to apply M onte C arlo tests to quite com plicated problem s, as we see next.

4.2.2 Markov chain Monte Carlo tests In som e applications o f the exact conditional test, w ith P-value given by (4.4), the conditional probability calculation is difficult or im possible to do directly. T he M onte C arlo test is in principle appropriate here, since the null d istribu tion (given s) does not depend upon unknow n param eters. A practical obstacle is th a t in com plicated problem s it m ay be difficult to sim ulate independent sam ples directly from th a t conditional null distribution. However, as we ob served before, the M onte C arlo test only requires exchangeable samples. This opens u p a new possibility, the use o f M arkov chain M onte C arlo sim ulation, in which only the u nconditional null distribution is needed. The basic idea is to represent d a ta y = ( y i , . . . , y n) as the result o f N steps o f a M arkov chain w ith some initial state x = ( x i,...,x „ ) , and to

144

4 ■ Tests

generate each y ’ by an independent sim ulation o f N steps w ith the same initial state x. If the M arkov chain has equilibrium distribution equal to the null hypothesis distribution o f Y = ( Y [ ,..., Y„), then y and the R replicates o f y * are exchangeable outcom es under Ho an d (4.11) applies. Suppose th a t und er H q the d a ta have jo in t density fo(y) for where both /o and & are conditioned on sufficient statistic s if we are dealing with a conditional test. F or simplicity suppose th a t has \3S\ elements, which we now regard as possible states labelled (1 ,2 ,...,\&S\) o f a M arkov chain {Zr, t = . . . , —1 ,0 ,1 ,...} in discrete time. C onsider the d a ta y to be one realization o f Zjy. We then have to fix an ap p ro p riate value o r state for Zo, and w ith this initial state sim ulate the R independent values o f Z N which are the R values o f Y \ The M arkov chain is defined so th a t /o is the equilibrium distribution, which can be enforced by ap p ro p riate choice o f the one-step forw ard transition probability m atrix Q, say, with elements quv = P r(Z I+i = v | Z ( = u),

u,v € &.

F or the m om ent suppose th a t Q is already known. The first p a rt o f the sim ulation is to produce a value for Z 0. S tarting from state y at tim e N, we sim ulate N backw ard steps o f the M arkov chain using the one-step backw ard transition probabilities Pr(Z, = u | Z t+1 = v ) = fo(u)quv/fo(v).

(4.12)

Let the final state, the realized value o f Zo, be x. N ote th at if Ho is true, so th at y was indeed sam pled from / 0, then Pr(Zo = x) = /o(x). In the second p art o f the sim ulation, which we repeat independently R times, we sim ulate N forw ard steps o f the M arkov chain, starting in state x and ending up in state y ' = (> > i,...,y '). Since und er Ho the chain starts in equilibrium , Pr(Y* = /

| H 0) = P r( Z N = / ) = / 0( / ) .

T h at is, if Ho is true, then the R replicates and d a ta y are all sam pled from /o, as we require. M oreover, the R replicates o f y ‘ are jointly exchangeable w ith the d a ta und er Ho- To see this, we have first th at R

f ( y , y l . . . , f R | Ho) = fo(y) £

Pr(Z 0 = x | Z N = y ) ] ] P r(Z N = x

| Z 0 = x),

r= l

using the independence o f the replicate sim ulations from x. But by the definition o f the first p a rt o f the sim ulation, where (4.12) applies, /o (y )P r(Z 0 = x | Z N = y) = / 0(x)P r(Z ^ = y \ Z 0 = x),

145

4.2 ■Resampling fo r Parametric Tests

and so f ( y , y[, . .. , y 'R \ H 0) = J 2 /o (x ){ p r(Z N = y | Z 0 = x) [ J Pr(Z N = y*r | Z 0 = x ) \ , x

r= l

'

w hich is a sym m etric function o f y , y { , - - - , y R as required. G iven th a t the d ata vector and sim ulated d a ta vectors are exchangeable under Ho, the associated test statistic values ( t , t j , . . . , t R) are also exchangeable outcom es under H q. Therefore (4.11) applies for the P-value calculation. To com plete the description o f the m ethod, it rem ains to define the transition probability m atrix Q so th a t the chain is irreducible w ith equilibrium d istribu tion fo(y)- T here are several ways to do this, all o f which use ratios f o( v) / fo (u)F or exam ple, the M etropolis algorithm starts with a carrier M arkov chain on state space @S having any sym m etric one-step forw ard transition probability m atrix M , an d defines one-step forw ard transition from state u in the desired M arkov chain as follows: •

given we are in state u, select state v with probability muv;

•

accept the tran sitio n to v with probability min{ l,fo(v)/fo(u)}, otherwise reject it an d stay in state u.

It is easy to check th a t the induced M arkov chain has transition probabilities quv = m i n { l, f o( v )/ fo ( u) } muv,

u^v,

and Qua = muu + Y ^ max{0 , 1 - fo(v)/fo{u)}muo, V^U an d from this it follows th a t f o is indeed the equilibrium distribution o f the M arkov chain, as required. In applications it is n o t necessary to calculate the probabilities muv explicitly, although the sym m etry and irreducibility o f the carrier chain m ust be checked. If the m atrix M is n o t sym m etric, then the acceptance probability in the M etropolis algorithm m ust be m odified to m in [l,fo(v)mvu/{fo(u)muv}]. T he crucial feature o f the M arkov chain m ethod is th a t fo itself is not needed, only ratio s fo(v)/fo(u) being involved. This m eans th a t for conditional tests, w here f o is the conditional density for Y given S = s, only ratios o f the u nconditional null density for Y are n e ed ed : fo(v) = P r(7 = v \ S = s , H q) = P r(7 = v | H 0) fo(u) P r(7 = u | S = s , H 0) P r(Y= u\H o)' This greatly simplifies m any applications. The realizations o f the M arkov chain are sym m etrically tied to the artificial starting value x, an d this induces a sym m etric correlation am ong (t,

146

4 ■ Tests

This correlation depends upon the p articu lar construction o f Q, and reduces to zero at a rate which depends upon Q as m increases. W hile the correlation does not affect the validity o f the P-value calculation, it does affect the power o f the te s t: the higher the correlation, the lower the power. Example 4.3 (Logistic regression) We retu rn to the problem o f Exam ple 4.1, which provides a very sim ple if artificial illustration. The d a ta y are a binary sequence o f length n w ith s ones, and calculations are to be conditional on Y , Yj = s. Recall th a t direct M onte C arlo sim ulation is possible, since all (") possible d a ta sequences are equally likely und er the null hypothesis o f constant probability o f a unit response. One simple M arkov chain has one-step transitions which select a pair o f subscripts i, j a t random , an d switch y t an d yj. Clearly the chain is irreducible, since one can progress from any one binary sequence with s ones to any other. All ratios o f null probabilities /o (u )//o (« ) are equal to one, since all binary sequences w ith s ones are equally probable. Therefore if we run the M etropolis algorithm , all switches are accepted. But note th a t this M arkov chain, while simple to im plem ent, is inefficient and will require a large num ber o f steps to induce approxim ate independence o f the t’s. T he m ost effective M arkov chain would have one-step transitions which are ran d o m p erm utations, and for this only one step w ould be required. ■ Example 4.4 (AM L data) F or d a ta such as those in Exam ple 3.9, consider testing the null hypothesis o f p ro p o rtio n al h azard functions. D enote the failure times by z\ < z2 < • • • < z„, assum ing no ties for the m om ent, and define rtj to be the nu m b er in group i w ho were at risk ju st p rior to zj. Further, let yj be 0 or 1 according as the failure at zj is in group 1 or 2, and denote the hazard function a t tim e z for group i by fy(z). Then P r(y . = l ) = _____

1

r*Mzj>_____

rljh l (zj) + r2jh2(zj)

=

°J

aj + 0 /

where aj = rij/rzj and 6j = h2{zj)/h\(zj) for j = 1 The null hypothesis o f p ropo rtio n al hazards implies the hypothesis H q : 6\ = • • • = 6n. For the d a ta o f Exam ple 3.9, where n — 18, the values o f y and a are given in Table 4.2; one tie has been random ly split. N ote th a t censored d a ta contribute only to the rs: the times are n o t used. O f course the YjS are n o t independent, because aj depends upon the o u t com es o f Yu . . . , Y j - i . However, for the purposes o f illustration here we shall pretend th a t the ajS are fixed, as well as the survival times and censoring times. T h a t is, we shall treat the Y)s as independent Bernoulli variables with probabilities as given above. U nder this pretence the conditional likelihood for

147

4.2 • Resampling fo r Parametric Tests

n

5 11

8 11

n

12

11

10

9

8

a

n 12

1

11 10

n 9

n 8

y

1

1

1

1

0

1

Table 4.2 Ingredients o f the conditional test for proportional hazards. Failure times as in Table 3.4; at time z = 23 the failure in group 2 is taken to occur first.

8 11

*18

9 11

12

5 11

z

13 10

18 8

23 7

23 7

27 6

30 5

31 5

33 4

34 4

43 3

45 3

48 2

8

7

6

6

5

5

4

3

3

2

2

1

0

10 8

10 7

8 6

7 6

7 5

6 5

5 4

5 3

4 3

2

3 2

3

oo

0

0

1

0

1

1

0

1

0

1

1

0

10

is simply 18

n

dj + Oj

7=1

N ote th a t because aig = oo, m ust be 0 w hatever the value o f 0ig, and so this final response is uninform ative. We therefore dro p yig from the analysis. H aving done this, we see th a t under Ho the sufficient statistic for the com m on h azard ratio 0 is S = Yj, whose observed value is s = 11. W hatever the test statistic T, the exact conditional P-value (4.4) m ust be approxim ated. D irect sim ulation appears impossible, but a simple M arkov chain sim ulation is possible. First, the state space o f the chain is 3§ = {x = ( x i , . . . , x n ) : Y l x j = s}> th a t is all perm utations o f y i , . . . , y n . F or any two vec tors x and x in the state-space, the ratio o f null conditional jo in t probabilities p{x | s, 0 ] p(x | s, 01

;'= i

We take the carrier M arkov chain to have one-step transitions which are ra n dom p erm u tatio n s: this guarantees fast m ovem ent over the state space. A step which moves from x to x is then accepted with probability min ^ 1, f l j l i a]‘

•

By sym m etry the reverse chain is defined in exactly the same way. The test statistic m ust be chosen to m atch the particular alternative hy p o th esis th o u g h t relevant. H ere we suppose th at the alternative is a m onotone ratio o f hazards, for which T = YljLi Yj log(Zj) seems to be a reasonable choice. The M arkov chain sim ulation is applied with N = 100 steps back to give the initial state x an d 100 steps forw ard to state y ' , the latter repeated R = 99 times. O f the resulting £* values, 48 are less th an or equal to the observed value t = 17.75, so the P-value is (1 + 4 8 )/(l + 99) = 0.49. Thus there appears to be no evidence against the prop o rtional hazards model. Average acceptance probability in the M etropolis algorithm is approxim ately 0.7, and results for N = 10 and N = 1000 ap p ear indistinguishable from those for N = 100. This indicates unusually fast convergence for applications o f the M arkov chain m ethod. ■

148

4 ■ Tests

T he use o f R conditionally independent realizations o f the M arkov chain is som etim es referred to as the parallel method. In co n trast is the series method, where only one realization is used. Since the successive states o f the chain are dependent, a rand o m izatio n device is needed to induce exchangeability. For details see Problem 4.2.

4.2.3 Parametric bootstrap tests In m any problem s o f course the distribution o f T under H q will depend upon nuisance param eters which can n o t be conditioned away, so th at the M onte C arlo test m ethod does not apply exactly. T hen the n atu ral approach is to fit A

A

the null m odel Fo and use (4.5) to com pute the P-value, i.e. p = P r(T > t \ Fo). F or exam ple, for the p aram etric m odel where we are testing Ho : ip = ipo with X a nuisance p aram eter, Fo w ould be the C D F o f f ( y \ ipo,Xo) with Xo the m axim um likelihood estim ator (M L E ) o f the nuisance param eter when ip is fixed equal to ipo. C alculation o f the P-value by (4.5) is referred to as a b o o tstrap test. If (4.5) can n o t be com puted exactly, o r if there is no satisfactory approx im ation (norm al or otherwise), then we proceed by sim ulation. T h at is, R independent replicate sam ples yj,...,_y* are draw n from Fo, and for the rth such sam ple the test statistic value t'r is calculated. T hen the significance probability (4.5) will be approxim ated by Pboot ~

J

( 4 .1 3 )

O rdinarily one would use a simple p ro p o rtio n here, but we have chosen to m ake the definition m atch th a t for the M onte C arlo test in (4.11). Example 4.5 (Separate families test) Suppose th a t we wish to choose between the alternative m odel form s fo(y \ r\) and f i ( y \ £) for the P D F o f the random sam ple y \ , . . . , y n. In some circum stances it m ay m ake sense to take one model, say fo, as a null hypothesis, and to test this against the o th er m odel as alternative hypothesis. In the n o tatio n o f Section 4.1, the nuisance param eter is X = (t],C) and ip is the binary indicator o f m odel, w ith null value ipo = 0 and alternative value ipa = 1. The likelihood ratio statistic (4.7) is equivalent to the m ore convenient form r = » - N ° g ^ = n- ' X > g £ M ^ , L o(rj) fo(yj I ri)

(4.14)

where f\ and ( are the M L E s and Lo an d L\ the likelihoods under f o and / 1 respectively. If the tw o families are strictly separate, then the chi-squared approxim ation (4.8) does n o t apply. T here is a norm al approxim ation for the

149

4.2 ■Resampling fo r Parametric Tests

null distribution o f T , b u t this is often quite unreliable except for very large n. The p aram etric b o o tstrap provides a m ore reliable and simple option. The p aram etric b o o tstrap w orks as follows. We generate R sam ples o f size n by ran d o m sam pling from the fitted null m odel /o (y | fj). For each sample we calculate estim ates fj* and ( ’ by m axim izing the sim ulated log likelihoods

m) = E lo&w i

4>fa) = E lo&w 11)*

and com pute the sim ulated log likelihood ratio statistic

T hen we calculate p using (4.13). As a p articu lar illustration, consider the failure-tim e d ata in Table 1.2. Two plausible m odels for this type o f d a ta are gam m a and lognorm al, th a t is , , , , Kiicy)*-1 e x p ( - K y / n ) f o ( y \ r i ) = ----------^ r ( K ) ----------’

, = ^

{ l o g y - ot\ — p—

) ’y > 0 -

F or these d a ta the M L E s o f the gam m a m ean and index are fi = y = 108.083 and k = 0.707, the latter being the solution to log(/c) - h(k) = log(y) - lo g y logy and s^og) are the average and sample variance for the log yj.

with h(x) = d \o gr ( K) /d K, the digam m a function. The M L E s o f the m ean and variance o f the norm al distribution for log Y are a = lo g y = 3.829 and P2 = (n — 1)s?ogy/ n = 2.339. The test statistic (4.14) is t = —k log(fc/y) — ka + k + log r(/c) — | \og(2n[i2) — whose value for the d a ta is t = —0.465. The left panel o f Figure 4.2 shows a histogram o f R = 999 values o f t* under sam pling from the fitted gam m a m odel: o f these, 619 are greater th an t and so p = 0.62. N ote th a t the histogram has a fairly non-norm al shape in this case, suggesting th a t a norm al approxim ation will not be very accurate. This is true also for the (rath er com plicated) studentized version Z o f T : the right panel o f Figure 4.2 shows the norm al plot o f b o o tstrap values z \ The observed value o f z is 0.4954, for which the b o o tstrap P-value is 0.34, som ew hat sm aller th an th at com puted for t, b u t not changing the conclusion th a t there is no evidence to change from a gam m a to a lognorm al m odel for these data. T here are good general reasons to studentize test statistics; see Section 4.4.1. It should p erhaps be m entioned th at significance tests o f this kind are not always helpful in distinguishing between models, in the sense th at we could find evidence against either b o th or neither o f them. This is especially true w ith small sam ples such as we have here. In this case the reverse test shows no evidence against the lognorm al model. ■

150

4 ■ Tests

Figure 4.2 Null hypothesis resampling for failure data. Left panel shows histogram of under gamma sampling. Right panel shows normal plot of z ' ; R — 999 and gamma parameters p. = 108.0833, k = 0.7065; dotted line is theoretical N(0,1) approximation.

t*

Quantiles of standard normal

4.2.4 Graphical tests G raphical m ethods are p o p u lar in m odel checking: exam ples include norm al and half-norm al plots o f residuals in regression, plots o f C ook distance in regression, plots o f n o nparam etric h azard function estim ates, and plots o f intensity functions in spatial analysis (Section 8.3). In m any cases the nom inal shape o f the plot is a straight line, which aids the detection o f deviation from a null model. W hatever the situation, inform ed in terpretation o f the plot requires som e n otion o f its probable variation und er the m odel being checked, unless the sam ple size is so large th a t deviation is obvious (c.f. the plot o f resam pling results in Figure 4.2). The sim plest and m ost com m on approach is to superim pose a “probable envelope”, to which the original d a ta plot is com pared. This probable envelope is obtained by M onte C arlo or param etric resam pling m ethods. G raphical tests are n o t usually ap p ro p riate when a single specific alternative m odel is o f interest. R ath er they are used to suggest alternative models, depending upon the m anner in which such a plot deviates from its null expected behaviour, or to find suspect data. (Indeed graphical tests are not tests in the usual sense, because there is usually no simple notion o f “rejectable” behaviour: we com m ent m ore fully on this below.) Suppose th a t the g raph plots T (a) versus a for a e s / , a bounded set. The observed plot is {t(a) : a € j / } . F or exam ple, in a norm al plot j / is a set o f norm al quantiles and the values o f t(a) are the ordered values o f a sample, possibly studentized. T he idea o f the p lo t is to com pare t(a) w ith the probable behaviour o f T(a) for all a e when H q is true. Example 4.6 (Normal plot)

C onsider the d ata in Table 3.1, and suppose in

151

4.2 • Resampling fo r Parametric Tests

yt)

Figure 4.3 Normal plot of n = 13 studentized values for final sample in Table 3.1.

O/ N CO

o

o O/'

c>d in o ■CDO o c o <1) -o D

Q--'

0 /o .■•6 o o

35

0/0 o -1

0

1

Quantiles of standard normal

p articu lar th a t we w ant to assess w hether or n o t the last sam ple o f n = 13 m easurem ents can be assum ed norm al. A norm al plot o f the d a ta is shown in Figure 4.3, which plots the ordered studentized values Z(,-) = ( y ^ — y ) / s against the quantiles a, = o f the N{ 0,1) distribution. In the general n o tation ,s4 is the set o f norm al quantiles, and t(at) = Z(,-). The d o tted line is the expected pattern , approxim ately, and the question is w hether or n o t the points deviate sufficiently from this to suggest th at the sam ple is non-norm al. ■ A ssum e for the m om ent th a t the null hypothesis jo in t distribution o f {T(a) : a € involves no unknow n nuisance param eters. This is true for a norm al p lo t if we use studentized sam ple values z, as in the previous example. T hen for any fixed a we can subject t(a) to a M onte C arlo test. F o r each o f R independent sets o f d a ta y j ,...,y * , which are obtained by sam pling from the null m odel, we com pute the sim ulated plot t*(a),

a G si.

U nder the null hypothesis, T(a), Tj*(a),. . . , T R(a) are independent and identi cally distributed for any fixed a, so th a t (4.9) applies w ith T = T(a). T h at is, P r ( T ( a ) < T (})( f l ) | H o ) = ^

I .

(4.15)

This leads to (4.11) as the one-sided P-value at the given value o f a, if large values o f t(a) are evidence against the null m odel. T here are obvious

152

4 ■ Tests

m odifications if we w ant to test for sm all values o f t(a), or if we w ant a two-sided test. The test as described applies for any single value o f a. However, the graphical test does n o t look ju st a t one fixed a, b u t rath er at all a € j / sim ultaneously. In principle the M onte C arlo test could be applied at all values o f a e srf, b u t this would be tim e-consum ing and difficult to interpret. To simplify m atters, at each value o f a we com pute lower an d u p p er critical values corresponding to fixed one-sided levels p, and plot these critical values against a to provide critical curves against which to com pare the whole d a ta plot {t(a),a e i } . So the m ethod is to choose integers R and k so th a t ^ = p, the desired one-sided test level, and then com pute the critical values f (fc)(a )> f (R + l-lc )(a )

from the R sim ulated plots. If t(a) exceeds the upp er value, or falls below the lower value, then the corresponding one-sided P-value is at m ost p; the twosided test which rejects Ho if t(a) falls outside the interval [ ^ ( a ) , tJ'J?+1_fc)(a)] has level equal to 2p. T he set o f all u p p er and lower critical values defines the test envelope S'1 2p = {[t(fc)(a), t(R+!_(;)(«)] : a e s / j .

(4.16)

Excursions o f t(a) outside S l~2p are regarded as evidence against Ho, and this sim ultaneous com parison across all values o f a is w hat is usually m eant by the graphical test. Example 4.7 (Normal plot, continued) F or the norm al plot o f the previous example, suppose we set p = 0.05. T he sm allest sim ulation size th a t works is R = 19, and then we take k = 1 in (4.16). T he test envelope will therefore be the lines connecting the m axim a and the m inim a. Because we are plotting studentized sam ple values, which elim inates m ean and variance param eters, the sim ulation can be done w ith the N ( 0,1) distribution. Each sim ulated sam ple y \ , . . . , y ‘u is studentized to give z* = ( y ‘ — y*)/s*, i = 1 ,..., 13, whose ordered values are then plotted against the same norm al quantiles a, =
153

4.2 ■Resampling fo r Parametric Tests

Figure 4.4 Graphical test of normality. Left panel: normal plots (dashed lines) of studentized values for R = 19 samples of n = 13 simulated from the N(0, 1) distribution, together with their envelope (solid line). Right panel: envelope of the simulated plots superimposed on the original data plot.

Quantiles of standard normal

Quantiles of standard normal

sam ples can be generated from any null m odel Fo. W hen unknow n model param eters can n o t be elim inated, we would sim ulate from Fo: then (4.15) will be approxim ately true provided n is n o t too small. There are two aspects o f the graphical test which need careful thought, nam ely the choice o f R and the in terpretation o f the resulting plot. It seems clear from earlier discussion th a t for p = 0.05, say, R = 19 is too sm all: the test envelope is too random . R = 99 would seem to be a m ore sensible choice, provided this is not com putationally difficult. But we should consider how form al is to be the interp retation o f the graph. As it stands the notional one-sided significance levels p hold pointwise, and certainly the chance th a t the envelope captures an entire plot will be far less th an 1 — 2p. So it would not m ake sense to infer evidence against the null m odel if one arbitrarily placed p o in t falls outside the envelope, as happened in Exam ple 4.7. In fact in th at exam ple the chance is ab o u t 0.5 th at some point will fall outside the sim ulation envelope, in co n trast to the pointw ise chance 0.1. F or some purposes it will be useful to know the overall erro r rate, i.e. the chance o f a point falling outside the envelope, or even to control this rate. W hile this is difficult to do exactly, there is a simple em pirical approach which w orks satisfactorily. G iven the R sim ulated plots which were used to calculate the test envelope, we can sim ulate the graphical test by com paring {t'(a),a G j / } to the envelope S‘lS r2p th at is obtained from the o th er R — 1 sim ulated plots. If we repeat this sim ulated test for r = 1, . . . , R , then we obtain a resam ple estim ate o f the overall two-sided erro r rate # { r : {t’(a),a G j / } exits <0’Lr2pj R

(4.17)

154

4 ■ Tests Figure 4.5 Normal plot of n = 13 studentized values for final sample in Table 3.1, together with simultaneous (solid lines) and pointwise (dashed lines) two-sided 0.10 test envelopes. K = 199

Quantiles of standard normal

This is easy to calculate, since {t‘(a),a e jtfj exits S lS r2p if and only if rank{t*(a)} < k

or

rank{f’ (a)} > R + 1 — k

for at least one value o f a, where as before k = p(R + 1 ) . T hus if the R plots are represented by a R x N array, we first com pute colum nw ise ranks. T hen we calculate the p ro p o rtio n o f rows in which either the m inim um rank is less th an or equal to k, or the m axim um ran k is greater th a n or equal to R + 1 —k, o r both. T he corresponding one-sided erro r rates are estim ated in the obvious way. Example 4.8 (Normal plot, continued)

F or the norm al plot o f Exam ple 4.6, an

overall tw o-sided error rate o f approxim ately 0.1 requires R = 199. Figure 4.5 shows a graphical test p lo t for R = 199 w ith outer envelope corresponding to overall tw o-sided erro r rate 0.1 and inner envelope corresponding to pointw ise two-sided erro r rate 0.1; the em pirical error rate (4.17) for the o u ter envelope is 0.10. ■ In practice one m ight ra th e r be looking for trends, m anifested by sequences o f points going outside the test envelope. A lternatively one m ight be focusing attention on p articu lar regions o f the plot, such as the tails o f a probability plot. Because such plots m ay be used to detect several possible deviations from a hypothetical m odel, and hence be in terpreted in several possible ways, it is n o t possible to m ake a single recom m endation th a t will induce a controlled error rate. In the absence o f a single criterion by which the plot is to be judged, it seems wise to plot envelopes corresponding to b o th pointw ise one-sided error rate p and sim ultaneous one-sided erro r rate p, say with p = 0.05. This is relatively easy to d o using (4.17). F or a further illustration see Exam ple 8.9.

155

4.2 ■Resampling fo r Parametric Tests

4.2.5 C hoice o f R In any sim ulation-based test, relatively few sam ples could be used if it quickly becam e clear th a t p was so large as to n o t be regarded as evidence against HoF or exam ple, if the event t* > t occurred 50 times in the first 100 samples, then it is reasonably certain th a t p will exceed 0.25, say, for m uch larger R, so there is little p o in t in sim ulating further. O n the other hand, if we observed t* > t only five times, then it w ould be w orth sam pling fu rther to m ore accurately determ ine the level o f significance. O ne effect o f n o t com puting p exactly is to w eaken the pow er o f the test, essentially because the critical region o f a fixed-level test has been random ly displaced. T he effect can be quantified approxim ately as follows. C onsider testing a t level a, which is to say reject Ho if p < a. If the integer k is chosen equal to (R + l)a , then the test rejects Ho when t'{R+l_k) < t. F or the alternative hypothesis H a , the pow er o f the test is nR(a, HA) = Pr(reject H 0 \ H A) = P r(T (*R+1_k) < T \ H A). To evaluate this probability, suppose for simplicity th a t T has a continuous distribution, w ith P D F go(t) and C D F Go(t) under Ho, and density gA(t) under H A. T hen from the stan d ard result for P D F o f an order statistic we have nR( a, HA) =

J J

R ( ^ _ Q c o M ^ g o M U - Goix ) } ^ 1 gA(t)dxdt.

A fter change o f variable an d some rearrangem ent o f the integral, this becom es nR(cc,Ha ) = [ ^ao(u, H A)hR(u;tx)du, Jo

(4.18)

where nx (u,HA) is the pow er o f the test using the exact P-value, and hR{u;a) is the b eta density on [0,1] w ith indices (R + l)a and (R + 1)(1 — a). T he next p a rt o f the calculation relies on n R{ot, H A) being a concave function o f a, as is usually the case. T hen a lower bound for n ^ u , H a ) is nm[ u , H a ) which equals U7taj( a ,H a) / a for u < a and 7tx ( a ,H 4 ) for u > a. It follows by applying (4.18) to n R(y., HA), and som e m anipulation, th at n 00( o L , H A ) - n R( a,HA)

< nco^^A')J

\u - a \ h R(u;cc)du

7too(a, H y4)a*R+1*<x(l + 1) (R + l ) a r ((R + l)a ) T ((R + 1)(1 - a)) ' We apply Stirling’s approxim ation T(x) = (2n)l/ 2 x x~ l / 1 exp(—x) for large x to the rig h t-h an d side an d obtain the approxim ate bound

156

4 ■ Tests

The following table gives som e num erical values o f this approxim ate bound. sim ulation size R power ratio for a = 0.05 power ratio for a. = 0.01

19

39

99

199

499

999

9999

0.61

0.73

0.83

—

—

0.60

0.88 0.72

0.92 0.82

0.95 0.87

0.98 0.96

These values suggest th a t the loss o f pow er with R = 99 is n o t serious for a > 0.05, and th a t R = 999 should generally be safe. In fact the values can be quite conservative. For exam ple, for testing a norm al m ean the pow er ratios for a = 0.05 are usually above 0.85 and 0.97 for R = 19 and R = 99 respectively.

4.3 Nonparametric Permutation Tests In m any practical situations it is useful to have available statistical m ethods which do n o t depend upon specific param etric models, if only in order to provide backup to results o f param etric m ethods. So, w ith significance testing, it is useful to have n onparam etric tests such as the sign test and the signed-rank test for analysing paired data, either to confirm the results o f applying the param etric paired t test, or to deal w ith evident non-norm ality o f the paired differences. N onparam etric tests in general com pute significance w ithout assum ing form s for the d a ta distributions. T he choice o f test statistic will usually be based firmly on the physical context o f the problem , possibly reinforced by w hat we know would be a good choice if a plausible p aram etric m odel were applicable. So, in a com parison o f two treatm ents w here we believe th a t treatm ent effects are additive, it w ould be reasonable to choose as test statistic the difference o f means, especially if we th o u g h t th a t the d a ta distributions were not far from norm al; for long-tailed d a ta distributions the difference o f m edians would be m ore reasonable from a statistical poin t o f view. If we are concerned about the nonrobustness o f m eans, then we m ight first convert d a ta values to relative ranks and then use an ap p ro p riate ran k test. There is a vast literature on various kinds o f nonparam etric tests, such as rank tests, U -statistic tests, and distance tests which com pare E D F s in various ways. We shall n o t a ttem p t to review these here. R ath e r our concern in this ch ap ter is w ith resam pling tests, and the sim plest form o f nonparam etric resam pling test is the p erm u tatio n test. Essentially a p erm u tatio n test is a com parative test, where the test statistic involves some sort o f com parison betw een E D Fs. T he special feature o f the p erm u tatio n test is th a t the null hypothesis implies a reduction o f the nonparam etric M L E o f the d a ta distributions to E D F s which play the role o f sufficient statistic S in equation (4.4). The conditional probability distribution

157

4.3 ■Nonparametric Permutation Tests

Figure 4.6 Scatter plot of n — 37 pairs of measurements in a study of handedness (provided by D r Gordon Claridge, University of Oxford).

dnan

used in (4.4) is then a uniform distribution over a set o f perm utations o f the d a ta structure. The following exam ple illustrates this. Example 4.9 (Correlation test) Suppose th at Y = ( U , X ) is a random pair an d th a t n such pairs are observed. T he objective is to see if U and X are independent, this being the null hypothesis Hq. A n illustrative dataset is plotted in Figure 4.6, where u = d nan is a genetic m easure and x = han d is an integer m easure o f left-handedness. T he alternative hypothesis is th a t x tends to be larger w hen u is larger. These d ata are clearly non-norm al. O ne simple test statistic is the sam ple correlation, T = p{F) say. N ote th at here the E D F F puts probabilities n~‘ on each o f the n d ata pairs (u;,x,). T he correlation is zero for any distribution th a t satisfies Ho. The correlation coefficient for the d a ta in Figure 4.6 is 0.509. W hen the form o f F is unspecified, F is m inim al sufficient for F. U nder the null hypothesis, however, the m inim al sufficient statistic is com prised o f the ordered us an d ordered xs, s = (M(i),...,U(n),X(i),...,X(„)), equivalent to the two m arginal E D Fs. So here a conditional test can be applied, w ith (4.4) defining the P-value, w hich will therefore be independent o f the underlying m arginal distributions o f U and X . N ow when S is constrained to equal s, the ran d o m sam ple ( U \ , X \ ) , ... ,(U„,X„) is equivalent to (u(i),X j), . . . , (u (n),X*) w ith ( X j ,. . .,X"n) a ran d o m p erm u tatio n o f X ( i ) ,...,X ( „ ) . F urther, when Ho is true all such p erm u tatio n s are equally likely, and there are n! o f them. Therefore the one-sided P-value is # o f perm utations such th at T* > t

In evaluating p, we can use the fact th at all m arginal sam ple m om ents

158

4 ■ Tests

Figure 4.7 Histogram o f correlation t* values for R — 999 random perm utations o f d ata in Figure 4.6.

-0.5

0.0

0.5

Correlation t*

are constant across perm utations. This implies th a t T > t is equivalent to T,XiUi> Y,XiU i-

■

As a practical m atter, it is rarely possible or necessary to com pute the p erm utatio n P-value exactly. Typically a very large num ber o f perm utations is involved, for exam ple m ore th a n 3 m illion in Exam ple 4.9 w hen n = 10. In special cases involving linear statistics there will be theoretical approxi m ations, such as norm al approxim ations or im proved versions o f these: see Section 9.5. But for general use the m ost reliable ap proach is to m ake use o f the M onte C arlo m ethod o f Section 4.2.1. T h at is, we take a large num ber R o f rand o m p erm utations, calculate the corresponding values t \ , . . . , t ’R o f T, and approxim ate p by me

1 + # { tr* > r} R + 1 '

A t least 99 and at m ost 999 ran d o m p erm u tatio n s should suffice. Example 4.10 (Correlation test, ctd) F or the d ataset shown in Figure 4.6, the test o f Exam ple 4.9 was im plem ented by sim ulation, th a t is generating random p erm u tatio n s o f the x-values, w ith R = 999. Figure 4.7 is a histogram o f the correlation values. The unshaded p a rt corresponds to the 4 t* values which are greater th an the observed correlation t = 0.509: the P-value is p = ( l + 4 ) / ( l + 9 9 9 ) = 0.005. ■ O ne feature o f p erm u tatio n tests is th a t any test statistic is as easy to use as any other, a t least in principle. So in the previous exam ple it is ju st as easy to use the ran k correlation (in which the us and xs are replaced by their relative

159

4.3 • Nonparametric Permutation Tests

ranks), a robust m easure o f correlation, or a com plicated m easure o f distance betw een the bivariate E D F F and its null hypothesis version Fo which is the pro d u ct o f the E D F s o f u and x. All th at is required is th a t we be able to com pute the test statistic for all perm utations o f the xs. In the previous exam ple the null hypothesis o f independence led unam bigu ously to a sufficient statistic s and a p erm u tatio n distribution. M ore generally the explicit null hypothesis m ay n o t be strong enough to do this, unless it can be taken to im ply a stronger hypothesis. This depends upon the practical context, as we see in the following example. Example 4.11 (Comparison of two means) Suppose th a t we w ant to com pare the m eans o f tw o populations, given random sam ples from each which are denoted by C y n ,...,j'ini) and (y 2 i , - - - , y 2n2)- The explicit null hypothesis is Hq : n\ = fi 2 , w here ji\ an d jij are the m eans for the respective populations. N ow Ho alone does n o t reduce the sufficient statistics from the two sets o f ordered sam ple values. However, suppose we believe th a t the C D F s Fi and Fj have either o f the special form s Fi(y) = G ( y - n \ ) ,

F2(y) = G(y - n 2)

or F\(y) = G ( y / n i),

F2(y) = G{y/(i2),

for some unknow n G. T hen the null hypothesis implies a com m on C D F F for the two populations. In this case, the null hypothesis sufficient statistic s is the set o f order statistics for the pooled sam ple =

yiii • • • >*^1

= yittp Wfii+i = 3^21 s • • • •>^«i+«2 =

y2n2,

th a t is s = (u(i),...,H (ni+n2)). Situations where the special form s for Fj and F 2 apply would include com parisons o f tw o treatm ents which were both applied to a random selection o f units from a com m on pool. The special forms would n o t necessarily apply to sets o f physical m easurem ents taken under different experim ental conditions or using different apparatu s, since then the sam ples could have unequal variablity even though Ho were true. Suppose th a t we test Ho by com paring the sam ple m eans using test statistic t = y 2 — yi, and suppose th a t the one-sided alternative H a : fi2 > ji\ is appropriate. If we assum e th a t Ho implies a com m on distribution for the Yu and Yzj, then the exact significance probability is given by (4.4), i.e. p = P r(T > t | S = s,Ho). N ow when S is constrained to equal s, the concatenation o f the two random sam ples ( Y u ,..., Yini, Y2i , . . . , Y2„2) m ust form a p erm utation o f s. The first

160

4 ■ Tests

m com ponents o f a p erm u tatio n will give the first sam ple and the last «2 com ponents will give the second sample. Further, w hen Ho is true all such perm utatio n s are equally likely, an d there are o f them. Therefore #

o f p erm u tatio n s such th a t T* > t ^ i+ n 2^

P ~

'

(4-21)

As in the previous exam ple, this exact probability would usually be approxi m ated by taking R ran d o m p erm u tatio n s o f the type described, and applying (4.11).

■

A som ew hat m ore com plicated tw o-sam ple test problem is provided by the following example. Example 4.12 (AM L data) Figure 3.3 shows the product-lim it estim ates o f the survivor function for tim es to rem ission o f tw o groups o f patients with acute m yelogeneous leukaem ia (A M L), w ith one o f the groups receiving m aintenance chem otherapy. D oes this treatm en t m ake a difference to survival? A com m on test for com parison o f estim ated survivor functions is based on the log-rank statistic, which com pares the actual n u m ber o f failures in group 1 with its expected value at each tim e a failure is observed, under the null hypothesis th a t the survival distributions o f the two groups are equal. To be m ore explicit, suppose th a t we pool the two groups and obtain ordered failure times y\ < ■■■ < ym, w ith m < n if there is censoring. Let / \j and r\j be the num ber o f failures and the nu m b er a t risk o f failure in group 1 at tim e yj, and similarly for group 2. T hen the log-rank statistic is T = E j = i ( /U - mij)

where ( / l j + f 2 j ) r i j r 2j ( r i j + r2j - f i; - f 2J)

(fij+f2j)rij 1]

r ij + r y

’

lJ

(ri; + r 2j ) 2{r\ j + r2j —

1)

are the conditional m ean an d variance o f the n u m b er in group 1 to fail a t time tj, given the values o f f i j + f 2j, r\j and r2j. F or the A M L d a ta t = 1.84. Is this evidence th a t chem otherapy lengthens survival tim es? For a suitable null distrib u tio n we simply treat the observations in the rows o f Table 3.4 as a single group and perm ute them , effectively random ly allocating group labels to the observations. F or each o f R perm utations, we recalculate t, obtaining t\, ..., t*R. Figure 4.8 shows the t'r plotted against order statistics from the JV(0,1) distribution, which is the asym ptotic null distribution o f T. The asym ptotic P-value is 0.033, in reasonable agreem ent with the P-value 26/(999 + 1) = 0.026 from the p erm u tatio n test. ■

4.4 • Nonparametric Bootstrap Tests

161

Figure 4.8 Results of a Monte Carlo permutation test for differences between the survivor functions for the two groups of AML data, R = 499. The dashed horizontal line shows the observed value of the statistic, and values of t* that exceed it are hollow. The dotted line is the line x = y.

Quantiles of standard normal

4.4 Nonparametric Bootstrap Tests T he p erm u tatio n tests described in the previous section are special n o n p a ra m etric resam pling tests, in which resam pling is done w ithout replacem ent. In this section we discuss the direct application o f nonparam etric resam pling m ethods, as introduced in C hapters 2 and 3. F or tightly structured problem s such as those in the previous section, this m eans resam pling w ith replacem ent rath er th an w ithout, which m akes little difference. But b o o tstrap tests apply to a m uch w ider class o f testing problems. The special n ature o f significance tests requires th a t probability calculations be done und er a null hypothesis model. In this way the b o o tstrap calculations m ust differ from those in earlier chapters. F or exam ple, where in C h apter 2 we introduced the idea o f resam pling from the E D F F, now we m ust resam ple from a distribution Fo, say, which satisfies the relevant null hypothesis H q. This has been illustrated already for param etric b o o tstrap tests in Section 4.2. A O nce the null resam pling distribution Fo is decided, the basic boo tstrap test will be to com pute the P-value as Pboot =

Pr*(7”

> r I

Fo),

or to approxim ate this by p P

i + # K > t} R+l

using the results t\,...,t*R from R b o o tstrap samples.

4 • Tests

162

Figure 4.9 Histogram of test statistic values t' = y 2 — y\ from R = 999 resamples of the two samples in Example 4.13. The data value of the test statistic is t — 2.84.

in

CM

in

o

o o ■6

-4

Example 4.13 (Comparison of two means, continued) C onsider the last two series o f m easurem ents in Exam ple 3.1, which are reproduced here labelled sam ples 1 and 2 :

sam ple 1 sam ple 2

82 84

79 86

81 85

79 82

77 77

79 76

79 77

78 80

79 83

82 81

76 78

73 78

64 78

Suppose th a t we w ant to com pare the corresponding population m eans, p\ and /i2, say w ith test statistic t = y i — y\. If, as seems plausible, the shapes o f the underlying distributions are identical, then under Ho : P2 = Pi the two distributions are the same. It would then be sensible to choose for Fo the pooled E D F o f the tw o samples. T he resam pling test will be the same as the p erm u tatio n test o f Exam ple 4.11, except th a t ran d om perm utations will be replaced by ran d o m sam ples o f size n\ 4 -112 = 26 d raw n w ith replacem ent from the pooled data. Figure 4.9 shows the results from applying this procedure to our two samples w ith R = 999. The unshaded area o f the histogram corresponds to the 48 values o f t* larger th a n the observed value t = 80.38 — 77.54 = 2.84. T he one-sided Pvalue for alternative H A : H2 > Hi is p = (48 + l ) / ( 9 9 9 + l) = 0.049. A pplication o f the p erm u tatio n test gave the sam e result. It is w orth stressing again th a t because the resam pling m ethod is wholly com putational, any sensible test statistic is as easy to use as any other. So here, if outliers were present, it w ould be ju st as easy, and perhaps m ore sensible, to choose t to be the difference o f trim m ed means.

4.4 • Nonparametric Bootstrap Tests

163

The question is: do we gain or lose anything by assum ing th a t the two distributions have the same shape? ■ The p articu lar null fitted m odel used in the previous exam ple was suggested in p a rt by the p erm u tatio n test, and is clearly n o t the only possibility. Indeed, a m ore reasonable null m odel in the context would be one which allowed different variances for the tw o p opulations sam pled: an analogous m odel is used in Exam ple 4.14 below. So in general there can be m any candidates for null m odel in the nonparam etric case, each corresponding to different restrictions im posed in ad d itio n to H q. O ne m ust judge which is m ost ap p ro p riate on the basis o f w hat m akes sense in the practical context. Semiparametric null models If d a ta are described by a sem iparam etric m odel, so th a t some features o f underlying distributions are described by param eters, then it m ay be relatively easy to specify a null model. The following exam ple illustrates this. Example 4.14 (Comparison of several means) F or the gravity d a ta in E xam ple 3.2, one p o in t th a t we m ight check before proceeding w ith an aggregate estim ation is th a t the underlying m eans for all eight series are in fact the same. One plausible m odel for the data, as m entioned in Section 3.2, is

)fij ~

j

— L ••• ? I ~ I?•■•)

where the ei; com e from a single distribution G. The null hypothesis to be tested is Ho : p\ = ■■■ = p.%, w ith general alternative. F or this an appropriate test statistic is given by yi and sj are the average and sample variance for the ith series.

8

t= E

Wi(yi - £o)2,

Wi = Hi/sf,

i=1 w ith fo = Y wi}'i/ Y wi null estim ate o f the com m on mean. The null distribution o f T w ould be approxim ately yfi were it n o t for the effect o f small sam ple sizes. So a b o o tstrap approach is sensible. T he null m odel fit includes /to and the estim ated variances °K> = (« i “

l ) s f / « i + ( Pi ~ M

2-

T he null m odel studentized residuals ytj - fo eij

{ ^ - ( E w , ) - 1}172’

when plotted against norm al quantiles, suggest mild non-norm ality. So, to be safe, we apply a nonparam etric bootstrap. D atasets are sim ulated under the null m odel

y'j = fo +

164

4 ■ Tests

0

10

20

30 t*

i

^

s?

1 2 3 4 5 6 7 8

66.4 89.9 77.3 81.4 75.3 78.9 77.5 80.4

370.6 233.9 248.3 68.8 13.4 34.1 22.4 11.3

40

50

w,' 474.4 339.9 222.3 67.8 23.1 31.1 21.9 13.5

0.022 0.047 0.036 0.116 0.599 0.323 0.579 1.155

60 Chi-squared quantiles

with e'jS random ly sam pled from the pooled residuals {e^, i = 1.......8, j = l,...,n ,} . F or each such sim ulated d ataset we calculate sam ple averages and variances, then weights, the pooled m ean, and finally t*. Table 4.3 contains a sum m ary o f the null m odel fit, from which we calculate f o = 78.6 an d t = 21.275. A set o f R = 999 b o o tstrap sam ples gave the histogram o f t ‘ values in the left panel o f Figure 4.10. O nly 29 values exceed t = 21.275, so p = 0.030. The right panel o f the figure plots ordered t* values against quantiles o f the Xi approxim ation, which is off by a factor o f ab o u t 1.24 and gives the distorted P-value 0.0034. A n o rm al-error p aram etric b o o tstrap gives results very sim ilar to the nonparam etric b o otstrap. ■

Table 4.3 Summary statistics for eight samples in gravity data, plus ingredients for significance test. The weighted mean is po = 78.6.

Figure 4.10 Resampling results for comparison of the means of the eight series of gravity data. Left panel: histogram of R = 999 values of t* under nonparametric resampling from the null model with pooled studentized residuals; the unshaded area to right of observed value t = 21.275 gives p = 0.029. Right panel: ordered t‘ values versus Xi quantiles; the dotted line is the theoretical approximation.

4.4 ■Nonparametric Bootstrap Tests

165

Example 4.15 (Ratio test) Suppose that, as in Exam ple 1.2, each observation y is a p air (u,x), and th a t we are interested in the ratio o f m eans 8 = E ( X ) /E ( U) . In p articu lar suppose th a t we wish to test the null hypothesis Hq : 6 = 0OThis problem could arise in a variety o f contexts, and the context would help to determ ine the relevant null model. F or example, we m ight have a pairedcom parison experim ent where the m ultiplicative effect 0 is to be tested. H ere do would be 1, an d the m arginal distributions o f U and X should be the same und er Hq- O ne n atu ral null m odel Fo w ould then be the sym m etrized E D F, i.e. the E D F o f the expanded d a ta ( u i , x i ) , . . . , (u„,x„),(xi,ui),. . . , ( x n,u„). ■ Fully nonparametric null models In those few situations where the context o f the problem does n o t help identify a suitable sem iparam etric null m odel, it is in principle possible to form a w holly nonp aram etric null m odel Fo. H ere we look a t one general way to do this. Suppose the test involves k distributions F i ,...,F ^ for which the null hy pothesis im poses a constraint, Ho : r(F i,. . . , F*) = 0. T hen we can obtain a null m odel by nonp aram etric m axim um likelihood, or a sim ilar m ethod, by adding the constraint to the usual derivation o f the E D F s as M LEs. To be specific, suppose th a t we force the estim ates o f F \ , . . . , Fk to be supported on the corre sponding sam ple values, as the E D F s are. T hen the estim ate for F, will attach probabilities p, = (p,i, ■■• , P;n,) to sam ple values y , i , t h e unconstrained E D F Ft corresponds to pi = n ^ 'f l , . . . , 1). N ow m easure the discrepancy be tween a possible F, and the E D F F,- by rf(p„p,), say, such th at the E D F probabilities p, m inim ize this when no constraints o ther th an Y.%] Pij = 1 are im posed. T hen a n o nparam etric null m odel is given by the probabilities which m inim ize the aggregate discrepancy subject to t ( Fi , . . . ,F k) = 0. T h at is, the null m odel m inim izes the L agrange expression

(4.22)

w here t{p i , . . . , pt) is a re-expression o f the original constraint function t ( F \ , . . ., Fk). We denote the solutions o f this constrained m inim ization problem by p®, i = l,...,k . T he choice o f discrepancy function d(-, ■) th at corresponds to m axim um likelihood estim ation is the aggregate inform ation distance k

ttj

(4.23)

4 ■ Tests

166 and a useful alternative is the reverse inform ation distance k

nk

Y Y Pli log(P<7/Py')-

(4-24)

r=l j= 1

Both are m inim ized by the set o f E D F s when no constraints are im posed. The second m easure has the advantage o f autom atically providing non-negative solutions. T he following exam ple illustrates the m ethod and som e o f its im pli cations. Example 4.16 (Comparison of two means, continued) F or the tw o-sam ple problem considered in Exam ples 4.11 and 4.13, we apply (4.22) with the discrepancy m easure (4.24). T he null hypothesis constraint is th a t the two m eans are equal, th a t is J ^ y i j P i j = Hi = H2 = ^ y i j P i j , so th a t (4.22) becomes 2

n,

2

Y Y piJ

/

n,

N

( Y yiwj - Y yypv) ~ Y a>[ Y ^ - 1

1=1 y=i

>=i

\j =i

Setting derivatives w ith respect to pi; equal to zero gives the equations 1 + lo g pij — ai — Xyij = 0,

1 + log p2j - a2 + Xy2j = 0,

which together w ith the initial constraints gives the solutions exp(A)>ij)

17,0

EkLiexp(Ayik)’

exp i - X y y )

E"Li ^ p i - X y i k Y

2 j' °

.

The specific value o f X is uniquely determ ined by the null hypothesis constraint, which becom es

Eyijexp(/l};iv-) = E y 2jexp(-Ay2j) E * e x p ( ^ lt)

s x p ( - X y 2k) ’

whose solution m ust be determ ined numerically. D istributions o f the form (4.25) are usually called exponential tilts o f the E D Fs. F or o u r d a ta X = 0.130. The resulting null m odel probabilities are shown in the left panel o f Figure 4.11. The right panel will be discussed later. H aving determ ined these null probabilities, the b o o tstrap test algorithm is as follows: Algorithm 4.1 (Tilted bootstrap two-sample comparison)

For r = 1 ,..., JR,

1 G enerate ( y ^ , . . . , y *lni) by random ly sam pling «i tim es from ( y n ,. . . , y i ni) w ith weights (p n ,o ,...,P ini,o). 2 G enerate (y'2 x y 2‘ ni) by random ly sa m p lin g n2 times from (y2i , . . . , y 2ni) w ith weights (p2i,o, • ■■, P2n2,o)3 C alculate the test statistic t' = y \ — y\.

4.4 *Nonparametric Bootstrap Tests

Figure 4.11 Null distributions for comparison of two means. Left panel: null probability distributions pio (1) and p2o (2) with equal means (X = 0.130); observations are marked +. Right panel: smooth densities corresponding to null probability distributions for population 1 (dotted curve) and population 2 (dashed curve), and smooth density corresponding to pooled EDF (solid curve).

167

cco 0

TJ

Table 4.4 Resampling P-values for one-sided comparison of two means. The entries are explained in Examples 4.11, 4.13, 4.16, 4.19 and 4.20.

N ull m odel S tatistic P-value ______________________________________ pooled E D F n ull variances exponential tilt M LE (pivot)

t and z t t z t z z

0.045 0.053 0.006 0.025 0.019 0.017 0.015

C alculate 1 + # { f* ^ t} v = -------- --------- -■ V R+ 1 •

N um erical results for R = 999 are given in Table 4.4 in the line labelled “exponential tilt, t". R esults for other resam pling tests are also given for com parison: z refers to a studentized version o f t, “M L E ” refers to use o f constrained m axim um likelihood (see Problem 4.8), “null variances” refers to the sem iparam etric m ethod o f Exam ple 4.14. Clearly the choice o f null m odel can have a strong effect on the P-value, as one m ight expect. T he studentized test statistics z are discussed in Section 4.4.1. ■ The m ethod as illustrated here has strong sim ilarity to use o f em pirical likelihood m ethods, as described in C h ap ter 10. In practice it seems wise to

168

4 ■ Tests

check the null m odel produced by the m ethod, since resulting P-values are generally sensitive to m odel. Thus, in the previous example, we should look at Figure 4.11 to see if it m akes practical sense. The sm oothed versions o f the null distributions in the right panel, which are obtained by kernel sm oothing, are perhaps easier to interpret. One m ight well judge in this case th a t the two null distributions are m ore different th a n seems plausible. D espite this reservation ab o u t this exam ple, the general m ethod is a valuable tool to have in case o f need. There are, o f course, situations where even this quite general approach will n o t work. N evertheless the basic idea behind the ap proach can still be applied, as the following exam ples show. Example 4.17 (Test for unimodality) O ne o f the difficulties w ith n o n p a ra m etric curve estim ation is know ing w hether particu lar features are “real”. For example, suppose th a t we com pute a density estim ate f ( y ) and find th a t it has two modes. H ow do we tell if the m inor m ode is real? B ootstrap m ethods can be helpful in such problem s. Suppose th a t a kernel density estim ate is used, so th at /<>■;» =

(4.26) j=1

where (j> is the stan d ard norm al density. It is possible to show th at the num ber o f m odes o f f decreases as h increases. So one way to test unim odality is to see if an unusually large h is needed to m ake / unim odal. This suggests th a t we take as test statistic t = min{h : f { y , h ) is unim odal}. A natural candidate for the null sam pling distribution is f { y , t ) , since this is the least sm oothed version o f the E D F which satisfies the null hypothesis o f unim odality. By the convolution p roperty o f / , random sam ple values from f ( y ; t ) are given by y j = yij + hep

(4.27)

where the Ej are independent N ( 0, 1) variates an d the l j are random integers from {1 ,2 ,...,n } . O n general grounds it seems wise to m odify / so as to have first two m om ents agree w ith the d a ta (Problem 3.8), b u t this m odification would have no effect here. For any such sam ple y \ , . . . , y ' n generated from the null distribution, we can check w hether o r n o t t’ > t by checking w hether or not the p articular density estim ate f ' ( y ' , t ) is unim odal. ■ The next exam ple applies a variation o f this test.

169

4.4 • Nonparametric Bootstrap Tests Table 4.5 Perpendicular distances (miles) from an aerial line transect to schools of Southern Bluefin Tuna in the Great Australian Bight (Chen, 1996).

0.19 1.00 1.83 2.46 3.48 4.36 6.19 9.29

0.28 1.16 1.91 2.51 3.79 4.53 6.45 9.78

0.29 1.17 1.97 2.89 3.83 4.97 7.13 10.15

0.45 1.29 2.05 2.89 3.94 5.02 7.35 11.32

0.64 1.31 2.10 2.90 3.95 5.13 7.77 13.21

0.65 1.34 2.17 2.92 4.11 5.75 7.80 13.27

0.78 1.55 2.28 3.03 4.14 6.03 8.81 14.39

0.85 1.60 2.41 3.19 4.19 6.19 9.22 16.26

Example 4.18 (Tuna density estimate) O ne m ethod for estim ating the ab u n dance o f a species in a region is to traverse a straight line o f length L through the region, an d to record the p erpendicular distances from the line to posi tions where there are sightings. If there are n independent sightings and their (unsigned) distances y \ , . . . , y n are presum ed to have P D F f ( y ) , y > 0, the ab undance density can be estim ated by n /( 0 ) /( 2L), where / ( 0 ) is an estim ate o f the density a t distance y = 0. The P D F f ( y ) is p roportional to a detection function th a t is assum ed to decline m onotonically with increasing distance, w ith non-m onotonic decline suggesting th a t the assum ptions th a t underlie line transect sam pling m ust be questioned. Table 4.5 gives d a ta from an aerial survey o f schools o f S outhern Bluefin T una in the G reat A ustralian Bight. Figure 4.12 shows a histogram o f the data. The figure also shows kernel density estim ates

y * 0-

( 4 -2 8 >

with h = 0.75, 1.5125, an d 3. This seemingly unusual density estim ate is used because the probability o f detection, and hence the distribution o f signed distances, should be sym m etric ab o u t the transect. The estim ate is obtained by first calculating the E D F o f the reflected distances + y i , - . . , + y n, then applying the kernel sm oother, and finally folding the result a b o u t the origin. A lthough the estim ated density falls m onotonically for h greater th an 1.5125, the estim ate for sm aller values suggests non-m onotonic decline. Since we consider f ( y ; h ) for positive values o f y only, we are interested in w hether the underlying density falls m onotonically or not. We take the sm allest h such th at f ( y ; h ) is unim odal to be the value o f o u r test statistic t. This corresponds to m ono tonic decline o f f ( y ; h ) for y > 0, giving no m odes for y > 0. The observed value o f the test statistic is t = 1.5125, and we are interested in the significance probability P r( T >

1

1 Fo),

for d a ta arising from Fo, an estim ate o f F th at satisfies the null hypothesis o f

4 ■ Tests

170

Figure 4.12 Histogram

of the tuna data, and kernel density estimates (4.28) with bandwidths h = 1.5125 (solid), 0.75 (dashes), and 3 (dots).

Distance (miles)

m onotone decline b u t is otherw ise as close to the d a ta as possible. T h at is, the null m odel is f ( y , t ) . To generate replicate d atasets from the null m odel we use the convolution p roperty o f (4.28), which implies y] = I ± y i j + tej\,

j = l,...,n ,

where the signs + are assigned random ly, the l j are random integers from { 1 ,2 ,...,n}, and the r.j are independent N ( 0,1) variates; cf. (4.27). T he kernel density estim ate based on the y ’ is f *(y;h). We now calculate the test statistic as outlined in the previous example, an d rep eat the process R = 999 times to obtain an approxim ate significance probability. We restrict the h u n t for m odes to 0 < y < 10, because it does n o t seem sensible to use so small a sm oothing param eter in the density tails. W hen the sim ulations were perform ed for these data, the frequencies o f the num ber o f m odes o f f ' ( y ; t ) for 0 < y < 10 were as follows.

M odes Frequency

0 536

1 2 411 50

3 2

Like the fitted null distribution, a replicate where the full f * {y ;t ) is unim odal will have no m odes for y > 0. I f we assum e th a t the event t* = t is impossible, b o o tstrap d atasets w ith no m odes have t* < t, so the significance probability is (411 + 5 0 + 2 + l)/(9 9 9 + 1) = 0.464. T here is no evidence against m onotonic decline, giving no cause to d o u b t the assum ptions underlying line transect m ethods. ■

171

4.4 ■Nonparametric Bootstrap Tests

4.4.1 Studentized bootstrap m ethod F or testing problem s which involve param eter values, it is possible to obtain m ore stable significance tests by studentizing com parisons. One version o f this is analogous to calculating a 1 —p confidence set by the studentized boo tstrap m ethod (Section 5.2.1), an d concluding th a t the P-value is less th a n p if the null hypothesis p aram eter value is outside the confidence set. Section 4.1 outlined the application o f this idea. H ere we describe two possible resam pling im plem entations. F or simplicity suppose first th at 9 is a scalar with estim ator T, and th at we w ant to test Ho : 9 = 9o versus Ha '■9 > 9q. The m ethod suggested in Section 4.1 applies when

is approxim ately a pivot, m eaning th a t its distribution is approxim ately inde pendent o f unknow n param eters. Then, with zo = (t — ()o)/v,/2 denoting the observed studentized test statistic, the resam pling analogue o f (4.6) is p = Pr*(Z* > z0 | F),

(4.29)

which we can approxim ate by sim ulation w ithout having to decide on a null m odel Fo- T he usual choice for v would be the nonparam etric delta m ethod estim ate vL o f Section 2.7.2. T he theoretical support for the use o f Z is given in Section 5.4; in certain cases it will be advantageous to studentize a transform ed estim ate (Sections 5.2.2 an d 5.7). In practice it would be appropriate to check on w hether or n o t Z is approxim ately pivotal, using techniques described in Section 3.10. A pplications o f this m ethod are described in Section 6.2.5 and Section 6.3.2. T he m odifications for the oth er one-sided alternative and for the two-sided alternative are simply p = Pr*(Z* < zo | F ) and p = Pr*(Z *2 > z \ \ F). Example 4.19 (Comparison of two means, continued) F or the application considered in Exam ples 4.11, 4.13 and 4.16, where we com pared two m eans using t = —y u it w ould be reasonable to suppose th a t the usual tw o-sam ple t statistic z

Y2 - Y 1 - ( H 2 - H i )

(,S i / n 2 + S f / n i ) l/2 is approxim ately pivotal. H ere F in (4.29) represents the E D F s o f the two samples, given th a t no assum ptions are m ade connecting the two distributions. We calculate the observed value o f the test statistic, 2o =

h - h ( s \ / n 2 + S ]/« i) 1/2

4 ■ Tests

172 whose value for these d a ta is 2.846/1.610 = 1.768. T hen R values o f z. = f 2 ~ fi ~ ( h - h ) (s 22/ n 2 + s \ 2/ n i ) l/2

are generated, w ith each sim ulated d ataset containing n\ values sam pled with replacem ent from sam ple 1 an d n2 values sam pled with replacem ent from sam ple 2. In R = 999 sim ulations we found 14 values in excess o f 1.768, so the P-value is 0.015. This is entered in Table 4.4 in the row labelled “ (pivot)”. ■ If 9 is a vector w ith estim ator T , an d the null hypothesis is simple, Ho : 6 = Go, w ith general alternative H A \ 9 =/=Go, then the analogous pivot is Q = {T - 6 ) t V ~ \ T -G ), w ith observed test statistic value qo = ( t - O o ) Tv~1( t - 9 o ) . A gain v i is a stan d ard choice for v, and again it m ay be beneficial first to transform T (Section 5.8). Test statistics for m ore com plicated alternatives can be defined sim ilarly; see Problem 4.10. Studentized test statistics can also be used w hen Z o r Q is n o t a pivot. The definitions will be slightly different, Z =

(4.30)

for the scalar case and Q = ( T - 9 o) t Vo 1( T - 9 o) for the vector case, where Vo is an estim ated variance under the null model. If Zo is used the b o o tstrap P-value will sim ply be p = Pr*(Z0* > z0 | Fo),

(4.31)

w ith the obvious changes for a test based on Qo. Even though the statistic is n o t pivotal, its use is likely to reduce the effects o f nuisance param eters, and to give a P-value th a t is m ore nearly uniform ly distributed u n der the null hypothesis th a n th a t calculated from T alone. Example 4.20 (Comparison of two means, continued) In Table 4.4 all the entries for z, except for the row labelled “(pivot)”, were obtained using (4.30) w ith t = y 2 — yi an d vo depending on the null m odel. F or example, for the null m odels discussed in Exam ple 4.16, 2

n,

vo = Y l nr1 Y l ( yij ~ foo)2Pij,o, i=1 j =1

173

4.4 ■Nonparametric Bootstrap Tests

where £,o = Yl'j=i yijPijfi■ F or the two sam ples in question, under the ex ponential tilt null m odel b o th m eans equal 79.17 and vo = 1.195, the latter differing considerably from the variance estim ate 2.59 used in the pivot m ethod (Exam ple 4.19). The associated P-values com puted from (4.31) are shown in Table 4.4 for all null models. These P-values are less dependent upon the p articular m odel th an those obtained w ith t unstudentized. ■

4.4.2 Conditional bootstrap tests In p aram etric testing, conditioning plays an im p o rtan t role b o th in elim inating nuisance param eters an d in fixing the inform ation content o f the data. In n o nparam etric testing the situation is less clear, because o f the absence o f a full m odel. Some aspects o f conditioning are illustrated in Exam ples 5.16 and 5.17. O ne simple exam ple which does illustrate the possibility and effect o f condi tioning is the nonparam etric b o o tstrap test for independence. In Exam ple 4.9 we described an exact p erm u tatio n test for this problem . The analagous b o o t strap test w ould set the null m odel F q to be the product o f the m arginal E D Fs. Sim ulation und er this m odel is equivalent to creating x*s by random sam pling w ith replacem ent from the xs, and independently creating z*s by ran d o m sam pling w ith replacem ent from the zs. However, we could view the m arginal C D F s G and H as nuisance param eters and attem pt to remove them from the analysis by conditioning on G* = G and H * = H. This turns o u t to be exactly equivalent to using the perm utation test, which does indeed com pletely elim inate G an d H. Adaptive tests C onditioning occurs in a som ew hat different way in the adaptive choice of test statistic. Suppose th a t we have possible test statistics T \ , . . . , T k for which efficiency m easures can be defined and estim ated by e i , . . . , ^ : for example, if the T, are alternative estim ators for scalar param eter 9 and Ho concerns 9, then e, m ight be the reciprocal o f the estim ated variance o f T,. The idea o f the adaptive test is to use th a t T* which is estim ated to be m ost efficient for the observed data, and to condition on this fact. We first p artitio n the set 9 o f all possible null m odel resam ples _y‘ into < W i k such th a t V i = {Cxi. •■•,3';) =% = m a x e’ }. 1< J< k

T hen if y i , . . . , y n is in the P-value as

J

so th a t t, is preferred, the adaptive test com putes

p = Pr*(T;* > ti | ( y j,...,y * )

G

^ ,)-

174

4 ■ Tests

F o r an exam ple o f this, see Problem 4.13. In the case o f exact tests, such as p erm utatio n tests, the adaptive test is also exact.

4.4.3 Multiple testing In some applications m ultiple tests o f a hypothesis are based on a single set o f data. This happens, for example, when pairwise com parisons o f m eans are carried out for a several-sam ple analysis where the null hypothesis is equality o f all m eans. In such situations the smallest o f all test P-values is used, and it is clearly incorrect to interp ret this smallest value in the usual way. B ootstrapping can be used to find the true significance level o f the smallest P-value, as follows. D ep artin g from o u r general notation, suppose th a t the test statistics are S i,...,S fc, w ith observed values s i,...,s /t, and th at the null distribution o f 5, is know n to be G ,(). T hen the observed significance levels are 1 — G,(s,). T he incorrect procedure would be treat the smallest Pvalue m in{l — G i( s i) ,...,1 — G^Sk)} as uniform on the interval [0,1]. I f the tests were exact and independent, the corresponding random variable would have distribution 1 — (1 — p)k on [0, 1], b u t in general we should take into account their (unknow n) dependence. We can allow for the m ultiple testing by taking t = m in{l — G i ( s i ) ,...,l — Gi(sfc)} to be the test statistic, and then the procedure is as follows. We generate d a ta from the null hypothesis distribution, calculate the b o o tstrap statistics and then take t* = m in{l — G i(s J ),...,1 — G^(s^)}. We repeat this R tim es to get t\,...,t*R, and then obtain the P-value in the usual way. N otice th a t if all the G;( ) equal G (), say, the test is ta n ta m o u n t to boo tstrap p in g t = m ax (si,...,sf;), and then G () need n o t be known. I f the G,( ) are unequal, the procedure requires them to be know n, in order to p u t the test statistics on a scale where they can be com pared. I f the G,(-) are unknow n, they can be estim ated, b u t then a nested boo tstrap (Section 3.9) is needed to o btain the P-value. The algorithm is the following. Algorithm 4.2 (Multiple testing)

F or r = 1, . . . , R,

1 G enerate y [ , . . . , y 'n independently from the fitted null distribution Fo, an d from them calculate s\, . .. ,s'k. 2 Fit the null distribution Fq to y*,. . . , y*. 3 F o r m = 1, . . . , M , generate y i n d e p e n d e n t l y from the fitted null distribution Fq, and from them calculate 4 Calculate t* = m in [ !+ # { ■ \ M C alculate p = (1 + #{t^ > t})/(/? + 1).

i+ # { s r M+ 1

}

175

4.5 • Adjusted P-values

The procedure is analogous to th a t used in Section 4.5, b u t in this case adjustm ent would require three levels o f nested bootstrapping.

4.5 Adjusted P-values So far we have described tests w hich com pute P-values as p = Pr*(T* > t \ Fo) w ith Fo the w orking null sam pling model. Ideally P should be uniform ly distributed on [0,1] if the usual error rate in terpretation is to be valid. This will be exactly or approxim ately correct for p erm u tatio n and p erm u tatio n like b o o tstrap tests, b u t for other tests it can be far from correct. Preventive m easures we can take are to transform t or studentize it, or both. However, these are n o t g u aranteed to work. H ere we describe a general m ethod o f adjustm ent, simple in principle b u t potentially very com puter-intensive. T he idea behind adjusting P-values is simply to treat p as the observed test statistic: it is after all ju st a tran sfo rm ation o f t. We estim ate the distribution o f the corresponding ran d o m variable P by resam pling — under the null model, o f course. Since small values o f p are o f interest, the adjusted P-value is defined by Padj = Pr*(P* <

P I

Fo),

(4.32)

where p is the observed P-value defined above. This requires b o o tstrapping the algorithm for com puting P-values, an o th er instance o f increasing the accuracy o f a b o o tstrap m ethod by b o o tstrapping it, an idea introduced in Section 3.9. T he problem can be explained theoretically in either o f two ways, perturbing the critical value o f t for a fixed nom inal erro r rate a, or adjusting for the bias in the P-value. We take the second approach, and since we are dealing with statistical erro r rath er th an sim ulation erro r (Section 2.5), we ignore the latter. The P-value com puted for the d ata is w ritten po{F), where the function po(') depends on the m ethod used to obtain Fo from F. W hen the null hypothesis is true, suppose th a t the p articu lar null distribution Fo obtains. T hen the null distrib u tio n function for the P-value is G ( u , F o)

=

P t { Po ( F ) < u \ F o } ,

(4.33)

which w ith u = a is the true error rate corresponding to nom inal erro r rate a. N ow (4.33) implies th at Pr{G(p0(F), F0) < a I F0} = a, and so G{po(F),Fo) would be the ideal adjusted P-value, having actual error rate equal to the nom inal erro r rate. N ext notice th a t by substituting Fo for Fo in (4.33) we can estim ate G{u,Fo) by Pr*{po(F*) < u | Fo}.

176

4 ■ Tests

Finally, setting u = po(F) we obtain G(po(F), Fo) = Pr*{p0( F ') < Po(F) \ F0}. This we define to be the adjusted P-value, so when po{F) = p, Padj = Pr*{p0( f *) < P I F0}, which is a m ore precise version o f (4.32). O ne m ust be careful to interp ret P * = outer probability relates to sam pling from a sam ple draw n from F o.

(4.34)

properly in (4.34). Since the F* in (4.34) denotes the E D F o f

po(F") Fo,

The adjusted P-value can be applied to advantage in b o th param etric and nonparam etric testing, the key point being th a t it is m ore nearly uniform ly distributed th a n the unadjusted P-value. Before discussing sim ulation im ple m entation o f the adjustm ent, we look a t a simple exam ple which illustrates the basic m ethod. Example 4.21 (Comparison of exponential means) Suppose th a t x i , . . . , x m and y i , . . . , y „ are respectively ran d o m sam ples from exponential distributions w ith m eans m and n 2, an d th a t we wish to test H o : Hi = Hi- F o r this problem there is an exact test based on U = X / Y , b u t we consider instead the test statistic T = X — Y , for which we show th a t the adjusted P-value autom atically produces the P-value for the exact test. For the p aram etric b o o tstrap test the null m odel sets the two sam pling distributions equal to a com m on fitted exponential distribution with pooled m ean m x + ny v = —m +;—n • If X * and 7* denote averages o f ran d o m sam ples o f sizes m and n respec tively from this exponential distribution, then the boo tstrap P-value is p = P r’(X ’ — 7* > x — y). This can be rew ritten as P = P r L - - G. - „ - G . > (m + * - ‘ ) } , (_ mu + n j

(4.35)

where u = x / y , and Gm and Gn are independent gam m a random variables with indices m a n d n respectively and unit scale param eters. The b o o tstrap P-value (4.35) does n o t have a uniform distribution under the null hypothesis, so P = p does n o t correspond to erro r rate p. This is fully corrected using the adjustm ent (4.34). To see this, w rite (4.35) as p = h(u), so th a t po(F') equals P r* * (T " > T* | F*o) = h(U*), where U ' = X ' / Y ' . Since h( ) is decreasing, it follows th at Padj = Pr*{/i(l/*) < h(u) | x , y } = Pr*(t/* > u | x , y ) = P r(F 2m,2„ > u),

177

4.5 ■Adjusted P-values

which is the P-value o f the exact test. Therefore p a<jj is exactly uniform and the adjustm ent is perfectly successful. ■ In the previous example, the same result for pa^ would be achieved if the b o o tstrap distribution o f T were replaced by a norm al approxim ation. This m ight suggest th a t b o o tstrap calculation o f p could be replaced by a rough theoretical approxim ation, thus rem oving one level o f boo tstrap sam pling from calculation o f padj- U nfortunately this is n o t always true, as is clear from the fact th a t if an approxim ate null distribution o f T is used which does not depend upon F at all, then pa<jj is ju st the ordinary bo o tstrap P-value. In m ost applications it will be necessary to use sim ulation to approxim ate the adjusted P-value (4.34). Suppose th at we have draw n R resam ples from the null m odel Fo, w ith corresponding test statistic values r j.......t'R. The rth resam ple has E D F F* (possibly a vector o f E D Fs), to which we fit the null model Ko- R esam pling M times from F *0 gives sam ples from which we calculate f " , m = 1 ,..., M. T hen the M onte C arlo approxim ation for the adjusted P-value is 1 + # { p r* < p } dj — R +1 ’ (4.36) where for each r • =

Pr

1 + # K m

M +l

^

fr )

(4 3 7 )

If p is calculated from the same R resamples, then a total o f R M sam ples is generated. We can sum m arize the algorithm as follows: Algorithm 4.3 (Double bootstrap test)

For r = 1

1 G enerate y\,...,y*n independently from the fitted null distribution Fo and calculate the test statistic t* from them. 2 Fit the null distribution to y [ , . . . , y * , thereby obtaining K r 3 F or m = 1, . . . , M , (a) generate y p , . . . , y ” independently from the fitted null distribu tion F*0 ; and (b) calculate from them the test statistic t” . 4 C alculate p* as in (4.37). Finally, calculate padj as in (4.36).

•

We discuss the choice o f M after the following example. Example 4.22 (Two-way table) Table 4.6 contains a set o f observed m ulti nom ial counts, for which we wish to test the null hypothesis o f row -colum n independence, or additive loglinear model.

178

4 ■ Tests

1 2 0 1 0

2 0 1 1 1

2 0 1 2 1

1 2 1 0 1

If the co u n t in row i an d colum n j is P-ijfi = yi+y+j/y++> where test statistic is

y l+

t =

1 3 2 0 1

y ,j,

0 0 7 0 0

1 0 3 1 0

then the null fitted values are

= E /J t y an d so forth. The log likelihood ratio

y 'i Xo^ y ' i / N f i ) -

A ccording to stan d ard theory, T is approxim ately distributed as Xd under the null hypothesis w ith d = (7 —1) x (5 — 1) = 24. Since t = 38.52, the approxim ate P-value is P r(^24 ^ 38.52) = 0.031. However, the chi-squared approxim ation is know n to be quite p o o r for such a sparse table, so we apply the param etric b ootstrap. The m odel F q is the fitted m ultinom ial m odel, sam ple size n = y ++ and (i,j)th cell probability p-ijfi/n. We generate R tables from this m odel and calculate the corresponding log likelihood ratio statistics t \, . . . , t ' R. W ith R = 999 we obtain 47 statistics larger th a n the observed value t = 38.52, so the b o o tstrap P-value is (1 + 4 7 )/(l + 999) = 0.048. The inaccuracy o f the chi-squared approxim ation is illustrated by Figure 4.13, which is a plot o f ordered values o f Pr(x24 > O versus expected uniform order statistics: the straight line corresponds to the theoretical chi-squared approxim ation for T. The b o o tstrap P-value tu rn s out to be quite non-uniform . A double bo o tstrap calculation w ith R = M = 999 gives pa<jj = 0.076. N ote th a t the test applied here conditions only on the total y ++, whereas in principle one would prefer to condition on all row an d colum n sums, which are sufficient statistics u nder the null h y p o th esis: this would require m ore complex sim ulation m ethods, such as those o f Section 4.2.1; see Problem 4.3. ■ Choice o f M T he general application o f the double b o o tstrap algorithm involves sim ulation at two levels, w ith a to tal o f R M samples. If we follow the suggestion to use as m any as 1000 sam ples for calculation o f probabilities, then here we would need as m any as 106 samples, which seems im practical for o th er th a n simple problem s. As in Section 3.9, we can determ ine approxim ately w hat a sensible choice for M would be. The calculation below o f sim ulation m ean squared erro r suggests th a t M = 99 w ould generally be satisfactory, and M = 249 would be safe. T here are also ways o f reducing considerably the total size o f the sim ulation, as we shall show in C h ap ter 9.

Table 4.6 Two-way table of counts (Newton and Geyer, 1994).

4.5 ■Adjusted P-values

179

Figure 4.13 Ordered values of ^ t*) versus expected uniform order statistics from R = 999 bootstrap simulations under the null fitted model for two-way table. Dotted line is theoretical approximation.

Expected uniform order statistic

To calculate the sim ulation m ean squared error, we begin w ith equation (4.37), which we rew rite in the form I {A} is the indicator function of the event A.

1 +Em=lJ{C ^ K}

Pr

M+ 1

In order to simplify the calculations, we suppose that, as M —>oo, p ' —>ur such th a t the urs are a ran d o m sam ple from the uniform distribution on [0,1]. In this case there is no need to adjust the b o o tstrap P-value, so padj = P■U nder this assum ption (M + l)p* is alm ost a B inom (M ,ur) random variable, so th a t equation (4.36) can be approxim ated by ■ l +

£ r = l* r

Padj = — r + t ~ ' where X r = /{B in o m (M , ur) < ( M + \)p}. We can now calculate the sim ulation m ean and variance o f

p adj

by using the fact th at

E(X^ | ur) = Pr{B inom (M , ur) < (M + 1)p} for k = 1,2. F irst we have th a t for all r ri m + m E (* ? ) = y

T . y=0

( " ; )uJ( l - u ) M^ d u =

w here [z] is the integer p a rt o f z. Since pa^ is p ro portional to the average o f independent X rs, it follows th a t UW

R [ ( M + l)p] (n + i)(Af + i)>

180

4 ■ Tests

which tends to the correct answ er p as R, M —>00, and , . . R [( M + 1)p](M + l - [ ( M + l)p]) var(padj) = A simple aggregate m easure o f sim ulation erro r is the m ean squared error relative to p, M S E ( p 3di) =

[(M + l)p]{M + l - [ ( M + l)p]} R ( M + l )2

N um erical evaluations o f this result suggest th a t M = 249 would be a safe choice. If 0.01 < p < 0.10 then M = 99 would be satisfactory, while M = 49 would be adequate for larger p. N ote th a t two assum ptions were m ade in the calculation, b o th o f which are harm less. First, we assum ed th a t p was independent o f the t ', w hereas in fact it w ould likely be calculated from the sam e values. Secondly, o u r m ain interest is in cases where P-values are not exactly uniform ly distributed. Problem 4.12 suggests a m ore flexible calculation, from which very sim ilar conclusions emerge.

4.6 Estimating Properties of Tests A statistical test involves two steps, collection o f d a ta and application o f a p articular test statistic to those data. Both steps involve choice, and resam pling m ethods can have a role to play in such choices by providing estim ates o f test power. Estimation o f power A s regards collection o f data, in simple problem s o f the kind under discussion in this chapter, the statistical co n trib u tio n lies in recom m endation o f sample sizes via considerations o f test power. I f it is proposed to use test statistic T, an d if the p articu lar alternative H a to the null hypothesis Ho is o f prim ary interest, then the pow er o f the test is 7i(p,HA) = P r(T > tp I H a ), where tp is defined by P r(T > tp \ Ho) = p. In the simplified language o f testing theory, if we fix p and decide to reject Ho when t > tp, then n ( p, HA) is the chance o f rejection when HA is true. A n alternative specification is in term s o f E (P | H a ), the expected P-value. In m any problem s hypotheses are expressed in term s o f param eters, and then pow er can be evaluated for arbitrary param eter values to give a pow er function. W h at is o f interest to us here is the use o f resam pling to assess the pow er o f a test, either as an aid to determ ination o f appropriate sam ple sizes for a p articu lar test, or as a way to choose from a set o f possible tests.

4.6 ■Estimating Properties o f Tests

181

Suppose, then, th a t a pilot set o f d a ta y i , . . . , y n is in hand, and th a t the m odel description is sem iparam etric (Section 3.3). The pilot d a ta can be used to estim ate the n onparam etric com ponent o f the model, and to this can be added a rb itrary values o f the param etric com ponent. This provides a fam ily o f alternative hypothesis m odels from which to sim ulate d a ta and test statistic values. F ro m these sim ulations we obtain approxim ations o f test power, provided we have critical values tp for the test statistic. This last condition will not always be met, b u t in m any problem s there will at least be a simple approxim ation, for exam ple N ( 0,1) if we are using a studentized statistic. For m any nonparam etric tests, such as those based on ranks, critical values are distribution-free, and so are available. The following exam ple illustrates this idea. Example 4.23 (M aize height data) The E D F s plotted in the left panel o f Figure 4.14 are for heights o f m aize plants growing in two adjacent rows, and differing only in a pollen sterility factor. The two sam ples can be modelled approxim ately by a sem iparam etric m odel with an unspecified baseline distri b u tio n F and one m edian-shift p aram eter 8. F or analysis o f such d a ta it is proposed to test Ho : 8 = 0 using the W ilcoxon test. W hether or n o t there are enough d a ta can be assessed by estim ating the power o f this test, which does depend upon F. D enote the observations in sample i by y i j, j = l ,...,n ; . The underlying distributions are assum ed to have the form s F ( y ) and F(y — 8), where 8 is estim ated by the difference in sam ple m edians 0. To estim ate F we subtract 0 from the second sam ple to give y 2j = y ij — 8- Then F is the pooled E D F o f the yijS and y 2js. F or these d a ta n\ = n2 = 12 and 8 = —4.5. The right panel o f Figure 4.14 plots E D F s o f the y );s and y 2js. T he next step is to sim ulate d a ta for selected values o f 0 and selected sample sizes N i an d N 2 as follows. F or group 1, sam ple d a ta from F(y), i.e. random ly w ith replacem ent from

and for group 2, sam ple d a ta y 2\ , - - - , y 2Nl from F(y — 8), i.e. random ly with replacem ent from y n + 8, . . . , yi„, + 8, y 2\ + 0, . . . , y 2„2 + 0T hen calculate test statistic t*. W ith R repetitions o f this, the pow er o f the test at level p is the p ro p o rtio n o f tim es th a t t* > tp, where tp is the critical value o f the W ilcoxon test for specified N\ and N 2. In this p articu lar case, the sim ulations show th a t the W ilcoxon test at level p = 0.01 has pow er 0.26 for 8 = 8 and the observed sam ple sizes. A dditional

4 • Tests

182

Figure 4.14 Power comparison for maize height data (Hand et al., 1994, p. 130). Left panel: EDFs of plant height for two groups. Right panel: EDFs for group 1 (unadjusted) and group 2 (adjusted by estimated median-shift 6 ~ —4.5).

Data values

Data values

calculations show th a t b o th sam ple sizes need to be increased from 12 to at least 33 to have pow er 0.8 for 9 = 9. ■ If the proposed test uses the pivot m ethod o f Section 4.4.1, then calculations o f sample size can be done m ore simply. F or exam ple, for a scalar 9 consider a two-sided test o f Ho : 9 = 9o w ith level 2a based on the pivot Z . The pow er function can be w ritten n(2a, 9) = 1 - Pr I zx>N +

I

< Z N < z X- ^ N +

VN

- i ,

VN

(4.39)

J

where the subscript N indicates sam ple size. A rough approxim ation to this pow er function can be obtained as follows. First sim ulate R sam ples o f size N from F , an d use these to approxim ate the quantiles za>sr and zi_a>jv. N ext set v Jl 2 = n^^vh^2/ N 1/2, where v„ is the variance estim ate calculated from the pilot data. Finally, approxim ate the probability (4.39) using the same R boo tstrap samples. Sequential tests Sim ilar sorts o f calculations can be done for sequential tests, where one im p o rtan t criterion is term inal sam ple size. In this context sim ulation can also be used to assess the likely eventual sam ple size, given d a ta y i , . . . , y „ at an interim stage o f a test, w ith a specified protocol for term ination. This can be done by sim ulating d a ta co n tin u atio n y^+i,y^,+2 , - ■■ up to term ination, by sam pling from fitted m odels or E D F s, as appropriate. F rom repetitions o f this sim ulation one obtains an approxim ate distribution for term inal sam ple size N.

4.7 ■Bibliographic Notes

183

4.7 Bibliographic Notes The stan d ard theory o f significance tests is described in C hapters 3-5 and 9 o f Cox an d H inkley (1974). F o r detailed treatm ent o f the m athem atical theory see L ehm ann (1986). In recent years m uch w ork has been done on obtaining im proved distrib u tio n al approxim ations for likelihood-based statistics, and m ost o f this is covered by Barndorff-N ielsen and Cox (1994). R and o m izatio n an d p erm u tation tests have long histories. R. A. Fisher (1935) introduced rando m izatio n tests as a device for explaining and justifying signifi cance tests, b o th in simple cases and for com plicated experim ental designs: the rando m izatio n used in selecting a design can be used as the basis for inference, w ithout appeal to specific erro r models. F o r a recent account see M anly (1991). A general discussion o f how to apply random ization in com plex problem s is given by W elch (1990). P erm utation tests, which are superficially sim ilar to random ization tests, are specifically n onparam etric tests designed to condition out the unknow n sam pling distribution. T he theory was developed by Pitm an (1937a,b,c), and is sum m arized by L ehm ann (1986). M ore recently R om ano (1989, 1990) has exam ined properties o f p erm u tation tests and their relation to b o o tstrap tests for a variety o f problems. M onte C arlo tests were first suggested by B arnard (1963) and are particularly p o p u lar in spatial statistics, as described by Ripley (1977,1981,1987) and Besag an d Diggle (1977). G raphical tests for regression diagnostics are described by A tkinson (1985), and Ripley (1981) applies them to m odel-checking in spatial statistics. M arkov chain M onte C arlo m ethods for conditional tests were introduced by Besag and Clifford (1989); applications to contingency table analysis are given by Forster, M cD onald and Sm ith (1996) and Smith, Forster and M cD o n ald (1996), w ho give additional references. G ilks et al. (1996) is a good general reference on M arkov chain M onte C arlo m ethods, including design o f sim ulation. T he effect o f sim ulation size R on power for M onte C arlo tests (with independent sim ulations) has been considered by M a rrio tt (1979), Jockel (1986) and by H all an d T itterington (1989); the discussion in Section 4.2.5 follows Jockel. Sequential calculation o f P-values is described by Besag and Clifford (1991) and Jennison (1992). The use o f tilted E D F s was introduced by E fron (1981b), and has sub sequently h ad a strong im pact on confidence interval m ethods; see C hapters 5 and 10. D ouble b o o tstrap adjustm ent o f P-values is discussed by Beran (1988), Loh (1987), H inkley and Shi (1989), and H all and M artin (1988). A pplications are described by N ew ton and G eyer (1994). G eyer (1995) discusses tests for inequality-constrained hypotheses, which sheds light on possible inconsistency

184

4 ■Tests

o f b o o tstrap tests an d suggests remedies. F or references to discussions o f im proved sim ulation m ethods, see C h ap ter 9. A variety o f m ethods and applications for resam pling in m ultiple testing are covered in the books by N oreen (1989) an d W estfall and Y oung (1993). Various aspects o f resam pling in the choice o f test are covered in papers by Collings an d H am ilton (1988), H am ilton an d Collings (1991), and Samawi (1994). A general theoretical treatm en t o f pow er estim ation is given by Beran (1986). The b rief discussion o f adaptive tests in Section 4.4.2 is based on D onegani (1991), w ho refers to previous w ork on the topic.

4.8 Problems 1

For the dispersion test of Example 4.2, y \ , . . . , y n are hypothetically sampled from a Poisson distribution. In the Monte Carlo test we simulate samples from the conditional distribution of Y i,..., Y„ given Y Yj — s<with s = Yl yj- If the exact multinomial simulation were not available, a Markov chain method could be used. Construct a Markov chain Monte Carlo algorithm based on one-step transitions from (mi,...,u„) to (t>i,_,u„) which involve only adding and subtracting 1 from two randomly selected us. (Note that zero counts must not be reduced.) Such an algorithm might be slow. Suggest a faster alternative. (Section 4.2)

2

Suppose that X i , . . . , X n are continuous and have the same marginal CDF F, although they are not independent. Let / be a random integer between 1 and n. Show that rank(X/) has a uniform distribution on {1,2,...,n}. Explain how to apply this result to obtain an exact Monte Carlo test using one realization of a suitable Markov chain. (Section 4.2.2; Besag and Clifford, 1989)

3

Suppose that we have a m x m contingency table with entries ytj which are counts. (a) Consider the null hypothesis of row-column independence. Show that the sufficient statistic So under this hypothesis is the set of row and column marginal totals. To assess the significance of the likelihood ratio test statistic conditional on these totals, a Markov chain Monte Carlo simulation is used. Develop a Metropolis-type algorithm using one-step transitions which modify the contents of a randomly selected tetrad yik,yu>yjk>yji> where i ^ j , k ^ I. (b) Now consider the the null hypothesis of quasi-symmetry, which implies that in the loglinear model for mean cell counts, log E(Yy) = /i + a, + + ytj, the interaction parameters satisfy yy = y;i- for all /, j. Show that the sufficient statistic So under this hypothesis is the set of totals yy+yji, i =£ j, together with the row and column totals and the diagonal entries. Again a conditional test is to be applied. Develop a Metropolis-type algorithm for Markov chain Monte Carlo simulation using one-step transitions which involve pairs of symmetrically placed tetrads. (Section 4.2.2; Smith et al, 1996)

4

Suppose that a one-sided bootstrap test at level a is to be applied with R simulated samples. Then the null hypothesis will be rejected if and only if the number of t’s exceeding t is less than k = (R + l)a — 1. If kr is the number of t*s exceeding t in the first r simulations, for what values of kr would it be unnecessary to continue simulation? (Section 4.2.5; Jennison, 1992)

4.8 ■Problems

5

185

(a) Consider the following rule for choosing the number of simulations in a Monte Carlo test. Choose k, and generate simulations t\,t’2,..., t] until the first I for which k of the t’ exceed the observed value t; then declare P-value p = (k + I)/(I + 1). Let the random variables corresponding to I and p be L and P. Show that Pr{P < (k + 1)/(/ + 1)} = Pr(L > 1 - 1 ) = k / l ,

l = k , k + 1,. .

and deduce that L has infinite mean. Show that P has the distribution of a t/(0, 1) random variable rounded to the nearest achievable significance level l , k / ( k + l ) , k / ( k + 2),..., and deduce that the test is exact. (b) Consider instead stopping immediately if k of the f* exceed t at any I < R, and anyway stopping when I = R, at which point m values exceed t. Show that this rule gives achievable significance levels / ( * + ! ) /( / + !),

P ~ \( m + l) /( K + l) ,

m = k, m
Show that under this rule the null expected value of L is R

R

E(L) = ^ 2 Pr(L > l ) = k + k ^ 2

1=1

l~\

Mc+l

and evaluate this with k = 49 and 9 for R = 999. (Section 4.2.5; Besag and Clifford, 1991) 6

Suppose that n subjects are allocated randomly to each of two treatments, A and B. In fact each subject falls in one of two relevant groups, such as gender, and the treatment allocation frequencies differ between groups. The response y t] for the j l h

subject in the ith group is modelled as y,j = y,- + + e,;, where xA and rb are treatment effects and k(i, j ) is A or B according to which treatment was allocated to the subject. Our interest is in testing Ho : rA = xB with alternative that xA < tb, and the test statistic chosen is T =

Y . ri> - Y r‘>’ i,j±(i,j)=B i,jM<,j)=A

where is the residual from regression of the >>s on the group indicators. (a) Describe how to calculate a permutation P-value for the observed value t using the method described above Example 4.12. (b) A different calculation of the P-value is possible which conditions on the observed covariates, i.e. on the treatment allocation frequencies in the two groups. The idea is to first eliminate the group effects by reducing the data to differences djj = yij — yij+i, and then to note that the joint probability of these differences under Ho is constant under permutations of data within groups. That is, the minimal sufficient statistic So under H0 is the set of differences — Yl(J+l), where Yni) < % ) < ■• • are the ordered values within the ith group. Show carefully how to calculate the P-value for t conditional on so le) Apply the unconditional and conditional permutation tests to the following data: Group 1 A

3

5

4

B O (Sections 4.3, 6.3.2; Welch and Fahey, 1994)

Group 2 4

1

2

1

186 1

4 ■ Tests A randomized matched-pair experiment to compare two treatments produces paired responses from which the paired differences dj = yij — >’i7 are calculated for j = 1 The null hypothesis Ho o f no treatment difference implies that the djs are sampled from a distribution that is symmetric with mean zero, whereas the alternative hypothesis implies a positive mean difference. For any test statistic t, such as d, the exact randomization P-value Pr(T* > t | H0) is calculated under the null resampling m odel

d) = Sjdj,

j =

where the Sj are independent and equally likely to be + 1 and —1. W hat would be the corresponding nonparametric bootstrap sampling m odel Fo? Would the resulting bootstrap P-value differ much from the randomization P-value? See Practical 4.4 to apply the randomization and bootstrap tests to the following data, which are differences o f measurements in eighths o f an inch on cross- and self-fertilized plants grown in the same pot (taken from R. A. Fisher’s famous discussion o f Darwin’s experiment). 49

-6 7

8

16

6

23

28

41

14

29

56

24 7560 -4 8

(Sections 4.3, 4.4; Fisher, 1935, Table 3) 8

For the two-sample problem o f Example 4.16, consider fitting the null m odel by maximum likelihood. Show that the solution probabilities are given by Pij,°

1 . ni (a + Xy i j) ’ P2]'°

1 n2(P - Xy2j) ’

where a, fi and / are the solutions to the equations Y P i j f l = 1>Y PVfl ~ U and Y yijPij.o = Y y 2jP2j,o- Under what conditions does this solution not exist, or give negative probabilities? Compare this null m odel with the one used in Example 4.16. 9

For the ratio-testing problem o f Example 4.15, obtain the nonparametric M LE o f the joint distribution o f ( U , X ) . That is, if pj is the probability attached to the data pair (Uj,Xj), maximize Yl Pj subject to Y P)(x i ~ ^ a uj) = 0- Verify that the resulting distribution is the E D F o f (U,X) when 0o = x/u. Hence develop a numerical algorithm for calculating the pjS for general $oN ow choose probabilities p i , . . . , p n to minimize the distance d(p, q) = Y ^ V j log Pj - Y 2 Pi lo § with q = ( ^ ,..., i) , subject to Y ( x j ~ &oUj)Pj = 0. Show that the solution is the exponential tilted E D F Pj cc exp{r\(xj - Bouj)}. Verify that for small values o f do — x / u these PjS are approximately the same as those obtained by the M LE method. (Section 4.4; Efron, 1981b)

10

Suppose that we wish to test the reduced-rank m odel H0 : g(0) — 0, where g(-) is a Pi-dimensional reduction o f p-dimensional 6. For the studentized pivot method we take Q = {g(T ) - g(6)}T V ~ l { g ( T ) - g(0)}, with data test value q0 = g(t)r i;g-1g(t), where vg estimates var[g(T )}. Use the nonparametric delta method to show that var{g(T )} = g(t)VLg ( t y , where g(0) = 8 g( 6 ) / d d T. Show how the method can be applied to test equality o f p means given p indepen dent samples, assuming equal population variances. (Section 4.4.1)

187

4.9 ■Practicals 11

In a parametric situation, suppose that an exact test is available with test statistic U, that S is sufficient under the null hypothesis, but that a parametric bootstrap test is carried out using T rather than U. Will the adjusted P-value padj always produce the exact test? (Section 4.5)

12

In calculating the mean squared error for the simulation approximation to the adjusted P-value, it might be more reasonable to assume that P-values u, follow a Beta distribution with parameters a and b which are close to, but not equal to, one. Show that in this case

E(Xk) = " V ,Pl j^o

T(M + l)r (a + j)V(b + M - j)T(a + b) T ( j + I W ( M - j + l )T(a + b + M ) r ( a ) r ( b ) ’

where X r = /{B in om (M , ur) < ( M + l)p}. Use this result to investigate numerically the choice o f M. (Section 4.5) 13

For the matched-pair experiment o f Problem 4.7, suppose that we choose between the two test statistics ty = d and t2 = (n — 2m)~l J2"Z2+i ^c/)> f° r som e m in the range 2, . . . , [^n], on the basis o f their estimated variances Vi and v2, where

„

=

E (d j-h )2

1 v->

n2 =

Ej=m+l(^U) ~ f2)2 + m(^(rn+1) ~ h ) 2 + m(rf(„_m) —t2)2 --- ----------------------------------------------------------------------- . n(n — 2m)

Give a detailed description o f the adaptive test as outlined in Section 4.4.2. To apply it to the data o f Problem 4.7 with m = 2, see Practical 4.4. (Section 4.4.2; D onegani, 1991) 14

Suppose that we want critical values for a size a one-sided test o f Ho : 9 = 9o versus H A : 9 > 0n. The ideal value is the 1 — a quantile to,i-a o f the distribution o f T under Ho, and this is estimated by the solution f o , i - a to Pr’(T ' > t0 | F o) = aTypically t o i - c is biased. Consider an adjusted critical value ( o , i - o - y . Obtain the double bootstrap algorithm for choosing y, and compare the resulting test to use o f the adjusted P-value (4.34). (Sections 4.5, 3.9.1; Beran, 1988)

4.9 Practicals 1

The data in dataframe dogs are from apharmacological experiment. The two variables are cardiac oxygen consum ption (M VO) and left ventricular pressure (LVP). D ata for n = 7 dogs are M VO LVP

78 32

92 33

116 45

90 30

106 38

7899 24 44

Apply a bootstrap test for the hypothesis o f zero correlation between M VO and LVP. Use R = 499 simulations. (Sections 4.3, 4.4) 2

For the permutation test outlined in Example 4.12,

188

4 ■Tests

ami.fun <- function(data, i) { d <- data[i, ] temp <- survdiff(Surv(d$time, d$cens) " data$group) s <- sign(temp$obs[2]-temp$exp[2]) s*sqrt(temp$chisq) } ami.perm <- boot(ami, ami.fun, R=499, sim="permutation") (1+sum(ami.perm$tO
(Section 4.3) 3

For a graphical test o f suitability o f the exponential m odel for the data in Table 1.2, we generate data from the exponential distribution, and plot an envelope.

expqq.fun <- function(data, q) sort(data)/mean(data) exp.gen <- function(data, mle) rexp(length(data), mle) n <- nrow(aircondit) qq <- qexp((1:n)/(n+1)) exp.boot <- boot(aircondit$hours,expqq.fun,R=999,sim="parametric", r a n .gen=exp.gen,mle=l/mean(aircondit$hours),q=qq) env <- envelope(exp.boot$t) plot(qq,exp.boot$tO,xlab="Exponential quantiles", ylab="Scaled order statistics",xlim=c(0,max(qq)), ylim=c(0,max(c(exp.boot$t0,env$overall[2,]))),pch=l) lines(qq,env$overall[1,]); lines(qq,env$overall[2,] ) lines(qq,env$point[l,],lty=2); lines(qq,env$point[2,],lty=2) Discuss the adequacy o f the model. Check whether the gamma m odel is a better fit. (Section 4.2.4) 4

To apply the permutation test outlined in Problem 4.7,

darwin.gen <- function(data, mle) { sign <- sample(c(-1,1),mle,replace=T) data*sign } darwin.rand <- boot(darwin$y, mean, R=999, sim="parametric", r a n .gen=darwin.g e n , mle=nrow(darwin)) (1+sum(darwin.rand$t>darwin.rand$tO))/(1+darwin.rand$R) Can you see how to modify d a r w in .g e n to produce the bootstrap test? To implement the adaptive test described in Problem 4.13, with m = 2:

darwin.f <- function(d) { n <- length(d); m <- 2 tl <- mean(d) vl <- sum((d-tl)"2)/n“ 2 d <- sort(d)[(m+1):(n-m)] t2 <- mean(d)

4.9 ■Practicals

189

v2 <- ((sum((d-t2)~2)+m*(min(d)-t2)“2+m*(max(d)-t2)"2))/(n*(n-2*m)) c(tl, vl, t2, v2) } darwln.ad <- boot(darwin$y, darwin.f, R=999, sim="parametric", r a n .gen=darwin.g e n , mle=nrow (darwin)) darwin.ad$tO i <- c (1:999)[darwin.ad$t[,2]>darwin.ad$t[,4]] (1+sum(darwin.ad$t [i,3] >darwin.ad$tO [3] )) / (1+length (i)) Is a different result obtained with the adaptive version o f the bootstrap test? (Sections 4.3, 4.4) 5

Dataframe p a u ls e n contains data collected as part o f an investigation into the quantal nature o f neurotransmission in the brain, by Dr O. Paulsen o f the Department o f Pharmacology, University o f Oxford, in collaboration with Pro fessor P. Heggelund o f the Department o f Neurophysiology, University o f Oslo. Two models have been proposed to explain such data. The first m odel suggests that the data are drawn from an underlying skewed unimodal distribution. The alternative m odel suggests that the data are drawn from a series o f distributions with modes equal to integer multiples o f a unit size. To distinguish between the two m odels, a bootstrap test o f multimodality may be carried out, with the null hypothesis that the underlying distribution is unimodal. To plot the data and a kernel density estimate with a Gaussian kernel and bandwidth h = 1.5, and to count its local maxima:

h <- 1.5 hist(paulsen$y,probability=T,breaks=c(0:30)) lines(density(paulsen$y,width=4*h,from=0,to=30)) peak.test <- function(y, h) {dens <- density(y,width=4*h,n=100) sum(peaks(dens$y[(dens$x>=0) k (dens$x<=20)])) } peak.test(paulsen$y, h) Check that h = 1.87 is the smallest value giving just one peak. For bootstrap analysis,

peak.gen <- function( d, mle) { n <- mle[l] ; h <- mle [2] i <- sample(n,n,replace=T) d[i]+h*rnorm(n) > paulsen.boot <- boot(paulsen$y, peak.test, R=999, sim="parametric", r a n .gen=peak.g e n , mle=c(nrow(paulsen),1.87), h=1.87) What is the significance level? To repeat with a shrunk sm oothed density estimate:

shrunk.gen <- function(d, mle) { n <- mle[l] ; h <- mle [2] v <- var(d) (d [sample (n,n,replace=T)] +h*rnorm(n)) /sqrt (l+h~2/v) }• paulsen.boot <- boot(paulsenSy, peak.test, R=999, sim="parametric", ran.gen=shrunk.gen, mle=c(nrow(paulsen),1.87), h=1.87) Bootstrap to obtain the P-value. D iscuss your results. (Section 4.4; Paulsen and Heggelund, 1994, Silverman, 1981).

190 6

4 ■ Tests For the cd4 data o f Practicals 2.3 and 3.6, test the hypothesis that the distribution o f C D 4 counts after one year is the same as the baseline distribution. Test also whether the treatment affects the counts for each individual. Discuss your conclusions.

5 Confidence Intervals

5.1 Introduction T he assessm ent o f uncertainty ab o ut param eter values is m ade using confidence intervals or regions. Section 2.4 gave a brief introduction to the ways in which resam pling can be applied to the calculation o f confidence limits. In this chapter we u ndertake a m ore tho ro u g h discussion o f such m ethods, including m ore sophisticated ideas th a t are potentially m ore accurate th an those m entioned previously. Confidence region m ethods all focus on the same target properties. T he first is th a t a confidence region w ith specified coverage probability y should be a set Cy(y) o f p aram eter values which depends only upon the d a ta y and which satisfies Pr{0 e Cy( F )} = y.

(5.1)

Im plicit in this definition is th a t the probability does n o t depend upon any nuisance param eters th a t m ight be in the model. The confidence coefficient, o r coverage probability, y, is the relative frequency with which the confidence region would include, or cover, the true param eter value 9 in repetitions o f the process th a t produced the d a ta y. In principle the coverage probability should be conditional on the inform ation content o f y as m easured by ancillary statistics, b u t this m ay be difficult in practice w ithout a param etric m odel; see Section 5.9. The second im p o rtan t property o f a confidence region is its shape. The general principle is th a t any value in Cy should be m ore likely th an all values outside Cy, where “likely” is m easured by a likelihood or sim ilar function. This is difficult to apply in nonparam etric problem s, where strictly a likelihood function is n o t available; see, however, C h apter 10. In practice the difficulty is

191

192

5 • Confidence Intervals

not serious for scalar 9, which is the m ajor focus in this chapter, because in m ost applications the confidence region will be a single interval. A confidence interval will be defined by limits 0ai and 9 i_a2, such th a t for any a Pr(0 < 0„) = a. The coverage o f the interval [0a,,0 i_ a2] is y = 1 — (x\ + a 2), and ai and a 2 are respectively the left- an d right-tail error probabilities. For som e applications only one lim it is required, either a low er confidence limit 6a o r an upper confidence limit 9 i_a, these b o th having coverage 1 — a. If a closed interval is required, then in principle we can choose oti and a2, so long as they sum to the overall erro r probability 2a. T he sim plest way to do this, which we ad o p t for general discussion, is to set a.\ = a2 = a. T hen the interval is equi-tailed with coverage probability 1 — 2a. In p articu lar applications, however, one m ight well w ant to choose ai and a 2 to give approxim ately the shortest interval: this would be analogous to having the likelihood property m entioned earlier. A single confidence region can n o t give an adequate sum m ary o f the u n certainty ab o u t 9, so in practice one should give regions for three or four confidence levels betw een 0.50 and 0.99, say, together with the p o int estim ate for 9. O ne benefit from this is th a t any asym m etry in the uncertainty ab o u t 6 will be fairly clear. So far we have assum ed th a t a confidence region can be found to satisfy (5.1) exactly, b u t this is n o t possible except in a few special param etric models. The m ethods developed in this chapter are based on approxim ate probability calculations, an d therefore involve a discrepancy betw een the nom inal or target coverage, an d the actual coverage probability. In Section 5.2 we review briefly the stan d ard approxim ate m ethods for param etric an d n o nparam etric models, including the basic b o o tstrap m ethods already described in Section 2.4. M ore sophisticated m ethods, based on w hat is know n as the percentile m ethod, are the subject o f Section 5.3. Section 5.4 com pares the various m ethods from a theoretical viewpoint, using asym ptotic expansions, and introduces the A B C m ethod as an alternative to sim ulation m ethods. The use o f significance tests to obtain confidence lim its is outlined in Section 5.5. A nested b o o tstrap algorithm is introduced in Section 5.6. Em pirical com parisons betw een m ethods are m ade in Section 5.7. Confidence regions for vector p aram eters are described in Section 5.8. The possibility o f conditional confidence regions is explored in Section 5.9 through discussion o f two examples. Prediction intervals are discussed briefly in Sec tion 5.10. The discussion in this chap ter is ab o u t how to use the results o f boo tstrap sim ulation algorithm s to obtain confidence regions, irrespective o f w hat the resam pling algorithm is. T he presentation supposes for the m ost p a rt th at we

5.2 • Basic Confidence Limit M ethods

193

are in the simple situation o f C h apter 2, where we have a single, com plete hom ogeneous sample. M ost o f the m ethods described can be applied to m ore com plex d a ta structures, provided th a t appropriate resam pling algorithm s are used, b u t for m ost sorts o f highly dependent d a ta the theoretical properties o f the m ethods are largely unknow n.

5.2 Basic Confidence Limit Methods 5.2.1 Parametric models O ne general ap p ro ach to calculating a confidence interval is to m ake it surround a good point estim ate o f the param eter, which for param etric m odels will often be taken to be the m axim um likelihood estim ator. We begin by discussing various simple ways in which this approach can be applied. Suppose th a t T estim ates a scalar 9 and th a t we w ant an interval with left- and right-tail errors b o th equal to a. F or simplicity we assum e th at T is continuous. I f the quantiles o f T — 9 are denoted by ap, then P r(T — 9 < aa) = a = P r(T — 9 > a i - x).

(5.2)

R ew riting the events T — 9 < aa and T — 9 > a i_ a as 9 > T — a 3 and 9 < T — a i _3 respectively, we see th a t the 1 — 2a equi-tailed interval has limits 9a = t

fli_ a,

^ 1—a = t

aa.

(5.3)

This ideal solution rarely applies, because the distribution o f T — 9 is usually unknow n. This leads us to consider various approxim ate m ethods, m ost o f w hich are based on approxim ating the quantiles o f T — 9. Normal approximation T he sim plest ap proach is to apply a N{0,v) approxim ation for T — 9. This gives the approxim ate confidence limits = t + v l/1z l- 0l,

((B) = d({e)id8, and

?(e) = d2t(e)/B0deT.

(5.4)

where as usual z i_ a = '(1 —a). If T is a m axim um likelihood estim ator, then the approxim ate variance v can be com puted directly from the log likelihood function tf(9). I f there are no nuisance param eters, then we can use the recip.. A rocal o f either the observed Fisher inform ation, v = —l/tf(9) o r the estim ated expected Fisher inform ation v = 1/7(0), where i(9) = E{—if(9)} —var{/(0)}. T he form er is usually preferable. W hen there are nuisance param eters, we use the relevant elem ent o f the inverse o f either —?(0) o r i(9). M ore generally, if T is given by an estim ating equation, then v can be calculated by the delta m etho d ; see Section 2.7.2. E quation (5.4) is the stan d ard form for norm al approxim ation confidence limits, although it is som etim es augm ented by a bias correction which is based on the third derivative o f the log likelihood function.

194

5 • Confidence Intervals

In problem s where the variance approxim ation v is hard to obtain theoret ically, or is th o u g h t to be unreliable, the param etric boo tstrap o f Section 2.2 can be used. This requires sim ulation from the fitted m odel with param eter value 9. If the resam pling estim ates o f bias and variance are denoted by bR and vR, then (5.4) is replaced by 9a, 9i - x = t - b R + v ^ 2z i - x.

(5.5)

W hether or n o t a norm al approxim ation m ethod will w ork can be assessed by m aking a norm al Q -Q p lo t o f the sim ulated estim ates t \ , . . ., t'R, as illustrated in Section 2.2.2. If such a plot suggests th a t norm al approxim ation is poor, then we can either try to im prove the norm al approxim ation in som e way, or replace it com pletely. T he basic resam pling confidence interval m ethods o f Section 2.4 do the latter, and we review them first. Basic and studentized bootstrap methods If we sta rt again at the general confidence limit form ula (5.3), we can estim ate the quantiles ax and a i_ a by the corresponding quantiles o f T ’ — t. A ssum ing th at these are approxim ated by sim ulation, as in Section 2.2.2, the argum ent given in Section 2.4 leads to the confidence limits Sot = 2t — f((R+i)(i_a)),

9 i_a = 2t — t((R+l)a)-

(5.6)

These we refer to as the basic bootstrap confidence limits for 9. A m odification o f this is to use the form o f the norm al approxim ation confidence limit, b u t to replace the N ( 0,1) approxim ation for Z = ( T — 9 ) / V ,/2 by a b o o tstrap approxim ation. Each sim ulated sam ple is used to calculate t*, the variance estim ate v*, an d hence the b o o tstrap version z* = (t’ — t ) / v ’ll2 o f Z . T he R sim ulated values o f z* are ordered and the p quantile o f Z is estim ated by the (R + l)p th o f these. T hen the confidence limits (5.4) are replaced by 9a = t — V 1/ 2Z ( { R + i)(i_a)),

^l-oc = t — tf1/2Z((R+l)a)-

(5.7)

These we refer to as studentized bootstrap confidence limits. They are also know n as b o o tstrap -t limits, by analogy w ith the Student-f confidence limits for the m ean o f a norm al distribution, to which they are equal under infinite sim ulation in th a t problem . In principle this m ethod is superior to the previous basic m ethod, for reasons outlined in Section 2.6.1 and discussed further in Section 5.4. A n em pirical bias adjustm ent could be incorporated into the num erator o f Z , b u t this is often difficult to calculate an d is usually not w orthw hile, because the effect is implicitly adjusted for in the b o o tstrap distribution. F or b o th (5.6) and (5.7) to apply exactly it is necessary th a t (R + l)a be an integer. This can usually be arran g ed : w ith R = 999 we can handle m ost

195

5.2 ■Basic Confidence Limit Methods

conventional values o f a. But if for some reason ( R + l)a is not an integer, then interp o latio n can be used. A simple m ethod th a t w orks well for approxim ately norm al estim ators is linear interp olation on the norm al quantile scale. For exam ple, if we are trying to apply (5.6) and the integer p a rt o f (R + l)a is k, then we define O -ifa ) —0 _1( - ^ r ) = fo +

w

'R + l'

v

^R+l-’

‘ = [(* + !)« ]•

(5-8)

The sam e in terp o latio n can be applied to the z* s. Clearly such interpolations fail if k = 0, R or R + 1. Parameter transformation T he norm al approxim ation m ethod m ay fail to w ork well because it is being applied on the w rong scale, in which case it should help to apply the approxi m ation on an appropriately transform ed scale. Skewness in the distribution o f T is often associated w ith v a r(T ) varying w ith 9. F or this reason the accuracy o f norm al approxim ation is often im proved by transform ing the param eter scale to stabilize the variance o f the estim ator, especially if the transform ed scale is the w hole real line. T he accuracy o f the basic boo tstrap confidence lim its (5.6) will also tend to be im proved by use o f such a transform ation. Suppose th a t we m ake a m onotone increasing transform ation o f the p aram eter scale from 9 to tj = h(9), and then transform t correspondingly to u = h(t). A ny confidence limit m ethod can be applied for tj, and untransform ing the results will give confidence limits for 9. F o r example, consider applying the norm al approxim ation limits (5.4) for r). By the delta m ethod (Section 2.7.1) the variance approxim ation v for T transform s to h(0) is dh(6)/d6.

v a r(l/) = var{/i(T)} = {h(t)}2v = v v , say. T hen the confidence limits for r\ are h(t) + t;|/2zi_ a, which transform back to the limits 9 j i - a = h - l {h(t) + vli2z l^ } .

(5.9)

Sim ilarly the basic b o o tstrap confidence limits (5.6) becom e = h~1{2h(t) - /i(?,*(R+1)(1_a)))},

0 .—* = h - l {2h(t) - h(t*{{R+l)a])}.

(5.10)

W hether or n o t the norm al approxim ation is im proved by transform ation can be ju d g ed from a norm al Q -Q p lo t o f sim ulated h(t") values. H ow do we determ ine an ap p ropriate transform ation h(-)2 I f v a r(T ) is exactly or approxim ately equal to the know n function v(0), then the variancestabilizing tran sfo rm atio n is defined by (2.14); see Problem 5.2 for an example. If no theoretical approxim ation exists for v ar(T ), then we can apply the em pirical m ethod outlined in Section 3.9.2. A sim pler em pirical approach

196

5 ■Confidence Intervals

which som etim es w orks is to m ake norm al Q -Q plots o f h(t’) for candidate transform ations. It is im p o rta n t to stress th a t the use o f transfo rm ation can im prove the basic b o o tstrap m ethod considerably. N evertheless it m ay still be beneficial to use the studentized m ethod, after transform ation. Indeed there is strong em pirical evidence th a t the studentized m ethod is im proved by w orking on a scale with stable approxim ate variance. T he studentized transform ed estim ator is H T ) - m \ h ( T) \ VV 2 ' G iven R values o f the b o o tstrap q uantity z* = {/i(f’) — h(t)} / {\h(t*)\v*1/2}, the analogue o f (5.10) is given by

6a = h ~ l {h(t)

- IM0|t>1/2z(*(R+i)(i-«))}> 0i-« =

h - ' i h i t ) - \h(t)\v1/2z i‘ { R + m }.

(5.11) N ote th a t if h(-) is given by (2.14) w ith no co n stan t m ultiplier and V = v(T), then the den o m in ato r o f z* an d the m ultiplier \h(t)\v1/2 in (5.11) are b o th unity. Likelihood ratio methods W hen likelihood estim ation is used, in principle the norm al approxim ation confidence limits (5.4) are inferior to likelihood ratio limits. Suppose th at the scalar 6 is the only unknow n p aram eter in the model, and define the log likelihood ratio statistic «f(0) is the log likelihood function

W(d) = 2 { m - m } Q uite generally the distribution o f W( 6) is approxim ately chi-squared, with one degree o f freedom since 8 is a scalar. So a 1 — 2a approxim ate confidence region is Cl-2cc = { 0

: w ( 0 ) < CU _ 2 a } ,

( 5 .1 2 )

where ciiP is the p quantile o f the y2 distribution. This confidence region need n o t be a single interval, although usually it will be, and the left- and righttail errors need n o t be even approxim ately equal. Separate lower and upper confidence lim its can be defined using sgn(u) = u/\u\ is the sign function.

z(0) = sgn(d - d ) y/ Md) , which is approxim ately N ( 0,1). T he resulting confidence lim its are defined im plicitly by z ( h ) = za,

z(01—a) = Zl-«.

(5.13)

W hen the m odel includes o th er unknow n param eters X, also estim ated by m axim um likelihood, w(6) is calculated by replacing / ( 0 ) with the profile log likelihood / prof(0) = sup;
197

5.2 • Basic Confidence Limit M ethods

( ' is the log likelihood for a set of data simulated using 6, for which the MLE is 0 \

In m ost applications the accuracy will be very good, provided the m odel is correct, b u t it m ay nevertheless be sensible to consider replacing the theoretical quantiles by b o o tstrap approxim ations. W hether or n o t this is w orthw hile can be ju d g ed from a chi-squared Q-Q plot o f sim ulated values o f w -(6) = 2 { r ( 6 ' ) - f ( G ) } , or from a norm al Q-Q plot o f the corresponding values o f z*(0). Example 5.1 (Air-conditioning data) T he d a ta o f Exam ple 1.1 were used to illustrate various features o f param etric resam pling in C h apter 2. H ere we look at confidence lim it calculations for the underlying m ean failure time n under the exponential m odel for these data. The exam ple is convenient in th a t there is an exact solution against which to com pare the various approxim ations. F or the norm al approxim ation m ethod we use an estim ate o f the exact variance o f the estim ator T = Y , v = n~ly 2. T he observed value o f y is 108.083 an d n = 12, so v = (31.20)2. T hen the 95% confidence interval limits given by (5.4) w ith a = 0.025 are 108.083 ± 31.20 x 1.96 = 46.9 and 169.2. These co n trast sharply w ith the exact lim its 65.9 and 209.2. T ransform ation to the variance-stabilizing logarithm ic scale does improve the norm al approxim ation. A pplication o f (2.14) with v(/i) = n“ V 2 gives h(t) = log(t), if we d ro p the m ultiplier n1/2, and the approxim ate variance transform s to n~l . The 95% confidence interval limits given by (5.9) are e x p { lo g (1 0 8 .0 8 3 )± (1 2 r1/2 x 1.96} = 61.4 and 190.3. W hile a considerable im provem ent, the results are still not very close to the exact solution. A p artial explanation for this is th a t there is a bias in log(T ) and the variance approxim ation is no longer equal to the exact variance. Use o f b o o tstrap estim ates for the bias and variance o f log(T ), w ith R = 999, gives limits 58.1 and 228.8. F or the basic b o o tstrap confidence lim its we use R = 999 sim ulations under the fitted exponential m odel, sam ples o f size n = 12 being generated from the exponential distribution w ith m ean 108.083; see Exam ple 2.6. The relevant ordered values o f y ' are the (9 9 9 + l)0.025th and (9 9 9 + l)0.975th, i.e. the 25th and 975th, which in o u r sim ulation were 53.3 and 176.4. The 95% confidence limits obtained from (5.6) are therefore 2 x 108.083 - 176.4 = 39.8,

2 x 108.083 - 53.3 = 162.9.

These are no b etter th a n the norm al approxim ation limits. However, applica tion o f the sam e m ethod on the logarithm ic scale gives m uch b etter results:

5 ■Confidence Intervals

198

using the same ordered values o f >■' in (5.10) we o btain the limits exp{21og(108.083)-log(176.4)} = 66.2, exp{21og(108.083)-log(53.4)} = 218.8. In fact these are sim ulation approxim ations to the exact limits, which are based on the exact gam m a distribution o f Y / p.- The sam e results are obtained using the studentized b o o tstrap limits (5.7) in this case, because z = n l/2(y — n ) / y is a m onotone function o f log(y) — log(/i) = log(y/p). E quation (5.11) also gives these results. N ote th a t if we h ad used R = 99, then the b oo tstrap confidence limits would have required interpolation, because (9 9 + 1)0.025 = 2.5 which is n o t an integer. T he application o f (5.8) would be (D- 1(0.025)- 0 ) - 1(0.020) *(2.5) -

f (2) +

(D- 1(0.030) - 0*-1(0.020) ( (3)

{2])'

This involves quite extrem e ordered values and so is som ew hat unstable. The likelihood ratio m ethod gives good results here, even using the chisquared approxim ation. Broadly sim ilar com parisons am ong the m ethods apply under the m ore com plicated gam m a m odel for these data. As the com parisons m ade in Ex am ple 2.9 would predict, results for the gam m a m odel are sim ilar to those for nonparam etric resam pling, which are discussed in the next example. ■

5.2.2 Nonparametric models W hen no m odel is assum ed for the d a ta distribution, we are then in the situation o f Section 2.3, if the d a ta form a single hom ogeneous sample. Initially we assum e th a t this is the case. M ost o f the m ethods ju st discussed for param etric m odels extend to this nonparam etric situation w ith little difficulty. T he m ajor exception is the likelihood ratio m ethod, which we postpone to C h apter 10. Normal approximation The sim plest m ethod is again to use a norm al approxim ation, now w ith a nonparam etric estim ate o f variance such as th a t provided by the nonparam etric delta m ethod described in Section 2.7.2. If lj represents the em pirical influence value for the ;'th case yj, then the approxim ate variance is vL = n~2 J2 lj, so the n onparam etric analogue o f (5.4) for the limits o f a 1 — 2a confidence interval for 6 is * -r

1/2

t + V£

Z 1—a.

(5.14)

Section 2.7 outlines various ways o f calculating or approxim ating the influence values. If a sm all nonparam etric b o o tstrap has been run to produce bias and

5.2 • Basic Confidence Lim it Methods

199

variance estim ates bR an d vR, as described in Section 2.3, then the corresponding approxim ate 1 — 2a confidence interval is t - bR + 4 /2zi-a-

(5.15)

In general we should expect this to be m ore accurate, provided R was large enough. Basic and studentized bootstrap methods F o r the basic b o o tstrap m ethod, the only change from the param etric case is th a t the sim ulation m odel is the E D F P. O therwise equation (5.6) still applies. W hether o r n o t the b o o tstra p m ethod is likely to give im provem ent over the norm al ap proxim ation m ethod can again be judged from a norm al Q -Q plot o f the t’ values. Sim ulated resam ple values do give estim ates o f bias and variance which provide the m ore accurate norm al approxim ation limits (5.15). T he studentized b o o tstra p m ethod w ith confidence limits (5.7) likewise ap plies here. If the nonparam etric delta m ethod variance estim ate v l is used for v, then those confidence limits becom e _ .

1/2 *

/J

vx — t — V L

_ ,

t'l—a ~

~

1/2 * VL

z((R+l)a)>

/r 1

(D.10J

where now z* = (t* — t ) / v ' ^ 2. N ote th at the influence values m ust be recom puted for each b o o tstra p sample, because in expanded n o tation lj = l(yj;P) depends u p o n the E D F o f the sample. Therefore » i = « - 2£ / v ; ; n 7=1

where P ' is the E D F o f the b o o tstrap sample. A simple approxim ation to v] can be m ade by substituting the approxim ation n

Ky' j ; F " ) = K y ) ; f ) -

n~l £

i(yk ' ; P) ,

k=l

b u t this is unreliable unless t is approxim ately linear; see Section 2.7.5 and Problem 2.20. As in the p aram etric case, one m ight consider m aking a bias adjustm ent in the n u m erato r o f z, for exam ple based on the em pirical second derivatives o f t. However, this rarely seems effective, and in any event an approxim ate adjustm ent is implicitly m ade in the b o o tstra p distribution o f Z*. Example 5.2 (Air-conditioning data, continued) F or the d a ta o f Exam ple 1.1, confidence lim its for the m ean were calculated under an exponential m odel in Exam ple 5.1. H ere we apply nonp aram etric m ethods, sim ulated datasets being obtained by sam pling w ith replacem ent from the data. F o r the norm al approxim ation, we use the nonparam etric delta m ethod

200

5 ■Confidence Intervals

estim ate vL = n 2 E (> '; — whose d a ta value is 1417.715 = (37.65)2. So the approxim ate 95% confidence interval is 108.083 ± 37.65 x 1.96 = 34.3 an d 181.9. This, as w ith m ost o f the num erical results here, is very sim ilar to w hat is obtained und er p aram etric analysis w ith the best-fitting gam m a m odel; see Exam ple 2.9. F or the basic b o o tstra p m ethod w ith R = 999 sim ulated datasets, the 25th and 975th ordered values o f y * are 43.92 and 192.08, so the lim its o f the 95% confidence interval are 2(108.083) - 192.08 = 24.1 and 2(108.083) - 43.92 = 172.3. This is n o t obviously a p o o r result, unless com pared with results for the gam m a m odel (likelihood ratio limits 57 an d 243), b u t the corresponding 99% interval has lower lim it —27.3, which is clearly very bad! T he studentized boo tstrap fares better: the 25th an d 975th ordered values o f z* are —5.21 and 1.66, so th at application o f (5.7) gives 95% interval limits 108.083 - 37.65 x 1.66 = 45.7 an d 108.083 - 37.65 x (-5 .2 1 ) = 304.2. But are these last results adequate, and how can we tell? T he first p art o f this question we can answ er b o th by com parison w ith the gam m a m odel results, an d by applying m ethods on the logarithm ic scale, which we know is appro p riate here. T he basic b o o tstra p m ethod gives 95% limits 66.2 and 218.8 w hen the log scale is used. So it would ap p ear th a t the studentized boo tstrap m ethod lim its are too wide here, b u t otherw ise are adequate. If the studentized b o o tstrap m ethod is applied in conjunction with the logarithm ic transform ation, the lim its becom e 50.5 and 346.9. How would we know in practice th a t the logarithm ic transform ation o f T is appropriate, o th er th an from experience w ith sim ilar d ata ? O ne way to answ er this is to p lo t v’L versus t*, as a surrogate for a “v arian ce-p aram eter” plot, as suggested in Section 3.9.2. F or this p articu lar dataset, the equivalent plot o f stan d ard errors vL is shown in the left panel o f Figure 5.1 and strongly suggests th a t variance is approxim ately p ro p o rtio n al to squared param eter, as it is under the param etric model. F rom this we w ould deduce, using (2.14), th at the logarithm ic tran sfo rm atio n should approxim ately stabilize the variance. T he right panel o f the figure, which gives the corresponding plot for logtransform ed estim ates, shows th a t the tran sfo rm atio n is quite successful. ■ Parameter transformation For suitably sm ooth statistics, the consistency o f the studentized boo tstrap m ethod is essentially g u aranteed by the consistency o f the variance estim ate V. In principle the m ethod is m ore accurate th an the basic b o o tstrap m ethod,

5.2 ■Basic Confidence Lim it Methods

Figure 5.1 Air-conditioning d ata: nonparametric delta method standard errors for t = y (left panel) and for log(t) (right panel) in R = 999 nonparametric bootstrap samples.

201

o CO

o

in

csj o

CNJ

*<" o _i co >

*

>

o C\J

50

100

150

200

250

t*

log t*

as we shall see in Section 5.4. However, variance approxim ations such as vL can be som ew hat unstable for small n, as in the previous exam ple with n = 12. Experience suggests th a t the m ethod is m ost effective when 6 is essentially a location p aram eter, which is approxim ately induced by variance-stabilizing tran sfo rm atio n (2.14). However, this requires know ing the variance function v(9) = v a r(T | F), which is never available in the nonparam etric case. A suitable transfo rm atio n m ay som etim es be suggested by analogy w ith a p aram etric problem , as in the previous example. T hen equations (5.10) and (5.11) will apply w ithout change. Otherwise, a transform ation can be obtained em pirically using the technique described in Section 3.9.2, using either nested b o o tstrap estim ates v* or delta m ethod estim ates v*L w ith which to estim ate values o f the variance function v(6). E quation (5.10) will then apply with estim ated transfo rm atio n h( ) in place o f h( ). F or the studentized boo tstrap interval (5.11), if the tran sfo rm ation is determ ined em pirically by (3.40), then studentized values o f the transform ed estim ates h(t'r) are K = v { K) l/2{ k O - M O }/”! 1/ 2-

O n the original scale the (1 — 2a) studentized interval has endpoints h~l { h(t) - v1/' 2tS(0 _ 1/2Z('(i-o<)(«+i))} ’

{MO - ®l :^ ( 0 “ 1/2Z(*a(K+i))} (5.17) In general it is wise to use the studentized interval even after transform ation. Example 5.3 (City population data) F or the d a ta o f Exam ple 2.8, w ith ratio 6 estim ated by t = x / u, we discussed em pirical choice o f transform ation in Exam ple 3.23. A pplication o f the em pirical transform ation illustrated in

202

5 ■Confidence Intervals

Figure 3.11 w ith the studentized b o o tstra p limits (5.17) leads to the 95% interval [1.23, 2.25], This is sim ilar to the 95% interval based on the h(t*) — h(t), [1.27, 2.21], while the studentized b o o tstra p interval on the original scale is [1.12, 1.88]. T he effect o f the tran sfo rm atio n is to m ake the interval m ore like those from the percentile m ethods described in the following section. To com pare the studentized m ethods, we took 500 sam ples o f size 10 w ithout replacem ent from the full city pop u latio n d a ta in Table 1.3. T hen for each sam ple we calculated 90% studentized b o o tstra p intervals on the original scale, and on the transform ed scale w ith and w ithout using the transform ed standard erro r; this last interval is the basic b o o tstrap interval on the transform ed scale. The coverages were respectively 90.4, 88.2, and 86.4%, to be com pared to the ideal 90% . The first tw o are n o t significantly different, b u t the last is rath er smaller, suggesting th a t it can be w orthw hile to studentize on the transform ed scale, w hen this is possible. The draw back is th a t studentized intervals th a t use the transform ed scale tend to be longer th an on the original scale, and their lengths are m ore variable. ■

5.2.3 Choice o f R W h at has been said ab o u t sim ulation size in earlier chapters, especially in Section 4.2.5, applies here. In particular, if confidence levels 0.95 and 0.99 are to be used, then it is advisable to have R = 999 or m ore, if practically feasible. Problem 5.5 outlines som e relevant theoretical calculations.

5.3 Percentile Methods We have seen in Section 5.2 th a t simple confidence limit m ethods can be m ade m ore accurate by w orking on a transform ed scale. In m any cases it is possible to use sim ulation results to get a reasonable idea o f w hat a sensible transform atio n m ight be. A quite different ap proach is to find a m ethod which im plicitly uses the existence o f a good transform ation, b u t does n o t require th at the tran sfo rm atio n be found. This is w hat the percentile m ethod and its m odifications try to do.

5.3.1 Basic percentile method Suppose th a t there is som e unknow n tran sfo rm ation o f T , say U = h(T), which has a sym m etric distribution. Im agine th a t we knew h and calculated a 1 —2a confidence interval for cf) = h(6) by applying the basic b o o tstrap m ethod (5.6), except th a t we first use the sym m etry to w rite ax = —fli_a in the basic equation (5.3) as it applies to U = h(T). This w ould m ean th a t in applying (5.3) we w ould take u - u{iR+m_a)) instead o f u{{R+m - u, and u - u{{R+l)a)

203

5.3 ■Percentile Methods

instead o f — u, to estim ate the a and 1 — a quantiles o f U. This sw ap would change the confidence interval limits (5.6) to *

U((K+l)a)>

*

U((R + l)(l-a))>

whose tran sfo rm atio n back to the 9 scale is f ((R+l)<x)’

f ((R+1)(1—ot))-

( 5 .1 8 )

R em arkably this 1 — 2a interval for 9 does n o t involve h at all, and so can be com puted w ithout know ing h. The interval (5.18) is know n as the bootstrap percentile interval, an d was initially recom m ended in place o f (5.6). As w ith m ost b o o tstrap m ethods, the percentile m ethod applies for both param etric and n o n p aram etric b oo tstrap sampling. Perhaps surprisingly, the m ethod turns out not to w ork very well with the nonparam etric b o o tstrap even when a suitable transfo rm ation h does exist. However, adjustm ents to the percentile m ethod described below are successful for m any statistics. Example 5.4 (Air-conditioning data, continued) F or the air-conditioning d a ta discussed in Exam ples 5.1 and 5.2, the percentile m ethod gives 95% intervals [70.8, 148.4] und er the exponential m odel and [43.9, 192.1] under the n o n p ara m etric model. N either is satisfactory, com pared to accurate intervals such as the basic b o o tstra p interval using logarithm ic transform ation. ■

5.3.2 Adjusted percentile m ethod F or the percentile m ethod to w ork well, it would be necessary th at T be unbiased on the transform ed scale, so th a t the sw ap o f quantile estim ates be correct. This does not usually happen. A lso the m ethod carries the defect o f the basic b o o tstrap m ethod, th a t the shape o f the distribution o f T changes as the sam pling distribution changes from F to F, even after transform ation. In particular, the im plied sym m etrizing transform ation often will not be quite the same as the variance-stabilizing transform ation — this is the cause o f the po o r perform ance o f the percentile m ethod in Exam ple 5.4. These difficulties need to be overcom e if the percentile m ethod is to be m ade accurate. Parametric case with no nuisance parameters We assum e to begin w ith th a t the d a ta are described by a param etric m odel w ith ju st the single unknow n param eter 9, which is estim ated by the m axim um likelihood estim ate t = 9. In order to develop the adjusted percentile m ethod we m ake the sim plifying assum ption th a t for some unknow n transform ation h( ), unknow n bias correction factor w and unknow n skewness correction factor a, the transform ed estim ator U = h ( T) for (j>= h(9) is norm ally distributed, U ~ N ( — wer(),
with a(<j)) = 1 + a<^>.

(5.19)

204

5 ■Confidence Intervals

In fact this is an im proved norm al approxim ation, after applying the (unknow n) norm alizing tran sfo rm atio n which elim inates the leading term in a skewness approxim ation. T he usual factor n-1 has been taken o u t o f the variance by scaling h(-) appropriately, so th a t b o th a an d w will typically be o f order n~x/2. The use o f a an d w is analogous to the use o f B artlett correction factors in likelihood inference for p aram etric models. The essence o f the m ethod is to calculate confidence limits for <j) and then transform these back to the 6 scale using the b o o tstrap distribution o f T. To begin with, suppose th a t a an d w are know n, an d write U = (j) + (I + acj))(Z - w ) , where Z has the N ( 0,1) distribution w ith a quantile z«. It follows th at log(l + aJJ) = lo g (l + a<j>) + log{ 1 + a( Z — w)}, which is m onotone increasing in e/>. T herefore substitution o f za for Z and u for U in this equation identifies the a confidence lim it for cj), which is

1

a =

i r\

w + z«

U + f f ( u h ---------;-------------- r .

1 — a(w + za)

Now the a confidence lim it for d is dx = b u t h( ) is u n k nown. However, if we denote the distribution function o f T ' by G, then G(0a) = Pr*(T* < e a \ t ) = ? x '{ U' <4>a \u)

=

<6 \

=

°w

which is know n. T herefore the a confidence limit for 0 is

<5 -2 0 >

which expressed in term s o f sim ulation values is 0* = ?m + l w ,

a = 0 ) ( w + 1 _ ^ + ^ ^ )) .

(5.21)

These lim its are usually referred to as B C a confidence limits. N ote th at they share the tran sfo rm atio n invariance p roperty o f percentile confidence limits. The use o f G overcom es lack o f know ledge o f the transform ation h. The values o f a an d w are unknow n, o f course, b u t they can be easily estim ated. F o r w we can use the initial norm al approxim ation (5.19) for U to write Pr*(T* < t | t) = P r*([/* < u | u) = Pr([7 < (f> \ (j)) = (w), so th at w = 0>-1{G(0}.

(5.22)

205

5.3 ■Percentile Methods

In term s o f sim ulation values # denotes the number of times the event occurs.

'# { t; < t} R+ 1 T he value o f a can be determ ined inform ally using (5.19). Thus if /() denotes the log likelihood defined by (5.19), with derivative ?(), then it is easy to show th at e { m 3} = 6a, var{)}3/2 ignoring term s o f o rd er n~l . But the ratio on the left o f this equation is invariant und er p aram eter transform ation. So we transform back from (j) to 6 and deduce th at, still ignoring term s o f order n-1 ,

Em > } v a r{ /(0)}3/2 To calculate a we approxim ate the m om ents o f f ( 6) by those o f / (d) under the fitted m odel w ith p aram eter value 9, so th a t the skewness correction factor is 1 E*{/*(0)3} a = T -------- :— *—1— > 6 v a r * { n 0 )3/2}

(5.23)

w here ( ' is the log likelihood o f a set o f d a ta sim ulated from the fitted m odel. M ore generally a is one-sixth the standardized skewness o f the linear approxim ation to T. One p o tential problem w ith the B C a m ethod is th a t if a in (5.21) is m uch closer to 0 or 1 th an a, then (R + l)a could be less th an 1 o r greater th an R, so th a t even w ith interpolation the relevant quantile can n o t be calculated. If this happens, and if R can n o t be increased, then it would be appropriate to quote the extrem e value o f t' and the im plied value o f a. For example, if ( R + l)a > JR, then the u pper confidence limit t'Rj would be given w ith implied right-tail error a 2 equal to one m inus the solution to a = R / ( R + 1). Example 5.5 (Air-conditioning data, continued) R eturning to the problem o f Exam ple 5.4 and the exponential b o o tstrap results for R = 999, we find th a t the n u m b er o f y * values below y = 108.083 is 535, so by (5.22) w = whose derivative is iw = ^

fiz

-

nfi

The second and third m om ents o f if(fi) are nfi~2 and 2n/i~3, so by (5.23) a = I n ” 1/2 = 0.0962.

206

5 ■Confidence Intervals

a

z„ = w + za

$ = ®(w + - p rjj;)

r = (/?-(- 1)5

0.025 0.975 0.050 0.950

- 1 .8 7 2 2.048 -1 .5 5 7 1.733

0.067 0.996 0.103 0.985

67.00 995.83 102.71 984.89

Table 5.1 Calculation of adjusted percentile bootstrap confidence limits for fi with the data of Example 1.1, under the parametric exponential model with R = 999; a = 0.0962, w = 0.0878.

‘w 65.26 199.41 71.19 182.42

T he calculation o f the adjusted percentile limits (5.21) is illustrated in Table 5.1. T he values o f r = ( K + l) a are not integers, so we have applied the interpolation form ula (5.8). H ad we tried to calculate a 99% interval, we should have had to calculate the 999.88th ordered value o f t ‘, which does n o t exist. The im plied right-tail erro r for t'gg9j is the value a 2 which solves 9"

1000

= d>

(V0 0 8 7 8 .

+

0-0878 + ^

1 -0 .0 9 6 2 (0 .0 8 7 8 + Z !_ a2)

nam ely a2 = 0.0125.

■

Parametric case with nuisance parameters W hen 9 is one o f several unknow n param eters, the previous developm ent applies to a derived distribution called the least-favourable family. As usual we denote the nuisance param eters by X and w rite ip = (9, a ). If the log likelihood function for ip based on all the d a ta is
E -fc(0 )3 }

6 var'K lf (0))W'

‘

1

In this expression the p aram eter estim ates ip are regarded as fixed, and the m om ents are calculated u nder the fitted model. A som ew hat sim pler expression for a can be obtained by noting th at i?LF{0) is p ro portio n al to the influence function for t. The result in Problem 2.12 shows th at Lt(yj',Fv ) = m l (ipy{ip,yj),

j{y>) is d2f(\p)/dy>d\pT.

5J • Percentile Methods Table 5.2 Calculation of adjusted percentile bootstrap confidence limits for ^ with the data of Example 1.1, under the gamma parametric model with p. = 108.0833, /c = 0.7065 and

207

a

za = w + z a

5 =
r = ( R + l)a

0.025 0.975 0.050 0.950

- 1 .8 2 3 2.097 - 1 .5 0 8 1.782

0.085 0.998 0.125 0.991

85.20 998.11 125.36 991.25

£(r> 62.97 226.00 67.25 208.00

a = 0.1145, w = 0.1372.

w here i-1(tp) is the first row o f the inverse o f i(ip) and /(\p,yj) is the contribution to /(tp) from the _/th case. We can then rew rite (5.24) as

where L* = nil ( xpy(xp,Y’ ) an d Y ' follows the fitted distribution w ith param eter value ip. As before, to first o rd er a is one-sixth the estim ated standardized skewness o f the linear approxim ation to t. In the form given, (5.25) will apply also to nonhom ogeneous data. T he B C a m ethod can be extended to any sm ooth function o f the original m odel p aram eters \p; see Problem 5.7. Example 5.6 (Air-conditioning data, continued) We now replace the exponen tial m odel used in the previous example, for the d a ta o f Exam ple 2.3, with the tw o-param eter gam m a m odel. The param eters are 6 = fi and A = k , the first still being the p aram eter o f interest. The log likelihood function is /(fi,

k

)

=

nK

\og(K/fi) +

(k -

1) £

log y j

-

k

£

yj/fi -

n

log T(k).

The inform ation m atrix is diagonal, so th a t the least-favourable fam ily is the original gam m a family w ith k fixed a t k = 0.7065. It follows quite easily th a t ?

l f

(°)

- y ,

and so a is one-sixth o f the skewness o f the sam ple average under the fitted gam m a m odel, th a t is a = The same result is obtained som ew hat m ore easily via (5.25), since we know th a t the influence function for the m ean is L t( y \ F ) = y - fi. The num erical values o f a and w for these d a ta are 0.1145 and 0.1372 respectively, the latter from R = 999 sim ulated samples. Using these we com pute the adjusted percentile b o o tstrap confidence limits as in Table 5.2.

5 ■Confidence Intervals

208

Just how flexible is the B C a m eth o d ? The following exam ple presents a difficult challenge for all b o o tstrap m ethods, an d illustrates how well the studentized b o o tstrap an d B C a m ethods can com pensate for weaknesses in the m ore prim itive m ethods. Example 5.7 (Normal variance estimation) Suppose th at we have independent sam ples ( y n , . . . , y i m), i = l,...,/c , from norm al d istributions w ith different m eans A, b u t com m on variance 8, the latter being the param eter o f interest. The m axim um likelihood estim ator o f the variance is t = n~l Y%=i Y^j=t(yij ~ yi)2, where n = mk. In practice the m ore usual estim ate w ould be the pooled m ean square, w ith deno m in ato r d = k(m — 1) ra th e r th a n n, but here we leave the bias o f T in tact to see how well the b o o tstrap m ethods can cope. The distrib u tio n o f T is n~l 6xj. This exact result allows us b o th to avoid the use o f sim ulation, an d to calculate exact coverages for all the confidence limit m ethods. D enote the a quantile o f the Xd distribution by cd^. Using the fact th a t T* = n~l t xd we see th a t the u p p er a confidence limits for 8 under the basic b o o tstrap and percentile m ethods are respectively 21 — r T 1tcj'i-a,

n~l tcd,a.

The coverages o f these limits are calculated using the exact distribution o f T. F or exam ple, for the basic b o o tstrap confidence lim it Pr (0 < 2 T —

T c 4 ,_.) - Pr

> JT T T T ^) '

F or the B C a m ethod, (5.22) gives w = ®_ 1{ P r( ^ < n)\ and (5.24) gives a = \ 2 l?2n - V 2. The upp er a confidence limit, calculated by (5.20) w ith the exact distribution for T ‘, is n~l tcd,«. The exact coverage o f this lim it is P r ( ^ > n2/cd,$). Finally, for the studentized b o o tstrap upp er a confidence lim it (5.7), we first calculate the variance approxim ation v = 2n~l t2 from the expected Fisher inform ation m atrix and then the confidence lim it is nt/cd,i-a. The coverage o f this limit is exactly a. Table 5.3 shows num erical values o f coverages for the four m ethods in the case k = 10 an d m = 2, w here d = | n = 10. The results show quite dram atically first how b a d the basic an d percentile m ethods can be if used w ithout careful thought, an d secondly how well studentized and adjusted percentile m ethods can do in a m oderately difficult situation. O f course use o f a logarithm ic transform atio n would im prove the basic b o o tstrap m ethod, which would then give correct answers. ■

yi is the average of

yn>---,ymr

209

5.3 ■Percentile M ethods Table 5.3 Exact coverages (%) of confidence limits for normal variance based on maximum likelihood estimator for 10 samples each of size two.

N om inal

Basic

S tudentized

P ercentile

BCa

1.0 2.5 5.0 95.0 97.5 99.0

0.8 2.5 4.8 35.0 36.7 38.3

1.0 2.5 5.0 95.0 97.5 99.0

0.0 0.0 0.0 1.6 4.4 6.9

1.0 2.5 5.0 91.5 100.0 100.0

Nonparametric case: single sample The adjusted percentile m ethod for the nonparam etric case is developed by applying the m ethod for the p aram etric case w ith no nuisance param eters to a specially constructed nonparam etric exponential family w ith support on the d a ta values, the least-favourable fam ily derived from the m ultinom ial distribution for frequencies o f the d a ta values under nonparam etric resampling. Specifically, if lj denotes the em pirical influence value for t at yj, then the resam pling m odel for an individual Y * is the exponential tilted distribution P r(7 * = y j ) = pj =

(5-26)

The p aram eter o f interest 6 is a m onotone function o f r\ with inverse rj(6), say. The M L E o f rj is fj = rj(t) = 0, which corresponds to the E D F F being the n o n p aram etric M L E o f the sam pling distribution F. The bias correction factor w is calculated as before from (5.22), b u t using nonp aram etric b o o tstrap sim ulation to obtain values o f t*. The skewness correction a is given by the em pirical analogue o f (5.23), where now fj($) is the first derivative drj(6)/dd.

W hen the m om ents needed in (5.23) are evaluated at 6, or equivalently at fj = 0, two sim plifications occur. First we have E*(L*) = 0, and secondly the m ultiplier ij(t) cancels when (5.23) is applied. The result is th at

a

6 /

6 (e

\ 3/2’ if)

which is the direct analogue o f (5.25). Example 5.8 (Air-conditioning data, continued) The nonparam etric version o f the calculations in the preceding exam ple involves the same form ula (5.21), b u t now w ith a = 0.0938 and w = 0.0728. The form er constant is calculated from (5.27) w ith lj = y7 — y. The confidence lim it calculations are shown in Table 5.4 for 90% an d 95% intervals. ■

210

5 ■Confidence Intervals

a

za = w + z a

0.025 0.975 0.050 0.950

-1.8872 2.0327 -1.5721 1.7176

5 = ®(w +

i^ i;)

0.0629 0.9951 0.0973 0.9830

r

= (R + 1)5 62.93 995.12 97.26 983.01

Table 5.4 Calculation of adjusted percentile bootstrap confidence limits for p. in Example 1.1 using nonparametric bootstrap with R ~ 999; a = 0.0938, w = 0.0728.

C(r) 55.33 243.50 61.50 202.08

function o f sam ple m om ents, say t == t ( s ) where In = n - ' E ”=i Sitij) for i = then (5.26) is a one-dim ensional reduction o f a /c-dimensional exponential fam ily for si(Y * ),... ,s*( Y *). By equation (2.38) the influence values lj for t are given sim ply by lj = t T {s(yj) — s} w ith t = dt/ds. T he m ethod as described will apply as given to any single-sample problem, and to m ost regression problem s (C hapters 6 and 7), but n o t exactly to problem s where statistics are based on several independent samples, including stratified samples. Nonparametric case: several samples In the param etric case the B C a m ethod as described applies quite generally through the unifying likelihood function. In the n onparam etric case, however, there are predictable changes in the B C a m ethod. The background approx im ation m ethods are described in Section 3.2.1, which defines an estim ator in term s o f the E D F s o f k samples, t = t(F\ , . . ., Fk). T he em pirical influence values lij for j = 1 and i = 1, . . . , k and the variance approxim ation vL are defined in (3.2) an d (3.3). I f we retu rn to the origin and developm ent o f the B C a m ethod, we see th at the definition o f bias correction w in (5.22) will rem ain the same. The skewness correction a will again be one-sixth the estim ated standardized skewness o f the linear approxim ation to t, which here is ,

i

s i , » rJ

This can be verified as an application o f the p aram etric m ethod by constructing the least-favourable jo in t family o f k distributions from the k m ultinom ial distributions on the d a ta values in the k samples. N ote th a t (5.28) can be expressed in the same form as (5.27) by defining hj = nlij/ni, where n = Y so th at

vL = n 2 Y ^ i ij

° =

(5.29)

( e u ?5)!

211

5.4 ■ Theoretical Comparison o f M ethods

see Problem 3.7. This can be helpful in w riting an all-purpose algorithm for the B C a m eth o d ; see also the discussion o f the A B C m ethod in the next section. A n exam ple is given at the end o f the next section.

5.4 Theoretical Comparison of Methods The studentized b o o tstrap and adjusted percentile m ethods for calculating confidence limits are inherently m ore accurate th an the basic b o o tstrap and percentile m ethods. This is quite clear from em pirical evidence. H ere we look briefly at the theoretical side o f the story for statistics which are approxim ately norm al. Some aspects o f the theory were discussed in Section 2.6.1. For simplicity we shall restrict m ost o f the detailed discussion to the single-sam ple case, but the results generalize w ithout m uch difficulty.

5.4.1 Second-order accuracy To assess the accuracies o f the various b o o tstrap confidence limits we calculate coverage probabilities up to the n_1/2 term s in series approxim ations, these based on corresponding approxim ations for the C D F s o f U = ( T — 0 ) /v 1/2 and Z = (T —0 ) / F 1/2. H ere v is v a r(T ) or any approxim ation which agrees to first order w ith vL, the variance o f the linear approxim ation to T. Sim ilarly V is assum ed to agree to first order with VL. F o r exam ple, in the scalar param etric case where T is the m axim um likelihood estim ator, v is the inverse o f the expected Fisher inform ation m atrix. In all o f the equations in this section equality is correct to o rd er n~1^2, i.e. ignoring errors o f order n ~ \ The relevant approxim ations for C D F s are the one-term C o rn ish-F isher approxim ations P r([/ < u) = G(6 +

v 1/ 2m)

= <5 (u — n~l/2(m\ — |m 3 + |m 3w2) j ,

(5.30)

where G is the C D F o f T , and K ( z ) = Pr(Z < z) =
(5.31)

with the con stan ts defined by E(U) = n~1/2m u E ( l /3) = n“ 1/2(m3 - 3wi), E { (F - v)(T - 0)} = rT 1/2m „ ; (5.32) note th a t the skewness o f U is «_ 1/2m3. The corresponding approxim ations for quantiles o f T (rath er th an U) and Z are G- 1(oe) = 6 + v 1/2za + n_ 1/2v1/2(mi —

+ gm3z2)

(5.33)

K _ 1(a) = za + n~l/2 {mi - ±m3 - { { m n - gw3)z2} .

(5.34)

and

212

5 ■ Confidence Intervals

T he analogous approxim ations apply to b o o tstrap distributions and q u a n tiles, ignoring sim ulation error, by substituting ap p ro p riate estim ates for the various constants. In fact, provided the estim ates for mi, m3 and m u have errors o f o rd er n ~1^2 o r less, the n~ 1/2 term s in the approxim ations will not change. This greatly simplifies the calculations th a t we are ab o u t to do. Studentized bootstrap method C onsider the studentized b o o tstrap confidence lim it 0a = t — v [/2k ~ l (l — a). Since the right-hand side o f (5.34) holds also for K ~ 1 we see th a t to order ea = t — v l/2z i - x - n~1/2v l/2{mi - ±m3 - (±m n - i/n 3)z^_o;}.

(5.35)

It then follows by applying (5.31) and expanding by Taylor series th a t Pr(0 < 0a) = a + 0 ( n _1). This property is referred to as second-order accuracy, as distinct from the firstorder accuracy o f the norm al approxim ation confidence limit, w hose coverage is a + 0 ( n ~ l/2). Adjusted percentile method F or the adjusted percentile limit (5.20) we use (5.33) with estim ated constants as the approxim ation for G_1( ). F or the norm al integral inside (5.20) we can use the approxim ation ®(za + 2w + ai\), because a and w are o f order n ~{/2. T hen the a confidence lim it (5.20) is approxim ately = t + v 1/2za + n_ 1/V /2 { 2 n 1/2w + mi — ±m3 + (nI/2a + |m 3)z2} .

(5.36)

This will also be second-order accurate if it agrees w ith (5.35), which requires th at to order n~1/2, a = n~ll2{ \ m n - |m 3),

w = - n ~ 1/2(mi - gm3).

(5.37)

To verify these approxim ations we use expressions for m u m3 and m n derived from the q u ad ratic approxim ation for T described in Section 2.7.2. In slightly simplified n otation, the q u ad ratic approxim ation (2.41) for T is T = 6 + n~l Y

Lt(Yi) + \ n ~ 2 £ Q,(Yj, Yk). » ‘J

(5.38)

It will be helpful here an d later to define the constants a = I „ - 3v -V 2 ^

E {L3(y,)},

b = in ”2 Y

i

E iQt (Yi, i

(5.39)

c = \n-4v-3/2J 2 E{Lt(Yj)L,(Yk)Qt{Yj, 7,)}. i+k

213

5.4 ■ Theoretical Comparison o f M ethods

T hen calculations o f the first and third m om ents o f T — 0 from the quadratic approxim ation show th at m, = n 1/2v ~ l/2b,

m3 = n 1/2(6
(5.40)

F or m u, it is enough to take V to be the delta m ethod estim ator, which in the full n o tatio n is V l = n~2 ^ L 2(Y ;;F). T hen using the approxim ation n L t(Y r,F ) =

L ,(Y ,) -

n -1£

n L t(Y j) + n” 1 £

7=1

Q ,( Y „ Y j )

7=1

given in Problem 2.20, detailed calculation leads to m u = « ^ 2(6a + 4c).

(5.41)

The results in (5.40) and (5.41) im ply the identity for a in (5.37), after noting th a t the definitions o f a in (5.23), (5.25) and (5.27) used in the adjusted percentile m ethod are obtained by substituting estim ates for m om ents o f the influence function. The identity for w in (5.37) is confirm ed by noting th at the original definition w = 4>~’ {G(t)} approxim ates <1>_ 1{G(0)}, which by applying (5.30) w ith u = 0 agrees w ith (5.37). Basic and percentile methods Sim ilar calculations show th a t the basic b o o tstrap and percentile confidence limits are only first-order accurate. However, they are b o th superior to the norm al approxim ation limits, in the sense th at equi-tailed confidence intervals are second-order accurate. F or example, consider the 1 — 2a basic boo tstrap confidence interval w ith limits 0 x, 8 i - x

= 2t — G_1(l — a ),2 t — G_ 1(a).

It follows from the estim ated version o f (5.33) th at 21 — G_1(l — a) = t — t>1/2zi_c( — n- 1/V /2(mi — ^m3 + and by (5.31) the error o f this lower limit is Pr{0 < 2 1 — G ~ '(l — a)} = 1 — <J>(zi_a + n~1/2m n z j _ J . C orrespondingly the error o f the u p per lim it is Pr{0 > 2 1 — G- 1(a)} = (zi_a + n ~l/2m n z 2- J + ®(za + n_ 1/2m iiz2) which, after expanding in Taylor series and dropping n~l term s, and then n oting th a t z2 = z \_ a an d <j)(za) = {zi_«), turns out to equal 2a + n~1/2mn {z2_a0 (zi-a) - z«0 (za)} = 2a.

214

5 • Confidence Intervals

These results are suggestive o f the behaviour th a t we observe in specific ex amples, th a t b o o tstrap m ethods in general are superior to norm al approxim a tion, but th a t only the adjusted percentile and studentized b o o tstrap m ethods correctly adjust for the effects o f bias, n o n co n stan t variance, and skewness. It would take an analysis including n-1 term s to distinguish betw een the preferred m ethods, an d to see the effect o f tran sfo rm atio n p rio r to use o f the studentized boo tstrap m ethod.

5.4.2 The ABC m ethod It is fairly clear that, to the o rd er n~l/2 considered above, there are m any equivalent confidence limit m ethods. O ne o f these, the A B C m ethod, is o f particular interest. T he m ethod rests on the approxim ation (5.35), which by using (5.40) and (5.41) can be re-expressed as 6x = t + v 1/2{zx + a + c — v~l/2b + (2 a + c)za};

(5.42)

here v has been approxim ated by v in the definition o f mi, and we have used Z \ - a — —z«. The con stan ts a, b and c in (5.42) are defined by (5.39), in which the expectations will be estim ated. Special form s o f the A B C m ethod correspond to special-case estim ates o f these expectations. In all cases we take v to be vl Parametric case If the estim ate t is a sm ooth function o f sam ple m om ents, as is the case for an exponential family, then the co nstants in (5.39) are easy to estim ate. W ith a tem porary change o f notatio n , suppose th a t t = t(s) where s = n~l ^ s(yj) has p com ponents, and define fi = E(S), so th a t 6 = t(n). Then L t(Y,) = t(n)T {s(Yj) - fi},

Qt(Yj, Yk) = {s(Yj) - fi}Tt(fi){s(Yk) - /i}.

(5.43)

t = dt(s)/ds, and V= d2t(s)/dsdsT .

Estim ates for a, b and c can therefore be calculated using estim ates for the first three m om ents o f s( Y ). For the p articu lar case where the distribution o f S has the exponential family PD F f S ) = exp{//r s - £ ( f / ) } , the calculations can be simplified. First, define L(^) = var(5) = l(rj). Then vl

Ul) = 81Ul)/d'ldqT-

= t(s)T'L(s)i(s).

S ubstitution from (5.43) in (5.39), and estim ation o f the expectations, gives estim ated con stan ts which can be expressed sim ply as , £=0

b = ^tr{t(s)£(s)}, L

tr(A) is the trace of the square matrix A.

5.4 ■ Theoretical Comparison o f Methods

c =

1

d2t(s + ke)

2vm

del

215

(5.44) £=0

where k = £ (s )i ( s ) / v ^ 2. The confidence lim it (5.42) can also be approxim ated by an evaluation o f the statistic t, analogous to the B C a confidence limit (5.20). This follows by equating (5.42) w ith the right-hand side o f the approxim ation t(s + v 1^2e) = t(s) + v ^ 2e T 't(s), w ith ap p ro p riate choice o f e. The result is

= t ii + F k ? k|*

( 5 -4 5 )

where za = w + za = a + c - bvL i/2 + z«. In this form the ABC confidence limit is an explicit approxim ation to the B C a confidence limit. If the several derivatives in (5.44) are calculated by num erical differencing, then only 4p + 4 evaluations o f t are necessary, plus one for every confidence lim it calculated in the final step (5.45). A lgorithm s also exist for exact num erical calculation o f derivatives. Nonparametric case: single sample If the estim ate t is again a sm ooth function o f sam ple m om ents, t = t(s), then (5.43) still applies, and substitution o f em pirical m om ents leads to

b = ! t (•;£) 6(E/,2)3/2

2 { U

1 (E 5jO)rf'(£ sjQ) k== 1 E s)h (E '72)3/2 ’ n (E^)1/2'

2n

(5.46) A n alternative, m ore general form ulation is possible in which s is replaced by the m ultinom ial pro p o rtio n s n~l { f \ , . . . , f n) attaching to the d a ta values. C o r respondingly fi is replaced by the probability vector p, and with distributions F restricted to the d a ta values, we re-express t(F) as t(p); cf. Section 4.4. Now F is equivalent to p = (£,■■■,£) and t = t(p). In this no tatio n the em pirical influence values and second derivatives are defined by (5.47) and

d2 qjj =

(5.48) £=0

where 1; is the vector w ith 1 in the ; t h position and 0 elsewhere. Let us set tj(p) = dt(p)/8pj, an d tjk(p) = d2t (p)/dpjdpk; see Section 2.7.2 and Prob lem 2.16. T hen alternative form s for the vector I and the full m atrix q are / = (/ - iT* J)t(p),

q = ( I ~ n - ' J i i m i ~ n~lJ),

216

5 ■Confidence Intervals

where J = 11T. F or each derivative the first form is convenient for approx im ation by num erical differencing, while the second form is often easier for theoretical calculation. Estim ates for a and b can be calculated directly as em pirical versions o f their definitions in (5.39), while for c it is sim plest to use the analogue o f the representation in (5.44). The resulting estim ates are a =

i E 'j 6 ( £ I ] ? ' 1'

,

i v1 / , 2n2 ^ qjj ~ 2n1 ^ (5.49)

c —

1

d2t(p + ek)

2 V 1/ 2

d£l

t (I — n

J)t(I — n

J)t

2*vl»

where k = n 2vL i/2lT and t , i are evaluated at p. The approxim ation (5.45) can also be used here, b u t now in the form d« = t [ P + n V

(1

' . J

- a z ay

)

■

(5.50)

If the several derivatives are calculated by num erical differencing, then the num ber o f evaluations o f t(p) needed is only 2n+2, plus one for each confidence limit and the original value t. N ote th a t the probability vector argum ent in (5.50) is n o t constrained to be proper, o r even positive, so th at it is possible for A B C confidence lim its to be undefined. Example 5.9 (Air-conditioning data, continued) T he adjusted percentile m ethod was applied to the air-conditioning d a ta in Exam ple 5.6 under the gam m a m odel and in Exam ple 5.8 und er the nonparam etric model. H ere we exam ine how well the A B C m ethod approxim ates the adjusted percentile confidence limits. F or the m ean param eter, calculations are simple under all models. For example, in the gam m a case the exponential family is tw o-dim ensional with s = (y .lo g y )7’, rj i = - hk / h ,

rj2 = me,

xp(ri) = - f /2 l og( -t ]i /n) + n log r(rj2/n),

and t(s) = si. The last implies th a t t = ( l ,0 ) r and t = 0. It then follows straightforw ardly th a t the co n stan t a is given by \ ( n k ) ~ l/2 as in Exam ple 5.6, th a t b = c = 0, and th a t k = v ^ 2( l , 0 ) T. Sim ilar calculations apply for the n onparam etric m odel, except th a t a is given by the corresponding value in Exam ple 5.8. So under b o th m odels 0i_a = 108.083 + v[/2-------- + - - - {1 - a ( a + z i-c)}2 ' N um erical com parisons betw een the adjusted percentile confidence limits and

1 is a vector of ones.

5.4 • Theoretical Comparison o f Methods Table 5.5 Adjusted percentile (BCa) and ABC confidence intervals for mean failure time fi for the air-conditioning data. R = 999 simulated samples for BCa methods.

217 N o m in al confidence 1 — 2a 0.99

0.95

0.90

G a m m a m odel

BCa ABC

51.5, 241.6 52.5, 316.6

63.0, 226.0 61.4, 240.5

67.2, 208.0 66.9, 210.5

N o n p a ra m e tric m odel

BCa ABC

44.6, 268.8 46.6, 287.0

55.3, 243.5 57.2, 226.7

61.5, 202.1 63.6, 201.5

A B C lim its are shown in Table 5.5. The A B C m ethod appears to give rea sonable approxim ations, except for the 99% interval under the gam m a model.

Nonparametric case: several samples T he estim ated co nstants (5.49) for the single-sample case can be applied to several sam ples by using a single artificial probability vector n o f length n = J 2 ni as follows. The estim ator will originally be defined by a function t(pi , .. ., pk), w here p, = ( p u , . . . , p ini) is the vector o f probabilities on the ith sam ple values y n , . . . , yim■T he artificial representation o f the estim ator in term s o f the single probability vector 71 =

(7T11, . . . ,

7T21,. . . ,

)

o f length n is u(n) = t( pi , .. . ,p k ) where p, has elem ents py =

(5.51) E ;= i n n

The set o f E D F s is equivalent to ft = (£,•■ •,£) and the observed value o f the estim ate is t = u(n). This artificial representation leads to expressions such as (5.29), in which the definition o f 7i; is obtained by applying (5.47) to u(p). (N ote th a t the real influence values /y and second derivatives q(j j derived from t ( pi , .. ., pk ) should n o t be used.) T h a t this m ethod produces correct results is quite easy to verify using the several sam ple extension o f the quadratic approxim ation (5.38); see Section 3.2.1 and Problem 3.7. Example 5.10 (Air-conditioning data failure ratio) The d a ta o f Exam ple 1.1 form one o f several sam ples corresponding to different aircraft. The previous sam ple («i = 12) and a second sam ple (n2 = 24) are given in Table 5.6. Suppose th a t we w ant to estim ate the ratio o f failure rates for the two aircraft, and give confidence intervals for this ratio. To set n otation, let the m ean failure times be fii and fi 2 for the first and second aircraft, w ith 6 = n t / n \ the param eter o f interest. T he corresponding

218

5 ■Confidence Intervals

F irst aircra ft 3

5

7

18

43

85

91

98

100

130

230

487

23 139

30 188

36 197

210

Second aircraft 3 44

5 46

5 50

13 72

14 79

15

88

22 97

22 102

39

sam ple m eans are y\ = 108.083 an d y 2 = 64.125, so the estim ate for 6 is t = y i / y i = 0-593. The em pirical influence values are (Problem 3.5) hj = yi

yi

We use (5.29) to calculate vL = 0.05614 and a = —0.0576. In R = 999 nonparam etric sim ulations there are 473 values o f t* below t, so by (5.22) w = —0.0954. W ith these values we can calculate B C a confidence limit (5.21). F o r exam ple, for a = 0.025 an d 0.975 the values o f a are 0.0076 and 0.944 respectively, so th a t the limits o f the 95% interval are r(*76) = 0.227 and £(944) = 1-306; the first value is in terpolated using (5.8). The studentized b o o tstrap m ethod gives 95% confidence interval [0.131,1.255] using the original scale. The distribution o f t* values is highly skew here, and the logarithm ic scale is strongly indicated by diagnostic plots. Figure 5.2 shows the norm al Q -Q plot o f the t* values, the variance-param eter plots for original and logarithm ic scales, an d the norm al Q -Q plot o f z* values after logarithm ic transform ation. A pplication o f the studentized b o o tstrap m ethod on the loga rithm ic scale leads to 95% confidence interval [0.183,1.318] for 6, m uch closer to the B C a limits. F or the A B C m ethod, the original definition o f the estim ator is t = t(pi,p2) = Y ^ y i j P i j / ^ y i j P i j - The artificial definition in term s o f a single probability vector n is

u(n\= E£i yunn/ E£i n2i ( E"li yijnij/TjLinij' A pplication o f (5.47) shows th a t the artificial em pirical influence values are

Table 5.6 Failure times for air-conditioning equipment in two aircraft (Proschan, 1963).

219

5.4 • Theoretical Comparison o f M ethods

Figure 5.2 Diagnostic plots for air-conditioning data confidence intervals, based on R = 999 nonparametric simulations. Top left panel: normal Q-Q plot of t*, dotted line is N(t,VL) approximation. Top right: variance-parameter plot, v*L versus r \ Bottom left: variance-parameter plot after logarithmic transformation. Bottom right: normal Q-Q plot of z* after logarithmic transformation.

Quantiles of Standard Normal

t*

/

CM

■ .•

is-

/ y

v

N

O /

CVJ /

‘

• -1.5

-0.5

•

0.5 1.0

-2

0

2

Quantiles of Standard Normal

iog(t*)

and '

n \ ( y2j-yi »

j

-

.......«■

This leads to form ulae in agreem ent with (5.29), which gives the values o f a and vL already calculated. It rem ains to calculate b and c. F or b, application o f (5.48) gives \2 w , , - y. r,. \) = -2t f, n2(yij iUjj — t-. 1 h I

„2r, »i yi

. n n 2( y i j - y i )

„2 n\

220

5 ■Confidence Intervals

and ^ l.jj

*,nrn(y2J - y2) 2n\yx

'

so by (5.49) we have b = n f M T 3' Y i y i j - y i f , whose value is b = 0.0720. (The b o o tstrap estim ates b and v are respectively 0.104 and 0.1125.) Finally, for c we apply the second form in (5.49) to u(n), th at is c = ^n~4v ^ 3/2l Tii(7t)I, and calculate c = 0.3032. The im plied value o f w is —0.0583, quite different from the b o o tstrap value —0.0954. The A B C form ula (5.50) is now applied to u( jt) with k = n~2v~[1/20 n , . ■■ The resulting 95% confidence interval is [0.250,1.283], which is fairly close to the B C a interval. It seems possible th a t the ap proxim ation theory does not w ork well here, which would explain the larger-than-usual differences betw een B C a, A B C and studentized b o o tstrap confidence lim its; see Section 5.7. O ne practical poin t is th a t the theoretical calculation o f derivatives is quite tim e-consum ing, com pared to application o f num erical differencing in (5.47)(5.49). ■

5.5 Inversion of Significance Tests T here is a duality betw een significance tests for p aram eters and confidence sets for those param eters, in the sense th a t — for a prescribed level — a confidence region includes p aram eter values which are n o t rejected by an appropriate significance test. This can provide an o th er o ption for calculating confidence limits. Suppose th a t 8 is an unknow n scalar param eter, and th a t the m odel includes no other unknow n param eters. If Ra(0o) is a size a critical region for testing the null hypothesis H 0 : 8 = 80, which m eans th at Pr{(Yu . . . , Y n) e R a( d o ) \ e 0} = «, then the set C W Y ,,. . . , Y„) = {6 : ( Y l t . . . , Y„) £ J^(0)} is a 1 — a confidence region for 6. The shape o f the region will be determ ined by the form o f the test, including the alternative hypothesis for which the test is designed. In particular, an interval w ould usually be obtained if the alternative is two-sided, H A : 6 0O; an upp er lim it if H A : 8 < 0O; and a lower limit if H a : 8 > 80.

5.5 ■Inversion o f Significance Tests

221

For definiteness, suppose th a t we w ant to calculate a lower 1 — a confidence limit, which we denote by 9X. The associated test o f Ho : 9 = do versus H a : 8 > do will be based on a test statistic t(9o) for which large values are evidence in favour o f H A : for example, t(0o) m ight be an estim ate o f 6 m inus Oo- We will have an algorithm for approxim ating the P-value, which we can w rite as p(9o) = Pr{T(0o) > ?(^o) I Fo}, where Fo is the null hypothesis distribution w ith p aram eter value 9o. The 1 — a confidence set is all values o f 9 such th a t p(8) > a, so the lower confidence limit 0a is the smallest solution o f p(9) = a. A simple way to solve this is to evaluate p(0) over a grid of, say, 20 values, an d to interpolate via a simple curve fit. The grid can sometimes be determ ined from the norm al approxim ation confidence limits (5.4). F or the curve fit, a simple general m ethod is to fit a logistic function to p(9) using either a simple polynom ial in 9 or a spline. Once the curve is fitted, solutions to p(9) = a can be com puted: usually there will be one solution, which is K F or an u p p er 1 — a confidence limit 9 \ - a, note th a t this is identical to a low er a confidence limit, so the same procedure as above w ith the same t(9o) can be used, except th a t we solve p(0) = 1 — a. The com bination o f lower and u pper 1 — a confidence limits defines an equi-tailed 1 — 2a confidence interval. The following exam ple illustrates this procedure. Example 5.11 (Hazard ratio) For the A M L d a ta in Exam ple 3.9, also an alysed in Exam ple 4.4, assum e th a t the ratio o f hazard functions h 2 (z)/hi(z) for the tw o groups is a co n stan t 9. As before, let rtJ be the num ber in group i w ho were at risk ju st p rio r to the y'th failure time zj, and let y} be 0 or 1 according as the failure ^t Zj is in group 1 or 2. T hen a suitable statistic for testing Ho : 9 = 9o is

this is the score test statistic in the Cox p roportional hazards model. Large values o f t(6o) are evidence th a t 9 > OoThere are several possible including those described in hazard ratio 9o- H ere we use which holds fixed the survival sim ulated values y \ , . . . , y ' n are

resam pling schemes th a t could be used here, Section 3.5 b u t m odified to fix the constant the sim pler conditional m odel o f Exam ple 4.4, and censoring times. T hen for any fixed 9o the generated by

222

5 • Confidence Intervals

Figure 5.3 Bootstrap P-values p(0o) for testing constant hazard ratio 0o, with R = 199 at each point. Solid curve is spline fit on logistic scale. Dotted lines interpolate solutions to p(l?o) = 0.05,0.95, which are endpoints of 90% confidence interval.

log(theta)

where the num bers a t risk ju st p rio r to zj are given by

f

J-i

)

r\j = m ax I 0, m - ^ ( 1 - y ’k ) - c1;I *=i

( r2j = m ax

1 0, r 2i

Y.y'k

C2j

k= 1

w ith Cij the n u m b er o f censoring tim es in group i before zj. F o r the A M L d a ta we sim ulated R = 199 sam ples in this way, and calculated the corresponding values t*(90) for a grid o f 21 values o f 90 in the range 0.5 < 0o ^ 10. F or each Go we com puted the one-sided P-value Pieo) =

#{t*(0o) > t(0o)} 200

then on the logit scale we fitted a spline curve (in log 6), and interpolated the solutions to p(9o) = a, 1—a to determ ine the endpoints o f the (1—2a) confidence interval for 9. Figure 5.3 illustrates this procedure for a = 0.05, which gives the 90% confidence interval [1.07,6.16]; the 95% interval is [0.86,7.71] and the p o int estim ate is 2.52. T hus there is m ild evidence th a t 6 > 1. A m ore efficient ap proach w ould be to use R = 99 for the initial grid to determ ine rough values o f the confidence limits, n ear which further sim ulation with R = 999 w ould provide accurate interp o latio n o f the confidence limits. Yet m ore efficient algorithm s are possible. ■ In a m ore system atic developm ent o f the m ethod, we m ust allow for a nuisance p aram eter X, say, which also governs the d a ta distribution b u t is not constrained by Ho. T hen b o th Ra(0) an d C \ - a{ Y \ , . . . , Y„) m ust depend upon X to m ake the inversion m ethod w ork exactly. U nder the b o o tstra p approach X is replaced by an estim ate.

5.6 • Double Bootstrap M ethods

223

Suppose, for exam ple, th a t we w ant a lower 1 — a confidence limit, which is obtained via the critical region for testing Ho : 9 = 9 q versus the alternative hypothesis H a : 9 > 9o■Define ip = (9, A). I f the test statistic is T(9o), then the size a critical region has the form R«(8o) = { ( y u - - - , y n) ■Pr{T (0o) > t(90) | ip = (0o,A)} < a}, an d the exact lower confidence limit is the value uy = ua(y, X), such th a t Pr{ T (ua) > t(ua) | xp = (ua,/1)} = a. We replace X by an estim ate s, say, to obtain the lower 1 — a boo tstrap confidence lim it u i_ a = ua(y,s). The solution is found by applying for u the equation Pr* {T*(u) > t(u) | xp = (u,s)} = a, where T*(w) follows the distribution under xp = (u , s). This requires application o f an interp o latio n m ethod such as the one illustrated in the previous example. T he sim plest test statistic is the point estim ate T o f 9, and then T(9o) = T. The m ethod will tend to be m ore accurate if the test statistic is the studentized estim ate. T h a t is, if v a r(T ) = o 2(9,A), then we take Z = (T — 9o)/v(9o,S)\ for furth er details see Problem 5.11. The same rem ark would apply to score statistics, such as th a t in the previous example, where studentization would involve the observed or expected Fisher inform ation. N ote th a t for the p articu lar alternative hypothesis used to derive an upper limit, it w ould be stan d ard practice to define the P-value as Pr{T(0o) < t(9o) \ Fo}, for exam ple if T ( 0 q) were an estim ator for 9 or its studentized form. Equivalently one can retain the general definition and solve p(9o) = 1 — a for an upp er limit. In principle these m ethods can be applied to b o th param etric and sem ipara m etric problem s, b u t not to com pletely nonparam etric problems.

5.6 Double Bootstrap Methods W hether the basic or percentile b o o tstrap m ethod is used to calculate con fidence intervals, there is a possibly non-negligible difference betw een the nom inal 1 — a coverage an d the actual probability coverage o f the interval in repeated sam pling, even if R is very large. The difference represents a bias in the m ethod, an d as indicated in Section 3.9 the b o o tstrap can be used to estim ate and correct for such a bias. T h a t is, by b o otstrapping a b o o tstrap confidence interval m ethod it can be m ade m ore accurate. This is analogous to the b o o tstrap adjustm ent for b o o tstra p P-values described in Section 4.5. O ne straightforw ard application o f this idea is to the norm al-approxim ation confidence interval (5.4), which produces the studentized b o o tstra p interval;

5 • Confidence Intervals

224

see Problem 5.12. A m ore am bitious application is b o o tstrap adjustm ent o f the basic b o o tstrap confidence limit, which we develop here. First we recall the full n o tatio n s for the quantities involved in the basic bo o tstrap confidence interval m ethod. The “ideal” u p per 1 —a confidence limit is t(F) — ax(F), where Pr { T - 6 < ax(F) | F j = Pr{f(F) - t(F) < aa(F) \ F} = a. W h at is calculated, ignoring sim ulation error, is the confidence lim it t(F)—ax(F). The bias in the m ethod arises from the fact th a t aa(F) ^ a a(F) in general, so th at Pr{f(F) < t(F) - aa( F) | F} ± 1 - a.

(5.52) A

We could try to elim inate the bias by adding a correction to ax(F), b u t a m ore successful approach is to adjust the subscript a. T h a t is, we replace ax(F) by Oq(a)(F) an d estim ate w hat the adjusted value q(a) should be. This is in the sam e spirit as the B C a m ethod. Ideally we w ant q(a) to satisfy P r{t(F) < t(F) - fl, (a)(F) | F} = 1 - a.

(5.53)

The solution q(a) will depend u p o n F, i.e. q(oc) = q(a, F). Because F is unknow n, we estim ate q(a) by q(a) = q(a, F). This m eans th a t we obtain q(a) by solving the b o o tstrap version o f (5.53), namely Pr*{t(F) < t(F') - ai{a)( h

I F} = 1 - a.

(5.54)

This looks intim idating, b u t from the definition o f aa(F) we see th a t (5.54) can be rew ritten as Pr*{Pr**(T** < 2 T ' - t \ F*) > q{oc) | F} = 1 - a.

(5.55)

The sam e m ethod o f adjustm ent can be applied to any b o o tstrap confi dence lim it m ethod, including the percentile m ethod (Problem 5.13) and the studentized b o o tstra p m ethod (Problem 5.14). To verify th a t the nested b o o tstrap reduces the o rd er o f coverage erro r m ade by the original b o o tstra p confidence limit, we can apply the general discussion o f Section 3.9.1. In general we find th a t coverage 1 —a + 0 ( n ~ “) is corrected to 1—a + 0 ( n ~ fl~1/2) for one-sided confidence limits, w hether a = | or 1. However, for equi-tailed confidence intervals coverage 1 — 2a + 0 (n-1 ) is corrected to 1 — 2a -I- 0 ( n ~ 2); see Problem 5.15. Before discussing how to solve equation (5.55) using sim ulated samples, we look at a simple illustrative exam ple where the solution can be found theoretically. Example 5.12 (Exponential mean) C onsider the param etric problem o f ex ponential d a ta w ith unknow n m ean /i. T he d a ta estim ate for fi is t = y, F is

5.6 ■Double Bootstrap M ethods

225

the fitted exponential C D F w ith m ean y, and F * is the fitted exponential C D F w ith m ean y * — the m ean o f a param etric b o o tstrap sam ple y \ , . . . , y ' n draw n from F. A result th a t we use repeatedly is th a t if X \ , . . . , X n are independent exponential w ith m ean y, then 2n X / y has the x l n distribution. The basic b o o tstrap u p p e r 1 — a confidence limit for n is 2y - y c 2n,u/(2n), where Pt(x I„ < cjn,%) = oc. To evaluate the left-hand side o f (5.55), for the inner probability we have P r* * (F " < 2 ? - y | F*) = Pr{*2„ < 2n(2 - J) / ? ) } , which exceeds q if and only if 2n(2 — y / y ’) > C2n,q■ Therefore the outer probability on the left-hand side o f (5.55) is Pr" {2„(2 - « ? • ) >

I

= Pr { & > 2 _

^ / ( 2„ , } .

(5-56)

w ith q = q(a). Setting the probability on the right-hand side o f (5.56) equal to 1 — a, we deduce th a t 2n 2 - cl n m l{2n)

C2n’a'

Using q(a) in place o f a in the basic b o o tstrap confidence lim it gives the adjusted u p p er 1 —a confidence limit 2 n y / c 2n,a, which has exact coverage 1 —oc. So in this case the double b o o tstrap adjustm ent is perfect. Figure 5.4 shows the actual coverages o f nom inal 1 — a b o o tstrap upper confidence limits when n = 10. There are quite large discrepancies for both basic and percentile m ethods, which are com pletely rem oved using the double b o o tstrap adjustm ent; see Problem 5.13. ■ In general, an d especially for n onparam etric problem s, the calculations in (5.55) can n o t be done exactly and sim ulation or approxim ation m ethods m ust be used. A basic sim ulation algorithm is as follows. Suppose th a t we draw R sam ples from F, and denote the m odel fitted to the rth sam ple by F ’ — the E D F for one-sam ple n o nparam etric problem s. Define ur = Pr(T** < 21* - 1 1 F*). This will be approxim ated by draw ing M sam ples from F", calculating the estim ator values r” for m = 1, . . . , M and com puting the estim ate I {A} is the zero-one indicator function of the event A.

M «M ,r =

^ K

m=1

~ '} •

5 • Confidence Intervals

226

Figure 5.4 Actual coverages of percentile (dotted line) and basic bootstrap (dashed line) upper confidence limits for exponential mean when n = 10. Solid line is attained by nested bootstrap confidence limits.

0.0

0.2

0.4

0.6

0.8

1.0

Nominal coverage

T hen the M onte C arlo version o f (5.55) is R

^ «(«)} = 1 r= l

which is to say th a t q(a) is the a quantile o f the uMr. The sim plest way to obtain <j(ot) is to o rd er the values uMr into uM{l) < ■■■ < and then set q{a) = W h at this am ounts to is th a t the (R + l)a th ordered value is read off from a Q -Q plot o f the uMr against quantiles o f the U ( 0 , 1) distribution, and th a t ordered value is then used to give the required quantile o f the t* — t. We illustrate this in the next example. The to tal nu m b er o f sam ples involved in this calculation is R M . Since we always think o f sim ulating as m any as 1000 sam ples to approxim ate probabilites, here this w ould suggest as m any as 106 sam ples overall. The calculations o f Section 4.5 w ould suggest som ething a bit smaller, say M = 249 to be safe, b u t this is still ra th e r im practical. However, there are ways o f greatly reducing the overall nu m b er o f sim ulations, two o f which are described in C h apter 9. Example 5.13 (Kernel density estimate) B ootstrap confidence intervals for the value o f a density raise som e aw kw ard issues, which we now discuss, before outlining the use o f the nested b o o tstra p in this context. The stan d ard kernel estim ate o f the P D F f ( y ) given a ran d o m sample y u - - - , y n is

227

5.6 ■Double Bootstrap M ethods

where w( ) is a sym m etric density with m ean zero and unit variance, and h is the bandw idth. O ne source o f difficulty is th a t if we consider the estim ator to be t(F), as we usually do, then t(F) = h~l f w{h~l (y — x ) } f ( x ) d x is being estim ated, n o t f ( y) . The m ean and variance o f f ( y ; h ) are approxim ately f ( y ) + j h 2f ' ( y ) ,

(nh)~lf ( y )

J

w2(u)du,

(5.57)

for small h an d large n. In general one assum es th a t as n—► o o so h—>0 in such a way th a t nh—*-oo, an d this m akes both bias and variance tend to zero as n increases. T he density estim ate then has the form t„(F), such th at t „ ( F ) - t ( F ) = f (y) . Because the variance in (5.57) is approxim ately proportional to the mean, it m akes sense to w ork w ith the square root o f the estim ate. T h a t is we take T = {f ( y ; h )}1/2 as estim ator o f 9 = {f ( y )}1/2. By the delta m ethod o f Section 2.7.1 we have from (5.57) th at the approxim ate m ean and variance o f T are

{f(y)Y/2+Uf(yT1/2{h2f"(y)-i2(nhr iK},

(5.58)

where K = f w 2(u) du. T here rem ains the problem o f choosing h. For point estim ation o f f ( y ) it is usually suggested, on the grounds o f m inim izing m ean squared error, th a t one take h o c n-1/5. This m akes b o th bias and stan d ard erro r o f order n~2^5. But there is no reason to do the same for setting confidence intervals, and in fact h o c n-1/5 tu rn s o u t to be a p oor choice, particularly for standard bo o tstrap m ethods, as we now show. Suppose th a t we resam ple y i , . . . , y ‘ from the E D F F. T hen the bo o tstrap version o f the density estim ate, th a t is

has m ean exactly equal to f ( y ’,h); the approxim ate variance is the same as in (5.57) except th a t f ( y \ h ) replaces f ( y ) . It follows th at T* = { f ' ( y \ h ) } 1^2 has approxim ate m ean and variance { f ( y , h ) } 1/2 - K/ONfc)}-172^ ) -1 ^ ,

{ ( n h) - lK .

Now consider the studentized estim ates

7

=

{ f ( y M

ll

2-

{ f ( y ) Y

i( n /j) - ‘/ 2K i /2

z<= {r(^}1/2-{/(>>;ft)}1/2

12 ’

F rom (5.58) an d (5.59) we see th a t if h

\(nh)~^K ^ oc

n“ 1/5, then as n increases

2 = e + { f ( y ) } - l/2K - ' /2{f " (y) - \ K } ,

Z* = e \

(5.59)

5 • Confidence Intervals

228

Figure 5.5 Studentized quantities for density estimation. The left panels show values of Z when h = n~1^5 for 500 standard normal samples of sizes n and 500 bootstrap values for one sample at each n. The right panels show the corresponding values when h = n-1^3.

20

50 100 200 5001000

20

50 100 200 5001000

20

50 100 200 5001000

20

50 100 200 5001000

where b o th e and s' are N ( 0,1). This m eans th a t quantiles o f Z can n o t be well approxim ated by quantiles o f Z*, no m atter how large is n. The same thing happens for the u n transform ed density estim ate. There are several ways in which we can try to overcome this problem . O ne o f the sim plest is to change h to be o f o rd er « -1/3, when calculations sim ilar to those above show th a t Z = e an d Z* = e*. Figure 5.5 illustrates the effect. H ere we estim ate the density a t y = 0 for sam ples from the N ( 0,1) distribution, with w(-) the stan d ard norm al density. T he first two panels show box plots o f 500 values o f z an d z* w hen h = n~1/s, which is near-optim al for estim ation in this case, for several values o f n; the values o f z* are obtained by resam pling from one dataset. T he last two panels correspond to h = n~1/3. The figure confirm s the key points o f the theory sketched above: th a t Z is biased aw ay from zero when h = n-1^5, b u t not w hen h = n_1/3; an d th a t the distributions o f Z and Z ’ are quite stable and sim ilar when h = n-1/3. U nder resam pling from F, the studentized b o o tstrap applied to {/(>’; ^)}1/2 should be consistent if h oc n~1/3. F rom a practical point o f view this m eans considerable undersm oothing in the density estim ate, relative to standard practice for estim ation. A bias in Z o f o rd er n~ 1/3 or worse will rem ain, and this suggests a possibly useful role for the double bootstrap. F or a num erical exam ple o f nested b o o tstrap p in g in this context we revisit Exam ple 4.18, where we discussed the use o f a kernel density estim ate in estim ating species abundance. T he estim ated P D F is

f(y.h) = z z where (•) is the stan d ard norm al density, and the value o f interest is / ( 0 ;/i), which is used to estim ate /(0 ). In light o f the previous discussion, we base

5.6 ■Double Bootstrap M ethods Figure 5.6 Adjusted bootstrap procedure for variance-stabilized density estimate f = {/(0;0.5)}1/2 for the tuna data. The left panel shows the EDF of 1000 values of I* —t. The right panel shows a plot of the ordered u'Mr against quantiles r/(R + 1) of the 1/(0,1) distribution. The dashed line shows how the quantiles of the u are used to obtain improved confidence limits, by using the right panel to read off the estimated coverage q{a) corresponding to the required nominal coverage a, and then using the left panel to read off the q(a) quantile of t* —t.

229

o o

O

■0) O

LU

fo E

LU

t*-t

Nominal coverage

confidence intervals on the variance-stabilized estim ate t = { /(0 ;h )} 1/2. We also use a value o f h considerably sm aller th an the value (roughly 1.5) used to estim ate / in Exam ple 4.18. T he right panel o f Figure 5.6 shows the quantiles o f the uMr obtained when the double b o o tstrap bias adjustm ent is applied with R = 1000 and M = 250, for the estim ate w ith b andw idth h = 0.5. If T* — t were an exact pivot, the distrib u tio n o f the u would lie along the do tted line, and nom inal and estim ated coverage would be equal. The distribution is close to uniform , confirm ing o u r decision to use a variance-stabilized statistic. The dashed line shows how the distribution o f the u* is used to remove the bias in coverage levels. F or an up p er confidence limit with nom inal level 1 — a = 0.9, so th a t a = 0.1, the estim ated level is 4(0-1) = 0.088. The 0.088 quantile o f the values o f tj. — t is t(*gg) — t = —0.091, while the 0.10 quantile is t(*100) — t = —0.085. The corresponding u p per 10% confidence limits for f ( 0 ) V 2 are t - (t(*88) - t) = 0.356 - (-0 .0 9 1 ) = 0.447 and t - (t(*100) - t) = 0.356 — (—0.085) = 0.441. F or this value o f a the adjustm ent has only a small effect. Table 5.7 com pares the 95% limits for /(0 ) for different m ethods, using bandw idth h = 0.5, for which /(0 ;0 .5 ) = 0.127. The longer upper tail for the double b o o tstrap interval is a result o f adjusting the nom inal a = 0.025 to §(0.025) = 0.004; a t the upper tail we obtain §(0.975) = 0.980. The lower tail o f the interval agrees well w ith the o ther second-order correct m ethods. F o r larger values o f h the density estim ates are higher and the confidence intervals narrow er.

5 • Confidence Intervals

230

Upper Lower

Basic

Basic1-

Student

S tu d en t

Percentile

BCa

D ouble

0.204 0.036

0.240 0.060

0.273 0.055

0.266 0.058

0.218 0.048

0.240 0.058

0.301 0.058

In Exam ple 9.14 we describe how saddlepoint m ethods can greatly reduce the tim e taken to perform the double b o o tstrap in this problem . It m ight be possible to avoid the difficulties caused by the bias o f the kernel estim ate by using a clever resam pling scheme, b u t it would be m ore com plicated th an the direct ap p ro ach described above. ■

5.7 Empirical Comparison of Bootstrap Methods T he several b o o tstrap confidence lim it m ethods can be com pared theoretically on the basis o f first- and second-order accuracy, as in Section 5.4, b u t this really gives only suggestions as to which m ethods we would expect to be good. The theory needs to be bolstered by num erical com parisons. O ne rath e r extrem e com parison was described in Exam ple 5.7. In this section we consider one m oderately com plicated application, estim ation o f a ratio o f means, and assess through sim ulation the perform ances o f the m ain b o o tstrap confidence limit m ethods. T he conclusions ap p ear to agree qualitatively with the results o f other sim ulation studies involving applications o f sim ilar com plexity: references to some o f these are given in the bibliographic notes a t the end o f the chapter. The application here is sim ilar to th a t in Exam ple 5.10, and concerns the ratio o f m eans for d a ta from tw o different gam m a distributions. The first sam ple o f size ni is draw n from a gam m a distrib u tio n w ith m ean fi\ = 100 and index 0.7, while the second independent sam ple o f size n2 is draw n from the gam m a distribution w ith m ean n 2 = 50 and index 1. T he p aram eter 9 = n i / ( i 2, whose value is 2, is estim ated by the ratio o f sam ple m eans t = y \ / y 2. F or particular choices o f sam ple sizes we sim ulated 10000 datasets and to each applied several o f the nonparam etric b o o tstrap confidence lim it m ethods discussed earlier, always w ith R = 999. We did n o t include the double b o o tstrap m ethod. As a control we added the exact p aram etric m ethod when the gam m a indexes are know n: this turns out not to be a strong control, b u t it does provide a check on sim ulation validity. The results quoted here are for tw o cases, n\ = n2 = 10 and n\ = n2 = 25. In each case we assess the left- and right-tail erro r rates o f confidence intervals, and their lengths. Table 5.8 shows the em pirical erro r rates for b o th cases, as percentages, for nom inal rates betw een 1% and 10% : sim ulation stan d ard errors are rates

Table 5.7 Upper and lower endpoints of 95% confidence limits for / ( 0) for the tuna data, with bandwidth h = 0.5; t indicates use of square-root transformation.

231

5.8 • M ultiparameter Methods Table 5.8 Empirical error rates (%) for nonparametric bootstrap confidence limits in ratio estimation: rates for sample sizes wi = n2 = 10 are given above those for sample sizes «| = «2 = 25. R = 999 for all bootstrap methods. 10000 datasets generated from gamma distributions.

M e th o d

N o m in al e rro r rate L ow er lim it

E xact N o rm al ap proxim ation Basic Basic, log scale S tudentized S tudentized, log scale B o o tstrap percentile BCa ABC

U p p e r lim it

1

2.5

5

10

10

5

2.5

1

1.0 1.0 0.1 0.1 0.0 0.0 2.6 1.6 0.6 0.8 1.1 1.1 1.8 1.2 1.9 1.4 1.9 1.3

2.8 2.3 0.5 0.5 0.0 0.1 4.9 3.2 2.1 2.3 2.8 2.5 3.6 2.6 4.0 3.0 4.2 3.0

5.5 4.8

10.5 9.9

1.7 2.1 0.2 0.4 8.1 6.0 4.6 4.6 5.6 5.0 6.5 5.1 6.9 5.6 7.4 5.7

6.3 6.4 1.8 3.0 12.9 11.4 9.9 9.9 10.7 10.1 11.6 10.1 12.3 10.9 12.7 11.0

9.8 10.2 20.6 16.3 24.4 19.2 13.1 11.5 11.9 10.9 11.6 10.8 14.6 12.6 14.0 11.8 14.6 12.1

4.8 4.9 15.7 11.5 21.0 15.0 7.5 6.3

2.6 2.5 12.5 8.2 18.6 12.5 4.8 3.3 4.0 3.0 3.5 2.9 5.9 4.2 5.3 3.8 5.5 3.7

1.0 1.1 9.6 5.5 16.4 10.3 2.5 1.7 2.0 1.4 1.7 1.3 3.3 2.1 3.0 1.9 3.1 1.9

6.7 5.9 6.3 5.7 8.9 7.1 8.3 6.8 8.7 6.8

divided by 100. The norm al approxim ation m ethod uses the delta m ethod variance approxim ation. The results suggest th a t the studentized m ethod gives the best results, provided the log scale is used. Otherwise, the studentized m ethod and the percentile, B C a and A B C m ethods are com parable b u t only really satisfactory a t the larger sample sizes. Figure 5.7 shows box plots o f the lengths o f 1000 confidence intervals for b o th sam ple sizes. The m ost pronounced feature for ni = n2 = 10 is the long — som etim es very long — lengths for the two studentized m ethods, which helps to account for their good error rates. This feature is far less prom inent a t the larger sam ple sizes. It is noticeable th a t the norm al, percentile, B C a an d A B C intervals are sh o rt com pared to the exact ones, and th at taking logs improves the basic intervals. Sim ilar com m ents apply when ni = n2 = 25, but w ith less force.

5.8 Multiparameter Methods W hen we w ant a confidence region for a vector param eter, the question o f shape arises. Typically a rectangular region form ed from intervals for each com ponent p aram eter will n o t have high enough coverage probability, although a B onferroni argum ent can be used to give a conservative confidence coefficient,

232

5 ■Confidence Intervals

n1=n2=10

Figure 5.7 Box plots of confidence interval lengths for the first 1000 simulated samples in the numerical experiment w ith gamma data.

1000 100 10

...... ^ ................... B "

" S .......E3........ Et3....... S "

n1=n2=25 10 5

2

■0.... 0 .... 0 .....0 .... 6 .... B .... [j.....0 .... 0 -

1

as follows. Suppose th a t 9 has d com ponents, an d th a t the confidence region Ca is rectangular, w ith interval Cxj = (9Lyi, 9Vj) for the ith com ponent 9t. T hen Pr(0 * Ca) = P r ( \ J { 9 t $

^

Pr(0, ^ Q , ) = ^

say. If we take a, = a / d then the region Ca has coverage a t least equal to 1 — a. F or certain applications this could be useful, in p a rt because o f its simplicity. But there are tw o poten tial disadvantages. First, the region could be very conservative — the true coverage could be considerably m ore than the nom inal 1 — a. Secondly, the rectangular shape could be quite at odds w ith plausible likelihood contours. This is especially true if the estim ates for p aram eter com ponents are quite highly correlated, w hen also the B onferroni m ethod is m ore conservative. One simple possibility for a jo in t b o o tstrap confidence region when T is approxim ately norm al is to base it on the quad ratic form Q = ( T - 9 ) t V ~ 1( T - 9 ) ,

(5.60)

where V is the estim ated variance m atrix o f T. N ote th a t Q is the m ultivariate extension o f the square o f the studentized statistic o f Section 5.2. If Q had exact p quantiles ap, say, then a 1 — a confidence set for 9 would be {9 : ( T - 9 ) t V ~ 1( T - 9 ) < a ^ } .

(5.61)

233

5.8 ■Multiparameter Methods

T he elliptical shape o f this set is correct if the distribution o f T has elliptical contours, as the m ultivariate norm al distribution does. So if T is approxim ately m ultivariate norm al, then the shape will be approxim ately correct. M oreover, Q will be approxim ately distributed as a y 2d variable. But as in the scalar case such distrib u tio n al approxim ations will often be unreliable, so it m akes sense to approxim ate the distrib u tio n o f Q, and in p articular the required quantiles a i_a, by resam pling. T he m ethod then becom es com pletely analogous to the studentized b o o tstrap m ethod for scalar param eters. The b o o tstrap analogue o f Q will be Q’ = ( T , - t ) r F * - 1( T * - t ) , which will be calculated for each o f R sim ulated samples. If we denote the ordered b o o tstra p values by q[ <■■■ < q'R, then the 1 —a b o o tstrap confidence region is the set {0 : (t - 9)Tv~l (t - 0) < 5(*R+i)(i-a)}-

(5-62)

As in the scalar case, a com m on and useful choice for v is the delta m ethod variance estim ate v^. T he sam e m ethod can be applied on any scales which are m onotone tra n s form ations o f the original p aram eter scales. F or example, if h(6) has ith com ponent /i,(0;), say, and if d is the diagonal m atrix with elem ents dhi/d6j evaluated at 0 = t, then we can apply (5.62) with the revised definition q = {h(t) - h(0)}T (dTvd)~l {h(t) - fe(0)}. If corresponding ordered b o o tstrap values are again denoted by q *, then the b o o tstrap confidence region will be {0 : {h(t) - h(6)}T(dTv d ) - l {h(t) - h(6)} < 9(*r+1Mi_«)}-

(5.63)

A p articu lar choice for h(-) would often be based on diagnostic plots o f com ponents o f t* and v", the objectives being to attain approxim ate norm ality an d approxim ately stable variance for each com ponent. This m ethod will be subject to the same potential defects as the studentized b o o tstrap m ethod o f Section 5.2. T here is no vector analogue o f the adjusted percentile m ethods, b u t the nested b o o tstrap m ethod can be applied. Example 5.14 (Air-conditioning data) F o r the air-conditioning d a ta o f Exam ple 1.1, consider setting a confidence region for the two param eters 0 = (ji, k) in a gam m a m odel. The log likelihood function is y and logy are the averages o f the d a ta and the log data.

/(,u, k ) =

n{K \og{K /ii) ~

logr(jc) + (k - l)logy -

Ky/n},

from which we calculate the m axim um likelihood estim ators T = (p,,k). The

234

5 ■Confidence Intervals

num erical values are p. = 108.083 and k = 0.7065. A straightforw ard calcula tion shows th a t the delta m ethod variance approxim ation, equal to the inverse o f the expected inform ation m atrix as in Section 5.2, is vL = n_1d i a g |/ c _1/i2, ~

(fi,

lo g r ( £ ) - k_1j .

(5.64)

The stan d ard likelihood ratio 1 — a confidence region is the set o f values o f k ) for which 2{/(fi, k) -

Z( f i ,

jc)} < c2,i—«,

where c2,i_« is the 1 — a quantile o f the x l distribution. The top left panel o f Figure 5.8 shows the 0.50, 0.95 an d 0.99 confidence regions obtained in this way. T he top right panel is the same, except th a t C2,i_a is replaced by a b o o tstrap estim ate obtained from R = 999 sam ples sim ulated from the fitted gam m a m odel. This second region is som ew hat larger than, b u t o f course has the same shape as, the first. From the b o o tstrap sim ulation we have estim ators t" = (£*,£*) from each sample, from which we calculate the corresponding variance approxim ations using (5.64), an d hence the quad ratic form s q * = ( f — f)r i>2-1 (f* — t). We then apply (5.62) to obtain the studentized b o o tstrap confidence regions shown in the bottom left panel o f Figure 5.8. This is clearly nothing like the likelihoodbased confidence regions above, p artly because it fails com pletely to take account o f the m ild skewness in the distribution o f fi and the heavy skewness in the distrib u tio n o f k. These features are clear in the histogram plots o f Figure 5.9. L ogarithm ic transfo rm atio n o f b o th fi an d k improves m atters considerably: the b otto m right panel o f Figure 5.8 com es from applying the studentized boo tstrap m ethod after d ual logarithm ic transform ation. Nevertheless, the solution is n o t com pletely satisfactory, in th a t the region is too wide on the k axis and slightly narrow on the fi axis. This could be predicted to som e extent by plotting v'L versus f*, which shows th a t the log transform ation o f k is not quite strong enough. Perhaps m ore im p o rtan t is th a t there is a substantial bias in k: the b o o tstrap bias estim ate is 0.18. One lesson from this exam ple is th a t where a likelihood is available and usable, it should be used — w ith param etric sim ulation to check on, and if necessary replace, stan d ard approxim ations for quantiles o f the log likelihood ratio statistic. ■ Example 5.15 (Laterite data) T he d a ta in Table 5.9 are axial d a ta consisting o f 50 pole positions, in degrees o f latitude an d longitude, from a palaeom agnetic study o f N ew C aledonian laterites. The d a ta take values only in the lower unit half-sphere, because an axis is determ ined by a single pole.

5.8 ■Multiparameter M ethods

Figure 5.8 Bootstrap confidence regions for the parameters /*, k of a gamma model for the air-conditioning data, with levels 0.50, 0.95 and 0.99. Top left: likelihood ratio region with x\ quantiles; top right: likelihood ratio region with bootstrap quantiles; bottom left: studentized bootstrap on original scales; bottom right: studentized bootstrap on logarithmic scales. R = 999 bootstrap samples from fitted gamma model with ft = 108.083 and k = 0.7065. + denotes MLE.

235

(0 Q.

< 0 Q.

(0

CO

Q.

Q.

mu

mu

Q. Q.

mu

Let Y denote a u n it vector on the lower half-sphere with cartesian coordi nates ( c o s X c o s Z ,c o s X s in Z ,s in X ) T, where X and Z are degrees o f latitude and longitude. T he population quantity o f interest is the m ean p o lar axis, a( 6, 0 ) = (cos 8 cos 0 , cos 9 sin <j), sin 6)T, defined as the axis given by the eigen vector corresponding to the largest eigenvalue o f E ( 7 Y T ). The sam ple value o f this is given by the corresponding eigenvector o f the m atrix n-1 y j y f , where y/ is the vector o f cartesian coordinates o f the jth pole position. The sample A A m ean p o lar axis has latitude 9 = —76.3 and longitude (f> = 83.8. Figure 5.10 shows the original d a ta in an equal-area projection onto a plane tangential to the South Pole, at 9 = —90°; the hollow circle represents the sam ple m ean p o lar axis.

236

5 ■Confidence Intervals

C\J

Figure 5.9 Histograms of ft and k* from R = 999 bootstrap samples from gamma model with p. = 108.083 and ic = 0.7065, fitted to air-conditioning data.

o o co oo

so

in

o

o

o o

o

I I I i i i i—. i—i i—

o 50

100 150 200 250 300

0.5

1.0

mu

1.5

2.0

2.5

3.0

kappa

Lat

Long

Lat

Long

Lat

Long

Lat

Long

-26.4 -32.2 -73.1 -80.2 -71.1 -58.7 -40.8 -14.9 -66.1 -1.8 -38.3 -17.2 -56.2

324.0 163.7 51.9 140.5 267.2 32.0 28.1 266.3 144.3 256.2 146.8 89.9 35.6

-52.1 -77.3 -68.8 -68 .4 -29.2 -78.5 -65 .4 -49 .0 -67 .0 -5 6 .7 -72.7 -81 .6 -75.1

83.2 182.1 110.4 142.2 246.3 222.6 247.7 65.6 282.6 56.2 103.1 295.6 70.7

-80.5 -77.7 -6.9 -5 9 .4 -5 .6 -62.6 -74.7 -65.3 -71.6 -23.3 -60.2 -40.4 -53.6

108.4 266.0 19.1 281.7 107.4 105.3 120.2 286.6 106.4 96.5 33.2 41.0 59.1

-74.3 -8 1 .0 -12.7 -75.4 -85.9 -84.8 -7 .4 -29.8 -85.2 -53.1 -63.4

90.2 170.9 199.4 118.6 63.7 74.9 93.8 72.8 113.2 51.5 154.8

In ord er to set a confidence region for the m ean p o lar axis, or equivalently (6, ), we let b(6, <£) = (sin 6 cos (j), sin 9 sin 0 , —cos d)T,

c (0 ,0 ) = (— sin <j>t — cos <j>,0)T

denote the unit vectors ortho g o n al to a(0, ). The sam ple values o f these vectors are 2, b and c, and the sam ple eigenvalues are 1\ < %2 < ^3- Let A denote the 2 x 3 m atrix (S,c)r and B the 2 x 2 m atrix with { j, k)th element ------— n~ l y ^ ( b Tyj)(cTyj)(aTyj)2.

Table 5.9 Latitude (°) and longitude (°) of pole positions determined from the paleomagnetic study of New Caledonian laterites (Fisher et a/., 1987, p. 278).

5.8 • M ultiparameter Methods

Figure 5.10 Equal-area projection of the laterite data onto the plane tangential to the South Pole (+). The sample mean polar axis is the hollow circle, and the square region is for comparison with Figures 5.11 and 10.3.

237

90

T hen the analogue o f (5.60) is Q = na(9,(j>)T A T J3_1/la(0, <^>),

(5.65)

which is approxim ately distributed as a y\ variable in large samples. In the b o o tstrap analogue o f Q, a is replaced by a, and A and B are replaced by the corresponding quantities calculated from the b o o tstrap sample. Figure 5.11 shows results from setting confidence regions for the m ean polar axis based on Q. The panels show the 0.5, 0.95 and 0.99 contours, using x\ quantiles an d those based on R = 999 nonparam etric boo tstrap replicates q". T he contours are elliptical in this projection. For this sam ple size it would not be m isleading to use the asym ptotic 0.5 and 0.95 quantiles, though the 0.99 quantiles differ by more. However, sim ulations with a random subset o f size n — 20 gave dram atically different quantiles, and it seems to be essential to use the b o o tstrap quantiles for smaller sam ple sizes. A different ap proach is to set T = (6, (j>)T, and then to base a confidence region for (d,4>) on (5.60), w ith V taken to be nonparam etric delta m ethod estim ate o f the covariance m atrix. This approach does not take into account the geom etry o f spherical d a ta and w orks very poorly in this example, partly because the estim ate t is close to the South Pole, which limits the range o f ().

238

5 * Confidence Intervals

Figure 5.U The 0.5, 0.95, and 0.99 confidence regions for the mean polar axis of the laterite data based on (5.65), using x\ quantiles (left) and bootstrap quantiles (right). The boundary of each panel is the square region in Figure 5.10; also shown are the South Pole (+) and the sample mean polar axis < °).

5.9 Conditional Confidence Regions In param etric inference the probability calculations for confidence regions should in principle be m ade conditional on the ancillary statistics for the m odel, w hen these exist, the basic reason being to ensure th a t the inference accounts for the actual inform ation content in the observed data. In param etric m odels w hat is ancillary is often specific to the m athem atical form o f F, and there is no n o n p aram etric analogue. However, there are situations where there is a m odel-free ancillary indicator o f the experim ent, as w ith the design o f a regression experim ent (C h ap ter 6). In fact there is such an indicator in one o f our earlier exam ples, an d we now use this to illustrate some o f the points which arise w ith conditional b o o tstrap confidence intervals. Example 5.16 (City population data) F o r the ra tio estim ation problem o f Exam ple 1.2, the statistic d = u w ould often be regarded as ancillary. The reason rests in p a rt on the n o tio n o f a m odel for linear regression o f x on u with v ariatio n p ro p o rtio n al to u. The left panel o f Figure 5.12 shows the scatter plo t o f t* versus d" for the R = 999 n o n p aram etric b o o tstrap sam ples used earlier. T he observed value o f d is 103.1. T he m iddle and right panels o f the figure show trends in the conditional m ean an d variance, E*(T* | d') and v ar* (T ’ | d"), these being approxim ated by crude local averaging in the scatter plot on the left. The calculation o f confidence lim its for the ratio 6 = E(AT)/E(l/) is to be m ade conditional on d* = d, the observed m ean o f u. Suppose, for example, th at we w ant to apply the basic b o o tstra p m ethod. T hen we need to approxim ate the conditional quantiles ap(d) o f T — 6 given D = d for p = a and 1 — a, and

239

5.9 ■Conditional Confidence Regions

5 \

V V vaitH

15

0.0012

l i f t ; , .

0.0008

0.0010

■

0.0014

5

<3

80

100

120

140

160

80

6’

Table 5.10 City population data, n = 49. Comparison of unconditional and conditional cumulative probabilities for bootstrap ratio T*. R = 9999 nonparametric samples, Rj = 499 used for conditional probabilities.

\

(

. ...

91001

Figure 5.12 City population data, n = 49. Scatter plot of bootstrap ratio estimates t* versus d*, and conditional means and variances of t* given d*. R = 999 nonparametric samples.

U nconditional C o n d itio n al

90

100

110

120

130

80

90

100

0.010 0.006

0.025 0.020

0.050 0.044

110

120

130

<S'

d'

0.100 0.078

0.900 0.940

0.950 0.974

0.975 0.988

0.990 1.000

use these in (5.3). T he b o o tstrap estim ate o f ap(d) is the value ap(d) defined by Pr{T* — t < ap(d) \ D* = d} = p, and the sim plest way to use o u r sim ulated sam ples to approxim ate this is to use only those sam ples for which d* is “n ea r” d. F or example, we could take the R i = 99 sam ples whose d* values are closest to d and approxim ate ap(d) by the lOOpth ordered value o f t* in those samples. C ertainly stratification o f the sim ulation results by intervals o f d* values shows quite strong conditional effects, as evidenced in Figure 5.12. The difficulty is th a t R j = 99 sam ples is n o t enough to obtain good estim ates o f conditional quantiles, and certainly not to distinguish betw een unconditional quantiles and the conditional quantiles given d' = d, which is near the m ean. O nly w ith an increase o f R to 9999, an d using strata o f Rd = 499 samples, does a clear picture emerge. Figure 5.13 shows plots o f conditional quantile estim ates from this larger sim ulation. How different are the conditional and unconditional distributions? Table 5.10 shows b o o tstrap estim ates o f the cum ulative conditional probabilities Pr( T < ap | D = d), where ap is the unconditional p quantile, for several values o f p. Each estim ate is the p ro p o rtio n o f times in Rd = 499 sam ples th a t t" is less than or equal to the unconditional quantile estim ate £(’10ooop)- The com parison suggests th a t conditioning does n o t have a large effect in this case. A m ore efficient use o f b o o tstrap samples, which takes advantage o f the sm oothness o f quantiles as a function o f d, is to estim ate quantiles for interval stra ta o f Rd sam ples an d then for each level p to fit a sm ooth curve. For exam ple, if the k th such stratu m gives quantile estim ates ap# and average

5 * Confidence Intervals

240

Figure 5.13 City population data, n = 49. Conditional 0.025 and 0.975 quantiles of bootstrap ratio t* from R = 9999 samples, with strata of size Rj = 499. The horizonal dotted lines are unconditional quantiles, and the vertical dotted line is at d' = d.

Ancillary d*

Figure 5.14 City population data, n = 49. Smooth spline fits to 0.025 and 0.975 conditional quantiles of bootstrap ratio t* from R = 9999 samples, using overlapping strata of size Rj = 199.

Ancillary d*

value dk for d', then we can fit a sm oothing spline to the points (dk, (ip^) for each p an d interpolate the required value ap(d) at the observed d. Figure 5.14 illustrates this for R = 9999 and non-overlapping s tra ta o f size R^ = 199, with p = 0.025 an d 0.975. N ote th a t interp o latio n is only needed at the centre o f the curve. Use o f non-overlapping intervals seems to give the best results. ■ A n alternative sm oothing m ethod is described in Problem 5.16. In C h apter 9 we shall see th a t in some cases, including the preceding example, it is possible to get accurate approxim ations to conditional quantiles using theoretical m ethods.

241

5.9 ■Conditional Confidence Regions

Figure 5.15 Annual discharge of River Nile at Aswan, 1871-1970 (Cobb, 1978).

o o o o CM

E

o o o

_2

o >

• ' l\ . / h m . i r '■: j* * j!

o

o 00

i

o o

I

• %

•T •: • •:.♦•k * •!■ :•;* • • ; ’ •M M ii m # * ?! iii • . . * M i 1/1 * ii \Mi ; i\i * * \i. ** r ; '»■ U i\. • *• !:* * . *•

CO

i 1880

1900

1920

1940

1960

Year

Ju st as w ith unconditional analysis, so with conditional analysis there is a choice o f b o o tstrap confidence interval m ethods. F rom our earlier discussion the studentized b o o tstrap and adjusted percentile m ethods are likely to work best for statistics th a t are approxim ately norm al, as in the previous example. The adjusted percentile m ethod requires constants a, v i and w, all o f which m ust now be co n d itio n a l; see Problem 5.17. The studentized b o o tstra p m ethod can be applied as before w ith Z = (T — 0 ) / F 1/2, except th at now conditional quantiles will be needed. Some simplification m ay occur if it is possible to standardize w ith a conditional standard error. T he next exam ple illustrates an o th er way o f overcom ing the paucity o f b o o tstrap sam ples which satisfy the conditioning constraint. Example 5.17 (N ile data) T he d a ta plotted in Figure 5.15 are annual dis charges y o f the R iver Nile at A sw an from 1871 to 1970. Interest lies in the year 1870+0 in which the m ean discharge drops from n\ — H 0 0 to H2 = 870; these m ean values are estim ated, b u t it is reasonable to ignore this fact and we shall do so. The least squares estim ate o f the integer 0 maximizes

e S(0) = ^ { > 7 “ 3 ^ i + w ) } j= i S tan d ard norm al-theory likelihood analysis suggests th a t differences in S(6) for 0 n ear 0 are ancillary statistics. We shall reduce these differences to two p articu lar statistics which m easure skewness and curvature o f S( ) near 0,

242

5 ■Confidence Intervals

b'

c*

1.64 2.44 4.62 4.87 5.12 5.49 6.06 6.94

.. ..

-0.62

-0.37

-0.17

0

0.17

0.37

0.62

0.87

59 62

52 88

53 81

71 83

68 79

62 82

50 68

53 81

92 91 92 97 94 93

84 91 96 96 100 100

93 91 100 89 100 100

93 95 95 98 100 100

95 89 86 96 97 100

97 92 97 95 96 100

87 92 100 97 95 100

93 95 97 96 95 100

2.45

_ 50 76 76 81 85 86 100

nam ely B = S(d + 5) - S(6 - 5 ) ,

C = S(0 + 5) - 2S(0) + S(0 - 5);

for num erical convenience we rescale B and C by 0.0032. It is expected th at B and C respectively influence the bias an d variablity o f 0. We are interested in the conditional confidence th a t should be attached to the set 0 + 1, th at is Pr(|0 — 0| < 1 | b,c). The d a ta analysis gives 0 = 28 (year 1898), b = 0.75 and c = 5.5. W ith no assum ption on the shape o f the distribution o f Y , except th a t it is constant, the obvious b o o tstrap sam pling scheme is as follows. First calculate the residuals ej = Xj — f i u j = 1 ,...,2 8 and e; = x j — fi2, j = 2 9 ,..., 100. T hen sim ulate d a ta series by x ' = m + e ’, j = 1 ,...,2 8 and x* = n 2 + s ) , j = 29.......100, w here e’ is random ly sam pled from eioo- Each such sam ple series then gives 0*,fr* an d c*. F rom R = 10 000 b o o tstra p sam ples we find th a t the pro p o rtio n o f samples A A w ith 16 — 9\ < 1 is 0.862, which is the unconditional b o o tstrap confidence. But when these sam ples are p artitio n ed according to b* and c”, strong effects show up. Table 5.11 shows p a rt o f the table o f proportions for outcom e 10* — 01 < 1 for a 16 x 15 partitio n , 201 o f these p artitions being non-em pty and m ost o f them having at least 50 b o o tstrap samples. The proportions are consistently higher th an 0.95 for ( b' ,c ') n ear (b,c), which strongly suggests th a t the conditional confidence Pr(|0 — 0| < 1 | b = 0.75, c = 5.5) exceeds 0.95. T he conditional probability Pr(|0 — 0| < 1 | b,c) will be sm ooth in b and c, so it m akes sense to assum e th a t the estim ate p(b’,c*) = Pr*(|0* — 0| < 1 | 6*,c’)

Table 5.11 Nile data. Part of the table of proportions (%) of bootstrap samples for which 10" —§ | ^ 1, for interval values of b' and c*. R = 10000 samples.

5.10 • Prediction

243

is sm ooth in b ' , c ' . We fitted a logistic regression to the proportions in the 201 non-em pty cells o f the com plete version o f Table 5.11, the result being logit p(b* , c ) = —0.51 — 0.20b’2 + 0.68c*. The residual deviance is 223 on 198 degrees o f freedom , which indicates an adequate fit for this simple model. The conditional bo o tstrap confidence is the fitted value o f p a t b' = b, c* = c, which is 0.972 w ith standard erro r 0.009. So the conditional confidence attached to 6 = 28 + 1 is m uch higher th an the unconditional value. The value o f the stan d ard error for the fitted value corresponds to a binom ial stan d ard error for a sam ple o f size 3500, or 35% o f the whole b o o tstrap sim u lation, which indicates high efficiency for this m ethod o f estim ating conditional probability. ■

5.10 Prediction Closely related to confidence regions for param eters are confidence regions for future outcom es o f the response Y , m ore usually called prediction regions. A pplications are typically in m ore com plicated contexts involving regression m odels (C hapters 6 and 7) and time series m odels (C hapter 8), so here we give only a b rief discussion o f the m ain ideas. In the sim plest situation we are concerned with prediction o f one future response Yn+l given observations y \ , . . . , y n from a distribution F. The ideal upp er y prediction lim it is the y quantile o f F, which we denote by ay(F). The sim plest ap p ro ach to calculating a prediction limit is the plug-in approach, th a t is substituting the estim ate F for F to give ay = ay(F). But this is clearly biased in the optim istic direction, because it does n o t allow for the uncertainty in F. R esam pling is used to correct for, or remove, this bias. Parametric case Suppose first th a t we have a fully param etric model, F = Fg, say. T hen the prediction lim it ay(F) can be expressed m ore directly as ay(9). T he true coverage o f this limit over repetitions o f b o th d a ta and predictand will n o t generally be y, b u t rath er P r{7 n+i < ay(6) \ 6} = h(y),

(5.66)

say, where h(-) is unknow n except th a t it m ust be increasing. (The coverage also depends on 6 in general, b u t we suppress this from the no tatio n for simplicity.) T he idea is to estim ate h(-) by resam pling. So, for d a ta Y J , . . . , Y * and predictand Yn*+1 all sam pled from F = Fg, we estim ate (5.66) by Mv) = Pr*{y„*+1 < a y(d')},

(5.67)

244

5 • Confidence Intervals

where as usual O' is the estim ator calculated for d a ta Y Y ‘. In practice it would usually be necessary to use R sim ulated repetitions o f the sam pling and approxim ate (5.67) by (5.68) Once h(y) has been calculated, the adjusted y prediction limit is taken to be at<7) = ag(y)(h where Hgi v) } = 7Example 5.18 (Normal prediction limit) Suppose th a t Y^,..., Y„+i are inde pendently sam pled from the N(/i, cr2) distribution, where fi and a are unknow n, and th a t we wish to predict Yn+\ having observed yi, . .. , y „- The plug-in m ethod gives the basic y prediction limit aY = y„ + s„ ‘(y), where y„ = n 1 ^ yj and s2 = n 1 —-V)2- ^ we write Yj = n + ere,-, so th at the Ej are independent JV(0,1), then (5.66) becom es e„ is the average of

where Z„_i has the S tudent-f distribution w ith n — 1 degrees o f freedom . This leads directly to the S tudent-f prediction limit

where /c„_i,y is the y quantile o f the S tudent-t distribution with n — 1 degrees o f freedom. In this p articu lar case, then, h( ) does not need to be estim ated. But if we had n o t recognized the occurrence o f the Student-f distribution, then the first probability in (5.69) w ould have been estim ated by applying (5.68) w ith samples generated from the N ( y n, s2) distribution. Such an estim ate (corresponding to infinite R) is plotted in Figure 5.16 for sam ple size n = 10. The plot has logit scales to em phasize the discrepancy betw een h(y) and y. G iven values o f the estim ate h{y), a sm ooth curve can be obtained by quadratic regression o f their logits on logits o f y; this is illustrated in the figure, where the solid line is the regression fit. T he required value g(y) can be read off from the curve. ■ The preceding exam ple suggests a m ore direct m ethod for special cases involving m eans, which m akes use o f a poin t prediction y n+\ and the distribu tion o f prediction error Yn+l — y„+1: resam pling can be used to estim ate this distribution directly. This m ethod will be applied to linear regression m odels in Section 6.3.3.

245

5.10 ■Prediction

Figure 5.16 Adjustment function /i(y) for prediction with sample size n = 10 from N(n,cr2), with quadratic logistic fit (solid), and line giving /i(y) = y (dots).

Logit of gamma

Nonparametric case N ow consider the n o nparam etric context, where F is the E D F o f a single sample. The calculations outlined for the param etric case apply here also. First, if r / n < y < (r + 1)/n then the plug-in prediction limit is ay(F) = y(r)\ equivalently, ay(F) = y([ny\), where [■] m eans integer part. Straightforw ard calculation shows th at Pr(Y„+1 < yw ) = r / ( n + l ) , w hich m eans th a t (5.66) becom es h(y) = [ny]/(n+1). Therefore [n g (y )]/(n + l) = y, so th at the adjusted prediction limit is y ( [ ( n+ i ) v ] ) : this is exact if (n + l ) y is an integer. It seems intuitively clear th a t the efficiency o f this nonparam etric prediction lim it relative to a param etric prediction limit would be considerably lower th an would be the case for confidence limits on a param eter. F or example, a com parison betw een the norm al-theory and nonparam etric m ethods for sam ples from a norm al distribution shows the efficiency to be ab o u t j for a = 0.05. F or sem iparam etric problem s sim ilar calculations apply. One general ap proach which m akes sense in certain applications, as m entioned earlier, bases prediction lim its on poin t predictions, and uses resam pling to estim ate the distribution o f prediction error. For further details see Sections 6.3.3 and 7.2.4.

246

J • Confidence Intervals

5.11 Bibliographic Notes S tan d ard m ethods for obtaining confidence intervals are described in C hap ters 7 an d 9 o f Cox an d H inkley (1974), while m ore recent developm ents in likelihood-based m ethods are outlined by B arndorff-N ielsen and Cox (1994). C orresponding m ethods based on resam ple likelihoods are described in C hap ter 10. B ootstrap confidence intervals were introduced in the original b o otstrap paper by E fron (1979); bias adjustm ent and studentizing were discussed by E fron (1981b). The adjusted percentile m ethod was developed by E fron (1987), w ho gives detailed discussion o f the bias and skewness adjustm ent factors b and a. In p a rt this developm ent responded to issues raised by Schenker (1985). T he A B C m ethod an d its theoretical justification were laid out by DiCiccio an d Efron (1992). H all (1988a, 1992a) contain rigorous developm ents o f the second-order com parisons betw een com peting m ethods, including the studentized b o o tstrap m ethods, an d give references to earlier w ork dating back to Singh (1981). D iCiccio an d E fron (1996) give an excellent review o f the B C a and A B C m ethods, together w ith their asym ptotic properties and com parisons to likelihood-based m ethods. A n earlier review, w ith discussion, was given by D iCiccio an d R om ano (1988). O ther em pirical com parisons o f the accuracy o f b o o tstrap confidence interval m ethods are described in Section 4.4.4 o f Shao and Tu (1995), while Lee and Y oung (1995) m ake com parisons w ith iterated bo o tstrap m ethods. Their conclusions and those o f Canty, D avison and H inkley (1996) broadly agree w ith those reached here. T ibshirani (1988) discussed em pirical choice o f a variance-stabilizing tra n s form ation for use w ith the studentized b o o tstrap m ethod. Choice o f sim ulation size R is investigated in detail by H all (1986). See also the related references for C h ap ter 4 concerning choice o f R to m aintain high test power. T he significance test m ethod has been studied by K abaila (1993a) and discussed in detail by C arp en ter (1996). B uckland and G arthw aite (1990) and G a rth w aite and B uckland (1992) describe an efficient algorithm to find confidence lim its in this context. The p articu lar application discussed in E xam ple 5.11 is a m odified version o f Jennison (1992). O ne intriguing application, to phylogenetic trees, is described by Efron, H allo ran and H olm es (1996). The double b o o tstrap m ethod o f adjustm ent in Section 5.6 is sim ilar to th at developed by Beran (1987) and H inkley an d Shi (1989); see also Loh (1987). The m ethod is som etim es called b o o tstrap calibration. H all and M artin (1988) give a detailed analysis o f the reduction in coverage error. Lee and Y oung (1995) provide an efficient algorithm for approxim ating the m ethod w ithout sim ulation w hen the p aram eter is a sm ooth function o f means. B ooth and H all

247

5.12 • Problems

(1994) discuss the num bers o f sam ples required when the nested b o o tstrap is used to calibrate a confidence interval. C onditional m ethods have received little attention in the literature. E xam ple 5.17 is tak en from H inkley an d Schechtm an (1987). B ooth, H all and W ood (1992) describe kernel m ethods for estim ating the conditional distribution o f a b o o tstrap statistic. Confidence regions for vector param eters are alm ost untouched in the lit erature. T here are no general analogues o f adjusted percentile m ethods. H all (1987) discusses likelihood-based shapes for confidence regions. Geisser (1993) surveys several approaches to calculating prediction intervals, including resam pling m ethods such as cross-validation. References to confidence interval and prediction interval m ethods for regres sion m odels are given in the notes for C hapters 6 and 7; see also C hapter 8 for tim e series.

5.12 Problems 1

Suppose that we have a random sample from a distribution F whose mean is unknown but whose variance is known and equal to a 1. D iscuss possi ble nonparametric resampling methods for obtaining confidence intervals for ^, including the following: (i) use z = J n ( y — n ) / a and resample from the E D F ; (ii) use z = J n ( y — fi)/s and resample from the E D F ; (iii) as in (ii) but replace the E D F o f the data by the E D F o f values y + a(yi — y ) / s; (iv) as in (ii) but replace the E D F by a distribution on the data values whose mean and variance are y and a 2.

2

Suppose that 9 is the correlation coefficient for a bivariate distribution. If this distribution is bivariate normal, show that the M LE 9 is approximately N ( 9 , ( 1 — 92)2/n). Use the delta m ethod to show that the transformed correlation parameter f for which fj is approximately N ( 0 , n ') is ( = | lo g {(l + 9)/ ( 1 — 0)}.

s2 is the usual sample variance of .........y„.

Compare the use o f normal approximations for 9 and f with use o f a parametric bootstrap analysis to obtain confidence intervals for 9: see Practical 5.1. (Section 5.2) 3

Independent measurements y i , . . . , y n come from a distribution with range [0,0], Suppose that we resample by taking samples o f size m from the data, and base confidence intervals on Q = m{t — T' )/t, where T ’ = m a x { . . . , Ym ’ }. Show that this works provided that m /n —>0 as n—»oo, and use simulation to check its performance when n = 100 and Y has the (7(0,0) distribution. (Sections 2.6.1, 5.2)

4

The gamma model (1.1) with mean /i and index k can be applied to the data o f Example 1.1. For this model, show that the profile log likelihood for pt is ^prof(M) = nk„ lo g (kft/fi) + (k„ - 1) Y 2 lo 8 JO ~ ^ Y I Vi/t1 ~ n lo g r ^ where k h is the solution to the estimating equation

n log(K/n) + n +

log yj - ^

y j / f i - m p ( K ) = 0,

’

248

5 • Confidence Intervals with tp(fc) the derivative o f logr(K ). Describe an algorithm for simulating the distribution o f the log likelihood ratio statistic W( p ) = 2{
5

Consider simulation to estimate the distribution o f Z = (T — 6 ) / V
( R ) f ( l - p f ~ s, W

where p = p(F) = P r'(Z ‘ < z \ F). Let P be the random variable corresponding to p(F), with C D F G( ). Hence show that the unconditional probability is Pr(0 6 Ir) = J 2 ( * ) f o “S( 1 - u)R~s dG(u). N ote that Pr(P < a) = Pr{0 6 [T — 7 1/2Z a',oo)}, where Z a* is the a quantile o f the distribution o f Z ', conditional on Y i , . . . , Y n. (b) Suppose that it is reasonable to approximate the distribution o f P by the beta distribution with density wa l (1 — u)b~l / B(a,b), 0 < u < 1; note that a, b—>\ as n—► o o . For som e representative values o f R, a, a and b, compare the coverage error o f I , with that o f the interval [T — V 1/2Z ’,oo). (Section 5.2.3; Hall, 1986) 6

Capability or precision indices are used to indicate whether a process satisfies a specification o f form (L, U ), where L and U are the lower and upper specification limits. If the process is “in control”, observations y i , . . . , y „ on it are taken to have mean p and standard deviation a. Two basic capability indices are then 9 = (U — L)/a and t] = 2 m in {([/ —p)/a,(p —L)/a], with precision regarded as low if 9 < 6, medium if 6 < 6 < 8, and high if 6 > 8, and similarly for r\, which is intended to be sensitive to the possibility that p ^ j ( L + U ) . Estimates o f 9 and r\ are obtained by replacing p and a with sample estimates, such as (i) the usual estimates p = y = n~l Y , y j and a = {(« — I)-1 Y ( y j ~ y)2}1/2; (ii) p — y and a — rk/dk, where rk = b~' Y l rKi and r/y is the range max yj — min yj o f the ith block o f k observations, namely yk(i-i)+i, ■■■, yki, where n = kb. Here du is a scaling factor chosen so that rk estimates a. (a) When estimates (i) are used, and the are independent N ( p , a 2) variables, show that an exact (1 — 2a) confidence interval for 8 has endpoints

s{ ^

f

•

where c„(a) is the a quantile o f the x, distribution. (b) With the set-up in (a), suppose that parametric simulation from the fitted normal distribution is used to generate replicate values 9 ’1, . . . , 8 ‘R o f 6. Show that for R = o o , the true coverage o f the percentile confidence interval with nominal

249

5.12 ■Problems coverage (1 — 2a) is

Pr i

(n -l)2 _ 2

„ (n -l)2

< t n_ x <

i. ^n—1,1—a

Cn—l,a

where C has the x l - 1 distribution. Give also the coverages o f the basic bootstrap confidence intervals based on 9 and log 6. Calculate these coverages for n = 25, 50, 75 and a = 0.05, 0.025, and 0.005. Which o f these intervals is preferable? (c) See Practical 5.4, in which we take d5 = 2.236. (Section 5.3.1) 7

Suppose that we have a parametric model with parameter vector tp, and that 9 = h(xp) is the parameter o f interest. The adjusted percentile ( B C a) method is found by applying the scalar parameter method to the least-favourable family, for which the log likelihood +£&), with S = i~l {rp)h(y)) and h( ) is the vector o f partial derivatives. Equations (5.21), (5.22) and (5.24) still apply. Show in detail how to apply this extension o f the B C a method to the problem o f calculating confidence intervals for the ratio 9 = Hi/\i\ o f the means o f two exponential distributions, given independent samples from those distributions. Use a numerical example (such as Example 5.10) to compare the B C a m ethod to the exact method, which is based on the fact that 9 / 9 has an F distribution. (Sections 5.3.2, 5.4.2; Efron, 1987)

8

For the ratio o f independent means in Example 5.10, show that the matrix o f second derivatives ii{n) has elements

n2t 1 2 ( y u - y i X y i j ~ y \ ) njyi I yi

uu,ij — ~ ^ r \ --------- =--------------- h (yu — y 0 + (y\j — y\)

uu.2j =

n2

r j {(yi,- - yi )(yn - h)},

n\n2y {

and ft

2

_

“2i,2j = — j~i ( y 2 ‘ ~ fo) + (yy ~ ^)}n2y i Use these results to check the value o f the constant c used in the A B C method in that example. For the data o f Example 1.2 we are interested in the ratio o f means 9 = E ( X ) / E ( U) . Define /j. = (E((7), E ( X ) ) T and write 9 = t(n), which is estimated by t = t(s) with s = («, 5c)t . Show that h 2/ h \ \ (' - I*2//^l /in )r ’ ” V lVm

. ___

-i = ( l -Hi /Hl

From Problem 2.16 we have lj = e j / u with

\-i/fii

— 1/Mi

o

= x; —tUj. Derive expressions for the

constants a, b and c in the nonparametric A B C method, and note that b = cv1/ 2-

5 ■Confidence Intervals

250 Hence show that the A B C confidence limit is given by ~ = x + d„ Y X j e j / ( n 2v l / 2u) u + d x Y l u j e j / ( n 2v l / 2u) ’

where da = (a + z « )/{ l - a(a + za)}2. Apply this result to the full dataset with n = 49, for which u = 103.14, x = 127.80, t = 1.239, vL = 0.0119, and a = 0.0205. (Section 5.4.2) 10

Suppose that the parameter 9 is estimated by solving the m onotone estimating equation S Y(0) = 0, with unique solution T. If the random variable c ( Y , 9 ) has (approximately or exactly) the known, continuous distribution function G, and if U ~ G, then define t v to be the solution to c(y, t v ) = U for a fixed observation vector y. Show that for suitable A, t — tu = —A ~ lc ( Y , 9 ) has roughly the same distribution as —A ~ l U = —A ' ' c ( y , t u ) = T — 9, and deduce that the distributions o f t — t v and T — 9 are roughly the same. The distribution o f t — tu can be approximated by simulation, and this provides a way to approximate the distribution o f T — 6. Comment critically on this resampling confidence limit method. (Parzen, Wei and Ying, 1994)

11

Consider deriving an upper confidence limit for 9 by test inversion. If T is an estimator for 8, and S is an estimator for nuisance parameter X, and if var(T | 9, X) = a2(9,X), then define Z = (T — 90)/a(90,S). Show that an exact upper 1 —a confidence limit is U\ = Ui_a(t,s, X) which satisfies

The bootstrap confidence limit is « i_ „ = ui_a(r, s, s). Show that if S is a consistent estimator for X then the method is consistent in the sense that Pr(0 < tii_a) = 1 — a + o(l). Further show that under certain conditions the coverage differs from 1 — a by 0 (n _1). (Section 5.5; Kabaila, 1993a; Carpenter, 1996) 12

The normal approximation method for an upper 1 — a confidence limit gives = 9 + z i_ at)1/2. Show that bootstrap adjustment o f the nominal level 1 — a in z i - a leads to the studentized bootstrap method. (Section 5.6; Beran, 1987)

13

The bootstrap m ethod o f adjustment can be applied to the percentile method. Show that the analogue o f (5.55) is

s is consistent for k if s = A+ op(i) as n->oo.

Pr'{Pr**(T” < t | F") < 1 - q(a) | F} = 1 - a. The adjusted 1 — a upper confidence limit is then the 1 — q(a) quantile o f T*. In the parametric bootstrap analysis for a single exponential mean, show that the percentile method gives upper 1 — a limit > ’C 2 „ , i - a / ( 2 n ) . Verify that the bootstrap adjustment o f this limit gives the exact upper 1 — a limit 2 n y / c 2n,tt(Section 5.6; Beran, 1987; Hinkley and Shi, 1989) 14

Show how to make a bootstrap adjustment o f the studentized bootstrap confidence limit method for a scalar parameter. (Section 5.6)

cv is the a quantile of the

distribution.

251

5.13 ■Practicals 15

For an equi-tailed (1 — 2a) confidence interval, the ideal endpoints are t + p with values o f P solving (3.31) with h(F, F ; P ) = I {t(F) - t(F) < 0} - a,

h(F, F; P) = I {t ( F) - t(F) < p } - (1 - a).

Suppose that the bootstrap solutions are denoted by [i? and P t- a., and that in the language o f Section 3.9.1 the adjustments b(F, y) are /Ja+?1 and /?i_a+w. Show how to estimate yi and y2, and verify that these adjustments modify coverage 1 — 2a + 0 (n _1) to 1 — 2a + 0(n~2). (Sections 3.9.1, 5.6; Hall and Martin, 1988) 16

Suppose that D is an approximate ancillary statistic and that we want to estimate the conditional probability G(u | d) = Pr(T — 9 < u \ D = d) using R simulated values (t’,d"r). One sm ooth estimate is the kernel estimate

G(„ I d ) , £ f= i W{h-'(d;-d)} where w( ) is a density symmetric about zero and h is an adjustable bandwidth. Investigate the bias and variance o f this estimate in the case where ( T , D ) is ap proximately bivariate normal and w( ) =
Suppose that ( T , D) are approximately bivariate normal, with D an ancillary statistic upon whose observed value d we wish to condition when calculating confidence intervals. If the adjusted percentile method is to be used, then we need conditional evaluations o f the constants a, vL and w. One approach to this is based on selecting the subset o f the R bootstrap samples for which d' = d. Then w can be calculated in the usual way, but restricted to this subset. For a and vL we need empirical influence values, and these can be approximated by the regression method o f Section 2.7.4, but using only the selected subset o f samples. Investigate whether or not this approach makes sense. (Section 5.9)

18

Suppose that y \ , . . . , y„ are sampled from an unknown distribution, which is known to be symmetric about its median. Then to calculate a 1 — a upper prediction limit for a further observation Y„+1 , the plug-in approach would use the 1 — a quantile o f the symmetrized E D F (Example 3.4). D evelop a resampling algorithm for obtaining a bias-corrected prediction limit. (Section 5.10)

19

For estimating the mean /i o f a population with unknown variance, we want to find a (1 — 2a) confidence interval with specified length i. Given data y {,...,y„, consider the following approach. Create bootstrap samples o f sizes N = n , n + 1,... and calculate confidence intervals (e.g. by the studentized bootstrap method) for each N. Then choose as total sample size that N for which the interval length is if or less. An additional N — n data values are then obtained, and a bootstrap confidence interval applied. Discuss this approach, and investigate it numerically for the case where the data are sampled from a N(n,cr2) distribution.

5.13 Practicals 1

Suppose that we wish to calculate a 90% confidence interval for the correlation 9 between the two counts in the colum ns o f cd4; see Practical 2.3. To obtain

252

5 ■Confidence Intervals

confidence intervals for 9 under nonparametric resampling, using the empirical influence values to calculate vl ■

cd4.boot <- boot(cd4, corr.fun, stype="w", R=999) bo o t .ci(cd4.boot,conf=0.9) To obtain intervals on the variance-stabilized scale, i.e. based on t = ilog{(l+ 0)/(l-0)} :

fisher <- function(r) 0.5*log((l+r)/(l-r)) fisher.dot <- function(r) l/(l-r~2) fisher.inv <- function(z) (exp(2*z)-l)/(exp(2*z)+l) bo o t .c i (cd 4 .boot,h=f isher,hdot=f isher.d o t ,hinv=f isher.inv,conf =0.9) How well do the intervals compare? Is the normal approximation reliable here? To compare intervals under parametric simulation from a fitted bivariate normal distribution:

cd4.rg <- function(data, mle) { d <- matrix(rnorm(2*nrow(data)), nrow(data), 2) d[,2] <- mle [5] *d[, 1]+sqrt (1-mle [5] "2)*d[,2] d[,l] <- m l e [1]+mle[3]*d[, 1] d[,2] <- mle [2]+mle [4] * d [,2] d > n <- nrow(cd4) cd4.mle <- c (apply(cd4,2,mean),sqrt(apply(cd4,2,var)*(n-l)/n), corr(cd4)) cd4.para <- boot(cd4, corr.fun, R=999, sim="parametric", ran.gen = cd4.rg, mle=cd4.mle) bo o t .ci(cd4.para,type=c("norm","basic","stud","perc"),conf=0.9) b o o t .ci(cd4.para,h=fisher,hdot=fisher.dot,hinv=fisher.inv, type=c("norm","basic","stud","perc"),conf=0.9) To obtain the corresponding interval using the nonparametric ABC method:

abc.ci(cd4, corr, conf=0.9) D o the differences among the various intervals reflect what you would expect? (Sections 5.2, 5.3, 5.4.2; D iC iccio and Efron, 1996).

Suppose that we wish to calculate a 90% confidence interval for the largest eigenvalue 9 o f the covariance matrix o f the two counts in the colum ns o f cd4; see Practicals 2.3 and 5.1. To obtain confidence intervals for 9 under nonparametric resampling, using the empirical influence values to calculate vL :

eigen.fun <- function(d, w = rep(l, nrow(d))/nrow(d)) { w <- w/sum(w) n <- nrow(d) m <- crossprod(w, d) m2 <- sweep(d,2,m) v <- crossprod(diag(sqrt(w)) ■ /.*■ /, m2) eig <- eigen(v,symmetric=T) stat <- eig$values[l] e <- eig$vectors[,l] i <- rep(l:n,round(n*w)) ds <- sweep(d[i,],2,m)

5.13 ■Practicals

253

L <- (ds/C*7,e)~2 - stat c(stat, sum(L~2)/n~2) } cd4.boot <- boot(cd4,eigen.fun,R=999,stype="w") boot.ci(cd4.boot, conf=0.90) abc.ci(cd4, eigen.fun, conf=0.9) Discuss the differences among the various intervals. (Sections 5.2, 5.3, 5.4.2; D iC iccio and Efron, 1996) 3

Dataframe am is contains data made available by G. Amis o f Cambridgeshire County Council on the speeds in miles per hour o f cars at pairs o f sites on roads in Cambridgeshire. Speeds were measured at each site before and then again after the erection o f a warning sign at one site o f each pair. The quantity o f interest is the mean relative change in the 0.85 quantile, o f the speeds for each pair, i.e. the mean o f the quantities (rjai —r]bl) — (rjao—Vbo)', here »/m and r\ai are the 0.85 quantiles o f the speed distribution at the site where the sign was placed, before and after its erection. This quantity is chosen because the warning is particularly intended to slow faster drivers. A bout 100 speeds are available for each com bination o f 14 pairs o f sites and three periods, one before and two after the warnings were erected, but some o f the pairs overlap. We work with a slightly smaller dataset, for which the rjs are:

amisl <- amis[(amis$pair!=4)&(amis$pair!=6)&(amis$period!=3),] tapply(amisl$speed, list(amisl$period,amisl$warning,amisl$pair), quantile, 0.85) To attempt to set confidence intervals for 6, by stratified resampling from the speeds at each com bination o f site and period:

amis.fun <- function(data, i) { d <- data[i, ] d <- tapply(d$speed,list(d$period,d$warning,d$pair).quantile,0.85) m ean((d[2,1, ] - d[l,l, ]) - (d[2,2, ] - d[l,2, ])) > str <- 4*(amisl$pair-l)+2*(amisl$warning-l)+amisl$period amisl.boot <- boot(amisl,amis.fun,R=99,strata=str) amisl,boot$t0 qqnonn(amisl.boot$t) abline(mean(amisl,boot$t),sqrt(var(amisl,boot$t)),lty=2) boot.ci(amisl.boot,type=c("basic","perc","norm"),conf=0.9) (There are 4800 cases in a m isl so this is demanding on memory: it may be necessary to increase the o b j e c t . s i z e .) D o the resampled averages look normal? Can you account for the differences am ong the intervals? How big is the average effect o f the warnings? (Section 5.2) 4

Dataframe c a p a b i l i t y gives “data” from Bissell (1990) comprising 75 successive observations with specification limits U = 5.79 and L — 5.49; see Problem 5.6. To check that the process is “in control” and that the data are close to independent normal random variables:

par(mfrow=c(2,2)) tsplot(capabilityly,ylim=c(5,6)) abline(h=5.79,lty=2); abline(h=5.49,lty=2) qqnorm(capability$y) acf(capabilitySy)

254

5 ■Confidence Intervals

acf(capability$y,type="partial") To find nonparametric confidence limits for rj using the estimates given by (ii) in Problem 5.6:

capability.fun <- function(data, i, U=5.79, L=5.49, dk=2.236) { y <- data$y[i] m <- mean(y) r5 <- apply(matrix(y,15,5), 1, function(y) diff(range(y))) s <- mean(r5)/dk 2*min((U-m)/s, (m-L)/s) > capability.boot <- boot(capability, capability.fun, R=999) b o o t .ci(capability.boot,type=c("norm","basic","perc")) D o the values o f t* look normal? Why is there such a difference between the percentile and basic bootstrap limits? W hich do you think are more reliable here? (Sections 5.2, 5.3)

Following on from Practical 2.3, w e use a double bootstrap with M = 249 to adjust the studentized bootstrap interval for a correlation coefficient applied to the cd4 data. nested.corr <- function(data, w, tO, M) { n <- nrow(data) i <- rep(l:n,round(n*w)) t <- corr.fun(data, w ) z <- (t[l]-t0)/sqrt(t[2]) nested.boot <- boot(data[i,], corr.fun, R=M, stype="w") z.nested <- (nested.boot$t[,1]—t [1])/sqrt(nested.boot$t[,2]) c(z, sum(z.nested
cd4.boot <- boot(cd4.nested.corr,R=99,stype="w",tO=corr(cd4),M=249) junk <- boot(cd4,nested.corr,R=100,stype="w",tO=corr(cd4),M=249) cd4.boot$t <- rbind(cd4.boot$t,junk$t) cd4.boot$R <- cd4.boot$R+junk$R but with the last three lines repeated eight further times. cd4.nested contains a nested simulation we did earlier. T o compare the actual and nominal coverage levels: par(pty="s") qqplot((1:c d4.nested$R)/ (l+cd4.nested$R),cd4.nested$t[,2], xlab="nominal coverage",ylab="estimated coverage",pch=".") lines(c(0,l),c(0,l)) How close to nominal is the estimated coverage? To read off the original and corrected 95% confidence intervals:

q <- c(0.975,0.025) q.adj <- quantile(cd4.nested$t[,2],q) tO <- corr.fun(cd4) z <- sort(cd4.nested$t[,1])

5.13 ■Practicals

255

t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q)] t O [1]-sqrt(tO[2])*z[floor((l+cd4.nested$R)*q.adj)]

Does the correction have much effect? Compare this interval with the correspond ing ABC interval. (Section 5.6)

6 Linear Regression

6.1 Introduction O ne o f the m ost im p o rta n t and frequent types o f statistical analysis is re gression analysis, in which we study the effects o f explanatory variables or covariates on a response variable. In this chap ter we are concerned with Unear regression, in which the m ean o f the ran d o m response Y observed at value x = ( x i,. . . , x p)T o f the explanatory variable vector is E ( y | x) = n(x) = x Tp. The m odel is com pleted by specifying the natu re o f random variation, which for independent responses am o u n ts to specifying the form o f the variance v a r(7 | x). F or a full p aram etric analysis we would also have to specify the distribution o f Y , be it norm al, Poisson o r w hatever. W ithout this, the m odel is sem iparam etric. F or linear regression w ith norm al ran d o m errors having co n stan t variance, the least squares theory o f regression estim ation and inference provides clean, exact m ethods for analysis. But for generalizations to non-norm al errors and non-con stan t variance, exact m ethods rarely exist, and we are faced with approxim ate m ethods based o n linear approxim ations to estim ators and central lim it theorem s. So, ju s t as in the sim pler context o f C hapters 2-5, resam pling m ethods have the poten tial to provide m ore accurate analysis. We begin o u r discussion in Section 6.2 w ith simple least squares linear re gression, where in ideal conditions resam pling essentially reproduces the exact theoretical analysis, b u t also offers the p o tential to deal with non-ideal cir cum stances such as non-co n stan t variance. Section 6.3 covers the extension to m ultiple explanatory variables. The related topics o f aggregate prediction erro r an d o f variable selection based on predictive ability are discussed in Section 6.4. R obust m ethods o f regression are exam ined briefly in Section 6.5.

256

257

6.2 ■Least Squares Linear Regression

Figure 6.1 Average body weight (kg) and brain weight (g) for 62 species of mammals, plotted on original scales and logarithmic scales (Weisberg, 1985, p. 144).

Body weight

Body weight

T he furth er topics o f generalized linear models, survival analysis, other n o n linear regression, classification error, and nonparam etric regression m odels are deferred to C h ap ter 7.

6.2 Least Squares Linear Regression 6.2.1 Regression fit and residuals T he left panel o f Figure 6.1 shows the scatter plot o f response “brain w eight” versus explanatory variable “body w eight” for n = 62 m am m als. As the right panel o f the figure shows, the d ata are well described by a simple linear regression after the two variables are transform ed logarithm ically, so th at y = log(brain weight),

x = log(body weight).

The simple linear regression m odel is Yj =

+ Pi xj + ej,

j= l,...,n,

(6.1)

w here the EjS are uncorrelated w ith zero m eans and equal variances a 2. This constancy o f variance, or hom oscedasticity, seems roughly right for the example data. We refer to the d a ta (x j , y j ) as the y'th case. In general the values Xj m ight be controlled (by design), random ly sampled, o r m erely observed as in the example. But we analyse the d a ta as if the x,s were fixed, because the am o u n t o f inform ation ab o u t ft = (/fo, l h ) T depends u p o n their observed values. The sim plest analysis o f d a ta under (6.1) is by the ordinary least squares

6 • Linear Regression

258

m ethod, on which we concentrate here. The least squares estim ates for (i are ,

h = y - Pi*,

(6 .2 )

where x = n 1 Y XJ an d = ^ = i ( x; — x )2- T he conventional estim ate o f the error variance er2 is the residual m ean square

where

ei = yj - A>

(6.3)

A/ = Po + Plxj

(6.4)

are raw residuals with

the fitted values, or estim ated m ean values, for the response at the observed x values. The basic properties o f the p aram eter estim ates Po, Pi, which are easily obtained u n d er m odel (6.1), are (6.5) and E(j?i) =

Pu

(6.6)

var(j?i) =

The estim ates are norm ally distributed and optim al if the errors e;- are norm ally distributed, they are often approxim ately norm al for other erro r distributions, b u t they are n o t robust to gross non-norm ality o f errors or to outlying response values. The raw residuals e} are im p o rtan t for various aspects o f m odel checking, and potentially for resam pling m ethods since they estim ate the random errors Ej, so it is useful to sum m arize their properties also. U nder (6.1), n (6.7) k= 1

where

with djk equal to 1 if j = k an d zero otherwise. T he quantities hjj are know n as leverages, an d for convenience we denote them by hj. It follows from (6.7) th a t E(e; ) = 0,

var(e; ) = tx2( l

-hj).

68

( . )

259

6.2 • Least Squares Linear Regression

O ne consequence o f this last result is th a t the estim ator S 2 th a t corresponds to s2 has expected value a 2, because £)(1 — hj) = n — 2. N ote th a t w ith the intercept o in the m odel, YI ej = 0 autom atically. T he raw residuals e} can be modified in various ways to m akes them suitable for diagnostic m ethods, b u t the m ost useful m odification for our purposes is to change them to have co n stan t variance, th a t is • Standardized residuals are called studentized residuals by some authors.

1

(6.9)

(i - h j W

We shall refer to these as modified residuals, to distinguish them from standard ized residuals which are in addition divided by the sam ple standard deviation. A norm al Q -Q p lo t o f the r;- will reveal obvious outliers, or clear non-norm ality o f the ran d o m errors, alth o u g h the latter m ay be obscured som ew hat because o f the averaging pro p erty o f (6.7). A sim pler m odification o f residuals is to use 1 — h = 1 — 2n-1 instead o f individual leverages 1 — hj, where h is the average leverage; this will have a very sim ilar effect only if the leverages hj are fairly hom ogeneous. This simpler m odification implies m ultiplication o f all raw residuals by (1 — 2n~1)~]/'2: the average will equal zero autom atically because ^ ej = 0. I f (6.1) holds w ith hom oscedastic random errors e; and if those random errors are norm ally distributed, or if the dataset is large, then stan d ard distri butio n al results will be adequate for draw ing inferences w ith the least squares estim ates. But if the errors are very non-norm al o r heteroscedastic, m eaning th a t their variances are unequal, then those stan d ard results m ay n o t be reliable an d a resam pling m ethod m ay offer genuine im provem ent. In Sections 6.2.3 an d 6.2.4 we describe two quite different resam pling m ethods, the second o f w hich is robust to failure o f the m odel assum ptions. I f strong non-norm ality o r heteroscedasticity (which can be difficult to distinguish) ap p ear to be present, then robust regression estim ates m ay be considered in place o f least squares estim ates. These will be discussed in Section 6.5.

6.2.2 Alternative models T he linear regression m odel (6.1) can arise in two ways, and for our purposes it can be useful to distinguish them. First formulation T he first possibility is th a t the pairs are random ly sam pled from a bivariate distrib u tio n F for (X, 7 ). T hen linear regression refers to linearity o f the conditional m ean o f Y given X = x, th a t is E(Y

IX

= x) =

fly

+ y{x — H x ) ,

y =

0 x y / 0 x2 ,

(6-10)

260

6 ■Linear Regression

w ith n x = E(X ), fly = E(Y ), a 2 = \a.r(X) and axy = cov(X, Y). This condi tional m ean corresponds to the m ean in (6.1), w ith

Po = H y - y f i x,

Pi=y-

(6.11)

T he param eters ft = (Po,Pi)T are here seen to be statistical functions o f the kind m et in earlier chapters, in this case based on the first and second m om ents o f F. The ran d o m errors t.j in (6.1) will be hom oscedastic with respect to x if F is bivariate norm al, for exam ple, b u t n o t in general. The least squares estim ators (6.2) correspond to the use o f sam ple m om ents in (6.10). F or future reference we n ote (Problem 6.1) th a t the influence function for the least squares estim ators t = (/?o, Pt )T is the vector

L^

<612>

F> = C - S ?

T he em pirical influence values as defined in Section 2.7.2 are therefore (1 -n x (x j-x )/S S x \ '< = { n(Xj — x ) / S S x ) “■

(6' 13)

T he nonparam etric delta m ethod variance approxim ation (2.36) applied to [1] gives vl

Y, { x j — x)2e2j = — -S S 2 1■

(6-14)

This m akes no assum ption o f hom oscedasticity. In practice we m odify the variance approxim ation to account for leverage, replacing ej by r, as defined in (6.9). Second formulation The second possibility is th a t a t any value o f x, responses Yx can be sam pled from a distribution Fx(y) whose m ean an d variance are n(x) and
E (xj - x)n(xj) SS X

In principle several responses could be obtained at each xj. Simple linear regression w ith hom oscedastic errors, w ith which we are initially concerned, corresponds to cr(x) = a and Fx(y) = G { y - r t x ) } .

(6.15)

So G is the distribution o f ran d o m error, w ith m ean zero and variance a 2. A ny p articu lar application is characterized by the design x i ,...,x „ and the corresponding d istributions Fx, the m eans o f which are defined by linear regression.

6.2 • Least Squares Linear Regression

261

The influence function for the least squares estim ator is again given by (6.12), b u t w ith fix and a \ respectively replaced by x and n~' J2(x j ~ *)2Em pirical influence values are still given by (6.13). The analogue o f linear approxim ations (2.35) an d (3.1) is $ = fi + n~x Lt { ( xj , y j) ; F} , w ith vari ance n_ 2 ^ " =1 v ar [Lt{( xj, Yj) ;F}]. If the assum ed hom oscedasticity o f errors is used to evaluate this, w ith the constant variance a 2 estim ated by n~l ep then the delta m ethod variance approxim ation for /?i, for example, is 'Z i. nSSx ’ strictly speaking this is a sem iparam etric approxim ation. This differs by a factor o f (n — 2) / n from the stan d ard estim ate, which is given by (6.6) with residual m ean square s2 in place o f a 2. The stan d ard analysis for linear regression as outlined in Section 6.2.1 is the sam e for b o th situations, provided the random errors ej have equal variances, as w ould usually be jud g ed from plots o f the residuals.

6.2.3 Resam pling errors To extend the resam pling algorithm s o f C hapters 2-3 to regression, we have first to identify the underlying m odel F. Now if (6.1) is literally correct with hom oscedastic errors, then those errors are effectively sam pled from a single distribution. I f the x; s are treated as fixed, then the second form ulation o f Section 6.2.2 applies, G being the com m on error distribution. The m odel F is the series o f distributions Fx for x = x i,...,x „ , defined by (6.15). The resam pling m odel is the corresponding series o f estim ated distributions Fx in which each /i(xy) is replaced by the regression fit p.(xj) and G is estim ated from all residuals. F or p aram etric resam pling we would estim ate G according to the assum ed form o f error distribution, for exam ple the N ( 0 , s 2) distribution if norm ality were ju d g ed appropriate. (O f course resam pling is n o t necessary for the norm al linear m odel, because exact theoretical results are available.) For nonparam etric resam pling, on which we concentrate in this chapter, we need a generalization o f the E D F used in C h ap ter 2. I f the random errors Ej were known, then their E D F w ould be appropriate. As it is we have the raw residuals ej which estim ate the e; , and their E D F will usually be consistent for G. But for practical use it is better to use the residuals r,- defined in (6.9), because their variances agree w ith those o f the e; . N oting th a t G is assum ed to have m ean zero in the m odel, we then estim ate G by the E D F o f rj — f, where r is the average o f the rj. These centred residuals have m ean zero, and we refer to their E D F as G. The full resam pling m odel is taken to have the same “design” as the data, th a t is x* = X j ; it then specifies the conditional distribution o f YJ given x*

262

6 ■Linear Regression

through the estim ated version o f (6.1), which is Y j = p . j + ep

j =

(6.16)

w ith p.j = + [Six’ an d ej random ly sam pled from G. So the algorithm to generate sim ulated datasets an d corresponding param eter estim ates is as follows. Algorithm 6.1 (Model-based resampling in linear regression) For r = 1 1 F or j = 1, . . . , n , (a) set x j = Xj\ (b) random ly sam ple ej from r

i

. . , r „ — r; then

(c) set yj = P o + j?ix j + ej. 2 Fit least squares regression to ( x j ,y j ) ,. ..,(x * ,y * ), giving estim ates Po,r’ P \ j ’ Sr2•

The resam pling m eans an d variances o f Pq an d p \ will agree very closely w ith sta n d a rd least squares theory. To see this, consider for exam ple the slope estim ate, whose b o o tstrap sam ple value can be w ritten a.

a , E (* )-* > 2 “ f t +

SS,

'

Because E*(e*) = n r 1 Y ( rj — r) = 0, it follows th a t E*(j?j) = Pi. Also, because var*(e*) = n_1 £ " =1(r; ~ Ff for a11 J, . y^(x; — x)2var*(£;) , v ar (Pi) = -----------^ -------- J- = n ^ ( r , - - r f / S S x. The latter will be approxim ately equal to the usual estim ate s2/ S S x, because n_1 Y;(rj ~ r ) 2 = (n ~ 2)~'

e] = s2- 1° fact if the individual hj are replaced by

their average h, then the m eans an d variances o f Pq and p \ are given exactly by (6.5) an d (6.6) w ith the estim ates Pq, P i an d s2 substituted for param eter values. T he advantage o f resam pling is im proved quantile estim ation when norm al-theory distributions o f the estim ators Pq, P i , S 2 are n o t accurate. Example 6.1 (M am m als) F or the d a ta plotted in the right panel o f Figure 6.1, the simple linear regression m odel seems appropriate. S tan d ard analysis sug gests th a t errors are approxim ately norm al, although there is a small suspicion o f heteroscedasticity: see Figure 6.2. T he p aram eter estim ates are Po = 2.135 and Pi = 0.752. From R = 499 b o o tstra p sim ulations according to the algorithm above, the

263

6.2 ■Least Squares Linear Regression

Figure 6.2 Normal Q-Q plot of modified residuals r;- and their plot against leverage values hj for linear regression fit to log-transformed mammal data.

co 3 TD

tO 3 ■o

■o
■0o> "D O

Quantiles of Standard Normal

Leverage h

estim ated sta n d a rd errors o f intercept and slope are respectively 0.0958 and 0.0273, com pared to the theoretical values 0.0960 and 0.0285. The em pirical distributions o f b o o tstra p estim ates are alm ost perfectly norm al, as they are for the studentized estim ates. T he estim ated 0.05 and 0.95 quantiles for the studentized slope estim ate

sE{fay w here SE(fS\) is the stan d ard error for obtained from (6.6), are z*25) = —1.640 an d z'475) = 1.5 89, com pared to the stan d ard norm al quantiles +1.645. So, as expected for a m oderately large “clean” dataset, the resam pling results agree closely w ith those obtained from stan d ard m ethods. ■ Zero intercept In som e applications the intercept f o will n o t be included in (6.1). This affects the estim ation o f Pi and a 2 in obvious ways, b u t the resam pling algorithm will also differ. First, the leverage values are different, nam ely

so the m odified residual will be different. Secondly, because now e; 0, it is essential to m ean-correct the residuals before using them to sim ulate random errors. Repeated design points I f there are rep eat observations a t som e or all values o f x, this offers an enhanced o p p o rtu n ity to detect heteroscedasticity: see Section 6.2.6. W ith

264

6 • Linear Regression

m any such repeats it is in principle possible to estim ate the C D F s Fx separately (Section 6.2.2), b u t there is rarely enough d a ta for this to be useful in practice. T he m ain advantage o f repeats is the o p portunity it affords to test the adequacy o f the linear regression form ulation, by splitting the residual sum o f squares into a “pure e rro r” com ponent an d a “goodness-of-fit” com ponent. To the extent th a t the com parison o f these com ponents through the usual F ratio is quite sensitive to non-norm ality and heteroscedasticity, resam pling m ethods m ay be useful in interpreting th a t F ratio (Practical 6.3).

6.2.4 Resam pling cases A com pletely different approach w ould be to im agine the d a ta as a sam ple from som e bivariate distribution F o f (X , Y). This will sometimes, b u t not often, mimic w hat actually happened. In this approach, as outlined in Section 6.2.2, the regression coefficients are viewed as statistical functions o f F, and defined by (6.10). M odel (6.1) still applies, b u t w ith no assum ption on the random errors e7 other th an independence. W hen (6.10) is evaluated a t F we obtain the least squares estim ates (6.2). W ith F now the bivariate distribution o f (X, Y ), it is appropriate to take F to be the E D F o f the d a ta pairs, an d resam pling will be from this ED F, ju st as in C h ap ter 2. T he resam pling sim ulation therefore involves sam pling pairs w ith replacem ent from { x \ , y \ ) , . . . , (x„,y„). This is equivalent to taking (x,*,y*) = (x i , y i ), where I is uniform ly distributed on {1 ,2 ,...,n } . Sim ulated values Pq, fi\ o f the coefficient estim ates are com puted from (xj,_y*),...,(x*,y*) using the least squares algorithm which was applied to obtain the original estim ates feo, fi\. So the resam pling algorithm is as follows. Algorithm 6.2 (Resampling cases in regression) F or r = 1 sam ple i\ , r a n d o m l y w ith replacem ent from {1,2 2 for j = 1 ,..., n, set x j = x,-, y j = y ;•; then 3 fit least squares regression to ( x \ , y \ ) , ... ,(x*n,y*n), giving estim ates K r K ’ sr2• There are two im p o rtan t differences betw een this second b o o tstrap m ethod and the previous one using a p aram etric m odel an d sim ulated errors. First, w ith the second m ethod we m ake no assum ption ab o u t variance hom ogeneity — indeed we do n o t even assum e th a t the conditional m ean o f Y given X = x is linear. This offers the advantage o f potential robustness to heteroscedasticity, and the disadvantage o f inefficiency if the constant-variance m odel is correct. Secondly, the sim ulated sam ples have different designs, because the values

The model E(Y | X = x) = a + /?i(x —x), which some writers use in place of (6.1), is not useful here because a = /fo 4- fi\x is a function not only of F but also of the data, through x.

265

6.2 ■Least Squares Linear Regression Mammals data. Comparison of bootstrap biases and standard errors of intercept and slope with theoretical results, standard and robust. Resampling cases with Table 6.1

R = 999.

f>i

T heoretical

R esam pling cases

R o b u st theoretical

bias sta n d a rd e rro r

0 0.096

0.0006 0.091

— 0.088

bias sta n d a rd e rro r

0 0.0285

0.0002 0.0223

0.0223

_

x j ,...,x * are random ly sam pled. The design fixes the inform ation content o f a sample, and in principle o u r inference should be specific to the inform ation in o u r data. The variation in x j , . . . , x ’ will cause some variation in inform ation, b u t fortunately this is often u n im p o rtan t in m oderately large datasets; see, however, Exam ples 6.4 and 6.6. N ote th a t in general the resam pling distribution o f a coefficient estim ate will not have m ean equal to the d a ta estim ate, contrary to the unbiasedness property th a t the estim ate in fact possesses. However, the difference is usually negligible. Example 6.2 (M ammals) F or the d ata o f Exam ple 6.1, a b o o tstra p sim ulation was run by resam pling cases with R = 999. Table 6.1 shows the bias and stan d ard error results for b o th intercept and slope. The estim ated biases are very small. T he striking feature o f the results is th at the stan d ard erro r for the slope is considerably sm aller than in the previous b o o tstrap sim ulation, which agreed w ith stan d ard theory. The last colum n o f the table gives robust versions o f the stan d ard errors, which are calculated by estim ating the variance o f Ej to be rj. For exam ple, the robust estim ate o f the variance o f (it is

This corresponds to the delta m ethod variance approxim ation (6.14), except th a t rj is used in preference to e; . As we m ight have expected from previous discussion, the b o o tstrap gives an approxim ation to the robust stan d ard error. A A Figure 6.3 shows norm al Q -Q plots o f the b o o tstra p estim ates Pq and fi'. F or the slope p aram eter the right panel shows lines corresponding to norm al d istributions w ith the usual and the robust stan d ard errors. T he distribution o f Pi is close to norm al, with variance m uch closer to the robust form (6.17) th an to the usual form (6.6). ■ One disadvantage o f the robust stan d ard error is its inefficiency relative to the usual stan d ard erro r when the latter is correct. A fairly straightforw ard calculation (Problem 6.6) gives the efficiency, which is approxim ately 40% for the slope p aram eter in the previous example. T hus the effective degrees o f freedom for the robust stan d ard error is approxim ately 0.40 times 62, or 25.

6 • Linear Regression

266

Quantiles of standard normal

Quantiles of standard normal

The sam e loss o f efficiency would apply approxim ately to b o o tstrap results for resam pling cases.

6.2.5 Significance tests for slope Suppose th a t we w ant to test w hether or n o t the covariate x has an effect on the response y, assum ing linear regression is appropriate. In term s o f m odel param eters, the null hypothesis is Ho : fi\ = 0. If we use the least squares estim ate as the basis for such a test, then this is equivalent to testing the Pearson correlation coefficient. This connection im m ediately suggests one nonparam etric test, the p erm u tatio n test o f Exam ple 4.9. However, this is not always valid, so we need also to consider o th er possible b o o tstrap tests. Permutation test The p erm u tatio n test o f co rrelation applies to the null hypothesis o f inde pendence betw een X and Y when these are b o th random . Equivalently it applies when the null hypothesis implies th a t the conditional distribution o f Y given X = x does n o t depend upon x. In the context o f linear regression this m eans n o t only zero slope, b u t also constant erro r variance. The justification then rests sim ply on the exchangeability o f the response values under the null hypothesis. If we use AT(.) to denote the ordered values o f X \ , . . . , X n, and so forth, then the exact level o f significance for one-sided alternative H a '■Pi > 0 and test statistic T is p

=

Pr ( T > t | X (.) = x (.), y(.) = )>(.), H 0)

-

Pr [T > 1 1X = x, Y = p e rm j^ .)} ],

Figure 63 Normal plots for bootstrapped estimates of intercept (left) and slope (right) for linear regression fit to logarithms of mammal data, with R = 999 samples obtained by resampling cases. The dotted lines give approximate normal distributions based on the usual formulae (6.5) and (6.6), while the dashed line shows the normal distribution for the slope using the robust variance estimate (6.17).

6.2 ■L east Squares Linear Regression

267

where perm { } denotes a perm utation. Because all perm utations are equally likely, we have # o f perm utations such th a t T > t

P = --------------------n!i-------------------’ as in (4.20). In the present context we can take T = fii, for which p is the same as if we used the sam ple Pearson correlation coefficient, b u t the same m ethod applies for any ap p ro p riate slope estim ator. In practice the test is perform ed by generating sam ples ( x j ,y j ) ,. ..,(x * ,y * ) such th a t x* = x j and (_ y j,...,y ’ ) is a ran d o m p erm u tatio n o f ( y i , . . . , y n), and fitting the least squares slope estim ate jSj. If this is done R times, then the one-sided P-value for alternative H A : fi i > 0 is P

# { fr> M + i R + 1

It is easy to show th a t studentizing the slope estim ate would n o t affect this test; see Problem 6.4. The test is exact in the sense th at the P-value has a uniform distrib u tio n under Ho, as explained in Section 4.1; note th at this uniform distribution holds conditional on the x values, which is the relevant property here. First bootstrap test A b o o tstrap test whose result will usually differ negligibly from th a t o f the p erm u tatio n test is obtained by taking the null m odel as the pair o f m arginal E D F s o f x an d y , so th a t the x*s are random ly sam pled with replacem ent from the X j S , and independently the y * s are random ly sam pled from the y j s. A gain is the slope fitted to the sim ulated data, and the form ula for p is the same. As w ith the p erm u tatio n test, the null hypothesis being tested is stronger than ju st zero slope. The p erm u tatio n m ethod and its b o o tstrap look-alike apply equally well to any slope estim ate, n o t ju st the least squares estimate. Second bootstrap test The next b o o tstrap test is based explicitly on the linear m odel structure with hom oscedastic errors, and applies the general approach o f Section 4.4. The null m odel is the null m ean fit and the E D F o f residuals from th a t fit. We calculate the P-value for the slope estim ate under sam pling from this fitted model. T h a t is, d a ta are sim ulated by

x) =

xp

yj =

£;0 + 8}o>

w here pjo = y an d the £*0 are sam pled with replacem ent from the null m odel residuals e^o = yj ~ y , j = 1 , The least squares slope /Jj is calculated from the sim ulated data. A fter R repetitions o f the sim ulation, the P-value is calculated as before.

268

6 ■Linear Regression

This second b o o tstrap test differs from the first b o o tstrap test only in th at the values o f explanatory variables x are fixed at the d a ta values for every case. N ote th a t if residuals were sam pled w ithout replacem ent, this test would duplicate the exact p erm u tatio n test, which suggests th at this boo tstrap test will be nearly exact. The test could be m odified by standardizing the residuals before sam pling from them , which here w ould m ean adjusting for the constant null m odel leverage n-1 . This w ould affect the P-value slightly for the test as described, b u t not if the test statistic were changed to the studentized slope estimate. It therefore seems wise to studentize regression test statistics in general, if m odel-based sim ulation is used; see the discussion o f b o o tstrap pivot tests below. Testing non-zero slope values All o f the preceding tests can be easily modified to test a non-zero value o f Pi. If the null value is /?i,o, say, then we apply the test to m odified responses yj — PiflXj, as in Exam ple 6.3 below. Bootstrap pivot tests F u rther b o o tstrap tests can be based on the studentized b o o tstrap approach outlined in Section 4.4.1. F or simplicity suppose th at we can assum e ho m o scedastic errors. T hen Z = ([S\ — Pi)/S\ is a pivot, where Si is the usual standard error for As a pivot, Z has a distribution not depending upon param eter values, an d this can be verified under the linear m odel (6.1). The null hypothesis is Ho : Pi = 0, and as before we consider the one-sided alternative H a : Pi > 0. T hen the P-value is p = Pr

-

P i = 0, P o, c r

-

Pi,Po,
because Z is a pivot. T he probability on the right is approxim ated by the b o o tstrap probability

where Z* = (j?,* — Pi ) / S ' is com puted from a sam ple sim ulated according to A lgorithm 6.1, which uses the fit from the full m odel as in (6.16). So, applying the b o o tstrap as described in Section 6.2.3, we calculate the b o o tstrap P-value from the results o f R sim ulated sam ples as # P

{z* > Zo} R + 1 ’

(6.19)

where zq = Pi/si. The relation o f this m ethod to confidence limits is th a t if the lower 1 — a

6.2 • Least Squares Linear Regression

•

•

CM

o

•

•

• X * * • • •

o

o

* A » i* **•*. «• i « • • ••

•

o CO Ip CM O CM

.

•

—

•

o • *

00

CO

d -

0.2

-

0.1

x

0.0

•

•

so

Figure 6.4 Linear regression model fitted to m onthly excess returns over riskless rate y for one company versus excess m arket returns x. The left panel shows the data and fitted line. The right panel plots the absolute values o f the standardized residuals against x (Simonoff and Tsai, 1994).

269

-

0.2

-

•

•

. • • 1 .

* •• w• - ,* • / t 0.1

•

0.0

x

confidence lim it for fa is above zero, then p < oc. Sim ilar interpretations apply with upper confidence limits and confidence intervals. T he sam e m ethod can be used with case resampling. If this were done as a precaution against erro r heteroscedasticity, then it would be appropriate to replace si w ith the robust stan d ard erro r defined as the square root o f (6.17). If we wish to test a non-zero value fa$ for the slope, then in (6.18) we simply replace f a / s \ by zo = (fa — fa,o)/si, or equivalently com pare the lower confidence lim it to fayW ith all o f these tests there are simple m odifications if a different alternative hypothesis is appropriate. For example, if the alternative is H A : fa < 0, then the inequalities “ > ” used in defining p are replaced by and the two-sided P-value is twice the sm aller o f the two one-sided P-values. O n balance there seems little to choose am ong the various tests described. The perm u tatio n test an d its b o o tstrap look-alike are equally suited to statis tics other th an least squares estim ates. T he b o o tstrap pivot test with case resam pling is the only one designed to test slope w ithout assum ing constant erro r variance u nder the null hypothesis. But one would usually expect sim ilar results from all the tests. The extensions to m ultiple linear regression are discussed in Section 6.3.2. Example 6.3 (Returns data) The d a ta plotted in Figure 6.4 are n = 60 consecutive cases o f m onthly excess returns y for a particular com pany and excess m ark et returns x, where excess is relative to riskless rate. We shall ignore the possibility o f serial correlation. A linear relationship appears to fit the data, and the hypothesis o f interest is Ho : fa = 1 with alternative HA : fa > 1, the la tte r corresponding to the com pany outperform ing the m arket.

270

6 ■Linear Regression

Figure 6.5 Returns data: histogram of R = 999 bootstrap values of studentized slope Q

a.

z* = (fil - M/Kob’

CM

o

obtained by resampling cases. Unshaded area corresponds to values in excess of data value 20 = (ft - 1)/sr0b = 0.669.

-2

Figure 6.4 and plots o f regression diagnostics suggest th a t erro r variation increases w ith x and is non-norm al. It is therefore appropriate to apply the boo tstrap pivot test w ith case resam pling, using the robust standard error from (6.17), which we denote here by s rob, to studentize the slope estimate. Figure 6.5 shows a histogram o f R = 999 values o f z". The unshaded p art corresponds to z ' greater th a n the d a ta value zo = (Pi - 1) / srob = (1.133 - 1)/0.198 = 0.669, which happens 233 times. Therefore the b o o tstrap P-value is 0.234. In fact the use o f the robust stan d ard erro r m akes little difference here: using the ordinary stan d ard erro r gives P-value 0.252. C om parison o f the ordinary t-statistic to the stan d ard norm al table gives P-value 0.28. ■

6.2.6 Non-constant variance: weighted error resampling In some applications the tic rando m errors. If the sim ulation by resam pling ordinary, i.e. unw eighted,

linear m odel (6.1) will apply, b u t with heteroscedasheteroscedasticity can be m odelled, then boo tstrap errors is still possible. We assum e to begin with th at least squares estim ates are fitted, as before.

Known variance function Suppose th a t in (6.1) the ran d o m erro r ej a t x = Xj has variance uj, where either c ? = k V ( x j ) or a j = K V ( f i j ) , with V ( ) a know n function. It is possible to estim ate k , b u t we do n o t need to d o this. We only require the modified residuals r

_

J

y j-h {V (X j)(l-h j)y/2

or

y j-h { F( ^. )

(

1/ 2’

271

6.2 ■L east Squares Linear Regression

w hich will be approxim ately hom oscedastic. T he E D F o f these m odified resid uals, after subtracting their m ean, will estim ate the distribution function G o f the scaled, hom oscedastic ran d o m errors dj in the m odel Yj = p 0 + fa Xj + V } % ,

(6.20)

w here Vj = V ( x j ) or V( f i j ) . A lgorithm 6.1 for resam pling errors is now modified as follows. Algorithm 6.3 (Resampling errors with unequal variances) F o r r = 1 ,..., R, 1 F or j = 1 ,..., n, (a) set x* = Xj\ (b ) random ly sam ple <5* from r\ — r , . . . , r n — r; then (c) set y'j = fio + fa Xj + Vj1/2Sj, where Vj is V( xj ) or V(frj) as appropriate. 2 F it linear regression by ordinary least squares to d a ta (xj, y [ ) , (x*, >’*), giving estim ates f a r, s*2.

Weighted least squares O f course in this situation ordinary least squares is inferior to weighted least squares, in which ideally the j'th case is given weight Wj = V ~ l . If Vj = V ( x }) then weighted least squares can be done in one pass through the data, whereas if Vj — V(fij) we first estim ate fij by ordinary least squares fitted values p°j, say, an d then do a weighted least squares fit w ith the em pirical weights Wj = l/V(p.°j). In the la tte r case the stan d ard theory assum es th at the weights are fixed, which is adequate for first-order approxim ations to distributional properties. T he practical effect o f using em pirical weights can be incorporated into the resam pling, an d so potentially m ore accurate distributional properties can be obtain ed ; cf. Exam ple 3.2. F or w eighted least squares, the estim ates o f intercept and slope are a _ T , wA x j - x » ) y j P1 — 22 Wj{xj - x w)2

a _ 5 P0 —

PlXw,

where x w = Y wj x j / Y ^ wj anc^ % ~ S wj y j / S wj- Fitted values and raw residuals are defined as for o rdinary least squares, b u t leverage values and m odified residuals differ. T he leverage values are now Wj(Xj - x w)2 hj — ^ ------h E wi ’ E wi(* i-X w )2’

272

6 ■Linear Regression

and the m odified residuals (standardized to equal variance) are

K

},

var(/?i)

Y , W j ( X j - X w)2 ’

where k = s2 = (n — 2)_l J2 w j ( y j — f aj ) 2 is the weighted residual m ean square. The algorithm for resam pling errors is the sam e as for ordinary least squares, sum m arized in A lgorithm 6.3, b u t w ith the full weighted least squares procedure im plem ented in the final step. The situation where erro r variance depends on the m ean is a special case o f the generalized linear m odel, which is discussed m ore fully in Section 7.2. Wild bootstrap W hat if the variance function F(-) is unspecified? In some circum stances there m ay be enough d a ta to m odel it from the p a ttern o f residual variation, for exam ple using a plot o f m odified residuals r; (or their absolute values o r squares) versus fitted values fij. This ap proach can w ork if there is a clear m onotone relationship o f variance w ith x or fi, or if there are clearly identifiable strata o f constant variance (cf. Figure 7.14). But w here the heteroscedasticity is unpattern ed , either resam pling o f cases should be done with least squares estim ates, o r som ething akin to local estim ation o f variance will be required. The m ost local ap proach possible is the wild bootstrap, which estim ates variances from individual residuals. This uses the m odel-based resam pling A lgorithm 6.1, b u t w ith the j t h resam pled erro r s* taken from the tw o-point distribution (6 .21 ) where n = (5 + *J5)/10 an d = yj — fij is the raw residual. The first three m om ents o f e ' are zero, ej an d ej (Problem 6.8). This algorithm generates at m ost 2" different values o f param eter estim ates, an d typically gives results th a t are underdispersed relative to m odel-based resam pling or resam pling cases. N ote th a t if m odified residuals rj were used in place o f raw residuals ej, then the variance o f fi* u nder the wild b o o tstrap w ould equal the robust variance estim ate (6.17). Example 6.4 (Returns data) As m entioned in Exam ple 6.3, the d ata in Fig ure 6.4 show an increase in error variance w ith m arket return, x. Table 6.3 com pares the b o o tstrap variances o f the p aram eter estim ates from ordinary least squares for case resam pling an d the wild b o o tstrap, with R = 999. The estim ated variance o f fii from resam pling cases is larger th a n for the wild

273

6.3 ■M ultiple Linear Regression Table 6.2 Bootstrap variances (xlO-3 ) of ordinary least squares estimates for returns data, with R = 999.

All cases

C ases Cases, subset W ild, ej W ild, rj R o b u st theoretical

h

h

0.32 0.28 0.31 0.33 0.34

44.3 38.4 37.9 37.0 39.4

W ith o u t case 22

0.42 0.39 0.37 0.41 0.40

73.2 59.1 62.5 67.2 67.2

b ootstrap , an d for the full d a ta it m akes little difference when the modified residuals are used. Case 22 has high leverage, and its exclusion increases the variances o f both estim ates. T he wild b o o tstrap is again less variable th an bootstrapping cases, with the wild b o o tstrap o f modified residuals interm ediate betw een them. We m entioned earlier th a t the design will vary when resam pling cases. The left panel o f Figure 6.6 shows the sim ulated slope estim ates plotted against the sum s o f squares X X — x ”)2> f ° r 200 b o o tstrap samples. The plotting ch aracter distinguishes the num ber o f tim es case 22 occurs in the resam ples: we retu rn to this below. The variability o f /}j decreases sharply as the sum o f squares increases. N ow usually we would treat the sum o f squares as fixed in the analysis, and this suggests th at we should calculate the variance o f P\ from those b o o tstra p sam ples for which X ( x} — x*)2 is close to the original value XXx; ~ x)2, show n by the d otted vertical line. If we take the subset between the dashed lines, the estim ated variance is closer to th at for the wild bootstrap, as show n the values in Table 6.2 and by the Q-Q plot in the right panel o f Figure 6.6. This is also true when case 22 is excluded. The m ain reason for the large variability o f XXxy — x ’)2 is th a t case 22 has high leverage, as its position at the b o tto m left o f Figure 6.4 shows. Figure 6.6 shows th a t it has a substantial effect on the precision o f the slope estim ate: the m ost variable estim ates are those where case 22 does not occur, and the least variable those w here it occurs two or m ore times. ■

6.3 Multiple Linear Regression T he extension o f the simple linear regression m odel (6.1) to several explanatory variables is

( 6.22)

274

6 • Linear Regression

Figure 6.6 Comparison of wild bootstrap and bootstrapping cases for monthly returns data. The left panel shows 200 estimates of slope plotted against sum of squares —x’ )2 for case resampling. Resamples where case 22 occurred zero or one times are labelled accordingly. The right panel shows a Q-Q plot of the values of for the wild bootstrap and the subset of the cases lying within the dashed lines in the left panel.

; V%: 0 (*1n ol*

:d fe co

i i

ii 0 i! 0.001

r i p i . .. v Ti * 1 ill i ii ii i

0.003

0.005

Sum of squares

Cases

where for m odels w ith an intercept Xjo = 1. In the m ore convenient vector form the m odel is Yj = Xj (i + £ j with x j = ( x jo , Xj i, .. ., Xj P). The com bined m atrix representation for all re sponses Y t = ( Y i , . . . , Y„) is

y

=

xp + s

(6.23)

with X T = ( xi , . . . , x „ ) an d eT = ( e i , . . . , e „ ) . A s before, the responses Y j are supposed independent. This general linear m odel will encom pass polynom ial and interaction models, by judicious definition o f x in term s o f prim itive variables; for exam ple, we m ight have Xji = u j i an d x,-2 = or Xj$ = uj\Uj 2 , and so forth. W hen the Xjk are dum m y variables representing levels o f factors, we om it Xjo if the intercept is a red u n d an t param eter. In m any respects the b o o tstrap analysis for m ultiple regression is an obvious extension o f the analysis for simple linear regression in Section 6.2. We again concentrate on least squares m odel fitting. P articular issues which arise a re : (i) testing for the effect o f a subset o f the explanatory variables, (ii) assessm ent o f predictive accuracy o f a fitted m odel, (iii) the effect o f p large relative to n, and (iv) selection o f the “b est” m odel by suitable deletion o f explanatory variables. In this section we focus on the first two o f these, briefly discuss the third, and address variable selection m ethods in Section 6.4. We begin by outlining the extensions o f Sections 6.2.1-6.2.4.

275

6.3 ■M ultiple Linear Regression

6.3.1 Bootstrapping the least squares fit The ordinary least squares estim ates o f P for m odel (6.23) based on observed response vector y are P = (X TX r lX Ty , and corresponding fitted values are fr = H y where H = X ( X TX ) ~ {X T is the “h a t” m atrix, whose diagonal elem ents hjj — again denoted by hj for simplicity — are the leverage values. The raw residuals are e = (I — H)y. U nder hom oscedasticity the standard form ula for the estim ated variance o f P is v ar (p) = s2(X TX ) ~ \

(6.24)

with s2 equal to the residual m ean square (n — p — l ) ~ 1e Te. The em pirical influence values for ordinary least squares estim ates are lj = n ( X T X ) ~ l Xjej,

(6.25)

which give rise to the robust estim ate o f var(/?),

vl

= (Xt X )-1

(X TX ) ~ l ■

(6.26)

see Problem 6.1. These generalize equations (6.13) and (6.14). The variance approxim ation is im proved by using the modified residuals

7

(1 - M 1/2

in place o f the e; , and then v i generalizes (6.17). B ootstrap algorithm s generalize those in Sections 6.2.3-6.2.4. T h at is, modelbased resam pling generates d a ta according to Y] = x J P + E p where the s' are random ly sam pled from the modified residuals n , . . . , rn, or their centred co u n terp arts — r. Case resam pling operates by random ly resam pling cases from the data. Pros and cons o f the two m ethods are the sam e as before, provided p is small relative to n and the design is far from being singular. T he situation where p is large requires special attention. Large p Difficulty can arise w ith b o th m odel-based resam pling and case resam pling if p is very large relative to n. The following theoretical exam ple illustrates an extrem e version o f the problem .

6 • Linear Regression

276

Example 6.5 (One-way model) C onsider the regression m odel th at corre sponds to m independent sam ples each o f size two. If the regression param eters P i , . . . , pm are the m eans o f the p o pulations sampled, then we om it the intercept term from the m odel, an d the design m atrix has p = m colum ns and n = 2m rows with dum m y explanatory variables x 2,-i,( = x 2iyi = 1, = 0 otherwise, i = I , . . . , p . T h a t is, 0

/I 1 0 0

X =

0 \0

0\ 0 0 0

0 0 0

0

0

0

1

0

0

0

1/

For this m odel

Pi = 3 (yn + y n - i ),

i=

l,...,p,

and

ej = ( ~ i y ^(yn ~ yn-i),

hj=\,

j = 2i - l , 2i,

i=l,...,p.

The E D F o f the residuals, m odified o r not, could be very unlike the true error distribution: for example, the E D F will always be symmetric. I f the ran d o m errors are hom oscedastic then the m odel-based b o otstrap will give consistent estim ates o f bias and stan d ard error for all regression coefficients. However, the b o o tstrap distributions m ust be symmetric, and so m ay be no b etter th an norm al approxim ations if true random errors are skewed. T here appears to be no rem edy for this. T he problem is n o t so serious for contrasts am ong the P,. F or example, if 0 = P\ — P2 then it is easy to see th at 9 has a sym m etric distribution, as does O'. The kurtosis is, however, A A different for 9 an d 6’ ; see Problem 6.10. Case resam pling will not w ork because in those sam ples where b o th y 2i+i and y2i+2 are absent /?, is inestim able: the resam ple design is singular. The chance o f this is 0.48 for m = 5 increasing to 0.96 for m = 20. This can be fixed by om itting all b o o tstrap sam ples where + f 2i = 0 for any i. T he resulting boo tstrap variance for P’ consistently overestim ates by a factor o f ab o u t 1.3. F u rth er details are given in Problem 6.9. ■ The im plication for m ore general designs is th a t difficulties will arise with com binations cTp where c is in the subspace spanned by those eigenvectors o f X TX corresponding to sm all eigenvalues. First, m odel-based resam pling will give adequate results for stan d ard erro r calculations, but b o o tstrap distribu tions m ay n o t im prove on norm al approxim ations in calculating confidence limits for the /?,-s, o r for prediction. Secondly, unconstrained case resam pling

277

6.3 ■M ultiple Linear Regression Table 6 3 Cement data (Woods, Steinour and Starke, 1932). The response y is the heat (calories per gram of cement) evolved while samples of cement set. The explanatory variables are percentages by weight of four constituents, tricaicium aluminate x\, tricalcium silicate X2, tetracalcium alumino ferrite *3 and dicalcium silicate X4.

1 2 3 4 5 6 7 8 9 10 11 12 13

xi

*2

X)

*4

y

7 1 11 11 7 11 3 1 2 21 1 11 10

26 29 56 31 52 55 71 31 54 47 40 66 68

6 15 8 8 6 9 17 22 18 4 23 9 8

60 52 20 47 33 22 6 44 22 26 34 12 12

78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4

m ay induce near-collinearity in the design m atrix X ' , or equivalently near singularity in X ' TX *, an d hence produce grossly inflated b o o tstrap estim ates o f some stan d ard errors. O ne solution would be to reject sim ulated samples where the sm allest eigenvalue o f X ’TX * is lower th an a threshold ju st below the sm allest eigenvalue ( \ o f X TX . A n alternative solution, m ore in line with the general thinking th a t analysis should be conditioned on X , is to use only those sim ulated sam ples corresponding to the middle h alf o f the values o f t \ . This probably represents the best strategy for getting good confidence limits which are also robust to erro r heteroscedasticity. The difficulty m ay be avoided by an ap p ro p riate use o f principal com ponent regression. Example 6.6 (Cement data) The d a ta in Table 6.3 are classic in the regression literature as an exam ple o f near-collinearity. The four covariates are percent ages o f constituents which sum to nearly 100: the sm allest eigenvalue o f X TX is = 0.0012, corresponding to eigenvector (—1,0.01,0.01,0.01,0.01). T heoretical an d b o o tstrap stan dard errors for coefficients are given in Table 6.4. For error resam pling the results agree closely w ith theory, as expected. The b o o tstrap distributions o f /?* are very norm al-looking: the h at m atrix H is such th a t modified residuals r; w ould look norm al even for very skewed errors Ej. Case resam pling gives m uch higher standard errors for coefficients, and the b o o tstrap distributions are visibly skewed w ith several outliers. Figure 6.7 shows scatter plots o f tw o b o o tstrap coefficients versus smallest eigenvalue o f X T' X ' ; plots for the oth er two coefficients are very similar. The variability o f /?,* increases substantially for small values o f /}, whose reciprocal ranges from j to 100 tim es the reciprocal o f £\. Taking only those b o o tstrap samples which give the m iddle 500 values o f / j (which are betw een 0.0005 and 0.0012)

278

6 • Linear Regression Table 6.4 Standard fio

/?!

P2

h

P4

err0rS of linear

____________________________________________________________________ N o rm al-th eo ry E rro r resam pling, R = 999 C ase resam pling, all R = 999 C ase resam pling, m iddle 500 C ase resam pling, largest 800

70.1 66.3 108.5 68.4 67.3

0.74 0.70 1.13 0.76 0.77

0.72 0.69 1.12 0.71 0.69

0.75 0.72 1.18 0.78 0.78

regression coefficients for cement data. Theoretical and error resampling assume homoscedasticity. Resampling results use R = 999 samples, but

0.71 0.67 1.11 0.69 0.68

--------------------------------------------------------------------------------------------------------

only on those samples with the middle 500 and the largest 800 values of

rv

Figure 6.7 Bootstrap regression coefficients and fit, versus smallest eigenvalue ( x l 0~5) o f X ' TX ' for R = 999 resamples of cases from the cement data. The vertical line is the smallest eigenvalue of X TX, and the horizontal lines show the original coefficients ± two standard errors.

•V1

U (0 ° © -O

• . V :?-

1

5 10

50

500

1

Smallest eigenvalue

5 10

50

500

Smallest eigenvalue

gives m ore reasonable stan d ard errors, as seen in the penultim ate row o f Table 6.4. T he last row, corresponding to d ropping the smallest 200 values o f f \ , gives very sim ilar results. ■

Weighted least squares The general discussion extends in a fairly obvious way to weighted least squares estim ation, ju st as in Section 6.2.6 for the case p = 1. Suppose th a t var(e) = k W ~ 1 where W is the diagonal m atrix o f know n case weights w; . T hen the w eighted least squares estim ates are p = (X T W X ) ~ lX T Wy,

(6.27)

the fitted values are p. = Xfl, and the residual vector is e = (I — H)y, where now the h a t m atrix H is defined by

H

=

X ( X T WX)~lX T W,

(6.28)

Note that H is not symmetric in general. Some authors prefer to work with the symmetric matrix X' ( X' TX ' ) - ' X 'T, where X' = W l' 1X.

279

6.3 ■M ultiple Linear Regression

w hose diagonal elem ents are the leverage values hj. The residual vector e has variance var(e) = k (I — H ) W ~ [, whose y'th diagonal elem ent is /c(l — h j ) w j 1. So the m odified residual is now rj =

_ J 2J -- ------ • Wj (1 — hj)1/2

(6.29)

M odel-based resam pling is defined by y;

= x j p + w j ll2£j,

where e* is random ly sam pled from the centred residuals r t — r , . . . , r n — r. It is not necessary to estim ate k to apply this algorithm , b u t if an estim ate were required it w ould be k = (n — p — 1)~1y T W ( I — H)y. A n im p o rtan t m odification o f case resam pling is th at each case m ust now include its w eight w in addition to the response y and explanatory variables x.

6.3.2 Significance tests Significance tests for the single covariate in simple linear regression were described in Section 6.2.5. A m ong those tests, which should all behave similarly, are the exact p erm u tatio n test and a related b o o tstrap test. H ere we look at the m ore usual practical problem , testing for the effect o f one or a subset o f several covariates. The tests are based on least squares estimates. Suppose th a t the linear regression m odel is partitioned as Y = X (3 + £ = X q oc + X \ y + e,

where y is a vector an d we wish to test Ho : y = 0. Initially we assume hom oscedastic errors. It would ap p ear th a t the sufficiency argum ent which m otivates the single-variable p erm utation test, and m akes it exact, no longer applies. But there is a n atu ral extension o f th at p erm utation test, and its m o tivation is clear from the developm ent o f boo tstrap tests. The basic idea is to su b tract out the linear effect o f X q from both y and X \ , and then to apply the test described in Section 6.2.5 for simple linear regression. The first step is to fit the null model, th a t is £o = Xo&o,

fo = (X0r X0)_1X 0Ty.

We shall also need the residuals from this fit, which are eo = (/ — Ho)y with Ho = X q( X q Xo)~lX q . The test statistic T will be based on the least squares estim ate y for y in the full m odel, which can be expressed as y — (Xi-oXio) 1X[.0eo w ith X i o = (I — H q) X i. The extension o f the earlier p erm utation test is

6 • Linear Regression

280

equivalent to applying the p erm u tatio n test to “ responses” eo and explanatory variables XioIn the perm utation-type test and its b o o tstrap analogue, we sim ulate d a ta from the null m odel, assum ing hom oscedasticity; th a t is y

= Ao + £o,

where the com ponents o f the sim ulated error vector e0 are sam pled w ithout (perm utation) or w ith (bo o tstrap ) replacem ent from the n residuals in eo- N ote th at this m akes use o f the assum ed hom oscedasticity o f errors. Each case keeps its original covariate values, which is to say th a t X ’ = X . W ith the sim ulated d a ta we regress y ’ on X to calculate y' and hence the sim ulated test statistic t \ as described below. W hen this is repeated R times, the b o o tstrap P-value is # { t; > t} + l R + l T he p erm u tatio n version o f the test is not exact w hen nuisance covariates X j are present, b u t em pirical evidence suggests th a t it is close to exact. Scalar y W hat should t be? F or testing a single com ponent, so th a t y is a scalar, suppose th a t the alternative hypothesis is one-sided, say H A : y > 0. T hen we could A 1/2 take t to be y itself, o r possibly a studentized form such as zo = y / v 0 , where Do is an ap p ro p riate estim ate o f the variance o f y. If we com pute the standard error using the null m odel residual sum o f squares, then v0 = ( n - q r ' e l e o i X l o X i o r 1, where q is the ran k o f X q. T he sam e form ula is applied to every sim ulated sam ple to get i>q an d hence z* = y*/vq1/2. W hen there are no nuisance covariates Xo, Vq = vq in the p erm u tatio n test, and studentizing has no effect: the sam e is true if the non-null stan d ard error is used. Em pirical evidence suggests th a t this is approxim ately true w hen Xo is present; see the exam ple below. Studentizing is necessary if m odified residuals are used, w ith stan d ard izatio n based on the null m odel hat m atrix. A n alternative b o o tstrap test can be developed in term s o f a pivot, as described for single-variable regression in Section 6.2.5. H ere the idea is to treat Z = (y — y ) / V l/2 as a pivot, w ith V l/1 an ap propriate stan d ard error. B ootstrap sim ulation u nder the full fitted m odel then produces the R replicates o f z ’ which we use to calculate the P-value. To elaborate, we first fit the full m odel p = X f i by least squares and calculate the residuals e = y — p. Still assum ing hom oscedasticity, the stan d ard erro r for y is calculated using the residual m ean square — a simple form ula is v = ( n - p - 1) l e Te ( X l 0Xi . 0)

6.3 ■M ultiple Linear Regression

281

N ext, d atasets are sim ulated using the m odel /

= X p + e*,

X ' = X,

where the n errors in e* are sam pled independently w ith replacem ent from the residuals e o r m odified versions o f these. The full regression o f y ‘ on X is then fitted, from which we obtain y * and its estim ated variance v", these being used to calculate z* = (y* — y ) / v ' ll2. F rom R repeats o f this sim ulation we then have the one-sided P-value #

P

{ z r* >

Z q }

+

1

R + 1

where zo = y /u 1/2. A lthough here we use p to denote a P-value as well as the num b er o f covariates, no confusion should arise. This test procedure is the same as calculating a (1 —a) lower confidence limit for y by the studentized b o o tstrap m ethod, and inferring p < a if the lower lim it is above zero. The corresponding two-sided P-value is less th an 2a if the equi-tailed (1 — 2a) studentized b o o tstrap confidence interval does n o t include zero. O ne can guard against the effects o f heteroscedastic errors by using case resam pling to d o the sim ulation, and by using a robust standard error for y as described in Section 6.2.5. Also the same basic procedure can be applied to estim ates o th e r th a n least squares. Example 6.7 (Rock data) The d a ta in Table 6.5 are m easurem ents on four cross-sections o f each o f 12 oil-bearing rocks, taken from two sites. The aim is to predict perm eability from the other three m easurem ents, which result from a com plex im age-analysis procedure. In all regression m odels we use logarithm o f perm eability as response y. The question we focus on here is w hether the coefficient o f shape is significant in a m ultiple linear regression on all three variables. The problem is n o n stan d ard in th at there are four replicates o f the ex p lanatory variables for each response value. If we fit a linear regression to all 48 cases treating them as independent, strong correlation am ong the four residuals for each core sam ple is evident: see Figure 6.8, in which the residuals have unit variance. U nder a plausible m odel which accounts for this, which we discuss in E xam ple 6.9, the ap p ro p riate linear regression for testing purposes uses core averages o f the explanatory variables. T hus if we represent the d a ta as responses yj and replicate vectors o f the explanatory variables Xjk, k = 1,2,3,4, then the m odel for o u r analysis is yj = x J . P + Ej, where the Ej are independent. A sum m ary o f the least squares regression

6 ■Linear Regression

282

Table 6.5 Rock data

case

a rea

p e rim e te r

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 9364 8624 10651 8868 9417 8874 10962 10743 11878 9867 7838 11876 12212 8233 6360 4193 7416 5246 6509 4895 6775 7894 5980 5318 7392 7894 3469 1468 3524 5267 5048 1016 5605 8793 3475 1651 5514 9718

2792 3893 3931 3869 3949 4010 4346 4345 3682 3099 4480 3986 4037 3518 3999 3629 4609 4788 4864 4479 3429 4353 4698 3518 1977 1379 1916 1585 1851 1240 1728 1461 1427 991 1351 1461 1377 476 1189 1645 942 309 1146 2280 1174 598 1456 1486

sh a p e 0.09 0.15 0.18 0.12 0.12 0.17 0.19 0.16 0.20 0.16 0.15 0.15 0.23 0.23 0.17 0.15 0.20 0.26 0.20 0.14 0.11 0.29 0.24 0.16 0.28 0.18 0.19 0.13 0.23 0.34 0.31 0.28 0.20 0.33 0.15 0.28 0.18 0.44 0.16 0.25 0.33 0.23 0.46 0.42 0.20 0.26 0.18 0.20

p e rm e a b ility 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119.0 119.0 119.0 119.0 82.4 82.4 82.4 82.4 58.6 58.6 58.6 58.6 142.0 142.0 142.0 142.0 740.0 740.0 740.0 740.0 890.0 890.0 890.0 890.0 950.0 950.0 950.0 950.0 100.0 100.0 100.0 100.0 1300.0 1300.0 1300.0 1300.0 580.0 580.0 580.0 580.0

(Katz, 1995; Venables and Ripley, 1994, p. 251). These are measurements on four cross-sections of 12 core samples, with permeability (milli-Darcies), area (of pore space, in pixels out of 256 x 256), perimeter (pixels), and shape (perimeter/area)1^2.

6.3 • M ultiple Linear Regression

Figure 6.8 Rock data: standardized residuals from linear regression of all 48 cases, showing strong intra-core correlations.

283

co 3

T3

■O 0N ? (0 •O c 03

CO

4

6

8

10

12

Core number

Table 6.6 Least squares results for multiple linear regression of rock data, all covariates included and core means used as response variable.

V ariable intercept a r e a ( x lO - 3 ) p e r i ( x lO - 3 ) sh ap e

Coefficient

SE

f-value

3.465 0.864 -1 .9 9 0 3.518

1.391 0.211 0.400 4.838

2.49 4.09 - 4 .9 8 0.73

is shown in Table 6.6. T here is evidence o f m ild non-norm ality, b u t not heteroscedasticity o f errors. Figure 6.9 shows results from b o th the null m odel resam pling m ethod and the full m odel pivot resam pling m ethod, in b o th cases using resam pling o f errors. The observed value o f z is z0 = 0.73, for which the one-sided P-value is 0.234 und er the first m ethod, an d 0.239 under the second m ethod. Thus sh ap e should n o t be included in the linear regression, assum ing th at its effect would be linear. N ote th a t R = 99 sim ulations would have been sufficient here. ■

Vector y F or testing several com ponents sim ultaneously, we take the test statistic to be the quad ratic form T = F i X l o X v 0)y,

6 *Linear Regression

284

Figure 6.9 Resampling distributions of standardized test statistic for variable shape. Left: resampling 2 under null model, R = 999. Right: resampling pivot under full model, R = 999.

-6

-4

-2

0

2

4

6

8

-6

-4

z*

-2

0

2

4

6

8

z0*

or equivalently the difference in residual sum s o f squares for the null and full m odel least squares fits. This can be standardized to n —q RSSo — R S S q X RSSo where RSSo and R S S denote residual sum s o f squares under the null m odel and full m odel respectively. We can apply the pivot m ethod with full m odel sim ulation here also, using Z = (y — y)T ( X l 0Xi.o)(y — y ) / S 2 w ith S 2 the residual m ean square. The test statistic value is zo = y T(X[.0Xi .0) y /s 2, for w hich the P-value is given by # {z* >

Zp}

+

1

R + 1 This would be equivalent to rejecting Ho at level a if the 1 — a confidence set for y does n o t include the point y = 0. A gain, case resam pling would provide protection against heteroscedasticity: z would then require a robust standard error.

6.3.3 Prediction A fitted linear regression is often used for prediction o f a new individual response Y+ when the explanatory variable vector is equal to x +. T hen we shall w ant to supplem ent o u r predicted value by a prediction interval. Confidence limits for the m ean response can be found using the same resam pling as is used to get confidence limits for individual coefficients, b u t limits for the response Y+ itself — usually called prediction lim its — require additional resam pling to sim ulate the variation o f 7+ ab o u t x \ j i .

285

6.3 ■M ultiple Linear Regression

T he q uantity to be predicted is Y+ = x'+ji + £ +, say, and the point predictor is Y+ = The ran d o m erro r £+ is assum ed to be independent o f the random errors £ i,...,£ „ in the observed responses, and for simplicity we assum e th at they all com e from the sam e d istribution: in p articular the errors have equal variances. To assess the accuracy o f the point predictor, we can estim ate the distribution o f the prediction error S = Y+ - Y + = x tJ -

( x l P + £+)

by the distribution o f <5* = x+/?* — (x+/? + e+),

(6.30)

w here £+ is sam pled from G and /T is a sim ulated vector o f estim ates from the m odel-based resam pling algorithm . This assum es hom oscedasticity o f random error. U nconditional properties o f the prediction erro r correspond to averaging over the distributions o f b o th £+ and the estim ates /?, which we do in the sim ulation by repeating (6.30) for each set o f values o f /T. H aving obtained the m odified residuals from the d a ta fit, the algorithm to generate R sets each w ith M predictions is as follows. Algorithm 6.4 (Prediction in linear regression) F or r = 1 ,..., R, 1 sim ulate responses y* according to (6.16); 2 obtain least squares estim ates pr = ( X TX ) ~ 1X Ty *; then 3 for m = 1 ,..., M , (a) sam ple £ ^ m from r \ — f , . . . , r „ — r, and (b ) com pute prediction error S ’m = x+i?* — (x£/? + £+m).

It is acceptable to use M = 1 here: the key point is th a t R M be large enough to estim ate the required properties o f <5*. N ote th at if predictions at several values o f x + are required, then only the third step o f the algorithm needs to be repeated for each x+. T he m ean squared prediction error is estim ated by the sim ulation m ean squared erro r (R M )-1 E rm(<5*m — <S*)2. M ore useful would be a (1 — 2a) pre diction interval for Y+, for which we need the a and (1 — a) quantiles ax and say, o f prediction erro r S. T hen the prediction interval would have limits y+ - fli-a,

$+ - a*-

T he exact, b u t unknow n, quantiles are estim ated by em pirical quantiles o f

6 ■Linear Regression

286

the pooled <5*s, w hose ordered values we denote by < 5( < • • • < boo tstrap prediction lim its are y+ — ^((RM+l)(l-ct))’

y+ — ^((RM+lJa)’

The

(6.31)

where y+ = *+/?. This is analogous to the basic b o o tstrap m ethod for confi dence intervals (Section 5.2). A som ew hat b etter ap p ro ach w hich mimics the stan d ard norm al-theory analysis is to w ork w ith studentized prediction error

where S is the square root o f residual m ean square for the linear regression. The corresponding sim ulated values are z*m = <5*m/s*, with s ' calculated in step 2 o f A lgorithm 6.4. T he a and (1 —a) quantiles o f Z are estim ated by z*(RM+1)0,) and respectively, where z'{V) < ■■■ < z ’RM) are the ordered values o f all R M z* s. T hen the studentized b o o tstrap prediction interval for 7+ is y+ ~ SZ((RM+l)(l-ct))’

£+ ~ SZ((RM+1)«)-

(6.32)

E xam ple 6.8 (N uclear power stations) Table 6.7 contains d a ta on the cost o f 32 light w ater reactors. T he cost (in dollars x l0 ~ 6 adjusted to a 1976 base) is the response o f interest, an d the o th er quantities in the table are explanatory variables; they are described in detail in the d a ta source. We take lo g (c o s t) as the w orking response y, and fit a linear m odel with covariates PT, CT, NE, d a te , lo g (c a p a c ity ) and log(N). T he dum m y variable PT indicates six plants for w hich there were p artial turnkey guarantees, and it is possible th a t some subsidies m ay be hidden in their costs. Suppose th a t we wish to obtain 95% prediction intervals for the cost o f a station like case 32 above, except th a t its value for d a te is 73.00. T he predicted value o f lo g (c o s t) from the regression is x+fi = 6.72, and the m ean squared erro r from the regression is s = 0.159. W ith a = 0.025 and a sim ulation with R = 999 an d M = 1, ( R M + l)a = 25 an d ( R M + 1)(1 — a) = 975. The values o f 3(25) an d <5*975) are -0.539 and 0.551, so the 95% lim its (6.31) are 6.18 and 7.27, which are slightly w ider th a n the norm al-theory limits o f 6.25 and 7.19. F or the lim its (6.32) we get z(*25) = —3.680 and z(*975) = 3.5 12, so the lim its for lo g (c o st) are 6.13 and 7.28. T he corresponding prediction interval for c o s t is [exp(6.13), exp(7.28)] = [459.4,1451], The usual caveats apply a b o u t extrapolating a trend outside the range o f the data, an d we should use these intervals w ith great caution. ■ The next exam ple involves an u nusual d a ta structure, where there is hierar chical variatio n in the covariates.

It is unnecessary to standardize also by the square root of 1 + x l ( X TX)- ' x+, which would make the variance of Z close to 1. unless bootstrap results for different x+ are pooled.

6.3 ■M ultiple Linear Regression Table 6.7 Data on light water reactors constructed in the USA (Cox and Snell, 1981, p. 81).

1 2 3 4 5

6 7

8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29 30 31 32

287

cost

d a te

Tl

t2

c a p a c ity

PR

NE

CT

BW

N

PT

460.05 452.99 443.22 652.32 642.23 345.39 272.37 317.21 457.12 690.19 350.63 402.59 412.18 495.58 394.36 423.32 712.27 289.66 881.24 490.88 567.79 665.99 621.45 608.80 473.64 697.14 207.51 288.48 284.88 280.36 217.38 270.71

68.58 67.33 67.33

14

46 73 85 67 78 51 50 59 55 71 64 47 62 52 65 67 60 76 67 59 70 57 59 58 44 57 63 48 63 71 72 80

687 1065 1065 1065 1065 514 822 457 822 792 560 790 530 1050 850 778 845 530 1090 1050 913 828 786 821 538 1130 745 821

0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1 1 1

1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1

14

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

68.00 68.00 67.92 68.17 68.42 68.42 68.33 68.58 68.75 68.42 68.92 68.92 68.42 69.50 68.42 69.17 68.92 68.75 70.92 69.67 70.08 70.42 71.08 67.25 67.17 67.83 67.83 67.25 67.83

10 10 11 11 13

12 14 15

12 12 13 15 17 13

11 18 15 15 16

11 22 16 19 19

20 13 9

12 12 13 7

886 886 745

886

1 1 12 12 3 5

1 5

2 3

6 2 7 16 3 17

2 1 8 15

20 18 3 19

21 8 7

11 11 8 11

Example 6.9 (Rock data) F or the d a ta discussed in Exam ple 6.7, one objective is to see how well one can predict perm eability from a single replicate o f the three im age-based m easurem ents, as opposed to the four replicates obtained in the study. The previous analysis suggested th a t variable sh ap e did not contribute usefully to a linear regression relationship for the logarithm o f perm eability, an d this is confirm ed by cross-validation analysis o f prediction errors (Section 6.4.1). So here we concentrate on predicting perm eability from the linear regression o f y = lo g ( p e r m e a b ility ) on a r e a and p e r i . In Exam ple 6.7 we com m ented on the strong intra-core correlation am ong the explanatory variables, and th a t m ust be taken into account here if we are to correctly analyse prediction o f core perm eability from single m easurem ents o f a r e a and p e r i . O ne way to do this is to think o f the four replicate values o f u = ( a r e a , p e r i ) T as unbiased estim ates o f an underlying core variable £, on which y has a linear regression. T hen the d a ta are m odelled by

yj = <x + £j y + fij,

ujk =

+ sjk,

(6.33)

6 ■Linear Regression

288

Table 6.8 Rock data: fits o f linear regression models with K replicate values o f explanatory variables a r e a and p e r i . Norm al-theory analysis is via model

V ariable

M eth o d In tercep t

a r e a ( x lO - 4 )

p e r i ( x l O 4)

K = 1

D irect regression on x ^ s N o rm al-th eo ry fit

5.746 5.694

5.144 5.300

-16.16 -16.39

K = 4

R egression on Xj. s N o rm al-th eo ry fit

4.295 4.295

9.257 9.257

-21.78 -21.78

(6.33).

for j = 1 ,...,1 2 and k = where rjj and < 5 are uncorrelated errors with zero means, and for o u r d a ta K = 4. U nder norm ality assum ptions on the errors and the the linear regression o f yj on Uj\,...,UjK depends only on the core average u; = K ~ l Y a =i ujkThe regression coefficients depend strongly on K . F or prediction from a single m easurem ent u+ we need the m odel w ith K = 1, and for resam pling analysis we shall need the m odel w ith K = 4. These tw o versions o f the observation regression m odel we w rite as yj = x j p w + ef> = a(K) + u J y (K} + e f \

(6.34)

for K = 1 and 4; the param eters a and y in (6.33) correspond to a (x) and when K = oo. Fortunately it turns out th a t b o th observation m odels can be fit ted easily: for K = 4 we regress the yjs on the core averages Uj; and for K = 1 we fit linear regression w ith all 48 individual cases as tabled, ignoring the intra-core correlation am ong the e;*s, i.e. pretending th at y; occurs four times independently. Table 6.8 shows the coefficients for both fits, and com pares them to corresponding estim ates based on exact norm al-theory analysis. Suppose, then, th a t we w ant to predict the new response y + given a single set o f m easurem ents u+. If we define x \ = (1,m+), then the point prediction Y+ is x l P \ where /?(1) are the coefficients in the fit o f m odel (6.34) with K = 1, shown in the first row o f Table 6.8. T he E D F o f the 48 modified residuals from this fit estim ates the m arginal distribution o f the e*1* in (6.34), and hence o f the error e+ in Y+ = x l ^ + s +. O ur concern is w ith the prediction error 5 = Y+ - Y + = x l $ W -

- £+,

(6.35)

whose distrib u tio n is to be estim ated by resampling. The question is how to do the resam pling, given the presence o f intra-core correlation. A resam pled dataset m ust consist o f 12 subsets each with 4 repli cates u*k an d a single response yj, from which we shall fit /?(1)*. The prediction

6.3 ■M ultiple Linear Regression

289

erro r (6.35) will then be sim ulated by

<5* = K - y'+ = *I(/*(1)*- j3(1)) - £+, where e*+ is sam pled from the E D F o f the 48 modified residuals as m entioned above. It rem ains to decide how to sim ulate the d a ta from which we calculate Pw '. Usually w ith erro r resam pling we would fix the covariate values, so here we fix the 12 values o f Uj, which are surrogates for the £jS in m odel (6.33). T hen we sim ulate responses from the fitted regression on these averages, and sim u late the replicated m easured covariates using an appropriate hierarchical-data algorithm . Specifically we take Ujk = Uj + djk, where djk = ujk — Uj and J is random ly sam pled from { 1 ,2 ,..., 12}. O ur ju s tification for this, in term s o f retaining intra-core correlation, is given by the discussion in Section 3.8. It is potentially im p o rtan t to build the variation o f u into the analysis. Since u* = Uj, the resam pled responses are defined by >■=*j r +

t f .

where the £*4)* are random ly sam pled from the 12 m ean-adjusted, modified residuals r ^ — rw from the regression o f the y; s on the iijS. The estim ates are now obtained by fitting the regression to the 48 sim ulated cases ( u ^ y j ) , k = 1 , ...,4 and j = 1 ,..., 12. Figure 6.10 shows typical norm al plots for prediction error y + — y+ , these for x + = (1,4000,1000) and x + = (1,10000,4000) which are near the edge o f the observed space, from R = 999 resam ples and M = 1. The skewness o f prediction erro r is quite noticeable. The resam pling stan d ard deviations for pre diction errors are 0.91 an d 0.93, som ew hat larger th an the theoretical standard deviations 0.88 and 0.87 obtained by treating the 48 cases as independent. To calculate 95% intervals we set a = 0.025, so th at ( R M + l)a = 25 and ( R M + 1)(1 — a) = 975. The sim ulation values <5(*25) and <5('975) are —1.63 and 1.93 at x+ = (1,4000,1000), and -1 .5 7 and 2.19 at x + = (1,10000,4000). The corresponding p o in t predictions are 6.19 and 4.42, so 95% prediction intervals are (4.26,7.82) at x+ = (1,4000,1000) and (2.23,5.99) at x+ = (1,10000,4000). These intervals differ m arkedly from those based on norm al theory treating all 48 cases as independent, those being (4.44,7.94) and (2.68,6.17). M uch o f the difference is due to the skewness o f the resam pling distribution o f prediction error. ■

6 • Linear Regression

290

Figure 6.10 Rock data: normal plots of resampled prediction errors for x+ =(1,4000,1000) (left panel) and = (1,10000,4000) (right panel), based on R = 999 and M = 1. Dotted lines correspond to theoretical means and standard deviations.

Quantiles of standard normal

Quantiles of standard normal

6.4 Aggregate Prediction Error and Variable Selection In Section 6.3.3 o u r discussion o f prediction focused on individual cases, and particularly on intervals o f uncertainty aro u n d point predictions. F or some applications, however, we are interested in an aggregate m easure o f prediction erro r — such as average squared error o r m isclassification erro r — which sum m arizes accuracy o f prediction across a range o f values o f the covariates, using a given regression m odel. Such a m easure m ay be o f interest in its own right, o r as the basis for com paring alternative regression models. In the first p art o f this section we outline the m ain resam pling m ethods for estim ating aggregate prediction error, an d in the second p a rt we discuss the closely related problem o f variable selection for linear regression models.

6.4.1 Aggregate prediction error The least squares fit o f the linear regression m odel (6.22) provides the least squares prediction rule y+ = x+fi for predicting w hat a single response y+ would be at value x+ o f the vector o f covariates. W h at we w ant to know is how accurate this prediction rule will be for predicting d a ta sim ilar to those already observed. Suppose first th a t we m easure accuracy o f prediction by squared error (y+ — y+)2, an d th a t o u r interest is in predictions for covariate values th a t exactly duplicate the d a ta values x \ , . . . , x n. T hen the aggregate prediction error is D = n - x Y j E (Y + j - x ] h \ j= i

6.4 ■Aggregate Prediction Error and Variable Selection

X is the n x q matrix with rows x j , . . . , x j , where q = p + 1 if there are p covariate terms and an intercept in the model.

291

in which ft is fixed and the expectation is over y+J = x]p + e+j. We cannot calculate D exactly, because the m odel param eters are unknow n, so we m ust settle for an estim ate — which in reality is an estim ate o f A = E(D), the average over all possible sam ples o f size n. O ur objective is to estim ate D or A as accurately as possible. As stated the problem is quite simple, at least under the ideal conditions where the linear m odel is correct and the error variance is constant, for then D

=

n - l Y ™ r ( Y +j) + n - l Y , ( X j P - x J [ l ) 2

=

a 2 + n - l ( p - l } ) TX TX 0 - p ) ,

(6.36)

w hose expectation is A = <j 2(1 + ^ - 1),

(6.37)

where q = p + 1 is the nu m b er o f regression coefficients. Since the residual m ean square s2 is an unbiased estim ate for a 2, we have the natural estim ate A = s2(l + qn~l ).

(6.38)

However, this estim ate is very specialized, in two ways. First, it assumes th at the linear m odel is correct and th a t erro r variance is constant, b o th unlikely to be exactly true in practice. Secondly, the estim ate applies only to least squares prediction and the squared erro r m easure o f accuracy, w hereas in practice we need to be able to deal w ith other m easures o f accuracy and other prediction rules — such as robust linear regression (Section 6.5) and linear classification, where y is binary (Section 7.2). T here are no simple analogues o f (6.38) to cover these situations, b u t resam pling m ethods can be applied to all o f them. In order th a t o u r discussion apply as broadly as possible, we shall use general n o tatio n in which prediction erro r is m easured by c(y+, y +), typically an increasing function o f |y+ — y+|, and the prediction rule is y + = /i(x+, F), where the E D F F represents the observed data. Usually n(x +>F) is an estim ate o f the m ean response at x +, a function o f x+/? with /? an estim ate o f /?, and the form o f this prediction rule is closely tied to the form o f c(y+,y+). We suppose th a t the d a ta F are sam pled from distribution F, from which the cases to be predicted are also sampled. This implies th at we are considering x + values sim ilar to d a ta values x i ,...,x „ . Prediction accuracy is m easured by the aggregate prediction error D = D(F, F) = E + [c{ Y+, tx(X+, F)} | F],

(6.39)

where E + em phasizes th a t we are averaging only over the distribution o f (AT+, 7+), w ith d a ta fixed. Because F is unknow n, D can n o t be calculated, and so we look for accurate m ethods o f estim ating it, or ra th er its expectation A = A (F ) = E { D ( F , F ) } ,

(6.40)

6 ■Linear Regression

292

the average prediction accuracy over all possible d atasets o f size n sam pled from F. The m ost direct ap proach to estim ation o f A is to apply the boo tstrap substitution principle, th a t is substituting the E D F F for F in (6.40). However, there are o th er widely used resam pling m ethods which also m erit consideration, in p art because they are easy to use, an d in fact the best approach involves a com bination o f m ethods. Apparent error The sim plest way to estim ate D or A is to take the average prediction error w hen the prediction rule is applied to the sam e d a ta th at was used to fit it. This gives the apparent error, som etim es called the resubstitution error, n

K PP = D( F, F) = n ~x ' Y ^ c { y j ,ii{xj,F)}. 7=1

(6.41)

This is n o t the sam e as the b o o tstrap estim ate A(F), which we discuss later. It is intuitively clear th a t A app will tend to underestim ate A, because the latter refers to prediction o f new responses. The underestim ation can be easily A | checked for least squares prediction w ith squared error, when A app = n~ R S S , the average squared residual. If the m odel is correct with hom oscedastic random errors, then A app has expectation a 2(l —qn~ l ), w hereas from (6.37) we know th a t A = <x2(l + qn~l ). The difference betw een the true erro r and ap p aren t erro r is the excess error, D( F, F) — D(F,F), whose m ean is the expected excess error, e(F) = E {D(F, F) - D(F, F)} = A(F) - E{D(F, F)},

(6.42)

where the expectation is taken over possible datasets F. F or squared error and least squares prediction the results in the previous p arag rap h show th at e(F) = 2qri~l o 2. The q uantity e(F) is akin to a bias and can be estim ated by resam pling, so the a p p aren t error can be m odified to a reasonable estim ate, as we see below. Cross-validation T he ap p aren t error is dow nw ardly biased because it averages errors o f predic tions for cases at zero distance from the d a ta used to fit the prediction rule. C ross-validation estim ates o f aggregate erro r avoid this bias by separating the d a ta used to form the prediction rule and the d a ta used to assess the rule. The general paradigm is to split the d ataset into a training set {(x j , y j ) : j £ S,} and a separate assessment set {(X j , y j ) : j e Sa}, represented by Ft and Fa, say. The linear regression predictor is fitted to Ft, used to predict responses yj for

293

6.4 ■Aggregate Prediction Error and Variable Selection

j € Sa, and then A is estim ated by D{Fa, Ft) = n ~ ' Y

£)}>

(6-43)

j€Sa

w ith na the size o f Sa. T here are several variations on this estim ate, depending on the size o f the training set, the m anner o f splitting the dataset, and the num ber o f such splits. The version o f cross-validation th at seems to come closest to actual use o f o u r predictor is leave-one-out cross-validation. H ere training sets o f size n —1 are taken, and all such sets are used, so we m easure how well the prediction rule does when the value o f each response is predicted from the rest o f the data. If F^j represents the n — 1 observations {(xk,yk),k ^ j}, and if /u(Xy,F_; ) denotes the value predicted for yj by the rule based on F _; , then the cross-validation estimate o f prediction error is n

Ac v = n~l

c{yj>

F-j)}, (6.44)

i= i which is the average erro r when each observation is predicted from the rest o f the sample. In general (6.44) requires n fits o f the model, b u t for least squares linear regression only one fit is required if we use the case-deletion result (Problem 6.2) ~

,

T A

Vi — x j B

P - P- j = ( X TX ) ~ ' x j ^ _ £

,

where as usual hj is the leverage for the 7th case. F or squared erro r in particular we then have ="

E

d

- ^

•

' 6-45>

From the natu re o f Ac v one would guess th a t this estim ate has only a small bias, and this is so: assum ing an expansion o f the form A(F) = oq + a\ n~l + a2n~2 + ■■■, one can verify from (6.44) th a t E(A c^) = «o + a i(n — I )-1 + • • ■, which differs from A by term s o f order n~2 — unlike the expectation o f the ap p aren t error which differs by term s o f order n_ l . K -fold cross-validation In general there is no reason th at training sets should be o f size n — 1. For certain m ethods o f estim ation the num ber n o f fits required for Ac v could itself be a difficulty — although not for least squares, as we have seen in (6.45). T here is also the possibility th at the small p erturbations in fitted m odel w hen single observations are left out m akes Ac v too variable, if fitted values H(x,F) do n o t depend sm oothly on F o r if c(y+ ,y+ ) is n o t continuous. These

294

6 ■Linear Regression

potential problem s can be avoided to a large extent by leaving out groups o f observations, rath er th an single observations. T here is m ore th an one way to d o this. One obvious im plem entation o f group cross-validation is to repeat (6.43) for a series o f R different splits into training and assessm ent sets, keeping the size o f the assessm ent set fixed at na = m, say. T hen in a fairly obvious n o tation the estim ate o f aggregate prediction error would be R

Acv = R ~{

X ! c{yJ’ jesv

r= 1

^v)}-

(6-46^

In principle there are (") possible splits, possibly an extrem ely large num ber, b u t it should be adequate to take R in the range 100 to 1000. It would be in the spirit o f resam pling to m ake the splits at random . However, consideration should be given to balancing the splits in some way — for example, it would seem desirable th a t each case should occur w ith equal frequency over the R assessm ent sets; see Section 9.2. D epending on the value o f nt = n — m and the num ber p o f explanatory variables, one m ight also need some form o f balance to ensure th a t the m odel can always be fitted. There is an efficient version o f group cross-validation th at does involve ju st one prediction o f each response. We begin by splitting the d a ta into K disjoint sets o f nearly equal size, w ith the corresponding sets o f case subscripts denoted by C i , . . . , C k , say. These K sets define R = K different splits into training and assessm ent sets, w ith S^k = Q the kt h assessm ent set and the rem ainder o f the d a ta Stf = |J,y* Ci the /cth training set. F or each such split weapply (6.43), and then average these estim ates. The result is the K-fold cross-validation estimate o f prediction error n

Acvjc = n~l y c{yj, n(xj, F - k{J))}, j=i

(6.47)

where F-k{j) represents the d a ta from which the group containing the j i h case has been deleted. N ote th a t ACvjc is equal to the leave-one-out estim ate (6.44) when K = n. C alculation o f (6.47) requires ju st K m odel fits. Practical experience suggests th a t a good strategy is to take K = m in{n1!1, 10}, on the grounds th a t taking K > 10 m ay be too com putationally intensive when the prediction rule is com plicated, while taking groups o f size at least n1/2 should p ertu rb the d a ta sufficiently to give small variance o f the estimate. The use o f groups will have the desired effect o f reducing variance, b u t at the cost o f increasing bias. F or exam ple, it can be seen from the expansion used earlier for A th a t the bias o f A Cvjc is a\{n(K — l )}-1 + ■• •, which could be substantial if K is small, unless n is very large. F ortunately the bias o f A qv ,k can be reduced by a simple adjustm ent. In a harm less abuse o f notation, let

6.4 ■Aggregate Prediction Error and Variable Selection

if n / K

=m

is an

integer, then ail groups are o f size m and Pk = l / K .

295

F-k denote the d a ta w ith the /cth group om itted, for k = 1 and let pk denote the p ro p o rtio n o f the d ata falling in the /cth group. T he adjusted cross-validation estimate o f aggregate prediction erro r is 00

0

r

&acvjk. = Ack,k + D( F, F) — ^2,PkD{F,F-k)-

(6.48)

k= 1

T his has sm aller bias th a n Acvjc and is alm ost as simple to calculate, because it requires n o additional fits o f the model. F or a com parison betw een ACvjc an d A acvjc in a simple situation, see Problem 6.12. T he following algorithm sum m arizes the calculation o f AAcvji w hen the split into groups is m ade a t random . Algorithm 6.5 (K -fold adjusted cross-validation) 1 Fit the regression m odel to all cases, calculate predictions m odel, an d average the values o f c(yj,yj) to get D. 2 C hoose group sizes m i,. . . , such th a t mi H----- + m* = n. 3 For k = 1

from th at

(a) choose Ck by sam pling times w ithout replacem ent from { 1 ,2 ,...,« } m inus elem ents chosen for previous C,s; (b) (c) (d) (e)

fit the regression m odel to all d a ta except cases j £ Ck', calculate new predictions yj = n(xj, F-k) for j e Ck ; calculate predictions %j = fi(xj,F-k) for all j ; then average the n values c{yj,%j) to give D(F,F-k).

4 A verage the n values o f c(yj,yj) using yj from step 3(c) to give Ac vj i5 C alculate Aacvji as in (6.48) with pk = mk/n.

Bootstrap estimates A direct ap plication o f the b o o tstrap principle to A(F) gives the estim ate A = A(F) = E*{D(F,F*)}, w here F* denotes a sim ulated sam ple ( x j,y j) ,. . . , (x*, >’”) taken from the d a ta by case resam pling. U sually sim ulation is required to approxim ate this estim ate, as follows. F or r = 1 we random ly resam ple cases from the d ata to obtain the sam ple (x*j,y*j) , . . . , (x*n,y'„), which we represent by F*, and to this sam ple we fit the prediction rule and calculate its predictions n ( x j , F ' ) o f the d a ta responses yj for j = 1 The aggregate prediction erro r estim ate is then calculated as R R - 1

n Y 2 c { y j,f i{ x j,F ') } .

r= l

j=l

(6.49)

6 ■Linear Regression

296

Intuitively this b o o tstra p estim ate is less satisfactory th an cross-validation, because the sim ulated d ataset F* used to calculate the prediction rule is p art o f the d a ta F used for assessm ent o f prediction error. In this sense the estim ate is a hybrid o f the a p p aren t erro r estim ate and a cross-validation estim ate, a point to which we retu rn shortly. As we have noted in previous chapters, care is often needed in choosing w hat to bootstrap. H ere, an ap p ro ach w hich w orks b etter is to use the boo tstrap to estim ate the expected excess erro r e(F) defined in (6.42), w hich is the bias o f the a p p aren t erro r A app, an d to add this estim ate to A app. In theory the b o o tstrap estim ate o f e(F) is e(F) = E ' { D ( F , F ' ) - D ( F ‘, F *)}, and its approxim ation from the sim ulations described in the previous p a ra graph defines the bootstrap estimate o f expected excess error

‘E

eB = R

n 1E c{yj>^ xpK .)} - n 1E cWp MKpF")} i=i

r= 1

(6.50)

j=i

T h at is, for the rth b o o tstra p sam ple we construct the prediction rule n(x, F'), then calculate the average difference betw een the prediction errors when this rule is applied first to the original d a ta an d secondly to the b o o tstrap sam ple itself, an d finally average across b o o tstra p samples. We refer to the resulting estim ate o f aggregate prediction error, Ab = $b + A app, as the bootstrap estimate o f prediction error, given by n

n~l E 7=1

R

E r= 1

R

F'r )} - R - 1 E D (F'r, K ) + D(F, F).

(6.51)

r= l

N ote th a t the first term o f (6.51), which is also the simple b o o tstra p estim ate (6.49), is expressed as the average o f the contributions jR-1 ^ f = i c{yy-, F ’ )} th at each original observation m akes to the estim ate o f aggregate prediction error. These contributions are o f interest in their own right, m ost im portantly in assessing how the perform ance o f the prediction rule changes with values o f the explanatory variables. This is illustrated in Exam ple 6.10 below. Hybrid bootstrap estimates It is useful to observe th a t the naive estim ate (6.49), which is also the first term o f (6.51), can be broken into two qualitatively different parts,

6.4 ■Aggregate Prediction Error and Variable Selection

297

and

w here R - j is the n u m b er o f the R b o o tstrap sam ples F ' in which (xj ,yj ) does n o t appear. In (6.52) yj is always predicted using d ata from which (X j , y j) is excluded, which is analogous to cross-validation, w hereas (6.53) is sim ilar to an a p p aren t erro r calculation because yj is always predicted using d a ta th at contain (xj,yj). N ow R - j / R is approxim ately equal to the constant e~l = 0.368, so (6.52) is approxim ately p ro p o rtio n al to A scr = n - 1E j=1

Y

(6'54)

J r:j out

som etim es called the leave-one-out bootstrap estimate o f prediction error. The n o ta tio n refers to the fact th a t Abcv can be viewed as a b o o tstrap sm oothing o f the cross-validation estim ate Acv- To see this, consider replacing the term c {y j , n ( x j , F - j )} in (6.44) by the expectation E l j[c{yj,n(Xj,F*)}], where E lrefers to the expectation over b o o tstrap sam ples F * o f size n draw n from F-j. T he estim ate (6.54) is a sim ulation approxim ation o f this expectation, because o f the result n o ted in Section 3.10.1 th a t the R - j b o o tstrap sam ples in which case j does n o t ap p ear are equivalent to random sam ples draw n from F-j. T he sm oothing in (6.54) m ay effect a considerable reduction in variance, com pared to Ac v , especially if c(y+, y +) is n o t continuous. B ut there will also be a tendency tow ard positive bias. This is because the typical b o o tstrap sample from which predictions are m ade in (6.54) includes only ab o u t (1 — e~l )n = 0.632n distinct d a ta values, an d the bias o f cross-validation estim ates increases as the size o f the train in g set decreases. W hat we have so far is th a t the b o o tstrap estim ate o f aggregate prediction erro r essentially involves a w eighted com bination o f Abcv and an apparent erro r estim ate. Such a com bin atio n should have good variance properties, b u t m ay suffer from bias. However, if we change the weights in the com bination it m ay be possible to reduce or rem ove this bias. This suggests th at we consider the hybrid estim ate A w = w A b cv + (1 - w)Aapp,

(6.55)

an d then select w to m ake the bias as small as possible, ideally E(AW) = A + 0 ( n ~ 2). N o t unexpectedly it is difficult to calculate E(AW) in general, b u t for quadratic erro r and least squares prediction it is relatively easy. We already know th at the a p p aren t erro r estim ate has expectation a 2( 1 — qn~l ), and th a t the true

298

6 • Linear Regression

A p p a re n t

Table 6.9 Estimates of aggregate prediction error (xlO -2) for data on nuclear power plants. Results for adjusted cross-validation are shown in parentheses.

K -fo ld (adjusted ) cross-validation

e rro r

B o o tstrap

0.632

32

16

10

6

2.0

3.2

3.5

3.6

3.7 (3.7)

3.8 (3.7)

4.4 (4.2)

aggregate erro r is A = er2( l + qn 1). It rem ains only to calculate E(ABCk), where here A B CV =

n~l Y 2 E -j(y j -

x ] P - j ) 2>

j =i A

w ith p ’_ j the least squares estim ate o f /? from a b o o tstra p sam ple w ith the j t h case excluded. A ra th e r lengthy calculation (Problem 6.13) shows th at E(A jjck) = c 2( l + 2 qn~l ) + 0 ( n ~ 2), from which it follows th a t E{wABCk + (1 - w)A app} = er2( l + 3w qn~l ) + 0 ( n ~ 2), which agrees w ith A to term s o f o rd er n~l if w = 2/3. It seems im possible to find an optim al choice o f w for general m easures o f prediction erro r an d general prediction rules, b u t detailed calculations do suggest th a t w = 1 — e-1 = 0.632 is a good choice. H euristically this value for w is equivalent to an ad justm ent for the below -average distance betw een cases an d b o o tstra p sam ples w ithout them , com pared to w hat we expect in the real prediction problem . T h a t the value 0.632 is close to the value 2 /3 derived above is reassuring. T he hybrid estim ate (6.55) w ith w = 0.632 is know n as the 0.632 estimator o f prediction error an d is denoted here by A0.632- T here is substantial em pirical evidence favouring this estim ate, so long as the num ber o f covariates p is n o t close to n. Example 6.10 (Nuclear power stations) C onsider predicting the cost o f a new pow er station based on the d a ta o f Exam ple 6.8. We base o u r prediction on the linear regression m odel described there, so we have n(x j , F ) = x j f i , where A

•'

18 is the least squares estim ate for a m odel w ith six covariates. The estim ated

erro r variance is s2 = 0.6337/25 = 0.0253 w ith 25 degrees o f freedom . The dow nw ardly biased a p p aren t erro r estim ate is A app = 0.6337/32 = 0.020, whereas the idealized estim ate (6.38) is 0.025 x (1 + ~ ) = 0.031. In this situation the prediction e rro r for a p articu lar station seems m ost useful, b u t before we tu rn to individual stations, we discuss the overall estim ates, which are given in Table 6.9. Those estim ates show the p a tte rn we would anticipate from the general

299

6.4 ■Aggregate Prediction Error and Variable Selection

Figure 6.11 Components of prediction error for nuclear power data based on 200 bootstrap simulations. The top panel shows the values of yj — n{xj,F*). The lower left panel shows the average error for each case, plotted against the residuals. The lower right panel shows the ratio of the model-based to the bootstrap prediction standard errors.

Case

Raw residual

Case

discussion. T he ap p aren t e rro r is considerably sm aller th an other estimates. The b o o tstrap estim ate, w ith R = 200, is larger th an the ap p aren t error, b u t sm aller th a n the cross-validation estim ates, and the 0.632 estim ate agrees well w ith the ordin ary cross-validation estim ate (6.44), for which K — n = 32. A d justm ent slightly decreases the cross-validation estim ates. N ote th a t the idealized estim ate appears to be quite accurate here, presum ably because the m odel fits well an d errors are n o t far from hom oscedastic — except for the last six cases. N ow consider the individual predictions. Prediction erro r arises from two com ponents: the variability o f the predictor and th a t o f the associated erro r s+. Figure 6.11 gives som e insight into these. Its top panel shows the values

300

6 ■Linear Regression

o f yj — n(xj,F*) for r = 1 ,...,J ? , p lo tted against case num ber j. The variability o f the average error corresponds to the variation o f individual observations a b o u t their predicted values, while the variance w ithin each group reflects param eter estim ation uncertainty. A striking feature is the small prediction erro r for the last six pow er plants, whose variances and m eans are both small. The lower left panel shows the average values o f y j — fi(xj,F*) over the 200 sim ulations, plotted against the raw residuals. They agree closely, as we should expect w ith a well-fitting m odel. T he lower right panel shows the ratio o f the m odel-based prediction stan d ard erro r to the b o o tstrap prediction standard error. It confirm s th a t the m odel-based calculation described in Exam ple 6.8 overestim ates the predictive stan d ard erro r for the last six plants, which have the partial turnkey guarantee. T he estim ated b o o tstra p prediction erro r for these plan ts is 0.003, while it is 0.032 for the rest. T he last six cases fall into three groups determ ined by the values o f the explanatory variables: in effect they are replicated. It m ight be preferable to p lo t y j — fi(xj, F ' ) only for those b o o tstrap samples which exclude the j t h case, and then m ean prediction error would b etter be com pared to jackknifed residuals yj — x j /L ; . F or these d a ta the plots are very sim ilar to those we have shown. ■ Example 6.11 (Times on delivery suite) F or a m ore system atic com parison o f prediction error estim ates in linear regression, we use d ata provided by E. Burns on the times tak en by 1187 w om en to give b irth a t the Jo h n Radcliffe H ospital in O xford. A n ap p ro p riate linear m odel has response the log time spent on delivery suite an d dum m y explanatory variables indicating the type o f labour, the use o f electronic fetal m onitoring, the use o f an intravenous drip, the reported length o f la b o u r before arriving a t the hospital and w hether or n o t the lab o u r is the w om an’s first; seven p aram eters are estim ated in all. We took 200 sam ples o f size n = 50 at ran d o m from the full data. F or each o f these sam ples we fitted the m odel described above, and then calculated cross-validation estim ates o f prediction error Acv#. w ith K = 50, 10, 5 and 2 groups, the corresponding adjusted cross-validation estim ates A a c v j c , the b o o tstrap estim ate AB, and the hybrid estim ate Ao.632- We took R = 200 for the b o o tstrap calculations. The results o f this experim ent are sum m arized in term s o f estim ates o f the expected excess erro r in Table 6.10. T he average a p p aren t error and excess erro r were 15.7 x 10-2 and 5.2 x 10-2 , the latter taken to be e(F) as defined in (6.42). T he table shows averages and stan d ard deviations o f the differences betw een estim ates A an d A app. T he cross-validation estim ate w ith K = 50, the boo tstrap an d the 0.632 estim ate have sim ilar properties, while other choices o f K give estim ates th a t are m ore variable; the half-sam ple estim ate A C v ,2 is worst. R esults for cross-validation w ith 10 and 5 groups are alm ost

301

6.4 ■Aggregate Prediction Error and Variable Selection Table 6.10 Summary results for estimates of prediction error for 200 samples of size n = 50 from a set of data on the times 1187 women spent on delivery suite at the John Radcliffe Hospital, Oxford. The table shows the average, standard deviation, and conditional mean squared error (x 10~2) for the 200 estimates of excess error. The ‘target’ average excess error is 5.2 x lO"2.

X -fo ld (adjusted) cross-validation

M ean SD M SE

B o o tstrap

0.632

50

10

5

2

4.6 1.3 0.23

5.3 1.6 0.24

5.3 1.6 0.24

6.0 (5.7) 2.3 (2.2) 0.28 (0.26)

6.2 (5.5) 2.6 (2.3) 0.30 (0.27)

9.2 (5.7) 5.4 (3.3) 0.71 (0.33)

the same. A djustm ent significantly im proves cross-validation when group size is n o t small. T he b o o tstrap estim ate is least variable, b u t is dow nw ardly biased. The final row o f the table gives the conditional m ean squared error, defined as (200)-1 {Aj — Dj ( F, F) }2 for each erro r estim ate A. This m easures the success o f A in estim ating the true aggregate prediction error D(F, F) for each o f the 200 samples. A gain the ordinary cross-validation, bootstrap, and 0.632 estim ates perform best. In this exam ple there is little to choose betw een K -fold cross-validation with 10 an d 5 groups, which b o th perform worse th an the ordinary cross-validation, bootstrap , an d 0.632 estim ators o f prediction error. K -fold cross-validation should be used w ith adjustm ent if ordinary cross-validation or the sim ulationbased estim ates are not feasible. ■

6.4.2 Variable selection In m any applications o f m ultiple linear regression, one purpose o f the analysis is to decide which covariate term s to include in the final model. T he supposition is th a t the full m odel y = x T fi + s with p covariates in (6.22) is correct, b u t th at it m ay include some red u n d an t terms. O ur aim is to elim inate those red u n d an t term s, and so obtain the true m odel, which will form the basis for further inference. This is som ew hat simplistic from a practical viewpoint, because it assum es th a t one subset o f the proposed linear m odel is “ tru e” : it m ay be m ore sensible to assum e th a t a few subsets m ay be equally good approxim ations to a com plicated true relationship betw een m ean response and covariates. G iven th a t there are p covariate term s in the m odel (6.22), there are 2P candidates for true m odel because we can include or exclude each covariate. In practice the num b er o f candidates will be reduced if prior inform ation necessitates inclusion o f p articu lar covariates or com binations o f them. There are several approaches to variable selection, including various stepwise m ethods. But the approach we focus on here is the direct one o f m inim izing aggregate prediction error, when each candidate m odel is used to predict independent, future responses at the d a ta covariate values. F or simplicity we assum e th a t m odels are fitted by least squares, and th a t aggregate prediction

302

6 ■Linear Regression

erro r is average squared error. It w ould be a sim ple m atter to use other prediction rules an d o th er m easures o f prediction accuracy. First we define som e n otation. We denote an arb itrary candidate m odel by M , which is one o f the 2P possible linear models. W henever M is used as a subscript, it refers to elem ents o f th a t model. T hus the n x pm design m atrix X M contains those pM colum ns o f the full design m atrix X th a t correspond to covariates included in M ; the y'th row o f X m is x h p the least squares estim ates for regression coefficients in M are P m , and H M is the h at m atrix X m ( X I i X m )~1X11 th a t defines fitted values = H My under m odel M . The total num b er o f regression coefficients in M is qM = pM + 1, assum ing th a t an intercept term is always included. Now consider prediction o f single responses y+ a t each o f the original design points x i,...,x „ . The average squared prediction erro r using m odel M is n n ~l J 2 ( y +j ~ x T m M > 7=1

and its expectation u n d er m odel (6.22), conditional on the data, is the aggregate prediction error n

D ( M ) = a 2 + n~x ^ ( ^ - - x ^ j Pm )2, i= i where p.T = (AMj■ is the vector o f m ean responses for the true m ultiple regression m odel. T aking expectation over the d a ta distribution we obtain A (M ) = E{D(M)} = (1 + n~lqM) a 2 + fxT(I — H M)n,

(6.56)

where /ir (/ — H M)p is zero only if m odel M is correct. The quantities D (M) and A(M) generalize D and A defined in (6.36) an d (6.37). In principle the best m odel w ould be the one th a t m inimizes D{M), but since the m odel p aram eters are unknow n we m ust settle for m inim izing a good estim ate o f D ( M) o r A(M). Several resam pling m ethods for estim ating A were discussed in the previous subsection, so the n atu ral approach would be to choose a good m ethod an d apply it to all possible models. However, accurate estim ation o f A(M ) is n o t itself im p o rtan t: w hat is im p o rtan t is to accurately estim ate the signs o f differences am ong the A(M), so th a t we can identify which o f the A(M )s is smallest. O f the m ethods considered earlier, the a p p aren t e rro r estim ate A app( M) = h^ R S S m was poor. Its use here is im m ediately ruled out w hen we observe th a t it always decreases w hen covariates are added to a m odel, so m inim ization always leads to the full model.

6.4 ■Aggregate Prediction Error and Variable Selection

303

Cross-validation O ne good estim ate, when used w ith squared error, is the leave-one-out crossvalidation estim ate. In the present no tatio n this is

=

(6.57)

w here y ^ j is the fitted value for m odel M based on all the d a ta and h ^ j is the leverage for case j in m odel M . The bias o f Ac v ( M ) is small, b u t th at is not enough to m ake it a good basis for selecting M . To see why, note first th a t an expansion gives mAc k (M ) =

et

(I

- H M)e +

2pM + fiT(I - H M)fi.

(6.58)

T hen if m odel M is true, an d M ' is a larger model, it follows th a t for large n Pr{Ac v ( M ) < ACv( M') } = P r(Z2 < 2d), where d = p w ~ P m - This probability is substantially below 1 unless d is large. It is therefore quite likely th a t selecting M to m inimize Ac v ( M ) will lead to overfitting, even for large n. So although the term p T(I — H M)n in (6.58) guarantees th at, for large n, incorrect m odels will n o t be selected, m inim ization o f A c v ( M ) does n o t provide consistent selection o f the true model. One explanation for this is th a t to estim ate A(M) w ith sufficient accuracy we need b o th large am o u n ts o f d ata to fit m odel M and a large num ber o f independent predictions. This can be accom plished using the m ore general cross-validation m easure (6.43), u nder conditions given below. In principle we need to average (6.43) over all possible splits, b u t for practical purposes we follow (6.46). T h a t is, using R different splits into training and assessm ent sets o f sizes nt = n — m and na = m, we generalize (6.57) to R

ACv(M) = jR_1 Y l m~ l X r= 1 jesv

~ yMj(St,r)}2,

where p M j ( S t,r) = x h ^ M ^ t , ) an d ^ M(^t,r) are the least squares estim ates for coefficients in M fitted to the rth training set whose subscripts are in Sv . N ote th a t the sam e R splits into training and assessm ent sets are used for all models. It can be show n that, provided m is chosen so th a t n — m —> o o and m /n —>1 as n -» o o , m inim ization o f Ac v ( M ) will give consistent selection o f the true m odel as n—► o o an d R —>o o .

304

6 ■Linear Regression

Bootstrap methods C orresponding results can be obtained for b o o tstrap resam pling m ethods. The b o o tstrap estim ate o f aggregate prediction erro r (6.51) becomes

Ab ( M ) = n~l R S S m + R ~ l £

n~l

j

- RSS'M,

j

,

(6.59)

where the second term on the right-hand side is an estim ate o f the expected excess erro r defined in (6.42). The resam pling scheme can be either case resam pling o r error resam pling, w ith x m Mj r = x Mj for the latter. It turns o u t th a t m inim ization o f A B( M) behaves m uch like m inim ization o f the leave-one-out cross-validation estim ate, an d does n o t lead to a consistent choice o f true m odel as n—*o o . However, there is a m odification o f A B(M), analogous to th a t m ade for the cross-validation procedure, which does produce a consistent m odel selection procedure. T he m odification is to m ake sim ulated datasets be o f size n — m rath er th an n, such th a t m / n —>l and n — m—> o o as n—>co. Also, we replace the estim ate (6.59) by the sim pler b o o tstrap estim ate R

Ab (M ) = R - 1 r= l

n

n- 1 Y ^ ( y j ~ x l j K r ) 2> j= 1

(6.60)

which is a generalization o f (6.49). (The previous doubts ab o u t this simple estim ate are less relevant for small n — m.) I f case resam pling is used, then n — m cases are random ly selected from the full set o f n. If m odel-based resam pling is used, the m odel being M w ith assum ed hom oscedasticity o f errors, then is a ran d o m selection o f n — m rows from X m and the n — m errors £* are random ly sam pled from the n m ean-corrected m odified residuals i"Mj ~ for m odel M. Bearing in m ind the general advice th a t the nu m ber o f sim ulated datasets should be at least R = 100 for estim ating second m om ents, we should use at least th a t m any here. T he sam e R b o o tstra p resam ples are used for each m odel M , as w ith the cross-validation procedure. One m ajo r practical difficulty th a t is shared by the consistent cross-validation and b o o tstrap procedures is th a t fitting all candidate m odels to small subsets o f d a ta is n o t always possible. W h at em pirical evidence there is concerning good choices for m / n suggests th a t this ratio should be ab o u t | . If so, then in m any applications some o f the R subsets will have singular designs X'M for big models, unless subsets are balanced by ap p ro p riate stratification on covariates in the resam pling procedure. Example 6.12 (Nuclear power stations) In Exam ples 6.8 and 6.10 o u r analyses focused on a linear regression m odel th a t includes six o f the p = 10 covariates available. T hree o f these covariates — d a te , lo g ( c a p ) and NE — are highly

305

6.4 ■Aggregate Prediction Error and Variable Selection

Figure 6.12 Aggregate prediction error estimates for sequence of models fitted to nuclear power stations data; see text. Leave-one-out cross-validation (solid line), bootstrap with R = 100 resamples of size 32 (dashed line) and 16 (dotted line).

0

2

4

6

8

10

Number of covariates

sign ifica n t, a ll o th ers h a v in g P -v a lu e s o f 0.1 or m ore. H ere w e co n sid e r the sele c tio n o f v a ria b les to in c lu d e in th e m o d el. T h e to ta l n u m b er o f p o ssib le m o d els, 2 10 = 1024, is p ro h ib itiv e ly larg e, a n d for th e p u r p o se s o f illu stra tio n w e co n sid e r o n ly the p a rticu la r seq u en ce o f m o d e ls in w h ich v a ria b les en ter in th e ord er d a t e , l o g ( c a p ) , NE, CT, l o g ( N ) , PT, T l, T2, PR, BW: th e first three are th e h ig h ly sig n ifica n t variab les.

Figure 6.12 plots the leave-one-out cross-validation estim ates and the b o o t strap estim ates (6.60) w ith R = 100 o f aggregate prediction error for the m odels w ith 0 , 1 ,..., 10 covariates. The two estim ates are very close, and b o th are m inim ized w hen six covariates are included (the six used in Exam ples 6.8 an d 6.10). Selection o f five or six covariates, ra th er th a n fewer, is quite clearcut. These results b ear o u t the rough rule-of-thum b th a t variables are selected by cross-validation if they are significant at roughly the 0.1 level. As the previous discussion would suggest, use o f corresponding crossvalidation and b o o tstra p estim ates from training sets o f size 20 or less is precluded because for training sets o f such sizes the m odels with m ore th an five covariates are frequently unidentifiable. T h at is, the unbalanced nature o f the covariates, coupled w ith the binary nature o f some o f them , frequently leads to singular resam ple designs. Figure 6.12 includes b o o tstrap estim ates for m odels w ith u p to five covariates and training set o f size 16: these results were obtained by om itting m any singular resamples. These ra th er fragm entary results confirm th a t the m odel should include at least five covariates. A useful lesson from this is th a t there is a practical obstacle to w hat in theory is a preferred variable selection procedure. O ne w ay to try to overcome

306

6 ■Linear Regression cv, resample 10

cv, resample 20

cv, resample 30

leave-one-out cv

boot, resample 10

boot, resample 20

boot, resample 30

boot, resample 50

this difficulty is to stratify on the b inary covariates, b u t this is difficult to im plem ent an d does n o t w ork well here. ■ Example 6.13 (Simulation exercise) In order to assess the variable selection procedures w ithout the com plication o f singular resam ple designs, we consider a sm all sim ulation exercise in which procedures are applied to ten datasets sim ulated from a given m odel. T here are p = 5 independent covariates, whose values are sam pled from the uniform distrib u tio n on [0, 1], and responses y are generated by adding N ( 0,1) variates to the m eans p. = x Tp. The cases we exam ine have sam ple size n = 50, an d yS3 = jS4 = = 0, so the true m odel includes an intercept and two covariate terms. To simplify calculations only six m odels are fitted, by successively adding x i , . . . , x 5 to an initial m odel with con stan t intercept. All resam pling calculations are done with R = 100 samples. T he num b er o f d atasets is adm ittedly small, b u t sufficient to m ake rough com parisons o f perform ance. The m ain results concern m odels w ith P\ = P2 = 2, which m eans th a t the two non-zero coefficients are ab o u t four stan d ard errors aw ay from zero. Each panel o f Figure 6.13 shows, for the ten datasets, one variable selection criterion plotted against the n u m b er o f covariates included in the model. Evidently the clearest indications o f the tru e m odel occur w hen training set size is 10 or 20. L arger training sets give flat profiles for the criterion, and m ore frequent selection o f overfitted models. These indications m atch the evidence from m ore extensive sim ulations, which suggest th a t if training set size n —m is a b o u t n /3 then the probability o f correct m odel selection is 0.9 or higher, com pared to 0.7 o r less for leave-one-out crossvalidation. F u rther results were obtained w ith P\ = 2 an d P2 = 0.5, the latter equal to one stan d ard erro r aw ay from zero. In this situation underfitting — failure to

Figure 6.13 Cross-validation and bootstrap estimates of aggregate prediction error for sequence of six models fitted to ten datasets of size n = 50 with p = 5 covariates. The true model includes only two covariates.

6.5 ■Robust Regression

307

include x 2 in the selected m odel — occurred quite frequently even w hen using training sets o f size 20. This deg radation o f variable selection procedures when coefficients are sm aller th a n tw o stan d ard errors is reputed to be typical.

■

The theory used to justify the consistent cross-validation and boo tstrap procedures m ay depend heavily on the assum ptions th at the dim ension o f the true m odel is small com pared to the num ber o f cases, and th a t the non-zero regression coefficients are all large relative to their stan d ard errors. It is possible th a t leave-one-out cross-validation m ay w ork well in certain situations where m odel dim ension is com parable to num ber o f cases. This w ould be im p o rtan t, in light o f the very clear difficulties o f using small training sets w ith typical applications, such as Exam ple 6.12. Evidently fu rther work, b o th theoretical an d em pirical, is necessary to find broadly applicable variable selection m ethods.

6.5 Robust Regression T he use o f least squares regression estim ates is preferred w hen errors are n ear-norm al in distrib u tio n an d hom oscedastic. However, the estim ates are very sensitive to outliers, th a t is cases which deviate strongly from the general relationship. Also, if errors have a long-tailed distribution (possibly due to heteroscedasticity), then least squares estim ation is n o t an efficient m ethod. A ny regression analysis should therefore include ap p ro p riate inspection o f diagnostics based on residuals to detect outliers, and to determ ine if a norm al assum ption for errors is reasonable. If the occurrence o f outliers does not cause a change in the regression model, then they will likely be om itted from the fitting o f th a t m odel. D epending on the general pattern o f residuals for rem aining cases, we m ay feel confident in fitting by least squares, or we m ay choose to use a m ore robust m ethod to be safe. Essentially the resam pling m ethods th a t we have discussed previously in this chapter can be adapted quite easily for use w ith m any robust regression m ethods. In this section we briefly review som e o f the m ain points. Perhaps the m ost im p o rtan t p o in t is th a t gross outliers should be rem oved before final regression analysis, including resam pling, is undertaken. There are tw o reasons for this. The first is th a t m ethods o f fitting th a t are resistant to outliers are usually n o t very efficient, and m ay behave badly u n der resampling. T he second reason is th a t outliers can be disruptive to resam pling analysis o f m ethods such as least squares th a t are n o t resistant to outliers. F o r m odel-based resam pling, the erro r distribution will be contam inated and in the resam pling the outliers can then occur at any x values. F or case resam pling, outlying cases will occur w ith variable frequency and m ake the b o o tstrap estim ates o f coefficients too variable; see Exam ple 6.4. The effects can be diagnosed from

308

6 ■Linear Regression

D ose (rads)

117.5

235.0

470.0

705.0

940.0

1410

S urvival %

44.000 55.000

16.000 13.000

4.000 1.960 6.120

0.500 0.320

0.110 0.015 0.019

0.700 0.006

Table 6.11 Survival data (Efron, 1988).

the jackk n ife-after-b o o tstrap plots o f Section 3.10.1 o r sim ilarly inform ative diagnostic plots, b u t such plots can fail to show the occurrence o f m ultiple outliers. For datasets w ith possibly m ultiple outliers, diagnosis is aided by initial use o f a fitted m ethod th a t is highly resistant to the effects o f outliers. One preferred resistant m ethod is least trim m ed squares, which minimizes m

5 > 0 )(/*)j=i

(6.61)

the sum o f the m sm allest squares o f deviations e; (/}) = yj — x j p. Usually m is taken to be [\n] + 1. R esiduals from the least trim m ed squares fit should clearly identify outliers. The fit itself is n o t very efficient, and should best be th o ught o f as an initial step in a m ore efficient analysis. (It should be noted th a t in som e im plem entations o f least trim m ed squares, local m inim a o f (6.61) m ay be found far aw ay from the global m inim um .) Example 6.14 (Survival proportions) T he d a ta in Table 6.11 and the left panel o f Figure 6.14 are survival percentages for rats a t a succession o f doses o f radiation, w ith two o r three replicates at each dose. T he theoretical relationship betw een survival rate an d dose is exponential, so linear regression applies to x = dose,

y = log(survival percentage).

T he right panel o f Figure 6.14 plots these variables. There is a clear outlier, case 13, at x = 1410. T he least squares estim ate o f slope is —59 x 10-4 using all the data, changing to —78 x 10-4 w ith stan d ard erro r 5.4 x 10-4 when case 13 is om itted. T he least trim m ed squares estim ate o f slope is —69 x 10-4 . F rom the scatter p lo t it app ears th a t heteroscedasticity m ay be present, so we resam ple cases. The effect o f the outlier on the resam ple least squares estim ates is illustrated in Figure 6.15, which plots R = 200 b o o tstrap least squares slopes PI against the corresponding values o f ]T (x ” — x*)2, differentiated by the frequency w ith which case 13 appears in the resam ple. There are three distinct groups o f b o o tstrap p ed slopes, w ith the lowest corresponding to resam ples in which case 13 does n o t occur and the highest to sam ples where it occurs twice or more. A jack k n ife-after-b o o tstrap plot w ould clearly reveal the effect o f case 13. T he resam pling stan d ard erro r o f p \ is 15.3 x 10-4 , b u t only 7.6 x 10-4 for

Here [•] denotes integer part.

6.5 • Robust Regression

Figure 6.14 Scatter plots of survival data.

309

•

S

o

t

•

0s 15 > o £ D (0 O ) CM O '

o

i co D o

CO

C\J

• • •

CM

• • • ••

200

•

• • t

• 600

• 1000

• 1400

• 200

600

1000

1400

Dose

Dose

Figure 6.15 Bootstrap estimates of slope and design sum-of-squares J2(x } - x

)2 ( x \ 0 5 ),

differentiated by frequency of case 13 (appears zero, one or more times), for case resampling with R = 200 from survival data.

Sum of squares

sam ples w ithout case 13. T he corresponding resam pling standard errors o f the least trim m ed squares slope are 20.5 x 10-4 and 18.0 x 10~4, showing b o th the resistance an d inefficiency o f the least trim m ed squares m ethod. ■

Exam ple 6.15 (Salinity d a ta ) The d a ta in Table 6.12 are n = 28 observations on the salinity o f w ater in Pam lico Sound, N o rth C arolina. The response in the second colum n is the bi-weekly average o f salinity. The next three colum ns contain values o f the covariates, respectively a lagged value o f salinity, a trend

310

6 ■Linear Regression

Salinity sal

L agged salinity la g

T ren d in d icato r tre n d

R iver discharge d is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

7.6 7.7 4.3 5.9 5.0 6.5 8.3 8.2 13.2 12.6 10.4 10.8 13.1 12.3 10.4 10.5 7.7 9.5 12.0 12.6 13.6 14.1 13.5 11.5

8.2 7.6 4.6 4.3 5.9 5.0 6.5 8.3 10.1 13.2 12.6 10.4 10.8 13.1 13.3 10.4 10.5 7.7

23.01 22.87 26.42 24.87 29.90 24.20 23.22 22.86 22.27 23.83 25.14 22.43 21.79 22.38 23.93 33.44 24.86 22.69 21.79 22.04 21.03 21.01 25.87 26.29

25 26 27 28

12.0 13.0 14.1 15.1

4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 0 1 4 5 0 1 2 3 4 5

10.0 12.0 12.1 13.6 15.0 13.5 11.5 12.0 13.0 14.1

Table 6.12 Salinity data (Ruppert and Carroll, 1980).

22.93 21.31 20.77 21.39

indicator, an d the river discharge. We consider a linear regression m odel with these three covariates. The initial least squares analysis gives coefficients 0.78, —0.03 and —0.30, with intercept 9.70. The usual stan d ard error for the trend coefficient is 0.16, so this coefficient would be ju d g ed n o t nearly significant. However, this fit is suspect, as can be seen n o t from the Q -Q plot o f m odified residuals b u t from the plot o f cross-validation residuals versus leverages, where case 16 stands out as an outlier — due apparen tly to its unusual value o f d is . T he outlier is m uch m ore easily detected using the least trim m ed squares fit, w hich has the quite different coefficient values 0.61, —0.15 and —0.86 w ith intercept 24.72: the residual o f case 16 from this fit has standardized value 6.9. Figure 6.16 shows norm al Q -Q plots o f standardized residuals from least squares (left panel) and least trim m ed squares fits (right panel); for the la tte r the scale factor is taken to be the m edian absolute residual divided by 0.6745, the value appropriate for estim ating the stan d ard deviation o f norm al errors.

Application of standard algorithms for least trimmed squares with default settings can give very different, incorrect solutions.

311

6.5 ■Robust Regression

Figure 6.16 Salinity data: standardized residuals from least squares (left) and least trimmed squares (right) fits using all cases.

co 3 ■D '(/) T© 3 N CO x>

c

co

55

Quantiles of standard normal

Quantiles of standard normal

T here is some question as to w hether the outlier is really ab errant, o r simply reflects the need for a quad ratic term in d i s . ■ Robust methods We suppose now th a t outliers have been isolated by diagnostic plots and set aside from fu rth er analysis. The problem now is w hether o r n o t th a t analysis should use least squares estim ation: if there is evidence o f a long-tailed error distribution, then we should dow nw eight large deviations yj — x j fi by using a robust m ethod. Two m ain options for this are now described. O ne ap p ro ach is to m inim ize n o t sums o f squared deviations b u t sums o f absolute values o f deviations, Y , Iy j ~ x J J®l> so liv in g less weight to those cases w ith the largest errors. This is the L i m ethod, which generalizes — and has efficiency com parable to — the sam ple m edian estim ate o f a population mean. T here is n o simple expression for approxim ate variance o f L\ estim ators. M ore efficient is M -estim ation, which is analogous to m axim um likelihood estim ation. H ere the coefficient estim ates /? for a m ultiple linear regression solve the estim ating equation 0,

(6.62)

where tp(z) is a b o unded replacem ent for z, and s is either the solution to a sim ultaneous estim ating equation, o r is fixed in advance. We choose the latter, tak in g s to be the m edian absolute deviation (divided by 0.6745) from the least trim m ed squares regression fit. T he solution to (6.62) is obtained by iterative weighted least squares, for which least trim m ed squares estim ates are good startin g values.

6 • Linear Regression

312

W ith a careful choice o f ip(-), M -estim ates should have sm aller standard errors th a n least squares estim ates for long-tailed d istributions o f random errors e, yet have com parable stan d ard errors should those errors be hom o scedastic norm al. O ne stan d ard choice is tp(z) = z m in (l,c /|z |), H u b er’s winsorizing function, for which the coefficient estim ates have approxim ate effi ciency 95% relative to least squares estim ates for hom oscedastic norm al errors when c = 1.345. F or large sam ple sizes M -estim ates ft are approxim ately norm al in distribu tion, with approxim ate variance v ar(£) = o'2 * {'p2{e/
(6.63)

under hom oscedasticity. A m ore robust, em pirical variance estim ate is provided by the nonp aram etric delta m ethod. First, the em pirical influence values are, analogous to (6.25), lj = k n ( X T X ) ~ 1Xj\p

^,

where k = sn-1 ]T "=1 w(ej / s) and e; = yj — x j f i is the raw residual; see Problem 6.7. T he variance approxim ation is then vL = n~2

h lJ = k 2( X TX ) - lX TD X ( X TX ) - \

(6.64)

7=1

where D = diag {y>2( e i/s ),. .. ,xp2( e„/ s) }; this generalizes (6.17). Resampling As with least squares estim ation, so w ith robust estim ates we have two simple choices for resam pling: case resam pling, o r m odel-based resam pling. D epend ing on which robust m ethod is used, the resam pling algorithm m ay need to be modified from the simple form th a t it takes for least squares estim ation. T he Lj estim ates will behave like the sam ple m edian under either resam pling scheme, so th a t the distrib u tio n o f can be very discrete, and close to th at o f P ~ P only for very large samples. Use o f the sm ooth b o o tstrap (Section 3.4) will im prove accuracy. N o simple studentization is possible for L\ estimates. F or M -estim ates case resam pling should be satisfactory except for small datasets, especially those w ith unreplicated design points. The advantage o f case resam pling is simplicity. F or m odel-based resam pling, som e m odifications are required to the algorithm used to resam ple least squares estim ation in Section 6.3. First, the leverage correction o f raw residuals is given by ej 1

( l - d h j ) ' / 2’

j _2 J2(e)f sMej A) Y W j/s)

E W2(ej/s) (E v ff j/s )} 2'

Sim ulated errors are random ly sam pled from the uncentred ru . . . , r n. M ean

tp(u) is the derivative d\p(u)/du.

6.5 ■Robust Regression

313

correction to the rj is replaced by a slightly m ore com plicated correction in the estim ation equation itself. T he resam ple version o f (6.62) is

T he scale estim ate s' is obtained by the same m ethod as s, b u t from the resam ple data. S tudentization o f j?* —ft is possible, using the resam ple analogue o f the delta m ethod variance (6.64) o r m ore simply ju st using s'. Exam ple 6.16 (Salinity d ata) In our previous look a t the salinity d a ta in E xam ple 6.15, we identified case 16 as a clear outlier. We now set th a t case aside an d re-analyse the linear regression w ith all three covariates. O ne objective is to determ ine w hether o r n o t the trend variable should be included in the m odel: the initial, incorrect least squares analysis suggested not. A norm al Q -Q plot o f the m odified residuals from the new least squares fit suggests som ew hat long tails for the erro r disribution, so th a t robust m ethods m ay be w orthw hile. We fit the m odel by four m e th o d s: least squares, H u b er Mestim ate (w ith c = 1.345), L i and least trim m ed squares. Coefficient estim ates are fairly sim ilar und er all m ethods, except for t r e n d whose coefficients are -0 .1 7 , -0 .2 2 , - 0 .1 8 an d -0 .0 8 . F o r fu rth er analysis we apply case resam pling w ith R = 99. Figure 6.17 illustrates the results for estim ates o f the coefficient o f tr e n d . The d o tted lines on the top two panels correspond to the theoretical norm al approxim ations: evidently the stan d ard variance approxim ation — based on (6.63) — for the H u b er estim ate is too low. N ote also the relatively large resam pling variance for the least trim m ed squares estim ate, p a rt o f which m ay be due to unconverged estim ates: tw o resam pling outliers have been trim m ed from this plot. To assess the significance o f t r e n d we apply the studentized pivot m ethod o f Section 6.3.2 w ith b o th least squares and M -estim ates, studentizing by the theoretical stan d ard erro r in each case. The corresponding values o f z are —1.25 and —1.80, w ith respectively 23 and 12 sm aller values o f z* o u t o f 99. So there appears to be little evidence o f the need to include tr e n d . I f we checked diagnostic plots for any o f the four regression fits, a question m ight be raised ab o u t w hether or n o t case 5 should be included in the analysis. A n alternative view o f this is provided by jackknife-after-bootstrap plots (Section 3.10.1) o f the four fits: such plots correspond to case-deletion resam pling. A s an illustration, Figure 6.18 shows the jackknife-after-bootstrap plo t for the coefficient o f t r e n d in the M -estim ation fit. This shows clearly th a t case 5 has an appreciable effect on the resam pling distribution, and th at its om ission w ould give tighter confidence limits on the coefficient. It also raises

6 • Linear Regression

314

Figure 6.17 Salinity data: Normal Q-Q plots of resampled estimates of trend coefficient, based on case resampling (R = 99 for data excluding case 16. Clockwise from top left: least squares, Huber M-estimation, least trimmed squares, L\. Dotted lines correspond to theoretical normal approximations.

Quantiles of standard normal

Quantiles of standard normal

Quantiles of standard normal

Quantiles of standard normal

q u e stio n s a b o u t tw o o th er ca ses. C lea rly so m e fu rth er e x p lo r a tio n is n eed ed b efo re firm c o n c lu s io n s c a n b e reach ed .

■

T h e p r ev io u s ex a m p le illu stra tes th e p o in t th a t it is o fte n w o rth w h ile to in co rp o ra te ro b u st m e th o d s in to a reg ressio n a n a ly sis, b o th to h elp iso la te o u tliers an d to a ssess th e relia b ility o f c o n c lu s io n s b a sed o n th e le a st sq u ares fit to su p p o se d ly “c le a n ” d ata. In so m e areas o f a p p lic a tio n s, fo r ex a m p le th o se in v o lv in g r e la tio n sh ip s b etw e e n fin a n cia l series, lo n g -ta ile d d istrib u tio n s m a y b e q u ite c o m m o n , an d th e n ro b u st m e th o d s w ill b e e sp e c ia lly im p o rta n t. T o th e e x ten t th at th eo retica l n o r m a l a p p r o x im a tio n s are in a ccu ra te fo r m a n y ro b u st estim a tes, resa m p lin g m e th o d s are a n a tu ra l c o m p a n io n to ro b u st an a ly sis.

315

6.6 ■Bibliographic Notes

Figure 6.18 Jackknifeafter-bootstrap plot for the coefficient of tre n d in the M-estimation fit to the salinity data, omitting case 16.

* o O

CO o

■ xP O '" to 05

CNJ

© o o

8 3

CM o

LO

CO

■
9 8 2 1 14

O*

22 11 17 i

2412 21 1« 13 ISO VS 16 2 * 5

15 S 19 3 27

LO

3

-

2

-

1

0

1

2

Standardized jackknife value

6.6 Bibliographic Notes There are several com prehensive accounts o f linear regression analysis, in cluding the books by D ra p e r and Sm ith (1981), Seber (1977), and W eisberg (1985). D iagnostic m ethods are described by A tkinson (1985) and by C ook an d W eisberg (1982). A good general reference on robust regression is the book by Rousseeuw an d Leroy (1987). M any linear regression m ethods and their properties are sum m arized, with illustrations using S-Plus, in Venables an d Ripley (1994). T he use o f b o o tstra p m ethods in regression was initiated by E fron (1979). Im p o rta n t early w ork on the theory o f resam pling for linear regression was by F reedm an (1981) an d Bickel and Freedm an (1983). See also E fron (1988). F reedm an (1984) and F reedm an and Peters (1984a,b) assessed the m ethods in practical applications. W u (1986) gives a quite com prehensive theoretical treatm ent, including com parisons betw een various resam pling and jackknife m ethods; for fu rth er developm ents see Shao (1988) and Liu and Singh (1992b). H all (1989b) shows th a t b o o tstrap m ethods can provide unusually accurate confidence intervals in regression problems. T heoretical properties o f b o o tstrap significance tests, including the use o f b o th studentized pivots an d F statistics, were established by M am m en (1993). R ecent interest in resam pling tests for econom etric m odels is reviewed by Jeong an d M ad d ala (1993). Use o f the b o o tstrap for calculating prediction intervals was discussed by Stine (1985). T he asym ptotic theory for the m ost elem entary case was given by Bai and O lshen (1988). F or further theoretical developm ent see B eran (1992).

6 • Linear Regression

316

Olshen et al. (1989) described an interesting application to a com plicated prediction problem . The wild b o o tstra p is based on an idea suggested by W u (1986), and has been explored in detail by H ardle (1989, 1990) an d M am m en (1992). The effectiveness o f the wild b o o tstrap , p articularly for studentized coefficients, was dem o n strated by M am m en (1993). C ross-validation m ethods for the assessm ent o f prediction erro r have a long history, b u t m odern developm ents originated w ith Stone (1974) and Geisser (1975). W h at we refer to as K -fo ld cross-validation was proposed by Breim an et al. (1984), and further studied by B urm an (1989). Im p o rta n t theoretical results were developed by Bunke and D roge (1984), Li (1987), and Shao (1993). The theoretical fo undation o f cross-validation and b o o tstrap estim ates o f prediction error, w ith p articu lar em phasis on classification problem s, was developed in C h ap ter 7 o f E fron (1982) and by E fron (1983), the latter introducing the 0.632 estim ate. F u rth er developm ents, w ith applications and em pirical studies, were given by E fron (1986) and E fron and Tibshirani (1997). T he discussion o f hybrid estim ates in Section 6.4 is based on H all (1995). In a simple case D avison an d H all (1992) a ttem p t to explain the properties o f the b o o tstrap an d cross-validation erro r estim ates. T here is a large literature on variable selection in regression, m uch o f which overlaps w ith the cross-validation literature. C ross-validation is related to the the Cp m ethod o f linear m odel selection, proposed by M allow s (1973), and to the A IC m ethod o f A kaike (1973), as was show n by Stone (1977). F or a sum m ary discussion o f various m ethods o f m odel selection see C h apter 2 o f Ripley (1996), for exam ple. T he consistent b o o tstra p m ethods outlined in Section 6.4 were developed by Shao (1996). A sym ptotic properties o f resam pled M -estim ates were derived by Shorack (1982) w ho described the adjustm ent necessary for unbiasedness o f the re sam pled coefficients. M am m en (1989) provided additional asym ptotic sup port. A spects o f residuals from robust regression were discussed by C ook, H aw kins an d W eisberg (1992) and M cK ean, S heather and H ettsm ansperger (1993), the la tte r show ing how to standardize raw residuals in M -estim ation. De Angelis, H all and Y oung (1993) gave a detailed theoretical analysis o f m odel-based resam pling in L i estim ation, which confirm ed th a t a sm ooth b o o tstrap is advisable; fu rth er num erical results were provided by Stangenhaus (1987).

6.7 Problems 1

Show that for a multivariate distribution with mean vector pi and variance matrix Q, the influence functions for the sample mean and variance are respectively L(z) = z - f i ,

k(z) = (z - n)(z - n)T - si.

6.7 • Problems

317

Hence show that for the linear regression model derived as the conditional expec tation E (y | X = x) o f a multivariate C D F F, the empirical influence function values for linear regression parameters are h (xj , yj ) = n ( X TX ) ~ i x j eJ, where X is the matrix o f explanatory variables. (Sections 2.7.2, 6.2.2) For hom ogeneous data as in Chapter 2, the empirical influence values for an estimator can be approximated using case-deletion values. Use the matrix identity t

(* * -

(X TX ) - l x iXJ ( X TX )->

^

+

l - xJlXTXT'x,

to show that in the linear regression model with least squares fitting,

P - P - J

= (X ‘ X)-

'y j-x jP ' l-h j

Compare this to the corresponding empirical influence value in Problem 6.1, and obtain the jackknife estimates o f the bias and variance o f fa (Sections 2.7.3, 6.2.2, 6.4) 3

For the linear regression m odel y, = xjji + ej, with no intercept, show that the least squares estimate o f /? is ft = Y x jy j/ Y x j. Define residuals by ej

=

y j — xjfa

If the resampling model is y j = Xjfi + e", with e’ randomly sampled from the e;s, show that the resample estimate /T has mean and variance respectively e and x are the averages of the ej and xj.

TSei ~

«+ Z * j’

nExj

■

Thus in particular the resampling mean is incorrect. Examine the improvements made by leverage adjustment and mean correction o f the residuals. (Section 6.2.3) The usual estimated variance o f the least squares slope estimate fa in simple linear regression can be written

_ n y j - y ) 2- M U x j - x ) 2 (n ~

2) £ ( * ; -

x )2

If the x ’s and y ‘s are random permutations o f xs and ys, show that

.

U y j - y ) 2 - P ’2n x j - x ) 2 (n - 2) J2(xj ~ x)2

Hence show that in the permutation test for zero slope, the R values o f f}[ are in the same order as those o f f a / v ' 1/2, and that f a > fa is equivalent to f a /u*1/2 > f a / v lf2. This confirms that the P-value o f the permutation test is unaffected by studentizing. (Section 6.2.5)

6 • Linear Regression

318

For least squares regression, model-based resampling gives a bootstrap estimator fi' which satisfies n 7=1

where the sj are randomly sampled modified residuals. An alternative proposal is to bypass the resampling model for data and to define directly n p = $ + { x Tx r i Y t »i’ j=i where the u’s are randomly sampled from the vectors uj = xj ( y j - xJ h

j = 1......... n.

Show that under this proposal fi" has mean fi and variance equal to therobust variance estimate (6.26). Examine, theoretically or through numerical examples, to what extent the skewness of fi’ matches the skewness of fi. (Section 6.3.1; Hu and Zidek, 1995) For the linear regression model y = X p + e, the improved version of the robust estimate of variance for the least squares estimates fi is Vrob = (X TX ) - lX Tdizg(r2i, . . . , r 2n) X ( XTX ) - \ where rj is the j th modified residual. If the errors have equal variances, then the usual variance estimate v = s2^ 7* ) - 1 would be appropriate and vroi, could be quite inefficient. To quantify this, examine the case where the random errors e; are independent N(0, a2). Show first that

E(rj) = „=, Hence show that the efficiency of the ith diagonal element of vrob relative to the ith diagonal element of v, as measured by the ratio of their variances, is bl (n-p)g{Qgt where bu is the ith diagonal element of (Z TX )_1, gJ = (d^...... dfn) with D = TX)~lX T, and Q has elements (1 —h j k ) 2/ { ( 1 —/i; )(l —hk ) } . Calculate this relative efficiency for a numerical example. (Sections 6.2.4, 6.2.6, 6.3.1; Hinkley and Wang, 1991) (X

The statistical function /?(F) for M-estimation is defined by the estimating equation

J xv{

y - x Tm ' a(F)

dF(x,y) = 0,

where a(F) is typically a robust scale parameter. Assume that the model contains an intercept, so that the covariate vector x includes the dummy variable 1. Use the

hjk is the (J,k)th element of hat matrix H and hjj = hj.

6.1 ■Problems

319

technique o f Problem 2.12 to show that the influence function for fl(F) is V?(u) is d ip(u)/du.

M

^ ) = { / x x Tyj(e)dF(x, y) |

oxy>(e),

where e — (y — x Tf i ) / o ; it is assumed that sy)(e) has mean zero. If the distribution o f the covariate vector is taken to be the E D F o f x i , . . . , x „ , show that

Lp(x,y) = m k ~ 1( X TX)~1x\p(e), where X is the usual covariate matrix and k = E{ip(e)}. U se the empirical version o f this to verify the variance approximation

y-rX ) i T , V 2(ej/s) Vl = ns.2 / (X

{ £ v(ej/s)}2’ where e; = yj — x j f t and s is the estimated scale parameter. (Section 6.5) Given raw residuals e i , . . . , e n, define independent random variables ej by (6.21). Show that the first three mom ents o f ej are 0, ej, and ej. (a) Let be raw residuals from the fit o f a linear m odel y = X f t + e , and define bootstrap data by y ' = x f t + e ’ , where the elements o f s’ are generated according to the wild bootstrap. Show that the bootstrap least squares estimates ft" take at m ost 2" values, and that

E’(ft') = ft,

var'($*) = vwild = (X TX r lX TW X ( X TX ) ~ \

where W = d ia g ( e f,...,e 2). (b) Show that when all the errors have equal variances and the design is balanced, so that hj = p / n , vwiu is negatively biased as an estimate o f var(/3). (c) Show that for the simple linear regression m odel (6.1) the expected value o f var'($*) is

/r2 m2

n 2(n — 1 — m^/m\),

where mr = n~l J2(x j — x ) r. Hence show that if the x j are uniformly spaced and the errors have equal variances, the wild bootstrap variance estimate is too small by a factor o f about 1 — 14/(5n). (d) Show that if the e,- are replaced by r;, the difficulties in (b) and (c) do not arise. (Sections 6.2.4, 6.2.6, 6.3.2) Suppose that responses y i , . . . , y „ with n = 2m correspond to m independent samples o f size two, where the ith sample comes from a population with mean n t and these means are o f primary interest; the m population variances may differ. Use appropriate dummy variables x t to express the responses in the linear m odel y = X f t + e, where /?, = n t. With parameters estimated by least squares, consider estimating the standard error o f ft, by case resampling. (a) Show that the probability o f getting a simulated sample in which all the parameters are estimable is

6 ■Linear Regression

320

(b) Consider constrained case resampling in which each o f the m samples must be represented at least once. Show that the probability that there are r resample cases from the ith sample is i

^ \ // 2m \ (/ 11 \\

11 \\ 2m—r in—1 / <m / m — 1<

r /

(4

P

for r = l , . . . , m + 1. Hence calculate the resampling mean o f [ij and give an expression for its variance. (Section 6.3; Feller, 1968, p. 102) 10

For the one-way m odel o f Problem 6.9 with two observations per group, suppose that 9 = fa ~ Pi- N ote that the least squares estimator o f 9 satisfies

0

=

9

+ j (fi 3 + £4 — Si — 62).

Suppose that we use model-based resampling with the assumption o f error hom oscedasticity. Show that the resample estimate can be expressed as

1=1 where the e ' are randomly sampled from the 2m modified residuals ± ^ ( « 2 i — S 2 1-1), i = 1, . .. , m. U se this representation to calculate the first four resampling moments o f 8‘ — 9. Compare the results with the first four mom ents o f 9 — 6, and comment. (Section 6.3) 11

Suppose that a 2~r fraction o f a 28 factorial experiment is run, where 1 < r < 4. Under what circumstances would a bootstrap analysis based on case resampling be reliable? (Section 6.3)

12

The several cross-validation estimates o f prediction error can be calculated explic itly in the simple problem o f least squares prediction for hom ogeneous data with no covariates. Suppose that data y u - - - , y n and future responses y + are all sampled from a population with mean n and variance a 2, and consider the prediction rule H(F) = y with accuracy measured by quadratic error. (a) Verify that the overall prediction error is A = cr2( l + n_1), that the expectation o f the apparent error estimate is
yk 1 (=1

k=1

and hence show that E ( A c ^ ) = ff2{ l + n - ‘ + n ~ \ K — I)-1 }. Thus the bias o f A Cvj< is a 2n ‘(X — 1)

321

6.8 ■Practicals

(c) Extend the calculations in (b) to show that the adjusted estimate can be written

A acvjc = & c v x

K —K ~ l ( K — I)-2 ^ ( p * —y ) 2, k=1

and use this to show that E(AACvjc) — A. (Section 6.4; Burman, 1989) 13

The leave-one-out bootstrap estimate o f aggregate prediction error for linear prediction and squared error is equal to Abcv =

E '_j(yj - x f f t l j ) 2, j=i

where /T j is the least squares estimate o f ji from a bootstrap sample with the )th case excluded and EV denotes expectation over such samples. To calculate the mean o f ABcv, use the substitution

yj - x j p’_j = yj - x j P-j + x j (Plj - p_j), and then show that E( Y j - X j p _ j ) 2

=

^ { l + q l n - l ) - 1},

E [E'_j { X J ( P l j ~ P - j ) ( t j ~ P - j ) TX j } \

=

° 2q(n ~ 1)“ ‘ + 0 ( n ~ 2),

E H Y j-X jp^X jE ljC plj-p-j)}

=

0 ( n ~ 2).

These results combine to show that E(ABCf ) = ff2( 1 + 2qn~]) 0 ( n 2), which leads to the choice w = | for the estimate Aw = w A BCv + (1 — w)Aapp. (Section 6.4; Hall, 1995)

6.8 Practicals 1

D ataset catsM contains a set o f data on the heart weights and body weights o f 97 male cats. We investigate the dependence o f heart weight (g) on body weight (kg). To see the data, fit a straight-line regression and do diagnostic plots:

catsM p lo t(c a tsM $ B w t, catsM$Hwt, x lim = c (0,4),y lim = c (0 , 24)) c a t s . l m < - glm (H w t~Bw t,data=catsM ) su m m ary(cats. lm)

cats.diag <- glm.diag.plots(cats.lm,ret=T) The summary suggests that the line passes through the origin, but we cannot rely on normal-theory results here, because the residuals seem skewed, and their variance possibly increases with the mean. Let us assess the stability o f the fitted regression. For case resampling:

cats.fit <- function(data) coef(glm(data$Hwt~data$Bwt)) cats.case <- function(data, i) cats.fit(data[i,]) cats.bootl <- boot(catsM, cats.case, R=499) cats.bootl

322

6 ■Linear Regression

plot(cats.boot1,j ack=T) plot(cats.boot1,index=2,j ack=T) to see a summary and plots for the bootstrapped intercepts and slopes,. How normal do they seem? Is the model-based standard error from the original fit accurate? To what extent do the results depend on any single observation? We can calculate the estimated standard error by the nonparametric delta m ethod by

cats.L <- empinf(cats.bootl,type="reg") sqrt(var.linear(cats.L)) Compare it with the quoted standard error from the regression output, and from the empirical variance o f the intercepts. Are the three standard errors in the order you would expect? For model-based resampling:

cats.res <- cats.diag$res*cats.diag$sd cats.res <- cats.res - mean(cats.res) cats.df <- data.frame(catsM,res=cats.res,fit=fitted(cats.lm)) cats.model <- function(data, i) { d <- data d$Hwt <- d$fit + d$res[i] cats.fit(d) } cats.boot2 <- boot(cats.df, cats.model, R=499) cats.boot2 plot(cats.boot2) Compare the properties o f these bootstrapped coefficients with those from case resampling. How would you use a resampling m ethod to test the hypothesis that the line passes through the origin? (Section 6.2; Fisher, 1947) 2

The data o f Example 6.14 are in dataframe s u r v iv a l. For a jackknife-afterbootstrap plot for the regression slope f a :

survival.fun <- function(data, i) { d <- data[i,] d.reg <- glm(log(d$surv)”d$dose) c(coefficients(d.reg))} survival.boot <- boot(survival, survival.fun, R=999) j a c k.after.boot(survival.boot, index=2) Compare this with Figure 6.15. W hat is happening? 3

p o is o n s contains the survival times o f animals in a 3 x 4 factorial experiment. Each com bination o f three poisons and four treatments is used for four animals, the allocation to the animals being com pletely randomized. The data are standard in the literature as an example where transformation can be applied. Here we apply resampling to the data on the original scale, and use it to test whether an interaction between the two factors is needed. To calculate the test statistic, the standard F statistic, and to see its significance using the usual F test:

poison.fun <- function(data) { assignC'data. junk",data,frame=l) data.anova <- anova(glm(time~poison*treat,data=data.junk)) dev <- as.numeric(imlist(data.anova[2]))

6.8 ■Practicals

323

df <- as.numeric(unlist(data.anova[1])) res.dev <- as.numeric(unlist(data.anova[4])) res.df <- as.numeric(unlist(data.anova[3])) (dev [4] /df [4] ) / (res.dev [4] /r e s .df [4] ) > poison.fun(poisons) anova(glm(time~poison*treat,data=poisons),test="F") To apply resampling analysis, using as the null m odel that with main effects:

poison.lm <- glm(time~poison+treat,data=poisons) poison.diag <- glm.diag(poison.lm) poison.mle <- list(fit=fitted(poison.lm), res=residuals(poison.lm)/sqrt(1-poison.diagSh)) poison.gen <- function(data,mle) { i <- sample(48,replace=T) data$time <- mle$fit + mle$res[i] data > poison.boot <- boot(poisons, poison.fun, R=199, sim="parametric", r a n .gen=poison.g e n , mle=poison.mle) sum(poison.boot$t>poison.boot$tO) A t what level does this give significance? Is this in line with the theoretical value? One assumption o f the above analysis is hom ogeneity o f variances, but the data cast some doubt on this. To test the hypothesis without this assumption:

poison.genl <- function(data,mle) { i <- matrix(l:48,4,12,byrow=T) i <- apply(i,1.sample,replace=T,size=4) data$time <- mle$fit + mle$res[i] data > poison.boot <- boot(poisons, poison.fun, R=199, sim="parametric", r a n .gen=poison.genl, mle=poison.mle) sum (poison.boot$t>poison.boot$tO) W hat do you conclude now? (Section 6.3; Box and Cox, 1964) For an example o f prediction, we consider using the nuclear power station data to predict the cost o f new stations like cases 27-32, except that their value for d a te is 73. We choose to make the prediction using the m odel with all covariates. To fit that model, and to make the ‘new’ station:

nuclear.glm <- glm(log(cost)~date+log(tl)+log(t2)+log(cap)+pr+ne +ct+bw+log(cum.n)+pt,data=nuclear) nuclear.diag <- glm.diag(nuclear.glm) nuke <- data.frame(nuclear,fit=fitted(nuclear.glm), res=nuclear.diag$res*nuclear.diagSsd) nuke.p <- n u k e [32,] nuke.p$date <- 73 nuke.p$fit <- predict(nuclear.glm,nuke.p) The bootstrap function and the call to b o o t are:

nuke.pred <- function(data,i,i.p,d.p) { d <- data d$cost <- exp(d$fit+d$res[i]) d.glm <- glm(log(cost)~date+log(tl)+log(t2)+log(cap)+pr+ne

324

6 ■Linear Regression

+ct+bw+log(cum.n)+pt,data=d) predict(d.glm,d.p)-(d.p$fit+d$res[i.p]) } nuclear.boot.pred <- boot (nuke, nuke.pred,R=199,m=l,d.p=nuke.p) Finally the 95% prediction intervals are obtained by

a s .vector(exp(nuke.p$f it-quantile(nuclear.boo t .pred$t, c(0.975,0.025)))) How do these compare to those in Example 6.8? M odify the above analysis to use a studentized pivot. What effect has this change on your interval? (Section 6.3.3; Cox and Snell, 1981, pp. 81-90) 5

Consider predicting the log brain weight o f a mammal from its log body weight, using squared error cost. The data are in dataframe mammals. For an initial model, apparent error and ordinary cross-validation estimates o f aggregate prediction error:

cost <- function(y, mu=0) mean((y-mu)“2) mammals.glm <- glm(log(brain)"log(body) ,data=maminals) muhat <- fitted(mammals.glm) app.err <- cost(mammals.glm$y, muhat) mammals.diag <- glm.diag(mammals.glm) cv.err <- mean((mammals.glm$y-muhat)“2/(1-mammals,diag$h)“ 2) For 6-fold unadjusted and adjusted estimates o f aggregate prediction error: c v . e r r . 6 < - cv.glm (m am m als, mammals.glm, c o s t , K=6) Experiment with other values o f K . For bootstrap and 0.632 estimates, and plot o f error components:

mammals.pred.fun <- function(data, i, formula) { d <- data[i,] d.glm <- glm(formula,data=d) D.F.hatF <- cost(log(data$brain), predict(d.glm,data)) D.hatF.hatF <- cost(log(d$brain), fitted(d.glm)) c(log(data$brain)-predict(d.glm,data), D.F.hatF - D.hatF.hatF)} mam.boot <- boot(mammals, mammals.pred.fun, R=200, formula=formula(mammals.glm)) n <- nrow(mammals) err.boot <- app.err + mean(mam.boot$t[,n+l]) err.632 <- 0 mam.boot$f <- boot.array(mam.boot) for (i in l:n) err.632 <- err.632 + cost(mam.boot$t[mam.boot$f[,i]==0,i])/n err.632 <- 0.368*app.err + 0.632*err.632 ord <- order(mammals.diag$res) mam.pred <- mam.boot$t[,ord] mam.fac <- factor(rep(l:n,rep(200,n)) ,labels=ord) plot(mam.fac, mam.pred,ylab="Prediction errors", xlab="Case ordered by residual") What are cases 34, 35, and 32? (Section 6.4.1)

325

6.8 ■Practicals 6

The data o f Examples 6.15 and 6.16 are regression m odel with all three covariates, the influence o f case 16 on estimating this. trimmed squares estimates, and then look

in dataframe s a l i n i t y . For the linear consider the effect o f discharge d is and Resample the least squares, Li and least at the jackknife-after-bootstrap p lo ts:

salinity.r o b .fun <- function(data,i) { data.i <- data[i,] Is.fit <- lm(sal~lag+trend+dis, data=data.i) 11.fit <- llfit(data.i[,-l] ,data.i[,l]) Its.fit <- ltsreg(data.i[,-l] ,data.i[,l]) c(ls.fit$coef,ll.fit$coef,I t s .fit$coef) } salinity.boot <- boot(salinity,salinity.rob.fun,R=1000) j ack.after.boot(salinity.boot,index=4) jack.after.boot(salinity.boot,index=8) j a ck.after.boot(salinity.boot,index=12) W hat conclusions do you draw from these plots about (a) the shapes o f the distributions o f the estimates, (b) comparisons between the estimation methods, and (c) the effects o f case 16? One possible explanation for case 16 being an outlier with respect to the multiple linear regression model used previously is that a quadratic effect in d is c h a r g e should be added to the model. We can test for this using the pivot m ethod with least squares estimates and case resampling:

salinity.quad.fun <- function(data, i) { data.i <- data[i,] Is.fit <- lm(sal~lag+trend+poly(dis,2), data=data.i) Is.sum <- summary(ls.fit) ls.std <- sqrt(diag(Is.sum$cov))*ls.sum$sigma c(ls.fit$coef, ls.std) > salinity.boot <- boot(salinity, salinity.quad.fun, R=99) quad.z <- salinity.boot$t0[5]/salinity.boot$tO[10] quad.z. stair <- (salinity ,boot$t [,5]-salinity .boot$t0[5] )/ salinity.boot$t[,10] (1+sum(quad.z
qqnorm(salinity.boot$t[,5],ylab="discharge quadratic coefficient") qqnorm(quad.z.star, ylab="discharge quadratic z statistic") Is it reasonable to use least squares estimates here? See whether or not the same conclusion would be reached using other methods o f estimation. (Section 6.5; Ruppert and Carroll, 1980; Atkinson, 1985, p. 48)

7 Further Topics in Regression

7.1 Introduction In C h ap ter 6 we showed how the basic b o o tstra p m ethods o f earlier chapters extend to linear regression. The b ro ad aim o f this ch ap ter is to extend the discussion further, to various form s o f nonlinear regression m odels — espe cially generalized linear m odels an d survival m odels — and to nonparam etric regression, where the form o f the m ean response is n o t fully specified. A particu lar feature o f linear regression is the possibility o f error-based resam pling, w hen responses are expressible as m eans plus hom oscedastic errors. T his is p articularly useful w hen o u r objective is prediction. F or generalized linear m odels, especially for discrete data, responses can n o t be described in term s o f additive errors. Section 7.2 describes ways o f generalizing error-based resam pling for such m odels. The corresponding developm ent for survival d a ta is given in Section 7.3. Section 7.4 looks briefly at nonlinear regression with additive error, m ainly to illustrate the useful co n trib u tio n th a t resam pling m ethods can m ake to analysis o f such models. T here is often a need to estim ate the poten tial accuracy o f predictions based on regression models, and Section 6.4 contained a general discussion o f resam pling m ethods for this. In Section 7.5 we focus on one type o f application, the estim ation o f misclassification rates w hen a binary response y corresponds to a classification. N o t all relationships betw een a response y an d covariates x can be readily m odelled in term s o f a p aram etric m ean function o f know n form. A t least for exploratory purposes it is useful to have flexible nonparam etric curvefitting m ethods, an d there is now a wide variety o f these. In Section 7.6 we exam ine briefly how resam pling can be used in conjunction w ith som e o f these n onparam etric regression m ethods.

326

327

7.2 • Generalized Linear Models

7.2 Generalized Linear Models 7.2.1 Introduction T he generalized linear m odel extends the linear regression m odel o f Section 6.3 in two ways. First, the distrib u tio n o f the response Y has the property th a t the variance is an explicit function o f the m ean n, v a r(Y ) = w here V(-) is the know n variance function and 4> is the dispersion param eter, w hich m ay be unknow n. T his includes the im p o rtan t cases o f binom ial, Poisson, an d gam m a d istributions in add ition to the norm al distribution. Secondly, the linear m ean structure is generalized to gO*) =

t],

n = x Tp,

w here g(-) is a specified m o notone link function which “links” the m ean to the linear predictor rj. As before, x is a {p + 1) x 1 vector o f explanatory variables associated w ith Y. The possible com binations o f different variance functions an d link functions include such things as logistic and probit regression, an d loglinear m odels for contingency tables, w ithout m aking ad-hoc transform ations o f responses. T he first extension was touched on briefly in Section 6.2.6 in connection w ith w eighted least squares, which plays a key role in fitting generalized linear m odels. T he second extension, to linear m odels for transform ed m eans, represents a very special type o f nonlinear model. W hen independent responses y } are obtained with explanatory variables Xj, the full m odel is usually taken to be E (Yj) = Hj,

g(nj) = x j p ,

\ a i ( Y j ) = KCjV(fij),

(7.1)

w here k m ay be unknow n and the c j are know n weights. F or example, for binom ial d a ta w ith probability n(xj) and d en om inator m7, we take c; = l/ m ; ; see Exam ple 7.3. T he co n stant k equals one for binom ial, Poisson an d exponential data. N otice th a t (7.1) strictly only specifies first and second m om ents o f the responses, an d in th a t sense is a sem iparam etric model. So, for exam ple, we can m odel overdispersed count d a ta by using the Poisson variance function V(fi) = n b u t allow ing k to be a free overdispersion param eter which is to be estim ated. O ne im p o rta n t p o in t a b o u t generalized linear m odels is the non-unique definitions o f residuals, an d consequent non-uniqueness o f nonparam etric re sam pling algorithm s. A fter illustrating these ideas w ith an exam ple we briefly review the m ain aspects o f generalized linear models. We then go on to discuss resam pling m ethods.

7 ■Further Topics in Regression

328

G ro u p 1 C ase

X

y

Case

X

y

1 2 3 4 5

3.36 2.88 3.63 3.41 3.78

65 156 100 134 16

18 19 20 21 22

3.64 3.48 3.60 3.18 3.95

56 65 17 7 16

6 7 8 9 10 11 12

4.02 4.00 4.23 3.73 3.85 3.97

108 121 4 39 143 56 26 22 1 1 5 65

23 24 25 26 27 28 29 30 31 32 33

3.72 4.00 4.28 4.43 4.45 4.49 4.41

22 3 4 2 3 8 4

4.32 4.90 5.00 5.00

3 30 4 43

13 14 15 16 17

4.51 4.54 5.00 5.00 4.72 5.00

Table 7.1 Survival times y (weeks) for two groups of acute leukaemia patients, together with x = log10 white blood cell count (Feigl and Zelen, 1965).

G ro u p 2

Exam ple 7.1 (Leukaem ia d a ta ) Table 7.1 contains d a ta on the survival times in weeks o f tw o groups o f acute leukaem ia victims, as a function o f their w hite blood cell counts. A simple m odel is th a t w ithin each group survival tim e Y is exponential w ith m ean /i = exp(/?o + Pix), where x = log10(white blood cell count). T hus the link function is logarithm ic. T he intercept is different for each group, b u t the slope is assum ed com m on, so the full m odel for the- yth response in group i is E (Y y) = Hij,

lo g (^ y ) = p Qi + pi Xj j ,

v a r(Y y ) = K(/Zy) = /X2,

T he fitted m eans p. an d the d a ta are show n in the left panel o f Figure 7.1. The m ean survival tim es for group 2 are shorter th a n those for group 1 at the same white blood cell count. U nder this m odel the ratios Y / n are exponentially distributed with unit m ean, an d hence the Q -Q p lo t o f y y //iy against exponential quantiles in the right panel o f Figure 7.1 w ould ideally be a straight line. System atic curvature m ight indicate th a t we should use a gam m a density w ith index v, y v_1vv / vv\ f i y l ^ v) = J ? T w e x p \ j ) ’

y>0,

^ V>a

In this case v ar(Y ) = /i2/v , so the dispersion p aram eter is taken to be and Cj = 1. In fact the exponential m odel seems to fit adequately.

k

= 1/v ■

329

7.2 ■Generalized Linear Models

Figure 7.1 Summary plots for fits of an exponential model fitted to two groups of survival times for leukaemia patients. The left panel shows the times and fitted means as a function of their white blood cell count (group 1, fitted line solid; group 2, fitted line dots). The right panel shows an exponential Q-Q plot of the y/\i.

Log10 white blood cell count

Quantile of standard exponential

7.2.2 Model fitting and residuals Estimation Suppose th a t in dependent d a ta (x i , y i ) , . . . , ( x n, y n) are available, w ith response m ean and variance described by (7.1). I f the response distributions are assum ed to be given by the corresponding exponential fam ily m odel, then the m axim um likelihood estim ates o f the regression param eters ji solve the (p + 1) x 1 system o f estim ating equations

P ' C jV (fij)

tg iSn j ) = 0’

™

w here g(n) = dr\/dp. is the derivative o f the link function. Because the dis persion p aram eters are tak en to have the form k c j , the estim ate fi does n o t depend on k . N ote th a t although the estim ates are derived as m axim um like lihood estim ates, their values depend only upon the regression relationship as expressed by the assum ed variance function and the link function and choice o f covariates. T he usual m ethod for solving (7.2) is iterative weighted least squares, in which a t each iteration the adjusted responses zj = t]j+ (yj — /ij)g(nj) are regressed on the x; w ith weights wj given by w j l = c j V(fij)g2(fiJ)-

(7.3)

all these quantities are evaluated at the cu rren t values o f the estim ates. The weighted least squares equation (6.27) applies at each iteration, w ith y replaced by the adjusted dependent variable z. The approxim ate variance m atrix for p

7 • Further Topics in Regression

330 is given by the analogue o f (6.24), nam ely

var(j?) = k ( X t W X ) ~ 1,

(7.4)

with the diagonal weight m atrix W = d ia g (w i,...,w „ ) evaluated at the final fitted values p.j. The corresponding ‘h a t’ m atrix is H =

X ( X T W X ) ~ lX T W ,

(7.5)

as in (6.28). T he relationship o f H to fitted values is rj = H z , where z is the vector o f adjusted responses. N ote th a t in general W , and hence H, depends upon the fitted values. T he residual vector e = y —fi has approxim ate variance m atrix (I — H )v ar(Y ), this being exact only for linear regression w ith know n

W. W hen the dispersion p aram eter o f residual m ean square, ft-

k

1

is unknow n, it is estim ated by the analogue

y ' to - * #

n - p - l j j

CjVfrj) ■

(7 6 )

F or a linear m odel w ith V{y) = 1 an d dispersion p aram eter k = a 2, this gives k = s2, the residual m ean square. Let tj(iij) denote the co n trib u tio n th a t the jih observation m akes to the overall log likelihood /(/i), param etrized in term s o f the m eans Hj. T hen the fit o f a generalized linear m odel is m easured by the deviance D = 2 k {t (y) - a m

= 2k £

{tj (yj ) - 1 0 }) ) ,

(7.7)

j

which is the scaled difference betw een the m axim ized log likelihoods for the saturated m odel — which has a p aram eter for each observation — and the fitted model. T he deviance corresponds to the residual sum o f squares in the analysis o f a linear regression model. F or exam ple, there are large reductions in the deviance w hen im p o rtan t explanatory variables are added to a m odel, and com peting m odels m ay be com pared via their deviances. W hen the fitted m odel is correct, the scaled deviance k ~ 1D will som etim es have an approxim ate chi-squared distrib u tio n on n — p — 1 degrees o f freedom , analogous to the rescaled residual sum o f squares in a norm al linear model. Significance tests Individual coefficients /?; can be tested using studentized estim ates, with stan dard errors estim ated using (7.4), w ith k replaced by the estim ate k if necessary. The null distrib u tio n s o f these studentized estim ates will be approxim ately stan d ard norm al, b u t the accuracy o f this ap proxim ation can be open to question. Allowance for estim ation o f k can be m ade by using the t distribution with

Some authors prefer to work with X'(X'TX')-'X 'V2, where X' = W ^ X .

331

7.2 ■Generalized Linear Models

n —p — 1 degrees o f freedom , as is justifiable for norm al-theory linear regression, b u t in general the accuracy is questionable. T he analogue o f analysis o f variance is the analysis o f deviance, wherein differences o f deviances are used to m easure effects. To test w hether or not a p articu lar subset o f covariates has no effect on m ean response, we use as test statistic the scaled difference o f deviances, D for the full m odel w ith p covariates an d Do for the reduced m odel w ith po covariates. If k is know n, then the test statistic is Q = (Do — D) /k. A pproxim ate properties o f log likelihood ratio s im ply th a t the null distribution o f Q is approxim ately chi-squared, with degrees o f freedom equal to p — po, the n u m ber o f covariate term s being tested. I f k is estim ated for the full m odel by fc, as in (7.6), then the test statistic is Q = (Do - D) / k.

(7.8)

In the special case o f linear regression, (p — po)~l Q is the F statistic, and this m otivates the use o f the Fp- pa
7 = !.•••,» •

(7.9)

The standardized Pearson residuals are essentially scaled versions o f the m o d ified residuals defined in (6.29), except th a t the denom inators o f (7.9) may depend on the p aram eter estim ates. In large sam ples one would expect the rpj to have m ean an d variance approxim ately zero and one, as they do for linear regression models. In general the Pearson residuals inherit the skewness o f the responses them selves, which can be considerable, and it m ay be b etter to standardize a transform ed response. O ne way to do this is to define standardized residuals on the linear predictor scale, t w - m {cjkg2( t i j ) V ( j i j ) ( l - h j ) }

(7 10)

F o r discrete d a ta this definition m ust be altered if g (yj) is infinite, as for

332

7 ■Further Topics in Regression

exam ple w hen g(y) = lo g y and y = 0. F o r a non-identity link function one should n o t expect the m ean and variance o f rLj to be approxim ately zero and one, unless k is unusually sm all; see Exam ple 7.2. A n alternative ap p ro ach to defining residuals is based on the fact th a t in a linear m odel the residual sum o f squares equals the sum o f squared residuals. This suggests th a t residuals for generalized linear m odels can be constructed from the contributions th a t individual observations m ake to the deviance. Suppose first th a t k is know n. T hen the scaled deviance can be w ritten as

where dj = d(y; , fij) is the signed square root o f the scaled deviance contribution due to the yth case, the sign being th a t o f y,- — frj. T he deviance residual is dj. D efinition (7.7) implies th a t dj = sign(y, - £ ; )[2{ /,(y y) - <0 (£j)}]1/2W hen / is the norm al log likelihood an d k = o 2 is unknow n, D is scaled by k = s2 rath er th a n k before defining dj. Sim ilarly for the gam m a log likelihood; see Exam ple 7.2. In practice standardized deviance residuals TDi

( l - h j ) V 2’

j

(7.11)

are m ore com m only used th a n the unadjusted dj. F or the linear regression m odel o f Section 6.3, r Dj is p roportional to the m odified residual (6.9). F o r o th er m odels the r Dj can be seriously biased, but once bias-corrected they are typically closer to stan d ard norm al th an are the r Pj or r LJ. One general point to note ab o u t all o f these residuals is th a t they are scaled, implicitly o r explicitly, unlike the m odified residuals o f C h ap ter 6. Quasilikelihood estimation As we have noted before, only the link an d variance functions m ust be specified in order to find estim ates ft and approxim ate stan d ard errors. So although (7.2) and (7.6) arise from a param etric m odel, they are m ore generally applicable — ju st as least squares results are applicable beyond the norm al-theory linear model. W hen n o response distribution is assum ed, the estim ates ft are referred to as quasilikelihood estim ates, and there is an associated theory for such estim ates, although this is n o t o f concern here. T he m ost com m on application is to d a ta w ith a response in the form o f counts or proportions, which are often found to be overdispersed relative to the Poisson or binom ial distributions. One approach to m odelling such d a ta is to use the variance function appropriate to binom ial or Poisson data, but to allow the dispersion param eter k to be a free param eter, estim ated by (7.6). This estim ate is then used in calculating stan d ard errors for ft and residuals, as indicated above.

333

7.2 • Generalized Linear Models

7.2.3 Sam pling plans Param etric sim ulation for a generalized linear m odel involves sim ulating new sets o f d a ta from the fitted param etric model. It has the usual disadvantage o f the p aram etric bootstrap, th a t datasets generated from a poorly fitting m odel m ay n o t have the statistical properties o f the original data. This applies particularly w hen count d a ta are overdispersed relative to a Poisson o r binom ial m odel, unless the overdispersion has been m odelled successfully. N o n p aram etric sim ulation requires generating artificial d a ta w ithout assum ing th a t the original d a ta have som e p articular param etric distribution. A com pletely nonparam etric approach is to resam ple cases, which applies exactly as described in Section 6.2.4. However, it is im p o rtan t to be clear w hat a case is in any p articu lar application, because count and pro p o rtio n d ata are often aggregated from larger datasets o f independent variables. Provided th a t the m odel (7.1) is correct, as w ould be checked by appropriate diagnostic m ethods, it m akes sense to use the fitted m odel and generalize the sem iparam etric approach o f resam pling errors, as described in Section 6.2.3. We focus now on ways to do this. Resampling errors T he simplest approach mimics the linear m odel sam pling scheme b u t allows for the different response variances, ju st as in Section 6.2.6. So we define sim ulated responses by y ’j = fij + {cjkV{p.j)Yl2t),

j = l,...,n,

(7.12)

where £ j,...,e * is a ran d o m sam ple from the m ean-adjusted, standardized Pearson residuals r Pj — r P w ith r Pj defined at (7.9). N ote th a t for count d a ta we are n o t assum ing k = 1. This resam pling scheme duplicates the m ethod o f Section 6.2.6 for linear m odels, where the link function is the identity. Because in general there is no explicit function connecting response yj to ran d o m erro r Sj, as there is for linear regression models, the resam pling scheme (7.12) is n o t the only approach, and som etim es it is n o t suitable. One alternative is to use the sam e idea on the linear predictor scale. T h a t is, we generate b o o tstra p d a ta by setting y ) = g ~ l x Tp + g{fij){cjkV{fij)}1/2£ ^ ,

In these first two resampling schemes the scale factor k~l/2 can be omitted provided it is omitted from both the residual definition and from the definition of

/•

j = l,...,n ,

(7.13)

where g _1(') is the inverse link function and £ j,...,e * is a b o o tstrap sample from the residuals r L U . . . , r Ln defined at (7.10). H ere the residuals should n o t be m ean-adjusted unless g( ) is the identity link, in which case r Lj = r Pj and the two schemes (7.12) an d (7.13) are the same. A th ird ap p ro ach is to use the deviance residuals as surrogate errors. If the deviance residual dj is w ritten as d{yj,p.j), then im agine th a t corresponding ran d o m errors ej are defined by ej = d(yj,fij). The distribution o f these £_,■

334

7 • Further Topics in Regression

is estim ated by the E D F o f the standardized deviance residuals (7.11). This suggests th a t we construct a b o o tstrap sam ple as follows. R andom ly sam ple from r o i , . .. , rD n and let y |,...,y * be the solutions to ej = d(yj,fij),

j = 1 ,..., n.

(7.14)

This also gives the m ethod o f Section 6.2.3 for linear models, except for the m ean adjustm ent o f residuals. N one o f these three m ethods is perfect. O ne obvious draw back is th a t they can all give negative or non-integer values o f y ' when the original d ata are non-negative integer counts. A simple fix for discrete responses is to round the value o f y j from (7.12), (7.13), or (7.14) to the nearest appropriate value. For count d a ta this is a non-negative integer, and if the response is a proportion w ith d en o m in ato r m, it is a nu m b er in the set 0 , 1 /m ,2 /m ,. . . , 1. However, rounding can appreciably increase the p ro p o rtio n o f extrem e values o f y ' for a case w hose fitted value is n ear the end o f its range. A sim ilar difficulty can occur w hen responses are positive w ith V(fi) = Kfi2, as in Exam ple 7.1. T he Pearson residuals are K~l/2(yj — fij)/p.j, all necessarily greater th a n —k ~ 1^2. But the standardized versions rpj are n o t so constrained, so th a t the result yj = fij( 1 + /c1/2e*) from applying (7.12) can be negative. The obvious fix is to tru n cate y j at zero, b u t this m ay distort the distribution o f y ', and so is n o t generally recom m ended. Example 7.2 (Leukaemia data) F or the d a ta introduced in Exam ple 7.1 the p aram etric m odel is gam m a w ith log likelihood contributions tij(Hij) - —K^'OogOxy) + yij/Hij), and the regression is additive on the logarithm ic scale, log(/zi;) = /?0i + /?ixy. The deviance for the fitted m odel is D = 40.32 w ith 30 degrees o f freedom , and equation (7.6) gives k = 1.09. The deviance residuals are calculated w ith k set equal to k , dtj = sign(ziy -

l ) { 2 k ~ l (zij

- 1 - logz,7)}1/2,

where zy = y y /£ y . The corresponding standardized values rDi,; have sam ple m ean an d variance respectively —0.37 an d 1.15. The Pearson residuals are k-

1 /2 ( z ,7 -

1 ).

T he Zjj w ould be approxim ately a sam ple from the stan d ard exponential distribution if in fact k = 1, and the right-hand panel o f Figure 7.1 suggests th a t this is a reasonable assum ption. O ur basic p aram etric m odel for these d a ta sets k = 1 and puts Y = fie, where £ has an exponential distrib u tio n w ith unit m ean. Hence the param etric b o o tstrap involves sim ulating exponential d a ta from the fitted m odel, th a t is setting y * = fie', where em is stan d ard exponential. A slightly m ore cautious

335

7.2 ■Generalized Linear M odels Table 7.2 Lower and upper limits of 95% studentized bootstrap confidence intervals for A i and 0 i for leukaemia data, based on 999 replicates of different simulation schemes.

Poi

E xponential L inear p redictor, r i D eviance, rp Cases

Pi

Lower

Upper

Lower

Upper

5.16 3.61 5.00 0.31

11.12 10.58 11.10 8.78

-1.42 -1.53 -1.46 -1.37

-0.04 0.17 0.02 0.81

ap p ro ach would be to generate gam m a d a ta with m ean p. and index /c_1, b u t we shall n o t d o this here. F o r nonparam etric sim ulation, we consider all three schemes described earlier. First, w ith variance function V(fi) = k /x2, the Pearson residuals are k ~ 1/2(y — p)/p- R esam pling Pearson residuals via (7.12) would be equivalent to setting y * = p.e*, where e* is sam pled at random from the zs (Problem 7.2). However, (7.12) can n o t be used w ith the standardized Pearson residuals rp, because negative values o f y * will occur, possibly as low as —4. T runcation at zero is n o t a sufficient rem edy for this. F or the second resam pling scheme (7.13), the logarithm ic link gives y ’ = j l c x p ( k 1/2e’ ), where e* is random ly sam pled from the rLs which here are given by n — /c-1/2( 1 — h)~l/2 log(z). The sam ple m ean and variance o f rL are —0.61 an d 1.63, in very close agreem ent w ith those for the logarithm o f a standard exponential variate. It is im p o rta n t th a t no m ean correction be m ade to the r^ To im plem ent the b o o tstrap for deviance residuals, the scheme (7.14) can be simplified as follows. We solve the equations d(zj, 1) = rDj for j = to obtain z i ,...,z „ , and then set y* = /t,£* for j = 1 where is a b o o tstrap sam ple from the zs (Problem 7.2). Table 7.2 shows 95% studentized b o o tstrap confidence intervals for /foi (the intercept for G ro u p 1) an d Pi using these schemes w ith R = 999. T he variance estim ates used are from (7.4) rath er th a n the nonparam etric delta m ethod. T he intervals for the three m odel-based schemes are very similar, while those for resam pling cases are ra th e r different, particularly for pi, for which the b o o tstrap distrib u tio n o f the studentized statistic is very non-norm al. Figure 7.2 com pares sim ulated deviances w ith quantiles o f the chi-squared distribution. N aive asym ptotics would suggest th a t the scaled deviance kD has approxim ately a chi-squared distribution on 30 degrees o f freedom , b u t these asym ptotics — w hich apply as k —>0 — are clearly n o t useful here, even w hen d a ta are in fact generated from the exponential distribution. T he fitted deviance o f 40.32 is n o t extrem e, and the variation o f the sim ulated estim ates

7 • Further Topics in Regression

336

Figure 7.2 Leukaemia data. Chi-squared Q-Q plots of simulated deviances for parametric sampling from the fitted exponential model (left) and case resampling (right).

Quantile of chi-squared distribution

Quantile of chi-squared distribution

k ’ is large enough th a t the observed value k = 1.09 could easily occur by chance if the d a ta were indeed exponential. ■ Comparison o f resampling schemes To com pare the perform ances o f the resam pling schemes described above in setting confidence intervals, we conducted a series o f M onte C arlo experim ents, each based on 1000 sets o f d a ta o f size n = 15, w ith linear predictor r\ = Po + Pix. In the first experim ent, the values o f x were generated from a distribution uniform on the interval (0, 1), we to o k po = Pi = 4, and responses were generated from the exponential distribution with m ean exp(^). Each sam ple was then b o o tstrap p ed 199 times using case resam pling and by m odelbased resam pling from the fitted m odel, w ith variance function V(/j) = /i2, by applying (7.13) and (7.14). F or each o f these resam pling schemes, various confidence intervals were obtained for param eters Po, Pi, tpi = PoPi and V 2 = Po/Pi- T he confidence intervals used were: the stan d ard interval based on the large-sam ple norm al distrib u tio n o f the estim ate, using the usual rath er th an a robust stan d ard erro r; the interval based on a norm al approxim ation w ith bias an d variance estim ated from the resam ples; the percentile and B C a intervals; an d the basic b o o tstrap and studentized b o o tstrap intervals, the la tter using n o n p aram etric delta m ethod variance estim ates. The first p a rt o f Table 7.3 shows the em pirical coverages o f nom inal 90% confidence intervals for these com binations o f resam pling scheme, m ethod o f interval construction, and param eter. The second experim ent used the sam e design m atrix, linear predictor, and m odel-fitting an d resam pling schemes as the first, b u t the d a ta were generated from a lognorm al m odel w ith m ean exp(t]) and u n it variance on the log scale.

7.2 • Generalized Linear Models Table 7 3 Empirical coverages (%) for four parameters based on applying various resampling schemes with R = 199 to 1000 samples of size 15 generated from various models. Target coverage is 90%. The first two sets of results are for an exponential model fitted to exponential and lognormal data, and the second two are for a Poisson model fitted to Poisson and negative binomial data. See text for details.

337

Cases

rL o r rp

ro

Po

Pi

V>1

xp2

S tan d ard N o rm al Percentile BCa Basic S tu d en t

85 88 85 84 86 89

86 89 87 86 88 89

89 92 83 82 87 86

85 90 89 86 84 81

85 88 86 86 86 92

86 89 89 88 89 92

89 90 86 83 86 89

86 89 89 88 83 84

85 87 86 86 85 92

86 89 88 88 89 92

90 90 86 83 87 89

86 89 89 88 83 84

S tan d ard N o rm al Percentile BCa Basic S tudent

79 81 80 78 78 84

79 81 84 83 78 85

82 84 73 72 82 82

81 85 85 81 78 74

79 81 80 80 81 90

78 80 82 80 80 88

82 84 77 74 83 84

82 84 83 79 80 79

79 82 80 79 80 90

78 80 81 81 81 88

82 84 76 74 84 84

82 84 82 80 80 79

S ta n d a rd N o rm al Percentile BCa Basic S tudent

90 88 87 86 87 95

90 88 87 86 87 90

91 88 85 82 85 80

90 88 86 86 87 92

89 87 89 88 87 90

90 86 88 87 87 89

92 88 88 85 88 89

90 88 88 87 88 89

89 87 90 88 86 90

91 93 94 94 92 93

92 97 97 96 97 92

91 93 91 91 92 91

S tan d ard N o rm al Percentile BCa Basic S tu d en t

69 87 85 85 86 93

64 84 86 85 84 87

59 86 84 80 83 82

70 90 86 85 85 87

69 88 90 88 88 89

63 84 86 83 84 89

59 84 82 77 83 85

69 89 88 86 87 85

67 87 90 87 87 89

64 89 91 89 89 93

60 92 93 88 91 90

71 94 91 89 91 85

Po

Pi

Vl

tp2

Po

Pi

Vl

V>2

The th ird experim ent used the same design m atrix as the first two, b u t linear predictor rj = Pq + P\x, w ith Po — Pi = 2 and Poisson responses w ith m ean H = exp (rj). T he fourth experim ent used the same m eans as the third, b u t had negative binom ial responses w ith variance function \x + /i2/1 0 . The b o o tstrap schemes for these two experim ents were case resam pling and m odel-based resam pling using (7.12) an d (7.14). Table 7.3 shows th at while all the m ethods tend to undercover, the standard m ethod can be disastrously b ad w hen the random p a rt o f the fitted m odel is incorrect, as in the second an d fourth experim ents. The studentized m ethod generally does b etter th a n the basic m ethod, b u t the B C a m ethod does not im prove on the percentile intervals. T hus here a m ore sophisticated m ethod does n o t necessarily lead to b etter coverage, unlike in Section 5.7, and in p articu lar there seems to be no reason to use the B C a m ethod. Use o f the studentized interval on an o th er scale m ight im prove its perform ance for the ratio \p2 , for which the sim pler m ethods seem best. As far as the resam pling schemes are concerned, there seems to be little to choose betw een the m odel-

7 • Further Topics in Regression

338

based schemes, which im prove slightly on b o o tstrap p in g cases, even when the fitted variance function is incorrect. We now consider an im p o rtan t caveat to these general com m ents. Inhomogeneous residuals F or some types o f d a ta the standardized Pearson residuals m ay be very inhom ogeneous. If y is Poisson w ith m ean fi, for example, the distribution o f (y — f i ) / n l/1 is strongly positively skewed w hen n < increasingly sym m etric as fi increases. T hus w hen a set o f large and sm all counts, it is unwise to treat the rP as possibility for such d a ta is to apply (7.12) b u t w ith fitted the estim ated skewness o f their residuals.

I, b u t it becom es d a ta contains both exchangeable. One values stratified by

Example 7.3 (Sugar cane) Carvao da cana-de-aqucar — coal o f sugar cane — is a disease o f sugar cane th a t is com m on in some areas o f Brazil, and its effects on p roduction o f the crop have led to a search for resistant varieties o f cane. We use d a ta kindly provided by D r C. G. B. D em etrio o f Escola Superior de A gricultura, U niversidade de Sao Paulo, from a random ized block experim ent in which the resistance to the disease o f 45 varieties o f cane was com pared in four blocks o f 45 plots. Fifty stems from a variety were p u t in a solution containing the disease agent, an d then plan ted in a plot. A fter a fixed period, the to tal num b er o f shoots appearing, m, an d the n um ber o f diseased shoots, r, were recorded for each plot. T hus the d a ta form a 4 x 45 layout o f pairs (m, r). T he purpose o f analysis was to identify the m ost resistant varieties, for further investigation. A simple m odel is th a t the nu m b er o f diseased shoots ry for the ith block and / t h variety is a binom ial ran d o m variable w ith d en o m inator my and probability nij. F or the generalized linear m odel form ulation, the responses are taken to be y tj = rij/niij so th a t the m ean response fiij is equal to the probability 7iy th at a sho o t is diseased. Because the variance o f Y is 7t( l — n) /m, the variance function is V(n) = fi(\ — fi) an d the dispersion p aram eters are (fi = 1/m , so th at in the tw o-w ay version o f (7.1), cy = 1/my and k = 1. The probability o f disease for the ith block an d / t h variety is related to the linear predictor tjij = a,+ Pj through the logit link function t] = log { n / ( l — 7i)}. So the full m odel for all d a ta is E(Yij) v ar(Ytj)

- fiij,

fiij = exp(a, + Pj)/ {1 + exp(a,

= m-jl V(fiij),

+ P j) } ,

V(fitj) = /i,7(l - fi,j).

Interest focuses on the varieties w ith sm all values o f Pj, which are likely to be the m ost resistant to the disease. F or an adequate fit, the deviance would roughly be distributed according to a X m d istrib u tio n ; in fact it is 1142.8. This indicates severe overdispersion relative to the model.

7.2 • Generalized Linear Models

Figure 7 3 Model fit for the cane data. The left panel shows the estimated variety effects £i + for block 1: varieties 1 and 3 are least resistant, and 31 is most resistant. The lines show the levels on the logit scale corresponding to n = 0.5, 0.2, 0.05 and 0.01. The right panel shows standardized Pearson residuals rp plotted against etj + pj; the lines are at 0, ±3.

339

o o £
CO

Q.

■&1 ■c <0 >

10

20

30

40

Variety

eta

T he left panel o f Figure 7.3 shows estim ated variety effects for block 1. Varieties 1 an d 3 are least resistant to the disease, while variety 31 is m ost resistant. T he right panel shows the residuals plotted against linear predictors. T he skewness o f the rP drops as rj increases. Param etric sim ulation involves generating binom ial observations from the fitted m odel. This greatly overstates the precision o f conclusions, because this m odel clearly does n o t reflect the variability o f the data. We could instead use the beta-binom ial distribution. Suppose that, conditional on n, a response is binom ial w ith den o m in ato r m an d probability n, b u t instead o f being fixed, n is taken to have a b eta distribution. T he resulting response has unconditional m ean and variance m il,

m ll(l - n ) { l + ( m - 1)0},

(7.15)

where n = E(7t) and <j) > 0 controls the degree o f overdispersion. Param etric sim ulation from this m odel is discussed in Problem 7.5. Two variance functions for overdispersed binom ial d a ta are V\{n) = n(l — n), w ith <j> > 1, and Viin) = 7i(l — 7t){l + (m — with (/> > 0. T he first o f these gives com m on overdispersion for all the observations, while the second allows p roportionately greater spread when m is larger. We use the first, for which 4> = 8.3, an d perform nonparam etric sim ulation using (7.12). T he sim ulated responses are rounded to the nearest integer in 0 ,1 ,..., m. The left panel o f Figure 7.4 shows box plots o f the ratio o f deviance to degrees o f freedom for 200 sim ulations from the binom ial model, the betabinom ial m odel, for nonparam etric sim ulation by (7.12), and for (7.12) b u t w ith residuals stratified into groups for the fifteen varieties w ith the smallest values o f fij, the m iddle fifteen values o f fij, and the fifteen largest values o f

340

7 • Further Topics in Regression

Figure 7.4 Resampling results for cane data. The left panel shows (left to right) simulated deviance/degrees of freedom ratios for fitted binomial and beta-binomial models, a nonparametric bootstrap, and a nonparametric bootstrap with residuals stratified by varieties; the dotted line is at the data ratio 8.66 = 1142.8/132. The right panel shows the variety effects in 200 replicates of the stratified nonparametric resampling scheme.

Variety

A

fij. The d o tted line shows the observed ratio. T he binom ial results are clearly quite inappropriate, those for the beta-binom ial an d unstratified sim ulation are better, an d those for the stratified sim ulation are best. To explain this, we retu rn to the right panel o f Figure 7.3. This shows th a t the residuals are n o t hom ogeneous: residuals for observations with sm all values o f rj are m ore positively skewed th a n those for larger values. This reflects the varying skewness o f binom ial data, which m ust be taken into account in the resam pling scheme. The right panel o f Figure 7.4 shows the estim ated variety effects for the 200 sim ulations from the stratified sim ulation. Varieties 1 and 3 are m uch less resistant th a n the others, b u t variety 31 is not m uch m ore resistant th an 11, 18, and 23; o th er varieties are close behind. As m ight be expected, results for the binom ial sim ulation are m uch less variable. T he unstratified resam pling scheme gives large negative estim ated variety effects, due to inappropriately large negative residuals being used. To explain this, consider the right panel o f Figure 7.3. In effect the unstratified scheme allows residuals from the right h alf o f the panel to be sam pled an d placed at its left-hand end, leading to negative sim ulated responses th a t are rounded u p to zero: the varieties for which this happens seem spuriously resistant. Finer stratification o f the residuals seems unnecessary for this application. ■

7.2.4 Prediction In Section 6.3.3 we showed how to use resam pling m ethods to obtain prediction intervals based on a linear regression fit. T he sam e idea can be applied here.

7.2 • Generalized Linear Models

341

Beyond having a suitable resam pling algorithm to produce the appropriate variation in p aram eter estim ates, we m ust also produce suitable response variation. In the linear m odel this is provided by the E D F o f standardized residuals, which estim ates the C D F o f hom oscedastic errors. N ow we need to be able to produce the correct heteroscedasticity. Suppose th a t we w ant to predict the response Y+ at x+, w ith a prediction interval. O ne possible poin t prediction is the regression estim ate K = g -'ix lh although it w ould often be wise to m ake a bias correction. F o r the prediction interval, let us assum e for the m om ent th a t some m onotone function 5 ( Y , n ) is hom oscedastic, w ith pth quantile ap, and th a t the m ean value o f Y+ is known. T hen the 1 — 2a prediction interval should be the values y+>a, y + ,i-a w here y +iP satisfies <5(y,/i+) = ap. If n is estim ated by p. independently o f Y+ and if 3{Y+,fi) has know n quantiles, then the sam e m ethod applies. So the ap p ro p riate b o o tstrap m ethod is to estim ate quantiles o f <5(7+,/t), and then set 5 (y,n+) equal to the estim ated a and 1 — a quantiles. T he function d ( Y, f i ) will correspond to one o f the definitions o f residuals, and the boo tstrap algorithm will use resam pling from the corresponding standardized residuals, whose hom oscedasticity is critical. The full resam pling algorithm , which generalizes A lgorithm 6.4, is as follows. Algorithm 7.1 (Prediction in generalized linear models) F or r = 1 create b o o tstrap sam ple response y j at Xj by solving d(y,p.j) = e,j ,

j= l,...,n ,

where the ej are random ly sam pled from residuals r i , . . . , r „ ; 2 fit estim ates fi* and k", and com pute fitted value p*+ r corresponding to the new observation w ith x = x + ; then 3 for m = 1 ,...,M , (,a ) sam ple S'm from n , . . . , r„, (b ) set y ’+ rm equal to the solution o f the equation S(y,p.+) = d*m, (c) com pute sim ulated prediction ‘errors’ d’+rm = 8{y’+rm,fi'+r). Finally, o rd er the R M values d \ rm to give d'_j_(1) < ■• ■< d‘+{RK1). T hen calculate the prediction limits as the solutions to

3(y+,fl+)

—

^+,((RM+!)«)>

$(y+>fi+)

—

^+,((HM+l)(l-a))-

342

7 • Further Topics in Regression

In principle any o f the resam pling m ethods in Section 7.2.3 could be used. In practice the hom oscedasticity is im portant, and should be checked. Exam ple 7.4 (A ID S diagnoses)

Table 7.4 contains the n um ber o f A ID S

reports in E ngland an d W ales to the end o f 1992. They are cross-classified by diagnosis period an d length o f reporting delay, in three-m onth intervals. A blank in the table corresponds to an unknow n entry, and > indicates where an entry is a lower b o u n d for the actual value. We shall treat these incom plete d a ta as unknow n in o u r analysis below. The problem was to predict the state o f the epidem ic at the tim e from the given data. This depends heavily on the values missing tow ards the foot o f the table. The d a ta su p p o rt the assum ption th a t the reporting delay does n o t depend on the diagnosis period. In this case a simple m odel is th a t the num ber o f reports in row j and colum n k o f the table, yjk, has a Poisson distribution w ith m ean fijk = exp(a; + /4). I f all the cells o f the table are regarded as independent, the total diagnoses in period j have a Poisson distribution w ith m ean J2k Vjk = exP(a;') J2k exP (ft)- H ence the eventual total for an incom plete row can be predicted by adding the observed row total and the fitted values for the unobserved p a rt o f the row. H ow accurate is this prediction? To assess this, we first sim ulate a com plete table o f b o o tstrap data, y*k, using the fitted values fak = exp(a; + /?*) from the original fit. We shall discuss below how to do this; for now simply note th a t this am ounts to steps 1 and 3(b) o f A lgorithm 7.1. We then fit the tw o-w ay layout m odel to the sim ulated data, excluding the cells where the original table was incom plete, thereby obtaining param eter estim ates a ’ and /?£. We then calculate y'+j =

YI k

yjk’

A+J = ex p (a ')

exP(PD> k

unobs

7 = 1 ,...,3 8 ,

unobs

where the sum m ation is over the cells o f row j for which yjk was unobserved; this is step 2. N ote th a t y*+j is equivalent to the results o f steps 3(a) and 3(b) with M = 1. We take 8(y,n) = (y — corresponding to Pearson residuals for the Poisson distribution. This m eans th a t step 3(c) involves setting _ y-+ J - K j +J

a *1/2

V+J We repeat this R times, to obtain values d‘+}(l) < ■■■< d \ j(R) for each j. The final step is to o btain the b o o tstrap u p p er an d lower limits for y +j , by solving the equations y+j

a*+j _ j* .1 /2

*+J

y +j ~

p +j

_ j*

a + , M ( R + 1)«)>TT /2

< )

a + J ,( ( R + l) ( l—a))-

y*+j i_a

343

7.2 ■Generalized Linear M odels Table 7.4 N um bers o f A ID S reports in England and Wales to the end o f 1992 (De Angelis and Gilks, 1994). A ^ sign in the body o f the table indicates a count incomplete a t the end o f 1992, and t indicates a reporting-delay less than one month.

R eporting delay interval (quarters)

Diagnosis period Y ear

Q uarter

ot

1

2

3

4

5

6

7

8

9

10

11

12

13

214

1983

3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

2 2 4 0 6 5 4 11 9 2 5 7 13 12 21 17 36 28 31 26 31 36 32 15 34 38 31 32 49 44 41 56 53 63 71 95 76 67

6 7 4 10 17 22 23 11 22 28 26 49 37 53 44 74 58 74 80 99 95 77 92 92 104 101 124 132 107 153 137 124 175 135 161 178 181 2:66

0 1 0 0 3 1 4 6 6 8 14 17 21 16 29 13 23 23 16 27 35 20 32 14 29 34 47 36 51 41 29 39 35 24 48 39 2:16

1 1 1 1 1 5 5 1 2 8 6 11 9 21 11 13 14 11 9 9 13 26 10 27 31 18 24 10 17 16 33 14 17 23 25 £6

1 1 0 1 1 2 2 1 4 5 9 4 3 2 6 3 7 8 3 8 18 11 12 22 18 9 11 9 15 11 7 12 13 12 2:5

0 0 2 0 0 1 1 5 3 2 2 7 5 7 4 5 4 3 2 11 4 3 19 21 8 15 15 7 8 6 11 7 11 Si

0 0 0 0 0 0 3 0 3 2 5 5 7 0 2 3 1 3 8 3 6 8 12 12 6 6 8 6 9 5 6 10 >2

1 0 0 0 0 2 0 1 4 4 5 7 3 7 2 1 2 6 3 4 4 4 4 5 7 1 6 4 2 7 4 Si

0 0 0 1 0 1 1 1 7 3 5 3 1 0 1 2 1 2 1 6 4 8 3 3 3 2 5 4 1 2 23

0 0 0 1 0 0 2 1 1 0 1 1 3 0 0 2 3 5 4 3 3 7 2 0 8 2 3 5 1

0 0 2 1 0 0 0 1 2 1 2 2 1 0 2 0 0 4 6 5 3 1 0 3 0 2 3 0

0 0 1 0 1 0 0 0 0 1 0 2 0 0 0 0 0 1 2 5 2 0 2 3 2 3 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 1 1 0 0 2 0 1 2

0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 3 3 1 2 1 3 2 0 1 2

1 0 0 0 1 0 2 1 0 1 2 4 6 1 8 5 1 3 6 3 3 2 2 1

1984

1985

1986

1987

1988

1989

1990

1991

1992

Total reports to end 1992 12 12 14 15 30 39 47 40 63 65 82 120 109 120 134 141 153 173 174 211 224 205 224 219 £253 2233 2:281 2:245 2260 2285 2271 2263 2306 2258 2310 2318 2273 2133

This procedure takes into account two aspects o f uncertainty th a t are im p o rta n t in prediction, nam ely the inaccuracy o f param eter estim ates, and the ran d o m fluctuations in the unobserved y,*. T he first enters through variation in a* an d from replicate to replicate, and the second enters through the sam pling variability o f the p redictand y'+J over different replicates. The procedure does n o t allow for a th ird com ponent o f predictive error, due to uncertainty ab o u t the form o f the model. T he m odel described above is a generalized linear m odel w ith Poisson errors an d the log link function. It contains 52 param eters. The deviance o f 716.5 on 413 degrees o f freedom is strong evidence th a t the d a ta are overdispersed relative to the Poisson distribution. The estim ate o f k is tc = 1.78, and in

7 ■Further Topics in Regression

344

o oin

Figure I S Results from the fit of a Poisson two-way layout to the AIDS data. The left panel shows predicted diagnoses (solid), together with the actual totals to the end of 1992 (+). The right panel shows standardized Pearson residuals plotted against estimated skewness, p~l/2; the vertical lines are at skewness 0.6 and

oo rT < D O O CO o

CO

c

O) O
1984 1986

1988

1990

1.

1992

Skewness

fact a quasilikelihood m odel in which v ar(Y ) = k/i appears to fit the d a ta ; this corresponds to treatin g the counts in Table 7.4 as independent negative binom ial ran d o m variables. The predicted value exp(a; ) J2k exP(A0 is shown as the solid line in the left panel o f Figure 7.5, together w ith the observed to tal reports to the end o f 1992. The right panel shows the standardized Pearson residuals plotted against the estim ated skewness p r l/2. T he b anding o f residuals at the right is characteristic o f d a ta containing sm all counts, w ith the lower b an d corresponding to zeroes in the original data, the next to ones, an d so forth. The distributions o f the rp change m arkedly, an d it w ould be in ap p ro p riate to tre a t them as a hom ogeneous group. The sam e conclusion holds for the standardized deviance residuals, although they are less skewed for larger fitted values. T he dotted lines in the figure divide the observations into three strata, w ithin each o f which the residuals are m ore hom ogeneous. Finer stratification has little effect on the results described below. One param etric b o o tstrap involves generating Poisson random variables Y ’k with m eans exp(aj + /?*). This fails to account for the overdispersion, which can be m im icked by p aram etric sam pling from a fitted negative binom ial distributio n w ith the sam e m eans an d estim ated overdispersion. N onparam etric resam pling from standardized Pearson residuals will give overdispersion, b u t the right panel o f Figure 7.5 suggests th a t the residuals should be stratified. Figure 7.6 shows the ratio o f deviances to degrees o f freedom for 999 sam ples tak en u nder these four sam pling schemes; the strata used in the low er right panel are show n in Figure 7.5. Param etric sim ulation from the Poisson m odel is plainly in ap p ro p riate because the d a ta so generated

7.2 • Generalized Linear Models Figure 7.6 Resampling results for AIDS data. The left panels show deviances/degrees of freedom ratios for the four resampling schemes, with the observed ratio given as the vertical dotted line. The right panel shows predicted diagnoses (solid line), with pointwise 95% predictive intervals, based on 999 replicates of Poisson simulation (small dashes), of resampling residuals (dots), and of stratified resampling of residuals (large dashes).

345

Negative binomial

0.0

0.5

1.0

1.5

2.0

2.5

Davtanca/df

Nonparametric

mi 0.0

0.5

1.0

1.5

2.0

2.5

Devianca'df

Table 7.5 Bootstrap

95% prediction intervals for numbers of AIDS cases in England and Wales for the fourth quarters of 1990, 1991, and 1992.

1990 Poisson N egative binom ial N o n p aram etric S tratified n o n p aram etric

296 294 294 292

315 318 318 319

1991 294 289 289 288

327 333 333 335

1992 356 317 314 310

537 560 547 571

are m uch less dispersed th an the original data, for which the ratio is 716.5/413. T he negative binom ial sim ulation gives m ore ap p ro p riate results, which seem rath er sim ilar to those for n o nparam etric sim ulation w ithout stratification. W hen stratification is used, the results mimic the overdispersion m uch better. The pointw ise 95% prediction intervals for the num bers o f A ID S diagnoses are shown in the right panel o f Figure 7.6. The intervals for sim ulation from the fitted Poisson m odel are considerably narrow er th an the intervals from resam pling residuals, b o th o f which are similar. The intervals for the last quarters o f 1990, 1991, an d 1992 are given in Table 7.5. T here is little change if intervals are based on the deviance residual form ula for the Poisson distribution, S(y,fi) = ±[ 2 {y log ( y / n) + n ~ y } \ x/1A serious draw back w ith this analysis is th a t predictions from the two-way layout m odel are very sensitive to the last few rows o f the table, to the extent th a t the estim ate for the last row is determ ined entirely by the b ottom left

346

7 • Further Topics in Regression

cell. Some sort o f tem poral sm oothing is preferable, and we reconsider these d a ta in Exam ple 7.12 ■

7.3 Survival Data Section 3.5 describes resam pling m ethods for a single hom ogeneous sam ple o f d a ta subject to censoring. In this section we tu rn to problem s where survival is affected by explanatory variables. Suppose th a t the d a ta ( Y , D , x ) on an individual consist of: a survival time Y ; an indicator o f censoring, D, th a t equals one if Y is observed and zero if 7 is right-censored; an d a covariate vector x. U nder random censorship the observed value o f Y is supposed to be m in( Y °,C ), where C is a censoring variable w ith distribution G, and the true failure tim e 7 ° is a variable whose distribution F ( y ; /?, x) depends on the covariates x through a vector o f p a ram eters, /?. M ore generally we m ight suppose th a t Y ° and C are conditionally independent given x, and th a t C has distribution G(c;y,x). In either case, the value o f C is supposed to be uninform ative ab o u t the param eter p. Parametric model In a param etric m odel F is fully specified once j3 has been chosen. So if the d ata consist o f m easurem ents ( y i , d i , x \ ) , . . . , ( y „ , d n,x„) on independent individuals, we suppose th a t p is estim ated, often by the m axim um likelihood estim ator p. Param etric sim ulation is perform ed by generating values Y f ' from the fitted distributions F ( y ; P , X j ) an d generating ap p ro p riate censoring tim es Cj, setting Yj = min(Yj0’, Cj), an d letting Dj indicate the event Yj>" < Cj. The censoring variables m ay be generating according to any one o f the schemes outlined in Section 3.5, or otherw ise if appropriate. Exam ple 7.5 (P E T film d a ta ) Table 7.6 contains d a ta from an accelerated life test on P E T film in gas insulated transform ers; the film is used in electrical insulation. T here are failure times y at each o f four different voltages x. Three failure tim es are right-censored a t voltage x = 5: according to the d a ta source they were subject to censoring at a pre-determ ined time, b u t their values m ake it m ore likely th a t they were censored after a pre-determ ined num ber o f failures, an d we shall assum e this in w hat follows. T he W eibull distrib u tio n is often used for such data. In this case plots suggest th a t b o th o f its p aram eters depend on the voltage applied, and th a t there is an unknow n th reshold voltage xo below which failure can n o t occur. O ur m odel is th a t the distrib u tio n function for y at voltage x is given by F(y\P,x)

=

1 - e x p { - ( y / ; . ) K},

y > 0,

X

=

exp {Po — Pi log(x — 5 + e^4) } ,

k

=

exp (/?2 — P 3 log x).

(7.16)

347

7.3 ■Survival Data Table 7.6 Failure times (hours) from an accelerated life test on PET film in SFg gas insulated transformers (Hirose, 1993). ^ indicates right-censoring.

V oltage (kV) 5 7

10 15

7131 >9104.25 50.25 108.30 135.60 15.17 23.90 2.40 6.68

8482 >9104.25 87.75 108.30

8559 >9104.25 87.76 117.90

19.87 28.17 2.42 7.30

20.18 29.70 3.17

8762

9026

9034

9104

87.77 123.90

92.90 124.30

92.91 129.70

95.96 135.60

21.50

21.88

22.23

23.02

3.75

4.65

4.95

6.23

This param etrizatio n is chosen so th a t the range o f each param eter is u n boun d ed ; n ote th a t xq = 5 — e^*. The u p p er panels o f Figure 7.7 show the fit o f this m odel when the p aram eters are estim ated by m axim izing the log likelihood (. The left panel shows Q -Q plots for each o f the voltages, and the right panel shows the fitted m ean failure tim e an d estim ated threshold xo- T he fit seems broadly adequate. We sim ulate replicate d atasets by generating observations from the W eibull m odel obtained by substituting the M L E s into (7.16). In order to apply our assum ed censoring m echanism , we sort the observations sim ulated w ith x = 5 to get _y*1} < < say, an d then set y(*9), and equal to y'7) + 0.25. We give these three observations censoring indicators d* = 0, so th a t they are treated as censored, treat all the o th er observations as uncensored, and fit the W eibull m odel to the resulting data. For sake o f illustration, suppose th a t interest focuses on the m ean failure tim e 9 w hen x = 4.9. To facilitate this we reparam etrize the m odel to have T(v) is the Gamma function / 0°° uv-1e-u du.

p aram eters 9 an d /? = ( / f i ,...,/ ^ ) , where 9 = 10- 3A r(l + 1/k), w ith x = 4.9. T he lower left panel o f Figure 7.7 shows the profile log likelihood for 9, i.e. ^Prof(0) = m a P

in the figure we renorm alize the log likelihood to have m axim um zero. U nder the stan d ard large-sam ple likelihood asym ptotics outlined in Section 5.2.1, the approxim ate distrib u tio n o f the likelihood ratio statistic W( 9) = 2 {< V of(0) — is xj, so a 1 — a confidence set for the true 9 is the set such th at cVtP is the p quantile of the Xv distribution. ^ p ro f(0 ) ^ < V o f ( ^ ) — 5 C U _ a .

348

7 ■Further Topics in Regression

CD

£ o

_D

o>

(5

o

(0 0

log Weibull quantiles

Voltage

theta

Chi-squared quantile

■oo o o> o S

Q_

where 6 is the overall M L E . F o r these d a ta 0 = 24.85 and the 95% confidence interval is [19.75,35.53]; the confidence set contains values o f 6 for which f prof (^) exceeds the d o tted line in the b o tto m left panel o f Figure 7.7. T he use o f the chi-squared quantile to set the confidence interval presupposes th a t the sam ple is large enough for the likelihood asym ptotics to apply, and this can be checked by the p aram etric sim ulation outlined above. The lower right panel o f the figure is a Q -Q plot o f likelihood ratio statistics w ’(6) = 2 { /‘rof(0‘ ) — /* rof(0)} based on 999 sets o f d a ta sim ulated from the fitted model. The distribution o f the w ’(6) is close to chi-squared, b u t w ith

Figure 7.7 PET reliability data analysis. Top left panel: Q-Q plot of log failure times against quantiles of log Weibull distribution, with fitted model given by dotted lines, and censored data by o. Top right panel: Fitted mean failure time as a function of voltage x; the dotted line shows the estimated voltage £o below which failure is impossible. Lower left panel: normalized profile log likelihood for mean failure time 0 at x = 4.9; the dotted line shows the 95% confidence interval for Q using the asymptotic chi-squared distribution, and the dashed line shows the 95% confidence interval using bootstrap calibration of the likelihood ratio statistic. Lower right panel: chi-squared Q-Q plot for simulated likelihood ratio statistic, with dotted line showing its large-sample distribution.

7.3 • Survival Data

349

Table 7.7 Com parison of estim ated biases and standard errors o f maximum likelihood estimates for the PET reliability data, using standard first-order likelihood theory, param etric bootstrap simulation, and model-based nonparam etric resampling.

P aram eter

Po Pi Pi ft *0

M LE

6.346 1.958 4.383 1.235 4.758

L ikelihood

P aram etric

N o n p a ra m e tric

Bias

SE

Bias

SE

Bias

SE

0 0 0 0 0

0.117 0.082 0.850 0.388 0.029

0.007 0.007 0.127 0.022 -0.004

0.117 0.082 0.874 0.393 0.030

0.001 0.006 0.109 0.022 -0.002

0.112 0.080 0.871 0.393 0.028

m ean 1.12, an d their 0.95 quantile is w(*950) = 4.09, to be com pared with ci,o.95 = 3.84. This gives b o o tstrap calibrated 95% confidence interval the set o f 9 such th a t / prof(0) > / prof(9) — 5 x 4.09, th a t is [19.62,36.12], which is slightly w ider th a n the stan d ard interval.

? is the m atrix o f second derivatives o f £ with respect to 0 and /?.

Table 7.7 com pares the bias estim ates and stan d ard errors for the m odel param eters using the param etric b o o tstra p described above and standard firsto rd er likelihood theory, und er which the estim ated biases are zero, and the variance estim ates are obtained as the diagonal elem ents o f the inverse observed inform ation m atrix (—?)_1 evaluated at the M LEs. The estim ated biases are sm all b u t significantly different from zero. The largest differences betw een the stan d ard theory and the b o o tstrap results are for f o and fo, for which the biases are o f order 2 -3 % . T he threshold param eter xo is well determ ined; the sta n d a rd 95% confidence interval based on its asym ptotic norm al distribution is [4.701,4.815], w hereas the norm al interval with estim ated bias and variance is [4.703,4.820], A m odel-based nonparam etric b o o tstrap m ay be perform ed by using resid uals e = ( y / ) . f , three o f which are censored, then resam pling errors £* from their product-lim it estim ate, an d then m aking uncensored b o o tstrap observa tions le*1/*. T he observations with x = 5 are then m odified as outlined above, an d the m odel refitted to the resulting data. The product-lim it estim ate for the residuals is very close to the survivor function o f the stan d ard exponential dis tribution, so we expect this to give results sim ilar to the param etric sim ulation, and this is w hat we see in Table 7.7. F or censoring at a pre-determ ined tim e c, the sim ulation algorithm s would w ork as described above, except th a t values o f y * greater th a n c would be replaced by c an d the corresponding censoring indicators d* set equal to zero. T he nu m b er o f censored observations in each sim ulated dataset would then be ran d o m ; see Practical 7.3. Plots show th a t the sim ulated M L E s are close to norm ally distributed: in this case sta n d a rd likelihood theory w orks well enough to give good confi dence intervals for the param eters. The benefit o f param etric sim ulation is th at the b o o tstra p estim ates give em pirical evidence th a t the stan d ard theory can

350

7 ■Further Topics in Regression

be trusted, while providing alternative m ethods for calculating m easures o f uncertainty if the stan d ard theory is unreliable. It is typical o f first-order like lihood m ethods th a t the variability o f likelihood quantities is underestim ated, although here the effect is sm all enough to be un im p o rtant. ■ Proportional hazards model I f it can be assum ed th a t the explanatory variables act m ultiplicatively on the hazard function, an elegant an d pow erful ap p ro ach to survival d a ta analysis is possible. U nder the usual form o f proportional hazards model the hazard function for an individual w ith covariates x is d A ( y ) = exp( x T P)dA°(y), where dA°(y) is the ‘baseline’ h azard function th a t would apply to an individual w ith a fixed value o f x, often x = 0. T he corresponding cum ulative hazard and survivor functions are A{y) = [ y e x p ( x T P)dA°(u), Jo

1 - F ( y ; p, x) = {1 - F °(y )}exp(x7 P)

where 1 — F°(y) is the baseline survivor function for the hazard dA°(y). The regression p aram eters P are usually estim ated by m axim izing the partial likelihood, which is the p ro d u ct over cases w ith dj = 1 o f term s ________g P f r r ft>________ E L i H (yj - y k ) e xp (x Tpky

(717)

where H(u) equals zero if u < 0 an d equals one otherwise. Since (7.17) is unaltered by recentring the xj, we shall assum e below th at E x j = 0 ; the baseline h azard then corresponds to the average covariate value x = 0. In term s o f the estim ated regression param eters the baseline cum ulative hazard function is estim ated by the Breslow estimator

A °(y )= J 2 ^n m d\ (T tiV j:yj
(7-18)

a non-decreasing function th a t ju m p s a t yj by dA°(yj) = ------------------^ --------------— . E L i H (yj - yk) exp ( x Tpk) O ne stan d ard estim ator o f the baseline survivor function is 1- ^ 0 0 =

n { i- ^ v ) } . i-y&y

(7.i9)

which generalizes the product-lim it estim ate (3.9), although o th e r estim ators also exist. W hichever o f them is used, the p ro p o rtio nal hazards assum ption implies th a t {1 _ F°(y)}exp<-xJfo

351

7.3 ■Survival Data

will be the estim ated survivor function for an individual with covariate values Xj.

U nder the ran d o m censorship model, the survivor function o f the censoring d istribution G is given by (3.11). T he b o o tstra p m ethods for censored d a ta outlined in Section 3.5 extend straightforw ardly to this setting. F or example, if the censoring distribution is independent o f the covariates, we generate a single sam ple under the condi tional sam pling plan according to the following algorithm . Algorithm 7.2 (Conditional resampling for censored survival data) For j = 1 1 generate

7?*

from

the estim ated failure

time survivor function

{1 — F°(y)}exp(xJW; 2 if dj = 0, set Cj = yj, and if dj = 1, generate Cj from the conditional censoring distribution given th a t Cj > yj, namely {G(y) - G{yj)}/{ 1 - G(y,)}; then 3 set Yj = m in(7P*, Cj), w ith Dj = 1 if YJ = Yf* and zero otherwise.

U nder the m ore general m odel where the distribution G o f C also depends up o n the covariates an d a p ro p o rtional hazards assum ption is ap p ro p riate for G, the estim ated censoring survivor function when the covariate is x is f

n

-1 exp(xr y)

1 - G ( y ;y ,x ) = { l- G ° ( y ) j

where G0(y) is the estim ated baseline censoring distribution given by the analogues o f (7.18) and (7.19), in which 1 — dj and y replace dj and fi. U nder m odel-based resam pling, a b o o tstrap dataset is then obtained by Algorithm 7.3 (Resampling for censored survival data) F or j = 1 ,..., n, 1 generate

7?*

from

the estim ated failure tim e survivor function

{1 — F°(y)}exp{xyP\ and independently generate Cj from the estim ated censoring survivor function {1 — G°(>')}exp(x^ ) ; then 2 set 7 / = m in(7P*,C *), w ith Dj = 1 if 7 / = Y f and zero otherwise.

T he next exam ple illustrates the use o f these algorithm s.

352

7 ■Further Topics in Regression

Example 7.6 (Melanoma data) To illustrate these ideas, we consider d a ta on the survival o f patients w ith m alignant m elanom a, whose tum ours were re m oved by o p eratio n a t the D ep artm en t o f Plastic Surgery, U niversity H ospital o f Odense, D enm ark. O perations to o k place from 1962 to 1977, and patients were followed to the end o f 1977. Each tu m o u r was com pletely removed, to gether w ith a b o u t 2.5 cm o f the skin aro u n d it. T he following variables were available for 205 p atients: tim e in days since the operation, possibly censored; status at the end o f the study (alive, dead from m elanom a, dead from other causes); sex; age; year o f o p eratio n ; tu m o u r thickness in m m ; and an indi cator o f w hether or n o t the tu m o u r was ulcerated. U lceration and tum our thickness are im p o rtan t prognostic variables: to have a thick o r ulcerated tu m o u r substantially increases the chance o f d eath from m elanom a, and we shall investigate how they affect survival. We assum e th a t censoring occurs at random . We fit a p ro p o rtio n al hazards m odel und er the assum ption th a t the baseline hazards are different for the ulcerated group o f 90 individuals, and the no n ulcerated group, b u t th a t there is a com m on effect o f tu m o u r thickness. F or a flexible assessm ent o f how thickness affects the h azard function, we fit a natu ral spline w ith four degrees o f freed o m ; its k nots are placed a t the em pirical 0.25, 0.5 and 0.75 quantiles o f the tu m o u r thicknesses. T hus our m odel is th at the survivor functions for the ulcerated an d non-ulcerated groups are 1 - F l ( y ; P , x ) = {1 - f ? ( 30}“ p(xrw,

l - F 2( y ; p , x ) = {1 - F 2°(y)}exp(xT/f),

where x has dim ension fo u r an d corresponds to the spline, /? is com m on to the groups, b u t the baseline survivor functions 1 — F^(y) and 1 — F^iy) m ay differ. F o r illustration we take the fitted censoring distribution to be the product-lim it estim ate obtained by setting censoring indicators d' = 1 —d, and fitting a m odel w ith no covariates, so G is ju st the product-lim it estim ate o f the censoring time distribution. T he left panel o f Figure 7.8 shows the estim ated survivor functions 1 — F®(y) an d 1 — F °(y); there is a strong effect o f ulceration. T he right panel shows how the linear predictor x Tji depends on tu m o u r thickness: from 0-3 m m the effect on the baseline h azard changes from ab o u t exp(—1) = 0.37 to ab o u t exp(0.6) = 1.8, followed by a slight dip an d a gradual upw ard increase to a risk o f a b o u t exp(1.2) = 3.3 for a tu m o u r 15 m m thick. T hus the hazard increases by a factor o f a b o u t 10, b u t m ost o f the increase takes place from 0 -3 mm. However, there are too few individuals w ith tum ours m ore th an 10 m m thick for reliable inferences at the right o f the panel. The top left panel o f Figure 7.9 shows the original fitted linear predictor, together w ith 19 replicates o btained by resam pling cases, stratified by ulcera tion. The lighter solid lines in the panel below are pointw ise 95% confidence limits, based on R = 999 replicates o f this sam pling scheme. In effect these are percentile m ethod confidence lim its for the linear predictor a t each thickness.

7.4 ■Other Nonlinear Models

Figure 7.8 Fit o f a proportional hazards model for ulcer histology and survival o f patients with malignant m elanom a (Andersen et al., 1993, pp. 709-714). Left panel: estim ated baseline survivor functions for cases with ulcerated (dots) and non-ulcerated (solid) tumours. Right p an el: fitted linear predictor x Tfi for risk as a function o f tum our thickness. The lower rug is for non-ulcerated patients, and the upper rug for ulcerated patients.

353

Time (days)

Tumour thickness (mm)

T he sharp increase in risk for small thicknesses is clearly a genuine effect, while beyond 3mm the confidence interval for the linear predictor is roughly [0,1], w ith thickness having little o r no effect. R esults from m odel-based resam pling using the fitted m odel and applying A lgorithm 7.3, an d from conditional resam pling using A lgorithm 7.2 are also show n; they are very sim ilar to the results from resam pling cases. In view o f the discussion in Section 3.5, we did n o t apply the weird bootstrap. The right panels o f Figure 7.9 show how the estim ated 0.2 quantile o f the survival distribution, yo.2 = min{y : F i ( y ; P , x ) > 0.2} depends on tum our thickness. T here is an initial sharp decrease from 3000 days to ab o u t 750 days as tu m o u r thickness increases from 0 -3 mm, but the estim ate is roughly co n stan t from then on. T he individual estim ates are highly variable, b u t the degree o f uncertainty m irrors roughly th a t in the left panels. Once again results for the three resam pling schemes are very similar. U nlike the previous exam ple, where resam pling and stan d ard likelihood m ethods led to sim ilar conclusions, this exam ple shows the usefulness o f resam pling w hen stan d ard approaches would be difficult o r im possible to apply. ■

7.4 Other Nonlinear Models A nonlinear regression m odel w ith independent additive errors is o f form

yj

=

Kxj,P) + £j,

; =

(7.20)

354

7 • Further Topics in Regression

Figure 7.9 Bootstrap results for melanoma data analysis. Top left: fitted linear predictor (heavy solid) and 19 replicates from case resampling (solid); the rug shows observed thicknesses. Top right: estimated 0.2 quantile of survivor distribution as a function of tumour thickness, for an individual with an ulcerated tumour (heavy solid), and 19 replicates for case resampling (solid); the rug shows observed thicknesses. Bottom left: pointwise 95% percentile confidence limits for linear predictor, from case (solid), model-based (dots), and conditional (dashes) resampling. Bottom right: pointwise 95% percentile confidence limits for 0.20 quantile of survivor distribution, from case (solid), model-based (dots), and conditional (dashes) resampling, R — 999.

o g TD ok> Q.

(0
2

4

6

8

10

Tumour thickness (mm)

Tumour thickness (mm)

0 Tumour thickness (mm)

2

4

6

8

10

Tumour thickness (mm)

with ji{x, /?) nonlinear in the p aram eter /?, which m ay be vector o r scalar. The linear algebra associated w ith least squares estim ates for linear regression no longer applies exactly. However, least squares theory can be developed by linear approxim ation, an d the least squares estim ate ft can often be com puted accurately by iterative linear fitting. The linear approxim ation to (7.20), obtained by Taylor series expansion, gives yj ~ t i x j , P') = u j (0 - P') + ej,

j = 1, . . . , n,

(7.21)

355

7.4 • Other Nonlinear Models

where

= 8y{xj,P) W

i>-p

T his defines an iteration th a t starts at P' using a linear regression least squares fit, an d a t the final iteratio n /?' = /?. A t th a t stage the left-hand side o f (7.21) is simply the residual ej = yj — fi(xj,P). A pproxim ate leverage values and o th er diagnostics are obtained from the linear approxim ation, th a t is using the definitions in previous sections b u t w ith the UjS evaluated a t p' = p as the values o f explanatory variable vectors. This use o f the linear approxim ation can give m isleading results, depending upon the “intrinsic curvature” o f the regression surface. In particu lar, the residuals will no longer have zero expectation in general, an d standardized residuals r; will no longer have co n stan t variance u n d er hom oscedasticity o f true errors. T he usual norm al approxim ation for the distribution o f P is also based on the linear approxim ation. F or the approxim ate variance, (6.24) applies w ith X replaced by U = ( u i , . . . , u n)T evaluated at p. So w ith s2 equal to the residual m ean square, we have P -P

~

N ( 0 , s 2( U T U r l ) .

(7.22)

T he accuracy o f this ap proxim ation will depend upon tw o types o f curvature effects, called p aram eter effects and intrinsic effects. The first o f these is specific to the p aram etrizatio n used in expressing /x(x, •), and can be reduced by careful choice o f param etrization. O f course resam pling m ethods will be the m ore useful the larger are the curvature effects, and the worse the norm al approxim ation. R esam pling m ethods apply here ju st as with linear regression, either sim u lating d a ta from the fitted m odel w ith resam pled m odified residuals or by resam pling cases. F o r the first o f these it will generally be necessary to m ake a m ean adjustm ent to w hatever residuals are being used as the erro r population. It would also be generally advisable to correct the raw residuals for bias due to nonlinearity: we d o n o t show how to do this here. Exam ple 7.7 (Calcium uptake d ata) T he d ata plotted in Figure 7.10 show the calcium u p tak e o f cells, y, as a function o f tim e x after being suspended in a solution o f radioactive calcium. Also shown is the fitted curve fi(x,P) = Po { l - e x p ( - / ? i x ) } . T he least squares estim ates are Po = 4.31 and Pi = 0.209, and the estim ate o f a is 0.55 w ith 25 degrees o f freedom. The stan d ard errors for Po and Pi based on (7.22) are 0.30 an d 0.039.

7 *Further Topics in Regression

356

Figure 7.10 Calcium uptake data and fitted curve (left panel), with raw residuals (right panel) (Rawlings, 1988, p. 403).

to o (0 ZJ "O *35 o £ o 5(0 cr m o

2

Time (minutes)

Po h

4

6

8

10

12

14

Time (minutes)

E stim ate

B o o tstrap bias

T heoretical SE

B o o tstrap SE

4.31

0.028

0.30

0.38

0.209

0.004

0.039

0.040

The right panel o f Figure 7.10 shows th a t hom ogeneity o f variance is slightly questionable here, so we resam ple cases by stratified sam pling. Estim ated biases and stan d a rd errors for f o an d fo based on 999 b o o tstrap replicates are given in Table 7.8. T he m ain p o in t to notice is the appreciable difference betw een A theoretical an d b o o tstra p stan d ard errors for Po. Figure 7.11 illustrates the results. N ote the non-elliptical p a ttern o f variation and the n on-norm ality: the z-statistics are also quite non-norm al. In this case the b o o tstrap should give b etter results for confidence intervals th an norm al approxim ations, especially for Po- T he b o tto m right panel suggests th a t the param eter estim ates are closer to norm al on logarithm ic scales. Results for m odel-based resam pling assum ing hom oscedastic errors are fairly similar, alth o u g h the sta n d a rd error for f o is then 0.32. The effects o f nonlin earity are negligible in this case: for exam ple, the m axim um absolute bias o f residuals is a b o u t 0.012
Table 7.8 Results from R = 999 replicates of stratified case resampling for nonlinear regression model fitted to calcium data.

7.4 ■Other Nonlinear Models

357

Figure 7.11 Parameter estimates for case resampling of calcium data, with R = 999. The upper panels show normal plots of fa and while the lower panels show their joint distributions on the original (left) and logarithmic scales (right).

Quantiles of standard normal

Quantiles of standard normal

o

•'t O

o co

CO ©

. A.

o

"".4/ *

r. iS w

o

0

V v W c g 1;

CO

d

.

I *

o

•••

^ b

d

3.5

4.0

4.5

5.0

betaO

5.5

6.0

4

5 betaO

approach this by applying the delta m ethod together with the bivariate norm al approxim ation for least squares estim ates, b u t the b o o tstrap can deal w ith this using only the sim ulated p aram eter estim ates. So consider the times x = 1, 5, 15, at which the estim ates n = 1 — exp(—fiix) are 0.188, 0.647 and 0.956 respectively. T he top panel o f Figure 7.12 shows b o o tstrap distributions o f 7T* = 1 — exp(—P[x): n ote the strong non-norm ality at x = 15. T he co n strain t th a t n m ust lie in the interval (0,1) m eans th a t it is unwise to construct basic or studentized confidence intervals for n itself. F o r example, the basic b o o tstrap 95% interval for n at x = 15 is [0.922,1.025], The solution is to do all the calculations on the logit scale, as show n in the lower panel o f Figure 7.12, an d untransform the lim its obtained a t the end. T h a t is, we obtain

7 • Further Topics in Regression

358

x=15 x=1

x=5

J

1 0.2

0.0

-2

ItfTh-i-rL-

0.4 0.6 Proportion 1 - exp(-beta1*x)

0

2 Logit of proportion

0.8

1.0

4

intervals [rji,rj2] for r\ = log{7t/(l — n)}, and then take exp(?7i)

exp (rj2)

1 +exp(j7i)’ 1 +exp(f/2). as the corresponding intervals for n. T he resulting 95% intervals are [0.13,0.26] at x = 1, [0.48,0.76] a t x = 5, and [0.83,0.98] a t x = 15. T he stan d ard linear theory gives slightly different values, e.g. [0.10,0.27] at x = 1 and [0.83,1.03] at x = 15. ■

7.5 Misclassification Error The discussion o f aggregate prediction erro r in Section 6.4.1 was expressed in a general n o ta tio n th a t would apply also to the regression m odels described in this chapter, w ith ap p ro p riate definitions o f prediction rule y+ = fi(x+, F) for a response y+ a t covariate values x+, an d m easure o f accuracy c(y+,y+). The general conclusions o f Section 6-4.1 concerning b o o tstra p and cross-validation estim ates o f aggregate prediction erro r should apply here also. In p articular the adjusted K -fold cross-validation estim ate an d the 0.632 b o o tstrap estim ate should be preferred in m ost situations.

Figure 7.12 Calcium uptake d ata: bootstrap histograms for estimated proportion of maximum n = 1 —exp(—fi\x) at x = 1, 5 and 15 based on R = 999 resamples of cases.

359

7.5 ■Misclassification Error

O ne type o f problem th a t deserves special attention, in p a rt because it differs m ost from the exam ples o f Section 6.4.1, is the estim ation o f prediction error for binary responses, supposing these to be m odelled by a generalized linear m odel o f the sort discussed in Section 7.2. I f the binary response corresponds to a classification indicator, then prediction o f response y + for an individual w ith covariate vector x + is equivalent to classification o f th at individual, and incorrect prediction (y+ =^= y+ ) is a m isclassification error. Suppose, then, th a t the response y is 0 o r 1, and th a t the prediction rule fi(x+, F) is an estim ate o f Pr(Y+ = 1 | x + ) for a new case (x+ ,y + ). We im ag ine th a t this estim ated probability is translated into a prediction o f y+, or equivalently a classification o f the individual w ith covariate x + . F or simplicity we set y + = 1 if fi{x+, F) > and y + = 0 otherw ise; this would be m od ified if incidence rates for the two classes differed. I f costs o f b o th types o f misclassification erro r are equal, as we shall assum e, then it is enough to set

otherwise.

(7.23)

T he aggregate prediction erro r D is simply the overall misclassification rate, equal to the p ro p o rtio n o f cases where y+ is wrongly predicted. T he special feature o f this problem is th a t the prediction and the m easure o f erro r are n o t continuous functions o f the data. A ccording to the discussion in Section 6.4.1 we should then expect b o o tstrap m ethods for estim ating D o r its expected value A to be superior to cross-validation estim ates, in term s o f variability. A lso leave-one-out cross-validation is no longer attractive on co m p u tatio n al grounds, because we now have to refit the m odel for each resample. Exam ple 7.8 (U rine d a ta ) F or an exam ple o f the estim ation o f misclassifi cation error, we take binary d a ta on the presence o f calcium oxalate crystals in 79 sam ples o f urine. E xplanatory variables are specific gravity, i.e. the den sity o f urine relative to w ater, pH , osm olarity (mOsm), conductivity (m M ho m illiM ho), u rea concen tratio n (millimoles per litre), and calcium concentration (millimoles p er litre). A fter d ropping two incom plete cases, 77 remain. C onsider how well the presence o f crystals can be predicted from the ex planatory variables. A nalysis o f deviance for binary logistic regression suggests the m odel which includes the p = 4 covariates specific gravity, conductivity, log calcium concentration, and log urine density, and we base o u r predictions on this model. T he sim plest estim ate o f the expected aggregate prediction error A is the average nu m b er o f m isclassifications, A app = n~l E w ith c(-, •) given by (7.23); it w ould be equivalent to use instead

otherwise.

360

7 • Further Topics in Regression

K -fold (adjusted) cross-validation B o o tstrap

0.632

77

38

10

7

2

24.7

22.1

23.4

23.4 (23.7)

20.8 (21.0)

26.0 (25.4)

20.8 (20.8)

Table 7.9 Estimates of aggregate prediction error (xlO-2), or misclassification rate, for urine data (Andrews and Herzberg, 1985, pp. 249-251).

Figure 7.13 Components of 0.632 estimate of prediction error, yj — fi(xj; F*), for urine data based on 200 bootstrap simulations. Values within the dotted lines make no contribution to prediction error. The components from cases 54 and 66 are the rightmost and the fourth from rightmost sets of errors shown; the components from case 27 are leftmost.

Case ordered by residual

In this case A app = 20.8 x 10- 2 . O th er estim ates o f aggregate prediction error are given in Table 7.9. F o r the b o o tstrap an d 0.632 estim ates, we used R = 200 boo tstrap resamples. The discontinuous n ature o f prediction error gives m ore variable results th an for the exam ples with squared erro r in Section 6.4.1. In p articular the results for K -fold cross-validation now depend m ore critically on which observations fall into the groups. F or example, the average an d standard deviation o f A q v j for 40 repeats were 23.0 x 10-2 an d 2.0 x 10~2. However, the broad pattern is sim ilar to th a t in Table 6.9. Figure 7.13 shows box plots o f the quantities yj — n(xj ;F*) th a t contribute to the 0.632 estim ate o f prediction error, plotted against case j ordered by the residual; only three values o f j are labelled. There are ab o u t 74 contributions at each value o f j. O nly values outw ith the horizontal d o tted lines contribute to prediction error. The p attern is broadly w hat we would ex p ect: observations with residuals close to zero are generally well predicted, and m ake little contribu tio n to prediction error. M ore extrem e residuals contribute m ost to prediction error. N ote cases 66 an d 54, which are always misclassified; their standardized Pearson residuals are 2.13 an d 2.54. T he figure suggests th a t case

7.5 • Misclassification Error Table 7.10 Summary results for estimates of prediction error for 200 samples of size n = 50 from data on low birth weights (Hosmer and Lemeshow, 1989, pp. 247-252; Venables and Ripley, 1994, p. 193). The table shows the average, standard deviation, and conditional mean squared error (xlO -2) for the 200 estimates of excess error. The “target” average excess error is 8.3 x 10“ 2.

361 K -fold (adjusted) cross-validation

M ean SD M SE

B o o tstrap

0.632

50

25

10

5

2

9.1 1.2 0.38

8.8 1.9 0.29

11.5 4.4 0.62

11.7 (11.5) 4.5 (4.2) 0.64 (0.63)

12.2 (11.7) 5.0 (4.6) 0.76 (0.73)

12.4 (11.3) 4.8 (3.9) 0.64 (0.54)

15.3 (11.1) 7.1 (4.6) 1.14 (0.59)

54 is outlying. A t the o th er end is case 27, w hose residual is -1.84; this was misclassified 42 tim es out o f 65 in our sim ulation. ■ Exam ple 7.9 (Low birth weights) In order to com pare the properties o f estim ates o f m isclassification erro r under repeated sam pling, we took d a ta on 189 births a t a U S hospital to be o ur p o p ulation F. The binary response equals zero for babies w ith b irth weight less th an 2.5 kg, and equals one otherwise. We took 200 sam ples o f size n = 50 from these data, and to each sam ple we fitted a b inary logistic m odel with nine regression param eters expressing dependence on m atern al characteristics — weight, sm oking status, n u m ber o f previous p rem ature labours, hypertension, uterine irritability and the num ber o f visits to the physician in the first trim ester. F or each o f the sam ples we calculated various cross-validation and b o o tstrap estim ates o f misclassification rate, using R = 200 b o o tstrap resamples. Table 7.10 shows the results o f this experim ent, expressed in term s o f esti m ates o f the excess error, which is the difference betw een true misclassification rate D an d the a p p aren t erro r rate A app found by applying the prediction rule to the data. T he “ta rg e t” value o f the average excess erro r over the 200 samples w as 8.3 x 10—2; the average a p p aren t erro r was 20.0 x 10~2. The b o o tstrap an d 0.632 excess erro r estim ates again perform best overall in term s o f m ean, variability, and conditional m ean squared error. N ote th at the stan d ard deviations for the b o o tstrap and 0.632 estim ates suggest th a t R = 50 w ould have given results accurate enough for m ost purposes. O rdinary cross-validation is significantly better th an K -fold cross-validation, unless K = 25. However, the results for K -fold adjusted cross-validation are n o t significantly different from those for unadjusted cross-validation, even with K = 2 . T hus if cross-validation is to be used, adjusted K -fold cross-validation offers considerable co m p u tatio n al savings over ordinary cross-validation, and is ab o u t equally accurate. F or reasons outlined in Exam ple 3.6, the E D F o f the d ata m ay be a poor estim ate o f the original C D F when there are binary responses yj. One way to overcom e this is to switch the response value w ith small probability, i.e. to replace (x*,y*) w ith (x * ,l — y ' ) w ith probability (say) 0.1. This corresponds to a binom ial sim ulation using probabilities shrunk som ew hat tow ards 0.5

7 • Further Topics in Regression

362

from the observed values o f 0 an d 1. It should produce results th at are sm oother th a n those obtained und er case resam pling from the original data. O u r sim ulation experim ent included this random ized bootstrap, b u t although typically it im proves slightly on b o o tstrap results, the results here were very sim ilar to those for the o rdinary bootstrap . ■ In principle resam pling estim ates o f misclassification rates could be used to select which covariates to include in the prediction rule, along the lines given for linear regression in Section 6.4.2. It seems likely, in the light o f the preceding exam ple, th a t the b o o tstrap ap p ro ach w ould be preferable.

7.6 Nonparametric Regression So far we have considered regression m odels in w hich the m ean response is related to covariates x th ro u g h a function o f know n form w ith a sm all num ber o f unknow n param eters. T here are, however, occasions w hen it is useful to assess the effects o f covariates x w ithout com pletely specifying the form o f the relationship betw een m ean response n an d x. This is done using nonparam etric regression m ethods, o f w hich there are now a large num ber. The sim plest n o n p aram etric regression relationship for scalar x is

7 - n ( x ) + e, where fi(x) has com pletely unknow n form but w ould be assum ed continuous in m any applications, an d e is a ran d o m erro r w ith zero m ean. A typical application is illustrated by the scatter p lo t in Figure 7.14. H ere no simple param etric regression curve seems appropriate, so it m akes sense to fit a sm ooth curve (which we do later in Exam ple 7.10) w ith as few restrictions as possible. O ften n o n p aram etric regression is used as an exploratory tool, either directly by producing a curve estim ate for visual interpretation, or indirectly by provid ing a com parison w ith som e tentative p aram etric m odel fit via a significance test. In som e applications the ra th e r different objective o f prediction will be o f interest. W hatever the application, the com plicated n ature o f nonparam etric regression m ethods m akes it unlikely th a t probability distributions for statistics o f interest can be evaluated theoretically, an d so resam pling m ethods will play a prom inent role. It is n o t possible here to describe all o f the nonparam etric regression m ethods th a t are now available, an d in any event m any o f them do not yet have fully developed com panion resam pling m ethods. We shall limit ourselves to a brief discussion o f som e o f the m ain m ethods, and to applications in generalized additive m odels, where nonparam etric regression is used to extend the generalized linear m odels o f Section 7.2.

7.6 • Nonparametric Regression

Figure 7.14 Motorcycle impact data. Acceleration y (g) at a time x milliseconds after impact (Silverman, 1985).

363

c

o 2q5 < oD o

<

Time (ms)

7.6.1 Nonparam etric curves Several nonparam etric curve-fitting algorithm s are variants on the idea o f local averaging. O ne such m ethod is kernel smoothing, which estim ates m ean response E(Y | x) = fi(x) by £(*) =

X > ;w { (x ~ */)/*>} E » {(x -x # )} ’

(7.24)

w ith w(-) a sym m etric density function and b an adjustable “ban d w id th ” con stan t th a t determ ines how widely the averaging is done. This estim ate is similar in m any ways to the kernel density estim ate discussed in Exam ple 5.13, and as there the choice o f b depends upon a trade-off betw een bias and variability o f the e stim a te : sm all b gives sm all bias and large variance, whereas large b has the opposite effects. Ideally b would vary w ith x, to reflect large changes in the derivative o f /i(x) and heteroscedasticity, b o th evident in Figure 7.14. M odifications to the estim ate (7.24) are needed at the ends o f the x range, to avoid the inherent bias when there is little or no d ata on one side o f x. In m any ways m ore satisfactory are the local regression m ethods, where a local linear or quad ratic curve is fitted using weights w{(x — xj ) / b} as above, and then p.(x) is taken to be the fitted value at x. Im plem entations o f this idea include the lowess m ethod, which also incorporates trim m ing o f outliers. A gain the choice o f b is critical. A different approach is to define a curve in term s o f basis functions, such as pow ers o f x which define polynom ials. The fitted m odel is then a linear co m bination o f basis functions, with coefficients determ ined by least squares regression. W hich basis to use depends on the application, b u t polynom ials are

364

7 • Further Topics in Regression

generally b a d because fitted values becom e increasingly variable as x moves tow ard the ends o f its d a ta range — polynom ial extrapolation is notoriously poor. O ne p o p u lar choice for basis functions is cubic splines, w ith which n(x) is m odelled by a series o f cubic polynom ials joined at “k n o t” values o f x, such th a t the curve has continuous second derivatives everywhere. The least squares cubic spline fit m inim izes the penalized least squares criterion for fitting /i(x), ~ M*/)}2 + * J { t t x ) } 2dx; w eighted sum s o f squares can be used if necessary. In m ost softw are im ple m entations the spline fit can be determ ined either by specifying the degrees o f freedom o f the fitted curve, o r by applying cross-validation (Section 6.4.1). A spline fit will generally be biased, unless the underlying curve is in fact a cubic. T h a t such bias is nearly always present for nonparam etric curve fits can create difficulties. T he o th er general feature th a t m akes in terp retatio n difficult is the occurrence o f spurious bum ps an d bends in the curve estim ates, as we shall see in Exam ple 7.10. Resampling methods Two types o f applications o f n o n p aram etric curves are use in checking a p a ra m etric curve, an d use in setting confidence lim its for fi(x) o r prediction limits for Y = h ( x ) + e at some values o f x. The first type is quite straightforw ard, be cause d a ta would be sim ulated from the fitted param etric m odel: Exam ple 7.11 illustrates this. H ere we look briefly a t confidence lim its and prediction limits, where the n o n p aram etric curve is the only “m odel”. The basic difficulty for resam pling here is sim ilar to th a t w ith density estim ation, illustrated in Exam ple 5.13, nam ely bias. Suppose th a t we w ant to calculate a confidence interval for ji(x) at one o r m ore values o f x. Case resam pling can n o t be used w ith stan d ard recom m endations for nonparam etric regression, because the resam pling bias o f f i { x ) will be sm aller th an th at o f ju(x). T his could probably be corrected, as w ith density estim ation, by using a larger b andw idth o r equivalent tuning constant. But simpler, at least in principle, is to apply the idea o f m odel-based resam pling discussed in C h apter 6. The naive extension o f m odel-based resam pling would generate responses y j = p.{xj) + e*, where fa(x; ) is the fitted value from some nonparam etric regression m ethod, an d ej is sam pled from appropriately m odified versions o f the residuals yj — fi(xj). U n fortunately the inherent bias o f m ost n o n p a ra m etric regression m ethods distorts b o th the fitted values and the residuals, and thence biases the resam pling scheme. O ne recom m ended strategy is to use as sim ulation m odel a curve th a t is oversm oothed relative to the usual estim ate. F o r definiteness, suppose th a t we are using a kernel m ethod o r a local sm oothing m ethod w ith tuning co n stan t b, an d th a t we use cross-validation

7.6 • Nonparametric Regression

365

to determ ine the best value o f b. T hen for the sim ulation m odel we use the corresponding curve with, say, 2b as the tuning constant. To try to elim inate bias from the sim ulation errors ej, we use residuals from an undersm oothed curve, say w ith tuning co n stan t b / 2. As with linear regression, it is appropriate to use m odified residuals, where leverage is taken into account as in (6.9). This is possible for m ost nonparam etric regression m ethods, since they are linear. D etailed asym ptotic theory shows th at som ething along these lines is necessary to m ake resam pling work, b u t there is no clear guidance as to precise relative values for the tuning constants. E xam ple 7.10 (M otorcycle im pact d a ta ) The response y here is acceleration m easured x m illiseconds after im pact in an accident sim ulation experim ent. T he full d a ta were shown in Figure 7.14, b u t for com putational reasons we elim inate replicates for the present analysis, which leaves n = 94 cases with distinct x values. The solid line in the top left panel o f Figure 7.15 shows a cubic spline fit for the d a ta o f Figure 7.14, chosen by cross-validation and having approxim ately 12 degrees o f freedom. The top right panel o f the figure gives the plot o f m odified residuals against x for this fit. N ote the heteroscedasticity, w hich broadly corresponds to the three stra ta separated by the vertical dotted lines. The estim ated variances for these stra ta are approxim ately 4, 600 and 140. Reciprocals o f these were used as weights for the spline fit in the left panel. Bias in these residuals is evident at times 10-15 ms, where the residuals are first m ostly negative and then positive because the curve does not follow the d a ta closely enough. There is a rough correspondence betw een kernel sm oothing and spline sm oothing an d this, together w ith the previous discussion, suggests th a t for m odel-based resam pling we use yj = p(xj) + ej, where fi is the spline fit obtained by doubling the cross-validation choice o f L This fit is the dotted line in the top left panel o f Figure 7.15. The random errors ej are sam pled from the m odified residuals for an o th er spline fit in which X is h a lf the crossv alidation value. The lower right panel o f the figure displays these residuals, which show less bias th a n those for the original fit, though perhaps a smaller b andw idth would be b etter still. The sam pling is stratified, to reflect the very strong heteroscedasticity. We sim ulated R = 999 d atasets in this way, and to each fitted the spline curve fi’ (x), w ith the b an d w id th chosen by cross-validation each time. We then calculated 90% confidence intervals at six values o f x, using the basic b o otstrap m ethod m odified to equate the distributions o f /i*(x) —p(x) and F or example, at x = 20 the estim ates ft and p. are respectively —110.8 and —106.2, and the 950th ordered value o f p" is —87.2, so th a t the upper confidence limit is —110.8 — {—87.2 — (—106.2)} = —129.8. The resulting confidence intervals are shown in the b o tto m left panel o f Figure 7.15, together w ith the original

7 • Further Topics in Regression

366

c3o '
o

Time (ms)

Time (ms)

Time (ms)

Time (ms)

fit. N ote how the confidence limits are centred on the convex side o f the fitted curve in o rd er to account for its bias; this is m ost evident at x = 20. ■

7.6.2 Generalized additive models The structural p a rt o f a generalized linear m odel, as outlined in Section 7.2.1, is the linear predictor rj = x Tft, which is additive in the com ponents x, o f x. It m ay n o t always be the case th a t we know w hether Xj o r some transform ation o f it should be used in the linear predictor. T hen it m akes sense, a t least for exploratory purposes, to include in rj a nonparam etric curve com ponent s;(x,) for each com ponent x,- (except those corresponding to qualitative factors). This still assum es additivity o f the effects o f the x,s on the linear predictor scale.

Figure 7.15 Bootstrap analysis of motorcycle data, without replicate responses. Top left: data, original cubic spline fit (solid) and oversmoothed fit (dots). Top right: residuals from original fit; note their bias at times 10-15 ms. Bottom right: residuals from undersmoothed fit. The lines in these plots show strata used in the resampling. Bottom left: original fit and 90% basic bootstrap confidence intervals at six values of x ; they are not centred on the fitted curve.

367

7.6 ■Nonparametric Regression

T he result is the generalized additive model p g{fi(x)} = ri(x) = ^ 2 Si(xt), !=i

(7.25)

where g( ) is a know n link function, as before. As for a generalized linear model, the m odel specification is com pleted by a variance function, var(Y ) = k V( h ). In practice we m ight force some term s s,(x,) in (7.25) to be linear, depending u p o n w hat is know n a b o u t the application. Each nonparam etric term is typi cally fitted as a linear term plus a nonlinear term , the latter using sm oothing splines or a local sm oother. This m eans th a t the corresponding generalized linear m odel is a sub-m odel, so th a t the effects o f nonlinearity can be as sessed using differences o f residual deviances, suitably scaled, as in (7.8). In stan d ard com puter im plem entations each nonparam etric curve s,(x,) has (ap proxim ately) three degrees o f freedom for nonlinearity. S tan d ard distributional approxim ations for the resulting test statistics are som etim es quite unreliable, so th a t resam pling m ethods are particularly helpful in this context. F or tests o f this sort the null m odel for resam pling is the generalized linear model, and the ap p ro ach taken can be sum m arized by the following algorithm . Algorithm 7.4 (Comparison of generalized linear and generalized additive models) F or r = 1 ,..., R, 1 fix the covariate values at those observed; 2 generate b o o tstra p responses y j,...,y * by resam pling from the fitted generalized linear null m odel; 3 fit the generalized linear m odel to the b o o tstrap d a ta and calculate the residual deviance d'0r; 4 fit the generalized additive m odel to the b o o tstrap data, calculate the residual deviance d* an d dispersion k*; then 5 calculate t* = (d$r — d * ) / k*. Finally, calculate the P-value as [l + #{t* > t}] / ( R + 1), where t = (do — d ) / k is the scaled difference o f deviances for the original data. • T he following exam ple illustrates the use o f nonparam etric curve fits in m odelchecking. Example 7.11 (Leukaemia data) For the d a ta in Exam ple 7.1, we originally fitted a generalized linear m odel w ith gam m a variance function and linear p redictor g ro u p + x w ith logarithm ic link, where g ro u p is a factor w ith two levels. T he fitted m ean function for th a t m odel is show n as two solid curves in Figure 7.16, the u p p er curve corresponding to G ro u p 1. H ere we consider

7 ■Further Topics in Regression

368

Figure 7.16 Generalized linear model fits (solid) and generalized additive model fits (dashed) for leukaemia data of Example 7.1.

Log10 white blood cell count

w hether or n o t the effect o f x is linear. To do this, we com pare the original fit to th at o f the generalized additive m odel in which x is replaced by s(x), which is a sm oothing spline w ith three degrees o f freedom . The link and variance functions are unchanged. T he fitted m ean function for this m odel is shown as dashed curves in the figure. Is the sm ooth curve a significantly b etter fit? To answ er this we use the test statistic Q defined in (7.8), where here D corresponds to the residual deviance for the generalized additive m odel, k is the dispersion for th a t m odel, and Do is the residual deviance for the sm aller generalized linear model. F or these d a ta D = 40.32 w ith 30 degrees o f freedom , k = 0.725, and Do = 30.75 w ith 27 degrees o f freedom , so th a t q = (40.32 — 30.75)/0.725 = 13.2. The standard approxim ation for the null distrib u tio n o f Q is chi-squared w ith degrees o f freedom equal to the difference in m odel dim ensions, here p — po = 3, so the approxim ate P-value is 0.004. A lternatively, to allow for estim ation o f the dispersion, (p — po)_12 is com pared to the F distribution w ith denom inator degrees o f freedom n — p — 1, here 27, an d this gives approxim ate P-value 0.012. It looks as though there is strong evidence against the simpler, loglinear model. However, the accuracies o f the approxim ations used here are som ew hat questionable, so it m akes sense to apply the resam pling analysis. To calculate a b o o tstrap P-value corresponding to q = 13.2, we sim ulate the distribution o f Q u nder the fitted null m odel, th a t is the original generalized linear m odel fit, b u t w ith n o n p aram etric resam pling. T he p articular resam pling scheme we choose here uses the linear predictor residuals rLj defined in (7.10), one advantage o f which is th a t positive sim ulated responses are guaranteed. The residuals in this case are = logCVj) ~ log(Aoj) Ll

4 /2( l - S ) i / 2 ’

7.6 - Nonparametric Regression

369

Figure 7.17 Chi-squared Q-Q plot of standardized deviance differences q* for comparing generalized linear and generalized additive model fits to the leukaemia data. The lines show the theoretical x\ approximation (dashes) and the F approximation (dots). Resampling uses Pearson residuals on linear predictor scale, with R = 999.

Chi-squared quantiles

w here hoj, jhj an d kq are the leverage, fitted value and dispersion estim ate for the null (generalized linear) m odel. These residuals ap p ear quite hom oge neous, so no stratification is used. T hus step 2 o f A lgorithm 7.4 consists o f sam pling e j,...,e * random ly with replacem ent from rL{, . . . , r Ln (w ithout m ean correction), an d then generating responses y * = /io; exp(KQ/2e*) for j = l , . . . , n . A pplying this algorithm w ith R = 999 for o u r d a ta gives the P-value 0.035, larger th a n the theoretical approxim ations, b u t still suggesting th a t the linear term in x is n o t sufficient. The b o o tstrap null distribution o f q * deviates m arkedly from the stan d ard %\ approxim ation, as the Q-Q plot in Figure 7.17 shows. The F approxim ation is also inaccurate. A jack k n ife-after-b o o tstrap plot reveals th at quantiles o f q* are m oderately sensitive to case 2, b u t w ithout this case the P-value is virtually unchanged. Very sim ilar results are obtained under param etric resam pling with the exponential m odel, as m ight be expected from the original d a ta analysis. ■ O u r next exam ple illustrates the use o f sem iparam etric regression in predic tion. E xam ple 7.12 (A ID S diagnoses) In Exam ple 7.4 we discussed prediction o f A ID S diagnoses based on the d a ta in Table 7.4. A sm ooth tim e trend seems preferable to fitting a separate param eter for each diagnosis period, an d accordingly we consider a m odel where the m ean num ber o f diagnoses in p eriod j reported w ith delay k, the m ean for the ( j, k) cell o f the table, equals Hjk = exp(aO') + 0k}. We take a (j) to be a locally quadratic lowess sm ooth w ith bandw idth 0.5.

370

7 • Further Topics in Regression

T he delay distrib u tio n is so sharply peaked here th a t although we could take a sm ooth function in the delay time, it is equally parsim onious to take 15 separate p aram eters f t . We use the sam e variance function as in Exam ple 7.4, which assum es th a t the observed counts yjk are overdispersed Poisson w ith m eans /ijk, and we fit the m odel as a generalized additive m odel. T he residual deviance is 751.7 on 444.2 degrees o f freedom , increased from 716.5 and 413 in the previous fit. The curve show n in the left panel o f Figure 7.18 fits well, and is m uch m ore plausible as a m odel for underlying trend th an the curve in Figure 7.5. The panel also shows the predicted values from this curve, which o f course are heavily affected by the observed diagnoses in Table 7.4. As m entioned above, in resam pling from fitted curves it is im p o rta n t to take residuals from an u n dersm oothed curve, in o rd er to avoid bias, and to add them to an oversm oothed curve. We take Pearson residuals {y — p ) / p } l2 from a sim ilar curve w ith b andw idth 0.3, and ad d them to a curve w ith bandw idth 0.7. These fits have deviances 745.3 on 439.2 degrees o f freedom and 754.1 on 446.1 degrees o f freedom . B oth o f these curves are show n in Figure 7.18. Leverage adjustm ent is aw kw ard for generalized additive m odels, b u t the large num ber o f degrees o f freedom here m akes such adjustm ents unnecessary. We m odify resam pling scheme (7.12), an d repeat the calculations as for A lgorithm 7.1 applied to Exam ple 7.4, w ith R = 999. Table 7.11 shows the resulting prediction intervals for the last quarters o f 1990, 1991, an d 1992. T he intervals for 1992 are substantially shorter th an those in Table 7.5, because o f the different m odel. T he generalized additive m odel is based on an underlying sm ooth trend in diagnoses, so predictions for the last few rows o f the table depend less critically on the values observed

Figure 7.18 Generalized additive model prediction of UK AIDS diagnoses. The left panel shows the fitted curve with bandwidth 0.5 (smooth solid line), the predicted diagnoses from this fit (jagged dashed line), and the fitted curves with bandwidths 0.7 (dots) and 0.3 (dashes), together with the observed totals (+). The right panel shows the predicted quarterly diagnoses for 1989-92 (central solid line), and pointwise 95% prediction limits from the Poisson bootstrap (solid), negative binomial bootstrap (dashes), and nonparametric bootstrap without (dots) and with (dot-dash) stratification.

7.6 • Nonparametric Regression Table 7.11 Bootstrap 95% prediction intervals for numbers of AIDS cases in England and Wales for the fourth quarters of 1990, 1991, and 1992, using generalized additive model.

371 1990

Poisson N egative binom ial N o n p aram etric S tratified n o n p aram etric

1991

1992

295

314

302

336

415

532

293 294 293

317 316 315

298 296 295

339 337 338

407 397 394

547 545 542

in those rows. This contrasts w ith the Poisson tw o-w ay layout m odel, for which the predictions depend com pletely on single rows o f the table and are m uch m ore variable. C om pare the slight forecast drop in Figure 7.6 with the predicted increase in Figure 7.18. The d otted lines in Figure 7.18 show pointw ise 95% prediction bands for the A ID S diagnoses. The prediction intervals for the negative binom ial and n o n p aram etric schemes are similar, although the effect o f stratification is smaller. S tratification has no effect on the deviances. The negative binom ial deviances are typically a b o u t 90 larger th a n those generated under the nonparam etric scheme. The plausibility o f the sm ooth underlying curve and its usefulness for p re diction is o f course central to the approach outlined here.

■

7.6.3 Other m ethods O ften a nonp aram etric regression fit will be com pared to a param etric fit, b u t not all applications are o f this kind. F or exam ple, we m ay w ant to see w hether or n o t a regression curve is m onotone w ithout specifying its form. T he following application is o f this kind. Exam ple 7.13 (Downs syndrom e) Table 7.12 contains a set o f d a ta on inci dence o f D ow ns syndrom e babies for m others in various age ranges. M ean age is approxim ate m ean age o f the m m others whose babies included y babies with D ow ns syndrom e. These d a ta are plotted on the logistic scale in Fig ure 7.19, together w ith a generalized additive spline fit as an exploratory aid in m odelling the incidence rate. W h at we notice ab o u t the curve is th at it decreases w ith age for young m others, co n trary to intu itio n and expert belief. A sim ilar phenom enon occurs for o th er datasets. We w ant to see if this dip is real, as opposed to a statistical artefact. So a null m odel is required under which the rate o f occurrence is increasing w ith age. L inear logistic regression is clearly inappropriate, and m ost oth er stan d ard m odels give non-increasing rates. The approach taken is isotonic regression, in which the rates are fitted nonparam etrically subject to their increasing w ith age. F urther, in order to m ake the null m odel a special

372

7 • Further Topics in Regression

X m

y X m

y X m

y

17.0 13555 16

18.5 13675 15

19.5 18752 16

20.5 22005 22

21.5 23896 16

22.5 24667 12

23.5 24807 17

24.5 23986 22

25.5 22860 15

26.5 21450 14

27.5 19202 27

28.5 17450 14

29.5 15685 9

30.5 13954 12

31.5 11987 12

32.5 10983 18

33.5 9825 13

34.5 8483 11

35.5 7448 23

36.5 6628 13

37.5 5780 17

38.5 4834 15

39.5 3961 30

40.5 2952 31

41.5 2276 33

42.4 1589 20

43.5 1018 16

44.5 596 22

45.5 327 11

47.0 249 7

Table 7.12 Number y of Downs syndrome babies in m births for mothers with age groups centred on x years (Geyer, 1991).

Figure 7.19 Logistic scale plot of Downs syndrome incidence rates. Solid curve is generalized additive spline fit with 3 degrees of freedom

a> o
Mean age x

case o f the general model, the la tte r is taken to be an arb itrary convex curve for the logit o f incidence rate. If the incidence rate at age x, is n(xi) w ith logit{7r(x/)} = rj(xi) = */*, say, for i= then the binom ial log likelihood is

1=1 A convex m odel is one in which Xi+1 -

Xi

Xi -

%i—1

Xj+1 -

X i-1

t i i < - ------ — rn- 1 + 7 ------ — 1i+1. x i+ 1

Xi- 1

.

I = 2 , . .. ,k - 1 .

The general m odel fit will m axim ize the binom ial log likelihood subject to these constraints, giving estim ates fji,...,rjk- T he null m odel satisfies the constraints rji < rji+i for i = l , . . . , k — 1, which are equivalent to the previous convexity

373

7.6 ■Nonparametric Regression

Figure 7.20 Logistic scale plot of incidence rates for Downs syndrome data, with convex fit (solid line) and isotonic fit (dotted line).

Mean age x

constraints plus the single co n straint r\\ < r\2 - The null fit essentially pools adjacent age groups for which the general estim ates fji violate the m onotonicity o f the null m odel. If the null estim ates are denoted by then we take as our test statistic the deviance difference T = 2{(f(»)i,. ..,r\k) ~

-• •»flojc)}-

T he difficulty now is th a t the stan d ard chi-squared approxim ation for de viance differences does n o t apply, essentially because there is n o t a fixed value for the degrees o f freedom . T here is a com plicated large-sam ple approxim ation which m ay well n o t be reliable. So a param etric b o o tstrap is used to calculate the P-value. This requires sim ulation from the binom ial m odel w ith sample sizes m„ covariate values x, and logits fjo,iFigure 7.20 shows the convex and isotone regression fits, which clearly differ for age below 30. T he deviance difference for these fits is t = 5.873. S im ulation o f R = 999 binom ial datasets from the isotone m odel gave 33 values o f t* in excess o f 5.873, so the P-value is 0.034 and we conclude th a t the dip in incidence rate m ay be real. (F urther analysis w ith additional d a ta does n o t su p p o rt this conclusion.) Figure 7.21 is a histogram o f the t* values. It is possible th a t the null distribution o f T is unstable with respect to p ara m eter values, in which case the nested b o o tstrap procedure o f Section 4.5 should be used, possibly in conjunction w ith the recycling m ethod o f Section 9.4.4 to accelerate the com putation. ■

7 • Further Topics in Regression

374

Figure 7.21 Histogram of 999 resampled deviance test statistics for the Downs syndrome data. The unshaded portion corresponds to values exceeding observed test statistic t = 5.873.

0

2

4

6

8

10

t*

7.7 Bibliographic Notes A full treatm en t o f all aspects o f generalized linear m odels is given by M cCullagh and N elder (1989). D obson (1990) is a m ore elem entary discussion, while F irth (1991) gives a useful sh o rter account. D avison and Snell (1991) describe m ethods o f checking such models. Books by C ham bers and H astie (1992) and Venables an d Ripley (1994) cover m ost o f the basic m ethods discussed in this chapter, b u t restricted to im plem entations in S and S-Plus. Published discussions o f b o o tstrap m ethods for generalized linear m odels are usually lim ited to one-step iterations from the m odel fit, w ith resam pling o f Pearson residuals; see, for example, M oulton and Zeger (1991). T here appears to be no system atic study o f the various schemes described in Section 7.2.3. N elder an d Pregibon (1987) briefly discuss a m ore general application. M oulton and Zeger (1989) discuss b o o tstra p analysis o f repeated m easure data, w hile Booth (1996) describes m ethods for use when there is nested variation. Books giving general accounts o f survival d a ta are m entioned in Section 3.12. H jo rt (1985) describes m odel-based resam pling m ethods for p roportional haz ards regression, and studies their theoretical properties such as confidence interval accuracy. B urr an d D oss (1993) outline how the double boo tstrap can be used to provide confidence b an d s for a m edian survival time, and com pare its perform ance w ith sim ulated bands based on asym ptotic results. Lo and Singh (1986) an d H orvath an d Y andell (1987) m ake theoretical con tributions to b o o tstrap p in g survival data. B ootstrap and p erm utation tests for com parison o f survivor functions are discussed by H eller and V enkatram an (1996). Burr (1994) studies em pirically various b o o tstrap confidence interval m eth ods for the p ro p o rtio n al hazards m odel. She finds no overall best com bination,

7.7 • Bibliographic Notes

375

b u t concludes th a t norm al-theory asym ptotic confidence intervals and basic b o o tstrap intervals are generally good for regression param eters fi, while per centile intervals are satisfactory for survival probabilities derived from the product-lim it estim ate. R esults from the conditional boo tstrap are m ore er ratic th an those for resam pling cases o r from m odel-based resam pling, and the latter is generally preferred. A ltm an an d A ndersen (1989), C hen and G eorge (1985) and Sauerbrei and Schum acher (1992) apply case resam pling to variable selection in survival d ata m odels, b u t there seems to be little theoretical justification o f this. The use o f b o o tstrap m ethods in general assessm ent o f m odel uncertainty in regression is discussed by Faraw ay (1992). B ootstrap m ethods for general nonlinear regression m odels are usually studied theoretically via linear approxim ation. See H uet, Jolivet and M essean (1990) for some sim ulation results. T here appears to be no literature on incorporating curvature effects into m odel-based resam pling. T he behaviour o f residuals, leverages an d diagnostics for nonlinear regression m odels are developed by Cook, Tsai an d Wei (1986) and St. L aurent and C ook (1993). The large literature on prediction erro r as related to discrim ination is sur veyed by M cL achlan (1992). References for boo tstrap estim ation o f prediction error are m entioned in Section 6.6. Those dealing particularly w ith misclassification erro r include Efron (1983) and Efron and T ibshirani (1997). G ong (1983) discusses a p articu lar case where the prediction rule is based on a logistic regression m odel obtained by forw ard selection. References to b o o tstrap m ethods for m odel selection are m entioned in Section 6.6. The treatm en t by Shao (1996) covers both generalized linear m odels an d nonlinear models. T here are now num erous accounts o f nonparam etric regression, such as H astie and T ibshirani (1990) on generalized additive models, and G reen and Silverm an (1994) on penalized likelihood m ethods. A useful treatm ent o f local w eighted regression by H astie and L oader (1993) is followed by a discussion o f the relative m erits o f various kernel-type estim ators. Venables and Ripley (1994) discuss im plem entation in S-Plus with exam ples; see also C ham bers and H astie (1992). C onsiderable theoretical w ork has been done on boo tstrap m ethods for setting confidence bands on nonparam etric regression curves, m ostly focusing on kernel estim ators. H ardle and Bowman (1988) and H ardle and M arro n (1991) b o th em phasize the need for. different levels o f sm oothing in the com ponents o f m odel-based resam pling schemes. H all (1992b) gives a detailed theoretical assessm ent o f the properties o f such confidence band m ethods, and em phasizes the benefits o f the studentized bootstrap. There appears to be no corresponding treatm ent for spline sm oothing m ethods, nor for the m any com plex m ethods now used for fitting surfaces to m odel the effects o f m ultiple covariates.

7 ■Further Topics in Regression

376

A sum m ary o f m uch o f the theory for resam pling in nonlinear and no n param etric regression is given in C h ap ter 8 o f Shao and Tu (1995).

7.8 Problems 1

The estimator ft in a generalized linear model may be defined as the solution to the theoretical counterpart of (7.2), namely

/

c V ( t ) d f / e/

F{x' y} = 0'

where n is regarded as a function of ft through the link function g(fi) = r\ = x Tft. Use the result of Problem 2 . 1 2 to show that the empirical influence value for ft based on data (x1,ci,yi),...,(x„,c„,}'„) is lj = n(XT W X ) - 1xj

J ? " * 1' . cjV(Hj)3t]j/8fij

evaluated at the fitted model, where W is the diagonal matrix with elements given by (7 . 3 ) . Hence show that the approximate variance matrix for ft' for case resampling in a generalized linear model is k ( X T W X ) - 1X T W S X { X T W X ) ~ \ where $ = diag(rp,,..., rj,n) with the rpj standardized Pearson residuals (7.9). Show that for the linear model this yields the modified version of the robust variance matrix ( 6 . 2 6 ) . (Section 7 . 2 . 2 ; Moulton and Zeger, 1 9 9 1 ) 2

For the gamma model of Examples 7 . 1 and 7 . 2 , verify that v a r(7 ) = k/i2 and that the log likelihood contribution from a single observation is = - ^ { l o g i ^ + y/fi}. Show that the unstandardized Pearson and deviance residuals are respectively _ / 2 ( — 1) and sign(z—1 ) [ 2 k _ 1 / 2 { z — 1 — log(z)}]1/2, where z = y/p.. If the regression is loglinear, meaning that the log link is used, verify that the unstandardized linear predictor residuals are simply k~i/2 log(z). What are the possible ranges of the standardized residuals rP, rL and rDl Calculate these for the model fitted in Example 7 .2. If the deviance residual is expressed as d(y,p), check that d(y,p) = d(z, 1). Hence show that the resampling scheme based on standardized deviance residuals can be expressed as y ’ = faz’, where zj is defined by d(zj, 1) = s' with «' randomly sampled from rDi, . . . , r Dn. What further simplification can be made? (Sections 7 . 2 . 2 , 7 . 2 . 3 ) k

3

i

z

The figure below shows the fit to data pairs ( x u y \ ),•■■,(x„,y„) of a binary logistic model Pr(7 = 1) = 1 - Pr(Y = 0) =

eXp(/?0 + /?lX) 1 + exp(/?0 + /fix)

7.8 ■Problems

377

x

(a) Under case resampling, show that the maximum likelihood estimate for a bootstrap sample is infinite with probability close to e~2. W hat effect has this on the different types o f bootstrap confidence intervals for fa ? (b) Bias-corrected maximum likelihood estimates are obtained by modifying re sponse values (0,1) to (/iy/2, l+hj), where hj is the jth leverage for the model fit to the original data. D o infinite parameter estimates arise when bootstrapping cases from the modified data? (Section 7.2.3; Firth, 1993; M oulton and Zeger, 1991) 4

Investigate whether resampling schemes given by (7.12), (7.13), and (7.14) yield Algorithm 6.1 for bootstrapping the linear model.

5

Suppose that conditional on P = n, Y has a binom ial distribution with probability n and denominator m, and that P has a beta density

/( n | «,ff) =

r Wi(P)

- nf-',

0 < n < 1,

tx,P>0.

Show that Y has unconditional mean and variance (7.15) and express n and in terms o f a and fa Express a and /? in terms o f n and <j>, and hence explain how to generate data with mean and variance (7.15) by generating n from a beta distribution, and then, conditional on the probabilities, generating binom ial variables with probabilities n and denominators m. How should your algorithm be amended to generate beta-binomial data with variance function II(l — II)? (Example 7.3) 6

For generalized linear models the analogue o f the case-deletion result in Problem 6.2 is

Kj = P-(xTwxy'wjk-^xj^^i. (a) Use this to show that when the y'th case is deleted the predicted value for y, is

7 • Further Topics in Regression

378

(b) Use (a) to give an approximation for the leave-one-out cross-validation estimate o f prediction error for a binary logistic regression with cost (7.23). (Sections 6.4.1,7.2.2)

7.9 Practicals 1

Dataframe r e m is s io n contains data from Freeman (1987) concerning a measure o f cancer activity, the LI values, for 27 cancer patients, o f whom 9 went into remission. Remission is indicated by the binary variable r = 1. Consider testing the hypothesis that the LI values do not affect the probability o f remission. First, fit a binary logistic m odel to the data, plot them, and perform a permutation test:

attach(remission) plot(LI+O.03*rnorm(27),r,pch=l,xlab="LI, jittered",xlim=c(0,2.5)) rem.glm <- glm(r"LI.binomial,data=remission) summary(rem.glm) x <- seqC0.4,2.0,0.02) eta <- cbind(rep(l,81) ,x)/C*'/.coeff icients(rem.glm) lines(x,inv.logit(eta),lty=2) rem.perm <- function(data, i) { d <-data d$LI<- d$LI[i] d.glm <- glm(r~LI,binomial,data=d) coefficients(d.glm) > rem.boot <- boot(remission, rem.perm, R=199, sim="permutation") qqnorm(rem.boot$t[,2],ylab="Coefficient of LI",ylim=c(-3,3)) abline(h=rem.boot$tO[2],lty=2) Compare this significance level with that from using a normal approximation for the coefficient o f LI in the fitted model. Construct bootstrap tests o f the hypothesis by extending the methods outlined in Section 6.2.5. (Freeman, 1987; Hall and Wilson, 1991) 2

Dataframe b reslo w contains data from Breslow (1985) on death rates from heart disease among British male doctors. A standard m odel is that the numbers o f deaths y have a Poisson distribution with mean nX, where n is the number o f person-years and X is the death rate. The focus o f interest is how death rate depends on two explanatory variables, a factor representing the age group and an indicator o f sm oking status, x. Two com peting models are X = exp(aage + fix),

X = aage + fix;

these are respectively multiplicative and additive. To fit these models we proceed as follows:

breslow.mult <- glm(y*offset(log(n))+age+smoke,poisson(log), data=breslow) breslow.add <- glm(y~n:age+ns-l,poisson(identity),data=breslow) Here n s is a variable for the effect o f smoking, constructed to allow for the difficulty in applying an offset in fitting the additive model. The deviances o f the fitted models are Dadd = 7.43 and Dmuit = 12.13. Although it appears that the additive model is the better fit, these models are not nested, so a chi-squared approximation cannot be applied to the difference o f deviances. For bootstrap

7.9 • Practicals

379

assessment o f fit based on the difference o f deviances, we simulate in turn from each fitted model. Because fits o f the additive m odel fail if there are no deaths in the lowest age group, and this happens with appreciable probability, we constrain the simulation so that there are deaths at each age.

breslow.fun <- function(data) { mult <- glm(y"offset(log(n))+age+smoke,poisson(log),data=data) add <- glm(y~n:age+ns-1,poisson(identity),data=data) deviance(mult)-deviance(add) } breslow.sim <- function(data, mle) { data$y <- rpois(nrow(data), mle) while(min(data$y)==0) data$y <- rpois(nrow(data), mle) data } add.mle <- fitted(breslow.add) add.boot <- boot(breslow, breslow.fun, R=99, sim="parametric", ran.gen=breslow.sim, mle=add.mle) mult.mle <- fitted(breslow.mult) mult.boot <- boot(breslow, breslow.fun, R=99, sim="parametric", ran.gen=breslow.sim, mle=mult.mle) boxplot(mult.boot$t,add.boot$t,ylab="Deviance difference", names=c("multiplicative","additive")) abline(h=mult.boot$tO,lty=2) W hat does this tell you about the relative fit o f the models? A different strategy would be to use parametric simulation, simulating not from the fitted models, but from the model with separate Poisson distributions for each o f the original data. D iscuss critically this approach. (Section 7.2; Example 4.5; Wahrendorf, Becher and Brown, 1987; Hall and Wilson, 1991) Dataframe h ir o s e contains the PET reliability data o f Table 7.6. Initially we consider estimating the bias and variance o f the M LEs o f the parameters /?o,. . . , / ? 4 and xo discussed in Example 7.5, using parametric simulation from the fitted Weibull model, but assuming that the data were subject to censoring at the fixed time 9104.25. Functions to calculate the minus log likelihood (in parametrization and to find the M LEs are:

hirose.lik <- function(mle, data) { xO <- 5-exp(mle [5]) lambda <- exp(mle[l]+mle[2]*(-log(data$volt-x0))) beta <- exp(mle[3]+mle[4]*(-log(data$volt))) z <- (data$time/lambda)“ beta sum(z - data$cens*log(beta*z/data$time)) } hirose.fun <- function(data, start) { d <- nlminb(start, hirose.lik, data=data) conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE") c(conv, d$objective, d$parameters) } The M L E s for the original data can be obtained by setting hirose.start
380

7 ■Further Topics in Regression

hirose.gen <- function(data, mle) { xO <- 5 - exp (mle [5]) xl <- -log(data$volt-xO) xb <- -log(data$volt) lambda <- exp(mle[1]+mle[2]*xl) beta <- exp(mle[3]+mle[4]*xb) y <- rweibull(nrow(data), shape=beta, scale=lambda) data$cens <- ifelse(y<=9104.25,1,0) data$time <- ifelse(data$cens==l,y,9104.25) data >

and the bootstrap results are obtained by hirose.mle <- hirose.start hirose.boot <- boot(hirose, hirose.fun, R=19, sim="parametric", r a n .gen=hirose.g en, mle=hirose.m l e , start=hirose.start) hirose.boot$t[,7] <- 5-exp(hirose.boot$t[,7]) hirose.boot$tO[7] <- 5-exp(hirose.boot$tO[7]) hirose.boot

Try this with a larger value of R — but don’t hold your breath. For a full likelihood analysis for the parameter 9, the log likelihood must be maximized over /?i,...,/?4 for a given value of 9. A little thought shows that the necessary code is betaO <- function(theta, mle) { x49 <- -log(4.9-(5-exp(mle[4]))) x <- -log(4.9) log(theta*10"3) - m l e [1]*x49-lgamma(l + exp (-mle [2]-mle [3] *x)) } hirose.Iik2 <- function(mle, data, theta) { xO <- 5-exp(mle[4]) lambda <- exp(betaO(theta,mle)+mle[1]*(-log(data$volt-xO))) beta <- exp(mle[2]+mle[3]*(-log(data$volt))) z <- (data$time/lambda)“ beta sum(z - data$cens*log(beta*z/data$time)) } hirose.fun2 <- function(data, start, theta) { d <- nlminb(start, hirose.Iik2, data=data, theta=theta) conv <- (d$message=="RELATIVE FUNCTION CONVERGENCE") c(conv, d$objective, d$parameters) } hirose.f <- function(data, start, theta) c( hirose.fun(data,i.start), hirose.fun2(data,i ,start[-1],theta))

so that h i r o s e . f does likelihood fits when 6 is fixed and when it is not. The quantiles of the simulated likelihood ratio statistic are then obtained by make.theta <- function(mle, x=hirose$volt) { xO <- 5-exp(mle[5]) lambda <- exp(mle[1]-mle[2]*log(x-x0))/10~3 beta <- exp(mle[3]-mle[4]*log(x)) lambda*gamma(l+l/beta) > theta <- m a k e .theta(hirose.mle,4.9) hirose.boot <- boot(hirose, hirose.f, R=19, "parametric", r a n .gen=hirose.g e n , mle=hirose.m l e , start=hirose.start, theta=theta)

7.9 ■Practicals

381

R <- hirose.bootSR i <- c(l:R) [(hirose.boot$t[,l]==l)&(hirose.boot$t[,8]==l)] w <- 2*(hirose.boot$t[i,9]-hirose.boot$t[,2]) qqplot(qchisq(c(l:length(w))/(l+length(w)),1),w) abline(0,1,lty=2) Again, try this with a larger R. Can you see how the code would be modified for nonparametric simulation? (Section 7.3; Hirose, 1993) Dataframe n o d a l contains data on 53 patients with prostate cancer. For each patient there are five explanatory variables, each with two levels. These are aged (< 60, > 6 0 ); s t a g e , a measure o f the seriousness o f the tumour; grade, a measure o f the pathology o f the tumour; a measure o f the seriousness o f an xray; and a c id , the level o f serum acid phosphatase. The higher level o f each o f the last four variables indicates a more severe condition. The response r indicates whether the cancer has spread to the neighbouring lymph nodes. The data were collected to see whether nodal involvement can be predicted from the explanatory variables. Analysis o f deviance for a binary logistic regression model suggests that the response depends only on s ta g e , xray and a c id , and we base our predictions on the m odel with these variables. Our measure o f error is the average number o f misclassifications n 1 c(yj,/ij), where c(y, ft) is given by (7.23). For an initial model, apparent error, and ordinary and X -fold cross-validation estimates o f prediction error:

attach(nodal) cost <- function(r, pi=0) mean(abs(r-pi)>0.5) nodal.glm <- glm(r~stage+xray+acid,binomial,data=nodal) nodal.diag <- glm.diag(nodal.glm) app.err <- cost(r, fitted(nodal.glm)) cv.err <- cv.glm(nodal, nodal.glm, cost, K=53)$delta cv.ll.err <- c v .glm(nodal, nodal.glm, cost, K=ll)$delta For resampling-based estimates and plot for 0.632 errors:

nodal.pred.fun <- function(data, i, model) { d <- data[i,] d.glm <- update(model,data=d) pred <- predict(d.glm,data,type="response") D.F.Fhat <- cost(data$r, pred) D.Fhat.Fhat <- cost(d$r, fitted(d.glm)) c(data$r-pred, D.F.Fhat - D.Fhat.Fhat) } nodal.boot <- boot(nodal, nodal.pred.fun, R=200, model=nodal.glm) nodal.boot$f <- boot.array(nodal.boot) n <- nrow(nodal) err.boot <- mean(nodal.boot$t[,n+l]) + app.err ord <- order(nodal.diag$res) nodal.pred <- nodal.boot$t[,ord] err.632 <- 0 n.632 <- NULL pred.632 <- NULL for (i in l:n) { inds <- nodal,boot$f[,i]==0 err.632 <- err.632 + cost(nodal.pred[inds,i])/n n.632 <- c(n.632, sum(inds)) pred.632 <- c(pred.632, nodal.pred[inds,i]) }

382

7 ■Further Topics in Regression err.632 <- 0.368*app.err + 0.632*err.632 nodal.fac <- factor(rep(l:n,n.632),labels=ord) plot(nodal.fac, pred.632,ylab="Prediction errors", xlab="Case ordered by residual") abline(h=-0.5,lty=2); abline(h=0.5,lty=2) Cases with errors entirely outside the dotted lines are always misclassified, and conversely. Estimate the misclassification error using the m odel with all five explanatory variables. (Section 7.5; Brown, 1980)

5

Dataframe c lo t h records the number o f faults y in lengths x o f cloth. Is it true that E(y) oc x?

plot(cloth$x,cloth$y) cloth.glm <- glm(y~offset(log(x)).poisson,data=cloth) lines(cloth$x,f itted(cloth.glm)) summary(cloth.glm) cloth.diag <- glm.diag(cloth.glm) cloth.gam <- gam(y~s(log(x)) .poisson,data=cloth) lines(cloth$x,fitted(cloth.gam),lty=2) summary(cloth.gam) There is som e overdispersion relative to the Poisson m odel with identity link, and strong evidence that the generalized additive model fit c lo th .g a m improves on the straight-line m odel in which y is Poisson with mean /30 + fa x . We can try parametric simulation from the m odel with the linear fit (the null model) to assess the significance o f the decrease; cf. Algorithm 7.4:

cloth.gen <- function(data, fits) { y <- rpois(n=nrow(data).fits) data.frame(x=data$x,y=y) > cloth.fun <- function(data) { d.glm <- glm(y~offset(log(x)),poisson,data=data) d.gam <- gam(y~s(log(x)) .poisson,data=data) c(deviance(d.glm),deviance(d.gam)) } cloth.boot <- boot(cloth, cloth.fun, sim="parametric", R=99, r a n .gen=cloth.g e n , mle=fitted(cloth.glm)) Are the simulated drops in deviance roughly as they would be if standard asymptotics applied? How significant is the observed drop? In addition to the hypothesis that we want to test — that E(y) depends linearly on x — the parametric bootstrap im poses the constraint that the data are Poisson, which is not intended to be part o f the null hypothesis. We avoid this by a nonparametric bootstrap, as follows:

clothl <- data.frame(cloth,fits=fitted(cloth.glm), pearson=cloth.diag$rp) cloth.funl <- function(data, i) { y <- data$fits+sqrt(data$fits)*data$pearson[i] y <- round(y) y[y<0] <- 0 d.glm <- glm(y~offset(log(data$x)).poisson) d.gam <- gam(y~s(log(data$x)).poisson) c(deviance(d.glm).deviance(d.gam)) } cloth.boot <- boot(clothl, cloth.funl, R”99)

7.9 ■Practicals

383

Here we have used resampled standardized Pearson residuals for the null model, obtained by c lo t h .d ia g $ r p . How significant is the observed drop in deviance under this resampling scheme? (Section 7.6.2; Bissell, 1972; Firth, G losup and Hinkley, 1991) 6

The data n i t r o f e n are taken from a test o f the toxicity o f the herbicide nitrofen on the zooplankton Ceriodaphnia dubia, an important species that forms the basis o f freshwater food chains for the higher invertebrates and for fish and birds. The standard test measures the survival and reproductive output o f 10 juvenile C. dubia in each o f four concentrations o f the herbicide, together with a control in which the herbicide is not present. During the 7-day period o f the test each o f the original individuals produces three broods o f offspring, but for illustration we analyse the total offspring. A previous m odel for the data is that at concentration x the total offspring y for each individual is Poisson distributed with mean exp(/?, + [3[X + (h * 1)- The fit o f this m odel to the data suggests that low doses o f nitrofen augment reproduction, but that higher doses inhibit it. One thing required from analysis is an estimate o f the concentration x 5o o f nitrofen at which the mean brood size is halved, together with a 95% confidence interval for x 50. A second issue is posed by the surprising finding from a previous analysis that brood sizes are slightly larger at low doses o f herbicide than at high or zero doses: is this true? A wide variety o f nonparametric curves could be fitted to the data, though care is needed because there are only five distinct values o f x. The data do not look Poisson, but we use models with Poisson errors and the log link function to ensure that fitted values and predictions are positive. To compare the fits o f the generalized linear m odel described above and a robustified generalized additive model with Poisson errors:

nitro <- rbind(nitrofen,nitrofen,nitrofen,nitrofen,nitrofen) nitro <- rbind(nitro,nitro,nitro,nitro,nitro) nitro$conc <- seq(0,310,length=nrow(nitro)) attach(nitrofen) plot(conc,j itter(total),ylab="total") nitro.glm <- glm(total~conc+conc“2,poisson,data=nitrofen) lines(nitro$conc,predict(nitro.g l m ,nitro,"response"),lty=3) nitro.gam <- gam(total~s(conc,df=3).robust(poisson),data=nitrofen) lines(nitro$conc,predict(nitro.g a m ,nitro,"response")) To compare bootstrap confidence intervals for x 50 based on these models:

nitro.fun <- function(data, i, nitro) { assignC'd" ,data[i,] ,frame=l) d.fit <- gam(total~s(conc,df=3).robust(poisson),data=d) f <- predict(d.fit,nitro,"response") f.gam <- max(nitro$conc[f>0.5 * f [1]]) d.fit <- glm(total~conc+conc“2,poisson,data=d) f <- predict(d.fit,nitro,"response") f.glm <- max(nitro$conc[f>0.5*f [1]]) c(f.gam, f.glm) > nitro.boot <- boot(nitrofen, nitro.fun, R=499, strata=rep(l:5,re p (10,5)), nitro=nitro) boot.ci(nitro.boot,index=l,type=c("norm","basic","perc","bca")) boot.ci(nitro.boot,index=2,type=c("norm","basic","perc","bca"))

384

7 ■Further Topics in Regression Do the values of x'^ look normal? What is the bias estimate for x50 using the two models? To perform a bootstrap test of whether the peak is a genuine effect, we simulate from a model satisfying the null hypothesis of no peak to see if the observed value of a suitable test statistic (, say, is unusual. This involves fitting a model with no peak, and then simulating from it. We read fitted values m0(x) from the robust generalized additive model fit, but with 2.2 df (chosen by eye as the smallest for which the curve is flat through the first two levels of concentration). We then generate bootstrap responses by setting y ’ = m o ( x ) + s', where the e’ are chosen randomly from the modified residuals at that x. We take as test statistic the difference between the highest fitted value and the fitted value at x = 0. nitro.test <- fitted(gam(total~s(conc,df=2.2).robust(poisson), data=nitrofen)) f <- predict(nitro.glm,nitro,"response") nitro.orig <- max(f) - f[l] res <- (nitrofen$total-nitro.test)/sqrt(l-0.1) nitrol <- data.frame(nitrofen,res=res,fit=nitro.test) nitrol.fun <- function(data, i, nitro) { assignC'd" ,data[i,] ,frame=l) d$total <- round(d$fit+d$res[i]) d.fit <- glm(total~conc+conc“2,poisson,data=d) f <- predict(d.fit,nitro,"response") max(f)-f[l] } nitrol.boot <- boot(nitrol, nitrol.fun, R=99, strata=rep(l:5,r ep(10,5)), nitro=nitro) (1+sum(nitrol.boot$t>nitro.orig))/(1+nitrol.boot$R)

Do your conclusions change if other smooth curves are fitted? (Section 7.6.2; Bailer and Oris, 1994)

8 Complex Dependence

8.1 Introduction In previous chapters o u r m odels have involved variables independent at some level, an d we have been able to identify independent com ponents th at can be sim ulated. W here a m odel can be fitted and residuals o f some sort identified, the sam e ideas can be applied in the m ore com plex problem s discussed in this chapter. W here th a t m odel is param etric, param etric sim ulation can in principle be used to obtain resam ples, though M arkov chain M onte C arlo techniques m ay be needed in practice. But in nonparam etric situations the dependence m ay be so com plex, or our knowledge o f it so limited, th a t neither o f these approaches is feasible. O f course some assum ption o f repeatedness w ithin the d a ta is essential, o r it is im possible to proceed. But the repeatability m ay not be at the level o f individual observations, b u t o f groups o f them , and there is typically dependence betw een as well as w ithin groups. This leads to the idea o f constructing b o o tstrap d a ta by taking blocks o f some sort from the original observations. T he area is in rapid developm ent, so we avoid a detailed m athem atical exposition, an d merely sketch key aspects o f the m ain ideas. In Section 8.2 we describe som e o f the resam pling schemes proposed for time series. Section 8.3 outlines some ideas useful in resam pling point processes.

8.2 Time Series 8.2.1 Introduction A time series is a sequence o f observations arising in succession, usually at tim es spaced equally an d taken to be integers. M ost m odels for tim e series assum e th a t the d a ta are stationary, in which case the jo in t distribution o f any subset o f them depends only on their times o f occurrence relative to each other

385

8 ■Complex Dependence

386

and n o t on their absolute position in the series. A w eaker assum ption used in d a ta analysis is th a t the jo in t second m om ents o f observations depend only on their relative positions; such a series is said to be second-order o r weakly stationary. Time domain T here are two basic types o f sum m ary quantities for stationary tim e series. The first, in the tim e dom ain, rests on the jo in t m om ents o f the observations. Let {7,} be a second-order stationary tim e series, w ith zero m ean and autocovari ance function yj. T h at is, E (Yj) = 0 an d co\(Yk, Yk+j) = yj for all k and j ; the variance o f Yj is yo- T hen the autocorrelation function o f the series is pj = y j / y o, for j = 0, + 1, . . which m easures the co rrelation betw een observations at lag j a p a rt; o f course —1 < pj < 1, po = 1, an d ps = p _; . A n uncorrelated series would have pj = 0, and if the d a ta were norm ally d istributed this would imply th a t the observations were independent. For exam ple, the statio n ary m oving average process o f order one, or M A(1) model, has Yj = ej + Pej-i,

; =

1 ,0 ,1 ,...,

(8.1)

where {ej} is a white noise process o f innovations, th a t is, a stream o f inde pendent observations w ith m ean zero and variance a 1. T he autocorrelation function for the (Y)} is p\ = /? /(l + P2) and pj = 0 for |y| > 1; this sharp cut-off in the autocorrelations is characteristic o f a m oving average process. O nly if P = 0 is the series Yj independent. O n the o ther hand the stationary autoregressive process o f o rd er one, o r A R(1) m odel, has Yj = ctYj-i + Ej,

j = . . . , - 1, 0, 1, . . . ,

| « | < 1.

(8.2)

The autoco rrelatio n function for this process is pj = a 1-'1 for j = + 1 , ± 2 and so forth, so large a gives high correlation betw een successive observations. The autocorrelatio n function decreases rapidly for b o th m odels (8.1) and (8.2). A close relative o f the au to co rrelatio n function is the partial autocorrelation function, defined as pj = yj/yo, where yj is the covariance betw een Y& and Yk+j after adjusting for the intervening observations. T he partial autocorrelations for the M A (1) m odel are p ’j

= - ( - m i - /?2){i - ^2(;+1)} - 1,

j

= ± i , +2, —

The A R(1) m odel has p\ = a, and pj = 0 for \j\ > 1; a sh arp cut-off in the partial autocorrelations is characteristic o f autoregressive processes. The sam ple estim ates o f pj and pj are basic sum m aries o f the structure o f a time series. Plots o f them against j are called the correlogram and partial correlogram o f the series. One widely used class o f linear time series m odels is the autoregressivem oving average or A R M A process. T he general ARM A(p,<j) m odel is defined

387

8.2 • Time Series

by 9

P

Yj = '^2<*kYj-k + Ej + '^2PkEj-k, k=l k=1

(8.3)

where {£,} is a w hite noise process. I f all the a& equal zero, { Yj} is the m oving average process M A (q), w hereas if all the f t equal zero, it is AR(p). In order for (8.3) to represent a stationary series, conditions m ust be placed on the coefficients. Packaged routines enable m odels (8.3) to be fitted readily, while series from them are easily sim ulated using a given innovation series ..., £ —

1, £ o , £ j , . . . .

Frequency domain The second ap p ro ach to tim e series is based on the frequency dom ain. The spectrum o f a statio n ary series w ith autocovariances yj is 00

g(co) = y0 + 2 ^ 2 yj cos (coj), i =i

0 < co < n. (8.4)

This sum m arizes the values o f all the autocorrelations o f {Yj}. A w hite noise process has the flat spectrum g(co) = yo, while a sh arp peak in g(to) corresponds to a strong periodic com ponent in the series. F or example, the spectrum for a stationary A R (1) m odel is g(co) = cr2{ 1 — 2acos(co) + a2}-1 . The em pirical F ourier transform plays a key role in d a ta analysis in the frequency dom ain. T he treatm en t is simplified if we relabel the series as yo, a n d suppose th a t n = 2np + 1 is odd. Let f = e2n'^n be the nth com plex ro o t o f unity, so (" = 1. T hen the empirical Fourier transform o f the d a ta is the set o f n com plex-valued quantities n—1 y k = Y 2 £}ky j ’

fc = o ,. . . , n - 1;

7=0

note th a t yo = ny an d th a t the com plex conjugate o f % is y n-k, for k = 1 ,...,« — 1. F o r different k the vectors (1, Ck, . . . , are orthogonal. It is straightforw ard to see th a t 1 "-1

~^2C~}kyk = yj,

7

= 0 , . . —l,

k=0 so this inverse Fourier transform retrieves the data. N ow define the Fourier frequencies cok — 2nk /n, for k = 1, . . . , n p . T he sam ple analogue o f the spectrum at a>k is the periodogram,

Y2yj n- 1

I{(ok) = n ' l & l 2 = n 1

j =0

cos(cokj)

\ +1YI yjsin(mkj)

Y

(n-l

I

I j =0

' 2

8 ■Complex Dependence

388

The orthogonality properties o f the vectors involved in the Fourier transform im ply th a t the overall sum o f squares o f the d a ta m ay be expressed as n- 1

(8.5)

The em pirical Fourier transform an d its inverse can be rapidly calculated by an algorithm know n as the f a s t Fourier transform. If the d a ta arise from a statio n ary process {Yj} with spectrum g(co), where Yj = YlT=-ccai - i Ei’ '"'ith {£/} a norm al w hite noise process, then as n increases and provided the term s |a/| decrease sufficiently fast as l—> ± oo, the real and im aginary parts o f the com plex-valued ran d o m variables y i , . . . , y „ F are asym ptotically independent norm al variables w ith m eans zero and variances ng(o)[)/2,. . . , ng(«„f )/2 ; furtherm ore the % a t different F ourier frequencies are asym ptotically independent. This implies th a t as n—>co for such a process, the periodogram values I{a>k) a t different Fourier frequencies will be independent, and th a t I(cok) will have an exponential distrib u tio n with m ean g(co^). (If n is even I ( n) m ust be added to (8.5); I(n) is approxim ately independent o f the /(ajfc) an d its asym ptotic distribution is g(Tt)xi-) T hus (8.5) decom poses the to tal sum o f squares into asym ptotically independent com ponents, each associated w ith the am o u n t o f variation due to a particular Fourier frequency. W eaker versions o f these results hold w hen the process is n o t linear, o r when the process {e/} is n o t norm al, the key difference being th a t the jo in t lim iting distribution o f the p eriodogram values holds only for a finite n um ber o f fixed frequencies. If the series is w hite noise, und er m ild conditions its periodogram ordinates I{co\) , . . . , I{(o„F) are roughly a ran d o m sam ple from an exponential distribu tion w ith m ean yo. Tests o f independence m ay be based on the cumulative periodogram ordinates, J2j=i

k =

— 1.

Z jU H ujY W hen the d a ta are w hite noise these ordinates have roughly the same jo in t distributio n as the o rd er statistics o f np — 1 uniform ran d o m variables. Exam ple 8.1 (Rio N egro d a ta ) The d a ta for o u r first time series exam ple are m onthly averages o f the daily stages — heights — o f the R io N egro, 18 km upstream a t M anaus, from 1903 to 1992, m ade available to us by Professors H. O ’Reilly S ternberg an d D. R. B rillinger o f the U niversity o f C alifornia at Berkeley. Because o f the tiny slope o f the w ater surface and the lower courses o f its flatland affluents, these d a ta m ay be regarded as a reasonable approxim ation o f the w ater level in the A m azon R iver at the confluence o f the

8.2 • Time Series

389

Figure 8.1 Deseasonalized monthly average stage (metres) of the R io N egro at M anaus, 1903-1992 (Sternberg, 1995).

1900

1920

1940

1960

1980

2000

Time (years)

two rivers. To remove the strong seasonal com ponent, we subtract the average value for each m onth, giving the series o f length n = 1080 shown in Figure 8.1. F or an initial exam ple, we take the first ten years o f observations. The top panels o f Figure 8.2 show the correlogram and partial correlogram for this sh o rter series, w ith horizontal lines showing approxim ate 95% confidence limits for correlations from a w hite noise series. The shape o f the correlogram and the cut-off in the p artial correlogram suggest th a t a low -order autoregressive m odel will fit the data, which are quite highly correlated. T he lower left panel o f the figure shows the periodogram o f the series, which displays the usual high variability associated w ith single periodogram ordinates. The lower right panel shows the cum ulative periodogram , which lies well outside its overall 95% confidence b and an d clearly does n o t correspond to a white noise series. A n A R (2) m odel fitted to the shorter series gives oil = 1.14 and a.2 = —0.31, b o th w ith stan d ard erro r 0.062, and estim ated innovation variance 0.598. The left panel o f Figure 8.3 shows a norm al probability plot o f the standardized residuals from this m odel, an d the right panel shows the cum ulative peri odogram o f the residual series. The residuals seem close to G aussian white noise. ■

8.2.2 M odel-based resampling T here are two approaches to resam pling in the tim e dom ain. The first and sim plest is analogous to m odel-based resam pling in regression. T he idea is to fit a suitable m odel to the data, to construct residuals from the fitted model, an d then to generate new series by incorporating random sam ples from the

8 ' Complex Dependence

390

Figure 8.2 Summary plots for the Rio Negro data, 1903-1912. The top panels show the correlogram and partial correlogram for the series. The bottom panels show the periodogram and cumulative periodogram.

£ to o> o o O

Lag

Lag

omega

omega/pi

residuals into the fitted m odel. T he residuals are typically recentred to have the same m ean as the innovations o f the m odel. A b o u t the sim plest situation is w hen the A R (1) m odel (8.2) is fitted to an observed series y i , . . . , y „ , giving estim ated autoregressive coefficient a an d estim ated innovations

ej

= yj

- &y j - u

j = 2,...,n;

e\ is uno b tain ab le because yo is unknow n. M odel-based resam pling m ight then proceed by equi-probable sam pling w ith replacem ent from centred residuals — e, . . . , en — e to obtain sim ulated innovations e j,. . . , and then setting

8.2 ■ Time Series

Figure 8.3 Plots for residuals from AR(2) model fitted to the Rio Negro data, 1903-1912: normal Q-Q plot of the standardized residuals (left), and cumulative periodogram of the residual series (right).

391

E 2? o> o -o o

V)

co D “O cn
0Q.

3 e3 o

Quantiles of standard normal

omega/pi

yo = ej and y j = a yj_! + e j ,

j = l,...,n ;

(8.6)

o f course we m ust have |a| < 1. In fact the series so generated is n o t stationary, an d it is b etter to start the series in equilibrium , o r to generate a longer series o f innovations an d sta rt (8.6) at j = —k, where the ‘b u rn-in’ period —k , . . . , 0 is chosen large enough to ensure th at the observations y [ , . . . , y * are essentially statio n ary ; the values y'_k, . . . , y ' ) are discarded. T hus m odel-based resam pling for tim e series is based on applying the defining equation(s) o f the series to innovations resam pled from residuals. This procedure is simple to apply, and leads to good theoretical behaviour for estim ates based on such d a ta w hen the m odel is correct. F or example, studentized b o o tstrap confidence intervals for the autoregressive coefficients ak in an A R (p) process enjoy the good asym ptotic properties discussed in Section 5.4.1, provided th a t the m odel fitted is chosen correctly. Just as there, confidence intervals based on transform ed statistics m ay be b etter in practice. Exam ple 8.2 (Wool prices) T he A ustralian W ool C o rp o ratio n m onitors prices weekly w hen wool m arkets are held, and sets a m inim um price ju st before each week’s m arkets open. This reflects the overall price o f wool for th a t week, b u t the prices actually paid can vary considerably relative to the m inim um . The left panel o f Figure 8.4 shows a plot o f log(price p aid /m in im u m price) for those weeks w hen m arkets were held from July 1976 to June 1984. The series does n o t seem stationary, having som e o f the characteristics o f a ran d o m walk, as well as a possible overall trend. I f the log ratio in week j follows a random walk, we have Yj = Yj -\ + Sj,

392

8 ■Complex Dependence

Figure 8.4 Weekly log ratio o f price paid to m inimum price for A ustralian wool from July 1976 to June 1984 (Diggle, 1990, pp. 229-237). Left panel: original data. R ight p a n el: first differences o f data.

0

50 100 150 200 250 300

Time in weeks

Time in weeks

where the ej are w hite noise; a non-zero m ean for the innovations Ej will lead to drift in yj. The right panel o f Figure 8.4 shows the differenced series, ej = y j —y j - i , which appears stationary a p a rt from a change in the innovation variance at a b o u t the 100th week. In o u r analysis we drop the first 100 observations, leaving a differenced series o f length 208. A n alternative to the ran d o m w alk m odel is the A R(1) m odel ( Y j - n ) = <x(Y}- 1 -iJ.) + ej ;

(8.7)

this gives the ran d o m w alk when a = 1. If the innovations have m ean zero and a is close to b u t less th a n one, (8.7) gives stationary data, though subject to the clim bs and falls seen in the left panel o f Figure 8.4. The im plications for forecasting depend on the value o f a, since the variance o f a forecast is only asym ptotically bounded w hen |a| < 1. We test the unit root hypothesis th a t the d ata are a ran d o m walk, or equivalently th a t a = 1, as follows. O ur test is based on the o rdinary least squares estim ate o f a in the regression Yj = }’ + a Yj-1 +Sj for j = 2 , . . . , n using test statistic T = (1 —a) /S, where S is the stan d ard erro r for a calculated using the usual form ula for a straight-line regression m odel. L arge values o f T are evidence against the random walk hypothesis, w ith or w ithout drift. T he observed value o f T is t = 1.19. The distribution o f T is far from the usual stan d ard norm al, however, because o f the regression o f each observation on its predecessor. U nder the hypothesis th a t a = 1 we sim ulate new time series Y J , . . . , Y * by generating a b o o tstrap sam ple e \ , . . . , e* from the differences e i , . . . , e n and then setting YJ = Y\, Y j = YJ + e 2" , an d YJ = Y]'_l + £* for subsequent j. This is (8.6) applied w ith the null hypothesis value a = 1. T he value o f T ' is then obtained from the regression o f YJ on YJ_X for j = 2 The left panel

8.2 • Time Series

393

Figure 8.5 Results for 199 replicates of the random walk test statistic, T*. The left panel is a normal plot of t*. The right panel shows t* plotted against the inverse sum of squares for the regressor, with the dotted line giving the observed value.

Quantiles of standard normal

1/SSy*

o f Figure 8.5 shows the em pirical distribution o f T * in 199 sim ulations. The distribution is close to norm al w ith m ean 1.17 and variance 0.88. T he observed significance level for t is (97 + l ) / ( 199 + 1) = 0.49: there is no evidence against the ran d o m w alk hypothesis. The right panel o f Figure 8.5 shows the values o f f* plotted against the inverse sum o f squares for the regressor y j _ v In a conventional regression, inference is usually conditional on this sum o f squares, which determ ines the precision o f the estim ate. The dotted line shows the observed sum o f squares. If the conditional distribution o f tm is th ought to be appropriate here, the distribution o f values o f t* close to the do tted line shows th a t the conditional significance level is even higher; there is no evidence against the random walk conditionally or unconditionally. ■

M odels are com m only fitted in o rder to predict future values o f a tim e series, b u t as in o th er settings, it can be difficult to allow for the various sources o f u ncertainty th a t affect the predictions. The next exam ple shows how boo tstrap m ethods can give some idea o f the relative contributions from innovations, estim ation error, and m odel error.

Exam ple 8.3 (Sunspot num bers) Figure 8.6 shows the m uch-analysed annual sunspot num bers y [ , - - - , y 2%g from 1700-1988. T he d a ta show a strong cycle w ith a period o f ab o u t 11 years, and som e hint o f non-reversibility, which shows up as a lack o f sym m etry in the peaks. We use values from 1930-1979 to predict the num bers o f sunspots over the next few years, based on fitting

394

8 ■Complex Dependence

Figure 8.6 Annual sunspot numbers, 1700-1988 (Tong, 1990, p. 470).

Time in years

A ctu al Predicted

1980

81

82

83

84

85

86

87

1988

23.0 21.6

21.8 18.9

19.6 14.9

14.4 12.2

11.7 9.1

6.7 7.5

5.6 6.8

9.0 8.8

18.1 13.6

3.2 3.8 3.6 3.8 6.6 3.6

3.3 4.1 3.8 3.9 6.7 3.9

3.4 4.0 3.9 4.0 6.8 4.3

3.4 3.6 3.8 4.1 6.5 4.3

Table 8.1 Predictions and their standard errors for 2{{y'j + 1)1/2 - 1} for sunspot data, 1980-1988, based on data for 1930-1979. The standard errors are nominal, and also those obtained under model-based resampling assuming the simulated series y* are AR(9), not assuming y ‘ is AR(9), and by a conditional scheme, and the block and post-blackened bootstraps with block length / = 10. See Examples 8.3 and 8.5 for details.

S ta n d a rd e rro r N o m in al M odel, A R (9) M odel M odel, c o n d it’l Block, I = 10 P o st-b lack ’d, I = 10

2.0 2.2 2.3 2.5 7.8 2.1

2.9 2.9 3.3 3.6 7.0 3.3

3.2 3.0 3.6 4.1 6.9 3.9

3.2 3.2 3.5 3.9 6.9 4.0

3.2 3.3 3.5 3.8 6.7 3.6

A R (p) m odels p

Y j- n

= ^2<3ik ( Yj - k -

n)

+ £j ,

( 8 .8 )

fc=i to the transform ed observations yj = 2 {(yj + l )1/2 — 1}; this transform ation is chosen to stabilize the variance. T he corresponding m axim ized log likelihoods are denoted ?p. A stan d ard ap p ro ach to m odel selection is to select the m odel th a t m inimizes A IC = —2 i p + 2p, which trades off goodness o f fit (m easured by the m axim ized log likelihood) against m odel com plexity (m easured by p). H ere the resulting “b est” m odel is A R(9), w hose predictions yj for 1980-88 and their n om inal sta n d a rd errors are given a t the top o f Table 8.1. These stan d a rd errors allow for prediction e rro r due to the new innovations, b u t not for param eter estim ation or m odel selection, so how useful are they? To assess this we consider m odel-based sim ulation from (8.8) using centred residuals an d the estim ated coefficients o f the fitted A R (9) m odel to generate series y*1,...,y * 59, corresponding to the p eriod 1930-1988, for r = l , . . . , R . We then fit autoregressive m odels up to o rd er p = 25 to y ’{, . . . , y ' 50, select the m odel giving the sm allest A IC , and use this m odel to produce predictions y'rj for j = 5 1, . .. , 59. The prediction erro r is y ’j — y'r], and the estim ated standard

8.2 ■ Time Series

395

errors o f this are given in Table 8.1, based o n J ? = 999 b o o tstrap series. The orders o f the fitted m odels were O rd er #

1 234 5 3 257 126100

67 89 273 8522 18

10 83

11

12 23

72

so the A R (9) m odel is chosen in only 8% o f cases, and m ost o f the m odels selected are less com plicated. The fifth and sixth rows o f Table 8.1 give the estim ated sta n d a rd errors o f the y ’ — y* using the 83 sim ulated series for which the selected m odel was A R(9) and using all the series, based on the 999 replications. T here is ab o u t a 10-15% increase in stan d ard erro r due to p aram eter estim ation, an d the stan dard errors for the A R (9) m odels are m ostly smaller. Prediction errors should take account o f the values o f yj im m ediately prior to the forecast period, since presum ably these are relevant to the predictions actually m ade. Predictions th a t follow on from the observed d a ta can be obtained by using innovations sam pled a t random except for the period j = n — k + 1 ,... ,n, where we use the residuals actually observed. T aking k = n yields the original series, in which case the only variability in the y'rj is due to the innovations in the forecast period; the stan d ard errors o f the predictions will then be close to the nom inal stan d ard error. However, if k is sm all relative to n, the differences y*j — y'j will largely reflect the variability due to the use o f estim ated param eters, although the y*rj will follow on from y n. The conditional stan d ard errors in Table 8.1, based on k = 9, are a b o u t 10% larger th an the unconditional ones, and substantially larger th an the nom inal stan d ard errors. The distrib u tio n s o f the y'j — y'j app ear close to norm al with zero means, and a sum m ary o f variation in term s o f standard errors seems appropriate. T here will clearly be difficulties w ith norm al-based prediction intervals in 1985 and 1986, w hen the lower lim its o f 95% intervals for y are negative, and it m ight be b etter to give one-sided intervals for these years. It would be better to use a studentized version o f y'j — y'j if an ap p ro p riate stan d ard error were readily available. W hen b o o tstra p series are generated from the A R (9) m odel fitted to the d a ta from 1700-1979, the orders o f the fitted m odels are O rd er #

5 1

9 765

10 88

11 57

12131415 161718 28211111 51 4

19 25

so the A R (9) m odel is chosen in ab o u t 75% o f cases. T here is a tendency for A IC to lead to overfitting: ju st one o f the m odels has order less th a n 9. For this longer series p aram eter estim ation and m odel selection inflate the nom inal stan d ard erro r by at m ost 6 %. The above analysis gives the variability o f predictions based on selecting the m odel th a t m inim izes A IC on the basis th at an A R (9) m odel is correct, and

396

8 ■Complex Dependence

does n o t give a true reflection o f the erro r otherwise. Is an autoregressive or m ore generally a linear m odel ap p ro p riate? A test for linearity o f a time series can be based on the non-additivity statistic T — w2{n — 2m — 2)/(R S S — w2), where RSS is the residual sum o f squares for regression o f (ym+i ,... ,y„) on the (n — m) x (m + 1) m atrix X whose y'th row is ( l , y m+j - i , . . . , y j ) , with residuals qj and fitted values gy. Let q'j denote the residuals from the regression o f gj on X , and let w equal T hen the approxim ate distribution o f T is fi,n —2m—2> w ith large values o f T indicating potential nonlinearity. The observed value o f T w hen m = 20 is 5.46, giving significance level 0.02, in good agreem ent w ith b o o tstrap sim ulations from the fitted A R(9) model. The significance level varies little for values o f m from 6 to 30. There is good evidence th a t the series is nonlinear. We retu rn to these d ata in Exam ple 8.5. ■

The m ajo r draw back w ith m odel-based resam pling is th a t in practice not only the p aram eters o f a m odel, b u t also its structure, m ust be identified from the data. I f the chosen structure is incorrect, the resam pled series will be generated from a w rong m odel, an d hence they will n o t have the same statistical properties as the original data. This suggests th a t som e allowance be m ade for m odel selection, as in Section 3.11, b u t it is unclear how to do this w ithout som e assum ptions ab o u t the dependence structure o f the process, as in the previous example. O f course this difficulty is less critical when the m odel selected is strongly indicated by subject-m atter considerations o r is w ell-supported by extensive data.

8.2.3 Block resampling The second ap proach to resam pling in the tim e dom ain treats as exchangeable n o t innovations, b u t blocks o f consecutive observations. The sim plest version o f this idea divides the d a ta into b non-overlapping blocks o f length /, where we suppose th a t n = bl. We set z\ = ( y i , . . . ,y i ) , z2 = (yi+u■■■,yn), and so forth, giving blocks z \ , . . . , z&. The procedure is to take a b o o tstrap sam ple with equal probabilities b~l from the z; , an d then to paste these end-to-end to form a new series. As a simple example, suppose th a t the original series is y i , . . . , y i2, and th a t we take I = 4 an d b = 3. T hen the blocks are z\ = ( y i , y 2 , y 3,y«), Z2 = iys,y6,yi,y%), an d z3 = {y<),yw,yu,yi 2 )- If the resam pled blocks are z\ = Z2, z \ = zj, an d z\ = zi, the new series o f length 12 is

[y]}

=

z i>z2>z3

=

y5,ye,yi,y%,

yuyi,yi,y*,

y5, yt,, yi, yz-

In general, the resam pled series are m ore like w hite noise th a n the original series, because o f the joins betw een blocks w here successive independently chosen z* meet. The idea th a t underlies this block resampling scheme is th a t if the blocks

8.2 ■ Time Series

397

are long enough, enough o f the original dependence will be preserved in the resam pled series th a t statistics f* calculated from {yj} will have approxim ately the sam e distribution as values t calculated from replicates o f the original series. C learly this approxim ation will be best if the dependence is weak and the blocks are as long as possible, thereby preserving the dependence m ore faithfully. O n the o th er hand, the distinct values o f t* m ust be as num erous as possible to provide a good estim ate o f the distribution o f T, and this points tow ards short blocks. T heoretical work outlined below suggests th a t a com prom ise in which the block length I is o f order ny for some y in the interval (0,1) balances these tw o conflicting needs. In this case b o th the block length / an d the n u m b er o f blocks b = n/ l tend to infinity as n —* oo, though different values o f y are ap p ro p riate for different types o f statistic t. There are several v ariants on this resam pling plan. One is to let the original blocks overlap, in o u r exam ple giving the n — I + 1 = 9 blocks z\ = (>’i , ...,> '4), 22 = Z3 = t o , . . . , ye), and so forth up to z9 = (y9, . . . , y n) . This incurs end effects, as the first and last / — 1 o f the original observations ap p ear in fewer blocks th an the rest. Such effects can be rem oved by w rapping the d a ta around a circle, in o u r exam ple adding the blocks z\o = (yio,y n , y n , y \ ) , Z n . = ( y u , y n , y i , y 2 ), and Z 12 = 0 ' 12,J '1»J'2,J'3)- This ensures th a t each o f the original observations has an equal chance o f appearing in a sim ulated series. E nd correction by w rapping also removes the m inor problem with the no n overlapping scheme th a t the last block is shorter th an the rest if n / l is not an integer.

Post-blackening The m ost im p o rtan t difficulty w ith resam pling schemes based on blocks is th at they generate series th a t are less dependent th an the original data. In some circum stances this can lead to catastrophically bad resam pling approxim ations, as we shall see in Exam ple 8.4. It is clearly inappropriate to take blocks o f length / = 1 w hen resam pling dependent data, for the resam pled series is then w hite noise, b u t the “w hitening” can rem ain substantial for small and m oderate values o f I. This suggests a strategy interm ediate betw een m odelbased and block resam pling. The idea is to “pre-w hiten” the series by fitting a m odel th a t is intended to remove m uch o f the dependence betw een the original observations. A series o f innovations is then generated by block resam pling o f residuals from the fitted m odel, and the innovation series is then “post-blackened” by applying the estim ated m odel to the resam pled innovations. T hus if an A R (1) m odel is used to pre-w hiten the original data, new series are generated by applying (8.6) b u t w ith the innovation series {ej} sam pled n o t independently b u t in blocks taken from the centred residual series e2 - e , . . . , e„ - e.

8 • Complex Dependence

398 B lo ck s o f blocks

A different ap p ro ach to rem oving the w hitening effect o f block resam pling is to resam ple blocks o f blocks. Suppose th a t the focus o f interest is a statistic T which estim ates 6 an d depends only on blocks o f m successive observations. A n exam ple is the lag k autocovariance (n — k) 1 Y ^ J l i y j ~ y)(yj+k ~ y), for which m = k + 1. T hen unless / » m the distribution o f T* — M s typically a po o r approxim ation to th a t o f T — 6, because a substantial p ro p ortion o f the pairs (YJ, Yj+k) in a resam pled series will lie across a jo in betw een blocks, and will therefore be independent. To im plem ent resam pling blocks o f blocks we define a new m -variate process { Yj } for which Y j = ( Y j , Y j +m- 1), rew rite T so th a t it involves averages o f the Yj, an d resam ple blocks o f the new “d a ta ” y \ , .. .,y'„_m+1, each o f the observations o f which is a block o f the original data. F or the lag 1 autocovariance, for exam ple, we set

and w rite t = (n — I )-1 YXVij ~ y'lMy'ij ~ ? 2-)- The key point is th a t t should n o t com pare observations adjacent in each row. W ith n = 12 and / = 4 a b o o tstrap replicate m ight be ys

y6

yi

ys

yi

yi

ys

y4

yi

^6

yi

>'8

y9

yi

w

yi

ys

ys

y?

y9

y io

yio

yn

Since a b o o tstra p version o f t based on this series will only contain products o f (centred) adjacent observations o f the original data, the w hitening due to resam pling blocks will be reduced, though n o t entirely removed. This ap p ro ach leads to a sh o rter series being resam pled, b u t this is unim p o rta n t relative to the gain from avoiding whitening. Stationary bootstrap A further b u t less im p o rtan t difficulty w ith these block schemes is th at the artificial series generated by them are n o t stationary, because the jo in t distri bution o f resam pled observations close to a jo in betw een blocks differs from th a t in the centre o f a block. This can be overcom e by taking blocks o f random length. The stationary bootstrap takes blocks whose lengths L are geom etrically distributed, w ith density Pr(L = j ) = ( l - p y - ' p ,

j = 1 ,2 ,—

This yields resam pled series th a t are statio n ary w ith m ean block length Z = p *. Properties o f this scheme are explored in Problem s 8.1 and 8.2. Exam ple 8.4 (Rio N egro d a ta ) To illustrate these resam pling schemes we consider the shorter series o f river stages, o f length 120, w ith its average subtracted. Figure 8.7 shows the original series, followed by three b o o tstrap

399

8.2 ■ Time Series Figure 8.7 Resamples from the shorter Rio Negro data. The top panel shows the original series, followed by three series generated by model-based sampling from the fitted AR(2) model, then three series generated using the block bootstrap with / = 24 and no end correction, and three series made using the post-blackened method, with the same blocks as the block series and the fitted AR(2) model.

0

20

40

60

80

100

120

0

20

40

60

80

100

120

0

20

40

60

80

100

120

0

20

40

60

80

100

120

series generated by m odel-based sam pling from the A R (2) model. The next three panels show series generated using the block b o o tstrap with length I = 24 and no w rapping. There are some sharp jum ps a t the ends o f contiguous blocks in the resam pled series. T he b o tto m panels show series generated using the sam e blocks applied to the residuals, and then post-blackened using the A R(2) m odel. The ju m p s from using the block b o o tstrap are largely rem oved by post-blackening. F o r a m ore system atic com parison o f the m ethods, we generated 200 b o o t strap replicates under different resam pling plans. F or each plan we calculated the sta n d a rd erro r SE o f the average y * o f the resam pled series, and the average o f the first three au to correlation coefficients. The m ore dependent

400

8 ■Complex Dependence

O riginal values S am pling SE R esam p lin g p lan M odel-based

Blockwise

P o st-blackened

h

0.85

0.002

0.62 0.007

0.010

D etails

SE

Pi

Pi

P\

A R (2)

AR(1)

0.34 0.49

0.83 0.82

0.60 0.67

0.38 0.54

A R (3)

0.44

0.83

0.58

0.39

0.20

0.41 0.67 0.75 0.79

-0.02 0.35 0.47 0.54

-0.01 0.14 0.27

0.85 0.85 0.85 0.85

0.63 0.63 0.64 0.64

0.45 0.45 0.47 0.48 0.03

0.38 0.56 0.40

—

I= 2 1= 5 / = 10

0.26 0.33 0.33

1= 2

0.20

I= 5 I = 10

0.26 0.33 0.33

1 = 20 S tatio n ary

P2

0.017

1 = 20 B locks o f blocks

Pi

1= 2

0.25 0.28 0.31 0.28

0.40

I= 5 / = 10 / = 20

0.74 0.79

0.13 0.37 0.47 0.54

A R (2), I = 2 A R (1), I = 2 A R (3), I = 2

0.39 0.58 0.43

0.83 0.85 0.83

0.59 0.69 0.58

0.66

0.45

0.35

0.20 0.28 0.36

the series, the larger we expect SE an d the autocorrelation coefficients to be. Table 8.2 gives the results. T he top tw o rows show the correlations in the d a ta and approxim ate stan d ard errors for the resam pling results below. The results for m odel-based sim ulation depend on the m odel used, although the overfitted A R (3) m odel gives results sim ilar to the AR(2). The A R(1) m odel adds correlation n o t present in the original data. T he block m ethod is applied w ith no end correction, b u t further sim ulations show th a t it m akes little difference. Block length has a dram atic effect, and in particular, block length / = 2 essentially rem oves co rrelation a t lags larger th an one. Even blocks o f length 20 give resam pled d a ta noticeably less dependent than the original series. T he w hitening is overcom e by resam pling blocks o f blocks. We took blocks o f length m = 4, so th a t the m -variate series h ad length 117. The m ean resam pled autocorrelations are essentially unchanged even w ith / = 2, while SE* does depend on block length.

Table 8.2 Comparison of time-domain resampling plans applied to the average and first three autocorrelation coefficients for the Rio Negro data, 1903-1912.

8.2 ■ Time Series

401

The statio n ary b o o tstrap is used with end correction. The results are similar to those for the block b o o tstrap , except th a t the varying block length preserves slightly m ore o f the original correlation structure; this is noticeable at I = 2. R esults for the post-blackened m ethod with A R (2) and A R (3) m odels are sim ilar to those for the corresponding m odel-based schemes. The results for the post-blackened A R (1) scheme are interm ediate betw een A R (1) and A R(2) m odel-based resam pling, reflecting the fact th a t the A R (1) m odel underfits the data, and hence structure rem ains in the residuals. L onger blocks have little effect for the A R (2) an d A R (3) models, b u t they bring results for the A R(1) m odel m ore into line w ith those for the others. ■ T he previous exam ple suggests th a t post-blackening generates resam pled series w ith co rrelation structure sim ilar to the original data. C orrelation, how ever, is a m easure o f linear dependence. Is nonlinear dependence preserved by resam pling blocks? Exam ple 8.5 (Sunspot num bers) To assess the success o f the block and p o st blackened schemes in preserving nonlinearity, we applied them to the sunspot data, using / = 10. We saw in Exam ple 8.3 th a t although the best autoregressive m odel for the transform ed d a ta is A R(9), the series is nonlinear. This nonlin earity m ust rem ain in the residuals, which are alm ost a linear transform ation o f the series. Figure 8.8 shows probability plots o f the nonlinearity statistic T from Exam ple 8.3, w ith m = 20, for the block and post-blackened bootstraps w ith I = 10. T he results for m odel-based resam pling o f residuals are not shown b u t lie on the diagonal line, so it is clear th a t b o th schemes preserve some o f the nonlinearity in the data, which m ust derive from lags up to 10. C uriously the post-blackened scheme seems to preserve more. Table 8.1 gives the predictive standard errors for the years 1980-1988 when the simple block resam pling scheme w ith I = 10 is applied to the d a ta for 1930— 1979. O nce d a ta for 1930-1988 have been generated, the procedure outlined in Exam ple 8.3 is used to select, fit, and predict from an autoregressive model. Owing to the jo in s betw een blocks, the stan d ard errors are m uch larger than for the o th er schemes, including the post-blackened one with I = 10, which gives results sim ilar to b u t som ew hat m ore variable th an the m odel-based bootstraps. U nadorned block resam pling seems inappropriate for assessing prediction error, as one w ould expect. ■ Choice o f block length Suppose th a t we w ant to use the block b o o tstrap to estim ate some feature k based on a series o f length n. A n exam ple would be the stan d a rd erro r o f the series average, as in the third colum n o f Table 8.2. D ifferent block lengths / result in different b o o tstrap estim ates k(n,l). W hich should we use? A key result is th a t u nder suitable assum ptions and for large n and I the

402

8 ■Complex Dependence

Figure 8.8 Distributions of nonlinearity statistic for block resampling schemes applied to sunspot data. The left panel shows R = 999 replicates of a test statistic for nonlinearity, based on detecting nonlinearity at up to 20 lags for the block bootstrap with / = 10. The right panel shows the corresponding plot for the post-blackened bootstrap using the AR(9) model.

Quantile of F distribution

Quantile of F distribution

m ean squared erro r o f k(n, I) is p ro p o rtio n al to

where Ci an d C2 depend only on k and the dependence structure o f the series. In (8.9) d = 2, c = 1 if k is a bias o r variance, d = 1, c = 2 if k is a one sided significance probability, and d = 2, c = 3 if k is a two-sided significance probability. T he justification for (8.9) when k is a bias o r a variance is discussed after the next example. T he im plication o f (8.9) is th at for large n, the m ean squared erro r o f o f k{n, I) is m inim ized by taking I oc n 1/(c+2), b u t we do not know the co n stan t o f proportionality. However, it can be estim ated as follows. We guess an initial value o f I, an d sim ulate to obtain k(n, I). We then take m < n and k < I an d calculate the values o f kj(m, k) from the n — m + 1 series y > j , . . . , y j +m- 1 for j = 1 , — m + 1. T he estim ated m ean squared erro r for k(m, k) from a series o f length m w ith block size k is then 1 n—m+1 MSE(m,fc) = ^-----------j{£/(wi, k) — k(n, I)}2 . j=

1

By repeating this procedure for different values o f k b u t the same m, we obtain the value k for which MSE(m,/c) is m inimized. We then choose Z = k x (n /m )1/(c+2)

(8.10)

as the optim um block length for a series o f length n, and calculate k(n,l). This procedure elim inates the co n stan t o f proportionality. We can check on the adequacy o f I by repeating the procedure w ith initial value I = I, iterating if necessary.

8.2 • Time Series

403

Figure 8.9 Ten-year running average of Manaus data (left), together with Abelson-Tukey coefficients (right) (Abelson and Tukey, 1963).

c o 'o it=
o o

>» £D C

O
1900

1940

1980

Time (years)

T he m inim um asym ptotic m ean squared error is n d 2/
j

+ {d + 2 /(c + 2)} logm

should be approxim ately independent o f m. This suggests th a t values o f A(m) for different m should be com pared as a check on the asym ptotics. Exam ple 8.6 (Rio N egro d a ta ) There is concern th a t river heights at M anaus m ay be increasing due to deforestation, so we test for trend in the river series, a ten-year running average o f which is shown in the left panel o f Figure 8.9. T here m ay be an u pw ard trend, b u t it is h ard to say w hether the effect is real. To proceed, we suppose th a t the d a ta consist o f a stationary tim e series to which has been added a m onotonic trend. O ur test statistic is T = Y?j=1 ai where the coefficients

are optim al for detecting a m onotonic trend in independent observations. The p lo t o f the a , in the right panel o f Figure 8.9 shows th a t T strongly contrasts the ends o f the series. We can think o f T as alm ost being a difference o f averages for the two ends o f the series, and this falls into the class o f statistics for which th e m ethod o f choosing the block length described above is appropriate. R esam pling blocks o f blocks w ould n o t be ap p ro p riate here. T he value o f T for the full series is 7.908. Is this significantly large? To sim ulate d a ta u nder the null hypothesis o f no trend, we use the stationary

404

8 • Complex Dependence

o

Figure 8.10 Estimated variances of T for Rio Negro data, for stationary (solid) and block (dots) bootstraps. The left panel is for 1903-1912 {R = 999), the right panel is for the whole series (R = 199).

o CO

ir> co

o CO

CM

,y\

\ s/

j—

Nv

j\/\

/\jy/ v '0-/v/

/ V'

ca

40

/\" _

0 o c

Vari

o < D co O c 10 o

J--' \ / \\]I

V x/’ '

»

/ /

/ o

CM

5

10 Block length

15

20

0

10

20

30

40

50

Block length

bo o tstrap w ith w rapping to generate new series Y \ We initially apply this to the shorter series o f length 120, adjusted to have m ean zero, for which T takes value 0.654. U nder the null hypothesis the m ean o f T = J 2 aj Y j is zero and the distrib u tio n o f T will be close to norm al. We estim ate its variance by taking the em pirical variance o f values T" generated with the stationary bootstrap. T he left panel o f Figure 8.10 shows these variances k(n, /) based on different m ean block lengths I, for b o th stationary and block bootstraps. The stationary b o o tstrap sm ooths the variances for different fixed block lengths, resulting in a fairly stable variance for / > 6 or so. Variances o f T * based on the block b o o tstra p are m ore variable and increase to a higher eventual value. The variances for the full series are larger an d m ore variable. In order to choose the block length /, we took 50 random ly selected subseries o f m consecutive observations from the series w ith n = 120, and for each value o f k = 2 ,. . . , 20 calculated values o f k(m, k) from R = 50 stationary b o otstrap replicates. T he left p a rt o f Table 8.3 shows the values k th a t m inimize the m ean squared erro r for different possible values o f k{n, I). N ote th a t the values o f k do n o t broadly increase w ith m, as the theory w ould predict. F or smaller values o f k(n, I) the values o f k vary considerably, an d even for k(n, I) = 30 the corresponding values o f I as given by (8.10) w ith c = 1 and d = 2 vary from 12 to 20. The left panel o f Figure 8.10 shows th a t for / in this range, the variance k(n, I) takes value roughly 25. F or k(n, I) = 25, Table 8.3 gives I in the range 8-20, so overall we take k(n, I) = 25 based on the stationary bootstrap. The right p art o f Table 8.3 gives the values o f k w hen the block boo tstrap w ith w rapping is used. T he series so generated are n o t exactly stationary, b u t are nearly so. O verall the values are m ore consistent th an for the stationary

405

8.2 ■ Time Series Table 8.3 Estimated values of k for Rio Negro data, 1903-1912, based on stationary bootstrap with mean length k applied to 50 subseries of length m (left figures) and block bootstrap with block length k applied to 50 subseries of length m (right figures).

k(m, /)

S tationary, m

20 15 17.5

20 22.5 25 27.5 30

10 11 11 11 11 11 11

60

70

20

2

2

3 3

2 2

4 4 4 4 4 4 4

30

40

50

6

3 3

18 18 18

11 11 11

Block, m

6 6 12

6 6

3 3 5 7

14 14

9 9

10 10

3 4

8 8 11

30

40

50

60

70

3

18 16 4 5 5 5 5

2

2

2

3

6 6 6

3 3 4 5

3 4 4 5

9 9

6 6

8 8

10 5 5 5 5 5

b o otstrap , w ith broadly increasing values o f k w ithin each row, provided k(n, I) > 20. F or these values o f k(n, I), the values o f k suggest th at I lies in the range 5-8, giving k(n, I) = 25 or slightly less. T hus b o th the stationary and the block b o o tstra p suggest th a t the variance o f T is roughly 25, and since t = 0.654, there is no evidence o f trend in the first ten years o f data. F or the stationary bootstrap, the values o f A ( m ) have smallest variance for k(n, I) = 22.5, when they are 13.29, 13.66, 14.18, 14.01, 13.99 and 13.59 for m = 2 0 ,...,7 0 . F o r the block b o o tstrap the variance is smallest when k(n,l) = 27.5, when the values are 13.86, 14.25, 14.63, 14.69, 14.73 and 14.44. However, the m inim um m ean squared erro r shows no obvious p attern for any value o f k(n, I), and it seems th a t the asym ptotics apply adequately well here. Overall Table 8.3 suggests th a t a range o f values o f m should be used, and th a t results for different m are m ore consistent for the block th a n for the stationary bootstrap. F or given values o f m and k, the variances k j ( m , k ) have approxim ate gam m a distributions, b u t calculation o f their m ean squared error on the variance-stabilizing log scale does little to im prove m atters. For the stationary b o o tstrap applied to the full series, we take I in the range (8,20) x (1080/120)1/3 = (17,42), which gives variances 46-68, w ith average variance roughly 55. T he corresponding range o f I for the block b o o tstrap is 10-17, w hich gives variances k(n,l) in the range 43-53 or so, w ith average value 47. In either case the lowest reasonable variance estim ate is about 45. Since the value o f t for the full series is 7.9, an approxim ate significance level for the hypothesis o f n o tren d based on a norm al approxim ation to T* is 1 — <E>(7.9/451/2) = 0.12. The evidence for trend based on the m onthly d a ta is thus fairly weak. ■ Some block theory In order to gain some theoretical insight into block resam pling and the fun d am ental approxim ation (8.9) which guides the choice o f I, we exam ine the estim ation o f bias and variance for a special class o f statistics.

406

8 ■Complex Dependence

C onsider a stationary tim e series {Yy} w ith m ean n and covariances yj = cov( Yo, Y j ) , an d suppose th a t the p aram eter o f interest is 9 = h(ji). The obvious estim ator o f 9 based on Y i,..., Y„ is T = h(Y), w hose bias and variance are P

=

E { M Y ) - /i( /i) } = ^ " (/i)v a r(Y ), (8.11)

v

=

var { h( Y )} = h'(n)2var( Y ),

by the delta m ethod o f Section 2.7.1. N ote th a t var( Y) = n~2 {ny0 + 2(n - l)? i + ----- 1- 2y„_i} = n-2 ^ , say, and th a t as n—>oo, tt—1

»_14 B) =

yo + 2

co

_

i / n^ i

7=1

^ 2 yj =

£■

j= —co

Therefore p ~ \h"( p)n~l£, an d v ~ for large n. Now suppose th at we estim ate P an d v by simple block resam pling, w ith b non-overlapping blocks o f length I, w ith n = bl, and use S; to denote the average Z_1 ^o-i)/+i o f the j i h block, for j = 1 T hus S = Y, and Y* = b~l Y j = i Sj , where the S j are sam pled independently from S i,. . . , St. T he b o o tstrap estim ates o f the bias and variance o f T are p

=

E*{fc(Y*)- / ! ( ? ) } = h ' ( Y ) E ’ ( Y ' - Y ) + i/i" ( Y ) E * { ( Y * - Y )2} ,

(8.12) v

=

v a r’ {/i(Y*)} = fr'(Y)2v ar’ (Y*).

W hat we w ant to know is how the accuracies o f P an d v vary w ith /. Since the blocks are non-overlapping, b

E ’(Y*) = S,

var*(Y*) = r 2 ^ ( 5 y - S ) 2. j=i

It follows by com paring (8.11) an d (8.12) th a t the m eans o f P and v will be asym ptotically correct provided th a t w hen n is large, E{ b~l ^ ( S , —S)2} ~ n-1 £. This will be so because ^Z(Sj ~ S) 2 =

~ I1)2 ~

~ I1)2 has m ean

bwar(Si) — b\ar(S) = b(l~2c^ — n~2c(0n>) ~ if I—►co and Z/n->0 as n—>oo. To calculate approxim ations for the m ean squared errors o f P an d v requires m ore careful calculations and involves the variance of — S ) 2. This is messy in general, b u t the essential points rem ain under the simplifying assum ptions th a t {Yj) is an m -dependent norm al process. In this case ym+i = y m+2 = • • • = 0, an d the third and higher cum ulants o f the

Y is the average of Yu . . . , Y n.

8.2 ■ Time Series

407

process are zero. Suppose also th a t m < I. T hen the variance o f approxim ately v a r { X ] ( S' - V ) 2} = b v a r { ( S l

~ n ) 2} + 2

~ S )2 is

(b - l)cov {(Si - n)2, (S2 - n)2} .

F or norm al data, var {(Si — n)2}

=

2{var(Si - n)}2 ,

cov{(S i - j u ) 2,(S 2 - / i ) 2}

=

2 {cov(Si - n, S2 - n )}2 ,

SO

var { J 2 ( SJ - S)2} = 2b(l~24 ))2 + 4 6 ( r V 1'))2, w here u n d er suitable conditions on the process, OO

c f = y i + 2 y 2 -\------ 1- l y i - >

^

i= i

jyj

~ ji,

say. A fter a delicate calculation we find th at E {$) — (} ~

x f 'r 't ,

var(/?) ~

{ ^ h " ( n ) } 2 x 2ln~3( 2, (8.13)

E(v) — v ~ - t i ( f i ) 2 xn~lr lT,

var(v) ~

hf(jif x 2/n“ 3f 2,

(8.14)

th u s establishing th a t the m ean squared errors o f fi and v are o f form (8.9). This developm ent can clearly be extended to m ultivariate tim e series, and thence to m ore com plicated param eters o f a single series. F or example, for the first-order co rrelation coefficient o f the univariate series {Xj}, we would apply the argum ent to the trivariate series {Yj} = { ( X j , X 2, X j X j - 1)} w ith m ean an d set G = M^i» A*n, ^ 12) = ~ H2)W hen overlapping blocks are resam pled, the argum ent is sim ilar b u t the details change. If the d a ta are n o t w rapped around a circle, there are n — I + 1 blocks w ith averages Sj = /-1 Y?i=i

an(^

E‘ (? * - ? ) = /(„- *+ !) | /(/ ~ 1)? ~

+ y"“;+l)} '

(8'15)

In this case the leading term o f the expansion for fi is the product o f h'( Y) and the rig h t-h an d side o f (8.15), so the b o o tstrap bias estim ate for Y as an estim ator o f 9 = n is non-zero, which is clearly m isleading since E (T ) = fi. W ith overlapping blocks, the properties o f the b o o tstra p bias estim ator depend on E*(Y *)—Y , and it tu rn s o u t th a t its variance is an order o f m agnitude larger th an for non-overlapping blocks. This difficulty can be rem oved by w rapping Yi....... Y„ aro u n d a circle an d using n blocks, in which case E*(Y*) = Y, or by re-centring the b o o tstrap bias estim ate to ^ = E ’ {/i(Y*)} — ft { E ”(Y ')} . In either case (8.13) and (8.14) apply. One asym ptotic benefit o f using overlapping

8 ■Complex Dependence

408

blocks when the re-centred estim ator is used is th at var(/?) and var(v) are reduced by a factor | , though in practice the reduction m ay not be visible for small n. The corresponding argum ent for tail probabilities involves E dgew orth ex pansions and is considerably m ore intricate th an th a t sketched above. A part from sm oothness conditions on h(-), the key requirem ent for the above argum ent to w ork is th a t x an d ( be finite, and th a t the autocovariances decrease sharply enough for the various term s neglected to be negligible. This is the case if ~ a; for sufficiently large j and some a with |a| < 1, as is the case for stationary finite A R M A processes. However, if for large j we find th at yj ~ j ~ s, where 5 < S < 1, £ an d x are n o t finite and the argum ent will fail. In this case g(
8.2.4 Phase scrambling Recall the basic stochastic properties o f the em pirical Fourier transform o f a series y o , . . . , y n- i o f length n = 2nf + 1 : for large n and under certain conditions on the process generating the data, the transform ed values % for k = 1, . . . , n F are approxim ately independent, and their real and im aginary parts are approxim ately independent norm al variables with m eans zero and variances ng(cok)/2, where cok = 2nk/ n. The approxim ate independence o f y i , . . . , y nF suggests that, provided the conditions on the underlying process are met, the frequency dom ain is a b etter place to look for exchangeable com ponents th an the tim e dom ain. Expression (8.4) shows th at the spectrum sum m arizes the covariance structure o f the process { Y j } , and correspondingly the periodogram values I(tOk) = \%\2/ n sum m arize the second-order structure o f the data, which as far as possible we should preserve w hen resampling. This suggests th a t we generate resam ples by keeping fixed the m oduli |y*|, b u t random izing their phases Uk = arg %, which anyw ay are asym ptotically uniform ly distributed on the interval [0,2n), independent o f the \yk\- This phase scrambling can be done in a variety o f ways, one o f which is the following. Algorithm 8.1 (Phase scram bling) 1 C om pute from the d a ta yo,

-1 the em pirical Fourier transform

n—1 h =

_ y)’ j=

where ( = exp(2

ni/n).

0

k = 0 ,...,n -l,

8.2 ■Time Series

409

2 Set X k = exp(iUk )ek, k = 0 variables uniform on [0, 2 k). 3 Set

— 1, wher e the Uk are independent

ek = 2 ~ ^ 2 { Xk + X cn_k) ,

fc = 0, . . . , n — 1,

where superscript c denotes com plex conjugate and we take X„ = Xo4 A pply the inverse Fourier transform to e*0, . . . , e'n_ ] to obtain n—1 Y j ^ y + n - 1 Y , Z ~ ik~ e 'k’

j = 0,...,n-l.

fc=0

5 C alculate the b o o tstrap statistic T ' from Y0' , . .., Y ’_ ,. •

Step 3 guarantees th a t Yk has com plex conjugate Y*_k, and therefore th a t the bo o tstrap series Y0*, . . . , Yn'_{ is real. A n alternative to step 2 is to resam ple the Uk from the observed phases The b o o tstrap series always has average y , which implies th at phase scram bling should be applied only to statistics th a t are invariant to location changes o f the original series; in fact it is useful only for linear contrasts o f the y j , as we shall see below. It is straightforw ard to see th at -1/2 n-1

n-1

Y j = y + -----Y P l ~ n 1=0

^ Y l cos {2 n k (l ~ k=0

+ U k}’

j = 0 , . . . , n - 1,

(8.16) from which it follows th a t the b o o tstrap d a ta are stationary, w ith covariances equal to the circular covariances o f the original series, and th a t all their odd jo in t cum ulants equal zero (Problem 8.4). This representation also m akes it clear th a t the resam pled series will be essentially linear with norm al margins. The difference betw een phase scram bling and m odel-based resam pling can be deduced from A lgorithm 8.1. U nder phase scram bling, \Yk' \ 2 = \ h \2 {1 + cos (u; + u ;_ k)} ,

(8.17)

which gives e *(| y; i2)

= Iw I2,

v a r * ( |y ;|2) = i |j) ,|4.

U nder m odel-based resam pling the approxim ate distribution o f n~] \ Y^ \ 2 is g(a>k)X*, where g(-) is the spectrum o f the fitted m odel and X ’ has a standard exponential d istrib u tio n ; this gives E * ( |y ;i2) = g t o ) ,

v a r* (|y ;i2) = g W

C learly these resam pling schemes will give different results unless the quantities o f interest depend only on the m eans o f the |y fe' | 2, i.e. are essentially quadratic

410

8 ■Complex Dependence

Figure 8.11 Three time series generated by phase scrambling the shorter Rio Negro data.

in the data. Since the quan tity o f interest m ust also be location-invariant, this restricts the dom ain o f phase scram bling to such tasks as estim ating the variances o f linear contrasts in the data. Example 8.7 (Rio Negro data) We assess em pirical properties o f phase scram bling using the first 120 m o n th s o f the R io N egro d ata, which we saw previously were well-fitted by an A R (2) m odel w ith norm al errors. N ote th a t our statistic o f interest, T = Y l ajYj> has the necessary structure for phase scram bling n o t autom atically to fail. Figure 8.11 shows three phase scram bled datasets, which look sim ilar to the A R(2) series in the second row o f Figure 8.7. T he top panels o f Figure 8.12 show the em pirical Fourier transform for the original d a ta an d for one resam ple. Phase scram bling seems to have shrunk the m oduli o f the series tow ards zero, giving a resam pled series w ith lower overall variability. The low er left panel shows sm oothed periodogram s for the original d a ta and for 9 phase scram bled resam ples, while the right panel shows corresponding results for sim ulation from the fitted A R (2) model. The results are quite different, an d show th a t d a ta generated by phase scram bling are less variable th an those generated from the fitted model. R esam pling w ith 999 series generated from the fitted A R(2) m odel and by phase scram bling, the distribution o f 7” is close to no rm al under b o th schemes b u t it is less variable u nder phase scram bling; the estim ated variances are 27.4 and 20.2. These are sim ilar to the estim ates o f a b o u t 27.5 and 22.5 obtained using the block and statio n ary bootstraps. Before applying phase scram bling to the full series, we m ust check th a t it shows no sign o f nonlinearity or o f long-range dependence, and th at it is plausibly close to a linear series w ith norm al errors. W ith m = 20 the nonlinearity statistic described in Exam ple 8.3 takes value 0.015, and no value for m < 30 is greater th a n 0.84: this gives no evidence th a t the series is nonlinear. M oreover the p eriodogram shows no signs o f a pole as to—>0+, so long-range dependence seems to be absent. A n A R (8) m odel fits the series well, b u t the residuals have heavier tails th an the norm al distribution, w ith kurtosis 1.2. T he variance o f T * u nder phase scram bling is ab o u t 51, which

8.2 • Time Series

Figure 8.12 Phase scrambling for the shorter Rio Negro data. The upper left panel shows an Argand diagram containing the empirical Fourier transform % of the data, with phase scrambled y'k in the upper right panel. The lower panels show smoothed periodograms for the original data (heavy solid), 9 phase scrambled datasets (left) and 9 datasets generated from an AR(2) model (right); the theoretical AR(2) spectrum is the lighter solid line.

411

o

CD

O Tj-

O C\J o o

C£> - 60

-40

- 20

0

20

3

4

40

60

-60

- 40

- 20

0

20

3

4

40

60

CG o> 0) e o

o> o

1

2

omega

1

2

omega

again is sim ilar to the estim ates from the block resam pling schemes. A lthough this estim ate m ay be untrustw orthy, on the face o f things it casts no d o ubt on the earlier conclusion th a t the evidence for trend is weak. ■ The discussion above suggests th a t n o t only should phase scram bling be confined to statistics th a t are linear contrasts, b u t also th a t it should be used only after careful scrutiny o f the d a ta to detect nonlinearity and longrange dependence. W ith n on-norm al d a ta there is the further difficulty th a t the Fourier transform and its inverse are averaging operations, which can produce resam pled d a ta quite unlike the original series; see Problem 8.4 and Practical 8.3. In p articular, w hen phase scram bling is used in a test o f the null

8 ■Complex Dependence

412

hypothesis o f linearity, it im poses on the distribution o f the scram bled d a ta the additional constraints o f stationarity an d a high degree o f symmetry.

8.2.5 Periodogram resampling Like time d om ain resam pling m ethods, phase scram bling generates an entire new dataset. T his is unnecessary for such problem s as setting a confidence in terval for the spectrum at a p articu lar frequency or for assessing the variability o f an estim ate th a t is based on periodogram values. T here are well-established lim iting results for the distributions o f p eriodogram values, which under cer tain conditions are asym ptotically independent exponential random variables, and this suggests th a t we som ehow resam ple p eriodogram values. The obvious ap proach is to note th a t if g f (wk) is a suitable consistent estim ate o f g(a)k) based on d a ta yo,...,y „ _ i, w here n = 2 np + 1, then for k = 1, . . . , « f the residuals e k — I(cok)/g^(o}k) are approxim ately standard exponential variables. This suggests th a t we generate b o o tstrap periodogram values by setting I ’(ojk) = g{(ok)e*k, where g(o)k) is also a consistent estim ate o f g(a>k), an d the e\ are sam pled random ly from the set ( e \ / e , . . . , e nF/e); this ensures th a t E*(e£) = 1. T he choice o f g^co) and g(co) is discussed below. Such a resam pling scheme will only w ork in special circum stances. To see why, we consider estim ation o f 6 = f a(co)g(a>)dco by a statistic th a t can be w ritten in the form

e is the average of

r = -? -£ > /* ,

tr where I k = Ho}k), ak = a(cok), an d (ok is the /cth F ourier frequency. F o r a linear process 00 Y j = T , b* H> i=—oo where {£,} is a stream o f independent and identically distributed random variables w ith standardized fourth cum u lan t K4, the m eans and covariances o f the Ik are approxim ately E (Ik) = g(a>k),

cov(Ik,Ii) =

g ( a > k ) g ( c o ,) ( S k,

+ n~ 1 K4).

F rom this it follows th a t u n d er suitable conditions, E (T )

=

J a(co)g(a>)d(o,

v ar(T )

=

ri- 1

2nJ a2(co)g2(co)dco+K4 | J a(o))g (« ) dcoj

(8.18)

<5hi is the Kronecker delta symbol, which equals one if k = I and zero otherwise.

413

8.2 ■ Time Series

T he b o o tstrap analogue o f T is T* = nn F] Y k ak^k’ an<^ under the resam pling scheme described above this has m ean and variance E*(T*) = Ja(co)g((o)da>,

v a r '( T ') = 2n n~l J a 2 (to)g2 (co)da).

F or var*(T*) to converge to v ar(T ) it is therefore necessary th a t k4 = 0 or th at f a(co)g(to) dco be asym ptotically negligible relative to the first variance term. A process w ith norm al innovations will have K4 = 0, but since this can n o t be ensured in general the structure o f T m ust be exam ined carefully before this resam pling scheme is applied; see Problem 8.6. One situation where it can be applied is kernel density estim ation o f g( ), as we now see. Example 8.8 (Spectral density estim ation) Suppose th a t our goal is inference for the spectral density g(tj) a t some t] in the interval (0, 7r), and let our estim ate o f g(tj) be

r

k=0

where X ( ) is a sym m etric P D F with m ean zero and unit variance and h is a positive sm oothing param eter. Then E(T)

=

'1' 1/ ^

p

) s ( w ) ‘iffl= ^ ) + 5'1V 'W ,

v a r(T )

=

^ { g ( r i ) } 2 J K 2 (u) du +

J K

g(co)d(o^ .

Since we m ust have h—>0 as n —*00 in order to remove the bias o f T , the second term in the variance is asym ptotically negligible relative to the first term , as is necessary for the resam pling scheme outlined above to work w ith a tim e series for which /c4 0. C om parison o f the variance and bias term s implies th at the asym ptotic form o f the relative m ean squared erro r for estim ation o f g(//) is m inim ized by tak in g h oc n~[^5. However, there are two difficulties in using resam pling to m ake inference ab o ut g(^) from T. T he first difficulty is analogous to th at seen in Exam ple 5.13, and appears on com paring T and its b o o tstrap analogue

k=1 We suppose th a t I k is generated using a kernel estim ate g(a>k) with sm oothing param eter h. T he standardized versions o f T and T * are Z = (n h c)1/2 T

g^ \

Z* = (n h c)1 / l T

8 ■Complex Dependence

414 where c = {2n f K 2 (u)du}

These have m eans

E (Z ) = (nhc ) l / 1

E * (Z ') = (n/ic)1/2E gO/)

gU/)

C onsiderations sim ilar to those in Exam ple 5.13 show th at E '( Z ’ ) ~ E (Z ) if h—>0 such th a t h / h ^ O as n—>o o . The second difficulty concerns the variances o f Z and Z*, which will both be approxim ately one if the rescaled residuals ek have the same asym ptotic distribution as the “erro rs” h/g{u>k). F or this to h appen with g f (co) a kernel estim ate, it m ust have sm oothing p aram eter hf oc n-1//4. T h a t is, asym ptotically gt (ftj) m ust be undersm oothed com pared to the estim ate th at m inimizes the asym ptotic relative m ean squared erro r o f T. Thus the application o f the b o o tstrap outlined above involves three kernel density estim ates: the original, g(co), w ith h o c n 1/5; a surrogate g(co) for g(a>) used when generating b o o tstrap spectra, w ith sm oothing param eter h asym ptotically larger th a n h ; and g t (oj), from which residuals are obtained, w ith sm oothing param eter ht o c n-1//4 asym ptotically sm aller th a n h. This raises sub stantial difficulties for practical application, which could be avoided by explicit correction to reduce the bias o f T o r by taking h asym ptotically narrow er th a n n ~ ^ 5, in which case the lim iting m eans o f Z and Z* equal zero. F or a num erical assessm ent o f this procedure, we consider estim ating the spectrum g(a>) = {1 — 2acos(co) + a2}-1 o f an A R (1) process w ith a. = 0.9 at rj = n i l . T he kernel K(-) is the stan d ard norm al PD F. Table 8.4 com pares the m eans and variances o f Z w ith the average m eans and variances o f Z* for 1000 time series o f various lengths, w ith norm al and x 2 innovations. The first set o f results has bandw idths h = an~1/5, hf = an-1/4, and h = an-1/6, with a chosen to m inim ize the asym ptotic relative m ean squared erro r o f g(>/). Even for tim e series o f length 1025, the m eans and variances o f Z and Z ’ can be quite different, w ith the variances m ore sensitive to the distribution o f innovations. F or the second block o f num bers we took a non-optim al b andw idth h = an~{/4, an d hf = h = h. A lthough in this case the true and bo o tstrap m om ents agree better for norm al innovations, the results for chisquared innovations are alm ost as bad as previously, and it would be unwise to rely on the results even for fairly long series. M ean and variance only sum m arize lim ited aspects o f the distributions, and for a m ore detailed com parison we com pare 1000 values o f Z and o f Z ’ for a p articu lar series o f length 257. The left panel o f Figure 8.13 shows th a t the Z* are far from norm ally distributed, while the right panel com pares the sim ulated Z ’ an d Z . A lthough Z ' captures the shape o f the distribution o f Z quite well, there is a clear difference in their m eans and variances, and confidence intervals for g(rj) based on Z ' can be expected to be poor. ■

8.3 ■Point Processes Table 8.4 Com parison o f actual and bootstrap means and variances for a standardized kernel spectral density estimate Z . For the means the upper figure is the average o f Z from 1000 AR(1) time series with a = 0.9 and length n, and the lower figure is the average o f E*(Z*) for those series; for the variances the upper and lower figures are estimates o f v ar(Z ) and E{var’ (Z*)}. The upper 8 lines o f results are for h oc n-1/ 5, h * oc n~ l/4> and h oc n~l/6 ; for the lower 8 lines h= { oc 1/4.

415

In n o v atio n s

N o rm al

M ean V ariance

C hi-squared

M ean V ariance

N o rm al

M ean V ariance

C hi-squared

M ean V ariance

65

129

257

513

1025

00

1.4 2.0 2.5 2.7 1.2 2.1 6.9 2.8

0.9 1.7 1.5 2.0 1.0 1.7 4.9 2.0

0.8 1.3 1.3 1.7 0.8 1.3 3.8 1.6

0.7 1.0 1.1 1.5 0.7 1.0

0.6 0.8 1.1 1.3 0.7 0.8 2.7 1.3

0.5

0.9 0.6 2.3

0.5 0.4 1.3 1.4 0.6 0.4 3.7 1.4

0.5 0.3 1.1 1.4 0.5 0.3 3.1 1.4

0.3 0.3 1.1 1.3 0.4

0.2 0.2 1.0 1.3 0.3 0.2 2.2 1.2

0.0

1.5 1.0 0.7 5.6 1.4

3.1 1.4

0.3 2.5 1.3

1.0 0.5 1.0

1.0 0.0 1.0

Figure 8.13 C om parison o f distributions o f Z and Z* for time series o f length 257. The left panel shows a norm al plot o f 1000 values o f Z . The right panel com pares the distributions o f Z and Z*.

Quantiles of standard normal

Z*

8.3 Point Processes 8.3.1 Basic ideas A p o in t process is a collection o f events in a continuum . Exam ples are tim es o f arrivals at an intensive care unit, positions o f trees in a forest, and epicentres

416

8 ■Complex Dependence

o f earthquakes. M athem atical properties o f such processes are determ ined by the jo in t distribution o f the num bers o f events in subsets o f the continuum . Statistical analysis is based on some n otion o f repeatability, usually provided by assum ptions o f stationarity. Let N { A ) denote the nu m b er o f events in a set A . A point process is stationary if Pr{/V(/li) = m , . . . , N ( A k ) = n k ) is unaffected by applying the same tran slatio n to all the sets A u . . . , A k , for any finite k. U nder second-order stationarity only the first an d jo in t second m om ents o f the N ( A t) rem ain unchanged by translation. F or a stationary process E{N(/1)} = X\A\, where X is the intensity o f the process and \A\ is the length, area, or volum e o f A . Second-order m om ent properties can be defined in various ways, w ith the m ost useful definition depending on the context. The sim plest stationary point process m odel is the hom ogeneous Poisson process, for which the ran d o m variables N(Ai), N i A i ) have independent Pois son distributions w henever A\ and A 2 are disjoint. This com pletely random process is a n atu ral stan d ard w ith which to com pare data, although it is rarely a plausible m odel. M ore realistic m odels o f dependence can lead to estim ation problem s th a t seem analytically insuperable, and M onte C arlo m ethods are often used, particularly for spatial processes. In particular, sim ulation from fitted param etric m odels is often used as a baseline against which to judge data. This often involves graphical tests o f the type outlined in Section 4.2.4. In practice the process is observed only in a finite region. This can give rise to edge effects, which are increasingly severe in higher dimensions. Exam ple 8.9 (Caveolae) T he u p p er left panel o f Figure 8.14 shows the p o sitions o f n = 138 caveolae in a 500 unit square region, originally a 2.65 /*m square o f muscle fibre. T he u pper right panel shows a realization o f a binom ial process, for which n points were placed a t ran d o m in the same region; this is an hom ogeneous Poisson process conditioned to have 138 events. The d ata seem to have fewer alm ost-coincident points th a n the sim ulation, b u t it is hard to be sure. Spatial dependence is often sum m arized by K -functions. Suppose th at the process is orderly and isotropic, i.e. m ultiple coincident events are precluded and jo in t probabilities are invariant und er ro tatio n as well as translation. Then a useful sum m ary o f spatial dependence is Ripley’s K -function, K ( t ) = A-1 E (#{events w ithin distance t o f an arb itrary e v e n t} ),

t > 0.

The m ean- an d variance-stabilized function Z ( t ) = { K ( t ) / n Y /2—t is som etim es used instead. F or an hom ogeneous Poisson process, K ( t ) = n t2. Em pirical versions o f K ( t ) m ust allow for edge effects, as m ade explicit in Exam ple 8.12. The solid line in the low er left panel o f Figure 8.14 is the em pirical version

417

8.3 ■Point Processes

Figure 8.14 Muscle caveolae analysis. Top left: positions of 138 cavoelae in a 500 unit square of muscle fibre (Appleyard et al., 1985). Top right: realization of an homogeneous binomial process with n = 138. Lower left: Z(t) (solid), together with pointwise 95% confidence bands (dashes) and overall 92% confidence bands (dots) based on R = 999 simulated binomial processes. Lower right: corresponding results for R = 999 realizations of a fitted Strauss process.

o o

LO

%•.* • • • •

o o o o

CO

o o

C\J

o o

0

100

200

300

400

500

m o

o

lO

in

---------

o

^

N

\

V v~

r-—

o

in T~

m

40

__r

_ -- --------

K5 O V

20

^

\

o T 7

0

I 'v\ / "

60

Distance

80

100

\

< ■

V

0

20

40

60

80

100

Distance

Z (t) o f Z(t). The dashed lines are pointw ise 95% confidence bands from R = 999 realizations o f the binom ial process, and the dotted lines are overall b ands w ith level ab o u t 92% , obtained by using the m ethod outlined after (4.17) w ith k = 2. Relative to a Poisson process there is a significant deficiency o f pairs o f points lying close together, which confirm s our previous impression. The lower right panel o f the figure shows the corresponding results for sim ulations from the Strauss process, a param etric m odel o f interaction th at can inhibit p attern s in which pairs lie close together. This m odels the local behaviour o f the d a ta b etter th an the stationary Poisson process. ■

8 ■Complex Dependence

418

o

c o

o

o

0

100

200

Time (ms)

W c 0) o -200

0

100

200

o

Time (ms)

-200 -100

0

100

200

Time (ms)

8.3.2 Inhomogeneous Poisson processes The sam pling plans used in the previous exam ple b o th assum e stationarity o f the process underlying the data, an d rely on sim ulation from fitted param etric models. Som etim es independent cases can be identified, in which case it m ay be possible to avoid the assum ption o f stationarity.

Example 8.10 (Neurophysiological point process) The d a ta in Figure 8.15 were recorded by D r S. J. Boniface o f the Clinical N europhysiology U nit at the Radcliffe Infirm ary, O xford, in a study o f how a hum an subject responded to a stimulus. Each row o f the left panel o f the figure shows the times at which the firing o f a m otoneurone was observed, in an interval extending 250 ms either side o f 100 applications o f the stim ulus, w hich is taken to be at time zero. A lthough little can be assum ed a b o u t dependence w ithin each interval, the stim ulus was given far enough a p a rt for firings in different intervals to be treated as independent. Firings occur a t ran d o m a b o u t 100 ms a p art prior to the stim ulus, b u t on ab o u t one-third o f occasions a firing is observed ab o u t 28 ms after it, an d this partially synchronizes the firings im m ediately following. T heoretical results im ply th a t und er m ild conditions the process obtained by superposing all N = 100 intervals will be a Poisson process with timevarying intensity, NX(y). H ere it seems plausible th a t the conditions are m et: for exam ple, 90 o f the 100 intervals con tain four o r fewer events, so the overall intensity is n o t d o m inated by any single interval. The superposed d a ta have n — 389 events whose tim es we denote by yj.

Figure 8.15 Neurophysiological point process. The rows of the left panel show 100 replicates of the interval surrounding the times at which a human subject was given a stimulus; each point represents the time at which the firing of a neuron was observed. The right panels shows a histogram and kernel intensity estimate (xlO -2 ms-1) from superposing the events on the left, which are shown by the rug in the lower right panel.

8.3 ■Point Processes

419

The right panels o f Figure 8.15 show a histogram o f the superposed d ata and a rescaled kernel estim ate o f the intensity X(y) in units o f 10-2 m s-1 , k y , h ) = 100 x (N h )~1 £ w ( ^ y 1 ) , 7=1 where w(-) is a sym m etric density with m ean zero and unit variance; we use the stan d ard norm al density w ith bandw idth h = 7.5 ms. O ver the observation period this estim ate integrates to 100n / N . The estim ated intensity is highly variable an d it is unclear which o f its features are spurious. We can try to construct a confidence region for A(y) at a set o f y values o f interest, but the sam e problem s arise as in Exam ples 5.13 and 8.8. O nce again the key difficulty is bias: l ( y ; h ) estim ates n o t k(y) b u t / w(u)A(y — hu) du. F or large n and small h this m eans th at E {l(y ;/j)} = 2.(y) + \ h 2 X'(y),

var{2(y;/i)} = c(iVft)_1A(>>),

where c = f w 2 (u)du. As in Exam ple 5.13, the delta m ethod (Section 2.7.1) im plies th a t l ( y ; h )l/2 has approxim ately constant variance \ c ( N h ) ~ l . We choose to w ork w ith the standardized quantities 2 (y,h)=

l l' 2 ( y ; h ) - k l/ 2 ( y ) K M )-V 2 c 1/2

y ef.

In principle an overall 1 — 2a confidence band for k(y) over W is determ ined by the quantiles ZLA(h) and z u A(h) th a t satisfy 1 - a = P r{zLiX(h) < Z { y ; h ) , y e 9 } = P t { Z ( y ; h ) < z U:X(h),y £ <&}.

(8.19)

T he lower and u p p er lim its o f the ban d would then be \ l l/ 2 (y;h) - \ ( N h ) ~ V 2 cl/ 2 z UA( h ) \ ,

L

J2

(8.20)

{ l i/2( y , h ) - \ ( N h ) - 1/2cl/2z U h ) } ■ In practice we m ust use resam pling analogues Z * ( y \ h ) o f Z ( y ; h ) to estim ate ZL,a(h) and zu,x(h), and for this to be successful we m ust choose h and the resam pling scheme to ensure th a t Z* and Z have approxim ately the same distributions. In this context there are a nu m b er o f possible resam pling schemes. The sim plest is to take n events a t ran d o m from the observed events. This relies on the independence assum ptions for Poisson processes. A second scheme generates n events from the observed events, where n* has a Poisson distribution with m ean n. A m ore robust scheme is to superpose 100 resam pled intervals, though this does n o t hold fixed the to tal n um ber o f events. These schemes would be

8 ■Complex Dependence

420

inappro p riate if the estim ator o f interest presupposed th at events could not coincide, as did the K -function o f Exam ple 8.9. For all o f these resam pling schemes the b o o tstrap estim ators r ( y ; h ) are unbiased for l(y',h). T he n atu ral resam pling analogue o f Z is { r ( r ; f c ) } '/ 2 - { r ( r ) ) l/2

b u t E*(Z*) = 0 and E (Z ) ^ 0. This situation is analogous to th at o f E xam ple 5.13, an d the conclusion is the sam e: to m ake the first two m om ents o f Z and Z* agree asym ptotically, one m ust choose h oc N ~ y w ith y > j . F urther detailed calculations for the jo in t distributions over % suggest also th at y < The essential idea is th a t h should be sm aller th a n is com m only used for point estim ation o f the intensity. A quite different ap p ro ach is to generate realizations o f an inhom ogeneous Poisson process from a sm ooth estim ate l ( y ; h ) o f the intensity. This can be achieved by using the sm oothed b o o tstrap , as outlined in Section 3.4, and detailed in Problem 8.7. U nder this scheme E* | X*{y; h) j =

J l ( y — hu',h)w(u) du = l ( y ',h)+ j h 2 l "(y;h),

and the resam pling analogue o f Z is

z(y M

------------------------------------ ■

whose m ean and variance closely m atch those o f Z . W hatever resam pling scheme is employed, sim ulated values o f Z* will be used to estim ate the quantiles z i A(h) and z y A{h) in (8.19). If R realizations are generated, then we take ZL,cc{h) and zu, Jh) to be respectively the (R + l)a th ordered values o f m in z*(y:/i),

m a xz*(y;h).

The u p p er panel o f Figure 8.16 shows overall 95% confidence bands for A(y;5), using three o f the sam pling schemes described above. In each case R = 999, an d zl,0.025(5) an d zl',0.025(5) are estim ated by the em pirical 0.025 and 0.975 quantiles o f the R replicates o f m in{z’(j;;5),>' = —250, —2 4 8 ,...,2 5 0 } and m a x { z '(y ;5),y = —2 5 0 ,—2 4 8 ,...,2 5 0 } . R esults from resam pling intervals and events are alm ost indistinguishable, while generating d a ta from a fitted intensity gives slightly sm oother results. In o rd er to avoid problem s at the boundaries, the set is taken to be (—230,230). The experim ental setup implies th a t the intensity should be ab o u t 1 x 10-2 firings per second, the only significant d ep artu re from which is in the range 0-130 ms, where there is strong evidence th a t the stim ulus affects the firing rate.

421

8.3 ■Point Processes Figure 8.16 Confidence bands for the intensity of the neurophysiological point process data. The upper panel shows the estimated intensity x(y;5) ( 10“ 2 ms-1 ) (heavy solid), with overall 95% equi-tailed confidence bands based on resampling intervals (solid), resampling events (dots), and generating events from a fitted intensity (dashes). The outer lines in the lower panel show the 2.5% and 97.5% quantiles of the standardized quantile processes z ’(y;h) for resampling intervals (solid) and generating from a fitted intensity (dashes), while the lines close to zero are the bootstrap bias estimates for k.

-200

-100

0

100

200

Time (ms)

Time (ms)

The lower panel o f the figure shows z0.025(5)’ z0.975(5), and the boo tstrap bias estim ate for /*(>>) for resam pling intervals and for generating d a ta from a fitted intensity function, with h = 7.5 ms. The quantile processes suggest th a t the variance-stabilizing transform ation has w orked well, b u t the double sm oothing effect o f the latter scheme shows in the bias. The behaviour o f the quantile process when y = 50 ms — where there are no firings — suggests th at a variable b andw idth sm oother m ight be better. ■ Essentially the same ideas can be applied when the d ata are a single real ization o f an inhom ogeneous Poisson process (Problem 8.8).

8.3.3 Tests o f association W hen a poin t process has events o f different types, interest often centres on association betw een the different types o f events or between events and associated covariates. T hen p erm u tation o r b o o tstrap tests m ay be appropriate, although the sim ulation scheme will depend on the context. Example 8.11 (Spatial epidemiology) Suppose th a t events o f a point pattern correspond to locations y o f cases o f a rare disease S> th a t is th ought to be related to em issions from an industrial site at the origin, y = 0. A m odel for the incidence o f Q) is th a t it occurs at rate /.(y) per person-year at location y,

422

8 ■Complex Dependence

where the suspicion is th a t X(y) decreases w ith distance from the origin. Since the disease is rare, the n u m b er o f cases a t y will be well approxim ated by a Poisson variable w ith m ean X{y)n(y), where fi(y) is the population density o f susceptible persons a t y. T he null hypothesis is th a t My) = Xo, i.e. th a t y has no effect on the intensity o f cases, o th er th an through /i(y). A crucial difficulty is th at n{y) is unknow n an d will be h ard to estim ate from the d a ta available. One ap p ro ach to testing for constancy o f X ( y ) is to com pare the p o int pattern for 2> to th a t o f an o th er disease 2)'. This disease is chosen to have the same populatio n o f susceptible individuals as 3), b u t its incidence is assum ed to be unrelated to em issions from the site an d to incidence o f S>, and so it arises with co n stan t b u t unknow n rate X ’ p er person-year. If Sfi' is also rare, it will be reasonable to suppose th a t the num b er o f cases o f at y has a Poisson distributio n w ith m ean X 'f i ( y ) . H ence the conditional probability o f a case o f at y given th a t there is a case o f o r 3 ' a t y is n { y ) = X { y ) / { X ' + A(y)}. If the disease locations are indicated by yj, an d dj is zero o r one according as the case a t yj has 3)' or Q>, the likelihood is n ^ { i - « ( y ^ . j If a suitable form for X(y) is assum ed we can o btain the likelihood ratio or perhaps an o th er statistic T to test the hypothesis th at 7i(y) is constant. This is a test o f pro p o rtio n al hazards for Q) and & , b u t unlike in Exam ple 4.4 the alternative is specified, at least weakly. W hen A(y) = Xo an ap proxim ation to the null distribution o f T can be obtained by perm uting the labels on cases at different locations. T h at is, we and 3l' to the yj, recom pute T perform R ran d o m reallocations o f the labels for each such reallocation, an d see w hether the observed value o f t is extrem e relative to the sim ulated values t \ , . . . , t ’R. m Exam ple 8.12 (Bram bles) The upp er left panel o f Figure 8.17 shows the locations o f 103 newly em ergent an d 97 one-year-old bram ble canes in a 4.5 m square plot. It seems plausible th a t these two types o f event are related, but how should this be tested? Events o f b o th types are clustered, so a Poisson null hypothesis is not appropriate, n o r is it reasonable to perm ute the labels attached to events, as in the previous example. Let us denote the locations o f the two types o f event by y i , . . . , y „ and y [, . . ., y 'n-. Suppose th a t a statistic T = t ( y i , . . . , y „ , y [ , . . . , y ' n,) is available th at tests for association betw een the event types. If the extent o f the observation region were infinite, we m ight construct a null distribution for T by applying random translations to events o f one type. T hus we would generate values T ‘ = t(yi + U*, . . ., y„ + U*,y[,...,y'rf), where I/* is a random ly chosen location in the plane. This sam pling scheme has the desirable property o f fixing the

8.3 • Point Processes

423

Figure 8.17 Brambles data. Top left: positions of newly emergent (+) and one-year bramble canes (•) in a 4.5 m square plot. Top right: random toroidal shift of the newly emergent canes, with the original edges shown by dotted lines. Bottom left: Original dependence function Z n (solid) and 20 replicates (dots) under the null hypothesis of no association between newly emergent and one-year canes. Bottom right: original dependence function and pointwise (dashes) and overall (dots) 95% null confidence sets. The data used here are the upper left quarter of those displayed on p. 113 of Diggle (1983).

++ \ \

+:*4• + ++ *- •

v*

+ ;

r

t

+

++ •** 4-

++

► V.

t

relative locations o f each type o f event, b u t cannot be applied directly to the d a ta in Figure 8.17 because the resam pled patterns will n o t overlap by the sam e am o u n t as the original.

[•] denotes integer part.

We overcom e this by ran d o m toroidal shifts, where we im agine th a t the pattern is w rapped on a torus, the random translation is applied, and the translated p attern is then unw rapped. Thus for points in the unit square we w ould generate U * = ( [ /j, Uj) at random in the unit square, and then m ap the event a t y} = ( y i j , y 2j) to yj = ( y {] + U\ - [yij + U[],y2j + U 2' - [y2J + U\]). The u p p er right panel o f Figure 8.17 shows how such a shift uncouples the tw o types o f events.

424

8 ■Complex Dependence

We can construct a test through an extension o f the K -function to events o f two types, th a t is the function (# {ty p e 2 events w ithin distance t o f an arbitrary type 1 e v e n t} ), where A2 is the overall intensity o f type 2 events. Suppose th a t there are «i, ri2 events o f types 1 an d 2 in an observation region A o f area \A\, th at u,, is the distance from the ith type 1 event to the 7th type 2 event, th a t w,(u) is the proportio n o f the circum ference o f the circle th a t is centred at the ith type 1 event an d has radius u th a t lies w ithin A, and let /(•) denote the indicator o f the event T hen the sam ple version o f this bivariate K -function is K i2(r) = (nin 2 r l \ A \ J 2 '^2 w - l (uij)I(uij < t). i=i j=\ A lthough it is possible to base an overall statistic on K n i t ) , for exam ple taking T = f Z n ( t ) 2 dt, where Z\ i ( t) = { k n { t ) / n } 112 — f, a graphical test is usually m ore inform ative. The lower left panel o f Figure 8.17 shows results from 20 random toroidal shifts o f the data. The original value o f Z \ 2 (t) seems to show m uch stronger local association th an do the sim ulations. This is confirm ed by the lower right panel, which shows 95% pointw ise an d overall confidence bands for Z n ( t ) based on R = 999 shifts. T here is clear evidence th a t the point patterns are no t in d ep en d en t: as the original d a ta suggest, new canes emerge close to those from the previous year. ■

8.3.4 Tiles Little is know n ab o u t resam pling spatial processes when there is no param etric model. One n onparam etric ap proach th a t has been investigated starts from a p artition o f the observation region St into disjoint tiles o f equal size and shape. I f we abuse n o tatio n by identifying each tile with the pattern it contains, we can w rite the original value o f the statistic as T = t(.stf The idea is to create a resam pled p attern by tak ing a random sam ple of tiles s 4 \ , . . . , s 4 ' n from with corresponding boo tstrap statistic T* = t( j/J ,...,,s /* ) . The hope is th a t if dependence is relatively short-range, taking large tiles will preserve enough dependence to m ake the properties o f T* close to those o f T. If this is to w ork, the size o f the tile m ust be chosen to trade off preserving dependence, which requires a few large tiles, and getting a good estim ate o f the distribution o f T , which requires m any tiles. This idea is analogous to block resam pling in tim e series, and is capable o f sim ilar variations. F o r exam ple, ra th e r th an choosing the stf* independently from the fixed tiles s i we m ay resam ple m oving tiles by setting

8.3 • Point Processes

o in

..... •

*

: •• • : •: •• :

1

*

*

.

* . . •

..

* * -I .* • j * ......... «-• • V

«

•

. •r

V #

• •

* -V • • • • •......... ..... . * !

•

!

.

.

! .

•

•: : : 0

100

•; 200

•* 300

. 400

•

o o 300

•

• . :•

•

• ;

.v

.. -•........- ♦ ...................................... •

••

• • -* . ••

............ !••••! • * ** • :* •

.......•...... • » • . * •: •/. * * . *• * •• * * ' • */ . • .• • •** • : • * •

o 500

» : •

200

..

100

Figure 8.18 Tile resampling for the caveolae data. The left panel shows the original data, with nine tiles sampled at random using toroidal wrapping. The right panel shows the resampled point pattern.

425

0

100

200

300

400

■srf'j = Uj + sJj, where Uj is a random vector chosen so th a t s / j lies wholly w ithin we can avoid bias due to undersam pling near the boundaries o f 9t by toroidal w rapping. As in all problem s involving spatial data, edge effects are likely to play a critical role. Exam ple 8.13 (Caveolae) Figure 8.18 illustrates tile resam pling for the d ata o f Exam ple 8.9. T he left panel shows the original caveolae data, with the dotted lines showing nine square tiles taken using the m oving scheme w ith toroidal w rapping. The right panel shows the resam pled p a ttern obtained when the tiles are laid side-by-side. F or example, the centre top tile and m iddle right tiles were respectively taken fropi the top left and b ottom right o f the original data. A long the tile edges, events seem to lie closer together th a n in the left p anel; this is analogous to the w hitening th a t occurs in blockwise resam pling o f tim e series. N o analogue o f the post-blackened b o o tstrap springs to mind, however. F or a num erical evaluation o f tile resam pling, we experim ented with esti m ating the variance 9 o f the nu m ber o f events in an observation region 3tt o f side 200 units, using d a ta generated from three random processes. In each case we generated 8800 events in a square o f side 4000, then estim ated 9 from 2000 squares o f side 200 taken at random . F or each o f 100 random squares o f side 200 we calculated the em pirical m ean squared error for estim ation o f 9 using b o o tstrap s o f size R, for b o th fixed and m oving tiles. D a ta were generated from a spatial Poisson process (9 = 23.4), from the Strauss process th a t gave the results in the b o tto m right panel o f Figure 8.14 (9 = 17.5), and from a sequential spatial inhibition process, which places points sequentially at ran d o m b u t n o t w ithin 15 units o f an existing event (6 = 15.6).

8 • Complex Dependence

426

Table 8.5 M ean n 4

16

36

64

100

144

196

256

theory fixed m oving

224.2 255.2 92.2

77.9 66.1 39.7

47.3 40.2 35.8

36.3 31.7 31.6

31.2 27.6 33.0

28.4 27.6 30.8

26.7 25.5 27.4

25.6 27.8 27.0

S trau ss

fixed m oving

129.1 53.2

49.1 26.4

27.9 19.0

19.2 17.4

16.4 15.9

19.3 18.9

20.8 18.7

21.9 17.9

SSI

fixed m oving

123.8 36.5

37.7 12.9

14.8 11.2

13.5 15.6

17.9 18.3

25.1 21.2

34.6 28.6

42.4 35.4

Poisson

Table 8.5 shows the results. F o r the Poisson process the fixed tile results broadly agree w ith theoretical calculations (Problem 8.9), and the m oving tile results accord w ith general theory, which predicts th a t m ean squared errors for m oving tiles should be lower th a n for fixed tiles. H ere the m ean squared erro r decreases to 22 as n—►o o . T he fitted Strauss process inhibits pairs o f points closer together th an 12 units. The m ean squared erro r is m inim ized w hen n = 100, corresponding to tiles o f side 20; the average estim ated variances from the 100 replicates are then 19.0 an d 18.2. T he m ean squared errors for m oving tiles are rath er lower, b u t their p a tte rn is similar. The sequential spatial inhibition results are sim ilar to those for the Strauss process, b u t w ith a sh arp er rise in m ean squared error for larger n. In this setting theory predicts th a t for a process with sufficiently shortrange dependence, the optim al n o c \ I f the caveolae d a ta were generated by a Strauss process, results from Table 8.5 would suggest th a t we take n = 100 x 500/200 = 162, so there w ould be 16 tiles along each side o f 3k. W ith R = 200 an d fixed and m oving tiles this gives variance estim ates o f 101.6 and 100.4, b o th considerably sm aller th a n the variance for Poisson data, which would be 138. ■

8.4 Bibliographic Notes There are m any books on tim e series. Brockwell an d D avis (1991) is a recent book aim ed at a fairly m athem atical readership, while Brockwell and D avis (1996) an d Diggle (1990) are m ore suitable for the less theoretically inclined. Tong (1990) discusses nonlinear tim e series, while Beran (1994) covers longm em ory processes. Bloomfield (1976), Brillinger (1981), Priestley (1981), and Percival an d W alden (1993) are introductions to spectral analysis o f tim e series.

squared errors for estim ation o f the variance o f the num ber o f events in a square of side 200, based on bootstrapping fixed and moving tiles. D ata were generated from a Poisson process, a Strauss process with param eters chosen to match the da ta in Figure 8.14, and from a sequential spatial inhibition process with radius 15. In each case the mean num ber o f events is 22. For n £ 64, we took R = 200, for n = 100, 144, we took R = 400, and for n ^ 196 we took R = m .

8.4 ■Bibliographic Notes

427

M odel-based resam pling for tim e series was discussed by F reedm an (1984), Freedm an an d Peters (1984a,b), Sw anepoel and van W yk (1986) and Efron and T ibshirani (1986), am ong others. Li and M ad d ala (1996) survey m uch o f the related tim e dom ain literature, which has a som ew hat theoretical em phasis; their account stresses econom etric applications. F or a m ore applied account o f param etric b o o tstrap p in g in tim e series, see Tsay (1992). B ootstrap prediction in tim e series is discussed by K ab aila (1993b), while the b o otstrapping o f statespace m odels is described by Stoffer and W all (1991). The use o f m odel-based resam pling for o rd er selection in autoregressive processes is discussed by Chen et al. (1993). Block resam pling for tim e series was introduced by C arlstein (1986). In an im p o rta n t paper, K iinsch (1989) discussed overlapping blocks in tim e series, although in spatial d a ta the proposal o f block resam pling in H all (1985) predates both. Liu an d Singh (1992a) also discuss the properties o f block resam pling schemes. Politis an d R om ano (1994a) introduced the stationary b o o tstrap , an d in a series o f papers (Politis and R om ano, 1993, 1994b) have discussed theoretical aspects o f m ore general block resam pling schemes. See also B uhlm ann an d K iinsch (1995) and L ahiri (1995). The m ethod for block length choice outlined in Section 8.2.3 is due to H all, H orow itz and Jing (1995); see also H all an d H orow itz (1993). B ootstrap tests for unit roots in autoregressive m odels are discussed by F erretti and R om o (1996). H all and Jing (1996) describe a block resam pling approach in which the construction o f new series is replaced by R ichardson extrapolation. Bose (1988) showed th a t m odel-based resam pling for autoregressive p ro cesses has good asym ptotic higher-order properties for a wide class o f statistics. L ahiri (1991) an d G otze and K iinsch (1996) show th a t the same is true for block resam pling, b u t D avison and H all (1993) p o int o u t th a t unfortunately — and unlike w hen the d a ta are independent — this depends crucially on the variance estim ate used. Form s o f phase scram bling have been suggested independently by several au th o rs (N ordgaard, 1990; Theiler et al., 1992), and B raun and K ulperger (1995, 1997) have studied its properties. H artig an (1990) describes a m ethod for variance estim ation in G aussian series th a t involves sim ilar ideas b u t needs no rand o m izatio n ; see Problem 8.5. Frequency dom ain resam pling has been discussed by F ranke and H ardle (1992), w ho m ake a strong analogy w ith b o o tstrap m ethods for nonparam etric regression. It has been fu rth er studied by Janas (1993) and D ahlhaus and Janas (1996), on which o u r account is based. O u r discussion o f the R io N egro d a ta is based on Brillinger (1988, 1989), which should be consulted for statistical details, while Sternberg (1987, 1995) gives accounts o f the d a ta and background to the problem. M odels based on p o in t processes have a long history and varied provenance.

428

8 • Complex Dependence

D aley and V ere-Jones (1988) an d K a rr (1991) provide careful accounts o f their m athem atical basis, while Cox an d Isham (1980) give a m ore concise treatm ent. Cox and Lewis (1966) is a sta n d a rd account o f statistical m ethods for series o f events, i.e. p o in t processes in the line. Spatial p o in t processes and their statistical analysis are described by Diggle (1983), Ripley (1981, 1988), and Cressie (1991). Spatial epidem iology has recently received atten tio n from various p oints o f view (M uirhead and D arby, 1989; Bithell and Stone, 1989; Diggle, 1993; Law son, 1993). Exam ple 8.11 is based on Diggle and Rowlingson (1994). Owing to the im possibility o f exact inference, a num ber o f statistical proce dures based on rando m izatio n or sim ulation originated in spatial d a ta analysis. Exam ples include graphical tests, which were used extensively by Ripley (1977), and various approaches to p aram etric inference based on M arkov chain M onte C arlo m ethods (Ripley, 1988, C hapters 4, 5). However, nonparam etric b o o t stra p m ethods for spatial d a ta have received little attention. O ne exception is H all (1985), a pioneering w ork on the theory th a t underlies block resam pling in coverage processes, a p articu lar type o f spatial data. F u rth er discussion o f resam pling these processes is given by H all (1988b) and G arcia-S oidan and H all (1997). Possolo (1986) discusses subsam pling m ethods for estim ating the p aram eters o f a ran d o m field. O th er applications include H all and K eenan (1989), w ho use the b o o tstra p to set confidence “gloves” for the outlines o f hands, an d Journel (1994), w ho uses p aram etric b o o tstrap p in g to account for estim ation uncertainty in an application o f kriging. Y oung (1986) describes b o o tstrap approaches to testing in som e geom etrical problems. Cowling, H all and Phillips (1996) describe the resam pling m ethods for inhom ogeneous Poisson processes th a t form the basis o f Exam ple 8.10, as well as outlining the related theory. V entura, D avison and Boniface (1997) describe a different analysis o f the neurophysiological d a ta used in th at example. Diggle, Lange an d Benes (1991) describe an application o f the b o o tstrap to a point process problem in neuroanatom y.

8.5 Problems 1

Suppose that y i,...,y „ is an observed time series, and let zy denote the block of length / starting at yu where we set y, = yi+(i_i mod „) and y0 = ynAlso let h , . . . be a stream of random numbers uniform on the integers 1,...,n and let be a stream of random numbers having the geometric distribution Pr(L = I) = p(l —p)‘~ \ I = 1,— The algorithm to generate a single stationary bootstrap replicate is Algorithm 8.2 (Stationary bootstrap) • •

Set 7* = z/jx,, and set i = 1. While length(Y’) < n, {increment /; replace Y ’ with (Y z ; i>Li)}.

429

8.5 ■Problems

•

Set 7* =

(a) Show that the algorithm above is equivalent to Algorithm 8.3 . Set Yl' = y , r • For i = 2,...,n, let Y ' = with probability p, and let Y" = yj+l with probability 1 —p, where y,l, = yj.

(b) Define the empirical circular autocovariance

n Ck =

O '; -

y ) ( y i + u + k - t mod n) -

y ),

k =Q ,...,n.

;=1 Show that conditional on y i , . . . , y „ ,

E’(y /) = y,

cov*(y,-,y;+1) = ( i - Py Cj

and deduce that y ' is second-order stationary. (c) Show that if y i , . . . , y n are all distinct, 7 ’ is a first-order Markov chain. Under what circumstances is it a fcth-order Markov chain? (Section 8.2.3; Politis and Romano, 1994a) 2

Let Y i , . . . , Yn be a stationary time series with covariances f j = cov(Y!, Yj+ 1 ). Show that v a r (? ) = y0 + 2 ^

fl -

yh

;=l ' and that this approaches C = Vo + 2 £ 5 ° yj if ! j\yj\ is finite. Show that under the stationary bootstrap, conditional on the data, n—1 / v a r '( y ‘) = c0 + 2 ^ 3 ( 1 - " ) (! ~ P ) JCj,

;=l '

nJ

where Co,c1;. .. are the empirical circular autocovariances defined in Problem 8.1. (Section 8.2.3; Politis and Romano, 1994a) 3

(a) Using the setup described on pages 405-408, show that J2($j ~ S )2 has mean vy — b~l v,j and variance V ijjj +

2 Vj jV tj - 2 b _1( v Uj j t + 2 v u v iJt) + b - 2( v iJJcJ + 2 v u v k J ),

where vy = cov(S,,S,), = cum(S,, Sj, St) and so forth are the joint cumulants o f the Sj, and summation is understood over each index. (b) For an m-dependent normal process, show that provided / > m,

( l~' 4 }, i = j, v‘.i = \ l - 2c(l>, \ i - j \ = l, ( 0,

otherwise,

and show that /“ ‘cq1—>(, c,1*— as /—»o o . Hence establish (8.13) and (8.14). (Section 8.2.3; Appendix A ; Hall, Horowitz and Jing, 1995)

8 ■Complex Dependence

430 4

Establish (8.16) and (8.17). Show that under phase scrambling,

n_1 H YJ =

cov‘(y/. Y,'+m) = «_ 1 - y)(yi+* - y)>

where j + m is interpreted m od n, and that all odd joint mom ents o f the Y j are zero. This last result implies that the resampled series have a highly symmetric joint distribution. W hen the original data have an asymmetric marginal distribution, the following procedure has been proposed: •

let Xj = - 1 { r j/(n +

1

)}, where rj is the rank o f y} among the original series

ya, . . . , y n- 1 ; • •

apply Algorithm 8.1 to x 0 , . . . , x„-i, giving X ‘_ , ; then set Y j = y(r/), where rj is the rank o f X j am ong Aro ,...,A '* _ 1.

D iscuss critically this idea (see also Practical 8.3). (Section 8.2.4; Theiler et al., 1992; Braun and Kulperger, 1995, 1997) 5

(a) Let / i , . . . , / m be independent exponential random variables with means fij, and consider the statistic T = Yl"j=\ ai h ’ where the a; are unknown. Show that V = | ajl ? is an unbiased estimate o f var(T ) = Y j N ow let C = (c 0 , - . . , c m) be an ( m + 1) x ( m + 1) orthogonal matrix with colum ns cj, where co is a vector o f ones; the Zth element o f c, is cj,-. That is, for som e constant

b, cjci= 0 ,

ii= j,

c j c j = b,

j= l,...,m .

Show that for a suitable choice o f b, V is equal to

j

ffl+1 ttl+1

2 ^ r n ) g B r ' - TO' where for i = 1 , . . . , m + 1 , Tf = + ca)h(b) N ow suppose that Yo,. . . , Y„_i is a time series o f length n = 2m + lz with empirical Fourier transform fb .---.iB _ i and periodogram ordinates h = \Yk\2/n, for k = 0 , . . . , m. For each i = 1 ,..., m + 1, let the perturbed periodogram ordinates be

YJ = ?o,

Y> = ( l + c ^ 2 Yk,

= ( l + c * ) 1/2Y„_*,

k = l,...,m ,

from which the ith replacement time series is obtained by the inverse Fourier transform. Let T be the value o f a statistic calculated from the original series. Explain how the corresponding resample values, T 1' , . . . , T ^ +1, may be used to obtain an approximately unbiased estimate o f the variance o f T , and say for what types o f statistics you think this is likely to work. (Section 8.2.4; Hartigan, 1990) 6

In the context o f periodogram resampling, consider a ratio statistic T =

a(u>k)I((Qk) = / a M g M dw( 1 + n } ' /2X a) YkF =i 1 (®fc)/ g(ft>) dw( 1 -f

1/2Z i)

say. U se (8.18) to show that X a and X i have means zero and that var(-Xa) COV(XUX a)

=

n l aaggl ^ 2 + i(c4,

—

1llagglag Ig

“t-

2

^4 .

\ a r ( X i ) = n l gel ~ 2 + ^ k4,

431

8.5 ■Problems

where I aagg = / a2(co)g2(co) dco, and so forth. Hence show that to first order the mean and variance o f T do not involve k4, and deduce that periodogram resampling may be applied to ratio statistics. Use simulation to see how well periodogram resampling performs in estimating the distribution o f a suitable version o f the sample estimate o f the lag j autocorrelation, = Pl

e~toJg M dco f l n g (« ) dco

(Section 8.2.5; Janas, 1993; Dahlhaus and Janas, 1996) 7

Let y \ , . . . , y n denote the times o f events in an inhom ogeneous Poisson process o f intensity My), observed for 0 < y < 1, and let

J= 1

denote a kernel estimate o f My), based on a kernel w( ) that is a PDF. Explain why the following two algorithms for generating bootstrap data from the estimated intensity are (almost) equivalent.

Algorithm 8.4 (Inhomogeneous Poisson process 1) • •

Let N have a Poisson distribution with mean A = f Q' l(u ;h )d u . For j = 1, . . . , N , independently take 17* from the t /( 0 ,1) distribution, and then set Y ’ = F ~ l (U j), where F (y) = A-1 f0} l(u ;h )d u .

Algorithm 8.5 (Inhomogeneous Poisson process 2) • •

A

p1

Let N have a Poisson distribution with mean A = J0 /.(u; h) du. For j = 1, . . . , N , independently generate /* at random from the integers { ! , . . . , « } and let s* be a random variable with P D F w(-). Set YJ = y,- + ht:'.

(Section 8.3.2) 8

Consider an inhom ogeneous Poisson process o f intensity /.(y) = N n(y), where fi(y) is fixed and sm ooth, observed for 0 < y < 1. A kernel intensity estimate based on events at y i , . . . , y n is

i =i

where w( ) is the P D F o f a symmetric random variable with mean zero and variance one; let K = / w2(u)du. (a) Show that as N - * c c and h—>0 in such a way that N h —>cej, E { l(y ; h)} = X(y) + ±h2X"(y),

var j l(y ; h) j = K h~l X(y);

you may need the facts that the number o f events n has a Poisson distribution with mean A = /J Mu) du, and that conditional on there being n observed events, their

432

8 ■Complex Dependence

times are independent random variables with PDF Hence show that the asymptotic mean squared error of is minimized when h oc N ~l/S. Use the delta method to show that the approximate mean and variance of l 1/ 2(y;h) are *'/ 2 (y) + \ * r m (y) {h 2f ( y ) - ±K h r 1},

\ Kh ~l.

(b) Now suppose that resamples are formed by taking n observations at random from yi,...,y„. Show that the bootstrapped intensity estimate w ', y - y j h J=l has mean E’{ l ‘(y, h)} = l(y;h), and that the same is true when there are n' resampled events, provided that E '(n') = n. For a third resampling scheme, let n have a Poisson distribution with mean n, and generate n events independently from density ).(y;h)/ f Ql l(u;h)du. Show that under this scheme E*{3.*{_y; Ai)} =

J w(u)2(y — hu;h)du.

(c) By comparing the asymptotic distributions of P 2( y ; h ) - ^ 2 (y) z i y ’h) =

{kU -w

, ’

Z ( r ’h) =

{ r ( y ; h ) \ ' - l 1/ 2 (y;h) ------- W m F u i ---------*

find conditions under which the quantiles of Z ' can estimate those of Z. (Section 8.3.2; Example 5.13; Cowling, Hall and Phillips, 1996) Consider resampling tiles when the observation region ^ is a square, the data are generated by a stationary planar Poisson process of intensity X, and the quantity of interest is d = var(Y), where Y is the number of events in 3t. Suppose that 0t is split into n fixed tiles of equal size and shape, which are then resampled according to the usual bootstrap. Show that the bootstrap estimate of 6 is t = ^2(yj — y)2, where yj is the number of events in the jth tile. Use the fact that var(T) = (n — 1)2{k4/h + 2 k \ /( n — 1)}, where Kr is the rth cumulant of Yj, to show that the mean squared error of T is ^ { n + ( n - l ) ( 2n + n - l ) } , where n = l\9l\. Sketch this when p. > 1, fi = 1, and /i < 1, and explain in qualitative terms its behaviour when fi > 1. Extend the discussion to moving tiles. (Section 8.3)

8.6 Practicals 1

Dataframe lynx contains the Canadian lynx data, to the logarithm of which we fit the autoregressive model that minimizes A IC : t s .plot(log(lynx)) lynx.ar <- arClogClynx)) lynx.ar$order

• Practicals

433

The best model is A R (ll). How well determined is this, and what is the variance of the series average? We bootstrap to see, using ly n x .fu n (given below), which calculates the order of the fitted autoregressive model, the series average, and saves the series itself. Here are results for fixed-block bootstraps with block length I = 20: lynx.fun <- function(tsb) { ar.fit <- ar(tsb, order,max=25) c(ar.fit$order, mean(tsb), tsb) > lynx.l <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="fixed") tsplot(ts(lynx.l$t[l,3:116],start=c(1821,1)), main="Block simulation, 1=20") boot.array(lynx.1) [1,] table(lynx.l$t[,1]) var(lynx.l$t[,2]) qqnormdynx. l$t [,2] ) abline(mean(lynx.l$t[,2]),sqrt(var(lynx.l$t[,2])),lty=2)

To obtain similar results for the stationary bootstrap with mean block length 1 = 20:

.Random.seed <- lynx.l$seed lynx.2 <- tsboot(log(lynx), lynx.fun, R=99, 1=20, sim="geom")

See if the results look different from those above. Do the simulated series using blocks look like the original? Compare the estimated variances under the two resampling schemes. Try different block lengths, and see how the variances of the series average change. For model-based resampling we need to store results from the original model: lynx.model <- list(order=c(lynx.ar$order,0,0),ar=lynx.ar$ar) lynx.res <- lynx.ar$resid[!is.na(lynx.ar$resid)] lynx.res <- lynx.res - mean(lynx.res) lynx.sim <- function(res,n.sim, ran.args) { rgl <- function(n, res) sample(res, n, replace=T) ts.orig <- ran.args$ts ts.mod <- r a n .args$model mean(ts.orig)+ts(arima.sim(model=ts.mod, n=n.sim, rand.gen=rgl, res=as.vector(res))) } .Random.seed <- lynx.l$seed lynx.3 <- tsboot(lynx.res, lynx.fun, R=99, sim="model", n.sim=114,ran.gen=lynx.sim, ran.args=list(ts=log(lynx), model=lynx.model))

Check the orders of the fitted models for this scheme. For post-blackening we need to define yet another function: lynx.black <- function(res, n.sim, ran.args) { ts.orig <- ran.args$ts ts.mod <- r a n .args$model mean(ts.orig) + ts(arima.sim(model=ts.mod,n=n.sim,innov=res)) } .Random.seed <- lynx.l$seed lynx.lb <- tsboot(lynx.res, lynx.fun, R=99, 1=20, sim="fixed", n .sim=l14,r a n .gen=lynx.black, ran.args=list(ts=log(lynx), model=lynx.model))

8 ■Complex Dependence Compare these results with those above, and try the post-blackened bootstrap with sim=" geom". (Sections 8.2.2, 8.2.3) The data in b ea v er consist o f a time series o f n = 100 observations on the body temperature y i , . . . , y „ and an indicator x i , . . . , x n o f activity o f a female beaver, Castor canadensis. We want to estimate and give an uncertainty measure for the body temperature o f the beaver. The simplest m odel that allows for the clear autocorrelation o f the series is yj = P o + PiXj + rij,

rij = tcrij_i +Ej,

j = l,...,n ,

(8.21)

a linear regression m odel in which the errors r\j form an A R (1) process, and the are independent identically distributed errors with mean zero and variance a 2. Having fitted this model, estimated the parameters a,/?o, j8i,
n] = j = i , . . . , n ,

(8.22)

where the error series {>/'} is formed by taking a white noise series {e‘ } at random from theset {a(e2 — e) , . . . , o(e„ — e)} and then applying the second parto f (8.22). To fit the original m odel and to generate a new series: f i t < - f u n c t io n ( d a ta ) { X < - c b i n d ( r e p ( l , 1 0 0 ) ,d a t a $ a c t iv ) para < - l i s t ( X =X ,data=data) a ss ig n (" p a r a " ,p a r a ,fr a m e = l) d < - a r im a .m le (x = p a r a $ d a ta $ t e m p ,m o d e l= lis t(a r = c (0 .8 )), xreg=para$X ) r e s < - a r i m a .d ia g ( d ,p l o t = F ,s t d .r e s id = T ) $ s t d .r e s i d r e s <- r e s [ ! is .n a ( r e s ) ] li s t ( p a r a s = c ( d $ m o d e l$ a r ,d $ r e g .c o e f ,s q r t ( d $ s ig m a 2 ) ) , r e s = r e s -m e a n (r e s ) ,f it = X 7,*7, d $ r e g .c o e f ) > b e a v e r .a r g s < - f i t ( b e a v e r ) w h it e .n o i s e < - f u n c t io n ( n .s im , t s ) s a m p le ( t s ,s iz e = n .s im ,r e p la c e = T ) b e a v e r .g e n < - f u n c t i o n ( t s , n .s im , r a n .a r g s ) { t s b < - r a n .a r g s $ r e s f i t < - r a n .a r g s $ f i t c o e f f < - r a n .a r g s$ p a r a s ts$ tem p < - f i t + c o e f f [ 4 ] * a r im a .s im ( m o d e l= lis t ( a r = c o e f f [ 1 ] ) , n = n .s im ,r a n d .g e n = w h it e .n o is e ,t s = t s b ) ts } n ew .b ea v er < - b e a v e r .g e n (b e a v e r , 1 0 0 , b e a v e r .a r g s ) N ow we are able to generate data, we can bootstrap and see the results o f b e a v e r .b o o t as follows: b e a v e r .fu n < - f u n c t i o n ( t s ) f i t ( t s ) $ p a r a s b e a v e r .b o o t < - t s b o o t ( b e a v e r , b e a v e r .fu n , R =99,sim ="m odel", n . s im=1 0 0 ,r a n . g e n = b e a v e r. g e n , r a n . a r g s= b e a v e r . a r g s ) n a m es(b ea v er. b o o t) b e a v e r . b o o t$ t0 b e a v e r .b o o t $ t [ 1 : 1 0 ,] showing the original value o f b e a v e r . fu n and its value for the first 10 replicate

8.6 ■Practicals

435

series. Are the estimated mean temperatures for the R = 99 simulations normal? Use b o o t . c i to obtain normal and basic bootstrap confidence intervals for the resting and active temperatures. In this analysis we have assumed that the linear m odel with A R(1) errors is appropriate. How would you proceed if it were not? (Section 8.2; Reynolds, 1994) 3

Consider scrambling the phases o f the su n sp o t data. To see the original data, two replicates generated using ordinary phase scrambling, and two phase scram bled series whose marginal distribution is the same as that o f the original data: su n s p o t .fu n < - f u n c t i o n ( t s ) t s s u n s p o t .1 < - ts b o o t(s u n s p o t,s u n s p o t.fu n ,R = 2 ,s im = " s c r a m b le " ) .R andom .seed < - s u n s p o t .l$ s e e d s u n s p o t .2 < - tsb o o t(su n sp o t,su n sp o t.fu n ,R = 2 ,sim = " sc r a m b le " ,n o r m = F ) s p l i t . s c r e e n ( c (3 ,2 ) ) y l < - c (- 5 0 ,2 0 0 ) s c r e e n ( l ) ; t s . p l o t ( s u n s p o t , y l i m = y l ) ; a b lin e ( h = 0 ,lt y = 2 ) s c r e e n ( 3 ) ; t s p l o t ( s u n s p o t . l $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 ) s c r e e n ( 4 ) ; t s p l o t ( s u n s p o t . l $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 ) s c r e e n ( 5 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 1 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 ) s c r e e n ( 6 ) ; t s p l o t ( s u n s p o t . 2 $ t [ 2 , ] ,y l i m = y l ) ; a b li n e ( h = 0 ,lt y = 2 ) W hat features o f the original data are preserved by the two algorithms? (You may find it helpful to experiment with different shapes for the figures.) (Section 8.2.4; Problem 8.4; Theiler et a l, 1992)

4

c o a l contains data on times o f explosions in coal mines from 15 March 1851 to 22 March 1962, often modelled as an inhom ogeneous Poisson process. For a kernel intensity estimate (accidents per year): c o a l . e s t < - f u n c t io n ( y , h=5) le n g th (y )* k sm o o th (y ,b a n d w id th = 2 . 7*h, k e r n e l= " n " ,x .p o in ts = s e q (1 8 5 1 , 1 9 6 3 ,2 ) )$ y y e a r < - s e q ( 1 8 5 1 ,1 9 6 3 ,2 ) p l o t ( y e a r , c o a l .e s t ( c o a l $ d a t e ) ,t y p e = " l " ,y l a b = " i n t e n s i t y " , y lim = c ( 0 ,6 ) ) r u g (c o a l) Try other choices o f bandwidth h, noting that the estimate for the period (1851 + 4/i, 1962 — 4h) does not have edge effects. D o you think that the drop from about three accidents per year before 1900 to about one thereafter is spurious? W hat about the peaks at around 1910 and 1940? For an equi-tailed 90% bootstrap confidence band for the intensity, we take h = 5 and R = 199 (a larger R will give more reliable results): c o a l.f u n < - f u n c t io n ( d a t a , i , h=5) c o a l . e s t ( d a t a [ i ] , h) c o a l.b o o t < - b o o t ( c o a l$ d a t e , c o a l.f u n , R=199) A <- 0 .5 /s q r t(5 * 2 * s q r t(p i)) Z < - s w e e p ( s q r t ( c o a l .b o o t $ t ) ,2 ,s q r t ( c o a l .b o o t $ t 0 ) ) / A Z.max < - s o r t ( a p p ly ( Z ,l ,m a x ) ) [190] Z.m in < - s o r t ( a p p l y ( Z , l . m i n ) ) [10] to p < - (s q r t(c o a l.b o o t$ tO )-A * Z .m in )”2 b o t < - (s q r t(c o a l.b o o t$ tO )-A * Z .m a x )" 2 li n e s ( y e a r , t o p ,l t y = 2 ) ; lin e s ( y e a r ,b o t ,lt y = 2 )

436

8 ■Complex Dependence

To see the quantile process: Z <- apply(Z,2,sort) Z.05 <- Z[10,] Z.95 <- Z[190,] plot(year,Z .05,type="1",ylab="Z",ylim=c(-3,3)) lines(year,Z .95) Construct symmetric bootstrap confidence bands based on za{h) such that

Pr{|Z(y; /i)| < z„(h),y € &} = a (no more simulation is required). How different are they from the equi-tailed ones? For simulation with a random number o f events, use

coal.gen <- function(data, n) { i <- s a m p l e d :n,size=rpois(n=l ,lambda=n) ,replace=T) datafi] } coal.boot2 <- boot(coal$date, coal.est, R=199, sim="parametric", ran.gen=coal.gen, mle=nrow(coal)) D oes this make any difference? (Section 8.3.2; Cowling, Hall and Phillips 1996; Hand et al., 1994, p. 155)

9 Improved Calculation

9.1 Introduction A few o f the statistical questions in earlier chapters have been am enable to analytical calculation. However, m ost o f o u r problem s have been too com plicated for exact solutions, an d sam ples have been too small for theoretical large-sam ple approxim ations to be trustw orthy. In such cases sim ulation has provided approxim ate answ ers through M onte C arlo estim ates o f bias, vari ance, quantiles, probabilities, an d so forth. T h roughout we have supposed th at the sim ulation size is lim ited only by our im patience for reliable results. S im ulation o f independent b o o tstrap sam ples and their use as described in previous chapters is usually easily program m ed and im plem ented. I f it takes up to a few hours to calculate enough values o f the statistic o f interest, T, ordinary simulation o f this sort will be an efficient use o f a researcher’s time. But som etim es T is very costly to com pute, or sam pling is only a single com ponent in a larger procedure — as in a double b o o tstrap — o r the procedure will be repeated m any times w ith different sets o f data. T hen it m ay pay to invest in m ethods o f calculation th a t reduce the num ber o f sim ulations needed to obtain a given precision, o r equivalently increase the accuracy o f an estim ate based on a given sim ulation size. This chapter is devoted to such m ethods. N o lunch is free. The techniques th a t give the biggest potential variance reductions are usually the h ardest to im plem ent. O thers yield less spectacular gains, b u t are m ore easily im plem ented. T houghtless use o f any o f them may m ake m atters worse, so it is essential to ensure th a t use o f a variance reduction technique will save the investigator’s time, which is m uch m ore valuable than com puter time. M ost o f o u r b o o tstrap estim ates depend on averages. For exam ple, in testing a null hypothesis (C h ap ter 4) we w ant to calculate the significance probability p = P r’(7” ^ t | Fo), where t is the observed value o f test statistic T and

437

9 ■Improved Calculation

438

the fitted m odel Fo is an estim ate o f F und er the null hypothesis. The simple M onte C arlo estim ate o f p is R ^ 1 / {T ' > (}, where I is the indicator function an d the T ’ are based on R independent sam ples generated from FoT he variance o f this estim ate is c R ~{, w here c = p fl —p). N othing can generally be done ab o u t the factor R ~ l , b u t the co n stan t c can be reduced if we use a m ore sophisticated M onte C arlo technique. M ost o f this chapter concerns such techniques. Section 9.2 describes m ethods for balancing the sim ulation in order to m ake it m ore like a full enum eration o f all possible samples, and in Section 9.3 we describe m ethods based on the use o f control variates. Section 9.4 describes m ethods based on im portance sampling. In Section 9.5 we discuss one im p o rta n t m ethod o f theoretical approxim ation, the saddlepoint m ethod, which elim inates the need for sim ulation.

9.2 Balanced Bootstraps Suppose for simplicity th a t the d a ta are a hom ogeneous ran d o m sample

y \, . . . , y n w ith E D F F, and th a t as usual we are concerned with the properties o f a statistic T whose observed value is t = t ( y i , . . . , y n). O u r focus is T ‘ = t ( Y { , . . . , Y„*), w here the Y" are a ran d o m sam ple from F. C onsider the bias estim ate for T, nam ely B = E ’(T* | F) — t. I f g denotes the jo in t density o f then B

=J

t { y \, . . . , y'n)g(y[,. . . , y'„)dy{

This m ight be com putable analytically if t( ) is simple enough, particularly for som e param etric models. In the nonparam etric case, if the calculation cannot be done analytically, we set g equal to n~n for all possible sam ples y\, ..., y'n in the set S f = { y i , . . . , y „ } n and w rite B = n~n ^ 2 t ( y [ , . . . , y ' n) - t .

(9.1)

This sum over all possible sam ples need involve only (2n„_1) calculations o f (*, since the sym m etry o f t( ) w ith respect to the sam ple can be used, b u t even so the complete enumeration o f values t* th a t (9.1) requires will usually be im practicable unless n is very small. So it is that, especially in nonparam etric problem s, we usually approxim ate the average in (9.1) by the average o f R random ly chosen elem ents o f Zf an d so approxim ate B by B r = R _i Y , T* — t. This calculation w ith a ran d o m subset o f has a m ajor defect: the values y i , . . . , y n typically d o n o t occur w ith equal frequency in th a t subset. This is illustrated in Table 9.1, which reproduces Table 2.2 b u t adds (penultim ate row) the aggregate frequencies for the d a ta values; the final row is explained later. In the even sim pler case o f the sam ple average t = y we can see clearly

9.2 ■Balanced Bootstraps Table 9.1 R = 9 resamples for city population data, chosen by ordinary bootstrap sampling from F.

j u X

Data Sample

1 2 3 4 5 6 7 8 9

439

1 138 143

2 93 104

3 61 69

1

1

1

3

2 1

1 1 3 1 1 2

1 1

4 179 260

Aggregate

9

1 8

rF*

9 50

55

8

1 2 1

2 2 2 1 1 11 11 50

7 29 50

8 23 48

Number o f times j sampled 1 1 1 1 1

1 1 1

Data 5 6 48 37 75 63

1 3

1 1

2 1 2 1

1 2 1 5

2 3 2 13

1 1 8

5 50

13 50

50

8

1 1 8 8

50

9 30 111

10 2 50

1

1

1 2 4

1 1 2 1

2 1 1

1 1

1 2 7

1 1 11

7 50

50

n

3 1 1 10

Statistic t = 1.520 t\ = t 2’ = = t’A = t; = t'6 = t; = (• = t; =

1.466 1.761 1.951 1.542 1.371 1.686 1.378 1.420 1.660

10 50

th a t the unequal frequencies com pletely account for the fact th a t B r differs from the correct value B = 0. The corresponding phenom enon for param etric b o o tstrap p in g is th a t the aggregated E D F o f the R sam ples is n o t as close to the C D F o f the fitted param etric m odel as it is to the sam e m odel with different p aram eter values. T here are tw o ways to deal w ith this difficulty. First, we can try to change the sim ulation to remove the defect; and secondly we can try to adjust the results o f the existing sim ulation.

9.2.1 Balancing the simulation T he idea o f balanced resampling is to generate tables o f random frequencies, b u t to force them to be balanced in an ap p ro p riate way. A set o f R boo tstrap sam ples is said to have first-order balance if each o f the original observations app ears w ith equal frequency, i.e. exactly R times overall. F irst-o rd er balance is easy to achieve. A simple algorithm is as follow s: Algorithm 9.1 (Balanced bootstrap) C oncatenate R copies o f y i , . . . , y „ into a single set

o f size Rn.

Perm ute the elem ents o f 9 at random , giving < &*, say. F o r r = 1 ,...,/? , take successive sets o f n elements o f resam ples, y *, an d set t'r = t(y‘ ).

to be the balanced •

440

9 • Improved Calculation

Data Sample

Aggregate

1 2 3 4 5 6 7 8 9

1

2

1

1

3

10

Number of times j sampled 1 1 1 1 1 1 1

1

1 2 2 2

1

1 2 2

9

9

5

3 1

6

7

9

1 2 2 1 2

4

1

2 2 3 1 2 1 9

i 1 1 1 2 9

1 1 1

1 2 1 1 1 2 9

2 2 1 1 1 1 1 9

8

1 1 1 1 1 1 1 1 1 9

1 1 1

1 2 1 1

2 2 1 1 9

1 1 1 1

Table 9.2 First-order balanced bootstrap with R = 9 for city population data.

Statistic t = 1.520 t\ = ti = t"3 = t'4 = t‘5 = t‘6 = ty = t\ = t; =

1.632 1.823 1.334 1.317 1.531 1.344 1.730 1.424 1.678

O ther algorithm s (e.g. Problem 9.2) have been suggested th a t economize on the tim e and space needed to generate balanced samples, b u t the m ost tim e-consum ing p a rt o f a b o o tstrap sim ulation is usually the calculation o f the values o f t \ so the details o f the sim ulation algorithm are rarely critical. W hatever the m ethod used to generate the balanced samples, the result will be th at individual observations have equal overall frequencies, ju st as for com plete enum eratio n — a simple illustration is given below. Indeed, so far as the m arginal frequencies o f the d a ta values are concerned, a com plete enum eration has been perform ed. Exam ple 9.1 (City population d ata) C onsider estim ating the bias o f the ratio estim ate t = x / u for the d a ta in the second an d third rows o f Table 9.1. Table 9.2 shows the results for a balanced b o o tstrap w ith R = 9: each d a ta value occurs exactly 9 tim es overall. To see how well the balanced b o o tstrap works, we apply it with the m ore realistic n u m b er R = 49. T he bias estim ate is B R = T* — t = R ~ l J 2r T ' — t, and its variance over 100 replicates o f the ordinary resam pling scheme is 7.25 x 10-4 . T he corresponding figure for the balanced b o o tstrap is 9.31 x 10-5 , so the balanced scheme is ab o u t 72.5/9.31 = 7.8 tim es m ore efficient for bias estim ation. ■ H ere an d below we say th a t the efficiency o f a b o o tstrap estim ate such as B r relative to the o rdinary b o o tstrap is the variance ratio v'

K J b r) v ar I J B r Y

where for this com parison the subscripts denote the sam pling scheme under which B r was calculated.

441

9.2 • Balanced Bootstraps Table 9 3 Approximate efficiency gains when balancing schemes with R = 49 are applied in estimating biases for estimates of nonlinear regression model applied to the calcium uptake data, based on 100 repetitions of the bootstrap.

Cases

Po Pi a

Stratified

R esiduals

B alanced

A djusted

B alanced

A djusted

B alanced

A djusted

8.9 13.1 11.1

6.9

141

108

1.2

0.6

8.9 9.1

63 18.7

49 18.0

1.4 15.3

0.6 13.5

So far we have focused on the application to bias estim ation, for which the balance typically gives a big im provem ent. The same is not generally true for estim ating higher m om ents or quantiles. For instance, in the previous exam ple the balanced b o o tstrap has efficiency less th an one for calculation o f the variance estim ate VR. The balanced b o o tstra p extends quite easily to m ore com plicated sam pling situations. I f the d a ta consist o f several independent samples, as in Section 3.2, balanced sim ulation can be applied separately to each. Some o ther extensions are straightforw ard. Exam ple 9.2 (Calcium uptake d ata) To investigate the im provem ent in bias estim ation for the p aram eters o f the nonlinear regression m odel fitted to the d a ta o f Exam ple 7.7, we calculated 100 replicates o f the estim ated biases based on 49 b o o tstra p samples. The resulting efficiencies are given in Table 9.3 for different resam pling schem es; the results labelled “A djusted” are discussed in Exam ple 9.3. F or stratified resam pling the d a ta are stratified by the covariate value, so there are nine stra ta each w ith three observations. T he efficiency gains u nder stratified resam pling are very large, and those under case resam pling are worthwhile. T he gains w hen resam pling residuals are n o t w orthw hile, except for a 2. ■ F irst-order balance ensures th a t each observation occurs precisely R times in the R samples. In a scheme w ith second-order balance, each pair o f observations occurs together precisely the same num ber o f times, and so on for schemes w ith third- an d higher-order balance. T here is a close connection to certain experim ental designs (Problem 9.7). D etailed investigation suggests, however, th a t there is usually no practical gain beyond first-order balance. A n open question is w hether o r n o t there are useful “ nearly balanced” designs.

9.2.2 Post-sim ulation balance C onsider again estim ating the bias o f T in a nonparam etric context, based on an unbalanced array o f frequencies such as Table 9.1. The usual bias estim ate

9 • Improved Calculation

442 can be w ritten in expanded n o tatio n as R

=

(9.2) r= l

where as usual F* denotes the E D F corresponding to the rth row o f the array. Let F* denote the average o f these E D F s, th a t is f * = r - ^ f ; + --- + F*r ). F or a frequency table such as Table 9.1, F* is the C D F o f the distribution corresponding to the aggregate frequencies o f d a ta values, as show n in the final row. T he resulting adjusted bias estimate is R

Brmj = R - 1

*(£*) -

(9-3)

r= 1

This is som etim es called the re-centred bias estim ate. In addition to the usual A

_

_

b o o tstrap values t(Fr ), its calculation requires only F* and f(F*). N ote th at for adjustm ent to work, t( ) m ust be in a functional form, i.e. be defined independently o f sam ple size n. F or example, a variance m ust be calculated with divisor n ra th e r th a n n — 1. The corresponding calculation for a p aram etric b o o tstra p is similar. In effect the adjustm ent com pares the sim ulated estim ates T ' to the p aram eter value Or = t(F*) obtained by fitting the m odel to d a ta w ith E D F F* rath er th an F. Exam ple 9.3 (Calcium uptake d a ta ) Table 9.3 shows the efficiency gains from using B r ^ in the nonparam etric resam pling experim ent described in Exam ple 9.2. T he gains are broadly sim ilar to those for balanced resam pling, b u t smaller. F o r param etric sam pling the quantities F ’ in (9.3) represent sets o f d a ta generated by p aram etric sim ulation from the fitted m odel, and the average F* is the d ataset o f size R n obtained by concatenating the sim ulated samples. H ere the sim plest p aram etric sim ulation is to generate d a ta y j = p-j + ej, where the fa are the fitted values from Exam ple 7.7 an d the e* are independent iV(0,0.552) variables. In 100 replicates o f this b o o tstrap with R = 49, the efficiency gains for estim ating the biases o f Po, P\, an d a were 24.7, 42.5, and 20.7; the effect o f the adjustm ent is m uch m ore m arked for the param etric th a n for the n o n p aram etric b ootstraps. ■ The sam e adjustm ent does n o t apply to the variance approxim ation V r , higher m om ents o r quantiles. R a th e r the linear approxim ation is used as a conventional control variate, as described in Section 9.3.

443

9.2 ■Balanced Bootstraps

9.2.3 Som e theory Some theoretical insight into both balanced sim ulation and post-sim ulation balancing can be gained by m eans o f the nonparam etric delta m ethod (Sec tion 2.7). As before, let F* denote the E D F o f a b o o tstrap sam ple Y J , . . . , Y„*. The expansion o f T* = t ( F’) ab o u t F is, to second-order terms, ti

n

n

t (F') = tQ(F') = t(F) + n~l 5 3 lj + \ n~ 2 5 3 5 3 q 'jk’ j=i j= l t=i

<9'4)

where lj = H Y J ; F) an d qjk = q(YJ, Yk‘ ; F) are values o f the em pirical firstan d second-order derivatives o f t at F; equation (9.4) is the same as (2.41), b u t w ith F an d F replaced by F ' and F. We call the right-hand side o f (9.4) the quadratic approximation to T". O m ission o f the final term leaves the linear approxim ation n

tL(F’ ) = t(F) + n~l 5 3 l j’

(9-5)

i= i

which is the basis o f the variance approxim ation vL ; equation (9.5) is simply a recasting o f (2.44). In term s o f the frequencies f j with which the yj ap p e ar in the boo tstrap sam ple a n d the em pirical influence values lj = l(yj;F) and qjk = q(yj,yk;F), the q u ad ratic ap proxim ation (9.4) is

n =t

+ E fpj +K2E E fjfa*’ 7=1

7=1 k= 1

w

in abbreviated n otation. Recall th a t 22j h — 0 an ^ 22j Qjk = 22k Qjk ~ We can now com pare the resam pling schemes through the properties o f the frequencies f j . C onsider b o o tstrap sim ulation to estim ate the bias o f T. Suppose th a t there are R sim ulated samples, an d th a t yj appears in the rth w ith frequency f ’rJ, while T takes value T ' . T hen from (9.2) and (9.6) the bias approxim ation B r = R ~ l 22 T ’ ~ t can be approxim ated by

a -'E *+»-1E a + i»'2E E » ) - cr= l

\

7=1

7=1 k = l

{9J)

J

In the ord in ary resam pling scheme, the rows o f frequencies (/* 1 , . . . , f ' n) are in dependent sam ples from the m ultinom ial distribution with denom inator n an d probability vector (n-1 , . . . , n _1). This is the case in Table 9.1. In this situation the first an d second jo in t m om ents o f the frequencies are E*(/V) = 1,

c o v -(/V ,/;fc) = SrASjk - n~l ),

444

9 ■Improved Calculation

where <5;* = 1 if j = k an d zero otherwise, an d so forth; the higher cum ulants are given in Problem 2.19. Straightforw ard calculations show th a t approxim ation (9.7) has m ean ^n~ 2 ^ 2 j q j j and variance 1 Rn1

i= 1

j= 1

+

An1

j= i

\j= i

j

+

2 ± ^

t

j= i

Qjk

i

• (9.8)

*=i

F or the balanced b o o tstrap , the jo in t distrib u tio n o f the R x n table o f frequencies f ' j is hypergeom etric w ith row sum s n and colum n sums R. Because = 0 an d f'r] = R for all j, approxim ation (9.7) becomes

/= 1 k= l

r—l

U nder balanced resam pling one can show (Problem 9.1) th at e

*(/*;•) = i,

(nSJk - 1)(JW„ - 1) ni? - 1

cov*(/;;, / ; , ) =

(9.9)

so the bias approxim ation (9.7) has m ean lM( i ? - l ) _2 , A j=i

m ore painful calculations show th a t its variance is approxim ately 1 4Rr?

-2I T 1

qjj + 2nT 2 R - 2 ( j=1

^ /

\;= 1

+ 2(n - I)/!"1 £ £ q)k j=1 /c =l

(9.10) The m ean is alm ost the sam e u nder b o th schemes, b u t the leading term o f the variance in (9.10) is sm aller th an in (9.8) because the term in (9.7) involving the lj is held equal to zero by the balance constraints Y l r f*j = First-order balance ensures th a t the linear term in the expansion for B r is held equal to its value o f zero for the com plete enum eration. Post-sim ulation balance is closely related to the balanced bootstrap. It is straightforw ard to see th a t the quad ratic nonparam etric delta m ethod approx im ation o f Bg^adj in (9.3) equals (9.11) y = l k= 1 I

r= l

r= l

r= l

9.2 • Balanced Bootstraps

". / W

JS

c 0) o ifc LU

v '

CO

in ©

j ■V .- j :

5.0

in

■O 0) oc m

1

o o T—

o

icy

Figure 9.1 Efficiency comparisons for estimating biases of normal eigenvalues. The left panel compares the efficiency gains over the ordinary bias estimate due to balancing and post-simulation adjustment. The right panel shows the gains for the balanced estimate, as a function of the correlation between the statistic and its linear approximation; the solid line shows the theoretical relation. See text for details.

445

■"■V* o

■"

_____ -»—''r'TV'V.T/*■

in d 0.1

0.5

5.0 Adjusted

0.0

0.2

0.4

0.6

0.8

1.0

Correlation

Like the balanced b o o tstrap estim ate o f bias, there are no linear term s in this expression. R e-centring has forced those term s to equal their p o p ulation values o f zero. W hen the statistic T does n o t possess an expansion like (9.4), balancing m ay n o t help. In any case the correlation betw een the statistic and its linear approxim ation is im p o rtan t: if the correlation is low because the quadratic com ponent o f (9.4) is appreciable, then it m ay n o t be useful to reduce variation in the linear com ponent. A rough approxim ation is th a t var*(B«) is reduced by a factor equal to 1 m inus the square o f the correlation betw een T" and T'L (Problem 9.5). Exam ple 9.4 (N orm al eigenvalues) F or a num erical com parison o f the effi ciency gains in bias estim ation from balanced resam pling and post-sim ulation adjustm ent, we perform ed M onte C arlo experim ents as follows. We generated n variates from the m ultivariate norm al density w ith dim ension 5 and identity covariance m atrix, and to o k t to be the five eigenvalues o f the sam ple covari ance m atrix. F or each sam ple we used a large b o o tstrap to estim ate the linear approxim ation t"L for each o f the eigenvalues and then calculated the correla tion c betw een t* and t"L. We then estim ated the gains in efficiency for balanced an d adjusted estim ates o f bias calculated using the b o o tstrap w ith R = 39, using variances estim ated from 100 independent bo o tstrap sim ulations. Figure 9.1 shows the gains in efficiency for each o f the 5 eigenvalues, for 50 sets o f d a ta w ith n = 15 an d 50 sets w ith n = 25; there are 500 points in each panel. T he left panel com pares the efficiency gains for the balanced an d adjusted schemes. Balanced sam pling gives b etter gains th an post-sam ple adjustm ent, b u t the difference is sm aller at larger gains. The right panel shows

446

9 • Improved Calculation

the efficiency gains for the balanced scheme plotted against the correlation c. The solid line is the theoretical curve (1 — c2)-1 . Know ledge o f c would enable the efficiency gain to be predicted quite accurately, at least for c > 0.8. T he potential im provem ent from balancing is n o t g u aranteed to be w orthwhile w hen c < 0.7. The corresponding plot for the adjusted estim ates suggests th a t c m ust be at least 0.85 for a useful efficiency gain. ■ This exam ple suggests the following strategy when a good estim ate o f bias is required: perform a sm all stan d ard unbalanced b ootstrap, and use it to estim ate the correlation betw een the statistic an d its linear approxim ation. If th a t correlation exceeds ab o u t 0.7, it m ay be w orthw hile to perform a balanced sim ulation, b u t otherw ise it will not. I f the correlation exceeds 0.85, post-sim ulation adjustm ent will usually be w orthw hile, b u t otherw ise it will not.

9.3 Control Methods The basis o f control m ethods is extra calculation during or after a series o f sim ulations w ith the aim o f reducing the overall variability o f the estim ator. This can be applied to nonparam etric sim ulation in several ways. The p o st sim ulation balancing described in the preceding section is a simple control m ethod, in which we store the sim ulated ran d o m sam ples and m ake a single post-sim ulation calculation. M ost control m ethods involve ex tra calculations a t the time o f the sim ulation, an d are applicable w hen there is a simple statistic th a t is highly correlated with T*. Such a statistic is know n as a control variate. T he key idea is to write T* in term s o f the control variate an d the difference betw een T* and the control variate, an d then to calculate the required properties for the control variate analytically, estim ating only the differences by sim ulation. Bias and variance In m any b o o tstrap contexts where T is an estim ator, a natu ral choice for the control variate will be the linear approxim ation T[ defined in (2.44). The m om ents o f can be obtained theoretically using m om ents o f the frequencies f j . In ordinary ran d o m sam pling the f j are m ultinom ial, so the m ean and variance o f T£ are E'(T'l ) = t,

v a r' ( T i ) = n~2 £ lj = vL. 7=1

In order to use T ’L as a control variate, we write T* = T[ + D ’, so th at D* equals the difference T * — T[. The m ean and variance o f T* can then

447

9.3 ■Control Methods

be w ritten E 'e r * ) = E m( T l ) + E*(D‘ ),

v ar *(T*) = var *(T£) + 2co v ' { T ’L , D ‘) + var *(/)*),

the leading term s o f which are known. O nly term s involving D * need to be approxim ated by sim ulation. G iven sim ulations T w ith corresponding linear approxim ations and differences D* = T* — T £r, the m ean and variance o f T* are estim ated by t+ D\

VKcon = v L + ^

i? ^ ( T £ r - f i ) ( D r* - D' ) + ^

i? J 2 ( D ; ~ D ' ) 2,

r= l

r= l

(9.12) where T[ = Ylr ^L,r an d D" = Use o f these and related approxim ations requires the calculation o f the T[ r as well as the T*. The estim ated bias o f T* based on (9.12) is B r co„ = D ' . This is closely related to the estim ate obtained un d er balanced sim ulation and to the re centred bias estim ate B r ^ . Like them , it ensures th at the linear com ponent o f the bias estim ate equals its population value, zero. D etailed calculation shows th a t all three approaches achieve the same variance reduction for the bias estim ate in large samples. However, the variance estim ate in (9.12) based on linear approxim ation is less variable th an the estim ated variances obtained u n d er the o th er approaches, because its leading term is n o t random . Example 9.5 (City population data) To see how effective control m ethods are in reducing the variability o f a variance estim ate, we consider the ratio statistic for the city pop u latio n d a ta in Table 2.1, w ith n = 10. F or 100 b o o tstrap sim ulations w ith R = 50, we calculated the usual variance estim ate vr = ( R — I)-1 — t*)2 and the estim ate VR>con from (9.12). The estim ated gain in efficiency calculated from the 100 sim ulations is 1.92, which though w orthw hile is n o t large. T he correlation betw een t* and t‘L is 0.94. F or the larger set o f d a ta in Table 1.3, with n = 49, we repeated the experim ent w ith R = 100. H ere the gain in efficiency is 7.5, and the correlation is 0.99. Figure 9.2 shows scatter plots o f the estim ated variances in these experim ents. F or b o th sam ple sizes the values o f v r
9 • Improved Calculation

448

Figure 9.2 Comparison of estimated variances (xlO -2) for city population ratio, using usual and control methods, for n = 10 with R = 50 (left) and for n = 49 with R = 100 (right). The dotted line is the line x = y, and the dashed lines show the “true” variances, estimated from a much larger simulation.

0

/

:

•

..................................... 0

Usual

Usual

1.3. The four left panels o f Figure 9.3 show plots o f the values o f v r >co„ against the values o f v r . N o strong p attern is discernible. To get a m ore system atic idea o f the effectiveness o f control m ethods in this setting, we repeated the experim ent outlined in Exam ple 9.4 and com pared the usual and control estim ates o f the variances o f the five eigenvalues. The results for the five eigenvalues an d n = 15 and 25 are show n in Figure 9.3. G ains in efficiency are n o t g u aranteed unless the correlation betw een the statistic and its linear ap proxim ation is 0.80 o r m ore, and they are n o t large unless the correlation is close to one. T he line y = (1 — x4)-1 sum m arizes the efficiency gain well, th o u g h we have n o t attem p ted to justify this. ■ Quantiles C ontrol m ethods m ay also be applied to quantiles. Suppose th a t we have the sim ulated values t\, ..., t’R o f a statistic, and th a t the corresponding control variates and differences are available. We now sort the differences by the values o f the control variates. F o r exam ple, if o u r control variate is a linear approxim ation, w ith R = 4 an d t 'L 2 < t"L , < t *L 4 < t] 3, we p u t the differences in order d"2, d\, d"4, d\. The procedure now is to replace the p quantile o f the linear approxim ation by a theoretical approxim ation, tp, for p = 1/(jR + 1 ) ,..., R / ( R + 1), thereby replacing t'r) w ith t ’C r = tp + d '(r), where 7t(r) is the ran k o f t'L r. In o u r exam ple we would obtain t ’c j = t0.2 + d'2, t'c 2 = £0 . 4 + d.\, t'c 3 = to. 6 + d\, an d t ’CA = fo.g + d\. We now estim ate the pth quantile o f the distribution o f T by t'c ^ , i.e. the rth quantile o f t“ c v ... ,t*CR. If the control variate is highly correlated w ith T m, the bulk o f the variability in the estim ated quantiles will have been rem oved by using the theoretical approxim ation.

449

9.3 ■Control Methods Figure 9.3 Efficiency comparisons for estimating variances of eigenvalues. The left panels compare the usual and control variance estimates for the data of Example 3.24, for which n = 25, when R = 39. The right panel shows the gains made by the control estimate in 50 samples of sizes 15 and 25 from the normal distribution, as a function of the correlation between the statistic and its linear approximation; the solid line shows the line y = (1 —x4)-1. See text for details.

Third

Fourth

0.0 S

10

15

20

25

0.2

0.4

0.6

0.8

1.0

Correlation

O ne desirable property o f the control quantile estim ates is that, unlike m ost o th er variance reduction m ethods, their accuracy improves with increasing n as well as R. T here are various ways to calculate the quantiles o f the control variate. The preferred ap proach is to calculate the entire distribution o f the control variate by saddlepoint approxim ation (Section 9.5), and to read off the required qu an tiles tp. This is better th a n oth er m ethods, such as C o rn ish 'F ish e r expansion, because it guarantees th a t the quantiles o f the control variate will increase w ith p. Example 9.7 (Returns data) To assess the usefulness o f the control m ethod ju s t described, we consider setting studentized b o o tstrap confidence intervals for the rate o f retu rn in Exam ple 6.3. We use case resam pling to estim ate quantiles o f T* = (/?J —/?i ) / S \ where fli is the estim ate o f the regression slope, an d S 2 is the robust estim ated variance o f fii based on the linear approxim ation to Pi. F or a single b o o tstra p sim ulation we calculated three estim ates o f the qu an tiles o f T * : the usual estim ates, the order statistics < ■■■< t'R); the control estim ates taking the control variate to be the linear approxim ation to T* based on exact em pirical influence values; and the control estim ates obtained using the linear approxim ation w ith em pirical influence values estim ated by regression on the frequency array for the same bootstrap. In each case the quantiles o f the control variate were obtained by saddlepoint approxim ation, as outlined in Exam ple 9.13 below. We used R = 999 and repeated the experi m ent 50 tim es in o rder to estim ate the variance o f the quantile estim ates. We

9 *Improved Calculation

450

Figure 9.4 Efficiency and bias com parisons for estim ating quantiles o f a studentized

CM

o

bootstrap statistic for the returns data, based on a bootstrap of size R = 999. The left panel c

®

shows the variance of the usual quantile estimate divided by the variance o f the control estimate based on an exact linear approxim ation, plotted against the corresponding norm al quantile. The dashed lines show efficiencies of 1, 2, 3, 4 and 5. The right panel shows the estim ated biases for the exact control (solid) and estim ated control (dots)

O

CM

o

o -3

-2

-1

0

2

3

Normal quantile

-3

-2

-1

0

2

3

Normal quantile

estim ated their bias by com paring them w ith quantiles o f T * obtained from 100000 b o o tstrap resamples. Figure 9.4 shows the efficiency gains o f the exact control estim ates relative to the usual estim ates. T he efficiency gain based on the linear approxim ation is n o t shown, b u t it is very similar. T he right panel shows the biases o f the two control estim ates. The efficiency gains are largest for central quantiles, and are o f o rd er 1.5-3 for the quantiles o f m ost interest, at ab o u t 0.025-0.05 an d 0.95-0.975. T here is som e suggestion th a t the control estim ates based on the linear ap proxim ation have the sm aller bias, b u t b o th sets o f biases are negligible a t all b u t the m ost extrem e quantiles. The efficiency gains in this exam ple are broadly in line w ith sim ulations reported in the literatu re; see also Exam ple 9.10 below.

■

9.4 Importance Resampling 9.4.1 Basic estimators Importance sampling M ost o f o u r sim ulation calculations can be th o u g h t o f as approxim ate inte grations, w ith the aim o f approxim ating

for som e function m( ), where y ' is abbreviated n o ta tio n for a sim ulated d a ta set. In expression (9.1), for exam ple, m( y' ) = t(y*), and the distribution G for y* = (y^,..., y„*) puts m ass n~n on each elem ent o f the set f f = { y i,...,y „} ".

quantiles. See text for

details

451

9.4 ■Importance Resampling

W hen it is im possible to evaulate the integral directly, o u r usual approach is to generate R independent sam ples 7,”, ..., YR* from G, and to estim ate fi by R

pG = R ‘ 5 3 " H O r=1 This estim ator has m ean an d variance

an d so is unbiased for fi. In the situation m entioned above, this is a re expression o f o rdinary b o o tstrap sim ulation. We use n o ta tio n such as po and Eg to indicate th a t estim ates are calculated from ran d o m variables sim ulated from G, and th a t m om ent calculations are w ith respect to the distribution G. O ne problem w ith po is th a t some values o f y* m ay contribute m uch m ore to fi th an others. F or example, suppose th a t the aim is to approxim ate the probability P r’(T* < to \ F), for which we would take m(y*) = I{t(y") < to}, where I is the indicator function. If the event t(y*) < t0 is rare, then m ost o f the sim ulations will co ntribute zero to the integral. The aim o f importance sampling is to sam ple m ore frequently from those “im p o rta n t” values o f y * whose contrib u tio n s to the integral are greatest. This is achieved by sam pling from a distribution th a t concentrates probability on these y ' , and then w eighting the values o f m(y*) so as to m im ic the approxim ation we w ould have used if we h ad sam pled from G. Im portance sam pling in the case o f the nonparam etric b o o tstrap am o u n ts to re-w eighting sam ples from the em pirical distribution function F , so in this context it is som etim es know n as importance resampling. T he identity th a t m otivates im portance sam pling is n =

J m( y’ )dG(y*) = J

d H ( y ’ ),

(9.14)

where necessarily the su p p o rt o f H includes the support o f G. Im portance sam pling approxim ates the right-hand side o f (9.14) using independent sam ples y ;,..., yR *from H. T he new ap proxim ation for fi is the raw importance sampling estimate R

Ph ,raw = / r 1 5 > ( y r> ( y ; ) ,

(9.15)

r= l

where w(y’) = dG(y’ ) / d H ( y ' ) is know n as the importance sampling weight. The estim ate fin,raw has m ean fi by virtue o f (9.14), so is unbiased, and has variance

9 ■Improved Calculation

452 O u r aim is now to choose H so th at

J m ( y * ) 2 w ( y ' ) d G ( y ' ) < J m ( y *)2 dG(y*). C learly the best choice is the one for which m(y*)w(y*) = n, because then Ah,raw has zero variance, b u t this is n o t usable because /i is unknow n. In general it is hard to choose H, b u t som etim es the choice is straightforw ard, as we now outline. Tilted distributions A potentially im p o rtan t application is calculation o f tail probabilities such as n = Pr*(T* < to | F), an d the corresponding quantiles o f T*. F or probabilities w (y’ ) is taken to be the indicator function I {t(y') < £o}, and if y \, . . . , y n is a single ran d o m sam ple from the E D F F then dG(y') = n~". A ny adm issible nonparam etric choice for H is a m ultinom ial distribution w ith probability pj on yj, for j = 1 ,..., n. Then dH (f) = J J p f , j

where f j counts how m any com ponents o f Y * equal y ; . We w ould like to choose the probabilities pj to m inimize v ar# (/iH.raw), or at least to m ake this m uch sm aller th a n R_1rc(l — n). T his ap p ears to be im possible in general, b u t if T is close to norm al we can get a good approxim ate solution. Suppose th a t T * has a linear approxim ation T l which is accurate, and th at the N ( t , v ) approxim ation for T[ u nder ordinary resam pling is accurate. T hen the probability n we are trying to approxim ate is roughly $ {(t0 — f)/u 1/2}. If we were using sim ulation to approxim ate such a norm al probability directly, then provided th a t to < t a good (near-optim al) im portance sam pling m ethod would be to generate t*s from the N(to, vi) distribution, where vl is the n onparam etric delta m ethod variance. It tu rn s o u t th a t we can arrange th a t this happen approxim ately for T* by setting pj cc e x p ( M j ) ,

j= l,...,n ,

(9.18)

where the lj are the usual em pirical influence values for t. The result o f Prob lem 9.10 shows th a t u nder this distribution T * is approxim ately N ( t + XnvL, vi ), so the ap p ro p riate choice for X in (9.18) is approxim ately X = (to — t)/{nvL), again provided to < t\ in some cases it is possible to choose X to m ake T* have m ean exactly to- T he choice o f probabilities given by (9.18) is called an exponential tilting o f the original values n ~l . This idea is also used in Sections 4.4, 5.3, an d 10.2.2. Table 9.4 shows approxim ate values o f the efficiency R ~ 1 n ( l —n ) / \ a T , (p.H,raw) o f near-optim al im portance resam pling for various values o f the tail probability 7i. The values were calculated using no rm al approxim ations for the distributions

453

9.4 • Importance Resampling Table 9.4 Approximate efficiencies for estimating tail probability n under importance sampling with optimal tilted EDF when T is approximately normal.

n Efficiency

0.01 37

0.025 17

0.05 9.5

0.2 3.0

0.5 1.0

0.8 0.12

0.95 0.003

0.975 0.0005

0.99 0.00004

o f T* und er G and H ; see Problem 9.8. The entries in the table suggest th at for n < 0.05 we could a tta in the same accuracy as w ith ordinary resam pling w ith R reduced by a factor larger th an ab o u t 10. A lso shown in the table is the result o f applying the exponential tilted im portance resam pling distribution w hen t > to, or n > 0.5: then im portance resam pling will be worse — possibly much worse — th an o rdinary resampling. This last observation is a w arning: straightforw ard im portance sam pling can be bad if m isapplied. We can see how from (9.17). If d H ( y ' ) becom es very small where m( y ’) an d dG(y') are n o t small, then w{y') = d G(y’ ) / d H ( y ' ) will becom e very large and inflate the variance. For the tail probability calculation, if to > t then all sam ples y ' w ith t(y*) < to contribute R ~ lw(y'r ) to pH,raw, and som e o f these contributions are enorm ous: although rare, they w reak havoc On flH,rawA little th o u g h t shows th a t for to > t one should apply im portance sam pling to estim ate 1 — n = Pr*(T* > to) and subtract the result from 1, ra th er th an estim ate n directly. Quantiles To see how quantiles are estim ated, suppose th a t we w ant to estim ate the a quantile o f the distribution o f 7” , and T* is approxim ately N(t, vL) under G = F. T hen we take a tilted distribution for H such th a t T* is approxim ately N ( t + zxV l 2 ,vl). For the situation we have been discussing, the exponential tilted distribution (9.18) will be near-optim al with k = zi / ( n v i/ 2), and in large sam ples this will be superior to G = F for any ct =/= i. So suppose th a t we have used im portance resam pling from this tilted distribution to obtain values fj < ■■■ < tf; w ith corresponding weights vvj,. . . , w ’R. T hen for a < | the raw quantile estim ate is t"M, where - m — V R + 1^

r= l

. M+l wr* < a < - — - V wr\ r R+l ^

(9.19) r

r= 1

while for a > j we define M by R - i - y w ; < l - a < - — r=M

y

R w*;

r= M + 1

see Problem 9.9. W hen there is no im portance sam pling we have w* = 1, and the estim ate equals the usual (”(R+1)a). T he variation in w (y') and its im plications are illustrated in the following

454

9 • Improved Calculation

example. We discuss stabilizing m odifications to raw im portance resam pling in the next subsection. Exam ple 9.8 (Gravity d a ta ) F or an exam ple o f im portance resam pling, we follow Exam ple 4.19 an d consider testing for a difference in m eans for the last two series o f Table 3.1. H ere we use the studentized pivot test, w ith observed test statistic Z° = , , y 2 ~ 7yi ,1 /2 ' (s\/n2 + s\/ni)

(9'2°)

where y t an d sj are the average an d variance o f the sam ple y n , . . . , y i „ n for i = 1,2. T he test com pares zo to the general distribution o f the studentized pivot

z =

?2-?l-(/^2-W ). 1/2 ’ (S f /n 2 + S f / n i )

zo is the value taken by Z u n d er the null hypothesis m = n 2. T he observed value o f zo is 1.84, w ith norm al one-sided significance probability P r(Z > zo) = 0.033. We aim to estim ate P r(Z > zo) by P r*(Z ” > zo | F), where F stands for the E D F s o f the two samples. In this case y* = ( y u , - - - , y i ni, y 2 i>--->y2n2)< an(^ ® is the jo in t density u n d er the two E D F s, so the probability on each sim ulated d ataset is dG{y*) = n p 1 x n^""2. Because zo > 0 an d the P-value is clearly below is ap p rop riate an d the estim ated P-value is

pH,raw = R 1 y ^ J { z'r > ^0}wr*,

raw im portance sam pling

W‘ = ^ ) r dHW Y

The choice o f H is m ade by analogy w ith the single-sam ple case discussed e ar lier. The tw o E D F s are tilted so as to m ake Z* approxim ately N ( zq, v l ), which should be near-optim al. This is done by w orking w ith the linear approxim ation nl Z'L =

f ’j l 'j +

Z + Mi 1

n 2 1 Y l f 2 J lV>

7=1

;=1

where / a nd f'2j are the b o o tstrap sam ple frequencies o f y \j and y 2j, and the em pirical influence values are l

_

yij - h { s \ / n 2 + s f / n i ) 1/2

t

_ 1

yij - yi ( s l / n 2 + s 2l / n i ) U2

We take H to be the p air o f exponential tilted distributions Pi] = P r( Y { = yij) cc exp(/.hJ/ n l ),

p2j = P r(7 2‘ = y 2J) cc exp(A/2y/ n 2), (9.21)

455

9.4 ■Importance Resampling

O O o o X 8 1

Figure 9.5 Importance resampling to test for a location difference between series 7 and 8 of the gravity data. The solid points in the left panel are the weights w* and bootstrap statistics z‘ for R = 99 importance resamples; the hollow points are the pairs (z*,w‘) for 99 ordinary resamples. The right panel compares the survivor function Pr*(Z* > 2*) estimated from 50000 ordinary bootstrap resamples (heavy solid) with estimates of it based on the 99 ordinary bootstrap samples (dashes) and the 99 importance resamples (solid). The vertical dotted lines show z q .

5

o O

LL Q O

W

B

°

2 o r i

• • •

i \ I: 1 i;

o

o -2

■4

0

-2

V L y

0 z*

z*

where X is chosen so th a t Z ’L has m ean z0 : this should m ake Z* approxim ately N( zo ,v i) u n d er H. The explicit equation for X is 1 hj exp(A/u /n i) E "L ie x p (/U ij/n i)

E ”l i h j exp(Xl2}/ n 2) _ +

£ " l i exp(Xl 2J/ n 2)

Z°’

w ith approxim ate solution X = zo since vL = 1. F or our d a ta the exact solution is X = 1.42. Figure 9.5 shows results for R = 99 sim ulations. The solid points in the left panel are the weights

Wr =

= eXP | - E f l j lQg ("1Plj) - E f a l0® fa P v )

p lo tted against the b o o tstra p values z* for the im portance resamples. These values o f z* are shifted to the right relative to the hollow points, which show the values o f z ’ an d w* (all equal to 1) for 99 ordinary resamples. The values o f w* for the im portance re-w eighting vary over several orders o f m agnitude, w ith the largest values w hen z* z0 contribute to f^H,raw •

H ow well does this single im portance resam pling distribution w ork for estim ating all values o f the survivor function Pr*(Z * > z)? T he heavy solid line in the right panel shows the “tru e” survivor function o f Z* estim ated from 50 000 o rdinary b o o tstra p sim ulations. T he lighter solid line is the im portance

456

9 ■Improved Calculation

resam pling estim ate K- 1 £

wrf{*r* ^ Z)

r= 1

with R = 99, an d the d o tted line is the estim ate based on 99 ordinary boo tstrap sam ples from the null distribution. T he im portance resam pling estim ate follows the “tru e” survivor function accurately close to zq b u t does poorly for negative z*. The usual estim ate does best n ear z* = 0 b u t poorly in the tail region o f interest; the estim ated significance probability is f a = 0. W hile the usual estim ate decreases by R ~ { at each z*, the weighted estim ate decreases by m uch sm aller ju m p s close to z<>; the raw im portance sam pling tail probability estim ate is p.H,raw = 0.015, which is very close to the true value. T he weighted survivor function estim ate has large ju m p s in its left tail, where the estim ate is unreliable. In 50 repetitions o f this experim ent the o rdinary and raw im portance re sam pling tail probability estim ates h ad variances 2.09 x 10-4 and 2.63 x 10-5 . F or a tail probability o f 0.015 this efficiency gain o f ab o u t 8 is sm aller th an would be predicted from Table 9.4, the reason being th a t the distribution o f z* is rath er skewed an d the norm al approxim ation to it is poor. ■ In general there are several ways to obtain tilted distributions. We can use exponential tilting w ith exact em pirical influence values, if these are readily available. O r we can estim ate the influence values by regression using jRo initial ordinary b o o tstra p resam ples, as decribed in Section 2.7.4. A n other way o f using an initial set o f b o o tstrap sam ples is to derive weighted sm ooth distributions as in (3.39): illustrations o f this are given later in Exam ples 9.9 and 9.11.

9.4.2 Improved estimators Ratio and regression estimators One simple m odification o f the raw im portance sam pling estim ate is based on the fact th a t the average w eight R -1 w ( Y ' ) from any particular sim ulation will n o t equal its theoretical value o f E*{w(Y*)} = 1. This suggests th a t the weights w(Yr”) be norm alized, so th a t (9.15) is replaced by the importance resampling ratio estimate tl

_ E f-i h y ; m y ;) Z L

m y

;)

To some extent this controls the effect o f very large fluctuations in the weights. In practice it is b etter to treat the weight as a control variate o r covariate. Since ou r aim in choosing H is to concentrate sam pling where m( ) is largest, the values o f m(Yr’ )w(Yr*) and w(Yr*) should be correlated. If so, and if

457

9.4 ■Importance Resampling

the average weight differs from its expected value o f one un d er sim ulation from H, then the estim ate pH,raw probably differs from its expected value fi. This m otivates the covariance adjustm ent m ade in the importance resampling regression estimate Ph ,reg = Ah,raw ~ b(w - 1),

(9.23)

w here vv* = R ~ x w(Yr*), an d b is the slope o f the linear regression o f the m ( Y ' ) w ( Y * ) on the w (Y r*). The estim ator pH,reg is the predicted value for m { Y ' ) w { Y “) at the poin t w(Y*) = 1. T he adjustm ents m ade to pH,raw in b o th pH,rat and pH,reg m ay induce bias, b u t such biases will be o f o rd er R ~ l and will usually be negligible relative to sim ulation stan d ard errors. C alculations outlined in Problem 9.12 indicate th a t for large R the regression estim ator should outperform the raw and ratio estim ators, b u t the im provem ent depends on the problem , and in practice the raw estim ator o f a tail probability o r quantile is usually the best. Defensive mixtures A second im provem ent aim s to prevent the weight w( y' ) from varying wildly. Suppose th a t H is a m ixture o f distributions, n H\ + (1 —n ) H 2 , where 0 < n < 1. T he distributions Hi and H 2 are chosen so th at the corresponding probabilities are n o t b o th sm all sim ultaneously. T hen the weights d G ( / ) / { j i d H , ( / ) + (1 - 7z)dH 2 (y')} will vary less, because even if d H i ( y m) is very small, d H 2 (y*) will keep the den o m in ato r aw ay from zero and vice versa. This choice o f H is know n as a defensive mixture distribution, and it should do particularly well if m any estim ates, w ith different m( y’ ), are to be calculated. T he m ixture is applied by stratified sam pling, th a t is by generating exactly n R observations from Hi and the rest from H 2, and using pH,reg as usual. T he com ponents o f the m ixture H should be chosen to ensure th a t the relevant range o f values o f t* is well covered, b u t beyond this the detailed choice is n o t critical. F o r exam ple, if we are interested in quantiles o f T* for probabilities betw een a an d 1 — a, then it would be sensible to target Hi at the a quantile and H 2 a t the 1 — a quantile, m ost simply by the exponential tilting m ethod described earlier. As a further precaution we m ight add a th ird com ponent to the m ixture, such as G, to ensure stable perform ance in the m iddle o f the distribution. In general the m ixture could have m any com ponents, b u t careful choice o f two or three will usually be adequate. A lways the application o f the m ixture should be by stratified sam pling, to reduce variation. Exam ple 9.9 (G ravity d a ta ) To illustrate the above ideas, we again consider the hypothesis testing problem o f Exam ple 9.8. T he left panel o f Figure 9.6

458

9 • Improved Calculation

shows 20 replicate estim ates o f the null survivor function o f z*, using ordinary b o o tstrap resam pling w ith R = 299. The right panel shows 20 estim ates o f the survivor function using the regression estim ate fiH,reg after sim ulations w ith a defensive m ixture distribution. This m ixture has three com ponents which are G (the tw o E D F s), an d tw o pairs o f exponential tilted distributions targeted at the 0.025 an d 0.975 quantiles o f Z*. From o u r earlier discussion these distributions are given by (9.21) w ith X = ± 2 / v L \ we shall denote the first pair o f distributions by probabilities p i j an d p 2j , and the second by probabilities q i j and q 2j . The first com ponent G was used for R i = 99 samples, the second com ponent (the ps) for R 2 = 100 an d the th ird com ponent (the qs) for R j = 100: the m ixture prop o rtio n s were therefore nj = R j / ( R \ + R 2 + R 3 ) for j = 1,2,3. T he im portance resam pling weights were

where as before f \ j and f y respectively count how m any tim es y ij and y 2j a p p e ar in the resample. F or convenience we estim ated the C D F o f Z* at the sam ple values z*. The regression estim ate at z* is obtained by setting m( y’ ) = I { z ( y *) < z ( y ’ )} and calculating (9.23); this appears to involve 299 regressions for each C D F estim ate, b u t Problem 9.13 shows how in fact ju st one m atrix calculation is needed. T he im portance resam pling estim ate o f the C D F is ab o u t as variable as the ordin ary estim ate over m ost o f the distribution, b u t m uch less variable well into the tails. For a m ore system atic com parison, we calculated the ratio o f the m ean

Figure 9.6 Importance resampling to test for a location difference between series 7 and 8 of the gravity data. In each panel the heavy solid line is the survivor function Pr’(Z ‘ > z‘) estimated from 50000 ordinary bootstrap resamples and the vertical dotted lines show z q . The left panel shows the estimates for 20 ordinary bootstraps of size 299. The right panel shows 20 importance resampling estimates using 299 samples with a regression estimate following resampling from a defensive mixture distribution with three components. See text for details.

459

9.4 ■Importance Resampling Table 9.5 Efficiency gains (ratios of mean squared errors) for estimating a tail probability, a bias, a variance and two quantiles for the gravity data, using importance resampling estimators together with defensive mixture distributions, compared to ordinary resampling. The mixtures have Ri ordinary bootstrap samples mixed with R 2 samples exponentially tilted to the 0.025 quantile of z*, and with R 3 samples exponentially tilted to the 0.975 quantile of r*. See text for details.

M ixture Ri

r2

E stim ate Ri 299

99

100

100

19

140

140

R aw R a tio R egression R aw R a tio R egression R aw R a tio R egression

E stim an d Pr* (Z* > z0)

E ’ ( Z ')

var*(Z *)

Z0.05

z0.025

11.2 3.5 12.4 3.8 3.4 4.0 3.9 2.3 4.3

0.04 0.06 0.18 0.73 0.79 0.93 0.34 0.43 0.69

0.03 0.05 0.07 1.5 1.5 1.6 1.2 0.82 1.3

0.07 0.06 0.06 1.3 0.93 0.87 0.96 0.48 0.44

0.05 0.04 2.5 1.3 1.2 2.6 1.1 1.3

_

squared erro r from ordinary resam pling to th at w hen using defensive m ixture d istributions to estim ate the tail probability Pr*(Z* > z q ) with zo = 1.77, two quantiles, an d the bias E *(Z ’ ) and the variance var’ (Z*) for sam pling from the two series. T he m ixture distributions have the same three com ponents as before, b u t w ith different values for the num bers o f sam ples R \ , R 2 and Rt, from each. Table 9.5 gives the results for three resam pling m ixtures with a to tal o f R = 299 resam ples in each case. The m ean squared errors were estim ated from 100 replicate b ootstraps, w ith “tru e ” values obtained from a single b o o tstra p o f size 50000. The m ain contribution to the m ean squared erro r is from variance ra th e r th an bias. The first resam pling distribution is n o t a m ixture, b u t simply the exponential tilt to the 0.975 quantile. This gives the best estim ates o f the tail probability, w ith efficiencies for raw an d regression estim ates in line with Exam ple 9.8, b u t it gives very p o o r estim ates o f the other quantities. F or the other two m ixtures the regression estim ates are best for estim ating the m ean and variance, while the raw estim ates are best for the quantiles and n o t really worse for the tail probability. B oth m ixtures are ab o u t the same for tail quantiles, while the first m ixture is b etter for the m om ents. In this case the efficiency gains for tail probabilities and quantiles predicted by Table 9.4 are unrealistic, for two reasons. First, the table com pares 299 o rdinary sim ulations w ith ju st 100 tilted to each tail o f the first m ixture distribution, so we w ould expect the variance for a tail quantity based on the m ixture to be larger by a factor o f ab o u t three; this is ju st w hat we see when the first distrib u tio n is com pared to the second. Secondly, the distribution o f Z* is quite skewed, which considerably reduces the efficiency out as fa r as the 0.95 quantile. We conclude th a t the regression estim ate is best for estim ating central

9 ■Improved Calculation

460

quantities, th a t the raw estim ate is best for quantiles, th a t results for estim ating quantiles are insensitive to the precise m ixture used, and th a t theoretical gains m ay not be realized in practice unless a single tail quantity is to be estim ated. This is in line w ith o th er studies.

9.4.3 Balanced importance resampling Im portance resam pling w orks best for the extrem e quantiles corresponding to small tail probabilities, b u t is less effective in the centre o f a distribution. Balanced resam pling, on the o th er hand, w orks best in the centre o f a distri bution. Balanced im portance resam pling aims to get the best o f b o th worlds by com bining the two, as follows. Suppose th a t we wish to generate R balanced resam ples in which y j has overall probability p, o f occurring. To do this exactly in general is im possible for finite n R , b u t we can do so approxim ately by applying the following simple algorithm ; a m ore efficient algorithm is described in Problem 9.14.

Algorithm 9.2 (Balanced importance resampling) C hoose Ri = n R p i , . . ., C oncatenate to form .

R\

= nRpn, such th a t Ri H----- + R n = nR.

copies o f y\ w ith

R 2

copies o f y 2 w ith ... with

R n

copies o f y n,

Perm ute the n R elem ents o f W at ran d o m to form and read off the R balanced im portance resam ples as sets o f n successive elem ents o f . • A simple way to choose the Rj is to set Rj = 1 + [n(R — l)p ; ], j = 1 wher e [•] denotes integer part, and to set Rj = Rj + \ for the d = n R — (R[ - (- ■+ R'„) values o f j w ith the largest values o f nRpj — R j ; we set R j = Rj for the rest. This ensures th a t all the observations are represented in the b o o tstrap sim ulation. Provided th a t R is large relative to n , individual sam ples will be approx im ately independent an d hence the w eight associated w ith a sam ple having frequencies ( / j , . . . , / ^ ) is approxim ately

this does n o t take account o f the fact th a t sam pling is w ithout replacem ent. Figure 9.7 shows the theoretical large-sam ple efficiencies o f balanced re sampling, im portance resam pling, an d balanced im portance resam pling for estim ating the quantiles o f a norm al statistic. O rdinary balance gives m ax im um efficiency o f 2.76 a t the centre o f the distribution, while im portance

461

9.4 ■Importance Resampling

Figure 9.7 Asymptotic efficiencies of balanced importance resampling (solid), importance resampling (large dashes), and balanced resampling (small dashes) for estimating the quantiles of a normal statistic. The dotted horizontal line is at relative efficiency one.

-

2

-

1

0

1

2

Normal quantile

resam pling w orks well in the lower tail b u t badly in the centre and u p per tail o f the distribution. Balanced im portance resam pling dom inates both. Exam ple 9.10 (Returns d a ta ) In order to assess how well these ideas m ight w ork in practice, we again consider setting studentized b o o tstrap confidence intervals for the slope in the returns example. We perform ed an experim ent like th a t o f Exam ple 9.7, b u t w ith the R = 999 b o o tstrap sam ples generated by balanced resam pling, im portance resam pling, and balanced im portance resampling. Table 9.6 shows the m ean squared error for the ordinary b o o tstrap divided by the m ean squared errors o f the quantile estim ates for these m ethods, using 50 replicate sim ulations from each scheme. This slightly different “efficiency” takes into account any bias from using the im proved m ethods o f sim ulation, though in fact the co n trib u tio n to m ean squared error from bias is small. The “tru e ” quantiles are estim ated from an ordinary b o o tstrap o f size 100000. The first tw o lines o f the table show the efficiency gains due to using the control m ethod w hen the linear approxim ation is used as a control variate, w ith em pirical influence values calculated exactly and estim ated by regression from the sam e b o o tstrap sim ulation. The results differ little. T he next two rows show the gains due to balanced sampling, both w ithout and w ith the control

462

M eth o d

9 • Improved Calculation

D istrib u tio n

Q u an tile (% ) 1

2.5

5

10

50

90

95

97.5

99

C o n tro l (exact) C o n tro l (approx)

1.7 1.4

2.7 2.8

2.8 3.2

4.0 4.1

11.2 11.8

5.5 5.1

2.4 2.2

2.6 2.6

1.4 1.3

B alance w ith co n tro l

1.0 1.4

1.2 1.8

1.5 3.0

1.4 2.8

3.1 4.4

2.9 4.7

1.7 2.5

1.4 2.2

0.6 1.5

7.8 4.6 3.6 4.3 2.6

3.7 2.9 3.7 2.6 2.1

3.6 3.5 2.0 2.5 0.7

1.8 1.1 1.7 1.8 0.3

0.4 0.1 0.5 0.9 0.4

3.5 2.6 2.4 1.6 0.5

2.3 3.1 2.2 1.6 0.6

3.1 4.3 2.6 2.2 1.6

5.5 5.2 3.6 2.3 2.1

5.0 4.2 5.2 4.3 3.2

5.7 3.4 4.2 3.3 2.8

4.1 2.4 3.8 3.4 1.0

1.9 1.8 1.8 2.2 0.4

0.5 0.2 0.9 2.1 0.9

2.6 2.0 3.0 2.7 0.9

2.2 3.6 2.4 3.7 1.4

6.3 4.2 4.0 3.3 2.1

4.5 3.9 4.0 4.3 2.1

Im p o rtan ce

Hi Hi Hi H* Hs

B alanced im p o rtan ce

Hi Hi h3 h4 h 5

m ethod, which gives a w orthw hile im provem ent in perform ance, except in the tail. The next five lines show the gains due to different versions o f im portance resam pling, in each case using a defensive m ixture distribution and the raw quantile estim ate. In practice it is unusual to perform a b o o tstrap sim ulation w ith the aim o f setting a single confidence interval, and the choice o f im portance sam pling distrib u tio n H m ust balance various potentially conflicting requirem ents. O u r choices were designed to reflect this. We first suppose th at the em pirical influence values lj for t are know n an d can be used for exponen tial tilting o f the linear approxim ation t'L to t ‘. T he first defensive m ixture, H\, uses 499 sim ulations from a distribution tilted to the a quantile o f t*L and 500 sim ulations from a distribution tilted to the 1 — a quantile o f fL, for a = 0.05. The second m ixture is like this b u t w ith a = 0.025. The third, fo u rth an d fifth distributions are the sort th a t m ight be used in practice w ith a com plicated statistic. We first perform ed an ordinary b o otstrap o f size Ro, which we used to estim ate first the em pirical influence values lj by regression an d then the tilt values rj for the 0.05 and 0.95 quantiles. We then perform ed a fu rth er b o o tstrap o f size (R — Ro)/2 using each set o f tilted probabilities, giving a to tal o f R sim ulations from three different distributions, one centred an d tw o tilted in opposite directions. We took Ro = 199 and Ro = 499, giving Hj an d i / 4. F or H$ we took Ro = 499, b u t estim ated the tilted distributions by frequency sm oothing (Section 3.9.2) w ith bandw idth

Table 9.6 Efficiencies for estimation of quantiles of studentized slope for returns data, relative to ordinary bootstrap resampling.

463

9.4 ■Importance Resampling

e = 0.5t>1/2 at the 0.05 an d 0.95 quantiles o f t*, where v x/1 is the standard error o f t estim ated from the ordinary bootstrap. Balance generally im proves im portance resam pling, which is n o t sensitive to the m ixture distrib u tio n used. The effect o f estim ating the em pirical influence values is n o t m arked, while frequency sm oothing does n o t perform so well as exponential tilting. Im portance resam pling estim ates o f the central quantiles are poor, even w hen the sim ulation is balanced. Overall, any o f schemes H \H 4 leads to appreciably m ore accurate estim ates o f the quantiles usually o f interest. ■

9.4.4 Bootstrap recycling In Section 3.9 we introduced the idea o f b o o tstrapping the b o otstrap, b o th for m aking bias adjustm ents to b o o tstrap calculations and for studying the v aria tion o f properties o f statistics. F u rth er applications o f the idea were described in C hapters 4 an d 5. In b o th param etric and nonparam etric applications we need to sim ulate sam ples from a series o f distributions, themselves obtained from sim ulations in the nonparam etric case. Recycling m ethods replace m any sets o f sim ulated sam ples by one set o f sam ples and m any sets o f weights, and have the p otential to reduce the com putational effort greatly. This is particularly valuable when the statistic o f interest is expensive to calculate, for exam ple when it involves a difficult optim ization, or w hen each b o o tstrap sam ple is costly to generate, as when using M arkov chain M onte C arlo m ethods (Section 4.2.2). T he basic idea is repeated use o f the im portance sam pling identity (9.14), as follows. Suppose th a t we are trying to calculate = E{m(Y)} for a series o f d istributions G i , . . . , G k ■The naive M onte C arlo approach is to calculate each value Hk = E { m ( Y ) | Gk} independently, sim ulating R sam ples y u - - - , y R from G/c and calculating pk = R ~ l m(yr). But for any distribution H whose su p p o rt includes th a t o f G* we have

E{m(Y) | Gk} =

J m(y)dGk{y) = J

=

E jm(Y)

dGk( Y ) dH(Y)

We can therefore estim ate all K values using one set o f sam ples y \ , . . . , y N sim ulated from H, w ith estim ates N

P k = N 1^ m ( y , )

(9.24)

In some contexts we m ay choose N to be m uch larger th a n the value R we m ight use for a single sim ulation, b u t less th an K R . It is im p o rtan t to choose H carefully, an d to take account o f the fact th a t the estim ates are correlated.

464

9 • Improved Calculation

Both N and the choice o f H depend u p o n the use being m ade o f the estim ates and the form o f m(-). Exam ple 9.11 (City population d a ta ) C onsider again estim ating the bias and variance functions for ratio 8 = t(F ) o f the city population d a ta with n = 10. In Exam ple 3.22 we estim ated b(F) = E (T | F) — t(F) and v(F) = v ar( T | F) for a range o f values o f 0 = t{F) using a first-level b o o tstrap to calculate values o f t* for 999 b o o tstrap sam ples F*, and then doing a secondA A level b o o tstrap to estim ate b(F') an d v( F’) for each o f those samples. H ere the second level o f resam pling is avoided by using im portance re-weighting. A t the sam e time, we retain the sm oothing introduced in Exam ple 3.22. R a th er th a n take each Gk to be one o f the b o o tstrap E D F s F*, we obtain a sm ooth curve by using sm ooth distributions F'f) w ith probabilities pj( 6 ) as defined by (3.39). Recall th a t the p aram eter value o f F e’ is t(F'g) = 0*, say, which will differ slightly from 6 . F o r H we take F , the E D F o f the original data, on the grounds th a t it has the correct su p p o rt and covers the range o f values for y ’ w ell: it is n o t necessarily a good choice. T hen we have weights dGk( f r ) = dFg(y') = A ( PjW V " = dH(y'r ) dH(y'r ) y i n - 1/

.

say, where as usual /*• is the frequency with which y} occurs in the rth bo o tstrap sample. We should em phasize th a t the sam ples y * draw n from H here replace second-level b o o tstrap samples. C onsider the bias estim ate. T he weighted sum R~' ^ ( f ’ — 6")w'(0} is an unbiased estim ate o f the bias E” (T *‘ | F'e ) — 6 *, an d we can plot this estim ate to see how the bias varies as a function o f O' or 6 . However, the weighted sum can behave badly if a few o f the w ' ( 0 ) are very large, and it is b etter to use the ratio an d regression estim ates (9.22) and (9.23). The top left panel o f Figure 9.8 shows raw, ratio, an d regression estim ates o f the bias, based on a single set o f R = 999 sim ulations, w ith the curve obtained from the double b o o tstrap calculation used in Figure 3.7. F o r example, the ratio estim ate o f bias for a p articu lar value o f d is ]T r(r' — 0 ’)w‘(0 ) / 2 2 r w '(0), and this is plotted as a function o f 0*. T he raw an d ratio estim ates are rath er poor, but the regression estim ate agrees fairly well w ith the double boo tstrap curve. The panel also shows the estim ated bias from a defensive m ixture w ith 499 ordinary sam ples m ixed w ith 250 sam ples tilted to each o f the 0.025 and 0.975 quantiles; this is the best estim ate o f those we consider. The panels below show 20 replicates o f these estim ated biases. These confirm the im pression from the panel a b o v e: w ith o rdinary resam pling the regression estim ator is best, but it is b etter to use the m ixture distribution. The to p right panel shows the corresponding estim ates for the standard

465

9.4 ■Importance Resampling

ID o

£ o

^

o

■o

iS c

1
® * CO O

co d

& " ■o 0 go

_1

CD

o

aj CM

O o

^

oo

d

LU / /

1.2 1.3 1.4 1.5 1.6 1.7 1.6

0.5

/

0.4

/

/.

/A

0.2

J' B &

02

l;k n'if vs Mrr--

0.3

0.10 0.08 0.04

0.06

;'Y / * ) / / M

0.5

1.2 1.3 1.4 1.5 1.6 1.7 1.8

1.2 1.3 1.4 1.5 1.6 1.7 1.8

5

6

9

° 1.2 1.3 1.4 1.5 1.6 1.7 1.8

1.2 1.3 1.4 1.5 1.6 1.7 1.6

/

0.4

^

____

0 0

o

o

|

•

>

0.3

0)

00 o o

200

Figure 9.8 Estimated bias and standard error functions for the city population data ratio. In the top panels the heavy solid lines are the double bootstrap values shown in Figure 3.7, and the others are the raw estimate (large dashes), the ratio estimate (small dashes), the regression estimate (dots), and the regression estimate based on a defensive mixture distribution (light solid). The lower panels show 20 replicates of raw, ratio, and regression estimates from ordinary sampling, and the regression estimate from a defensive mixture (clockwise from upper left) for the panels above.

1.2 1.3 1.4 1.5 1.6 1.7 1.8

error o f t. For each o f a range o f values o f 6 , we calculate this by estim ating the bias and m ean squared error h f ;)

= e**(t** - 0* i f ;),

e**{(r* - 0*)2 1f ; }

by each o f the raw, ratio, and regression m ethods, and plotting the resulting estim ate

v^ 2(F'e) = [E**{(r* - e ')2 1f ; } - {e**(t** - e ' \ f 0*)}2]1/2. T he conclusions are the sam e as for the bias estimate. A s we saw in Exam ple 3.22, here T — 6 is n o t a stable quantity because its m ean an d variance depend heavily on 6 . ■

466

9 • Improved Calculation

The results for the raw estim ate suggest th a t recycling can give very variable results, an d it m ust be used w ith care, as the next exam ple vividly illustrates. Exam ple 9.12 (Bias adjustm ent) C onsider the problem o f adjusting the b o o tstrap estim ate o f bias o f T , discussed in Section 3.9. The adjustm ent C in equation (3.30) is (R M )_1 5Zf=1 ]Cm=i(f*m ~ K) ~ which uses M sam ples from each o f the R m odels F* fitted to sam ples from F. The recycling m ethod replaces each average M -1 2 2 m=i([rm — t ”) by a w eighted average o f the form (9.24), so th a t C is estim ated by (9.25) where t’’ is the value o f T for the /th sam ple y ’{ , y ’’ draw n from the distribution H. If we applied recycling only to the first term o f C, which estim ates E ” (T**), then a different — an d as it tu rn s out inferior — estim ate would be obtained for C. The sup p o rt o f H m ust include all R first-level b o o tstrap samples, so as in the previous exam ple a n atu ral choice is H = F , the m odel fitted to (or the E D F of) the original sample. However, this can give highly unstable results, as one m ight predict from the leftm ost panel in the second row o f Figure 9.8. This can be illustrated by considering the case o f the p aram etric m odel Y ~ N(0, 1), with estim ate T = Y . H ere the term s being sum m ed in (9.25) have infinite variance; see Problem 9.15. T he difficulty arises from the choice H = F , and can be avoided by taking H to be a m ixture as described in Section 9.4.2, with at least three com ponents. ■ Instability due to the choice H = F does not occur with all applications o f recycling. Indeed applications to b o o tstra p likelihood (C hapter 10) w ork well with this choice.

9.5 Saddlepoint Approximation 9.5.1 Basic approxim ations Basic ideas Let W\, ..., W„ be a ran d o m sam ple from a continuous distribution F with cum ulant-generating function

Suppose th a t we are interested in the linear co m bination U = Y 2 aj W j , and th at this has exact P D F an d C D F g(w) an d G(u). U nder suitable conditions, the cumulant-generating function o f U, which is is the

467

9.5 • Saddlepoint Approximation

basis o f highly accurate approxim ations to the P D F and C D F o f U, know n as saddlepoint approximations. The saddlepoint approxim ation to the density o f U a t u is gs(«) = { 2 n K " ( I ) } ~ m exp [ k ( | ) - {«} ,

(9.26)

where the saddlepoint
(9.27)

an d is therefore a function o f u. H ere K ' and K " are respectively the first and second derivatives o f K with respect to £. A simple approxim ation to the C D F o f U, P r(l/ < u), is Gf(u) = (w + - l o g ( - ) l , [ w \ w/ J

(9.28)

where $(■) denotes the stan d ard norm al CD F, and w = slgn ( 3 ) [ 2 { a u - K ( 3 ) } ] 1/2,

v = l { K " ( l ) } U\

are b o th functions o f u. A n alternative to (9.28) is the Lugannani-Rice formula cD(w) + 0 ( w ) f - - - ) , \W V J

(9.29)

b u t in practice the difference between them is small. W hen \ = 0 the C D F approxim ation is m ore com plicated and we do n o t give it here. The approx im ations are constructed by num erical solution o f the saddlepoint equation to obtain the value o f I for each value o f u o f interest, from which the approxim ate P D F and C D F are readily calculated. Form ulae such as (9.26) and (9.28) can provide rem arkably accurate approx im ations in m any statistical problems. In fact, g(u) = gs(u) {1 + 0 ( n -1 )} ,

G(u) = Gs(u) {1 + 0 ( n - 3/2)} ,

for values o f u such th a t |w| < c for some positive c; the erro r in the C D F ap proxim ation rises to 0 ( n ~ l ) w hen u is such th a t |w| < cn1^2. A key feature is th a t the error is relative, so th a t the ratio o f the true density o f U to its saddlepoint approxim ation is bounded over the likely range o f u. A consequence is th a t unlike other analytic approxim ations to densities and tail probabilities, (9.26), (9.28) an d (9.29) are very accurate far into the tails o f the density o f U. If there is d o u b t ab o u t the accuracy o f (9.28) and (9.29), Gs may be calculated by num erical in tegration o f gs. The m ore com plex form ulae th a t are used for conditional and m arginal density an d distrib u tio n functions are given in Sections 9.5.2 and 9.5.3.

9 • Improved Calculation

468

Table 9.7 C om parison n

10 15

a (% )

1

2.5

5

95

97.5

99

99.5

99.9

10.9

12.8 12.5

15.4 15.2

18.1 17.8

78.5 78.1

85.1 85.9

96.0 95.3

102.1

10.8

101.9

116.4 115.8

13.6 13.5

14.5 14.4

16.0 15.9

17.4 17.4

37.4 37.4

39.7 39.7

42.3 42.4

44.4 44.3

48.2 48.2

0.1

0.5

Sim ’n S’p o in t

7.8 7.6

Sim ’n S’p o in t

11.8 11.7

Application to resampling In the context o f resam pling, suppose th a t we are interested in the distri b ution o f the average o f a sam ple from y \ , . . . , y n, where is sam pled with probability pj, j = 1, . . . , n. O ften, b u t n o t always, Pj = n-1 . We can w rite the average as U' = n~l J2 f)yj> where as usual ( / j , . . . , ) has a jo in t m ultinom ial distribution w ith den o m in ato r n. T hen U ' has cum ulant-generating function

K( £ ) = nl og j ] T f ? ; exp(£o,)

j

,

(9.30)

where a; = y j / n . The function (9.30) can be used in (9.26) and (9.28) to give n on-random approxim ations to the P D F and C D F o f U ‘. Unlike m ost o f the m ethods described in this book, the erro r in saddlepoint approxim ations arises not from sim ulation variability, b u t from determ inistic num erical error in using gs and Gs rath er th an the exact density and distribution function. In principle, o f course, a n o n p aram etric b o o tstrap statistic is discrete and so the density does not exist, b u t as we saw in Section 2.3.2, U * typically has so m any possible values th a t we can thin k o f it as continuous aw ay from the extrem e tails o f its distribution. C ontinuity corrections can som etim es be applied, b u t they m ake little difference in b o o tstrap applications. W hen it is necessary to approxim ate the entire distribution o f U', we calculate the values o f Gs(u) for m values o f u equally spaced betw een min aj and m ax aj an d use a spline sm oother to in terpolate betw een the corresponding values o f C>_ 1{Gs(m)}. Q uantiles an d cum ulative probabilities for U ' can be read off the fitted curve. Experience suggests th a t m = 50 is usually ample. Exam ple 9.13 (L inear approxim ation) A simple application o f these ideas is to the linear approxim ation t'L for a b o o tstrap statistic t \ as was used in Exam ple 9.7. We write T [ = t + n~l where as usual /* is the frequency o f the yth case in the b o o tstrap sam ple an d lj is the 7 th em pirical influence value. T he cum ulant-generating function o f T[ — t is (9.30) with aj = l j / n and

o f saddlepoint approxim ation to bootstrap a quantiles (x lO -2 ) o f a linear statistic for samples of sizes 10 and 15, with results from R = 49 999 simulations.

469

9.5 ■Saddlepoint Approximation

Figure 9.9 Comparison of the saddlepoint approximation (solid) to the PDF of a linear statistic with results from a bootstrap simulation with R = 49 999, for samples of sizes 10 (left) and 15 (right).

A O

cvi

o Q

Q Q_ °

co

CL CM

in ©

o

Jh

o

0.0

0.5

1.0

1.5

lltlii**,----------

0.1

0.2

0.3

0.4

0.5

0.6

t

Pj = n

so the saddlepoint equation for approxim ation to the P D F and C D F

o f T[ a t t'L is E " = i O exP ( ^ ;/« ) _ . E J - . e x p ({ /;/« ) L

’

whose solution is \. For a num erical exam ple, we take the variance t = n~l J2(yj ~ y )2 f ° r exponential sam ples o f sizes 10 and 15; the em pirical influence values are lj = (yj — y )2 — t. Figure 9.9 com pares the saddlepoint approxim ations to the P D F s o f t'L w ith the histogram from b o o tstrap calculations with R = 49 999. T he saddlepoint approxim ation accurately reflects the skewed lower tail o f the b o o tstrap distribution, w hereas a norm al approxim ation would not do so. However, the saddlepoint approxim ation does n o t pick up the m ultim odality o f the density for n = 10, which arises for the sam e reason as in the right panels o f Figure 2.9: the bulk o f the variability o f T[ is due to a few observations w ith large values o f |/; |, while those for which \lj\ is small merely add noise. The figure suggests th a t w ith so small a sam ple the C D F approxim ation will be m ore useful. This is borne out by Table 9.7, which com pares the sim ulation quantiles an d quantiles obtained by fitting a spline to 50 saddlepoint C D F values. In m ore com plex applications the em pirical influence values lj would usually be estim ated by num erical differentiation or by regression, as outlined in Sections 2.7.2, 2.7.4 and 3.2.1. ■ Example 9.14 (Tuna density estimate)

We return to the double boo tstrap

470

9 • Improved Calculation

used in Exam ple 5.13 to calibrate confidence intervals based on a kernel density estim ate. This involved estim ating the probabilities P r**(T " < 2t‘ - t | F ’),

(9.31)

where

is the variance-stabilized estim ate o f the quan tity o f interest. The double bo o tstrap version o f t can be w ritten as t " = ( 2 2 f j ' aj ) ,/2> where aj = (nh)~l { 4>{—y j / h ) + (yj/h)} an d /** is the frequency w ith which yj appears in a second-level b o o tstrap sample. C onditional on a first-level b o o tstra p sample F ’ with frequencies /*„, the / * ’ are independent m ultinom ial variables with m ean vector (/* !,...,/* „ ) and d en o m in ato r n. Now if 2 t’r —t < 0, the probability (9.31) equals zero, because T is positive. If 2t*—t > 0, the event T** < 2 t*—t is equivalent to n~l

^ (2 ^ —0 2- Thus

conditional on F", if 2 1 * — t > 0, we can obtain a saddlepoint approxim ation to (9.31) by applying (9.28) an d (9.30) w ith u = (21* — t )2 and pj = Including program m ing, it took ab o u t ten m inutes to calculate 3000 values o f (9.31) by saddlepoint approxim ation; direct sim ulation with 250 sam ples at the second level took ab o u t four hours on the sam e w orkstation. ■ Estimating functions O ne simple extension o f the basic approxim ations is to statistics determ ined by m onotonic estim ating functions. Suppose th a t the value o f a scalar bo o tstrap statistic T* based on sam pling from y i , . . . , y „ is the solution to the estim ating equation n (9.32) U*(t) = ^ 2, a{ f ,y j )f 'j = 0, where for each y the function a( 6 ;y) is decreasing in d. T hen T* < t if and only if U ’(t) < 0. H ence Pr*(T* < t) m ay be estim ated by Gs(0) applied w ith cum ulant-generating function (9.30) in which aj = a{t;yj). A saddlepoint approxim ation to the density o f T is (9.33) .

A

where K ( ^ ) = d K / d t , an d £ solves the equation K '( £) = 0. The first term on the right in (9.33) corresponds to the Jacobian for transform ation from the density o f U ’ to th a t o f T ' .

471

9.5 ■Saddlepoint Approximation

Example 9.15 (M aize data) Problem 4.7 contains d a ta from a paired com parison experim ent perform ed by D arw in on the grow th o f m aize plants. The d a ta are reduced to 15 differences y \ , . . . , y n betw een the heights (in eighths o f an inch) o f cross-fertilized and self-fertilized plants. W hen two large negative values are excluded, the differences have average J> = 33 and look close to norm al, b u t w hen those values are included the average drops to 20.9. W hen d a ta m ay have been co ntam inated by outliers, robust M -estim ates are useful. If we assum e th a t Y = 8 + as, where the distribution o f e is sym m etric a b o u t zero b u t m ay have long tails, an estim ate o f location 0 can be found by solving the equation ' = 0,

(9.34)

j=i where tp(e) is designed to dow nw eight large values o f s. A com m on choice is the H u b er estim ate determ ined by

y>(e) =

c,

(9.35)

W ith c = oo this gives 1p(s) = s and leads to the norm al-theory estim ate 9 = y, b u t a sm aller choice o f c will give b etter behaviour w hen there are outliers. W ith c = 1.345 and a fixed a t the m edian absolute deviation s o f the data, we obtain 8 = 26.45. H ow variable is this? We can get some idea by looking at replicates o f 9 based on b o o tstrap sam ples y j,...,y * . A b o o tstrap value 9* solves

P

i

^

h

so the saddlepoint approxim ation to the P D F o f b o o tstrap values is obtained starting from (9.32) w ith a(f , yj ) = y>{(yj — t)/s}. The left panel o f Figure 9.10 com pares the saddlepoint approxim ation with the em pirical distribution o f 9*, and w ith the approxim ate P D F o f the b o o tstrap p ed average. The saddlepoint approxim ation to 9’ seems quite accurate, while the P D F o f the average is w ider and shifted to the left. The assum ption o f sym m etry underlies the use o f the estim ator 9, because the p aram eter 9 m ust be the same for all possible choices o f c. The discus sion in Section 3.3 an d Exam ple 3.26 implies th at our resam pling scheme should take this into account by enlarging the resam pling set to y i ,. . . , y „ , 9 — (yi — 9 ) , . . . , 9 — {y„ — 9), for some very robust estim ate o f 9 ; we take 9 to be the m edian. T he cum ulant-generating function required w hen taking sam ples

9 • Improved Calculation

472

CD

o

o o

Q

Q_

C\J

o

o

o d

-20 theta

20

40

60

theta

o f size n from this set is n log

(2n) 1 ^

exP { £ a (f ;?;)} + ex p { < ^ a(t;2 § - y , ) }

j=i

T he right panel o f Figure 9.10 com pares saddlepoint and M onte C arlo a p proxim ations to the P D F o f O' und er this sym m etrized resam pling scheme; the P D F o f the average is shown also. All are sym m etric ab o u t 6 . One difficulty here is th a t we m ight prefer to approxim ate the P D F o f O' w hen s is replaced by its b o o tstrap version s', an d this cannot be done in the current fram ew ork. M ore fundam entally, the distrib u tion o f interest will often be for a q uantity such as a studentized form o f O' derived from 6 ", s', and perhaps other statistics, necessitating the m ore sophisticated approxim ations outlined in Section 9.5.3. ■

9.5.2 C onditional approxim ation T here are num erous ways to extend the discussion above. O ne o f the m ost straightforw ard is to situations w here U is a q x 1 vector which is a linear function o f independent variables W i , . . . , W „ w ith cum ulant-generating func tions K j { £ ) , j = T h a t is, U = A T W , where A is a n x q m atrix with rows a j . The jo in t cum ulant-generating function o f U is

K(0

= log E exp

(ZTA T W) =

n

Figure 9.10 Comparison of the saddlepoint approximation to the PDF of a robust M-estimate applied to the maize data (solid), with results from a bootstrap simulation with R = 50000. The heavy curve is the saddlepoint approximation to the PDF of the average. The left panel shows results from resampling the data, and the right shows results from a symmetrized bootstrap.

473

9.5 ■Saddlepoint Approximation

an d the saddlepoint approxim ation to the density o f U at u is gs(u) = ( 2 n r q/ 2 \ K " ( t r 1/2cxp { * ( £ ) - l Tu ] ,

(9.36)

w here | satisfies the q x 1 system o f equations 8 K(t;)/d£, = u, and K "(£) = d 2K { t ) / d m T is the q x q m atrix o f second derivatives o f K ; | • | denotes determ inant. N ow suppose th a t U is p artitioned into U\ and U2, th at is, U T = ( I / / , U j ), w here U\ an d U 2 have dim ension q\ x 1 and (q — qi) x 1 respectively. N ote th a t U2 = A% W , where A 2 consists o f the last q — qi colum ns o f A. The cum ulant-generating function o f U 2 is simply K(0, £,2), where £ T = (£jr ,<J^) has been p artitioned conform ably w ith U, so the saddlepoint approxim ation to the m arginal density o f U 2 is g,(u2) = ( 2 ^ r (9-« 1,/2|X^2( 0 , |20) r 1/2 exp { k ( 0 , | 20) - U>u2) ,

(9.37)

w here £20 satisfies the (q — qi) x 1 system o f equations d K ( 0 , £ 2 )/dt ; 2 = u2, and K '22 is the (q — q\) x {q — qi) corner o f K " corresponding to U2. D ivision o f (9.36) by (9.37) gives a double saddlepoint approxim ation to the conditional density o f U\ at ui given th a t U2 = u2. W hen U\ is scalar, i.e. q\ = 1, the approxim ate conditional C D F is again (9.28), b u t with w

=

sig n (lt) ^ | x ( 0 , | 2o) - 32T0M2} - { ^ ( ^ ) -

' .

\ |K"2( 0 , 6 o)I J

Example 9.16 (City population data) A simple b o o tstrap application is to obtain the distribution o f the ratio T* in b o o tstrap sam pling from the city population d a ta pairs w ith n = 10. In order to avoid conflicts o f no tatio n we set yj = (Zj, Xj), so th a t T* is the solution to the equation ]T (x; — tZj)f* = 0. F or this we take the W j to be independent Poisson random variables with equal m eans /j, s o K j ( £ ) = n{e{ — 1). We set

=

- ( ”)• " - ( V '

N ow T ' < t if and only if J2 j(xj ~ tzj ) W j < 0, where Wj is the num ber o f times ( z j , X j ) is included in the sample. But the relation betw een the Poisson an d m ultinom ial distributions (Problem 9.19) implies th a t the jo in t conditional distrib u tio n o f ( W \ , . . . , W „ ) given th a t J 2 ^ j = n is the same as th a t o f the m ultinom ial frequency vector (/*, . . . , / * ) in ordinary b o o tstra p sam pling from a sam ple o f size n. T hus the probability th a t J2 j(xj “ tzj)W j < 0 given th at J2 W j = n is ju st the probability th a t T ' < t in ordinary b o o tstrap sam pling from the d a ta pairs.

474

9 ■Improved Calculation a

0.001 0.005 0.01 0.025 0.05 0.1 0.9 0.95 0.975 0.99 0.995 0.999

Unconditional

Conditional

W ithout replacement

S’point

Sim’n

S’point

Sim’n

S’point

Sim’n

1.150 1.191 1.214 1.251 1.286 1.329 1.834 1.967 2.107 2.303 2.461 2.857

1.149 1.192 1.215 1.252 1.286 1.329 1.833 1.967 2.104 2.296 2.445 2.802

1.216 1.236 1.248 1.273 1.301 1.340 1.679 1.732 1.777 1.829 1.865 1.938

1.215 1.237 1.247 1.269 1.291 1.337 1.679 1.736 1.777 1.833 1.863 1.936

1.070 1.092 1.104 1.122 1.139 1.158 1.348 1.392 1.436 1.493 1.537 1.636

1.070 1.092 1.103 1.122 1.138 1.158 1.348 1.392 1.435 1.495 1.540 1.635

In this situation it is o f course m ore direct to use the estim ating function m ethod w ith a(t;yj) = Xj—tZj and the sim pler approxim ations (9.28) and (9.33). T hen the Jaco b ian term in (9.33) is | 22 z; e x p { |(x , —t zj ) } / 22 exp{|(x,- —tZj)}\. A n o th er application is to conditional distributions for T*. Suppose th at the populatio n pairs are related by x; = Zj 6 + z l/ 2 £j, where the e; are a random sam ple from a distrib u tio n w ith m ean zero. T hen conditional on the Zj, the ratio 2 2 xj / 2 2 zj has variance p ro p o rtio n al to (]P Z j)~' ■In some circum stances we m ight w ant to obtain an ap proxim ation to the conditional distribution o f T * given th a t 2 2 Z j = 2 2 zjthis case we can use the approach outlined in the previous p aragraph, b u t w ith tw o conditioning variables: we take the Wj to be independent Poisson variables w ith equal m eans, and set "E (xj-tzj)W j\ [/*=

2 2 zjW j

22 w j

( h

J

o

u = \ 2 2 zj

\

n )

a, =

zJ

V 1

A third application is to approxim ating the distribution o f the ratio when a sam ple o f size m = 10 is taken w ithout replacem ent from the n = 49 d a ta pairs. A gain T ' < t is equivalent to the event 2 2 j( x j ~ t z j ) W j < 0, b u t now W j indicates th a t (z j , X j) is included in the m cities chosen; we w ant to im pose the condition 2 2 ^ 0 = m - We take Wj to be binary variables with equal success probabilities 0 < n < 1, giving Kj(£) = lo g (l — n + ne*), with n any value. We then apply the double saddlepoint approxim ation with

- U ) -

" - ( V '

Table 9.8 com pares the quantiles o f these saddlepoint distributions with

Table 9.8 Comparison of saddlepoint and simulation quantile approximations for the ratio when sampling from the city population data. The statistics are the ratio £ x j / £ zj with n = 10, the ratio conditional on Yl zj = 640 with n = 10, and the ratio in samples of size 10 taken without replacement from the full data. The simulation results are based on 100000 bootstrap samples, with logistic regression used to estimate the simulated conditional probabilities, from which the quantiles were obtained by a spline fit.

9.5 • Saddlepoint Approximation

475

M onte C arlo approxim ations based on 100000 samples. The general agreem ent is excellent in each case. ■ A fu rth er application is to p erm u tatio n distributions. Exam ple 9.17 (C orrelation coefficient) In Exam ple 4.9 we applied a perm u tation test to the sam ple co rrelation t betw een variables x and z based on pairs (x i,z i), ..., (x„,z„). F or this statistic and test, the event T > t is equivalent to EjXjZ(U) - Y l x i zj> where £(•) is a p erm u tatio n o f the integers 1,.. . , n . A n alternative form ulation is as follows. Let Wy, i , j = 1 denote independent binary variables w ith equal success probabilities 0 < n < 1, for any n. T hen consider the distrib ution o f U\ = J 2 i j x izj ^ U conditional on U 2 = ( £ , W i j , . . . , Y , j w nj,E , W|1....... 5Di w i,n-i) r = M2, where u 2 is a vector o f ones o f length 2n — 1. N otice th a t the condition E , = 1 is entailed by the o th er conditions an d so is redundant. E ach value o f Xj and each value o f zj app ears precisely once in the sum U\, w ith equal probabilities, and hence the conditional distrib u tio n o f U\ given U 2 = u 2 is equivalent to the p erm utation distribution o f T. H ere m = n2, q = 2n, and qi = 1. O u r lim ited num erical experience suggests th a t in this exam ple the sad d lepoint ap proxim ation can be inaccurate if the large num ber o f constraints results in a conditional distribution on only a few values. ■

9.5.3 Marginal approximation T he approxim ate distrib u tio n and density functions described so far are useful in contexts such as testing hypotheses, b u t they are hard er to apply to such problem s as setting studentized b o o tstrap confidence intervals. A lthough (9.26) an d (9.28) can be extended to some types o f com plicated statistics, we merely outline the results. Approximate cumulant-generating function T he sim plest ap proach is direct approxim ation to the cum ulant-generating function o f the b o o tstrap statistic o f interest, T ’. The key idea is to replace the cum ulant-generating function K ( ^ ) by the first four term s o f its expansion in powers o f + \ £ 2 k 2 + g£3*C3 + ^<^4k4,

(9.38)

w here K; is the ith cum u lan t o f T*. T he exact cum ulants are usually unavailable, so we replace them w ith the cum ulants o f the cubic approxim ation to T* given by n

n

n

t * = t + n - 1 £ / ; / , • + K 2 E / * / ; * ; + i n~3 fifjfkdjk, i=l i,j=1 i,jjt=1

9 • Improved Calculation

476

where t is the original value o f the statistic, an d lj, qjj and Cy* are the em pirical linear, quad ratic and cubic influence values; see also (9.6). To the order required the approxim ate cum ulants are

+!2 (n 3Y .ijk l‘lm q jk) +4 (n 3Y ,ijk W ^ w ) } where the quantities in parentheses are o f o rder one. We get an approxim ate cum ulant-generating function K c ( 0 by substituting the Kc,i into (9.38), an d then use the stan d ard approxim ations (9.26) and (9.28) w ith K ( £) replaced by Kc(£). D etailed consideration establishes th a t this preserves the usual asym ptotic accuracy o f the saddlepoint approxim ations. From a practical point o f view it m ay be b etter to sacrifice some theoretical accuracy b u t reduce the co m p u tatio n al b urden by dropping from k c ,2 and Kc,4 the term s involving the cy*; w ith this m odification b o th P D F and C D F approxim ations have erro r o f order n~l . In principle this ap proach is fairly simple, b u t in applications there is no guarantee th a t K c (£ ) is close to the true cum ulant-generating function o f T ' except for small It m ay be necessary to m odify K c (£) to avoid m ultiple roots to the saddlepoint equation or if Kc ,4 < 0, for then K c(£ ) cannot be convex. In these circum stances we can m odify K c (£) to ensure th a t the cubic and quartic term s do n o t cause trouble, for exam ple replacing it by K c M ) = f r c . i + { £ 2 * c ,2 + (|<^3Kc,3 + J4 %4 kc,4) exp ( - \ n b 2 £ 2 KC,2 ) , where b is chosen to ensure th a t the second derivative o f Kc,b(£) with respect to £ is positive; Kc,b(£) is then convex. A suitable value is b = m ax [5, in f {a : K ' c a(£) > 0, —oo <

< oo}],

and this can be found by num erical search. Empirical Edgeworth expansion The approxim ate cum ulants can also be used in series expansions for the density and distrib u tio n o f T*. T he E dgew orth expansion for the C D F o f

477

9.5 ■Saddlepoint Approximation

Z q = (T* ~ kc ,i ) / k £

is

P r\ Z ' C < z) = (z) + Op(n~3/2),

(9.39)

where Pi(z )

=

-5'fc,3'C c,2/2 ( z 2 - 1).

p 2 {z)

=

- z { ^ K C,4 K c l ( z 2 - 3) + j 2 KC,3 Kc U z4 ~ ^

+ 15)} •

D ifferentiation o f (9.39) gives an approxim ate density for Z'c and hence for T*. However, experience suggests th a t the saddlepoint approxim ations (9.28) and (9.29) are usually preferable if they can be obtained, prim arily because (9.39) results in less accurate tail probability estim ates: its error is absolute ra th e r th an relative. F u rth er draw backs are th at (9.39) need n o t increase with z, and th a t the density approxim ation m ay becom e negative. D erivation o f the influence values th a t contribute to kc,i , . . . , Kc,4 can be tedious. Exam ple 9.18 (Studentized statistic) A statistic T = t (F) studentized using the nonparam etric delta m ethod variance estim ate obtained from its linear influence values L t( y , F ) m ay be w ritten as Z = nx^2 W , where t(F) - t(F)

W = w (F,F) =

1/ 2 ’

{ / L t ( y , F ) 2 d F( y )} w ith F the E D F o f the data. The corresponding b o o tstrap statistic is w ( F \ F), where F* is the E D F corresponding to a boo tstrap sample. F or econom y o f n o tatio n below we write v = v(F) = J L t( y; F) dF(y), L w(yi) = M j ^ F ) , Q A y u y i ) = Q A y u y n F ) , and so forth. To obtain the linear, quad ratic and cubic influence values for w(G, F ) at G = F, we replace G(y) w ith Here H(x) is the Heaviside function, jumping from 0 to 1 at x = 0.

(1 - ei - s 2 - e3)F(y) + £1H ( y - j>i) + e2 H ( y - y 2) + £3H (y - y3), differentiate w ith respect to £1, s2, and £3, and set £1 = £2 = £3 = 0. The em pirical influence values for W at F are then obtained by replacing F with F. In term s o f the influence values for t and v the result o f this calculation is L w(yi) Qviyuyi) Cw{y \, y 2 , y i )

= v~ 1 / 2 L t(yi), = v~ll 2 Qt{yx, y 2) - ^v~ 3/ 2 L t{yi)Lv{y2 )[2], = v ^ l/ 2 Ct(yu y 2 , y 3) - \ v ~ V 2 { 6 f0 'i,j'2 )l‘.,0'3) + Qv (y uy 2 ) Lt(yi)} [3] + 1V~5/ 2 L[ (y 1)LV(y 2 )LV(y3) [3],

9 • Improved Calculation

478

where [fc] after a term indicates th a t it should be sum m ed over the perm utations o f its y^s th a t give the k distinct quantities in the sum. Thus for exam ple L t( yi ) Lv(y 2 )Lv(y 3 )[3 ]

=

L t( y i) Lv(y 2 )Lv(y}) + L t( yi ) Lv( yi ) Lv(y 2 ) + L t{y2 )Lv( yi ) Lv(yi).

The influence values for z involve linear, quadratic, and cubic influence values for t, and linear an d quad ratic influence values for v, the latter given by

J

L t( x )2 dF(x)

+ 2J L t(x)Qt( x , y l )dF(x),

L v(yi)

=

L t{yi )2 —

lQv(yi,y2)

=

L t( y i )Q t ( y uy 2 )l 2 ] - L t{yi)Lt(y2) ~ J { Q t ( x , y i ) + Qt( x , y 2 ) }Lt( x) dF( x)

+J

{Qt(x,y 2 )Qt(x,yi) + L t(x)Ct{ x , y u y 2)} dF(x).

The sim plest exam ple is the average t(F) = f x d F ( x ) = y o f a sam ple o f values y u - - - , y „ from F. T hen L t(j/,-) = y t - y , Qtiyuyj) = Ct(yi,yj,yk) = 0, the expressions above simplify greatly, an d the required influence quantities are

li 9ij Cijk

= Lw(yi;F) = v~x,2{yi - y), = Q U y i , y j i h = - i v ~ i/ 2 ( y i - y ) { ( y j - y ) 2 - v } [ 2 ], = Cw(yi , yj, yk ;F) = 3v~i/2(yi - y)(yj - y)(yk - y)

+\v~5n{yi - y) {(yj - y)2 -

{(yk - y)2 -

[3],

where v = n-1 J2(yi ~ y)2- The influence quantities for Z are obtained from those for W by m ultiplication by n 1/2. A num erical illustration o f the use o f the corresponding approxim ate cum ulant-generating function K c ( £ ) is given in Exam ple 9.19. ■ Integration approach A n other ap proach involves extending the estim ating function approxim ation to the m ultivariate case, an d then approxim ating the m arginal distribution o f the statistic o f interest. To see how, suppose th a t the quantity T o f interest is a scalar, an d th a t T and S = ( S i , . . . , S q- i) r are determ ined by a q x 1 estim ating function n U(t,s) = ^ a ( t , s i , . . . , s 9- l ;Yj ). J=i T hen the b o o tstra p quantities T* an d S ’ are the solutions o f the equations n U'(t, s) = J 2 a j ( t , s ) f j = 0 ,

j=i

(9.40)

9.5 • Saddlepoint Approximation

479

where a; (t,s) = a(t,s;yj) an d the frequencies ( / j , . . . , / * ) have a m ultinom ial distribution w ith d en o m in ato r n and m ean vector n ( p \ , - typically pj = n_1. We assum e th a t there is a unique solution (t*,s*) to (9.40) for each possible set o f /* , an d seek saddlepoint approxim ations to the m arginal P D F and C D F of r . F o r fixed t an d s, the cum ulant-generating function o f U" is

K ( £ ; t , s ) = n log

(9.41)

Y 2 PJex P l ^ a / M ) } ;'=i

an d the jo in t density o f the U * at u is given by (9.36). The Jacobian needed to obtain the jo in t density o f T* and S ' from th at o f U ' is h ard to obtain exactly, b u t can be approxim ated by dcij(t,s) 8 aj(t,s) ' dt

j=i

’

dsT

where ,, 1 ’

.

p j e x p { Z Taj(t,s)} Y l = i P k e x p { £ , Tak{ t , s) } ’

as usual for r x 1 an d c x 1 vectors a and s w ith com ponents at and sj, we w rite 8 a / d s T for the r x c array whose (i,j) elem ent is dat/dsj. T he Jacobian J { t , s \ £,) reduces to the Jacobian term in (9.33) w hen s is not present. Thus the saddlepoint approxim ation to the density o f (T*,S*) at (t,s) is J ( t , s ; l ) { 2 n ) - ^ 2 \ K " { l ; t, s) p 1/2 exp K & ; t, s),

(9.42)

w here £ = £(t,s) is the solution to the q x 1 system o f equations 8 K/d£, = 0. L et us w rite A(t,s) = —K{£( t, s) ;t, s} . We require the m arginal density and distribution functions o f T* a t t. In principle they can be obtained by integration o f (9.42) num erically w ith respect to s, but this is tim e-consum ing when s is a vector. A n alternative approach is analytical approxim ation using Laplace’s m ethod, which replaces the m ost im p o rtan t p a rt o f the integrand — the rightm ost term in (9.42) — by a norm al integral, suitably centred an d scaled. Provided th a t the m atrix d 2 A ( t ,s ) / ds d sT is positive definite, the resulting approxim ate m arginal density o f T * at t is

J(t,S;?)(2 n)-,/2 |X"(|;t,S)|-1/ 2

d 2 A(t, s) dsdsT

-

1 /2

exp

s),

(9.43)

w here \ = \ ( t ) an d s = s(t) are functions o f t th a t solve sim ultaneously the

480

9 •Improved Calculation

q x 1 and (q — 1) x 1 systems o f equations 8K —

; t, s)

al

—

=

nYj l

i

8 K (<^; t, s)

pA

s ) = °>

—

j

s—

, 8 aj

=

nYlpft

=1

s)

>s )

d-s- t =

°-

;= 1

(9.44) These can be solved using packaged routines, w ith starting values given by noting th a t w hen t equals its sam ple value to, say, s equals its sam ple value and £ = 0. The second derivatives o f A needed to calculate (9.43) m ay be expressed as 8 2 A(t,s) _ d 2 K ( £ ; t , s ) f d 2 K ( i ; t , s ) Y i d 2K(£-,t,s) 8 s8 s T

8 s8 £ T

\

8^8^ T

J

8 £,dsT

82 K(t;-,t,s) 8 sdsT

where a t the solutions to (9.44) the m atrices in (9.45) are given by 8 2K { t - , t , s )

n ^2p ' j( t ,s ) aj ( t , s ) a j ( t ,s ) T,

(9.46)

(9.47,

w ith sc and sj the cth and dth com ponents o f s. The m arginal C D F approxim ation for T ' at t is (9.28), with w

=

s i g n ( t - t0){ 2 X (^ ;t,s )} 1/2, dt

|K " (£ ;t,* )l1/2

(9.49) d2A (t, s) 8 s8 s T

1/2

evaluated a t s = s, £ = | ; the only additional q uantity needed here is (9.51) ;= i A pproxim ate quantiles o f T* can be obtained in the way described ju st before Exam ple 9.13. The expressions above look forbidding, b u t their im plem entation is relatively straightforw ard. The key p o in t to note is th a t they depend only on the qu an ti ties aj(t, s), their first derivatives w ith respect to t, an d their first two derivatives w ith respect to s. Once these have been program m ed, they can be input to a generic routine to perform the saddlepoint approxim ations. Difficulties th a t som etim es arise w ith num erical overflow due to large exponents can usually be circum vented by rescaling d a ta to zero m ean and unit variance, which has no

481

9.5 ■Saddlepoint Approximation

effect on location- an d scale-invariant quantities such as studentized statistics. R em em ber, however, o u r initial com m ents in Section 9.1: the investm ent o f tim e an d effort needed to p rogram these approxim ations is unlikely to be w orthw hile unless they are to be used repeatedly. Exam ple 9.19 (M aize d ata) To illustrate these ideas we consider the boo tstrap variance an d studentized average for the m aize data. Both these statistics are location-invariant, so w ithout loss o f generality we replace yj with yj — y and henceforth assum e th a t y = 0. W ith this sim plification the statistics o f interest are

where Y" = n 1 J2 YJ. A little algebra shows th at ii-1 Y , V 2 = V * {1 + Z * 2/(n - 1)} ,

n~l Y , Yj = Z ' V l/2{n - 1)~1/2,

so to apply the integration ap proach we take pj = n 1 and

from which the 2 x 1 m atrices o f derivatives daj(z,v) 8z

daj(z,v) dv

d 2 cij(z,v) 8z 2 ’

8 2 aj(z,v) 8 v1

needed to calculate (9.43)—(9.51) are readily obtained. To find the m arginal distribution o f Z*, we apply (9.43)-(9.51) with t = z and s = v. F or a given value o f z, the three equations in (9.44) are easily solved numerically. The u p p er panels o f Figure 9.11 com pare the saddlepoint distribution an d density approxim ations for Z* w ith a large sim ulation. The analytical quantiles are very close to the sim ulated ones, and although the saddlepoint density seems to have integral greater th an one it captures well the skewness o f the distribution. F or V * we take t = v and s = z, b u t the lower left panel o f Figure 9.11 shows th a t resulting P D F approxim ation fails to capture the bim odality o f the density. This arises because V * is deflated for resam ples in which neither o f the two sm allest observations — which are som ew hat separated from the rest — appear. The contours o f —A(z, v) in the lower right panel reveal a potential problem w ith these m ethods. For z = —3.5, the Laplace approxim ation used to obtain (9.43) am ounts to replacing the integral o f exp{—A(z, t>)} along the dashed vertical line by a norm al approxim ation centred at A and w ith precision given by the second derivative o f A(z, v) at A along the line. But A(—3.5, v) is bim odal for v > 0, an d the Laplace ap proxim ation does n o t account for the second peak at B. As it turns out, this doesn’t m atte r because the peak at B is so m uch

9 *Improved Calculation

482

o W a> c co 3

o

-4

*2

2000

3000

2 z

Quantiles of standard normal

1000

0

-2

v

lower th a n a t A th a t it adds little to the integral, b u t clearly (9.43) would be catastrophically bad if the peaks at A an d B were com parable. This behaviour occurs because there is no guarantee th a t A(z, v) is a convex function o f v and z. If the difficulty is th o u g h t to have arisen, num erical integration o f (9.42) can be used to find the m arginal density o f Z ’, b u t the problem is n o t easily diagnosed except by checking th a t (9.45) is positive definite a t any solution to (9.44) an d by checking th a t different initial values o f c, and s lead to the the same solution for a given value o f t. This m ay increase the com putational burden to an extent th a t direct sim ulation is m ore efficient. Fortunately this difficulty is m uch rarer in larger samples. The quantities needed for the approxim ate cum ulant-generating function

Figure 9.11 Saddlepoint approximations for the bootstrap variance V * and studentized average Z* for the maize data. Top left: approximations to quantiles of Z* by integration saddlepoint (solid) and simulation using 50000 bootstrap samples (every 20th order statistic is shown). Top right: density approximations for Z* by integration saddlepoint (heavy solid), approximate cumulant-generating function (solid), and simulation using 50 000 bootstrap samples. Bottom left: corresponding approximations for V*. Bottom right: contours of —A(z,t>), with local maxima along the dashed line z = —3.5 at A and at B.

9.5 • Saddlepoint Approximation

483

ap proach to obtaining the distribution o f n '/2(n — 1)-1/2Z* were given in E xam ple 9.18. The approxim ate cum ulants for Z* are Kc,i = 0.13, k c ,2 = 1.08, Kc,3 = 0.51 and k c ,4 = 0.50, w ith k c ,2 = 0.89 and k c ,4 = —0.28 when the term s involving the are dropped. W ith or w ithout these term s, the cum ulants are som e way from the values 0.17, 1.34, 1.05, and 1.55 estim ated from 50000 sim ulations. T he upper right panel o f Figure 9.11 shows the P D F approxim ation based on the m odified cum ulant-generating function; in this case Kc fi ( £) is convex. The m odified P D F m atches the centre o f the distribution m ore closely th a n the in tegration PD F, b u t is poor in the u p per tail. F or V ' , we have U = (yi ~ y f ~ t,

qtj = - 2 (yt - y)(yj - y),

ciJk = 0,

so the approxim ate cum ulants are kc,i = 1241, k c ,2 /k c i = kc j / kc i = 0.013 and , = —0.0015; the corresponding sim ulated values are 1243, 0.18, 0.018, 0.0010. N either saddlepoint approxim ation captures the bim odality o f the sim ulations, though the integration m ethod is the b etter o f the two. In this case b = j for the approxim ate cum ulant-generating function m ethod, and the resulting density is clearly too close to norm al.

■

Exam ple 9.20 (Robust M -estim ate) For a second exam ple o f m arginal ap proxim ation, we suppose th a t 8 and a are M -estim ates found from a random sam ple y i , . . . , y n by sim ultaneous solution o f the equations

7=1

v

7

;=1

v

7

T he choice rp(e) = e, ^(e) = e2, y = 1 gives the non-robust estim ates 8 = y and
n~] Y nJ=i v'jej )

a

( y/n)x/ 2

which is pro p o rtio n al to the usual Student-t statistic when ip(s) = e. In order to set studentized b o o tstrap confidence limits for 8 , we need approxim ations to the b o o tstrap quantiles o f Z . These m ay be obtained by applying the m arginal saddlepoint approxim ation outlined above with T = Z , S = (Si, S i ) 7 , Pj = n-1 ,

9 • Improved Calculation

484

Table 9.9 Comparison of results from 50000 simulations with integration saddlepoint approximation to bootstrap a quantiles of a robust studentized statistic for the maize data.

« (% )

Sim ’n S’p o in t

0.1

1

2.5

5

10

90

95

97.5

99

99.9

-3.81 -3 .6 8

-2 .6 8 -2 .6 0

-2.21 -2.11

-1 .8 6 -1 .7 2

-1.49 -1.31

1.25 1.24

1.62 1.62

1.94 1.97

2.35 2.42

3.49 3.57

and

(

xp ( ae j/s i - z d / s 2) \ tp 2 ( ae j/ si - z d / s 2) - y j ,

(9.52)

tp' ( pe j/ s \ — z d / s 2) - s 2 )

where d = (y/ri)l/2. F or the H u b er estim ate, s 2 = n_1 ~ z*d/s\\ < c) takes the discrete range o f values j / n ; here 1(A) is the indicator o f the event A. In the extrem e case w ith c = oo, sj always equals one, b u t even if c is finite it will be unwise to treat sj as continuous unless n is very large. We therefore fix sj = s2, and m odify a,- by dropping its th ird com ponent, so th a t q = 2, and S = Si. W ith this change the quantities needed for the P D F and C D F approxim ations are daj(t,s)

_

/

_

f

dt ddj(t,s)

\ —2 xpxp'd/s2 ,

ds d 2 cij(t,s) 8 s 8 sT

—xp'd/s 2 —aejtp'/s 2

\ —2 oej\py>'/si _

( 2 aej\p'/sl \4aejxpxp,/ s l + 4 a 2 e 2 \p,2 /s'\

where tp an d xp' are evaluated a t a e j/ s\ — z d / s 2. For the m aize d a ta the rob u st fit dow nw eights the largest and tw o smallest observations, giving 6 = 26.68 an d a = 25.20. Table 9.9 com pares saddlepoint and sim ulated quantiles o f Z*. The agreem ent is generally poorer th an in the previous examples. To investigate the properties o f studentized b o o tstrap confidence intervals based on Z , we conducted a small sim ulation experim ent. We generated 1000 sam ples o f size n from the norm al distribution, the t$ distribution, the “slash” distributio n — the distrib u tio n o f an N ( 0 , 1) variate divided by an independent 17(0,1) variate — and the x$ distribution. F or each sam ple confidence intervals were obtained using the saddlepoint m ethod described above. Table 9.10 shows the actual coverages o f nom inal 95% confidence intervals based on the integration saddlepoint approxim ation. F or the sym m etric distributions the results are rem arkably good. T he assum ption o f sym m etric errors is false for the xl distribution, an d its results are poorer. In the sym m etric cases the saddlepoint m ethod failed to converge for a b o u t 2% o f samples, for which

9.6 • Bibliographic Notes Table 9.10 Coverage (%) of nominal 90% and 95% confidence intervals based on the integration saddlepoint approximation to a studentized bootstrap statistic, based on 1000 samples of size n from underlying normal, ts, slash, and xl distributions. Two-sided 90% and 95% coverages are given for all distributions, but for the asymmetric x2 distribution one-sided 5, 95, 2.5 and 97.5 % coverages are given also.

485

N o rm al

n = 20 n = 40

Slash

h

C hi-squared

90

95

90

95

90

95

5

95

90

2.5

97.5

95

91 90

95 94

91 89

96 95

91 89

95 95

14 9

97 94

83 85

9 6

97 95

88 89

sim ulation w ould be needed to obtain confidence intervals; we simply left these sam ples out. C uriously there were no convergence problem s for the xl samples. O ne com plication arises from assum ing th at the error distribution is sym metric, in which case the discussion in Section 3.3 implies th a t our resam pling scheme should be m odified accordingly. We can do so by replacing (9.41) with

X (^ ;z ,S !) = n lo g

i £ ^ e x p { ^ T a/(z ’Sl)} + I ^ P . / exP { f Ta}(z>s i)} j= 1

j~i

where a'j{z,s\) is obtained from (9.52) by replacing ej with —e; . However the odd cum ulants o f are then zero, and a norm al approxim ation to the distribution o f Z will often be adequate. Even w ith o u t this m odification, it seems th at the m ethod described above yields rob u st confidence intervals with coverages very close to the nom inal level. Relative tim ings for sim ulation and saddlepoint approxim ations to the b o o t strap distribution o f Z* depend on how the m ethods are im plem ented. In our im plem entation for this exam ple it takes ab o u t the same time to obtain 1000 values o f Z ' by sim ulation as to calculate 50 saddlepoint approxim ations using the integration m ethod, but this com parison is n o t realistic because the saddlepoint m ethod gives accurate quantile estim ates m uch further into the tails o f the distrib u tio n o f Z*. If ju st two quantile estim ates are needed, as would be the case for a 95% confidence interval, the saddlepoint m ethod is ab o u t ten tim es faster. O ther studies in the literature suggest that, once p ro gram m ed, saddlepoint m ethods are 20-50 tim es faster than sim ulation, and th a t efficiency gains tend to be larger w ith larger samples. However, saddle p o in t approxim ation fails on ab o u t 1-2% o f samples, for which sim ulation is needed. ■

9.6 Bibliographic Notes V ariance reduction m ethods for param etric sim ulation have a long history and a scattered literature. They are discussed in books on M onte C arlo m ethods,

486

9 ■Improved Calculation

such as H am m ersley and H andscom b (1964), Bratley, Fox and Schrage (1987), Ripley (1987), an d N iederreiter (1992). B alanced b o o tstra p sim ulation was introduced by D avison, H inkley and Schechtm an (1986). O gbonm w an (1985) describes a slightly different m ethod for achieving first-order balance. G rah am et al. (1990) discuss second-order balance an d the connections to classical experim ental design. A lgorithm s for balanced sim ulation are described by G leason (1988). T heoretical aspects o f balanced resam pling have been investigated by D o and H all (1992b). Balanced sam pling m ethods are related to num ber-theoretical m ethods for integration (Fang an d W ang, 1994), and to L atin hypercube sam pling (M cK ay, Conover and Beckm an, 1979; Stein, 1987; Owen, 1992b). D iaconis and H olm es (1994) discuss the com plete en um eration o f b o o tstrap sam ples by m ethods based on G ray codes. L inear approxim ations were used as control variates in b o o tstra p sam pling by D avison, H inkley an d Schechtm an (1986). A different approach was taken by E fron (1990), w ho suggested the re-centred bias estim ate and the use o f control variates in quantile estim ation. D o and H all (1992a) discuss the properties o f this m ethod, an d provide com parisons with other approaches. F u rth er discussion o f control m ethods is contained in theses by T herneau (1983) and H esterberg (1988). Im portance resam pling was suggested by Jo hns (1988) and D avison (1988), and was exploited by H inkley an d Shi (1989) in the context o f iterated boo tstrap confidence intervals. G igli (1994) outlines its use in param etric sim ulation for regression an d certain tim e series problem s. H esterberg (1995b) suggests the application o f ratio and regression estim ators and o f defensive m ixture distributions in im portance sam pling, an d describes their properties. T he large-sam ple perform ance o f im portance resam pling has been investigated by D o an d H all (1991). B ooth, H all and W ood (1993) describe algorithm s for balanced im portance resampling. B ootstrap recycling was suggested by D avison, H inkley and W orton (1992) and independently by N ew ton and G eyer (1994), following earlier ideas by J. W. Tukey; see M orgenthaler an d Tukey (1991) for application o f sim ilar ideas to robust statistics. Properties o f recycling in various applications are discussed by V entura (1997). S addlepoint m ethods have a history in statistics stretching back to D aniels (1954), an d they have been studied intensively in recent years. R eid (1988) reviews their use in statistical inference, while Jensen (1995) and Field and R onchetti (1990) give longer accounts; see also B arndorff-N ielsen and Cox (1989). Jensen (1992) gives a direct account o f the distribution function ap proxim ation we use. Saddlepoint approxim ation for p erm u tatio n tests was proposed by D aniels (1955) and further discussed by R obinson (1982). D avi son and H inkley (1988), D aniels and Y oung (1991), and W ang (1993b) in

9.7 - Problems

487

vestigate their use in a n u m b er o f resam pling applications, and others have investigated their use in confidence interval estim ation (DiCiccio, M artin and Y oung 1992a,b, 1994). T he use o f approxim ate cum ulant-generating functions is suggested by E aston and R onchetti (1986), G a tto and R onchetti (1996), an d G a tto (1994), while W ang (1992) shows how the approxim ation m ay be m odified to ensure the saddlepoint equation has a single root. W ang (1990) discusses the accuracy o f such m ethods in the b o o tstrap context. B ooth and Butler (1990) show how relationships am ong exponential family distributions m ay be exploited to give saddlepoint approxim ations for a num ber o f b o o t strap and p erm u tatio n inferences, while W ang (1993a) describes an alternative approach for use in finite popu lation problems. T he m arginal approxim ation in Section 9.5.3 extends an d corrects th a t o f D avison, H inkley and W orton (1995); see also Spady (1991). The discussion in Exam ple 9.18 follows H inkley and Wei (1984). Jing an d R obinson (1994) give a careful discussion o f the accuracy o f conditional an d m arginal saddlepoint approxim ations in b o o t strap applications, while C hen and D o (1994) discuss the efficiency gains from com bining saddlepoint m ethods with im portance resampling. O th er m ethods o f variance reduction applied to b o o tstra p sim ulation include antithetic sam pling (H all, 1989a) — see Problem 9.21 — and R ichardson e x trap o latio n (Bickel an d Y ahav, 1988) — see Problem 9.22. A ppendix II o f H all (1992a) com pares the theoretical properties o f some o f the m ethods described in this chapter.

9.7 Problems 1

Under the balanced bootstrap the descending product factorial m om ents o f the /*• are = Y [ m(p“> U

R ^ / i n R ) ^ s">,

(9.53)

V

where / (a) = / ! / ( / — a)!, and Pu =

^

^

vv:rw = u

Qu —

^ ^ Sw> w :jw=v

with u and v ranging over the distinct values o f row and colum n subscripts on the left-hand side o f (9.53). (a) Check the first- and second-order mom ents for the f ’ j at (9.9), and verify that the values in Problem 2.19 are recovered as R — * c o . (b) Use the results from (a) to obtain the mean o f the bias estimate under balanced resampling. (c) N ow suppose that 7” is a linear statistic, and let V — ( R — I)-1 ^ r(Tr' — T ' ) 2 be the estimated variance o f T based on the bootstrap samples. Show that the mean o f V ’ under m ultinom ial sampling is asymptotically equivalent to the mean under hypergeometric sampling, as R increases. (Section 9.2.1; Appendix A ; Haldane, 1940; D avison, Hinkley and Schechtman, 1986)

9 ■Improved Calculation

488 2

Consider the following algorithm for generation o f R balanced bootstrap samples from y = (y i,...,y „ ):

Algorithm 9.3 (Balanced bootstrap 2) Concatenate y with itself R times to form a list

o f length nR.

For I = n R , . . . , 2 : (a) Generate a random integer U in the range 1 ,..., I. (b) Swap a.Vi and < & iv

Show that this produces output equivalent to Algorithm 9.1. Suggest a balanced bootstrap algorithm that uses storage 2n, rather than the Rn used above. (Section 9.2.1; G leason, 1988; Booth, Hall and W ood, 1993) Show that the re-centred estimate o f bias, B r^ , can be approximated by (9.11), and obtain its mean and variance under ordinary bootstrap sampling. Compare the results with those obtained using the balanced bootstrap. (Section 9.2.2; Appendix A ; Efron, 1990) D ata y \ , . . . , y n are sampled from a N{n,a2) distribution and we estimate a by the M LE t = {n-1 Y ( y j ~ >’)2}'/2- The bias o f T can be estimated theoretically:

D

[ 21/2r ( f )

1

l " I/2r ( ¥ )

J

But suppose that we estimate the bias by parametric resampling; that is, we generate samples y[,...,y„ from the N(y,t2) distribution. Show that the raw and adjusted bootstrap estimates o f B can be expressed as

Br =

Y xr/2 ~ 1

and 1 /2

B RM j = n - ' /2 R - ' |

Y

X r /2 -

R i/2

>

( Y X' + X R ‘ +l

where X l, . . . , X R are independent xl-i and X R+\ is independently ■/1R_ l. Use simulation with these representations to show that the efficiencies o f B Rajj are roughly 8 and 16 for n = 10 and 20, for any R. (Section 9.2.2; Efron, 1990) (a) Show that, for large n, the variance o f the bias estimate under ordinary resampling, (9.8), can be written (nA + 2B + C ) / ( R n 2), while the variance o f the bias estimate under balanced resampling, (9.11), is C / ( R n 2); here A, B, and C are quantities o f order one. Show also that the correlation p between a quadratic statistic T ‘ and its linear approximation T'L can be written as (nA + B ) / { n A ( n A + 2B + C )}1/2, and hence verify that the variance o f the bias estimate is reduced by a factor o f 1 — p 2 when balanced resampling is used. (b) Give p in terms o f sample mom ents when t = n~‘ Y ( y j — y ) 2, and evaluate the resulting expression for samples o f sizes n = 10 and 20 simulated from the normal and exponential distributions. (Section 9.2.3)

9.7 • Problems 6

489

Consider generating bootstrap samples y ' {, . . . , y ^ n, r = 1 ,...,R , from y i , . . . , y n. Write y'j = y ^ j ) , where the elements o f the R x n matrix £ ( r , j ) take values in 1 (a) Show that first-order balance can be arranged by ensuring that each column o f £ contains each o f 1 wi t h equal frequency, and deduce that when R = n the matrix is a randomized block design with treatment labels 1, . . . , n and with colum ns as blocks. Hence explain how to generate such a design when R = kn. (b) Use the representation

n f'rj = ' £ H j - Z ( r , i ) } , i= 1

where S(u) = 1 if u = 0 and equals zero otherwise, to show that the ^-balanced design is balanced in terms o f f r’ j. Is the converse true? (c) Suppose that we have a regression m odel Yj = fixj + ej, where the independent errors e; have mean zero and variance a 2. We estimate fi by T = Y j X j / J 2 x jLet T ' = 52(t Xj +e'j )xj/ Y l x ) denote a resampled version o f T , where e’ is selected randomly from centred residuals e; — e, with e} = y, — t x j and e = n ~ l ^ e;. Show that the average value o f T* equals t if R values o f T ' are generated from a
7

(a) Following on from the previous question, a design is s e c o n d - o r d e r <J-ba l a n c e d if all n2 values o f ( £ ( r , i ) , l ; ( r , j ) ) occur with equal frequency for any pair o f columns i and j. With R = n2, show that this is achieved by setting the first colum n o f c to be ( l , . . . , l , 2 , . . . , 2 , . . . , n , . . . , n ) T, the second colum n to be ( 1 , 2 , . 1 , 2 , . . , , n ) T, and the remaining n — 2 colum ns to be the elements o f n — 2 orthogonal Latin squares with treatment labels 1, . . . , n . Exhibit such a design for n = 3. (b) Think o f the design matrix as having rows as blocks, with treatment labels 1 , . . . , n to be allocated within blocks; take R = kn. Explain why a design is said to be s e c o n d - o r d e r / ’ - b a l a n c e d if R

Y l f ' J = k (2 n ~ ! ). r=l

R

J2frjfrk= k(n-l),

j,k = l,...,n,

j + k.

r=l

Such a design is derived by replacing the treatment labels by 0 , . . . , n — 1, choosing k initial blocks with these replacement labels, adding in turn l , 2 , . . . , n — 1, and reducing the values m od n. With n = 5 and k = 3, construct the design with initial blocks (0 ,1 ,2 ,3 ,4 ), (0 ,0 ,0 ,1 ,3 ), and (0 ,0 ,0 ,1 ,2 ), and verify that it is firstand second-order balanced. Can the initial blocks be chosen at random? (Section 9.2.1; Graham e t al., 1990)

8

Suppose that you wish to estimate the normal tail probability f I { z < a}(a + fi) — 9 ( a ) 2 ’ where (i is chosen to minimize exp(/r )(a+^). Use the fact that (z) = —(z)/z for z < 0 to give an approximate value for fj., and plot the corresponding approximate efficiency for —3 < a < 0. W hat happens when a > 0 1 (Section 9.4.1)

9 • Improved Calculation

490 9

Suppose that T \ , . . . , T R is a random sample from P D F h( ) and C D F H ( ) , and let g( ) be a P D F with the same support as h( ) and with p quantile t]p, i.e. rip

'- L

rip

*

w

- / . S 5 " w

Let T(! ) < • • • < T {R) denote the order statistics o f the Tr, set

p

1 \ " g(T(r)) R + i ^ H T {r)y

m—\

R

and let M be the random index determined by SM < p < SM+1 - Show that as R —*-oo, and hence justify the estimate t"M given at (9.19). (Section 9.4.1; Johns, 1988) 10

Suppose that T has a linear approximation T[, and let p be the distribution on y y n with probabilities p; oc exp { l l j / ( n v Ll / 2 ) } , where v L = n ~ 2 Y I l j - Find the mom ent-generating function o f T[ under sampling from p, and hence show that in this case T* is approximately N ( t -I- A v j / 2 , v L ). You may assume that T[ is approximately N ( 0, v L ) when A = 0. (Section 9.4.1; Johns, 1988; Hinkley and Shi, 1989)

11

The linear approximation t L ’ for a single-sample resample statistic is typically accurate for t ’ near t, but may not work well near quantiles o f interest. For an approximation that is accurate near the a quantile o f T \ consider expanding t" = t(p’) about p„ = (pi*,... ,p„a) rather than about ( £ ,. . ., £ ) . (a) Show that if pja oc cx-p(n~lv~[[/2z j j ) , then t(pa) will be close to the a quantile o f T" for large n. (b) Define d

,

lj* = ^ t { ( l - f i) p a + e lj] Show that t’ = ?Ljz = t(pa) + n

f]h aj=i

(c) For the ratio estimates in Example 2.22, compare numerically t’L, t'Lfi9 and the quadratic approximation

tQ = t + n^ J l f ? J + j=l

2l n ~ 2 H j = 1 k= 1

fjfk t*

with t ’. (Sections 2.7.4, 3.10.2, 9.4.1; Hesterberg, 1995a) 12

(a) The importance sampling ratio estimator o f n can be written as

E ^ r)w W J 2 W(yr) where si = R 1/2 $3{m(yr)w(jv) implies that

n + R-Vh,

1 + R~,/2Eo ’

A1} an(i ®o = R 1/2 E W O v) — !}• Show that this

var (fiH, at) = K ^ v a r {m(Y )w(Y) - /*w(Y)} .

\ j is a vector with one in the jth position and zeroes elsewhere.

9.7 ■Problems

491

(b) The variance o f the importance sampling regression estimator is approximately var(/iHreg) = R -'v a r {m (Y )w (Y ) - m v (Y )},

(9.54)

where a = cov{m (Y )w (Y ), w (Y )}/var{w (Y )}. Show that this choice o f a achieves minimum variance am ong estimators for which the variance has form (9.54), and deduce that when R is large the regression estimator will always have variance no larger than the raw and ratio estimators. (c) As an artificial illustration o f (b), suppose that for 0 > 0 and som e non-negative integer k we wish to estimate

/

m(y ) g( y ) dy =

/•°° vke~y - j j - x e e ~ eydy

by simulating from density h(y) = fie~^y, y > 0, fi > 0. Give w(y) and show that E{m (Y )w (Y )} = n for any fi and 6, but that var(£//rat) is only finite when 0 < fi < 2 6 . Calculate var{m (Y)w (Y )}, cov{m (Y )w (Y ), w (Y)}, and var{w(Y)}. Plot the asymptotic efficiencies var(/i;; raw) / var(£// ra, ) and var(/*//ratv) / var(^Wrfg) as functions o f fi for 0 = 2 and fc = 0 ,1 ,2 ,3 . Discuss your findings. (Section 9.4.2; Hesterberg, 1995b) 13

Suppose that an application o f importance resampling to a statistic T" has resulted in estimates tj < ■■■< t'R and associated weights w”, and that the importance re weighting regression estimate o f the C D F o f T" is required. Let A be the R x R matrix w hose (r,s) element is w“/ ( t “ < t‘ ) and B be the R x 2 matrix whose rth row is ( I X ) . Show that the regression estimate o f the C D F at t \ , . . . , t ’R equals (1,1 ) ( BTB ) ~ i B TA. (Section 9.4.2)

14

(a) Let h = ( h \ , . • ■,hn), k = 1, . . . , n R , denote independent identically dis tributed multinomial random variables with denominator 1 and probability vector p = (p\, . . . ,p„). Show that SnK = Yl k =l ^ ^as a multinomial distribution with denominator n R and probability vector p, and that the conditional distribution o f I nR given that S„R = q is multinomial with denominator 1 and mean vector (nR) ~{q , where q = ( R i , . . . , R „ ) is a fixed vector. Show also that Prf/]

i i , . . . ,/ni?

InR I SnR

q)

equals

nfi-l g(inR I S„r = q)

g (inR-j |

= q — i„R-J+] — ■■■— i„Rj ,

i =i

where g( ) is the probability mass function o f its argument. (b) U se (a) to justify the following algorithm:

Algorithm 9.4 (Balanced importance resampling) Initialize by setting values o f R i , . . . , R „ such that Rj = n R Pj and For m = n R , . . . , 1:

= n ^-

(a) Generate u from the 1/(0,1) distribution. (b) Find the j such that £ 1=i Ri < um < Y2i=i Ri fe) Set I m = j and decrease Rj to Rj — 1. Return the sets {I„+l, . . . , I 2n}, •••, { /n(R_i)+1, ...,/„ « } as the indices o f the R bootstrap samples o f size n. •

(Section 9.4.3; Booth, Hall and Wood, 1993)

9 • Improved Calculation

492 15

For the bootstrap recycling estimate o f bias described in Example 9.12, consider the case T = Y with the parametric m odel Y ~ N ( 0 , 1). Show that if H is taken to be the N ( y , a ) distribution, then the simulation variance o f the recycling estimate o f C is approximately

1

i

n

R

/ a2 y + \2 « -l/

~ 1)/2 r « ( « - ! ) 1 I (2a - 3)3/2 R N

°2 11 8 (a - \ f ' 2 N J J ’

provided a > Compare this to the simulation variance when ordinary double bootstrap methods are used. What are the im plications for nonparametric double bootstrap calculations? In vestigate the use o f defensive mixtures for H in this problem. (Section 9.4.4; Ventura, 1997) 16

Consider exponential tilting for a statistic whose linear approximation is

where the ( / ' , , . . . , f ‘„s), s = 1 ,..., S, are independent sets o f m ultinom ial frequen cies. (a) Show that the cumulant-generating function o f T I is s K { 0 = ft + Y

f 1 "s n* lo6 \ ~

s=l

I

exP ( ^ y M ) t= 1

Hence show that choosing £ to give K ' ( ^ ) = t0 is equivalent to exponential tilting o f T [ to have mean to, and verify the tilting calculations in Example 9.8. (b) Explain how to modify (9.26) and (9.28) to give the approximate P D F and C D F o f T[. (c) How can stratification be accom m odated in the conditional approximations o f Section 9.5.2? (Section 9.5) 17

In a matched pair design, two treatments are allocated at random to each o f n pairs o f experimental units, with differences dj and average difference d = n~l J2 djI f there is no real effect, all 2" sequences + d i , . . . , + d n are equally likely, and so are the values D" = n~l J2^j^j> where the Sj take values + 1 with probability The one-sided significance level for testing the null hypothesis o f no effect is Pr*(D* > d). (a) Show that the cumulant-generating function o f D ' is n

K(£) = Y

io g c o sh (Zdj/n),

i=i and find the saddlepoint equation and the quantities needed for saddlepoint approximation to the observed significance level. Explain how this may be fitted into the framework o f a conditional saddlepoint approximation. (b) See Practical 9.5. (Section 9.5.1; Daniels, 1958; D avison and Hinkley, 1988) 18

For the testing problem o f Problem 4.9, use saddlepoint methods to develop an approximation to the exact bootstrap P-value based on the exponential tilted EDF. Apply this to the city population data with n = 10. (Section 9.5.1)

9.7 ■Problems

493

19

(a) If W \ , . . . , W „ are independent Poisson variables with means show that their joint distribution conditional on J2j = m is multinomial with probability vector n = (fi\ ^ fij and denominator w. Hence justify the first saddlepoint approximation in Example 9.16. (b) Suppose that T* is the solution to an estimating equation o f form (9.32), but that f j = 0 or 1 and f j = m < n; T" is a delete-m jackknife value o f the original statistic. Explain how to obtain a saddlepoint approximation to the P D F o f T ’. How can this P D F be used to estimate var*(T‘)? D o you think the estimate will be good when m = n — 1 ? (Section 9.5.2; Booth and Butler, 1990)

20

(a) Show that the bootstrap correlation coefficient t ’ based on data pairs ( x j , Zj), j = 1, . . . , n , may be expressed as the solution to the estimating equation (9.40) with Xj-Si

/

Zj

Oj ( t , s ) =

( Xj

\

- s2 Si)2

53

- s2j2 - s4 ( Xj - Si ) ( Zj - S2) - t{s3s4)1/2 J (Zj

V

where s T = (s1,s 2 ,s 3,s 4), and show that the Jacobian J ( t , s ; £ ) = n5(s 3s4)1/2. Obtain the quantities needed for the marginal saddlepoint approximation (9.43) to the density o f T*. (b) W hat further quantities would be needed for saddlepoint approximation to the marginal density o f the studentized form o f T ‘ ? (Section 9.5.3; D avison, Hinkley and Worton, 1995; DiCiccio, Martin and Young, 1994) 21

Let T[‘ be a statistic calculated from a bootstrap sample in which appears with frequency f j (j = 1, . ..,n ) , and suppose that the linear approximation to T ' is T [ = t + n~‘ Y s f j h ’ where /i < k < ■ ■ ■ < / „ . The statistic r2 * antithetic to T,' is calculated from the bootstrap sample in which y, appears with frequency /* +l .. (a) Show that if T [ and r 2“ are antithetic,

var{i(7Y + r 2*)} = J-n

(n-l j 2 lJ + »~l E bh' n+l - j \

7=1

,

7=1

and that this is roughly x2/ 2 n as n—► 00, where

and t]p is the pth quantile o f the distribution o f L t( Y ;F). (b) Show that if T j is independent o f r,' the corresponding variance is

and deduce that when T is the sample average and F is the exponential distribution the large-sample performance gain o f antithetic resampling is 6 /(1 2 — n 2) = 2.8. (c) W hat happens if F is symmetric? Explain qualitatively why. (Hall, 1989a)

9 - Improved Calculation

494 22

Suppose that resampling from a sample o f size n is used to estimate a quantity z(n) with expansion z(n) = zQ+ n~az\ + n~2az2 -\----- ,

(9-55)

where zo, zi, z2 are unknown but a is known; often a = j . Suppose that we resample from the E D F F, but with sample sizes nQ, m , where 1 < no < n t < n, instead o f the usual n, giving simulation estimates z ' ( n0), z ' ( n t ) o f z(n0), z( n x). (a) Show that z*(n) can be estimated by

z‘(n) =

z ’ (no) +

^

no

^

n,

(z‘(n0) - z > i ) } •

(b) N ow suppose that an estimate o f z ’(n; ) based on Rj simulations has variance approximately b / R j and that the com putational effort required to obtain it is cnjRj, for some constants b and c. Given no and ni, discuss the choice o f R q and R\ to minimize the variance o f z"(n) for a given total com putational effort. (c) Outline how knowledge o f the limit zo in (9.55) can be used to improve z ’(n). How would you proceed if a were unknown? D o you think it wise to extrapolate from just two values no and ? (Bickel and Yahav, 1988)

9.8 1

Practicals For ordinary bootstrap sampling, balanced resampling, and balanced resampling within strata:

y <- rnorm(lO) junk.fun <- function(y, i) var(y[i]) junk <- boot(y, junk.fun, R=9) b o o t .array(junk) apply(junk$t,2,sum) junk <- boot(y, junk.fun, R=9, sim="balanced") b o o t .array(junk) apply(j unk$t,2,sum) junk <- boot(y, junk.fun, R=9, sim="balanced", strata=rep(l:2,c(5,5))) boot.array(j unk) apply(junk$t,2,sum) N ow use balanced resampling in earnest to estimate the bias for the gravity data weighted average:

grav.fun <- function(data, i) { d <- data[i,] m <- tapply(d$g,d$series,mean) v <- tapply(d$g,d$series,var) n <- table(d$series) v <- (n-l)*v/n c(sum(m*n/v)/sum(n/v), sum(n/v)) } grav.bal <- boot(gravity, grav.fun, R=49, strata=gravity$series, sim="balanced") mean (grav.bal$t [, 1] ) -grav.bal$tO [1] For the adjusted estimate o f bias:

Practicals

495

grav.ord <- boot(gravity, grav.fun, R=49, strata=gravity$series) control(grav.ord,bias.adj =T) N ow a more systematic comparison, with 40 replicates each with R = 19:

R <- 19; nreps <- 40; bias <- matrix(,nreps,3) for (i in 1.’ nreps) { grav.ord <- boot(gravity, grav.fun, R=R, strata=gravity$series) grav.bal <- boot(gravity, grav.fun, R=R, strata=gravity$series, sim="balanced") bias[i,] <- c(mean(grav.ord$t[,l])-grav.ord$tO[l] , mean(grav.bal$t[,1])-grav.bal$t0[1], control(grav.ord,bias.adj=T)) } bias apply(bias,2,mean) apply(bias,2,var) split.screen(c(l,2)) screen(l) qqplot(bias [,1],bias [,2J ,xlab="ordinary",ylab="balanced") abline(0,1,lty=2) screen(2) qqplot (bias [ ,2],bias[,3],xlab="balanced",ylab="adjusted") abline(0,1,lty=2) W hat are the efficiency gains due to using balanced simulation and post-simulation adjustment for bias estimation here? N ow a calculation to see the correlation between T ' and its linear approximation:

grav.ord <- boot(gravity, grav.fun, R=999, strata=gravity$series) grav.L <- empinf(grav.ord,type="reg") tL <- linear.approx(grav.ord,grav.L,index=l) close.screen(all=T) plot(tL,grav.ord$t[,1]) cor(tL,grav.ord$t[,1]) Finally, calculations for the estimates o f bias, variance and quantiles using the linear approximation as control variate:

grav.cont <- control(grav.ord,L=grav.L,index=l) grav.contSbias grav.cont$var grav.cont$quantiles To use importance resampling to estimate quantiles o f the contrast o f averages for the tau data o f Practical 3.4, we first set up strata, a weighted version o f the statistic t, a contrast o f averages, and calculate the empirical influence values:

tau.w <- function(data, w) { d <- data$rate*w d <- tapply(d,data$decay,sum)/tapply(w,data$decay,sum) d[l]-sum(d [-1]) } tau.L <- empinf(data=tau, statistic=tau.w, strata=tau$decay) We could use exponential tilting to find distributions tilted to 14 and 18 (the original value o f t is 16.16):

496

9 ■Improved Calculation

e x p . t i l t ( t a u . L , t h e t a = c ( 1 4 , 1 8 ) ,t 0 = 1 6 .1 6 ) Function t i l t . b o o t does this automatically. Here we do 199 bootstraps without tilting, then 100 each tilted to the 0.05 and 0.95 quantiles o f these 199 values o f t". We then display the weights, without and with defensive mixture distributions: t a u . t i l t < - t i l t . b o o t ( t a u , t a u . w, R=c( 1 9 9 ,1 0 0 ,1 0 0 ) ,s t r a ta = ta u $ d e c a y , s t y p e = "w", L = ta u . L, a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) ) s p l i t . s c r e e n (c (1 ,2 ) ) s c r e e n ( l) ; p lo t ( t a u .t ilt $ t ,im p .w e ig h t s ( t a u .t ilt ) ,lo g = " y " ) s c r e e n ( 2 ) ; p l o t ( t a u . t i l t $ t , im p. w e ig h t s ( t a u . t i l t , d e f= F ), lo g = " y ") The corresponding estimated quantiles are i m p .q u a n t i l e ( t a u . t i l t , a l p h a = c ( 0 . 0 5 , 0 . 9 5 ) ) im p. q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) , def=F ) The same can be done with frequency sm oothing, but then the initial value o f R must be larger: t a u .f r e q < - t i l t . b o o t ( t a u , ta u .w , R=c( 4 9 9 ,2 5 0 ,2 5 0 ) , s t r a ta = ta u $ d e c a y , stype="w ", t i l t = F , a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) ) im p .q u a n t il e ( t a u .f r e q ,a lp h a = c ( 0 . 0 5 , 0 . 9 5 ) ) For balanced importance resampling we simply add sim ="balanced" to the argu ments o f t i l t . b o o t . For a small simulation study to see the potential efficiency gains over ordinary sampling, we compare the performance o f ordinary sampling and importance resampling with and without balance, in estimating the 0.1 and 0.9 quantiles o f the distribution o f t". t a u . t e s t < - NULL f o r ( i r e p in 1 :1 0 ) { ta u .b o o t < - b o o t ( t a u , ta u .w , R=199, stype="w ", str a ta = ta u $ d e c a y ) q .o r d < - s o r t ( t a u . b o o t $ t ) [ c ( 2 0 , 1 8 0 )] t a u . t i l t < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) , s t r a ta = ta u $ d e c a y , stype="w ", L = tau .L , a lp h a = c ( 0 . 1 , 0 . 9 ) ) q . t i l t < - i m p . q u a n t i l e ( t a u . t i l t , a lp h a = c ( 0 . 1 , 0 . 9 ) ) $raw t a u .b a l < - t i l t . b o o t ( t a u , ta u .w , R = c ( 9 9 ,5 0 ,5 0 ) , s tr a ta = ta u $ d e c a y , stype="w ", L = tau .L , a lp h a = c ( 0 .1 , 0 . 9 ) , sim ="balanced") q .b a l < - i m p .q u a n t il e ( t a u .b a l , a lp h a = c ( 0 .1 , 0 .9 ))$ r a w t a u . t e s t < - r b i n d ( t a u . t e s t , c ( q .o r d , q . t i l t , q . b a l ) ) > s q r t ( a p p l y ( t a u . t e s t , 2 , v a r )) W hat are the efficiency gains o f the two importance resampling methods? Consider the bias and standard deviation functions for the correlation o f the c la r id g e data (Example 4.9). To estimate them, we perform a double bootstrap and plot the results, as follows. c l a r .f u n < - f u n c t io n ( d a t a , f ) { r <- c o r r (d a ta , f/s u m (f)) n < - n row (d ata) d <- d a ta [r e p ( 1 : n ,f ) ,] us <- ( d [ ,l] - m e a n ( d [ ,l ] ) ) /s q r t ( v a r ( d [ ,l] ) ) x s < - ( d [ ,2 ] - m e a n ( d [ ,2 ] ) ) / s q r t ( v a r ( d [ ,2 ] ) )

9.8 ■Practicals

497

L <- us*xs - r*(us"2+xs~2)/2 v <- sum((L/n)*2) clar.t <- boot(d, corr, R=25, stype="w")$t i <- is.na(clar.t) clar.t <- clar.t[!i] c(r, v, mean(clar.t)-r, var(clar.t), sum(i)) > clar.boot <- boot(claridge, clar.fun, R=999, stype="f") split.screen(c(1,2)) screen(l) plot (clar .boot$t [, 1] ,clar .boot$t [,3] ,pch=" ." , xlab="theta*",ylab="bias") lines(lowess(clar.boot$t[,l] ,clar.boot$t[,3] ,f=l/2) ,lwd=2) screen(2) plot(clar.boot$t[,1],sqrt(clar,boot$t [ , 4 ] ) ,pch=".", xlab="theta*",ylab="SD") 1 <- lowess(clar.boot$t[,l] ,clar.boot$t[,4] ,f=l/2) lines(l$x,sqrt(l$y),lwd=2) To obtain recycled estimates using only the results from a single bootstrap, and to compare them with those from the double bootstrap:

clar.rec <- boot(claridge, corr, R=999, stype="w") IS.ests <- function(theta, boot.out, statistic, A=0.2) { f <- smooth.f(theta,boot.out,width=A) theta.f <- statistic(boot.out$data,f/sum(f)) IS.w <- imp.weights(boot.out,q=f) moms <- imp.moments(boot.out,t=boot.out$t[,1]-theta.f,w=IS.w) c(theta, theta.f, moms$raw, moms$rat, moms$reg) } IS.clar <- matrix(,41,8) theta <- seq(0,0.8,length=41) for (j in 1:41) IS.clar[j,] <- IS.ests (theta [j] ,clar .rec, corr) screen(l,new=F) lines(IS.clar[,2],I S .clar [,7]) lines (IS. clar [, 2] ,IS.clar[,5] ,lty=3) lines(IS.clar [,2] ,IS.clar[,3] ,lty=2) screen(2, new=F) lines(IS.clar[,2],sqrt(IS.clar[,8])) lines(IS.clar[,2],sqrt(IS.clar[,6]),lty=3) lines(IS.clar [,2],sqrt(IS.clar[,4]) ,lty=2) D o you think these results are close enough to those from the double bootstrap? Compare the values o f 9 in I S .c la r [, 1] to the values o f O' = t(Fg) in I S .c la r [, 2]. Dataframe c a p a b i l i t y gives “data” from Bissell (1990) comprising 75 successive observations with specification limits U = 5.79 and L = 5.49; see Problem 5.6 and Practical 5.4. Suppose that we wish to use the range o f blocks o f 5 observations to estimate <x, in which case 0 = k / r 5, where k = ( U — L ) d 5. Then 8 is the root o f the estimating equation ^T;(/c — r5j-0) = 0; this is just a ratio statistic. We estimate the P D F o f 6' by saddlepoint methods as follow s:

psi <- function(tt, r, k=2.236*(5.79-5.49)) k-r*tt psil <-function(tt, r, k=2.236*(5.79-5.49)) r det.psi <- function(tt, r, xi) { p <- exp(xi * psi(tt, r)) length(r) * abs(sum(p * psil(tt,r))/sum(p)) }

498

9 • Improved Calculation r5 <- apply(matrix(capability$y,15,5,byrow=T), 1, function(x) diff(range(x))) m <- 300; top <- 10; bot <- 4 sad <- matrix(, m, 3) th <- seq(bot,top,length=m) for (i in l:m) { sp <- saddle(A=psi(th[i], r 5 ) , u=0) sad[i,] <- c(th[i] , sp$spa[l] *det .psi(th[i] , r5, xi=sp$zeta.hat) , sp$spa[2]) } sad <- sad[! is.na(sad[,2] )&!is.na(sad[,3] ) ,] plot(sad[,l],sad[,2],type="l",xlab="theta hat",ylab="PDF") To obtain the quantiles o f the distribution o f 6', we use the following code; here ca p a b .tO contains 9 and its standard error.

theta.fun <- function(d, w, k = 2 .236*(5.79-5.49)) k*sum(w)/sum(d*w) capab.v <- v a r .linear(empinf(data=r5, statistic=theta.fun)) capab.tO <- c (2.236*(5.79-5.49)/mean(r5),sqrt(capab.v)) Afn <- function(t, data, k=2.236*(5.79-5.49)) k-t*data ufn <- function(t, data, k=2.236*(5.79-5.49)) 0 capab.sp <- saddle.distn(A=Afn, u=ufn, t0=capab.t0, data=r5) capab.sp We can use the same ideas to apply the block bootstrap. N ow we take b = 15 o f the n — I + 1 blocks o f successive observations o f length / = 5. We concatenate them to form a new series, and then take the ranges o f each block o f successive observations. This is equivalent to selecting b ranges from among the n — I + 1 possible ranges, with replacement. The quantiles o f the saddlepoint approximation to the distribution o f 6’ under this scheme are found as follows.

r5 <- NULL for (j in 1:71) r5 <- c(r5, diff(range(capability$y[j:(j+4)]))) Afn <- function(t, data, k=2.236*(5.79-5.49)) cbind(k-t*data,1) ufn <- function(t, data, k=2.236*(5.79-5.49)) c(0,15) capab.spl <- saddle.distn(A=Afn,u=ufn,wdist="p", type="cond",t0=capab.t O ,data=r5) capab.spl$quantiles Compare them with the quantiles above. How do they differ? W hy? 5

To apply the saddlepoint approximation given in Problem 9.17 to the paired com parison data o f Problem 4.7, and obtain a one-sided significance level P r'(D ’ > d):

K <- function(xi) sum(log(cosh(xi*darwin$y)))-xi*sum(darwin$y) K2 <- function(xi) sum(darwin$y~2/cosh(xi*darwin$y)~2) darwin.saddle <- saddle(K.adj=K,K2=K2) darwin.saddle 1-darwin.saddle$spa[2]

10 Semiparametric Likelihood Inference

10.1 Likelihood The likelihood function is central to inference in param etric statistical models. Suppose th a t d a ta y are believed to have com e from a distribution F w, where xp is an unknow n p x 1 vector param eter. T hen the likelihood for rp is the corresponding density evaluated a t y , nam ely L (w )

=

/ v (y ).

regarded as a function o f xp. This m easures the plausibility o f the different values o f ip which m ight have given rise to y , and can be used in various ways. If furth er inform ation ab o u t \p is available in the form o f a p rio r probability density, n(xp), Bayes’ theorem can be used to form a posterior density for ip given the d a ta y , " (V I

■

Inferences regarding xp o r other quantities o f interest m ay then be based on this density, which in principle contains all the inform ation concerning xp. If p rio r inform ation a b o u t xp is n o t available in a probabilistic form, the likelihood itself provides a basis for com parison o f different values o f xp. T he m ost plausible value is th a t which m aximizes the likelihood, nam ely the m a x i m u m l i keli hood e st imate, xp. The relative plausibility o f o ther values is m easured in term s o f the log likelihood t?(xp) = log L(xp) by the l i keli hood ratio statistic W{ y >) = 2 { t ( \ p ) - / ( x p ) } .

A key result is th a t u nder repeated sam pling o f d a ta from a regular model, W(xp) has approxim ately a chi-squared distribution w ith p degrees o f freedom. T his form s the basis for the prim ary m ethod o f calculating confidence regions

499

10 ■Semiparametric Likelihood Inference

500

in param etric models. O ne special feature is th a t the likelihood determ ines the shape o f confidence regions when xp is a vector. Unlike m any o f the confidence interval m ethods described in C h apter 5, likelihood provides a n a tu ra l basis for the com bination o f inform ation from different experim ents. If we have tw o independent sets o f data, y and z, th a t bear on the sam e p aram eter, the overall likelihood is simply L(xp) = f ( y I W)f(z I and tests an d confidence intervals concerning 1p m ay be based on this. This type o f com bination is particularly useful in applications where several in dependent experim ents are linked by com m on param eters; see Practical 10.1. In applications we can often w rite xp = ( 6 ,X), where the com ponents o f 8 are o f prim ary interest, while the so-called nuisance param eters X are o f secondary concern. In such situations inference for 8 is based on the profile likelihood, L p( 6 ) = m ax L ( 8 , X),

(10.1)

which is treated as if it were a likelihood. In some cases, particularly those where X is high dim ensional, the usual properties o f likelihood statistics (consistency o f m axim um likelihood estim ate, approxim ate chi-squared distribution o f log likelihood ratio) d o n o t apply w ithout m aking an adjustm ent to the profile likelihood. The adjusted likelihood is L a(8 )

=

L p ( o m e ,

i
(io.2)

where % is the M L E o f X for fixed 6 and jx(xp) is the observed inform ation m atrix for X, i.e. jx(xp) = —d 2 f(xp)/dXdXT . W ithout a p aram etric m odel the definition o f a param eter is m ore vexed. As in C h ap ter 2, we suppose th a t a p aram eter d is determ ined by a statistical function t(-), so th a t 8 = t(F) is a m ean, m edian, o r o ther quantity determ ined by, b u t n o t by itself determ ining, the unknow n distribution F. N ow the nuisance param eter Xis all aspects o f F o th er th a n t(F), so th a t in general X is infinite dim ensional. N o t surprisingly, there is no unique way to construct a likelihood in this situation, and in this ch ap ter we describe some o f the different possibilities.

10.2 Multinomial-Based Likelihoods 10.2.1 Empirical likelihood Scalar parameter Suppose th a t observations form a ran d o m sam ple from an unknow n distribution F, an d th a t we wish to construct a likelihood for a scalar param eter 8 = t(F), where t( ) is a statistical function. O ne view o f the E D F F is th a t it is the nonparam etric m axim um likelihood estim ate o f F, with corresponding

10.2 ■Multinomial-Based Likelihoods

501

n o n p aram etric m axim um likelihood estim ate t = t(F) for 9 (Problem 10.1). T he E D F is a m ultinom ial distribution w ith d enom inator one and probability vector («_1, . . . , n _1) attached to the yj. We can think o f this distribution as em bedded in a m ore general m ultinom ial distribution with a rb itrary probability vector p = (pi ,. . . ,p„) attached to the d a ta values. If F is restricted to be such a m ultinom ial distribution, then we can w rite t(p) rath er than t(F) for the function which defines 8 . The special m ultinom ial probability vector (n_1, . . . , n _1) corresponding to the E D F is p, and t = t(p) is the nonparam etric m axim um likelihood estim ate o f 6 . This m ultinom ial representation was used earlier in Sections 4.4 an d 5.4.2. R estricting the m odel to be m ultinom ial on the d a ta values with probability vector p, the p aram eter value is 9 = t{p) and the likelihood for p is L(p) = n " = i P^j >with / j equal to the frequency o f value yj in the sample. But, assum ing there are no tied observations, all f j are equal to 1, so th a t U p ) = p x x • • • x pn: this is the analogue o f L(i/;) in the param etric case. We are interested only in 9 = t(p), for which we can use the profile likelihood L e l {Q)=

n sup TT Pj, p-Ap)=ejJi

(10.3)

which is called the empirical likelihood for 9. N otice th a t the value o f 9 which m axim izes L El { 8 ) corresponds to the value o f p m axim izing L(p) with only the constrain t Y l Pj = 1> th a t is p. In other words, the em pirical likelihood is m axim ized by the nonparam etric m axim um likelihood estim ate t. In (10.3) we m axim ize over the p; subject to the constraints im posed by fixing t(p) = 9 an d Y l Pj = 1> which is effectively a m axim ization over n — 2 quantities w hen 9 is scalar. R em arkably, although the num ber o f param eters over which we m axim ize is com parable with the sam ple size, the approxim ate d istributional results from the p aram etric situation carry over. Let do be the true value o f 8 , w ith T the m axim um em pirical likelihood estim ator. T hen und er mild conditions on F and in large samples, the em pirical likelihood ratio statistic W e l (90) = 2 {log L e l ( T ) - log L e l ( 6 o)} has an approxim ate chi-squared distribution w ith d degrees o f freedom. A l though the lim iting distribution o f W e l (8 q) is the same as th at o f W p(8 o) und er a correct p aram etric m odel, such asym ptotic results are typically less useful in the nonparam etric setting. This suggests th at the b o o tstrap be used to calibrate em pirical likelihood, by using quantiles o f b o o tstrap replicates o f A W e l (9q), i.e. quantiles o f W ^ L( 8 ). This idea is outlined below. Exam ple 10.1 (Air-conditioning d ata) We consider the em pirical likelihood for the m ean o f the larger set o f air-conditioning d a ta in Table 5.6; n = 24

10 • Semiparametric Likelihood Inference

502

o

Figure 10.1 Likelihood and log likelihoods for the mean of the air-conditioning data: empirical (dots), exponential (dashes), and gamma profile (solid). Values of 6 whose log likelihood lies above the horizontal dotted line in the right panel are contained in an asymptotic 95% confidence set for the true mean.

o

00 d

o

JO

o

CvJ

o o o

40

60

80

100

120

40

60

80

100

120

theta

theta

and y = 64.125. T he m ean is d = f yd F ( y) , which equals Y j P j y j f ° r the m ultinom ial distribution th a t p u ts m asses pj on the yj. F or a specified value o f 8 , finding (10.3) is equivalent to m axim izing E l ° g P ; w ith respect to subject to the constraints th a t E Pj = 1 ar,d Pjyj = Use o f L agrange m ultipliers gives pj oc {1 + rjoiyj — 0)}_1, where the L agrange m ultiplier rjg is determ ined by 8 and satisfies the equation (10.4)

T hus the log em pirical likelihood, norm alized to have m axim um zero, is n

(10.5)

This is m axim ized at the sam ple average 8 = y, where ye = 0 and Pj = n_1. It is undefined outside (m iny^, m ax y 7-), because no m ultinom ial distribution on the yj can have m ean outside this interval. Figure 10.1 shows L e l (&), which is calculated by successive solution o f (10.4) to yield tjg at values o f 8 small steps apart. T he exponential likelihood and gam m a profile likelihood for the m ean are also shown. As we should expect, the gam m a profile likelihood is always higher th a n the exponential likelihood, which corresponds to the gam m a likelihood b u t w ith shape param eter k = 1. Both param etric likelihoods are w ider th a n the em pirical likelihood. D irect com parison betw een p aram etric an d em pirical likelihoods is misleading, how ever, since they are based on different m odels, an d here and in later figures

10.2 • Multinomial-Based Likelihoods

Figure 10.2 Simulated empirical likelihood ratio statistics (left panel) and gamma profile likelihood ratio statistics (right panel) for exponential samples of size 24. The dotted line corresponds to the theoretical x\ approximation.

y To sw o 2 "D

503

O W sw o 2 ■D

o o

o o

JO

o OJ E E

Q .

aj 0

E

LU

Chi-squared quantiles

Chi-squared quantiles

we give the gam m a likelihood purely as a visual reference. The circum stances in which em pirical an d p aram etric likelihoods are close are discussed in P rob lem 10.3. The endpoints o f an approxim ate 95% confidence interval for 8 are obtained by reading off where £ e l ( 8 ) = 501,0.95, where c^a is the a quantile o f the chisquared distribution w ith d degrees o f freedom. The interval is (43.3,92.3), which com pares well w ith the n onparam etric B C a interval o f (42.4,93.2). The likelihood ratio intervals for the exponential and gam m a m odels are (44.1,98.4) and (44.0,98.6). Figure 10.2 shows the em pirical likelihood and gam m a profile likelihood ratio statistics for 500 exponential sam ples o f size 24. T hough good for the param etric statistic, the chi-squared approxim ation is poor for W EL, whose estim ated 95% quantile is 5.92 com pared to the xj quantile o f 3.84. This suggests strongly th a t the em pirical likelihood-based confidence interval given above is too narrow . However, the sim ulations are only relevant w hen the d a ta are exponential, in which case we would n o t be concerned w ith em pirical likelihood. We can use the b o o tstrap to estim ate quantiles for W e l ( 8 o ), by setting 6 q = y and then calculating W ’ ( 6 q ) for b o o tstrap sam ples from the original data. The resulting Q -Q p lo t is less extrem e th a n the left panel o f Figure 10.2, w ith a 95% quantile estim ate o f 4.08 based on 999 b o o tstrap sam ples; the corresponding em pirical likelihood ratio interval is (42.8,93.3). W ith a sam ple o f size 12, 41 o f the 999 sim ulations gave infinite values o f W e l ( 6 q) because y did not lie w ithin the lim its (m in y ',m a x y * ) o f the b o o tstrap sample. W ith a sam ple o f size 24, this problem did n o t arise. ■

10 ■Semiparametric Likelihood Inference

504 V ector p a ra m eter

In principle, em pirical likelihood is straightforw ard to construct when 6 has dim ension d < n — 1. Suppose th a t 9 = (91, . . . , 8 d)T is determ ined implicitly as the root o f the sim ultaneous equations

/

u( 9; y) dF ( y) = 0,

i= l,...,d,

where u(9;y) is a d x 1 vector w hose ith elem ent is Ui(9;y). T hen the estim ate 9 is the solution to the d estim ating equations n

=

( 10.6 )

;'=i A n extension o f the argum ent in Exam ple 10.1, involving the vector o f L a grange m ultipliers rjg = (t]ou- ■■, f]od)T, shows th a t the log em pirical likelihood is n S e l ( 0) = ~ £ log {1 + n l U j ( 9 ) } , (10.7) i =i where uj(9) = u(9;yj). T he value o f rjg is determ ined by 9 through the. sim ultaneous equations V

- UjiTd )- - = 0 . 1 + tiju j(8)

(10.8)

The sim plest approxim ate confidence region for the true 9 is the set o f values such th at W EL( 0 ) < q j _ „ b u t in sm all sam ples it will again be preferable to replace the Xd quantile by its b o o tstrap estim ate.

10.2.2 Empirical exponential family likelihoods A n o th er data-b ased m ultinom ial likelihood can be based on an em pirical expo nential family construction. Suppose th a t 9 \ , . . . , 9 i are defined as the solutions to the equations (10.6). T hen ra th e r th a n p u ttin g probability n_1{ l+ f /J u ; (0)}_1 on yj, corresponding to (10.7), we can take probabilities proportional to e x p { £ jUj(9)} ■ this is the exponential tilting construction described in Ex am ple 4.16 an d in Sections 5.3 an d 9.4. H ere = (iei, ■■■, ied)T is determ ined by 9 through n

E M;( 0 ) e x p { ^ u ; ( 0 ) } = O . j=i

(10.9)

This is analogous to (10.8), b u t it m ay be solved using a program th at fits regression m odels for Poisson responses (Problem 10.4), which is of ten m ore convenient to deal w ith th an the optim ization problem s posed

10.2 ■Multinomial-Based Likelihoods

505

by em pirical likelihood. T he log likelihood obtained by integrating (10.9) is £e e f ( 6 ) = X !exP { £ j“ /(0)}- This can be close to £el{&), which suggests th a t b o th the corresponding log likelihood ratio statistics share the same rather slow ap p ro ach to their large-sam ple distributions. In ad d itio n to likelihood ratio statistics from em pirical exponential families an d em pirical likelihood, m any other related statistics can be defined. For exam ple, we can regard as the p aram eter in a Poisson regression m odel and construct a quad ratic form

u ,( 0 ) j

(10.10)

based on the score statistic th a t tests the hypothesis = 0. T here is a close parallel betw een Q e e f ( O ) an d the q u adratic form s used to set confidence regions in Section 5.8, b u t the nonlinear relationship betw een 6 and Q e e f ( 6) m eans th a t the contours o f (10.10) need not be elliptical. As discussed there, for exam ple, theory suggests th a t w hen the true value o f 6 is 9o, Q e e f (&o) has a large-sam ple x i distribution. T hus an approxim ate 1 — a confidence region for 8 is the set o f values o f 6 for which Q e e f (&) does n o t exceed q ,i_a. A nd as there, it is generally b etter to use b o o tstrap estim ates o f the quantiles o f Q e e f (0)-

Example 10.2 (Laterite data) We consider again setting a confidence region based on the d a ta in Exam ple 5.15. Recall th at the quantity o f interest is the m ean p o lar axis, a( 8 , (p) = (cos 6 cos 0 , cos 9 sin cj>, sin 0)T, which is the axis given by the eigenvector corresponding to the largest eigen value o f E ( y y T). T he d a ta consist o f positions on the lower half-sphere, or equivalently the sam ple values o f a(9, (j>), which we denote by yj, j = 1 ,..., n. In o rd er to set an em pirical likelihood confidence region for the m ean p o lar axis, or equivalently for the spherical polar coordinates (9, cj)), we let b(9, ) = (sin 9 cos 0 , sin 0 sin ,—cos 9)T,

c(9, ) = (— sin (p, — cos ,0)T

denote the unit vectors o rthogonal to a(0,4>). T hen since the eigenvectors o f E (y Y T) m ay be taken to be orthogonal, the p o pulation values o f (0, ) satisfy sim ultaneously the equations b(9, 4>)t E{ Y Y T )a(0, ) = 0, w ith sam ple equivalents

c(0, <j))T E ( Y Y T )a(9, ) = 0,

10 - Semiparametric Likelihood Inference

506

Figure 10.3 Contours of W Ei (left) and Qeef (right) for the mean polar axis, in the square region shown in Figure 5.10. The dashed lines show the 95% confidence regions using bootstrap quantiles. The dotted ellipse is the 95% confidence region based on a studentized statistic (Fisher, Lewis and Embleton, 1987, equation 6.9).

In term s o f the previous general discussion, we have d = 2 and

-

<e,4>,y j ) -

[ c{e>(j ))T y j y j a i d (j))

The left panel o f Figure 10.3 shows the em pirical likelihood contours based on (10.7) and (10.8), in the square region show n in Figure 5.10. The correspond ing contours for Q e e f ( S ) are show n on the right. T he dashed lines show the boundaries o f the 95% confidence regions for ( 6 , (f>) using b o o tstrap calibra tio n ; these differ little from those based on the asym ptotic y\ distribution. In each panel the d o tted ellipse is a 95% confidence region based on a studentized form o f the sam ple m ean p o la r axis, for which the contours are ellipses. The elliptical contours are appreciably tighter th a n those for the likelihood-based statistics. Table 10.1 com pares theoretical and b o o tstrap quantiles for several likelihoodbased statistics an d the studentized b o o tstrap statistic, Q , for the full d a ta and for a rando m subset o f size 20. F or the full data, the quantiles for Q e e f and W e l are close to those for the large-sam ple distribution. F o r the subset, Q e e f is close to its nom inal distribution, b u t the o th er statistics seem consid erably m ore variable. Except for Q e e f , it would be m isleading to rely on the asym ptotic results for the subsam ple. ■ T heoretical w ork suggests th a t W e l should have b etter properties th an statistics such as W e e f or Q e e f , b u t since sim ulations do n o t always confirm this, b o o tstrap quantiles should generally be used to set the limits o f confidence regions from m ultinom ial-based likelihoods.

10.3 ■Bootstrap Likelihood Table 10.1 Bootstrap p quantiles of likelihood-based statistics for mean polar axis data.

507

Full d a ta , n = 50

Subset, n = 20

p

xi

Q

W el

W ee f

Qe e f

Q

W EL

W eef

Qeef

0.80 0.90 0.95

3.22 4.61 5.99

3.23 4.77 6.08

3.40 4.81 6.18

3.37 5.05 6.94

3.15 4.69 6.43

3.67 5.39 7.17

3.70 5.66 7.99

3.61 5.36 10.82

3.15 4.45 7.03

10.3 Bootstrap Likelihood Basic idea Suppose for sim plicity th a t o u r d a ta y i , . . . , y n form a hom ogeneous random sam ple for which statistic T takes value t. If the d a ta were governed by a p aram etric m odel u n d er which T had the density then a partial likelihood for 9 based on T would be f T(t',0) regarded as a function o f 9. In the absence o f a p aram etric m odel, we m ay estim ate the density o f T at t, for different values o f 9, by m eans o f a nonparam etric double bootstrap. To be specific, suppose th a t we generate a first-level b o o tstrap sam ple y [ , . . . , y ‘ from w ith corresponding estim ator value t*. This boo tstrap sam ple is now considered as a population whose param eter value is t*; the em pirical distribution o f y j,...,y * is the nonparam etric analogue o f a param etric m odel w ith 9 = t*. We then generate M second-level b o o tstrap sam ples by sam pling from o u r first-level sample, and calculate the corresponding values o f T, nam ely r * \ . . . , f'J. K ernel density estim ation based on these second-level values provides an approxim ate density for T ” , and by analogy with p a ra m etric p artial likelihood we take this density at f** = t to be the value o f a n o n p aram etric p artial likelihood at 9 = f*. If the density estim ate uses kernel w( ) w ith bandw idth h, then this leads to the bootstrap likelihood value at 9 = t’ given by 1 M / <■*• _ A u n = f M ' i n = m Y . - { Jn r ) m=1 v7

<10-1»

O n repeating this procedure for R different first-level boo tstrap samples, we obtain R approxim ate likelihood values L ( t ’ ), r = 1, . . . , R , from which a sm ooth likelihood curve L B(9) can be produced by nonparam etric sm oothing. Computational improvements There are various ways to reduce the large am ount o f com putation needed to obtain a sm ooth curve. One, which was used earlier in Section 3.9.2, is to generate second-level sam ples from sm oothed versions o f the first-level samples. As before, probability distributions on the values y i .......y n are denoted

508

10 ■Semiparametric Likelihood Inference

by vectors p = (p i , . . . , p „ ), and p aram eter values are expressed as t(p); recall th a t p = and t = t(p). The rth first-level b o o tstrap sam ple gives statistic value t'n and the d a ta value yj occurs w ith frequency /*• = np’rj, say. In the bo o tstrap likelihood calculation this b o o tstrap sam ple is considered as a population w ith probability distrib u tio n p* = (p*1;...,p * n) on the d a ta values, and t ' = f(p ') is considered as the 0-value for this population. In order to obtain p opulations which vary sm oothly with 6 , we apply kernel sm oothing to the p*, as in Section 3.9.2. T hus for target param eter value 6 ° we define the vector p*(0°) o f probabilities p ' ( 0 °)

e - 1-') p'rj, r= 1

'

j= l,...,n ,

(10.12)

'

1II where typically w(-) is the stan d ard norm al density an d e = t>L ; as usual vl is the nonparam etric delta m ethod variance estim ate for t. The distribution p*(0°) will have p aram eter value not 0° b u t 9 = t (p '(0 0)). W ith the understanding th a t 9 is defined in this way, we shall for sim plicity w rite p'(9) ra th er th an p*(0°). F or a fixed collection o f R first-level sam ples and bandw idth e > 0, the probability vectors p"(9) change gradually as 9 varies over its range o f interest. Second-level b o o tstra p sam pling now uses vectors p'(0) as sam pling distri butions on the d a ta values, in place o f the p* s. The second-level sam ple values f** are then used in (10.11) to o btain Lg(0). R epeating this calculation for, say, 100 values o f 6 in the range t + 4 v /1 2, followed by sm ooth interpolation, should give a good result. Experience suggests th a t the value e = v 1/ 2 is safe to use in (10.12) if the t* are roughly equally spaced, which can be arran g ed by weighted first-level sam pling, as outlined in Problem 10.6. A way to reduce furth er the am o u n t o f calculation is to use recycling, as described in Section 9.4.4. R a th e r th an generate second-level sam ples from each p"(9) o f interest, one set o f M sam ples can be generated using distribution p on the d ata values, an d the associated values f” , . . . , calculated. Then, following the general re-w eighting m ethod (9.24), the likelihood values are calculated as .> 0 ,3 , m=\

v

/j = 1 v

J

'

where is the frequency o f the j t h case in the with second-level boo tstrap sample. O ne simple choice for p is the E D F p. In special cases it will be possible to replace the second level o f sam pling by use o f the saddlepoint approxim ation m ethod o f Section 9.5. This w ould give an accurate an d sm ooth approxim ation to the density o f T ’“ for sam pling from each p ' ( 8 ). Exam ple 10.3 (Air-conditioning d a ta )

We apply the ideas outlined above to

10.4 ■Likelihood Based on Confidence Sets

Figure 10.4 Bootstrap likelihood for mean of air-conditioning data. Left panel: bootstrap likelihood values obtained by saddlepoint approximation for 200 random samples, with smooth curve fitted to values obtained by smoothing frequencies from 1000 bootstrap samples. Right panel: gamma profile log likelihood (solid) and bootstrap log likelihood (dots).

509

~D O O

JZ

o>

o

theta

theta

the d a ta from Exam ple 10.1. The solid points in the left panel o f Figure 10.4 are b o o tstrap likelihood values for the m ean 9 for 200 resamples, obtained by saddlepoint approxim ation. This replaces the kernel density estim ate (10.11) an d avoids the second level o f resam pling, b u t does n o t remove the variation in estim ated likelihood values for different b o o tstrap sam ples with sim ilar values o f t r*. A locally q u ad ratic nonparam etric sm oother (on the log likelihood scale) could be used to produce a sm ooth likelihood curve from the values o f L(t"), b u t an o th er approach is better, as we now describe. The solid line in the left panel o f Figure 10.4 interpolates values obtained by applying the saddlepoint approxim ation using probabilities (10.12) at a few values o f 9°. H ere the values o f t! are generated at random , and we have taken 112 e = 0.5vl ; the results depend little on the value o f e. T he log b o o tstrap likelihood is very close to log em pirical likelihood, with 95% confidence interval (43.8,92.1). ■ B ootstrap likelihood is based purely on resam pling and sm oothing, which is a p o tential advantage over em pirical likelihood. However, in its simplest form it is m ore com puter-intensive. This precludes b o otstrapping to estim ate quantiles o f b o o tstra p likelihood ratio statistics, which would involve three levels o f nested resam pling.

10.4 Likelihood Based on Confidence Sets In certain circum stances it is possible to view confidence intervals as being approxim ately posterior probability sets, in the Bayesian sense. This encourages the idea o f defining a confidence distribution for 9 from the set o f confidence

510

10 ■Semiparametric Likelihood Inference

limits, and then taking the P D F o f this distribution as a likelihood function. T h a t is, if we define the confidence distribution function C by C( 6 xj = a, then the associated likelihood would be the “density” dC( 6 )/dd. Leaving the philosophical argum ents aside, we look briefly at where this idea leads in the context o f nonparam etric b o o tstrap m ethods.

10.4.1 Likelihood from pivots Suppose th a t Z ( 9 ) = z ( 6 , F ) is a pivot, w ith C D F K ( z ) not depending on the true distribution F , an d th a t z ( 0 ) is a m onotone function o f 6 . T hen the confidence distribution based on confidence limits derived from z leads to the likelihood LH6) = \ m \ k { z ( 6 ) } ,

(10.14)

where k ( z ) = d K ( z ) / d z . Since k will be unknow n in practice, it m ust be estim ated. In fact this definition o f likelihood has a hidden defect. If the identification o f confidence distrib u tio n w ith posterior distribution is accurate, as it is to a good approxim ation in m any cases, then the effect o f some prio r distribution has been ignored in (10.14). But this effect can be rem oved by a simple device. C onsider an im aginary experim ent in which a ran d o m sam ple o f size 2n is obtained, w ith outcom e exactly tw o copies o f the d a ta y th at we have. Then the likelihood w ould be the square o f the likelihood L z (6 I y) we are trying to calculate. T he ratio o f the corresponding posterior densities would be simply L z (6 | y). This argum ent suggests th a t we apply the confidence density (10.14) twice, first w ith d a ta y to give L j(0), say, and second w ith d a ta (y, y) to give L f2n(0). The ratio L l n(6 ) / L l ( 6 ) will then be a likelihood with the unknow n prior effect removed. In an explicit notatio n , this definition can be w ritten t (Q\ — ^2n(®) _ l^2n(0)l&2n {z 2n(d)} L z(p ) = , Ln \Zn{6 )\kn \2n(0)}

(10 15) (10.15)

where the subscripts indicate sam ple size. N ote th a t F and t are the same for both sam ple sizes, b u t quantities such as variance estim ates will depend upon sam ple size. N ote also th a t the im plied p rio r is estim ated by L l 2( 6 ) / L f2n(6). Exam ple 10.4 (Exponential m ean) If d a ta y i , . . . , y n are sam pled from an exponential distrib u tio n w ith m ean 6 , then a suitable choice for z ( 6 , F ) is y / 6 . The gam m a distrib u tio n for Y can be used to check th at the original definition (10.14) gives L i (6) = 9 ~ n~ l exp(—n y / 6 ) , w hereas the true likelihood is 9~n exp (—n y / 6 ) . The true result is obtained exactly using (10.15). The im plied prior is n( 6 ) oc 0-1 , for 6 > 0. ■ In practice the distrib u tio n o f Z m ust be estim ated, in general by boo tstrap

2(0) equals dz{Q)/d6.

10.4 ■Likelihood Based on Confidence Sets

511

sam pling, so the densities k n and k 2„ in (1 0 .1 5 ) m ust be estim ated. To be specific, consider the p articu lar case o f the studentized quantity z(9) = (t—d ) / v 1L/2. A part from a co n stan t m ultiplier, the definition (1 0 .1 5 ) gives

L f (0 ) = k 2n

j k n

,(10 .1 6 )

where v„^ = v i an d v2«,l = \ vl , and we have used the fact th a t t is the estim ate for b o th sam ple sizes. The densities k„ and k 2n are approxim ated using b o o tstrap sam ple values as follows. First R nonparam etric sam ples o f size n are a ^ i j'y draw n from F an d corresponding values o f z* = (t* — t ) / v n[ calculated. T hen R sam ples o f size 2n are draw n from F and values o f Z2» = (*2» ~ 0 /(® io .)1/2 calculated. N ext kernel estim ates for k„ and k 2n, with bandw idths h n and h 2n respectively, are obtained and substituted in (10.16). F or example, (10.17)

In practice these values can be com puted via spline sm oothing from a dense set o f values o f the kernel density estim ates k„{z). There are difficulties w ith this m ethod. First, ju st as with b o o tstrap likeli hood, it is necessary to use a large num ber o f sim ulations R. A second difficulty is th a t o f ascertaining w hether or n o t the chosen Z is a pivot, o r else w hat p rio r tran sfo rm atio n o f T could be used to m ake Z pivotal; see Section 5.2.2. This is especially true if we extend (10.16) to vector 9, which is theoretically possible. N ote th a t if the studentized b o o tstrap is applied to a transform ation o f t rath er th a n t itself, then the factor \z(9)\ in (10.14) can be ignored when applying (10.16).

10.4.2 Implied likelihood In principle any b o o tstra p confidence lim it m ethod can be turned into a likelihood m ethod via the confidence distribution, b u t it m akes sense to restrict atten tio n to the m ore accurate m ethods such as the studentized b o o tstrap used above. Section 5.4 discusses the underlying theory and introduces one other m ethod, the A B C m ethod, which is particularly easy to use as basis for a likelihood because no sim ulation is required. First, a confidence density is obtained via the q u adratic approxim ation (5.42), w ith a, b and c as defined for the nonparam etric A B C m ethod in (5.49). Then, using the argum ent th a t led to (10.15), it is possible to show th a t the induced likelihood function is L Ab c (0) = ex p { -5 U 2(0)},

(10.18)

512

10 ■Semiparametric Likelihood Inference

Figure 10.5 Gamma profile likelihood (solid), implied likelihood L a b c (dashes) and pivot-based likelihood (dots) for air-conditioning dataset of size 12 (left panel) and size 24 (right panel). The pivot-based likelihood uses R = 9999 simulations and bandwidths 1.0.

50

100

150

200

250

300

theta

40

60

80

100

120

theta

where um W

2r(fl) l + 2 ar(d) + { l + 4 a r ( d ) } 1/ 2’

22(0) 1 + {1 - 4cz(0)}V2’

1/I with z(9) = (t — d)/vj[ as before. This is called the implied likelihood. Based on the discussion in Section 5.4, one w ould expect results sim ilar to those from applying (10.16). A furth er m odification is to m ultiply La b c( 8 ) by exp{(cv1/ 2 — b) 6 /vi.}, with b the bias estim ate defined in (5.49). T he effect o f this m odification is to m ake the likelihood even m ore com patible w ith the Bayesian interpretation, som ew hat akin to the adjusted profile likelihood (10.2). Exam ple 10.5 (Air-conditioning d ata) Figure 10.5 shows confidence likeli hoods for the two sets o f air-conditioning d a ta in Table 5.6, sam ples o f size 12 and 24 respectively. The im plied likelihoods L ABc ( 9 ) are sim ilar to the em pirical likelihoods for these data. The pivotal likelihood L z ( 6 ), calculated from R = 9999 sam ples w ith bandw idths equal to 1.0 in (10.17), is clearly quite unstable for the sm aller sam ple size. This also occurred with b o o tstrap likeli hood for these d a ta an d seems to be due to the discreteness o f the sim ulations with so sm all a sample. ■

10.5 Bayesian Bootstraps All the inferences we have described thus far have been frequentist: we have sum m arized uncertainty in term s o f confidence regions for the unknow n p a ram eter 6 o f interest, based on repeated sam pling from a distribution F. A

10.5 ■Bayesian Bootstraps

513

quite different ap proach is possible if prior inform ation is available regarding F. Suppose th a t the only possible values o f Y are know n to be u i , . . . , u N, and th a t these arise with unknow n probabilities p \ , . . . , p N, so that

Pr(y = Uj | p i , . . . , p N ) = pj,

= I-

If o u r d a ta consist o f the ran d o m sam ple y \ , . . . , y „ , and f j counts how m any y, equal Uj, the probability o f the observed d a ta given the values o f the Pj is pro p o rtio n al to flyLi P^‘ ■ If the prior inform ation regarding the p; is sum m arized in the p rior density n(Pi, . . . , p N), the jo in t posterior density o f the Pj given the d a ta is pro p o rtio n al to N

n ip u -.^p ^n //, 7= 1

and this induces a posterior density for 8 . Its calculation is particularly straight forw ard w hen 7i is the D irichlet density, in which case the p rio r and posterior densities are respectively prop o rtional to

ft#

7= 1

7= 1

the posterior density is D irichlet also. Bayesian bootstrap sam ples and the corresponding values o f 8 are generated from the jo in t posterior density for the pj, as follows. Algorithm 10.1 (Bayesian bootstrap) F or r = 1 ,...,/? , 1 L et G \ , . . . , G n be independent gam m a variables with shape param eters aj + f j + 1, and unit scale param eters, and for j = l , . . . , N set P j = Gj/{G\ H------- 1- G^). 2 L et 8 } = t(Fj), where F j = ( P / , . . . , / ^ ) . E stim ate the posterior density for 8 by kernel sm oothing o f d \ , . . . , dfR.

•

In practice w ith continuous d a ta we have f j = l. The simplest version o f the sim ulation puts aj = —1, corresponding to an im proper p rio r distribution w ith su p p o rt on y \ , . . . , y„; the G; are then exponential. Some properties o f this procedure are outlined in Problem 10.10. Example 10.6 (City population data) In the city population d a ta o f E xam ple 2.8, for which n = 10, the param eter 8 = t(F) and the rth sim ulated posterior value dj are

514

10 • Semiparametric Likelihood Inference

Figure 10.6 Bayesian bootstrap applied to city population data, with n = 10. The left panel shows posterior densities for ratio 6 estimated from 999 Bayesian bootstrap simulations, with a = —1, 2, 5, 10; the densities are more peaked as a increases. The right panel shows the corresponding prior densities for 0.

o ‘C

Q_

theta

theta

The left panel o f Figure 10.6 shows kernel density estim ates o f the posterior density o f 9 based on R = 999 sim ulations w ith all the aj equal to a = —1, 2, 5, and 10. The increasingly strong p rio r inform ation results in posterior densities th at are m ore an d m ore sharply peaked. The right panel shows the im plied priors on 6 , obtained using the d a ta doubling device from Section 10.4. The priors seem highly inform ative, even when a = —1. ■ The prim ary use o f the Bayesian b o o tstrap is likely to be for im putation when d a ta are missing, ra th e r th a n in inference for 9 per se. There are theoretical advantages to such weighted bootstraps, in which the probabilities P* vary sm oothly, b u t as yet they have been little used in applications.

10.6 Bibliographic Notes Likelihood inference is the core o f p aram etric statistics. M any elem entary textbooks con tain som e discussion o f large-sam ple likelihood asym ptotics, while adjusted likelihoods an d higher-order approxim ations are described by Barndorff-N ielsen an d Cox (1994). Em pirical likelihood was defined for single sam ples by Owen (1988) and extended to w ider classes o f m odels in a series o f papers (Owen, 1990, 1991). Q in and Lawless (1994) m ake theoretical connections to estim ating equations, while H all and L a Scala (1990) discuss some practical issues in using em pir ical likelihoods. M ore general m odels to which em pirical likelihood has been applied include density estim ation (H all an d Owen, 1993; Chen 1996), lengthbiased d a ta (Qin, 1993), tru n cated d a ta (Li, 1995), and tim e series (M onti,

10.6 ■Bibliographic Notes

515

1997). A pplications to directional d a ta are discussed by Fisher et al. (1996). Owen (1992a) reports sim ulations th at com pare the behaviour o f the em pirical likelihood ratio statistic w ith b o o tstrap m ethods for sam ples o f size up to 20, w ith overall conclusions in line with those o f Section 5.7: the studentized b o o tstrap perform s best, in particular giving m ore accurate confidence in ter vals for the m ean th an the em pirical likelihood ratio statistic, for a variety o f underlying populations. R elated theoretical developm ents are due to DiCiccio, H all and R om ano (1991), D iC iccio and R om an o (1989), and Chen and H all (1993). F rom a theoretical view point it is notew orthy th a t the em pirical likelihood ratio statistic can be B artlett-adjusted, th o u g h C orcoran, D avison and Spady (1996) question the practical relevance o f this. H all (1990) m akes theoretical com parisons betw een em pirical likelihood and likelihood based on studentized pivots. Em pirical likelihood has roots in certain problem s in survival analysis, notably using the product-lim it estim ator to set confidence intervals for a survival probability. R elated m ethods are discussed by M urphy (1995). See also M ykland (1995), w ho introduces the idea o f dual likelihood, which treats the L agrange m ultiplier in (10.7) as a param eter. Except in large samples, it seems likely th a t o u r caveats ab o u t asym ptotic results apply here also. Em pirical exponential families have been discussed in Section 10.10 o f Efron (1982) an d D iCiccio and R om an o (1990), am ong others; see also C orcoran, D avison and Spady (1996), w ho m ake com parisons with em pirical likelihood statistics. Jing an d W ood (1996) show th a t em pirical exponential family like lihood is n o t B artlett adjustable. A univariate version o f the statistic Q e e f in Section 10.2.2 is discussed by Lloyd (1994) in the context o f M -estim ation. B ootstrap likelihood was introduced by D avison, H inkley and W orton (1992), w ho discuss its relationship to em pirical likelihood, while a later paper (D avison, H inkley and W orton, 1995) describes com putational im provem ents. E arly w ork on the use o f confidence distributions to define nonparam etric likelihoods was done by H all (1987), Boos and M on ah an (1986), and O gbonm w an and W ynn (1986). T he use o f confidence distributions in Section 10.4 rests in p a rt on the sim ilarity o f confidence distributions to Bayesian posterior distributions. F o r related theory see W elch and Peers (1963), Stein (1985) and Berger an d B ernardo (1992). E fron (1993) discusses the likelihood derived from A B C confidence limits, shows a strong connection with profile likelihood an d related likelihoods, an d gives several applications; see also C h apter 24 o f E fron and T ibshirani (1993). T he Bayesian b o o tstrap was introduced by R ubin (1981), and subsequently used by R ubin and Schenker (1986) and R ubin (1987) for m ultiple im putation in missing d a ta problem s. B anks (1988) has described some variants o f the Bayesian b o o tstrap , while N ew ton and R aftery (1994) describe a varian t which

516

10 • Semiparametric Likelihood Inference

they nam e the w eighted likelihood b o otstrap. A com prehensive theoretical discussion o f w eighted b o o tstrap s is given in B arbe and Bertail (1995).

10.7 Problems 1

Consider empirical likelihood for a parameter 0 = t(F) defined by an estimating equation f u(t;y)dF(y) = 0, based on a random sample y\,...,y„. (a) Use Lagrange multipliers to maximize Y l°g Pj subject to the conditions Y P j = 1 and Y2 Pju(t;yj) = 0, and hence show that the log empirical likelihood is given by (10.7) with d = 1. Verify that the empirical likelihood is maximized at the sample EDF, when 6 = t(F). (b) Suppose that u(f,y) = y — t and n = 2, with y\ < y 2. Show that rj9 can be written as (9 — y ) / {( 6 — y i)(y2 — 0)}, and sketch it as a function o f 6. (Section 10.2.1)

2

Suppose that x \ , . . . , x n and are independent random samples from dis tributions with means /i and n + 3. Obtain the empirical likelihood ratio statistic for 5. (Section 10.2.1)

3

(a) In (10.5), suppose that 6 = y + n~1/2ere, where a 1 = var(y; ) and e has an asymptotic standard normal distribution. Show that rjg = —n~l/2s / a 2, and deduce that near y, SEl (0) = —§ (y ~ 0)2/ o 2. (b) N ow suppose that a single observation from F has log density f ( 0 ) = log f(y;6) and corresponding Fisher information i(6) = E{—?($)}. Use the fact that the M LE 6 satisfies the equation t(6) = 0 to show that near 6 the parametric log likelihood is roughly
4

Let 6 be a scalar parameter defined by an estimating equation f u(6;y)dF(y) = 0. Suppose that we wish to make likelihood inference for 6 based on a random sample using the empirical exponential family

gisuiOiyj) nj(6) = Pr(Y = y j ) = where

j= l,...,n,

is determined by

n 5 > ; (0)u (0;y,) = 0. j =i

(10.19)

(a) Let Z i,...,Z „ be independent Poisson variables with means exp(£uj), where Uj = u(0;yj); we treat 6 as fixed. Write down the likelihood equation for and show that when the observed values o f the Z j all equal zero, it is equivalent to (10.19). Hence outline how software that fits generalized linear models may be used to find (b) Show that the formulation in terms o f Poisson variables suggests that the empir ical exponential family likelihood ratio statistic is the Poisson deviance W tEF(0Q),

517

10.7 ■Problems while the multinomial form gives W EEF(0O), where W W (flo)

=

2 ^ { l - e x p ( ^ ; )},

W ’EEF(e0)

=

2 [«log { n - ' X y ^ } - & £ > , ] .

(c) Plot the log likelihood functions corresponding to W E e f and W EEF for the data in Example 10.1; take Uj = y, — 6. Perform a small simulation study to compare the behaviour o f W EEF and W'EEF when the underlying data are samples o f size 24 from the exponential distribution. (Section 10.2.2)

5

Suppose that a = (s in 0 ,c o s 0 )r is the mean direction o f a distribution on the unit circle, and consider setting a nonparametric confidence set for a based on a random sample o f angles 9 i , . . . , 9„ \ set yj = (sin 0 . , c o s 0j)T. (a) Show that a is determined by the equation Y v f b = 0> where b = (cos 6, — sin 6)T Hence explain how to construct confidence sets based on statistics from empirical likelihood and from empirical exponential families. (b) Extend the argument to data taking values on the unit sphere, with mean direction a = (cos 9 cos ,cos 9 sin <j>, sin 9)T . (c) See Practical 10.2. (Section 10.2.2; Fisher et ai, 1996)

6

Suppose that t has empirical influence values lj, and set Pj(9°) =

f 1

,

(10.20)

E ,= i ^ where f = v l/2(8° — t ) and v = n~2

lj.

(a) Show that t(Ft ) = 90, where Fj denotes the C D F corresponding to (10.20). Hence describe how to space out the values t" in the first-level resampling for a bootstrap likelihood. (b) Rather than use the tilted probabilities (10.12) to construct a bootstrap like lihood by simulation, suppose that we use those in (10.20). For a linear statistic, show that the cumulant-generating function o f T ” in sampling from (10.20) is At + n { K ( ^ + n~iA) — K( ^ ) } , where K ( ^ ) = lo g (£ ] e(lJ). Deduce that the saddlepoint approximation to f r - \ T - ( t I 0°) is proportional to exp {—n X (f)}, where 6° = K '(0 Hence show that for the sample average, the log likelihood at 9° = Y I y j e tyi / 53 eiyj is n { i t - lo g ( 5 3 e ^ )} . (c) Extend (b) to the situation where t is defined as the solution to a m onotonic estimating equation. (Section 10.3; D avison, Hinkley and Worton, 1992) 7

Consider the choice o f h for the raw bootstrap likelihood values (10.11), when w ( ) is the standard normal density. A s is often roughly true, suppose that T* ~ N(t, v), and that conditional on T ‘ — t ’, T ” ~ N(t ' , v). (a) Show that the mean and variance o f the product o f vl/1 with (10.11) are /j and M ~ l {12 — It ), where

h = (

2

«

r

*

v

-

v

}.

where y = hv~l/2 and 3 = v~l/2(t' — t). Hence verify some o f the values in the following table:

10 ■Semiparametric Likelihood Inference

y

0

D ensity x lO -2 Bias x lO -2 M x variance xlO -2

39.9 -0.8 40.4

1

Oo II O

518

8 = 1 0

1

2

0

1

2

24.2 0 28.3

24.2 -0.1 11.2

24.2 -0 .5 5.7

5.4 0.3 7.5

5.4 1.2 3.8

5.4 2.5 2.6

2

39.9 39.9 -2 .9 -5 .7 13.4 5.6

(b) If y is small, show that the variance o f (10.11) is roughly proportional to the square o f its mean, and deduce that the variance is approximately constant on the log scale. (c) Extend the calculations in (a) to (10.13). (Section 10.3; D avison, Hinkley and Worton, 1992) 8

Let y represent data from a parametric m odel f ( y ; 6 ) , and suppose that 6 is estimated by t(y). Assum ing that simulation error may be ignored, under what circumstances would the bootstrap likelihood generated by parametric simulation from / equal the parametric likelihood? Illustrate your answer with the N ( 9 , 1) distribution, taking t to be (i) the sample average, (ii) the sample median. (Section 10.3)

9

Suppose that we wish to construct an implied likelihood for a correlation coefficient 6 based on its sample value T by treating Z = \ lo g {(l + T)/( 1 — T )} as normal with mean g(0) = | lo g {(l + 6)/(1 — 0)} and variance n ~ l . Show that the implied likelihood and implied prior are proportional to e x p [ - § { g ( t ) - g ( 0 ) } 2] ,

( 1 - 0 ) - 2,

|0| < 1.

Is the prior here proper? (Section 10.4) 10

The Dirichlet density with parameters ( f i , . . . , f „ ) is

*>»•

5 > -> .

< -> *

Show that the Pj have joint moments Tjf n

\ __

“ 7 ’

_____l n

n \ _

cow(pi ' Pk) ~

^j(^jks ~ Zk) S2(t + l) ’

where S]k = 1 if j = k and zero otherwise, and s = ------- b c„. (a) Let y i , . . . , y n be a random sample, and consider bootstrapping its average. Show that under the Bayesian bootstrap with aj = a, e ', p ;

,-

Hence show that the posterior mean and variance o f

do- 21) = YI yjPj are y an(l

(2n + a n + 1 )- I m2, where m 2 = n_1 J2(yj ~ y ) 2(b) N ow consider the average F t o f bootstrap samples generated as follows. We generate a distribution F 1 = ( P / , . . . , P j ) o n y t, . . . , y„ under the Bayesian bootstrap,

10.8 ■Practicals

519

and then make Y/, . . . , Y j by independent multinomial sampling from F f. Show that

E>(

=

,a,(f

(2n + an + 1) n

Are the properties o f this as n—►oo and a —»oo what you would expect? How does this compare with samples generated by the usual nonparametric bootstrap? (Section 10.5)

10.8 Practicals 1

We compare the empirical likelihoods and 95% confidence intervals for the mean o f the data in Table 3.1, (a) pooling the eight series:

attach(gravity) grav.EL <- EL.profile(g,tmin=70,tmax=85,n.t=51) plot(grav.EL[,1],exp(grav.EL[,2]) ,type="l",xlab="mu", ylab="empirical likelihood") l i k .CI(grav.E L ,lim=-0.5*qchisq(0.95,1)) and (b) treating the series as arising from separate distributions with the same mean and plotting eight individual likelihoods:

gravs.EL <- EL.profile(g[series==l],n.t=21) plot(gravs.EL[,1],exp(gravs.EL[,2]) ,type="n",xlab="mu", ylab="empirical likelihood",xlim=range(g)) lines(gravs .EL[, 1] ,exp(gravs.EL[,2] ) ,lty=2) for (s in 2:8) { gravs.EL <- EL.profile(g[series==s],n.t=21) lines(gravs.EL[,1],exp(gravs.EL[,2]) ,lty=2) } N ow we combine the individual likelihoods into a single likelihood by multiplying them together; we renormalize so that the product has maximum one.

lims <- matrix(NA,8,2) for (s in 1:8) { x <- g[series==s]; lims[s,] <- range(x) } mu.min <- max(lims[,1]); mu.max <- min(lims[,2]) gravs.EL <- EL.profile(g[series==l], tmin=mu.m i n ,tmax=mu.m a x ,n .t =21) gravs.E L .L <- gravs.E L [,2] gravs.EL.mu <- gravs.EL[,l] for (s in 2:8) gravs.EL.L <- gravs.EL.L + EL.profile(g[series==s], tmin=mu.min,tmax=mu.max,n.t=21)[,2] gravs.EL.L <- gravs.EL.L - max(gravs.EL.L) lines(gravs.EL.mu,exp(gravs.EL.L),lwd=2) lik.CI(cbind(gravs.EL.mu,gravs.EL.L),lim=-0.5*qchisq(0.95,1)) Compare the intervals with those in Example 3.2. D oes the result for (b) suggest a limitation o f multinomial likelihoods in general? Compare the empirical likelihoods with the profile likelihood (10.1) and the ad justed profile likelihood (10.2), obtained when the series are treated as independent normal samples with different variances but the same mean. (Section 10.2.1)

520 2

10 ■Semiparametric Likelihood Inference Dataframe i s l a y contains 18 measurements (in degrees east o f north)o f palaeocurrent azimuths from the Jura Quartzite on the Scottish island o f Islay. We aim to use m ultinom ial-based likelihoods to set 95% confidence intervals for the mean direction a(0) = (s in 0 ,c o s 0 )r o f the distribution underlying the data; the vector b(6) = (c o s 0, — s in 6)T is orthogonal to a. Let yj = (s mQj , co s 6 j )T denote the vectors corresponding to the observed angles Then the mean direction 6 is the angle subtended at the origin by Y , yj/\\ Y X/IIFor the original estimate, plots o f the data, log likelihoods and confidence intervals: a tt a c h (is la y )

th <- ifelse(theta>180,theta-360,theta) a.t <- function(th) c(sin(th*pi/180), cos(th*pi/180)) b.t <- fimction(th) c(cos(th*pi/180), -sin(th*pi/180)) y <- t(apply(matrix(theta, 18,1), 1, a.t)) thetahat <- function(y) { m <- apply(y,2,sum) m <- m/sqrt(m[l] ~2+m[2] “ 2) 180*atan(m[l]/m[2] )/pi } thetahat(y) u.t <- function(y, th) crossprod(b.t(th), t(y)) islay.EL <- EL.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t) plot(islay.EL[,1],islay.EL[,2],type="l",xlab="theta", ylab="log empirical likelihood",ylim=c(-25,0)) points(th,rep(-25,18)); abline(h=-3.84/2,lty=2) lik.CI(islay.EL,lim=-0.5*qchisq(0.95,1)) islay.EEF <- EEF.profile(y, tmin=-100, tmax=120, n.t=40, u=u.t) lines(islay.EEF[,1],islay.EEF[,2],lty=3) l i k .CI(islay.E E F ,lim=-0.5*qchisq(0.95,1)) Discuss the shapes o f the log likelihoods. To obtain 0.95 quantiles o f the bootstrap distributions o f W EL and

W Eef

'■

islay.fun <- function(y, i, angle) { u <- as.vector(u.t(y[i,] , angle)) z <- r e p (0,length(u)) EEF.fit <- glm(z~u-l.poisson) W.EEF <- 2*sum(l-fitted(EEF.fit)) EL.loglik <- function(lambda) - sum(log(l + lambda * u)) EL.score <- function(lambda) - sum(u/(l + lambda * u)) assignC'u" ,u,frame=l) EL.out <- nlmin(EL.loglik,0.001) W.EL <- -2*EL.loglik(EL.out$x) c(thetahat(y[i,]), W.EL, W.EEF, E L .out$converged) } islay.boot <- boot(y,islay.fun,R=999,angle=thetahat(y)) islay.boot$R < - sum(islay.boot$t [ , 4 ] ) islay.boot$t <- islay,boot$t[islay.boot$t[,4]==1,] apply(islay.boot$t[,2:3],2,quantile,0.95) How do the bootstrap-calibrated confidence intervals compare with those based on the xi distribution, and with the basic bootstrap intervals using the 6' ? (Sections 10.2.1, 10.2.2; Hand et al., 1994, p. 198) 3

We compare posterior densities for the mean o f the air-conditioning data using (a) the Bayesian bootstrap with aj = — 1:

521

8 ■Practicals

airl <- data.frame(hours=aircondit$hours,G=l) air.bayes.gen <- function(d, a) { out <- d out$G <- rgamma(nrow(d),shape=a+2) out } air.bayes.fun <- function(d) sum(d$hours*d$G)/sum(d$G) air.bayesian <- boot(airl, air .bayes .fun, R=999, sim="parametric", r a n .gen=air.bayes.g e n ,mle=-1) plot(density(air.bayesian$t,n=100,width=25),type="l", xlab="theta",ylab="density",ylim=c(0,0.02)) and (b) an exponential m odel with mean 6 for the data, with prior according to which 6 ~ l has a gamma distribution with index k and scale :

kappa <- 0; lambda <- 0 kappa.post <- kappa + length(airl$hours) lambda.post <- lambda + sum(airl$hours) theta <- 30:300 lines(theta, lambda.post/theta“2*dgamma(lambda.post/theta,kappa.post), lty-2) Repeat this with different values o f a in the Bayesian bootstrap and parametric case, and discuss your results. (Section 10.5)

k,

X in the

11 Computer Implementation

11.1 Introduction The key requirem ents for com puter im plem entation o f resam pling m ethods are a flexible program m ing langauge w ith a suite o f reliable quasi-random num ber generators, a wide range o f built-in statistical procedures to bootstrap, and a reasonably fast processor. In this chapter we outline how to use one im plem entation, using the curren t (M ay 1997) com m ercial version S p lu s 3.3 o f the statistical language S, although the m ethods could be realized in a n um ber o f o th e r statistical com puting environm ents. The rem ainder o f this section outlines the in stallation o f the library, and gives a quick sum m ary o f features o f S p lu s essential to our purpose. Each subsequent section describes aspects o f the library needed for the m aterial in the corresponding ch ap ter: Section 11.2 corresponds to C h apter 2, Section 11.3 to C h apter 3, an d so forth. These sections take the form o f a tutorial on the use o f the library functions. T he outline given here is n o t intended to replace the help files distributed w ith the library, w hich can be viewed by typing h e l p ( b o o t , l i b r a r y = " b o o t " ) w ithin S p lu s. A t various points below, you will need to consult these files for m ore details on functions. The m ain functions in the library are sum m arized in Table 11.1. The best way to learn to use softw are is to use it, and from Section 11.1.2 onw ards, we assum e th a t you, d ear reader, know the basics o f S, including how to w rite sim ple functions, th a t you are seated com fortably at your favourite com puter w ith S p lu s launched and a graphics w indow open, and th a t you are working through this chapter. We d o n o t show the S p lu s p ro m p t >, n o r the continuatio n p ro m p t +.

522

523

11.1 ■Introduction Table 11.1 Functions in the S plus bootstrap library.

F u nction

Purpose

abc.ci boot b o o t. a rra y b o o t. c i censboot c o n tro l c v . glm em p in f e n v e lo p e e x p .tilt g l m .d ia g g lm .d ia g .p lo t im p.m om ents im p . p r o b im p .q u a n tile im p .w e ig h ts j a c k , c i f t e r .b o o t lin e a r .a p p r o x s a d d le s a d d le .d is tn s im p le x s m o o th .f tilt.b o o t ts b o o t

N o n p a ra m e tric A B C confidence intervals P aram etric a n d n o n p aram etric b o o tstra p sim ulation A rra y o f indices o r frequencies from b o o tstra p sim ulation B o o tstrap confidence intervals B o o tstrap for censored a n d survival d a ta C o n tro l m eth o d s for e stim atio n o f quantiles, bias, variance, etc. C ro ss-v alid atio n pred ictio n e rro r estim ate for generalized linear m odel C alcu late em pirical influence values C alcu late sim ulation envelope E x p o n en tial tilting to calcu late p ro b ab ility d istrib u tio n s G eneralized linear m odel diagnostics P lo t generalized linear m odel diagnostics Im p o rta n c e resam pling m o m en t estim ates Im p o rtan ce resam pling tail p ro b ab ility estim ates Im p o rtan ce resam pling q u an tile estim ates C alculate im p o rtan c e resam pling w eights Ja ck k n ife-after-b o o tstrap p lo t C alculate linear ap p ro x im atio n to a statistic S ad d lep o in t ap p ro x im atio n S ad d lep o in t ap p ro x im atio n for a d istrib u tio n Sim plex m eth o d o f lin ear pro g ram m in g F requency sm oothing A uto m atic im p o rtan c e re-w eighting b o o tstra p sim ulation B o o tstrap for tim e series d a ta

11.1.1 Installation UNIX T he b o o tstra p library can be obtained from the hom e page for this book, h t t p : //dmawww. e p f 1 . c h / d a v i s o n . mosaic/BM A/ in the form o f a com pressed s h a r file b o o t l i b . s h . Z. This file should be uncom pressed an d m oved to an ap p ro p riate directory. The file can then be unpacked by sh bootlib.sh rm bootlib.sh

Y ou should then installation o f the It is best to set m ay need to ask once inside S p lu s

follow the instructions in the README file to com plete the library. up an S p lu s library b o o t containing the library files; you your system m anager to do this. Once this is done, and in your usual w orking directory, the functions and d a ta are

accessed by typing l i b r a r y ( b o o t ,f irst=T)

524

11 ■Computer Implementation

T his will avoid cluttering your w orking directory w ith library files, and reduce the chance th a t you accidentally overw rite them. W i n d ow s

T he disk at the back o f this book contains the library functions and docum en tatio n for use w ith S p lu s f o r Windows. F or instructions on the installation, see the file README. TXT on the disk. T he contents o f the disk can also be retrieved in the form o f a z i p file from the hom e page for the book given above.

11.1.2 Some key S p lu s ideas Q u a s i - r a n do m n u m b er s

To p u t 20 q uasi-random N ( 0,1) d a ta into y and to see its contents, type y < - rn o rm (2 0 )

y H ere < - is the assignm ent sym bol. To see the contents o f any S object, simply type its nam e, as above. This is often done below, and we do n o t show the output. In general q uasi-random num bers from a distrib u tion are generated by the functions re x p , rgamma, r c h i s q , r t , . . . , w ith argum ents to give param eters where needed. F or example,

y <- rgamma(n=10,shape=2) generates 10 gam m a observations w ith shape p aram eter 2, and

y <- rgamma(n=10,shape=c(l:10)) generates a vector o f ten gam m a variables w ith shape param eters 1 ,2 ,..., 10. T he function sam ple is used to sam ple from a set w ith o r w ithout replace m ent. F o r exam ple, to get a ran d o m p erm u tatio n o f the num bers 1 ,...,1 0 , a random sam ple w ith replacem ent from them , a ran d o m p erm u tatio n o f 11, 22, 33, 44, 55, a sam ple o f size 10 from them , and a sam ple o f size 10 taken with unequal probabilities:

sample(10) sample(10,replace=T) set <- c(ll,22,33,44,55) sample(set) sample(set,size=10,replace=T) sample(set,size=10,replace=T,prob=c(0.1,0.1,0.1,0.1,0.6))

11.2 ■Basic Bootstraps

525

Subscripts T he city p o p u latio n d a ta w ith n = 10 are

city citySu city$x where the second two com m ands show the individual variables o f c i t y . T his S p lu s object is a datafram e — an array o f d a ta in which rows correspond to cases, and the nam ed colum ns to variables. Elem ents o f an object are accessed by subscripts, so

city$x[l] city$x[c(l:4)] city$x[c(l,5,10)] city[c(l,5,10),2] city$x[-l] city[c(l:3),] give various subsets o f the elem ents o f c i t y . To m ake a nonparam etric b o o tstrap sam ple o f the rows o f c i t y , you could type:

i <- sample(10,replace=T) city[i,I The row labels result from the algorithm used to give unique labels to rows, an d can be ignored for o u r purposes.

11.2 Basic Bootstraps 11.2.1 Nonparam etric bootstrap The m ain b o o tstra p function, b o o t, w orks on a vector, a m atrix, or a datafram e. A simple use o f b o o t to b o o tstrap the ratio t = x / u for the city population d a ta o f Exam ple 2.8 is

city.fun <- function(data, i) { d <- data[i,] mean(d$x)/mean(d$u) } city.boot <- boot(data=city, statistic=city.fun, R=50) T he function c i t y . f u n takes as input the datafram e d a t a and the vector o f indices i. Its first com m and sets up the boo tstrap p ed datafram e, and its second m akes an d returns the b o o tstrap p ed ratio. The last com m and instructs the function b o o t to b o o tstra p the d a ta in c i t y R = 50 times, apply the statistic c i t y . f u n to each b o o tstrap dataset and p u t the results in c i t y . b o o t .

526

11 ■Computer Implementation

B o o t s t r a p obj ect s

The result o f a call to b o o t is a b o o tstrap object. This is im plem ented as a list o f quantities which is given the class " b o o t" and for which various m ethods are defined. F or exam ple, typing

city.boot prints the original statistic, its estim ated bias an d its stan d ard error, while

plot(city.boot) gives suitable sum m ary plots. To see the nam es o f the elem ents o f the b o o tstrap object c i t y . b o o t , type

names(city.boot) Y ou see various nam es, o f which c i t y .b o o t$ t O , c i t y . b o o t $ t , c ity .b o o t$ R , c i t y ,b o o t$ s e e d con tain the original value o f the statistic, the boo tstrap values, the value o f R, an d the value o f the S p lu s random num ber generation seed w hen b o o t was invoked. To see their contents, type their names. Timing To repeat the sim ulation, checking how long it takes, type u n i x . t i m e ( c i t y . b o o t < - b o o t ( c i t y , c i t y .f u n ,R = 5 0 ) ) on a U N IX system or

dos.time(city.boot <- boot(city,city.fun,R=50)) on a D O S system. The first num b er retu rn ed is the tim e the sim ulation took, and is useful for estim ating how long a larger sim ulation would take. A lthough code is generally clearer w hen d atafram es are used, the co m p u ta tion can be speeded up by avoiding them , as here:

mat <- as.matrix(city) mat.fun <- function(data, i) { d <- data[i,1 mean(d[,2])/mean(d[,1]) > unix.time(mat.boot <- boot(mat,mat.fun,R=50)) C om pare this w ith the tim e tak en using the d atafram e c i t y . Frequency array To obtain the R x n arra y o f b o o tstra p frequencies for c i t y . b o o t and to display its first 20 lines, type

f <- boot.array(city.boot) f [ 1 :2 0 ,]

527

11.2 ■Basic Bootstraps

T he rows o f f are the vectors o f frequencies for individual b o o tstrap samples. T he array is useful for m any post hoc calculations, and is invoked by p o st processing functions such as j a c k , a f t e r .b o o t and imp. w e ig h t, which are discussed below. It is calculated from c i t y . b o o t$ se e d . The array o f indices for the b o o tstrap sam ples can be obtained by b o o t . a r r a y ( c i t y . b o o t , in d e x = T ). Types o f statistic F or a nonparam etric b o o tstrap , the function s t a t i s t i c can be o f one o f three types. We have already seen exam ples o f the first, index type, where the argum ents are the d atafram e d a t a and the vector o f indices, i ; this is specified by s ty p e = " i" (the default). F or the second, weighted type, the argum ents are d a t a and a vector o f weights w. F or exam ple,

city.w <- function(data, w=rep(l,nrow(data))/nrow(data)) { w <- w/sum(w) sum(w*data$x)/sum(w*data$u)} city.boot <- boot(city, city.w, R=20, stype="w") writes

£ = E w'jxj / E wj u’

Y , w) uj / Y , wY

where w* is the weight p u t on the j t h case o f the datafram e in the boo tstrap sam ple; the first line o f c i t y .w ensures th at vv* = 1. Setting w in the initial line o f the function gives the default value for w, which is a vector of n-1 s; this enables the original value o f t to be obtained by c i t y . w ( c i t y ) . A m ore com plicated exam ple is given by the library correlation function c o rr . N o t all statistics can be w ritten in this form, b u t w hen they can, num erical differentiation can be used to obtain em pirical influence values and A B C confidence intervals. F or the third, frequency type, the argum ents are d a t a and a vector o f frequencies f. F o r example,

city.f <- function(data, f) mean(f*data$x)/mean(f*data$u) city.boot <- boot(city, city.f, R=20, stype="f") uses

“*

n_ 1 E / y * « /

w here /* is the frequency w ith which the ;'th row o f the datafram e occurs in the b o o tstra p sample. N o t all statistics can be w ritten in this form. It differs from the preceding type in th a t w hereas weights can in principle take any positive

528

11 ■Computer Implementation

values, frequencies m ust be integers. O f course in this exam ple it would be easiest to use the function c i t y . f u n given earlier. Fu n c t i on s t a t i s t i c

The contents o f s t a t i s t i c can be m ore-or-less arbitrarily com plicated, p ro vided th a t its o u tp u t is a scalar or fixed-length vector. F or example, a i r .f u n <- f u n c tio n (d a ta , i) { d <- d a t a [ i ,] c (m e a n (d ), v a r ( d ) / n r o w ( d a t a ) ) } a i r . b o o t < - b o o t ( d a t a = a i r c o n d i t , s t a t i s t i c = a i r . f u n , R=200) perform s a n o n p aram etric b o o tstrap for the average o f the air-conditioning data, and returns the b o o tstrap p ed averages and their estim ated variances. We give m ore com plex exam ples below. Beware o f m em ory and storage problem s if you m ake the o u tp u t too long. By default the first elem ent o f s t a t i s t i c (and so the first colum n o f b o o t . o u t$ t) is treated as the m ain statistic for certain calculations, such as calculation o f em pirical influence values, the jackknife-after-bootstrap plot, and confidence interval calculations, which are described below. This is changed by use o f the in d e x argum ent, usually a single num b er giving the colum n o f s t a t i s t i c to which the calculation is to be applied. F u rther argum ents can be passed to s t a t i s t i c using the . . . argum ent to b o o t. F or exam ple, c i t y . s u b s e t < - f u n c t i o n ( d a t a , i , n=10) { d < - d a t a [ i [ 1 : n ] ,] m e a n ( d [ ,2 ] ) /m e a n ( d [ ,1 ]) } c i t y . b o o t < - b o o t ( d a t a = c i t y , s t a t i s t i c = c i t y . s u b s e t , R=200, n=5) gives resam pled ratios for b o o tstrap sam ples o f size 5. N ote th a t the frequency array for c i t y . b o o t w ould n o t be useful in this case. The indices can be obtained by b o o t . a r r a y ( c i t y . b o o t , i n d i c e s = T ) [ ,1 : 5 ]

11.2.2 Parametric bootstrap F o r a param etric bootstrap , the first argum ent to s t a t i s t i c rem ains a vector, m atrix, o r d atafram e, b u t s t a t i s t i c need take no second argum ent. Instead three furth er argum ents to b o o t m ust be supplied. T he first, r a n . gen, tells b o o t how to sim ulate b o o tstra p data, an d is a function th a t takes two argum ents, the original d ata, and an object containing any other param eters, mle. The o u tp u t o f r a n .g e n should have the sam e form and attributes as the original dataset. The second new argum ent to b o o t is a value for m le itself. The third

11.2 ■Basic Bootstraps

529

new argum ent to b o o t, s im = " p a ra m e tric " , tells b o o t to perform a param etric sim ulation: by default the sim ulation is nonparam etric and s im = " o rd in a ry " . O ther possible values for sim are described below. F or exam ple, for p aram etric sim ulation from the exponential m odel fitted to the air-conditioning d a ta in Table 1.2, we set a i r c o n d i t . f u n < - f u n c t i o n ( d a t a ) m e a n (d a ta $ h o u rs ) a i r c o n d i t . sim < - f u n c t i o n ( d a t a , m le) { d <- d a ta d $ h o u rs < - r e x p ( n = n r o w ( d a ta ) , ra te = m le ) d > a i r c o n d i t . m l e < - l/ m e a n ( a ir c o n d i t$ h o u r s ) a ir c o n d it.p a r a <- b o o t( d a ta = a ir c o n d it, s t a t i s t i c = a i r c o n d i t .f u n , R=20, s im = " p a r a m e tr ic " , r a n . g e n = a i r c o n d i t . sim , m le = a ir c o n d it.m le ) A ir-conditioning d a ta for a different aircraft are given in a i r c o n d i t 7 . O btain their sam ple average, and perform a param etric b o o tstrap o f the average using the fitted exponential model. Give the bias and variance estim ates for the average. D o the b o o tstrap p ed averages look norm al for this sam ple size? A m ore com plicated exam ple is param etric sim ulation based on a log bivariate no rm al distribution fitted to the city p o pulation d a ta: l . c i t y <- lo g (c ity ) c ity .m le <- c ( a p p ly ( 1 .c i t y , 2 ,m e a n ) ,s q r t ( a p p l y ( l .c i t y ,2 ,v a r ) ) , c o r r ( 1 .c i t y ) ) c i t y . s i m < - f u n c t i o n ( d a t a , m le) { n < - n ro w (d a ta ) d < - m a t r i x ( r n o r m ( 2 * n ) ,n ,2) d [ , 2 ] < - m le [2 ] + m le [ 4 ] * (m le [ 5 ] * d [ ,2 ] + s q r t ( 1 - m l e [ 5 ] ~ 2 )* d [ , 1 ] ) d [ , 1] < - m l e [ l ] + m l e [ 3 ] * d [ ,l ] d a ta $ x < - e x p ( d [ ,2 ] ) d a ta $ u < - e x p ( d [ , l ] ) d a ta } c i t y . f < - f u n c t i o n ( d a t a ) m e a n ( d a t a [ ,2 ] ) / m e a n ( d a t a [ ,1 ]) c i t y . p a r a < - b o o t ( c i t y , c i t y . f , R=200, s im = " p a r a m e tr ic " , r a n . g e n = c i t y . sim , m le = c ity .m le ) W ith this definition o f c i t y . f , a nonparam etric b o o tstrap can be perform ed by c ity .b o o t <- b o o t(d a ta = c ity , s t a t i s t i c = f u n c t i o n ( d a t a , i ) c i t y . f ( d a t a [ i , ] ) , R=200)

530

11 • Computer Implementation

This is useful w hen com paring p aram etric and n o n p aram etric b o o tstraps for the same problem . C om pare them for the c i t y data.

11.2.3 Empirical influence values F or a statistic b o o t .f u n in w eighted form, function em pinf returns the em pirical influence values lj, obtained by num erical differentiation. F o r the ratio function c i t y . w given above, for exam ple, these an d the exact values (Prob lem 2.9) are

L.diff <- empinf(data=city, statistic=city.w, stype="w") cbind(L.diff,(city$x-city.w(city)*city$u)/mean(city$u)) Em pirical influence values can also be obtained from the o u tp u t o f b o o t by regression o f the values o f f* on the frequency array. F o r example,

city.boot <- boot(city, city.fun, R=999) L.reg <- empinf(city.boot) L.reg uses regression w ith the 999 sam ples in c i t y . b o o t to estim ate the lj. Jackknife values can be obtained by

J <- empinf(data=city,statistic=city.fun,stype="i",type="jack") The argument type controls h o w the influence values are to be calculated, but this also depends on the quantities input to empinf: for details see the help file. Variance approximations v a r . l i n e a r uses em pirical influence values to calculate the nonparam etric delta m ethod variance ap proxim ation for a statistic:

v a r .linear(L.diff) v a r .linear(L.reg) Linear approximation l i n e a r .a p p ro x uses o u tp u t from a nonparam etric b o o tstrap sim ulation to calculate the linear approxim ations to the b o o tstrap p ed quantities. The em pir ical influence values can be supplied, b u t if not, they are estim ated by a call to em pinf. F o r the city p o p u latio n ratio,

city.tL.reg <- linear.approx(city.boot) city.tL.diff <- linear.approx(city.boot, L=L.diff) split.screen(c(1,2)) screen(l); plot(city.tL.reg,city.boot$t); abline(0,l,lty=2) screen(2); plot(city.tL.diff,city,boot$t); abline(0,1,lty=2)

11.3 ■Further Ideas

531

calculates the linear approxim ation for the two sets o f em pirical influence values an d plots the actual t' against them.

11.3 Further Ideas 11.3.1 Stratified sampling Stratified sam pling is perform ed by including argum ent s t r a t a in the call to b o o t. Suppose th a t we wish to b o o tstrap the difference in the trim m ed averages for the last tw o groups o f gravity d a ta (Exam ple 3.2):

gravity grav <- gravity[as.numeric(gravity$series)>=7,] grav grav.fun <- function(data, i, trim=0.125) { d <- data[i,] m <- tapply(d$g, d$series, mean, trim=trim) m[7] -m [8] > grav.boot <- boot(grav, grav.fun, R=200, strata=grav$series) Check th a t the expected properties o f b o o t . a r r a y ( g r a v . b o o t) hold. E m pirical influence values, linear approxim ations, and nonparam etric delta m ethod variance approxim ations are calculated by

grav.L <- empinf(grav.boot) grav.tL <- linear.approx(grav.boot) v a r .linear(grav.L , strata=grav$series) g r a v . b o o t $ s t r a t a contains the strata used in the resam pling, which are taken into account autom atically if g r a v . b o o t is used, b u t otherw ise m ust be supplied, as in the final line o f the code above.

11.3.2 Sm oothing T he neatest w ay to perform sm ooth bo o tstrap p in g is to use s im = " p a ra m e tric " . F o r exam ple, to estim ate the variance o f the m edian o f the d a ta in y, using sm oothing p aram eter h = 0.5:

y <- rnorm(99) h <- 0.5 y.gen <- function(data, mle) { n <- length.(data) i <- sample(n, n, replace=T) data[i] + mle*rnorm(n) }

532

11 ■Computer Implementation

y.boot <- boot(y, median, R=200, sim="parametric", r a n .gen=y.g e n , mle=h) var(y.boot$t) This guarantees th a t y .b o o t$ tO contains the original m edian. F o r shrunk sm oothing, see P ractical 4.5.

11.3.3 Censored data c e n s b o o t is used to b o o tstra p censored data. Suppose th a t we wish to assess the variability o f the m edian survival time and the probability o f survival beyond 20 weeks for the first group o f A M L d a ta (Exam ple 3.9).

amll <- ami[aml$group==l,] amll.fun <- function(data) { surv <- survfit(Surv(data$time,data$cens)) pi <- min(surv$surv[surv$time<20]) ml <- min(surv$time[surv$surv<0.5]) c(pl, ml) > amll.ord <- censboot(data=amll, statistic=amll.fun, R=50) ami1.ord This involves ordinary b o o tstra p resam pling, an d hence could be perform ed w ith b o o t, alth o u g h a m l l .f u n w ould then have to be rew ritten to have an o th er argum ent. F or conditional sim ulation, two additional argum ents m ust be supplied containing the estim ated survivor functions for the tim es to failure and the censoring d istrib u tio n :

amll.fail <- survfit(Surv(time,cens),data=amll) amll.cens <- survfit(Surv(time-0.01*cens,l-cens),data=amll) amll.con <- censboot(data=amll, statistic=amll.fun, R=50, F.surv=amll.fail, G.surv=amll.cens, sim="cond")

11.3.4 Bootstrap diagnostics Jackknife-after-bootstrap Function j a c k . a f t e r . b o o t produces a jack k n ife-after-bootstrap plot o f the first colum n o f b o o t . o u t $ t based on a n o n p aram etric sim ulation. F o r example, for the c i t y d a ta ratio :

city.fun <- function(data, i) { d <- data[i,] rat <- mean(d$x)/mean(d$u) L <- (d$x-rat*d$u)/mean(d$u) c(rat, sum(L~2)/nrow(d)~2, L) }

11.3 ■Further Ideas

533

c i t y . b o o t < - b o o t ( c i t y , c i t y . f u n , R=999) c i t y . L < - c i t y . b o o t $ t 0 [ 3 : 12] s p l i t . s c r e e n ( c ( l , 2 ) ) ; s c r e e n ( l ) ; s p l i t .s c r e e n ( c ( 2 ,1 )); sc re e n (4 ) a tta c h (c ity ) p l o t ( u , x , ty p e = " n " , x lim = c ( 0 , 3 0 0 ) , y lim = c ( 0 ,3 0 0 ) ) te x t( u ,x ,r o u n d ( c ity .L ,2 ) ) sc re e n (3 ) p l o t ( u , x , t y p e = " n " , x l i m = c ( 0 ,3 0 0 ) ,y lim = c ( 0 ,3 0 0 ) ) t e x t ( u , x , c ( l : 1 0 ) ) ; a b l i n e ( 0 , c i t y . b o o t $ t 0 [ 1 ] ,lty = 2 ) sc re e n (2 ) j a c k . a f t e r . b o o t ( b o o t . o u t = c i t y . b o o t , u se J= F , s t i n f = F , L = c ity .L ) c l o s e . s c r e e n ( a ll = T ) The two left panels show the d a ta with case num bers and em pirical influence values as p lo ttin g symbols. T he jackknife-after-bootstrap plot on the right shows the effect o f deleting cases in tu rn : values o f t* are m ore variable when case 4 is deleted and less variable w hen cases 9 and 10 are deleted. We see from the em pirical influence values th at the distribution o f t' shifts dow nw ards when cases w ith positive em pirical influence values are deleted, and conversely. This plot is also produced by setting true the ja c k argum ent to p l o t when applied to a b o o tstrap object, as in p l o t ( c i t y . b o o t , j a c k = T ) . O ther argum ents for j a c k , a f t e r .b o o t control w hether the influence values are standardized (by default they are, s tin f = T ) , w hether the em pirical influence values are used (by default jackknife values are used, based on the sim ulation, so the default values are u seJ= T and L=NULL). M ost post-processing functions allow the user to specify either an index for the com ponent o f interest, o r a vector o f length b o o t.o u t$ R to be treated as the m ain statistic. T hus a jackknife-after-bootstrap plot using the second com ponent o f c i t y . b o o t $ t — the estim ated variances for t* — would be obtained by either o f j a c k . a f t e r . b o o t ( c i t y . b o o t , u s e J = F , s t i n f = F , in d ex = 2 ) ja c k .a f te r .b o o t( c ity .b o o t,u s e J = F ,s tin f = F ,t= c ity .b o o t$ t[ ,2 ] ) Frequency smoothing sm o o th . f sm ooths the frequencies o f a nonparam etric b o o tstrap object to give a “ typical” distrib u tio n w ith expected value roughly at 9. In order to find the sm oothed frequencies for 9 = 1.4 for the city ratio, and to obtain the corresponding value o f t, we set c i t y . f r e q < - s m o o t h . f ( t h e t a = l . 4 , b o o t . o u t = c i t y .b o o t ) c ity .w ( c ity , c ity .f r e q )

534

11 ■Computer Implementation

The sm oothing b andw idth is controlled by the w id th argum ent to s m o o th .f and is w id th x u 1/2, where v is the estim ated variance o f t — w id th = 0 .5 by default.

11.4 Tests 11.4.1 Parametric tests Simple param etric tests can be conducted using p aram etric sim ulation. For example, to perform the conditional sim ulation for the d a ta in f i r (E xam ple 4.2):

fir.mle <- c(sum(fir$count), nrow(fir)) fir.gen <- function(data, mle) { d <- data y <- sample(x=mle[2],size=mle[1],replace=T) d$count <- tabulate(y,mle[2]) d > fir.fun <- function(data) (nrow(data)-1)*var(data$count)/mean(data$count) fir.boot <- boot(fir, fir.fun, R=999, sim="parametric", r a n .gen=fir.g e n , mle=fir.mle) qqplot(qchisq(c(l:fir.boot$R)/(fir.boot$R+l),df=49),fir.boot$t) abline(0,1,lty=2); abline(h=fir.boot$t0) The last tw o lines here display the results (alm ost) as in the right panel o f Figure 4.1.

11.4.2 Permutation tests A pproxim ate p erm u tatio n tests are perform ed by setting s im = " p e rm u ta tio n " w hen invoking b o o t. F o r exam ple, suppose th a t we wish to perform a per m u tatio n test for zero co rrelation betw een the tw o colum ns o f datafram e d ucks:

perm.fun <- function(data, i) cor(data[,1],data[i,2]) ducks.perm <- boot(ducks, perm.fun, R=499, sim="permutation") (sum(ducks.perm$t>ducks.perm$t0)+l)/(ducks.perm$R+l) qqnorm(ducks.perm$t,ylim=c(-1,1)) abline(h=ducks.perm$t0,lty=2) If s t r a t a is included in the call to b o o t, p erm u tatio n is perform ed indepen dently w ithin each stratum .

11.4 ■ Tests

535

11.4.3 Bootstrap tests F o r a b o o tstra p test o f the hypothesis o f zero correlation in the d u ck s data, we m ake a new d atafram e and fu n c tio n :

duck <- c(ducks[,1].ducks[,2]) n <- nrow(ducks) duck.fun <- function(data, i, n) { x <- data[i] cor(x[l:n],x[(n+l):(2*n)]) > .Random.seed <- ducks.perm$seed ducks.boot <- boot(duck, duck.fun, R=499, strata=rep(c(l,2),c(n,n)), n=n) (sum(ducks.boot$t>ducks.boot$tO)+l)/(ducks.boot$R+l) This uses the same seed as for the p erm u tatio n test, for a m ore precise com parison. Is the significance level sim ilar to th a t for the p erm utation test? W hy can n o t b o o t be directly applied to d u ck s to perform a b o o tstrap test? Exponential tilting The test o f equality o f m eans for two sets o f d a ta in Exam ple 4.16 involves exponential tilting. The null distribution puts probabilities given by (4.25) on the two sets o f data, an d the tilt param eter k solves the equation Y j zij exp(^zi7') __ „ Eyexp(A zy) where z\j = yij, z2j = —yij, an d 6 = 0. The fitted null distribution is obtained using e x p . t i l t , as follows:

z <- grav$g z[grav$series==8] <- -z[grav$series==8] z.tilt <- exp.tilt(L=z, theta=0, strata=grav$series) z.tilt where z . t i l t contains the fitted probabilities (which sum to one for each stratum ) and the values o f k an d 6. O ther argum ents can be in put to e x p . t i l t : see its help file. The significance probability is then obtained by using the w e ig h ts argum ent to b o o t. This argum ent is a vector containing the probabilities w ith which to select the rows o f d a ta , when b o o tstrap sam pling is to be perform ed with unequal probabilities. In this case the unequal probabilities are given by the tilted distribution, und er which the expected value o f the test statistic is zero. The code needed to perform the sim ulation and get the estim ated significance level is:

536

11 ■Computer Implementation

g r a v .te s t <- fu n c tio n (d a ta , i) { d <- d a t a [ i ,] d i f f ( t a p p l y ( d $ g , d $ s e r i e s , m e a i i ) ) [7] } g r a v .b o o t < - b o o t( d a ta = g r a v , s t a t i s t i c = g r a v . t e s t , R=999, w e ig h ts = z . t i l t $ p , s t r a t a = g r a v $ s e r i e s ) (s u m (g ra v . b o o t$ t > g r a v . b o o t$ tO ) + 1 ) / ( g r a v . b o o t$ R + l)

11.5 Confidence Intervals The m ain function for setting b o o tstrap confidence intervals is b o o t . c i , which takes as in p u t a b o o tstrap object. F or example, to get a 95% confidence interval for the ratio in the c i t y data, using the c i t y . b o o t object created in Section 11.3.4: b o o t . c i ( b o o t . o u t = c i t y .b o o t ) By default the confidence level is 0.95, b u t oth er values can be obtained using the c o n f argum ent. H ere invoking b o o t . c i shows the norm al, basic, studentized b o o tstrap , percentile, an d B C a intervals. Subsets o f these intervals are obtained using the ty p e argum ent. F o r exam ple, if c i t y . b o o t $ t only contained the ratio and n o t its estim ated variance, it would be im possible to o btain the studentized b o o tstra p interval, an d an ap p ropriate use o f b o o t . c i would be b o o t.c i( b o o t.o u t= c ity .b o o t,ty p e = c ( " n o r m " , " p e r c " , " b a s ic " , " b c a " ) , c o n f = c ( 0 .8 ,0 .9 ) ) By default b o o t . c i assum es th a t the first an d second colum ns o f b o o t . o u t$ t contain the statistic itself and its estim ated variance; otherwise the in d e x argum ent can be used, as outlined in the help file. To calculate intervals for the p aram eter h{6), an d then back-transform them to the original scale, we use the h, h in v , an d h d o t argum ents. F or example, to calculate intervals for the city ratio, using h(-) = log(-), we set b o o t . c i ( c i t y . b o o t , h = lo g , h in v = e x p , h d o t= f u n c tio n ( u ) 1 /u ) where h in v and h d o t are the inverse and first derivative o f h(-). N ote how transform atio n im proves the basic b o o tstrap interval. N onparam etric A BC intervals are calculated using a b c . c i. F or exam ple a b c . c i ( d a t a = c i t y , s t a t i s t i c = c i t y . w) calculates the 95% A BC interval for the city ratio ; s t a t i s t i c m ust be in weighted form for this. As usual, stra ta are in corporated using the s t r a t a argum ent.

11.6 • Linear Regression

537

11.6 Linear Regression 11.6.1 Basic approaches R esam pling for linear regression m odels is perform ed using b o o t. It is simplest when b o o tstrap p in g cases. F or example, to com pare the biases and variances for p aram eter estim ates from b o o tstrapping least squares and Li estim ates for the mammals d a ta :

fit.model <- function(data) { fit <- glm(log(brain)~log(body),data=data) 11 <- llfit(log(data$body),log(data$brain)) c(coef(fit), coef(ll)) } mammals.fun <- function(data, i) fit.model(data[i,]) mammals.boot <- boot(mammals, mammals.fun, R=99) mammals.boot F or m odel-based resam pling it is sim plest to set up an augm ented datafram e containing the residuals and fitted values. A lthough the m odel is a straightfor w ard linear m odel, we fit it using glm rath er th an lm so th at we can calculate residuals using the library function g lm .d ia g , which calculates various types o f residuals, approxim ate C ook statistics, and m easures o f leverage for a glm object. (The diagnostics are exact for a linear model.) A related function is g l m . d i a g . p l o t s , which produces stan d ard diagnostic plots for a generalized linear m odel fit:

mam.lm <- glm(log(brain)"log(body),data=mammals) maim, diag <- glm. diag (mam. lm) glm.diag.plots(mam.lm) res <- (mam.diag$res-mean(mam.diag$res))*mam.diag$sd mam <- data.frame(mammals,res=res,fit=fitted(mam.lm)) mam.fun <- function(data, i) { d <- data d$brain <- exp(d$fit+d$res [i]) fit.model(d) } mam.boot <- boot(mam, mam.fun, R=99) mam.boot Em pirical influence values and the nonparam etric delta m ethod standard error for the slope o f the linear m odel could be obtained by putting the slope estim ate in weighted form :

mam.w <- function(data, w) coef(glm(log(data$brain)"log(data$body), weights=w))[2] mam.L <- empinf(data=mammals, statistic=mam.w) sqrt(var.linear(mam.L))

538

11 ■Computer Implementation

F or m ore com plicated regressions, for exam ple w ith unequal response vari ances, m ore inform ation m ust be added to the new datafram e. Wild bootstrap The wild b o o tstra p can be im plem ented using s im = " p a ra m e tric " , as follows:

mam.mle <- c(nrow(mam), (5+sqrt(5))/10) mam.wild <- function(data, mle) { d <- data i <- 2*rbinom(mle[1], size=l, prob=l-mle[2])-l d$brain <- exp(d$fit+d$res*(l-i*sqrt(5))/2) d > mam.boot.wild <- boot(mam, fit.model, R=20, sim="parametric", ran.gen=mam.wild, mle=mam.mle)

11.6.2 Prediction Now consider prediction o f the log b rain weight o f new m am m als w ith body weights equal to those for the chim panzee and baboon. F or this we introduce yet an o th er argum ent to b o o t — m, which gives the num ber o f e*m to be sim ulated w ith each b o o tstra p sam ple (see A lgorithm 6.4). In this case we w ant to predict a t m = 2 “new ” m am m als, w ith covariates contained in d .p r e d . The s t a t i s t i c function supplied to b o o t m ust now take at least one m ore argum ent, nam ely the additional indices for constructing the boo tstrap versions o f the two “new ” m am m als. We im plem ent this as follows:

d.pred <- mam[c(46,47),] pred <- function(data, d.pred) predict(glm(log(brain)"log(body),data=data), d.pred) maun.pred <- function(data, i, i.pred, d.pred) { d <- data d$brain <- exp(d$fit+d$res[i]) pred(d, d.pred) - (d.pred$fit + d$res[i.pred]) } mam.boot.pred <- boot(mam, mam.pred, R=199, m=2, d .pred=d.pred) orig <- matrix(pred(mam, d.pred),mam.boot.pred$R,2,byrow=T) exp(apply(orig+mam.b o o t .pred$t,2,quantile,c (0.025,0.5,0.975) ) ) giving the 0.025, 0.5, an d 0.975 prediction limits for the b rain sizes o f the “new ” m am m als. The actual brain sizes lie close to o r above the up p er limits o f these intervals: prim ates tend to have larger b rains th a n other m am m als.

11.6.3 Aggregate prediction error and variable selection Practical 6.5 shows how to o btain the various estim ates o f aggregate prediction erro r based on a given m odel.

11.6 ■Linear Regression

539

F or consistent b o o tstrap variable selection, a subset o f size n — m is used to fit each o f the possible models. C onsider Exam ple 6.13, where a fake set o f d a ta is m ade by

xl <- runif(50); x2 <- runif(50); x3 <- runif(50) x4 <- runif(50); x5 <- runif(50); y <- rnorm(50)+2*xl+2*x2 fake <- data.frame(y,xl,x2,x3,x4,x5) As in th a t example, we consider the six possible m odels with no covariates, with ju st x i, w ith x i , x 2, and so forth, finishing with x i , . . . , x 5. The function s u b s e t .b o o t fits these to a subset o f n - s i z e observations, and calculates the prediction m ean squared erro r for all the data. It is then applied using b o o t :

subset.boot <- function(data, i, size=0) { n <- nrow(data) i.t <- i [1:(n-size)] data.t <- data[i.t, ] resO <- data$y - mean(data.t$y) lm.d <- lm(y ~ xl, data=data.t) resl <- data$y - predict.lm(lm.d, data) lm.d <- update(lm.d, .~.+x2) res2 <- data$y - predict.lm(lm.d, data) lm.d <- update(lm.d, .~.+x3) res3 <- data$y - predict.lm(lm.d, data) lm.d <- update(lm.d, .~.+x4) res4 <- data$y - predict.lm(lm.d, data) lm.d <- update(lm.d, .”.+x5) res5 <- data$y - predict.lm(lm.d, data) meansq <- function(y) mean(y~2) apply(cbind(res0,resl,res2,res3,res4,res5),2,meansq)/n } fake.boot.40 <- boot(fake, subset.boot, R=100, size=40) delta.hat.40 <- apply(fake.boot.40$t,2,mean) plot(c(0:5).delta.hat.40,xlab="Number of covariates", ylab="Delta hat (M)" ,type="l",ylim=c(0,0.1)) F or results w ith a different value o f s i z e , but re-using f a k e . b o o t. 4 0 $ se e d in order to reduce sim ulation variability:

.Random.seed <- fake.boot.40$seed fake.boot.30 <- boot(fake, subset.boot, R=100, size=30) delta.hat.30 <- apply(fake.boot.30$t,2,mean) lines(c(0:5),delta.h a t .30,lty=2) Try this w ith various values o f s iz e .

540

11 ■Computer Implementation

M odify the code above to d o variable selection using cross-validation, and com pare it w ith the b o o tstrap results.

11.7 Further Topics in Regression 11.7.1 N onlinear and generalized linear m odels N onlinear and generalized linear m odels are b o o tstrap p ed using the ideas in the preceding section. F or example, to apply case resam pling to the calcium d a ta o f Exam ple 7.7:

calcium.fun <- function(data, i) { d <- data[i,] d.nls <- nls(cal~betaO*(l-exp(-time*betal)),data=d, start=list(beta0=5,betal=0.2)) c(coefficients(d.nls),sum(d.nls$residuals"2)/(nrow(d)-2)) > cal.boot <- boot(calcium,calcium.fun,R=19,strata=calcium$time) Likewise, to apply m odel-based sim ulation to the leukaem ia d a ta o f E xam ple 7.1, resam pling standardized deviance residuals according to (7.14),

leuk.glm <- glm(time~loglO(wbc)+ag-l,Gamma(log),data=leuk) leuk.diag <- glm.diag(leuk.glm) muhat <- fitted(leuk.glm) rL <- log(leuk$time/muhat)/sqrt(l-leuk.diag$h) eps <- 10*(-4) u <- -log(seq(from=eps,to=l-eps,by=eps)) d <- sign(u-l)*sqrt(2*(u-l-log(u)))/leuk.diag$sd r.dev <- smooth.spline(d, u) z <- predict(r.dev, leuk.diag$rd)$y leuk.mle <- data.frame(muhat,rL,z) fit.model <- function(data) { data.glm <- glm(time"loglO(wbc)+ag-l,Gamma(log),data=data) c(coefficients(data.glm).deviance(data.glm)) } leuk.gen <- function(data,mle) { i <- sample(nrow(data),replace=T) data$time <- mle$muhat*mle$z[i] data } leuk.boot <- boot(leuk, fit.model, R=19, sim="parametric", r a n .gen=leuk.g e n , mle=leuk.mle) The o ther procedures for m odel-based resam pling o f generalized linear m odels are applied similarly. Try to m odify this code to resam ple the linear predictor residuals according to (7.13) (they are already calculated above).

11.7 ■Further Topics in Regression

541

11.7.2 Survival data F u rth er argum ents to c e n s b o o t are needed to b o o tstrap survival data. For illustration, we consider the m elanom a d a ta o f Exam ple 7.6, and fit a m odel in which survival depends on log tum our thickness. T he initial fits are given by

mel.cox <- coxph(Surv(time,status==l)~log(thickness) +strata(ulcer),data=melanoma) mel.surv <- survfit(mel.cox) mel.cens <- survfit(Surv(time-0.01*(status!=1),status!=l)~l, data=melanoma) The b o o tstra p function m e l. fun given below need only take one argum ent, a d atafram e containing the d a ta themselves. N ote how the function uses a sm oothing spline to interpolate fitted values for the full range o f thickness; this avoids difficulties due to the variability o f the covariate when resam pling cases. T he o u tp u t o f m e l. fun is the vector o f fitted linear predictors predicted by the spline.

mel.fun <- function(d) { attach(d) cox <- coxph(Surv(time,status==l)~log(thickness)+strata(ulcer)) eta <- unique(cox$linear.predictors) u <- unique(thickness) sp <- smooth.spline(u,eta,df=20) th <- seq(from=0.25,to=10,by=0.25) eta <- predict(sp,th)$y detach("d") eta > T he next three com m ands give the syntax for case resam pling, for m odel-based resam pling an d for conditional resam pling. For either o f these last two schemes, the baseline survivor functions for the survival times and censoring times, and the fitted p ro p o rtio n al hazards (Cox) m odel for the survival distribution m ust be supplied via the F . s u rv , G . s u rv , and cox argum ents.

attach(melanoma) mel.boot <- censboot(melanoma, mel.fun, R=99, strata=ulcer) mel.boot.mod <- censboot(melanoma, mel.fun, R=99, F.surv=mel.surv, G.surv=mel.cens, strata=ulcer, cox=mel.cox, sim="model") mel.boot.con <- censboot(melanoma, mel.fun, R=99, F .surv=mel.surv, G.surv=mel.cens, strata=ulcer, cox=mel.cox, sim="cond")

542

11 ■Computer Implementation

The b o o tstrap results are best displayed graphically. H ere is the code for the analogue o f the left panels o f Figure 7.9: t h < - s e q ( f r o m = 0 .2 5 ,to = 1 0 ,b y = 0 .2 5 ) s p l i t . s c r e e n ( c ( 2 , 1 )) s c re e n (l) p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)", x l i m = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L in e a r p r e d i c t o r " ) l i n e s ( t h , m e l . b o o t$ tO , lwd=3) r u g (jitte r(th ic k n e s s )) f o r ( i i n 1 :1 9 ) l i n e s ( t h , m e l . b o o t $ t [ i , ] , l w d = 0 . 5 ) sc re e n (2 ) p lo t( th ,m e l.b o o t$ tO ,ty p e = " n " ,x la b = " T u m o u r t h i c k n e s s (mm)", x lim = c ( 0 ,1 0 ) ,y l i m = c ( - 2 ,2 ) ,y l a b = " L i n e a r p r e d i c t o r " ) l i n e s ( t h , m e l . b o o t $ t 0 , lwd=3) m e l.e n v < - e n v e l o p e ( m e l .b o o t $ t ,l e v e l = 0 .95) l i n e s ( t h . m e l . e n v $ p o in t [ 1 , ] , l t y = l ) l i n e s ( t h , m e l. e n v $ p o in t [ 2 , ] , l t y = l ) m e l.e n v < - e n v e lo p e ( m e l .b o o t .m o d $ t ,l e v e l = 0 .95) l i n e s ( t h , m e l . e n v $ p o i n t [ 1 , ] , lty = 2 ) lin e s ( th ,m e l.e n v $ p o in t[ 2 , ] ,lty = 2 ) m e l.e n v < - e n v e l o p e ( m e l .b o o t .c o n $ t ,l e v e l = 0 .95) l i n e s ( t h . m e l . e n v $ p o i n t [ 1 , ] ,lt y = 3 ) l i n e s ( t h , m e l . e n v $ p o i n t [ 2 , ] , lty = 3 ) d e t a c h ( "m elanom a") N ote how tight the confidence envelope is relative to th a t for the m ore highly param etrized m odel used in the example. Try again w ith larger values o f R, if you have the patience.

11.7.3 Nonparam etric regression N onparam etric regression is b o o tstrap p ed in the sam e way as o ther regres sions. C onsider for exam ple b o o tstrap p in g the sm oothing spline fit to the m otorcycle d a ta o f Exam ple 7.10. The d a ta w ithout repeats are in m otor, with com ponents a c c e l, tim e s , s t r a t a , and v, the last tw o o f which give the strata for resam pling an d an estim ated variance w ithin each stratum . The three fits are obtained by a tta c h ( m o to r ) m o to r.sm o o th < - s m o o t h .s p l i n e ( t i m e s ,a c c e l ,w = l / v ) m o to r .s m a ll < - s m o o t h .s p l i n e ( t i m e s ,a c c e l ,w = l / v , s p a r = m o to r . s m o o th $ s p a r/2 ) m o to r .b ig < - s m o o t h . s p l i n e ( t i m e s , a c c e l , w = l / v , s p a r = m o to r . sm o o th $ sp ar* 2 )

11.8 ■ Time Series

543

C om m ands to set up and perform the resam pling are as follows:

res <- (motor$accel-motor.small$y)/sqrt(1-motor.small$lev) motor.mle <- data.frame(bigfit=motor.big$y,res=res) xpoints <- c(10,20,25,30,35,45) motor.fun <- function(data, x) { y.smooth <- smooth.spline(data$times,data$accel,w=l/data$v) predict(y.smooth,x)$y } motor.gen <- function(data, mle) { d <- data i <- c(l:nrow(data)) 11 <- sample(i[data$strata==l],replace=T) 12 <- sample(i[data$strata==2],replace=T) 13 <- sample(i[data$strata==3],replace=T) d$accel <- mlelbigfit + mle$res[c(il,i 2 ,i3)] d > motor.boot <- boot(motor, motor.fun, R=999, sim="parametric", ran.gen=motor.gen, mle=motor.mle, x=xpoints) Finally, the 90% basic b o o tstrap confidence limits are obtained by

mu.big <- predict(motor.big,xpoints)$y mu <- predict(motor.smooth,xpoints)$y ylims <- apply(motor.boot$t,2,quantile,c(0.05,0.95)) ytop <- mu - (ylims[1,]-mu.big) ybot <- mu - (ylims[2,]-mu.big) W h at is the effect o f using a smaller sm oothing param eter when calculating the residuals? Try altering this code to apply the wild bootstrap, and see w hat effect it has on the results.

11.8 Time Series M odel-based resam pling for tim e series is analogous to regression. We consider the sunspot d a ta o f Exam ple 8.3, to which we fit the autoregressive m odel th at m inimizes A IC :

sun <- 2*(sqrt(sunspot+l)-l) ts.plot(sun) sun.air <- ar(sun) sun.ar$order The best m odel is AR(9). How well determ ined is this, and w hat is the variance o f the series average? We b o o tstrap to see, using

544

11 ■Computer Implementation

sun.fun <- function(tsb) { ax.fit <- ar(tsb, order,max=25) c(ar.fit$order, mean(tsb), tsb) > which calculates the o rd er o f the fitted autoregressive m odel, the series average, and saves the series itself. O u r function for b o o tstrap p in g tim e series is ts b o o t. H ere are results for fixed-block b o o tstrap s w ith block length / = 20:

sun.l <- tsboot(sun, sun.fun, R=99, 1=20, sim="fixed") tsplot(sun.l$t[l,3:291],main="Block simulation, 1=20") table(sun.l$t[,1]) var(sun.l$t [,2]) qqnorm(sun.l$t [,2]) The statistic for t s b o o t takes only one argum ent, the tim e series. The first plot here shows the results for a single replicate using block sim ulation: note the occasional big ju m p s in the resam pled series. N o te also the large variation in the orders o f the fitted autoregressive models. To obtain sim ilar results for the statio n ary b o o tstrap w ith m ean block length I = 20 :

sun.2 <- tsboot(sun, sun.fun, R=99, 1=20, sim="geom") A re the results sim ilar to having blocks o f fixed length? For m odel-based resam pling we need to store results from the original m odel, and to m ake residuals from th a t fit:

sun.model <- list(order=c(sun.ar$order,0,0),ar=sun.ar$ar) sun.res <- sun.ar$resid[!is.na(sun.ar$resid)] sun.res <- sun.res - mean(sun.res) sim.sim <- function(res,n.sim, ran.args) { rgl <- function(n, res) sample(res, n, replace=T) ts.orig <- ran.args$ts ts.mod <- r a n .args$model mean(ts.orig)+rts(arima.sim(model=ts.mod, n=n.sim, rand.gen=rgl, res=as.vector(res))) } sun.3 <- tsboot(sun.res, sun.fun, R=99, sim="model", n.sim=114, r a n .gen=sun.s i m ,r a n .args=list(t s=sun, model=sun.model)) Check the orders o f the fitted m odels for this scheme. A re they sim ilar to those obtained using the block schemes above? For “ post-blackening” we need to define yet an o th er function:

sun.black <- function(res, n.sim, ran.args) { ts.orig <- ran.args$ts

11.9 ■Improved Simulation

545

ts .m o d < - ra n .a r g s $ m o d e l m e a n ( t s . o r i g ) + r t s ( a r im a .s im ( m o d e l= ts .m o d ,n = n .s im ,in n o v = r e s ) ) } s u n . l b < - t s b o o t ( s u n . r e s , s u n .f u n , R=99, 1=20, s im = " f ix e d " , r a n .g e n = s u n .b l a c k , r a n . a x g s = l i s t ( t s = s u n , m o d e l= su n .m o d e l), n .s im = le n g t h ( s u n ) ) C om pare these results with those above, and try it with sim="geom".

11.9 Improved Simulation 11.9.1 Balanced resampling The balanced b o o tstrap is invoked via the sim argum ent to boot: c i t y . b a l < - b o o t ( c i t y , c i t y . f u n , R=20, sim = " b a la n c e d " ) If s t r a t a is supplied, balancing takes place separately w ithin each stratum .

11.9.2 Control m ethods c o n t r o l applies the control m ethods, including post-sim ulation balance, to the o u tp u t from an existing b o o tstrap sim ulation. F o r example, c o n t r o l ( c i t y . b o o t , b i a s .a d j= T ) produces the adjusted bias estim ate, while c ity .c o n <- c o n tr o l( c ity .b o o t) gives a list consisting o f the regression estim ates o f the em pirical influence values, linear approxim ations to the b o o tstrap statistics, the control estim ates o f bias, variance, and the th ird cum ulant o f the f \ control estim ates o f selected quantiles o f the distrib u tio n o f t*, and a spline object th at sum m arizes the ap proxim ate quantiles used to o btain the control quantile estim ates. Saddlepoint approxim ation is used to obtain these approxim ate quantiles. Typing c ity c ity c ity c ity

. con$L . c o n $ b ia s . con$var . c o n $ q u a n tile s

gives some o f the above-m entioned quantities. A rgum ents to c o n t r o l allow the user to specify the em pirical influence values, the spline object, and other quantities to be used by control, if they are already available; see the help file for details.

11 ■Computer Implementation

546

11.9.3 Importance resampling We have already m et a use o f nonparam etric sim ulation with unequal pro b abilities in Section 11.4, using the w e ig h ts argum ent to b o o t. T he simplest form for w e ig h ts , used there, is a vector containing the probabilities w ith which to select the rows o f d a ta , when b o o tstrap sam pling is to be perform ed with unequal probabilities. I f we wish to perform im portance resam pling using several distributions, we can set them up an d then perform the sam pling as follows: c i t y . t o p < - e x p . t i l t ( L = c i t y . L , t h e t a = 2 , t O = c i t y .w ( c i t y ) ) c i t y . b o t < - e x p . t i l t ( L = c i t y . L , t h e t a = 1 . 2 , t O = c i t y .w ( c i t y ) ) c i t y . t i l t < - b o o t ( c i t y , c i t y . f u n , R = c (1 0 0 ,9 9 ), w e i g h t s = r b i n d ( c i t y . to p $ p , c i t y . b o t$ p ) ) which perform s 100 sim ulations from the probabilities in c i t y . t o p $ p and 99 from the probabilities in c i t y . b o t $ p . In the first tw o lines e x p . t i l t is used to solve the equation .

,

to I

£ , 0 e x p (^ ) V—\ /17 \ J2 j exp(A/j)

’

corresponding to exponential tilting o f the linear approxim ation to t to be centred a t 6 = 2 an d 1.2. In the call to b o o t, R is a vector, and w e ig h ts a m atrix w ith le n g th ( R ) rows and n ro w (d a ta ) colum ns, corresponding to the l e n g th (R) d istributions from which resam pling takes place. The im portance sam pling weights, m om ents, and selected quantiles o f the resam ples in c i t y . t i l t $ [ , l ] are calculated by im p. w e i g h t s ( c i t y . t i l t ) i m p .m o m e n t s ( c i t y . t i l t ) im p .q u a n tile ( c ity .tilt) Each o f these returns raw, ratio and regression estim ates o f the corresponding quantities. Some oth er uses o f im p o rtan t resam pling are exemplified by i m p . p r o b ( c i t y . t i l t , t 0 = 1 .2 , d ef= F ) z <- ( c i t y . t i l t $ t [ ,1 ] - c i t y . t i l t $ t 0 [1 ]) / s q r t ( c i t y . t i l t $ t [ ,2 ] ) im p. q u a n t i l e ( b o o t . o u t = c i t y . t i l t , t= z ) The call to im p .p ro b calculates the im portance sam pling estim ate o f the probability th a t t* < 1.2, w ithout using defensive m ixture distributions (by default d e f =T, i.e. defensive m ixture distributions are used to obtain the weights and estim ates). The last two lines show how im portance sam pling is used to estim ate quantiles o f the studentized b o o tstrap statistic. F or m ore details an d fu rth er argum ents to the functions, see their help files.

11.9 ■Improved Simulation

547

Function tilt.boot The description above relies on exponential tilting to obtain the resam pling probabilities, an d requires know ing where to tilt to. If this is difficult, t i l t . b o o t can be used to avoid this, by perform ing an initial b o o tstrap with equal resam pling probabilities, then using frequency sm oothing to estim ate ap p ro p riate tilted probabilities. F o r example, c i t y . t i l t < - t i l t . b o o t ( c i t y , c i t y . f u n , R=c(5 0 0 ,2 5 0 ,2 4 9 )) perform s 500 ordinary bootstraps, uses the results to estim ate probability d istributions tilted to the 0.025 and 0.975 points o f the sim ulations, and then perform s 250 b o o tstrap s tilted to the 0.025 quantile, and 249 tilted to the 0.975 quantile, before assigning the result to a b o o tstrap object. M ore com plex uses o f t i l t , b o o t are possible; see its help file. Importance re-weighting These functions allow for im portance re-w eighting as well as im portance sam pling. F or exam ple, suppose th a t we require to re-weight the sim ulated values so th a t they ap p e a r to have been sim ulated from a distribution with expected ratio close to 1.4. We then use the q= option to the im portance sam pling functions as follows: q <- s m o o th .f ( th e ta = l.4 , b o o t .o u t = c i t y .t i l t ) c i t y . w ( c i t y , q) i m p . m o m e n t s ( c i t y . t i l t , q=q) i m p . q u a n t i l e ( c i t y . t i l t , q=q) where the first line calculates the sm oothed distribution, the second obtains the corresponding ratio, an d the third and fourth obtain the m om ent and quantile estim ates corresponding to sim ulation from the distribution q.

11.9.4 Saddlepoint m ethods T he function used for single saddlepoint approxim ation is s a d d le . Its simplest use is to obtain the P D F an d C D F approxim ations for a linear statistic, such as the linear ap proxim ation t + n-1 Y f j h to a general b o o tstrap statistic t*. The sam e results are obtained by using the approxim ation n~l Y f j h to t* — t, and this is w hat s a d d le does. To obtain the approxim ations at t' = 2 for the c i t y data, we set s a d d l e ( A = c i t y .L / n r o w ( c i t y ) , u = 2 - c i t y . w ( c i t y ) ) which returns the P D F an d C D F approxim ations, and the value o f £. The function s a d d l e . d i s t n returns the saddlepoint estim ate o f an entire distribution, using the term s n~*lj in the random sum and an initial idea o f the centre and scale for the distribution o f T ' — t:

11 ■Computer Implementation

548

city.tO <- c(0, sqrt(var.linear(city.L))) city.sad <- saddle.distn(A=city.L/nrow(city), tO=city.tO) city.sad T he L ug an n an i-R ice form ula can be applied by setting LR=T in the calls to s a d d le an d s a d d l e . d i s t ; by default LR=F. F o r m ore sophisticated applications, the argum ents A and u to s a d d l e . d i s t n can be replaced by functions. F or exam ple, the b oo tstrap p ed ratio can be defined th ro u g h the estim ating equation

£ / ; ( * ; - tu ,) = ° ,

(11.1)

j

where the /* have a jo in t m ultinom ial distrib u tio n w ith equal probabilities and denom in ato r n = 10, the n u m b er o f rows o f c i t y , as outlined in Exam ple 9.16. A ccordingly we set

city.tO <- c(city.w(city), sqrt(var.linear(city.L))) Afn <- function(t, data) data$x-t*data$u ufn <- function(t, data) 0 saddle(A=Afn(2, city), u=0) city.sad <- saddle.distn(A=Afn, u=ufn, t0=city.t0, data=city) The penultim ate line here gives the exact version o f the call to s a d d le th at started this section, while the last line calculates the saddlepoint approxim ation to the exact distrib u tion o f T*. F or s a d d l e . d i s t n the quantiles o f the distri bution o f T * are estim ated by obtaining the C D F approxim ation a t a num ber o f values o f t, an d then interp o latin g the C D F using a spline sm oother. The range o f values o f t used is determ ined by the contents o f tO, w hose first value contains the original value o f the statistic, and whose second value contains a m easure o f the spread o f the distrib u tio n o f T*, such as its stan d ard error. A n o th er use o f s a d d le an d s a d d l e . d i s t n is to give them directly the adjusted cum ulant generating function K ( £ ) — t£, and the second derivative K " { <*). F o r exam ple, the c i t y d a ta above can be tackled as follows:

K.adj <- function(xi) { L <- city$x-city.t*city$u nrow(city)*log(sum(exp(xi*L))/nrow(city))-city.t*xi } K2 <- function(xi) { L <- city$x-city.t*city$u p <- exp(L*xi) nrow(city)*(sum(L‘ 2*p)/sum(p) - (sum(L*p)/sum(p))~2) > city.t <- 2 saddle(K.adj=K.adj, K2=K2)

11.10 ■Semiparametric Likelihoods

549

This is m ost useful w hen K (-) is not o f the standard form th a t follows from a m ultinom ial distribution. Conditional approximations C onditional saddlepoint approxim ation is applied by giving Af n and u f n m ore colum ns, an d setting the w d is t and ty p e argum ents to s a d d le appropriately. F or example, suppose th a t we w ant to find the distribution o f T " , defined as the root o f (11.1), b u t resam pling 25 rath e r th an 49 cases o f b i g c i t y . T hen we set

bigcity.L <- (bigcity$x-city.w(bigcity)*bigcity$u)/ mean(bigcity$u) bigcity.tO <- c(city.w(bigcity), sqrt(var.linear(bigcity.L))) Afn <- function(t, data) cbind(data$x-t*data$u, 1) ufn <- function(t, data) c(0,25) saddle(A=Afn(l.4, bigcity), u=ufn(1.4, bigcity), wdist="p", type="cond") city.sad <- saddle.distn(A=Afn, u=ufn, wdist="p", type="cond", data=bigcity, tO=bigcity.tO) H ere the w d is t argum ent gives the distribution o f the random variables W j , which is Poisson in this case, and the ty p e argum ent specifies th a t a conditional approxim ation is required. F or resam pling w ithout replacem ent, see the help file. A fu rth er argum ent mu allows these variables to have differing means, in which case the conditional saddlepoint will correspond to sam pling from m ultinom ial or hypergeom etric distributions with unequal probabilities.

11.10 Semiparametric Likelihoods Basic functions only are provided for sem iparam etric likelihood inference. To calculate an d plot the log profile likelihood for the m ean o f a gam m a m odel for the larger air conditioning d ata (Exam ple 10.1):

gam.L <- function(y, tmin=min(y)+0.1, tmax=max(y)-0.1, n.t) { gam.loglik <- function(l.nu, mu, y) { nu <- exp(l.nu) -sum(log(dgamma(mi*y/mu, nu)*nu/mu)) } out <- matrix(NA, n.t+1, 3) for (it in 0:n.t) { t <- tmin + (tmax-tmin)*it/n.t fit <- nlminb(0, gam.loglik, mu=t, y=y) out[l+it,] <- c(t, exp(fit$parameters), -fit$objective) } out }

550

11 ■Computer Implementation

air.gam <- gam.L(aircondit7$hours, 40, 120, 100) air. gam [, 3] <- air. gam [, 3] - max (air .gam [, 3]) plot(air.gam[,1],a i r .g a m [,3],type="l",xlab="theta", ylab="Log likelihood",xlim=c(40,120)) abline(h=-0.5*qchisq(0.95,1),lty=2) Em pirical an d em pirical exponential family likelihoods are obtained using the functions E L . p r o f i l e an d EEF. p r o f i l e . They are included in the library for dem o n stratio n purposes only, an d are not intended for serious use, nor are they currently supported as p art o f the library. These functions give log likelihoods for the m ean o f their first argum ent, calculated a t n . t values o f 6 from tm in an d tmax. T he o u tp u t o f E L .p ro f i l e is a n . t x 3 m atrix whose first colum n is the values o f 6 , w hose next colum n is the log profile likelihood, and whose final colum n is the values o f the L agrange m ultiplier. The o u tp u t o f EEF. p r o f i l e is a n . t x 4 m atrix w hose first colum n is the values o f 6 , whose next two colum ns are versions o f the log profile likelihood (see Exam ple 10.4), and whose final colum n is the values o f the L agrange m ultiplier. F or ex am ple:

air.EL <- EL.profile(aircondit7$hours,tmin=40,tmax=120,n.t=100) lines(air.E L [,1],a i r .E L [,2],lty=2) air.EEF <- EEF.profile(aircondit7$hours,tmin=40,tmax=120, n.t=100) lines(air.E E F [,1],air.E E F [,3],lty=3) N ote how close the two sem iparam etric log likelihoods are, com pared to the param etric one. T he practicals at the end o f C h ap ter 10 give m ore exam ples o f their use (and abuse). M ore general (and m ore robust!) code to calculate em pirical likelihoods is provided by Professor A. B. Owen a t Stanford U niversity; see W orld W ide W eb reference http://playfair.stanford.edU/reports/owen/el.S.

APPENDIX A Cumulant Calculations

In this b o o k several chapters and som e o f the problem s involve m om ent calculations, which are often simplified by using cum ulants. The cum ulant-generating function o f a random variable Y is K ( t ) = l o g E ( e tY) = J 2 - / ks, s=l Swhere

ks

is the sth cum ulant, while the m om ent-generating function o f Y is 00

1

M(t) = E ( e' r ) = £ - f y s, s=0 5 ’

where fi's = E (Y S) is the sth m om ent. A simple exam ple is a N(/i, a 2) random variable, for which K( t ) = t/j.+ ^ t 2 a 2; note the appealing fact th a t its cum ulants o f order higher th an tw o are zero. By equating powers o f t in the expansions o f K ( t ) an d log M( t) we find th a t k\ = and th at K2

=

*3

=

n'3 -3(i2»'i+2(n'l)3,

K4

=

H 4 ~ 4 /4 /4 - 3(/4)2 + 12/4(/4 )2 - 6(/4 )4.

with inverse form ulae A*2

= K2 + (K i )2,

/i'3

= K3 + 3K2K\ + (Ki)3,

/*4

=

k

4+

4 ( c 3 )c i +

3 (k 2)2

(A .l) +

6

k

2(k

i)2

+

(k i)4.

T he cum ulants k j, k2, kt, and K4 are the m ean, variance, skewness and kurtosis o f Y. F o r vector Y it is b etter to drop the pow er n o ta tio n used above and to

551

552

A ■Cumulant Calculations

ad o p t index n o tatio n an d the sum m ation convention. In this no tatio n Y has com ponents Y ' , . . . , Y" an d we w rite Y ' Y ' and Y 'Y 'Y ' for the square and cube o f Y' . T he jo in t cum ulant-generating function K ( t ) o f Y l , . . . , Y n is the logarithm o f their jo in t m om ent-generating function, lo g E

= hk 1 + jjtitjKlJ + j ]titj tkKt'j'k + j ititJtktiK‘JJ<J + ■■■,

where sum m ation is im plied over repeated indices, so that, for example, t,/c‘ = t\Kl + ----- h t„Kn,

titjK''J = fitiK 1,1 + t \ t 2 K1'2 + ----- h tntnKn'n.

T hus the n-dim ensional norm al distrib u tio n w ith m eans k ‘ and covariance m atrix fc‘J has cum ulant-generating function Uk1+ jtjtjKl’i . We som etim es write K>J = cum (Y ‘, Y j ), K'j* = cum (Y ', Y>, Y k) and so forth for the coefficients o f titj, titjtk in K(t). The cum ulant arrays k ‘j , etc. are invariant to index perm utation, so for exam ple (c1,2,3 = k2,3,1. A key feature th a t simplifies calculations w ith cum ulants as opposed to m o m ents is th a t cum ulants involving two or m ore independent random variables are z e ro : for independent variables, k,j = k ' ^ = • • = 0 unless all the indices are equal. The above n o tatio n extends to generalized cum ulants such as cu m (Y ‘Y ; Y fc)

=

E ( Y iY i Y k) = Kijk,

cu m (Y ‘, Y * Y k)

=

KlJk,

c u m ( Y iY J, Y k, Y l) = KijJ('1,

which can be obtained from the jo in t cum ulant-generating functions o f Y ' Y j Y k, o f Y ' an d Y JY k an d o f Y ‘Y J, Y k, and Y 1. N ote th at ordinary m om ents can be regarded as generalized cum ulants. G eneralized cum ulants can be expressed in term s o f ordinary cum ulants by m eans o f com plem entary set partitions, the m ost useful o f which are given in Table A .I. F or exam ple, we use its second colum n to see th at = jc‘j + k ‘k \ or E (Y ‘Y J) = cum ( Y iY j ) = cum (Y ', Y J) + cu m (Y ‘)cum (Y '), m ore fam iliarly w ritten cov(Y ‘, Y j ) + E (Y ')E (Y ; ). The boldface 12 represents k 12, while the 12 [1] an d 1|2 [1] im m ediately below it represent k 1,2 and k 1k 2. W ith this u nderstanding we use the third colum n to see th a t Ky’fc = [3] + KiKJV t, where [3] is sh o rth an d for kij’k* + KlJtkj + k ’^ k ' ; this is the m ultivariate version o f (A .l). Likewise k ' ^ = i + k*^k;’[2], where the term K ,,kK ^ [2 ] on the right is understood in the context o f the left-hand side to equal + k ^ k ' : each index in the first block o f the p artition i j \ k appears once w ith the index in the second block. The expression 123|4 [2] [2] in the fourth colum n o f the table represents the p artitions 123|4, 124|3, 134|2, 234| 1. To illustrate these ideas, we calculate cov{Y ,(n — I)-1 ^ ( Y , — Y )2}, w here

A ■Cumulant Calculations

553

Y = n ~ ] Y l Yj is the average o f the independent and identically distributed rand o m variables Y \ , . . . , Y n. N ote first th a t the covariance does n o t depend on the m ean o f the Y„ so we can take k ‘ = 0. We then express Y and (n — I)-1 — Y ) 1 in index n o tatio n as a, Y' and frj/Y'Y-', where a, = l / n and bij = (<5I;- — l/n ) /( n — 1), with S i = l 1, ,} \ 0,

i = j’ otherwise,

the K ronecker delta symbol. T he covariance is c u m ( a i Y ' , b j k Y JY k) = afij^K1'^ = atbjkKl'J'k = n a i b u K 1,1,1, the second equality following on use o f Table A .l because k ' = 0 and the third equality following because the observations are independent and identi cally distributed. In pow er n o ta tio n k 1,1,1 is k 3, the third cum ulant o f Yi; so cov{Y ,(n - l ) - 1 £ ( Y , - Y )2} = K-i/n. Similarly var{(n - I)-1 ^ ( Y , ~ Y ) 2} = c u m ^ Y ' Y 'A , Y* Y () = btj h, , which Table A .l shows to be equal to bijbiii(ic‘'j’k’1 + k '^ k ^ 1 + kx’1k ^ ) . This reduces to n b n b u K 1’1’1’1 + 2 nb \ ib uKl’l K1'1 + 2 n(n — l ) b n b i 2 K1’1k 1'1, which in tu rn is K 4 / n + 2(k2)2/(n — 1) in pow er notation. To perform this calculation using m om ents and pow er no tatio n will convince the reader o f the elegance an d relative sim plicity o f cum ulants and index notation. M cC ullagh (1987) m akes a cogent m ore-extended case for these m ethods. H is book includes m ore-extensive tables o f com plem entary set partitions.

554

A ■Cumulant Calculations

1

2

3

4

1 1 [1]

12 12 [1] 1|2 [1]

123 123 [1] 12|3 [3] 1|2|3 [1]

1234 1234 [1] 123|4 [4] 12|34 [3] 12|3|4 [6] 1|2|3|4 [1]

1|2 12 [1]

12|3 123 [1] 13|2 [2] 1|2|3| 123 [1]

123|4 1234 [1] 124|3 [3] 12|34 [3] 14|2|3 [3] 12|34 1234 [1] 123|4 [2] [2] 134|2 13|24 [2] 13|2|4 [4] 12|3|4 1234 [1] 134|2 [2] 13|24 [2] 1|2|3|4 1234 [1]

Table A.1 Complementary set partitions

Bibliography

A belson, R . R an d T ukey, J. W. (1963) Efficient utilization o f n o n -n u m erical in fo rm atio n in q u a n tita tiv e analysis: general th eo ry a n d th e case o f sim ple order. Annals of Mathematical Statistics 34, 1347-1369. A kaike, H . (1973) In fo rm a tio n theory a n d a n extension o f th e m axim um likelihood principle. In Second International Symposium on Information Theory, eds B. N . Petrov an d F. C zaki, pp. 267-281. B udapest: A k ad em iai K iad o . R e p rin ted in Breakthroughs in Statistics, volum e 1, eds S. K o tz an d N . L. Jo h n so n , pp. 610-624. N ew Y o rk : Springer. A k ritas, M . G . (1986) B o o tstrap p in g the K a p la n -M e ie r estim ato r. Journal of the American Statistical Association 81, 1032-1038. A ltm an , D . G . a n d A n dersen, P. K . (1989) B o o tstrap investig atio n o f th e stab ility o f a C ox regression m odel. Statistics in Medicine 8, 771-783. A n d ersen , P. K ., B organ, 0 ., G ill, R . D . an d K eiding, N. (1993) Statistical Models Based on Counting Processes. N ew Y o rk : S pringer. A ndrew s, D. F. a n d H erzberg, A. M . (1985) Data: A

Collection of Problems from Ma ny Fields for the Student and Research Worker. N ew Y ork: S pringer. A p p ley ard , S. T., W itkow ski, J. A., R ipley, B. D., S hotton, D . M . a n d D ubow icz, V. (1985) A novel pro ced u re for p a tte rn analysis o f featu res p resen t on freeze fractured p lasm a m em branes. Journal of Cell Science 74, 105-117. A th rey a, K . B. (1987) B o o tstrap o f the m ean in th e infinite varian ce case. Annals of Statistics 15, 724-731. A tk in so n , A. C. (1985) Plots, Transformations, and Regression. O x fo rd : C laren d o n Press. Bai, C. a n d O lshen, R . A. (1988) D iscussion o f

“ T h eo retical c o m p ariso n o f b o o tstra p confidence intervals”, by P. H all. Annals of Statistics 16, 953-956. Bailer, A. J. a n d O ris, J. T. (1994) A ssessing toxicity o f p o llu tan ts in aq u atic system s. In Case Studies in Biometry, eds N. L ange, L. R yan, L. B illard, D . R. B rillinger, L. C o n q u est an d J. G reenhouse, pp. 25-40. N ew Y o rk : Wiley. Banks, D . L. (1988) H istospline sm o o th in g the Bayesian b o o tstrap . Biometrika 75, 673-684. Barbe, P. a n d B ertail, P. (1995) The Weighted Bootstrap. V olum e 98 o f Lecture Notes in Statistics. N ew Y ork: Springer. B arnard, G . A. (1963) D iscussion o f “T h e spectral analysis o f p o in t processes”, by M . S. B artlett. Journal of the Royal Statistical Society series B 25, 294. B am dorff-N ielsen, O. E. a n d Cox, D . R . (1989) Asymptotic Techniques for Use in Statistics. L o n d o n : C h a p m a n & H all. B am dorff-N ielsen, O. E. a n d Cox, D . R. (1994) Inference and Asymptotics. L o n d o n : C h a p m a n & H all. B eran, J. (1994) Statistics for Long-Memory Processes. L o n d o n : C h a p m a n & H all. B eran, R . J. (1986) S im ulated pow er functions. Annals of Statistics 14, 151-173. B eran, R . J. (1987) P repivoting to reduce level e rro r o f confidence sets. Biometrika 74, 457-468. B eran, R . J. (1988) P repivoting test statistics: a b o o tstra p view o f asym ptotic refinem ents. Journal of the American Statistical Association 83, 687-697. B eran, R. J. (1992) D esigning b o o tstra p pred ictio n regions. In Bootstrapping and Related Techniques: Proceedings, Trier, FRG, 1990, eds K .-H . Jockel, G . R o th e an d

555

556 W. Sendler, volum e 376 o f L ecture N otes in Economics and M athem atical Systems, pp. 23-30. N ew Y ork: Springer. B eran, R. J. (1997) D iag n o sin g b o o tstra p success. Annals o f the Institute o f Statistical M athem atics 49, to appear. Berger, J. O. an d B ern ard o , J. M . (1992) O n the dev elo p m en t o f reference p rio rs (w ith D iscussion). In Bayesian S tatistics 4, eds J. M . B ernardo, J. O. Berger, A. P. D aw id an d A. F. M . Sm ith, pp. 35-60. O xford: C laren d o n Press.

Bibliography B ooth, J. G . (1996) B o o tstrap m eth o d s for generalized linear m ixed m odels w ith ap p licatio n s to sm all area estim ation. In Statistical Modelling, eds G. U . H. Seeber, B. J. F rancis, R . H atzin g er an d G . Steckel-Berger, volum e 104 o f Lecture N otes in Statistics, pp. 43-51. N ew Y ork: Springer. B ooth, J. G . an d B utler, R. W. (1990) R a n d o m iza tio n d istrib u tio n s an d sa d d lep o in t a p p ro x im atio n s in g eneralized linear m odels. Biometrika 77, 787-796.

Besag, J. E. a n d Clifford, P. (1989) G eneralized M o n te C a rlo significance tests. Biometrika 76, 633-642.

B ooth, J. G., B utler, R. W. an d H all, P. (1994) B o o tstrap m eth o d s for finite p o p ulations. Journal o f the American Statistical Association 89, 1282-1289.

Besag, J. E. an d C lifford, P. (1991) S equential M o n te C arlo p-values. Biom etrika 78, 301-304.

B ooth, J. G . an d H all, P. (1994) M o n te C arlo ap p ro x im atio n an d th e iterated b o o tstra p . Biometrika

Besag, J. E. an d D iggle, P. J. (1977) Sim ple M o n te C a rlo tests fo r sp a tia l p a ttern . A pplied Statistics 26, 327-333. Bickel, P. J. a n d F reed m an , D. A. (1981) Som e asym ptotic th eo ry fo r th e b o o tstrap . Annals o f Statistics 9, 1196-1217. Bickel, P. J. a n d F reed m an , D . A. (1983) B oo tstrap p in g regression m odels w ith m an y param eters. In A Festschrift fo r Erich L. Lehmann, eds P. J. Bickel, K . A. D o k su m an d J. L. H odges, pp. 28-48. Pacific G rove, C alifo rn ia: W ad sw o rth & B ro o k s/C o le. Bickel, P. J. a n d F reed m an , D. A. (1984) A sym ptotic n o rm ality a n d th e b o o tstra p in stratified sam pling. Annals o f S tatistics 12, 470-482. Bickel, P. J., G o tze, F. a n d v an Zw et, W. R. (1997) R esam p lin g fewer th a n n o b serv atio n s: gains, losses, an d rem edies fo r losses. S tatistica Sinica 7, 1-32. Bickel, P. J., K lassen, C. A. J., R itov, Y. an d W ellner, J. A. (1993) Efficient and A daptive Estimation fo r Semiparametric M odels. B altim ore: Jo h n s H o p k in s U niversity Press. Bickel, P. J. a n d Y ahav, J. A. (1988) R ich a rd so n e x tra p o la tio n an d the b o o tstra p . Journal o f the American Statistical Association 83, 387-393. Bissell, A. F. (1972) A negative b in o m ial m odel w ith varying elem ent sizes. Biom etrika 59, 435—441. Bissell, A. F. (1990) H ow reliable is y o u r cap ab ility index? Applied Statistics 39, 331-340. Bithell, J. F. an d Stone, R. A. (1989) O n statistical m ethods fo r analysing th e geo g rap h ical d istrib u tio n o f cancer cases n ear n u clear installatio ns. Journal o f Epidemiology and Community Health 43, 79-85. Bloom field, P. (1976) Fourier Analysis o f Time Series: An Introduction. N ew Y o rk : Wiley. Boos, D. D . an d M o n a h an , J. F. (1986) B o o tstrap m ethods u sin g p rio r in fo rm atio n . Biom etrika 73, 77-83.

81, 331-340. B ooth, J. G ., H all, P. an d W ood, A. T. A. (1992) B o o tstrap estim atio n o f co n d itio n al distrib u tio n s. Annals o f Statistics 20, 1594-1610. B ooth, J. G ., H all, P. an d W ood, A. T. A. (1993) B alanced im p o rtan c e resam pling for the b o o tstrap . Annals o f S tatistics 21, 286-298. Bose, A. (1988) E dgew orth co rrectio n by b o o tstra p in autoregressions. Annals o f Statistics 16, 1709-1722. Box, G . E. P. a n d C ox, D . R. (1964) A n analysis o f tran sfo rm atio n s (w ith D iscussion). Journal o f the Royal Statistical Society series B 26, 211-246. B ratley, P., Fox, B. L. an d Schrage, L. E. (1987) A Guide to Simulation. S econd edition. N ew Y o rk : Springer. B raun, W. J. an d K ulperger, R. J. (1995) A F o u rie r m eth o d fo r b o o tstra p p in g tim e series. P reprint, D e p a rtm e n t o f M a th em atic s an d Statistics, U niversity o f W innipeg. B raun, W. J. an d K ulperger, R . J. (1997) P roperties o f a F o u rie r b o o tstra p m eth o d fo r tim e series. Communications in Statistics — Theory and M ethods 26, to ap pear. B reim an, L., F ried m an , J. H ., O lshen, R . A. an d Stone, C. J. (1984) Classification and Regression Trees. Pacific G rove, C alifo rn ia: W adsw orth & B ro o k s/C o le. Breslow, N . (1985) C o h o rt analysis in epidem iology. In A Celebration o f Statistics, eds A. C. A tk in so n an d S. E. F ienberg, pp. 109-143. N ew Y o rk : S pringer. B retagnolle, J. (1983) L ois lim ites d u b o o tstra p de certaines fonctionelles. Annales de I'Institut Henri Poincare, Section B 19, 281-296. Brillinger, D. R . (1981) Time Series: Data Analysis and Theory. E xpan d ed edition. San F ran cisco : H olden-D ay. Brillinger, D. R . (1988) A n elem entary tren d analysis o f R io N eg ro levels a t M a n au s, 1903-1985. Brazilian Journal o f Probability and Statistics 2, 63-79.

557

Bibliography B rillinger, D. R. (1989) C o n sistent d etection o f a m o n o to n ic tren d su p erp o sed on a sta tio n ary tim e series. Biometrika 76, 23-30.

b o o tstra p , pivots an d confidence lim its. T echnical R e p o rt 34, C enter for S tatistical Sciences, U niversity o f T exas a t A ustin.

Brockw ell, P. J. an d D avis, R . A. (1991) Time Series: Theory and Methods. Second edition. N ew Y ork: Springer.

Chen, C., D avis, R . A., Brockwell, P. J. a n d Bai, Z. D. (1993) O rd e r d eterm in atio n for autoregressive processes using resam pling m ethods. Statistica Sinica 3, 481-500.

Brockw ell, P. J. an d D avis, R. A. (1996) Introduction to Time Series and Forecasting. N ew Y o rk : Springer.

C hen, C.-H. a n d G eorge, S. L. (1985) T he b o o tstra p an d identification o f pro g n o stic factors via C ox’s p ro p o rtio n a l h azard s regression m odel. Statistics in Medicine 4, 39-46.

Brow n, B. W. (1980) P red iction analysis for b inary d a ta . In Biostatistics Casebook, eds R. G. M iller, B. E fron, B. W. B row n a n d L. E. M oses, pp. 3-18. N ew Y ork: Wiley. B uckland, S. T. an d G arth w aite, P. H . (1990) A lgorithm A S 259: estim atin g confidence intervals by the R o b b in s -M o n ro search process. Applied Statistics 39, 413-424. B iihlm ann, P. an d K iinsch, H. R. (1995) T he blockw ise b o o tstra p for general p aram eters o f a sta tio n ary tim e series. Scandinavian Journal of Statistics 22, 35-54. B unke, O. an d D ro g e, B. (1984) B o o tstrap and cro ss-v alid atio n estim ates o f the pred ictio n e rro r for lin ear regression m odels. Annals of Statistics 12, 1400-1424. B u rm an , P. (1989) A co m p arativ e study o f o rd in ary cross-v alid atio n , D-fold cross-validation a n d the rep eated learn in g -testin g m eth o d s. Biometrika 76, 503-514. B urr, D . (1994) A co m p ariso n o f certain b o o tstra p confidence intervals in th e Cox m odel. Journal of the American Statistical Association 89, 1290-1302. B urr, D . a n d D oss, H. (1993) C onfidence b an d s for the m ed ian survival tim e as a function o f covariates in the C ox m odel. Journal of the American Statistical Association 88, 1330-1340.

C hen, S. X. (1996) E m pirical likelihood confidence intervals for n o n p aram etric density estim ation. Biometrika 83, 329-341. C hen, S. X. an d H all, P. (1993) S m oothed em pirical likelihood confidence intervals for quantiles. Annals of Statistics 21, 1166-1181. C hen, Z. an d D o , K.-A. (1994) T he b o o tstra p m ethods w ith sad d lep o in t ap p ro x im atio n s an d im p o rtan ce resam pling. Statistica Sinica 4, 407-421. C obb, G . W. (1978) T he problem o f th e N ile: conditional solution to a c h an g ep o in t problem . Biometrika 65, 243-252. C o chran, W. G . (1977) Sampling Techniques. T h ird edition. N ew Y o rk : Wiley. C ollings, B. J. an d H am ilto n , M . A. (1988) E stim atin g the pow er o f the tw o-sam ple W ilcoxon test for location shift. Biometrics 44, 847-860. C ook, R . D ., H aw kins, D. M . an d W eisberg, S. (1992) C o m p ariso n o f m odel m isspecification diagnostics using residuals from least m ean o f squares an d least m edian o f squares fits. Journal of the American Statistical Association 87, 4 1 9 ^ 2 4 .

C anty, A. J., D avison, A. C. an d H inkley, D. V. (1996) R eliable confidence intervals. D iscussion o f “ B o o tstrap confidence in terv als”, by T. J. D iC iccio an d B. Efron. Statistical Science 11, 214-219.

C ook, R. D ., Tsai, C.-L. an d Wei, B. C. (1986) Bias in n o n lin ear regression. Biometrika 73, 615-623.

C arlstein, E. (1986) T h e use o f subseries values for estim atin g th e v arian ce o f a general statistic from a sta tio n a ry sequence. Annals of Statistics 14, 1171-1179.

C ook, R . D . a n d W eisberg, S. (1994) T ran sfo rm in g a response variable for linearity. Biometrika 81, 731-737.

C a rp en ter, J. R. (1996) Simulated confidence regions for parameters in epidemiological models. Ph.D . thesis, D e p a rtm e n t o f S tatistics, U niversity o f O xford. C h am b ers, J. M . a n d H astie, T. J. (eds) (1992) Statistical Models in S. Pacific G rove, C alifo rn ia: W adsw orth & B ro o k s/C o le. C h ao , M.-T. a n d Lo, S.-H. (1994) M ax im u m likelihood su m m ary an d th e b o o tstra p m eth o d in stru ctu red finite p o p u latio n s. Statistica Sinica 4, 389-406. C h a p m a n , P. a n d H inkley, D . V. (1986) T h e double

C ook, R . D . a n d W eisberg, S. (1982) Residuals and Influence in Regression. L o n d o n : C h a p m a n & Hall.

C o rco ran , S. A., D avison, A. C. an d Spady, R. H . (1996) R eliable inference from em pirical likelihoods. P reprint, D e p a rtm e n t o f Statistics, U niversity o f O xford. Cow ling, A., H all, P. an d Phillips, M . J. (1996) B o o tstrap confidence regions for th e intensity o f a Poisson process.

Journal of the American Statistical Association 91, 1516-1524. Cox, D . R . a n d H inkley, D . V. (1974) Theoretical Statistics. L o n d o n : C h a p m a n & Hall. Cox, D . R . a n d Isham , V. (1980) Point Processes. L o n d o n : C h a p m a n & H all.

558 Cox, D. R. a n d Lewis, P. A. W. (1966) The Statistical Analysis of Series of Events. L o n d o n : C h a p m a n & H all. Cox, D. R. a n d O akes, D . (1984) Analysis of Survival Data. L o n d o n : C h a p m a n & H all. Cox, D. R. a n d Snell, E. J. (1981) Applied Statistics: Principles and Examples. L o n d o n : C h a p m a n & H all. Cressie, N. A. C. (1982) P laying safe w ith m isw eighted m eans. Journal of the American Statistical Association 77, 754-759. Cressie, N . A. C. (1991) Statistics for Spatial Data. N ew Y o rk : Wiley. D ah lh au s, R. a n d Ja n as, D. (1996) A frequency d o m ain b o o tstra p fo r ra tio statistics in tim e series analysis. Annals of Statistics 24, to ap p ear. D aley, D. J. an d V ere-Jones, D. (1988) A n Introduction to the Theory of Point Processes. N ew Y ork: S pringer. D aniels, H . E. (1954) S ad d lep o in t a p p ro x im atio n s in statistics. Annals of Mathematical Statistics 25, 631-650. D aniels, H. E. (1955) D iscussion o f “ P e rm u ta tio n theory in the d eriv atio n o f ro b u st criteria an d th e study o f d ep artu re s fro m assu m p tio n ”, by G . E. P. Box a n d S. L. A ndersen. Journal of the Royal Statistical Society series B 17, 27-28. D aniels, H . E. (1958) D iscussion o f “T h e regression analysis o f b in ary sequences”, by D . R. Cox. Journal of the Royal Statistical Society series B 20, 236-238. D aniels, H. E. an d Y oung, G . A. (1991) S ad d lep o in t ap p ro x im atio n fo r th e stu d entized m ean, w ith an a p p licatio n to th e b o o tstra p . Biometrika 78, 169-179. D av iso n , A. C. (1988) D iscussion o f th e R oyal S tatistical Society m eeting o n th e b o o tstrap . Journal of the Royal Statistical Society series B 50, 356-357. D av iso n , A. C. an d H all, P. (1992) O n the bias a n d variab ility o f b o o tstra p an d cross-validation estim ates o f e rro r rate in d iscrim in atio n problem s. Biometrika 79, 279-284. D av iso n , A. C. a n d H all, P. (1993) O n S tudentizing an d blo ck in g m eth o d s fo r im plem enting the b o o tstra p w ith d ep en d en t d a ta . Australian Journal of Statistics 35, 215-224. D av iso n , A. C. a n d H inkley, D. V. (1988) S ad d lep o in t ap p ro x im atio n s in resam pling m ethods. Biometrika 75, 417-431. D av iso n , A. C., H inkley, D. V. a n d S chechtm an, E. (1986) Efficient b o o tstra p sim ulation. Biometrika 73, 555-566. D av iso n , A. C., H inkley, D. V. a n d W orton, B. J. (1992) B o o tstrap likelihoods. Biometrika 79, 113-130. D av iso n , A. C., H inkley, D . V. a n d W orton, B. J. (1995) A ccu rate an d efficient co n stru ctio n o f b o o tstra p

Bibliography

likelihoods. Statistics and Computing 5, 257-264. D avison, A. C. a n d Snell, E. J. (1991) R esiduals an d diagnostics. In Statistical Theory and Modelling: In Honour of Sir David Cox, FRS, eds D. V. H inkley, N . R eid a n d E. J. Snell, pp. 83-106. L o n d o n : C h a p m a n & H all. D e A ngelis, D . an d G ilks, W. R. (1994) E stim ating acquired im m une deficiency syndrom e incidence acco u n tin g for re p o rtin g delay. Journal of the Royal Statistical Society series A 157, 31-40. D e A ngelis, D ., H all, P. a n d Y oung, G . A. (1993) A nalytical an d b o o tstra p ap p ro x im atio n s to estim ato r d istrib u tio n s in L\ regression. Journal of the American Statistical Association 88, 1310-1316. D e A ngelis, D . an d Y oung, G . A. (1992) S m oothing the b o o tstrap . International Statistical Review 60, 45-56. D em pster, A. P., L aird, N . M . a n d R ubin, D . B. (1977) M ax im u m likelihood from incom plete d a ta via the E M alg o rith m (w ith D iscussion). Journal of the Royal Statistical Society series B 39, 1-38. D iaconis, P. an d H olm es, S. (1994) G ray codes for ran d o m izatio n procedures. Statistics and Computing 4, 287-302. D iC iccio, T. J. a n d E fron, B. (1992) M ore accurate confidence intervals in exponential fam ilies. Biometrika 79, 231-245. D iC iccio, T. J. an d E fron, B. (1996) B o o tstrap confidence intervals (w ith D iscussion). Statistical Science 11, 189-228. D iC iccio, T. J., H all, P. a n d R o m ano, J. P. (1989) C o m p ariso n o f p aram etric an d em pirical likelihood functions. Biometrika 76, 465-476. D iC iccio, T. J., H all, P. an d R o m ano, J. P. (1991) E m pirical likelihood is B a rtlett-correctable. Annals of Statistics 19, 1053-1061. D iC iccio, T. J., M a rtin , M . A. an d Y oung, G . A. (1992a) A nalytic ap p ro x im atio n s for iterated b o o tstra p confidence intervals. Statistics and Computing 2, 161-171. D iC iccio, T. J., M a rtin , M . A. a n d Y oung, G . A. (1992b) F a st an d accu rate ap p ro x im ate dou b le b o o tstra p confidence intervals. Biometrika 79, 285-295. D iC iccio, T. J., M a rtin , M . A. an d Y oung, G . A. (1994) A nalytical a p p ro x im atio n s to b o o tstra p d istrib u tio n functions using sad d lep o in t m ethods. Statistica Sinica 4, 281-295. D iC iccio, T. J. an d R o m an o , J. P. (1988) A review o f b o o tstra p confidence intervals (w ith D iscussion). Journal of the Royal Statistical Society series B 50, 338-370. C o rrectio n , volum e 51, p. 470.

559

Bibliography D iC iccio, T. J. an d R o m an o , J. P. (1989) O n adjustm ents based o n th e signed ro o t o f the em pirical likelihood ra tio statistic. Biometrika 76, 447-456. D iC iccio, T. J. an d R o m an o , J. P. (1990) N o n p aram etric confidence lim its by resam pling m eth o d s an d least fav o rab le fam ilies. International Statistical Review 58, 59-76. Diggle, P. J. (1983) Statistical Analysis of Spatial Point

Patterns. L o n d o n : A cadem ic Press. Diggle, P. J. (1990) Time Series: A Biostatistical Introduction. O x fo rd : C laren d o n Press. Diggle, P. J. (1993) P o in t process m odelling in en v iro n m en tal epidem iology. In Statistics for the Environment, eds V. B a rn ett an d K . F. T u rk m an , pp. 89-110. C h ich ester: Wiley. Diggle, P. J., L ange, N . a n d Benes, F. M . (1991) A nalysis o f varian ce fo r rep licated sp a tia l p o in t p a tte rn s in clinical n eu ro an ato m y . Journal of the American Statistical Association 86, 618-625. Diggle, P. J. an d R ow lingson, B. S. (1994) A co n d itio n al a p p ro a c h to p o in t process m odelling o f elevated risk. Journal of the Royal Statistical Society series A 157, 433-440.

jackknife. Annals o f Statistics 7, 1-26. E fron, B. (1981a) N o n p a ra m e tric sta n d a rd erro rs an d confidence intervals (w ith D iscussion). Canadian Journal of Statistics 9, 139-172. E fron, B. (1981b) C ensored d a ta an d th e b o o tstrap . Journal of the American Statistical Association 76, 312-319. E fron, B. (1982) The Jackknife, the Bootstrap, and Other Resampling Plans. N u m b er 38 in C B M S -N S F R egional C onference Series in A pplied M athem atics. P h ilad elp h ia: S IA M . Efron, B. (1983) E stim atin g th e e rro r rate o f a prediction rule: im provem ent on cross-validation. Journal of the American Statistical Association 78, 316-331. E fron, B. (1986) H ow biased is the a p p a re n t e rro r ra te o f a pred ictio n rule? Journal of the American Statistical Association 81, 461-470. Efron, B. (1987) B etter b o o tstra p confidence intervals (w ith D iscussion). Journal of the American Statistical Association 82, 171-200. E fron, B. (1988) C om puter-intensive m eth o d s in statistical regression. S I A M Review 30, 421-449.

D o , K .-A . an d H all, P. (1991) O n im p o rtan c e resam pling fo r th e b o o tstra p . Biometrika 78, 161-167.

E fron, B. (1990) M ore efficient b o o tstra p co m p u tatio n s. Journal of the American Statistical Association 55, 79-89.

D o, K .-A . an d H all, P. (1992a) D istrib u tio n estim ation using co n co m itan ts o f o rd er statistics, w ith application to M o n te C a rlo sim u latio n for the b o o tstrap . Journal of the Royal Statistical Society series B 54, 595-607.

E fron, B. (1992) Ja ck k n ife-after-b o o tstrap sta n d a rd erro rs an d influence functions (w ith D iscussion). Journal of the Royal Statistical Society series B 54, 83-127.

D o , K .-A . a n d H all, P. (1992b) Q u asi-ran d o m resam pling fo r th e b o o tstra p . Statistics and Computing 1, 13-22. D o b so n , A. J. (1990) An Introduction to Generalized Linear Models. L o n d o n : C h a p m a n & H all. D o n eg an i, M . (1991) A n ad aptive an d pow erful ra n d o m izatio n test. Biometrika 78, 930-933. D oss, H. a n d G ill, R. D. (1992) A n elem entary ap p ro ach to w eak convergence fo r q u an tile processes, with ap p licatio n s to cen so red survival d ata. Journal of the American Statistical Association 87, 869-877. D rap er, N . R. an d Sm ith, H . (1981) Applied Regression Analysis. S econd edition. N ew Y o rk : Wiley. D u ch arm e, G . R., Jh u n , M., R o m ano, J. P. a n d T ruong, K . N . (1985) B o o tstrap confidence cones for directional d a ta . Biometrika 72, 637-645. E asto n , G . S. an d R o n ch etti, E. M. (1986) G en eral sa d d lep o in t a p p ro x im atio n s w ith ap p licatio n s to L statistics. Journal of the American Statistical Association 81, 420-430. E fron, B. (1979) B o o tstrap m eth o d s: a n o th e r look a t the

E fron, B. (1993) Bayes a n d likelihood calcu latio n s from confidence intervals. Biometrika 80, 3-26. E fron, B. (1994) M issing d a ta , im p u tatio n , an d the b o o tstra p (w ith D iscussion). Journal of the American Statistical Association 89, 463-479. E fron, B., H allo ran , M . E. an d H olm es, S. (1996) B o o tstrap confidence levels fo r phylogenetic trees. Proceedings of the National Academy of Sciences, U S A 93, 13429-13434. E fron, B. a n d Stein, C. M . (1981) T he jack k n ife estim ate o f variance. Annals of Statistics 9, 586-596. E fron, B. an d T ibshirani, R . J. (1986) B o o tstra p m ethods for sta n d a rd errors, confidence intervals, a n d o th er m easures o f statistical accuracy (w ith D iscussion). Statistical Science 1, 54-96. E fron, B. a n d T ibshirani, R . J. (1993) An Introduction to the Bootstrap. N ew Y ork: C h a p m a n & H all. E fron, B. a n d T ibshirani, R . J. (1997) Im provem ents on cross-validation: the .632+ b o o tstra p m ethod. Journal of the American Statistical Association 92, 548-560. Fang, K . T. a n d W ang, Y. (1994) Number-Theoretic Methods in Statistics. L o n d o n : C h a p m a n & H all.

560 Faraway, J. J. (1992) O n the cost of data analysis. Journal of Computational and Graphical Statistics 1, 213-229.

Bibliography

least-squares estimates in stationary linear models. Annals of Statistics 12, 827-842.

Feigl, P. and Zelen, M. (1965) Estimation of exponential survival probabilities with concomitant information. Biometrics 21, 826-838.

Freedman, D. A. and Peters, S. C. (1984a) Bootstrapping a regression equation: some empirical results. Journal of the American Statistical Association 79, 97-106.

Feller, W. (1968) A n Introduction to Probability Theory and its Applications. Third edition, volume I. N e w York: Wiley. Fernholtz, L. T. (1983) von Mises Calculus for Statistical Functionals. Volume 19 of Lecture Notes in Statistics. N e w York: Springer.

Freedman, D. A. and Peters, S. C. (1984b) Bootstrapping an econometric model:some empirical results. Journal of Business & Economic Statistics 2, 150-158.

Ferretti, N. and Romo, J. (1996) Unit root bootstrap tests for /1R( 1) models. Biometrika 83, 849-860. Field, C. and Ronchetti, E. M. (1990) Small Sample Asymptotics. Volume 13 of Lecture Notes — Monograph Series. Hayward, California: Institute of Mathematical Statistics. Firth, D. (1991) Generalized linear models. In Statistical Theory and Modelling: In Honour of Sir David Cox, FRS, eds D. V. Hinkley, N. Reid and E. J. Snell, pp.

55-82. London: Chapman & Hall. Firth, D. (1993) Bias reduction of maximum likelihood estimates. Biometrika 80, 27-38. Firth, D., Glosup, J. and Hinkley, D. V. (1991) Model checking with nonparametric curves. Biometrika 78, 245-252. Fisher, N. I., Hall, P., Jing, B.-Y. and Wood, A. T. A. (1996) Improved pivotal methods for constructing confidence regions with directional data. Journal of the American Statistical Association 91, 1062-1070. Fisher, N. I., Lewis, T. and Embleton, B. J. J. (1987) Statistical Analysis of Spherical Data. Cambridge: Cambridge University Press. Fisher, R. A. (1935) The Design of Experiments. Edinburgh: Oliver and Boyd. Fisher, R. A. (1947) The analysis of covariance method for the relation between a part and the whole. Biometrics 3, 65-68. Fleming, T. R. and Harrington, D. P. (1991) Counting Processes and Survival Analysis. N e w York: Wiley. Forster, J. J., McDonald, J. W. and Smith, P. W. F. (1996) Monte Carlo exact conditional tests for log-linear and logistic models. Journal of the Royal Statistical Society series B 58, 445^53. Franke, J. and Hardle, W. (1992) On bootstrapping kernel spectral estimates. Annals of Statistics 20, 121-145. Freedman, D. A. (1981) Bootstrapping regression models. Annals of Statistics 9, 1218-1228. Freedman, D. A. (1984) O n bootstrapping two-stage

Freeman, D. H. (1987) Applied Categorical Data Analysis. N e w York: Marcel Dekker. Frets, G. P. (1921) Heredity of head form in man. Genetica 3, 193-384. Garcia-Soidan, P. H. and Hall, P. (1997) O n sample reuse methods for spatial data. Biometrics 53, 273-281. Garthwaite, P. H. and Buckland, S. T. (1992) Generating Monte Carlo confidence intervals by the Robbins-Monro process. Applied Statistics 41, 159-171. Gatto, R. (1994) Saddlepoint methods and nonparametric approximations for econometric models. Ph.D. thesis, Faculty of Economic and Social Sciences, University of Geneva. Gatto, R. and Ronchetti, E. M. (1996) General saddlepoint approximations of marginal densities and tail probabilities. Journal of the American Statistical Association 91, 666-673. Geisser, S. (1975) The predictive sample reuse method with applications. Journal of the American Statistical Association 70, 320-328. Geisser, S. (1993) Predictive Inference: An Introduction. London: Chapman & Hall. Geyer, C. J. (1991) Constrained maximum likelihood exemplified by isotonic convex logistic regression. Journal of the American Statistical Association 86, 717-724. Geyer, C. J. (1995) Likelihood ratio tests and inequality constraints. Technical Report 610, School of Statistics, University of Minnesota. Gigli, A. (1994) Contributions to importance sampling and resampling. Ph.D. thesis, Department of Mathematics, Imperial College, London. Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (eds) (1996) Markov Chain Monte Carlo in Practice. London: Chapman & Hall. Gleason, J. R. (1988) Algorithms for balanced bootstrap simulations. American Statistician 42, 263-266. Gong, G. (1983) Cross-validation, the jackknife, and the bootstrap: excess error estimation in forward logistic regression. Journal of the American Statistical Association 78, 108-113.

Bibliography

Gotze, F. and Kiinsch, H. R. (1996) Second order correctness of the blockwise bootstrap for stationary observations. Annals of Statistics 24, 1914-1933. Graham, R. L., Hinkley, D. V., John, P. W. M. and Shi, S. (1990) Balanced design of bootstrap simulations. Journal of the Royal Statistical Society series B 52, 185-202. Gray, H. L. and Schucany, W. R. (1972) The Generalized Jackknife Statistic. N e w York: Marcel Dekker. Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapman & Hall.

Gross, S. (1980) Median estimation in sample surveys. In Proceedings of the Section on Survey Research Methods,

pp. 181-184. Alexandria, Virginia: American Statistical Association. Haldane, J. B. S. (1940) The mean and variance of x2, when used as a test of homogeneity, when expectations are small. Biometrika 31, 346-355. Hall, P. (1985) Resampling a coverage pattern. Stochastic Processes and their Applications 20, 231-246. Hall, P. (1986) O n the bootstrap and confidence intervals. Annals of Statistics 14, 1431-1452. Hall, P. (1987) O n the bootstrap and likelihood-based confidence regions. Biometrika 74, 481^193. Hall, P. (1988a) Theoretical comparison of bootstrap confidence intervals (with Discussion). Annals of Statistics 16, 927-985. Hall, P. (1988b) On confidence intervals for spatial parameters estimated from nonreplicated data. Biometrics 44, 271-277. Hall, P. (1989a) Antithetic resampling for the bootstrap. Biometrika 76, 713-724. Hall, P. (1989b) Unusual properties of bootstrap confidence intervals in regression problems. Probability Theory and Related Fields 81, 247-273. Hall, P. (1990) Pseudo-likelihood theory for empirical likelihood. Annals of Statistics 18, 121-140. Hall, P. (1992a) The Bootstrap and Edgeworth Expansion. N e w York: Springer.

561 Hall, P. and Horowitz, J. L. (1993) Corrections and blocking rules for the block bootstrap with dependent data. Technical Report SRI 1-93, Centre for Mathematics and its Applications, Australian National University. Hall, P., Horowitz, J. L. and Jing, B.-Y. (1995) O n blocking rules for the bootstrap with dependent data. Biometrika 82, 561-574. Hall, P. and Jing, B.-Y. (1996) On sample reuse methods for dependent data. Journal of the Royal Statistical Society series B 58, 727-737. Hall, P. and Keenan, D. M. (1989) Bootstrap methods for constructing confidence regions for hands. Communications in Statistics — Stochastic Models 5, 555-562. Hall, P. and La Scala, B. (1990) Methodology and algorithms of empirical likelihood. International Statistical Review 58, 109-28. Hall, P. and Martin, M. A. (1988) O n bootstrap resampling and iteration. Biometrika 75, 661-671. Hall, P. and Owen, A. B. (1993) Empirical likelihood confidence bands in density estimation. Journal of Computational and Graphical Statistics 2, 273-289. Hall, P. and Titterington, D. M. (1989) The effect of simulation order on level accuracy and power of Monte Carlo tests. Journal of the Royal Statistical Society series B 51, 459-467. Hall, P. and Wilson, S. R. (1991) Two guidelines for bootstrap hypothesis testing. Biometrics 47, 757-762. Hamilton, M. A. and Collings, B. J. (1991) Determining the appropriate sample size for nonparametric tests for location shift. Technometrics 33, 327-337. Hammersley, J. M. and Handscomb, D. C. (1964) Monte Carlo Methods. London: Methuen. Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust Statistics: The Approach Based on Influence Functions. N e w York: Wiley. Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J. and Ostrowski, E. (eds) (1994) A Handbook of Small Data Sets. London: Chapman & Hall.

Hall, P. (1992b) O n bootstrap confidence intervals in nonparametric regression. Annals of Statistics 20, 695-711.

Hardle, W. (1989) Resampling for inference from curves. In

Hall, P. (1995) O n the biases of error estimators in prediction problems. Statistics and Probability Letters 24, 257-262.

Hardle, W. (1990) Applied Nonparametric Regression. Cambridge: Cambridge University Press.

Hall, P., DiCiccio, T. J. and Romano, J. P. (1989) On smoothing and the bootstrap. Annals of Statistics 17, 692-704.

Bulletin of the 47th Session of the International Statistical Institute, Paris, August 1989, volume 3, pp. 53-63.

Hardle, W. and Bowman, A. W. (1988) Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. Journal of the American Statistical Association 83, 102-110.

562 Hardle, W. and Marron, J. S. (1991) Bootstrap simultaneous error bars for nonparametric regression. Annals of Statistics 19, 778-796. Hartigan, J. A. (1969) Using subsample values as typical values. Journal of the American Statistical Association 64, 1303-1317. Hartigan, J. A. (1971) Error analysis by replaced samples. Journal of the Royal Statistical Society series B 33, 98-110. Hartigan, J. A. (1975) Necessary and sufficient conditions for asymptotic joint normality of a statistic and its subsample values. Annals of Statistics 3, 573-580.

Bibliography jackknife confidence limit methods. Biometrika 71, 331-339.

Hirose, H. (1993) Estimation of threshold stress in accelerated life-testing. IEEE Transactions on Reliability 42, 650-657. Hjort, N. L. (1985) Bootstrapping Cox’ s regression model. Technical Report NSF-241, Department of Statistics, Stanford University. Hjort, N. L. (1992) O n inference in parametric survival data models. International Statistical Review 60, 355-387.

Hartigan, J. A. (1990) Perturbed periodogram estimates of variance. International Statistical Review 58, 1-7.

Horvath, L. and Yandell, B. S. (1987) Convergence rates for the bootstrapped product-limit process. Annals of Statistics 15, 1155-1173.

Hastie, T. J. and Loader, C. (1993) Local regression: automatic kernel carpentry (with Discussion). Statistical Science 8, 120-143.

Hosmer, D. W. and Lemeshow, S. (1989) Applied Logistic Regression. N e w York: Wiley.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. London: Chapman & Hall.

Hu, F. and Zidek, J. V. (1995) A bootstrap based on the estimating equations of the linear model. Biometrika 82, 263-275.

Hayes, K. G., Perl, M. L. and Efron, B. (1989) Application of the bootstrap statistical method to the tau-decay-mode problem. Physical Review Series D 39, 274-279.

Huet, S., Jolivet, E. and Messean, A. (1990) Some simulations results about confidence intervals and bootstrap methods in nonlinear regression. Statistics 3, 369-432.

Heller, G. and Venkatraman, E. S. (1996) Resampling procedures to compare two survival distributions in the presence of right-censored data. Biometrics 52, 1204-1213. Hesterberg, T. C. (1988) Advances in importance sampling. Ph.D. thesis, Department of Statistics, Stanford University, California.

Hyde, J. (1980) Survival analysis with incomplete observations. In Biostatistics Casebook, eds R. G. Miller, B. Efron, B. W. Brown and L. E. Moses, pp. 31-46. N e w York: Wiley.

Hesterberg, T. C. (1995a) Tail-specific linear approximations for efficient bootstrap simulations. Journal of Computational and Graphical Statistics 4, 113-133. Hesterberg, T. C. (1995b) Weighted average importance sampling and defensive mixture distributions. Technometrics 37, 185-194. Hinkley, D. V. (1977) Jackknifing in unbalanced situations. Technometrics 19, 285-292. Hinkley, D. V. and Schechtman, E. (1987) Conditional bootstrap methods in the mean-shift model. Biometrika 74, 85-93. Hinkley, D. V. and Shi, S. (1989) Importance sampling and the nested bootstrap. Biometrika 76, 435-446. Hinkley, D. V. and Wang, S. (1991) Efficiency of robust standard errors for regression coefficients. Communications in Statistics — Theory and Methods 20, 1- 11.

Hinkley, D. V. and Wei, B. C. (1984) Improvements of

Janas, D. (1993) Bootstrap Procedures for Time Series. Aachen: Verlag Shaker. Jennison, C. (1992) Bootstrap tests and confidence intervals for a hazard ratio when the number of observed failures is small, with applications to group sequential survival studies. In Computer Science and Statistics: Proceedings of the 22nd Symposium on the Interface, eds C. Page and R. LePage, pp. 89-97. N e w York: Springer. Jensen, J. L. (1992) The modified signed likelihood statistic and saddlepoint approximations. Biometrika 79, 693-703. Jensen, J. L. (1995) Saddlepoint Approximations. Oxford: Clarendon Press. Jeong, J. and Maddala, G. S. (1993) A perspective on application of bootstrap methods in econometrics. In Handbook of Statistics, vol. II: Econometrics, eds G. S. Maddala, C. R. Rao and H. D. Vinod, pp. 573-610. Amsterdam: North-Holland. Jing, B.-Y. and Robinson, J. (1994) Saddlepoint approximations for marginal and conditional probabilities of transformed variables. Annals of Statistics 22, 1115-1132.

Bibliography

563

Jing, B.-Y. and Wood, A. T. A. (1996) Exponential empirical likelihood is not Bartlett correctable. Annals of Statistics 24, 365-369.

Lawson, A. B. (1993) O n the analysis of mortality events associated with a prespecified fixed point. Journal of the Royal Statistical Society series A 156, 363-377.

Jockel, K.-H. (1986) Finite sample properties and asymptotic efficiency of Monte Carlo tests. Annals of Statistics 14, 336-347.

Lee, S. M. S. and Young, G. A. (1995) Asymptotic iterated bootstrap confidence intervals. Annals of Statistics 23, 1301-1330.

Johns, M. V. (1988) Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association 83, 709-714.

Leger, C., Politis, D. N. and Romano, J. P. (1992) Bootstrap technology and applications. Technometrics 34, 378-398.

Journel, A. G. (1994) Resampling from stochastic simulations (with Discussion). Environmental and Ecological Statistics 1, 63-91.

Leger, C. and Romano, J. P. (1990a) Bootstrap choice of tuning parameters. Annals of the Institute of Statistical Mathematics 42, 709-735.

Kabaila, R (1993a) Some properties of profile bootstrap confidence intervals. Australian Journal of Statistics 35, 205-214.

Leger, C. and Romano, J. P. (1990b) Bootstrap adaptive estimation: the trimmed mean example. Canadian Journal of Statistics 18, 297-314.

Kabaila, P. (1993b) O n bootstrap predictive inference for autoregressive processes. Journal of Time Series Analysis 14, 473—484.

Lehmann, E. L. (1986) Testing Statistical Hypotheses. Second edition. N e w York: Wiley.

Kalbfleisch, J. D. and Prentice, R. L. (1980) The Statistical Analysis of Failure Time Data. N e w York: Wiley. Kaplan, E. L. and Meier, P. (1958) Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53, 457-481. Karr, A. F. (1991) Point Processes and their Statistical Inference. Second edition. N e w York: Marcel Dekker. Katz, R. (1995) Spatial analysis of pore images. Ph.D. thesis, Department of Statistics, University of Oxford.

Li, G. (1995) Nonparametric likelihood ratio estimation of probabilities for truncated data. Journal of the American Statistical Association 90, 997-1003. Li, H. and Maddala, G. S. (1996) Bootstrapping time series models (with Discussion). Econometric Reviews 15, 115-195. Li, K.-C. (1987) Asymptotic optimality for C p, C l , cross-validation and generalized cross-validation: discrete index set. Annals of Statistics 15, 958-975.

Kendall, D. G. and Kendall, W. S. (1980) Alignments in two-dimensional random sets of points. Advances in Applied Probability 12, 380-424.

Liu, R. Y. and Singh, K. (1992a) Moving blocks jackknife and bootstrap capture weak dependence. In Exploring the Limits of Bootstrap, eds R. LePage and L. Billard, pp. 225-248. N e w York: Wiley.

Kim, J.-H. (1990) Conditional bootstrap methods for censored data. Ph.D. thesis, Department of Statistics, Florida State University.

Liu, R. Y. and Singh, K. (1992b) Efficiency and robustness in resampling. Annals of Statistics 20, 370-384.

Kiinsch, H. R. (1989) The jackknife and bootstrap for general stationary observations. Annals of Statistics 17, 1217-1241. Lahiri, S. N. (1991) Second-order optimality of stationary bootstrap. Statistics and Probability letters 11, 335-341. Lahiri, S. N. (1995) On the asymptotic behaviour of the moving block bootstrap for normalized sums of heavy-tail random variables. Annals of Statistics 23, 1331-1349. Laird, N. M. (1978) Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American Statistical Association 73, 805-811. Laird, N. M. and Louis, T. A. (1987) Empirical Bayes confidence intervals based on bootstrap samples (with Discussion). Journal of the American Statistical Association 82, 739-757.

Lloyd, C. J. (1994) Approximate pivots from M-estimators. Statistica Sinica 4, 701-714. Lo, S.-H. and Singh, K. (1986) The product-limit estimator and the bootstrap: some asymptotic representations. Probability Theory and Related Fields 71, 455-465. Loh, W.-Y. (1987) Calibrating confidence coefficients. Journal of the American Statistical Association 82, 155-162. Mallows, C. L. (1973) Some comments on C p. Technometrics 15, 661-675. Mammen, E. (1989) Asymptotics with increasing dimension for robust regression with applications to the bootstrap. Annals of Statistics 17, 382-400. Mammen, E. (1992) When Does Bootstrap Work? Asymptotic Results and Simulations. Volume 77 of Lecture Notes in Statistics. N e w York: Springer.

Bibliography

564

Journal of the Royal Statistical Society series A 152,

Mammen, E. (1993) Bootstrap and wild bootstrap for high dimensional linear models. Annals of Statistics 21, 255-285. Manly, B. F. J. (1991) Randomization and Monte Carlo Methods in Biology. London: Chapman & Hall.

Murphy, S. A. (1995) Likelihood-based confidence intervals in survival analysis. Journal of the American Statistical Association 90, 1399-1405.

Marriott, F. H. C. (1979) Barnard’ s Monte Carlo tests: how many simulations? Applied Statistics 28, 75-77.

Mykland, P. A. (1995) Dual likelihood. Annals of Statistics 23, 396-421.

McCarthy, P. J. (1969) Pseudo-replication: half samples. Review of the International Statistical Institute 37, 239-264. McCarthy, P. J. and Snowden, C. B. (1985) The Bootstrap and Finite Population Sampling. Vital and Public Health Statistics (Ser. 2, No. 95), Public Health Service Publication. Washington, DC: United States Government Printing Office, 85-1369.

Nelder, J. A. and Pregibon, D. (1987) A n extended quasi-likelihood function. Biometrika 74, 221-232.

McCullagh, P. (1987) Tensor Methods in Statistics. London: Chapman & Hall. McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models. Second edition. London: Chapman & Hall. McKay, M. D., Beckman, R. J. and Conover, W. J. (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239-245. McKean, J. W., Sheather, S. J. and Hettsmansperger, T. P. (1993) The use and interpretation of residuals based on robust estimation. Journal of the American Statistical Association 88, 1254-1263. McLachlan, G. J. (1992) Discriminant Analysis and Statistical Pattern Recognition. N e w York: Wiley. Milan, L. and Whittaker, J. (1995) Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics 44, 31-49. Miller, R. G. (1974) The jackknife — a review. Biometrika 61, 1-15. Miller, R. G. (1981) Survival Analysis. N e w York: Wiley. Monti, A. C. (1997) Empirical likelihood confidence regions in time series models. Biometrika 84, 395-405. Morgenthaler, S. and Tukey, J. W. (eds) (1991) Configural Polysampling: A Route to Practical Robustness. N e w York: Wiley. Moulton, L. H. and Zeger, S. L. (1989) Analyzing repeated measures on generalized linear models via the bootstrap. Biometrics 45, 381-394.

305-384.

Newton, M. A. and Geyer, C. J. (1994) Bootstrap recycling: a Monte Carlo alternative to the nested bootstrap. Journal of the American Statistical Association 89, 905-912. Newton, M. A. and Raftery, A. E. (1994) Approximate Bayesian inference with the weighted likelihood bootstrap (with Discussion). Journal of the Royal Statistical Society series B 56, 3—48. Niederreiter, H. (1992) Random Number Generation and Quasi-Monte Carlo Methods. Number 63 in C B M S - N S F Regional Conference Series in Applied Mathematics. Philadelphia: SIAM. Nordgaard, A. (1990) On the resampling of stochastic processes using a bootstrap approach. Ph.D. thesis, Department of Mathematics, Linkoping University, Sweden. Noreen, E. W. (1989) Computer Intensive Methods for Testing Hypotheses: An Introduction. N e w York: Wiley. Ogbonmwan, S.-M. (1985) Accelerated resampling codes with application to likelihood. Ph.D. thesis, Department of Mathematics, Imperial College, London. Ogbonmwan, S.-M. and Wynn, H. P. (1986) Accelerated resampling codes with low discrepancy. Preprint, Department of Statistics and Actuarial Science, The City University. Olshen, R. A., Biden, E. N., Wyatt, M. P. and Sutherland, D. H. (1989) Gait analysis and the bootstrap. Annals of Statistics 17, 1419-1440. Owen, A. B. (1988) Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237-249. Owen, A. B. (1990) Empirical likelihood ratio confidence regions. Annals of Statistics 18, 90-120. Owen, A. B. (1991) Empirical likelihood for linear models. Annals of Statistics 19, 1725-1747.

Moulton, L. H. and Zeger, S. L. (1991) Bootstrapping generalized linear models. Computational Statistics and Data Analysis 11, 53-63.

Owen, A. B. (1992a) Empirical likelihood and small samples. In Computer Science and Statistics: Proceedings of the 22nd Symposium on the Interface, eds C. Page and R. LePage, pp. 79-88. N e w York: Springer.

Muirhead, C. R. and Darby, S. C. (1989) Royal Statistical Society meeting on cancer near nuclear installations.

Owen, A. B. (1992b) A central limit theorem for Latin hypercube sampling. Journal of the Royal Statistical

Bibliography Society series B 54, 541-551.

Parzen, M. I., Wei, L. J. and Ying, Z. (1994) A resampling method based on pivotal estimating functions. Biometrika 81, 341-350. Paulsen, O. and Heggelund, P. (1994) The quantal size at retinogeniculate synapses determined from spontaneous and evoked EPSCs in guinea-pig thalamic slices. Journal of Physiology 480, 505-511. Percival, D. B. and Walden, A. T. (1993) Spectral Analysis

565

Quenouille, M. H. (1949) Approximate tests of correlation in time-series. Journal of the Royal Statistical Society series B 11, 68-84. Rao, J. N. K. and Wu, C. F. J. (1988) Resampling inference with complex survey data. Journal of the American Statistical Association 83, 231-241. Rawlings, J. O. (1988) Applied Regression Analysis: A Research Tool. Pacific Grove, California:Wadsworth & Brooks/Cole.

for Physical Applications: Multitaper and Conventional Univariate Techniques. Cambridge: Cambridge

Reid, N. (1981) Estimating the median survival time. Biometrika 68, 601-608.

University Press.

Reid, N. (1988) Saddlepoint methods and statistical inference (with Discussion). Statistical Science 3, 213-238.

Pitman, E. J. G. (1937a) Significance tests which may be applied to samples from any populations. Journal of the Royal Statistical Society. Supplement 4, 119-130. Pitman, E. J. G. (1937b) Significance tests which may be applied to samples from any populations: II. The correlation coefficient test. Journal of the Royal Statistical Society, Supplement 4, 225-232. Pitman, E. J. G. (1937c) Significance tests which may be applied to samples from any populations: III. The analysis of variance test. Biometrika 29, 322-335. Plackett, R. L. and Burman, J. P. (1946) The design of optimum multifactorial experiments. Biometrika 33, 305-325. Politis, D. N. and Romano, J. P. (1993) Nonparametric resampling for homogeneous strong mixing random fields. Journal of Multivariate Analysis 47, 301-328. Politis, D. N. and Romano, J. P. (1994a) The stationary bootstrap. Journal of the American Statistical Association 89, 1303-1313. Politis, D. N. and Romano, J. P. (1994b) Large sample confidence regions based on subsamples under minimal assumptions. Annals of Statistics 22, 2031-2050.

Reynolds, P. S. (1994) Time-series analyses of beaver body temperatures. In Case Studies in Biometry, eds N. Lange, L. Ryan, L. Billard, D. R. Brillinger, L. Conquest and J. Greenhouse, pp. 211-228. N e w York: Wiley. Ripley, B. D. (1977) Modelling spatial patterns (with Discussion). Journal of the Royal Statistical Society series B 39, 172-212. Ripley, B. D. (1981) Spatial Statistics. N e w York: Wiley. Ripley, B. D. (1987) Stochastic Simulation. N e w York: Wiley. Ripley, B. D. (1988) Statistical Inference for Spatial Processes. Cambridge: Cambridge University Press. Ripley, B. D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press. Robinson, J. (1982) Saddlepoint approximations for permutation tests and confidence intervals. Journal of the Royal Statistical Society series B 44, 91-101. Romano, J. P. (1988) Bootstrapping the mode. Annals of the Institute of Statistical Mathematics 40, 565-586.

Possolo, A. (1986) Subsampling a random field. Technical Report 78, Department of Statistics, University of Washington, Seattle.

Romano, J. P. (1989) Bootstrap and randomization tests of some nonparametric hypotheses. Annals of Statistics 17, 141-159.

Presnell, B. and Booth, J. G. (1994) Resampling methods for sample surveys. Technical Report 470, Department of Statistics, University of Florida, Gainesville.

Romano, J. P. (1990) O n the behaviour of randomization tests without a group invariance assumption. Journal of the American Statistical Association 85, 686-692.

Priestley, M. B. (1981) Spectral Analysis and Time Series. London: Academic Press.

Rousseeuw, P. J. and Leroy, A. M. (1987) Robust Regression and Outlier Detection. N e w York: Wiley.

Proschan, F. (1963) Theoretical explanation of observed decreasing failure rate. Technometrics 5, 375-383.

Royall, R. M. (1986) Model robust confidence intervals using maximum likelihood estimators. International Statistical Review 54, 221-226.

Qin, J. (1993) Empirical likelihood in biased sample problems. Annals of Statistics 21, 1182-1196. Qin, J. and Lawless, J. (1994) Empirical likelihood and general estimating equations. Annals of Statistics 22, 300-325.

Rubin, D. B. (1981) The Bayesian bootstrap. Annals of Statistics 9, 130-134. Rubin, D. B. (1987) Multiple Imputation for Nonresponse in Surveys. N e w York: Wiley.

566 Rubin, D. B. and Schenker, N. (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. Journal of the American Statistical Association 81, 366-374. Ruppert, D. and Carroll, R. J. (1980) Trimmed least squares estimation in the linear model. Journal of the American Statistical Association 75, 828-838. Samawi, H. M. (1994) Power estimation for two-sample tests using importance and antithetic resampling. Ph.D. thesis, Department of Statistics and Actuarial Science, University of Iowa, Ames. Sauerbrei, W. and Schumacher, M. (1992) A bootstrap resampling procedure for model building: application to the Cox regression model. Statistics in Medicine 11, 2093-2109. Schenker, N. (1985) Qualms about bootstrap confidence intervals. Journal of the American Statistical Association 80, 360-361. Seber, G. A. F. (1977) Linear Regression Analysis. N ew York: Wiley. Shao, J. (1988) O n resampling methods for variance and bias estimation in linear models. Annals of Statistics 16, 986-1008. Shao, J. (1993) Linear model selection by cross-validation. Journal of the American Statistical Association 88, 486-494. Shao, J. (1996) Bootstrap model selection. Journal of the American Statistical Association 91, 655-665. Shao, J. and Tu, D. (1995) The Jackknife and Bootstrap. N e w York: Springer. Shao, J. and Wu, C. F. J. (1989) A general theory for jackknife variance estimation. Annals of Statistics 17, 1176-1197. Shorack, G. (1982) Bootstrapping robust regression. Communications in Statistics — Theory and Methods 11, 961-972.

Bibliography bootstrap. Annals o f Statistics 9, 1187-1195.

Sitter, R. R. (1992) A resampling procedure for complex survey data. Journal of the American Statistical Association 87, 755-765. Smith, P. W. F., Forster, J. J. and McDonald, J. W. (1996) Monte Carlo exact tests for square contingency tables. Journal of the Royal Statistical Society series A 159, 309-321. Spady, R. H. (1991) Saddlepoint approximations for regression models. Biometrika 78, 879-889. St. Laurent, R. T. and Cook, R. D. (1993) Leverage, local influence, and curvature in nonlinear regression. Biometrika 80, 99-106. Stangenhaus, G. (1987) Bootstrap and inference procedures for L\ regression. In Statistical Data Analysis Based on the L\-Norm and Related Methods, ed. Y. Dodge, pp. 323-332. Amsterdam: North-Holland. Stein, C. M. (1985) On the coverage probability of confidence sets based on a prior distribution. Volume 16 of Banach Centre Publications. Warsaw: P W N — Polish Scientific Publishers. Stein, M. (1987) Large sample properties of simulations using Latin hypercube sampling. Technometrics 29, 143-151. Sternberg, H. O ’ R. (1987) Aggravation of floods in the Amazon River as a consequence of deforestation? Geografiska Annaler 69A, 201-219. Sternberg, H. O ’ R. (1995) Water and wetlands of Brazilian Amazonia: an uncertain future. In The Fragile Tropics of Latin America : Sustainable Management of Changing Environments, eds T. Nishizawa and J. I. Uitto, pp.

113-179. Tokyo: United Nations University Press. Stine, R. A. (1985) Bootstrap prediction intervals for regression. Journal of the American Statistical Association 80, 1026-1031.

Silverman, B. W. (1981) Using kernel density estimates to investigate multimodality. Journal of the Royal Statistical Society series B 43, 97-99.

Stoffer, D. S. and Wall, K. D. (1991) Bootstrapping state-space models: Gaussian maximum likelihood estimation and the Kalman filter. Journal of the American Statistical Association 86, 1024-1033.

Silverman, B. W. (1985) Some aspects of the spline smoothing approach to non-parametric regression curve fitting (with Discussion). Journal of the Royal Statistical Society series B 47, 1-52.

Stone, M. (1974) Cross-validatory choice and assessment of statistical predictions (with Discussion). Journal of the Royal Statistical Society series B 36, 111-147.

Silverman, B. W. and Young, G. A. (1987) The bootstrap: to smooth or not to smooth? Biometrika 74, 469-479. Simonoff, J. S. and Tsai, C.-L. (1994) Use of modified profile likelihood for improved tests of constancy of variance in regression. Applied Statistics 43, 357-370. Singh, K. (1981) O n the asymptotic accuracy of Efron’ s

Stone, M. (1977) A n asymptotic equivalence of choice of model by cross-validation and Akaike’ s criterion. Journal of the Royal Statistical Society series B 39, 44-47. Swanepoel, J. W. H. and van Wyk, J. W. J. (1986) The bootstrap applied to power spectral density function estimation. Biometrika 73, 135-141.

Bibliography

Tanner, M. A. (1996) Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Third edition. N e w York:

Springer.

567 Wang, S. (1992) General saddlepoint approximations in the bootstrap. Statistics and Probability Letters 13, 61-66. Wang, S. (1993a) Saddlepoint expansions in finite population problems. Biometrika 80, 583-590.

Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior densities by data augmentation (with Discussion). Journal of the American Statistical Association 82, 528-550.

Wang, S. (1993b) Saddlepoint methods for bootstrap confidence bands in nonparametric regression. Australian Journal of Statistics 35, 93-101.

Theiler, J., Galdrikian, B., Longtin, A., Eubank, S. and Farmer, J. D. (1992) Using surrogate data to detect nonlinearity in time series. In Nonlinear Modeling and Forecasting, eds M. Casdagli and S. Eubank, number XII in Santa Fe Institute Studies in the Sciences of Complexity, pp. 163-188. N e w York: Addison-Wesley.

Weisberg, S. (1985) Applied Linear Regression. Second edition. N e w York: Wiley.

Therneau, T. (1983) Variance reduction techniques for the bootstrap. Ph.D. thesis, Department of Statistics, Stanford University, California. Tibshirani, R. J. (1988) Variance stabilization and the bootstrap. Biometrika 75, 433-444. Tong, H. (1990) Non-linear Time Series: A Dynamical System Approach. Oxford: Clarendon Press. Tsay, R. S. (1992) Model checking via parametric bootstraps in time series. Applied Statistics 41, 1-15. Tukey, J. W. (1958) Bias and confidence in not quite large samples (Abstract). Annals of Mathematical Statistics 29, 614. Venables, W. N. and Ripley, B. D. (1994) Modern Applied Statistics with S-Plus. N e w York: Springer. Ventura, V. (1997) Likelihood inference by Monte Carlo methods and efficient nested bootstrapping. D.Phil. thesis, Department of Statistics, University of Oxford. Ventura, V., Davison, A. C. and Boniface, S. J. (1997) Statistical inference for the effect of magnetic brain stimulation on a motoneurone. Applied Statistics 46, to appear. Wahrendorf, J., Becher, H. and Brown, C. C. (1987) Bootstrap comparison of non-nested generalized linear models: applications in survival analysis and epidemiology. Applied Statistics 36, 72-81. Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. London: Chapman & Hall. Wang, S. (1990) Saddlepoint approximations in resampling analysis. Annals of the Institute of Statistical Mathematics 42, 115-131.

Wang, S. (1995) Optimizing the smoothed bootstrap. Annals of the Institute of Statistical Mathematics 47, 65-80.

Welch, B. L. and Peers, H. W. (1963) O n formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society series B 25, 318-329. Welch, W. J. (1990) Construction of permutation tests. Journal of the American Statistical Association 85, 693-698. Welch, W. J. and Fahey, T. J. (1994) Correcting for covariates in permutation tests. Technical Report STAT-94-12, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario. Westfall, P. H. and Young, S. S. (1993) Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment. N e w York: Wiley.

Woods, H., Steinour, H. H. and Starke, H. R. (1932) Effect of composition of Portland cement on heat evolved during hardening. Industrial Engineering and Chemistry 24, 1207-1214. Wu, C. J. F. (1986) Jackknife, bootstrap and other resampling methods in regression analysis (with Discussion). Annals of Statistics 14, 1261-1350. Wu, C. J. F. (1990) O n the asymptotic properties of the jackknife histogram. Annals of Statistics 18, 1438-1452. Wu, C. J. F. (1991) Balanced repeated replications based on mixed orthogonal arrays. Biometrika 78, 181-188. Young, G. A. (1986) Conditioned data-based simulations: Some examples from geometrical statistics. International Statistical Review 54, 1-13. Young, G. A. (1990) Alternative smoothed bootstraps. Journal of the Royal Statistical Society series B 52, 477-484. Young, G. A. and Daniels, H. E. (1990) Bootstrap bias. Biometrika 77, 179-185.

Name Index

Abelson, R. R 403 Akaike, H. 316 Akritas, M. G. 124 Altman, D. G. 375 Amis, G. 253 Andersen, P. K. 124, 128, 353, 375 Andrews, D. F. 360 Appleyard, S. T. 417 Athreya, K. B. 60 Atkinson, A. C. 183, 315, 325 Bai, C. 315 Bai, Z. D. 427 Bailer, A. J. 384 Banks, D. L. 515 Barbe, P. 60, 516 Barnard, G. A. 183 Bamdorff-Nielsen, O. E. 183, 246, 486, 514 Becher, H. 379 Beckman, R. J. 486 Benes, F. M. 428 Beran, J. 426 Beran, R. J. 125, 183, 184, 187, 246, 250, 315 Berger, J. O. 515 Bernardo, J. M. 515 Bertail, P. 60, 516 Besag, J. E. 183, 184, 185 Bickel, P. J. 60, 123, 125, 129, 315, 487, 494 Biden, E. N. 316 Bissell, A. F. 253, 383, 497 Bithell, J. F. 428 Bloomfield, P. 426 Boniface, S. J. 418, 428 Boos, D. D. 515 Booth, J. G. 125, 129, 246, 247, 251, 374, 486, 487, 488, 491, 493

568

Borgan, 0. 124, 128, 353 Bose, A. S. 427 Bowman, A. W. 375 Box, G. E. P. 323 Bratley, P. 486 Braun, W. J. 427, 430 Breiman, L. 316 Breslow, N. 378 Bretagnolle, J. 60 Brillinger, D. R. x, 388, 426, 427 Brockwell, P. J. 426, 427 Brown, B. W. 382 Brown, C. C. 379 Buckland, S. T. 246 Buhlmann, P. 427 Bunke, O. 316 Burman, J. P. 60 Burman, P. 316, 321 Burr, D. 124, 133, 374 Bums, E., 300 Butler, R. W. 125, 129, 487, 493 Canty, A. J. x, 135, 246 Carlstein, E. 427 Carpenter, J. R. 246, 250 Carroll, R. J. 310, 325 Chambers, J. M. 374, 375 Chao, M.-T. 125 Chapman, P. 60, 125 Chen, C. 427 Chen, C.-H. 375 Chen, S. X. 169, 514, 515 Chen, Z. 487 Claridge, G. 157 Clifford, P. 183, 184, 185 Cobb, G. W. 241 Cochran, W. G. 7, 125 Collings, B. J. 184

Conover, W. J. 486 Cook, R. D. 125, 315, 316, 375 Corcoran, S. A. 515 Cowling, A. 428, 432, 436 Cox, D. R. 124, 128, 183, 246, 287, 323, 324, 428, 486, 514 Cressie, N. A. C. 72, 428 Dahlhaus, R. 427, 431 Daley, D. J. 428 Daly, F. 68, 182, 436, 520 Daniels, H. E. 59, 486, 492 Darby, S. C. 428 Davis, R. A. 426, 427 Davison, A. C. 66, 135, 246, 316, 374, 427, 428, 486, 487, 492, 493, 515, 517, 518 De Angelis, D. 2, 60, 124, 316, 343 Demetrio, C. G. B. 338 Dempster, A. P. 124 Diaconis, P. 60, 486 DiCiccio, T. J. 68, 124, 246, 252, 253, 487, 493, 515, 516 Diggle, P. J. 183, 392, 423, 426, 428 Do, K.-A. 486, 487 Dobson, A. J. 374 Donegani, M. 184, 187 Doss, H. 124, 374 Draper, N. R. 315 Droge, B. 316 Dubowicz, V. 417 Ducharme, G. R. 126 Easton, G. S. 487 Efron, B. ix, 59, 60, 61, 66, 68, 123, 124, 125, 128, 130, 132, 133, 134, 183, 186, 246, 249, 252, 253, 308, 315, 316, 375, 427, 486, 488, 515 Embleton, B. J. J. 236, 506 Eubank, S. 427, 430, 435 Fahey, T. J. 185

Name Index

Fang, K.-T. 486 Faraway, J. J. 125, 375 Farmer, J. D. 427, 430, 435 Feigl, P. 328 Feller, W. 320 Fernholtz, L. T. 60 Ferretti, N. 427 Field, C. 486 Firth, D. 374, 377, 383 Fisher, N. I. 236, 506, 515, 517 Fisher, R. A. 183, 186, 322 Fleming, T. R. 124 Forster, J. J. 183, 184 Fox, B. L. 486 Franke, J. 427 Freedman, D. A. 60, 125, 129, 315, 427 Freeman, D. H. 378 Frets, G. P. 115 Friedman, J. H. 316 Galdrikian, B. 427, 430, 435 Garcia-Soidan, P. H. 428 Garthwaite, P. H. 246 Gatto, R. x, 487 Geisser, S. 247, 316 George, S. L. 375 Geyer, C. J. 178, 183, 372, 486 Gigli, A. 486 Gilks, W. R. 2, 183, 343 Gill, R. D. 124, 128, 353 Gleason, J. R. 486, 488 Glosup, J. 383 Gong, G. 375 Gotze, F. 60, 427 Graham, R. L. 486, 489 Gray, H. L. 59 Green, P. J. 375 Gross, S. 125 Haldane, J. B. S. 487 Hall, P. ix, x, 59, 60, 62, 124, 125, 129, 183, 246, 247, 248, 251, 315, 316, 321, 375, 378, 379, 427, 428, 429, 432, 436, 486, 487, 488, 491, 493, 514, 515, 516, 517 Halloran, M. E. 246 Hamilton, M. A. 184

569 Hammersley, J. M. 486 Hampel, F. R. 60 Hand, D. J. 68, 182, 436, 520 Handscomb, D. C. 486 Hardle, W. 316, 375, 427 Harrington, D. P. 124 Hartigan, J. A. 59, 60, 427, 430 Hastie, T. J. 374, 375 Hawkins, D. M. 316 Hayes, K. G. 123 Heggelund, P. 189 Heller, G. 374 Herzberg, A. M. 360 Hesterberg, T. C. 60, 66, 486, 490, 491 Hettsmansperger, T. P. 316 Hinkley, D. V. 60, 63, 66, 125, 135, 183, 246, 247, 250, 318, 383, 486, 487, 489, 490, 492, 493, 515, 517, 518 Hirose, H. 347, 381 Hjort, N. L. 124, 374 Holmes, S. 60, 246, 486 Horowitz, J. L. 427, 429 Horvath, L. 374 Hosmer, D. W. 361 Hu, F. 318 Huet, S. 375 Hyde, J. 131 Isham, V. 428 Janas, D. 427, 431 Jennison, C. 183, 184, 246 Jensen, J. L. 486 Jeong, J. 315 Jhun, M. 126 Jing, B.-Y. 427, 429, 487, 515, 517 Jockel, K.-H. 183 John, P. W. M. 486, 489 Johns, M. V. 486, 490 Jolivet, E. 375 Jones, M. C. x, 128 Joumel, A. G. 428 Kabaila, P. 246, 250, 427 Kalbfleisch, J. D. 124 Kaplan, E. L. 124 Karr, A. F. 428

Katz, R. 282 Keenan, D. M. 428 Keiding, N. 124, 128, 353 Kendall, D. G. 124 Kendall, W. S. 124 Kim, J.-H. 124 Klaassen, C. A. J. 123 Kulperger, R. J. 427, 430 KUnsch, H. R. 427 Lahiri, S. N. 427 Laird, N. M. 124, 125 Lange, N. 428 La Scala, B. 514 Lawless, J. 514 Lawson, A. B. 428 Lee, S. M. S. 246 Leger, C. 125 Lehmann, E. L. 183 Lemeshow, S. 361 Leroy, A. M. 315 Lewis, P. A. W. 428 Lewis, T. 236, 506 Li, G. 514 Li, H. 427 Li, K.-C. 316 Liu, R. Y. 315, 427 Lloyd, C. J. 515 Lo, S.-H. 125, 374 Loader, C. 375 Loh, W.-Y. 183, 246 Longtin, A. 427, 430, 435 Louis, T. A. 125 Lunn, A. D. 68, 182, 436, 520 Maddala, G. S. 315, 427 Mallows, C. L. 316 Mammen, E. 60, 315, 316 Manly, B. F. J. 183 Marriott, F. H. C. 183 Marron, J. S. 375 Martin, M. A. 125, 183, 246, 251, 487, 493 McCarthy, P. J. 59, 60, 125 McConway, K. J. 68, 182, 436, 520 McCullagh, P. 66, 374, 553 McDonald, J. W. 183, 184

570 McKay, M. D. 486 McKean, J. W. 316 McLachlan, G. J. 375 Meier, P. 124 Messean, A. 375 Milan, L. 125 Miller, R. G. 59, 84 Monahan, J. F. 515 Monti, A. C. 514 Morgenthaler, S. 486 Moulton, L. H. 374, 376, 377 Muirhead, C. R. 428 Murphy, S. A. 515 Mykland, P. A. 515 Nelder, J. A. 374 Newton, M. A. 178, 183, 486, 515 Niederreiter, H. 486 Nordgaard, A. 427 Noreen, E. W. 184 Oakes, D. 124, 128 Ogbonmwan, S.-M. 486, 515 Olshen, R. A. 315, 316 Oris, J. T. 384 Ostrowski, E. 68, 182, 436, 520 Owen, A. B. 486, 514, 515, 550 Parzen, M. I. 250 Paulsen, O. 189 Peers, H. W. 515 Percival, D. B. 426 Perl, M. L. 123 Peters, S. C. 315, 427 Phillips, M. J. 428, 432, 436 Pitman, E. J. G. 183 Plackett, R. L. 60 Politis, D. N. 60, 125, 427, 429 Possolo, A. 428 Pregibon, D. 374 Prentice, R. L. 124 Presnell, B. 125, 129 Priestley, M. B. 426 Proschan, F. 4, 218 Qin, J. 514 Quenouille, M. H. 59 Raftery, A. E. 515 Rao, J. N. K 125, 130

Name Index

Rawlings, J. O. 356 Reid, N. 124, 486 Reynolds, P. S. 435 Richardson, S. 183 Ripley, B. D. x, 183, 282, 315, 316, 361, 374, 375, 417, 428, 486 Ritov, Y. 123 Robinson, J. 486, 487

Stein, C. M. 60, 515 Stein, M. 486 Steinour, H. H. 277 Sternberg, H. O ’ R. x, 388, 389, 427 Stine, R. A. 315 Staffer, D. S. 427 Stone, C. J. 316 Stone, M. 316

Romano, J. P. 60, 124, 125, 126, 183, 246, 427, 429, 515, 516 Romo, J. 427 Ronchetti, E. M. 60, 486, 487 Rousseeuw, P. J. 60, 315 Rowlingson, B. S. 428 Royall, R. M. 63 Rubin, D. B. 124, 125, 515 Ruppert, D. 310, 325 St. Laurent, R. T 375 Samawi, H. M. 184 Sauerbrei, W. 375 Schechtman, E. 66, 247, 486, 487 Schenker, N. 246, 515 Schrage, L. E. 486 Schucany, W. R. 59 Schumacher, M. 375 Seber, G. A. F. 315 Shao, J. 60, 125, 246, 315, 316, 375, 376 Sheather, S. J. 316 Shi, S. 183, 246, 250, 486, 489, 490 Shorack, G. 316 Shotton, D. M. 417 Silverman, B. W. 124, 128, 189, 363, 375 Simonoff, J. S. 269 Singh, K. 246, 315, 374, 427 Sitter, R. R. 125, 129 Smith, H. 315 Smith, P. W. F. 183, 184 Snell, E. J. 287, 324, 374 Snowden, C. B. 125 Spady, R. H. 487, 515 Spiegelhalter, D. J. 183 Stahel, W. A. 60 Stangenhaus, G. 316 Starke, H. R. 277

Stone, R. A. 428 Sutherland, D. H. 316 Swanepoel, J. W. H. 427 Tanner, M. A. 124, 125 Theiler, J. 427, 430, 435 Themeau, T. 486 Tibshirani, R. J. ix, 60, 125, 246, 316, 375, 427, 515 Titterington, D. M. 183 Tong, H. 394, 426 Truong, K. N. 126 Tsai, C.-L. 269, 375 Tsay, R. S. 427 Tu, D. 60, 125, 246, 376 Tukey, J. W. 59, 403, 486 van Wyk, J. W. J. 427 van Zwet, W. R. 60 Venables, W. N. 282, 315, 361, 374, 375 Venkatraman, E. S. 374 Ventura, V. x, 428, 486, 492 Vere-Jones, D. 428 Wahrendorf, J. 379 Walden, A. T. 426 Wall, K. D. 427 Wand, M. P. 128 Wang, S. 124, 318, 486, 487 Wang, Y. 486 Wei, B. C. 63, 375, 487 Wei, L. J. 250 Weisberg, S. 125, 257, 315, 316 Welch, B. L. 515 Welch, W. J. 183, 185 Wellner, J. A. 123 Westfall, P. H. 184 Whittaker, J. 125 Wilson, S. R. 378, 379

Name Index

Witkowski, J. A. 417 Wong, W. H. 124 Wood, A. T. A. 247, 251, 486, 488, 491, 515, 517 Woods, H. 277 Worton, B. J. 486, 487, 493, 515, 517, 518

571

Wu, C. J. F. 60, 125, 130, 315, 316 Wyatt, M. P. 316

Young, G. A. x, 59, 60, 124, 128, 246, 316, 428, 486, 487, 493

Wynn, H. P. 515

Young, S. S. 184

Yahav, J. A. 487, 494

Zeger, S. L. 374, 376, 377

Yandell, B. S. 374 Ying, Z. 250

Zelen, M. 328 Zidek, J. V. 318

Example index

accelerated life test, 346, 379 adaptive test, 187, 188 AIDS data, 1, 342, 369 air-conditioning data, 4, 15, 17, 19, 25, 27, 30, 33, 36, 197, 199, 203, 205, 207, 209, 216, 217, 233, 501, 508, 513, 520 Amis data, 253 A M L data, 83, 86, 146, 160, 187 antithetic bootstrap, 493 association, 421, 422 autoregression, 388, 391, 393, 398, 432, 434 average, 13, 15, 17, 19, 22, 25, 27, 30, 33, 36, 47, 51, 88, 92, 94, 98, 128, 501, 508, 513, 516 axial data, 234, 505 balanced bootstrap, 440, 441, 442, 445, 487, 488, 489, 494 Bayesian bootstrap, 513, 518, 520 beaver data, 434 bias estimation, 106, 440, 464, 466, 488, 492, 495 binomial data, 338, 359, 361 bivariate missing data, 90, 128 bootstrap likelihood, 508, 517, 518 bootstrap recycling, 464, 466, 492, 496 block bootstraps, 398, 401, 403, 432 brambles data, 422 Breslow data 378 calcium uptake data, 355, 441, 442 capability index, 248, 253, 497 carbon monoxide data, 67 cats data, 321 caveolae data, 416, 425 C D 4 data, 68, 134, 190, 251, 252, 254 cement data, 277 changepoint estimation, 241

572

Channing House data, 131 gamma model, 5, 25, 36, 148, 207, 233, 247, 376 circular data, 126, 517, 520 generalized additive model, 367, 369, city population data, 6, 13, 22, 30, 49, 371, 382, 383 52, 53, 54, 66, 95, 108, 110, 113, 118, 201, 238, 439, 440, 447, 464, generalized linear model, 328, 334, 473, 490, 492, 513 338, 342, 367, 376, 378, 381, 383 Claridge data, 157, 158, 496 gravity data, 72, 121, 131, 454, 457, 494, 519 cloth data, 382 coal-mining data, 435 handedness data, 157, 496 comparison of means, 159, 162, 163, 166, 171, 172, 176, 181, 186, 454, hazard ratio, 221 457, 519 head size data, 115 comparison of variable selection heart disease data, 378 methods, 306 hypergeometric distribution, 487 convex regression, 371 correlation coefficient, 48, 61, 63, 68, importance sampling, 454, 457, 461, 80, 90, 108, 115, 157, 158, 187, 464, 466, 489, 490, 491, 495 247, 251, 254, 475, 493, 496, 518 imputation, 88, 90 correlogram, 388 independence, 177 Darwin data, 186, 188, 471, 481, 498 influence function, 48, 53 difference of means, 71, 75 intensity estimate, 418 dogs data, 187 Islay data, 520 Downs’syndrome data, 371 isotonic regression, 371 double bootstrap, 176, 177, 224, 226, 254, 464, 466, 469 jackknife, 51, 64, 65, 317 ducks data, 134 jackknife-after-bootstrap, 115, 130, eigenvalue, 64, 134, 252, 277, 445, 447 empirical likelihood, 501, 516, 519, 520 empirical exponential family likelihood, 505, 516, 520 equal marginal distributions, 78 exponential mean, 15, 17, 19, 30, 61, 176, 224, 250, 510 exponential model, 188, 328, 334, 367 factorial experiment, 320, 322 fir seedlings data, 142 Frets’ heads data, 115, 447

134, 313, 325 K-function, 416, 422 kernel density estimate, 226, 413, 469 kernel intensity estimate, 418, 431 laterite data, 234, 505 leukaemia data, 328, 334, 367 likelihood ratio statistic, 62, 148, 247, 346, 501 linear approximation, 118, 468, 490 logistic regression, 141, 146, 338, 359, 361, 371, 376

Example index

log-linear model, 342, 369 lognormal model, 66, 148 low birth weights data, 361 lynx data, 432 maize data, 181, mammals data, 257, 262, 265, 324 M C M C , 146, 184, 185 matched pairs, 186, 187, 188, 492 mean, see average mean polar axis, 234, 505 median, see sample median median survival time, 86 melanoma data, 352 misclassification error, 359, 361, 381 missing data, 88, 90, 128 mixed continuous-discrete distributions, 78 model selection, 304, 306, 393, 432 motorcycle impact data, 363, 365 multinomial distribution, 66, 487 multiple regression, 276, 277, 281, 286, 287, 298, 300, 304, 306, 309, 313

573 phase scrambling, 410, 430, 435 point process data, 416, 418, 421 poisons data, 322 Poisson process, 416, 418, 422, 425, 431, 435 Poisson regression, 342, 369, 378, 382, 383 prediction, 244, 286, 287, 323, 324, 342 prediction error, 298, 300, 320, 321, 359, 361, 369, 381, 393, 401 product-limit estimator, 86, 128 proportional hazards, 146, 160, 221, 352 quantile, 48, 253, 352 quartzite data, 520

one-way model, 276, 319, 320 overdispersion, 142, 338, 342

ratio, 6, 13, 22, 30, 49, 52, 53, 54, 66, 95, 98, 108, 110, 113, 118, 126, 127, 165, 201, 217, 238, 249, 439, 447, 464, 473, 490, 513 regression, see convex regression, generalized additive model, generalized linear model, logistic regression, log-linear model, multiple regression, nonlinear regression, nonparametric regression, robust regression, straight-line regression regression prediction, 286, 287, 298, 300, 320, 321, 323, 324, 342, 359, 361, 369, 381 reliability data, 346, 379 remission data, 378 returns data, 269, 272, 449, 461 Richardson extrapolation, 494 Rio Negro data, 388, 398, 403, 410 robust M-estimate, 318, 471, 483 robust regression, 308, 309, 313, 318, 324 robust variance, 265, 318, 376 rock data, 281, 287

paired comparison, 471, 481, 498 partial correlation, 115 Paulsen data, 189 periodogram, 388 periodogram resampling, 413, 430 P ET film data, 346, 379

saddlepoint approximation, 468, 469, 471, 473, 475, 477, 481, 483, 492, 493, 497 salinity data, 309, 313, 324 sample maximum, 39, 56, 247 sample median, 41, 61, 65, 80

neurophysiological point process data, 418 neurotransmission data, 189 Nile data, 241 nitrofen data, 383 nodal involvement data, 381 nonlinear regression, 355, 441, 442 nonlinear time series, 393, 401 nonparametric regression, 365 normal plot, 150, 152, 154 normal prediction limit, 244 normal variance, 208 nuclear power data, 286, 298, 304, 323

sample variance, 61, 62, 64, 104, 481 separate families test, 148 several samples, 72, 126, 131, 133, 519 simulated data, 306 smoothed bootstrap, 80, 127, 168, 169, 418, 431 spatial association, 421, 422 spatial clustering, 416 spatial epidemiology, 421 spectral density estimation, 413 spherical data, 126, 234, 505 spline model, 365 stationary bootstrap, 398, 403, 428, 429 straight-line regression, 257, 262, 265, 269, 272, 308, 317, 321, 322, 449, 461 stratified ratio, 98 Strauss process, 416, 425 studentized statistic, 477, 481, 483 sugar cane data, 338 sunspot data, 393, 401, 435 survival probability, 86, 131 survival proportion data, 308, 322 survival time data, 328, 334, 346, 352, 367 survivor functions, 83 symmetric distribution, 78, 251, 471, 483 tau particle data, 133, 495 test of correlation, 157 test for overdispersion, 142, 184 test for regression coefficient, 269, 281, 313 test of interaction, 322 tile resampling, 425, 432 times on delivery suite data, 300 traffic data, 253 transformation, 33, 108, 118, 169, 226, 322, 355, 418 trend test in time series, 403, 410 trimmed average, 64, 121, 130, 133 tuna data, 169, 228, 469 two-sample problem, see comparison of means two-way model, 177, 184, 338

Example index

574 unimodality, 168, 169, 189 unit root test, 391

variance estimation, 208, 446,464, 488, 495

weird bootstrap, 128 Wilcoxon test, 181

unne data, 359

Weibull model, 346, 379

wild bootstrap, 272, 319

variable selection, 304, 306

weighted average, 72, 126, 131

wool prices data, 391

Subject index

abc.ci,536 A B C method, see confidence interval

Abelson-Tukey coefficients, 403 accelerated life test example, 346, 379 adaptive estimation, 120-123, 125, 133 adaptive test, 173-174, 184, 187, 188 aggregate prediction error, 290-301, 316, 320-321, 358-362 AIDS data example, 1, 342, 369 air conditioning data example, 4, 15, 17, 19, 25, 27, 30, 33, 36, 149, 188, 197, 199, 203, 205, 207, 209, 216, 217, 233, 501, 508-512 Akaike’ s information criterion, 316, 394, 432 algorithms JC-fold adjusted cross-validation, 295 balanced bootstrap, 439, 488 balanced importance resampling, 460, 491 Bayesian bootstrap, 513 case resampling in regression, 264 comparison of generalized linear and generalized additive models, 367 conditional bootstrap for censored data, 84 conditional resampling for censored survival data, 351 double bootstrap for bias adjustment, 104 inhomogeneous Poisson process, 431 model-based resampling in linear regression, 262 phase scrambling, 408 prediction in generalized linear models, 341 prediction in linear regression, 285

resampling errors with unequal balanced resampling, 440 variances, 271 bias of, 103 resampling for censored survival post-simulation balance, 488 data, 351 sensitivity analysis for, 114, 117 stationary bootstrap, 428 binary data, 78, 359-362, 376, 377, 378 superpopulation bootstrap, 94 binomial data, 338 all-subsamples method, 57 binomial process, 416 A M L data example, 83, 86, 146, 160, bivariate distribution, 78, 90-92, 128 187, 221 block resampling, 396-408, 427, 428, analysis of deviance, 330-331, 432 367-369 boot, 525 ancillary statistics, 43, 238, 241 ..528, 538, 548 antithetic bootstrap, 493 ‘‘ balanced’’ , 545 apparent error, 292 m, 538 assessment set, 292 mle, 528, 538, 540, 543 autocorrelation, 386, 431 ‘ ‘ parametric’’ , 534 autoregressive process, 386, 388, 389, ran.gen, 528, 538, 540, 543 392, 395, 398, 399, 400, 401, 410, 414, 432, 433, 434 sim, 529, 534 simulation, 390-391 statistic, 527, 528 autoregressive-moving average strata, 531 process, 386, 408 stype, 527 average, 4, 8, 13, 15, 17, 19, 22, 25, 27, weights, 527, 536, 546 30, 33, 36, 47, 51, 88, 90, 92, 94, boot.array, 526 98, 129, 130, 197, 199, 203, 205, boot.ci, 536 207, 209, 216, 251, 501, 508, 512, bootstrap 513, 516, 518, 520 adjustment, 103-107, 125, 130, comparison of several, 163 175-180, 223-230 comparison of two, 159, 162, 166, antithetic, 487, 493 171, 172, 186, 454, 457, 519 finite population, 94, 98, 129 asymptotic accuracy, 39-41, 211-214 balanced, 438-446, 486, 494-499 Bayesian bootstrap, 512-514, 515, 518, 520 algorithm, 439, 488 B C a method, see confidence interval bias estimate, 438-440, 488 beaver data example, 434 conditions for success, 445 bias correction, 103-107 efficiency, 445, 460, 461, 495 bias estimator, 16-18 experimental design, 441, 486, 489 adjusted, 104, 106-107, 130, 442, 464, 466, 492 first-order, 439, 486, 487-488

575

576

Subject index

higher-order, 441, 486, 489 theory, 443^45, 487 balanced importance resampling, 460-463, 486, 496 Bayesian, 512-514, 515, 518, 520 block, 396-408, 427, 428, 433 calibration, 246 case resampling, 84 consistency, 37-39 conditional, 84, 124, 132, 351, 374, 474 discreteness 27, 61 double 103-113, 122, 125, 130, 175-180, 223-230, 254, 373, 463-466, 469, 486, 497, 507-509 theory for, 105-107, 125 generalized, 56 hierarchical, 100-102, 125, 130, 288 imputation, 89-92, 124-125 jittered, 124 mirror-match, 93, 125, 129 model-based, 349, 433, 434 nested, see double nonparametric, 22 parametric, 15-21, 261, 333, 334, 339, 344, 347, 373, 378, 379, 416, 528, 534 population, 94, 125, 129 post-blackened, 397, 399, 433 post-simulation balance, 441-445, 486, 488, 495 quantile, 18-21, 36, 69, 441, 442, 448-450, 453-456, 457 463, 468, 490 recycling, 463-466, 486, 492, 496 robustness, 264 shrunk smoothed, 79, 81, 127 simulation size, 17-21, 34-37, 69, 155-156, 178-180, 183, 185, 202, 226, 246, 248 smoothed, 79-81, 124, 127, 168, 169, 310, 418, 431, 531 spectral, 412—415, 427, 430 stationary, 398-408, 427, 428^29, 433

stratified, 89, 90, 306, 340, 344, 365, 371, 457, 494, 531 superpopulation, 94, 125, 129 symmetrized, 78, 122, 169, 471, 485 tile, 424-426, 428, 432 tilted, 166-167, 452^56, 459, 462, 546-547 weighted, 60, 514, 516 weird, 86-87, 124, 128, 132 wild, 272-273, 316, 319, 538 bootstrap diagnostics, 113-120, 125 bias function, 108, 110, 464-465 jackknife-after-bootstrap, 113-118, 532 linearity, 118-120 variance function, 107-111, 464-465 bootstrap frequencies, 22-23, 66, 76, 110-111, 438-445, 464, 526, 527 bootstrap likelihood, 507-509, 515, 517, 518 bootstrap recycling, 463-466, 487, 492, 497, 508 bootstrap test, see significance test Box-Cox transformation, 118 brambles data example, 422 Breslow estimator, 350 calcium uptake data example, 355, 441, 442 capability index example, 248, 253, 497 carbon monoxide data, 67 cats data example, 321 caveolae data example, 416, 425 C D 4 data example, 68, 134, 190, 251, 252, 254 cement data example, 277 censboot, 532, 541 censored data, 82-87, 124, 128, 131, 160, 346-353, 514, 532, 541 changepoint model, 241 Channing House data example, 131 choice of estimator, 120-123, 125, 134 choice of predictor, 301-305 choice of test statistic, 173, 180, 184, 187 circular data, 126, 517, 520

city population data example, 6, 13, 22, 30, 49, 52, 53, 54, 66, 95, 108, 110, 113, 118,201, 238, 249,440, 447, 464, 473, 490, 492, 513 Claridge data example, 157, 158, 496 cloth data example, 382 coal-mining data example, 435 collinearity, 276-278 complementary set partitions, 552, 554 complete enumeration, 27, 60, 438, 440, 486 conditional inference, 43, 138, 145, 238-243, 247, 251 confidence band, 375, 417, 420, 435 confidence interval ABC, 214-220, 231, 246, 511, 536 B C a, 203-213, 246, 249, 336-337, 383, 536 basic bootstrap, 28-29, 194-195, 199, 213-214, 337, 365, 374, 383, 435 coefficient, 191 comparison of methods, 211-214, 230-233, 246, 336-338 conditional, 238-243, 247, 251 double bootstrap, 223-230, 250, 254, 374, 469 normal approximation, 14, 194, 198, 337, 374, 383, 435 percentile method, 202-203, 213-214, 336-337, 352, 383 profile likelihood, 196, 346 studentized bootstrap, 29-31, 95, 125, 194-196, 199, 212, 227-228, 231, 233, 246, 248, 250, 336-337, 391, 449, 454, 483-485 test inversion, 220-223, 246 confidence limits, 193 confidence region, 192, 231-237, 504-506 consistency, 13 contingency table, 177, 183, 184, 342 control, 545 control methods, 446-450, 486 bias estimation, 446-448, 496 efficiency, 447, 448, 450, 462 importance resampling weight, 456

Subject index

linear approximation, 446, 486, 495 discreteness effects, 26-27, 61 quantile estimation, 446-450, dispersion parameter, 327, 328, 331, 461-463, 486, 495-496 339, see also overdispersion saddlepoint approximation, 449 distribution variance estimation, 446-448, 495 F, 331, 368 Cornish-Fisher expansion, 40, 211, t, 81, 331, 484 449 Bernoulli, 376, 378, 381, 474, 475 correlation estimate, 48, 61, 63, 68, 69, beta, 187, 248, 377 80, 90-92, 108, 115-116, 134, 138, beta-binomial, 338, 377 157, 158, 247, 251, 254, 266, 475, binomial, 86, 128, 327, 333, 338, 377 493 bivariate normal, 63, 80, 91, 108, correlogram, 386, 389 128 partial, 386, 389 Cauchy, 42, 81 coverage process, 428 chi-squared, 139, 142, 163, 233, 234, cross-validation, 153, 292-295, 237, 303, 330, 335, 368, 373, 296-301, 303, 305-307, 316, 320, 378, 382, 484, 500, 501, 503, 321, 324, 360-361, 365, 377, 381 504, 505, 506 K-fold, 294-295, 316, 320, 324, defensive mixture, see defensive 360-361, 381 mixture distribution cumulant-generating function, 66, Dirichlet, 513, 518 466, 467, 472, 479, 551-553 double exponential, 516 approximate, 476-478, 482, 492 empirical, see empirical distribution paired comparison test, 492 function, empirical exponential cumulants, 551-553 family approximate, 476 exponential, 4, 81, 82, 130, 132, 176, generalized, 552 188, 197, 203, 205, 224, 249, cumulative hazard function, 82, 83, 328, 334, 336, 430, 491, 503, 86, 350 521 exponential family, 504-507, 516 Darwin data example, 186, 188, 471, gamma, 5, 131, 149, 207, 216, 230, 481, 498 233, 247, 328, 332, 334, 376, defensive mixture distribution, 503, 512, 513, 521 457-459, 462, 464, 486, 496 geometric, 398, 428 delivery suite data example, 300 hypergeometric, 444, 487 delta method, 45^6, 195, 227, 233, 419, 432, see also nonparametric least-favourable, 206, 209 delta method lognormal, 66, 149, 336 density estimate, see kernel density multinomial, 66, 111, 129, 443, 446, estimate 452, 468, 473, 491, 492, 493, deviance, 330-331, 332, 335, 367-369, 501, 502, 517, 519 370, 373, 378, 382 multivariate normal, 445, 552 deviance residuals, see regression negative binomial, 337, 344, 345, residuals 371 diagnostics, see bootstrap diagnostics normal, 10, 150, 152, 154, 208, 244, difference of means, see average, 327, 485, 488, 489, 518, 551 comparison of two Poisson, 327, 332, 333, 337, 342, directional data, 126, 234, 505, 515, 344, 345, 370, 378, 382, 383, 517, 520 416, 419, 431, 473, 474, 493, dirty data, 44 516

577

posterior, see posterior distribution prior, see prior distribution slash, 485 tilted, see exponential tilting Weibull, 346, 379 dogs data example, 187 Downs’syndrome data example, 371 ducks data example, 134 Edgeworth expansion, 39—41, 60, 408, 476-477 EEF.profile, 550 eigenvalue example, 64, 134, 252, 278, 445, 447 eigenvector, 505 EL. prof ile, 550 empinf, 530 empirical Bayes, 125 empirical distribution function, 11-12, 60-61, 128, 501, 508 as model, 108 marginal, 267 missing data, 89-91 residuals, 77, 181, 261 several, 71, 75 smoothed, 79-81, 127, 169, 227, 228 symmetrized, 78, 122, 165, 169, 228, 251 tilted, 166-167, 183, 209-210, 452-456, 459, 504 empirical exponential family likelihood, 504-506, 515, 516, 520 empirical influence values, 46-47, 49, 51-53, 54, 63, 64, 65, 75, 209, 210, 452, 461, 462, 476, 517 generalized linear models, 376 linear regression, 260, 275, 317 numerical approximation of, 47, 51-53, 76 several samples, 75, 127, 210 see also influence values empirical likelihood, 500-504, 509, 512, 514-515, 516, 517, 519, 520 empirical likelihood ratio statistic, 501, 503, 506, 515 envelope test, see graphical test

578 equal marginal distributions example, 78 error rate, 137, 153, 174, 175 estimating function, 50, 63, 105, 250, 318, 329, 470-471, 478, 483, 504, 505, 514, 516 excess error, 292, 296 exchangeability, 143, 145 expansion Cornish-Fisher, 40, 211, 449 cubic, 475-478 Edgeworth, 39-41, 60, 411, 476-478, 487 linear, 47, 51, 69, 75, 76, 118, 443, 446, 468 notation, 39 quadratic, 50, 66, 76, 443 Taylor series, 45, 46 experimental design relation to resampling, 58, 439, 486 exponential mean example, 15, 17, 19, 30, 61, 176, 250, 510 exponential quantile plot, 5, 188 exponential tilting, 166-167, 183, 209-210, 452-454, 456-458, 461-463, 492, 495, 504, 517, 535, 546, 547 exp.tilt, 535 factorial experiment, 320, 322 finite population sampling, 92-100, 125, 128, 129, 130, 474 fir seedlings data, 142 Fisher information, 193, 206, 349, 516 Fourier frequencies, 387 Fourier transform, 387 empirical, 388, 408, 430 fast, 388 inverse, 387 frequency array, 23, 52, 443 frequency smoothing, 110, 456, 462, 463, 464-465, 496, 508 Frets’heads data example, 115, 447 gamma model, 5, 25, 62, 131, 149, 207, 216, 233, 247, 376 generalized additive model, 366-371, 375, 382, 383

Subject index

generalized likelihood ratio, 139 generalized linear model, 327-346, 368, 369, 374, 376-377, 378, 381-384, 516 comparison of resampling schemes for, 336-338 graphical test, 150-154, 183, 188, 416, 422 gravity data example, 72, 121, 131, 150, 152, 154, 162, 163, 166, 171, 172, 454, 457, 494, 519 Greenwood’ s formula, 83, 128 half-sample methods, 57-59, 125 handedness data example, 157, 158, 496 hat matrix, 258, 275, 278, 318, 330 hazard function, 82, 146-147, 221-222, 350 heads data example, see Frets’heads data example heart disease data example, 378 heteroscedasticity, 259-260, 264, 269, 270-271, 307, 318, 319, 323, 341, 363, 365 hierarchical data, 100-102, 125, 130, 251-253, 287-289, 374 Huber M-estimate, see robust M-estimate example hypergeometric distribution, 487 hypothesis test, see significance test implied likelihood, 511-512, 515, 518 imp.moments, 546 importance resampling, 450-466, 486, 491, 497 balanced, 460-463 algorithm, 491 efficiency, 461, 462 efficiency, 452, 458, 461, 462, 486 improved estimators, 456-460 iterated bootstrap confidence intervals, 486 quantile estimation, 453-456, 457, 495 ratio estimator, 456, 459, 464, 486, 490 raw estimator, 459, 464, 486 regression, 486

regression estimator, 457, 459, 464, 486, 491 tail probability estimate, 452, 455 time series, 486 weights, 451, 455, 456-457, 458, 464 importance sampling, 450-452, 489 efficiency, 452, 456, 459, 460, 462, 489 identity, 116, 451, 463 misapplication, 453 quantile estimate, 489, 490 ratio estimator, 490 raw estimator, 451 regression estimator, 491 tail probability estimate, 453 weight, 451 imp.prob, 546 imp.quantile, 546 imputation, 88, 90 imp.weights, 546 incomplete data, 43-44, 88-92 index notation, 551-553 infinitesimal jackknife, see nonparametric delta method influence functions, 46-50, 60, 63-64 chain rule, 48 correlation, 48 covariance, 316, 319 eigenvalue, 64 estimating equation, 50, 63 least squares estimates, 260, 317 M-estimation, 318 mean, 47, 316 moments, 48, 63 multiple samples, 74-76, 126 quantile, 48 ratio of means, 49, 65, 126 regression, 260, 317, 319 studentized statistic, 63 trimmed mean, 64 two-sample t statistic, 454 variance, 64 weighted mean, 126 information distance, 165-166

Subject index

integration number-theoretic methods, 486 interaction example, 322 interpolation of quantiles, 195 Islay data example, 520 isotonic regression example, 371 iterative weighted least squares, 329

579 likelihood, 137 adjusted, 500, 512, 515

based on confidence sets, 509-512 bootstrap, 507-509 combination of, 500, 519 definition, 499 dual, 515 empirical, 500-506 jackknife, 50-51, 59, 64, 65, 76, 115 implied, 511-512 delete-m, 56, 60, 493 multinomial-based, 165-166, 186, for least squares estimates, 317 500-509 for sample surveys, 125 parametric, 347, 499-500 infinitesimal, see nonparametric partial, 350, 507 delta method pivot-based, 510-511, 512, 515 multi-sample, 76 profile, 62, 206, 248, 347, 501, 515, jack.after.boot, 532 519 jackknife-after-bootstrap, 113-118, quasi, 332, 344 125, 133, 134, 135, 308, 313, 322,likelihood ratio statistic, 62, 137, 138, 325, 369 139, 148, 196, 234, 247, 330, 347, parametric model, 116-118, 130 368, 373, 380, 499-501 Jacobian, 470, 479 signed, 196 linear.approx, 530 K-function, 416, 424 linear approximation, see Kaplan-Meier estimate, see nonparametric delta method product-limit estimate linear predictor, 327, 366 kernel density estimate, 19-20, 79, residuals, 331 124, 127, 168-170, 189, 226-230, linearity diagnostics, 118-120, 125 251, 413, 469,507,511,514 link function, 327, 332, 367 kernel intensity estimate, 419-421, location-scale model, 77, 126 431, 435 logistic regression example, 141, 146, kernel smoothing, 110, 363, 364, 375 338, 359, 361, 371, 376, 378, 381, kriging, 428 474 Kronecker delta symbol, 412, 443, 553 logit, 338, 372 loglinear model, 177, 184, 342, 369 Lagrange multiplier, 165-166, 502, lognormal model, 66, 149 504, 515, 516 log rank test, 160 Laplace’ s method, 479, 481 long-range dependence, 408, 410, 426 laterite data example, 234, 505 low birth weights data example, 361 Latin hypercube sampling, 486 lowess, 363, 369 Latin square design, 489 lunch least squares estimates, 258, 275, 392 nonexistence of free, 437 penalized, 364 lynx data example, 432 weighted, 271, 278, 329 length-biased data, 514 leukaemia data example, 328, 334, 367 M-estimate, 311-313, 316, 318, 471, 483, 515 leverage, 258, 271, 275, 278, 330, 370, 377 maize data example, 181

mammals data example, 257, 262, 265, 324 Markov chain, 144, 429 Markov chain Monte Carlo, 143-147, 183, 184, 385, 428 matched-pair data example, 186, 187, 188, 471, 481, 492, 498 maximum likelihood estimate bias-corrected, 377 generalized linear model, 329 nonparametric, 165-166, 186, 209, 501, 516 mean, see average mean polar axis example, 234, 505 median, see sample median median absolute deviation, 311 median survival time example, 86, 124 melanoma data example, 352 misclassification error, 358-362, 375, 378, 381 misclassification rate, 359 missing data, 88-92, 125, 128 mixed continuous-discrete distribution example, 78 mode estimator, 124 model selection, 301-307, 316, 375, 393, 427, 432 model-based resampling, 389-396, 427, 433 modified sample size, 93 moment-generating function, 551, 552 Monte Carlo test, 140-147, 151-154, 183, 184 motoneurone firing, 418 motorcycle impact data example, 363, 365 moving average process, 386 multiple imputation, 89-91, 125, 128 multiplicative bias, 62 multiplicative model, 77, 126, 328, 335 negative binomial model, 337, 344 Nelson-Aalen estimator, 83, 86, 128 nested bootstrap, see double bootstrap nested data, see hierarchical data

580 neurophysiological data example, 418, 428 Nile data example, 241 nitrofen data example, 383 nodal involvement data example, 381 nonlinear regression, 353-358 nonlinear time series, 393-396, 401, 410, 426 nonparametric delta method, 46-50, 75 balanced bootstrap, 443-444 cubic approximation, 475-478 linear approximation, 47, 51, 52, 60, 69, 76, 118, 126, 127, 205, 261, 443, 454, 468, 487, 488, 490, 492 control variate, 446 importance resampling, 452 tilted, 490 quadratic approximation, 50, 79, 212, 215, 443, 487, 490 variance approximation, 47, 50, 63, 64, 75, 76, 108, 120, 199, 260, 261, 265, 275, 312, 318, 319, 376, 477, 478, 483 nonparametric maximum likelihood, 165-166, 186, 209, 501 nonparametric regression, 362-373, 375, 382, 383 normal prediction limit, 244 normal quantile plot test, 150 notation, 9-10 nuclear power data example, 286, 298, 304, 323 null distribution, 137 null hypothesis, 136 one-way model example, 208, 276, 319, 320 outliers, 27, 307-308, 363 overdispersion, 327, 332, 338-339, 343-344, 370, 382 test for, 142 paired comparison, see matched-pair data parameter transformation, see transformation of statistic partial autocorrelation, 386

Subject index

partial correlation example, 115

prediction rule, 290, 358, 359

periodogram, 387-389, 408, 430 resampling, 412-415, 427, 430 permutation test, 141, 146, 156-160, 173, 183, 185-186, 266, 279, 422, 486, 492

prior distribution, 499, 510, 513 product factorial moments, 487 product-limit estimator, 82-83, 87, 124, 128, 350, 351, 352, 515 profile likelihood, 62, 206, 248, 347, 501, 515, 519 proportional hazards model, 146, 160, 221, 350-353, 374 P-value, 137, 138, 141, 148, 158, 161, 437 adjusted, 175-180, 183, 187 importance sampling, 452, 454, 459

for regression slope, 266, 378 saddlepoint approximation, 475, 487 PET film data example, 346, 379 phase scrambling, 408-412, 427, 430, 435 pivot, 29, 31, 33, 510-511, see also studentized statistic point process, 415-426, 427-428 poisons data example, 322 Poisson process, 416-422, 425, 428, 431-432, 435 Poisson regression example, 342, 369, 378, 382, 383 posterior distribution, 499, 510, 513, 515, 520 power notation, 551-553 prediction error, 244, 375, 378 K-fold cross-validation estimate, 293-295, 298-301, 316, 320, 324, 358-362, 381 0.632 estimator, 298, 316, 324, 358-362, 381 adjusted cross-validation estimate, 295, 298-301, 316, 324, 358-362 aggregate, 290-301, 320, 321, 324, 358-362 apparent, 292, 298-301, 320, 324, 381 bootstrap estimate, 295-301, 316, 324, 358-362, 381 comparison of estimators, 300-301 cross-validation estimate, 292-293, 298-301, 320, 324, 358-362, 381 generalized linear model, 340-346 leave-one-out bootstrap estimate, 297, 321 time series, 393-396, 401, 427 prediction limits, 243-245, 251, 284-289, 340-346, 369-371

quadratic approximation, see nonparametric delta method quantile estimator, 18-21, 48, 80, 86, 124, 253, 352 quartzite data example, 520 quasi-likelihood, 332, 344 random effects model, see hierarchical data random walk model, 391 randomization test, 183, 492, 498 randomized block design, 489 ratio in finite population sampling, 98 stratified sampling for, 98 ratio estimate in finite population sampling, 95 ratio example, 6, 13, 22, 30, 49, 52, 53, 54, 62, 66, 98, 108, 110, 113, 118, 126, 127, 165, 178, 186, 201, 217, 238, 249, 439, 447, 464, 473, 490, 513 recycling, see bootstrap recycling regression L u 124, 311, 312, 316, 325 case deletion, 317, 377 case resampling, 264-266, 269, 275, 277, 279, 312, 333, 355, 364 convex, 372 design, 260, 261, 263, 264, 276, 277, 305 generalized additive, 366-371, 375, 382, 383

Subject index

generalized linear, 327-346, 374, linear predictor, 331, 333, 376 376, 377, 378, 381, 382, 383 modified, 77, 259, 270, 272, 275, isotonic, 371 279, 312, 318, 331, 355, 365 least trimmed squares, 308, 311, nonlinear regression, 355, 375 313, 325 nonstandard, 349 linear, 256-325, 434 raw, 258, 275, 278, 317, 319 local, 363, 367, 375 Pearson, 331, 333, 334, 342, 370, logistic, 141, 146, 338, 371, 376, 378, 376, 382 381, 474 standardized, 259, 331, 332, 333, loglinear, 342, 369, 383 376 M-estimation, 311-313, 316, 318 time series, 390, 392 many covariates with, 275-277 returns data example, 269, 272, 449, 461 model-based resampling, 261-264, 267, 270-272, 275, 276, 279, Richardson extrapolation, 487, 494 280, 312, 333-335, 346-351, Rio Negro data example, 388, 398, 364-365 403, 410, 427 multiple, 273-307 robustness, 3, 14, 264, 318 no intercept, 263, 317 robust M-estimate example, 471, 483 nonconstant variance, 270-273 robust regression example, 308, 309, 313, 318, 325 nonlinear, 353-358, 375, 441, 442 rock data example, 281, 287 nonparametric, 362-373, 375, 427 rough statistics, 41-43 Poisson, 337, 342, 378, 382, 383, 473, 504, 516 prediction, 284-301, 315, 316, 323, saddle, 547 324, 340-346, 369 saddle.distn, 547 repeated design points in, 263 saddlepoint approximation, 466-485, resampling moments, 262 486, 487, 492, 493, 498, 508, 509, 517, 547 residuals, see residuals accuracy, 467, 477, 487 resistant, 308 conditional, 472-475, 487, 493 robust, 307-314, 315, 316, 318, 325 density function, 467, 470 significance tests, 266-270, 279-284, 322, 325, 367, 371, 382, 383 distribution function, 467, 468, 470, 486-487 straight-line, 257-273, 308, 317, 322, 391, 449, 461, 489 double, 473-475 survival data, 346-353 equation, 467, 473, 479 weighted least squares, 271-272, estimating function, 470-472 278-279, 329 integration approach, 478-485 regression estimate linear statistic for, 468-469, 517 in finite population sampling, 95 Lugannani-Rice formula, 467 remission data example, 378 marginal, 473, 475-485, 487, 493 repeated measures, see hierarchical permutation distribution, 475, 486, data 487 resampling, see bootstrap quantile estimate, 449, 468, 480, 483 residuals randomization distribution, 492, deviance, 332, 333, 334, 345, 376 498 in multiple imputations, 89-91 salinity data example, 309, 311, 324 inhomogeneous, 338-340, 344 sample average, see average

581

sample maximum example, 39, 56, 247 sample median, 41, 61, 65, 80, 121, 181, 518 sample variance, 61, 62, 64, 104, 208, 432, 488 sampling stratified, see stratified sampling without replacement, 92 sampling fraction, 92-93 sandwich variance estimate, 63, 275, 318, 376 second-order accuracy, 39-41, 211-214, 246 semiparametric model, 77-78, 123 sensitivity analysis, 113 separate families example, 148 sequential spatial inhibition process, 425 several samples, 71-76, 123, 126, 127, 130, 131, 133, 163, 208, 210-211, 217-220, 253 shrinkage estimate, 102, 130 significance probability, see P-value significance test adaptive, 173-174, 184, 187, 188 conditional, 138, 173-174 confidence interval, 220-223 critical region, 137 double bootstrap, 175-180, 183, 186, 187 error rate, 137, 175-176 generalized linear regression, 330-331, 367-369, 378, 382 graphical, 150-154, 188, 416, 422, 428 linear regression, 266-270, 279-284, 317, 322, 392 Monte Carlo, 140-147, 151-154 multiple, 174-175, 184 nonparametric bootstrap, 161-175, 267-270 nonparametric regression, 367, 371, 382, 383 parametric bootstrap, 148-149 permutation, 141, 146, 156-160, 173,183, 185, 188, 266, 317, 378, 475, 486

582 pivot, 138-139, 268-269, 280, 284, 392, 454, 486 power, 155-156, 180-184 P-value, 137, 138, 141, 148, 158, 161, 175-176 randomization, 183, 185, 186, 492, 498 separate families, 148, 378 sequential, 182 spatial data, 416, 421, 422, 428 studentized, see pivot time series, 392, 396, 403, 410 simulated data example, 306 simulation error, 34-37, 62 simulation outlier, 73 simulation size, 17-21, 34-37, 69, 155-156, 178-180, 183, 185, 202, 226, 246, 248 size of test, 137 smooth.f, 533 smooth estimates of F, 79-81 spatial association example, 421, 428 spatial clustering example, 416 spatial data, 124, 416, 421^126, 428 spatial epidemiology, 421, 428 species abundance example, 169, 228 spectral density estimation example, 413 spectral resampling, see periodogram resampling spectrum, 387, 408 spherical data example, 126, 234, 505 spline smoother, 352, 364, 365, 367, 368,371,468, standardized residuals, see residuals, standardized stationarity, 385-387, 391, 398, 416 statistical error, 31-34 statistical function, 12-14, 46, 60, 75 Stirling’ s approximation, 61, 155

Subject index

straight-line regression, see regression, straight-line stratified resampling, 71, 89, 90, 306, 340, 344, 365, 371, 457, 494 stratified sampling, 97-100 Strauss process, 417, 425 studentized statistic, 29, 53, 119, 139, 171-173, 223, 249, 268, 280-281, 284, 286, 313, 315, 324, 325, 326, 330, 477, 481, 483, 513 subsampling, 55-59 balanced, 125 in model selection, 303-304 spatial, 424—426 sugar cane date example, 338 summation convention, 552 sunspot data example, 393, 401, 435 survival data nonparametric, 82-87, 124, 128, 131, 132, 350-353, 374-375 parametric, 346-350,'379 survival probability, 86, 132, 160, 352, 515 survival proportion data example, 308, 322 survivor function, 82, 160, 350, 351, 352, 455 symmetric distribution example, 78, 169, 228, 251, 470, 471, 485 tau particle data example, 133, 495 test, see significance test tile resampling, 424— 426, 427, 428, 432 tilt.boot, 547 tilted distribution, see exponential . tilting time series, 385-415, 426-427, 428-431, 514 econometric, 427 nonlinear, 396, 410, 426 toroidal shifts, 423 traffic data, 253

training set, 292 transformation of statistic empirical, 112, 113, 118, 125, 201 for confidence interval, 195, 200, 233 linearizing, 118-120 variance stabilizing, 32, 63, 108, 109, 111-113, 125, 195, 201, 227, 246, 252, 394, 419, 432 trend test in time series example, 403, 410 trimmed average example, 64, 121, 130, 133, 189 tsboot, 544 tuna data example, 169, 228, 469 two-way model example, 338, 342, 369 two-way table, 177, 184 unimodality test, 168, 169, 189 unit root test, 391, 427 urine data example, 359 variable selection, 301-307, 316, 375 variance approximations, see nonparametric delta method variance estimate, see sample variance variance function, 327-330, 332, 336, 337, 338, 339, 344, 367 estimation of, 107-113, 465 variance stabilization, 32, 63, 108, 109, 111-113, 125, 195, 201, 227, 246, 419, 432 variation of properties of T, 107-113 var.linear, 530 weighted average example, 72, 126, 131 weighted least squares, 270-272, 278-279, 329-330, 377 white noise, 386 Wilcoxon test, 181 wool prices data example, 391

Spectral methods and their applications

Read more

Miracle Mongers And Their Methods

Read more

Spectral Methods and Their Applications

Read more

Sequential Methods and Their Applications

Read more

Sequential Methods and Their Applications

Read more

Rigorous Magic: Communication Ideas and their Application

Read more

Fancy Yarns: Their Manufacture and Application

Read more

Bootstrap Methods: A Guide for Practitioners and Researchers

Read more

Bootstrap methods: a guide for practitioners and researchers

Read more

Bootstrap Methods: A Guide for Practitioners and Researchers

Read more

Bootstrap Methods: A Guide for Practitioners and Researchers

Read more

Restricted Kalman Filtering: Theory, Methods, and Application

Read more

Restricted kalman filtering : theory, methods, and application

Read more

Pesticide Application Methods, 3rd Edition

Read more

Finite Element Methods and Their Applications

Read more

Finite Element Methods and Their Applications

Read more

Finite Element Methods and Their Applications

Read more

Path-integral methods and their applications

Read more

Finite element methods and their applications

Read more

Finite Element Methods And Their Applications

Read more

Mesh parameterization methods and their applications

Read more

Finite Element Methods and Their Applications

Read more

Stochastic Methods and Their Applications to Communications

Read more

Singularities and Constructive Methods for Their Treatment

Read more

Metaclasses and Their Application: Data Model Tailoring and Database Integration

Read more

Twitter Bootstrap Web Development

Read more

An introduction to bootstrap

Read more

Twitter Bootstrap Web Development

Read more

Turbulence Models and Their Application: Efficient Numerical Methods With Computer Programs

Read more

Turbulence Models and Their Application: Efficient Numerical Methods With Computer Programs

Read more

Recommend Documents

Spectral methods and their applications

Miracle Mongers And Their Methods

MIRACLE MONGERS AND THEIR METHODS MIRACLE MONGERS AND THEIR METHODS A COMPLETE EXPOSE' OF THE MODUS OPERANDI OF FIRE EA...

Spectral Methods and Their Applications

Spectral Methods and Their Applications Guo Ben-Yu Shanghai University P R China ,.p World Scientific I Singapore· N...

Sequential Methods and Their Applications

Sequential Methods and Their Applications Sequential Methods and Their Applications Nitis Mukhopadhyay Basil M. de S...

Sequential Methods and Their Applications

Sequential Methods and Their Applications Sequential Methods and Their Applications Nitis Mukhopadhyay Basil M. de S...

Rigorous Magic: Communication Ideas and their Application

Rigorous Magic Communication Ideas and Their Application Jim Taylor & Steve Hatch Rigorous Magic Rigorous Magic Comm...

Fancy Yarns: Their Manufacture and Application

Fancy yarns Their manufacture and application R H Gong and R M Wright Cambridge England Published by Woodhead Publis...

Bootstrap Methods: A Guide for Practitioners and Researchers

Bootstrap Methods Bootstrap Methods: A Guide for Practitioners and Researchers Second Edition MICHAEL R. CHERNICK Un...

Bootstrap methods: a guide for practitioners and researchers

Bootstrap Methods Bootstrap Methods: A Guide for Practitioners and Researchers Second Edition MICHAEL R. CHERNICK Un...

Bootstrap Methods: A Guide for Practitioners and Researchers

Bootstrap Methods Bootstrap Methods: A Guide for Practitioners and Researchers Second Edition MICHAEL R. CHERNICK Uni...