Institute of Mathematical Statistics LECTURE NOTES–MONOGRAPH SERIES Volume 54
Complex Datasets and Inverse Problems Tomography, Networks and Beyond
Regina Liu, William Strawderman and Cun-Hui Zhang, Editors
Institute of Mathematical Statistics Beachwood, Ohio, USA
Institute of Mathematical Statistics Lecture Notes–Monograph Series
Series Editor: R. A. Vitale
The production of the Institute of Mathematical Statistics Lecture Notes–Monograph Series is managed by the IMS Office: Jiayang Sun, Treasurer and Elyse Gustafson, Executive Director.
Library of Congress Control Number: 2007924176 International Standard Book Number (13): 978-0-940600-70-6 International Standard Book Number (10): 0-940600-70-6 International Standard Serial Number: 0749-2170 c 2007 Institute of Mathematical Statistics Copyright All rights reserved Printed in Lithuania
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Deconvolution by simulation Colin Mallows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
An iterative tomogravity algorithm for the estimation of network traffic Jiangang Fang, Yehuda Vardi and Cun-Hui Zhang . . . . . . . . . . . . . . . . . . . .
12
Statistical inverse problems in active network tomography Earl Lawrence, George Michailidis and Vijayan N. Nair . . . . . . . . . . . . . . . . .
24
Network tomography based on 1-D projections Aiyou Chen and Jin Cao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Using data network metrics, graphics, and topology to explore network characteristics A. Adhikari, L. Denby, J. M. Landwehr and J. Meloche . . . . . . . . . . . . . . . . .
62
A flexible Bayesian generalized linear model for dichotomous response data with an application to text categorization Susana Eyheramendy and David Madigan . . . . . . . . . . . . . . . . . . . . . . . . .
76
Estimating the proportion of differentially expressed genes in comparative DNA microarray experiments Javier Cabrera and Ching-Ray Yu . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Functional analysis via extensions of the band depth Sara L´ opez-Pintado and Rebecka Jornsten . . . . . . . . . . . . . . . . . . . . . . . . . 103 A representative sampling plan for auditing health insurance claims Arthur Cohen and Joseph Naus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Confidence distribution (CD) – distribution estimator of a parameter Kesar Singh, Minge Xie and William E. Strawderman . . . . . . . . . . . . . . . . . . 132 Empirical Bayes methods for controlling the false discovery rate with dependent data Weihua Tang and Cun-Hui Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A smoothing model for sample disclosure risk estimation Yosef Rinott and Natalie Shlomo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 A note on the U, V method of estimation Arthur Cohen and Harold Sackrowitz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Local polynomial regression on unknown manifolds Peter J. Bickel and Bo Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Shape restricted regression with random Bernstein polynomials I-Shou Chang, Li-Chu Chien, Chao A. Hsiung, Chi-Chung Wen and Yuh-Jenn Wu . . 187 Non- and semi-parametric analysis of failure time data with missing failure indicators Irene Gijbels, Danyu Lin and Zhiliang Ying . . . . . . . . . . . . . . . . . . . . . . . . 203
iii
iv
Contents
Nonparametric estimation of a distribution function under biased sampling and censoring Micha Mandel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Estimating a Polya frequency function2 Jayanta Kumar Pal, Michael Woodroofe and Mary Meyer
. . . . . . . . . . . . . . . 239
A comparison of the accuracy of saddlepoint conditional cumulative distribution function approximations Juan Zhang and John E. Kolassa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Multivariate medians and measure-symmetrization Richard A. Vitale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Statistical thinking: From Tukey to Vardi and beyond Larry Shepp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Preface This book is a collection of papers dedicated to the memory of Yehuda Vardi. Yehuda was the chair of the Department of Statistics of Rutgers University when he passed away unexpectedly on January 13, 2005. On October 21–22, 2005, some 150 leading scholars from many different fields, including statistics, telecommunications, biomedical engineering, bioinformatics, biostatistics and epidemiology, gathered at Rutgers in a conference in his honor. This conference was on “Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond,” and was organized by the editors. The present collection includes research work presented at the conference, as well as contributions from Yehuda’s colleagues. The theme of the conference was networks and other important and emerging areas of research involving incomplete data and statistical inverse problems. Networks are abundant around us: communication, computer, traffic, social and energy are just a few examples. As enormous amounts of network data are collected in this information age, the field has attracted a great amount of attention from researchers in statistics and computer engineering as well as telecommunication providers and various government agencies. However, few statistical tools have been developed for analyzing network data as they are typically governed by time-varying and mutually dependent communication protocols sitting on complicated graph-structured network topologies. Many prototypical applications in these and other important technologies can be viewed as statistical inverse problems with complex, massive, high-dimensional and possibly biased/incomplete data. This unifying theme of inverse problems is particularly appropriate for a conference and volume dedicated to the memory of Yehuda. Indeed he made influential contributions to these fields, especially in medical tomography, biased data, statistical inverse problems, and network tomography. The conference was supported by the NSF Grant DMS 05-34181, and by the Faculty of Arts and Sciences and the Department of Statistics of Rutgers University. We would like to thank the participants of the conference, the contributors to the volume, and the anonymous reviewers. Thanks are also due to DIMACS for providing conference facilities, and to the members of the staff and the many graduate students from the Department of Statistics for their tireless efforts to ensure the success of the conference. Last but not least, we would like to thank Ms. Pat Wolf for her patience and meticulous attention to all details in handling the papers in this volume.
Regina Liu, William Strawderman and Cun-Hui Zhang December 15, 2006
v
vi
Dedication This volume is dedicated to our dear colleague Yehuda Vardi, who passed away in January 2005. Yehuda was born in 1946 in Haifa, Israel. He earned a B.S. in Mathematics from Hebrew University, Jerusalem, an M.S. in Operations Research from the Technion, Israel Institute of Technology and a Ph.D. under Jack Kiefer at Cornell University in 1977. Yehuda served as a Scientist at AT&T’s Bell Laboratories in Murray Hill before joining the Department of Statistics at Rutgers University in 1987. He served as the department chair from 1996 until he passed away. Yehuda was a dynamic and influential chair. He led the department with great energy and vision. He also provided much service to the statistical community by organizing many research conferences, and serving on the editorial boards of several statistical and engineering journals. He was an elected fellow of the Institute of Mathematical Statistics and International Statistical Institute. His research was supported by numerous grants from the National Science Foundation and other government agencies. Yehuda was a leading statistician and a true champion for interdisciplinary research. He developed key algorithms which are now widely used for emission tomographic PET and SPECT scanners. In addition to his work on medical imaging, he coined the term “network tomography” in his pioneering paper on the problem of estimating source-destination traffic based on counts in individual links or “road sections” of a network. This problem has since blossomed into a full-fledged field of active research. His work on unbiased estimation based on biased data was a fundamental contribution in the field, and was recently rediscovered as a powerful general tool for the popular Markov chain Monte Carlo method. He has explored many other areas of statistics, including data depth and positive linear inverse problems with applications in signal recovery. His seminal contributions played a leading role in advancing the scientific fields in question, while enriching statistics with important applications. Yehuda was not just a scientist with remarkable breadth and insight. He was also a wonderful colleague and friend, and a constant source of encouragement and humor. We miss him deeply.
Regina Liu, William Strawderman and Cun-Hui Zhang
Contributors to this volume Adhikari, A., Avaya Labs Bickel, P. J., University of California, Berkeley Cabrera, J., Rutgers University Cao, J., Bell Laboratories, Alcatel-Lucent Technologies Chang, I-S., National Health Research Institutes Chen, A., Bell Laboratories, Alcatel-Lucent Technologies Chien, L.-C., National Health Research Institutes Cohen, A., Rutgers University Denby, L., Avaya Labs Eyheramendy, S., Oxford University Fang, J., Rutgers University Gijbels, I., Katholieke Universiteit Leuven Hsiung, C. A., National Health Research Institutes Jornsten, R., Rutgers University Kolassa, J. E., Rutgers University Landwehr, J. M., Avaya Labs Lawrence, E., Los Alamos National Laboratory Li, B., Tsinghua University Lin, D., University of North Carolina L´opez-Pintado, S., Universidad Pablo de Olavide Madigan, D., Rutgers University Mallows, C., Avaya Labs Mandel, M., The Hebrew University of Jerusalem Meloche, J., Avaya Labs Meyer, M., University of Georgia Michailidis, G., University of Michigan Nair, V. N., University of Michigan Naus, J., Rutgers University Pal, J. K., University of Michigan Rinott, Y., Hebrew University Sackrowitz, H., Rutgers University Shepp, L., Rutgers University Shlomo, N., Southampton University vii
viii
Contributors to this volume
Singh, K., Rutgers University Strawderman, W. E., Rutgers University Tang, W., Rutgers University Vardi, Y., Rutgers University Vitale, R. A., University of Connecticut Wen, C.-C., Tamkang University Woodroofe, M., University of Michigan Wu, Y.-J., Chung Yuan Christian University Xie, M., Rutgers University Ying, Z., Columbia University Yu, C.-R., Rutgers University Zhang, C.-H., Rutgers University Zhang, J., Rutgers University
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 1–11 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000021
Deconvolution by simulation Colin Mallows1 Avaya Labs Abstract: Given samples (x1 , . . . , xm ) and (z1 , . . . , zn ) which we believe are independent realizations of random variables X and Z respectively, where we further believe that Z = X + Y with Y independent of X, the problem is to estimate the distribution of Y . We present a new method for doing this, involving simulation. Experiments suggest that the method provides useful estimates.
1. Motivation The need for an algorithm arose in work on estimating delays in the Internet. We can send a packet from an origin A to a remote site B, and have a packet returned from B to A; the time that that this takes is called the “round-trip delay” for the link A-B. These delays are very volatile and are occasionally large. We can also send packets from A to a more remote site C, by way of B, and can arrange for packets to be returned from C via B to A; this gives the round-trip delay for the A-B-C path. However, we cannot directly observe the delay on the B-C link. Observation suggests that delays for successive packets are almost independent of one another; in particular the measured delays for two packets sent 20ms apart, the first from A to B (and return), the second from A to B to C (and return), are almost independent. We model this situation by assuming there are distributions FX and FY that give the delays on the links A-B and B-C respectively, with the distribution of the A-C delay being the convolution of these two distributions. In practice we are interested in identifying changes in the distributions as rapidly as possible. However a more basic question is, how to estimate the distribution FY when we can observe only X and Z? While our formulation of the deconvolution problem seems natural in our context, we have not seen any study of it in the literature. A Google Scholar search for titles containing “deconvolution” yields about 12000 references; many of these refer to “blind deconvolution” which is what a statistician would term “estimation of a transfer function”. If we delete titles containing “blind” there remain about 4730 titles. Most of these are in various applied journals, relating to a large variety of disciplines. A selection of those in statistical and related journals are listed in the References section. In all the papers we have seen, the distribution of X is assumed known. 2. A note on notation The usual convention is to write all mathematical variables in italics, with random variables in upper-case, and realizations in lower-case. We depart from this by using typewriter font like this for both observations and functions of them. Our 1 Avaya
Labs Basking Ridge, NJ, USA, e-mail:
[email protected] AMS 2000 subject classifications: 60J10, 62G05, 94C99. Keywords and phrases: nonparametric estimation, Markov chains. 1
2
C. Mallows
algorithms are copied directly from implementations in the S language. Most of the S functions that we use are self-explanatory, but a detailed explanation appears in Appendix 1. Only two things need explanation here; the function c() (concatenate) makes its arguments into a vector. Also, many S functions take a vector argument. It is convenient that subscripts are not used in S; indices are shown by using square brackets. Thus a vector x of length 3 has elements x[1],x[2],x[3]. This notation makes it easy to write complicated expressions as indices. 3. Two naive methods, and a new idea Recall that the observed samples are x = c(x[1], . . . , x[m]) and z = c(z[1], . . . , z[n]). If we have m = n, a first suggestion is to sort x and z, forming sortx and sortz, and to form yhat ← sortz − sortx (i.e. yhat[i] = sortz[i] - sortx[i]). If the distributions of X and Z are Normal with variances σ 2 and τ 2 respectively, so that what we want is an estimate of a Normal distribution with variance τ 2 − σ 2 , this method produces an estimate of a Normal distribution with the correct mean but with variance (τ − σ)2 (because the sorted vectors are perfectly correlated), which is too small. The method is not consistent as n → ∞. Another approach, still assuming m=n, is to put both x and z into random orders and to compute the vector of differences z-x. Again, this does not work; this gives an estimate of the distribution of X + Y − X where X is an independent copy of X. In the Normal case described above, this method gives an estimate of a Normal distribution with variance τ 2 + σ 2 instead of τ 2 − σ 2 . The new idea is that a useful estimate could be obtained if we knew the “right” order in which to take the zs before subtracting the xs; and we can estimate an appropriate order by a simulation. Here is a first version of how this would work, assuming m=n. Suppose we have a first estimate of FY , represented by a vector of values oldy = c(oldy[1],...,oldy[n]). We choose a random permutation rperm of (1, . . . , n), and put the elements of oldy into this order. We add the xs to give a vector w where w ← x + oldy[rperm] We record the ranks of the elements of this vector. We put the elements of z into this same order and subtract the xs. Thus newy ← sort(z)[rank(w)] − x We can repeat this operation as many times as we like. We will attempt an explanation of why this might be expected to work below. An example is shown in Figure 1. Here the sample size is n = 100, and both X and Y are standard Normal. We generated pseudo-random samples z0 = x0 + y0 and x1, placed these in sorted order (sortz0 = sort(z0) and sortx1 = sort(x1) and started the algorithm by taking y1 = sort(sortz0 - sortx1). Successive versions of y were obtained using the iteration. Note that rank(runif(n)) is a random permutation of (1,. . . ,n). newy ← sort(sortz0[rank(sortx1 + oldy[rank(runif(n))])] − sortx1). We ran the iteration for 100 steps. Figure 1 shows the first nine y vectors, each sorted into increasing order, plotted against standard normal quantiles. Also shown is the straight line that corresponds to a normal distribution with mean mean(z0) - mean(x1) and variance var(z0) - var(x1). Figure 2 shows iterations 81:100.
Deconvolution
3
Fig 1. QQplots of the first nine iterations for the Normal example.
Figure 3 shows values of a distance index d, which is the sum of absolute vertical deviations between this line and the estimate y. The algorithm appears to be stable, meaning that in repeated applications of the algorithm, the estimates stay close together. The initial transient takes no more than four iterations. The average value of the distance d over iterations 5:100 is 19.69. Also shown (with plotting character “o”) are comparable values of d for random samples from a normal distribution with the same mean and variance as this fitted normal distribution. The average value of these is 14.74. If we average the y vectors over iterations 5:100, we get a vector whose distance from this fitted normal distribution is d = 17.82. The average distance between the iterates and their average is d = 8.95. Thus the average distance between the iterates and their average is smaller than the average distance between random normal samples and the population line. The iteration seems to be giving good estimates of Y . Why should this be so? Here is an argument to support this expectation. Suppose z = x+y; these vectors are realizations of random variables X, Y, Z. we cannot observe any of x,y,z but can see sortz = sort(z) and an independent realization of X, namely x1. How can we define an estimate of y? Since X and Y are independent (by assumption), if rperm is a random permutation, then zhat = sort(x) + sort(y)[rperm] is a realization of Z, sorted according to sort(x). To retrieve y we simply subtract sort(x) from zhat. If n is large, we expect zhat to be close to z, and sort(x1) to be close to sort(x0). Thus we expect that putting z into the same order as zhat will make z approximately equal to zhat; and subtracting sort(x1) from this will approximately retrieve y. This argument does not explain why the iteration should converge when it is started with y0 remote from the correct value. We do not yet have an explanation of this.
4
C. Mallows
Fig 2. QQplots of iterations 81:100 for the Normal example.
4. Questions Several questions come to mind immediately. Is this algorithm always stable? Is the algorithm consistent, meaning that as n ← ∞, the empirical c.d.f of y converges in probability to FY ? I thank a referee for reminding me that FY may not be unique. What happens when it is not? To approach these questions, we point out that in the algorithm we have described, the possible values of the vector y are all of the form z[perm] − x where perm is a permutation of (1, . . . , n). Thus in repeated applications y executes a random walk on the n! possible values of this vector. This random walk will have a stationary distribution, which may not concentrate on a single state (this seems to be the usual case). Some states may be transient. Thus the most we can hope for is that this stationary distribution is close to FY in some sense. Clearly we need a proof that as n → ∞ this stationary distribution converges (in some sense) to a distribution that is FY whenever this is identifiable. Also it would be very pleasant to understand the distribution of the discrepancy measure d when y is drawn from the stationary distribution. As yet we do not have these results, but empirical evidence strongly suggests that the convergence result holds universally, and that useful estimates are obtained in all cases. However the dispersion among successive realizations of y is an over-optimistic estimate of the precision of the estimate of FY . Detailed analysis of the stationary distribution seems out of reach. Even with m = n = 3, 924 different configurations of x and z need to be considered. There are 208 distinct stationary distributions. See Appendix 2. We suggest that in practice we need to ignore an initial transient, and that the
Deconvolution
5
Fig 3. The index d for the first 100 iterations, with values for random normal samples.
dispersion among successive realizations of y is an over-optimistic estimate of the precision of the estimate of FY . We need to consider how to handle boundary conditions, for example (as in the motivating example) that all values of Y are positive. The algorithm as stated need not generate vectors y that satisfy such conditions. Also, we question how the algorithm will perform when there are remote outliers in either or both z0 and x1. Since these samples are assumed to be independent of one another, there is no reason to hope that subtracting an x1 outlier from a z0 outlier will make any sense. We study these questions in Section 6 below. 5. Variations Several variations on the basic idea are as follows. (a) Instead of using the actual data (x[1],...,x[n]) use a sample from an estimate of FX , for example a bootstrap sample from the observed x. (b) To add some smoothness to the algorithm, at each iteration replace x by x + ξ and/or y by y + η, where ξ and η are vectors of small Gaussian perturbations. We can use the same perturbations throughout, or we can use independent perturbations at each step of the algorithm. (c) Similarly we can (independently) smooth z by adding ζ. If we arrange that varζ = varξ + varη, these smoothings should not introduce any bias into the estimate of FY , because X + ξ + Y + η is distributed like Z + ζ. Of course the efficiency of the method will degrade if the variance of ζ becomes large (unless each of X, Y, Z is Gaussian).
6
C. Mallows
If m and n are not equal, to apply the algorithm we need to generate equal numbers of values of x and z. We can do this either by (d) creating vectors of some length N by bootstrapping from the observed x and z (N could be very large, so that we are effectively regarding x and z as defining empirical distributions), (e) if m > n, by taking z with a random sample (without replacement) from x; or similarly sampling z if n > m; or (f) if m
n, repeat z to fill out m values. (g) In generating w we can use a bootstrap sample from y, possibly smoothed as above. To achieve stability in the estimate of FY , we can (h) Apply the algorithm a moderate number of times, k say, and average the resulting sorted y vectors; or (i) concatenate successive y vectors to form a pooled estimate of FY ; if we do this we can at each stage (j) generate w by sampling from this pooled estimate. It is not clear how to generalize the idea to deal with multivariate observations. 6. Boundary conditions, and outliers If some bound on Y is known a priori, for example if it is known (as in the motivating problem) that Y > 0, we need to decide what to do if the algorithm produces one or more negative values in y. Some possibilities in this case are: (k) At each iteration, round negative values of yhat up to zero. (l) At each iteration, replace negative values by randomly sampling from the positive ones; (m) At each iteration, replace negative values in yhat by copies of the smallest values among the positive ones. (n) At each iteration, reject a random permutation if it leads to offending values; draw further permutations until one is obtained that satisfies the positivity conditions; (o) At each iteration, adjust the permutation by changing (at random) a few elements (as few as possible?) in such a way as to meet the conditions. Our experience so far suggests that none of these proposals works very well. Proposals (n) and (o) are excessively tedious, and have been tried only in very small examples. At this point we recommend another strategy, namely (p) Replace negative values in yhat by their absolute values. We investigated two of these proposals as follows. We generated 100 pseudorandom exponential variates x0, and added a similar (independent) vector y0 to form the observed vector z0. We assumed that an independent vector x1 was also observed. We ran the iteration in three ways: (q) no adjustment (l) replace negative values by a random sample from the positive values;
Deconvolution
7
Fig 4. The lowest 20 elements of the first nine iterations for each of three methods: Top:(q), Middle:(l), Bottom:(p).
(p) Remove negative values of yhat, replacing them by their absolute values. This can be done in S by an application of the abs function: newy ← sort(abs(sortz0[rank(sortx1 +oldy[rank(runif(n))])] − sortx1)). All three methods performed similarly for values of yhat greater than 0.25. Figure 4 shows the lowest 20 values of yhat for the first nine iterations, plotted against standard exponential quantiles, for each of these three methods, together with the line through the origin with slope mean(z0) - mean(x1). We see that the naive method (q) produces a large number of negative values; method (l) avoids this but seems to introduce a positive bias; method (p) works well. Figure 5 shows the number of negative elements in yhat (before adjustment) for the three methods. The average numbers over the first 100 iterations are (q) 4.44, (l) 3.35, (p) 2.68. We have no understanding why the “absolute values” method works as well as it seems to. We have run similar trials for the case where both X and Y are uniform on (0, 1), so that Z has a triangular density supported on (0, 2). Here for method (p) we need to reflect values above y=1 to lie in (0, 1). Again, method (p) seems to be better than (q) and (l). We have also investigated the performance of the “absolute values” method when there is a positivity condition and outliers are present. We find that the non-outlying part of the distribution is estimated satisfactorily. We took x0, x1, and y0 each to contain 95 samples from a standard exponential distribution (with mean 1), and
8
C. Mallows
Fig 5. The number of negative elements (before adjustment) for each of three methods: Top:(q), Middle:(l), Bottom:(p).
5 samples from an exponential distribution with mean 100. Figure 6 shows the first four iterations of our basic algorithm, using the option (p) to adjust negative estimates, plotted against sort(y0). Figure 7 expands the lower corner of this plot, with the line through the origin of unit slope. The iterates seem to be staying close to this line. At this point our recommendation (if m = n and the variables are continuous, so that there are no ties in the computed values), is to use the original method, i.e. do not bootstrap or smooth or use (j). If the variables are lattice-valued, for example integer-valued, it seems to help to add small random perturbations to x and y[rperm] at each stage to break the ties randomly. It is not clear whether it is as good to simply add small perturbations once and for all. To handle the boundary and outlier problems, we recommend using the absolute-values method (p) above. Appendix 1. The S language In S the basic units of discourse are vectors; most functions take vector arguments. The elements of a vector x of length n are x[1], . . . , x[n]. c() is the “concatenate” function, which creates a vector from its arguments. Thus x = c(x[1], . . . , x[n]). If the elements of an m-vector s are drawn from 1, . . . , n, (possibly with repetitions), x[s] is the vector c(x[s[1]], . . . , x[s[m]]. The function sort() rearranges the elements of its argument into increasing order; so if x = c(2,6,3,4), sort(x) is c(2,3,4,6). The function rank() returns the ranks of the elements of its argument; i.e. rank(x)[i] is the rank of x[i] in x. Thus if x = c(2,6,3,4), rank(x)
Deconvolution
9
Fig 6. Four iterates of the basic algorithm, using option (p) to handle the positivity condition, when outliers are present.
is c(1,4,2,3). sort(x)[rank(x)] is just x. Another function we have used is runif(), which generates pseudo-random uniform variables drawn from the interval (0,1). Thus rank(runif(n)) is a random permutation of 1, . . . , n rnorm(n) generates n standard normals; rexp(n) generates n random exponentials. The function abs() replaces the elements of its argument by their absolute values. Appendix 2. The case m=n=3 Without loss of generality we may assume x1 = c(0,x,1) and z0 = c(-a,0,b) with 0 < x < 1/2 and a and b positive. Examination of the 36 possible values of (z0[perm1]-x1)[perm2] +x1 shows that the stationary distribution will change whenever any of a,b and a+b crosses any of the values x,2x,1,1+x,1-x,1-2x,2, 2-x,2-2x. For a general x in (0,1/2) these cut-lines divide the positive quadrant of the a,b plane into 154 regions. The configuration of these regions changes when x passes through the values (1/6,1/5,1/4,1/3,2/5). Thus we need to consider six representative values of x, perhaps x = c(10,22,27,35,44,54)/120, and for each of these values of x we have 154 regions, 924 regions in all. We computed the transition matrix of the random walk for each of these 924 cases, and found 208 different stationary distributions. One of these, where one state is absorbing and the other five transient, occurs 84 times. Ten distributions occur only once each. A similar calculation for m or n larger than 3 seems impractical. Acknowledgments. Thanks to Lorraine Denby, for showing me the problem, and to Lingsong Zhang, who did some of the early simulations. Also to Jim Landwehr,
10
C. Mallows
Fig 7. Expansion of the lower corner of Figure 6.
Jon Bentley and Aiyou Chen for stimulating comments. Two referees contributed insightful remarks. References [1] Carroll, R. J. and Hall, P. (1988). Optimal rates of convergence for deconvolving a density. J. Amer. Statist. Assoc. 83 1184–1186. [2] Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist. 19 1257–1272. [3] Fan, J. (2002). Wavelet deconvolution. IEEE Trans. Inform. Theory 48 734– 747. [4] Goldenshluger, A. (1999). On pointwise adaptive nonparametric deconvolution. Bernoulli 5 907–925. [5] Jansson, P. A., ed. (1997). Deconvolution of Images and Spectra. Academic Press, New York. [6] Liu, M. C. and Taylor, R. L. (1989). A consistent nonparametric density estimator for the deconvolution problem. Canad. J. Statist. 17 427–438. [7] Mendelsohn, J. and Rice, J. (1982). Deconvolution of microfluorometric histograms with B-splines. J. Amer. Statist. Assoc. 77 748–753. [8] Starck, J.-L. and Bijaoui, A. (1994). Filtering and deconvolution by the wavelet transform. IEEE Trans. Signal Processing 35 195–211. [9] Stefanski, L. A. and Carroll, R. J. (1990). Deconvoluting kernel density estimators. Statistics 21 169–184. [10] Zhang, C. H. (1990). Fourier methods for estimating mixing densities and distributions. Ann. Statist. 18 806–830.
Deconvolution
11
[11] Zhang, H. S., Liu, X. J. and Chai, T. Y. (1997). A new method for optical deconvolution. IEEE Trans. Signal Processing 45 2596–2599.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 12–23 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000030
An iterative tomogravity algorithm for the estimation of network traffic Jiangang Fang1 , Yehuda Vardi1,∗ and Cun-Hui Zhang1,† Rutgers University Abstract: This paper introduces an iterative tomogravity algorithm for the estimation of a network traffic matrix based on one snapshot observation of the link loads in the network. The proposed method does not require complete observation of the total load on individual edge links or proper tuning of a penalty parameter as existing methods do. Numerical results are presented to demonstrate that the iterative tomogravity method controls the estimation error well when the link data is fully observed and produces robust results with moderate amount of missing link data.
1. Introduction This paper concerns the estimation of network traffic based on link data. The traffic matrix of a network, which gives the amount of source-to-destination (SD) flow, is an essential element in a wide range of network administration and engineering applications. However, in today’s fast growing communication networks, it is often impractical to directly measure network traffic matrices due to cost, network protocol and/or administrative constraints, while measurements of the total traffic passing through certain individual links are more readily available. Thus, the problem of estimating SD traffic based on link data, called network tomography [10], is of great interest to communications service providers. In the network tomographic model [10] (1.1)
y = Ax,
where y is a vector of traffic loads on links, A = (aij ) is a known routing matrix with elements aij = 1 if link i is in the path for the j-th pair of SD nodes and aij = 0 otherwise, and x is the SD traffic flow as a vectorization of the traffic matrix. Here the routing protocol A is fixed. In typical network applications, the number of links (edges) is of the same order as the number of nodes (vertices) in the network graph, while the number of SD pairs is of the order of the square of the number of nodes. Thus, dim(y) dim(x) and the network tomographic model (1.1) is ill-posed. Vardi [10] identified the ill-posedness of (1.1) as the main difficulty of network tomography and proposed to estimate the expected traffic flow based on ∗ Research
partially supported by National Science Foundation Grant DMS-0405202. partially supported by National Science Foundation Grants DMS-0405202, DMS0504387 and DMS-0604571. 1 Department of Statistics, Hill Center, Busch Campus, Rutgers University, Piscataway, New Jersey 08854, USA, e-mail: [email protected]; [email protected]; [email protected] AMS 2000 subject classifications: 62P30, 62H12, 62G05, 62F10. Keywords and phrases: network traffic flow, network tomography, Kullback-Leiber distance, network gravity model, regularized estimation. † Research
12
Iterative tomogravity algorithm
13
independent copies of y by modeling the variance of y. The problem has since being considered by many research groups. See Vanderbei and Iannone [9] for MLE/EM, Cao et al. [1, 2] for MLE/EM in the model xj ∼ N (λj , φλcj ) and non-stationarity issues, Medina et al. [8], Liang and Yu [6] for a more scalable pseudo-likelihood, Liang et al. [7] for additional direct observations of flow data for selected SD pairs, and Coates et al. [4] and Castro et al. [3] for surveys with additional references. In general, these methods require observations of multiple copies of y. An interesting and noticeable development in the area is the introduction of (tomo)gravity algorithms and related methods based on a single snapshot of the network, i.e. one copy of y. Zhang et al. [11] observed that in certain communications networks (e.g. a backbone network where each node represents a PoP, or point of presence), almost all the traffic flow is generated by and destined to a known set of edge nodes which do not serve as intermediate nodes in any SD paths. Thus, each SD path begins with a source edge node, traverses through an inbound edge link, an inner network, and then an outbound edge link to a destination edge node. (in) from a source node s is the Under this assumption, the total inbound flow Ns sum of the loads over all the inbound edge links from s and the total outbound flow (out) Nd to a destination node d is the sum of the loads over all the outbound edge links to d. The edge nodes communicate to each other through an inner network with a directed graph composed of inner nodes and links and a routing protocol, but the inner nodes does not generate or receive traffic. Moreover, Zhang et al. [11] observed that for each fixed source node s, the distribution of the inbound traffic (in) Ns from s to different destinations d is approximately proportional to the total (out) outbound loads Nd these destinations receive. Formally, this is called the gravity model and can be written in Vardi’s [10] vectorization as (in)
(1.2)
x j =
(out)
Nsj Ndj N
,
N=
Ns(in) =
s
(out)
Nd
,
d
where sj and dj are respectively the source and destination nodes for the j-th SD as an pair, x j is the corresponding component of the simple gravity solution x approximation of the vector x in (1.1), and N is the total flow. The gravity model is best described as (1.3)
(out)
x sd = Ns(in) Nd
/N
with a slight abuse of notation, where x sd is the traffic flow from source s to destination d in the gravity model, i.e. x j = x sj dj . Here, the relationship between the link data y and the SD traffic flow x is still governed by the tomographic model (1.1). Due to the additional information provided in the gravity model (1.2) about the nature of the SD traffic x, the number of unknowns in x is square rooted. Thus, the ill-posedness of (1.1) is greatly alleviated. In particular, if all link loads are ob(in) (out) served, the total inbound flow Ns and outbound flow Nd for individual edge nodes and thus the total traffic N are all available network statistics in the gravity in (1.2), Zhang et al. [11] developed model. In addition to the gravity solution x the simple tomogravity solution : Ax = y (1.4) arg min x − x x
and more general tomogravity solutions when the edge nodes are further classified as “access” or “peering”, while Zhang et al. [12] developed entropy regularized
J Fang, Y. Vardi and C.-H. Zhang
14
tomogravity solution as (1.5)
/N , arg min y − Ax2 + φN 2 K x/N, x x
where K(·, ·) is the Kullback-Leibler information and φ is a tuning parameter for the penalty level. These tomogravity solutions require complete knowledge of the (in) (out) , for all individual source and total inbound and outbound flow, i.e. Ns and Nd destination nodes. They perform reasonably well when such information is available and have been implemented in certain AT&T commercial networks. In this paper, we propose an iterative tomogravity (ITG) algorithm which alternately seeks estimates as local optimal solutions in the tomographic space (1.1) and a gravity space of network traffic flow x. Our algorithm, described in Section 2, is based on a single snapshot of the link data and does not require the full knowledge of the total inbound and outbound flow for all individual edge nodes. The idea is to use the gravity space, instead of the specific simple gravity solution (1.2), to regularize the network tomography problem (1.1). In Section 3, we present the results of a real-data experiment to demonstrate that the ITG method is competitive compared with other tomogravity algorithms when the complete link data is available and robust when a moderate amount of link data is missing. 2. An iterative tomogravity algorithm In a general network tomographic model, the observed link data, as a sub-vector y ∗ of the vector y in (1.1), satisfies (2.1)
y∗ = A∗ x,
where the matrix A∗ = (a∗ij ) is composed of the rows of the routing matrix A in (1.1) corresponding to the observed links, and x is the SD traffic flow as in (1.1). Let J be the total number of SD-pairs of concern. For the observation y ∗ , the tomographic space of probability vectors is (2.2) T ∗ = f ∈ IRJ : y∗ ∝ A∗ f , f ≥ 0, 1T f = 1 ,
where 1 is the vector composed of 1’s and vT denotes the transpose of a vector v. Here and in the sequel, inequalities are applied to all components of vectors. In the literature, different types of flow and load are often specifically denoted. Let y(net) be the link loads of the inner network, y(edge) the loads on the links between the edge nodes and inner network, y(self) the load on the links from the edge nodes to themselves, x(net) the traffic flow between distinct edge nodes (necessarily through the inner network), and x(self) the flow of the edge nodes to themselves. Since x(self) does not go through the inner network and the flow from an edge node to itself is the same as the load on the corresponding self-link, the tomographic model can be written as
y(net) 0 A(net) x(net) (edge) (edge) (2.3) = Ax, y = y = A 0 x(self) (self) ( self ) y 0 I with I (self) being the identity matrix giving y(self) = x(self) , provided that the inner network does not generate traffic. This is a special case of Vardi’s [10] tomographic
Iterative tomogravity algorithm
15
model (1.1) describing decompositions of the SD traffic x and link load y, but (1.1) can be also viewed as y(net) = A(net) x(net) . In this paper, the observed y∗ in (2.1) is a general sub-vector of the y in (2.3) to allow partial observation of y(edge) and networks without y(self) and x(self) . Suppose throughout the sequel that the list of the SD-pairs (sj , dj ), i = 1, . . . , J, forms a product set composed of all the pairings from a set S of source nodes to a set D of destination nodes (D = S allowed), so that J = |S||D|, where |C| is the size of a set C. This gives a one-to-one mapping between IRJ and the space of all |S| × |D| matrices: v = (v1 , . . . , vJ )T ∼ (vsd )|S|×|D| ,
vj = vsj dj .
In this notation, the gravity space of probability vectors is J T T (2.4) G = g ∈ IR : g ∼ (gsd )|S|×|D| = p q , g ≥ 0, 1 g = 1 ,
i.e. gsd = ps qd or matrices of rank 1, where p ∈ IR|S| and q ∈ IR|D| . Zhang et al. [11] proposed (1.2) as the simple gravity algorithm and (1.4) as the simple tomogravity algorithm. Zhang et al. [12] proposed (1.5) as the entropyregularized tomogravity algorithm. Their basic ideas can be summarized as follows: (i) The gravity model gives a rough approximation of the SD flow; (ii) When the simple gravity solution (1.2) is available, it can be used to regularize Vardi’s tomographic model (1.1). Motivated by their work, we propose the following algorithm which provides estimates of the SD flow x in (2.1). Iterative tomogravity algorithm (ITG): (2.5) (2.6) (2.7) (2.8) (2.9)
Initialization: g = 1/J Iteration: f (new) = arg min K(f , g(old) ) : f ∈ T ∗ (new) (new) g = arg min K(f , g) : g ∈ G T ∗ ( fin) = 1T y∗ Finalization: N 1 A f f ( fin) =N x
where K(f , g) is the Kullback-Leibler information defined as (2.10)
K(f , g) =
J j=1
fj log
fj . gj
As mentioned earlier, our basic idea is to use the gravity space (2.4), instead of the simple gravity solution (1.2), to regularize the tomographic model (2.2). A main advantage of this approach is that it does not require the knowledge of the simple gravity solution or equivalently, the complete observation of loads on all edge links. Numerical results in Section 3 demonstrate that when the complete link data y is observed, the ITG (2.9) and the entropy-regularized tomogravity (1.5) perform comparably in terms of estimation error, and they both outperform the simple gravity (1.2) and tomogravity (1.4). Moreover, the ITG without using the knowledge of the “access” or “peering” status of links has similar performance compared with the generalized tomogravity method [11] which requires such knowledge. We note that the ITG method does not need a tuning parameter as (1.5) does.
16
J Fang, Y. Vardi and C.-H. Zhang
A main difference between ITG (2.9) and the simple tomogravity (1.4) is that the in (1.2) is not explicitly used in ITG, since g is treated simple gravity solution x as an unknown in the ITG algorithm. However, the information in the observed portions of y(edge) and y(self) is still utilized in the ITG iterations through the from y(edge) and y(self) tomographic space (2.2), instead of directly computing x (or an approximation of it if x is not fully as in (1.3). If the simple gravity x available) is used as the initialization for ITG, the simple tomogravity solution is /N as an unknown in the result of a single ITG iteration. We may also treat g = x (1.5), cf. Section 4, but then a tuning parameter is still required. We use the relaxation algorithm of Krupp (1979) to compute (2.6) of the ITG, while (2.7) is explicit with (new) (new) (new) fsd = fs d gsd d
s
as in (1.3). Here is a full description of the relaxation algorithm. Let g(old) = (old) (old) T (g1 , . . . , gJ ) be a given probability vector. The problem is to minimize (old) K f, g under the linear constraints in (2.2). Since yi∗ = 0 implies fij = 0 for all j with a∗ij = 1 and thus reduces the optimization problem to a subset of j, we assume y∗ = (y1∗ , . . . , yr∗ ) > 0 where r is the total number of links with observed load. Define a∗ij /yi∗ − a∗rj /yr∗ , i = 1, . . . , r − 1, hij = 1, i = r. The linear constraints A∗ f = y∗ and 1T f = 1 for the tomographic space (2.2) can be written as Hf = (0T , 1)T , where H = (hij ). Krupp’s [5] relaxation algorithm maximizes r J (old) vr − (2.11) hij vi − 1 exp gj j=1
i=1
over all vectors v = (v1 , . . . , vr )T and then set r (new) (old) (2.12) hij vi − 1 . fj = gj exp i=1
As (2.11) is concave in v, its optimization is done by the Newton-Raphson method for individual components vi , cycling through i = 1, . . . , r. Since hr,j = 1 for all j, f (new) in (2.12) is properly normalized. The iteration steps (2.6) and (2.7) are both monotone in K(f , g), so that the ITG algorithm reaches a local minimum of the Kullback-Leibler information between the tomographic (2.2) and gravity (2.4) spaces. However, since K(f , g) is not convex jointly in (f , g) with g in the gravity space, ITG is not guaranteed to converge to a global minimum. 3. An example We conduct numerical experiments with data collected over the Abilene Network (an Internet2 high-performance backbone network in United States) illustrated in
Iterative tomogravity algorithm
17
Fig 1. Abline Network.
Figure 1, with 12 nodes, 144 total traffic pairs (132 SD pairs and 12 self pairs), 30 inner links, and 24 edge links. We collect the full 12 × 12 SD traffic matrices in 5 min intervals for consecutive 19 weeks in 2004. We randomly pick four different periods of 3 days and use the data in these four time periods. We call these four raw datasets as X1, X2, X3, and X4. It turns out that the four datasets give different traffic patterns as the time periods cover different days of the week, cf. Figure 2. For each dataset and each hour, we compute x as the hourly total SD flow and y = Ax with a fixed routing matrix A used in the Abilene data. We compare four procedures using the complete data y as y∗ : the ITG (2.9), the simple tomogravity (STG) in (1.4), the generalized tomogravity (GTG) of Zhang et al. [11] utilizing the extra information of “access” or “peering” status of links, and the entropy regularized tomogravity (ERTG) in (1.5). Since the traffic flow for self pairs (s = d) is directly observable as the load on the self links, the ITG and STG estimate these components of x without error. Thus, we measure the performance of all estimators by the relative total error for non-self SD pairs sd − xsd (3.1) xsd , x s=d
s=d
where xsd is the flow from source s to destination d. We compute the relative total error for (1.5) with various values of the tuning parameter φ and found that the Table 1 Average of relative total errors for 288 different hours (4 different 3-day periods) based on complete link data. The best tuning parameter is used for the ERTG, while extra information is used for the GTG. Method Iterative Tomogravity (ITG) Entropy Regularized (ERTG) Simple Tomogravity (STG) Generalized Tomogravity (GTG)
risk 0.3001 0.2995 0.3139 0.3026
18
J Fang, Y. Vardi and C.-H. Zhang
Fig 2. The total hourly traffic for the 4 non-overlapping 3 day periods.
Fig 3. Compare of the error rate using different models, dataset X1.
Iterative tomogravity algorithm
Fig 4. Compare of the error rate using different models, dataset X2.
Fig 5. Compare of the error rate using different models, dataset X3.
19
20
J Fang, Y. Vardi and C.-H. Zhang
Fig 6. Compare of the error rate using different models, dataset X4.
performance of (1.5) is near the best in a wide neighborhood of φ = 10−3 = 0.001. This confirms the results of Zhang et al. [12]. Thus, φ = 10−3 = 0.001 is used for (1.5) in our experiment. We plot the relative total error (3.1) against hour for the four datasets in Figures 3, 4, 5, and 6. We tabulate the average relative error in Table 1. From the results of the experiments, we observed that the performance of the proposed ITG is comparable to the ERTG with the best choice of the tuning parameter and the GTG based on extra information, while all three outperform the STG. We also exam the relative errors for different SD pairs as functions of the total traffic flow for the SD pairs. We compute the relative total error over 3-day time periods (3.2)
t t (t) (t) (t) sd − xsd xsd x ∗
∗
t=1
t=1
for fixed SD pairs in individual datasets, where t indicates time points with t∗ = 72. We group the values of (3.2) for SD pairs in all datasets according to the total flow t∗ (t) 10 t=1 xsd with the grid {0, 1/4, 1/2, 3/4, 1, 1.5, 2, 2.5, 3, 4, 5, 7} in the unit of 10 packets, and tabulate in Table 2 the average of (3.2) within groups. From Table 2, we observe that the estimation error is essentially a decreasing function of the amount of traffic for individual SD pairs. Finally, we check the robustness of the ITG (2.9) with missing link data (i.e. y∗ is a proper sub-vector of y). We focus on the case of missing data in edge links as the ITG is the only procedure among the four that do not require observations for all edge links. Let k be the number of edge links with missing data. We use only
Iterative tomogravity algorithm
21
Table 2 Relative total errors over 72 hours for fixed SD pairs and 3-day periods, grouped according to the total flow. The relative total errors are decreasing functions of the flow for all 4 procedures. Flow Level 0 – 0.25 0.25 – 0.5 0.5 – 0.75 0.75 – 1 1 – 1.5 1.5 – 2 2 – 2.5 2.5 – 3 3–4 4–5 5–7
# in Group 215 100 73 30 46 25 18 6 7 6 2
ITG
ERTG
STG
GTG
4.4799 0.4320 0.3457 0.2997 0.2286 0.2878 0.1836 0.1583 0.1143 0.1456 0.0887
5.8725 0.4279 0.3449 0.2992 0.2305 0.2859 0.1828 0.1576 0.1126 0.1448 0.0938
4.5833 0.4548 0.3666 0.3379 0.2402 0.2934 0.1802 0.1689 0.1335 0.1514 0.0767
5.3545 0.4158 0.3467 0.2505 0.2588 0.3089 0.2080 0.1207 0.1261 0.1373 0.0882
data for the first day in dataset X1 and compute the average of the relative total error for 10 random missing patterns for each given k. We plot this average against k in Figure 7. From Figure 7, we find that the performance of the ITG method is robust against small or moderate amount of missing link data (up to 5 out of 24 edge links). 4. Discussion We consider the estimation of SD traffic flow in a network based on observations of a snapshot of traffic loads on links. Based on the ideas of Vardi [10] and Zhang et al. [11, 12], we propose an iterative tomogravity method which allows incomplete observation of the link data. Our main idea is to use the gravity space (2.4), instead of the simple gravity solution (1.2), to regularize Vardi’s [10] tomographic model (1.1). A numerical study with a real-life dataset demonstrates that the proposed method has similar performance compared with the methods proposed in [11, 12] which demand complete observation of the link data. We discuss below a number of related issues. There are two other possible ways of using the gravity space (2.4) to regularize (1.1) that we do not explore in this paper. The first is to use the ITG (2.9) instead of the simple gravity (1.2) in the penalty function in (1.5), resulting in 2 K x/N , g(fin) /N (4.1) arg min y∗ − A∗ x2 + φN x
in (2.8). The second is to alternate between the optimization in the with the N gravity space and entropy-regularized solution, i.e. to replace (2.6) with T ∗ (old) (4.2) N (new) = 1T y∗ 1 A g f (new) = arg min y∗ − N (new) A∗ f 2 +φ{N (new) }2 K f , g(old) : f ∈ T ∗ . (4.3)
A small numerical study seems to indicate that there is little difference between (4.1) and the ITG. The proposed ITG (2.9) implicitly assumes that the measurement error in the tomographic model (2.1) is of smaller order than the bias representing the KullbackLeibler distance K(x/N, G ∗ ) between x/N and the gravity space (2.4). This seems
22
J Fang, Y. Vardi and C.-H. Zhang
Fig 7. Relative total errors of the ITG versus the number of edge links with missing data. Average over 10 random missing patterns is used for each point in the plot. The ITG is robust against small or moderate amount of missing link data.
to be the case in our real-data experiments since ITG significantly improves upon /N ) to K(x/N, G ∗ ). the simple tomogravity (1.4) by formally reducing K(x/N, x In cases where the measurement error in the tomographic model is potentially of /N )] it would make sense to replace larger order than K(x/N, G ∗ ) [or K(x/N, x (2.6) by (4.2) and (4.3) in ITG [or to use (1.5)] with a proper tuning parameter φ. A possibility to further reduce the bias is to consider the mixed gravity model k∗ (k) (k) (4.4) πk f , f ∈G . Fmix = f : f = k=1
For example, we may compute a regularized mixed tomogravity solution k∗ k∗ arg min y − N πk f (k) + N 2 (4.5) φk K(f (k) , g(k) ) k=1
k=1
by alternately optimizing over g(k) ∈ G, f (k) , k = 1, . . . , k∗ and the mixing vector (π1 , . . . , πk∗ )T . It seems that for a network with a fixed routing protocol, the ITG estimate x − Ex is asymptotically normal when in (2.9) is a continuous map of y∗ , so that x y∗ − EA∗ x is asymptotically normal with Ex/N ∈ G, as N → ∞. Our simulation study in a small artificial network has demonstrated the validity of this asymptotic normality theorem for moderate sample sizes. Estimation of traffic matrix based on link-load data alone is difficult as the estimation error is typically above 20%. More accurate results can be obtained if
Iterative tomogravity algorithm
23
additional information can be extracted from packets passing through routers. See for example Zhao, Kumar, Wang and Xu [13]. References [1] Cao, J., Davis, D., Vander Weil, S. and Yu, B. (2000). Time-varying network tomography: Router link data J. Amer. Statist. Assoc. 95 1063–1075. [2] Cao, J., Vander Wiel, S., Yu, B. and Zhu, Z. (2000). A scalable method for estimating network traffic matrices. Technical report, Bell Labs. [3] Castro, R. Coates, M., Liang, G., Nowak, R. and Yu, B. (2004). Network tomography: Recent developments. Statist. Sci. 19 499–517. [4] Coates, M. and Nowak, R. (2002). Sequential Monte Carlo inference of internal delays in nonstationary communication networks. IEEE Trans. Signal Process. 50 366–376. [5] Krupp, R. S. (1979). Properties of Kruithof’s projection method. The Bell System Technical J. 58 517–538. [6] Liang, G. and Yu, B. (2003). Maximum pseudo-likelihood estimation in network tomography. IEEE Trans. Signal Process. 51 243–253. [7] Liang, G., Taft, N. and Yu, B. (2006). A fast lightweight approach to origindestination IP traffic estimation using partial measurements. Special Issue of IEEE-IT and ACM Networks on Data Networks, January 2006. [8] Medina, A., Taft, N., Salamatian, K., Bhattacharyya, S. and Diot, C. (2002). Traffic matrix estimation: Existing techniques compared and new directions. SIGCOMM, Pittsburgh, Aug. 2002. [9] Vanderbai, R. J. and Iannone, J. (1994). An EM approach to OD matrix estimation. Technical Report SOR 94-04, Princeton Univ. [10] Vardi, Y. (1996). Network tomography: Estimating source-destination traffic intensities from link data. J. Amer. Statist. Assoc. 91 365–377. [11] Zhang, Y., Roughan, M., Duffield, N. and Greenberg, A. (2003). Fast accurate computation of large-scale IP traffic matrices from link loads. In ACM SIGMETRICS, San Diego, USA, June 2003. [12] Zhang, Y., Roughan, M., Lund, C. and Donoho, D. (2003). An information-theoretic approach to traffic matrix estimation. In ACM SIGCOMM, Karlsruhe, Germany, August 2003. [13] Zhao, Q., Kumar, A., Wang, J. and Xu, J. (2005). Data streaming algorithms for accurate and efficient measurement of traffic and flow matrices. ACM SIGMETRICS, Banff, Canada, June 2005.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 24–44 In the Public Domain DOI: 10.1214/074921707000000049
Statistical inverse problems in active network tomography Earl Lawrence1,∗ , George Michailidis2,∗ and Vijayan N. Nair2,∗ Los Alamos National Laboratory and University of Michigan Abstract: The analysis of computer and communication networks gives rise to some interesting inverse problems. This paper is concerned with active network tomography where the goal is to recover information about quality-of-service (QoS) parameters at the link level from aggregate data measured on end-toend network paths. The estimation and monitoring of QoS parameters, such as loss rates and delays, are of considerable interest to network engineers and Internet service providers. The paper provides a review of the inverse problems and recent research on inference for loss rates and delay distributions. Some new results on parametric inference for delay distributions are also developed. In addition, a real application on Internet telephony is discussed.
1. The inverse problems Consider a topology with a tree structure defined as follows: T = {V, E} has a set of nodes V and a set of links or edges E. Figure 1 shows two examples, a simple two-layer symmetric binary tree on the left and a more general four-layer tree on the right. Each member of E is a directed link numbered after the node at its terminus. V includes a (single) root node 0, a set of receiver or destination nodes R, and a set of internal nodes I. The internal nodes have a single incoming link and at least two outgoing links (children). The receiver nodes have a single incoming link but no children. For the tree on the right panel of Figure 1, R = {2, 3, 6, 8, 9, 10, 11, 12, 13, 14, 15} and I = {1, 4, 5, 7}. All transmissions are sent from the root (or source) node to one or more of the receiver nodes. This generates independent observations Xk at all links along the paths to those receiver nodes. Let X denote this set of measurements. These data are not directly observable; rather we can collect only end-to-end data at the receiver nodes: Yr = f (X) for r ∈ R. The statistical inverse problem is to reconstruct the distributions of the link-level Xk s from these path-level measurements. Examples of f (·) are: f (X) = k∈P(0,r) Xk , f (X) = k∈P(0,r) Xk , and f (X) = mink∈P(0,r) Xk , and f (X) = maxk∈P(0,r) Xk , where P(0, r) is the path between the root node 0 and the receiver node r. In this paper, we will be concerned only with the first two cases of f (·) above. To understand the statistical issues and challenges involved, let us examine some simple examples. ∗ The research was supported in part by NSF Grants CCR-0325571, DMS-0204247 and DMS0505535. 1 Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos NM 87545, USA, e-mail: [email protected] 2 Department of Statistics, University of Michigan, Ann Arbor MI 48109, USA, e-mail: [email protected]; [email protected] AMS 2000 subject classifications: 62F10, 60G05, 62P30. Keywords and phrases: Network tomography, internet, inverse problems, monitoring, nonlinear least squares.
24
Inverse problems in network tomography
25
Fig 1. Examples of tree network topologies. A binary two-layer tree is shown on the left panel and a general four-layer tree on the right panel. The path lengths from the root to nodes belonging to the same layer are the same.
Example 1. Consider the two-layer binary tree on the left panel of Figure 1, and suppose the Xk are binary with P (Xk = 1) = αk , k = 1, 2, 3 for the three links. Further, the root node sends transmissions to the receiver nodes one at a time. Take f (a, b) = ab. Then, the observed data are Y2j = X1j X2j for transmission j and Y3m = X1m X3m for transmission m. They are independent Bernoulli with probabilities α1 α2 and α1 α3 , respectively. Suppose we send M transmissions to receiver node 2 and N transmissions to receiver node 3. Let M1 and N1 be the respective number of “ones”. Then, M1 and N1 are independent binomial random variables with success probabilities α1 α2 and α1 α3 . From these data, we can estimate only α1 α2 and α1 α3 . The individual link-level parameters α1 , α2 and α3 cannot be fully recovered. Example 2. Take the same two-layer binary tree with binary outcomes with f (a, b) = ab as above. But now the root node sends transmissions to receiver nodes 2 and 3 simultaneously. In other words, the m-th transmission generates random variables X1m , X2m and X3m on all of the links. We observe Y2m = X1m X2m and Y3m = X1m X3m . The distinction from Example 1 is that the X1m is common to both Y2m and Y3m . Now, each transmission has 4 possible outcomes: (1, 1), (1, 0), (0, 1), (0, 0) depending on whether the transmission reaches none, one, or both of the receiver nodes. If we send N such transmissions to nodes 2 and 3 simultaneously, the result is a multinomial experiment with probabilities α1 α2 , α1 (1 − α2 ), (1 − α1 )α2 , and (1 − α1 )(1 − α2 ) corresponding to the four outcomes. Let N (i, j) denote the number of events with outcome (i, j). Then, E[N (1, 1)] = α1 α2 α3 , E[N (1, 1) + N (1, 0)] = α1 α2 , and E[N (1, 1) + N (0, 1)] = α1 α3 . It is easy to see that we can estimate all the three link-level parameters from these measurements. Thus, the data transmission scheme plays an important role in this type of
26
E. Lawrence, G. Michailidis and V. N. Nair
inverse problems. Example 3. Again we have a two-layer binary tree but now f (a, b) = a + b. Then Y2 = X1 + X2 and Y3 = X1 + X3 . Let Fk be the distribution of the link-level random variables Xk ∈ R, for k = 1, 2, 3. Assume, as in Example 2, that the root node sends transmissions simultaneously to both receivers. In this case, even with simultaneous transmission to both receivers, the link-level parameters are not always identifiable. Just take Xk to be independent Normal(µk , 1), k = 1, 2, 3. Then Y2 and Y3 are bivariate normal with mean µ1 + µ2 and µ1 + µ3 , variance 2 and correlation 1. One can see that the individual µk cannot be recovered from the joint distribution of Y2 and Y3 . Additional assumptions on the distribution are needed in order to solve the inverse problem. We will revisit this issue. Example 4. Consider now the more general tree on the right panel of Figure 1. Again, we send transmissions to all of the receiver p nodes simultaneously. If the random variables are binary and f (x1 , . . . , xp ) = j=1 xj , all the link-level parameters p are identifiable. The same is true for a general Xk with f (x1 , . . . , xp ) = j=1 xj under suitable conditions on the distribution of the Xk (as discussed later in the paper). However, it may be “expensive” to send transmissions to all receiver nodes simultaneously. Instead, can we schedule transmissions to some judicious subsets of the receiver nodes at a time and combine the information appropriately to estimate all the link-level parameters? It is clear from Example 1 that it is not enough to send transmissions to one receiver node at a time. How should the transmission scheme be designed in order to estimate all the parameters? Are there some “good” schemes (according to some appropriate criteria)? These examples are simple instances of issues that arise in the context of analyzing computer and communications networks and are collectively referred to as active network tomography. In the next section, we will describe the network application and the need for estimating quality-of-service (QoS) parameters such as loss rates and delays. Section 3 provides an overview of recent results in the literature on the design of transmission experiments and inference for loss rates and discrete delay distributions. A real application on data collected from the campus network at the University of North Carolina, Chapel Hill is used to illustrate some of the results. Section 4 develops some new results on parametric inference for delay distributions. 2. Active network tomography The area of network tomography originated with the pioneering work of Vardi [14] where the term was first introduced. His work dealt with another type of inverse problem relating to origin-destination (OD) traffic matrix estimation. The OD information is important in network management, capacity planning, and provisioning. In this problem, one is interested in estimating the intensities of traffic flowing between the origin-destination pairs in the network. However, we cannot collect these data directly; rather, one places equipment at the individual nodes (routers/switches) and collects aggregate data on all traffic flowing through the nodes i ∈ V. The goal is to recover distributions of origin-destination traffic between all pairs of nodes in the network. There has been considerable work in this area, and a summary of the developments can be found in [3]. Active network tomography, on the other hand, is concerned with the “opposite” problem of estimating link-level information from end-to-end data. One sends test
Inverse problems in network tomography
27
probes (packets) (active probing) from a source to one or more receiver nodes on the periphery of the network and gets end-to-end path-level data on losses and delays. One then has to solve the inverse problem of reconstructing link-level loss and delay information from the end-to-end data. The specific goal is to estimate QoS parameters such as loss rates and delays at the link level. The reason for probing the network from the outside is that Internet service providers or other interested parties often do not have access to the internal nodes of the network (which may be owned by a third party). Nevertheless, they have to assess QoS of the links over which they are providing service. Active tomography offers a convenient approach by probing the network from nodes located on the periphery. The probing and data collection are done with dedicated instruments at the root node and receiver nodes. These packets can be sent to one receiver at a time (unicast transmission scheme) or to a specified subset of receivers (multicast scheme). Some networks have turned off the multicast scheme for security reasons. In this case, one sends unicast packets to several receivers spaced closely in time with the goal of trying to mimic the multicast scheme. What causes losses and delays of packets over the network? When a packet arrives at a node, it joins a queue of incoming packets. If the buffer is full, the packet is dropped, i.e., lost. Depending on the protocol, the packet may or may not be resent. Packets also encounter delays along the path, primarily due to the queueing process above. In the case of losses, the binary outcome Xk = 0 or 1 indicates whether the packet is lost (dropped) or not. In terms of the examples in Section 1, f (x1 , . . . , xk ) = k∈P(0,r) Xk = 1 if the packet transmitted k xk , and the end-to-end loss Y = along the path P(0, For delays, r) reached the receiver node r and zero otherwise. f (x1 , . . . , xK ) = k xk , and the end-to-end observation is Y = k∈P(0,r) Xk , the path-level delay. The physical topology of a network is usually complicated. But the logical topology with a single source node can often be represented as a tree. For example, the left panel of Figure 2 shows the physical topology of a subnetwork at the campus of the University of North Carolina at Chapel Hill. The right panel shows the corresponding logical topology, which is a tree with a directed flow. We will revisit this network later in the paper. It is possible to deal with topologies with multiple sources, other kinds of transmission schemes (two-way flows), and so on. But for simplicity, we will restrict attention to the tree structures in this paper.
Fig 2. Left panel: Schematic of the UNC network; Right panel: Logical topology of the UNC network.
28
E. Lawrence, G. Michailidis and V. N. Nair
3. Literature review of loss and discrete delay inference Most of the results in the literature on active tomography have been developed under the assumption that the loss rates and delay distributions are temporally homogeneous and are independent across links. We will also use this framework. The assumption of temporal homogeneity is reasonable as the probing experiments are done within the order of minutes. The assumption of independence across links is less likely to hold. However, the nature of the dependence will vary from network to network, and it is difficult to obtain general results. 3.1. Design of probing experiments We noted in Example 1 that the link-level parameters are not identifiable under the unicast transmission scheme (sending probes to one receiver at a time). The multicast scheme, which sends packers to all the receivers in the network simultaneously, addresses this problem for loss rates and, under some additional conditions, for delay distributions as well. However, this scheme has a number of drawbacks. It creates more traffic than necessary for estimating the link-level parameters. Also, the data generated are very high-dimensional. For example, in a binary symmetric tree with L layers, there are R = 2L − 1 receiver nodes. A multicast scheme for measuring loss rates results in a multinomial experiment with 2R possible outcomes. This is a large number even for moderately sized trees. The most important drawback, however, is that it is inflexible and does not allow investigation of subnetworks using different intensities and at different times. In practice, one may want to probe sensitive parts of the network as lightly as necessary to avoid disturbance. So there is a need for more flexible probing experiments. As pointed out in Example 4, this raises interesting issues on how to design the probing experiments. A class of flexible probing experiments, called flexicast experiments, were introduced and studied in Xi et al. [17] and Lawrence et al. [8]. This consists of a combination of schemes for different values of k with each scheme aimed at studying a subnetwork. However, each of the scheme by itself will not necessarily allow us to estimate the link-level parameters of that subnetwork. The data have to be combined across the various k-cast schemes to estimate the link-level parameters. To illustrate the ideas, consider the network on the right panel in Figure 1. The multicast scheme sends probes simultaneously to {2, 3, 6, 8, 9, 10, 11, 12, 13, 14, 15}. Two possible flexicast experiments are: (1)
{2, 3, 6, 12, 13, 14, 8, 15, 9, 10, 11}
and (2)
{2, 3, 6, 12, 13, 14, 15, 8, 9, 10, 11}.
The former consists of only bicast (two receiver nodes at a time) and unicast schemes. Intuitively, the latter scheme appears to more “efficient” but we will see shortly that it does not allow one to estimate all the link-level parameters. A full multicast scheme for this tree will result in 11-tuples or 11-dimensional data. The first flexicast experiment using pairs and singletons can cover the whole tree with five pairs and one singleton. The resulting data are considerably less complex in terms of processing and computations for inference. This advantage is particularly important for trees with many layers.
Inverse problems in network tomography
29
Of course, not all flexicast experiments will permit estimation of the link-level parameters. To discuss the technical issues associated with the identifiability problem, consider first the notion of a splitting node. For a k-cast scheme, an internal node is a splitting node if the scheme splits at that node. For example, for the tree on the right panel of Figure 1, the bicast scheme {6, 12} splits at node 4. Xi et al. [17] showed that the following conditions are necessary and sufficient for identifiability of link-level loss rates: (a) all receiver nodes are covered; and (b) every internal node in the tree is a splitting node for some k-cast scheme in the flexicast experiment. Lawrence et al. [8] studied the delay problem and showed that the same conditions are also necessary and sufficient for estimating delay distributions provided the distributions are discrete. The case where the delay distributions are not discrete is discussed in the next section. Consider again the flexicast schemes in equations (1) and (2) for the tree on the right panel in Figure 1. The first one based on a collection of bicast and unicast schemes satisfies the conditions. For the second one, none of the k-cast schemes split at node 4. There are many flexicast experiments that satisfy the identifiability requirements, and the choice among these has to be based on other criteria. Experiments based on just bicast and unicasts have minimal data complexity – just 1- and 2-dimensional outcomes. However, these provide information on just first and second-order dependencies and will be less efficient (in a statistical sense) to k-cast schemes with higher values of k. In particular, the full mulitcast scheme will be most efficient in this sense. So the overall choice of the flexicast experiment has to be a compromise between statistical efficiency and flexibility including the ability to adapt over time to accommodate changes in network conditions. 3.2. Inference for loss rates Inference for loss rates was first studied in C´ aceres et al. [2] for the multicast scheme. A recent, up-to-date list of references can be found in Xi et al. [17] who developed MLEs based on the EM algorithm for flexicast experiments. We provide next a brief review of these results. Each k-cast scheme in a flexicast experiment is a k-dimensional multinomial experiment. Specifically, each outcome is of the form {Zr1 , . . . , Zrk } where Zrj = 1 or 0 depending on whether the probe reached receiver node rj or not. Let N(r1 ,...,rk ) denote the number of outcomes corresponding to this event, and let γ(r1 ,...,rk ) be the probability of this event. Then the log-likelihood for the k-cast scheme is proportional to γ(r1 ,...,rk ) log(N(r1 ,...,rk ) ). The overall log-likelihood is just the sum of the log-likelihoods for these individual experiments. However, the γ(r1 ,...,rk ) are complicated functions of αk , the link-level loss rates, so one has to use numerical methods to obtain the MLEs. The EM algorithm is a natural approach for computing the MLEs and has been used extensively in network tomography applications (see [3, 5, 16]). The structure of the EM-algorithm for general flexicast experiments was developed in Xi et al. [17]. While the E-step can be complex for arbitrary collections of k-cast schemes, it simplifies for flexicast experiments comprised of bicast and unicast schemes as seen below. Let sb be the splitting node for bicast pair b = ib , jb . Then, π(0, sb ), π(sb , ib ) and π(sb , jb ), the three path probabilities for this bicast pair are products of the αk . Starting with an initial value α (0) let α (k) be the value after the k-th iteration. Then, we can write the (k + 1)-th iteration of the E-step as follows:
E. Lawrence, G. Michailidis and V. N. Nair
30
E-step: 1. For each bicast pair: (a) Use α (k) to obtain the updated path probabilities π (k) (0, sb ), π (k) (sb , ib ), b (k) π (k) (sb , jb ) and γ00 . (k+1) (b) For each node ∈ P(0, sb ) ∪ P(sb , ib ) ∪ P(sb , jb ), compute V,b = Eα (k) [V | b b b b Nb ], where Nb = {N00 , N01 , N10 , N11 } are the collected counts of the four possible outcomes, as follows. For node ∈ P(0, sb ), (k+1) V,b
b
=N −
(k) b 1 − α N00 b (k) . γ00
For link ∈ P(sb , ib ), (k+1)
V,b
b = N b − N01 ×
(k)
(k)
(k)
(k)
(k) 1 − α (0, jb )) b (1 − α )(1 − π − N . 00 (k) b (k) 1 − π (sb , ib ) γ00
For link ∈ P(sb , jb ), (k+1)
V,b
b = N b − N10 ×
(k) (0, ib )) 1 − α b (1 − α )(1 − π . − N 00 b (k) 1 − π (k) (sb , jb ) γ00
2. Unicast schemes: Let node ∈ P(0, u) for a unicast transmission to receiver node u, and compute (k+1)
V,u
(k)
= N u − N0u ×
1 − α 1 − π (k) (0, u)
M-step: The (k + 1)-th update for the M-step is simply (k+1) α
=
(k+1)
b∈B
V,b
b∈B
+
Nb +
(k+1)
u∈U
V,u
u∈U
Nu
where B is the set of bicast pairs that includes the node in its path and U is the set of all unicast schemes that includes node in its path. In our experience, the EM algorithm works reasonably well for small to moderate networks when used with a flexicast experiment that consists of a collection of bicast and unicast schemes. For large networks, however, it becomes computationally intractable. In on-going work, we are developing a class of fast estimation methods based on least-squares methods and are studying their application to on-line monitoring of network performance. 3.3. Inference for discrete delay distributions For the delay problem, let Xk denote the (unobservable) delay on link k, and let the cumulative delay accumulated from the root node to the receiver node r be Yr = k∈P(0,r) Xk . Here P(0, r) denotes the path from node 0 to node r. The observed data are end-to-end delays consisting of Yr for all the receiver nodes. Most of the papers on delay inference assume a discrete delay distribution. Specifically, if q denotes the universal bin size, Xk ∈ {0, q, 2q, . . . , bq} is the discretized
Inverse problems in network tomography
31
delay on link k and bq is the maximum delay. Let αk (i) = P{Xk = iq}. The inference problem then reduces to estimating the parameters αk (i) for k ∈ E and i in {0, 1, . . . , b} using the end-to-end data Yr . Lo Presti et al. [10] developed a fast, heuristic algorithm for estimating the link delays. Liang and Yu [9] developed a pseudo-likelihood estimation method. Nonparametric maximum likelihood estimation under the above setting was investigated in Tsang et al. [13] and Lawrence et al. [8]. Shih and Hero [12] examined inference under mixture models. See also Zhang [18] for a more general discussion of the deconvolution problem. We discuss nonparametric MLE with discrete delays in more detail. Let α k = [αk (0), αk (1), . . . αk (b)] and let α = [ α0 , α 1 , . . . , α |E| ] . The observed end-to-end measurements consist of the number of times each possible outcome y was observed from the set of outcomes Y c for a given scheme c. Let Nyc denote these counts. These are distributed as multinomial random variables with corresponding pathlevel probabilities γc (y ; α ). So the log-likelihood is given by l( α; Y) = Nyc log[γc (y ; α )]. c∈C y ∈Y c
This cannot be maximized easily, and one has to resort to numerical methods. Again, the EM algorithm is a reasonable technique for computing the MLEs. See [7] for multicast schemes and [8] for inference with flexicast experiments. However, the complexity of the EM algorithm, in particular computing conditional expectations of the internal link delays for each bin, is prohibitive for all but fairly small-sized networks. To deal with larger networks, [8] developed a grafting method which fits “local” EMs to the subtrees defined by the k-cast schemes and then combines the estimates through a fixed point algorithm. This hybrid algorithm is fast and has reasonable statistical efficiency compared to the full MLE. For bicast schemes, the resulting algorithm has third-order polynomial complexity, a substantial improvement over the full bicast MLE. The heuristic algorithm in [10] is based on solving higher order polynomials and is much faster. However, it uses only part of the data and is quite inefficient. The pseudo-likelihood method of [9] uses only data from all pairs of probes in the multicast experiment. This is similar in spirit to a flexicast experiment comprised of only bicast schemes, although in this setting the schemes would be independent. The computational performance of the pseudo-likelihood method is faster than the MLE based on the full multicast. It is comparable to doing a full EM based on data from all possible bicast schemes. This will still not scale up well to very large trees as it includes all possible bicasts which can involve a large number of schemes. Furthermore, using the full MLE combining the results across all schemes is computationally intensive. The flexicast experiments, on the other hand, are typically based on a much smaller number of schemes (eve if one restricts attention to bicasts). Further, the grafting algorithm is much faster for combining the results across the schemes. 3.4. Application to the UNC network We use a real example to demonstrate how the results from active tomography are used. The example deals with estimating the QoS of the campus network at the University of North Carolina at Chapel Hill and assessing its capabilities for Voice-Over-IP readiness. This network has 15 endpoints which were organized into the tree shown in Figure 2. Node 1 is the main campus router and it connects to the university gateway. Nodes 2, 3, and 9 are also large routers responsible for different portions of
32
E. Lawrence, G. Michailidis and V. N. Nair
Fig 3. Probability of large delay on 3/7/2005.
the campus. The accessible nodes are all located in dorms and other university buildings. The root node of the tree was Sitterson Hall which houses the computer science department. The network was probed in pairs using the following flexicast experiment: {4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18}. A single probing session consisted of two passes through the collection of experiments sending about 500 probes to each pair in a single pass. The experiment was conducted over the course of several days in order to evaluate both the network and the methodology. We have collected extensive data but show only selected results for illustrative purposes. The data presented here were collected at 9:00 a.m., 12:00 p.m., 3:00 p.m., 6:00 p.m., and 9:00 p.m. on March 1 and 17 of 2005. March 17 was during spring break. For both days, we chose a bin size of q = .0001s to assess occurrences of large delays on the network. The large bin size also allowed us to use the full MLE to estimate the delay distributions. Figures 3 and 4 provide a picture of the probability of large delay (larger than a specified threshold) throughout the course of the day. From Figure 3, we see that many buildings (Venable, Davis, Rosenau, Smith, Greenlaw, and South) show a typical diurnal pattern. These buildings are either administrative or departmental building; so the majority of users follow a regular 9 to 5 schedule. Other buildings are either more uniform throughout the day or even more activity at night. Hinton, for example, is a large freshman dorm and thus the drop during the day and increase at night are expected as the residents return from classes and other activities in the evening. A comparison of Figures 3 and 4 shows the difference in dorm activity before
Inverse problems in network tomography
33
Fig 4. Probability of large delay on 3/17/2005.
and during spring break. Everett, Old East, Hinton, and Craige are dorms. The data collected during spring break reveals almost no large delays in three out of four of these buildings. This is of course to be expected. The Hinton dorm is especially interesting, since it experienced very little congestion over the break, but a significant increase to pre-break levels on the first day after the break (post-break results are not shown here). As a consequence of this study, it became clear that many of the building links require upgrades in order to support delay-sensitive applications such as VoIP. Some of the departmental and administration buildings (Smith and South) already have large delays even without additional VoIP traffic. 4. Parametric inference for delay distributions This section develops some new results on parametric inference for delay distributions. We start with a framework that includes two components: a zero delay and a (non-zero) finite delay. Specifically, let Xk be the delay on link k, and suppose (3)
Xk ∼ pk δ{0} + (1 − pk )F (x; θk ).
Here we assume that F (x; θk ) does not give any mass to zero, for all k. So, a successful transmission (finite delay or no loss) experiences an empty queue (no delay) with probability pk and has some non-zero delay that is distributed according to a parametric distribution F (·) indexed by θk with probability 1 − pk .
E. Lawrence, G. Michailidis and V. N. Nair
34
Fig 5. Three-layer, binary tree
4.1. Identifiability
The basic issue for delay distributions is the one posed in Example 2 in the introductory section, viz., whether the parameters of a simple two-layer tree (left panel of Figure 1) are estimable from probes sent simultaneously to both receivers. If this holds, then the result extends readily to general flexicast experiments that satisfy the conditions in Section 3.1 (using the arguments in [8]). We discuss the details briefly. See also [4, 6] for a general discussion of identifiability issues. We consider two cases: Case 1: If pk > 0 for all k, no additional assumptions on the distribution F (·) are needed. All the link-level delay parameters (pk and θk ) are identifiable using flexicast experiments provided they satisfy the conditions in Section 3.1: a) every receiver node is covered and b) every internal node is a splitting node for some sub-experiment. To see this, consider the two-layer tree on the left panel of Figure 1. Condition on the subset of data with Y2,m = 0 and Y3,m > 0 for probes m = 1, . . . , M . Now, Y2,m = 0 implies that both of the internal links X1,m and X2,m had zero delay, so Y3,m = X3,m . So we can use this subset of Y3,m to estimate F (x; θ3 ). A similar argument can be used to estimate F (x; θ2 ) using the subset of Y3,m > 0 and Y2,m = 0. Once these two distributions are estimated, we can easily estimate F (x; θ1 ). Case 2: If pk = 0, then we need additional assumptions on the delay distributions F (x; θk ). As we noted in Example 2, the means of the normal distributions are not identifiable. If the moments of order two and higher depend on the first moment, they will provide additional information for estimating the parameters. One such example is when the variance is a function of the mean (as is the case with exponential, gamma, log-normal, and Weibull distributions). Example 5. We consider here a more general situation with the three-layer binary, symmetric tree shown in Figure 5. Let the delay on link k be distributed Gamma(αk , βk ). Suppose we use the flexicast probing experiment {4, 5, 5, 6, 6,
Inverse problems in network tomography
35
7}. The covariances yield the following moment equations: 5,6
, Y6
5,6
, Y6
5,6
, Y6
Cov(Y5 4,5
, Y5
6,7
, Y7
Cov(Y4
Cov(Y6
4,5
) − Cov(Y5
6,7
) − Cov(Y5
5,6
) = α1 β12 ,
5,6
) = α2 β22 ,
5,6
) = α3 β32 .
Let E(Yr ) = νr . We also get the following equations based upon third moments: 5,6
− ν5 )2 (Y6
5,6
− ν5 )2 (Y6
5,6
− ν5 )2 (Y6
E(Y5 4,5
− ν4 )2 (Y5
6,7
− ν6 )2 (Y7
E(Y4
E(Y6
4,5
− ν5 ) − E(Y5
6,7
− ν7 ) − E(Y5
5,6
− ν6 ) = 2α1 β13 ,
5,6
− ν6 ) = 2α2 β23 ,
5,6
− ν6 ) = 2α3 β33 .
The corresponding sample moments can be used to estimate the terms on the left. Then, estimators for α1 , β1 , α2 , β2 , α3 , and β3 can be obtained by rearranging the above equations. The parameters for the receiver links can be estimated with just the first moments. For example, the equations for link 4 are: E(Y4 ) = α1 β1 + α2 β2 + α4 β4 , Var(Y4 ) = α1 β12 + α2 β22 + α4 β42 . The unknown parameters are easily obtained from the observed values on the left and the estimated parameters on the right.
4.2. Maximum likelihood estimation
It turns out that pk , the probability of zero delay, can be estimated using methods analogous to those for loss rate discussed in Section 3.2. Recall that a zero delay will be observed at the receiver node if and only if there is zero delay at every link. On the other hand, a non-zero delay at the receiver link may include zero delays at some links, so we have to “recover” this information from the aggregate level data. But this is equivalent to the problem with of losses. A packet received at the receiver node implies “success” at all the links. A packet not received at the receiver node involves a combination of successes and losses, with at least one loss. Thus, we can use the data with zero-delays and positive delays in an analogous manner to estimate the zero-delay probabilities pk . To simplify matters, therefore, we will focus on parametric estimation of Fk (x; θk ) assuming that pk = 0. Let us consider some simple examples with the two-layer tree with two receivers in the left panel of Figure 1 and with exponential and gamma distributions for delays. The gamma family is closed under convolution if the scale parameters are the same, so the distribution of the end-to-end delays belong to the same family as the link-level delays. Even for these simple cases, we will see that the MLE computations are intractable. a) Exponential Distributions: Suppose the delay distribution on each link is exponential with parameter λk . Further, we send N probes to both receivers 2, 3
E. Lawrence, G. Michailidis and V. N. Nair
36
simultaneously. The log-likelihood function is (4)
l(λ; Y) = N log(λ1 ) + N log(λ2 ) + N log(λ3 ) − N log(λ1 − λ2 − λ3 ) − λ2
N
yi,2 − λ3
i=1
+
N
N
yi,3
i=1
log[1 − exp{−(λ1 − λ2 − λ3 ) min(yi,2 , yi,3 )}]
i=1
There is no analytic solution to maximize this equation over λ, so one would have to use an iterative technique, such as EM or Newton-Raphson, to find the MLEs even in this simple case. We examine the details for the EM-algorithm. The exponential distribution is a member of the exponential sufficient statistics are the n nso the (unobserved) n family, total link-level delays i=1 Xi,1 , i=1 Xi,2 , and i=1 Xi,3 . Since Xi,2 = Yi,2 −Xi,1 andXi,3 = Yi,3 − Xi,1 , we need to compute only the conditional expected values n of i=1 Xi,1 in the E-step. The conditional distribution [X1 |Y2 = a, Y3 = b] has density (5)
g(x) =
exp{−(λ1 − λ2 − λ3 )x} , C(a, b; λ1 , λ2 , λ3 )
0 < x < a ∧ b,
where a ∧ b = min(a, b) and the constant of proportionality is
C(a, b; λ1 , λ2 , λ3 ) =
1 − exp{−(λ1 − λ2 − λ3 )(a ∧ b)} . λ1 − λ2 − λ3
a∧b Now 0 xg(x)dx n is an incomplete gamma function and one can compute the expected value i=1 E(Xi,1 |Yi,2 , Yi,3 ) as a ratio of the incomplete gamma function and the constant C(a, b; λ1 , λ2 , λ3 ). Thus, the MLEs of the λk can be computed without too much trouble in this simple two-layer binary case. How well does this extend to more general cases? Suppose we have a three layer binary tree (Figure 5), and we use bicast schemes 4, 5, 6, 7, and 5, 6. Consider the scheme 4, 5 which splits at node 2. We can try and mimic the computations for the two-layer tree above. However, we have to consider the combined path P(0,2) whose delay is the sum of delays for links 1 and 2. The exponential distribution is not closed under convolution, so the distribution is now more complex. The details for more general trees will depend on the number of links involved before-and-after the splitting node. The problem is even more complex for multicast schemes with multiple splitting nodes. We see that the MLE computations are complicated even for simple exponential distributions. b) Gamma Distributions: Gamma distributions with same scale parameter are closed under convolution, i.e., the path delays which are sums of link-level delays are still gamma. Specifically, let Xk ∼ Gamma(αk , β) and independent across k for k ∈ E. We start with the simple two-layer binary tree. Then, the likelihood function
Inverse problems in network tomography
of the observed data is n L(data) = =
yi,2 ∧yi,3
0 i=1 n y i,2 ∧yi,3
i=1
0
37
f1 (x)f2 (yi,2 − x)f3 (yi,3 − x)dx
1 x 1 α1 −1 x exp{− } α 1 Γ(α1 ) β β
1 yi,2 − x 1 α2 −1 } (y − x) exp{− i,2 Γ(α2 ) β α2 β 1 1 yi,3 − x α3 −1 × } dx (yi,3 − x) exp{− Γ(α3 ) β α3 β n 1 1 1 = exp{− (yi,2 + yi,3 )} α +α +α 1 2 3 Γ(α )Γ(α )Γ(α ) β β 1 2 3 i=1 yi,2 ∧yi,3 xi α1 −1 α2 −1 α3 −1 × x (yi,2 − x) (yi,3 − x) exp{ }dx . β 0 ×
As before, the MLEs will have to be obtained numerically. Let us consider the details of the EM-algorithm. The Gamma distribution is a member of the exponential family with sufficient statistics X and log(X). For the two-layer tree, we need to compute just the conditional expectation of X1 and log(X1 ), the unknown delays on the first link. The conditional distribution [X1 |Y1 = a, Y2 = b] is now given by (6)
g(x) =
xα1 −1 (a − x)α2 −1 (b − x)α3 −1 exp{ βx } C(a, b; α1 , α2 , α3 , β)
,
0 < x < a ∧ b,
where the proportionality constant is a∧b x C(a, b; α1 , α2 , α3 , β) = xα1 −1 (a − x)α2 −1 (b − x)α3 −1 exp{ }dx. β 0 This can be used to compute E[X1 |Y1 , Y2 ] and E[log(X1 )|Y1 , Y2 ] numerically. Note that C(Y1 , Y2 ; α1 + 1, α2 , α3 , β) . E[X1 |Y1 , Y2 ] = C(Y1 , Y2 ; α1 , α2 , α3 , β) How well does this extend to trees with more than two layers? It turns out that the full MLE is still not feasible. However, a combination of “local” MLEs and a grafting idea (along the lines of [8]) is feasible. Consider the 3-layer tree in Figure 5. Suppose we use a flexicast experiment with 3 bicasts 4, 5, 6, 7, and 5, 6. The bicast scheme 4, 5 splits at node 2. So we can combine links 1 and 2 into a single link and use the previous results for the two-layer tree to get estimates for this subtree. Note that the delay distribution for the combined links 1 and 2 is Γ(α1 + α2 , β). So we can get “local” MLEs for α1 + α2 , α4 , α5 and β from the bicast experiment 4, 5. Using a similar argument, we can get estimates for α1 + α3 , α6 , α7 and β from the bicast scheme 6, 7 and estimates for α1 , α2 + α5 , α3 + α7 and β from the bicast scheme 5, 6. Now we can use one of several methods to combine these estimates to get a unique set of estimates for all of the αk and β. Possible methods include ordinary or weighted LS. We do not pursue this strategy here as the specifics work only for special cases. The main message here is that it is not easy to compute the full MLE even in very simple cases, and the problem becomes completely intractable as the size of the tree and number of children in the links grow.
E. Lawrence, G. Michailidis and V. N. Nair
38
4.3. Method-of-moments estimation We discuss the use of method of moments which estimates the parameters by matching the population moments to the sample moments using some appropriate loss function. General losses are possible, but squared error loss leads to more tractable optimization and the large-sample properties are easy to establish. Let H = {1, . . . , m} be the index set of the probes used in the probing experiment. Denote the observed data Yr (1), . . . , Yr (m) as Yr (H). Let Mij (H) be the observed i-th moment for the j-th scheme based on the probes in H. Let Mji (θ) be the functional form of the i-th moment from the j-th probing scheme. For example, for the two-layer tree in Figure 1 with Gamma(αk , βk ) distributions on each link, we get the following relationships: E(Y2 ) = α1 β1 + α2 β2 , E(Y3 ) = α1 β1 + α3 β3 , Cov(Y2 , Y3 ) = α1 β12 , E(Y2 − ν2 )2 (Y3 − ν3 ) = 2α1 β13 , Var(Y2 ) = α1 β12 + α2 β22 , Var(Y3 ) = α1 β12 + α3 β32 . We can now estimate the parameters by minimizing the squared error loss Q(θ; M(H)) =
m
2 Mij (H) − Mji (θ) . j=1
i
This is a special case of the nonlinear least squares problem and can be solved using iterative methods such as the Gauss-Newton procedure (see [1] for example). After rewriting the loss function as a single sum over all the moments, we consider the derivatives ∂Mi (θ) ∂Q(θ; M(H)) (7) = −2 [Mi (H) − Mi (θ)] . ∂θj ∂θj i These can be expressed in matrix form as ∂Q(θ; M(H)) = D [M (H) − M(θ)], ∂θ where Di,j =
∂Mi (θ) ∂θj .
The moments at the true value can be expanded using a Tay-
lor series expansion around an initial guess θ(0) as M(θ0 ) ≈ M(θ(0) ) + D(θ0 − θ(0) ). Computing the residuals and replacing the true value with the observed moments gives an updating scheme based on solving a linear system. Thus at iteration q, we have the following linear system. M (H) − M(θ(q) ) = Dβ. ˆ Solving this, we get the next iteration as θ(q+1) = θ(q) + β. In general, each iteration should be closer to the minimizer. However, there can be situations where the step increases the sum of squares. To avoid this, we recommend the modified Gauss-Newton in which the next iteration is given by θ(q+1) = θ(q) + rβˆ where 0 < r ≤ 1. This fraction can be chosen adaptively at each
Inverse problems in network tomography
39
step. If the full step reduces the sum of squares, then it is taken. Otherwise, we can set r = .5. If the half step fails to reduce the sum of squares, then it can be halved again. This guarantees that the loss function is reduced with every step and gives convergence to a stationarity point. Examination of the derivatives will indicate if the point is a minimum. The algorithm has useful complexity properties in terms of both memory and computation. Since the estimation is based only on the moments, these values are all that need to be stored. This is a vast improvement over algorithms that require all of the data or the counts of the binned data. Further, the efficient implementation of the algorithm, involving a QR factorization and one matrix inversion, gives computational complexity of O(m3 ) where m is the number of required moments. Again, this is a large improvement over other methods that have exponential complexity. Further improvement is achieved in many cases using sparse matrix techniques. These two points make the approach ideal for application requiring real-time estimates. The ordinary non-linear least-squares (OLS) method of moment (MOM) estimators can be inefficient as the different moments are correlated and have unequal variances. Since these can often be computed and estimated easily, one can use generalized least-squares (GLS) to improve the efficiency. A limited comparison of the efficiencies is given in the next subsection. It is easy to show that the method-of-moment estimators based on OLS or GLS are consistent and asymptotically normal as the sample sizes (numbers of probes) increase on a given network. The large-sample distribution can be used to compute approximate confidence regions and hypothesis tests which are useful in monitoring applications.
4.4. Relative efficiency of the method-of-moments: a limited study
We conducted a small simulation study to assess the performance of the MOM estimators versus the MLE. This was done on a two-layer binary tree (left panel of Figure 1) for exponential distributions. This is one of the few instances when it is practical to construct the MLE. We used two MOM estimators: the OLS and the GLS schemes (described above). The GLS methods used a weighting scheme based on an empirical estimate of the covariance of the observed moments. Relative efficiency is defined as the ratio of the variance of the MLE to the variance of the estimator of interest. We considered two cases: a) each link has the same mean; i.e., 1/λk = 1/2, k = 1, 2, 3, and b) each link has its own mean, 1/λ1 = 1/2, 1/λ2 = 1/4, and 1/λ3 = 1/6, respectively. For both scenarios, 1000 data sets of size 3000 were generated. Boxplots of the three sets of estimators are shown in Figures 6 and 7. The procedures appear to be unbiased. When the means are the same, the relative efficiencies of the OLS MOM are 1.72, 1.33, 1.41 and the relative efficiencies of the GLS MOM are 1.12, 1.11, and 1.12. When the means are different, the relative efficiencies for the OLS are 2.43, 5.09, and 9.13 and the relative efficiencies for the GLS are 1.07, 1.24, and 1.34. In this example, the GLS MOM appears to be quite efficient compared to the MLE. However, a much more extensive study is clearly needed to quantify the performance of the MOM estimators.
40
E. Lawrence, G. Michailidis and V. N. Nair
Fig 6. Boxplots comparing the MLE with MOM: Case 1 – All means are equal.
Fig 7. Boxplots comparing the MLE with MOM: Case 2 – The three means are different.
Inverse problems in network tomography
41
4.5. Analysis using the NS-2 simulator We now describe a study of the MOM estimators in a simulated network environment. The ns-2 package is a discrete event simulator that allows one to generate network traffic and transmit packets using various network transmission protocols, such as TCP, UDP, ([15]) over wired or wireless links. The simulator allows the underlying link delays to exhibit both spatial and temporal dependence with correlation between sister links (children of the same parent, i.e. 4, 5, and 6.) around .25 and autocorrelation about .2 on all of the links. Thus, we can study the performance of the active tomography methods under more realistic scenarios. We used the topology shown in Figure 8 with a multicast transmission scheme. The capacity of all links was set to the same size (100 Mbits/sec), with 11 sources (10 TCP and one UDP) generating background traffic. The UDP source sent 210 byte long packets at a rate of 64 kilobits per second with burst times exponentially distributed with mean .5s, while the TCP sources sent 1,000 byte long packets every .02s. The main difference between these two transmission protocols is that UDP transmits packets at a constant rate while TCP sources linearly increase their transmission rate to the set maximum and halve it every time a loss is recorded. The length of the simulation was 300 seconds, with probe packets 40 bytes long injected into the network every 10 milliseconds for a total of about 3,000. Finally, the buffer size of the queue at each node (before packets are dropped and losses recorded) was set to 50 packets. We studied only the continuous component of the delay distribution, i.e., the portion of the path-level data that contain zero or infinite delays was removed. The traffic-generating scenario described above resulted in approximately uniform waiting times in the queue (see Figure 9). This is somewhat unrealistic in real network situations where traffic tends to be fairly bursty [11], but it provides a simple scenario for our purposes. Estimating the unknown parameters for this model is equivalent to estimating the maximum waiting time for a random packet. Figure 10 shows quantile-quantile plots using simulated values from the fitted distributions versus the observed ns-2 delays for both the links and paths. Specifically, we estimated the parameters for the uniform distributions using the moment estimation procedure and then generated data based on these parameter values. The fitted values were: ˆb = [.89, .79, .53, 1.10, 1.09, 1.13]. The estimation procedure does quite well on all of the links except for the interior link 3. The algorithm seems
Fig 8. Portion of the UNC network used in the ns-2 simulaton scenario.
42
E. Lawrence, G. Michailidis and V. N. Nair
Fig 9. Histogram of link delays for the ns-2 simulation.
to compensate for this under-estimation (about 40%) by slightly overestimating the parameters for each of the descendants of link 3, as evidenced by the closely matched quantiles for the end-to-end data. This error is probably the result of several factors. First, link 3 deviates the most from the uniform distribution with the last bin in Figure 9 being too thin. Secondly, the algorithm appears to be moderately affected by the violations of the independence assumptions, particularly the spatial dependence among the children of link 3. This could likely be somewhat relieved by using a larger sample size and accounting for the empty queue probabilities. Nevertheless, the estimation performs well overall. 5. Summary There are a number of other interesting problems that arise in active network tomography. There are usually multiple source nodes, which raises the issue of how to optimally design flexicast experiments for the various sources. We have also assumed that the logical topology of the tree is known. However, only partial knowledge of the network topology is typically available, and one would be interested in using the path level information to simultaneously discover the topology and estimate the parameters of interest. The topology discovery problem is computationally difficult (NP-hard), but methods using a Bayesian formulation as well as those based on clustering ideas have been proposed in the literature (see [3] and references therein). Active tomography techniques are useful for monitoring network quality of service over time. However, this application requires that path measurements are collected sequentially over time and appropriately combined. The probing intensity, the type of control charts to be used for monitoring purposes, and the use of pathvs link- information are topics of current research.
Inverse problems in network tomography
43
Fig 10. QQ plots for the links and paths of the ns-2 simulator example. The fitted quantiles come from simulating delays from distributions with parameters given by the fitted values of the data.
Acknowledgments. The authors are grateful to: Jim Landwehr, Lorraine Denby, and Jean Meloche of Avaya Labs for making their ExpertNet tool available for VoIP data collection and for many useful discussions on network monitoring; Jim Gogan and his team from the IT Division at UNC for their technical support in deploying the ExpertNet tool on their campus network, for troubleshooting hardware problems and for providing information about the structure and topology of the network; Don Smith of the CS Department at UNC for helping us establish the collaboration with the UNC IT group; and to two referees for several useful comments that have improved the paper. References [1] Bates, D. and Watts, D. (1988). Nonlinear Regression Analysis and Its Applications. Wiley, New York. ´ceres, R., Duffield, N. G., Horowitz, J. and Towsley, D. F. (1999). [2] Ca Multicast-based inference of network-internal loss characteristics. IEEE Trans-
44
E. Lawrence, G. Michailidis and V. N. Nair
actions on Information Theory 45 2462–2480. [3] Castro, R., Coates, M., Liang, G., Nowak, R. and Yu, B. (2004). Network tomography: Recent developments. Statist. Sci. 19 499–517. [4] Chen, A., Cao, J. and Bu, T. (2005). Network tomography: Identifiability and fourier domain estimation. Submitted. [5] Coates, M. J. and Nowak, R. D. (2000). Network loss inference using unicast end-to-end measurement. In Proceedings of the ITC Conference on IP Traffic, Modelling and Management. [6] Lawrence, E. (2005). Flexicast network delay tomography. Ph.D. thesis, University of Michigan. [7] Lawrence, E., Michailidis, G. and Nair, V. N. (2003). Maximum likelihood estimation of internal network link delay distributions using multicast measurements. In Proceedings of the 37th Conference on Information Sciences and Systems. [8] Lawrence, E., Michailidis, G. and Nair, V. N. (2006). Network delay tomography using flexicast experiments. J. Roy. Statist. Soc. Ser. B 68 785– 813. [9] Liang, G. and Yu, B. (2003). Maximum pseudo likelihood estimation in network tomography. IEEE Transactions on Signal Processing 51 2043–2053. [10] Lo Presti, F., Duffield, N. G., Horowitz, J. and Towsley, D. (2002). Multicast-based inference of network-internal delay distributions. IEEE Transactions on Networking 10 761–775. [11] Park, K. and Willinger, W. (2000). Self-Similar Network Traffic and Performance Evaluation. Wiley Interscience, New York. [12] Shih, M.-F. and Hero, A. O. (2003). Unicast-based inference of network link delay distributions with finite mixture models. IEEE Transactions on Signal Processing 51 2219–2228. [13] Tsang, Y., Coates, M. and Nowak, R. D. (2003). Network delay tomography. IEEE Transactions on Signal Processing 51 2125–2135. [14] Vardi, Y. (1996). Network tomography: Estimating source-destination traffic intensities from link data. J. Amer. Statist. Assoc. 91 365–377. [15] Walrand, J. (2002). Communications Networks: A First Course. McGrawHill. [16] Xi, B., Michailidis, G. and Nair, V. N. (2003). Least squares estimates of network link loss probabilities using end-to-end multicast measurements. In Proceedings of the 37th Conference on Information Sciences and Systems. [17] Xi, B., Michailidis, G. and Nair, V. N. (2006). Estimating network loss rates using active tomography. J. Amer. Statist. Assoc. 101 1430–1438. [18] Zhang, C.-H. (2005). Estimation of sums of random variables: Examples and information bounds. Ann. Statist. 33 2022–2041.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 45–61 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000238
Network tomography based on 1-D projections Aiyou Chen1 and Jin Cao1 Bell Laboratories, Alcatel-Lucent Technologies Abstract: Network tomography has been regarded as one of the most promising methodologies for performance evaluation and diagnosis of the massive and decentralized Internet. This paper proposes a new estimation approach for solving a class of inverse problems in network tomography, based on marginal distributions of a sequence of one-dimensional linear projections of the observed data. We give a general identifiability result for the proposed method and study the design issue of these one dimensional projections in terms of statistical efficiency. We show that for a simple Gaussian tomography model, there is an optimal set of one-dimensional projections such that the estimator obtained from these projections is asymptotically as efficient as the maximum likelihood estimator based on the joint distribution of the observed data. For practical applications, we carry out simulation studies of the proposed method for two instances of network tomography. The first is for traffic demand tomography using a Gaussian Origin-Destination traffic model with a power relation between its mean and variance, and the second is for network delay tomography where the link delays are to be estimated from the end-to-end path delays. We compare estimators obtained from our method and that obtained from using the joint distribution and other lower dimensional projections, and show that in both cases, the proposed method yields satisfactory results.
1. Introduction Network performance monitoring and diagnosis is a challenging problem due to the size and the decentralized nature of the Internet. For instance, when an end-toend measurement indicates the performance degradation of an Internet path, the exact cause is hard to uncover because the path may traverse several autonomous systems (AS) that are often owned by different entities, e.g., a service provider, and the service providers generally do not share their internal performance. Even they do, there is no scalable way to correlate the link level measurements to end-to-end performance in a large network like the Internet. Similarly, the service providers may be interested in the end-to-end path characteristics that they can not observe directly. Network tomography is a technology for addressing these issues [1, 4, 5, 21] (see [8] for an excellent review of this topic). The key advantage of network tomography is that it does not require the collaboration between the network elements and the end users. There are two main classes of problems being studied in the literature. The first estimates the link-level characteristics, such as packet loss or delay based on end-to-end measurements [1, 3, 4, 10, 13, 16, 18]. The loss tomography problem 1 Communications
and Statistical Sciences Department, Bell Laboratories, AlcatelLucent Technologies, 600 Mountain Ave, Murray Hill, NJ, USA 07974, e-mail: [email protected]; [email protected] AMS 2000 subject classifications: primary 62F35, 62F30; secondary 62P30. Keywords and phrases: network tomography, one dimensional projection, identifiability, asymptotically efficient. 45
46
A. Chen and J. Cao
can be viewed as a special case of the delay tomography problem if loss is treated as a very large delay. Here we consider the case where packets are transmitted using the multicast protocol, that is, a packet is sent from a source to multiple destinations simultaneously. The second is traffic demand tomography where one predicts endto-end path-level traffic intensity based on link-level traffic measurements [5, 6, 21]. For both network delay tomography and traffic demand tomography, the statistical models can be unified by (1)
Y = AX,
where X = (X1 , . . . , XI )T is an I-dimensional vector of unobservable network dynamic parameters with mutually independent components, and Y = (Y1 , . . . , YJ )T is a J-dimensional vector of measurements and A is a J × I routing matrix with elements 0 or 1. The objective is to estimate the distribution of X given independent observations from the distribution of Y. As with other inverse problems, the main difficulty in network tomography lies in the fact that I > J. For the traffic demand tomography I can be as large as J 2 , and for the delay tomography model, I can be as large as 2J − 1. As a result, A is not invertible and the tomography problem is ill-posed. However, if components of X are assumed independent, it can be shown that the distribution functions of X under model (1) can be uniquely determined up to their means under mild conditions [9]. This mean ambiguity can be further removed if the component distributions of X satisfy some additional constraints, for example, a relation between their means and higher order moments (such as the Poisson distribution), positive point mass at zero etc [5, 12, 19, 21]. To estimate the distribution of X, likelihood based inference has been proposed [4, 5, 15, 16]. However, since the likelihood involves a high order convolution, such inference is computationally expensive except for specific distributions. This can be limiting in some circumstances. For example, in delay tomography where the link delays are to be estimated, the continuous distributions for both end-to-end and link delays are approximated by non-parametric discrete distributions using the same set of bins of equal widths. This is problematic for a heterogeneous network such as Internet where the link delay distributions can differ significantly. To overcome the computational difficulties in likelihood based inference, a characteristic-function based generalized moment estimator has been proposed for general distributions [9], where the model parameters are estimated by minimizing a contrast function between the empirical characteristic function and the parametrized characteristic function of Y under the model. However, it has been realized that either the full likelihood or joint characteristic function based inference may still be expensive when the dimension of X is high. A solution to this is the pseudo-likelihood approach proposed in [16], where the parameters are estimated by minimizing a pseudo-likelihood function that is constructed by multiplying the marginal likelihoods of a sequence of subsets Ys of the high-dimensional observation Y. The idea is that these marginal likelihoods only focus on a small subset Ys and thus are computationally much cheaper. Specifically, the authors considered constructing these subsets using all pairwise components of Y, i.e., Ys = (Yj1 , Yj2 ), 1 ≤ j1 < j2 ≤ J, and found that such a pseudo-likelihood based estimator is computationally efficient as compared to the estimator based on full likelihood meanwhile has little loss in statistical efficiency. The same idea has been used in the characteristic-function based estimator in [9] by considering characteristic functions of a sequence of low dimensional subsets Ys , and in the local likelihood estimator in [15] by considering Ys of both one and two dimensions.
Method of 1D projections
47
In all the above approaches, each Ys can be considered as a projection of a high dimensional measurement Y onto a low dimensional subspace, taken along the axis of components of Y. One can further generalize this and take projections in arbitrary directions, for example, using BsT Y where Bs is matrix with a small number of columns. However, an optimal choice of these lower-dimensional projections so as to balance the computational complexity and statistical efficiency has not been studied previously. For statistical efficiency, one might want to use relatively high dimensional projections so that the information on multivariate dependency will not be lost. For computational efficiency, one might prefer to a small set of lower dimensional projections. This paper provides a partial solution to the design issue of these lower dimensional projections. Here we consider the extreme case – one-dimensional (1D) linear projections of the observed data Y. That is to say, the statistical inference regarding the distribution of X is based on the marginal distributions of a series of 1D-projections of Y, say βkT Y, βk ∈ RJ for k = 1, . . . , K. The contributions of this paper are two-fold. First, we give a sufficient condition for the choice of these 1D-projections so that the X distribution can be uniquely determined. Second, we study the design issue of such 1D-projections in terms of statistical efficiency. For a simple Gaussian tomography model where component distributions of X are Gaussian, we show that there exists an optimal choice of 1D-projections, selected by a correlation based rule, from which the obtained estimator is asymptotically as efficient as the maximum likelihood estimator (MLE) based on the joint distributions of Y. For more realistic tomography models, we carry out simulation studies of two instances: the first is for traffic demand tomography where the Origin-Destination traffic is also Gaussian but its mean and variance are related through a power equation, and the second is for network delay tomography where the link delays are to be estimated using a continuous mixture distribution. For both cases, we show the proposed method based on 1D-projections yields satisfactory results as compared to estimators using other choice of projections and the complete data. The remaining of the paper is organized as follows. In Section 2, the method of 1D-projections is proposed, and the identifiability issue and parameter estimation are discussed. In Section 3, a simple Gaussian tomography model is analyzed, and the optimal design of 1D-projections and its efficiency are studied. In Section 4, simulation studies of the 1D-projection method are presented for traffic demand tomography and network delay tomography. We conclude in Section 5. The following conventions will be used throughout the paper. 1D-projectio- ns represent one-dimensional projections with the form βkT Y. 2D-projections represent pairwise components of Y, e.g. (Yj , Yj ). A lower case letter represents a vector and an upper case letter represents a matrix, with the exception of X and Y, which represent random vectors as in model (1). Mab or M (a, b) represents the (a, b)th element of a matrix M . vi is the ith element of a vector v and βk ∈ RJ is a column vector and βki is its ith element. M T and v T represent the transpose of M and v respectively. M −T is the transpose of M −1 , the inverse of M . 2. Method of one-dimensional projections In this section, we formally describe the method of 1D-projections for solving the inverse problem in (1) in the context of network tomography. We first present a necessary and sufficient condition for identifiability and then discuss the parameter estimation in this setting.
A. Chen and J. Cao
48
2.1. Identifiability One fundamental question of the 1D-projection method is whether the distribution of X can be uniquely determined by the marginal distributions of these 1Dprojections. This is the so-called identifiability issue. For simplicity we shall start with a simple matrix A derived from the two-leaf tree delay tomography model (Figure 1), and then generalize it to an arbitrary matrix. For illustration purpose, we use a special set of 1D-projections on the two-leaf tree to explain the main idea behind the identifiability. For the two-leaf tree in Figure 1, let X = (X1 , X2 , X3 )T be the internal link delays from top to bottom and left to right, and Y = (Y1 , Y2 )T be the end-to-end delay from the root node to the two leaves from left to right. Since the end-to-end delay is the sum of internal link delays on the path, we can write Y1 = X1 + X2 and Y2 = X1 + X3 . That is, A = [1, 1, 0; 1, 0, 1]. Following [9], we assume that the characteristic function of each component of X is analytic1 and we refer to this as the analytic condition in this paper. Lemma 2.1. For the two-leaf tree in Figure 1, assume that X has mutually independent components and satisfies the analytic condition. Then the distribution of X is determined up to a shift in its mean by the marginal distributions of Y1 , Y2 , Y1 + aY2 if a = 0 and a = −1, where the mean of X satisfies the constraint E[X1 ] + E[X2 ] = E[Y1 ] and E[X1 ] + E[X3 ] = E[Y2 ]. Proof. Let β1 = (1, 0)T , β2 = (0, 1)T and β3 = (1, a)T , then the three projections can be written as βkT Y, k = 1, 2, 3. Let γk = AT βk = (γk1 , γk2 , γk3 )T , then the characteristic function of βkT Y is equal to T
Eeitβk Y =
3
ψi (γki t),
i=1
where ψi is the characteristic function of Xi . Suppose that there exists another set of characteristic functions ψ¯i which also satisfy the above three equations. Let ωi (t) = log ψi (t) − log ψ¯i (t). Then we have for k = 1, 2, 3, 3
ωi (γki t) = 0.
i=1
Let din be the nth order derivative of ωi (t) at t = 0. By evaluating the nth order derivatives at zero, we have for k = 1, 2, 3, 3
n γki din = 0.
i=1
For each n ≥ 2, let Mn be the coefficient matrix of the above linear equations of {din : i = 1, 2, 3}. Since A = [1, 1, 0; 1, 0, 1], we have 1 1 0 1 0 1 . Mn = n (1 + a) 1 an 1 An
analytic characteristic function corresponds to a distribution function which has moments mk of all orders k and lim supk→∞ [|mk |/k!]1/k is finite.
Method of 1D projections
Fig 1. Two-leaf tree
Fig 2. Four-leaf tree
49
Fig 3. One router network
The identifiability problem is equivalent to asking whether din = 0 (i = 1, 2, 3), i.e. det(Mn ) = 0, for all n ≥ 2. Note that det(Mn ) = (1 + a)n − an − 1. Let f (x) = (1 + x)n − xn − 1. Thus f (x) = n[(1 + x)n−1 − xn−1 ]. If n ≥ 2 is even, f (x) > 0 and thus f (x) is monotone with a unique zero x = 0. If n > 2 is odd, f (x) has a unique zero x = −1/2 and thus f (x) has two zeros x = 0 and x = −1. Under assumptions of a = 0 and a = −1, det(Mn ) = f (a) = 0, and thus din = 0 (i = 1, 2, 3) for all n ≥ 2. Hence each ωi (t) is a linear function. The conclusion follows readily. Lemma 2.1 states that if chosen properly, only three 1D-projections are needed to determine the distribution of X (ignoring its mean). The condition a = 0 merely says that the third projection Y1 + aY2 does not coincide with either Y1 or Y2 . Obviously, from the proof we can see that the condition a = −1 is not needed in order to identify the even order cumulants of each Xi , i.e. the even order derivatives of log ψi (t) at t = 0, but a = −1 is required in order to identify the odd order cumulants. Remark. It is not hard to show that all even order cumulants can be determined by the marginal distributions of arbitrary three distinct2 projections aj Y1 + bj Y2 , j = 1, 2, 3. But some further constraints, i.e. aj +bj = 0 (j = 1, 2, 3) which is similar to the condition a = −1 in the above lemma, are needed in order to determine the odd order cumulants except the first order. Lemma 2.1 provides a simple set of 1Dprojections which can be used to identify the X distributions. This can be easily extended to the case where A corresponds to a tree topology. For a general matrix A, the identifiability issue is addressed by the following theorem. Theorem 2.1. For a general tomography problem Y = AX introduced by (1), let βkT Y, k = 1, . . . , K, be K 1D-projections of Y onto the linear space of its components. Let M be a K × I matrix whose rows consist of βkT A, and let Mn be a matrix whose elements are the nth power of the corresponding elements of M . Then the distribution of X can be determined up to the ambiguity of its mean by the marginal distributions of βkT Y, k = 1, . . . , K, if and only if Mn has full column rank for all n ≥ 2, where the mean of X satisfies the constraint AE[X] = E[Y]. The proof of Theorem 2.1 follows the same idea as that of Lemma 2.1. The details are omitted to avoid technical redundancy. Remark. Theorem 2.1 provides a sufficient and necessary condition for the identifiability of the X distribution by using the marginal distributions of a set of 2 Two
projections are the same if one is a scale multiplication of the other.
A. Chen and J. Cao
50
1D-projections of the observed measurements Y. Unfortunately, the full column rank condition on Mn , for any n ≥ 2, is hard to verify for an arbitrary set of projections. Further, it is worth pointing out that for identifiability, K ≥ I is required since each Mn has I columns. However, when A satisfies some sufficient conditions such that the distribution of X can be identified from the joint distribution of Y (see [9] for discussions of such A matrix), then we would expect that in most cases, the marginal distributions of a set of K = I properly chosen 1D-projections of Y can be used to identify the distribution of X. This is due to the fact that by solving the polynomial equations, i.e. det(Mn ) = 0, the set of projection directions {βk : k = 1, . . . , I} (ignoring the scales) which violates the full rank condition has Lebesgue measure zero. 2.2. Parameter estimation Let f (X|θ) be the distribution of X with unknown parameter θ. We consider two estimators of θ derived from the marginal distributions of 1D-projections {βkT Y, k = 1, . . . , K}, one based on their marginal likelihoods and the other based on their marginal characteristic functions. In later sections we shall give examples for each of these for instances of network tomography problems. 2.2.1. Likelihood based inference Provided that it is easy to evaluate the univariate distribution of βkT Y, the Kullback-Leibler divergence between the empirical and model distributions of each βkT Y, k = 1, . . . , K can be used to obtain an estimator of θ. Let Pn be the empirical distribution of Y based on n i.i.d. samples and lk (·, θ) be the logarithmic likelihood function of βkT Y . Similar to the maximum pseudo likelihood method in [16], an estimator of θ based on the 1D-projections can be defined by K −lk (βkT Y, θ)dPn . (2) θˆ1D = arg min k=1
Usually there is no closed form solution to θˆ1D , and the pseudo EM algorithm in [16] or other numerical optimization algorithms may be used. 2.2.2. Characteristic function based inference
Often the distribution of βkT Y is hard to evaluate since it is a high order convolution of the distributions of (X1 , . . . , XI ). In this case, as proposed in [9], the generalized method of moments (GMM) based on characteristic function [7, 14] can be used to obtain an estimator of θ. Let ψi (·, θ) be the characteristic function of Xi and ϕˆk (·, θ) be the empirical characteristic function of βkT Y. Let Ai be the ith column of A. Then the characteristic function of βkT Y, say ϕk (·, θ), is equal to ϕk (t, θ) = Ee
itβkT Y
=
I
ψi (βkT Ai t, θ).
i=1
Now an estimator of θ based on characteristic functions of the 1D-projections can be defined as follows: K (3) θˆCF = arg min |ϕˆk (t) − ϕk (t, θ)|2 dµ(t), k=1
Method of 1D projections
51
where µ(·) is a predetermined distribution measure on R. In general, (3) may be solved numerically. For computational convenience, we may use an empirical distribution of µ in the above each Xi follows a mixture When asi ani approximation. i i distribution, i.e. ψi (t) = k pk Ψk (t), pk ≥ 0, k pk = 1, where Ψik are characteristic functions of corresponding mixture components, and θ consists of the coefficient parameters pik , then for each i, the above objective function is a quadratic function of {pik }. Thus the optimization can be done iteratively by quadratic programming (see [9] for details). Asymptotic properties such as consistency and asymptotic normality have been studied for this estimator under mild conditions [7]. An improved estimator that takes into account the correlation of the empirical characteristic function ϕˆk at different points t can also be considered following similar techniques developed in [9]. 3. Optimal design of 1D projections for the Gaussian tomography model In Section 2, it has been shown that the distribution of X can be identified using the marginal distributions of a set of K (K ≥ I) properly chosen 1D-projections. In this section, we consider the optimal design of these projections. We consider the two main factors: 1) the statistical efficiency of the estimators based on these projections and 2) the computational complexity determined primarily by the number of such 1D-projections, i.e. K. To achieve optimal design, we first consider a simple Gaussian tomography model defined below and investigate the design issue in depth. Surprisingly, we found that for this simple model there is a minimal set of I 1D-projections such that the estimator based on these projections is asymptotically as efficient as MLE. Such a set of projections constitute the optimal design. 3.1. The Gaussian tomography model The Gaussian tomography model is defined by the tomography model in (1) when the components of X have mutually independent Gaussian distributions, i.e., Xi ∼ N (µi , θi ), i = 1, . . . , I, and N (µi , θi ) stands for the normal distribution with mean µi and variance θi . Notice that for a set of 1D-projections of Y that satisfies the condition of Theorem 2.1, θ = (θ1 , . . . , θI )T is identifiable but (µ1 , . . . , µI )T is not because of the mean ambiguity. For simplicity assume that µi = 0 for i = 1, . . . , I. The Gaussian tomography model is defined as (4)
Y ∼ N (0, AΘAT ),
where θ is the parameter of interest and Θ is a diagonal matrix with diagonal elements θi . Since the distribution of βkT Y is also Gaussian and thus easy to evaluate, given a set of 1D-projections, we use the likelihood based method given by (2) in Section 2 to estimate θ. We investigate its statistical efficiency as compared to that of MLE, denoted as θˆM LE . Let Σ = AΘAT , and let →d indicate convergence in distribution. The following lemmas state the asymptotic distributions of the estimators θˆM LE and θˆ1D . Lemma 3.1. Suppose that θ is identifiable. Then √ n(θˆM LE − θ) →d N (0, IF−1 ), 2 and Uab where IF is the Fisher information matrix with elements IF (a, b) = 21 Uab T −1 is the (a, b)th element of U = A Σ A.
A. Chen and J. Cao
52
Proof. The log-likelihood function of Y can be written as 1 1 log pY (y; θ) = − YT Σ−1 Y − log(2π det(Σ)), 2 2 m where Σ = AΘAT . Notice that Σ = j=1 θj Aj AjT , by taking partial derivatives w.r.t. each θi , we have
∂ log pY (y; θ) 1 T −1 i iT −1 = Y Σ A A Σ Y − vec(Σ−1 )T vec(Ai AiT ) , ∂θi 2
where vec(M ) vectorizes M . So the score functions, say si , i = 1, . . . , I, are si (Y) ≡ −
1 ∂ log p(Y; θ) = −(YT Σ−1 Ai )2 + trace(Σ−1 (Ai AiT )) , ∂θi 2
where trace(M ) denotes the trace of M . Hence the result follows from 1 cov(sj (Y), sk (Y)) = cov((YT Σ−1 Aj )2 , (YT Σ−1 Ak )2 ) 4 1 = {cov(YT Σ−1 Aj , YT Σ−1 Ak )}2 2 1 jT −1 k 2 = (A Σ A ) . 2
The second equality in the above uses the fact that for any bivariate Gaussian random vector (N1 , N2 ), cov(N12 , N22 ) = 2[cov(N1 , N2 )]2 . Lemma 3.2. For a set of 1D-projections {βkT Y, k = 1, . . . , K}, let γk = AT βk , and σk2 = βkT Σβk . Suppose that θ can be identified by these 1D-projections. Then the likelihood based estimator θˆ1D defined by (2) using these projections has an asymptotic distribution √ −1 −1 (5) n(θˆ1D − θ) →d N (0, C1D I1D C1D ) where C1D = 21 V T V and I1D = 12 V T W V . Here V is a K × I matrix with ele2 ments Vka = σk−2 γka and W is a K × K symmetric matrix with elements Wkk = −2 −2 T 2 σk σk (βk Σβk ) . Furthermore, if K = I and V is invertible, then the limit covariance matrix in (5) simplifies to 2V −1 W V −T . Proof. Notice that
βkT Y ∼ N (0, σk2 ).
Let lk (·, θ) be the marginal logarithmic likelihood function of βkT Y, that is, 1 1 lk (z, θ) = − σk−2 z 2 − log(2πσk2 ). 2 2 By the classical theory on M-estimators, we have √ −1 −1 n(θˆ1D − θ) →d N (0, C1D I1D C1D (6) ) where C1D (7)
∂ 2 lk (βkT Y, θ) E =− ∂θ∂θT k=1
T K ∂lk (βkT Y, θ) ∂lk (βkT Y, θ) = E ∂θ ∂θ K
k=1
Method of 1D projections
and (8)
I1D = var
K ∂lk (β T Y, θ) k
∂θ
k=1
53
.
It is not hard to verify that 2 γkj (z 2 − σk2 ) ∂lk (z, θ) ∂σk2 ∂lk (z, θ) . = = ∂θj ∂σk2 ∂θj 2σk4
Thus C1D (a, b) =
K 1
k=1
2
2 2 σk−4 γka γkb
and I1D (a, b) =
K 2 2 γka γk b T 2 T 2 4 4 cov((βk Y) , (βk Y) ) 4σ σ k k
k,k =1
K 2 2 γk b T 1 γka (βk Σβk )2 . = 4 2 σk σk4 k,k =1
Hence, C1D = 21 V T V and I1D = 12 V T W V . 3.2. Optimal projections −1 −1 Given the asymptotic variance C1D I1D C1D of the estimator θˆ1D (6, 7 and 8), it is not obvious how one can choose a set of 1D-projections so that the asymptotic covariance matrix can be minimized. By (7) C1D is the summation of the information matrix of individual projection βkT Y. Note that cov((βk Y)2 , (βk Y)2 ) > 0. By (9), ∂l (β T Y,θ)
k this implies that the score functions of individual 1D-projections, i.e. k ∂θ (k = 1, . . . , K), are positively correlated. Intuitively, the directions βk should be chosen such that the scores and thus the projections βkT Y are mutually uncorrelated as much as possible from each other. To be more precise, reduce I1D significantly while keep C1D stable. Since each projection is a linear function of the components of Y and the individual components of X are mutually independent, thus there will not be much information overlap if the projections are chosen such that each is close to an individual component of X. This motivates the following design for the Gaussian tomography model. Take K = I. For each k ∈ {1, . . . , I}, pick βk such that the correlation between βkT Y and Xk , say corr(βkT Y, Xk ), is maximized. Since
1/2
β T Ak θk cov(Xk , β T AX) = , corr(Xk , β T Y) = var(Xk )var(β T Y) β T cov(Y)β
the scale on β does not affect the correlation. Furthermore, notice that the scale does not affect either C1D or I1D . By assuming var(βkT Y) = 1, βk can be determined : (9)
−1 k A , βk = λ−1 k Σ
where λk = (AkT Σ−1 Ak )1/2 . Theorem 3.1 below shows that the projection directions chosen by the above correlation rule leads to an efficient estimator of θ.
A. Chen and J. Cao
54
Theorem 3.1. Under the same condition as Lemma 3.1, if the 1D-projections are taken by the maximum correlation rule given by (9), then for the simple Gaussian tomography model, the 1D-projection estimator given by (2) is asymptotically as efficient as MLE. aT −1 k Proof. By (9), γka = λ−1 Σ A . Plugging this into (5), it is not hard to verify k A that
V = 2S −1 IF , and W = 2S −1 IF S −1 , where IF is the same as in Lemma 3.1 and S is a diagonal matrix with diagonals λ2k , k = 1, . . . , I. Thus V is invertible and the limit covariance matrix in (5) simplifies to IF−1 . The result follows. Notice that the optimal 1D-projections βk s in (9) depend on unknown parameters θ and thus are unavailable. Fortunately, βk only depends on Σ and the sample ˆ and ˆ is a √n-consistent estimator of Σ. Thus we can plug-in Σ covariance of Y, say Σ, obtain empirical estimates of βk . By assuming that θ belongs to a compact subset of RI , from the theory of generalized M-estimators in [2], this plug-in 1D-projection estimator is asymptotically efficient. We omit the tedious technical verification but refer to Chapter 7 of [2] for details. More realistic Gaussian tomography models assume a mean-variance relationship and thus the above theorem may not hold. But simulation studies in the next section suggest that the above plug-in 1D-projections still work better than arbitrary projections and the performance of the corresponding estimator is close to MLE. 3.3. Comparison with other projections We now compare the statistical efficiency of θˆ1D based on the optimal set of 1Dprojections given by (9) with that based on other choices of projections. We consider two such choices. The first is based on the set of all pairwise 2D-projections of Y that are proposed in [16], called θˆ2D , the second is based on a set of K randomly chosen 1D-projections while adjusting for the correlation of Y, i.e., each random projection is generated by βkT Y = αkT Σ−1/2 Y,
(10)
where αk is drawn independently from the standard multivariate Gaussian distribution with an identity covariance matrix. Let θˆ2D be the estimator of θ based on all pairwise 2D-projections of Y and let f ((Yk , Yk )|θ) be the distribution of the bivariate variable (Yk , Yk ). Then θˆ2D is defined by ˆ θ2D = arg min (11) − log f ((Yk , Yk )|θ)dPn . θ
1≤k
Following similar arguments as in Lemma 3.2, it can be shown that √ −1 −1 , n(θˆ2D − θ) →d N 0, C2D I2D C2D (12)
Method of 1D projections
55
√ Fig 4. The limit standard deviations of n(θˆ − θ) for four estimators for the 16 variance parameters θi : the two random 1D-projection estimators with K = 2I = 32 and K = 10I = 160 projections (reporting medians of 100 replications), θˆ2D and the optimal θˆ1D . A is given in (13) below. The x-axis represents the values of 16 specified variance parameters, θi , i = 1, . . . , 16.
where C2D (a, b) =
2 1 −1 T [Aka , Ak a ][Σkk ] [A , A ] , kb k b kk 2 k
I2D (a, b) =
1 2
k
2 −1 ll −1 [Aka , Ak a ][Σkk Σkk [Σll [Alb , Al b ]T , kk ] ll ]
Σkl Σkl and = is a 2 × 2 matrix consisting of the elements of Σ. Σk l Σk l Since it is hard to evaluate in closed form the expectation of the asymptotic covariance matrix of θˆ1D using random 1D projections, we use simulations to study its performance. In the simulation, we use a 7 × 16 A matrix, shown in (13) below, derived from a later simulation study of a traffic demand tomography problem on a one-router network in Figure 3. That is, I = 16 and J = 7. The parameters θ = (θ1 , . . . , θ16 )T are generated i.i.d. from the exponential distribution with mean 1, and remain fixed throughout the simulation. In each simulation run, we randomly draw K = 2I and K = 10I 1D-projection directions βk as in (10), and then calculate the limit covariance matrix of these two estimators using (5). We then replicate this procedure √ ˆ100 times. For comparison, we also calculate the limit covariance matrix of n(θ1D − θ) for the optimal 1D-projections using (9) (same as √ √ that of n(θˆM LE − θ)), and that of n(θˆ2D − θ). Figure 4 shows the limit standard deviations of the 16 variance parameters θi for four estimators: the two random 1Dprojection estimators (reporting medians of 100 replications), θˆ2D and the optimal θˆ1D . The plot shows that the 2D-projection estimator is not efficient asymptotically by comparing with the optimal one or MLE. The plot also suggests that as the number of projections grows, the limiting covariance matrix for the random 1Dprojections converges to the optimal one but we leave it to future studies. Σll kk
56
A. Chen and J. Cao
4. Simulation studies for realistic models in network tomography In Section 3, we have shown that for the simple Gaussian tomography model (4), the estimator θˆ1D by using the optimal set of 1D-projections of Y is asymptotically as efficient as MLE (Theorem 3.1). For other models, however, it is not clear whether the 1D-projection method can lead to an efficient estimator since a linear structure inherent in the Gaussian model no longer exists. In addition, the correlation rule which gives an optimal set of 1D-projections (9) may no longer be optimal. Since theoretical investigations of efficiency are difficult for general models, we resort to simulations to study the performance of the 1D-projection method (i.e. θˆ1D and θˆCF ), especially using 1D-projections that obtained from the correlation rule (9). We study the performance of the method under two realistic models in network tomography. The first example is for traffic demand tomography [5, 21] with Gaussian OD traffic model where the mean and variance of traffic counts is related by a power equation [5], and the second example is for delay tomography [16, 18, 20] with mixture models for link delay distributions. We demonstrate that in both cases the 1D-projection method yields satisfactory results. 4.1. Traffic demand tomography using the Gaussian OD traffic model Let X = (X1 , . . . , XI )T be the Origin-Destination (OD) traffic counts in a network. In traffic demand tomography, we observe Y = AX, where Y is the measured link traffic counts collected for instance using SNMP, and A is the network routing matrix with elements 0 or 1. The problem is to estimate the distribution of each Xi from independent samples of Y. Following [5], assume a Gaussian OD traffic model with a power relation between the variance and mean of traffic counts, i.e., each Xi (i = 1, . . . , I) is an independent Gaussian random variable with var(Xi ) = φ(EXi )c , where φ is an unknown scale parameter and c > 0 is a known power exponent. Let θi = EXi be the mean counts of the ith OD traffic. The parameter of interest is θ = (θ1 , . . . , θI )T and φ is an unknown nuisance parameter. In the simulation, we use the same one-router network with four attached input/output links as in [5], reproduced here in Figure 3. For this network, there are a total of 16 OD pairs from all pairwise combinations of an input and output link. For a certain arrangement of the 16 OD pairs, as described in [5], the routing matrix A can be written as a 7 × 16 matrix that has entries of 0 except in the places indicated below 1111 1111 1111 . 1 1 1 1 (13) A= 1 1 1 1 1 1 1 1 1 1 1 1 For each OD pair Xi , we generate the mean traffic rate θi independently from a lognormal distribution with mean=10 and sd=.95 in the log scale. The variance of the traffic rate is generated from a mean-variance relation var(Xi ) = 1000θi , i.e., the scale parameter φ = 1000 and power exponent c = 1. These parameters are chosen to be compatible with the real data observed on the same one-router network in [5].
Method of 1D projections
57
Fig 5. The median estimation errors of mean traffic rates for the 16 OD pairs from 100 simulation runs. The estimation error is measured by | log θˆi − log θi |, i = 1, . . . , 16, The four estimates are: moment estimator (MM), estimates obtained from the marginal likelihoods of 2D-projections (2D), from the marginal likelihoods of 1D-projection method using the optimal set of projections in (9) (1D), and MLE. The x-axis represents the specified mean traffic intensity values for the 16 OD pairs in a log based scale. Details of the simulation setup are described in Section 4.1.
In each simulation run, we generate n = 1000 independent samples of OD traffic counts and estimated the mean OD traffic rates from the resulting link traffic counts. We consider four estimators for performance comparison: MLE, likelihood based estimator obtained from the correlation-based 1D-projections, likelihood based estimator obtained from all pairwise 2D-projections as proposed in [16] and a moment estimator. Similar to that used in [21], the moment estimator here is obtained by solving a system of over-determined linear equations constructed using the mean and variance of the link traffic counts Y. The moment estimator is also used as the starting value of the other estimators obtained using numerical optimizations. Figure 5 shows the median estimation errors of mean traffic rates for 16 OD pairs from 100 simulation runs. The estimation error for each θi is measured by | log θˆi − log θi | for all estimates. The plot shows that the correlation based 1Dprojection estimator performs slightly worse than MLE, but much better than the 2D-projection estimator and the moment method. We have observed very similar results for other parameter settings. This suggests that the correlation based 1Dprojections may be close to being optimal for the Gaussian OD traffic model with power mean-variance relation. 4.2. Network delay tomography using mixture modeling of link delays In the context of network delay tomography, one is interested in inferring the distributions of network link delays3 from the end-to-end delay measurements, obtained either passively or actively. As an example, consider the four-leaf network shown in Figure 2. Here I = 7 and J = 4. Suppose that at the root of the tree, we send active 3 To
be precise, the delay here is the queuing delay that excludes the constant link propagation delay, we omit queuing when context is clear.
58
A. Chen and J. Cao
multicast probes to the four leaves, resulting four simultaneous end-to-end delay measurements for each probe. Let X = (X1 , . . . , X7 )T be the internal link delays, and Y = (Y1 , . . . , Y4 )T be the end-to-end delay measurements. Then Y = AX, where A is the routing matrix with elements 0 or 1 derived from the tree. The network delay tomography problem is to infer the distribution of X from independent samples of Y. There have been significant amount of work on network delay tomography [3, 10, 16, 18, 20], and most of these approaches use likelihood based estimation. The likelihood based estimation uses a discrete distribution with equal width bins to approximate a continuous link delay distribution, where the same bin width is used for all the links. However, for a heterogeneous network such as the Internet, such a discretization process will result in an over-parameterized link delay model and hence lead to high computational complexity. Recently, a characteristic function based estimation approach has been proposed as an alternative approach to accommodate network heterogeneity ([9]), using a flexible mixture modeling of link delays. As described in Section 2.2.2, instead of minimizing the Kullback-Leibler distance between the empirical and model distribution, by which the maximum likelihood estimator is derived, their estimator is derived by minimizing a L2 distance between empirical and the model characteristic function. It is also found that the estimator derived from comparing the characteristic functions of low dimensional components of Y yields better performance, as compared to Y itself, where the difficulty of the latter is how to choose an appropriate high dimensional weight function µ. In the following, we run simulations of the delay tomography model on the four-leaf tree, and compare the performance of θˆCF defined in (3) based on 1Dprojections with that based on all pairwise 2D-projections of Y. We do not compare the estimates against MLE because MLE is computationally infeasible for the flex-
Fig 6. The estimated cumulative distribution functions of link delays on a four-leaf tree (Figure 2) from 1000 end-to-end delay measurements. The true delay distributions, shown in solid line, are those generated from a M/M/1 queue. All three estimates are obtained using the marginal characteristic function based approach described in Section 2.2.2 using a mixture model of piecewise uniform of 6 bins and an exponential tail. The estimates are: 2D-projection method (2D), 1Dprojection method using projections in (9) (1D.Cor), and 1D-projection method using I random projections in (9) (1D.Rand). Details of the simulation setup are described in Section 4.2.
Method of 1D projections
59
ible mixture model that we use in deriving these estimates. In the simulation, each link delay distribution Xi , i = 1, . . . , I, is generated independently from a queuing distribution of an M/M/1 queue of the following form [11] P (Xi > x) = ui exp(−vi−1 x), x > 0, and P (Xi = 0) = 1 − ui where 0 < ui < 1 is the utilization of the queue, and vi−1 > 0 is the service rate of the queue time (1 − ui ). For each link delay Xi , we generate a corresponding ui from a uniform distribution in the interval [0.3,0.7], and vi from an exponential distribution with mean 3. To obtain our estimates, we first model each link delay by a piecewise uniform distribution with an exponential tail, using 10 bins placed at quantiles of the distribution4 . The estimates of the link delay distributions based on 1D-projections of Y are obtained by (3), where a Gaussian weight function, with standard deviation of 5 after normalizing each projection, is used for µ to calculate the L2 distance and the integrals are approximated by using Monte-Carlo methods. The estimates based on 2D-projections are computed similarly. In each simulation run, a total of 1000 delay samples are generated for each link in the four-leaf tree from its specified delay distribution and then the end-to-end delays are computed. For each of the seven link delay distributions, we consider three estimators: θˆCF by using the correlation based 1D-projections as in (9), θˆCF using K = I = 7 random 1D-projections adjusted for the covariance of Y, and the characteristic function based estimator using all pairwise 2D-projections by [16]. Figure 6 plots the cumulative distributions of the estimated link delays along with the generated true distribution for one simulation run. From the figure, we observe that all estimators yield satisfactory results. To measure the accuracy of the estimates, we use the Mallows’ distance [17] defined for a cumulative distribution distribution F and its estimate Fˆ by M (F, Fˆ ) =
0
1
−1 F (p) − Fˆ −1 (p) dp,
where F −1 and Fˆ −1 are the inverse cumulative distributions. The Mallows’ distance can be viewed as the average of absolute difference in quantiles between two distributions. Because the Mallows’ distance is linear to the scale of distributions, we use M (F, Fˆ )/σF , the normalized Mallows’ distance, to measure the difference between F and the estimate Fˆ , where σF is the standard deviation of F . We repeat the simulation 100 times and compute the normalized Mallows distance between the estimated and the simulated distributions for all links. Figure 7 reports the median of the normalized Mallows’ distance between the three estimates and the generated true distributions for each link. We observe that the estimator using 2D-projections yield best results overall, the estimator using the correlation based 1D-projections is a close second, and that using random 1D-projections performs the worst. This indicates that although the correlation based 1D-projections may not be the optimal directions, the information loss using these 1D-projections as compared to all pairwise 2D-projections is not significant. 4 The quantiles are unknown in real applications, but they are assumed to be predetermined here for simplicity. Otherwise, they can be estimated through iterations of the estimation process with an initial value. This has been suggested in [9] as a strategy for placing bins for modeling the link delays and has been shown to yield accurate estimators.
60
A. Chen and J. Cao
Fig 7. Medians of the normalized Mallows’ distance for the three link delay distribution estimates from 100 simulation runs, where each simulation run the same setup as in Figure 6.
5. Conclusion and future work This paper proposes a one-dimensional linear projection method for solving a class of linear inverse problems in network tomography. For the simple Gaussian tomography model, the optimal set of 1D-projections is derived and it is shown that a likelihood based estimator based on these 1D projections is asymptotically as efficient as the usual maximum likelihood method. For more realistic models in network tomography, simulation studies show that the estimators derived from the marginal distributions or marginal characteristic functions of 1D-projections perform well for large sample sizes. Future work includes generalization of the optimal design of 1D-projections for non-Gaussian tomography models and small sample studies of the proposed method. Acknowledgments. We would like to thank Gang Liang at UC Irvine for sharing his code of their pseudo likelihood method and to David James in our department for helpful comments on the paper. References [1] Adams, A., Bu, T., Friedman, T., Horowitz, J., Towsley, D., Caceres, R., Duffield, N., Presti, F., Moon, S. and Paxson, V. (2000). The use of end-to-end multicast measurements for characterizing internal network behavior. IEEE Communications Magazine, May 2000. [2] Bickel, P., Klaassen, C., Ritov, Y. and Wellner, J. (1993). Efficient and Adaptive Estimation for Semiparametric Models. The Johns Hopkins Univ. Press. [3] Bu, T., Duffield, N., Presti, F. and Towsley, D. (2002). Network tomography on general topologies. ACM SIGCOMM 2002 21–30. [4] Caceres, R., Duffield, N.G., Horowitz, J. and Towsley D. (1999). Multicast-based inference of network internal loss characteristics. IEEE Trans. Inform. Theory 45 2464–2480.
Method of 1D projections
61
[5] Cao, J., Davis, D., Vander Wiel, S. and Yu, B. (2000). Time-varying network tomography: router link data. J. Amer. Statist. Assoc. 95 1063–1075. [6] Cao, J., Vander Wiel, S., Yu, B. and Zhu Z. (2000). A scalable method for estimating network traffic matrices. Bell Labs Technical Report. [7] Carrasco, M. and Florens, J. (2000). Generalization of gmm to a continuum of moment conditions. Econometric Theory 797–834. [8] Castro, R., Coates, M., Liang, G., Nowak, R. and Yu, B. (2004). Network tomography; recent developments. Statist. Sci. 19 499–517. [9] Chen, A., Cao, J. and Bu, T. (2007). Network tomography: Identifiability and fourier domain estimation. Proc. IEEE INFOCOM. [10] Coates, M. and Nowak, R. (2002) Sequential monte carlo inference of internal delays in nonstationary data networks. IEEE Trans. Signal Process. 50 366–376. [11] Cooper, R. B. (1972). Queueing Theory. Macmillan, New York. [12] Duffield, N., Horowitz, J. and Presti, F. (2001). Adaptive multicast topology inference. Proc. IEEE INFOCOM 1636–1645. [13] Duffield, N. and Presti, F. (2004). Network tomography from measured end-to-end delay covariance. IEEE Trans. Networking 12 978–992. [14] Feuerverger, A. and Mureika, R. (1977). The empirical characteristic function and its applications. Ann. Statist. 55 88–97. [15] Lawrence, E., Michailidis, G. and Nair, V. (2006). Network delay tomography using flexicast experiments. J. Roy. Statist. Soc. Ser. B 68 785–813. [16] Liang, G. and Yu, B. (2003). Maximum pseudo likelihood estimation in network tomography. IEEE Trans. Signal Process. 51 2043–2053. [17] Mallows, C. L. (1972). A note on asymptotic joint normality. Ann. Math. Statist. 43 508–515. [18] Presti, F., Duffield, N., Horowitz, J. and Towsley, D. (2002). Multicast-based inference of network-internal delay distributions. IEEE Trans. Networking 10 761–775. [19] Shih, M. and Hero, A. (2003). Unicast-based inference of network link delay distributions using finite mixture models. IEEE Trans. Signal Process. 51 2219–2228. [20] Tsang, Y., Coates, M. and Nowak, R. (2003). Network delay tomography. IEEE Trans. Signal Process. 51 2125–2136. [21] Vardi, Y. (1996). Network tomography: Estimating source-destination traffic intensities from link data. J. Amer. Statist. Assoc. 91 365–377.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 62–75 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000058
Using data network metrics, graphics, and topology to explore network characteristics A. Adhikari1 , L. Denby1 , J. M. Landwehr1 and J. Meloche1 Avaya Labs Abstract: Yehuda Vardi introduced the term network tomography and was the first to propose and study how statistical inverse methods could be adapted to attack important network problems (Vardi, 1996). More recently, in one of his final papers, Vardi proposed notions of metrics on networks to define and measure distances between a network’s links, its paths, and also between different networks (Vardi, 2004). In this paper, we apply Vardi’s general approach for network metrics to a real data network by using data obtained from special data network tools and testing procedures presented here. We illustrate how the metrics help explicate interesting features of the traffic characteristics on the network. We also adapt the metrics in order to condition on traffic passing through a portion of the network, such as a router or pair of routers, and show further how this approach helps to discover and explain interesting network characteristics.
1. Introduction Problems in a wide range of fields can be expressed in terms of network structures. Examples range across fields such as social relationships, vehicular traffic, the spread of epidemics, and Internet Protocol (IP) data networks. Data that live on a network go well beyond the classical cases-by-variables paradigm in terms of how the data are represented, the formulation of problems to attack, and the wide range of analytical methods that might be useful in one situation or another. In a ground breaking and influential paper, Yehuda Vardi [7] identified the problem area of estimating source-destination traffic intensities from link count data and proposed how statistical inverse methods could be adapted to attack this problem, coining the term network tomography. More recently, in one of his final papers, Vardi [8] raised notions of defining and using metrics of one type or another in the context of network tomography problems. Our interest is data networks, in particular test traffic mimicking Voice-over-IP (VoIP) traffic. Data obtained from even limited testing on a corporate network of realistic size are massive and complicated, so some sort of data reduction methods are a natural topic to consider among a range of analytical options when attacking certain problems. Measures of distance and similarity provide one way to summarize and reduce the raw data. In this paper, we explore how adapting some of Vardi’s ideas on metrics can supply useful insights for interesting problems relating to VoIP traffic flows. We think of the methods as data analysis and data summarization 1 Avaya Labs, 233 Mt Airy Road, Basking Ridge, New Jersey 07920, USA, e-mail: [email protected]; [email protected]; [email protected]; [email protected] AMS 2000 subject classifications: 62H15, 62C10, 62C12, 62C25. Keywords and phrases: VoIP, network paths, network visualization, network QoS, traffic analysis.
62
Using data network metrics
63
tools, possibly useful in initial analytical steps, apart from whether or not more extensive analytical capabilities using the mathematical properties of metrics are taken advantage of or not. It is natural to think of a data network in terms of a graph in which the vertices represent either routers in the core of the network or communication endpoints at the periphery of the network, and the edges in the graph represent network links on which traffic can flow. A whole range of distance measures can be thought of in this context. For example, the distance from one vertex to another can be defined to be the length of the shortest path between the two vertices; that is, the distance is the minimum number of hops required to traverse through edges from one vertex to the other in the graph. If the edges have a direction, this is a directed graph and this distance measure is not necessarily symmetric. If the edges are all bi-directional, however, this distance is symmetric. The above idea of measuring the distance between vertices by the length of the shortest path can clearly be extended to take edge performance values such as latency, loss or jitter into account. Indeed, a distance measure from x to y can be defined by minimizing the sum of edge values over all paths that go from x to y. The original measure corresponds to the case where all the values are 1. Real data networks have the additional structure of routing. Routing is the process by which messages sent by one communication endpoint find their way through the network to the destination endpoint. Routing is performed by devices called routers and is partly dictated by protocols such as OSPF. These protocols, in turn, are based on metrics such as the above shortest path and also other considerations such as security or policy. One way to describe routing is to specify the paths through the network between the endpoints of interest. These paths can be thought of as sequences of vertices, starting with the source endpoint and ending with the destination endpoint. Routing in real networks is often asymmetric (the path from B to A is not simply the reversed sequence of vertices in the path from A to B), it is time dependent, and it depends on the nature of the end-to-end traffic (e.g., www traffic is not necessarily routed along the same path as VoIP traffic). Given this additional routing structure, there is a need to go beyond approaches that assign distances between vertices. Vertices are routers and endpoints, and the edges between vertices are the basic elements of interaction in a network. In principle, a path through the network could be characterized either by a sequence of routers and/or by a sequence of edges. From an information standpoint, however, the sequence of edges constitutes a more complete description of the path since it amounts to a sequence of router interfaces rather than simply a sequence of routers. Moreover, including routers alone seems naive in that having interaction or not between adjacent routers is an important network characteristic – that is, are the routers connected by a network link or edge? For these reasons, we prefer to think of paths as sequences of edges rather than as a sequence of vertices. These points motivate considering topics around distances between the edges of a network. Clearly many notions of distance could be created and studied. Our interest is around the similarity of traffic across various links, or across certain sets of links (paths) going from one communication endpoint to another endpoint. Thus, we are primarily interested in distances (or similarities) between pairs of edges, rather than between pairs of vertices (nodes). Taking this perspective, we investigate ways to construct and use distances that are defined between pairs of edges in the network. The definition of the distances is conditional on the following network traffic characteristics: the quantity and
64
Adhikari, Denby, Landwehr and Meloche
intensity of the end-to-end communication traffic sent through the network; the paths taken by this traffic; and the dependence of the routing over time. The values of the metric could include performance measures of the end-to-end traffic. Section 2 presents the definitions needed for our approach. In Section 3, we provide some initial examples and graphical displays to visualize the set of distances between pairs of edges in the network. Section 4 proposes an application of the distances-between-edges calculations in which the communication endpoints are automatically assigned in a hierarchy that reflects end-to-end performance. In Section 5, we describe a second application of the distances in which the entire network is automatically classified into one of several canonical types based on the distribution of the distances between the edges. Section 6 provides a few concluding comments. There are, of course, many statistical problems related to IP network data. For discussions of other problems, an issue of Statistical Science contains three companion review articles that survey many recent developments in the field: see Duffield [5], Castro et al. [4], and Cao et al. [3], each of which also points to many further references. Our research interests have grown out of the needs to develop technology to test and assess enterprise networks for VoIP Quality of Service (QoS); see Bearden, et al. [1, 2] and Karacali, et al. [6]. 2. Definitions Consider a data network including a set of communication endpoints used as traffic sources and destinations, along with a specific pattern of test traffic generated among pairs of these devices. Conditional on this setup, we wish to calculate distances among the edges in the network based on the similarities of the traffic flowing across the edges, in ways that we will make precise. Traffic is routed according to a given routing matrix A. The rows of A represent the edges in the network and the columns represent the paths of interest. The paths correspond to each source-destination pair of endpoints that is tested. The element at row e and column p is wp if e ∈ p (2.1) aep = 0 if otherwise where wp is whatever path level quantity is of interest. Examples of wp could be the indicator variable 1, or the path frequency within the test plan, or an end-to-end performance measure such as delay or loss on this path. On the other hand, w could be defined to be edge effects derived from end-to-end values such as those resulting from network tomography calculations on delays. Vardi proposed to measure the distance between edges in terms of the rows of the matrix A. Let Ae be the row of A that corresponds to the edge e. Specifically, he proposed (2.2)
d(e, f ) = |Ae − Af |
as a distance between edges when wp = 1 and there is exactly one path for each source-destination pair and he demonstrated that d satisfies the properties of a metric. We propose to explore variations on this original idea. For edges e and f , define (2.3)
d1 (e, f ) = |Ae − Af |/ max(|Ae |, |Af |)
Using data network metrics
65
in which the difference between the two row vectors is normalized by the larger of their norms. Continuing the extension, let Pe be the set of paths containing edge e. Let RS denote the symmetric difference of the two sets R and S, so (2.4) We also define (2.5) which reduces to (2.6)
RS = (R ∪ S) − (R ∩ S)
p∈Pe Pf
wp
p∈Pe ∪Pf
wp
d2 (e, f ) = d3 (e, f ) =
|Pe Pf | |Pe ∪ Pf |
when wp ≡ 1. Vardi’s d is the numerator of d3 when there is exactly one path for each source-destination pair. We find that normalizing his quantity by the denominator of d3 , a quantity essentially representing the total frequency with which edges e and f appear in the paths under test from this traffic configuration, gives an intuitively more plausible measure of distance. Moreover, in d2 we extend the concept that Vardi proposed to the case where multiple paths from source to destination are observed, a common situation in real networks at least over a sufficient period of time. This extension also opens up the possibility of incorporating alternative path level quantities into the distance measure. 3. Examples In this section we give two examples to better understand the edge distance measure. We focus on the similarity measure s = 1 − d2 . We start with a large example, then specialize it to a subgraph in the smaller second example to illustrate the calculations. The data come from a real, world-wide corporate network. It consists of paths between 28 endpoints deployed on the corporate network. The paths are obtained with the common traceroute utility which was run between randomly selected pairs of endpoints about 1.3 million times in a period of 10 days which amounts to not quite 2 tests running every second. The traceroute test works by sending packets from source to destination with increasing time to live (TTL) values. The time to live value of a packet is the number of routers the packet is allowed to traverse on its way to the destination. Each intervening router decrements the time to live value and when it has reached the value 0, the router sends its address back to the source in a TTL Exceeded (ICMP) message. The result is a sequence of addresses from source to destination at increasing distances from the source. The testing process resulted in 1825 paths observed at one time or another. Figure 1 is a histogram of the path frequencies. If routing never changed, we would have observed only 756 = 28×27 of them, each of which would have occurred about 1900 times. Figure 1 reveals an interesting aspect of the paths on this network. There are accumulations of frequencies around 1800, and at binary divisions of 1800, with the bumps at 900 and 450 clearly visible. This suggests that there are certain paths that occur each time a certain source-destination pair is tested, while for other S-D pairs there are two or four paths that occur roughly equally frequently.
66
Adhikari, Denby, Landwehr and Meloche
Fig 1. Distribution of path frequency throughout collection period.
Table 1 shows the number of observed paths for each of the source-destination pairs. Note that this table shows only 27 rows and columns since no traceroute data originated from one of the endpoints so that row and column were eliminated. The first two endpoints have a large number of paths to destination due to load balancing and fail-over nodes near their location. Load balancing nodes are used in turn according to some schedule such as round robin in order to equalize traffic over alternative network links. Fail-over nodes are only used when a primary node fails. Both types of nodes are the primary explanations for the multiplicity of the paths from source to destination. One remarkable aspect of this multiplicity is that it is not symmetric at all. The second example involves only two endpoints, both in New Jersey with one in Lincroft and the other in Basking Ridge. For this example, the routing matrix A is formed by restricting the data to these two endpoints. Figure 2 shows this restricted network. The endpoints in Lincroft and Basking Ridge are shown as triangles at the left and right edge of the network and the circles represent the routers involved in the end-to-end traffic between the two endpoints. The direction of the traffic is shown as an arrow on the edge or as a dot for bidirectional edges. Figure 2 shows the network topology but fails to express how this topology is used in the end-to-end communication between the two endpoints. The corresponding routing information is reported in Table 1 which amounts to the weighted routing matrix A. In this case, we have three paths, two from Lincroft to Basking Ridge and one in the reverse direction. Table 1 reports how frequently each edge is used in each of the paths during the 10 day period. Over time, the network kept shifting between Path 1 and Path 2 for tests from Lincroft to Basking Ridge, while tests in the opposite direction always followed Path 3. The entries in the table are the w1 , w2 , and w3 which we plug into Equation 2.5 to calculate the edge distances. The edge traffic similarity 1 − d2 can be used to measure the similarity of the end-to-end traffic that goes through a pair of edges. By way of example, (3.1)
s(Lin − B, I − G) = 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 2 2 8 4 2 3 2 1 1 2 1 1 2 2 2 2 1 2 2 1 1 1 1 2 1
0
4 2 8 4 2 3 1 1 2 2 1 2 1 1 2 2 2 1 2 1 1 1 1 2 1
1 1
4 8 4 4 3 2 2 4 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2
2 12 12
4 2 4 2 1 1 2 1 2 1 1 1 2 1 2 1 2 3 2 2 1 1 2
3 6 6 4
2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1
4 11 10 2 1
4 2 1 1 2 1 2 1 2 2 2 2 2 2 2 3 2 2 2 1 1
5 6 6 4 3 4
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 10 10 2 2 4 2
1 2 2 1 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2
7 10 10 2 2 4 2 3
1 1 2 1 2 1 2 1 1 1 1 1 2 2 1 2 1 2
8 6 6 2 3 8 2 2 1
2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1
9 10 10 2 2 4 2 2 1 1
1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
10 1 1 2 3 1 1 2 1 1 1
1 1 2 1 1 1 1 1 1 1 1 1 2 1 1
11 10 10 2 3 4 2 2 1 2 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1
12 10 10 2 2 4 2 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1
13 10 10 2 2 4 2 2 1 2 1 1 1 1
1 1 1 1 1 1 1 2 1 1 1 1
14 10 10 2 3 4 2 2 1 2 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
15 10 10 1 3 1 1 2 2 1 1 1 1 1 1 1
1 1 1 1 2 1 1 1 1 1
16 10 10 2 2 4 2 2 2 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
17 10 10 2 2 4 2 2 2 1 1 1 1 1 1 1 1 1
1 1 2 1 1 1 1 1
18 10 10 2 2 4 2 1 2 1 1 1 1 1 1 1 2 1 1
Table 1 Number of paths per source-destination pair
1 2 1 1 1 1 1
19 10 10 2 1 4 2 2 2 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1
20 10 10 2 1 4 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1
2 2 1 2 1
21 10 10 2 5 1 2 2 3 2 1 1 1 1 1 1 1 2 1 2 2 1
1 1 1 1
22 6 6 1 2 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2
2 1 2
23 6 6 2 3 4 2 4 2 2 1 3 2 2 1 1 1 2 2 1 2 2 3 2
1 4
24 6 6 10 1 4 4 2 1 5 4 4 4 4 4 4 4 4 4 8 8 4 4 1 8
1
25 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1
27 1 1 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Using data network metrics 67
Adhikari, Denby, Landwehr and Meloche
68
Fig 2. New Jersey Topology.
since none of the end-to-end traffic that goes through Lin-B goes through I-G. On the other hand, (3.2)
s(Lin − B, B − D) = 1336/(0 + 1336 + 552) = 0.71
The rest of the similarity values are reported in Table 1. Figure 3 shows the same information graphically. For better visibility, the edges that have similarity of zero to Lin-B are colored grey and the others are shaded in a gradation from pink to red. We end this section will a full map of the corporate topology which expresses the similarity of traffic between all edges to the Lin-B edge. Figure 4 illustrates the important idea that because traffic is end-to-end, similarities can be found between edges that are far apart in the usual sense of hop distance or geography. Figure 4 shows that overall few edges have some similarity to Lin-B and that those that do have some similarity are widely dispersed in a geographical and in a network sense. Furthermore, manipulating the color rendering on a computer screen (which we cannot convey in a static picture) reveals that among edges that have some similarity to Lin-B, a number of the inter-continental edges of the network have stronger similarity to Lin-B than the intra-continental ones. This is reflecting that there are relatively few cross-continental edges which are left to concentrate the traffic from one continent to another. Table 1 New Jersey weighted routing matrix Edge Lin-B B-C C-F F-H H-I I-BR B-D D-F BR-I I-G G-E E-C C-A A-Lin
Path 1 552 552 552 552 552 552
Path 2 1336
Path 3
1336 1336 1336 1336 1336 1889 1889 1889 1889 1889 1889
Similarity to Lin-B 1.0 0.29 0.29 1.0 1.0 1.0 0.71 0.71 0 0 0 0 0 0
Using data network metrics
69
Fig 3. Edge similarity to Lin-B.
4. Automatic hierarchy of endpoints Active monitoring of a network involves sending probes to and from each endpoint. Such a pairwise testing of all pairs does not scale easily with increasing number of endpoints since adding an additional endpoint given that there are already n endpoints being used results in adding 2n additional pairs to be tested. Thus the number of pairs to be tested grows like n2 . In designing the monitoring plan one must balance network coverage with the intensity of the monitoring data over the links. One does not want to swamp the links, yet one does not want to visit a particular region of the network (or pair) too infrequently. One way to handle this balance is by assigning the endpoints to a hierarchical tree that defines the test scheduling between endpoints. For example, take the network in Figure 4. One possible tree is shown in Figure 5. This tree was specified by the network engineer who is familiar with the routing pattern of the network along with the existence of certain applications on the network, e.g., VoIP or video conferencing. This hierarchy specifies the scheduling of the tests. Tests are performed consecutively for all pairs in one level of the hierarchy and then between pairs in the next level of the hierarchy. Thus, on the first round there will be tests performed between pairs of EMEA, APAC, US and Backbone regions where one of the endpoints in the leaves below that region is chosen at random to represent the region. After those six tests are completed then the 10 tests within EMEA are performed, followed by the 10 tests within APAC, followed by an East-West US test, and then the test within the Backbone. Then we go to the next level and perform the 10 tests within the west and the North-South test and lastly the tests within North and within South. This completes one round of testing and we can start over at the top level again. So, this hierarchy defines a round of testing comprised of 64 tests instead of the 378 that it would take to cover all n(n − 1)/2 pairs of endpoints. The advantage of this is less traffic on the network and more frequently visited regions of interest. Setting up this hierarchy by hand requires considerable knowledge of the network and can prove to be time consuming for large networks. Thus, there is a need for an automatic way of determining the hierarchy and assigning the endpoints to each node of the hierarchy. A way to automatically determine the hierarchy is to use the routing incidence matrix (probably collected at a moderate pace between all pairs of endpoints) along with clustering to determine how groups of endpoints combine to form the hierarchy. For example, suppose we take the routing matrix which contains counts of the number of times each link is touch by a path such
70
Adhikari, Denby, Landwehr and Meloche
Fig 4. Edge similarity to Lin-B.
Using data network metrics
71
Fig 5. Manual hierarchy.
as in Table 2 and collapse the columns related to the paths between the same source-destination pair of endpoints by summing them and dividing by the total number of times that source/destination pair was used in testing. For example, columns 1 and 2 of Table 2 are replace by a single column with values (1, 0.29, 0.29, 1, 1, 1, 0.71, 0.71, 0, 0, 0, 0, 0, 0). In this case a link entry for a sourcedestination pair would represent the proportion of times that that link has been touched in the paths between this source-destination pair. The distance between source and destination would be the sum of the resulting column. Thus, in this case the distance is essentially the number of hops between source and destination adjusted for the multiple paths being varying lengths. To symmetrize the distance matrix we use (D + DT )/2. It is this symmetric distance matrix that is then used in a straightforward clustering algorithm to determine the hierarchy. Figure 6 shows the hierarchical clustering based on this simple algorithm. Generally the clustering follows what might be set intuitively by the network administration, such as in Figure 6, thus showing that using this routing matrix as the basis of an automatic hierarchy algorithm is promising. In fact, one might think of enhancing the A matrix to reflect the network metrics collected along a path or other network parameters such as throughput or capacity, thus automatically constructing a distance that is not only based on the number of hops but also measures of the performance between the endpoints. 5. Network classification A second application of edge distance is to classify networks according to the distances between the edges in that network. The idea is that this distribution could
72
Adhikari, Denby, Landwehr and Meloche
Fig 6. Automatic hierarchy.
serve as a signature of the paths on the network. To illustrate how such a signature may be useful, we start by producing the signature distribution for four networks on which there are four fundamentally different routing patterns. The networks are • ring (unidirectional), where the nodes are disposed in a ring and the links are unidirectional; • ring (bidirectional), where the nodes are disposed in a ring and the links are bidirectional; • star, where the nodes are disposed in a star around a middle router; • mesh, where all nodes connect directly to one another. Figure 7 illustrates the four networks based on 10 nodes. For each of the four networks, we calculated d3 for all edge pairs and formed a histogram of the resulting similarities. The calculations for d3 , however, were made for networks with 100 nodes. Figure 8 displays the resulting histograms for the corresponding similarities. The star and mesh histograms look the same, but while mesh actually has all of its mass at 0, star has a bit of it off 0. The histograms show as expected that the edges on the star and mesh have (little or) no similarity because distinct edges (hardly) ever carry the same end-to-end traffic. The rings, however, are different and the unidirectional ring, in particular has a lot of similar edges. In fact, similarity for the rings is a function of the hop distance between the edges. Figure 9 displays the histogram signature for a worldwide corporate network. The signature is similar to the star network, with much but not all of its mass near 0, which tells us that the edges on the corporate network have little or no similarity.
Using data network metrics
unidirectional ring
bidirectional ring
star
mesh
Fig 7. Candidate networks.
Fig 8. Distributions of edge similarities.
73
74
Adhikari, Denby, Landwehr and Meloche
Fig 9. Distributions of edge distances on the corporate network.
6. Summary and conclusions Data networks incorporate many features: hops, traffic flow, delay, jitter, loss, topology, random paths, etc. They are complex and the data available to characterize them is voluminous. In order to help study their characteristics and to compare two data networks, Vardi suggested an approach for creating network metrics based on traffic flow conditional on source/destination endpoints. In this paper we show that the distribution of edge similarities is related to the structure of the network topology. In addition, displaying network metrics on top of the topology is useful since it helps in the following: 1) understanding of traffic flow on the network; 2) identifying multiple paths used in the network; 3) recognizing where topologically separate parts of the network have similar traffic; and 4) recognizing where adjacent links have dissimilar traffic flows. Lastly, using end-to-end traffic flow metrics can lead to automatically creating useful hierarchies among the endpoints that can be used in designing scalable monitoring strategies. In conclusion, Vardi’s idea of using network metrics of traffic flow has merit and is a fruitful topic for current and future research. We thank Yehuda for pointing out the value of using traffic flow metrics to help understand and compare networks. References [1] Bearden, M., Denby, L., Karacali, B., Meloche, J. and Stott, D. T. (2002). Assessing network readiness for IP telephony. Proc. of the 2002 IEEE Intl. Conf. on Communications. [2] Bearden, M., Denby, L., Karacali, B., Meloche, J. and Stott, D. T. (2002). Experiences with evaluating network QoS for IP telephony. Proc. of the Tenth Intl. Workshop on Quality of Service 259–268.
Using data network metrics
75
[3] Cao, J., Cleveland, W. S. and Sun, D. X. (2004). Bandwidth estimation for best-effort internet traffic. Statist. Sci. 19 499–517. [4] Castro, R., Coates, M., Liang, G., Nowak, R. and Yu, B. (2004). Network tomography: Recent developments. Statist. Sci. 19 499–517. [5] Duffield, N. (2004). Sampling for passive internet measurements: A review. Statist. Sci. 19 472–498. [6] Karacali, B., Denby, L. and Meloche, J. (2004). Scalable network assessment for IP telephony. Proc. of the 2004 IEEE Intl. Conf. on Communications. [7] Vardi, Y. (1996). Network tomography: estimating source-destination traffic intensities from link data. J. Amer. Statist. Assoc. 91 365–377. [8] Vardi, Y. (2004). Metrics useful in network tomography studies. IEEE Signal Processing Letters 11 353–355.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 76–91 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000067
A flexible Bayesian generalized linear model for dichotomous response data with an application to text categorization Susana Eyheramendy1 and David Madigan2 Oxford University and Rutgers University Abstract: We present a class of sparse generalized linear models that include probit and logistic regression as special cases and offer some extra flexibility. We provide an EM algorithm for learning the parameters of these models from data. We apply our method in text classification and in simulated data and show that our method outperforms the logistic and probit models and also the elastic net, in general by a substantial margin.
1. Introduction The standard approach to model the dependence of binary data on explanatory variables under the Generalized linear models setting, is through a cumulative density function (cdf) Ψ. For example, for a vector of explanatory variables x and a random variable y that takes values in {0, 1}, the conditional probability of y given x and a vector of parameters β is modelled using a cdf Ψ, i.e. P r(y = 1|x, β) = Ψ(xT β). The most commonly used cdfs are the logistic and normal cdfs. The corresponding “link functions”, Ψ−1 , are the logit and probit link functions respectively. Albert and Chib [1] proposed a Student-t inverse cdf as the link funtion. This model includes the logit and probit as special cases, at least approximately. For probabilities between 0.001 and 0.999, logistic quantiles are approximately a linear function of the quantiles of a Student-t distribition with 8 degrees of freedom. Also, when the degrees of freedom in a Student-t distribution are large, (say υ > 20) the tlink approximates the probit model. The degrees of freedom υ control the thickness of the tail of the t-density allowing for a more flexible model. One can also benefit from this model as it can be presented via a latent variable representation that allows one to estimate the parameters of the model easily using the EM algorithm (Dempster et al. [5]). For the logistic, normal and Student-t, the corresponding probability density functions (pdf) are symmetric around zero. This implies that their corresponding cdfs approach 1 at the same rate that they approach 0, which may not always be reasonable. In some applications, the overall fit can significantly improve by using a cdf that approaches 0 and 1 at different rates. Many authors have proposed models with asymmetric link functions that can approximate, or have as special cases, the logit and probit links. Stukel [15] proposes 1 Statistics
Dept., Oxford University, Oxford, OX1 3TG, United Kingdom of Statistics, 501 Hill Center, Rutgers University, Piscataway, NJ 08855, USA, e-mail: [email protected] Keywords and phrases: Generalized linear model, text classification, binary regression. AMS 2000 subject classifications: 62-02, 62J12. 2 Department
76
77
a class of models that generalizes the logistic model. Chen et al. [4] introduce an alternative to the probit models where the cdf is the skew-normal distribution from Azzalini and Della Valle [2]. Fernandez and Steel [6] propose a class of skewed densities indexed by a scalar δ ∈ (0, ∞) of the form: (1)
p(y|δ) =
y 2 1 {f ( δ )I[0,∞) (y) + f (yδ)I(−∞,0) (y)}, δ+ δ
y∈
where f (.) is a univariate probability density function (pdf) with the mode at 0 and symmetry around the mode. The parameter δ determines the amount of mass at each side of the mode, and hence the skewness of the densities. Capobianco [3]) considers the Student-t pdf as the univariate f (.) density in equation (1). This is an appealing model as it contains a parameter that controls the thickness of the tails and another parameter that determines the skewness of the density. We apply an extended version of this model to textual datasets, where the problem is to classify text documents into predefined categories. We consider a Bayesian hierarchical model that contains, in addition to the parameter that controls the skewness of the density and the parameter that controls the thickness of the tails, a third parameter that controls the sparseness of the model, i.e. the number of regression parameters with zero posterior mode. In studies when there are a large number of predictor variables, this methodology gives one way of discriminating and selecting relevant predictors. In what follows we describe the model that we propose. 2. The model Suppose that n independent binary random variables Y1 , . . . , Yn are observed together with a vector of predictors x1 , . . . , xn , where Yi = 1 with probability P (Yi = 1|β, xi ). Under the generalized linear model setting, models for binary classification satisfy P (Yi = 1|β, xi ) = Ψ(xTi β) where Ψ is a nonnegative function whose range is between 0 and 1. For instance, the probit model is obtained when Ψ is the normal cumulative distribution and the logit model when Ψ is the logistic cdf. Under Bayesian learning one starts with a prior probability distribution for the unknown parameters of the model. Prediction of new data utilizes the posterior probability distribution. More specifically, denote by π(β) the prior density function for the unknown parameter vector β. Then the posterior density of β is given by n π(β) i=1 Ψ(xTi β)yi (1 − Ψ(xTi β))1−yi n π(β|{(x1 , y1 ), . . . , (xn , yn )}) = π(β) i=1 Ψ(xTi β)yi (1 − Ψ(xTi β))1−yi dβ and the posterior predictive distribution for y given x is π(y|x, {(x1 , y1 ), . . . , (xn , yn )}) = P r(y|x, β)π(β|{(x1 , y1 ), . . . , (xn , yn )})dβ
which in general are intractable due to the many integrals. In the model that is proposed in this paper, we estimate the vector of parameters β as the mode of the posterior density π(β|{(x1 , y1 ), . . . , (xn , yn )}) and prediction of new data is performed using the following rule: (2)
yˆ = 1 if π(y = 1|x, {(x1 , y1 ), . . . , (xn , yn )}) > 0.5, yˆ = 0 if π(y = 1|x, {(x1 , y1 ), . . . , (xn , yn )}) ≤ 0.5
78
which is an optimal rule in the sense that it has the smallest expected prediction error (see Hastie et al. [10] for more details). We consider two general cases for the form of Ψ. First, we consider a cdf Ψ that approaches 1 and 0 at the same rate. This is referred to hereon as the symmetric case. Second, we generalize this model assuming that the inverse of the link function is the cdf of a skewed density. In this way, the inverse of the link function approaches 0 and 1 at different rates. 2.1. Symmetric case The model that we propose assumes that the conditional distribution of Yi given a vector of predictors x and a vector of unknown parameters β is determined by the cdf of the Student-t distribution with υ degrees of freedom evaluated at xTi β i.e. P (Yi = 1|β, xi ) = Tυ (xTi β). Figure 1 shows how the Student-t cdf can approximate the normal and logistic cdfs. The black continuous line corresponds to the normal cdf, the dashed line correspond to the logistic cdf and the dotted lines to the Student-t cdf with different degrees of freedom. Also Figure 2 shows the logistic quantiles against the quantiles of a Student-t distribution with 8 degrees of freedom for probabilities between 0.0005 and 0.9995. The straight line corresponds to the linear model fit. Assume that apriori the distribution of βj is normal with mean 0 and variance τj , N (0, τj ) and the distribution of the hyperparameters τj is exponential exp(2/γ) with pdf equal to γ2 e−γτj /2 . Integrating with respect to τj , one obtains P r(βj |γ) = √ √ γ 2 exp(− γβj ), which is the double exponential prior. In Section 3 it will became clear that decomposing the double exponential into a two-level Bayesian hierarchical model, allows us to estimate the parameters β via the EM algorithm, where in addition to the latent variables Z (introduced below) the τj parameters are seen as missing data.
Fig 1. The black continuous line corresponds to the normal cdf, the dashed line correspond to the logistic cdf and the dotted lines to the Student-t cdf with different degrees of freedom.
79
Fig 2. Plot of the logistic quantiles against the quantiles of a Student-t distribution with 8 degrees of freedom for probabilities between 0.00005 and 0.99995. The strait line corresponds to the linear model fit.
The normal prior on the β parameters has the effect of shrinking the maximun likelihood estimator of β towards zero, which has been shown to give a better generalization performance (see Hastie et al. [10] for more details). A different variance in the priors of the β gives the flexibility of having the parameters shrunk by a different amount which is relevant when the predictors influence the response unevenly. Moreover, the particular distribution of the variances τj that we use, will shrink some of the parameters βj to be exactly equal to zero. If the hyperprior distribution for τj places significant weight on small values of τj then it is likely that the estimate of the corresponding βj will also be small and can have zero posterior mode. On the other hand, hyperprior distributions that favor large values of the τj ’s will lead to posterior modes for the βj ’s that are close to the maximum likelihood estimates. Note that, since E(τj ) = γ1 for all j, γ effectively controls the sparseness of the model. Figure 3 shows the different shapes that the hyper distribution for τj can take as the parameter γ varies. Analogous models have been applied by many authors e.g. Genkin et al [9], Figueiredo and Jain [7], Neal [13] and MacKay [11]. We show below that by introducing some latent variables, the generalized linear model that takes Ψ to be equal to the Student-t cdf Tυ with υ degrees of freedom, can be seen as a missing data model and hence amenable to EM algorithm. This procedure offers a tractable way to estimate the parameters of the model. The latent variable representation for this model was first introduced by Albert and Chib [1], who derive a Gibbs sampling algorithm to find an estimator of the parameter β. In the same spirit, Scott [14] introduced a latent variable representation for the logistic model.
80
Fig 3. Shapes that the exponential hyperprior for the τj parameters takes when the hyper parameter γ changes.
Let Z1 , . . . , Zn be n independent random variables where Zi is normally distrib−1 T uted with mean xTi β and variance λ−1 i , N (xi β, λi ). Define the probability of Yi given Zi as a Bernoulli distribution with parameter equal to the probability that Zi is non-negative P r(Zi > 0) i.e. Yi = 1 if Zi > 0 and Yi = 0 otherwise. Let the inverse of the variance of Zi , λi be distributed as Gamma(υ/2, 2/υ) with pdf υ/2−1 proportional to λi exp(−υλi /2). Note that the marginal distribution p(z|β, x, υ) is tυ (xT β), a Student-t density with υ degrees of freedom and location xT β, and p(y = 1|β, x) = p(z > 0|β, x) = Tυ (xT β) is the Student-t cumulative distribution with υ degrees of freedom, centered at 0 and evaluated at xT β. Therefore, we see that by integrating the latent variables zi we recover the original model. 2.2. General case Fernandez and Steel [6] propose a class of skewed densities indexed by a scalar δ ∈ (0, ∞) of the form: p(y|δ) =
2 y 1 {f ( δ )I[0,∞) (y) + f (yδ)I(−∞,0) (y)} δ+ δ
where f (.) is a univariate probability density function (pdf) with mode at 0 and symmetry around the mode. The parameter δ determines the amount of mass at each side of the mode, and hence the skewness of the densities. When δ varies between (0, 1), the distribution has negative skewness and when δ is bigger than 1,
81
γ
τj
βj
j = 1, . . . , d
i = 1, . . . , n zi λi δ
j = 1, . . . , d xij yi
υ
Fig 4. Graphical model representation of the model with the skew Student-t link.
the distribution has positive skewness. We replace the function f () in equation (1) by the Student-t distribution and consider the corresponding cdf which we denote by Tυ,δ . Let Y1 , . . . , Yn be n binary independent random variables distributed Bernoulli with probability of success given by P (Yi = 1|β, xi ) = Tυ,δ (xTi β). As in the symmetric case of Section 2.1, we introduce some latent variables that will allow us to find the parameter estimates using the EM algorithm. Consider n random variables Z1 , . . . , Zn such that the distribution of Zi is the skew normal with mode at xTi β given by λ 2 2 λi − 2i (ri −xT β ) i (3) p(zi |xi , β, λi , δ) = 1e 2π δ+ δ where ri = zδi I(zi ≥ xTi β) + zi δI(zi < xTi β). Define Yi = 1 if Zi > 0 and Yi = 0 otherwise. Note that we can derive the pdf of equation (2) by replacing the f () function of equation (1) with the normal distribution with mean xTi β and variance λi and accordingly divide the mass of the distribution at each side of the mode xTi β. Assume, as in the symmetric case, that a priori the distribution of βj is normal with mean 0 and variance τj , the distribution of λi is Gamma(υ/2, 2/υ) and the distribution of the hyperparameters τj are exponential exp(2/γ). See Figure 4 for a graphical model representation of this model. Note that by marginalizing the pdf in equation (2) with respect to λi we get, p(zi |xi , β, δ) = (4)
×[
2 δ+
1 δ
Γ( υ+1 (zi − xTi β)2 2 ) √ {1 + υ Γ( υ2 ) πυ
1 2 − υ+1 2 I T β ,∞) (zi ) + δ I(−∞,xT β ) (zi )]} [ x i δ2 i
which is the skew Student-t distribution with mode at xTi β, υ degrees of freedom and δ the parameter that controls the skewness of the pdf. Figure 5 shows this pdf for different vales of δ and 8 degrees of freedom. The continous line corresponds
82
Fig 5. This graph shows the different shapes that the skew Student-t distribution with 8 degrees of freedom. The continues line correspond to the symmetric case.
to the symmetric Student-t distribution. We can derive the pdf from equation (3) by replacing the function f () from equation (1) with a Student-t distribution with mean xTi β and υ degrees of freedom and accordingly divide the mass of the pdf at each side of the mode. 3. The algorithm 3.1. The EM algorithm In this section we derive the equations for the EM algorithm in the symmetric and general cases explained above. The EM algorithm, originally introduced by Dempster, Laird and Rubin [5], provides an iterative procedure to compute maximum likelihood estimators (M LE). Each iteration in the EM algorithm has two steps, the E-step and the M-step. E-step: Compute the expected value of the complete log-posterior, given the current estimate of the parameter and the observations, usually denoted as Q. M-step: Find the parameter value that maximizes the Q function. 3.2. Symmetric case The complete log posterior for β is given by log p(β|y, z, λ, υ) ∝ −(z − Hβ)T A(z − Hβ) − β T W β (5)
∝ 2β T HAz − β T H T AHβ − β T W β
83
where A = diag(λ1 , . . . , λn ), H = [x1 , . . . , xn ]T is the design matrix and W = −1 diag(τ1−1 , . . . , τm ) is the covariance matrix for the normal prior on β. From equation (5) we see that the matrices A and W and the vector z correˆ (t) , spond to the missing data. Therefore, the E-step needs to compute E(τi−1 |y, β ˆ(t) , γ, υ) and E(zi λi |y, β ˆ(t) , γ, υ) for i = 1, . . . , n. γ, υ), E(λi |y, β ˆ(t) , yi , λi , In order to do these computations we need first to get the pdfs p(zi |β (t) ˆ , yi , υ). Note that the conditional probability of zi given β ˆ(t) , γ, υ) and p(λi |β yi , λi , γ and υ is a normal distribution with mean xTi β and variance λi but left truncated at zero if yi = 1 and right truncated at zero if yi = 0. The posterior probabilities for λi also have closed form given by, √ p(λi |υ)Φ( λi xT i β) yi = 1, Tυ (xT β ) i (t) ˆ , y, υ) = p(λi |β √ T p(λi |υ)Φ(− λi xi β ) yi = 0, T Tυ (−xi β )
where Φ denotes the cumulative distribution for the standard normal with mean 0 and variance 1. Now we can compute the expected value for λi and zi , which are given by
(6)
ˆ (t) , yi , υ) = E(λi |β
and
(7)
√ υ+2 Tυ+2 (xT i β υ ) yi = 1, β ) Tυ (xT i
√ υ+2 T Tυ+2 (−xi βT υ ) yi = 0, Tυ (−xi β )
E(zi |βˆ(t) , λi , y, υ) = √ (t) (i) ˆ(t) h(x(i) ) + √1 φ( √λi βˆ h(x )) yi = 1, β ˆ(t) h(x(i) )) λi Φ( λi β βˆ(t) h(x(i) ) −
√ ˆ(t) h(x(i) )) φ( λi β √1 √ ˆ(t) h(x(i) )) λi Φ(− λi β
yi = 0,
ˆ(t) , υ). Observe that respectively. The next computation in the E-step is E(λi zi |yi , β ˆ(t) , υ) = E{E(λi zi |yi , β ˆ(t) , υ, λi )} = E{λi E(zi |yi , β ˆ(t) , υ, λi )}. ThereE(λi zi |yi , β fore by replacing first eq. (7) and then eq. (6) in this expectation we obtain,
(8)
ˆ (t) , υ) = E(λi zi |yi , β √ T x β Tυ+2 (xT β υ+2 t (xT β ) υ ) + υ T yi = 1, T Tυ (x β ) Tυ (x β ) √ tυ (xT β ) xT β Tυ+2 (−xT β υ+2 υ ) − yi = 0. Tυ (−xT β ) Tυ (−xT β )
Finally, the expectation for τj is given by, ∞
(9)
1 ˆ(t) |τj )dτj p(τj |γ)p(β ˆ(t) , γ, υ) = 0 τj E(τj−1 |y, β ∞ ˆ(t) |τj )dτj p(τj |γ)p(β 0 γ . = |βj |
84
Denote, −1 ˆ(t) , γ, υ), . . . , E(τm ˆ(t) , γ, υ)), W ∗ = diag(E(τ1−1 |y, β |y, β ˆ(t) , γ, υ), . . . , E(λn |y, β ˆ(t) , γ, υ)), A∗ = diag(E(λ1 |y, β ˆ(t) , γ, υ), . . . , E(zn λn |y, β ˆ(t) , γ, υ))T . z∗ = (E(z1 λ1 |y, β
Then the M-step that results from maximizing equation (5) with respect to β is given by, βˆ = (H T A∗ H + W ∗ )−1 H T z∗
(10)
Summarizing, in the (t + 1)th iteration of the EM algorithm, E-step: Compute W ∗ , A∗ and z∗ using equations (8), (5) and (7) respectively. ˆ (t+1) by replacing the new values of W ∗ , A∗ M-step: Obtain a new estimate β ˆ (t+1) − β ˆ(t) /β ˆ(t) < ∆ stop, otherwise go back to the and z∗ in eq. (9). If β E-step. We fix ∆ = 0.005 following Figuereido and Jain [7]. 3.3. General case In the general case of Section 4.2, the complete log-posterior can be written as log p(β|λ, υ, y, z, δ) ∝ −(r − Hβ)T A(r − Hβ) − β T W β (11)
∝ 2β T HAr − β T H T AHβ − β T W β
where ri = zδi I(zi > 0) + zi δI(zi < 0), W and A are defined in Section 5.1. To get the new equations for the EM algorithm we have to compute E(λi ri |yi , βˆ(t) , υ) ˆ(t) , υ) for i = 1, . . . , n. Using the same trick that we used before, i.e. and E(λi |yi , β (t) ˆ , υ, δ) = E{E(λi zi |yi , β ˆ(t) , υ, λi , δ)} = E{λi E(zi |yi , β ˆ(t) , υ, λi , δ)}, E(λi zi |yi , β ˆ (t) , υ, λi ) for yi = 0 and yi = 1 in the following way: we obtain E(λi ri |yi , β E(λi ri |yi = 1, β, υ, λi ) =
(12)
√ υ+2 T 1 1 δ xi β Tυ+2,δ (−xT tυ,δ (xT i β ( δ −1) i β ( δ −1)) υ ) + δ T T Tυ,δ (xi β ) Tυ,δ (xi β √)υ+2 √ υ+2 T T T )−T (− x β {T ( x β (δ−1) x β υ+2,δ υ+2,δ i i υ υ )} + i δ Tυ,δ (xi bT β ) T {t (−xT i β )−tυ,δ (xi β (δ−1))} + 1δ υ,δ xTi β ≥ 0, T T ( x β ) υ,δ √ iυ+2 T T δxi β Tυ+2,δ (xi β υ ) + δ tυ,δ (−xTi β ) xTi β < 0, T β) Tυ (xT β ) T ( x υ,δ i i
E(λi ri |yi = 0, β, υ, λi ) =
(13)
√ T xTi β Tυ+2,δ (−xTi β υ+2 1 tυ,δ (xi β ) υ ) − xTi β ≥ 0, δ δ Tυ,δ (−xT β ) β) Tυ,δ (−xT i √ i √ υ+2 T 1 T Tυ+2,δ (−xTi β υ+2 υ )−Tυ+2,δ (xi β ( δ −1) υ ) δxi β Tυ,δ (−xT β ) i T 1 tυ,δ (xT t (xT β (δ−1)) i β( δ −1))−tυ,δ (−xi β ) − 1δ υ,δ i T +δ T Tυ,δ (−xi β ) √ Tυ,δ (−xi β ) υ+2 T T ( x β (δ−1) ) T x β υ + iδ υ+2,δ i xTi β < 0. Tυ,δ (−xT β) i
85
ˆ (t) , υ, E(λi |yi , β ˆ (t) , υ), can be computed as The expected value for λi given yi , β follows: √ υ+2 T 1 Tυ+2,δ (xi β (δ−1) υ ) − T β) δ T ( x υ,δ i√ υ+2 T 1δ Tυ+2,δ (−xi Tβ υ ) + Tυ,δ (xi β ) √ υ+2 (14) E(λi |yi = 1, β, υ) = 1 Tυ+2,δ (−xT i β ( δ −1) υ ) xTi β ≥ 0, δ T β) T ( x υ,δ i √ Tυ+2,δ (xTi β υ+2 υ ) δ xTi β < 0, Tυ,δ (xT β ) i
(15)
Denote, (16) (17) (18)
√ υ+2 Tυ+2,δ (−xT β 1 i υ ) xTi β ≥ 0, T β) δ T (− x υ,δ i √ υ+2 T υ ) 1δ Tυ+2,δ (xi β (δ−1) + T Tυ,δ (−x β) i √ E(λi |yi = 0, β, υ) = υ+2 Tυ+2,δ (−xT i β υ ) δ − T Tυ,δ (−xi β ) √ υ+2 T 1 υ ) δ Tυ+2,δ (xi β ( δ −1) xTi β < 0. T Tυ,δ (−xi β )
r∗ = (E(λ1 r1 ), . . . , E(λn rn )), −1 ˆ(t) , γ, υ), . . . , E(τm ˆ(t) , γ, υ)), W ∗ = diag(E(τ −1 |y, β |y, β 1
ˆ(t) , γ, υ), . . . , E(λn |y, β ˆ(t) , γ, υ)). A = diag(E(λ1 |y, β ∗
The new steps for the EM algorithm are as follows. In the (t + 1)th iteration, E-step: Compute W ∗ , A∗ and r∗ using equations (8), (12) and (11) respectively. ˆ (t+1) by replacing the new values of W ∗ , A∗ M-step: Obtain a new estimate β ˆ (t+1) − β ˆ(t) /β ˆ(t) < ∆ and r∗ in eq. (9), where now z ∗ is replaced by r∗ . If β stop, otherwise go back to the E-step. We set ∆ = 0.005. In both cases, the symmetric and general, we initialized the algorithm by setting (0) ˆ = (I +H T H)−1 Hy (with = 1e−6), which corresponds to a weakly penalized β ridge-type estimator. 4. Simulation study In this section we present a simulation study that compares the flexible Student-t link (hereafter FST) with the probit and logistic models and also with the recently introduced “elasticnet” (Zou and Hastie [16]). We assess the performance of the models by measuring the misclassification rate using simulated datasets. Two examples are presented here. Example 1: We generated 10 datasets consisting of 10 predictors and 100 observations. The response random variable was generated as a random binomial of size 1 and probability 0.6 and the 10 predictors were generated as a random binomial of size 1 and probabilities equal to 0.3 (in three predictors), 0.5 (in five predictors) and 0.8 (in two predictors). Example 2: We generated 10 datasets consisting of 10 predictors and 100 observations. We allow in these datasets high correlation between the response variable and the predictors, and also high correlation within the predictors. We generated
86
the response and the predictors as multivariate normal with mean zero and some structure in the covariate matrix and then we dichotomized these variables by assigning a 0 to negative values and 1 to positive values. The assumed covariance matrix structure is the following: the response together with the first 4 predictors all have correlation equal to 0.8 with each other. The next three predictors have correlation 0.3 with each other, and the last three predictors have correlation 0.4 with each other. The other correlations are all equal to 0.01. Our model has three tuning parameters, υ (the degrees of freedom of the Studentt distribution) that controls the thickness of the tails of the distribution, γ that controls the sparseness of the parameter estimates and δ that controls the skewness of the distribution. For different values of the parameters (υ ∈ {1, . . . , 8, 15, 30, 50, 100}, γ ∈ {0.01, 0.1, 1, . . . , 10, 20, 50, 100} and δ ∈ {0.01, 0.1, 0.5, 0.7, 1, 1.2, 1.5, 2, 3, 4, 5, 10}) we estimate the misclassification rate of our proposed model. We compare the best performance with the performance of the generalized linear models with probit and logit links which were fitted using the R statistical package (see results in Table 1). In Example 1, our flexible t link function FST consistently gives better performance than the logit and probit, ranging from 1% improvement in dataset 9 to 16% in the first dataset. Note that most of the best models choose the degrees of freedom of the Student-t distribution to be equal to 1 i.e. they prefer fat-tailed distributions. We also compare our FST model with a related method introduced by (Zou and Hastie (2005)), the so-called elastic net. We estimated this model over a grid of points for the two parameters of the model (λ ∈ {0.01, 0.1, 1, 10, 100}) using the statistical software R and selected the best performance. The results are shown in the last column of Table 1. Our method gives slightly poorer performance only in Datasets 9 − 10. We look at the best performance of the flexible probit (approximated by Studentt30 ) and the logit (approximated by Student-t8 ) models (results in Table 2) and their choice of the parameters that give best performance. Note that in most cases the same set of parameters gives best performance in both models. The three link functions consistently choose δ = 1.2 in most datasets, i.e. they prefer skewed distributions. Table 3 shows the results for the simulated datasets in Example 2. FST outperforms the three other methods in general by a substantial margin. We compare the sparseness of the elasticnet with the FST. In general, the choice of parameters for the best performed FST model, favors small values of the γ parameter, which do not induced sparseness in the model. Compared to the elasticnet, the FST model appears to be less sparse. Results are shown in Table 4 for the 10 datasets of Example 2. 5. Applications to text categorization Text categorization concerns the automated classification of documents into categories from a predefined set C = {c1 , . . . , cm }. A standard approach is to build m independent binary classifiers, one for each category, that decide whether or not a document dj is in category ci for i ∈ {1, . . . , m} and j ∈ {1, . . . , n}. Construction of the binary classifiers requires the availability of a corpus of documents D with assigned categories. We learn the classifiers using a subset of the corpus T r ⊂ D, the training set, and we test the classifiers using what remains of the corpus T e = D/T r, the testing set.
87 Table 1 Example 1. Misclassification rates for the generalized linear model with probit and logit link functions computed using the statistical software R are shown in the first two columns. The third column correspond to the best performance achieved, and the following three columns correspond to the parameters of the model that achieved the best performance. A ∗ in a cell means that the minimum misclassification rate is achieved by more than one value of the parameter. Data 1 2 3 4 5 6 7 8 9 10
Misclassification probit logit FST 0.4 0.4 0.26 0.27 0.28 0.23 0.4 0.4 0.36 0.4 0.4 0.38 0.28 0.28 0.25 0.4 0.4 0.36 0.34 0.33 0.30 0.39 0.37 0.31 0.32 0.32 0.31 0.33 0.33 0.31
υ 1 1 ∗ 1 1 1 1 ∗ ∗ ∗
Parameters δ γ ∗ ∗ 1.2 0.01 1.2 0.1 ∗ 0.1 ∗ ∗ ∗ 0.01 1.2 ∗ ∗ ∗ ∗ ∗ ∗ ∗
enet 0.27 0.27 0.36 0.38 0.29 0.38 0.31 0.36 0.29 0.28
Table 2 Example 1. Best performance achieved by the probit and logit models when approximated by a t30 and t8 respectively with the correspondent parameters. A ∗ in a cell means that the minimum misclassification rate is achieved by more than one value of the parameter. Data 1 2 3 4 5 6 7 8 9 10
Probit probit 0.28 0.26 0.37 0.4 0.28 0.39 0.31 0.31 0.31 0.31
Parameters δ γ 1.2 1 ∗ 0.1 1.2 0.1 ∗ 1 ∗ ∗ ∗ 0.1 1.2 0.1 ∗ 3 ∗ 1 ∗ 1
Logit logit 0.27 0.26 0.36 0.4 0.27 0.39 0.32 0.32 0.31 0.31
Parameters δ γ 1.2 1 ∗ 0.1 1.2 0.1 ∗ 1 1.2 1 ∗ 0.1 1.2 0.1 ∗ 3 ∗ 1 ∗ 1
Table 3 Example 2. Misclassification rates for the generalized linear model with probit and logit link functions computed using the statistical software R are shown in the first two columns. The third column correspond to the best performance achieved, and the following three columns correspond to the parameters of the model that achieved the best performance. A ∗ in a cell means that the minimum misclassification rate is achieved by more than one value of the parameter. Data 1 2 3 4 5 6 7 8 9 10
Misclassification probit logit FST 0.08 0.08 0.03 0.15 0.14 0.06 0.13 0.12 0.06 0.13 0.13 0.1 0.08 0.08 0.05 0.11 0.11 0.08 0.09 0.09 0.07 0.13 0.14 0.09 0.15 0.15 0.07 0.18 0.18 0.12
Parameters υ δ γ 1 ∗ ∗ 1 ∗ 0.01 1 ∗ ∗ ∗ ∗ ∗ 1 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 1 ∗ 0.01 ∗ ∗ ∗
enet 0.06 0.13 0.12 0.11 0.08 0.10 0.09 0.13 0.14 0.16
Table 4 Number of parameters estimates equal to zero in the elasticnet and FST model for each of the datasets in Example 2. model enet FST
1 0 1
2 0 0
3 0 0
4 2 0
5 5 2
6 3 1
7 9 4
8 6 0
9 3 0
10 7 5
88
Usually, documents are represented as vectors of weights dj = (x1j , . . . , xdj ) where xij represents a function of the frequency of appearance of word wi in document dj for d words w1 , . . . , wd in the “bag of words”. This is the so-called “bag of words” representation (see e.g. Mladenic [12]). Due to the large number of possible features or different words that can be gathered from a set of documents (usually this could be one hundred thousand or more) the classifiers are commonly built with a subset of the words. The text classification literature has tended to focus on feature selection (or word selection) algorithms that compute a score independently for each candidate feature, this is the so-called filtering approach. The scores typically depend on the counts of occurrences of the word in documents within the class and outside the class in training documents. For a predefined number of words to be selected, say d, the d words with the highest score are chosen. Several score functions exist, for a thorough comparative study of many of them see Forman [8]. We consider the 100 best words, according to the information gain criterion. Before we selected these 100 words we remove common noninformative words taken from a standard stopword list of 571 words. We performed the experiments in one standard text dataset. The dataset that we use comes from the Reuters news story collection that contains 21, 450 documents that have assigned zero or more categories to them among more than a hundred categories. We use a subset of the ModApte version of the Reuters−21, 578 collection, where each document has assigned at least one topic label (or category) and this topic label belongs to any of the 10 most populous categories—earn, acq, grain, wheat, crude, trade, interest, corn, ship, money-fx. It contains 6, 775 documents in the training set and 2, 258 in the testing set. To evaluate the performance of the different classifiers we use Recall, Precision and the F1 measure. Recall measures the proportion of documents correctly classified within documents in the same category. Precision measures the proportion of documents correctly classified within all documents classified into the same category, and F1 is the harmonic mean of Recall and Precision. There are two ways to average Recall, Precision and the F1 measure over all categories, micro-averaged and macro-averaged. The micro-averaged is an average weighted by the class distribution and the macro-averaged is the arithmetic mean over all categories. All three measures depend on a specific threshold which is chosen by either doing cross validation or by letting part of the dataset (a validation set) determine the best choice, or by fixing it arbitrarily. We set this threshold to be 0.5, i.e., we simply classify a document to the category with the highest probability. To choose the model that will perform best in new data, we divide the dataset into three parts: a training set, a validation set and a testing set. In the training set, we fix the tuning parameters and learn the model which is tested using the validation set. We repeat this process for every set of tuning parameters. We pick the set of tuning parameters that gives the best performance in the validation set. Then the algorithm learns the model with the chosen tuning parameters, this time using training and validation sets. The performance of the this final model is asessed using the testing set. We repeat this whole process 5 times, for different splits of the dataset into training-validation-testing sets to evaluate the performance error. We utilize 50% of the documents for training, 25% for validation and 25% for testing in the 5 splits of the dataset. We vary the tuning parameters as follows: υ ∈ {1, 2, 5, 8, 30}, γ ∈ {0.01, 0.05, 0.1, 1, 2} and δ ∈ {0.1, 0.5, 1, 1.5, 2}. For each category, we pick the set of tuning parameters (υ, γ, δ) that gives the highest performance according to
89
the F1 -measure in the validation test. Table 5 shows the values of these parameters for Split 1. Note that three of the categories (among the most populous ones) choose a symmetric link (δ = 1). The first column of Tables 6, 7 and 8 shows the micro and macro average of the F1 measure, recall and precision respectively, of the performance of the FST models in the testing set for the 5 splits of the dataset. The second column corresponds to the model with symmetric t-link with 30 Table 5 Tuning parameters for each category in the Reuters dataset chosen by Dataset 1. υ γ δ υ γ δ
earn 8 0.01 1 trade 1 0.05 1.5
acq 2 0.1 1 interest 8 0.01 1
grain 1 0.05 1.5 corn 1 0.01 0.5
wheat 1 0.1 0.5 ship 2 0.05 1.5
crude 1 0.05 1.5 money-fx 2 0.05 1.5
Table 6 Micro F1 and macro F1 measures for different link functions. The last two rows show the average and standard deviation over the five splits. Split 1 2 3 4 5 mean sd
FST micro macro 0.862 0.801 0.853 0.796 0.863 0.802 0.871 0.807 0.874 0.816 0.865 0.804 0.008 0.008
Probit micro macro 0.855 0.781 0.849 0.788 0.859 0.788 0.864 0.788 0.867 0.802 0.859 0.789 0.007 0.008
Logistic micro macro 0.857 0.786 0.853 0.795 0.861 0.793 0.866 0.795 0.869 0.804 0.861 0.795 0.006 0.006
Table 7 Micro recall and macro recall measures for different link functions. The last two rows show the average and standard deviation over the five splits. Split 1 2 3 4 5 mean sd
FST micro macro 0.818 0.747 0.806 0.749 0.821 0.748 0.828 0.749 0.831 0.759 0.821 0.750 0.01 0.005
Probit micro macro 0.790 0.693 0.785 0.715 0.796 0.699 0.803 0.703 0.805 0.712 0.796 0.704 0.008 0.009
Logistic micro macro 0.796 0.703 0.795 0.730 0.803 0.710 0.809 0.715 0.812 0.723 0.803 0.716 0.008 0.011
Table 8 Micro precision and macro precision measures for different link functions. The last two rows show the average and standard deviation over the five splits. Split 1 2 3 4 5 mean sd
FST micro macro 0.910 0.864 0.906 0.849 0.909 0.865 0.919 0.874 0.921 0.881 0.913 0.867 0.007 0.012
Probit micro macro 0.932 0.894 0.923 0.878 0.932 0.902 0.934 0.897 0.940 0.917 0.932 0.898 0.006 0.014
Logistic micro macro 0.930 0.891 0.920 0.872 0.928 0.898 0.931 0.894 0.935 0.907 0.929 0.892 0.006 0.013
90
degrees of freedom, that approximates the probit link and the third column corresponds to the model with symmetric t-link with 8 degrees of freedom. The last two rows of Table 6, 7 and 8 are the average and standard deviation, respectively, accross the five splits of the dataset. Note that the best performance is achieved by the FST model in the 5 datasets according to the F1 measure and recall. For precision, the best performing model utilizes a probit link. 6. Conclusions This paper introduces a flexible Bayesian generalized linear model for dichotoumous response data. We gain considerable flexibility by embedding the logistic and probit links into a larger class, the class of all symmetric and asymmetric t-link functions. The empirical results and simulations demostrate the good performance of the proposed model. We find that the model with the t-link function consistently improves the performance, according to the F1 measure and misclassification rate, as compared with the models with probit or logistic link functions. We compare also our model with the elastic net which is a related method, and showed that our method in general outperforms the elasticnet usually by a substantial margin. The FST model being a Bayesian model that can also be interpreted as a penalized likelihood model, enjoys all the good properties of these models. Shrinking the parameter estimates for example, is an important property of these models, which has been shown widely that lead to good generalization performance. We implemented an EM algorithm to learn the parameters of the model. A drawback, is that our algorithm has been implemented to allow only categorical predictors. We plan to extent our work to allow for continuous predictors. Acknowledgments. We are grateful to David D. Lewis for helpful discussions. References [1] Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669–679. [2] Azzalini, A. and Della Valle, A. (1996). The multivariate skew-normal distribution. Biometrika 88 715–726. [3] Capobianco, R. (2003). Skewness and fat tails in discrete choice models. [4] Chen, M., Dey, D. K. and Shao, Q. (1999). A new skew link model for dichotomous quantal response data. J. Amer. Statist. Assoc. 94 1172–1186. [5] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum lilelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38. [6] Fernandez, C. and Steel, M. F. J. (1998). On bayesian modelling of fat tails and skewness. J. Amer. Statist. Assoc. 93 359–371. [7] Figueiredo, M. A. T. and Jain, A. K. (2001). Bayesian learning of sparse classifiers. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Hawaii, December 2001. [8] Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. J. Machine Learning Research.
91
[9] Genkin, A., Lewis, D. D. and Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics. To appear. [10] Hastie, T. J., Tibshirani, R. J. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining Inference and Prediction. Springer, New York. [11] MacKay, D. J. C. (1994). Bayesian non-linear modeling for the energy prediction competition. In ASHRAE Transactions 100 1053–1062. [12] Mladenic, D. (1998). Feature subset selection in text-learning. In European conference on Machine Learning. [13] Neal, R. (1996). Bayesian Learning for Neural Networks. Springer, New York. [14] Scott, S. L. (2003). Data augmentation for the Bayesian analysis of multinomial logit models. Proceedings of the American Statistical Association Section on Bayesian Statistical Science[CD-Rom]. Amer. Statist. Assoc., Alexandria, VA. [15] Stukel, T. A. (1988). Generaliged logistic models. J. Amer. Statist. Assoc. 83 426–431. [16] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301–320.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 92–102 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000076
Estimating the proportion of differentially expressed genes in comparative DNA microarray experiments∗ Javier Cabrera1 and Ching-Ray Yu1 Rutgers University Abstract: DNA microarray experiments, a well-established experimental technique, aim at understanding the function of genes in some biological processes. One of the most common experiments in functional genomics research is to compare two groups of microarray data to determine which genes are differentially expressed. In this paper, we propose a methodology to estimate the proportion of differentially expressed genes in such experiments. We study the performance of our method in a simulation study where we compare it to other standard methods. Finally we compare the methods in real data from two toxicology experiments with mice.
1. Introduction The human genome and a number of other genomes have been almost fully sequenced, but the functions of most genes are still unknown. The difficulty is that gene expression is only one of the pieces of cellular processes sometimes called biological pathways or networks, and it is not yet possible to observe these pathways directly. DNA microarray technology has made it possible to quantify and compare relative gene expression profiles across a series of conditions many thousands of genes at a time. By identifying groups of genes that are simultaneously expressed the guesswork of reconstructing biological pathways is expedited. The information collected through the years on genes that participate on biological pathways or networks has been used to construct GO (Gene Ontology Consortium [8]). The information on differentially expressed genes from a microarray experiment is contrasted with the groupings that are known according to existing GO and a determination is made on whether or not a certain cellular process is taking place. In addition there might be a few genes that are differentially expressed in the experiment but were not known to be part of the biological process. These genes become candidates for further extending the pathway and will be confirmed by further experimentation and also by searching for annotations that describe their function in other processes. However, how to determine biological differentially expressed genes accurately is a nontrivial issue. Microarray experiments are high throughput in the sense that they evaluate the expression levels of thousands of genes at a time but with little replications. It is often the case that the number of replicate chips (biological, or technical) is 3 to 5 per condition. In addition the distributions of gene expressions across samples tend to be skewed and/or heavily tailed and hence they do not follow ∗ The
First and Second author make the same contributions to this paper. of Statistics, Rutgers University, NJ, 08854, USA, e-mail: [email protected]; [email protected] Keywords and phrases: biological process, DNA microarray, differentially expressed genes, toxicology experiment. 1 Department
92
Estimating the proportion of differentially expressed genes
93
a normal distribution. In this situation, permutation tests and traditional t-tests do not work very well because they have very low power. One way to improve the power of the test is to incorporate the GO information to the process. Fisher’s exact test (Fisher [7]) has been proposed as a way to detect if a particular subgroup of genes as a whole is differentially expressed. The test is applied to a two-way table of the indicator variable detecting the significance of the individual gene versus the indicator variable of the group. Another test is to consider the test statistic computed by Mean- Log-P, mean(-log(p-value)), (Pavlidis et al. [13] and Raghavan et al. [14]), of the genes in the group and compare this to the distribution of the statistics under a random subset of genes. On the other hand, if when applying real data on GO, the number of differentially expressed genes overall is large, then the Fisher’s exact test or Mean-Log-P test would still have low power. In order to overcome these problems we propose a new model approach, which consists of the following steps: 1. Estimate the proportion of differentially expressed genes. 2. Estimate the distribution of p-values for genes that are not differentially expressed. One would expect that this distribution is uniform but this is not the case in many examples that we have studied. The reason might be related to the processing of the data and the discarding of genes that take place at some stages of the process. Therefore the model has to estimate the distributions of null p-values by a semi-parametric or nonparametric method. 3. Estimate the distribution of p-values corresponding to differentially expressed genes. 4. Proceed by modeling the distribution of Mean-Log -P statistics for genes that belong to a subgroup or network. See Raghavan et al. [14], by using the estimators of steps 1-3. In this paper we concentrate on step 1 of the procedure, which corresponds to the estimation of π. This quantity π is important also in other situations, for example to calculate q-values (moderated p-values) proposed by Storey and Tibshirani [16]. For step 2-4 of the procedure, we will publish elsewhere as well. In Section 2 we propose a method and an algorithm for estimating π. In Section 3 we report the results of extensive simulation that support the performance of our method as well as comparison with other simpler methods. Example mice and mice2: To illustrate the estimation of π, we apply our procedure for the mouse data sets from toxicology experiments (Amaratunga and Cabrera [3]). These datasets correspond to typical toxicology experiments where a group of mice is treated with a toxic compound and the objective is to find genes that are differentially expressed against samples from untreated mice. mice and mice2 are two of the data sets that consist n1 = n2 = 4 mice in the control and treatment groups and total number of genes are G = 4077 from mice and G = 3434 for mice2 respectively. They represent two examples of cDNA chips, the first one mice has a high proportion π of differentially expressed genes whereas mice2 has a much smaller π. The data from such experiments consist of suitably normalized intensities: Xgij , where g(g = 1, . . . , G) indicates the genes on the microarray, i(i = 1, 2) indexes the groups, and j(j = 1, . . . , ni ) is the i-th mouse in the j-th group. Our goal is to characterize Γ, a subset of genes, among the G genes in the experiment that are differentially expressed across two groups. Methods for determining Γ, researchers (e.g. Schena et al. [15]) use fold change, but they did not take variability into account. Subsequent improvements were t-test
J. Cabrera and C. Yu
94
statistics (Efron et al. [6], Tusher et al. [17], and Broberg [5]), median-based methods (Amaratunga and Cabrera [1]) and Bayes and Empirical Bayes procedures (Lee et al. [10], Baldi and Long [4], Efron et al. [6], Newton et al. [12], and Lonnstedt and Speed [11]). T-tests are the most widely used method for assessing differential expression. The assumption of the t-tests is that normalized intensities are approximately normally distributed with the same variance across the groups. i.e.Xgij ∼ N (µgi , σg2 ). For each gene g, a t-statistic is calculated in order to test null hypothesis µg1 = µg2 and a p-value is generated. For small samples the t-test might be replaced by SAM or conditional t-test, Ct (Amaratunga and Cabrera [3]) in order to improve the power. Here we will follow the model proposed by Amaratunga and Cabrera [3] for the Ct method. Instead of trying to determine which genes are differentially expressed we will estimate the proportion of differentially expressed genes. Of course, as a consequence we could also produce an ordered list of genes that would be of interest to the biologist, but as we said above the entire procedure will be published elsewhere. 2. Statistical model and inference The data for experiments typically consists of suitable iid normalized intensities: (2.1)
Xgij = µg + τgi + σg gij ,
where µg and σg2 , g = 1, . . . , G, are the effect and variance of the g-th gene respectively, τgi is the effect of the g-th gene in the i-th group (i = 1, 2), and j(j = 1, . . . , ni ) indexes the samples. This is the same model in Amaratunga and Cabrera [3]. The treatment effect of the g-th gene is: τg = |τg2 − τg1 | We assume that gij are iid observations from an unknown distribution F and we assume that σg and τg are iid observations from unknown distributions Fσ and Fτ , respectively. Fσ represents the distribution of the gene variances. Fτ is likely to have a mass at zero with probability π representing the proportion of gene that are not differentially expressed. If the sample sizes were bigger the unknown distributions could be readily estimated by their respective cdf’s but for small sample sizes the cdf’s would produce very biased estimators. In the remainder of this section we will provide three procedures to estimate the three distributions F , Fτ , and Fσ , which try to overcome the biases induced by small sample size. In the model step: 1. Estimation of the error distribution F : In (2.1) when the number of samples per group is very small (3, 4, 5) and ¯ = 0, sample after residuals are subject to two constraints (sample mean X standard deviation s = 1) then if we pool the residuals together, the empirical distribution that is obtained gives a very poor estimator of the error distribution F . For example: Suppose we sample 1000 genes from a normal distribution with two groups of subjects of sizes 4 and 4. The empirical distribution of the residuals is close to the true error distribution (which is standard normal) which is shown in the left-top graph of Figure 1, but if we also simulated the
Estimating the proportion of differentially expressed genes
95
t- distribution with df.=4, and 10 the qq-plot of the empirical distribution is not so good which is shown in the Figure 1. One simple way to avoid this problem is to select a subset of genes SG that have small absolute t-values (say below 1 or some threshold that gives a large set of numbers). For each gene in SG , both samples are pooled together and normalized by subtracting the gene mean and dividing by the standard deviation. If the sample size per group is very small (3, 4, 5) instead of the sample mean and standard deviation it is much better to use Huber M-estimator of location and scale (Huber [9]) as shown by Figure 1. This will result in a table of residuals ˆgij , g ∈ SG . The error distribution F is estimated by (2.2)
Fˆ = EmpiricalCDF {ˆ gij , g ∈ SG , i = 1, 2, j = 1, . . . , ni },
Figure 1 shows the qq-plot for the estimated error distribution on t- distribution. The improvement is very clear. 2. Estimating Fσ : We follow the method described in Amaratunga and Cabrera [2, 3]. They pointed out that the empirical distribution, Fˆσ , of sg is a very poor estimator of the distribution Fσ , because on average Fˆσ is much more scattered than Fσ . They proposed an estimate F˜σ of Fσ that shrinks Fˆσ towards its center and hence producing a better estimator of Fσ . A similar algorithm will be discussed in 3. 3. Estimating Fτ : (determine the proportion of differential expressed genes) We said earlier that τg is drawn from some distribution Fτ . We expect that Fτ has a mass at zero of probability Fτ (0) ≥ 0, which represents the genes that are not differentially expressed. In order to estimate the probability P (τg = 0) we apply an algorithm that will produce an estimator F˜τ such that the EF˜τ (Fˆτ∗ (t)) = Fˆτ (t), where Fˆτ∗ (t) is the random variable representing the empirical cdf of τ ∗∗ at value t, which is constructed in following algorithm and Fˆτ (t) represents the actual observed value. The algorithm is as follows: Algorithm: Step 1: 1.1) Draw a random sample, s∗ , from F˜σ , which our estimate of the distribution of σ. 1.2) Estimate the error distribution F with the empirical distribution Fˆ defined in (2.2). 1.3) Take a random sample (with replacement): rgij ∼ Fˆ for i = 1, 2, j = 1, . . . , ni , g = 1, . . . , N . 1.4) Draw a sample τg∗ from Fˆτ (t) = I{t≥0} , where I{t≥0} = 1 if t ≥ 0 and I{t≥0} = 0 if t < 0. ∗ ∗ 1.5) Construct the pseudo-data: Xg1j = sg ∗ rg1j , Xg2j = τg∗ + sg ∗ rg2j .
1.6) Reconstruct the distribution FF∗ˆ = E(Fˆτ∗ |Fˆτ ), where Fˆτ∗ is the distribτ ¯ ∗ |. ¯∗ − X ution of τ ∗∗ by pseudo-data: τg∗∗ = |X g1 g2 (old) = Fˆτ . 1.7) Start by setting Fˆτ (new) (Fˆτ )). = Fˆτ (F ∗−1 1.8) Let Fˆτ ˆ (old) Fτ
1.9) Set
(old) Fˆτ
=
(new) Fˆτ
and go to 1.3).
96
J. Cabrera and C. Yu
Fig 1. A comparison of the error distribution estimates obtained from the empirical distribution (left) and our estimator (right), when the errors come from a Normal(0,1), t10 and t4 distributions.
Estimating the proportion of differentially expressed genes
97
1.10) Iterate until convergence (approximately 100 iterations). At conver(new) gence we get our final estimate F˜τ = Fˆτ . 1.11) Give a cutoff point, say η, which is a 95% quantile of the final F˜τ (t). Step 2: 2.1) Repeat 1.4)-1.8) using all original data Xgij and the estimated Fˆτ . 2.2) Get the estimated percentage of τg∗∗ which is greater than η × 95% quantile of standard normal. Theorem 2.1. At convergence the estimator F˜τ is a fix point of the step in 1.8) of the algorithm. That is F˜τ = Fˆτ (FF∗−1 (Fˆτ )), then we have ˜ τ
EF˜τ (Fˆτ∗ ) = Fˆτ .
(2.3)
Proof. If the algorithm converges, then F˜τ = Fˆτ (FF∗−1 (Fˆτ )). Thus ˜ τ
Fˆτ ◦ F˜τ−1 ◦ Fˆτ = FF∗˜τ = E(F˜τ |F˜τ ) = F˜τ ⇒ ⇒ ⇒ or
Fˆτ ◦ F˜τ−1 = F˜τ ◦ Fˆτ−1 (Fˆτ ◦ F˜τ−1 )2 = I Fˆτ ◦ F˜ −1 = I τ
Fˆτ ◦ F˜τ−1 = −I (impossible, since Fˆτ , F˜τ ≥ 0)
EF˜τ (Fˆτ∗ ) = EF˜τ (F˜τ ) = F˜τ = Fˆτ . Remark 1. Base on our simulations, the algorithm converges in at most 100 iterations. Remark 2. At convergence, F˜τ is very close to Fˆτ and Fˆτ∗ is also very close to F˜τ , such that we have nice result EF˜τ (Fˆτ∗ ) = Fˆτ . Remark 3. This is a two-stage estimation method. We split data into two pieces. One is non-informative data, which produces a good estimation of the error distribution. The other is the informative data, we use shrinkage method to estimate the distribution of τg , which gives the better result. Performance assessment: To assess the performance of this method, we simulated data points, which are normally and independently distributed. 1. Xgij ∼ N (τg , 1), where G = 10000, n1 = n2 = 4 and we assume that Gsig = 1000, . . . , 9000 of G genes were differentially expressed between two groups and their difference was δ, i.e. τg = δ(δ = 1, 2) for all g = 1, . . . , Gsig , and τg = 0 otherwise. 2. Xgij ∼ N (τg , σg2 ), where G = 10000, n1 = n2 = 4 and we assume that Gsig = 1000, . . . , 9000 of G genes were differentially expressed between two groups and their difference was δ = 1, 2, for all g = 1, . . . , Gsig , and τg = 0 otherwise and σg2 are chi-square distributed with degrees of freedom 3. We calibrate the mean of σg2 to 1. i.e. σg2 /3. We compare our method to permutation tests and t-tests using a threshold of 0.05 to determine significance. These two methods are standard in biological applications. Our method is much more accurate than other two methods (Table
J. Cabrera and C. Yu
98
1-4, Figure 2). Each cell in the table is the mean (standard deviation) based on 10 times simulations on each condition. In Figure 2, the straight line represents the true values and the red line is obtained by the smooth spline function. We also calculate the pFDR of our method in different values of lambda (Table 5-6). pFDR decreases when the true value increases. 3. Discussion and extensions In this paper we propose an algorithm for estimating the proportion of differentially expressed genes in a microarray experiment. We also show that the estimator of the distribution of the variance converges to a fix point. We performed a simulation study to check the performance of our estimate and it is shown to be “satisfactory” and we show that our method has better performance than other alternatives such as permutation tests and standard two-sample t- test. The simulations were performed under normal and gamma error distribution and with constant variances and chi-square variances. In addition we illustrate the method with real data examples on mice and mice2 (Table 7, Figure 3). In the real data examples we obtained estimates of the proportion of significant genes that were more realistic than those produced by the other methods. Hence, this algorithm gives us more accurate prediction to detect differential genes. This same method is generally extendable to other more complicated modeling procedures such as the one-way ANOVA F-test and other linear models. The same model is used and the same ideas are easily extendable into a second paper. Another paper will deal with the GO issues, by modeling the p-values and getting a null distribution that will be used to detect differentially expressed gene network and subsets. Table 1 Normal(0,1) δ 1 1 1 2 2 2
true λ t-test
0.1 0.066 (0.002) Permutation 0.039 test (0.002) New method 0.071 (0.058) t-test 0.109 (0.002) Permutation 0.074 test (0.003) New method 0.087 (0.020)
0.2 0.085 (0.003) 0.051 (0.002) 0.163 (0.091) 0.171 (0.002) 0.120 (0.002) 0.196 (0.022)
0.3 0.103 (0.002) 0.063 (0.002) 0.226 (0.084) 0.234 (0.003) 0.168 (0.002) 0.321 (0.034)
0.4 0.119 (0.003) 0.073 (0.003) 0.282 (0.072) 0.294 (0.003) 0.214 (0.003) 0.431 (0.033)
0.5 0.136 (0.003) 0.085 (0.003) 0.304 (0.049) 0.354 (0.003) 0.259 (0.004) 0.522 (0.030)
0.6 0.154 (0.005) 0.096 (0.003) 0.422 (0.081) 0.415 (0.004) 0.305 (0.003) 0.635 (0.045)
0.7 0.171 (0.004) 0.107 (0.003) 0.473 (0.105) 0.474 (0.005) 0.350 (0.005) 0.720 (0.034)
0.8 0.186 (0.004) 0.116 (0.003) 0.479 (0.145) 0.534 (0.004) 0.397 (0.005) 0.823 (0.022)
0.9 0.207 (0.004) 0.130 (0.003) 0.518 (0.120) 0.595 (0.004) 0.442 (0.004) 0.923 (0.021)
Estimating the proportion of differentially expressed genes
99
Table 2 N(0,a),a ∼ χ2(3) /3 δ 1 1 1 2 2 2
true λ t-test
0.1 0.066 (0.001) Permutation 0.045 test (0.002) New method 0.079 (0.072) t-test 0.105 (0.002) Permutation 0.080 test (0.003) New method 0.111 (0.027)
0.2 0.087 (0.004) 0.060 (0.002) 0.145 (0.096) 0.172 (0.002) 0.134 (0.002) 0.207 (0.034)
0.3 0.110 (0.002) 0.077 (0.002) 0.153 (0.040) 0.237 (0.003) 0.186 (0.003) 0.311 (0.032)
0.4 0.134 (0.004) 0.095 (0.003) 0.301 (0.069) 0.303 (0.003) 0.241 (0.004) 0.413 (0.030)
0.5 0.157 (0.003) 0.112 (0.002) 0.327 (0.062) 0.370 (0.004) 0.295 (0.003) 0.514 (0.025)
0.6 0.180 (0.004) 0.129 (0.003) 0.436 (0.119) 0.435 (0.003) 0.347 (0.004) 0.609 (0.022)
0.7 0.204 (0.003) 0.148 (0.004) 0.513 (0.138) 0.498 (0.003) 0.400 (0.004) 0.712 (0.017)
0.8 0.227 (0.004) 0.163 (0.003) 0.576 (0.138) 0.565 (0.006) 0.451 (0.005) 0.811 (0.018)
0.9 0.252 (0.003) 0.182 (0.002) 0.577 (0.116) 0.630 (0.003) 0.508 (0.005) 0.914 (0.015)
0.5 0.178 (0.004) 0.143 (0.004) 0.321 (0.099) 0.381 (0.005) 0.330 (0.004) 0.515 (0.023)
0.6 0.207 (0.004) 0.168 (0.003) 0.377 (0.110) 0.450 (0.003) 0.391 (0.004) 0.613 (0.010)
0.7 0.233 (0.004) 0.190 (0.003) 0.482 (0.094) 0.521 (0.004) 0.454 (0.003) 0.712 (0.015)
0.8 0.264 (0.004) 0.213 (0.006) 0.504 (0.119) 0.588 (0.005) 0.514 (0.005) 0.802 (0.014)
0.9 0.292 (0.003) 0.237 (0.003) 0.626 (0.107) 0.657 (0.005) 0.576 (0.004) 0.912 (0.013)
0.5 0.153 (0.003) 0.106 (0.003) 0.319 (0.080) 0.373 (0.003) 0.298 (0.003) 0.516 (0.022)
0.6 0.174 (0.004) 0.122 (0.003) 0.368 (0.091) 0.440 (0.003) 0.352 (0.004) 0.610 (0.016)
0.7 0.197 (0.003) 0.138 (0.004) 0.490 (0.133) 0.507 (0.004) 0.408 (0.004) 0.718 (0.027)
0.8 0.218 (0.004) 0.153 (0.003) 0.530 (0.128) 0.575 (0.005) 0.461 (0.006) 0.811 (0.017)
0.9 0.243 (0.002) 0.170 (0.003) 0.641 (0.084) 0.639 (0.005) 0.517 (0.006) 0.918 (0.013)
0.8 0.0860 (0.0209) 0.0601 (0.0093)
0.9 0.0482 (0.0131) 0.0465 (0.0112)
Table 3 Gamma(1, 1) δ 1 1 1 2 2 2
true λ t-test
0.1 0.067 (0.002) Permutation 0.053 test (0.001) New method 0.059 (0.043) t-test 0.108 (0.002) Permutation 0.090 test (0.003) New method 0.126 (0.048)
0.2 0.094 (0.004) 0.075 (0.002) 0.151 (0.035) 0.177 (0.002) 0.151 (0.002) 0.232 (0.045)
0.3 0.123 (0.003) 0.099 (0.003) 0.225 (0.075) 0.246 (0.003) 0.212 (0.002) 0.310 (0.024)
0.4 0.150 (0.004) 0.120 (0.003) 0.310 (0.062) 0.313 (0.003) 0.272 (0.003) 0.417 (0.020)
Table 4 t5 δ 1 1 1 2 2 2
true λ t-test
0.1 0.065 (0.003) Permutation 0.043 test (0.002) New method 0.074 (0.060) t-test 0.112 (0.002) Permutation 0.083 test (0.002) New method 0.113 (0.030)
0.2 0.086 (0.003) 0.058 (0.002) 0.141 (0.100) 0.177 (0.002) 0.136 (0.003) 0.205 (0.013)
0.3 0.109 (0.005) 0.075 (0.004) 0.208 (0.065) 0.241 (0.002) 0.190 (0.002) 0.309 (0.028)
0.4 0.130 (0.004) 0.090 (0.003) 0.212 (0.074) 0.309 (0.005) 0.246 (0.004) 0.411 (0.027)
Table 5 pFDR for our method with Normal(0,1) error distribution true λ δ=1 δ=2
0.1 0.5471 (0.1679) 0.1963 (0.0753)
0.2 0.3768 (0.1371) 0.1924 (0.0741)
0.3 0.3823 (0.0537) 0.2416 (0.0876)
0.4 0.2372 (0.0535) 0.1533 (0.0393)
0.5 0.2372 (0.0535) 0.1215 (0.0406)
0.6 0.1924 (0.0354) 0.0965 (0.0242)
0.7 0.1486 (0.0363) 0.0841 (0.0255)
100
J. Cabrera and C. Yu
Fig 2. Example comparing our method to the Permutation and t methods. The true errors are N (0, σ 2 ), σ 2 ∼ χ2(3) /3.
Fig 3. Density estimators for the p-values obtained from two toxicology datasets.
Estimating the proportion of differentially expressed genes
101
Table 6 pFDR for our method with N ormal(0, σ 2 ), σ ∼ χ2(3) /3 error distribution true λ δ=1 δ=2
0.1 0.634 (0.069) 0.325 (0.099)
0.2 0.480 (0.060) 0.226 (0.054)
0.3 0.375 (0.060) 0.167 (0.042)
0.4 0.323 (0.040) 0.139 (0.022)
0.5 0.233 (0.053) 0.119 (0.020)
0.6 0.185 (0.048) 0.107 (0.016)
0.7 0.102 (0.018) 0.074 (0.017)
0.8 0.094 (0.017) 0.063 (0.014)
0.9 0.047 (0.0135) 0.037 (0.0047)
Table 7 Results for the three methods applied to two real examples from toxicology Estimated π t − test P ermutation test N ew method
M ice 0.245 0.220 0.107
M ice2 0.499 0.443 0.363
References [1] Amaratunga, D. and Cabrera, J. (2001). Statistical analysis of viral microchip data. J. Amer. Statist. Assoc. 96 1161–1170. [2] Amaratunga, D. and Cabrera, J. (2003). Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley, New York. [3] Amaratunga, D. and Cabrera, J. (2006). Differetial expression in DNA microarray and protein array experiment. Technical Report 06-001, Department of Statistics, Rutgers University. [4] Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics 7 509–519. [5] Broberg, P. (2003). Ranking genes with respect to differential expression. Genome Biology 4 R41. [6] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarry experiment. J. Amer. Statist. Assoc. 96 1151–1160. [7] Fisher, R. A. (1934). Statistical Methods for Researcher Workers. Oxford University Press. [8] Gene Otology Consortium (2000). Gene ontology: Tool for the unification of biology. Nature Genet. 25 25–29. [9] Huber, P. J. (1981). Robust Statistics. Wiley, New York. [10] Lee, M. L. T., Kuo, F. C., Whitmore, G. A. and Sklar, J. (2000). Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proceedings of the National Academic of Sciences 97 9834–9839. [11] Lonnstedt, I. and Speed, T. P. (2002). Replicated microarray data. Statist. Sinica 12 31–46. [12] Newton, M. A., Kendziorski, C. M., Richmond, C. S., Blattner, F. R. and Tsui, K. W. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J. Comp. Biol. 8 37–52. [13] Pavlidis, P. et al. (2004). Using the gene ontology for microarray data mining: A comparison of Methods and Application to Age Effect in Human Prefrontal Cortex. Neurochemical Research 29 1213–1222.
102
J. Cabrera and C. Yu
[14] Raghavan, R., Amaratunga, D., Cabrera, J. Nie, A. Qin, J. and Mcmillian, M. (2006). On methods for gene function scoring as a mean of facilitating the interpretation of microarray results. J. Comp. Biol. 13 798–809. [15] Schena, M., Shalon, D., Davis, D. R. and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235) 467–470. [16] Storey, J. D. and Tibshirani, R. (2001). Estimating false discovery rates under dependence, with applications to DNA microarrays. Technical Report 2001–18, Dep. Statistics, Stanford Universtity, Stanford. [17] Tusher, V., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98 5116–5124.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 103–120 c Institute of Mathematical Statistics, 2007
DOI: 10.1214/074921707000000085
Functional analysis via extensions of the band depth
arXiv:0708.1107v1 [stat.ME] 8 Aug 2007
Sara L´ opez-Pintado1 and Rebecka Jornsten2,∗ Universidad Pablo de Olavide and Rutgers University Abstract: The notion of data depth has long been in use to obtain robust location and scale estimates in a multivariate setting. The depth of an observation is a measure of its centrality, with respect to a data set or a distribution. The data depths of a set of multivariate observations translates to a centeroutward ordering of the data. Thus, data depth provides a generalization of the median to a multivariate setting (the deepest observation), and can also be used to screen for extreme observations or outliers (the observations with low data depth). Data depth has been used in the development of a wide range of robust and non-parametric methods for multivariate data, such as non-parametric tests of location and scale [Li and Liu (2004)], multivariate rank-tests [Liu and Singh (1993)], non-parametric classification and clustering [Jornsten (2004)], and robust regression [Rousseeuw and Hubert (1999)]. Many different notions of data depth have been developed for multivariate data. In contrast, data depth measures for functional data have only recently been proposed [Fraiman and Muniz (1999), L´ opez-Pintado and Romo (2006a)]. While the definitions of both of these data depth measures are motivated by the functional aspect of the data, the measures themselves are in fact invariant with respect to permutations of the domain (i.e. the compact interval on which the functions are defined). Thus, these measures are equally applicable to multivariate data where there is no explicit ordering of the data dimensions. In this paper we explore some extensions of functional data depths, so as to take the ordering of the data dimensions into account.
1. Introduction In functional data analysis, each observation is a real function xi , i = 1, . . . , n, defined on a common interval in R. Functional data is observed in many disciplines, such as medicine (e.g. EEG traces), biology (e.g. gene expression time course data), economics and engineering (e.g., financial trends, chemical processes). Many multivariate methods (e.g. analysis of variance, and classification) have been extended to functional data (see Ramsay and Silverman [22]). A basic building block of such statistical analyses is a location estimate, i.e. the mean curve for a group of data objects, or data objects within a class. When analyzing functional data, outliers can affect the location estimates in many different ways, e.g. altering the shape and/or magnitude of the mean curve. Since measurements are frequently noisy, statistical analysis may thus be much improved by the use of robust location estimates, such as ∗ Corresponding author is funded by NSF Grant DMS-0306360 and EPA Star Grant RD83272101-0. 1 Departamento de Economia, Metodos Cuantitativos e Historia economica, Universidad Pablo de Olavide, Edif. n3-Jose Monino-3 planta Ctra de Utrera, Km. 1 41013 Sevilla, Spain, e-mail: [email protected] 2 Department of Statistics, Rutgers University, Piscataway, NJ 07030, USA, e-mail: [email protected] AMS 2000 subject classifications: 62G35, 62G30. Keywords and phrases: data depth, functional data, band depth, robust statistics.
103
S. L´ opez-Pintado and R. Jornsten
104
the median or trimmed mean curve. Data depth provides the tools for constructing these robust estimates. We first review the concept of data depth in the multivariate setting, where data depth was introduced to generalize order statistics, e.g. the median, to higher dimensions. Given a distribution function F in Rd , a statistical depth assigns to each point x a real, non-negative bounded value D(x|F ), which measures the centrality of x with respect to the distribution F . Given a sample of n observations X = {x1 , . . . , xn }, we denote the sample version by D(x|Fn ) or D(x|X). D(x|X) is a measure of the centrality of a point x with respect to the sample X (or the empirical distribution function Fn ). The point x can be a sample observation, or constitute independent “test data”. For x = xi ∈ X, D(xi |Fn ) provides a center-outward ordering of the sample observations x1 , . . . , xn . Many depth definitions have been proposed for multivariate data (e.g. Mahalanobis [19], Tukey [26], Oja [20], Liu [12], Singh [25], Fraiman and Meloche [3], Vardi and Zhang [27] and Zuo [31]). To illustrate the data depth principle and the variety of depth measures, we will briefly review two very different notions of depth: the simplicial depth of Liu [12], and the L1 depth of Vardi and Zhang [27] (a detailed discussion of the different types of data depths can be found in Liu, Parelius and Singh [14] and Zuo and Serfling [32]). To compute the simplicial depth of a point x ∈ Rd with respect to the sample X = {x1 , . . . , xn }, we start by partitioning the n sample into a set of d+1 unique (d+1)-simplices. Consider the two-dimensional case illustrated in Figure 1a. We depict a subset of the n3 3-simplices (triangles) in R2 defined by a set of objects (x1 , x2 , x3 ) ∈ X. A point x is considered deep within the sample X if many simplices contain it, and vice versa. Formally, the simplicial depth of a point x is defined as −1 X n SD(x|X) = I{x ⊂ simplex(xi1 , . . . , xid+1 )}, d+1 1≤i1
where I{A} is an indicator of the event A, equal to 1 if A is true and 0 otherwise. We can see from Figure 1a that the point marked with a triangle is covered by many simplices, resulting in a high depth measure, whereas the point marked with a plus −1 attains the minimum depth measure (i.e. n3 if the point is a sample observation, and 0 otherwise).
(a)
(b)
Fig 1. (a) The simplicial data depth, (b) The L1 data depth.
Extensions of the band depth
105
To calculate the L1 data depth of a point x with respect to the sample X, we start by forming the unit vectors e(x, xi ) that point from x to xi ∈ X (Figure 1b). The L1 depth of x is defined as LD(x|X) = 1 − e¯(x), where e¯(x) =
1 X e(x, xi ). n xi ∈X
I.e., e¯(x) is the sample average of the unit vectors e(x, xi ). If x is on the periphery of the sample, all e(x, xi ) unit vectors point in an almost identical direction such that e¯(x) ' 1, and LD(x|X) ' 1 − 1 = 0 (point marked with a plus in Figure 1b). If x is in the center of the sample, the unit vectors e(x, xi ) will point in many different directions and almost cancel out in the computation of e¯(x), resulting in a high depth measure LD(x|X) ' 1 − 0 = 1 (point marked with a triangle in Figure 1b). Focusing on the case when x is a sample observation, x ∈X, we see from the above examples that data depth can be used to rank-order the data set X, from the deepest to the least deep. We classify x with high D(x|X) as the most representative of the sample X, and x with low D(x|X) as the most extreme observations, that may be considered outliers. The deepest observation is a generalization of the median to a multivariate setting, and the center-outward ordering can be used to construct trimmed mean estimates. Robust multivariate estimates based on data depth have been used in a wide range of non-parametric analyses, such as non-parametric testing of location and/or scale (Li and Liu [11]), multivariate rank-test (Liu and Singh [15]), non-parametric classification and clustering (Jornsten [10]), and robust regression (Rousseeuw and Hubert [23]). In this paper we discuss data depth measures for functional data. We review the band depths of L´ opez-Pintado and Romo [16] (Section 2), and propose some extensions of these depths to better address the functional characteristic of the data (section 3). We also propose a re-sampling based estimation scheme that speeds up the data depth calculation for sample objects, which is otherwise a computationally intensive task for large data sets (Section 4). In Section 5 we compare the performance of the new notions of depth to that of the band depth, under various simulation scenarios. While we cannot identify a depth measure that dominates all other measures across all simulation scenarios, we do find that data depths that account for the functional characteristics of the data can improve on data depths that do not have this property. 2. The band depth and the generalized band depth In recent years, some definitions of depth for functional data have been proposed. Fraiman and Muniz [3] considered a concept of depth based on the integral of univariate depths. L´ opez-Pintado and Romo [16] introduced the notion of band depth, which is based on the graphs of the curves and the bands they delimit in the plane. Let X = {x1 , . . . , xn } be a sample of continuous curves defined in the compact interval T . The graph of a function x is given by G(x) = {(t, x(t)) : t ∈ T } , where x is either an observation from the sample, or independent test data. The band in R2 is the region delimited by j curves xi1 , . . . , xij , and defined as B(xi1 , xi2 , . . . , xij ) = (t, y) : t ∈ T, min xir (t) ≤ y ≤ max xir (t) = r=1,...,j
r=1,...,j
S. L´ opez-Pintado and R. Jornsten
106
(a)
(b)
(c)
Fig 2. (a) Band determined by two curves, (b) Band determined by two curves, where the curves cross, and (c) Band determined by three curves.
=
(t, y) : t ∈ T, y = αt min xir (t) + (1 − αt ) max xir (t), αt ∈ [0, 1] . r=1,...,j
r=1,...,j
Here, the j curves xi1 , . . . , xij are chosen from the sample x1 , . . . , xn . A sample of size n can thus generate nj possible bands. Figures 2a and 2b show examples of bands determined by two curves. In Figure 2a the curves form a band that has a non-zero width across the entire compact interval T . In Figure 2b the curves cross, and the band is degenerate (has width 0) at the points of crossing. In Figure 2c we depict a band determined by three curves. Definition 1. Given a sample of curves x1 , . . . , xn , the band depth (BD) of any curve x is (2.1)
Sn,J (x|X) =
J X
Sn(j) (x|X), J ≥ 2,
j=2
where (2.2) Sn(j) (x|X) =
−1 n j
X
I{G(x) ⊂ B(xi1 , xi2 , . . . , xij )}, j ≥ 2.
1≤i1
That is, the band depth of object x is defined as the proportion of bands delimited by j curves (B(xi1 , xi2 , . . . , xij )) containing the graph of x. In a multivariate setting, the band depth resembles the simplicial depth of Liu [12]. In fact, in R2 the band B(xi1 , xi2 , . . . , xij ) corresponds to the smallest rectangle with sides parallel to the axes that contain j objects xi1 , . . . , xij , compared with a triangle (3-simplex) used in the definition of the simplicial depth. In a functional data setting, the intuition behind the band depth is as follows: If a curve x has a shape that differs from sample curves xi ∈ X, then few bands can contain it, and vice versa. Thus, a curve that is not representative of the sample will be associated with a low band depth value, and a representative curve will be associated with a high band depth value. The median curve defined by the band depth is xdeepest = argmax Sn,J (xi |X). For J = 2 the band depth is simple and fast to compute. However, there are some practical limitations: The curves xi1 , xi2 that determine the band B(xi1 , xi2 ) may cross (Figure 2b). At the point of crossing tc , the band B(xi1 , xi2 ) is degenerate. Unless a curve x coincides with xi1 and xi2 at the cross-point tc , curve x is not contained in the band. If many curves in the sample cross, most bands B(xi1 , xi2 ) will not contribute to the band depth measure (equation 2.2). This generates many ties between data objects, and may thus result in a non-unique median curve and center-outward rank-order of the data objects. This limits the practical
Extensions of the band depth
107
use of the band depth to construct robust estimates for functional analysis, e.g. non-parametric testing or classification. If we use J = 3, i.e. let 3 curves define the delimiting band B(xi1 , xi2 , xi3 ), we reduce the impact of curves that cross, and reduce the number of ties. However, the band depth with J > 2 is computationally intensive to work with since the number of unique bands grows at rate nJ . In addition, the band delimited by 3 data objects (Figure 2c) does not provide the same intuition as the band delimited by 2 objects (Figure 2a). The shape of the band delimited by J = 3 objects may differ substantially from the individual objects. Thus, if curve x is contained in band B(xi1 , xi2 , xi3 ), this does not necessarily imply that x is similar to any of the objects xi1 , xi2 , xi3 . When the curves are very irregular, the band depth can be too restrictive. Few bands will fully contain a data object. This will again result in too many ties between data objects, and a poorly defined center-outward ranking. A more flexible notion of depth (the generalized band depth) was therefore proposed in L´opez-Pintado and Romo [16]. It is obtained by replacing the indicator function in definition (2.2) by the proportion of time the curve x is inside a corresponding band. Given the sample x1 , . . . , xn , let for any curve x min xr (t) ≤ x(t) ≤ max xr (t) , j ≥ 2, A(x; xi1 , xi2 , . . . , xij ) = t ∈ T : r=i1 ,...,ij
r=i1 ,...,ij
be the set of points in the interval T where the function x is inside the band λ(Aj (x)) delimited by xi1 , xi2 , . . . , xij . If λ is the Lebesgue measure in R, λ(T is the ) proportion of time that x is inside the band. We define −1 X λ(A(x; xi1 , xi2 , . . . , xij )) n , j ≥ 2, (2.3) GSn(j) (x|X) = λ(T ) j 1≤i1
(j)
as a generalized version of Sn (x|x1 , . . . , xn ). Definition 2. Given a sample of curves x1 , . . . , xn , the generalized band depth (GBD) of a curve x is (2.4)
GSn,J (x|X) =
J X
GSn(j (x|X), J ≥ 2.
j=2
A curve x may be representative of the sample X, with the exception of an isolated set of points t ∈ T . The generalized band depth assigns a high depth measure to such observations, whereas the band depth ascribes a minimum depth value. In addition, for J = 2 and irregular data, the delimiting objects of a band may cross. While for the band depth, such bands do not contribute to the computation of the depth measures, the generalized band depth will simply down-weigh the contribution of those bands by the proportion of cross-points tc in T . The GBD thus computes the data depth by considering each data dimension t separately, similar to the data depth proposed by Fraiman and Muniz [3]. 3. Extensions of the band depth The definitions of the band depth and the generalized band depth allow for delimiting objects of a band to cross. We saw that this poses a problem when applying the
108
S. L´ opez-Pintado and R. Jornsten
band depth to noisy and irregular curve data, creating too many ties between the sample curves. Moreover, by allowing the delimiting objects to cross, we lose some of the intuition behind the band depth: The delimiting objects should gather curves of similar shape within the band, such that curves that are contained in the most bands are the most representative of the sample. If delimiting objects are allowed to cross, the shape of the band may in fact differ substantially from the shapes of the sample objects. In addition, while the generalized band depth allows for excursions outside a band, it makes no distinction between randomly scattered excursions and excursions over a contiguous region. Clearly however, single excursions at random points on the compact interval T are not as informative as an excursion that persists across a set of consecutive points. The latter is more likely to demonstrate that the shape of a curve x differs from those of the band delimiting objects xi1 ,xi2 . From the above discussion, and the definitions in Section 2, we see that the band depth (BD) and generalized band depth (GBD) are both invariant with respect to permutations of the data dimensions, i.e. permutations of t ∈ T . This raises the question whether the performance of these depths can be improved via extensions of the depth measures that take the explicit ordering of the data dimensions for functional data into account. We will explore such extensions of the band depth and generalized band depths in Sections 3.1 and 3.2. In what follows, we focus on the band depth where J = 2 delimiting objects form each band, though generalizations to J > 2 can be made at the expense of an increased computational burden. 3.1. The corrected (generalized) band depth We begin by revisiting the definition of a delimiting band. Ideally, we want each band to gather curves of similar shape inside it. To achieve this, we make the following adjustments to the definition of a band: If two curves cross, the band will be defined only where one of the curves is the upper curve and the other one is the lower curve. Hence, for each pair of curves that cross, there are two possible bands (depending on which curve is consider the upper curve). We will choose the longest of the two bands. The notions of corrected band depth and its generalized version are defined similarly to the band depth and the generalized band depth, but re-defining the band as described above. We first define the corrected band depth (cBD). Let a(i1 , i2 ) = {t : xi2 −xi1 ≥ 0} 1 ,i2 )) (λ is the Lebesgue measure). By exchanging the roles of the and Li1 ,i2 = λ(a(i λ(T ) upper (xi1 ) and lower (xi2 ) curves we obtain Li2 ,i1 in a similar fashion. We define the corrected band Bc as Bc = I{Li1 ,i2 ≥1/2} Bc (xi1 , xi2 ) + I{Li2 ,i1 >1/2} Bc (xi2 , xi1 ), where Bc (xi1 , xi2 ) = {(t, y) : t ∈ a(i1 , i2 ), xi1 (t) ≤ y ≤ xi2 (t), } , and similarly for Bc (x2 , x1 ). We also form the corrected graph G(x∗ ) as G(x∗ ) = {(t, x(t)) : t ∈ a(i1 , i2 )}, if Li1 ,i2 ≥ 1/2, G(x∗ ) = {(t, x(t)) : t ∈ a(i2 , i1 )}, if Li2 ,i1 > 1/2.
Extensions of the band depth
109
We can now define the corrected band depth of a curve x with respect to the sample X as −1 X n max(Li1 ,i2 , Li2 ,i1 )I{G(x∗ )∈Bc } (3.1) cBD(x|X) = 2 1≤i1
The term max(Li1 ,i2 , Li2 ,i1 ) acts as a weight. cBD will thus down-weigh the contribution of bands delimited by curves that cross, but not as drastically as BD where such bands would not contribute at all. In the simulation study (section 4) we see that cBD can substantially improve on the BD when the data is contaminated by curves with different shapes from the rest of the data. To allow for random excursions of a curve x outside the corrected band Bc , as is likely to happen with noisy data and irregular curves, we also propose the corrected generalized band depth as a more flexible alternative. We define the cGBD of a curve x with respect to the sample X as cGBD(x|X) =
(3.2)
−1 n 2
X
1≤i1
λ(Ac (x; xi1 , xi2 )) . λ(T )
where Ac (x; xi1 , xi2 ) = {t ∈ a(i1 , i2 ) : xi1 (t) ≤ x(t) ≤ xi2 (t)}, if Li1 ,i2 ≥ 1/2 = {t ∈ a(i2 , i1 ) : xi2 (t) ≤ x(t) ≤ xi1 (t)}, if Li2 ,i1 > 1/2. Hence, the difference between the corrected (generalized) band depth and the (generalized) band depth is that the band is modified in order to consider only the proportion of the domain where the delimiting curves define a contiguous region of non-zero width. 3.2. GBDI and GBDO —accounting for consecutive band excursions If a curve is only partially contained in a delimiting band, it may still share many characteristics with the delimiting objects. However, this similarity may not be optimally measured by the number of excursions outside the band, as with GBD, but perhaps how these excursions present themselves. We therefore propose two alternative definitions of the generalized band depth called GBDI and GBDO . We propose GBDI as a more conservative alternative to GBD. We replace λ(A(x; xi1 , xi2 )) (the number of t ∈ T where x ∈ B(xi1 , xi2 )) in equation (2.3) by a measure CI(x; xi1 , xi2 ), where CI(x; xi1 , xi2 ) = max λ(tS ) : min xr (t) ≤ x(t) ≤ max xr (t), ∀t ∈ tS , tS
r=i1 ,i2
r=i1 ,i2
where tS is a compact interval. That is, CI(x; xi1 , xi2 ) is the longest consecutive stretch for which x is inside in the band delimited by xi1 , xi2 (Consecutive Inside). Given sample functions x1 , x2 , . . . , xn , the GBDI of a curve x is (3.3)
GBDI (x|X) =
−1 n 2
X
1≤i1
CI(x; xi1 , xi2 ) . λ(T )
S. L´ opez-Pintado and R. Jornsten
110
With GBDI , only the longest non-contaminated portion of a curve x contributes to its depth value. If a curve x weaves in and out of a band, the GBD may still allot a high depth value to x, whereas the GBDI requires that the curve x resides within the band over a large compact set. We further propose GBDO , where we penalize band excursions that are consecutive. We view a long stretch of a band excursion as evidence that the curve x is not similar to the band delimiting objects xi1 , xi2 . Thus, in terms of the total number of band excursions, GBDO is less conservative than GBD. GBDO penalizes deviations that are persistent, and can thus serve as an indicator that the functional characteristic of curve x differs from the band delimiting objects. Similar to the GBDI we start by re-defining the GBD band measure λ(A(x; xi1 , xi2 )) by an alternative measure CO(x; xi1 , xi2 ), where CO(x; xi1 , xi2 ) = max λ(tS ) : min xr (t) > x(t) or x(t) > max xr (t), ∀t ∈ tS , tS
r=i1 ,i2
r=i1 ,i2
and tS again denotes a compact set on the interval T . That is, CO(x; xi1 , xi2 ) is the longest contiguous region of the curve outside the delimiting band (Consecutive Outside). The GBDO of curve x with respect to sample x1 , . . . , xn is defined as (3.4)
GBDO (x|X) =
−1 n 2
X
1−
1≤i1
CO(x; xi1 , xi2 ) . λ(T )
With the depth measure GBDO , a curve x is penalized if the contamination is persistent (CO(x; xi1 , xi2 ) is large for many bands B(xi1 , xi2 )), whereas random and isolated excursions are largely ignored. Thus, unlike GBD, GBDO makes a distinction between noise contaminations (random “spikes”) and shape contaminations. 4. Fast computation via data re-sampling The computational cost of calculating the data depth of objects in a sample X = {x1 , . . . , xn } grows with the sample size at rate nJ . If the number of curves in the sample of interest is large, the computational burden of data depth based methods puts a serious limit on their applicability. Moreover, non-parametric testing, clustering and classification frequently involve iterative procedures. For example; Li and Liu [11] use bootstrap to compute the sampling distribution of the non-parametric test statistic under the null; the clustering method of Jornsten [10] iterates between updating the cluster center (deepest object) and the cluster allocation, until convergence. For these computationally challenging non-parametric and robust analyses to be competitive with standard approaches, in a practical sense, we need to make each data depth estimation step fast and efficient. Therefore, we propose a simple method for computing the depth of each curve in a sample based on re-sampling. This re-sampling based data depth calculation is also applicable to multivariate data, and can easily be adapted to other notions of depth. We begin by dividing the sample x1 , . . . , xn into K randomly selected and roughly equal size parts. We refer to each data part as X1 , . . . , XK , where {xi ∈ Xk } is a set of ∼ n/K objects. For each curve x, we obtain independent depth measures with respect to each data part: D(x|X1 ), . . . , D(x|XK ),
Extensions of the band depth
(a)
111
(b)
Fig 3. Re-sampling based rank versus full data rank using (a) BD and (b) cBD.
where D refers to any of the depth measures discussed in this paper (e.g. GBDI ). We finally define the re-sampling based depth of a curve x as Dr (x|X) =
(4.1)
K 1 X D(x|Xk ) K k=1
Using simulated data, we investigate the feasibility of replacing the data depth with the re-sampling based version. We generate 150 curves xi (t) from a model xi (t) = 4t + ei (t), 1 ≤ i ≤ n, where ei (t) is a sample from a gaussian stochastic process with zero mean and covariance function γ1 (s, t) = exp{−|t − s|2 }. We compute the depth-based ranks for the sample curves, and the corresponding re-sampling based ranks. In Figures 3, 4 and 5 we compare the rank-orders induced by the re-sampling based method to the rank-orders obtained with the full data. For each of b = 1, . . . , B = 50 simulated data sets of size n = 150, we obtain the ranks of each of the i = 1, . . . , n data objects from the full data: Db (X) = {Db (x1 |X), . . . , Db (xn |X)}.
(a)
(b)
Fig 4. Re-sampling based rank versus full data rank using (a) GBD, (b) cGBD.
S. L´ opez-Pintado and R. Jornsten
112
(a)
(b)
Fig 5. Re-sampling based rank versus full data rank using (a) GBDI , and (b) GBDO .
The rank-order is ib (1), . . . , ib (n). We partition the data into K = 10 parts, and compute the re-sampling based ranks: Drb (X) = {Drb (x1 |X), . . . , Drb (xn |X)}. We sort the re-sampling based ranks using the full data rank order to obtain: Drb,∗ (X) = {Drb (xib (1) |X), . . . , Drb (xib (n) |X)}. If Drb,∗ (X) ' {1, . . . , n}, the re-sampling based ranks closely agree with the full data ranks. In the figures we plot the mean of the re-sampling based ranks for the 50 simulations against {1, . . . , n}. We also depict the standard deviations as vertical bars in the figures. If the re-sampling based depth estimates are competitive with the full data based estimates, we expect the Drb,∗ (X) to fall on the line y = x in each plot. The simulations confirm that, with the exception of the band depth, BD, all re-sampling based depth rank-orders agree closely with the full data depth rankorders. The BD re-sampling based estimates are not accurate for the deepest set of observations since many ties are generated with a smaller sample ∼ n/K. Using the corrected band depth, cBD, alleviates this problem. For the more extreme observations, the BD and cBD rank-orders fall on the y = x line. The GBD, cGBD, GBDI and GBDO all show a strong agreement between the full data and re-sampling based depth rank-orders. In addition, the standard deviations are quite small. We obtain similar results in several simulations, including both functional data and multivariate data (omitted to conserve space). We thus conclude that the ranks obtained from re-sampling based depths are near equivalent to the ranks obtained using the full data to compute the depths. This justifies using the computationally faster re-sampling based versions of the depths in practical applications. 5. Simulations and illustrations In this section we compare our new notions of depth, to the band depth and generalized band depth proposed by L´opez-Pintado and Romo [16], on several simulated data sets. In each simulation, we generate curves from a base model. We then randomly contaminate the data set with six different types of contaminations. The contamination types include those previously analyzed by Fraiman and Muniz [4] and by L´ opez-Pintado and Romo [16]. The base model is Xi (t) = g(t) + ei (t), 1 ≤ i ≤ n, where ei (t) is a stochastic gaussian process with zero mean and covariance function γ1 (s, t) = exp{−|t − s|1.5 }
Extensions of the band depth
113
and g(t) = 4t, with t ∈ [0, 1]. There are many different ways of defining an outlier or a contamination within a sample of curves. For instance, a curve could be very distant from the mean (magnitude outlier) or have a pattern different from the other curves, e.g. decreasing when the remaining ones are increasing, or very irregular in a set of smooth curves (shape outlier). We describe the six different contamination scenarios below (illustrated in Figure 6). Model 0. All curves come from the base model. Model 1. An asymmetric contamination is included in Model 1; Yi (t) = Xi (t) + ci M, 1 ≤ i ≤ n, where ci is 1 with probability q and 0 with probability 1−q; M is the contamination size constant. Model 2. In Model 2 the contamination is symmetric; Yi (t) = Xi (t) + ci σi M, 1 ≤ i ≤ n, where ci and M are defined as in Model 1 and σi is a sequence of random variables independent of ci taking values 1 and −1 with probability 1/2. Model 3. A partial contamination constitutes Model 3; Yi (t) = Xi (t) + ci σi M, if t ≥ Ti , 1 ≤ i ≤ n, and Yi (t) = Xi (t), if t < Ti , where Ti is a random number generated from a uniform distribution on [0, 1]. Model 4. Model 4 is contaminated by peaks; Yi (t) = Xi (t) + ci σi M, if Ti ≤ t ≤ Ti + l, 1 ≤ i ≤ n, and Yi (t) = Xi (t), if t ∈ / [Ti , Ti + l], where l = 2/30 and Ti is a random number from a uniform distribution in [0, 1 − l]. The contamination only occurs for a short subinterval of length l. Model 5. We also propose a new model for comparison, where the contamination is like the one in Model 4, but it will occur at k different points uniformly distributed in the domain. Specifically, the model is Yi (t) = Xi (t) + ci σi M, if t ∈ {T1 , . . . , Tk }, and Yi (t) = Xi (t) otherwise, where {T1 , . . . , Tk } are k random numbers uniformly chosen from the interval [0, 1]. Model 6. In Model 6 we consider shape contaminations (L´opez-Pintado and Romo [16]). To include shape outliers, we use the covariance function structure proposed in Wood and Chan [29] γ(s, t) = k exp{−c|t − s|µ }, with s, t ∈ [0, 1], and k, c, µ > 0. Different values of k, c and µ change the shape of the generated functions. For example, increasing µ and k, makes the curves smoother; whereas, increasing c makes the curves more irregular. Model 6 is a mixture of Xi (t) = g(t) + e1i (t), 1 ≤ i ≤ n, with g(t) = 4t and ei1 (t) a gaussian stochastic process with zero
S. L´ opez-Pintado and R. Jornsten
114
mean and covariance function γ1 (s, t) = exp{−|t − s|2 } and Yi (t) = g(t) + e2i (t), 1 ≤ i ≤ n, with ei2 (t) a gaussian process with zero mean and covariance function γ2 (s, t) = exp{−|t−s|0.2}. The contaminated Model 6 is Zi (t) = (1−ε)Xi (t)+εYi (t), 1 ≤ i ≤ n, where ε is a Bernoulli variable Be(q) and q is a small contamination probability; thus, we contaminate a sample of smooth curves from Xi (t) with more irregular curves from Yi (t). We analyze the performance of different notions of depth in terms of robustness. The notions of depth considered are: the band depth with J = 2 and J = 3 (BD2, BD3), the generalized band depth (GBD), the corrected version of the band depth and generalized band depth (cBD, cGBD), and the generalized band depths that account for consecutive band excursions, GBDI and GBDO . We compare the mean and the α-trimmed mean, given by
µ bn (t) =
n P
Yi (t)
i=1
n
and
m bα n (t) =
n−[nα] P i=1
Y(i) (t)
n − [nα]
,
where Y(1) , Y(2) , . . . , Y(n) is the sample ordered from the deepest to the most extreme (least deep) curve and [nα] is the integer part of nα. The rank-orders are computed using the re-sampling based depths. For each model, we consider R = 200 replications, each generating n = 150 curves, with contamination probability q = 0.1 and contamination constant M = 25. The integrated errors (evaluated at V = 30 equally spaced points in [0, 1]) for each replication j are EIµ (j) =
V
V
k=1
k=1
1X 1X α 2 2 α [b µn (k/V ) − g (k/V )] and EID (j) = [m b n (k/V ) − g (k/V )] , V V
where D refers to one of the data depths (BD2, BD3, cBD, GBD, cGBD, GBDI or GBDO ). All methods are applied to each simulated data set. Across simulated data sets we see a lot variability since the contaminations are random. We therefore summarize the results as follows; (1) For each modeling scenario and each simulated data set j, we compute the minimum integrated error across all methods EI∗α (j) =
min
D=BD2,BD3,cBD,GBD,cGBD,GBDI ,GBDO
α EID (j);
(2) We adjust the integrated errors by subtracting the minimum value, such that α α (j) = EID (j) − EI∗α (j). We the best method has adjusted integrated error 0, EAID summarize the simulation results in terms of the mean and standard deviation of the adjusted integrated errors, EAI We will first examine each simulation model separately, and then discuss the overall results at the close of the section. Model 0 generates uncontaminated data. From Table 1 we see that the mean is the best estimate, as expected. The generalized band depths (GBD, cGBD, GBDI and GBDO ) perform better than the band depths (BD2, BD3 and cBD) in this setting, but all of the robust estimates perform reasonably well on uncontaminated data. Some loss of estimation efficiency is unavoidable since a fixed trimming factor α = 0.2 was used. Simulations settings 1 through 3 correspond to simple contaminations, i.e. a positive or negative mean offset, or a partial mean offset contamination. In this
Extensions of the band depth
115
Table 1 Simulation results for the seven modeling scenarios. Mean and standard deviations of the adjusted integrated errors (subtracting the integrated error of the winning method for each simulation), with R = 200 replications, q = 0.1 and α = 0.2. Mean BD2 BD3 cBD GBD cGBD GBDI GBDO
M0 0.002 (0.003) 0.007 (0.009) 0.007 (0.009) 0.009 (0.012) 0.005 (0.005) 0.004 (0.005) 0.005 (0.005) 0.005 (0.007)
M1 6.600 (3.319) 4.864 (3.832) 3.916 (3.189) 5.353 (4.137) 0.085 (0.286) 0.117 (0.381) 0.260 (0.619) 0.001 (0.003)
M2 0.463 (0.606) 0.286 (0.395) 0.163 (0.227) 0.277 (0.365) 0.005 (0.008) 0.005 (0.008) 0.007 (0.010) 0.003 (0.007)
M3 0.204 (0.278) 0.204 (0.278) 0.053 (0.119) 0.091 (0.274) 0.050 (0.074) 0.048 (0.075) 0.055 (0.072) 0.060 (0.090)
M4 0.036 (0.027) 0.021 (0.019) 0.012 (0.014) 0.006 (0.011) 0.047 (0.034) 0.046 (0.034) 0.043 (0.031) 0.049 (0.036)
M5 0.063 (0.025) 0.036 (0.028) 0.014 (0.018) 0.003 (0.008) 0.074 (0.034) 0.074 (0.033) 0.043 (0.026) 0.083 (0.037)
M6 0.012 (0.012) 0.008 (0.010) 0.005 (0.005) 0.005 (0.010) 0.013 (0.010) 0.013 (0.010) 0.003 (0.006) 0.022 (0.022)
setting we expect only marginal gains from the use of shape sensitive or restrictive depths such as cBD and GBDI . On the other hand, the data is noisy so we expect that cGBD and GBDO should perform well. For the case of asymmetric contamination (Model 1), we find that GBDO outperforms the other methods. The trimming factor is α = 0.2, while the contamination probability is q = 0.1. Thus, for many of the simulated data sets, some of the uncontaminated curves will be trimmed. The band depths (BD2, BD3 and cBD) all struggle in this setting. The uncontaminated curves frequently cross, leading to too many ties in the rank-order. While the GBD, cGBD and GBDI perform much better than the band depths, they are not competitive with GBDO . The source of the problem lies in the asymmetry of the contaminations, and that the magnitude outliers contribute to the computation of the depth values of all other observations (see Figure 6). Since GBD, cGBD and GBDI are more restrictive than GBDO , these three depth measures trim more curves from the lower portion of the uncontaminated set than from the upper portion, creating a bias in the trimmed mean estimate. This suggests that perhaps an iterative procedure, were outliers are dropped one at a time and the data depths re-estimated after each step, would perform better. As expected, when the contamination is symmetric (Model 2), GBD, cGBD and GBDI perform almost as well as GBDO . The trimming of the uncontaminated data set is now mostly symmetric, and the trimmed mean estimates essentially unbiased. The band depths are not competitive. Model 3 generates data objects that have been partially contaminated. This symmetric contamination is easily identified by GBD, cGBD, GBDI and GBDO . Simulation settings 4 through 6 can loosely be seen as shape contaminations (Figure 6). Here, we expect the cBD and GBDI to perform well, whereas the performances of the less restrictive GBD, cGBD and GBDO may deteriorate. Indeed, for Model 4 (peak contamination), cBD is the best method, followed by BD3 and BD2. cBD improves on the band depths since the band correction accounts for curves that cross. The generalized band depths (GBD, cGBD, GBDI and GBDO ) do not perform well in this setting, even occasionally resulting in an integrated error exceeding that of the mean.
116
S. L´ opez-Pintado and R. Jornsten
Fig 6. Curves generated from Model 1: asymmetric contamination (top left), Model 2: symmetric contamination (top right), Model 3: partial contamination (middle left), Model 4: peaks contamination (middle right), Model 5: uniform contamination (bottom left), Model 6: shape contamination (bottom right). The mean curve is depicted in red, and the trimmed mean in green in all cases.
Extensions of the band depth
117
In Model 5 we allow for multiple contaminations of each curve. The performance of the GBD, cGBD and GBDO deteriorates since these short excursions are not recognized as a contamination. GBDI performs a little better, since the uniform contamination leads to the contaminated curves residing in the bands for only short consecutive stretches. Still, GBDI is not competitive with the band depths (though better than the mean). The best method is again the cBD, followed by the computationally expensive BD3. In Model 6 (shape contamination), a set of smooth curves are contaminated by a set of irregular curves. We expect the shape sensitive methods (band depths, GBDI ) to excel in this setting. Indeed, the GBDI is the best method, followed closely by cBD and BD3. Again, the less restrictive depths (GBD, cGBD and GBDO ) do not perform well in this setting. In the first three simulations (Models 1 through 3), the generalized band depths, with GBDO in the lead, outperform the band depths. The contaminations in Models 1 through 3 are essentially magnitude outliers, and persistent across the compact interval T . In simulations from Models 4 through 6, the contaminations are more subtle (random peaks, or a different covariance structure). In such settings, the less restrictive data depths, that discard short excursions as non-informative, perform poorly. The band depths, with the corrected band depth in the lead, as well as the GBDI , perform well in this setting. It is clear then, that one depth cannot be defined as the “best” across all possible scenarios. Therefore, in practise one needs to consider the type of contaminations to screen against. We recommend that several depths are applied to each data set, and the outliers identified by each depth measure examined graphically. From the above simulation results we see that GBDO and cBD are two candidate measures that are fast and easy to compute, and would highlight different structures in the data. To illustrate the screening properties of the depth measures, we apply six notions of depth to a data set consisting of the daily temperature in 35 different Canadian weather stations for one year (Ramsay and Silverman [22]). The original data was smoothed using a Fourier basis with sixty five elements. In Figure 7 we show the mean and the median (deepest) curve identified by the BD, cBD, GBD, cGBD, GBDI and GBDO . In addition, in each figure we highlight the 20% least deep curves in green (wide lines). From the above discussion, and the definitions in Sections 2 and 3, we know that GBD, cGBD and GBDO are the least restrictive depths. These depths will largely identify magnitude outliers as the least deep. In Figure 7 we see that this is indeed the case for the temperature data set. In contrast, BD, cBD and GBDI are the most restrictive, and will identify shape outliers. Temperature profiles that are flatter than the rest of the data, or irregular with large, local fluctuations, are identified by these depths. 6. Discussion We introduce several extension of the band depth for functional data. These new notions of depth account for the explicit ordering of the data dimensions inherent to functional data. We find that a simple alteration of the definition of a delimiting band, a band correction, can improve on the previously proposed band depth. In addition, a differential treatment of band excursions that are consecutive versus isolated, can boost the performance of the generalized band depth. While two of our proposed extensions, the corrected band depth cBD and the GBDO , improve on the band depth and the generalized band depth respectively, we
118
S. L´ opez-Pintado and R. Jornsten
Fig 7. Comparison of the 6 notions of depth. In each figure, we depict the mean curve (red), and median curve (black). The 20% least deep curves are plotted in green (wide lines), whereas the 80% most central curves are depicted in cyan (thin lines). Top panel: BD and cBD. Middle panel: GBD and cGBD. Bottom panel: GBDI and GBDO .
Extensions of the band depth
119
cannot identify a “best” data depth for functional data that uniformly dominates the other measures across all forms of contamination. The GBDO is the most competitive in simple contamination scenarios (magnitude outliers), whereas the cBD is more competitive when the samples are contaminated by curves of a different structure or shape. Our recommendation is that a set of data depths are used to screen the data for contaminations. A graphical examination of the data set can elucidate the potential outliers or extreme observations. One must make a case-by-case decision as to which functional shapes constitute outliers in a particular data set. We propose a fast and simple re-sampling based data depth calculation procedure. The rank-orders induced by the re-sampling based method closely agree with the rank-orders induced by the full data. With the computationally efficient resampling based method, the new notions of depth can be used as building blocks in non-parametric functional analysis such as clustering and classification. Acknowledgments. The authors would like to thank the reviewer for many insightful comments and helpful suggestions. R. Jornsten is partially supported by NSF Grant DMS-0306360 and EPA Star Grant RD-83272101-0. The work for this paper was conducted while S. L´opez-Pintado was a postdoctoral researcher at Rutgers University, Department of Statistics. References ´, E. (1993). Limit theorems for U-processes. Ann. [1] Arcones, M. A. and Gine Probab. 21 1494–1542. MR1235426 [2] Brown, B. and Hettmansperger, T. (1989). The affine invariant bivariate version of the sign test. J. Roy. Statist. Soc. Ser. B 51 117–125. MR0984998 [3] Fraiman, R. and Meloche, J. (1999). Multivariate L-estimation. Test 8 255–317. MR1748275 [4] Fraiman, R. and Muniz, G. (2001). Trimmed means for functional data. Test 10 419–440. MR1881149 [5] Ghosh, A. K. and Chaudhuri, P. (2005). On data depth and distributionfree discriminant analysis using separating surfaces. Bernoulli 11 1. MR2121452 [6] Hettmansperger, T. and Oja, H. (1994). Affine invariant multivariate multisample sign tests. J. Roy. Statist. Soc. Ser. B 56 235–249. MR1257810 [7] Inselberg, A. (1981). N-dimensional graphics, Part I-lines and hyperplanes. In IBM LASC, Technical Report, G320-27111. [8] Inselberg, A. (1985). The plane parallel coordinates. Invited paper. Visual Computer 1 69–91. [9] Inselberg, A., Reif, M. and Chomut, T. (1987). Convexity algorithms in parallel coordinates. Journal of ACM 34 765–801. MR0913842 [10] Jornsten, R. (2004). Clustering and classification via the L1 data depth. J. Multivariate Anal. 90 67–89. MR2064937 [11] Li, J. and Liu, R. (2004). New nonparametric tests of multivariate locations and scales using data depth. Statist. Sci. 19 686–696. MR2185590 [12] Liu, R. (1990). On a notion of data depth based on random simplices. Ann. Statist. 18 405–414. MR1041400
120
S. L´ opez-Pintado and R. Jornsten
[13] Liu, R. (1995). Control charts for multivariate processes. J. Amer. Statist. Assoc. 90 1380–1388. MR1379481 [14] Liu, R., Parelius, J. M. and Singh, K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference. Ann. Statist. 27 783–858. MR1724033 [15] Liu, R. and Singh, K. (1993). A quality index based on data depth and multivariate rank test. J. Amer. Statist. Assoc. 88 257–260. MR1212489 ´ pez-Pintado, S. and Romo, J. (2006a). Depth-based inference for func[16] Lo tional data. Comput. Statist. Data Anal. In press. ´ pez-Pintado, S. and Romo, J. (2006b). Depth-based classification for [17] Lo functional data. In Data Depth: Robust Multivariate Analysis, Computational Geometry and Applications. Dimacs Series in Discrete Mathematics and Theoretical Computer Science 72 (R. Y. Liu, R. Serfling and D. L. Souvaine, eds.). Amer. Math. Soc., Providence, RI. In press. ´ pez-Pintado, S. and Romo, J. (2006c). On the concept of depth for [18] Lo functional data. J. Amer. Statist. Assoc. Submitted. Second revision. [19] Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proc. Nat. Acad. Sci. India 12 49–55. [20] Oja, H. (1983). Descriptive statistics for multivariate distributions. Statist. Probab. Lett. 1 327–332. MR0721446 [21] Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York. MR0762984 [22] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed. Springer, New York. MR2168993 [23] Rousseeuw, P. and Hubert, M. (1999). Regression depth (with discussion). J. Amer. Statist. Assoc. 4 388–433. MR1702314 [24] Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. MR0595165 [25] Singh, K. (1991). A notion of majority depth. Unpublished document. [26] Tukey, J. (1975). Mathematics and picturing data. Proceedings of the 1975 International Congress of Mathematics 2 523–531. MR0426989 [27] Vardi, Y. and Zhang, C.-H. (2001). The multivariate L1-median and associated data depth. Proc. Nat. Acad. Sci. USA 97 1423–1426. MR1740461 [28] Wegman, E. (1990). Hyperdimensional data analysis using parallel coordinates. J. Amer. Statist. Assoc. 85 664–675. [29] Wood, A. T. A. and Chan, G. (1994). Simulation of stationary gaussian processes in C[0, 1]. J. Comput. Graph. Statist. 3 409–432. MR1323050 [30] Yeh, A. and Singh, K. (1997). Balanced confidence sets based on the Tukey depth. J. Roy. Statist. Soc. Ser. B 3 639–652. MR1452031 [31] Zuo, Y. (2003). Projection based depth functions and associated medians. Ann. Statist. 31 1460–1490. MR2012822 [32] Zuo, Y. and Serfling, R. (2000). General notions of statistical depth function. Ann. Statist. 28 461–482. MR1790005
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 121–131 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000094
A representative sampling plan for auditing health insurance claims∗ Arthur Cohen1 and Joseph Naus1 Rutgers University Abstract: A stratified sampling plan to audit health insurance claims is offered. The stratification is by dollar amount of the claim. The plan is representative in the sense that with high probability for each stratum, the difference in the average dollar amount of the claim in the sample and the average dollar amount in the population, is “small.” Several notions of “small” are presented. The plan then yields a relatively small total sample size with the property that the overall average dollar amount in the sample is close to the average dollar amount in the population. Three different estimators and corresponding lower confidence bounds for over (under) payments are studied.
1. Introduction Auditing health insurance claims is of necessity, extensively practiced. Statistical sampling plans and analysis of data from such plans is also extensive and diverse. One recent bibliography is Sampling for Financial and Internal Audits compiled by Yancey [5]. An annotated bibliography is given in Statistical Models and Analysis in Auditing (SMAA) [3] compiled by the Panel on Nonstandard Mixtures of Distributions, Committee on Applied and Theoretical Statistics of the Board on Mathematical Sciences, National Research Council. Most standard sampling methodologies have been considered. Namely, simple random sampling, stratified random sampling, and dollar unit sampling. In addition a variety of estimators of overpayments (underpayments) have been considered. (See SMAA [3].) These include the meanper-unit estimator, the difference estimator, two types of ratio estimators, weighted averages of the above three, dollar unit estimator based on estimating the proportion of items in which overpayment occurs and Stringer estimators based on an estimator of the above proportion and also the data corresponding to actual overpayments (underpayments). In this study we seek a stratified sampling plan whose main objective is to produce a lower confidence bound for the amount of overpayments (underpayments). Stratification is done on dollar amount of the claim (book). It is envisioned that this lower bound could justify a repayment by a health care provider to a client whose employees are covered by the provider’s health insurance plan. Desirable features sought include simplicity, representativeness, relatively small total sample size yet somewhat adequate sample size in each stratum, no samples from zero dollar claims and relatively larger samples from strata with high dollar claims. Furthermore the ∗ Research
supported by NSF Grant DMS-0457248 and NSA Grant H 98230-06-1-007. of Statistics and Biostatistics, Rutgers University, Hill Center, Busch Campus, 110 Frelinghuysen Road, Piscataway NJ 08854-8019, USA, e-mail: [email protected]; [email protected] AMS 2000 subject classifications: primary 60K35; secondary 60K35. Keywords and phrases: stratified sampling plan, dollar amount of claim, over (under) payments, unbiased estimator, separate ratio estimator, combined ratio estimator, lower confidence bound. 1 Department
121
122
A. Cohen and J. Naus
plan should audit all extremely high dollar claims and treat them separately. That is, such claims should not be included as part of the statistical sample. To achieve the stated objectives we propose a “dollar representative stratified sampling plan” and refer to it as RepStrat sampling. The strata are class intervals of dollar amounts of claims. Class boundaries of strata are chosen according to typical guidelines (ample number of claims in each stratum, rounded numbers for boundaries, not too few, not too many). In addition the strata are adjustable so that the sample from each stratum is chosen in such a way that the average dollar amount of the claim for the population is “close” to the average dollar amount of the claim in the sample. The notion of “close” is made explicit in the next section. The closeness of the average dollar amounts of claims in sample and in population is what we mean by representativeness. Sample sizes for each stratum are chosen to ensure this closeness with high probability. We demonstrate that closeness within each stratum guarantees a higher degree of closeness between the averages in the overall population and overall sample. Representativeness, as measured by closeness of average dollar amounts in sample and population, is important for several reasons. First, it is intuitively desirable. Second, oftentimes estimates of overpayments are needed by an agency or company in order to recover money that was excessively paid out. Since larger dollar amounts of claims have more opportunity for larger overpayments the agency would welcome higher average dollar amounts in the sample strata than in the population strata. On the other hand the agency’s adversary would prefer smaller average dollar amounts in the sample. Since there is always a chance of litigation for the sake of recovering money, the least biased situation is to have representativeness in terms of the closeness notion. Representativeness is not the only feature of the plan proposed here. We want the sample size to be moderate. Not too large because of auditing expense and yet large enough to get an estimate of overpayment whose variance is not too large. Furthermore choosing samples within strata randomly allows for plausible estimators of means and variances in an unbiased way. Still further the plan offered here is not model based and does not require distributional or other assumptions. Thus the plan offered here is a balance of a sense of fairness in a litigation setting. Representativeness, randomness, distributional robustness, and adequate sample size. In Section 2 several different definitions of closeness will be offered and their properties will be studied. In Section 3 we will display customary formulas for lower confidence bounds based on several different estimators of overpayments. Section 4 contains an example. Standard textbook references containing sections on stratified sampling are Cochran [1], Scheaffer, Mendenhall, and Ott, 6th edition [4], and Lohr [2]. 2. Determination of strata and sample sizes Let N be the total number of claims in the population under study. Each claim has a dollar amount (book) and the first step in a stratified sampling plan is to determine L strata. The L strata will be formed as class intervals (Ai , Bi ), i = 1, 2, . . . , L. These L strata can be determined iteratively if necessary so that certain properties of the sample (prior to auditing) are achieved. Justification for stratification is an intuitive impression that higher book values are correlated with higher or greater likelihood of overpayments. In our case representativeness is easier to achieve with
Representative Sampling
123
stratification. In addition we want to assure that relatively more samples are drawn from higher dollar strata. Once strata boundaries are determined (not necessarily finalized) sample sizes for each stratum need to be determined. In general, the researcher seeks to estimate a population parameter (say µ) within a certain precision g, with a certain confidence level, 1 − α. In stratified sampling L ni is the sample size for each stratum and the total sample size is n = i=1 ni . Let Yij be the known dollar amount of the jth claim in the ith stratum, i = 1, . . . , L; j = 1, . . . , Ni , where Ni is the Ni number of claims in stratum i. Let Y¯i = i=1 Yij /Ni and Vi = (Yij − Y¯i )2 /Ni be the mean and variance respectively of the Yij in stratum i. Let yij and xij be respectively the sample book amount and audited amount of the jth claim in the ith stratum, i = 1, . . . , L; j = 1, . . . , ni . Also let dij = max(0, yij − xij ) be the sample amount overpaid on an audited claim. The ith stratum mean and variance of dij are respectively denoted by µi and σi2 , and the mean of the overpayment variable for the population is denoted by µ where µ = Ni µi /N . Finally let Wi = Ni /N and wi = ni /n. Note that when xij = yij that means the auditor has detected an error in treating the claim. A desirable or even acceptable error rate (that includes underpayments as well as overpayments) is 1%, which is the standard in some industries. However it is not uncommon to see higher rates of 3 to 5 or even 8 percent. In the latter cases overpayments or even overpayments minus underpayments can be in the millions of dollars. In [3] the dij ’s are modeled as coming from a mixture distribution. One distribution of the mixture is degenerate at the point {0}. More discussion regarding the other distribution is given in [3] where it is mentioned that the distribution may depend on the book amount. In our study no distributional assumptions are made. There are a variety of ways sample sizes are determined or allocated to the strata. These include equal allocation (w i = 1/L), proportional allocation (wi = Wi ) and Neyman allocation (wi = Wi σi / Wi σi ). Neyman allocation is “optimal” in the sense of requiring the least total sample size to achieve the required overall precision. However knowledge of σi is rarely available and proportional allocation is often preferred to Neyman allocation since it offers a type of “representativeness” in the sense that each stratum appears in the sample the same fraction of the time that it appears in the population. Researchers are sometimes willing to take a somewhat larger sample for “representativeness” and for the simplicity gained. In RepStrat sampling we allocate the sample to gain representativeness in the dollar amount of the claims in each stratum and in the overall population. Toward this end the researcher specifies a level of precision gi and a confidence level 1 − γ such that the sample is representative for each stratum in the sense that i = 1, . . . , L. (2.1) P |¯ yi − Y¯i | ≤ gi ≥ 1 − γ,
The practitioner may specify equal absolute precision in estimating stratum means, by choosing equal values for gi . This is considered in more detail under case (a) below. Alternatively, the practitioner may specify equal relative precision in estimating stratum means, i.e., gi = f Y¯i , 0 < f < 1; case (b) below deals with this case. Other choices for specifying stratum precision are considered in cases (c) through (e) below. We show that certain types of specifications are related to proportional allocation in stratified random sampling (case c), or Neyman optimal allocation (case d). We first deal with general {gi }. Given the practitioner specifies the desired stratum precisions in terms of the {gi } and the confidence 1 − γ, then the stratum sizes {ni } can be determined
124
A. Cohen and J. Naus
as follows. Strata boundaries are chosen so that the sample sizes determined will hopefully be adequate enough so that y¯i will be approximately normal. In light of this and (2.1) we find ni so that ni Ni − 1 (2.2) P |Z| < gi ≥ 1 − γ, Vi (Ni − ni ) where Z is a standard normal variable. This leads to 2 2 Vi , Vi Ni gi2 (Ni − 1) + zγ/2 (2.3) ni = zγ/2
where zγ/2 is the 1 − γ/2 percentile of a standard normal. Should Ni be large so that the finite population correction factor (fpc) be close to 1, then (2.3) reduces to (2.4)
2 ni = zγ/2 Vi /gi2 .
Remark 2.1. Typically strata with larger dollar amounts of claims will also have larger values of Vi . Thus from (2.3) we see that relatively larger samples will be drawn from such strata. This was one of the properties felt to be desirable in a sampling plan of this type. ¯ Now let y¯st = Ni y¯i /N and Y¯ = Ni Yi /N be respectively the sample estimate of and true population mean of book dollar amounts. Given the stratum {ni } are determined from the stratum precisions {gi } and 1−γ, and given the known stratum variances Vi of the Yij ’s yields the variance of y¯st . We can use this to approximate the distribution of |¯ yst − Y¯ |. For a specified value of g, we can approximate P {|¯ yst − Y¯ | ≤ g}; we will denote this probability by 1 − α. For a chosen value of g we consider
2 2 Wi Vi /ni ≤ g Wi Vi /ni . (2.5) P |¯ yst − Y¯ | ≤ g = P |¯ yst − Y¯ |
If we ignore the fpc and substitute (2.4) in (2.5), we find that (2.5) is approximately
L 2 2 Wi (gi /g) . (2.6) P |Z| ≤ zγ/2 i=1
Thus if (2.7)
L
Wi2 (gi /g)2 ≤ 1
i=1
it follows from (2.6) that the probability that the overall sample mean of claims is close to the population mean of claims is at least 1 − γ. That is, if (2.7) holds (2.8)
P {|¯ yst − Y¯ | ≤ g} = 1 − α ≥ 1 − γ.
Clearly in this instance α ≤ γ. Note that (2.7) and (2.8) are easily satisfied in the case where gi = g, for all i; they are also satisfied if gi = f Y¯i and g = f Y¯ for constant f . Thus, in the case where the stratum mean estimates precisions are equal (either absolutely or relative to the means), the combined stratum estimate has at least as good precision.
Representative Sampling
125
In the auditing application considered in this paper, claims are stratified by claim amount Yi into L strata. A special type of stratified random sample is picked with ni claims from the Ni claims within the ith stratum, for i = 1, . . . , L. The stratum sample sizes are chosen to make the sample “representative” in that for every stratum, there is a high probability 1 − γ that the stratum sample mean will be “close” (for the ith stratum, within gi ) of the true stratum mean. Given the {gi } and γ, the {ni } are completely determined, as is P |¯ yst − Y¯ | ≤ g , for any g. The formula (2.1) through (2.8) detail this. In general stratified random sampling the desired closeness of y¯st to Y¯ is specified together with some type of allocation to determine the {ni }. This in turn determines the within stratum precision which is typically of minor or secondary interest. By contrast, the representative stratified random sampling approach specifies the desired “closeness” (of y¯i to Y¯i ) within individual stratum to find the {ni }. In the next section we relate these two approaches to specifying sample sizes in stratified random sampling. 3. Relation between representative and general stratified sampling approaches In general stratified random sampling, some particular method of allocating the total sample size n to the L individual strata is chosen. In equal allocation, ni = n/L. In proportional allocation, ni = nNi / Ni ; that is the ni are proportional to the Ni . How does representative stratified random sampling relate to various other types of allocation in stratified random sampling? To analyze this, fix the precision of the estimator y¯st. ; that is specify g and α in equation (2.8). Various ways to specify the within stratum precision (the gi ’s and γ) are related to types of allocation (proportional, optimal and other) in stratified random sampling. For a given desired precision of the estimator y¯st given by α and g, and choice of spcification of gi , and γ, we can find the total sample size n, as well as the sample weights wi = ni /n. For example, we will show that if the gi are all taken to be equal, then wi is proportional to the stratum variances, and the overall n is given by equation (3.4) below. More generally, we show how to determine {ni } and n given any three of {gi }, α, γ, g. L W 2 Vi /ni ; use the normal Divide both the |¯ yst − Y¯ | and g in (2.8) by i=1
i
approximation and ignore the fpc, to find that approximately L
2 Wi2 Vi /ni = g 2 /zα/2 .
i=1
Thus choosing ni as in (2.4) yields (3.1)
2 2 Wi2 gi2 = g 2 zγ/2 /zα/2 .
At this point we present a variety of choices of gi . Case (a): gi = C for all i = 1, . . . , L. Then from (3.1) (3.2)
2 2 C 2 = g 2 zγ/2 /zα/2
L i=1
Wi2 ,
A. Cohen and J. Naus
126
and should we want ni to satisfy both (2.1) and (2.8), we have from (2.4) that 2 ni = Vi zα/2
(3.3)
L
Wi2 /g 2 .
i=1
That is, ni is proportional to Vi . The overall sample size in this case is (3.4)
n = (zα/2 /g)2
L
Wi2
i=1
L
Vj
j=1
and the sample weights wi = ni /n are given by wi = Vi
(3.5)
L
Vi .
i=1
Case (b): gi = f Y¯i for all i = 1, . . . , L. If gi is a fixed proportion of strata mean book amount, then from (3.1) we have approximately (3.6)
2
f =
2 g 2 zγ/2
L 2 2 ¯2 zα/2 Wi Yi i=1
and thus if we want ni to satisfy both (2.1) and (2.8), we have from (2.4) (3.7)
ni =
2 zα/2 Vi
L
Wj2 Y¯j2
j=1
g 2 Y¯i2 .
Thus in this case ni is proportional to Vi /Y¯i2 , which is the squared coefficient of variation of the book value for the ith situation. The sample weights wi = ni /n are Vi /Y¯i2
(3.8)
L i=1
Case (c): gi = f From (2.4) (3.9)
Vi /Wi .
2 ¯ Vi /Yi .
2 ni = zγ/2 Wi /f 2 = KWi ,
when K is constant. This amounts to proportional allocation related to the number of claims in a stratum. This method of sample size determination does not really satisfy our goals. 1/4 √ Case (d): gi = f Vi Wi . From (2.4) 2 (3.10) ni = zγ/2 Wi Vi /f 2 . From (3.1) (3.11)
f = gzγ/2 /zα/2
1/2
W i Vi
.
Representative Sampling
127
This is Neyman-optimum allocation for estimating mean claim size. For this case using (3.10) and (3.11) n=
(3.12)
L
1/2 W i Vi
i=1
2
2 2 zα/2 g
and wi =
(3.13)
1/2 Wi Vi
L
1/2
W i Vi
.
i=1
1/2
Case (e): gi = f Y¯i . This designation of strata precision is a compromise between case (a) which seeks the same absolute precision for each stratum and case (b) which seeks the same relative precision. Here, using (2.4) and (3.1) Wi2 Y¯i , f = gzγ/2 /zα/2 2 ni = Vi zα/2
(3.14)
L j=1
Wj2 Y¯j Y¯i g 2
so L ¯ wi = (Vi /Yi ) Vj /Y¯j .
(3.15)
j=1
Remark 3.1. In deriving most formulas for ni we have been ignoring the fpc. Should the fpc not be close to 1 a modification of the formulas may be necessary. The modification entails replacing Vi by Vi (Ni − ni )/(Ni − 1) and solving the resulting equation for ni . So for example in case (a), (3.3) would be replaced by (3.16)
ni =
2 Ni zα/2
L j=1
Wj2
Ni − 1 +
2 Vi zα/2
L j=1
Wj2
.
In case (b) the ni of (3.7) would be replaced by (3.17)
L L 2 ¯2 2 ¯2 2 ¯2 2 2 Wj Yj . Wj Yj (Ni − 1)g Yi + Vi zα/2 ni = Vi zα/2 Ni j=1
j=1
Of course if we were concerned only with ni satisfying (2.1), recognizing the implications this has for (2.8), then ni can be determined from (2.3) which includes the fpc. 4. Estimating overpayments In this section we describe several point estimators and corresponding lower confidence bounds for (OP ), the total amount overpaid. Most of the material here is essentially drawn from the cited SMAA [3] article appearing in Statistical Science, and Cochran [1]. We focus on three of the estimators although the SMAA article discusses others as well. It would not be an unreasonable approach, when seeking
A. Cohen and J. Naus
128
recovery of money, to give several estimates of (OP ) to see if they agree. If they do not, the parties might choose to compromise among the competing estimators. We are somewhat guided in the selection of the three estimators by the findings of the SMAA committee in the sense that some of the following estimators lead to conservative lower bounds. The three estimators are The difference estimator: )d = (4.1) (OP Ni d¯i , where
d¯i =
ni
dij /ni .
j=1
The separate ratio estimator: )RS = (OP
(4.2) where
L
Ni Y¯i ri ,
i=1
ri = d¯i /¯ yi .
The combined ratio estimator: )RC = rc (OP
(4.3) where
rc =
(4.4)
L
L
Ni Y¯i ,
i=1
Ni d¯i /
L
Ni y¯i .
i=1
i=1
Denote the sample variance of dij by s2di , where s2di
(4.5)
ni (dij − d¯i )2 /(ni − 1). = j=1
)d is Then the estimated variance of (OP )d = Vˆ (OP
(4.6)
L
Ni (Ni − ni )s2di /ni .
i=1
A (1 − β) lower confidence bound based on an estimate θˆ for total overpayment is ˆ 1/2 . θˆ − zβ Vˆ (θ)
(4.7)
)d depends on the s2 which can be computed from (4.5), or in the Note that Vˆ (OP di case where many of the di ’s are zero, more simply from s2di
ni ni ni 2 dij dij dij − = ni ni (ni − 1), j=1
j=1
j=1
Representative Sampling
where
ni
dij
ni
and
j=1
129
d2ij
j=1
can be computed from just the non-zero dij . This type of simplification is of even greater help for the ratio estimates that follow. )RS is The estimated variance of (OP )RS = Vˆ (OP
(4.8)
where
s2RSi
=
ni
L
Ni s2RSi (Ni − ni )/ni ,
i=1
(dij − ri yij )2 /(ni − 1)
j=1
= s2di + ri2 s2yi − 2ri sdi yi
(4.9) and
yi , ri = d¯i /¯ n i (di − d¯i )(yi − y¯i )/(ni − 1) s2di yi = j=1
=
ni
dij yij − y¯i
ni
dij
j=1
j=1
(ni − 1).
2 Note that s2di yi and s2di can be computed from dij , dij , and dij yij , which can be computed from just the non-zero dij . Thus, just knowing (dij , yij ) for the non-zero dij , and y¯i and s2yi are sufficient to compute (4.8). )RC is The estimated variance of (OP )RC = Vˆ (OP
(4.10) where
s2ri =
L
Ni s2ri (Ni − ni )/ni ,
i=1
ni (dij − rc yij )2 /(ni − 1) j=1
(4.11)
=
ni j=1
d2ij
+
rc2
ni j=1
2 yij
− 2rc
ni j=1
dij yij
(ni − 1)
and rc is defined in (4.4). ni 2 yij Knowing y¯i and s2yi gives j=1 , and knowing these together with the (dij , yij ) for non-zero dij is sufficient to compute (4.10). 5. Example We have constructed the following fictitious data set to be similar in number of claims, and book and audited amount of claims in data sets we have seen. Table 1
A. Cohen and J. Naus
130
contains population data broken down by class intervals (strata) of designated dollar amount of claims. For each stratum we give the number of claims, the population mean (Y¯i ) and variance (Vi ) of dollar amount of claim, and the sample sizes (ni ) determined from (2.3) for γ = .05, and gi = .05Y¯i . Table 2 contains sample data for those pairs (dij , yij ) with positive overpayment, dij > 0. Note that there are several sample pairs (dij , yij ) where yij − dij = xij = 0. Such cases result when payments were made on claims that shouldn’t have been paid. Table 3 contains the stratum sample means and variances for the dij and yij . Table 4 compares the difference estimate (4.1), the separate ratio estimator (4.2), and the combined ratio estimate (4.4) for the total amount of overpayment. Also included are the 95% lower confidence bounds for total overpayment based on each estimator derived using (4.7). First we note that |¯ yst − Y¯ | = |418.7500−417.9375| = ¯ 0.812475 < 0.02Y . This indicates that RepStrat has accomplished one of its main goals. Namely that the average dollar amount of the claims in the sample is very close to the average dollar amount of claims in the population. From Table 4 we note that all three estimation procedures yield results that are very close to each other. In light of the SMAA report on conservativeness of the lower confidence bounds and the closeness of the three estimation procedures we feel that RepStrat has provided a satisfactory analysis of the data set. Table 1 Population data and sample sizes Strata Number (1) (2) (3) (4) (5) (6)
Dollar Value Strata Ni 0–199 200–499 500–999 1,000–1,999 2,000–3,999 4,000–6,999
Number of Claims Y¯i 4000 2200 1000 500 200 100 8000
Population Mean Vi 120 313 620 1148 2374 5061
Population Variance ni 703 3,500 10,000 30,000 110,000 250,000
74 54 39 33 27 14 241
Table 2 Sample Pairs: (dij , yij ) when dij > 0 Strata 1 2 3 4 5 6 height
(9, 44), (105, 105), (57, 57), (143, 143) (8, 288), (422, 422), (115, 380), (93, 455), (495, 495), (359, 359) (530, 530), (76, 516), (12, 736), (124, 540), (54, 711), (96, 674) (804, 1804), (628, 1000), (718, 1000), (475, 1000), (500, 1500), (800, 1500) (1120, 2520), (2607, 2607), (389, 3456), (1990, 3265), (3900, 3900), (100, 3900), (1550, 3000), (500, 3000) (1220, 6102), (1750, 6999), (3, 5232), (6900, 6900), (100, 6671), (1220, 6102)
Table 3 Sample Statistics Strata 1 2 3 4 5 6
ni 74 54 39 33 27 14
y¯i 115 300 650 1200 2400 5000
d¯i 4.2432 27.6296 22.8718 118.9394 450.2222 799.5000
s2yi 680 3400 10500 30300 111000 250000
Representative Sampling
131
Table 4 Estimators and Lower 95% Confidence Bound Type of Estimator Difference Separate Ratio Combined Ratio
Estimator 330,094 329,833 329,453
Lower Confidence Bound 214,037 220,286 215,323
References [1] Cochran, W. G. (1977). Sampling Techniques, 3rd ed. Wiley, New York. [2] Lohr, S. L. (1999). Sampling: Design and Analysis. Duxbury, New York. [3] Panel on Nonstandard Mixtures of Distributions (1989). Statistical models and analysis in auditing. Statist. Sci. 4 2–33. [4] Scheaffer, R. L., Mendenhall, W. III and Ott, R. L. (2006). Elementary Survey Sampling, 6th ed. Thomson, U.S. [5] Yancey, W. (2006). Sampling for financial and internal audits. Bibliography. Available at http://www.willyancey.com/sampling-financial.htm.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 132–150 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000102
Confidence distribution (CD) – distribution estimator of a parameter Kesar Singh1,∗ , Minge Xie1,† and William E. Strawderman
1,‡
Rutgers University Abstract: The notion of confidence distribution (CD), an entirely frequentist concept, is in essence a Neymanian interpretation of Fisher’s Fiducial distribution. It contains information related to every kind of frequentist inference. In this article, a CD is viewed as a distribution estimator of a parameter. This leads naturally to consideration of the information contained in CD, comparison of CDs and optimal CDs, and connection of the CD concept to the (profile) likelihood function. A formal development of a multiparameter CD is also presented.
1. Introduction and the concept We are happy to dedicate this article to the memory of our colleague Yehuda Vardi. He was supportive of our efforts to develop this research area and in particular brought his paper with Colin Mallows (Mallows and Vardi [19]) to our attention during the discussion. A confidence-distribution (CD) is a compact expression of frequentist inference which contains information on just about every kind of inferential problem. The concept of a CD has its roots in Fisher’s fiducial distribution, although it is a purely frequentist concept with a purely frequentist interpretation. Simply speaking, a CD of a univariate parameter is a data-dependent distribution whose s-th quantile is the upper end of a 100s%-level one-sided confidence interval of θ. This assertion clearly entails that, for any 0 < s < t < 1, the interval formed by s-th and t-th quantiles of a CD is a 100(t − s)% level two-sided confidence interval. Thus, a CD is in fact Neymanian interpretation of Fisher’s fiducial distribution (Neyman [21]). The concept of CD has appeared in a number of research articles. However, the modern statistical community has largely ignored the notion, particularly in applications. We suspect two probable causes lie behind this: (I) The first is its historic connection to Fisher’s fiducial distribution, which is largely considered as “Fisher’s biggest blunder” (see, for instance, Efron [8]); (II) Statisticians have not seriously looked at the possible utility of CDs in the context of modern statistical practice. As pointed out by Schweder and Hjort [22], there has recently been a renewed interest in this topic. Some recent articles include Efron [7, 8], Fraser [11, 12], Lehmann [15], Schweder and Hjort [22, 23], Singh, Xie and Strawderman [25, 26], among others. In particular, recent articles emphasize the Neymanian interpretation of the CD and present it as a valuable statistical tool for inference. ∗ Research
partially supported by NSF DMS-0505552. partially supported by NSF SES-0241859. ‡ Research partially supported by NSA Grant 03G-112. 1 Department of Statistics, Hill Center, Busch Campus, Rutgers University, Piscataway, NJ 08854, USA, e-mail: [email protected]; [email protected]; [email protected]. edu; url: stat.rutgers.edu AMS 2000 subject classifications: primary 62F03, 62F12, 62G05; secondary 62G10, 62G20. Keywords and phrases: confidence distribution, frequentist inference, fiducial inference, optimality, bootstrap, data-depth, likelihood function, p-value function. † Research
132
Confidence-distributions
133
For example, Schweder and Hjort [22] proposed reduced likelihood function from the CDs for inference, and Singh, Xie and Strawderman [26] developed attractive comprehensive approaches through the CDs to combining information from independent sources. The following quotation from Efron [8] on Fisher’s contribution of the Fiducial distribution seems quite relevant in the context of CDs: “. . . but here is a safe prediction for the 21st century: statisticians will be asked to solve bigger and more complicated problems. I believe there is a good chance that objective Bayes methods will be developed for such problems, and that something like fiducial inference will play an important role in this development. Maybe Fisher’s biggest blunder will become a big hit in the 21st century!” In the remainder of this section, we give a formal definition of a confidence distribution and the associated notion of an asymptotic confidence distribution (aCD), provide a simple method of constructing CDs as well as several examples of CDs and aCDs. In the following formal definition of the CDs, nuisance parameters are suppressed for notational convenience. It is taken from Singh, Xie and Strawderman [25, 26]. The CD definition is essentially the same as in Schweder and Hjort [22]; they did not define the asymptotic CD however. Definition 1.1. A function Hn (·) = Hn (Xn , ·) on X × Θ → [0, 1] is called a confidence distribution (CD) for a parameter θ, if (i) For each given sample set Xn in the sample set space X , Hn (·) is a continuous cumulative distribution function in the parameter space Θ; (ii) At the true parameter value θ = θ0 , Hn (θ0 ) = Hn (Xn , θ0 ), as a function of the sample set Xn , has a uniform distribution U (0, 1). The function Hn (·) is called an asymptotic confidence distribution (aCD), if W requirement (ii) above is replaced by (ii) : At θ = θ0 , Hn (θ0 ) −→ U (0, 1), as n → +∞, and the continuity requirement on Hn (·) is dropped. We call, when it exists, hn (θ) = Hn (θ) a CD density. It is also known as confidence density in the literature. It follows from the definition of CD that if θ < θ0 , sto
sto
sto
Hn (θ) ≤ 1 − Hn (θ), and if θ > θ0 , 1 − Hn (θ) ≤ Hn (θ). Here, ≤ is a stochastic comparison between two random variables; i.e., for two random variable Y1 and Y2 , sto
Y1 ≤ Y2 , if P (Y1 ≤ t) ≥ P (Y2 ≤ t) for all t. Thus a CD works, in a sense, like a compass needle. It points towards θ0 , when placed at θ = θ0 , by assigning more mass stochastically to that side (left or right) of θ that contains θ0 . When placed at θ0 itself, Hn (θ) = Hn (θ0 ) has the uniform U[0,1] distribution and thus it is noninformative in direction. The interpretation of a CD as a distribution estimator is as follows. The purpose of analyzing sample data is to gather knowledge about the population from which the sample came. The unknown θ is a characteristic of the population. Though useful, the knowledge acquired from the data analysis is imperfect in the sense that there is still a, usually known, degree of uncertainty remaining. Statisticians can present the acquired knowledge on θ, with the left-over uncertainty, in the form of a probability distribution. This appropriately calibrated distribution, that reflects statisticians’ confidence regarding where θ lives, is a CD. Thus, a CD is an expression of inference (an inferential output) and not a distribution on θ. What is really fascinating is that a CD is loaded with a wealth of information about θ (as it is detailed later), as is a posterior distribution in Bayesian inference. Before we give some illustrative examples, let us describe a general substitution scheme for the construction of CDs, that avoids inversion of functions; See, also Schweder and Hjort [22]. Although this scheme does not cover all possible ways of constructing CDs (see, for example, Section 4), it covers a wide range of examples
134
K. Singh, M. Xie and W. E. Strawderman
involving pivotal statistics. Consider a statistical function ψ(Xn , θ) which involves the data set Xn and the parameter of interest θ. Besides θ, the function ψ may contain some known parameters (which should be treated as constants) but it should not have any other unknown parameter. On ψ, we impose the following condition: • For any given Xn , ψ(Xn , θ) is continuous and monotonic as a function of θ. Suppose further that Gn , the true c.d.f. of ψ(Xn , θ), does not involve any unknown parameter and it is analytically tractable. In such a case, ψ(Xn , θ) is generally known as a pivot. Then one has the following exact CD for θ (provided Gn (·) is continuous): Gn ψ(Xn , x) , Hn (x) = 1 − Gn ψ(Xn , x) ,
if ψ is increasing in θ if ψ is decreasing in θ.
In most cases, Hn (x) is typically a continuous c.d.f. for fixed Xn and, as a function of Xn , Hn (θ0 ) follows a U [0, 1] distribution. Thus, Hn is a CD by definition. Note the substitution of θ by x. In case the sampling distribution Gn is unavailable, including the case in which Gn depends on unknown nuisance parameters, one can turn to an approximate or ˆ n . This could be the limit of Gn , an estimate of estimated sampling distribution G ˆn, the limit or an estimate based on bootstrap or some other method. Utilizing G one defines ˆ n ψ(Xn , x) , G if ψ is increasing in θ, Hn (x) = ˆ 1 − Gn ψ(Xn , x) , if ψ is decreasing in θ. L
In most cases, Hn (θ0 )−→U [0, 1] and is thus an asymptotic CD. The above construction resembles Beran’s construction of prepivot (see Beran [4], page 459), which was defined at θ = θ0 (the true value of θ). Beran’s goal was to achieve second order accuracy in general via double bootstrap. We now present some illustrative examples of CDs. Example 1.1. (Normal mean and variance) The most basic case is that of sampling from a normal distribution with parameters µ and σ 2 . Consider first a CD for the mean when√the variance is unknown. Here the standard pivot is ψ(Xn , µ) = ¯ n − µ)/(sn / n), which has the student t-distribution with (n − 1) d.f. Using the (X above substitution, the CD for µ is
Hn (x) = 1 − P Tn−1
¯ −x ¯ X x−X √ √ , ≤ = P Tn−1 ≤ sn / n sn / n
where Tn−1 is a random variable that has the Student’s tn−1 - distribution. For σ 2 , the usual pivot is ψ(Xn , σ 2 ) = (n−1)s2n /σ 2 . By the substitution method, the CD for σ 2 is (n − 1)s2n 2 (1.1) Hn (x) = P χn−1 ≥ , x ≥ 0. x where χ2n−1 is a random variable that has the Chi-square distribution with n − 1 degrees of freedom.
Confidence-distributions
135
Example 1.2. (Bivariate normal correlation) For a bivariate normal population, let ρ denote the correlation coefficient. The asymptotic pivot used in this example is Fisher’s Z, ψ(Xn , ρ) = log((1 + r)/(1 − r)) − log (1 − ρ) /2 where r is + ρ)/(1 1 the sample correlation. Its limiting distribution is N 0, n−3 ), with a fast rate of convergence. So the resulting asymptotic CD is √ 1 1+r 1 1+x n−3 log − log Hn (x) = 1 − Φ , −1 ≤ x ≤ 1. 2 1−r 2 1−x Example 1.3. (Nonparametric bootstrap) Turning to nonparametric examples based on bootstrap, let θˆ be an estimator of θ, such that the limiting distribution of ˆ properly normalized, is symmetric. Using symmetry, if the sampling distribution θ, of θˆ − θ is estimated by the bootstrap distribution of θˆ − θˆB , then an asymptotic CD is given by Hn (x) = 1 − PB (θˆ − θˆB ≤ θˆ − x) = PB (θˆB ≤ x). Here, θˆB is θˆ computed on a bootstrap sample. The resulting asymptotic CD is the ˆ raw bootstrap distribution of θ. ˆ ˆ If the distribution of θ − θ is estimated by the bootstrap distribution of θˆB − θ, which is what bootstrappers usually do, the corresponding asymptotic CD is Hn (x) = 1 − PB (θˆB − θˆ ≤ θˆ − x) = PB (θˆB ≥ 2θˆ − x). Example 1.4. (Bootstrap-t method) By the bootstrap-t method, the distribution ˆ is estimated by the bootstrap distribution of θ) of asymptotic pivot (θˆ − θ)/SE( ˆ SE B (θˆB ). Here SE B (θˆB ) is the estimated standard error of θˆB , based on (θˆB − θ)/ the bootstrap sample. Such an approximation has so-called, second order accuracy (see Singh [24], Babu and Singh [2, 3]). The resulting asymptotic CD would be ˆ θB − θˆ θˆ − x ≤ . Hn (x) = 1 − PB ˆ B (θˆB ) θ) SE SE( Such a CD, at x = θ0 , typically converges to U [0, 1], in law, at a rapid pace.
Example 1.5. (Bootstrap 3rd order accurate aCD) Hall [13] came up √ with the following increasing function of the t-statistics, which does not have the 1/ n-term in its Edgeworth expansion: ˆ λ 1 ˆ2 3 ψ(Xn , µ) = t + √ (2t2 + 1) + λ t . 27n 6 n √ ¯ ˆ is a sample estimate of λ and the assumpHere t = n(X − µ)/sn , λ = µσ33 , λ tion of population normality is dropped. Under mild conditions on the population distribution, the bootstrap approximation to the distribution of this function of t, ˆ B be the c.d.f. of the bootstrap approximation. Then, is third-order correct. Let G using the substitution, a second-order correct CD for µ is given by ˆ B ψ(Xn , x) . Hn (x) = 1 − G
One also has CDs that do not involve pivotal statistics. A particular class of such CDs are constructed from likelihood functions. We will have some detailed discussions on the connections of CDs and likelihood functions in Section 4. For each given sample Xn , Hn (·) is a cumulative distribution function. We can construct a random variable ξ such that ξ has the distribution Hn . For convenience of presentations, we call ξ a CD random variable.
K. Singh, M. Xie and W. E. Strawderman
136
Definition 1.2. We call ξ = ξHn a CD random variable associated with a CD Hn , if the conditional distribution of ξ given the data Xn is Hn . As an example, let U be a U (0, 1) random variable independent of Xn , then ξ = Hn−1 (U ) is a CD random variable. Let us note that ξ may be viewed as a CD-randomized estimator of θ0 . As an estimator, ξ is median unbiased, i.e., Pθ0 (ξ ≤ θ0 ) = Eθ0 {Hn (θ0 )} = 12 . However, ξ is not always mean unbiased. For example, the CD random variable ξ associated with (1.1) in Example 1.1 is mean biased as an estimator of σ 2 . We close this section with a equivariance result on CDs, which may be helpful in the construction of a CD for a function of θ. For example, to derive a CD for σ from that of σ 2 given in Example 1.1. The equivariance is shared by Efron’s bootstrap distribution of an estimator, which is of course aCD (Example 1.3 above) under conditions. Proposition 1.1. Let Hn be a CD for θ and ξ be an associated CD random variable. Then, the conditional distribution function of g(ξ), for given Xn , is a CD of g(θ), if g is monotonic. When the monotonicity is limited to a neighborhood of θ0 only, then the conditional distribution of g(ξ), for given Xn , yields an asymptotic CD at p θ = θ0 , provided, for all > 0, Hn (θ0 + ) − Hn (θ0 − )−→1. Proof. The proof of the first claim is straightforward. For the second claim, we note that, if g(·) is increasing within (θ − , θ + ), P g(ξ) ≤ g(θ )|x = P ξ≤ 0 0 0
θ0 ∩ θ0 − ≤ ξ ≤ θ0 + |x + op (1) = Hn (θ0 ) + op (1). One argues similarly for decreasing g(·). The rest of the paper is arranged as follows. Section 2 is devoted to comparing CDs for the same parameter and related issues. In Section 3, we explore, from the frequentist viewpoint, inferential information contained within a CD. In Section 4, we establish that the normalized profile likelihood function is an aCD. Lastly, Section 5 is an attempt to formally define and develop the notion of joint CD for a parameter vector. Parts of Sections 2 and 3 are closely related to the recent paper of Schweder and Hjort [22], and also to Singh, Xie and Strawderman [25, 26]. Schweder and Hjort [22] present essentially the same definition of the CD and also compare CDs as we do in this paper (See Definition 2.1). They also develop the notion of an optimal CD which is quite close to that presented here and in Singh, Xie and Strawderman [25]. Our development is based on the theory of UMPU tests and differs slightly from theirs. The materials on p-values in Section 3.3 is also closely related to, but somewhat more general than, that of Fraser [11]. 2. Comparison of CDs and a notion of optimal CD The precision of a CD can be measured in terms of how little probability mass a CD wastes on sets that do not include θ0 . This suggests that, for > 0, one should compare the quantities H1 (θ0 − ) with H2 (θ0 − ) and also 1 − H1 (θ0 + ) and 1 − H2 (θ0 + ). In each case, a smaller value is preferred. Here H1 and H2 are any two CDs for the common parameter θ, based on the same sample of size n. Definition 2.1. Given two CDs H1 and H2 for θ, we say H1 is more precise than H2 , at θ = θ0 , if for all > 0, sto
H1 (θ0 − ) ≤ H2 (θ0 − ) and when θ0 is the prevailing value of θ.
sto
1 − H1 (θ0 + ) ≤ 1 − H2 (θ0 + )
Confidence-distributions
137
An essentially equivalent definition is also used in Singh, Xie and Strawderman [25] and Schweder and Hjort [22]. The following proposition follows immediately from the definition. Proposition 2.1. If H1 is more precise than H2 and they both are strictly increasing, then for all t in [0, 1],
Thus,
H1−1 (t) − θ0
− − sto + + sto −1 ≤ H2 (t) − θ0 and H1−1 (t) − θ0 ≤ H2−1 (t) − θ0 −1 sto H (t) − θ0 ≤ H −1 (t) − θ0 .
(2.1)
1
2
The statement (2.1) yields a comparison of confidence intervals based on H1 and H2 . In general, an endpoint of a confidence interval based on H1 is closer to θ0 than that based on H2 . In some sense, it implies that the H1 based confidence intervals are more compact. Also, the CD median of a more precise CD is stochastically closer to θ0 . Let φ(x, θ) be a loss function such that φ(·, ·) is non-decreasing for x ≥ θ and non-increasing for x ≤ θ. We now connect the above defined CD-comparison to the following concept of the φ-dispersion of a CD. Definition 2.2. For a CD H(x) of a parameter θ, the φ-dispersion of H(x) is defined as dφ (θ, H) = Eθ φ(x, θ)dH(x).
In the special case of square error loss, dsq (θ, H) = Eθ (x−θ)2 dH(x). In general, we have the following: Theorem 2.1. If H1 is more precise than H2 at θ = θ0 , in terms of Definition 2.1, then dφ (θ0 , H1 ) ≤ dφ (θ0 , H2 ).
(2.2)
In fact, the above theorem holds under a set of weaker conditions: For any > 0, (2.3)
E{H1 (θ0 − )} ≤ E{H2 (θ0 − )} and E{H1 (θ0 + )} ≥ E{H2 (θ0 + )}
Proof. The claim in (2.2) is equivalent to E{φ(ξ1 , θ0 )} ≤ E{φ(ξ2 , θ0 )},
(2.4)
where ξ1 and ξ2 are CD random variables associated with H1 and H2 (see, Definition 1.2), respectively. From (2.3), via conditioning on Xn , it follows that sto
sto
(ξ1 − θ0 )+ ≤ (ξ2 − θ0 )+ and (ξ1 − θ0 )− ≤ (ξ2 − θ0 )− . Due to the monotonicity of φ(·, θ0 ), we have sto
sto
φ(ξ1 , θ0 )I(ξ1 ≥θ0 ) ≤ φ(ξ2 , θ0 )I(ξ2 ≥θ0 ) and φ(ξ1 , θ0 )I(ξ1 <θ0 ) ≤ φ(ξ2 , θ0 )I(ξ2 <θ0 ) . The above inequalities lead to (2.4) immediately.
K. Singh, M. Xie and W. E. Strawderman
138
Suppose now that there is a family of Uniformly Most Powerful Unbiased tests for testing K0 : θ ≤ θ0 versus K1 : θ > θ0 , for every θ0 . The underlying family of distributions may have nuisance parameter(s). Let the corresponding p-value (the inf of α at which K0 can be rejected) p(θ0 ) = p(Xn , θ0 ) be strictly increasing and continuous as a function of θ0 . It is further assumed that 1 − p(θ0 ) is the p-value of an UMPU test for testing K0 : θ ≥ θ0 vs K1 : θ < θ0 . Let the distribution of p(θ0 ) under θ0 be U [0, 1] and let the range of p(·) be [0, 1]. Define the corresponding CD, H ∗ (x) = p(Xn , x). We have the following result. Theorem 2.2. The CD H ∗ defined above is more precise than any other CD for the parameter θ, at all θ0 . Proof. Let θ = θ0 be the true value. Note that Pθ0 H ∗ (θ0 − ) < α is the power (at θ = θ0 ) of the UMPU test when K0 is θ ≤ θ0 − and K1 is θ > θ0 − . Given any other CD H, one has the following unbiased test for testing the same hypotheses: Reject K0 iff H(θ0 − ) < α. Therefore, Pθ0 H ∗ (θ0 − ) < α ≥ Pθ0 H(θ 0 − )∗ < α for all α similarly argues for Pθ0 1−H (θ0 + Using the function 1−p(·), one ∈ [0, 1]. ∗ ) < α ≥ Pθ0 1 − H(θ0 + ) < α . Thus, H is most precise.
It should be mentioned that the property of CDs as exhibited in Theorem 2.2 depend on corresponding optimality properties of hypothesis tests. The basic ideas behind this segment could be traced to the discussions of confidence intervals in Lehmann [14].
Remark 2.1. If the underlying parametric family has the so-called MLR (monotone likelihood ratio) property, there exists an UMP test for one-sided hypotheses whose p-value is monotonic. Example 2.1. In the testing problem of normal means, the Z-test is UMPU (actually UMP), for the one-sided hypotheses when σ is known. The t-test is UMPU for the one-sided hypotheses when σ is a nuisance parameter (see Lehmann [14], x−X¯ √ is the most precise CD for µ, when Chapter 5). The conclusion: H ∗ (x) = Φ σ/ n ¯ x−√ X ∗∗ σ is known, and H (x) = Ftn−1 sn / n is the most precise CD, when σ is not known. Here, Ftn−1 is the cumulative distribution function of the t-distribution with degrees of freedom n − 1. The above presented optimality theory can be expressed in the decision theoretic framework by considering the “target distribution” towards which a CD is supposed to converge. Given θ0 as the true value of θ, the target distribution is δ(θ0 ), the Dirac δ-measure at θ0 , which assigns its 100% probability mass at θ0 itself. A loss function can be defined in terms of “distance” between a CD H(·) and δ(θ0 ). Perhaps, the most popular distance, between two distributions F and G, is the Kolmogorov-Smirnov distance = supx |F (x)−G(x)|. However, it turns out that this particular distance, between a CD and its target δ(θ0 ), is useless for comparing two CDs. To see this, note that Eθ0 sup |H(x) − I[θ0 ,∞) | = Eθ0 max H(θ0 ), 1 − H(θ0 ) = 3/4 (free of H!), x
since H(θ0 ) follows the U [0, 1] distribution. Note, I[θ0 ,∞) is the cdf of δ(θ0 ). So, we instead consider the integrated distance τ H, δ(θ0 ) = ψ |H(x) − I[θ0 ,∞) | dW (x)
Confidence-distributions
139
where ψ(·) is a monotonic function from [0, 1] to R+ and W (·) is a positive measure. The risk function is Rθ0 (H) = Eθ0 τ (H, δ(θ0 ))].
Theorem 2.3. If H1 is a more precise CD than H2 at θ = θ0 , then Rθ0 (H1 ) ≤ Rθ0 (H2 ).
Theorem 2.3 is proved by interchanging the expectation and the integration appearing in the loss function, which is allowed by Fubini’s Theorem. Now, for asymptotic CDs, we define two asymptotic notions of “more precise CD”. One is in the Pitman sense of “local alternatives” and the other is in the Bahadur sense when the θ = θ0 is held fixed and the a.s. limit is taken on CDs themselves. First, the Pitman-more precise CD: Definition 2.3. Let H1 and H2 be two asymptotic CDs, We say that H1 is Pitmanmore precise than H2 if, for every > 0, lim P H1n θ0 − √ ≤ t ≥ lim P H2n θ0 − √ ≤t n n and ≤ t ≥ lim P 1 − H2n θ0 + √ ≤t lim P 1 − H1n θ0 + √ n n where all the limits (as n → ∞) are assumed to exist, and the probabilities are under θ = θ0 . Thus, we are requiring that in terms of the limiting distributions, sto ≤ H2n θ0 − √n H1n θ0 − √ n and sto 1 − H1n θ0 + √ ≤ 1 − H2n θ0 + √n . n Example 2.2. The most basic example allowing such a comparison is that of
x − θˆ1 √ H1n (x) = Φ a/ n
x − θˆ2 √ and H2n (x) = Φ b/ n
√ where θˆ1 , θˆ2 are two n-consistent asymptotically normal estimators of θ, with asymptotic variances a2 /n and b2 /n, respectively. The one with a smaller asymptotic variance is Pitman-more precise. Next, the Bahadur-type comparison: Definition 2.4. We define H1 to be Bahadur-more precise than H2 (when θ = θ0 ) if, for every > 0, a.s. 1 1 log H1 (θ0 − ) ≤ lim log H2 (θ0 − ) n n 1 1 lim log 1 − H1 (θ0 + ) ≤ lim log 1 − H2 (θ0 + ) , n n lim
and
where the limits, as n → ∞, are assumed to exist.
140
K. Singh, M. Xie and W. E. Strawderman
Here too, we are saying that in a limit sense, H1 places less mass than H2 does on half-lines which exclude θ0 . This comparison is on Hi directly and not on their distributions. Example 2.3. for normal means. The us return to the example of Z vs t x− x−X¯Let ¯ X √ √ fact that Φ σ/ n is Bahadur-more precise than Ftn−1 sn / n follows from the well-known limits: √ √ 1 1 1 2 2 log Φ(− n) −→ − /2 and log Ftn−1 (− n) −→ − log 1 + . n n 2 Remark 2.2. Under modest regularity conditions, 1 1 lim log Hn (θ0 − ) = lim log hn (θ0 − ) n n and
1 1 lim log[1 − Hn (θ0 + )] = lim log hn (θ0 + ). n n The right hand sides are CD-density slopes, which have significance of their own for CDs. The faster the CD density goes to 0, at fixed θ = θ0 , the more compact, in limit, the CD is. 3. Information contained in a CD This section discusses inference on θ from a CD or an aCD. We briefly consider basic elements of inferences about θ, including confidence intervals, point estimation, and hypothesis testing. 3.1. Confidence intervals The derivation of confidence intervals from a CD is straightforward and well known. Note that, according to the definition, the intervals (−∞, Hn−1 (α)] and [Hn−1 (α), +∞) are one-sided confidence intervals for θ, for any α ∈ (0, 1). It is also clear that the central regions of the CD, i.e., (Hn−1 (α/2), Hn−1 (1 − α/2)), provide two sided confidence interval for θ at each coverage level α ∈ (0, 1). The same is true for an aCD, where the confidence level is achieved in limit. 3.2. Point estimators
A CD (or an aCD) on the real line can be a tool for point estimation as well. We assume the following condition, which is mild and almost always met in practice. (3.1) For any and each fixed θ0 , 0 < < 21 , Ln () = Hn−1 (1 − ) − Hn−1 () → 0, in probability, as n → ∞. Condition (3.1) states the CD based information concentrates around θ0 as n gets large. One natural choice for a point estimator of the parameter θ is the median of a CD (or an aCD), Mn = Hn−1 (1/2). Note that, Mn is a median-unbiased estimator; even if the original estimator, on which Hn is based, is not. For instance, this does happen in the case of the CD for the normal variance, based on the χ2 -distribution. This median unbiased result follows from observation that Pθ0 (Mn ≤ θ0 ) = Pθ0 (1/2 ≤ Hn (θ0 )) = 1/2. The following is a consistency result on Mn .
Confidence-distributions
141
Theorem 3.1. (i) If condition (3.1) is true, then Mn → θ0 , in probability, as n → ∞. (ii) Furthermore, if Ln () = Op (an ), for a non-negative an → 0, then Mn − θ0 = Op (an ). Proof. (i) We first note the identity: for α ∈ (0, 12 )
|Mn − θ0 | > δ ∩ Ln (α) > δ
+Pθ0 |Mn − θ0 | > δ ∩ Ln (α) ≤ δ .
Pθ0 (|Mn − θ0 | > δ) = Pθ0
Under the assumed condition, the first term in the r.h.s.→ 0. We prove that the second term is ≤ 2α. This is deduced from the set inequality:
|Mn − θ0 | > δ ∩ Ln (α) ≤ δ ⊆ Hn (θ0 ) ≤ α Hn (θ0 ) ≥ 1 − α .
To conclude the above set inequality, one needs to consider the two cases: Mn > θ0 + δ and Mn < θ0 − δ, separately. The first case leads to {Hn (θ0 ) ≤ α} and the second one to {Hn (θ0 ) ≥ 1 − α}. Part (i) follows, since α is arbitrary. One can prove part (ii) by using similar reasoning.
+∞ One can also use the average of a CD (or an aCD), θ¯n = −∞ t dHn (t), to consistently estimate the unknown θ0 . Indeed, θ¯n is the frequentist analog of Bayesian estimator of θ under the usual square loss.
+∞ Theorem 3.2. Under condition (3.1), if rn = −∞ t2 dHn (t) is bounded in probability, then θ¯n → θ0 , in probability. Proof. Using Cauchy Schwartz inequality, we have, for any 0 < < 1/2, |
−1 Hn ()
tdHn (t) +
−∞
+∞
−1 Hn (1−)
tdHn (t)| ≤ 2rn 1/2 .
Thus, (1 − )Hn−1 () − 2rn 1/2 ≤ θ¯n ≤ (1 − )Hn−1 (1 − ) + 2rn 1/2 . Now, with Mn = Hn−1 (1/2), |θ¯n −θ0 | ≤ |Mn −θ0 |+|θ¯n −Mn | ≤ |Mn −θ0 |+|Hn−1 (1−)−Hn−1 ()|+2rn 1/2 +2|Mn | Since > 0 is arbitrary, the result follows using Theorem 3.1. Denote θˆn = arg maxθ hn (θ), the value that maximizes the CD (or aCD) density d function hn (θ) = dθ Hn (θ). Let n = inf 0<≤1/2 { : θˆn ∈ [Hn−1 (), Hn−1 (1 − )]}. The event n > ∗ is that θˆn will not be in the tails having probability less than ∗ . We have the following theorem. Theorem 3.3. Assume condition (3.1) holds. Suppose there exists a fixed ∗ > 0, such that P (n ≥ ∗ ) → 1. Then, θˆn → θ0 in probability. Proof. Note that θˆn ∈ [Hn−1 (∗ ), Hn−1 (1 − ∗ )]c implies n ≤ ∗ . The claim follows immediately using (3.1) and Theorem 3.1.
142
K. Singh, M. Xie and W. E. Strawderman
3.3. Hypothesis testing Now, let us turn to one-sample hypothesis testing, given that a CD Hn (·) is available on a parameter of interest θ. Suppose the null hypothesis is K0 : θ ∈ C versus the alternative K1 : θ ∈ C c . A natural line of thinking would be to measure the support that Hn (·) lends to C. If the support is “high,” the verdict based on Hn should be for C and if it is low, it should be for C c . The following two definitions of support for K0 from a CD are suggested by classical p-values. These two notions of support highlight the distinction between the two kind of p-values used in statistical practice, one for the one-sided hypotheses and the other for the point hypotheses. I. Strong-support ps (C), defined as ps (C) = Hn (C), which the probability content of C under Hn . II. Weak-support pw (C), defined as pw (C) = supθ∈C 2 min(Hn (θ), 1 − Hn (θ)). See, e.g., Cox and Hinkley [6], Barndorff-Nielsen and Cox [1], and especially Fraser [11], for discussions on this topic related to p-value functions. Our results are closely related to those in Fraser [11] but they are developed under a more general setting. We use the following claim for making connection between the concepts of support and the p-values. Claim A If K0 is of the type (−∞, θ0 ] or [θ0 , ∞), the classical p-value typically agrees with the strong-support ps (C). If K0 is a singleton, i.e. K0 is θ = θ0 , then the p-value typically agrees with the weak-support pw (C). To illustrate the above claim, consider tests based on the normalized statistic ˆ for an arbitrary estimator θ. ˆ Based on the method given in Tn = (θˆ − θ)/SE(θ), ˆ n ≤ x , where ηn is independent Section 1.1, a CD for θ is Hn (x) = Pηn θˆ − SE(θ)η L
of the data Xn and ηn =Tn . Thus,
θˆ − θ0 ps (−∞, θ0 ) = Hn (θ0 ) = Pηn ηn ≥ . ˆ SE(θ) This agrees with the p-values for one-sided test K0 : θ ≤ θ0 versus K1 : θ > θ0 . Similar demonstrations can be given for the tests based on studentized statistics. If the null hypothesis is K0 : θ = θ0 vs K1 : θ = θ0 , the standard p-values, based ˆ under the distribution on Tn , is twice the tail probability beyond (θˆ − θ0 )/SE(θ) of Tn . This equals 2 min Hn (θ0 ), 1 − Hn (θ0 ) = pw (θ0 ). Remark 3.1. It should be remarked here that pw (θ0 ) is the Tukey’s depth (see Tukey [27]) of the point θ0 w.r.t. Hn (·). The following inequality justifies the names of the two supports, Theorem 3.4. For any set C, ps (C) ≤ pw (C). Proof. Suppose the sup in the definition of pw (C) is attained at θ , which may or may not be in C and θ ≤ Mn (recall Mn = the median of Hn ). Let θ be the point to the right of Mn such that 1 − Hn (θ ) = Hn (θ ). Then C ⊆ (−∞, θ ] [θ , ∞); ignoring a possible null set under Hn . As a consequence ps (C) = Hn (C) ≤ Hn ((−∞, θ ]) + Hn ([θ , ∞)) = pw (C). Similar arguments are given when θ ≥ Mn . The next three theorems justify Claim A.
Confidence-distributions
143
Theorem 3.5. Let C be of the type (−∞, θ0 ] or [θ0 , ∞). Then supθ∈C Pθ ps (C) ≤ α = α. Hn ((−∞, H ((−∞, θ ]) ≤ α ≤ P ]. For a θ ≤ θ , P Proof. Let C = (−∞, θ n 0 θ 0 0 θ θ]) ≤ α = α. When θ = θ0 , one has the equality. A similar proof is given when C = [θ0 , ∞). A limiting result of the same nature holds for a more general null hypothesis, namely a union of finitely many disjoint closed intervals (bounded or unbounded). Assume the following regularity condition: as n → ∞, (3.2) sup Pθ max{Hn (θ − ), 1 − Hn (θ + )} > δ → 0 θ∈[a,b]
for any finite a, b and positive , δ. Essentially, the condition (3.2) assumes that the scale of Hn (·) shrinks to 0, uniformly in θ lying in a compact set. k Theorem 3.6. Let C = j=1 Ij where Ij are disjoint intervals of the type (−∞, a] or [c, d] or [b, ∞). If the regularity condition (3.2) holds, then supθ∈C Pθ ps (C) ≤ α → α, as n → ∞.
Proof. It suffices to prove the claim with the sup over θ ∈ Ij for each j = 1, . . . , k. Consider first the case when Ij = (−∞, a]. For this Ij and any δ > 0, sup Pθ (ps (C) ≤ t) ≥ Pθ=a (ps (C) ≤ t) ≥ Pθ=a (ps (Ij ) ≤ t − δ) + o(1) = t − δ + o(1).
θ∈Ij
The second inequality is due to (3.2). Also, from Theorem 3.5, sup Pθ (ps (C) ≤ t) ≤ sup Pθ (ps (Ij ) ≤ t) = t
θ∈Ij
θ∈Ij
which completes the proof for this Ij . The case of Ij = [b, ∞) is handled similarly. Turning to the case Ij = [c, d], c < d, we write it as the union of Ij1 = c, c+d and 2 c+d Ij2 = 2 , d , and note that, for any δ > 0, it follows from (3.1) that sup Pθ ps (C) ≤ t ≥ Pθ=c ps (C) ≤ t ≥ Pθ=c ps ((c, ∞]) θ∈Ij1
≤ t − δ + o(1) = t − δ + o(1)
Furthermore, from Theorem 3.5, we have for any δ > 0, sup Pθ ps (C) ≤ t ≤ sup Pθ ps (Ij ) ≤ t ≤ sup Pθ ps ([c, ∞) ≤ t + δ + o(1) θ∈Ij1
θ∈Ij1
θ∈Ij1
= t + δ + o(1)
The case of sup over θ ∈ Ij2 is dealt with in a similar way. In the arguments θ = c is replaced by θ = d. Remark 3.2. The result of Theorem 2.6 still holds if ps (C) is replaced by p∗s = max1≤j≤k ps (Ij ). The use of p∗s for p-value amounts to the so called Intersection Union Test (see, Berger [5]). Since p∗s as a p-value gives a larger rejection region k than that by ps ( 1 Ij ) as a p-value, testing by p∗s will have better power, for the same asymptotic size. If the intervals Ij are unbounded, it follows that sup Pθ (ps (C) ≤ t) ≤ sup Pθ (p∗s ≤ t) ≤ t.
θ∈C
θ∈C
K. Singh, M. Xie and W. E. Strawderman
144
Moving on to the situation when K0 is θ = θ0 and K1 is θ = θ0 , it is immediate that pw (θ0 ) = 2 min{Hn (θ0 ), 1 − Hn (θ0 )} has the U [0, 1] distribution, since Hn (θ0 ) does so. Thus pw (θ0 ) can be used like a p-value in the case of such a testing problem. In a more general situation when K0 is C = {θ1 , θ2 , . . . , θk } and K1 is C c , one has the following asymptotic result. Theorem 3.7. Let K0 be C = {θ1 , . . . , θk } and K1 be C c . Assume that Hn (θi − p ) and 1 − Hn(θi + )−→0 under θ = θi , for all i = 1, 2, . . . , k. Then maxθ∈C Pθ pw (C) ≤ α → α, as n → ∞.
Proof. For simplicity, let C = {θ1 , θ2 }, θ1 < θ2 . Under the condition, clearly p pw (θ2 ) ≤ 2{1 − Hn (θ2 )}−→0, if θ = θ1 . Since pw (θ1 ) has the U [0, 1] distribution
L under θ = θ1 , it follows, using standard arguments, that, max pw (θ1 ), pw (θ2 ) −→ U [0, 1] when θ = θ1 . The same holds under θ = θ2 . The result thus follows.
Example 3.1. (Bio-equivalence). An important example of the case where C, the null space, is a union of closed intervals is provided by the standard bioequivalence µ1 problem. In this testing problem K0 is the region µ2 ∈ (−∞, .8] [1.25, ∞), where µ1 , µ2 are the population means of bioavailability measures of two drugs being tested for equivalence. Example 3.2. In the standard classification problem, the parameter space is divided into k-regions. The task is to decide which one contains the true value of parameter θ. A natural (but probably over-simplified) suggestion is to compare CD contents of the k-regions and attach θ to the one which has got the maximum CD probability. 4. Profile likelihood functions and CDs We examine here the connection between the concepts of profile likelihood function and asymptotic CD. Let x1 , x2 , . . . , xn be independent sample draws from a parametric distribution with density fηη (x), η is a p × 1 vector of unknown parameters. Suppose we are interested in a scalar parameter θ = s(ηη ), where s(·) is a secondorder differentiable mapping from IRp to Θ ⊂ IR. To make an inference about θ, one often obtains the log-profile likelihood function n (θ) =
n i=1
log fηηˆ (θ) (xi ),
n log fηη (xi ). where ηˆ (θ) = arg max {η η :s(ηη )=θ} i=1
ˆ where θˆ = argmax n (θ) is the maximum likelihood Denote ∗n (θ) = n (θ) − n (θ), θ ∗ estimator of the unknown parameter θ. We prove below that en (θ) , after normalization (with respect to θ, so that the area under its curve is one), is the density function of an aCD for θ. The technique used to prove the main result is similar to that used in the proofs of Theorems 2 and 3 in Efron [7]. −1 1 1 ˆ Let i−1 n = n n (θ) and i0 (θ) = limn→+∞ n n (θ). Assume that the true value θ = θ0 is in Θo , the interior of Θ, and i−1 0 (θ0 ) > 0. The key assumption is that √ √ n(θˆ − θ0 )/ in → N (0, 1). ˆ In addition to the regularity conditions that ensure the asymptotic normality of θ, we make the following three mild assumptions. They are satisfied in the cases of commonly used distributions.
Confidence-distributions
145
exists a function k(θ), such that ∗n (θ) ≤ −nk(θ) for all large n, a.s.,
(i) There −ck(θ) dθ < +∞, for a constant c > 0. and Θ e (ii) There exists an > 0, such that c =
inf
|θ−θ0 |<
i−1 0 (θ) > 0 and k =
inf
|θ−θ0 |>
k(θ) > 0.
(iii) n (θ) satisfies a Lipschitze condition of order 1 around θ0 . For y ∈ Θ, write 2 ˆ (θ−θ) ∗ 1 1 − 2i /n Hn (y) = en (θ) dθ, e n dθ and Gn (y) = cn (−∞,y]∩Θ 2πin /n (−∞,y]∩Θ
+∞ ∗ where cn = −∞ en (θ) dθ. We assume that cn < ∞ for all n and Xn ; condition (i) implies cn < ∞ for n large. We prove the following theorem. Theorem 4.1. Gn (θ) = Hn (θ) + op (1) for each θ ∈ Θ. Proof. We prove the case with Θ = (−∞, +∞); the other cases can be proved similarly. Let > 0 be as in condition (ii). We first prove, for any fixed s > 0, (4.1)
θ0 −
e
∗ n (θ)
dθ = Op
−∞
1 ns
and
+∞
e
∗ n (θ)
dθ = Op
θ0 +
1 . ns
Note, by condition (i) and (ii), when n is large enough, we have ∗n (θ) + s log n < −nk(θ) + nk /2 = −n(k(θ) − k /2) ≤ −nk /2, for |θ − θ0 | > . By Fatou’s Lemma, lim sup n→+∞
θ0 −
e
∗ n (θ)+s log n
dθ ≤
−∞
≤
θ0 −
∗
lim sup en (θ)+s log n dθ
n→+∞ −∞ θ0 − lim e−nk /2 dθ n→∞ −∞
= 0.
This proves the first equation in (4.1). The same is true for the second equation in (4.1). In the case when |θ − θ0 | < , by Taylor expansion ∗n (θ) =
(4.2)
1 ˜ ˆ 2, n (θ)(θ − θ) 2
ˆ for θ˜ between θ and θ.
ˆ 2 , when n is large. Thus, one can From condition (ii) we have ∗n (θ) ≤ − n2 c (θ − θ) prove √ (4.3) n
θ0 − √n log n
e
∗ n (θ)
dθ = op (1),
√ and n
θ0 −
Now, consider the case when |θ − θ0 | < one has θ θ √ √ ∗ (θ) n e n dθ = n (4.4) θ0 − √n log n
√ n
θ0 +
∗
θ0 + √n log n
en (θ) dθ = op (1).
log n. By (4.2) and condition (iii),
θ0 − √n log n
e−
ˆ 2 n(θ−θ) 2in
dθ + op (1).
K. Singh, M. Xie and W. E. Strawderman
146
From (4.1), (4.3) and (4.4), it easily follows that (4.5)
1 2πin /n
θ
e
∗ n (θ)
−∞
1
θ
dθ = e− 2πin /n −∞ = Hn (θ) + op (1),
ˆ 2 n(θ−θ) 2in
dθ + op (1)
for all θ ∈ (−∞, ∞). Note that (4.5) implies that cn = 2πin /n + op ( √1n ). So (4.5) is, in fact, Gn (θ) = Hn (θ)+op (1) for all θ ∈ (−∞, ∞). This proves the theorem. ˆ
W
Remark 4.1. At θ = θ0 , Hn (θ0 ) = Φ( √θ−θ0 ) + op (1) −→ U (0, 1). It follows from in /n
W
this theorem that Gn (θ0 ) −→ U (0, 1), thus Gn is an aCD. Remark 4.2. It is well known that at the true value θ0 , −2∗n (θ0 ) = −2{n (θ0 ) − ˆ is asymptotically equivalent to n(θ0 − θ) ˆ 2 /in ; see, e.g., Murphy and van der n (θ)} Vaart [20]. In the proof of the above theorem, we need to extend this result to a ∗ shrinking neighborhood of θ0 , and control the “tails” in our normalization of en (θ) (with respect to θ, so that the area underneath the curve is 1). This normalization produces a proper distribution function, and Theorem 4.1 is an asymptotic result for this distribution function. As a special case, the likelihood function in the family of one-parameter distributions (i.e., η is a scalar parameter) is proportional to an aCD density function. There is also a connection between the concepts of aCD and other types of likelihood functions, such as Efron’s implied likelihood function, Schweder and Hjort’s reduced likelihood function, etc. In fact, one can easily conclude from Theorems 1 and 2 of Efron [7] that in an exponential family, both the profile likelihood and the implied likelihood (Efron [7]) are aCD densities, after a normalization (with respect to θ). Schweder and Hjort [22] proposed the reduced likelihood function, which itself is proportional to a CD density for a specially transformed parameter. See Welch and Peers [28] and Fisher [10] for earlier accounts of likelihood function based CDs in the case of single parameter families. 5. Multiparameter joint CD Let us first note that in higher dimensions, the cdf is not as useful a notion, at least for our purposes here. The main reasons are: (a) The region F (x) ≤ α is not of L L much interest in IRk . (b) The property F (X)=U [0, 1], when X=F is lost! 5.1. Development of multiparameter CD through Cramer-Wold device The following definition of a multiparameter CD has the make of a random vector having a particular multivariate distribution (arguing via characteristic functions). Definition 5.1. A distribution Hn (·) ≡ Hn (Xn , ·) on IRk is a CD in the linear sense (l-CD) for a k × 1 parameter vector θ if and only if for any k × 1 vector λ , the conditional distribution of λ ξ n given Xn is a CD for λ θ where the k × 1 random vector ξ n has the distribution Hn (·) given Xn . Using the definition of asymptotic CD on the real line, one has a natural extension of the above definition to the asymptotic version. With this definition, for example,
Confidence-distributions
147
the raw bootstrap distribution in IRk remains an asymptotic CD, under asymptotic symmetry. An l-CD H1,n is more precise than H2,n (both for the same θ in IRk ), if the CD for λ θ given by H1,n is more precise than that given by H2,n , for all k × 1 vectors √ 1 ¯ n −µ λ . For the normal mean vector µ , with known dispersion Σ, 1−Φ nΣ− 2 (X µ) is the most precise CD for µ . Let An (θˆ − θ ) have a completely known absolutely continuous distribution G(·), where An is non-singular, non-random matrix. Then Hn (·) defined by Hn (θθ) = ˆ 1 − G An (θ − θ ) is an l-CD for θ . If G is the limiting distribution then the above Hn is an asymptotic CD, in which case An can be data-dependent. A useful property of this extension to IRk is the fact that the CD content of a region behaves like pvalues (in limit), as it does in the real line case. See, Theorem 4.2 of Liu and Singh [18] for the special case of bootstrap distribution in this context. It is evident from the definition that one can obtain a CD for any linear function A θ from that of θ , by obtaining the distribution of A ξ n . The definition also entails that for a vector of linear functions, the derived joint distribution (from that of ξn ) is an l-CD. For a non-linear function though, one can in general get an asymptotic CD only. Let Hn (·) be an asymptotic l-CD for θ . Suppose random vector ξn follows the distribution Hn . Consider a possibly non-linear function g(θθ ) : IRk → IR , ≤ k. Let each coordinate of g(θθ) have continuous partial derivatives in a neighborhood of θ 0 . Furthermore, suppose the vector of partial derivative at θ = θ 0 is non-zero, for each coordinate of g(θθ). Theorem 5.1. Under the setup assumed above, the distribution of g(ξn ) is an asymptotic l-CD of g(θθ), at θ = θ 0 , provided the Hn probability of {|| θ − θ 0 || > } P →0, for any > 0. Proof. The results follows from the following Taylor’s expansion over the set ||θ − θ0 || ≤ : g(ξξn ) = g(θθ0 ) + ∆(ξξn )(ξξn − θ 0 ), where ∆(ξξn ) is the matrix of partial derivative of g(·) at θ lying within the -neighborhood of θ 0 . Remark 5.1. Given a joint l-CD for θ , the proposition prescribes an asymptotic method for finding a joint l-CD for a vector of functions of θ , or for just a single function of θ . This method can inherit any skewness (if it exists) in the l-CD of g(θθ). This will be missed if direct asymptotics is done on g(θˆ) − g(θθ 0 ). On the topic of combining joint l-CDs, one natural approach is by using the univariate CDs of linear combination, where combination is carried out by the methods discussed in Singh, Xie and Strawderman [26] . The problem of finding a combined joint l-CD comes down to finding a joint distribution agreeing with the univariate distributions of linear combinations, if it exists. The existence problem (when it is not obvious) could perhaps be tackled via characteristic functions and Bochner’s Theorem. It may also be noted that an asymptotic combined l-CD of a nonlinear function of the parameters can be constructed via Theorem 5.1 and the methods of Singh, Xie and Strawderman [26]. 5.2. Confidence distribution and data depth Another requirement on a probability distribution Hn on IRk (based on a data set Xn ) to be a confidence distribution for θ (a k-column vector) should naturally be: the 100t% “central region” of Hn are confidence regions for θ , closed and bounded, having the coverage level 100t%. We define such a Hn to be a c-CD, where c stands
148
K. Singh, M. Xie and W. E. Strawderman
for circular or central. We note in Remark 5.2 that the notions of l-CD and c-CD match in a special setting. Definition 5.2. A function Hn (·) = Hn (·, Xn ) on Θ ∈ IRk is called a Confidence Distribution in the circular sense (c-CD) for k × 1 multiparameter θ , if (i) it is a probability distribution function on the parameter space Θ for each fixed sample set Xn , and (ii) the 100t% “central region” of Hn (·) is a confidence region for θ , having the coverage level 100t% for each t. By central regions of a−1distribution G, statisticians usually mean the elliptical regions of the type y G y ≤ a. This notion of central regions, turns out to be a special case of central regions derived from the notion of data-depth. See Liu, Parelius and Singh [16] among others, for various concepts of data-depth and more. The elliptical regions arise out of so-called Mahalanobis-depth (or distance). The phrase data-depth was coined by J. Tukey. See Tukey [27], for the original notion of Tukey’s depth or half-space depth. In recent years, the first author, together with R. Liu, has been involved in developing data-depth, especially its application. For the reader’s convenience, we provide here the definition of Tukey’s depth. T DG (x) = min PG (H), where the minimum is over all half-spaces H containing x. On the real line, T DG (x) = min G(x), 1 − G(x) . A notion of data-depth is called affineinvariant if D(X, x) = D(AX + b, Ax + b) where D(X, x) is the depth of the point x w.r.t. the distribution of a random vector X. Here AX + b is a linear transform of X. The above mentioned depths are affine-invariant. For an elliptical population, the depth contours agree with the density contours (see Liu and Singh [17]). For a given depth D on a distribution G on IRk , let us define the centrality function C(x) = C(G, D, x) = PG {y : DG (y) ≤ DG (x)}. Thus, the requirement (ii) stated earlier on Hn in Definition 5.2, can be restated as: {x : C(Hn , D, x) ≥ α} is a 100(1 − α)% confidence region for θ , for all 0 ≤ α ≤ 1. This is equivalent to Requirement (ii)’: C(θθ0 ) = C(Hn , D, θ 0 ), as a function of sample set Xn , has the U [0, 1] distribution. For a c-CD Hn , let us call the function C(·) = C(Hn , D, ·) a confidence centrality function (CCF). Here D stands for the associated depth. Going back to the real line, if Hn is a CD, one important CCF associated with Hn is Cn (x) = 2 min{Hn (x), 1 − Hn (x)}. This CCF gives rise to the two-sided equal tail confidence intervals. The depth involved is the Tukey’s depth on the real line. Next, we present a general class of sample dependent multivariate distributions which meet requirement (ii)’. Let An (θˆ − θ ) have a cdf Gn (·) independent of parameters. The nonsingular matrix An could involve the data Xn . Typically An is a square root of the inverse dispersion matrix of θˆ (or an estimated dispersion matrix). Let ηn be an (external) random vector, independent of Xn , which has the cdf Gn . Let Hn be the cdf of θˆ − A−1 n ηn for a given vector Xn . Theorem 5.2. Let D be an affine-invariant data-depth such that the boundary sets {x : C(Gn , D, x) = t} have zero probability under Gn (recall the centrality function C(·)). Then Hn (·) defined above meets Requirement (ii)’ and it is a c-CD.
Confidence-distributions
149
Proof. For any t ∈ (0, 1), in view of the affine-invariance of D, the set
Xn : C(Hn , D, θ 0 ) ≤ t = Xn : C(Gn , D, An (θˆ − θ 0 )) ≤ t = Xn corresponding to An (θˆ − θ0 ) lying in the outermost 100t%
of the population Gn . Its probability content = t, since An (θˆ − θ) is distributed as Gn .
Remark 5.2. The discussion preceeding Theorem 5.1 and the result in Theorem 5.2 imply that this Hn is both a l-CD and a c-CD of θ when An is independent of the data. The l-CD and c-CD coincide in this special case! When An is data-dependent, one has an exact c-CD, but only an asymptotic l- CD. Given two joint c-CDs H1n and H2n , based on the same data set, their precision could be compared using stochastic comparison between their CCFs involving the same data-depth. More precisely, let C1 , C2 be the CCFs of two joint CDs H1n and H2n , induced by the same notion of data-depth, i.e., Ci (x) = C(Hin , D, x) = {fraction of the Hin population having D-depth less than or equal to that of x}. One sto
would define: H1n is more precise than H2n when θ = θ0 prevails, if C1 (x) ≤ C2 (x), under θ = θ 0 , for all x = θ 0 . References [1] Barndorff-Nielsen, O. E. and Cox, D. R. (1994). Inference and Asymptotics. CRC Press. [2] Babu, G. J. and Singh, K. (1983). Inference on means using the bootstrap. Ann. Statist. 11 999–1003. [3] Babu, G. J and Singh, K. (1984). On one term Edgeworth correction by Efron’s bootstrap. Sankhya Ser. A Indian J. Statist. 46 219–232. [4] Beran, R. (1987). Prepivoting to reduce level error of confidence sets. Biometrika 74 457–468. [5] Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics 24 295–300. [6] Cox and Hinkley (1974). Theoretical Statistics. Chapman and Hall, London. [7] Efron, B. (1993) Bayes and likelihood calculations from confidence intervals. Biometrika 80 3–26. [8] Efron, B. (1998). R. A. Fisher in the 21st century. Statist. Sci. 13 95–122. [9] Fisher, R. A. (1930). Inverse probability. Proc. Cambridge Pilos. Soc. 26 528–535. [10] Fisher, R. A. (1973). Statistical Methods and Scientific Inference, 3rd ed. Hafner Press, New York. [11] Fraser, D. A. S. (1991). Statistical inference: Likelihood to significance. J. Amer. Statist. Assoc. 86 258–265. [12] Fraser, D. A. S. (1996). Comments on Pivotal inference and the fiducial argument (95V63p 309–323). Internat. Statist. Rev. 64 231–235. [13] Hall, P. (1992). On the removal of skewness by transformation. J. Roy. Statist. Soc. Ser. B 54 221–228. [14] Lehmann, E. L. (1986). Testing Statistical Hypotheses, 2nd ed. Wiley, New York. [15] Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? J. Amer. Statist. Assoc. 88 1242–1249.
150
K. Singh, M. Xie and W. E. Strawderman
[16] Liu, R. Y., Parelius, J. and Singh, K. (1999). Multivariate analysis by data depth: Descriptive statistics, graphics and inference. Ann. Statist. 27 783–858. [17] Liu, R. Y. and Singh, K. (1993). A quality index based on data depth and multivariate rank tests. J. Amer. Statist. Assoc. 88 252–260. [18] Liu, R. Y. and Singh, K. (1997). Notions of limiting P values based on data depth and bootstrap. J. Amer. Statist. Assoc. 92 266–277. [19] Mallows, C. A. and Vardi, Y. (1982). Bounds on the efficiency of estimates based on overlapping data. Biometrika 69 287–296. [20] Murphy, S. A. and van der Vaart, A. W. (2000). On profile likelihood. J. Amer. Statist. Assoc. 195 449–465. [21] Neyman, J. (1941). Fiducial argument and the theory of confidence intervals. Biometrika 32 128–150. [22] Schweder, T. and Hjort, N. L. (2002). Confidence and likelihood. Scand. J. Statist. 29 309–332. [23] Schweder, T. and Hjort, N. L. (2003). Frequenstist analogues of Priors and Posteriors. In Econometrics and the Philosophy of Economics 285–317. Princeton University Press. [24] Singh, K. (1981). On asymptotic accuracy of Efron’s bootstrap. Ann. Statist. 9 1187–1195. [25] Singh, K., Xie, M. and Strawderman. W. (2001). Confidence distributions — concept, theory and applications. Technical Report, Department of Statistics, Rutgers University. Updated 2004. [26] Singh, K., Xie, M. and Strawderman, W. (2005). Combining information through confidence distributions. Ann. Statist. 33 159–183. [27] Tukey, J. W. (1975). Mathematics and the picturing of data. Proceedings of the International Congress of Mathematicians 2 523–531. [28] Welch, B. L. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. J. Roy. Statist. Soc. Ser. B 25 318–329.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 151–160 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000111
Empirical Bayes methods for controlling the false discovery rate with dependent data Weihua Tang1 and Cun-Hui Zhang1,∗ Rutgers University Abstract: False discovery rate (FDR) has been widely used as an error measure in large scale multiple testing problems, but most research in the area has been focused on procedures for controlling the FDR based on independent test statistics or the properties of such procedures for test statistics with certain types of stochastic dependence. Based on an approach proposed in Tang and Zhang (2005), we further develop in this paper empirical Bayes methods for controlling the FDR with dependent data. We implement our methodology in a time series model and report the results of a simulation study to demonstrate the advantages of the empirical Bayes approach.
1. Introduction The false discovery rate (FDR), proposed by Benjamini and Hochberg ([1], BH hereafter), has been widely used as an error measure in multiple testing problems. Let R be the number of rejected hypotheses (discovered items) and V be the number of falsely rejected hypotheses, the FDR is defined as (1.1) FDR ≡ E V /R I{R>0} .
Since most discovered items truly contain signals when the ratio V /R is small, FDR controlling methods allow a scientific inquiry to move ahead from a screening experiment to more focused systematic investigations of discovered items. Moreover, since FDR controlling methods do allow a small number of items without signal to slip through, they often provide sufficiently many discovered items in such screening experiments. Thus, the FDR seems to strike a suitable balance between the needs of multiple error control and sufficient discovery power in many applications (e.g. microarray, imaging, and astrophysics studies), compared with the more conservative family-wise error rate (FWER) P (V > 0) and more liberal per-comparison error rate (PCER) EV /m, where m is the total number of hypotheses being tested. Mathematically, we notice that PCER≤ FDR ≤FWER. The FDR is closely related to the positive predictive value (PPV) for diagnostic tests, since (1.2)
PPV = 1 − V /R
∗ Research
partially supported by National Science Foundation Grants DMS-0405202, DMS0504387 and DMS-0604571. 1 Department of Statistics, Hill Center, Busch Campus, Rutgers University, Piscataway, New Jersey 08854, USA, e-mail: [email protected]; [email protected] AMS 2000 subject classifications: 62H15, 62C10, 62C12, 62C25. Keywords and phrases: multiple comparisons, false discovery rate, conditional false discovery rate, most powerful test, Bayes rule, empirical Bayes, dependent data, time series. 151
152
W. Tang and C.-H. Zhang
based on ground truth. Most papers in the FDR literature have focused on procedures for controlling the FDR based on independent test statistics or properties of such procedures with dependent test statistics. For example, in BH [1], the p-values for individual hypotheses are assumed to be independent and uniformly distributed under the null. Benjamini and Yekutieli [3] proved that the BH [1] procedure still controls the FDR under a “positive regression dependence” condition on test statistics under the null hypotheses. Storey, Taylor and Sigmund [? ] proposed certain modification of the BH [1] procedure and proved its FDR controlling properties under conditions on the convergence of the empirical processes of the p-values. For related work in controlling decision errors in multiple testing, see Cohen and Sackrowitz [4, 5], Finner and Roters [9], Sarkar [11] and Simes [12], in addition to references cited elsewhere in this paper. In Tang and Zhang [15], we showed that the optimal solution of the problem of (1.3) maximizing ER subject to E V /R I{R>0} ≤ α
may produce undesirable multiple testing procedures, and proposed Bayes and empirical Bayes (EB) approaches under a conditional version of (1.3) given certain test statistics which determine R. This means to maximize the total discovery R (i.e. power) of multiple testing procedures subject to a preassigned level of the conditional FDR: (1.4) CFDR(X) ≡ E V /(R ∨ 1)X ,
for certain statistics X satisfying E[R|X] = R, where (x ∨ y) = max(x, y). These approaches, which provide a general framework for controlling the FDR with dependent data, are discussed in Section 2. We note that the concept of conditional FDR (1.4) allows conditioning on a mixture of observations, parameters, and variables generated by statisticians to implement randomized rules. The CFDR (1.4) becomes posterior FDR when X is the vector of all observations of concern. If we observe X0 and generate variables ε0 to execute a randomized multiple testing rule, the constraint E[R|X] = R demands X = (X0 , ε0 ) in our optimization problem, cf. (2.1) below, but the posterior FDR is computed given X0 alone for such randomized tests. In Section 3, we develop EB methods for controlling the (conditional) FDR in a time series model based on our approach. In Section 4, we present simulation results which demonstrate the advantages our method compared with the BH [1] rule based on marginal p-values and the additional knowledge of the proportion of true hypotheses. 2. Bayes and empirical Bayes approaches Let H1 , . . . , Hm , be the null hypotheses to be tested and θi = I{Hi is not true },
δi = I{Hi is rejected }.
In a Bayes approach, we treat θ = (θ1 , . . . , θm ) as a random vector and assume that for certain observations X (not necessarily all observations), the joint distribution of {θ, X} is known given the knowledge of the joint prior distribution of θ and all nuisance parameters (if any). Consider the problem of (2.1) maximizing R subject to E V /(R ∨ 1)X ≤ α and E[R|X] = R,
EB methods for FDR control with dependent data
153
m m where R = i=1 δi and V = i=1 δi (1−θi ). Let πi = P {θi = 0|X} be the posterior probability of Hi given X. The Bayes rule, which solves (2.1), is given by (2.2)
∗
δ(i) = I{i ≤ k },
k 1 π(i) ≤ α , k ≡ k (α) ≡ max k ≤ m : k i=1 ∗
∗
with the convention max ∅ ≡ 0, where {(1), . . . , (m)} is the ordering of {1, . . . , m} determined by π(1) ≤ · · · ≤ π(m) . The above Bayes solution is optimal for any type of data as long as the joint distribution of (X, θ) is available. The Bayes rule (2.2) provides the most powerful solution for controlling the (conditional) FDR in the sense of (2.1) with general dependent test statistics for general null hypothesis, provided the knowledge of the joint prior distribution of unknown parameters and the computational feasibility for the conditional probabilities of the hypotheses Hi given X. The Bayes rule yields R = k∗ , since it rejects Hi iff πi ≤ π(k∗ ) . It clearly controls the FDR at level α, since the constraint in (2.1) is stronger than FDR = E[V /(R ∨ 1)] ≤ α. In applications where the full knowledge of the joint prior distribution is not available or the computation of the posterior probability of the null hypotheses given data is not feasible, the Bayes rule (2.2) motivates empirical Bayes rules of the form (2.3)
δ[i]
k 1
= I{i ≤ k}, k = max k : π
[i] ≤ α , k i=1
where π
i are estimates of the posterior probability πi = P {Hi |X} and {[1], . . . , [m]} is an estimate of the ordering {(1), . . . , (m)} in (2.2). The performance of the Bayes rule serves as a benchmark, as the goal of the EB (2.3) here is to approximately achieve optimality in the sense of (2.1). This EB approach is applicable for dependent data as long as suitable estimates π
i can be obtained. The ordering [i] can be derived from π
i if the ordering (i) is not a known functional of the data. Tang and Zhang [15] proposed the above Bayes and empirical Bayes approaches and proved the asymptotic optimality of the BH [1] procedure as EB in the sense of (2.1). In the rest of the section, we discuss implementation of the EB method and some related work on Bayes and EB aspects of FDR problems. Suppose that the expectation in (2.1) depends on an unknown parameter ξ, so that the main constraint in (2.1) becomes CFDR(X, ξ) ≤ α. Let (2.4) πi (X, ξ) = P θi = 0X, ξ
denote the conditional probabilities of the null hypotheses Hi given statistics X and parameters ξ. If f (X|ξ) belongs to a regular parametric family, we may use an EB rule with an estimate ξˆ (e.g. the MLE) of ξ and π
i = π(X, ξˆ ) in (2.3), or a hierarchical Bayes rule with a prior on ξ and πi = πi (X, ξ)f (ξ|X)ν(dξ) in (2.2). If (θ1 , X1 ), . . . , (θm , Xm ) are independent vectors for certain test statistics n Xi , the conditional probabilities πi (X, ξ) = πi (Xi , ξ) or the average k −1 i=1 π(i) of their ordered values in (2.2) may still be estimated sufficiently accurately even for high-dimensional ξ, e.g. the asymptotic optimality BH [1] rule as EB. How do we implement (2.3) when (2.4) depends on many components of X and ξ is highdimensional? We propose EB rules (2.3) with (2.5) π
i = πm,i (Tm,i (X), ξˆ ), πm,i (Tm,i (X), ξ) = P θi = 0Tm,i (X), ξ ,
W. Tang and C.-H. Zhang
154
with certain lower-dimensional statistics Tm,i (X) which are informative about θi . This can be also viewed as the EB version of the approximate Bayes rule (2.6)
δ(i)
= I{i ≤ k ∗ },
k 1 ∗ k ≡ max k ≤ m : πm,(i) ≤ α , k i=1
where πm,(i) are the ordered values of πm,i (Tm,i (X), ξ). In applications with time series, image, or networks data, Tm,i (X) are typically composed of observations “near” the location of the i-th hypotheses Hi . The idea is to reduce the dimensionality of the function πi to be estimated. In the time series example in Section 3 and certain models of Markov random fields, πm,i (Tm,i (X), ξ) depend on ξ only through a lower-dimensional functional of ξ, so that the complexity of the estimation problem is further reduced. The cost of such dimension reduction is the bias bm,i = πm,i (Tm,i (X), ξ) − πi (X, ξ). We note that Ebm,i = 0 and Var(bm,i ) decreases when we add more variables to the vector Tm,i (X). Efron et al. [6] proposed a different EB approach based on the conditional probability fdr(x) = P {θi = 0|Xi = x} given a univariate statistic Xi , called local fdr, and developed multiple-testing methodologies based on certain estimate of it in mixture models. Efron and Tibshirani [7] and Efron [8] further developed this EB approach using an integrated version of local fdr. These notions of FDR and related quantities have been studied in Storey [13] and Genovese and Wasserman [10]. Sarkar and Zhou (personal communication) recently considered the posterior FDR (i.e. the conditional FDR given all observed data) for a number of multiple testing procedures, including a randomized version of (2.2). 3. EB methods in a time series model In this section, we develop EB methods for controlling the (conditional) FDR in the time series model Xi = µi + i ,
(3.1)
where {i } is a stationary Gaussian process (e.g. moving average) with (3.2)
Ei = 0,
Cov(i , i+k ) = γ(k).
Our problem is to test Hi : µi = 0 versus Ki : µi = 0, i = 1, . . . , m, based on observations X = (X1 , . . . , Xm ). We assume that the null distribution of Xi ∼ N (0, γ(0)) is known. We set γ(0) = 1 without loss of generality. We derive an EB procedure (2.3) of the form (2.5) under a nominal mixture model in which (θi , µi ) are iid vectors with µi (θi = 0) = 0, µi (θi = 1) ∼ N (η, τ 2 ) (3.3) for certain unknown (η, τ 2 ). We assume k γ 2 (k) < ∞, which allows us to take advantage of the diminishing correlation. This leads to Tm,i (X) = (Xj , |j − i| ≤ k), ξ = η, τ 2 , w0 , γ(1), . . . , γ(k) , (3.4)
EB methods for FDR control with dependent data
155
in (2.5), where w0 = P {θi = 0}. As we have mentioned in Section 2, conditioning on the lower-dimensional statistics Tm,i (X) is helpful in two important ways: computational feasibility and reduction of the set of nuisance parameters involved. Since the components of X are all correlated, the posterior probability πi (X, ξ) in (2.4) and thus the Bayes rule k∗ in (2.2) demand the inversion of m-dimensional conditional covariance matrices and summation over 2m possible values of θ. Thus, the Bayes rule is computationally intractable. Exact computation of πm,i (Tm,i (X), ξ) in (2.5) is much more manageable since it involves the inversion of (2k + 1)-dimensional covariance matrices and summation over 22k+1 possible values of Tm,i (θ) = (θj , |j − i| ≤ k). Also, the conditional probability πm,i (Tm,i (X), ξ) of Hi given Tm,i (X) does not depend on γ(j) for j > k, so that the dimensionality of ξ is much smaller than the sample size if we choose k = o(m). Given w0 , we estimate ξ as follows: m
(3.5)
1 Xi /(1 − w0 ), η = m i=1 m
(3.6)
(3.7)
1 Xi2 − 1 τ = − m i=1 1 − w0
2
γ
(j) =
m−j i=1
1≤i≤j≤m,j−i>ρm
Xi Xi+j − m−j
Xi Xj /(1 − w0 )2 , (1 − ρ)2 m2 /2
1≤i≤j≤m,j−i>ρm
Xi Xj , (1 − ρ)2 m2 /2
where 0 < ρ < 1. Estimates (3.5), (3.6) and (3.7) are based on the method of moments, since (3.1), (3.2) and (3.3) imply EXi = Eµi = (1 − w0 )η, EXi2 = 1 + Eµ2i = 1 + (1 − w0 )(τ 2 + η 2 ), EXi Xi+j = γ(j) + (Eµ1 )2 , and EXi Xj ≈ (Eµ1 )2 for large |j − i|. In order to reduce the bias [composed of terms involving γ(j − i)], we use the average of Xi Xj with |i − j| > ρm in (3.6) and (3.7) to estimate m (EXi )2 = (Eµi )2 , instead of ( i=1 Xi /m)2 . In the simulation study discussed in Section 4, we take ρ = 0.1. For the estimation of the proportion w0 of null hypotheses, we use a Fourier method (Tang and Zhang [15] and Zhang [16]) and its parametric bootstrap version. The Fourier method estimates w0 by m
(3.8)
(F ) w
0
1 ψ(Xi ; hm ), ψ(z; h) ≡ ≡ m i=1
2
hψ0 (ht)et
/2
cos(zt)dt,
where ψ0 is a density function with support [−1, 1] and hm = {κ(log m)}−1/2 is the bandwidth, κ ≤ 1. This estimator is derived from 2 E ψ(Xi ; h)µi = hψ0 (ht)et /2 cos(µi t)Eeiti dt = 1, µi = 0, h > 0, = ψ0 (t) cos((µi /h)t)dt → 0, µi = 0, h → 0+ by Riemann-Lebesgue. For the bootstrap version of (3.8), we generate bootstrap samples of X under the parameter value of ξˆ in (3.5)-(3.8) and then estimate w0 by (3.9)
(B)
w
0
(F )
= 2w
0
−w
0∗ ,
W. Tang and C.-H. Zhang
156
where w
0∗ is the average of the estimator (3.8) based on the bootstrap samples. In the simulation study, we use uniform [−1, 1] as ψ0 and κ = 1/2 for (3.8), and we bootstrap 100 times for (3.9). The Empirical Bayes procedure rejects the null hypotheses associated with the first
k smallest estimated conditional probabilities πm,i (Tm,i (X), ξˆ ), with k 1
π
[i] ≤ α , k = max k : k i=1
(3.10)
where π
[1] ≤ · · · ≤ π
[m] are ordered values of πm,i (Tm,i (X), ξˆ ), and ξˆ is defined through (3.5)-(3.8) with the alternative of using the bootstrap estimate (3.9) for w0 . The conditional probability πm,i (Tm,i (X), ξ) = P θi = 0Tm,i (X), ξ ,
ˆ is computed by conditioning on Tm,i (θ) = (θj , |j − i| ≤ k). with ξ replaced by ξ, To save notation, we may drop the subscript m in the rest of the paragraph, e.g. Tm,i = Ti . Under the nominal mixture model (3.3), the conditional joint distribution of Ti (X) is multivariate normal Ti (X)Ti (θ) ∼ N ηTi (θ), Σi (θ) ,
where the covariance matrix Σi (θ) = Cov Ti (ε) + τ 2 diag Ti (θ) depends on unknown parameters Ti (θ), {γ(j) : 1 < j ≤ k} and τ 2 . The joint density of this conditional distribution is exp − (v − ηT (θ)){Σ (θ)}−1 (v − ηT (θ)) /2] i i i i i fi vi Ti (θ) = d /2 1/2 i (2π) {det Σi (θ) } where di = #{j : |j − i| ≤ k} is the dimensionality of Ti (X), ranging from 1 + k to 2k + 1 depending on if i is close to the endpoints i = 1 and i = m, and vi are di -dimensional row vectors. This gives (3.11)
πm,i (Tm,i (X), ξ)
d −s (θ) fi Ti (X)Ti (θ) w0 i i (1 − w0 )si (θ) = , di −si (θ) si (θ) T T (X) (θ) w (1 − w ) f i i 0 0 Ti (θ)∈Ωi i (0) Ti (θ)∈Ωi
(0) where si (θ) = |j−i|≤k θj , Ωi = {0, 1}di , and Ωi = {Ti (θ) ∈ Ωi : θi = 0}. The estimation of the proportion of null hypotheses is an important aspect of the FDR problem. Benjamini and Hochberg [1] simply used the conservative w
0 = 1 to control the FDR at level α, while Benjamini and Hochberg [2] suggested the possibility of power enhancement with estimated w0 in the BH [1] procedure. Different estimators of w0 were proposed by Storey [13] and Storey, Taylor and Siegmund [14] based on the tail proportion of p-values, and by Efron et al. [6] based on minx {f (x)/f0 (x)}, in the context of controlling the FDR. Tang and Zhang proved the consistency of (3.8) for normal mixtures. 4. Simulation results In this section, we describe the results of our simulation study. We compare five procedures: the BH [1] rule using the true w0 , the approximate Bayes rule (2.6), and
EB methods for FDR control with dependent data
157
Fig 1. A realization of the time series data.
three EB rules (3.10) using Tm,i (X) = (Xj , |j − i| ≤ 2), estimators (3.5), (3.6) and (3.7), and the following three values for (estimated) w0 : the true w0 , the Fourier estimator (3.8), and the bootstrap estimator (3.9). We denote the BH procedure by BH-w0 , and the three EB rules by EB-w0 , EB-Fourier and EB-bootstrap respectively. The target (conditional) FDR level is α = 0.1 throughout the simulation study. Simulation data are generated as follows: m = 1000, #{i : µi = 0} = 900, #{i : µi = 2} = 100, i.e. w0 = 0.9, and γ = (1, 0.6, 0.4, 0.2, 0.1, 0, 0, . . .), e.g. γ(1) = 0.6. We note that this is a singular point in the nominal mixture model (3.3) for the derivation of EB procedures. A realization of the simulated X is plotted in Figure 1. We plot the simulated pairs of (V /R, R), i.e. the proportion of false rejections in the x-axis and the total number of rejections in the y-axis, for the BH-w0 and the three EB procedures in Figure 2, with dashed lines at the means of the simulated data. The mean and standard deviation of V /R, R, and V for all five procedures are given in Table 1. It is clear that the EB procedures are much more powerful than BH as they have much higher number R of rejections, while the false discovery
158
W. Tang and C.-H. Zhang
Fig 2. The proportion of false rejections (x-axis) and the total number of rejections (y-axis) for BH-w0 , EB-w0 , EB-Fourier, and EB-bootstrap, clockwise from top-left; solid vertical lines for the target FDR of α = 10% and dashed lines for the means of simulated points in the plots.
ratio V /R are similar among EB and BH procedures. 5. Conclusion For multiple testing problems, we describe the Bayes and empirical Bayes approaches for controlling the (conditional) FDR proposed in Tang and Zhang [15]. While these approaches are completely general for dependent data, its implementation is subject to computational feasibility and the availability of sufficient information for the estimation of certain conditional probabilities of the individual null hypotheses. We propose in this paper to use the conditional probabilities of the null hypotheses given certain low-dimensional statistics to ease the computational burden and possibly to reduce the number of nuisance parameters involved. We implement this EB approach in a time series model with general stationary Gaussian errors. Simulation results demonstrate that the EB procedures have much high number of correct rejections and similar false rejection ratio compared with an application of the procedure of Benjamini and Hochberg [1] based on individual p-
EB methods for FDR control with dependent data
159
Table 1 Mean (µ) and Standard Deviation (σ) of V /R, R, and V
Procedure BH-w0 Approximate Bayes EB-w0 EB-Foureir EB-bootstrap
V/R mean SD 0.11 0.12 0.10 0.13 0.13
0.10 0.03 0.09 0.11 0.12
mean
R SD
V mean
SD
13.52 76.86 34.42 34.49 35.59
7.57 6.62 7.95 13.44 15.05
1.66 9.12 3.27 5.06 5.89
1.82 3.16 3.05 4.95 6.56
values. This clearly demonstrates the feasibility and superior power of the proposed EB approach for controlling the false rejection with dependent data. References [1] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57 289–300. [2] Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Stat. 25 60–83. [3] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency Ann. Statist. 29 1165–1188. [4] Cohen, A. and Sackrowitz, H. (2005 a). Characterization of Bayes procedures for multiple endpoint problems and inadmissibility of step-up procedure. Ann. Statist. 33 145–158. [5] Cohen, A. and Sackrowitz, H. (2005 b). Decision theory results for onesided multiple comparison procedures. Ann. Statist. 33 126–144. [6] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160. [7] Efron, B., Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23 70–86. [8] Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104. [9] Finner, H. and Roters, M. (1998). Asymptotic comparison of step-down and step-up multiple test procedures based on exchangeable test statistics. Ann. Statist. 26 505–524. [10] Genovese, C. R. and Wasserman, L. (2002). Operating characteristics and extensions of the FDR procedure J. Roy. Statist. Soc. Ser. B 64 499–518. [11] Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist. 30 239–257. [12] Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73 751–754. [13] Storey, J. (2002). A direct approach to false discovery rate J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498. [14] Storey, J., Taylor, J. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates; a unified approach J. R. Stat. Soc. Ser. B Stat. Methodol. 66 187–206. [15] Tang, W. and Zhang, C.-H. (2005). Bayes and empirical Bayes approaches
160
W. Tang and C.-H. Zhang
to controlling the false discovery rate. Technical Report # 2005-004, Department of Statistics, Rutgers University. [16] Zhang, C.-H. (1990). Fourier methods for estimating mixing densities and distributions. Ann. Statist. 18 806–831.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 161–171 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000120
A smoothing model for sample disclosure risk estimation Yosef Rinott1,∗ and Natalie Shlomo2,† Hebrew University and Southampton University Abstract: When a sample frequency table is published, disclosure risk arises when some individuals can be identified on the basis of their values in certain attributes in the table called key variables, and then their values in other attributes may be inferred, and their privacy is violated. On the basis of the sample to be released, and possibly some partial knowledge of the whole population, an agency which considers releasing the sample, has to estimate the disclosure risk. Risk arises from non-empty sample cells which represent small population cells and from population uniques in particular. Therefore risk estimation requires assessing how many of the relevant population cells are likely to be small. Various methods have been proposed for this task, and we present a method in which estimation of a population cell frequency is based on smoothing using a local neighborhood of this cell, that is, cells having similar or close values in all attributes. We provide some preliminary results and experiments with this method. Comparisons are made to two other methods: 1. a log-linear models approach in which inference on a given cell is based on a “neighborhood” of cells determined by the log-linear model. Such neighborhoods have one or some common attributes with the cell in question, but some other attributes may differ significantly. 2 The Argus method in which inference on a given cell is based only on the sample frequency in the specific cell, on the sample design and on some known marginal distributions of the population, without learning from any type of “neighborhood” of the given cell, nor from any model which uses the structure of the table.
1. Introduction When a microdata sample file is released by an agency, directly identifying variables, such as name, address, etc., are always deleted, variable values are often grouped (e.g., Age-Groups instead of precise age), and the data is given in the form of a frequency table. However disclosure risk may still exist, that is, some individuals in the file may be identified by their combination of values in the variables appearing in the data. Samples often contain information on certain variables on which the agency’s information for the whole population is limited, such as expenditure on specific items in a Household Expenditure Survey, or detailed information on variables such as children’s extra curricular activities in the Social Survey of the Israel Central Bureau of Statistics. ∗ Research
supported by the Israel Science Foundation (grant No. 473/04). supported in part by the Israel Science Foundation (grant No. 473/04). 1 Department of Statistics, Hebrew University, Jerusalem, Israel, e-mail: [email protected] 2 Department of Statistics, Hebrew University of Jerusalem, Southampton Statistical Sciences Research Institute, University of Southampton, United Kingdom, e-mail: [email protected] AMS 2000 subject classifications: primary 62H17; secondary 62-07. Keywords and phrases: sample uniques, neighborhoods, microdata. † Research
161
Y. Rinott and N. Shlomo
162
Often agencies have to assess the disclosure risk involved in the release of sample data in the form of a frequency table when the corresponding population table may be unknown, or only partially known. Risk arises from cells in which both sample and population frequencies are small, allowing an intruder who has the sample data and access to some information on the population, and in particular on individuals of interest, to identify such individuals in the sample with high probability. Thus, the disclosure risk depends both on the given sample, and the population. In this paper we are concerned with the issue of estimating disclosure risk involved in releasing a sample on the basis of the sample alone, assuming the population is unknown. Let f = {fk } denote an m-way frequency table, which is a sample from a population table F = {Fk }, where k = (k1 , . . . , km ) indicates a cell and fk and Fk denote the frequency in the sample and population cell k, respectively. Formally, the sample and population sizes in our models are random and their expectations are denoted by n and N respectively, and the number of cells by K. We can either assume that n and N are known, or that they are estimated by their natural estimators: the actual sample and population sizes, assumed to be known. In the sequel when we write n of N we formally refer to expectations. If the m attributes in the table can be considered key variables, that is, variables which are to some extent accessible to the public or to potential intruders, then disclosure risk arises from cells in which both fk and Fk are positive and small, and in particular when fk = Fk = 1 (sample and population uniques). Suppose an intruder locates a sample unique in cell k, say, and is aware of the fact that the combination of values k = (k1 , . . . , km ) happens to be unique or rare in the population. If this combination matches an individual of interest to the intruder then identification can be made with high probability on the basis of the m attributes. If the sample contains information on the values of other attributes, then these can now be inferred for the individual in question, and his privacy is violated. In many countries this would constitute a violation of law. For example The Central Bureau of Statistics in Israel operates under the Statistics Ordinance (1972) which says “No information. . . shall be so [published] as to enable the identification of the person to whom it relates”. A global risk measure quantifies an aspect of the total risk in the file by aggregating risk over the individual cells. For simplicity we shall focus here only on two global measures, which are based on sample uniques: 1 , I(fk = 1, Fk = 1) , τ2 = I(fk = 1) τ1 = Fk k
k
where I denotes the indicator function. Note that τ1 counts the number of sample uniques which are also population uniques, and τ2 is the expected number of correct guesses if each sample unique is matched to a randomly chosen individual from the same population cell. These measures are somewhat arbitrary, and one could consider measures which reflect matching of individuals that are not sample uniques, possibly with some restrictions on cell sizes. Also, it may make sense to normalize these measures by some measure of the total size of the table, by the number of sample uniques, or by some measure of the information value of the data. Various individual and global risk measures have been proposed in the literature, see e.g., Benedetti et al. [1, 2], Skinner and Holmes [12], Elamir and Skinner [6], Rinott [8]. In Section 3 we propose and explain a new method of estimation of quantities like τ1 and τ2 , using a standard Poisson model, and local smoothing of frequency
Disclosure risk estimation
163
tables. The method is based on the idea that one can learn about a given population cell from neighboring cells, if a suitable definition of closeness is possible, without relying on complex modeling. In Sections 2.1 and 2.2 we briefly describe two known methods of estimation of quantities like τ1 and τ2 , and in Section 4 we provide real data experiments which compare the methods discussed. We consider the case that f is known, and F is an unknown parameter (on which there may be some partial information) and the quantities τ1 and τ2 should be estimated. Note that they are not proper parameters, since they involve both the sample f and the parameter F. The methods discussed in this paper consist of modeling the conditional distribution of F|f, estimating parameters in this distribution and then using estimates of the form ˆ 1 |fk = 1] , (1) τˆ1 = I(fk = 1)Pˆ (Fk = 1|fk = 1), τˆ2 = I(fk = 1)E[ Fk k
k
ˆ denote estimates of the relevant conditional probability and expecwhere Pˆ and E tation. For a general theory of estimates of this type see Zhang [14] and references therein. Some direct variance estimates appear in Rinott [8]. 2. Models For completeness we briefly introduce the Poisson and Negative Binomial models. More details can be found, for example, in Bethlehem et al. [3], Cameron and Trivedi [4], Rinott [8]. A common assumption in the frequency table literature is Fk ∼ Poisson(N γk ), independently, where N is assumed to be a known parameter, and γk = 1. Binomial (or Poisson) sampling from Fk means that fk |Fk ∼ Bin(Fk , πk ), where each πk is a known constant which is part of the sampling design, called the sampling fraction in cell k. By standard calculations we then have (2)
fk ∼ Poisson(N γk πk ) and Fk | fk ∼ fk + Poisson(N γk (1 − πk )) ,
leading to the Poisson model of subsection 2.1 below. Under this model the population size is random with expectation N , and so is the sample size, with expectation N k γk πk which we denote by n. In practice we have in mind that N and n could be estimated by the actual population and sample sizes, and these estimates could be “plugged in” where needed. If one adds the Bayesian assumption γk ∼ Gamma(α, β) independently, with αβ = 1/K to ensure that E γk = 1, then fk ∼ N B(α, pk = 1+N1πk β ), the Negative Binomial distribution defined for any α > 0 by P (fk = x) =
Γ(x + α) (1 − pk )x pα k , x = 0, 1, 2, . . . , Γ(x + 1)Γ(α)
which for a natural α counts the number of failures until α successes occur in independent Bernoulli trials with probability of success pk . Further calculations k +1/β yield Fk | fk ∼ fk + N B(α + fk , NNπ+1/β ), (Fk ≥ fk ). Note that in this model the population size is again random with expectation N , and now the sample size has expectation N k πk /K which we denote again by n. As α → 0 (and hence β → ∞) we obtain Fk | fk ∼ fk + N B(fk , πk ), which is exactly the Negative Binomial assumption in Section 2.2 below. As α → ∞ the
Y. Rinott and N. Shlomo
164
Poisson model of Section 2.1 is obtained, and in this sense the Negative Binomial with parameter α subsumes both models. Next we discuss two methods which have received much attention. They have been applied in some bureaus of statistics recently, and are being tested by others. 2.1. The Poisson log-linear method Skinner and Holmes [12] and Elamir and Skinner [6] proposed and studied the following approach. Assuming a fixed sampling fraction, that is, πk = π, the first part of (2) implies fk ∼ Poisson(nγk ), where n = N π. Using the sample {fk } one can fit a log-linear model using standard programs, and obtain estimates {ˆ γk } of the parameters. Goodness of fit measures for selecting models having good risk estimates were studied in Skinner and Shlomo [11]. Using the second part of (2) it is easy to compute individual risk measures for cell k, defined by (3)
P (Fk = 1|fk = 1) = e−N γk (1−πk ) , 1 1 [1 − e−N γk (1−πk ) ]. E[ |fk = 1] = Fk N γk (1 − πk )
Plugging γˆk for γk in (3) leads to the desired estimates Pˆ (Fk = 1|fk = 1) and ˆ 1 |fk = 1] and then to τˆ1 and τˆ2 of (1). E[ Fk For each k we therefore obtain estimates of P (Fk = 1|fk = 1) and E[ F1k |fk = 1] which depend on γˆk , which in turn depends on the frequencies in other cells. For example, in a log-linear model of independence, γˆk depends on the frequencies in all cells which have a common attribute with k. Thus cells that are rather different in nature, having values which are very different from those of cell k in most of the attributes, influence the estimates of the parameter γk pertaining to this cell. The main goal of this paper is to study the possibility of estimating γk using cells in more local “neighborhoods,” having attribute values which are closer to those of the cell k in cases where closeness can be defined. 2.2. The Argus method This method, proposed by Benedetti et al. [1, 2], was originally oriented towards individual risk estimation, but was subsequently also applied to global risk measures, see, e.g, Polettini and Seri [7], and Rinott [8]. Argus has recently been implemented in some European statistical bureaus. In the Argus model it is assumed that Fk |fk ∼ fk + N B(fk , πk ) with an implicit assumption of independence between cells. Since πk are assumed known we could now calculate Pπk (Fk = 1|fk = 1) and Eπk [ F1k |fk = 1]. However because of non response, sampling biases and errors, Argus does not use the known πk , but rather estimates them from the sampling weights as discussed next. At statistics bureaus, each statistical unit responding to a sample survey is assigned a sampling weight. This weight wi is an inflating factor that informs on the number of units in the population that are represented by sample unit i, to be used for inference from the sample to the population. It is calculated by the inverse sampling fraction that is adjusted for non-response or other biases that may occur in the sampling process. These adjustments are often carried out within post-strata (weighting classes) defined by known marginal distributions of the populations,
Disclosure risk estimation
165
such as Age, Sex and Geographical Location. The inverse sampling fractions are calibrated so that the weighted sample count in each post-strata is equal to the known population total; this calibration reduces under or over representation of the chosen strata due to any bias, or sampling errors. The Argusmethod provides initial estimates of the population cell sizes of the form Fˆk = i∈ cell k wi , where wi denotes the sampling weight of individual i described above (see also example below). Here is a simple example: Suppose for simplicity that the sampling weights are based only on the sampling design, and on post stratification by a single variable, say Sex, and that the sample is designed to be a random subset consisting of one percent of the population and therefore we have the same sampling fraction of π = 1/100 in each Sex group. If males, say, have a non-response rate of 20%, and females of 0%, then the sampling weight for women in the sample would be wi = 100, and for men wi = 100/0.8 = 125. If in the sample table there is a cell k = (k1 , k2 ) where k1 stands for Male, and k2 stands for the level in another attribute, such as Income, and fk = 20, then in this cell all wi are 125, and Fˆk = 20 ∗ 125 = 2500. Now suppose Sex is not one of the variables in the table to be released, but the agency knows it for all individuals in the sample. Suppose the variables in the table are Income and Occupation, and suppose now k = (k1 , k2 ), where k1 stands for a given Income group, and k2 for a given Occupation. Suppose fk = 20, meaning that in the sample there are 20 individuals with the given income group and occupation, and suppose that there are 10 males and 10 females in this group. The weight wi = 100 for the 10 females, and 125 for the 10 males, and therefore Fˆk = 10 ∗ 100 + 10 ∗ 125 = 2250. In the above example sampling weights reflect non response. In principle a bureau may arrive at such weights also because in the original sampling design men are under represented, or because it finds out that this is the case after post stratifying on Sex and observing that males are under represented due to some reasons (some bias, including non-response, or sampling error). ˆ Returning to Argus, recall its initial estimates of the population cell sizes Fk = i∈ cell k wi . Using the relation Eπk [Fk |fk ] = fk /πk , the parameters πk are estimated using the moment-type estimate π ˆk = fk /Fˆk . Note that if Fk were known, this would be the usual estimate of the binomial sampling probability. Straightforward calculations with the Negative Binomial distribution show Pπˆk (Fk = 1|fk = 1) = π ˆk
and Eπˆk [
π ˆk 1 |fk = 1] = − log(ˆ πk ) . Fk 1−π ˆk
ˆ in (1) we obtain the estimates τˆ1 and τˆ2 of Plugging these estimates for Pˆ and E the global risk measures. Note that in this method the cells are treated completely independently, each cell at a time, and the structure of the table, or relations between different cells play no role. Moreover, since this method does not involve a model which reduces the number of parameters, it is required to estimate essentially K parameters, which is typically hard in sparse tables of the kind we have in mind. 3. Smoothing polynomials and local neighborhoods The estimation question here is essentially the following: given, say, a sample unique, how likely is it to be also a population unique, or arise from a small population cell.
166
Y. Rinott and N. Shlomo
If a sample unique is found in a part of the sample table where neighboring cells (by some reasonable metric, to be discussed later) are small or empty, then it seems reasonable to believe that it is more likely to have arisen from a small population cell. This motivates our attempt to study local neighborhoods, and compare the results to the type of model-driven neighborhood as the log-linear method, and the Argus method which uses no neighborhoods. Consider frequency tables in which some of the attributes are ordinal, and define closeness between categories of an attribute in terms of the order, or more generally, suppose that for a certain attribute one can say that some values of the attribute are closer to a given value than others. For example, Age and Years of Education are ordinal attributes, and naturally the age of 5 is closer to 6 than to 7 or 17, say, while Occupation is not ordinal, but one can try to define reasonable notions of closeness between different occupations. Classical log-linear models do not take such closeness into account, and therefore, when such models are used for individual cell parameter estimation, the estimates involve data in cells which may be rather remote from the estimated cell. On the other hand, as mentioned above, the Argus method bases its estimation only on the sampling weight of the estimated population cell. There is no learning from other cells, the structure of the table plays no role, and each cell’s parameter is estimated separately. We now describe our proposed approach which consists of using local neighborhoods of the estimated cell. Returning to (2) we assume that fk ∼ Poisson(λk = N γk πk ). Apart from conK stants, the sample log-likelihood is k=1 [fk log λk − λk ]. However if we use a model for λk which is valid only in some neighborhood M of a given cell, we shall consider the log-likelihood of the data in this neighborhood, that is (4) [fk log λk − λk ]. k∈M
For convenience of notation we now assume that m = 2, that is, we consider two-say tables; the extension to any m is straightforward. Following Simonoff [10], see also references therein, we use a local smoothing polynomial model. For each fixed k = (k1 , k2 ) separately, we write the model below for λk in terms of the parameters α=(β0 , β1 , γ1 , . . . , βt , γt ), with k = (k1 , k2 ) varying in some neighborhood of k: (5)
log λk (α) ≡ log λ(k1 ,k2 ) = β0 + β1 (k1 − k1 ) + γ1 (k2 − k2 ) + · · · + βt (k1 − k1 )t + γt (k2 − k2 )t ,
for some natural number t. One can hope that such a polynomial model is valid with a suitable t for k = (k1 , k2 ) in some neighborhood M of k = (k1 , k2 ). Substituting (5) into (4) we maximize the concave function (6) L(α) = L(β0 , β1 , γ1 , . . . , βt , γt ) = [f(k1 ,k2 ) log λ(k1 ,k2 ) − λ(q,r) ] (k1 ,k2 )∈M
with respect to the coefficients in α of the regression model (5). With arg max ˆ and βˆ0 denoting its first component, we finally obtain our estimate of L(α) = α, λk = λ(k1 ,k2 ) in the form (7)
ˆ k ≡ λk (α) ˆ = exp(βˆ0 ), λ
Disclosure risk estimation
167
where the second equality is explained by taking k = k = (k1 , k2 ) in (5). The maximization by the Newton-Raphson method is rather straightforward and fast. ˆ k requires a separate maximization as above which leads Each of the estimates λ ˆ of which ˆ that depends on k = (k1 , k2 ), and a set of estimates λk (α), to a value α ˆ k of (7) is used. For the risk measure discussed in this paper, it suffices to only λ compute these estimates for cells k which are sample uniques, that is, fk = 1. Equating of the function of (6) with respect to β0 to zero we the partial derivative ˆ = k ∈M fk , and other derivatives yield moment identities. obtain k ∈M λk (α) ˆ which are obtained Note, however, that these desirable identities hold for λk (α) for a fixed k = (k1 , k2 ), and not for our final estimates in (7), which are the ones we use in the sequel. With the estimate of (7), recalling λk = N γk πk and setting U = {k : fk = 1}, the set of sample uniques, we now apply the Poisson formulas (3), see also (1), to obtain the risk estimates ˆ 1 ˆ [1 − e−λk (1−πk )/πk ]. (8) τˆ1 = e−λk (1−πk )/πk , τˆ2 = ˆ k (1 − πk )/πk λ k∈U
k∈U
In our experiments we defined neighborhoods M of k by varying around k coordinates corresponding to attributes that are ordinal, and using close values in non-ordinal attributes when possible (e.g., in Occupation). Attributes in which closeness of values cannot be defined remain constant in the whole neighborhood. Thus in our experiments, neighborhoods always consist of individuals of the same Sex. For more details see Section 4. 4. Experiments with neighborhoods We present a few experiments. They are preliminary as already mentioned and more work is needed on the approach itself and on classifying types of data for which it might work. In the experiments we used our own versions of the Argus and log-linear models methods, programmed on the SAS system. Throughout our experiments two log-linear models are considered, one of independence of all attributes, the other including all two-way interactions. The weights wi for the Argus method in all our examples were computed by post-stratification on Sex by Age by Geographical location (the latter is not one of the attributes in any of the tables, but it was used for post-stratification). These variables are commonly used for post-stratification, other strata may give different, and perhaps better results. In all experiments we took a real population data file of size N given in the form of a contingency table with K cells, and from it we took a simple random sample of size n. Since the population and the sample are known to us, we can compute the true values of τ1 and τ2 and their estimates by the different methods, and compare. Example 1. In this small example the population consists of a small extract from the 1995 Israeli Census with individuals of age 15 and over, with N = 15, 035 and K = 448. From this population we took a random sample of size n = 1, 504, using a fixed sampling fraction, that is πk = n/N for all k. The sampling fraction is constant in all our experiments. The attributes (with number of levels in parentheses) were Age Groups (32), and Income Groups (14), both ordinal. As mentioned above, throughout our experiments two log-linear models are considered, one of independence, the other including all two-way interactions (which
168
Y. Rinott and N. Shlomo Table 1 Model True Values Argus Log Linear Model: Independence Log Linear Model: 2-Way Interactions Smoothing t = 1 |M | = 49 Smoothing t = 2 |M | = 49
Example 1 τ1 τ2 2 12.4 7.8 19.6
Example 2 τ1 τ2 2 19.9 14.7 37.2
0.06
6.7
0.01
9.8
0.01 3.2 1.7
8.6 12.0 10.4
1.4 7.0 4.8
19.6 22.5 19.0
in the present case of two attributes, is a saturated model). In this experiment we tried our proposed smoothing polynomial approach of (5) for t = 1, 2. We considered one type of neighborhood here, constructed by changing each attribute value in k by at most 3 values up or down, that is, the neighborhood of each cell k is (9)
M = {k : max |ki − ki | ≤ c}, 1≤i≤m
with m = 2, c = 3 and hence size |M | = 49. For cells near the boundaries some of the cells in their neighborhoods do not exist; here we set non-existing cells’ frequencies to be zero, but other possibilities can be considered. Table 1 presents the true τ values and their estimates by the methods described above. Example 2. The population consists of an extract from the 1995 Israeli Census, N = 37, 586, n = 3, 759, and K = 896. The attributes are Sex(2) * Age Groups (32) * Income Groups(14). We applied the smoothing polynomial of (5) for t = 1, 2 and neighborhoods obtained by varying the attributes of Age and Income as in Example 1 and keeping Sex fixed. In other words we used the neighborhoods (10)
M = {k : k1 = k1 , max |ki − ki | ≤ c}, 2≤i≤m
with m = 3, c = 3 which are like (9) on each sub-table of males and females. The results are given in Table 1. Example 3. Population: an extract from the 1995 Israeli Census. N = 37, 586, n = 3, 759, K = 11, 648. Attributes: Sex(2) * Age Groups (32) * Income Groups(14) * Years of Study (13). We applied the smoothing polynomial of (5) for t = 2 and neighborhoods obtained by fixing Sex, so neighborhoods are as in (10), but with m = 4, c = 2, and since we now vary three variables, each over a range of five values, we have |M | = 125. The results are given in Table 2. Table 2 Model True Values Argus Log Linear Model: Independence Log Linear Model: 2-Way Interactions Smoothing t = 2 |M | = 125
τ1 187 137.2
τ2 452.0 346.4
217.3
518.0
167.2 170.7
432.8 447.9
Disclosure risk estimation
169
Table 3 Model True Values Argus Log Linear Model: Independence Log Linear Model: 2-Way Interactions Smoothing t = 2 |M | = 545 Smoothing t = 2 |M | = 625 Smoothing t = 2 |M | = 1025
τ1 191 79.2
τ2 568.0 315.6
364.8
862.3
182.2 139.6 154.7 215.7
546.2 509.1 528.5 647.2
Table 4 Model True Values Argus Log Linear Model: Independence Log Linear Model: 2-Way Interactions Smoothing t = 2 |M | = 125
τ1 5 7.7
τ2 36.9 35.5
6.4
44.2
1.1 3.3
26.4 31.3
Example 4. Population: an extract from the 2001 UK Census File. N = 944, 793, n = 18, 896, K = 152, 100. Attributes: Sex (2) * Age Groups (25) * Number of Persons in Household (9) * Education Qualifications (13) * Occupation (26). We applied the smoothing polynomial of (5) for t = 2 and neighborhoods defined by fixing Sex and varying all other variables, including Occupation, which was coded as ordinal. The neighborhoods are (11) M = {k : k1 = k1 , max |ki − ki | ≤ c, |ki − ki | ≤ d}, 2≤i≤m
i
with m = 5, c = 2 and d = 6, 8, resulting in neighborhood sizes |M | = 545 and 625, respectively. We also tried c = 3, d = 6 and hence |M | = 1025. The results are given in Table 3. Example 5. Population: an extract from the 1995 Israeli Census. N = 248, 983, n = 2, 490, K = 8, 800. Attributes: Sex(2)* Age Groups(16) * Years of Study (25) * Occupation (11) . We applied the smoothing polynomial of (5) for t = 2 and neighborhoods obtained by varying three attributes and fixing Sex so neighborhoods as in (10) with m = 4, c = 2, and |M | = 125. The results are given in Table 4. Example 6. Population: an extract from the 1995 Israeli Census. N = 746, 949, n = 14, 939, K = 337, 920. Attributes: Sex (2) * Age Groups (16) * Years of Study (10) * Number of Years in Israel (11) * Income Groups (12) * Number of Persons in Household (8). Note that this is a very sparse table. We applied the smoothing polynomial of (5) for t = 2 and neighborhoods obtained by varying all attributes except for Sex which was fixed. Neighborhoods are as in (11) with m = 6, c = 2, d = 4, 6, and |M | = 581 and 1, 893, respectively. The results are given in Table 5. Example 7. Population: an extract from the 1995 Israeli Census. N = 746, 949, n = 7, 470, K = 42, 240. Attributes: Sex (2) * Age Groups (16) * Years of Study (10) * Number of Years in Israel (11) * Income Groups (12).
170
Y. Rinott and N. Shlomo Table 5 Model True Values Argus Log Linear Model: Independence Log Linear Model: 2-Way Interactions Smoothing t = 2 |M | = 581 Smoothing t = 2 |M | = 1, 893
τ1 430 114.5
τ2 1,125.8 456.0
773.8
1,774.1
470.0 287.1 471.1
1,178.1 988.4 1,240.2
Table 6 Model True Values Argus Log Linear Model: Independence Log Linear Model: 2-Way Interactions Smoothing t = 2 |M | = 545
τ1 42 20.7
τ2 171.2 95.4
28.8
191.5
35.8 37.1
164.1 175.1
We applied the smoothing polynomial of (5) for t = 2 and neighborhoods obtained by varying all attributes except for Sex which was fixed. Neighborhoods are as in (11) with m = 5, c = 2, d = 6, and |M | = 545. Smaller neighborhood did not yield good estimates. The results are given in Table 6. Discussion of examples The log-linear model method was tested in Skinner and Shlomo [11] and references therein, and it seems to yield good results for experiments of the kind done here. Di Consiglio et al. [5] presented experiments for individual risk assessment with Argus, which seems to perform less well than the log-linear method in many of our experiments with global risk measures. Our new method still requires fine-tuning. At present the results seem comparable to the loglinear method, and it seems to be computationally somewhat simpler and faster. Naturally, more variables and sparse data sets with a large number of cells are typical and need to be tested. Such files will cause difficulties to any method, and this is where the different methods should be compared. In sparse multi-way tables, model selection will be crucial but difficult for the log-linear method, and perhaps simpler for the smoothing approach. We also think that our method may be easier to modify to complex sampling designs. Our proposed method is at a preliminary stage and requires more work. Particular directions are the following: 1. Adjust the estimates γˆk of (7) to fit known population marginals obtained from prior knowledge and sampling weights. In log-linear models the total sum of these estimates corresponds to the sample size, but as commented in Section 3 this is not the case with the smoothing estimates of (7). 2. Use goodness of fit measures and information on population marginals and sampling weights to select the type and size of the neighborhoods, and the degree of the smoothing polynomial in (5). We have observed in experiments that when the sum of all estimates matches the sample size, we obtain good risk measure estimates, and further matching to marginals may improve the estimates. 3. Extend the smoothing approach to the more general Negative Binomial model which subsumes both the Poisson model implemented here, and the Negative Binomial discussed in Section 2. 4. Apply this method also for individual risk measure estimates, which are im-
Disclosure risk estimation
171
portant in themselves, and may also shed more light on efficient neighborhood and model selection. Our preliminary experiments suggest that the smoothing approach performs relatively well in estimating individual risk. References [1] Benedetti, R., Capobianchi, A. and Franconi, L. (1998). Individual risk of disclosure using sampling design information. Contributi Istat 1412003. [2] Benedetti, R., Franconi, L. and Piersimoni, F. (1999). Per-record risk of disclosure in dependent data. In Proceedings of the Conference on Statistical Data Protection, Lisbon March 1998. European Communities, Luxembourg. [3] Bethlehem, J., Keller, W. and Pannekoek, J. (1990). Disclosure control of microdata. J. Amer. Statist. Assoc. 85 38–45. [4] Cameron, A. C. and Trivedi, P. K. (1998). Regression Analysis of Count Data. Cambridge University Press. [5] Di Consiglio, L., Franconi, L. and Seri, G. (2003). Assessing individual risk of disclosure: an experiment. In Proceedings of the Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg 286–298. [6] Elamir, E. and Skinner, C. (2006). Record-level measures of disclosure risk for survey microdata. J. Official Statist. 22 525–539. [7] Polettini, S. and Seri, G. (2003). Guidelines for the protection of social micro-data using individual risk methodology—Application within mu-argus version 3.2. CASC Project Deliverable No. 1.2-D3. Available at http://neon.vb.cbs.nl/casc/. [8] Rinott, Y. (2003). On models for statistical disclosure risk estimation. In Proceedings of the Joint ECE/Eurostat Work Session on Statistical Data Confidentiality, Luxemburg 275–285. [9] Rinott, Y. and Shlomo, N. (2005). A neighborhood regression model for sample disclosure risk estimation. In Proceedings of the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Geneva, Switzerland. [10] Simonoff, S. J. (1998). Three sides of smoothing: Categorical data smoothing, nonparametric regression, and density estimation. Internat. Statist. Rev. 66 137–156. [11] Skinner, C. and Shlomo, N. (2005). Assessing disclosure risk in microdata using record-level measures. In Proceedings of the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality Geneva, Switzerland. [12] Skinner, C. and Holmes, D. (1998). Estimating the re-identification risk per record in microdata, J. Official Statist. 14 361–372. [13] Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control. Springer, New York. [14] Zhang, C.-H. (2005). Estimation of sums of random variables: examples and information bounds. Ann. Statist. 33 2022–2041.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 172–176 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000139
A note on the U, V method of estimation∗ Arthur Cohen1 and Harold Sackrowitz1 Rutgers University Abstract: The U, V method of estimation provides unbiased estimators or predictors of random quantities. The method was introduced by Robbins [3] and subsequently studied in a series of papers by Robbins and Zhang. (See Zhang [5].) Practical applications of the method are featured in these papers. We demonstrate that for one U function (one for which there is an important application) the V estimator is inadmissible for a wide class of loss functions. For another important U function the V estimator is admissible for the squared error loss function.
1. Introduction The U, V method of estimation was introduced by Robbins [3]. The method applies to estimating random quantities in an unbiased way, where unbiasedness is defined as follows: The expected value of the estimator equals the expected value of the random quantity to be estimated. More specifically, suppose Xj , j = 1, . . . , n, are random variables whose density (or mass) function is denoted by fXi (xi |θi ). In this paper we consider estimands of the form (1.1)
S(X, θ) =
n
U ∗ (Xj , θj ),
j=1
where X = (X1 , . . . , Xn ) and θ = (θ1 , . . . , θn ) . An estimator, V (X) is an unbiased estimator of S if (1.2)
Eθ V (X) = Eθ (S(X, θ)).
Of particular interest in applications are estimands of the form U ∗ (Xj , θj ) = U (Xj )θj , where U (·) is an indicator function. Robbins [3] offers a number of examples of unbiased estimators using the U, V method. Zhang [5] studies the U, V method for estimating S and provides conditions under which the “U, V ” estimators are asymptotically efficient. Zhang [5] then presents a Poisson example that deals with a practical problem involving motor vehicle accidents. In this note we demonstrate that for many practical applications the U, V estimators are inadmissible for many sensible loss functions. In particular, for the Poisson example given in Zhang [5], for the U function given, the V estimator is inadmissible for any reasonable loss function, since the estimator is positive for some X when S = 0 no matter which θ is true. Previously, Sackrowitz and Samuel-Cahn [4] showed that the U, V estimator of the selected mean of two independent negative exponential distributions is inadmissible for squared error loss. ∗ Research
supported by NSF Grant DMS-0457248 and NSA Grant H98230-06-1-007. of Statistics and Biostatistics, Hill Center, Busch Campus, Pisctaway NJ 088548019, USA, e-mail: [email protected]; [email protected] AMS 2000 subject classifications: primary 62C15; secondary 62F15. Keywords and phrases: admissibility, unbiased estimators, asymptotic efficiency. 1 Department
172
U, V estimation
173
In the next section we examine examples in which S functions based on simple U functions are estimated by inadmissible V functions. For other simple U functions the resulting V estimators are admissible for squared error loss. These later results will be presented in Section 3. 2. Inadmissibility results Let Xj , j = 1, . . . , n, be independent random variables with density fXi (xi |θi ). Let U ∗ (Xj , θj ) = U (Xj )θj , where, for some fixed A ≥ 0, 1, if Xj ≤ A, (2.1) U (Xj ) = 0, if Xj > A. Consider the following four distributions for Xj . (2.2) (2.3) (2.4) (2.5)
Poisson fX (x|θ) = e−θ θx /x! (θ > 0, x = 0, 1, . . .), Geometric fX (x|θ) = (1 − θ)θx (0 < θ < 1, x = 0, 1, . . .), Exponential fX (x|θ) = (1/θ)e−x/θ (θ > 0, x > 0), Uniform Scale fX (x|θ) = 1/θ (0 < x < θ, θ > 0).
Let W (t), t ≥ 0 be a function with the property that W (0) = 0 and W (t) > 0 for t > 0. Consider loss functions (2.6)
W (a, S) = W (a − S),
for action a. For the distributions in (2.2), (2.3), (2.4), (2.5), Robbins [3] finds unique unbiased estimators V (Xj ) for U (Xj )θj . Theorem 2.1. Let Xj , j = 1, . . . , n, be independent random variables whose distribution is (2.2) or (2.3) or (2.4) or (2.5). Consider the loss function n given in (2.6). Let U (Xj ) be as in (2.1). Then the unbiased estimator V (X) = j=1 V (Xj ), where V (Xj ) is the unbiased estimator of U (Xj )θj , is inadmissible for S given in (1.1). Proof. The idea of the proof is easily seen if n = 1. However for n > 1 it is instructive to see how much improvement can be made. The proof for n = 1 goes as follows: Let X1 be X and θ1 be θ. The V (X) estimators for the four cases are given in Robbins [3]. For the Poisson case V (X) = U (X − 1)X (V (0) = 0). Now let [A] denote the largest integer in A less that A. Then V ([A] + 1) = [A] + 1, whereas S = U ([A] + 1)θ = 0. If V (X), all X except X = [A] + 1, ∗ V (X) = 0, X = [A] + 1, then clearly V ∗ (X) is better than V (X) since W (V ∗ ([A] + 1) − S) = 0 for V ∗ and W (([A] + 1) − S) > 0 for V . For the case of arbitrary n, S = 0 whenever all Xj ≥ ([A] + 1) whereas V (X) = 0 whenever at least one Xj = ([A] + 1). If all Xj = ([A] + 1), then V = n([A] + 1). Clearly if V ∗ = 0 at such X, V ∗ is better than V . X−1 For the geometric distribution when n = 1, V (X) = i=0 U (i) (V (0) = 0). Note S = 0 for X ≥ [A] + 1 but V = [A] + 1 for all such X. Again if V ∗ = V for X ≤ [A] and V ∗ = 0 for X ≥ [A] + 1, V ∗ is better than V . The case of arbitrary
A. Cohen and H. B. Sackrowitz
174
Table 1 Improvement in risk for squared error loss function n A 1 3 5 7 9
1 1.083 3.126 5.782 8.934 12.511
2 1.872 4.763 8.268 12.268 16.694
3 2.190 5.086 8.419 12.113 16.120
4 2.148 4.626 7.364 10.328 13.490
5 1.902 3.831 5.894 8.083 10.388
6 1.575 2.982 4.447 5.976 7.568
7 1.243 2.220 3.216 4.242 5.299
8 0.947 1.599 2.253 2.919 3.600
9 0.701 1.122 1.539 1.961 2.389
10 0.508 0.771 1.031 1.292 1.556
n is even more dramatic than is the Poisson case with S = 0 if all Xj ≥ [A] + 1 whereas V = 0 on such points. X For the exponential distribution when n = 1, V (X) = 0 U (t)dt = X if X ≤ A, and V (X) = A if X > A. For arbitrary n, S = 0 whenever all Xj > A, whereas V (X) = 0 on such points. For the scale parameter of a uniform distribution with n = 1, V (X) = XU (X) + X U (t)dt which becomes 2X if X ≤ A and A if X > A. Hence as in the previous 0 case, for arbitrary n, S = 0 whenever all Xj > A whereas V (X) = 0 on such points. This completes the proof of the theorem. Remark 2.1. Theorem 2.1 applies to the Poisson example in Zhang [5]. Remark 2.2. If the loss function in (2.6) is squared error then the amount of improvement in risk of V ∗ over V depends on n, A, and θ. It can be easily calculated. For the case where all the components of θ are equal and each θi , i = 1, . . . , n is set equal to [A] + 1 the amount of improvement is equal to n
2 i([A] + 1) Cin e−([A]+1) ([A] + 1)[A]+1 ([A] + 1)! [A]+1 1 − y=0 e−([A]+1) ([A] + 1)y
i=1
(2.7)
·
y!
Table 1 offers the amount of improvement for n = 1(1)10 and for values of A = 1, 3, 5, 7, 9. We observe as n gets large the amount of improvement becomes smaller. Also for small n as A gets large, improvement gets large. Such observations are consistent with the asymptotic efficiency of the U, V estimator as n → ∞ and with Sterling’s formula. Remark 2.3. Theorem 2.1 also holds for predicting ∗
S =
n
Yj U (Xj ),
j=1
where Yj has the same distribution of Xj but is unobserved. 3. Admissibility results In this section we consider the case 0, if Xj ≤ A, (3.1) U (Xj ) = 1, if Xj > A,
A ≥ 0; j = 1, . . . , n.
Also we consider a squared error loss function.
U, V estimation
175
Theorem 3.1. Suppose Xj are independent with Poisson distributions with parameter λj . Then V (X) is an admissible estimator of S(X, λ) for squared error loss. Proof. Let n = 1 and recall V (X1 ) = U (X1 − 1)X1 , V (0) = 0. Then 0, for X1 = 0, 1, . . . , [A] + 1, V (X) = X1 , for X1 > [A] + 1, while ∗
U (X1 , λ1 ) = U (X1 )λ1 =
0, X1 ≤ [A], λ1 , X1 ≥ [A] + 1
Since U ∗ (X1 , λ1 ) = 0 for X1 ≤ [A], any admissible estimator of U ∗ (X1 , λ1 ) must estimate 0 for X1 ≤ [A] as V (X1 ) does. At this point we can restrict the class of estimators to all those which estimate by the value 0 for all X1 ≤ [A]. For [X1 ] ≥ [A] + 1, U ∗ (X1 , λ1 ) = λ1 and we have a traditional problem of estimating a parameter λ1 . Now we can refer to the proof of Lemma 5.2 of Brown and Farrell [1] to conclude that any estimator that can beat V (X) would have to estimate 0 at X1 = [A] + 1. Furthermore for the conditional problem given X1 > [A] + 1, it follows by results in Johnstone [2] that X1 is an admissible estimator of λ1 . For arbitrary n the proof is more detailed. We give the details for n = 2. The extension for arbitrary n will follow the steps for n = 2 and employ induction. For n = 2, suppose V (X1 ) + V (X2 ) is inadmissible. Then there exists δ ∗ (X1 , X2 ) such that ∞ ∞
2 V (x1 ) + V (x2 ) − U (x1 )λ1 − U (x2 )λ2 λx1 1 λx2 2 e−λ1 −λ2 x1 !x2 !
x1 =0 x2 =0 ∞ ∞
(3.2) ≥
x1 =0 x2 =0
2 ∗ δ (x1 , x2 ) − U (x1 )λ1 − U (x2 )λ2 λx1 1 λx2 2 e−λ1 −λ2 x1 !x2 !
for all λ1 > 0, λ2 > 0, with strict inequality for some λ1 and λ2 . Now let λ2 → 0. Then by continuity of the risk function, (3.2) leads to
∗ 2 2 . ≥ E δ (X1 , 0) − U (X1 )λ1 (3.3) E V (X1 ) − U (X1 )λ1 Since V (X1 ) is admissible for U (X1 )λ1 , the case n = 1, (3.3) implies that V (X1 ) = δ ∗ (X1 , 0). At this point we do as in Brown and Farrell [1] by dividing both sides of (3.2) by λ2 . Reconsider (3.2) but now we can let the sum on x2 run from 1 to ∞ since V (X1 ) = δ ∗ (X1 , 0). Again let λ2 → 0 and this leads to V (X1 ) = δ ∗ (X1 , 1). Repeat the process for X2 = 0, 1, . . . , [A] + 1. Furthermore by symmetry V (X2 ) = δ ∗ (0, X2 ) = · · · = δ ∗ ([A] + 1, X2 ). Thus V (X1 ) + V (X2 ) = δ ∗ (X1 , X2 ) on all sample points except the set B = (X1 ≥ [A] + 2, X2 ≥ [A] + 2). Here V (X1 ) + V (X2 ) = X1 + X2 and S = λ1 + λ2 . We consider the conditional problem of estimating λ1 + λ2 by X1 + X2 given X ∈ B. Clearly when λ1 = λ2 = λ no estimator can match, much less beat the risk of X1 +X2 for this conditional problem since X1 + X2 is a sufficient statistic, the loss is squared error, and X1 + X2 is an admissible estimator of 2λ. Thus δ ∗ (X1 , X2 ) = V (X1 )+V (X2 ) on the entire sample space proving the theorem.
176
A. Cohen and H. B. Sackrowitz
References [1] Brown, L. D. and Farrell, R. H. (1985). Complete class theorems for estimation of multivariate Poisson means and related problems. Ann. Statist. 13 706–726. [2] Johnstone, I. (1984). Admissibility, difference equations and recurrence in estimating a Poisson mean. Ann. Statist. 12 1173–1198. [3] Robbins, H. (1988). The u, v method of estimation. In Statistical Decision Theory and Related Topics. IV 1 (S. S. Gupta and J. O. Berger, eds.) 265–270. Springer, New York. [4] Sackrowitz, H. B. and Samuel-Cahn, E. (1984). Estimation of the mean of a selected negative exponential population. J. R. Statist. Soc. 46 242–249. [5] Zhang, C. (2005). Estimation of sums of random variables: examples and information bounds. Ann. Statist. 33 2022–2041.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 177–186 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000148
Local polynomial regression on unknown manifolds Peter J. Bickel1 and Bo Li2 University of California, Berkeley and Tsinghua University Abstract: We reveal the phenomenon that “naive” multivariate local polynomial regression can adapt to local smooth lower dimensional structure in the sense that it achieves the optimal convergence rate for nonparametric estimation of regression functions belonging to a Sobolev space when the predictor variables live on or close to a lower dimensional manifold.
1. Introduction It is well known that worst case analysis of multivariate nonparametric regression procedures shows that performance deteriorates sharply as dimension increases. This is sometimes refered to as the curse of dimensionality. In particular, as initially demonstrated by [19, 20], if the regression function, m(x), belongs to a Sobolev space with smoothness p, there isp no nonparametric estimator that can achieve a faster convergence rate than n− 2p+D , where D is the dimensionality of the predictor vector X. On the other hand, there has recently been a surge in research on identifying intrinsic low dimensional structure from a seemingly high dimensional source, see [1, 5, 15, 21] for instance. In these settings, it is assumed that the observed highdimensional data are lying on a low dimensional smooth manifold. Examples of this situation are given in all of these papers — see also [14]. If we can estimate the manifold, we can expect that we should be able to construct procedures which perform as well as if we know the structure. Even if the low dimensional structure obtains only in a neighborhood of a point, estimation at that point should be governed by actual rather than ostensible dimension. In this paper, we shall study this situation in the context of nonparametric regression, assuming the predictor vector has a lower dimensional smooth structure. We shall demonstrate the somewhat surprising phenomenon, suggested by Bickel in his 2004 Rietz lecture, that the procedures used with the expectation that the ostensible dimension D is correct will, with appropriate adaptation not involving manifold estimation, achieve the optimal rate for manifold dimension d. Bickel conjectured in his 2004 Rietz lecture that, in predicting Y from X on the basis of a training sample, one could automatically adapt to the possibility that the apparently high dimensional X that one observed, in fact, lived on a much smaller dimensional manifold and that the regression function was smooth on that manifold. The degree of adaptation here means that the worst case analyses for prediction are governed by smoothness of the function on the manifold and not on 1 367
Evans Hall, Department of Statistics, University of California, Berkeley, CA, 94720-3860, USA, e-mail: [email protected] 2 S414 Weilun Hall, School of Economics and Management, Tsinghua University, Beijing, 100084, China, e-mail: [email protected] AMS 2000 subject classifications: primary 62G08, 62H12; secondary 62G20. Keywords and phrases: local polynomial regression, manifolds. 177
178
P. J. Bickel and B. Li
the space in which X ostensibly dwells, and that purely data dependent procedures can be constructed which achieve the lower bounds in all cases. In this paper, we make this statement precise with local polynomial regression. Local polynomial regression has been shown to be a useful nonparametric technique in various local modelling, see [8, 9]. We shall sketch in Section 2 that local linear regression achieves this phenomenon for local smoothness p = 2, and will also argue that our procedure attains the global IMSE if global smoothness is assumed. We shall also sketch how polynomial regression can achieve the appropriate higher rate if more smoothness is assumed. A critical issue that needs to be faced is regularization since the correct choice of bandwidth will depend on the unknown local dimension d(x). Equivalently, we need to adapt to d(x). We apply local generalized cross validation, with the help of an estimate of d(x) due to [14]. We discuss this issue in Section 3. Finally we give some simulations in Section 4. A closely related technical report, [2] came to our attention while this paper was in preparation. Binev et al consider in a very general way, the construction of nonparametric estimation of regression where the predictor variables are distributed according to a fixed completely unknown distribution. In particular, although they did not consider this possibility, their method covers the case where the distribution of the predictor variables is concentrated on a manifold. However, their method is, for the moment, restricted to smoothness p ≤ 1 and their criterion of performance is the integral of pointwise mean square error with respect to the underlying distribution of the variables. Their approach is based on a tree construction which implicitly estimates the underlying measure as well as the regression. Our discussion is considerably more restrictive by applying only to predictors taking values in a low dimensional manifold but more general in discussing estimation of the regression function at a point. Binev et al promise a further paper where functions of general Lipschitz order are considered. Our point in this paper is mainly a philosophical one. We can unwittingly take advantage of low dimensional structure without knowing it. We do not give careful minimax arguments, but rather, partly out of laziness, employ the semi heuristic calculations present in much of the smoothing literature. Here is our setup. Let (Xi , Yi ), (i = 1, 2, . . . , n) be i.i.d D+1 valued random vectors, where X is a D-dimensional predictor vector, Y is the corresponding univariate response variable. We aim to estimate the conditional mean m0 (x) = E(Y |X = x) nonparametrically. Our crucial assumption is the existence of a local chart, i.e., each small patch of X (a neighborhood around x) is isomorphic to a ball in a ddimensional Euclidean space, where d = d(x) ≤ D may vary with x. Since we fix our working point x, we will use d for the sake of simplicity. The same rule applies d to other notations which may also depend on x.) More precisely, let Bz,r denote d D the ball in , centered at z with radius r. A similar definition applies to Bx,R . For D small R > 0, we consider the neighborhood of x, Xx := Bx,R ∩ X within X . We d suppose there is a continuously differentiable bijective map φ : B0,r → Xx . Under this assumption with d < D, the distribution of X degenerates in the sense that it does not have positive density around x with respect to Lebesgue measure on D . d defined below, can have a non-degenerate However, the induced measure Q on B0,r density with respect to Lebesgue measure on d . Let S be an open subset of Xx , d(x) and φ−1 (S) be its preimage in B0,r . Then Q(Z ∈ φ−1 (S)) = P(X ∈ S). We assume throughout that Q admits a continuous positive density function f (·). We proceed to our main result whose proof is given in the Appendix.
Local polynomial regression on unknown manifolds
179
2. Local linear regression [17] develop the general theory for multivariate local polynomial regression in the usual context, i.e., the predictor vector has a D dimensional compact support in D . We shall modify their proof to show the ”naive” (brute-force) multivariate local linear regression achieves the ”oracle” convergence rate for the function m(φ(z)) on d B0,r . Local linear regression estimates the population regression function by α, ˆ where ˆ minimize (ˆ α, β) n 2 Yi − α − β T (Xi − x) Kh (Xi − x). i=1
Here Kh (·) is a D−variate kernel function. For the sake of simplicity, we choose the same bandwidth h for each coordinate. Let 1 (X1 − x)T .. Xx = ... . 1 (Xn − x)T and Wx = diag{Kh (X1 −x), . . . , Kh (Xn −x)}. Then the estimator of the regression function can be written as m(x, ˆ h) = eT1 (XxT Wx Xx )−1 XxT Wx Y where e1 is the (D + 1) × 1 vector having 1 in the first entry and 0 elsewhere. 2.1. Decomposition of the conditional MSE We enumerate the assumptions we need for establishing the main result. Let M be a canonical finite positive constant, (i) The kernel function K(·) is continuous and radially symmetric, hence bounded. (ii) There exists an (0 < < 1) such that the following asymptotic irrelevance conditions hold. c D X −x E Kγ( )w(X)1 X ∈ Bx,h = o(hd+2 ) 1− ∩ X h
for γ = 1, 2 and |w(x)| ≤ M (1 + |x|2 ). (iii) v(x) = V ar(Y |X = x) ≤ M . ∂2m (iv) The regression function m(x) is twice differentiable, and ∂x ∞ ≤ M for a xb all 1 ≤ a ≤ b ≤ D if x = (x1 , . . . , xD ). (v) The density f (·) is continuously differentiable and strictly positive at 0 in d B0,r . Condition (ii) is satisfied if K has exponential tails since if V = can be written as
X−x h ,
D c E K γ (V )w(x + hV )1(V ∈ (B0,h = o(hd+2 ). 1− )
the conditions
P. J. Bickel and B. Li
180
Theorem 2.1. Let x be an interior point in X . Then under assumptions (i)-(v), there exist some J1 (x) and J2 (x) such that E{m(x, ˆ h) − m(x)|X1 , . . . , Xn } = h2 J1 (x)(1 + oP (1)), V ar{m(x, ˆ h) − m(x)|X1 , . . . , Xn } = n−1 h−d J2 (x)(1 + oP (1)). Remark 1. The predictor vector doesn’t need to lie on a perfect smooth manifold. The same conclusion still holds as long as the predictor vector is “close” to a smooth manifold. Here “close” means the noise will not affect the first order of our asymptotics. That is, we think of X1 , . . . , Xn as being drawn from a probability distribution P on D concentrated on the set d X = {y : |φ(u) − y| ≤ n for some u ∈ B0,r }
and n → 0 with n. It is easy to see from our arguments below that if n = o(h), then our results still hold. Remark 2. When the point of interest x is on the boundary of the support X , we can show that the bias and variance have similar asymptotic expansions, following the Theorem 2.2 in [17]. But, given the extra complication of the embedding, the proof would be messier, and would not, we believe, add any insight. So we omit it. 2.2. Extensions It’s somewhat surprising but not hard to show that if we assume the regression function m to be p times differentiable with all partial derivatives of order p bounded (p ≥ 2, an integer), we can construct estimates m ˆ such that, E{m(x, ˆ h) − m(x)|X1 , . . . , Xn } = hp J1 (x)(1 + oP (1)), V ar{m(x, ˆ h) − m(x)|X1 , . . . , Xn } = n−1 h−d J2 (x)(1 + oP (1)) 2p
yielding the usual rate of n− 2p+d for the conditional MSE of m(x, ˆ h) if h is chosen 1 − 2p+d . This requires replacing local linear regression with local optimal, h = λn polynomial regression with a polynomial of order p − 1. We do not need to estimate the manifold as we might expect since the rate at which the bias term goes to 0 is derived by first applying Taylor expansion with respect to the original predictor components, then obtaining the same rate in the lower dimensional space by a first order approximation of the manifold map. Essentially all we need is that, locally, the geodesic distance is roughly proportionate to the Euclidean distance. 3. Bandwidth selection 1
As usual this tells us, for p = 2, that we should use bandwidth λn− 4+d to achieve 2 the best rate of n− 4+d . This requires knowledge of the local dimension as well as the usual difficult choice of λ. More generally, dropping the requirement that the bandwidth for all components be the same we need to estimate d and choose the constants corresponding to each component in a simple data determined way. There is an enormous literature on bandwidth selection. There are three main approaches: plug-in ([7, 16, 18], etc); the bootstrap ([3, 11, 12], etc) and cross validation ([6, 10, 22], etc). The first has always seemed logically inconsistent to
Local polynomial regression on unknown manifolds
181
us since it requires higher order smoothness of m than is assumed and if this higher order smoothness holds we would not use linear regression but a higher order polynomial. See also the discussion of [23]. We propose to use a blockwise cross-validation procedure defined as follows. Let the data be (Xi , Yi ), 1 ≤ i ≤ n. We consider a block of data points {(Xj , Yj ) : j ∈ J }, with |J | = n1 . Assuming the covariates have been standardized, we choose the same bandwidth h for all the points and all coordinates within the block. A leaveone-out cross validation with respect to the block while using the whole data set is defined as following. For each j ∈ J , let m ˆ −j,h (Xj ) be the estimated regression function (evaluated at Xj ) via local linear regression with the whole data set except Xj . In contrast to the usual leave-one-out cross-validation procedure, our
modified 1 leave-one-out cross-validation criterion is defined as mCV (h) = n1 j∈J (Yj − m ˆ −j,h (Xj ))2 . Using a result from [23], it can be shown that mCV (h) =
ˆ h (Xj ))2 1 (Yj − m n1 (1 − Sh (j, j))2 j∈J
where Sh (j, j) is the diagonal element of the smoothing matrix Sh . We adopt the GCV
idea proposed by [4] and replace the Sh (j, j) by their average atrJ (Sh ) = 1 j∈J Sh (j, j). Thereby our modified generalized cross-validation criterion is, n1 mGCV (h) =
1 (Yj − m ˆ h (Xj ))2 . n1 (1 − atrJ (Sh ))2 j∈J
The bandwidth h is chosen to minimize this criterion function. We give some heuristics for the justifying the (blockwise homoscedastic) mGCV. In a manner analogous to [23], we can show Sh (j, j) = eT1 (XxT Wx Xx )−1 e1 Kh (0)|x=Xj . In view of (A.2) in the Appendix, we see Sh (j, j) = n−1 h−d K(0)(A1 (Xj ) + op (1)). Thus as n−1 h−d → 0, atrJ (Sh ) = n−1 h−d K(0)(n−1 A1 (Xj ) + op (1)) 1 j∈J
= Op (n−1 h−d ) = op (1).
Then, as is discussed in [22], using the approximation (1 − x)−2 ≈ 1 + 2x for small x, we can rewrite mGCV (h) as mGCV (h) =
1 2 1 trJ (Sh ) (Yj − m ˆ h (Xj ))2 + (Yj − m ˆ h (Xj ))2 . n1 n1 n1 j∈J
j∈J
Now regarding n11 j∈J (Yj − m ˆ h (Xj ))2 in the second term as an estimator of the constant variance for the focused block, the mGCV is approximately the same as the Cp criterion, which is an estimator of the prediction error up to a constant. In practice, we first use [14]’s approach to estimate the local dimension d, which yields a consistent estimate dˆ of d. Based on the estimated intrinsic dimensionality 1 − ˆ1 ˆ ˆ a set of candidate bandwidths CB = {λ1 n− d+4 , . . . , λB n d+4 } (λ1 < · · · < λB ) d, are chosen . We pick the one minimizing the mGCV (h) function.
182
P. J. Bickel and B. Li
4. Numerical experiments The data generating process is as following. The predictor vector X = (X(1) , X(2) , X(3) ), where X(1) will be sampled from a standard normal distribution, X(2) = 3 2 X(1) + sin(X(1) ) − 1, and X(3) = log(X(1) + 1) − X(1) . The regression function (1) m(x) = m(x(1) , x(2) , x(3) ) = cos(x ) + x(2) − x2(3) . The response variable Y is generated via the mechanism Y = m(X) + ε, where ε has a standard normal distribution. By definition, the 3-dimensional regression function m(x) is essentially a 1-dimensional function of x(1) . n = 200 samples are drawn. The predictors are standardized before estimation. We estimate the regression function m(x) by both the ”oracle” univariate local linear (ull) regression with a single predictor X(1) and our blind 3-variate local linear regression with all predictors X(1) , X(2) , X(3) . We focus on the middle block with 100 data points, with the number of neighbor parameter k, needed for Levina and Bickel’s estimate, set to be 15. The intrinsic dimension estimator is dˆ = 1.023, which is close to the true dimension, d = 1. We use the Epanechnikov kernel in our simulation. Our proposed modified GCV procedure is applied to both the ull and mll procedures. The estimation results are displayed in Figure 1. The x − axis is the standardized X(1) . From the right panel, we see the blind mll indeed performs almost as well as the “oracle” ull. Next, we allow the predictor vector to only lie close to a manifold. Specifically, 3 2 we sample X(1) = X(1) + 1 , X(2) = X(1) + sin(X(1) ) − 1 + 2 , X(3) = log(X(1) + 1) − X(1) + 3 , where X(1) is sampled from a standard normal distribution, and 1 , 2 and 3 are sampled from N (0, σ 2 ). The noise scale is hence governed by σ . In our experiment, σ is set to be 0.02, 0.04, . . . , 0.18, 0.20 respectively. The predictor vector samples are visualized in the left panel of Figure 2 with σ = 0.20. In the maximum noise scale case, the pattern of the predictor vector is somewhat vague. Again, a blind “mll” estimation is done with respect to new data generated in the aforementioned way. We plot the MSEs associated with different noise scales in the right panel of Figure 2. The moderate noise scales we’ve considered indeed don’t have a significant influence on the performance of the “mll” estimator in terms of MSE.
Fig 1. The case with perfect embedding. The left panel shows the complete data and fitting of the middle block by both univariate local linear (ull) regression and multivariate local linear (mll) regression with bandwidths chosen via our modified GCV. The focused block is amplified in the right panel.
Local polynomial regression on unknown manifolds
183
Fig 2. The case with “imperfect” embedding. The left panel shows the predictor vector in a 3-D fashion with the noise scale σ = 0.2. The right panel gives the MSEs with respect to increasing noise scales.
Appendix Proof of Theorem 2.1. Using the notation of [17], Hm (x) is the D × D Hessian matrix of m(x) at x, and Qm (x) = [(X1 − x)T Hm (x)(X1 − x), · · · , (Xn − x)T Hm (x)(Xn − x)]T . Ruppert and Wand have obtained the bias term. (A.1)
E(m(x, ˆ h) − m(x)|X1 , · · · , Xn ) 1 T T = e1 (Xx Wx Xx )−1 XxT Wx {Qm (x) + Rm (x)} 2
where if | · | denotes Euclidean norm, |Rm (x)| is of lower order than |Qm (x)|. Also we have n−1 XxT Wx Xx
n
n n−1 i=1 Kh (Xi − x)(Xi − x)T n−1 i=1 Kh (Xi − x)
= −1 n n −1 T . n i=1 Kh (Xi − x)(Xi − x) n i=1 Kh (Xi − x)(Xi − x)(Xi − x)
The difference in our context lies in the following asymptotics. D EKh (Xi − x) = E Kh (Xi − x)1 Xi ∈ Bx,h 1− ∩ X c D +E Kh (Xi − x)1 Xi ∈ Bx,h 1− ∩ X
φ(z ) − φ(0) (ii) −D = h K f (z )dz + oP (hd ) h N d 1− 0,h
= hd−D f (0) K(∇φ(0)u)du + oP (1) d = hd−D A1 (x) + oP (1) .
Thus, by the LLN, we have n−1
n i=1
Kh (Xi − x) = hd−D A1 (x) + oP (1) .
P. J. Bickel and B. Li
184
Similarly, there exist some A2 (x) and A3 (x) such that n−1
n i=1
and −1
n
n i=1
Kh (Xi − x)(Xi − x) = h2+d−D A2 (x) + oP (1)
Kh (Xi − x)(Xi − x)(Xi − x)T = h2+d−D A3 (x) + oP (1)
where we used assumption (i) to remove the term of order h1+d−D in deriving
n the asymptotic behavior of n−1 i=1 Kh (Xi − x)(Xi − x). Invoking Woodbury’s formula, as in the proof of Lemma 5.1 in [13], leads us to −1 −1 T A (x)−1 + oP (1) OP (1) (A.2) n X x Wx X x = hD−d 1 OP (1) h−2 Op (1)
On the other hand,
n−1 Xx Wx Qm (x)
n n−1 i=1 Kh (Xi − x)(Xi − x)T Hm (x)(Xi − x) = −1 n . T n i=1 {Kh (Xi − x)(Xi − x) Hm (x)(Xi − x)}(Xi − x)
In a similar fashion, we can deduce that for some B1 (x), B2 (x), −1
n
n i=1
Kh (Xi − x)(Xi − x)T Hm (x)(Xi − x) = h2+d−D B1 (x) + oP (1)
and −1
n
n i=1
{Kh (Xi − x)(Xi − x)T Hm (x)(Xi − x)}(Xi − x) = h3+d−D B2 (x) + oP (1) .
We have −1
n
(A.3)
Xx Wx Qm (x) = h
d−D
h2 B (x) + o (1) P 1 . h3 B2 (x) + oP (1)
It follows from (A.1),(A.2) and (A.3) that the bias admits the following approximation. (A.4)
E(m(x, ˆ h) − m(x)|X1 , . . . , Xn ) = h2 A1 (x)−1 B1 (x) + oP (h2 ).
Next, we move to the variance term. V ar{m(x, ˆ h)|X1 , . . . , Xn } T = e1 (XxT Wx Xx )−1 XxT Wx V Wx Xx (XxT Wx Xx )−1 e1 .
(A.5)
The upper-left entry of n−1 XxT Wx V Wx Xx is n−1
n
Kh (Xi − x)2 v(Xi ) = hd−2D C1 (x)(1 + oP (1)).
i=1
The upper-right block is n−1
n i=1
Kh (Xi − x)2 (Xi − x)T v(Xi ) = h1+d−2D C2 (x)(1 + oP (1))
Local polynomial regression on unknown manifolds
185
and the lower-right block is −1
n
n
Kh (Xi − x)2 (Xi − x)(Xi − x)T v(Xi ) = h2+d−2D C3 (x)(1 + oP (1)).
i=1
In light of (A.2), we arrive at (A.6)
V ar{m(x, ˆ h)|X1 , . . . , Xn } = n−1 h−d A1 (x)−2 C1 (x)(1 + oP (1)).
The proof is complete. Acknowledgment. We thank Ya’acov Ritov for insightful comments. References [1] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15 1373–1396. [2] Binev, P., Cohen, A., Dahmen, W., DeVore, R. and Temlyakov, V. (2004). Universal algorithms for learning theory part i: piecewise constant functions. IMI technical reports, SCU. [3] Cao-Abad, R. (1991). Rate of convergence for the wild bootstrap in nonparametric regression. Ann. Statist. 19 2226–2231. MR1135172 [4] Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik 31 377–403. MR516581 [5] Donoho, D. and Grimes, C. (2003). Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA 100 5591–5596 (electronic). [6] Dudoit, S. and van der Laan, M. (2005). Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat. Methodol. 2 131–154. MR2161394 [7] Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B 57 371–394. MR1323345 [8] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. MR1383587 [9] Fan, J. and Gijbels, I. (2000). Local polynomial fitting. In Smoothing and Regression. Approaches, Computation and Application (M.G. Schimek, ed.) 228– 275. Wiley, New York. ¨ rfi, L., Kohler, M., Krzyz˙ ak, A. and Walk, H. (2002). A [10] Gyo Distribution-Free Theory of Nonparametric Regression. Springer, New York. MR1920390 [11] Hall, P., Lahiri, S. and Truong, Y. (1995). On bandwidth choice for density estimation with dependent data. Ann. Statist. 23 2241–2263. MR1389873 ¨rdle, W. and Mammen, E. (1991). Bootstrap methods in nonparametric [12] Ha regression. In Nonparametric Functional Estmation and Related Topics (Spetses, 1990) 335 111–123. Kluwer Acad. Publ., Dordrecht. MR1154323 [13] Lafferty, J. and Wasserman, L. (2005). Rodeo: Sparse nonparametric regression in high dimensions. Technical report, CMU. [14] Levina, E. and Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. Advances in NIPS 17. MIT Press.
186
P. J. Bickel and B. Li
[15] Roweis, S. and Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290 2323–2326. [16] Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. J. Amer. Statist. Assoc. 22 1049–1062. MR1482136 [17] Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression. Ann. Statist. 22 1346–1370. MR1311979 [18] Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 90 1257–1270. MR1379468 [19] Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators. Ann. Statist. 8 1348–1360. MR594650 [20] Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053. MR673642 [21] Tenenbaum, J. B., De Silva, V. and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science 290 2319– 2323. [22] Wang, Y. (2004). Model selection. In Handbook of Computational Statistics 437–466. Springer, New York. MR2090150 [23] Zhang, C. (2003). Calibrating the degrees of freedom for automatic data smoothing and effective curve checking. J. Amer. Statist. Assoc. 98 609–628. MR2011675
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 187–202 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000157
Shape restricted regression with random Bernstein polynomials I-Shou Chang1,∗ , Li-Chu Chien2,∗ , Chao A. Hsiung2 , Chi-Chung Wen3 and Yuh-Jenn Wu4 National Health Research Institutes, Tamkang University, and Chung Yuan Christian University, Taiwan Abstract: Shape restricted regressions, including isotonic regression and concave regression as special cases, are studied using priors on Bernstein polynomials and Markov chain Monte Carlo methods. These priors have large supports, select only smooth functions, can easily incorporate geometric information into the prior, and can be generated without computational difficulty. Algorithms generating priors and posteriors are proposed, and simulation studies are conducted to illustrate the performance of this approach. Comparisons with the density-regression method of Dette et al. (2006) are included.
1. Introduction Estimation of a regression function with shape restriction is of considerable interest in many practical applications. Typical examples include the study of dose response experiments in medicine and the study of utility functions, product functions, profit functions and cost functions in economics, among others. Starting from the classic works of Brunk [4] and Hildreth [17], there exists a large literature on the problem of estimating monotone, concave or convex regression functions. Because some of these estimates are not smooth, much effort has been devoted to the search of a simple, smooth and efficient estimate of a shape restricted regression function. Major approaches to this problem include the projection methods for constrained smoothing, which are discussed in Mammen et al. [24] and contain smoothing splines methods and kernel and local polynomial methods and others as special cases, the isotonic regression approach studied by Mukerjee [25], Mammen [22, 23] and others, the tilting method proposed by Hall and Huang [16], and the density-regression method proposed by Dette, Neumeyer and Pilz [11]. We note that both of the last two methods enjoy the same level of smoothness as the unconstrained counterpart and are applicable to general smoothing methods. Besides, the density-regression method is particularly computationally efficient and has a wide applicability. In ∗ Partially
supported by NSC grant, NSC 94-3112-B-400-002-Y. of Cancer Research and Division of Biostatistics and Bioinformatics, National Health Research Institutes, 35 Keyan Road, Zhunan Town, Miaoli County 350, Taiwan, e-mail: [email protected] 2 Division of Biostatistics and Bioinformatics, National Health Research Institutes, 35 Keyan Road, Zhunan Town, Miaoli County 350, Taiwan, e-mail: [email protected]; [email protected] 3 Department of Mathematics, Tamkang University, 151 Ying-chuan Road, Tamsui Town, Taipei County, 251, Taiwan, e-mail: [email protected] 4 Department of Applied Mathematics, Chung Yuan Christian University, 200 Chung Pei Road, Chung Li City 320, Taiwan, e-mail: [email protected] AMS 2000 subject classifications: primary 62F15, 62G08; secondary 65D10. Keywords and phrases: Bayesian concave regression, Bayesian isotonic regression, geometric prior, Markov chain Monte Carlo, Metropolis-Hastings reversible jump algorithm. 1 Institute
187
188
I-S. Chang et al.
fact, the density-regression method was used to provide an efficient and smooth convex estimate for convex regression by Birke and Dette [3]. This paper studies a nonparametric Bayes approach to shape restricted regression where the prior is introduced by Bernstein polynomials. This prior features the properties that it has a large support, selects only smooth functions, can easily incorporate geometric information into the prior, and can be generated without computational difficulty. We note that Lavine and Mockus [18] also discussed Bayes methods for isotonic regression, but the prior they use is a Dirichlet process, whose sample paths are step functions. In addition to the above desirable properties, our approach can be applied to quite general shape restricted statistical inference problems, although we consider mainly isotonic regression and concave (convex) regression in this paper. To facilitate the discussion, we first introduce some notations as follows. For integers 0 ≤ i ≤ n, let ϕi,n (t) = Cin ti (1 − t)n−i , where Cin = n!/(i!(n − i)!). {ϕi,n | i = 0, . . . , n} iscalled the Bernstein basis for polynomials of order n. Let ∞ Bn = Rn+1 and B = n=1 ({n} × Bn ). Let π be a probability measure on B. For τ > 0, we define F : B × [0, τ ] −→ R1 by F(n, b0,n , . . . , bn,n , t) =
n
t bi,n ϕi,n ( ), τ i=0
(1.1)
where (n, b0,n , . . . , bn,n ) ∈ B and t ∈ [0, τ ]. We also denote (1.1) by Fbn (t) if bn = (b0,n , . . . , bn,n ). The probability measure π is called a Bernstein prior, and F is called the random Bernstein polynomial for π. It is a stochastic process on [0, τ ] with smooth sample paths. Important references for Bernstein polynomials include Lorentz [20] and Altomare and Campiti [2], among others. It is well-known that Bernstein polynomials are popular in curve design (Prautzsch et al. [28]). Bernstein basis have played important roles in nonparametric curve estimation and in Bayesian statistical theory. Good examples of the former include Tenbusch [31] and some of the references therein. For the latter, we note that Beta density β(x; a, b) = xa−1 (1 − x)b−1 /B(a, n b) is itself a Bernstein polynomial and mixtures of Beta densities of the kind j=1 wj β(x; aj , bj ), where wj ≥ 0 are random with n j=1 wj = 1, were used to introduce priors that only select smooth density functions on [0, 1]; see Dalal and Hall [8], Diaconis and Ylvisaker [9], and Mallik and Gelfand [21], and references therein. We also note that Petrone [26] and Petrone and Wasserman [27] studied priors on the set of distribution functions on [0, 1] that n are specified by i=0 F˜ (i/n)ϕi,n (t) with F˜ being a Dirichlet process; this prior was referred to as a Bernstein-Dirichlet prior. n For a continuous function F on [0, τ ], i=0 F (iτ /n)ϕi,n (t/τ ) is called the n-th order Bernstein polynomial of F on [0, τ ]. We will see in Section 2 that much of the geometry of F is preserved by its Bernstein polynomials and very much of the geometry of a Bernstein polynomial can be read off from its coefficients. This together with Bernstein-Weierstrass approximation theorem suggests the possibility of a Bernstein prior on a space of functions with large enough support and specific geometric properties. These ideas were developed in Chang et al. [7] for Bayesian inference of a convex cumulative hazard with right censored data. The purpose of this paper is to indicate that the above ideas are also useful in the estimation of shape restricted regressions. The regression model we consider in this paper assumes that given Fbn satisfying certain shape restriction, Yjk = Fbn (Xk ) + jk ,
Shape restricted regression with random Bernstein polynomials
189
where Xk are design points, Yjk are response variables and jk are errors. We will investigate isotonic regression and concave (convex) regression in some detail. In particular, we will examine the numerical performance of this approach and offer comparison with the newly-developed density-regression method of Dette et al. [11] in simulation studies. We also indicate without elaboration that this approach can also be used to study regression function that is smooth and unimodal or that is smooth and is constantly zero initially, then increasing for a while, and finally decreasing. We note that the latter can be used to model the time course expression level of a virus gene, as discussed in Chang et al. [5]; a virus gene typically starts to express after it gets into a cell and its expression level is zero initially, increases for a while and then decreases. That it is very easy both to generate the prior and the posterior for inference with Bernstein prior is an important merit of this approach. Because the prior is defined on the union of subspaces of different dimension, we adapt MetropolisHastings reversible jump algorithm (MHRA) (Green [15]) to calculate the posterior distributions in this paper. This paper is organized as follows. Section 2 introduces the model, provides statements that exemplify the relationship between the shape of the graph of (1.1) and its coefficients b0 , . . . , bn , and gives conditions under which the prior has full support with desired geometric peoperties. Section 3 illustrates the use of Bernstein priors in conducting Bayesian inference; in particular, suitable Bernstein priors are introduced for isotonic regression and unimodal concave (convex) regression, and Markov chain Monte Carlo approaches to generate posterior are proposed. Section 4 provides simulation studies to compare our methods with the density-regression method. Section 5 is a discussion on possible extensions. 2. The Bernstein priors 2.1. Bernstein polynomial geometry n Let Fa (t) = i=0 ai ϕi,n (t/τ ). This subsection presents a list of statements concerning the relationship between the shape of Fa and its coefficients a0 , . . . , an . This list extends that in Chang et al. [7] and is by no means complete; similar statements can be made for a monotone and convex or a monotone and sigmoidal function, for example. They are useful in taking into account the geometric prior information for regression analysis. Proposition 1. (i) (Monotone) If a0 ≤ a1 ≤ · · · ≤ an , then Fa (t) ≥ 0 for every t ∈ [0, τ ]. (ii) (Unimodal Concave) Let n ≥ 2. If a1 −a0 > 0, an −an−1 < 0, and ai+1 +ai−1 ≤ 2ai , for every i = 1, . . . , n − 1, then Fa (0) > 0, Fa (τ ) < 0, and Fa (t) ≤ 0 for every t ∈ [0, τ ]. (iii) (Unimodal) Let n ≥ 3. If a0 = a1 = · · · = al1 < al1 +1 ≤ al1 +2 ≤ · · · ≤ al2 and al2 ≥ al2 +1 ≥ · · · ≥ al3 > al3 +1 = · · · = an for some 0 ≤ l1 < l2 < l3 ≤ n, then there exists s ∈ (0, τ ) such that s is the unique maximum point of Fa and Fa is strictly increasing on [0, s] and strictly decreasing on [s, τ ]. (i) Furthermore,if l1 > 0, then Fa (0) = 0 for i = 1, . . . , l1 , and if l3 < n − 1, (i) then Fa (τ ) = 0 for i = 1, . . . , n − l3 − 1. In this paper, derivatives at 0 and τ are meant to be one-sided. We note that, in Proposition 1, (ii) provides a sufficient condition under which Fa is a unimodal
190
I-S. Chang et al.
concave function and (iii) provides a sufficient condition under which Fa is a unimodal smooth function whose function and derivative values at 0 and τ may be prescribed to be 0. Let Fa satisfy (iii) with τ = 1 and l1 > 1, and let F be defined by F (t) = Fa (t − τ1 /τ − τ1 )I(τ1 ,τ ] (t) for some τ1 ∈ (0, τ ), then F is a non-negative smooth function on [0, τ ] and it is zero initially for a while, then increases, and finally decreases. As we mentioned earlier, functions with this kind of shape restriction are useful in the study of time course expression profile of a virus gene. We note that the results presented here are for concave regressions or unimodal regressions and similar results hold for convex regressions. Proof. Without the derivatives n−1 loss of generality, we assume τ = 1. Notingthat n−2 Fa (t) = n i=0 (ai+1 − ai )ϕi,n−1 (t) and Fa (t) = n(n − 1) i=0 (ai+2 − 2ai+1 + ai )ϕi,n−2 (t), we obtain (i) and (ii) immediately. We now prove (iii) for the case l1 = 1 and l3 = n − 1; the proofs for other cases are similar and hence omitted. n−1 t ). Using Let ϕ(x) = i=0 Cin−1 (ai+1 − ai )xi , then Fa (t) = n(1 − t)n−1 ϕ( 1−t a0 = a1 < a2 ≤ · · · ≤ al and al ≥ al+1 ≥ · · · ≥ an−1 > an , we know ϕ is a polynomial whose number of sign changes is exactly one. This together with Descartes’ sign rule (Anderson et al. [1]) implies that ϕ has at most one root in (0, ∞), and hence, Fa has at most one root in (0, 1). Using a0 = a1 and a2 − a1 > 0, we know Fa () > 0 for being positive and small enough. This combined with Fa (1) = n(an − an−1 ) < 0 shows that Fa has at least one root in (0, 1). Thus, Fa has exactly one root s in (0, 1), and Fa is positive on (0, s) and negative on (s, 1]. Therefore, the conclusion of (iii) follows. This completes the proof. The following proposition complements Proposition 1 and provides BernsteinWeierstrass approximations for functions with specific shape restrictions. (1) (2) Let In = {Fa | a ∈ Bn , a0 ≤ a1 ≤ · · · ≤ an }, In = {Fa | a ∈ Bn , a1 − a0 > (3) 0, an − an−1 < 0, ai+1 + ai−1 ≤ 2ai , for i = 1, . . . , n − 1}, and In = {Fa | a ∈ Bn , a0 = a1 < a2 ≤ · · · ≤ al , al ≥ al+1 ≥ · · · ≥ an−1 > an , for some l = 2, . . . , n − 1}. Then we have Proposition 2. (i) (Monotone) Let D1 consist of linear combinations of ele∞ (1) ments in n=1 In , with non-negative coefficients. Then the closure of D1 in uniform norm is precisely the set of increasing and continuous functions on [0, τ ]. (ii) (Unimodal Concave) Let D2 consist of linear combinations of elements in ∞ (2) n=2 In , with non-negative coefficients. Let S denote the set of all continuously differentiable real-valued functions F defined on [0, τ ] with F (0) ≥ 0, F (τ ) ≤ 0, and F decreasing. For two continuously differentiable functions f and g, define d(f, g) = f − g∞ + f − g ∞ , where · ∞ is the sup-norm for functions on [0, τ ]. Then the closure of D2 , under d, is S. ∞ (3) (iii) (Unimodal) Let D3 = n=3 In . Let S denote the set of all continuously differentiable real-valued functions F defined on [0, τ ] satisfying the properties that F (0) = 0 and that there exists s ∈ [0, τ ] such that F (s) = 0, F (x) ≥ 0 for x ∈ [0, s], and F (x) ≤ 0 for x ∈ [s, τ ]. For two continuously differentiable functions f and g, define d(f, g) = f − g∞ + f − g ∞ , where · ∞ is the sup-norm for functions on [0, τ ]. Then the closure of D3 , under d, is S. Proof. We give the proofs for (i) and (ii), and omit the proof for (iii), because it is similar to that for (ii).
Shape restricted regression with random Bernstein polynomials
191
(i) It is obvious to see that the closure of D1 is contained in the set of increasing and continuous functions. We now prove the converse. Let F be increasing and conk tinuous. Taking ai = F (iτ /k), for i = 0, 1, . . . , k, we set F(k) (t) = i=0 ai ϕi,k (t/τ ). Then F(k) is in D1 . It follows from Bernstein-Weierstrass approximation theorem that F(k) converges to F uniformly. This completes the proof. (ii) It follows from (ii) in Proposition 1 that D2 ⊂ S. Because S is complete relative to d, we know D2 ⊂ S. Here D2 is the closure of D2 . We now prove S ⊂ D2 . Let F ∈ S. Let d0 = F (0) + 1/n3 , dn−1 = F (τ ) − 1/n3 , di = F (iτ /(n − 1)), n−1 for i = 1, . . . , n − 2, and H1,n (t) = i=0 di ϕi,n−1 (t/τ ). Note that H1,n (0) > 0, H1,n (τ ) < 0. Using Bernstein-Weierstrass Theorem, we know H1,n converges to F uniformly. n Let a0 = F (0)n/τ , ai = a0 + d0 + · · · + di−1 , H0,n (t) = (τ /n) i=0 ai ϕi,n (t/τ ). Then H0,n (t) = H1,n (t) and H0,n (0) = F (0). Thus H0,n converges uniformly to F (See, for example, Theorem 7.17 in Rudin [30]). Using the fact that di is a decreasing sequence with d0 > 0, dn−1 < 0, and ai+2 − 2ai+1 + ai = di+1 − di , we know a1 − a0 > 0, an − an−1 < 0 and ai+2 + ai ≤ 2ai+1 . Thus H0,n is in D2 . This shows that F is in D2 . This completes the proof. 2.2. Bayesian regression We now describe a Bayesian regression model with the prior distribution, on the regression functions, introduced by random Bernstein polynomials (1.1). Assume that on a probability space (B × R∞ , F, P), there are random variables {Yjk | j = 1, . . . , mk ; k = 1, . . . , K} satisfying the property that, conditional on B = (n, bn ), (2.1) Yjk = Fbn (Xk ) + jk , with {jk | j = 1, . . . , mk ; k = 1, . . . , K} being independent random variables, jk having known density gk for j = 1, . . . , mk , B being the projection from B × R∞ to B, Fbn being the function on [0, τ ] associated with (n, bn ) ∈ B defined in (1.1), X1 , . . . , XK being constant design points, F being the Borel σ−field on B × R∞ . We also assume the marginal distribution of P on B is the prior π. Our purpose is to illustrate a Bayesian regression method and the above mathematical formulation is only meant to facilitate an simple formal presentation. In fact, P is constructed after the prior and the likelihood are specified.A natural way ∞ to introduce the prior π is to define π(n, bn ) = p(n)πn (bn ), with n=1 p(n) = 1 and πn a density function on Bn ; in fact, πn (·) is the conditional density of π on Bn and also denoted by π(· | {n} × Bn ). Given B = (n, bn ), the likelihood for the data {(Xk , Yjk ) | j = 1, . . . , mk ; k = 1, . . . , K} is mk K
gk (Yjk − Fbn (Xk )).
k=1 j=1
Thus the posterior density ν of the parameter (n, bn ) given the data is proportional to mk K gk (Yjk − Fbn (Xk ))πn (bn )p(n), k=1 j=1
I-S. Chang et al.
192
where (n, bn ) ∈ B. We note that, although we assume gk is known in this paper, the method of this paper can be extended to treat the case that gk has certain parametric form with priors on the parameters. 2.3. Support of Bernstein priors The following propositions show that the support of the Bernstein priors can be quite large. (1)
(1)
Proposition 3. (Monotone) Let Bn = {bn ∈ Bn : Fbn ∈ In }. Assume p(n) > 0 for n = 1, 2, . . . , and the conditional density πn (b0,n , b1,n , . . . , bn,n ) of π(· | {n} × (1) (1) Bn ) has support Bn for infinitely often n. Let F be a given increasing and contin∞ (1) uous function on [0, τ ]. Then π{(n, bn ) ∈ n=2 ({n} × Bn ) : Fbn − F ∞ < } > 0 for every > 0. n Proof. Let F(n) (t) = i=0 F (iτ /n)ϕi,n (t/τ ). Using Bernstein-Weierstrass Theorem, we can choose a large n1 so that F(n1 ) − F ∞ ≤ /2, P (n1 ) > 0, and (1) (1) π(· | {n1 } × Bn1 ) > 0 has support Bn1 . Combining this with Fbn − F(n) ∞ ≤ maxi=0,...,n {| bi,n − F (iτ /n) |} for bn ∈ Bn , we get π{(n, bn ) ∈ B : Fbn − F ∞ < } ≥ π{(n1 , bn1 ) ∈ B : Fbn1 − F(n1 ) ∞ < ≥ π{(n1 , bn1 ) ∈ {n1 } × Bn(1) : 1
} 2
max |bi,n1 − F (
i=0,...,n1
iτ )| < }, n1 2
which is positive. This completes the proof. Remarks. If we know c1 < F (τ ) < c0 , then it suffices to assume πn has support (1) {(b0,n , . . . , bn,n ) ∈ Bn | c1 ≤ b0,n , bn,n ≤ c0 } in Proposition 3. Statements similar to Proposition 3 can also be made for concave functions. In fact, we have (2)
(2)
Proposition 4. (Unimodal Concave) Let Bn = {bn ∈ Bn : Fbn ∈ In }. Assume p(n) > 0 for n = 1, 2, . . . , and the conditional density πn (b0,n , b1,n , . . . , bn,n ) of (2) (2) π(·| {n} × Bn ) has support Bn for infinitely often n. Let F be a continuously differentiable real-valued function defined on [0, τ ] with F (0) ≥ 0, F (τ ) ≤ 0, and ∞ (2) F decreasing. Then π{(n, bn ) ∈ n=2 ({n} × Bn ) : Fbn − F ∞ + Fbn − F ∞ < } > 0 for every > 0. (3)
(3)
Proposition 5. (Unimodal) Let Bn = {bn ∈ Bn : Fbn ∈ In }. Assume p(n) > 0 for n = 1, 2, . . . , and the conditional density πn (b0,n , b1,n , . . . , bn,n ) of π(·| {n} × (3) (3) Bn ) has support Bn for infinitely often n. Let F be a continuously differentiable real-valued function defined on [0, τ ] satisfying the properties that F (0) = 0 and that there exists s ∈ (0, τ ) such that F (s) = 0, F (x) ≥ 0 for x ∈ [0, s], and F (x) ≤ 0 for ∞ (3) x ∈ [s, τ ]. Then π{(n, bn ) ∈ n=2 ({n}×Bn ) : Fbn −F ∞ +Fbn −F ∞ < } > 0 for every > 0. 3. Bayesian inference Instead of defining priors explicitly, we propose algorithms to generate samples from the Bernstein priors that incorporate geometric information. We also propose algorithms for generating posterior distributions so as to do statistical inference. We consider only isotonic regression and unimodal concave (convex) regression in the rest
Shape restricted regression with random Bernstein polynomials
193
of this paper, because we want to compare our method with the density-regression method. Algorithms 1, 2 and 3 respectively generate prior for isotonic regression, unimodal concave regression and unimodal convex regression. Both Algorithm 4, an independent Metropolis algorithm (IMA), and Algorithm 5, a Metropolis-Hastings reversible jump algorithm (MHRA), can be used to generate posterior for isotonic regression. Algorithms generating posterior for unimodal concave (convex) regression can be similarly proposed; they are omitted to make the paper more concise. Algorithm 1. (Bernstein isotonic prior) Let p(1) = e−α + αe−α , p(n) = αn e−α /n! for n = 2, . . . , n0 − 1, and p(n0 ) = n0 −1 1 − n=1 p(n). Let q1 be a density so that its support contains F (0). Let q2 be a density so that its support contains F (τ ). The following steps provide a way to sample from an implicitly defined prior distribution. 1. Generate n ∼ p. 2. Generate a0 ∼ q1 , an ∼ q2 . 3. Let U1 , U2 , . . . , Un−1 be a random sample from U nif orm(a0 , an ). Let U(1) , U(2) ,. . . ,U(n−1) be the order statistics of {U1 , U2 , . . . , Un−1 }. Set a1 = U(1) , a2 = U(2) , . . . , an−1 = U(n−1) . 4. (n, a0 , . . . , an ) ∈ B is a sample from the prior. Algorithm 2. (Bernstein concave prior) Let p(2) = (1 + α)e−α + α2 e−α /2, p(n) = αn e−α /n! for n = 3, . . . , n0 − 1, and n0 −1 p(n). Let q be a density with its support containing the mode of p(n0 ) = 1 − n=2 the regression function. Let β1 be a lower bound of F (0). Let β2 be a lower bound of F (τ ). The algorithm has the following steps. Generate n ∼ p. Randomly choose l from {1, 2, . . . , n − 1}. Generate al ∼ q. Generate a0 ∼ U nif orm(−al + 2β1 , al ) and an ∼ U nif orm(−al + 2β2 , al ). Let U1 ≤ U2 ≤ · · · ≤ Ul−1 be the order statistics of a random sample from U nif orm(a0 , al ). Denote by c1 ≤ c2 ≤ · · · ≤ cl the order statistics of {al − Ul−1 , Ul−1 − Ul−2 , . . . , U2 − U1 , U1 − a0 }. Then set a1 = a0 + cl , a2 = a0 + cl + cl−1 , . . . , al−1 = a0 + cl + · · · + c2 . 6. Let V1 ≤ V2 ≤ · · · ≤ Vn−l−1 be the order statistics of a random sample from U nif orm(an , al ). Denote by c1 ≤ c2 ≤ · · · ≤ cn−l the order statistics of {al − Vn−l−1 , Vn−l−1 − Vn−l−2 , . . . , V2 − V1 , V1 − an }. Then set al+1 = al − c1 , al+2 = al − c1 − c2 , . . . , an−1 = al − c1 − · · · − cn−l−1 . 1. 2. 3. 4. 5.
In the above algorithms, the conditional distributions of π(· | {n} × Bn ) are defined to be those of (a0 , a1 , . . . , an ). Although these two algorithms might look a little ad hoc, the main idea is to produce random sequence a0 , a1 , . . . , an satisfying conditions in the propositions in Section 2. It is obvious that a0 ≤ a1 ≤ · · · ≤ an in Algorithm 1 and that a1 − a0 > 0, an − an−1 < 0, and ai+2 + ai ≤ 2ai+1 in Algorithm 2. It follows from Proposition 1 that the support of the prior generated by Algorithm 1 (Algorithm 2) contains only monotone (unimodal concave) functions. The following Algorithm 3 will be used in the simulation study. Algorithm 3. (Bernstein convex prior) Let p(2) = (1 + α)e−α + α2 e−α /2, p(n) = αn e−α /n! for n = 3, . . . , n0 − 1, n0 −1 and p(n0 ) = 1 − n=2 p(n). Let q be a density with its support containing the minimum value of the regression function. Let β1 be a upper bound of F (0), and β2 be a upper bound of F (τ ). The algorithm has the following steps.
194
I-S. Chang et al.
Generate n ∼ p. Randomly choose l from {1, 2, . . . , n − 1}. Generate al ∼ q. Generate a0 ∼ U nif orm(al , 2β1 − al ) and an ∼ U nif orm(al , 2β2 − al ). Let U1 ≤ U2 ≤ · · · ≤ Ul−1 be the order statistics of a random sample from U nif orm(al , a0 ). Denote by c1 ≤ c2 ≤ · · · ≤ cl the order statistics of {a0 − Ul−1 , Ul−1 − Ul−2 , . . . , U2 − U1 , U1 − al }. Then set a1 = a0 − cl , a2 = a0 − cl − cl−1 , . . . , al−1 = a0 − cl − · · · − c2 . 6. Let V1 ≤ V2 ≤ · · · ≤ Vn−l−1 be the order statistics of a random sample from U nif orm(al , an ). Denote by c1 ≤ c2 ≤ · · · ≤ cn−l the order statistics of {an − Vn−l−1 , Vn−l−1 − Vn−l−2 , . . . , V2 − V1 , V1 − al }. Then set al+1 = al + c1 , al+2 = al + c1 + c2 , . . . , an−1 = al + c1 + · · · + cn−l−1 . 1. 2. 3. 4. 5.
Algorithm 4. (IMA for the posterior in isotonic regression) This algorithm uses the independent Metropolis approach (see, for example, Robert and Casella [29], page 276) to calculate the posterior distribution ν of (n, b). Let x = (n, a0 , . . . , an ) be generated by the prior. We describe the transition from the current state x(t) = (n , a0 , . . . , an ) to a new point x(t+1) by ν(x)πn (a0 , . . . , an )p(n ) x, with prob. min 1, , x(t+1) = ν(x(t) )πn (a0 , . . . , an )p(n) (t) x , o.w.
The posterior distribution ν of (n, a) in turn produces the posterior distribution of Fa , and the Bayes estimate we consider is the mean of Fa . Algorithm 5. (MHRA for the posterior in isotonic regression) (1) Let B(n) = {(n, a0 , . . . , an ) | (a0 , . . . , an ) ∈ Bn }. Let x(t) = (n, a0 , . . . , an ) ∈ B(n) be the current state. We describe the transition from x(t) ∈ B(n) to a new point x(t+1) by the following algorithms. Randomly select one of three types of moves H, H + , or H − . Here H is a transition of element in B(n) , H + a transition of element from B(n) to B(n+1) , and n H − a transition of element from B(n) to B(n−1) , respectively. Denote by PH , n n PH + and PH − the probabilities of selecting the three different types of moves H, H + and H − when the current state of the Markov chain is in B(n) . We set p(n+1) n0 1 n n n n PH − = PH + = 0. Consider PH = 1 − PH + − PH − , PH + = c min{1, p(n) } and p(n−1) n PH − = c min{1, p(n) }, where p is given in Algorithm 1 and c is a sample parameter. Suppose M1 ≤ F ≤ M2 . If the move of type H is selected, then
1. select k randomly from {0, 1, 2, . . . , n} so that there is 1/3 probability of choosing 0 or n and 1/3(n − 1) probability of choosing any one of the rest; generate V ∼ U nif orm(ck−1 , ck+1 ), with c−1 = M1 , cn+1 = M2 , and ck = ak for k = 1, . . . , n; 2. let y (t) be the vector x(t) with ak replaced by V ; 3. set the next state ν(y (t) ) (t) y , with prob. ρ = min{1, }, (t+1) x = ν(x(t) ) (t) x , o.w.
If the move of type H + is selected, then
Shape restricted regression with random Bernstein polynomials
195
1. generate V ∼ U nif orm(a0 , an ) and assume ak ≤ V < ak+1 ; 2. let y (t) = (n + 1, a0 , a1 , . . . , ak , V, ak+1 , . . . , an ) ∈ B(n+1) ; 3. set the next state ν(y (t) ) × p(n) × (an − a0 ) (t) y , with prob. ρ = min{1, }, x(t+1) = ν(x(t) ) × p(n + 1) × (n) (t) x , o.w.
If the move of type H − is selected, then
1. select k uniformly from {1, 2, . . . , n − 1}; 2. let y (t) = (n − 1, a0 , a1 , . . . , ak−1 , ak+1 , . . . , an ) ∈ B(n−1) ; 3. set the next state ν(y (t) ) × p(n) × (n − 1) (t) y , with prob. ρ = min{1, }, (t+1) x = ν(x(t) ) × p(n − 1) × (an − a0 ) (t) x , o.w.
4. Numerical studies
We now explore the numerical performance of the Bayes methods in this paper. Viewing the posterior as a distribution on regression functions, we can use the posterior mean Fˆ of the regression functions as the estimate; namely, m
1 Fˆ (t) = F (j) (t), m j=1 b where m is a large number, and b(1) , b(2) , . . . are chosen randomly according to the posterior distribution. The performance of Fˆ is evaluated by the L1 -norm, sup-norm and mean square error (MSE) of Fˆ − F, as a function on [0, 1]. Here F denotes the true regression function. We note that, in the studies of isotonic regression and concave (convex) regression, Fˆ is a reasonable estimate because monotonicity and concavity (convexity) are preserved by linear combination with non-negative coefficients, and when studying other shape restricted regression, it might be more appropriate to use posterior mode as the estimate. Numerical studies in this section include comparison between our method and the density-regression method. In order to make the comparison more convincing, we study in this section both isotonic regression and convex regression, instead of concave regression, because they are studied by Dette and coauthors. In this section, τ = 1; K = 100; X1 , X2 , . . . , X100 are i.i.d. from U nif orm(0, 1); mk = 1 for every k = 1, 2, . . . , 100; gk is N ormal(0, σ 2 ) with σ = 0.1 or 1 in the data generation. When carrying out inference, σ is estimated from the data. 4.1. Isotonic regression Our simulation studies suggest that compared with the density-regression method of Dette et al. [11], our method performs comparably when the noise is large and better when the noise is small. We studied all the regression functions in Dette et al. [11], Dette and Pilz [12], and Dette [10], and all the results are similar; hence, we only report the results for two of the regression functions, which are defined by Π F<1> (t) = sin( t), 2
(4.1)
I-S. Chang et al.
196
for t ∈ [0, 0.25], 2t for t ∈ [0.25, 0.75], F<2> (t) = 0.5 2t − 1 for t ∈ [0.75, 1].
(4.2)
We note that these are not polynomials and our method usually performs better for polynomials. We also note that the notation Π in (4.1) is the ratio of the circumference of a circle to its diameter. We use Algorithm 1 with n0 = 20, α = 10, q1 ∼ U nif orm(q11 , q12 ), q2 ∼ U nif orm(q21 , q22 ) to generate the prior distribution. Here q11 , q12 , q21 and q22 are defined as follows. Let X(1) < X(2) < · · · < X(100) be the order statistics for {X1 , X2 , . . . , X100 }. Let Y[i] = Yj if X(i) = Xj . Then q11 = min{Y[i] | i = 100 1 1, 2, . . . , 10}, q12 = q21 = 100 i=1 Yi and q22 = max{Y[i] | i = 91, 92, . . . , 100}. We note that the choice of n0 = 20 and α = 10 makes it relatively uninformative in the order of the polynomial. We use Algorithm 5 (MHRA) with c = 0.35, M1 = q11 , M2 = q22 and estimated σ 2 to generate the posterior distribution. We note that this choice of c allows relatively large probabilities of changing the order of the polynomial and σ 2 is estimated 99 1 2 by 198 i=1 (Y[i+1] − Y[i] ) , which is also used in the following density-regression method. We run MHRA for 100,000 updates; after a burn in period of 10,000 updates, we use the remaining 90,000 realizations of the Markov chain to obtain the posterior mean. Starting from generating the data {Xi , Yi | i = 1, 2, . . . , 100}, the above experiment is replicated 200 times; the averages of the resulting 200 L1 norms, sup-norms and MSE of Fˆ − F are reported in Table 1 and Table 2, which also contain the corresponding results using the density-regression method; the method of this paper is referred to as the Bayes method. Also contained in Table 1 and Table 2 are the posterior distributions of the polynomial order; these posterior distributions seem to suggest that n0 = 20 in the prior is large enough. Figure 1 contains the autocorrelation plots of the L1 -norm of Fb(j) for one of the 200 replicates in the study of F<1> . Figure 1 indicates that MHRA behaves quite nicely. Those for F<2> look similar and hence are omitted. The corresponding results using IMA are also omitted, because they are similar to those using MHRA, except having larger autocorrelation values and smaller effective sample sizes. Concepts of autocorrelation and effective sample size can be found in Liu [19], for example. 4.2. Convex regression The main conclusion of our simulation studies regarding convex regression is similar to that reported for isotonic regression; namely, compared with the densityregression method for convex regression studied by Birke and Dette [3], our Bayes method performs comparably when the noise is large and performs better when the noise is small. To make the presentation concise, we only report the results for two of the regression functions, which are defined by F<3> (t) = (16/9)(t − 1/4)2 , −4t + 1 for t ∈ [0, 0.25], 0 for t ∈ [0.25, 0.75], F<4> (t) = 4t − 3 for t ∈ [0.75, 1].
We note that both F<3> and F<4> are examples in Birke and Dette [3].
(4.3)
(4.4)
Shape restricted regression with random Bernstein polynomials
197
Table 1 Simulation study for F<1> (t) = sin( 2Π t).
σ
n
σ
|Fˆ − F |
a Bayes
0.1
L1 -norm Sup-norm MSE
0.0161 0.0532 0.0004
Density -regression 0.0238 0.0874 0.0010
1
L1 -norm Sup-norm MSE
0.1148 0.2795 0.0215
0.1194 0.2651 0.0233
0.1
1 2 3 4 5
Posterior probability 0.0000 0.0300 0.0433 0.0483 0.0789
1
1 2 3 4 5
0.0006 0.0026 0.0077 0.0196 0.0406
6 7 8 9 10
b Posterior probability 0.0797 0.0994 0.1009 0.1109 0.1044
6 7 8 9 10
0.0673 0.0950 0.1164 0.1308 0.1252
n
n
n
11 12 13 14 15
Posterior probability 0.0960 0.0708 0.0497 0.0359 0.0235
16 17 18 19 20
Posterior probability 0.0139 0.0076 0.0037 0.0016 0.0015
11 12 13 14 15
0.1104 0.0878 0.0687 0.0495 0.0319
16 17 18 19 20
0.0200 0.0113 0.0068 0.0042 0.0036
We use Algorithm 3 with n0 = 20, α = 10, q ∼ U nif orm(q01 , q02 ), β1 = q3 and β2 = q4 to generate the prior distribution, where q01 , q02 , q3 and q4 are defined as follows. Let Y(1) < Y(2) < · · · < Y(100) be the order statistics for {Y1 , Y2 , . . . , Y100 }. 10 100 1 1 Then q01 = 10 i=1 Y(i) , q02 = |q01 + 100 i=1 Yi |/2. Let X(1) < X(2) < · · · < X(100) be the order statistics for {X1 , X2 , . . . , X100 }. Let Y[i] = Yj if X(i) = Xj . Then q3 = max{Y[i] | i = 1, 2, . . . , 5} and q4 = max{Y[i] | i = 96, 97, . . . , 100}. We use MHRA to generate the posterior distribution with similar parameters given in Algorithm 5. Table 3 and Table 4 contain the main results. Entries in Table 3 and Table 4 bear the same meanings of the corresponding entries in Table 1. Figure 2 carries similar information as Figure 1. Table 1a gives comparison between the Bayes method and the density-regression method. Table 1b gives the posterior probability of the order of the Bernstein polynomial. The effective sample size for σ = 0.1 (1) is 5623 (461). The acceptance rate for σ = 0.1 (1) is 0.2923 (0.6635). Table 2a gives comparison between the Bayes method and the density-regression method. Table 2b gives the posterior probability of the order of the Bernstein polynomial. The effective sample size for σ = 0.1 (1) is 4913 (364). The acceptance rate for σ = 0.1 (1) is 0.3779 (0.6943). Table 3a gives comparison between the Bayes method and the density-regression method. Table 3b gives the posterior probability of the order of the Bernstein polynomial. The effective sample size for σ = 0.1 (1) is 1397 (344). The acceptance rate for σ = 0.1 (1) is 0.2413 (0.3430). Table 4a gives comparison between the Bayes method and the density-regression method. Table 4b gives the posterior probability of the order of the Bernstein polynomial. The effective sample size for σ = 0.1 (1) is 722 (216). The acceptance rate for σ = 0.1 (1) is 0.2217 (0.3670).
I-S. Chang et al.
198
Table 2 Simulation study for F<2> (t) =
σ
n
2t for t ∈ [0, 0.25], 0.5 for t ∈ [0.25, 0.75], 2t − 1 for t ∈ [0.75, 1].
σ
|Fˆ − F |
a Bayes
0.1
L1 -norm Sup-norm MSE
0.0353 0.1002 0.0019
Density -regression 0.0413 0.1415 0.0030
1
L1 -norm Sup-norm MSE
0.1255 0.3035 0.0244
0.1267 0.3509 0.0256
0.1
1 2 3 4 5
Posterior probability 0.0000 0.0000 0.0000 0.0000 0.0005
1
1 2 3 4 5
0.0004 0.0016 0.0062 0.0177 0.0379
6 7 8 9 10
b Posterior probability 0.0144 0.0385 0.0420 0.0621 0.1006
6 7 8 9 10
0.0611 0.0871 0.1100 0.1245 0.1284
n
n
n
11 12 13 14 15
Posterior probability 0.1481 0.1730 0.1571 0.1142 0.0692
16 17 18 19 20
Posterior probability 0.0375 0.0194 0.0117 0.0062 0.0055
11 12 13 14 15
0.1169 0.0992 0.0794 0.0539 0.0335
16 17 18 19 20
0.0195 0.0113 0.0060 0.0029 0.0025
Table 3 Simulation study for F<3> (t) = (16/9)(t − 1/4)2 .
σ
n
σ
|Fˆ − F |
a Bayes
0.1
L1 -norm Sup-norm MSE
0.0157 0.0525 0.0004
Density -regression 0.0775 0.2296 0.0101
1
L1 -norm Sup-norm MSE
0.1362 0.4247 0.0319
0.1366 0.3643 0.0292
0.1
2 3 4 5 6
Posterior probability 0.0001 0.0609 0.0826 0.0931 0.1015
1
2 3 4 5 6
0.0413 0.0602 0.0728 0.0849 0.1037
7 8 9 10 11
b Posterior probability 0.1177 0.1312 0.1139 0.0938 0.0653
12 13 14 15 16
Posterior probability 0.0452 0.0346 0.0230 0.0147 0.0090
7 8 9 10 11
0.1094 0.1175 0.1110 0.0886 0.0717
12 13 14 15 16
0.0516 0.0313 0.0204 0.0141 0.0096
n
n
n 17 18 19 20
Posterior probability 0.0064 0.0029 0.0020 0.0021
17 18 19 20
0.0062 0.0028 0.0015 0.0014
Shape restricted regression with random Bernstein polynomials
199
Table 4 Simulation study for F<4> (t) =
σ
n
−4t + 1 for t ∈ [0, 0.25], 0 for t ∈ [0.25, 0.75], 4t − 3 for t ∈ [0.75, 1].
σ
|Fˆ − F |
a Bayes
0.1
L1 -norm Sup-norm MSE
0.0405 0.1381 0.0026
Density -regression 0.1311 0.4663 0.0282
1
L1 -norm Sup-norm MSE
0.1394 0.4603 0.0338
0.2265 0.7010 0.0793
0.1
2 3 4 5 6
Posterior probability 0.0000 0.0000 0.0000 0.0000 0.0000
1
2 3 4 5 6
0.0153 0.0295 0.0539 0.0784 0.1065
7 8 9 10 11
b Posterior probability 0.0000 0.0205 0.1144 0.1028 0.1498
12 13 14 15 16
Posterior probability 0.1710 0.1301 0.1130 0.0778 0.0474
7 8 9 10 11
0.1194 0.1254 0.1153 0.0953 0.0786
12 13 14 15 16
0.0571 0.0432 0.0321 0.0227 0.0137
n
n
n 17 18 19 20
Posterior probability 0.0263 0.0158 0.0124 0.0187
17 18 19 20
0.0068 0.0041 0.0013 0.0014
Fig 1. Autocorrelation plots for the MHRA in the data generation from the posterior distribution for F<1> (t) = sin( 2Π t).
200
I-S. Chang et al.
Fig 2. Autocorrelation plots for the MHRA in the data generation from the posterior distribution for F<3> (t) = (16/9)(t − 1/4)2 .
5. Discussion We have proposed a Bayes approach to shape restricted regression with the prior introduced by random Bernstein polynomials. In particular, algorithms for generating the priors and the posteriors are proposed and numerical performance of the Bayes estimate has been examined in some detail. The usefulness of this approach is successfully demonstrated in simulation studies for isotonic regression and convex regression, which compares our method with the density-regression method. We note that certain frequentist properties of this Bayes estimate are established in Chang et al. [6]. The method of this paper can also be used to assess the validity of the shape restriction on the regression function, by considering the predictive posterior assessment proposed by Gelman et al. [14] and Gelman [13]; in fact, we have implemented this assessment and found it quite satisfactory. We would like to remark that this approach can be used to study other shape restricted statistical inference problem. In fact, as pointed out by P. Bickel and M. Woodroofe in the Vardi memorial conference, the geometry of Bernstein polynomials may be utilized to propose a penalized likelihood approach to shape restricted regression; our preliminary simulation studies (not reported here) do suggest that penalized maximum likelihood estimate looks promising, and further investigation is underway. References [1] Anderson, B., Jackson, J. and Sitharam, M. (1998). Descartes’ rule of signs revisited. Am. Math. Mon. 105 447–451. [2] Altomare, F. and Campiti, M. (1994). Korovkin-type Approximation Theory and its Application. Walter de Gruyter, Berlin. [3] Birke, M. and Dette, H. (2006). Estimating a convex function in nonparametric regression. Preprint. [4] Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. Ann. Math. Statist. 26 607–616.
Shape restricted regression with random Bernstein polynomials
201
[5] Chang, I. S., Chien, L. C., Gupta, P. K., Wen, C. C., Wu, Y. J. and Hsiung, C. A. (2006a). A Bayes approach to time course expression profile of a virus gene based on microarray data. Submitted. [6] Chang, I. S., Hsiung, C. A., Wen, C. C. and Wu, Y. J. (2006b). A note on the consistency in Bayesian shape-restricted regression with random Bernstein polynomials. In Random Walks, Sequential Analysis and Related Topics (C. A. Hsiung, Z. Ying and C. H. Zhang, eds.). World Scientific, Singapore. [7] Chang, I. S., Hsiung, C. A., Wu, Y. J. and Yang, C. C. (2005). Bayesian survival analysis using Bernstein polynomials. Scand. J. Statist. 32 447–466. [8] Dalal, S. and Hall, W. (1983). Approximating priors by mixtures of natural conjugate priors. J. R. Stat. Soc. Ser. B 45 278–286. [9] Diaconis, P. and Ylvisaker, D. (1985). Quantifying prior opinion. In Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.) 133–156. North-Holland, Amsterdam. [10] Dette, H. (2005). Monotone regression. Preprint. [11] Dette, H., Neumeyer, N. and Pilz, K. F. (2006). A simple nonparametric estimator of a monotone regression function. Bernoulli 12 469–490. [12] Dette, H. and Pilz, K. F. (2006). A comparative study of monotone nonparametric kernel estimates. J. Stat. Comput. Simul. 76 41–56. [13] Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness-of-fit testing. Int. Stat. Rev. 71 369–382. [14] Gelman, A., Meng, X. L. and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statist. Sinica 6 733–807. [15] Green, P. G. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732. [16] Hall, P. and Huang, L. S. (2001). Nonparametric kernel regression subject to monotonicity constraints. Ann. Statist. 29 624–647. [17] Hildreth, C. (1954). Point estimate of ordinates of concave functions. J. Amer. Statist. Assoc. 49 598–619. [18] Lavine, M. and Mockus, A. (1995). A nonparametric Bayes method for isotonic regression. J. Statist. Plann. Inference 46 235–248. [19] Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer, New York. [20] Lorentz, C. G. (1986). Bernstein Polynomials. Chelsea, New York. [21] Mallik, B. K. and Gelfand, A. E. (1994). Generalized linear models with unknown link functions. Biometrika 81 237–245. [22] Mammen, E. (1991a). Estimating a smooth monotone regression function. Ann. Statist. 19 724–740. [23] Mammen, E. (1991b). Nonparametric regression under qualitative smoothness assumptions. Ann. Statist. 19 741–759. [24] Mammen, E., Marron, J. S., Turlach, B. A. and Wand, M. P. (2001). A general projection framework for constrained smoothing. Statist. Sci. 16 232–248. [25] Mukerjee, H. (1988). Monotone nonparametric regression. Ann. Statist. 16 741–750. [26] Petrone, S. (1999). Random Bernstein polynomials. Scand. J. Statist. 26 373–393. [27] Petrone, S. and Wasserman, L. (2002). Consistency of Bernstein polynomial posteriors. J. R. Stat. Soc. Ser. B 64 79–100. [28] Prautzsch, H., Boehm, W. and Paluszny, M. (2002). Bezier and B-Spline Techniques. Springer, Berlin, Heidelberg.
202
I-S. Chang et al.
[29] Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. Springer, New York. [30] Rudin, W. (1976). Principles of Mathematical Analysis, 3nd ed. McGraw-Hill, New York. [31] Tenbusch, A. (1997). Nonparametric curve estimation with Bernstein estimates. Metrika 45 1–30.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 203–223 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000166
Non- and semi-parametric analysis of failure time data with missing failure indicators∗ Irene Gijbels1,† , Danyu Lin2,‡ and Zhiliang Ying3,§ Katholieke Universiteit Leuven, University of North Carolina and Columbia University Abstract: A class of estimating functions is introduced for the regression parameter of the Cox proportional hazards model to allow unknown failure statuses on some study subjects. The consistency and asymptotic normality of the resulting estimators are established under mild conditions. An adaptive estimator which achieves the minimum variance-covariance bound of the class is constructed. Numerical studies demonstrate that the asymptotic approximations are adequate for practical use and that the efficiency gain of the adaptive estimator over the complete-case analysis can be quite substantial. Similar methods are also developed for the nonparametric estimation of the survival function of a homogeneous population and for the estimation of the cumulative baseline hazard function under the Cox model.
1. Introduction Let (Ti , Ci , Zi ) (i = 1, . . . , n) be n independent replicates of the random vector (T, C, Z ), where T and C denote the failure and censoring times, and Z denotes a p × 1 vector of possibly time-varying covariates. The observations consist of (Xi , δi , Zi ) (i = 1, . . . , n), where Xi = Ti ∧ Ci and δi = 1(Ti ≤Ci ) . Assume that Ti and Ci are conditionally independent given Zi . The widely-used Cox semiparametric regression model [4] postulates that, con ditional on Z(t), the hazard function λ(t) for T takes the form eβ0 Z(t) λ0 (t), where β0 is a p-dimensional regression parameter and λ0 (·) is an unspecified baseline hazard function. The maximum partial likelihood estimator βˆf for β0 is obtained by maximizing δi n eβ Zi (Xi ) n , (1.1) L(β) = β Zj (Xi ) j=1 1(Xj ≥Xi ) e i=1 ∗ Part
of the work was done while the first and the third authors were visiting the Mathematical Sciences Research Institute, Berkeley. † Supported in part by the Belgium National Fund for Scientific Research and GOA-grant GOA/2007/04 of Research Fund K.U.Leuven. ‡ Supported in part by the National Institutes of Health. § Supported in part by the National Institutes of Health and the National Science Foundation. 1 Department of Mathematics and Center for Statistics, Katholieke Universiteit Leuven, W. de Croylaan 54, B-3001 Leuven (Heverlee), Belgium, e-mail: [email protected] 2 Department of Biostatistics, CB 7420, University of North Carolina, Chapel Hill, NC 275997420, USA, e-mail: [email protected] 3 Department of Statistics, Columbia University, 1255 Amsterdam Avenue, 10th Floor, New York, NY 10027, USA, e-mail: [email protected] AMS 2000 subject classifications: primary 62J99; secondary 62F12, 62G05. Keywords and phrases: cause of death, censoring, competing risks, counting process, Cox model, cumulative hazard function, failure type, incomplete data, Kaplan-Meier estimator, partial likelihood, proportional hazards, regression, survival data. 203
204
I. Gijbels, D. Lin and Z. Ying
or by solving {S(β) = 0}, where n n ∞ β Zj (t) 1 e Z (t) j (X ≥t) j j=1 δi d1(Xi ≤t) . Zi (t) − n (1.2) S(β) = β Zj (t) 1 e (X ≥t) 0 j j=1 i=1
d d Under suitable regularity conditions, n−1/2 S(β0 ) → N (0, V ) and n1/2 (βˆf − β0 ) → N (0, V −1 ), where V = − limn→∞ n−1 ∂S(β0 )/∂β [1]. These asymptotic properties provide the basis for making inference about β0 . For the one-dimensional (dichotomous) Z, the nonparametric test based on S(0) for testing β0 = 0 has been better known as the (two-sample) log rank test. t The estimation of the cumulative hazard function Λ(t) = 0 λ(s)ds and the survival function F (t) = e−Λ(t) is also of interest. In the one-sample case, where no covariates are modeled, Λ(t) is commonly estimated by the Nelson-Aalen estimator t n i=1 δi d1(Xi ≤s) ˆ N A (t) = , (1.3) Λ n 0 j=1 1(Xj ≥s) ˆ N A (t) ˆ −Λ and the corresponding survival function estimator F is asymptotN A (t) = e ically equivalent to the well-known Kaplan-Meier estimator δi ˆ (1.4) F KM (t) = 1 − n . j=1 1(Xj ≥Xi ) Xi ≤t
Motivated by the Nelson-Aalen estimator, Breslow [2] suggested that the cumulative t baseline hazard function Λ0 (t) = 0 λ0 (s)ds under the Cox model be estimated by (1.5)
ˆ B (t) = Λ
0
t
n
i=1 δi d1(Xi ≤s) . βˆf Zj (s) 1 e (X ≥s) j j=1
n
ˆ N A (·) − Λ(·)} and n1/2 {Λ ˆ B (·) − Λ0 (·)} converge weakly to zero-mean Both n1/2 {Λ Gaussian processes [1, 3, 6, 14]. All of the aforementioned procedures assume complete measurements on the failure indicators δi (i = 1, . . . , n). In many applications, however, the values of {δi } are missing for some study subjects. We shall distinguish between two types of missingness. For Type I missingness, {δi } are missing completely at random among all subjects. For Type II missingness, {δi } take value 0 for some subjects and are missing completely at random among the remaining subjects. By missing completely at random, we mean that the missing mechanism is independent of everything else. The following two examples demonstrate how such missingness arises in practice. Example 1. (Type I missingness). Suppose that a series system has two independent components I and II and let T and C represent times to failure of I and II respectively. The potential observations for a single system consist of X = T ∧ C and δ = 1(T ≤C) . Suppose that a large number of systems are operated until failure. Also suppose that the diagnosis of a system to identify which component failed is so costly that it can only be done for a random sample of the systems under testing. Thus we observe all {Xi } and a random subset of {δi }. Example 2. (Type II missingness). In the medical study, investigators are often interested in the time to death attributable to a particular disease, in which case
Survival analysis with missing failure indicators
205
δi = 1 if and only if the ith subject died from that disease. Typically, the causes of death are unknown for some deaths because it requires extra efforts (e.g., performing autopsies or obtaining death certificates) to gather such information. Thus the values of {δi } may be missing among the deaths. On the other hand, if the ith subject has been withdrawn from the study before its termination or is still alive at the end of the study, then δi must be 0. Hence, we have Type II missingness provided that the deaths with known causes are representative of all the subjects who died. The most commonly adopted strategy for handling missing values is the completecase analysis, which totally disregards all the subjects with unknown failure statuses. This approach is valid under Type I missingness; however, it can be highly inefficient if there is heavy missingness. For Type II missingness, the complete-case analysis does not even yield consistent estimators. There have been a few articles on estimating the survival distribution of a homogeneous population in the presence of missing failure indicators. Notably, [5] used the nonparametric maximum likelihood method in conjunction with the EM algorithm to derive an estimator that is analogous to the Kaplan-Meier estimator (1.4). According to [10], however, the maximum likelihood as well as the self-consistent estimators are in general nonunique and inconsistent. Two alternative estimators are proposed in [10] under Type I missingness. As will be discussed in Section 3, these estimators have some undesirable properties. On the more challenging regression problem, there has been little progress. The only solution seems to have been the modified log rank test for Example 1.2 proposed [8]. As admitted by these authors, they made some rather unrealistic assumptions, including the independence between the covariate and the causes of death not under study as well as the proportionality of the hazard rate for the cause of interest and that of the other causes. On the other hand, further developments along the line of efficient estimation can be found in [11, 13, 15]. Furthermore, [17] deals with the additive hazards regression model. This paper provides a treatment of the Cox regression analysis and the survival function estimation under both types of missingness. In the next section, we introduce a class of estimating functions for β0 under Type I missingness which incorporates the partial information from the individuals with unknown δi . The consistency and asymptotic normality of the resulting estimators are established. A simple adaptive estimator is constructed which has the smallest variance-covariance matrix among the proposed class of estimators including the complete-case estimator. Simulation studies show that the adaptive estimator is suitable for practical use. Section 3 deals with the survival function estimation under Type I missingness. For the one-sample case, we derive an adaptive estimator which offers considerable improvements over the complete-case and Lo’s estimators [10]. Estimation of the cumulative baseline hazard function for the Cox model is also studied. In Section 4, we apply the ideas developed in Sections 2 and 3 to Type II missingness to obtain consistent estimators with similar optimality properties. Note that some of the technical developments there are streamlined and may be traced to a technical report [7]. We conclude this paper with some discussions in Section 5. 2. Cox regression under Type I missingness In this section, we propose estimating functions for the parameter vector β0 which utilize the partial information from the subjects with unknown failure indicators.
I. Gijbels, D. Lin and Z. Ying
206
The asymptotic properties of these functions and the resulting parameter estimators are studied in detail. Throughout the paper, we shall make the following assumption, which is satisfied in virtually all practical situations. Boundedness condition. The covariate processes Zi (·) = {Zi1 (·), . . . , Zip (·)} (i = 1, . . . , n) are of bounded variation with a uniform bound, i.e., there exists K > 0 such that for all i, ∞ p |dZij (t)| ≤ K. |Zij (0)| + 0
j=1
Let ξi indicate, by the value 1 vs. 0, whether δi is known or not. Under Type I missingness, the data consist of i.i.d random vectors (Xi , ξi , ξi δi , Zi ) (i = 1, . . . , n), where ξi is independent of (Xi , δi , Zi ) for every i. Write ρ = P (ξ1 = 1). Note that the partial likelihood score function (1.2) is the sum over all the observed failure times of the differences between the covariate vectors of the subjects who fail and the weighted averages of the covariate vectors among the subjects under observation. In view of this fact, we introduce the following estimating function: (2.1)
S1 (β) =
n i=1
∞ 0
¯ t) ξi dN u (t), Zi (t) − Z(β, i
¯ t) = n 1(X ≥t) eβ Zj (t) Zj (t)/ n 1(X ≥t) eβ Zj (t) and N u (t) = where Z(β, i j j j=1 j=1 δi 1(Xi ≤t) . In the sequel, we shall also use the notation Yi (t) = 1(Xi ≥t) , Ni (t) = 1(Xi ≤t) and Nic (t) = (1 − δi )1(Xi ≤t) . Note that {Niu , Nic } may not be fully observable whereas {Ni , ξi Niu , ξi Nic } are always observed. Another way of deriving (2.1) is to modify the partial likelihood function (1.1) by omitting the factors for which the δi are missing. Then S1 (β) can be obtained by the usual way of differentiating the “log-likelihood function”. n t
¯ s) ξi dN u (s). Theorem 2.1. Let S1 (β, t) = i=1 0 Zi (s) − Z(β, i (i) The process n−1/2 S1 (β0 , ·) converges weakly to a zero-mean Gaussian martingale with variance function t
⊗2 u (2.2) V1 (t) = E {Z1 (s) − z¯(β0 , s)} ξ1 dN1 (s) , 0
where z¯(β, t) = E Y1 (t)eβ Z1 (t) Z1 (t) /E Y1 (t)eβ Z1 (t) . (ii) Define β˜ as the root of {S1 (β) = 0}. If V1 = V1 (∞) is nonsingular, then d n1/2 (β˜ − β0 ) → N (0, V −1 ). 1
Remarks. (1) It is simple to show that V1 = ρV , where V is the limiting covariance matrix for βˆf defined in Section 1. By the arguments of [1], V1 (t) can be consistently estimated by n n t β˜ Zj (s) ⊗2 Z (s) Y (s)e j j j=1 ˜ s) ξi dN u (s). Vˆ1 (t) = n−1 − Z¯ ⊗2 (β, n i β˜ Zj (s) Y (s)e j=1 j i=1 0
(2) The nonsingularity of V1 is a very mild assumption and it is true in practically all meaningful situations.
Survival analysis with missing failure indicators
207
(3) The difference between the process S1 (β, t) and the partial likelihood score process under the complete-case analysis n t
Zi (s) − Z¯d (β, s) ξi dNiu (s), Sd (β, t) = i=1
0
n n where Z¯d (β, t) = j=1 ξj Yj (t)eβ Zj (t) Zj (t)/ j=1 ξj Yj (t)eβ Zj (t) , is that the sub¯ but not jects with unknown failure indicators are included in the calculation of Z, ¯ in that of Zd . It is somewhat surprising to note that Sd (β, ·) and the corresponding estimator βˆd have the same asymptotic distributions as those of S1 (β, ·) and ˜ respectively, even though Z(β, ¯ t) is a more accurate estimator of z¯(β, t) than β, Z¯d (β, t) is. As will be seen in the proof of Theorem 2.1, however, Sd (β, ·) and βˆd ˜ Simulation results themselves are not asymptotically equivalent to S1 (β, ·) and β. ˜ to be reported later in the section reveal that β tends to be slightly more efficient than βˆd for small and moderate-sized samples. (4) The use of S1 (β) may incur substantial loss of information, especially when ρ is small, since the asymptotic distribution of β˜ is the same as that of βˆd , which only uses data with known failure indicators. Indeed, the purpose of this section is to construct a new estimator that combines S1 (β) with an estimating function utilizing the counting processes Ni (·) associated with ξi = 0. In this connection, the estimating function S1 plays only a transitional role. Proof of Theorem 2.1. For notational simplicity, assume p = 1. Let Mi (t) = t Niu (t) − 0 Yi (s)eβ0 Zi (s) λ0 (s)ds, which are martingale processes with respect to an appropriate σ-filtration [1]. Decompose S1 (β0 , t) into two parts n t
¯ 0 , s) ξi dMi (s) Zi (s) − Z(β S1 (β0 , t) = i=1
0
t n
¯ 0 , s) eβ0 Zi (s) Yi (s)λ0 (s)ds (ξi − ρ) Zi (s) − Z(β + 0
i=1
= S11 (t) + S12 (t),
say.
Now n−1/2 S11 (·) is a martingale. By the arguments of [1], n−1/2 S11 (t) is asympn t totically equivalent to n−1/2 S˜11 (t) = n−1/2 i=1 0 Wi (s)ξi dMi (s), where Wi (s) = Zi (s) − z¯(β0 , s), and converges weakly in D[0, ∞) to a Gaussian martingale with variance function V1 (t). Note that the tightness of n−1/2 S11 (·) at ∞ can be easily handled along the lines of [6]. From Lemma 1(i) given at the end of the section, n−1/2 S12 (t) is also tight and is asymptotically equivalent to t n −1/2 ˜ −1/2 (ξi − ρ) n S12 (t) = n Wi (s)eβ0 Zi (s) Yi (s)λ0 (s)ds. i=1
0
Hence, n−1/2 S1 (β0 , ·) is asymptotically equivalent to n−1/2 S˜11 (·) + S˜12 (·) , which
converges weakly to a zero-mean Gaussian process with covariance function at (t, t ) that can be shown to be equal to n−1 E {S˜11 (t) + S˜12 (t)}{S˜11 (t ) + S˜12 (t )} = V1 (t ∧ t ).
−1 To prove part (ii) of the theorem, note that −n ∂S1 (β)/∂β is positive (positive ∞ 2 u definite for p > 1) and converges to E 0 {Z1 (t) − z¯(β, t)} ξ1 dN1 (t) . Thus, β˜
I. Gijbels, D. Lin and Z. Ying
208
is uniquely defined and the arguments of [1] entail the convergence of n1/2 (β˜ − β0 ) to N (0, V1−1 ). To incorporate the partial survival information from those subjects with missing δi , it is natural to consider the counting processes (1 − ξi )Ni (·) and to subtract off the jumps due to censoring. In this connection, we introduce (2.3) n t
¯ s) (1 − ξi )dNi (s) − ρˆ−1 (1 − ρˆ)ξi dN c (s) , Zi (s) − Z(β, S2 (β, t) = i 0
i=1
−1 c ξ , noting that E (1 − ξ )N (t) − ρ (1 − ρ)ξ N (t) i 1 1 1 1 i=1
n
where ρˆ = n−1 = E {N1u (t)}.
Theorem 2.2. The process n−1/2 S2 (β0 , ·) is asymptotically independent of n−1/2 × S1 (β0 , ·) and converges weakly to a zero-mean Gaussian process with covariance function t∧t
W1⊗2 (s)(1 − ξ1 )dN1u (s)
V2 (t, t ) = E
0
+ρ−1 (1 − ρ)E where NiCZ (t) =
t 0
N1CZ (t) − EN1CZ (t) N1CZ (t ) − EN1CZ (t ) ,
{Zi (s) − z¯(β0 , s)} dNic (s) (i = 1, . . . , n).
Proof. Again assume p = 1. Since ρˆ − ρ = Op (n−1/2 ), by the usual delta method, n t
¯ 0 , s) (1 − ξi )dMi (s) S2 (β0 , t) = Zi (s) − Z(β i=1 0 n t
+ +
i=1 0 n t
i=1
+
0
¯ 0 , s) {1 − ξi − (1 − ρ)}eβ0 Zi (s) Yi (s)λ0 (s)ds Zi (s) − Z(β
¯ 0 , s) dNic (s)ρ−1 (ρ − ξi ) Zi (s) − Z(β
n t
i=1
0
c −1 ¯ 0 , s) dN (s)ρ (ˆ Zi (s) − Z(β ρ − ρ) i
+ rn (t) = S21 (t) + S22 (t) + rn (t),
say.
Here the remainder term rn is uniformly negligible in the sense that supt |rn (t)| = op (n1/2 ). Note that S21 (t) is the same as S1 (t) except that {ξi } there are replaced by {1 − ξi }. Thus S21 (t) is tight in D[0, ∞) and is asymptotically equivalent to S˜21 (t) =
n i=1
t
Wi (s)(1 − ξi )dMi (s) + 0
n i=1
t
Wi (s)(ρ − ξi )eβ0 Zi (s) Yi (s)λ0 (s)ds.
0
By Lemma 1(ii), S22 (t) is asymptotically equivalent to S˜22 (t) = −
n i=1
ρ−1 (ξi − ρ) NiCZ (t) − ENiCZ (t) .
Survival analysis with missing failure indicators
209
By writing S˜21 (t) =
n i=1
t
Wi (s)(1 − ρ)dMi (s) +
0
n i=1
t
0
Wi (s)dNiu (s)(ρ − ξi ),
we can show that, for any t and t , (2.4) E S˜21 (t)S˜22 (t ) = 0.
Thus n−1/2 S˜21 (·) + S˜22 (·) converges weakly to a zero-mean Gaussian process with V2 as its covariance function. ˜ ˜ ˜ Similar to (2.4), E {S11 (t) + S12 (t)}S22 (t ) = 0 for any t and t . Thus to prove the asymptotic independence between S1 and S2 , it suffices to show E {S˜11 (t) + S˜12 (t)}S˜21 (t ) = 0.
To this end, we can apply the same covariance calculation as employed in the proof of Theorem 2.1 to show that E {S˜11 (t) + S˜12 (t)}S˜21 (t ) t t W1 (s)ξ1 dM1 (s) = nE W1 (s )(1 − ξ1 )dM1 (s ) = 0. 0
0
By combining S1 and S2 , more efficient estimators of β0 may be obtained. Specifically, given a p × p matrix D, we can define βˆ as a solution to (2.5)
S1 (β) + DS2 (β) = 0.
Theorem 2.3. Suppose that {ρV +(1−ρ)DV } is nonsingular. Let V2 = V2 (∞, ∞). d Then n1/2 (βˆ − β0 ) → N (0, Σ(D)), where (2.6)
Σ(D) = {ρV + (1 − ρ)DV }
−1
(ρV + DV2 D ) {ρV + (1 − ρ)V D }
−1
.
In particular, D∗ = (1 − ρ)V V2−1 yields
−1 Σ(D∗ ) = ρV + (1 − ρ)2 V V2−1 V
and is optimal in the sense that Σ(D) − Σ(D∗ ) is nonnegative definite for any D.
⊗2 . Then V2 = (1 − ρ)V + Remarks. (1) Let VCZ = E N1CZ (∞) − EN1CZ (∞) ρ−1 (1 − ρ)VCZ . For p = 1, D∗ = V /(V + ρ−1 VCZ ) and Σ(D∗ ) = (V + ρ−1 VCZ )/ {V (V + VCZ )}. This variance will be close to the ideal V −1 if either ρ is close to 1 (light missingness) or VCZ is close to zero (light censorship). (2) A consistent estimator for Σ(D) may be obtained by replacing ρ, V and V2 ˆ and Vˆ2 (β), ˆ where in (2.6) by ρˆ, Vˆ (β) n n ∞ β Zj (t) ⊗2 Y (t)e Z (t) 1 j j j=1 n Vˆ (β) = n − Z¯ ⊗2 (β, t) ξi dNiu (t), Z (t) β j ξ i=1 i i=1 0 j=1 Yj (t)e
I. Gijbels, D. Lin and Z. Ying
210
Table 1 Monte Carlo estimates for the sampling means and variances of four estimators of β0 and for the sizes of the corresponding 0.05-level wald tests for testing H0 : β0 = 0 under the Model λ(t|Z) = 1 20% Censoring Mean Var. Size
50% Censoring Mean Var. Size
70% Censoring Mean Var. Size
βˆf βˆd β˜ ˆ β∗
–0.001 –0.001 –0.001 –0.001
0.015 0.018 0.018 0.015
0.056 0.053 0.054 0.056
–0.001 –0.002 –0.001 –0.002
0.023 0.030 0.029 0.027
0.054 0.052 0.053 0.055
0.002 0.002 0.002 0.002
0.038 0.049 0.049 0.046
0.052 0.049 0.050 0.051
βˆf βˆd β˜ ˆ β∗
–0.001 –0.002 0.001 –0.001
0.015 0.032 0.029 0.017
0.056 0.057 0.050 0.056
–0.001 –0.002 –0.001 –0.002
0.023 0.054 0.048 0.037
0.054 0.053 0.052 0.052
0.002 0.001 0.003 0.002
0.038 0.092 0.082 0.071
0.052 0.049 0.050 0.048
ρ
Estimator
0.8
0.5
NOTE: Z is standard normal. The censoring time is exponentially distributed with hazard rate λc , where λc is chosen to achieve the desired censoring percentage. The sample size is 100. Each block is based on 10,000 replications. The random number generator of [16] is used.
Vˆ2 (β) = (1 − ρˆ)Vˆ (β) + ρˆ−1 (1 − ρˆ)VˆCZ (β), n ∞
1 ˆ ¯ t) ⊗2 ξi dN c (t) VCZ (β) = n Zi (t) − Z(β, i i=1 ξi i=1 0
1
− n
n
i=1 ξi i=1
∞
0
⊗2 ¯ t) ξi dN c (t) Zi (t) − Z(β, . i
˜ ˆ ∗ = (1− ρˆ)Vˆ (β)× Since we can estimate the optimal weight D∗ consistently by D an “adaptive” estimator of β0 that achieves the lower variance-covariance bound Σ(D∗ ) may be constructed. Specifically, we can first use β˜ from {S1 (β) = 0} ˆ ∗ and then obtain the adaptive estimator by solving to compute D ˜ Vˆ2−1 (β),
(2.7)
ˆ ∗ S2 (β) = 0. S1 (β) + D
Corollary 1. Let βˆ∗ be the estimator given by (2.7). Then under the same asd sumptions as Theorem 2.3, n1/2 (βˆ∗ − β0 ) → N (0, Σ(D∗ )). In addition, Σ(D∗ ) can −1 ∗ 2 ˆ ˆ∗ ˆ −1 ˆ∗ ˆ ˆ∗ ˆ ˆ be consistently estimated by ρˆV (β ) + (1 − ρˆ) V (β )V2 (β )V (β ) .
We have carried out extensive Monte Carlo experiments to investigate the finitesample behaviour of the proposed adaptive estimator βˆ∗ and to compare it with the ˜ full-data estimator βˆf , the complete-case estimator βˆd and the S1 (β) estimator β. The key results are summarized in Tables 1 and 2. The biases of all four estimators and of their variance estimators (the latter not shown here) are negligible, and the associated Wald tests have proper sizes. The adaptive estimator is always more ˜ as is reflected in the sampling variances of the estimators efficient than βˆd and β, as well as in the powers of the Wald tests. The gains in the relative efficiencies increase as the missing probability increases and decrease as the censoring probability increases. The efficiency of βˆ∗ relative to βˆf is close to 1 when censoring is light. The estimator β˜ seems to have slightly better small-sample efficiency than βˆd . Proof of Theorem 2.3. From its definition, −∂S1 (β)/∂β is, with probability 1, positive definite. Thus, following [1], we can show that (ρn)−1 S1 (β) converges uniformly
Survival analysis with missing failure indicators
211
Table 2 Monte Carlo estimates for the sampling means and variances of four estimators of β0 and for the powers of the corresponding 0.05-level wald tests for testing h0 : β0 = 0 under the Model λ(t|Z) = e0.5Z 20% Censoring Mean Var. Power
50% Censoring Mean Var. Power
70% Censoring Mean Var. Power
βˆf βˆd β˜ ˆ β∗
0.509 0.511 0.510 0.509
0.017 0.021 0.021 0.018
0.984 0.956 0.955 0.980
0.511 0.514 0.513 0.512
0.026 0.033 0.033 0.030
0.912 0.844 0.844 0.876
0.514 0.518 0.516 0.516
0.041 0.053 0.052 0.049
0.755 0.655 0.653 0.684
βˆf βˆd β˜ ˆ β∗
0.509 0.516 0.511 0.509
0.017 0.038 0.035 0.021
0.984 0.813 0.821 0.960
0.511 0.522 0.514 0.512
0.026 0.061 0.054 0.042
0.912 0.624 0.642 0.757
0.514 0.530 0.520 0.518
0.041 0.102 0.088 0.077
0.755 0.435 0.450 0.514
ρ
Estimator
0.8
0.5
NOTE: See NOTE of Table 1.
in any compact set to the nonrandom function
∞ u {Z1 (t) − z¯(β, t)} dN1 (t) . m(β) = E 0
For S2 (β), we have, by the law of large numbers, n ∞
−1 ¯ t) (1 − ξi )dNi (t) n Zi (t) − Z(β, 0
i=1
−1
−n
n i=1
= op (1) +
∞
0
∞
¯ t) (1 − ρ)dNi (t) Zi (t) − Z(β,
¯ t)d n−1 Z(β,
0
= op (1),
n i=1
Ni (t)(ξi − ρ)
n where the last equality follows from the facts that supt n−1 i=1 Ni (t)(ξi − ρ) = ¯ ·) is at most O(log n) uniformly for op (n−1/4 ) and that the total variation of Z(β, β in any compact region. Thus the order op (1) is also uniform. Continuing this line of arguments, we get n ∞
−1 −1 ¯ t) (1 − ρ)dNiu (t) + op (1) Zi − Z(β, n S2 (β) = n i=1
0
= (1 − ρ)m(β) + op (1)
with the same uniformity. Thus n−1 {S1 (β) + DS2 (β)} is uniformly approximated p by {ρI +(1 − ρ)D}m(β), which has a unique root β0 . Hence, βˆ → β0 . The asymptotic normality is easier to show now. Taking the Taylor series expanˆ ˆ sion of S1 (β) + DS2 (β) at β0 , we get −1 n1/2 (βˆ − β0 ) = [{ρI + (1 − ρ)D} V ] n−1/2 {S1 (β0 ) + DS2 (β0 )} + op (1),
which, by Theorems 2.1 and 2.2 and a straightforward matrix manipulation, converges to the desired normal distribution.
I. Gijbels, D. Lin and Z. Ying
212
To verify the optimality of D∗ , we note that the estimating function can be linearized around β0 and the limiting normal random vectors may be used in place of n−1/2 Sk (β0 ) (k = 1, 2). Specifically, we can consider the following “limiting” linear model ∗ ∗ , S1 = ρV b + S01
∗ S2∗ = (1 − ρ)V b + S02 ,
∗ where S0k (k = 1, 2) are independent N (0, Vk ) (k = 1, 2). Recall that V1 = ρV and V2 = (1−ρ)(V +ρ−1 VCZ ). By the Gauss-Markov theorem, the best linear estimator is
ˆb∗ = ρV + (1 − ρ)2 V V −1 V −1 S ∗ + (1 − ρ)V V −1 S ∗ 1 2 2 2
−1 with variance-covariance matrix ρV + (1 − ρ)2 V V2−1 V , which is exactly ∗ Σ(D ).
Lemma 1. (i) The process −1/2
n
t n
¯ 0 , s) eβ0 Zi (s) λ0 (s)ds (ξi − ρ) Zi (s) − Z(β
−1/2
S12 (t) = n
0
i=1
is tight in D[0, ∞) and is asymptotically equivalent to −1/2
n
S˜12 (t) = n−1/2
t n (ξi − ρ) {Zi (s) − z¯(β0 , s)} eβ0 Zi (s) λ0 (s)ds 0
i=1
in the sense that supt n−1/2 S˜12 (t) − S12 (t) = op (1). t
n ¯ 0 , s) dN c (s) is tight and as(ii) The process n−1/2 i=1 (ξi − ρ) 0 Zi (s) − Z(β i ymptotically equivalent to −1/2
n
t n (ξi − ρ) {Zi (s) − z¯(β0 , s)} dNic (s). 0
i=1
Proof. Without loss of generality, assume p = 1. For any t1 < t2 , we have 4 −1/2 −1/2 lim E n S12 (t2 ) − n S12 (t1 ) n→∞ t2
¯ 0 , s) eβ0 Zi (s) Yi (s)λ0 (s)ds Zi (s) − Z(β E (ξi − ρ) = lim n−2 n→∞
t1
i =j
× (ξj − ρ)
4
t2
t1 −2
≤ (2K) lim n n→∞
× (ξj − ρ)
2 β0 Zj (s) ¯ Zj (s) − Z(β0 , s) e Yj (s)λ0 (s)ds
i =j
t2
e
E (ξi − ρ)
β0 Zj (s)
t1
2
eβ0 Zi (s) Yi (s)λ0 (s)ds
t1
2 Yj (s)λ0 (s)ds
= (2K) {ρ(1 − ρ)} E 4
t2
t2
Y1 (s)e t1
β0 Z1 (s)
2 2 λ0 (s)ds .
Survival analysis with missing failure indicators
Since µ[t1 , t2 ] = E
t2 t1
213
2 Y1 (s)eβ0 Z1 (s) λ0 (s)ds is a finite measure on [0, ∞), the
moment criterion ([12], page 52, formula (30)) implies the tightness of n−1/2 S12 . Likewise, n−1/2 S˜12 is also tight. Furthermore, let t0 = inf{t : EY1 (t) = 0}. Then it is p ¯ 0 , s) − z¯(β0 , s)| → easy to see that sups≤t |Z(β 0 for any t < t0 . Thus the equivalence −1/2 −1/2 ˜ of n S12 and n S12 follows from the tightness just proved. Hence (i) holds. The proof of (ii) is very much the same as that of (i). Because of possible discontinuity of EN1c (t) in t, another moment condition ([12], page 51, formula (25)) should be used. Note that the tightness continues to hold even if the measure µ there is discontinuous. 3. Cumulative hazard function estimation under Type I missingness In this section, we first deal with the problem of nonparametric estimation of the cumulative hazard function for a homogeneous population under Type I missingness. We shall discuss the estimators proposed in [10] and give our own solutions. We then apply the ideas to the estimation of the cumulative baseline hazard function for the Cox model. In both cases, asymptotic distributions of the relevant estimators are derived. In the one-sample case, the observations consist of i.i.d. random vectors (Xi , ξi , ξi δi ) (i = 1, . . . , n), where Xi = Ti ∧ Ci , δi = 1(Ti ≤Ci ) and ξi is the missing indicator independent of (Xi , δi ). Assume that Ti is independent of Ci and that Ti t has a continuous distribution function. Let F (t) = P (T1 ≤ t), Λ(t) = 0 dF (s)/{1− t F (s)}, G(t) = P (C1 ≤ t), ΛG (t) = 0 dG(s)/{1 − G(s−)}, H(t) = {1 − F (t)}{1 − t t G(t−)}, A(t) = 0 dΛ(s)/H(s) and AG (t) = 0 dG(s)/{(1 − G(s−))H(s)}. The notation for Yi , Ni , Niu , Nic , ρ, ρˆ, etc. introduced in Section 2 will also be used. Under the setup described above, [10] shows that the nonparametric maximum likelihood method typically does not yield a consistent estimator for F , indicating that this is far more complicated than the complete-data situation. Two alternative estimators, Fˆ1 and FˆB , are also proposed there. It can be shown, by expanding log(1 − FˆA ), that FˆA is not a consistent estimator; in particular, Theorem 3 of [10] is not valid. In our notation, the second estimator is given by (3.1)
FˆB (t) = 1 −
Xi ≤t
1 1 − n j=1 Yj (Xi )
ξi δi /ρˆ
.
Motivated by (3.1), we modify (1.3) to obtain the following estimator for Λ(t): t n ξ dN u (s) i=1 ˆ n i i (3.2) Λ1 (t) = . ρˆ j=1 Yj (s) 0
By the exponentiation formula of Doleans-Dade ([1], p. 897), the corresponding estimator for F (t) is ξ δ i i 1 − n (3.3) Fˆ1 (t) = 1 − . ρˆ j=1 Yj (Xi ) Xi ≤t
It is easily seen that Fˆ1 and FˆB are asymptotically equivalent; however, the cumulative hazard function approach is more convenient for our later developments.
I. Gijbels, D. Lin and Z. Ying
214
ˆ 1 (and hence FˆB and Fˆ1 ) does not utilize the Expression (3.2) also reveals that Λ counting process information from the subjects with ξi = 0. To recover this information, we introduce n t n ρˆ−1 (1 − ρˆ) i=1 ξi dNic (s) i=1 (1 − ξi )dNi (s) − ˆ , (3.4) Λ2 (t) = n (1 − ρˆ) i=1 Yi (s) 0 which shares the same spirit as estimating function S2 (β) given in (2.3). Thus, Λ(t) can be estimated by ˆ 2 (t), ˆ ˆ 1 (t) + (1 − α)Λ Λ(α, t) = αΛ
(3.5) where α ∈ [0, 1].
ˆ Theorem 3.1. Let t0 < H −1 (0). Then n1/2 Λ(α, ·) − Λ(·) converges weakly in
D[0, t0 ] to a zero-mean Gaussian process with covariance function
α2 {A(t ∧ t ) − (1 − ρ)Λ(t)Λ(t )} ρ
+α(1 − α) 2Λ(t)Λ(t ) + ρ−1 Λ(t)ΛG (t ) + ρ−1 Λ(t )ΛG (t) (1 − α)2 + A(t ∧ t ) + ρ−1 AG (t ∧ t ) 1−ρ
−ρ Λ(t) + ρ−1 ΛG (t) Λ(t ) + ρ−1 ΛG (t ) . d ˆ For fixed t, n1/2 Λ(α, t) − Λ(t) → N (0, Γα (t)), where (3.6)
(3.7)
Γα (t, t ) =
Γα (t) =
α2
A(t) − (1 − ρ)Λ2 (t) + 2α(1 − α) Λ2 (t) + ρ−1 Λ(t)ΛG (t) ρ
2 (1 − α)2 + A(t) + ρ−1 AG (t) − ρ Λ(t) + ρ−1 ΛG (t) , 1−ρ
which reaches its minimum when α equals
2 ρ A(t) − Λ (t) + AG (t) − Λ2G (t) − (1 + ρ)Λ(t)ΛG (t) α∗ = . A(t) − Λ2 (t) + AG (t) − Λ2G (t) − 2Λ(t)ΛG (t)
p ˆ n , ·) has the same asymptotic distriRemarks. (1) If we choose αn → α, then Λ(α ∗ ˆ bution as Λ(α, ·). Since α can be estimated consistently, the “optimal” estimator of Λ can be constructed adaptively. To be specific, α∗ may be estimated by 2 ˆ ˆ ˆ 2 (t) − (1 + ρˆ)Λ ˆ 1 (t)Λ ˆ G (t) ρˆ A1 (t) − Λ1 (t) + AˆG (t) − Λ G ∗ , α ˆ = ˆ 1 (t)Λ ˆ G (t) ˆ 2 (t) − 2Λ ˆ 2 (t) + AˆG (t) − Λ Aˆ1 (t) − Λ 1 G
t ˆ G (t) and AˆG (t) are the obvious ˆ 1 (s)/ n Yj (s), and Λ where Aˆ1 (t) = n 0 dΛ j=1 ˆ 1 (t) and Aˆ1 (t). analogs of Λ (2) A consistent estimator for Γα (t, t ) may be obtained by replacing ρ, A, Λ, ˆ 1 , AˆG and Λ ˆ G. AG and ΛG in (3.6) by ρˆ, Aˆ1 , Λ ˆ (3) Two special cases deserve extra attention. If α = 1, then Λ(α, t) reduces −1 ˆ A(t) − (1 − ρ)Λ2 (t) , to Λ1 (t). In that case, the asymptotic variance Γ1 (t) = ρ which agrees with Lo’s result when the exponentiation is taken into account, and
Survival analysis with missing failure indicators
215
Table 3 Simulation summary statistics for the adaptive estimator Fˆ (α ˆ ∗ , t) at t = F −1 (0.5) under the exponential model F (t) = 1 − e−t 20% Censoring ρ = 0.8 ρ = 0.5 Mean of α ˆ∗ Mean of Fˆ (α ˆ ∗ , t) 1/2 Var of n Fˆ (α ˆ ∗ , t) ∗ ˆ Mean of VFˆ (α ˆ , t) Var of Fˆd (t) / Var of Fˆ (α ˆ ∗ , t) ˆ ˆ Var of FB (t) / Var of F (α ˆ ∗ , t)
0.84 0.497 0.284 0.283 1.21 1.10
50% Censoring ρ = 0.8 ρ = 0.5
0.60 0.497 0.325 0.323 1.72 1.33
0.90 0.496 0.419 0.413 1.14 1.06
0.75 0.495 0.566 0.547 1.36 1.13
70% Censoring ρ = 0.8 ρ = 0.5 0.94 0.493 0.786 0.747 1.12 1.06
0.85 0.490 1.161 1.039 NA 1.08
NOTE: The censoring time is exponential. The sample size n = 100. Each block is based on 10,000 replications. VˆFˆ (α ˆ ∗ , t) is the variance estimator for n1/2 Fˆ (α ˆ ∗ , t), which is Fˆ 2 (α ˆ ∗ , t) ˆ multiplied by the estimator for Γα∗ (t, t) mentioned in Remark (2) of Theorem 3.1. Fd (t) is the estimator based on complete cases only and FˆB (t) is Lo’s second estimator. “Mean” and “Var” refer to the sampling mean and variance. NA indicates that the result for the complete-case estimator is not obtainable.
which is less than ρ−1 A(t), the variance of the complete-case estimator. On the other hand, if we let α = ρˆ, then ˆ Λ(α, t) =
0
t
n
i=1
ξi dNiu (s) + (1 − ξi )dNi (s) − ρˆ−1 (1 − ρˆ)ξi dNic (s) n i=1 Yi (s)
with asymptotic variance Γρ (t) = A(t) + ρ−1 (1 − ρ) AG (t) − Λ2G (t) . Clearly, 2 2 2 Γρ (t) if AG (t)−ΛG (t) ≤ A(t)−Λ (t). Note thatAG (t)−ΛG (t) = ≤ Γ1 (t) if and only t t Var 0 dN c (s)/H(s) and A(t) − Λ2 (t) = Var 0 dN u (s)/H(s) .
(4) Let ρ ↑ 1, i.e., the proportion of missing δi ’s shrinks to 0. Then α∗ → 1. The ˆ 1 (t). On the other hand, if the censorship shrinks to 0, which resulting estimator is Λ entails ΛG (t) → 0 and AG (t) → 0, then α∗ → ρ, which was the case discussed in the previous remark. Table 3 displays the main results from our Monte Carlo studies on the adaptive ˆ ∗ estimator Fˆ (ˆ α∗ , t) = 1 − e−Λ(αˆ ,t) . The biases of the adaptive estimator and its variance estimator are small. The efficiency improvements of the adaptive estimator over the complete-case analysis and (to a lesser extent) over estimator (3.1) are impressive, especially for light censoring and substantial missingness.
Proof of Theorem 3.1. In analogy with the approximations given in Lemma 1, we can show that (3.8)
ˆ 1 (t) − Λ(t) = 1 Λ nρ
n
−
i=1 n
0
t
n t dMi (s) Yi (s)dΛ(s) ξi + (ξi − ρ) H(s) H(s) i=1 0
Λ(t)(ξi − ρ)
i=1
1
1
+ op (n− 2 )
= L1 (t) + op (n− 2 ), say,
I. Gijbels, D. Lin and Z. Ying
216
and n n t t dMi (s) 1 Yi (s)dΛ(s) ˆ (3.9) Λ2 (t) − Λ(t) = (1 − ξi ) − (ξi − ρ) n(1 − ρ) i=1 0 H(s) H(s) i=1 0 n t n 1 ξi − ρ dNic (s) + op (n− 2 ) − ΛG (t) Λ(t)(ξi − ρ) − + H(s) ρ 0 i=1 i=1 1
= L2 (t) + op (n− 2 ), say.
ˆ Thus to characterize the limiting distribution of Λ(α, ·), it suffices to derive the covariance functions E {Lj (t)Lk (t )} (j, k = 1, 2). Through some tedious, but otherwise routine calculations, we obtain (3.10)
E{L1 (t)L1 (t )} = (nρ)−1 {A(t ∧ t ) − (1 − ρ)Λ(t)Λ(t )} ,
(3.11)
E {L1 (t)L2 (t )} = n−1 Λ(t)Λ(t ) + ρ−1 Λ(t)ΛG (t ) ,
(3.12)
−1 E {L2 (t)L2 (t )} = {n(1 − ρ)} A(t ∧ t ) + ρ−1 AG (t ∧ t )
−ρ Λ(t) + ρ−1 ΛG (t) Λ(t ) + ρ−1 ΛG (t ) .
From (3.10)–(3.12), we can evaluate nE [{αL1 (t) + (1 − α)L2 (t)} {αL1 (t ) + (1 − α)L2 (t )}] to get the desired covariance function. We now return to the regression model studied in Section 2. Let βˆ be as defined by (2.5). To estimate the cumulative baseline hazard function Λ0 , it is natural to extend the class of estimators given in (3.5). To avoid complicated asymptotic variances, we shall only consider α = 1 and α = ρˆ, the two special cases discussed in Remarks (3) and (4) following Theorem 3.1. The two estimators for Λ0 (t) are given below ˆ t) = ˆ 1 (β, Λ
(3.13)
0
(3.14)
ˆ t) = ˆ 2 (β, Λ
t 0
n
t
n
ξi dNiu (s) , ni=1 ρˆ i=1 Yi (s)eβˆ Zi (s)
u i=1 {ξi dNi (s)
+ (1 − ξi )dNi (s) − ρˆ−1 (1 − ρˆ)ξi dNic (s)} . n βˆ Zi (s) i=1 Yi (s)e
Theorem 3.2. Suppose that the assumptions of Theorem 2.3 are satisfied. Let t0 > 0 be any number such 0. that EY1 (t0 ) > ˆ ·) − Λ0 (·) converges weakly in D[0, t0 ] to a zeroˆ 1 (β, (i) The process n1/2 Λ mean Gaussian process with covariance function (3.15)
˜ 1 (t, t ) = ρ Γ
t∧t
dΛ0 (s) − ρ−1 (1 − ρ)Λ0 (t)Λ0 (t ) H (s) Z 0
+a (t)Σ(D)a(t ) − ρ−1 (1 − ρ) a (t)ΩE N1CZ (∞) Λ0 (t )
+ a (t )ΩE N1CZ (∞) Λ0 (t) ,
−1
Survival analysis with missing failure indicators
217
t where HZ (s) = E Y1 (s)eβ0 Z1 (s) , a(t) = 0 z¯(β0 , s)dΛ0 (s) and Ω = {ρV + (1 − d ˆ t) − Λ0 (t) → ˜ 1 (t)), where ˆ 1 (β, N (0, Γ ρ)DV }−1 D. For fixed t, n1/2 Λ
t
dΛ0 (s) − ρ−1 (1 − ρ)Λ20 (t) + a (t)Σ(D)a(t) 0 HZ (s)
−2ρ−1 (1 − ρ)a (t)ΩE N1CZ (∞) Λ0 (t).
˜ 1 (t) = ρ−1 Γ
(3.16)
ˆ ·) − Λ0 (·)} converges weakly to a zero-mean Gaussian ˆ 2 (β, (ii) The process n1/2 {Λ process with covariance function (3.17) ˜ 2 (t, t ) = Γ
t∧t
dΛ0 (s) + ρ−1 (1 − ρ)Cov N1CH (t), N1CH (t ) + a (t)Σ(D)a(t ) HZ (s) 0
−ρ−1 (1 − ρ) a (t)ΩE N1CZ (∞) N1CH (t ) − EN1CH (t )
, +a (t )ΩE N1CZ (∞) N1CH (t) − EN1CH (t)
t ˆ t) − ˆ 2 (β, where NiCH (t) = 0 dNic (s)/HZ (s) (i = 1, . . . , n). For fixed t, n1/2 {Λ d ˜ 2 (t)), where Λ0 (t)} → N (0, Γ ˜ 2 (t) = Γ
(3.18)
t
dΛ0 (s) + ρ−1 (1 − ρ)Var N1CH (t) + a (t)Σ(D)a(t) 0 HZ (s)
−2ρ−1 (1 − ρ)a (t)ΩE N1CZ (∞) N1CH (t) − EN1CH (t) .
˜ 2 (t) may be obtained ˜ 1 (t) and Γ Remarks. (1) Consistent estimators for variances Γ t ˆ s)dΛ ˆ s) and Ω ¯ β, ˆ 1 (β, ˆ = in a straightforward manner. For example, let a ˆ(t) = 0 Z( −1 ˜ 1 (t) is ρˆVˆ +(1 − ρˆ)Vˆ D D. Then a consistent estimator for Γ
t
ˆ s) ˆ 1 (β, dΛ
ˆ t) + a ˆ 21 (β, ˆ − ρˆ−1 (1 − ρˆ)Λ ˆ (t)Σ(D)ˆ a(t) βˆ Zi (s) Y (s)e 0 i=1 i n ∞ 1 c −1 ˆ s) ξi dN (s) Λ ˆ t), ˆ 1 (β, ¯ β, ˆ n Zi (s) − Z( −2ˆ ρ (1 − ρˆ)ˆ a (t)Ω i ξ i 0 i=1 i=1 −1
nˆ ρ
n
ˆ where Σ(D) is the consistent estimator given in Remark (2) following Theorem 2.3. (2) If D = 0, then the last term on the right hand side of (3.16) disappears and the sum of the first and the third terms becomes the variance of the complete-case ˜ t) reduces the variance by ρ−1 (1 − ρ)Λ2 (t). ˆ 1 (β, estimator. Thus, the use of Λ 0 Proof of Theorem 3.2. Taking the Taylor expansions at β0 , we get, for l = 1, 2, t ˆ t) = Λ ˆ l (β, ˆ l (β0 , t) − ˆ l (β0 , s)(βˆ − β0 ) + op (n− 12 ). (3.19) Λ Z¯ (β0 , s)dΛ 0
By the approximations given in the proofs of Theorems 2.1 and 2.2, we can express βˆ − β0 approximately as a sum of n i.i.d. random vectors. Furthermore, similar
I. Gijbels, D. Lin and Z. Ying
218
to (3.8), (3.20)
t
ξi dMi (s) HZ (s) 0 t (ξi − ρ)Y˜i (s) +(nρ)−1 λ0 (s)ds HZ (s) 0 n 1 (ξi − ρ) + op (n− 2 ) −(nρ)−1 Λ0 (t)
ˆ 1 (β0 , t) − Λ0 (t) = (nρ)−1 Λ
i=1
1
= J1 (t) + J2 (t) + J3 (t) + op (n− 2 ),
say,
˜ where Y˜i (s) = Yi (s)eβ0 Zi (s) . Let S˜k = S˜k1 (∞) + S˜k2 (∞) (k = 1,2), where Skj are defined in the proofs of Theorems 2.1 and 2.2. Then E S˜1 J3 (t) = 0 and
˜ E S1 J1 (t) = (1 − ρ)E = −(1 − ρ)E
∞ 0
0
∞
W1 (s)Y˜1 (s)dΛ0 (s)
t 0
t∧s 0
dM1 (u) HZ (u)
1(u≤s) W1 (s)Y˜1 (s)HZ−1 (u)Y˜1 (u)dΛ0 (s)dΛ0 (u).
Moreover, we can show that E S˜1 J2 (t) = −E S˜1 J1 (t) . Therefore E S˜1 {J1 (t) + J2 (t) + J3 (t)} = 0.
(3.21)
Likewise, we can show that E S˜21 (∞) {J1 (t) + J2 (t) + J3 (t)} = 0. Thus (3.22)
E S˜2 {J1 (t) + J2 (t) + J3 (t)} = E S˜22 (∞) {J1 (t) + J2 (t) + J3 (t)}
CZ CZ −1 = −ρ (1 − ρ)E N1 (∞) − EN1 (∞) −1
=ρ
(1 −
t 0
dN1u (s) HZ (s)
ρ)EN1CZ (∞)Λ0 (t).
From (3.21) and (3.22), E (3.23)
S˜1 + DSˆ2 {J1 (t) + J2 (t) + J3 (t)}
= ρ−1 (1 − ρ)DE N1CZ (∞) Λ0 (t).
It is also not difficult to show that
E [{J1 (t) + J2 (t) + J3 (t)} {J1 (t ) + J2 (t ) + J3 (t )}] t∧t dΛ0 (s) 1 − ρ 1 = − Λ0 (t)Λ0 (t ), nρ 0 HZ (s) nρ which, combined with (3.19), (3.20) and (3.23), yields the desired covariance func˜1. tion Γ
Survival analysis with missing failure indicators
219
For (ii), first note that
(3.24)
ˆ 2 (β0 , t) − Λ0 (t) Λ n t dMi (s) −1 =n HZ (s) i=1 0 −1
− (nρ)
n i=1
NiCH (t) − E NiCH (t) (ξi − ρ) + op (n−1/2 ).
ˆ 2 (β0 , t) − Λ0 (t) is asymptotically uncorrelated From (3.24), we can show that Λ with S˜1 and S˜21 . The desired covariance formula (3.17)then follows by evaluating ˆ 2 (β0 , t) − Λ0 (t) and S˜22 . The details are the asymptotic covariance between Λ omitted. 4. Cox regression and cumulative hazard function estimation under Type II missingness We now describe in detail the problem of Type II missingness mentioned in Sec(1) (2) tion 1 using a slightly different notation. Let (Ti , Ti , Ci , Zi ) (i = 1, . . . , n) be (1) (2) i.i.d. random vectors, where Ti and Ti denote two types of latent failure times, of which the first is of interest, and Ci and Zi denote the censoring time and co(1) variate vector as before. Suppose that, conditional on Zi , the failure time Ti is (2) independent of Ti and Ci , and has the hazard rate λ(t | Zi ) = eβ0 Zi λ0 (t). Define (1) (2) Ti = Ti ∧ Ti , φi = 1(T (1) ≤T (2) ) , Xi = Ti ∧ Ci and δi = 1(Ti ≤Ci ) . Note that i i φi δi indicates, by the value 1 vs. 0, whether or not the observation time Xi is the (1) failure time of interest Ti . In the standard competing risk setup, one observes (Xi , δi , φi δi , Zi ) for every i. With incomplete measurements on the failure types, however, the data consist of (Xi , δi , ξi , ξi φi δi , Zi ) (i = 1, . . . , n), where ξi indicates, by the value 1 vs 0, whether φi is known or unknown. We assume that ξi is independent of all other variables with P (ξi = 1 | Xi , δi , φi , Zi ) = τ . This has the same level of generality as assuming P (ξi = 1 | Xi , δi = 1, φi , Zi ) = τ , since for δi = 0 the value ξi does not have any effect on the observations and can therefore be redefined to make the independence true. We define Niu , Yi , Z¯ and z¯ as in Section 2. In the absence of missing values, the partial likelihood score function for β0 is φ
S (β) =
n i=1
∞ 0
¯ t) φi dN u (t). Zi (t) − Z(β, i
By deleting all the cases with {δi = 1, ξi = 0}, the complete-case estimating function is n n ∞ β Zj (t) {δ ξ + (1 − δ )} Y (t)e Z (t) j j j j j j=1 ξi φi dNiu (t). Zi (t) − n Sdφ (β) = β Zj (t) {δ ξ + (1 − δ )} Y (t)e j j j j j=1 i=1 0
Because the index set {j : (δj ξj + (1 − δj ))Yj (t) = 1} is not a random subset of the risk set {j : Yj (t) = 1}, the complete-case analysis does not yield a consistent estimator for β0 . We shall use the ideas presented in Section 2 to estimate β0 under Type II missingness.
I. Gijbels, D. Lin and Z. Ying
220
The analogs of estimating functions S1 (β) and S2 (β) studied in Section 2 are n ∞
φ ¯ t) ξi φi dN u (t), Zi (t) − Z(β, S1 (β) = i i=1
S2φ (β)
=
n i=1
∞ 0
n
0
¯ t) (1 − ξi ) − τˆ−1 (1 − τˆ)ξi (1 − φi ) dN u (t), Zi (t) − Z(β, i
n where τˆ = i=1 δi ξi / i=1 δi . We have the following results for Skφ (β0 ) (k = 1, 2), which are similar to those of Sk (β0 ) (k = 1, 2) given in Theorems 2.1 and 2.2. Theorem 4.1. The random vector n−1/2 S1φ (β0 ) , S2φ (β0 ) is asymptotically zeromean normal with covariance matrix φ
V1 0 , 0 V2φ ∞ φ φ ⊗2 φ {Zi (t) − = where V1φ = τ V φ , V2φ = (1−τ )V φ +τ −1 (1−τ )E(N −EN ) , N 1 1 i 0 ∞ ⊗2 z¯(β0 , t)} (1 − φi )dNiu (t) and V φ = E 0 {Z1 (t) − z¯(β0 , t)} φ1 dN1u (t) . ˆ we define βˆφ as a solution to In analogy with (2.5) for β, S1φ (β) + DS2φ (β) = 0, where D is a given p × p matrix. Then the following theorem similar to Theorem 2.3 holds.
Theorem 4.2. Suppose that τ V φ + (1 − τ )DV φ is nonsingular. Then n1/2 (βˆφ − d
β0 ) → N (0, Σφ (D)), where
−1
−1 Σφ (D) = τ V φ + (1 − τ )DV φ (τ V φ + DV2φ D ) τ V φ + (1 − τ )V φ D .
The optimal choice for D is D∗ = (1 − τ )V φ (V2φ )−1 , in which case −1 Σφ (D∗ ) = τ V φ + (1 − τ )2 V φ (V2φ )−1 V φ .
Proof of Theorem 4.1. As in the proofs of Theorems 2.1 and 2.2, we can define the t φ u martingales Mi (t) = φi Ni (t) − 0 Yi (s)eβ0 Zi (s) λ0 (s)ds (i = 1, . . . , n) and derive the following key approximations n ∞ φ S1 (β0 ) = {Zi (t) − z¯(β0 , t)} ξi dMiφ (t) i=1
+
0
n i=1
S2φ (β0 )
=
∞
0
n i=1
−
1
{Zi (t) − z¯(β0 , t)} (ξi − τ )Yi (t)eβ0 Zi (t) λ0 (t)dt + op (n 2 ),
∞ 0
{Zi (t) − z¯(β0 , t)} (1 − ξi )dMiφ (t)
n i=1
−τ −1
∞
0
n 1 (Niφ − ENiφ )(ξi − τ ) + op (n 2 ). i=1
{Zi (t) − z¯(β0 , t)} (ξi − τ )Yi (t)eβ0 Zi (t) λ0 (t)dt
Survival analysis with missing failure indicators
221
These two approximations can be used to show, through some tedious calcula φ φ −1/2 tions, that the asymptotic variance-covariance matrix of n S1 (β0 ) , S2 (β0 ) is Diag V1φ , V2φ . Hence, the theorem follows from the multivariate central limit theorem. Proof of Theorem 4.2. This can be done by applying Theorem 4.1 and the arguments given in the proof of Theorem 2.3. Using βˆφ with its asymptotic distribution given by Theorem 4.2, we can construct consistent estimators for the cumulative baseline hazard function Λ0 (t). Two such ˆ t) and Λ ˆ t) defined by (3.13) and (3.14) ˆ 1 (β, ˆ 2 (β, estimators which correspond to Λ are t n ξi φi dNiu (s) φ ˆφ ˆ Λ1 (β , t) = , ni=1 ˆ i=1 Yi (s)eβˆφ Zi (s) 0 τ t n
−1 ξ φ + (1 − ξ ) − τ ˆ (1 − τ ˆ )ξ (1 − φ ) dNiu (s) i i i i i φ i=1 ˆ (βˆφ , t) = . Λ 2 n βˆφ Zi (s) 0 i=1 Yi (s)e ˆ t) (k = 1, 2) ˆ k (β, The kind of asymptotic properties given in Theorem 3.2 for Λ φ φ ˆ ˆ (β , t) (k = 1, 2). They can be derived from the arguments used also hold for Λ k in the proof of Theorem 3.2. To simplify the statements, we shall only present the asymptotic normality part although the weak convergence also holds.
Theorem 4.3. Suppose that the assumptions of Theorem 4.2 hold and that t satisd 1/2 ˆ φ ˆφ 2 fies EY1 (t) > 0. Then n Λk (β , t) − Λ0 (t) → N (0, σφ,k (t)) (k = 1, 2), where,
−1 with HZ (t) and a(t) as defined in Theorem 3.2, Ωφ = τ V φ + (1 − τ )DV φ D t φH u and Ni (t) = 0 (1 − φi )dNi (s)/HZ (s) (i = 1, . . . , n), 2 σφ,1 (t)
=τ
−1
0
t
dΛ0 (s) − τ −1 (1 − τ )Λ20 (t) + a (t)Σφ (D)a(t) HZ (s)
−2τ −1 (1 − τ )a (t)Ωφ EN1φ Λ0 (t),
2 σφ,2 (t)
=
dΛ0 (s) + τ −1 (1 − τ )Var N1φH (t) + a (t)Σ(D)a(t) 0 HZ (s) −2τ −1 (1 − τ )a (t)Ωφ E N1φ N1φH (t) − EN1φH (t) . t
For the one-sample case, where the data consist of i.i.d. random vectors (Xi , δi , ξi , ξi φi δi ) (i = 1, . . . , n), we modify (3.2), (3.4) and (3.5) to obtain the following class (1) of consistent estimators for the cumulative hazard function of Ti : ˆ φ (α, t) = α Λ
t 0
n
ξ φ dN u (s) i=1 ni i i τˆ i=1 Yi (s) t n
i=1 1 − ξi
+ (1 − α)
0
− τˆ−1 (1 − τˆ)ξi (1 − φi ) dNiu (s) n . (1 − τˆ) i=1 Yi (s)
The arguments given in the proof of Theorem 3.1 can be used to verify the following ˆ φ (α, t). asymptotic normality for Λ
I. Gijbels, D. Lin and Z. Ying
222
d ˆ φ (α, t) − Λ(t) → N (0, σt2 (α)), Theorem 4.4. For t satisfying EY1 (t) > 0, n1/2 Λ where
α2
A(t) − (1 − τ )Λ2 (t) + 2α(1 − α) Λ2 (t) + τ −1 Λ(t)ΛQ (t) τ
2 (1 − α)2 , + A(t) + τ −1 AQ (t) − τ Λ(t) + τ −1 ΛQ (t) 1−τ t t where A(t) = 0 dΛ(s)/EY1 (s), AQ (t) = 0 dΛQ (s)/EY1 (s) and ΛQ is the cumulaσt2 (α) =
(2)
tive hazard function of Ti . In particular,
σt2 (1) = τ −1 A(t) − (1 − τ )Λ2 (t) ,
σt2 (τ ) = A(t) + τ −1 (1 − τ ) AQ (t) − Λ2Q (t) .
The variance σt2 (α) is minimized when α equals
τ A(t) − Λ2 (t) + AQ (t) − Λ2Q (t) − (1 + τ )Λ(t)ΛQ (t) ∗ . α = A(t) − Λ2 (t) + AQ (t) − Λ2Q − 2Λ(t)ΛQ (t) 5. Discussions We did not provide all the details for Type II missingness in Section 4 because of the similarity with Type I missingness. It should be noted that consistent estimators for 2 the variance quantities such as Σφ (D), σφ,k (t) (k = 1, 2) and σt2 (α) can be obtained in the same manners as their counterparts in Sections 2 and 3. Furthermore, the asymptotic approximations under Type II missingness are expected to have similar degrees of accuracy in finite samples as those of Type I missingness. We have made the missing completely at random assumption in our developments. This assumption consists of two parts, the first part being the independence between ξi and (Xi , δi , φi , Zi ) for every i and the second being the i.i.d property of ξi (i = 1, . . . , n). The first part of the assumption cannot be avoided without direct modeling the missing processes. The second part can be relaxed to the extent that a consistent estimator for P (ξi = 1) is available for every i. For example, in a multiinstitutional study, it may be reasonable to assume that the missing probabilities are constant within the same institution but vary among different institutions. In this case, we may stratify our data on the institutions and modify the methods described in the previous sections to incorporate the stratification factor. In many applications, the measurements on the covariate vectors are incomplete. [9] and subsequent papers provide solutions to this problem. It is possible to combine the techniques developed in this paper with those of [9] to handle the situation where both the covariates and the failure indicators are partially measured. The details will not be presented here. References [1] Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: a large sample study. Ann. Statist. 10 1100–1120. [2] Breslow, N. E. (1972). Contribution to the discussion on the paper of D. R. Cox cited below. J. Roy. Statist. Soc. Ser. B 34 216–217.
Survival analysis with missing failure indicators
223
[3] Breslow, N. and Crowley, J. (1974). A large sample study of the life table and product limit estimates under random censorship. Ann. Statist. 2 437–453. [4] Cox, D. R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. Ser. B 34 187–220. [5] Dinse, G. E. (1982). Nonparametric estimation for partially-complete time and type of failure data. Biometrics 38 417–431. [6] Gill, R. (1983). Large sample behaviour of the product-limit estimator on the whole line. Ann. Statist. 11 49–58. [7] Gijbels, I., Lin, D. Y. and Ying, Z. (1993). Non- and semi-parametric analysis of failure time data with missing failure indicators. Technical Report 039-93, Mathematical Sciences Research Institute, Berkeley. [8] Goetghebeur, E. and Ryan, L. (1990). A modified log rank test for competing risks with missing failure type. Biometrika 77 207–211. [9] Lin, D. Y. and Ying, Z. (1993). Cox regression with incomplete covariate measurements. J. Amer. Statist. Assoc. 88 1341–1349. [10] Lo, S.-H. (1991). Estimating a survival function with incomplete cause-ofdeath data. J. Multivariate Anal. 39 217–235. [11] McKeague, I.W. and Subramanian, S. (1998). Product-limit estimators and Cox regression with missing censoring information. Scan. J. Statist. 25 589-601. [12] Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. [13] Subramanian, S. (2000). Efficient estimation of regression coefficients and baseline hazard under proportionality of conditional hazards. J. Statist. Plann. Inference 84 81–94. [14] Tsiatis, A. A. (1981). A large sample study of Cox’s regression model. Ann. Statist. 9 93–108. [15] van de Laan, M. J. and McKeague, I. W. (1998). Efficient estimation from right-censored data when failure indicators are missing at random. Ann. Statist. 26 164–182. [16] Wichmann, B. A. and Hill, I. D. (1982). An efficient and portable pseudorandom number generator. Appl. Statist. 31 188–190. [17] Zhou, X. and Sun, L. (2003). Additive hazards regression with missing censoring information. Statist. Sinica 13 1237–1257.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 224–238 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000175
Nonparametric estimation of a distribution function under biased sampling and censoring Micha Mandel1,∗ The Hebrew University of Jerusalem Abstract: This paper derives the nonparametric maximum likelihood estimator (NPMLE) of a distribution function from observations which are subject to both bias and censoring. The NPMLE is obtained by a simple EM algorithm which is an extension of the algorithm suggested by Vardi (Biometrika, 1989) for size biased data. Application of the algorithm to many models is discussed and a simulation study compares the estimator’s performance to that of the product-limit estimator (PLE). An example demonstrates the utility of the NPMLE to data where the PLE is inappropriate.
1. Introduction In this paper, the EM algorithm [4] of Vardi [18] is extended from size-biased to W -biased observations, where W is a known, positive, increasing and right continuous function. Specifically, the algorithm finds the distribution function G that maximizes (1.1)
m dG(xi )
i=1
µ∗
n ¯ G(yj ) × , µ∗ j=1
∞ ¯ = 1 − G and x1 , . . . , xm and y1 , . . . , yn are given where µ∗ = 0 W (x)dG(x), G data points. The function (1.1) is proportional to likelihoods that arise in reliability and survival studies when data are subject to both bias and censoring. Several authors derive the nonparametric maximum likelihood estimator (NPMLE) of G in problems where the likelihood is a special case of (1.1). The size-biased case, W (x) = x, appears in cross-sectional sampling if the population is in steady state. An early work is Cox [3] who estimates G in the uncensored case (i.e., n = 0). Vardi [18] presents four problems that result in likelihood proportional to (1.1) with W (x) = x and develops a simple EM algorithm to find the NPMLE of G. In an earlier paper [16], he develops an EM algorithm to estimate the underlying lifetime distribution of a renewal process observed in a time window (see also [15]). Wijers [22] suggests the EM algorithm for the same window sampling in a population model under 1 Department of Statistics, The Hebrew University of Jerusalem, Mount Scopus, Jerusalem 91905, Israel, e-mail: [email protected] ∗ This paper is part of my PhD dissertation that was completed under the supervision of Yosef Rinott and Yehuda Vardi. I am grateful for their excellent guidance and many suggestions that improved this paper considerably. A reviewer provided helpful comments and suggestions. Research supported by the Israel Science Foundation (grant No. 473/04). AMS 2000 subject classification: 62N01. Keywords and phrases: cross-sectional sampling, EM algorithm, Lexis diagram, multiplicative censoring, truncated data.
224
Nonparametric estimation under bias and censoring
225
the steady state assumptions. When the renewals or birth times are known for all sampled individuals, the likelihood is proportional to (1.1) with W (x) = x + C where C is the window width. Motivated by data on HIV infection, Kalbfleisch and Lawless [7] consider a population model in which entrances are according to an inhomogeneous Poisson process. Their likelihood is similar to the uncensored part of (1.1) with W being the cumulative rate of the process. They briefly remark on the censored case at the end of their paper. The random left truncation model is commonly used for survival data [12, 21, 23]. In that model, two independent variables A ∼ W and X ∼ G are truncated to the region A ≤ X. The variable X may be censored by A + C, where C is independent of (A, X). When the truncation distribution is known, the likelihood of the model is proportional to (1.1). More models that give rise to likelihoods proportional to (1.1) are reviewed in Section 3. In Section 2, we suggest a unified EM algorithm that provides the NPMLE for the general likelihood (1.1) and discuss its convergence properties. Two methods for developing the EM algorithm are considered; the first uses an extension of Vardi’s multiplicative censoring model [18], and the second utilizes the random left truncation model. By presenting the two approaches, the similarity between the two seemingly unrelated models is highlighted. In Section 3 we derive the likelihoods of the above mentioned papers and other models and show that they are special cases of (1.1). In Section 4 we compare the performances of estimators with full and no knowledge on W by simulation. The algorithm is then used to reanalyze the Channing House data [6]. We complete the paper with discussion in Section 5. Throughout the paper, we will refer to (1.1) and to similar expressions as “likelihood”, although they may be only proportional to the likelihood of the data. We will refer to the maximizer of (1.1) as the NPMLE of G. To distinguish, we will call the maximizer of G when W is not known the product-limit estimator (PLE). The latter is presented in Section 3.4 and is used in the simulation and application for comparison purposes. 2. An EM algorithm There are several ways to develop the EM algorithm. We first adopt the approach used in [18] and solve a seemingly unrelated multiplicative censoring problem (similar to [18] problem A). This is shown to be almost equivalent to our original problem of maximizing (1.1). The second approach is motivated by cross-sectional sampling where subjects are selected to the sample at a random time point (similar to [18] problem B). It is assumed that ages at sampling are observed for all sampled individuals and residual lifetimes are subject to random censoring. Here the censoring times are independent, and are independent of the ages at sampling and the residual lifetimes. Both approaches eliminate the bias of the data and deal only with censoring. They demonstrate that bias is a secondary problem for estimation relative to censoring (when the bias is known and does not cause identification problems). After developing the algorithm, several convergence properties are discussed. 2.1. Vardi’s approach Since W (x) > 0 and known, (1.1) is proportional to (2.1)
m
i=1
W
dG (xi )
n
j=1
∞ yj
1 dGW (u) W (u)
M. Mandel
226
x where GW (x) = 0 W (u)dG(u)/µ∗ is the W -weighted version of G. The problem can be divided into two parts; maximization of (2.1) for GW , and transformation using G(dx) ∝ GW (dx)/W (x). To maximize (2.1), consider first the following multiplicative censoring model which extends the problem studied by Vardi [18]. 0 Let X10 , . . . , Xm , Z10 , . . . , Zn0 be positive random variables from the distribution G0 supported on (0, ∞) and let U1 , . . . , Un ∼ U(0, 1) where all the random variables are independent. Let Yi0 = W (Zi0 )Ui (i = 1, . . . , n). Similar to Vardi who solves the problem for the special case W (x) = x, the Yi0 ’s describe the transforma0 , Z10 , . . . , Zn0 ) to the observed ‘incomplete tion from the ‘complete data’ (X10 , . . . , Xm data’. This transformation is used in the E-step of the EM algorithm. The statistical problem is of estimating G0 using (x01 , . . . , x0m , y10 , . . . , yn0 ), a realization of 0 (X10 , . . . , Xm , Y10 , . . . , Yn0 ). First note that the density of Y 0 with respect to Lebesgue measure is 1 1 0 0 fY (t) = dFW (Z ) (v) = dG0 (v) {v:W (v)≥t} W (v) v≥t v where fV and FV denote the density and distribution of a random variable V . The likelihood of the data (x0 , y0 ) = (x01 , . . . , x0m , y10 , . . . , yn0 ) is given by: 0
0
0
L(G ; x , y ) =
m
dG
0
(x0i )
n
j=1
i=1
{v:W (v)≥yj0 }
1 dG0 (v). W (v)
By defining W −1 (y) = min{v : W (v) ≥ y} (which exists from the right continuity assumption) and recalling that W is increasing, we can rewrite the likelihood as (2.2)
0
0
0
L(G ; x , y ) =
m
dG
0
(x0i )
n
j=1
i=1
v≥W −1 (yj0 )
1 dG0 (v) W (v)
and the similarity to our original likelihood (2.1) is apparent. When W is strictly increasing, no information is lost by the transformation X 0 → 0 W (X 0 ), and Vardi’s EM algorithm [18] can be applied to W (X10 ), . . . , W (Xm ), 0 0 0 Y1 , . . . , Yn . This gives the NPMLE of FW (X 0 ) and the NPMLE of G is obtained by a simple transformation. Specifically, let t1 < t2 < · · · < th be the distinct values of x01 , . . . , x0m , W −1 (y10 ), . . . , W −1 (yn0 ), and let ξj , ζj be the multiplicity of the X 0 and W −1 (Y 0 ) samples at tj : ξj =
m
I{x0i
= tj },
ζj =
n
I{W −1 (yi0 ) = tj }.
i=1
i=1
Denote by pj the mass G0 assigns to tj , p = (p1 , . . . , ph ); then the problem can be rewritten as
(2.3)
maximize
L(p) =
subject to
h
j=1
pj = 1
ξ pj j k≥j
j
pj ≥ 0(j = 1, . . . , h).
ζj
1 pk W (tk )
Nonparametric estimation under bias and censoring
And an EM step is (2.4)
pnew = (n + m)−1 ξj + [W (tj )]−1 pold j j
k≤j
227
ζk , −1 pold [W (t )] l l≥k l
where pold and pnew are the current and updated estimates of pj . j j The derivation of (2.4) from (2.3) holds true whether or not W is strictly increasing. Maximization of (2.2) reduces to the discrete problem (2.3) by showing that the support of the NPMLE is discrete and determining t1 , . . . , th . This somewhat technical point is deferred to the Appendix for the case of a general non-decreasing W. Finally, the connection between G0 and GW is apparent from (2.1) and (2.2), where the only difference appears in the left limits of the integrals. These were used only to determine the points t1 , . . . , th and their multiplicity in the two samples. Thus, to use the algorithm for maximizing (2.1), one needs to define t1 , . . . , th as the distinct values of x1 , . . . , xm , y1 , . . . , yn (the original data) and to change ξj and ζj accordingly. The support of our original problem is a subset of the observations as discussed in the Appendix. After finding it, the problem reduces to (2.3) for which the algorithm (2.4) derives the NPMLE of GW . The corresponding estimate of G is achieved by the inversion formula dG(x) ∝ [W (x)]−1 dGW (x), as mentioned above. The Kaplan-Meier estimate of G [8] is obtained by using W ≡ 1 in (2.4) (see Section 3.1 for more details). This estimator does not use the correct weights when it redistributes the mass of the censored observations, and in general it is inappropriate. 2.2. A direct approach Cross-sectional samples of lifetimes usually contain the age at sampling A and the residual lifetime R; the latter is subject to random censoring. Vardi’s approach uses the sufficient statistic A + R [see (2.7) below] to develop an EM algorithm. In this subsection, the statistic (A, R) is used. The two approaches give different perspectives about the formation of the bias and censoring. ∞ ¯ ¯ = 0. Assume that W (0) = 0, W (t)G(t) → 0 as t → ∞, and 0 W (dt)G(dt) ∞ ∞ ∗ ¯ Mathematically this means that µ ≡ 0 W (t)G(dt) = 0 G(t)W (dt). Practically it means that the probability of leaving the population at the very instant of sampling is zero; hence, one does not need to worry about inclusion or exclusion of such observations in the sample. Let ¯ G(a) (2.5) fA (a) = ∗ dW (a), a > 0, µ dG(a + r) (2.6) , r≥0 fR|A (r|a) = ¯ G(a) so the joint density at (a, r) is (2.7)
fA,R (a, r) =
dG(a + r) dW (a), µ∗
a > 0, r ≥ 0.
Now suppose we have one sample of m pairs (ai , ri ) from fA,R and another independent sample of n variables yj from fA . This describes the so-called incomplete data
M. Mandel
228
and by denoting xi = ai +
ri (i = 1, .
. . , m) we arrive at the likelihood (1.1). Since W is known, the product i dW (ai ) j dW (yj ) is irrelevant for maximization. The complete data are of course x1 , . . . , xm and y1 + r˜1 , . . . , yn + r˜n , where r˜j is the unobserved residual lifetime of subject j of the second sample. Using the sum xi = ai + ri instead of its components is justified by noticing that the sum is the sufficient statistic for the complete data problem. By variables changing and integrating a out in (2.7), it can be easily verified that the likelihood of xi (or the density of yj + r˜j ) is (2.8)
W (xi )dG(xi ) . µ∗
For the support points t1 , . . . , th described in the appendix, the E-step uses (2.6): (2.9)
EΠold (I{Ai + Ri = tj }|Ai = yi ) =
k
πjold πkold I{tk ≥ yi }
I{tj ≥ yi },
where Πold = (π1old , . . . , πhold ) is the current estimate of the unbiased distribution G, i.e., the estimate at tj of the weighted distribution GW is pold ∝ W (tj )πjold . j The complete likelihood is a product of terms such as (2.8) which is the likelihood of a weighted sample. An M-step estimates the weighted distribution GW by the empirical distribution function. Combining the E-step and the M-step, an iteration is given by: ζk πjold , (2.10) pnew = (n + m)−1 ξj + j old π l≥k l k≤j
where ξj and ζj are the multiplicities of uncensored and censored observations, −1 old respectively. Put πjold = [W (tj )]−1 pold / pk } in the equation above j k {[W (tk )] to get (2.4). 2.3. Convergence of the algorithm Several properties of the problem and the algorithms are sketched below. The appendix shows that for some functions W , such as step functions, the NPMLE is not unique and different choices of the support points yield the same (maximum) value of the likelihood. The properties below hold for a given choice of support points. Property 2.1. Given the points t1 , . . . , th , the maximizer of (2.3) is unique.
Proof. We show that the problem can be replaced with a maximization of a strictly concave function over a convex region. Following [18], write qj = pj /W (tj ) , Qj = k≥j qk . Since W (tj ) is constant in the likelihood, we can replace (2.3) with:
(2.11)
maximize
log
subject to
h
j=1
ξ
ζ
qj j Qj j
qj W (tj ) ≤ 1
j
qj ≥ 0(j = 1, . . . , h).
Nonparametric estimation under bias and censoring
229
h h h ξj ζj Now log = j=1 ξj log(qj ) + j=1 ζj log(Qj ), and the assertion folj=1 qj Qj lows from log of a partial sum being a strictly concave function from R+h to R, and since sum of concave functions is concave (recall that for all j, ξj , ζj ≥ 0 and ξj + ζj ≥ 1). Property 2.2. Let pn and Ln = L(pn ) be the value of p and the value of the likelihood assigned by the EM algorithm (2.4) after its n’th iteration; then Ln converges to a point L∗ . Proof. [1] and later [4] show that the likelihood increases in each iteration of the EM algorithm. The assertion follows from (2.3) that shows that the likelihood is bounded from above. Property 2.3. If the maximizer of (2.3) assigns positive mass to all points t1 , . . . , th , then the algorithm (2.4) converges to that maximizer. Proof. This follows from Theorem 4 of [24] by using the uniqueness proved in Property 2.1 and the fact that L is a polynomial on the simplex (which establishes the regularity conditions). 3. Examples Many examples are easily described using the Lexis diagram (see Figure 1) that depicts changes in a population S over time. The horizontal and vertical axes of the diagram represent the calendar time and the lifetime or duration of subjects in S, respectively. Subjects of S are represented by 45◦ lines that start at their time of entering S and end at leaving. Lines that cross the vertical line x = t correspond to the population of S at t. In particular, a cross-sectional sample at time 0 contain those subjects (or lines) that intersect with the line x = 0. For a review of the Lexis diagram and its utility for studying different sampling designs and population quantities see [2] and [11]. 3.1. Random censorship A common sampling plan is of collecting data on subjects entering S during the time window [0, C]. This widely used design is known as the random censorship model and it is depicted in the top left panel of Figure 1. According to the model, a random sample Ti (i = 1, 2, . . . , m + n) is selected from a distribution G, but instead of the Ti ’s, observations are independent realizations of min(Ti , Ci ), where Ci ∼ FC is independent of Ti . The data contain also the information whether the observations were censored or not. Denote the uncensored observation by x1 , . . . , xm and the censored ones by y1 , . . . , yn , then the likelihood of the data is (3.1)
m
i=1
dG(xi )
n
j=1
¯ j) × G(y
m
i=1
F¯C (xi )
n
dFC (yj ).
j=1
The NPMLE is the celebrated Kaplan-Meier estimator [8] and is based only on the left part of (3.1). This is a special case of (1.1) with W = 1, and can be solved (in somewhat a redundant way) by the EM algorithm (2.4). At convergence, the algorithm is exactly the self-consistent estimate of Efron [5] and (2.4) illustrates the
230
M. Mandel
Fig 1. Sampling designs in the Lexis diagram. Each 45◦ line represents a subject in the population. The horizontal axis is the calendar time and the vertical axis is duration or lifetime. Observed data are depicted as solid lines. Top left - random censoring model. I = (0, C) Top right - follow-up on cross-sectional sampling. I = (−∞, 0). Bottom left - window sampling. I = (−∞, C). Bottom right - cross-sectional with truncation. I = (−β, −α).
redistributed to the right principle. As Efron shows, it reduces to the non-iterative Kaplan-Meier estimator. According to Lemma A.3, an NPMLE assigns mass only to the uncensored observations, because W is constant everywhere (if the last observation is censored, then it assigns mass also to max{yj }). This is a well-known feature of the Kaplan-Meier estimator.
3.2. Population models with poisson entrance process
The likelihood (1.1) is obtained in many designs where entrances to the studied population S are according to a Poisson process. Specifically, the model assumes a Poisson entrance process N (t) on (−∞, C) with rate ρ(·). The lifetimes, X1 , X2 , . . ., are determined by the law G, and N (·), X1 , X2 , . . . are independent. The sample consists of all subjects who entered S during I and are in S sometime during (0, C), where I is a subset of (−∞, C), usually an interval. Figure 1 presents several common examples and the corresponding samples. It is assumed that ρ(x) = λ×λ0 (x) where λ0 is ¯ known and that I λ0 (u)G(−u)du = µ∗ < ∞. For each sampled subject, we observe the entrance time a and the possibly censored residual lifetime r with the failure indicator δ. Denote by (a1 , x1 ), . . . , (am , xm ) and (am+1 , ym+1 ), . . . , (am+n , ym+n ) the data on the m uncensored and n censored observations, then the likelihood is
Nonparametric estimation under bias and censoring
231
given by (3.2)
L(G) =
m dG(xi )λ0 (ai )
µ∗
i=1
m+n
∗ n+m ¯ j )λ0 (aj ) G(y −λµ∗ (λµ ) × e . ∗ µ (n + m)! j=m+1
The NPMLE of λµ∗ is n + m and since λ0 is assumed known, the problem reduces to maximizing for G the function (3.3)
L(G|n + m) =
m dG(xi )
i=1
µ∗
m+n
¯ j) G(y µ∗ j=m+1
which has the form of (1.1). In this case W (x) = (−x,C)∩I λ0 (u)du which is continuous and increasing. Examples: A homogeneous poisson process - cross-sectional sampling - I = (−∞, 0). Early studies of this design are [14] and [9]. It is depicted in Figure 1 in the top right panel. The sample consists of all subjects who are in S at time 0 and the data contain their entrance time and follow-up until time C. Here W (x) = x and µ∗ = EX is the mean lifetime. The algorithm (2.4) for this special case is derived by [18] who studies a slightly different model. A homogeneous poisson process - window sampling - I = (−∞, C). Here the sample consists of the cross-sectional population and those entering during the follow-up period (see the bottom left panel of Figure 1). Thus, there are two samples: i) a size biased sample comprising of subjects who entered S before 0, and ii) an unbiased sample comprising of those who entered during the time window [0, C]. Both samples are censored at C. This model is a mixture of the random censorship and the cross-sectional models discussed above, with random number of observations from each model. Here W (x) = x + C and µ∗ = µ + C. The model is studied by [9]; [22] provides an EM algorithm for it. An inhomogeneous poisson process - a window sampling - I = (−∞, C). Kalbfleisch and Lawless [7] study entrances according to an inhomogeneous process in a somewhat different model, mainly focusing on the uncensored case. They consider many models in this framework and estimate both the rate function and the disC tribution of lifetimes. Under this model, W (x) = −x λ0 (t)dt is proportional to the cumulative rate. A truncated poisson process - I = (−β, −α). Wang [20] derives the NPMLE of G when data are collected only for subjects whose ages at sampling time are in [α, β] for some known 0 ≤ α < β. This design is depicted in the bottom right panel of Figure 1 and it is natural when data started to be recorded only β years ago (in that case α = 0) or when there is a specific interest on subjects who entered during the period (−β, −α) (e.g., when S is defined by some epidemic status and a specific treatment was used during (−β, −α)). Wang shows that when entrances to S are according to a homogeneous Poisson process the likelihood is given by i
¯ j) dG(xi ) G(y . β β ¯ ¯ G(u)du G(u)du j α
α
A simple calculation shows that the likelihood is of the form (1.1) with µ∗ = E[min(X, β) − α]+ so the algorithm (2.4) can be applied with W (x) = [min(x, β) − α]+ . We note that in this case W (x) = 0 for x ∈ [0, α] so implementation of the
M. Mandel
232
algorithm looks problematic. However, under this setting, G is not identifiable on [0, α] [20] and one can only hope to estimate G given X > α where W > 0. This model is used in Section 4 to analyze the Channing House data [6]. 3.3. A discrete entrance process Mandel and Rinott (unpublished) study a discrete version of the Poisson entrance model in which entrances to S occur at fixed time points σK < · · · < σ2 < σ1 ≤ 0. At σk , Nk new subjects joining S with lifetimes Xk1 , . . . , XkNk . The model assumes that Nk has a Poisson distribution with parameter λ × λ0 (k), where λ0 is known, Xki ∼ G (k = 1, . . . , K; i = 1, . . . , Nk ), and {N1 , . . . , NK , X11 , X12 , . . . , XKNK } are independent. Under this model, the NPMLE is obtained by maximizing (1.1) with λ0 (k). W (x) = {k:−σk ≤x}
Here W is a step function and Lemma A.3 should be used to determine the support of G. When σk = −k, λ0 is constant and lifetimes are integer valued, the model is a discrete size-biased model. 3.4. Truncation models The left truncation model [12, 21, 23] assumes that two independent variables A ∼ W and T ∼ G can be observed only on the region A ≤ T . In addition, there is a random censoring variable C which is independent of T and satisfies P (C > A) = 1. Data comprise of (ai , min(ti , ci ), δi ) (i = 1, . . . , m + n) which are m + n realizations of (A, min(T, C), ∆) restricted to the region A ≤ T , where ∆ = 1 if T ≤ C and ∆ = 0 otherwise. Let µ∗ = P (A ≤ T ) = EW (T ). Changing the notations as in (3.2), the likelihood of the data is (3.4)
m dW (ai )dG(xi )
i=1
µ∗
m+n
¯ j) dW (aj )G(y ∗ µ j=m+1
(note the similarity to (3.2)). If W is known, then dW (ui ) can be omitted from the likelihood and the problem of maximizing (3.4) is equivalent to maximizing (1.1). Equation (3.4) can be reexpressed as (3.5)
m m+n m dW (aj )G(a ¯ j −) ¯ j) ¯ i −) m+n dW (ai )G(a dG(xi ) G(y × , ¯ i −) ¯ j −) µ∗ µ∗ G(a G(a j=m+1 i=1 j=m+1 i=1
where the first term is the likelihood of T |A = a, A ≤ T and the second term is the likelihood of A|A ≤ T . If W is completely unknown, maximizing (3.4) for G is equivalent to maximizing the left term in (3.5) and the maximizer is the productlimit estimator (PLE) defined by m ˜ G(dt) i=1 I{xi = t} (3.6) = m , m+n ˜ 1 − G(t−) i=1 I{ai ≤ t ≤ xi } + j=m+1 I{aj ≤ t ≤ yj }
see [20]. In the next section, the PLE and the NPMLE are compared on real and simulated data.
Nonparametric estimation under bias and censoring
233
4. Illustration Channing House data. The data set comprises of 97 males who were residents of the Channing House retirement community in Palo Alto, California [6]. It contains the age at entry to the community and the age at death or censoring accompanied by the event indicator. In terms of Section 3.4, the age at entry is A and the age at death is T . The parameter of interest, G, is the distribution of age at death (in months) of male residents of the community. Wang [20] estimates the distribution of age (in months) at entry to be uniform on (782,1073). Assuming W is the uniform distribution function on (782,1073), ¯ G(·)/G(782) was estimated using algorithm (2.4) and W (x) = [min(x, 1073)−782]+ . Note that this is similar to the bias of the truncated Poisson process model described at the end of Section 3.2. The NPMLE and the PLE are depicted in the right panel of Figure 2. The survival under the uniform assumption is estimated to be somewhat higher. For the analysis in the right panel of Figure 2, only 93 out of the 97 residents were used. Using all 97 individuals was problematic since the PLE approaches zero after the first two failures. This is shown in the left panel of Figure 2. Also shown is the survival estimated by (2.4) based on all 97 individuals and after estimating W to be U(751, 1073) (by the minimum and maximum age at entry of all 97 residents). The NPMLE is seen to be less sensitive to outliers. It shows that (2.4) can provide reasonable estimates when the PLE fails, a common phenomenon in small truncated data sets. Simulation. A simulation study was conducted to compare the performances of the NPMLE and the PLE under the left truncation model described in Section 3.4. We used the EXP(1) model for both G and W and generated 400 data sets of 50 observations each. The first row of Figure 3 compares the performance of the estimators in terms of log MSE at the deciles of G, calculated by the average over the repeated replicates. Since the PLE is not well defined when the risk group is empty before the last observation [21], such data sets were not used to calculate the MSE of the PLE. They were used to calculate the MSE of the NPMLE. The columns of Figure 3 show the effect of censoring. Lifetime were censored at A + C for a fixed C such that the probability of censoring is 10,25 and 50 percent from
Fig 2. Comparison of the PLE (dashed line) and the NPMLE (solid line) of survival in the Channing House community. Left - all 97 individuals. Right - excluding the first two failures.
234
M. Mandel
Fig 3. Comparison of log MSE of the NPMLE (solid line) and the PLE (dashed line) calculated at deciles. Data were generated using W =EXP(1). Sample size of 50 (top) and 200 (bottom), censoring probability of 0.1,0.25 and 0.5 from left to right.
left to right. The second row of Figure 3 shows the results of the same analysis applied to data sets of 200 observations. These were generated by combining the simulated data sets of 50 observations (total of 100 data sets). Figure 4 shows the sensitivity of the estimators to the assumption on W . The analysis was done assuming the exponential distribution as before while the data were generated using W =Gamma(2,1). The results show that using knowledge on the bias improves estimation by 10%25% in terms of MSE. Increasing the probability of censoring results in a higher MSE of both estimators especially in the right tail. The relative performance does not change much by censoring, but some indication of better relative performance of the NPMLE in the right tail is seen in the simulation with 50% probability of censoring. More interesting are the results of the sensitivity study that show that the NPMLE is quite sensitive to the assumption on W . In general, the MSE of the PLE is smaller than that of the NPMLE. Moreover, the performance of the PLE improves when sample size increases from 50 to 200 while the performance of the NPMLE does not change. However, the performance of the NPMLE is better than that of the PLE in the left tail even when the model is incorrect. This phenomenon is seen in simulation with other distributions (not shown) and is probably attributed to the small risk group in the left tail that results in unstable estimation. In the simulated data sets, the PLE was not well defined in 2% to 20% of the samples depending on the setting.
Nonparametric estimation under bias and censoring
235
Fig 4. Sensitivity analysis. Log MSE of the NPMLE (solid line) and the PLE (dashed line) at deciles. Data were generated using W =Gamma(2,1) and the NPMLE was calculated assuming W =EXP(1). Sample size of 50 (top) and 200 (bottom), censoring probability of 0.1,0.25 and 0.5 from left to right.
5. Discussion The principal aim of this article is to provide a general framework and a unified algorithm for problems involving bias and censoring. A secondary aim is to contrast Vardi’s multiplicative censoring model with truncated data and to compare the NPMLE to the PLE. An important question is which of the two estimators to use. The PLE cannot be calculated when all observations are censored or when data do not contain the truncation times a1 , . . . , am+n (see Section 3.4), and may not be well defined in data sets that do contain the truncation times as illustrated by the Channing House example. In all of these situations, the NPMLE exists and can be used. The simulation study shows that the NPMLE is more efficient when the model is correctly specified. However, it also indicates that it is quite sensitive to the assumed form of W . The use of the NPMLE, therefore, should be limited to situations where there is a theoretical justification for the assumed model of W . Furthermore, when data contain truncation times, the assumed form of W can and should be tested. Wang [20] suggests a graphical goodness-of-fit test by plotting an estimate of W versus the assumed model, and [10] study generalized Pearson statistics that can provide formal goodness-of-fit tests for the current model. These and other goodness-of-fit tests are studied in [13]. Algorithm (2.4) can be nested in more complex algorithms to provide nonpara-
M. Mandel
236
metric estimates for other interesting problems. Several examples are given in the unpublished PhD dissertation of the author. For example, Wang [19] studies the semi-parametric left truncation model {W ∈ Wθ , G unrestricted}, where Wθ is a family of distributions indexed by θ. Her method, however, is applicable only for uncensored data and hence is of limited use. Estimates under Wang’s model for censored data can be obtained by an iterative algorithm that uses (2.4) in one of its steps. Preliminary simulation results reveal better performance of this estimator over the PLE. The algorithm presented in this article can be easily extended to likelihood of the form (5.1)
ms S dG(xsi )
s=1
i=1
µ∗s
ns ¯ G(ysj ) , × ∗ µ s j=1
∞ where µ∗s = 0 Ws (x)dG(x) for known increasing and right continuous functions Ws (s = 1, . . . , S). This likelihood generalized the model of [17] to the multiplicative censoring case. The complete likelihood in this problem involves products of likelihoods from different weight functions. The E-step is equivalent to (2.9), and the M-step uses Vardi’s algorithm for selection bias models [17]. Appendix A: The support of the NPMLE This appendix discusses the determination of the support of the NPMLE for a non increasing right continuous function W . Although the NPMLE is not always unique (when W has steps) it is shown that there exists an NPMLE that assigns mass only to the observed points (Lemma A.1). Furthermore, Lemmas A.2 and A.3 characterizes observed points that can be excluded from the support. Let W −1 (y) = min{v : W (v) ≥ y}, then for the likelihood (2.2) we have Lemma A.1. There exists an NPMLE of G0 that assigns mass only to the critical points x1 , . . . , xm , W −1 (y1 ), . . . , W −1 (yn ). Proof. If mass is assigned to points other than the critical ones (i.e., points other than xj or W −1 (yi )), then shifting the mass to the closest critical point to the left will not decrease the likelihood. The assertion follows after noticing that mass assigned to the left of the minimal critical point contributes nothing to the likelihood since the integrals have left limits. Next, suppose that m = n = 1, W is constant on an interval [a, b) and a < x1 < b and W −1 (y1 ) = a. Drawing a step function W and inspecting the likelihood show that the NPMLE assigns mass only to x1 . In general, if there exist i and j such that W (xi ) = W (W −1 (yj )), then the likelihood (2.2) increases if mass first assigned to W −1 (yj ) is shifted to xi . This excludes several of the critical points from being in the support of the NPMLE of G0 and it is summarized by Lemma A.2. Let W −1 (y|x) = min1≤i≤m {xi : W (xi ) = W (W −1 (y))} if such xi exists, and W −1 (y|x) = W −1 (y) otherwise. Then there exists an NPMLE of (2.2) that assigns mass only to the points x1 , . . . , xm , W −1 (y1 |x), . . . , W −1 (yn |x). The support of the NPMLE of GW from the likelihood (2.1) is determined by arguments similar to those leading to Lemma A.1 and Lemma A.2. This suggests that the support is a subset of {x1 , . . . , xm , y1 , . . . , yn } (from the integrals appearing in the likelihood (2.1) it is seen that the support points are the y’s and not the
Nonparametric estimation under bias and censoring
237
W −1 (y)’s). Moreover, if observations yj < xi exist such that W (xi ) = W (yj ), then the likelihood increases if mass initially assigned to yj is shifted to xi . Likewise, if there are yj < yj such that W (yj ) = W (yj ) then the likelihood increases if mass initially assigned to yj is shifted to yj . The following lemma summarizes this discussion: Lemma A.3. There always exists an NPMLE for (1.1) which assigns mass only to observed points. The complete observations xi (i = 1, . . . , m) are always points of support. A censored observation yj is not a point of support if: (i) there exists xi > yj such that W (xi ) = W (yj ) or (ii) there exists yj > yj such that W (yj ) = W (yj ). Remark A.1. Lemma A.3 tells us which yj ’s are not points of support and not which yj ’s are. The EM algorithm may assign mass zero to some of the yj ’s. One can always use the inefficient approach of considering all observations as support points and let the algorithm to assign zero mass to the redundant ones. References [1] Baum, L. E. Petrie, T. Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41 164–171. [2] Brillinger, D. R. (1986). The natural variability of vital rates and associated statistics (with discussion). Biometrics. 42 693–734. [3] Cox, D.R. (1969). Some sampling problems in technology. In New Developments in Survey Sampling (N. L. Johnson and H. Smith, eds.) 506–527. Wiley, New York. [4] empster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38. [5] Efron, B. (1967). The two sample problem with censored data. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 5 831–853. Univ. California Press, Berkeley. [6] Hyde, J. (1977). Testing survival under right censoring and left truncation. Biometrika 64 225–230. [7] Kalbfleisch, J. D. and Lawless, J. F. (1989). Inference based on retrospective ascertainment: An analysis of the data on transfusion-related AIDS. J. Amer. Statist. Assoc. 84 360–372. [8] Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457–481. [9] Laslett G. M. (1982). The survival curve under monotone density constraints with applications to two-dimensional line segment processes. Biometrika 69 153–160. [10] Li, G. and Doss, H. (1993). Generalized Pearson–Fisher chi-square goodnessof-fit tests, with applications to models with life history data. Ann. Statist. 21 772–797. [11] Lund, J. (2000). Sampling bias in population studies—How to use the Lexis diagram. Scandinavian J. Statist. 27 589–604 [12] Lynden-Bell, D. (1971). A method of allowing for known observational selection in small samples applied to 3CR quasars. Monthly Notices of the Roy. Astronomical Society 155 95–118.
238
M. Mandel
[13] Mandel, M. and Betensky, R. A. (2007). Testing goodness-of-fit of a uniform truncation model. Biometrics. To appear. [14] Simon, R. (1980). Length biased sampling in etiologic studies. American J. Epidemiology 111 444–452. [15] Soon, G. and Woodroofe, M. (1996). Nonparametric estimation and consistency for renewal processes. J. Statist. Plann. Inference 53 171–195. [16] Vardi, Y. (1982). Nonparametric estimation in renewal processes. Ann. Statist. 10 772–785. [17] Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13 178–203. [18] Vardi, Y. (1989). Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika 76 751–761. [19] Wang, M. C. (1989). A semiparametric model for randomly truncated data. J. Amer. Statist. Assoc. 84 742–748. [20] Wang, M.C. (1991). Nonparametric estimation from cross-sectional survival data. J. Amer. Statist. Assoc. 86 130–143. [21] Wang, M. C, Jewell N. P. and Tsai, W. Y. (1986) Asymptotic properties of the product limit estimate under random truncation. Ann. Statist. 14 1597– 1605. [22] Wijers, B. J. (1995). Consistent non-parametric estimation for a onedimensional line segment process observed in an interval. Scand. J. Statist. 22 335–360. [23] Woodroofe, M. (1985). Estimating a distribution function with truncated data. Ann. Statist. 13 163–177. [Correction Ann. Statist. 15 (1987) 883.] [24] Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11 95–103.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 239–249 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000184
Estimating a Polya frequency function2 Jayanta Kumar Pal1,∗ , Michael Woodroofe1,† and Mary Meyer
2,†
University of Michigan and University of Georgia Abstract: We consider the non-parametric maximum likelihood estimation in the class of Polya frequency functions of order two, viz. the densities with a concave logarithm. This is a subclass of unimodal densities and fairly rich in general. The NPMLE is shown to be the solution to a convex programming problem in the Euclidean space and an algorithm is devised similar to the iterative convex minorant algorithm by Jongbleod (1999). The estimator achieves Hellinger consistency when the true density is a PFF2 itself.
1. Introduction The problem of estimating a unimodal density and its mode has attracted a wide interest in the literature, beginning with the work of Barlow [1], Prakasa Rao, [7], Robertson [8] and Wegman [15], [16] and continuing through [9], [2], [3], [5], and [4], who can be consulted for further references. Asymptotic properties of the maximum likelihood estimators have been developed but may be messy and suffer from some inconsistency in the region near the mode. Kernel estimation can avoid the inconsistency, but must confront the choice of a bandwidth. Here we investigate a smaller, easier version of the problem, estimating a Polya frequency function of order two [hereafter PFF2 ]. By PFF2 , we mean a density f whose logarithm is concave over the support of f . Equivalently, a function f whose logarithm is concave in the sense of Rockafellar [10]. Such functions are automatically unimodal. Moreover, an estimated PFF2 supplies its own estimate of the mode. There is no need to estimate the mode seperately. The non-parametric maximum likelihood estimator [hereafter, NPMLE] for this problem is derived in Section 2 and shown to be Hellinger consistent in Section 3. Simulations are reported in Section 4. Rufibach and Duembgen [11] and Walther [14] adopt a similar approach but obtain different results by different methods. 2. The NPMLE Let F be the class of PFF2 densities and suppose that a sample X1 , . . . , Xn ∼ f ∈ F is available. The problem is to estimate f non-parametrically. Letting −∞ < x1 < x2 < · · · < xn < ∞ denote the order statistics the log-likelihood function is (2.1)
(g) =
n
log[g(xi )].
i=1
∗ Supported
by Horace Rackham Graduate School and National Science Foundation. by National Science Foundation. 1 436 West Hall, 1085 South University, Department of Statistics, Ann Arbor, MI 48105, USA, e-mail: [email protected]; [email protected] 2 223 Statistics Building, University of Georgia, Athens, GA, USA, e-mail: [email protected] AMS 2000 subject classifications: primary 62G07, 62G08; secondary 90C25. Keywords and phrases: Polya frequency function, Iterative concave majorant algorithm, Hellinger consistency. † Supported
239
J. Kumar Pal, M. Woodroofe and M. Meyer
240
This is to be maximized with respect to g ∈ F. Equivalently, this is to be maximized with respect to non-negative g for which log(g) is concave and ∞ g(x)dx = 1. (2.2) −∞
If g ∈ F, write θi = log[g(xi )]. Then log[g(x)] ≥ [(x − xi−1 )θi + (xi − x)θi−1 ]/(xi − xi−1 ) for xi−1 ≤ x ≤ xi and, therefore, xi eθi − eθi−1 (2.3) g(x)dx ≥ (xi − xi−1 ) θi − θi−1 xi−1
for i = 2, . . . , n. It follows easily that (2.1) is maximized when g ∈ F has support [x1 , xn ] and log(g) is a piecewise linear function with knots at x1 , . . . , xn . For if g ∈ F, let g o be a function for which log(g o ) is piecewise linear, g o (xi ) = g(xi ), i = 1, . . . , n, and g o (x) = 0 for x ∈ / [x1 , xn ]. Then, using (2.3), g(x) ≥ g o (x) for all x with equality for x ∈ {x1 , . . . , xn }, and therefore, ∞ xn xn ∞ o 1= g(x)dx ≥ g(x)dx ≥ g (x)dx = g o (x)dx. x1
−∞
x1
−∞
So, there is a c ≤ 1 for which g ∗ = g o /c ∈ F and (g) ≤ (g ∗ ). Thus, finding the NPMLE may be reformulated as a maximization problem in n IR . Let K be the set of θ = [θ1 , θ2 , . . . , θn ] ∈ IRn for which θi+1 − θi θi − θi−1 ≥ xi − xi−1 xi+1 − xi for i = 2, . . . , n − 1. The reformulated maximization problem is to maximize θ1 + · · · + θn among θ ∈ K subject to the constraint (2.4)
n eθi − eθi−1 (xi − xi−1 ) = 1. θ − θ i i−1 i=2
Introducing a Lagrange multiplier, it is necessary to maximize (2.5)
n eθi − eθi−1 ψ(θ) = θi − λ (xi − xi−1 ) θi − θi−1 i=2 i=1 n
subject to (2.4), for appropriate λ, for θ ∈ K. The partial derivatives of ψ are ∂ψ(θ) eθ 1 e θ 2 − eθ 1 =1−λ − + (x2 − x1 ), ∂θ1 θ2 − θ1 (θ2 − θ1 )2
eθ 2 eθ 2 − eθ 1 ∂ψ(θ) =1−λ − (x2 − x1 ) ∂θ2 θ2 − θ1 (θ2 − θ1 )2 eθ 2 e θ 3 − eθ 2 −λ − + (x3 − x2 ), θ3 − θ2 (θ3 − θ2 )2 ··· θ j e eθj − eθj−1 ∂ψ(θ) =1−λ − (xj − xj−1 ) ∂θj θj − θj−1 (θj − θj−1 )2 eθj+1 − eθj eθ j (xj+1 − xj ), −λ − + θj+1 − θj (θj+1 − θj )2
Polya frequency function
··· and
∂ψ(θ) eθ n eθn − eθn−1 =1−λ − (xn − xn−1 ). ∂θn θn − θn−1 (θn − θn−1 )2
At the maximizing θ, ∇ψ(θ)t 1 = 0, so that n=λ −
eθ 2 − eθ 1 eθ 1 + (x2 − x1 ) θ2 − θ1 (θ2 − θ1 )2 n−1 eθj eθj − eθj−1 +λ − (xj − xj−1 ) 2 θ − θ (θ − θ ) j j−1 j j−1 j=2
eθ j eθj+1 − eθj + (x − x ) j+1 j θj+1 − θj (θj+1 − θj )2 eθn − eθn−1 eθ n (xn − xn−1 ) − +λ θn − θn−1 (θn − θn−1 )2 +λ −
There is some cancelation here, and
n eθj − eθj−1 n=λ (xj − xj−1 ). θj − θj−1 j=2
So, if (2.4) is to be satisfied, then λ = n.
(2.6) Now let
ω 1 = θ1 θj − θj−1 ωj = xj − xj−1 for j = 2, . . . , n. Then, ω2 ≥ · · · ≥ ωn
(2.7) and
θj = ω 1 +
(2.8)
j
(xi − xi−1 )ωi
i=2
for j = 1, . . . , n, where an empty sum is to be interpreted as 0. Let φ(ω) = ψ(θ)
(2.9) and ∆xj = xj − xj−1 . Then n j=1
θj = nω1 +
n
(n − i + 1)∆xi ωi ,
i=2
n n eθi − eθi−1 e(xj −xj−1 )ωj − 1 eθj−1 ∆xi = θi − θi−1 ωj i=2 j=2
=
n j=2
eθj−1 ρ(∆xj ωj )∆xj ,
241
J. Kumar Pal, M. Woodroofe and M. Meyer
242
where ρ(t) = So, φ(ω) = nω1 +
n
et − 1 . t
(n − i + 1)∆xi ωi − n
n
eθj−1 ρ(∆xj ωj )∆xj .
j=2
i=2
Using (2.8), it follows that at the maximizing θ (or ω) n ∂φ(ω) eθj−1 ρ(∆xj ωj )∆xj = 0. =n−n ∂ω1 j=2
So, we are led to the problem of maximizing φ(ω), subject to (2.7) and ∂φ(ω)/∂ω1 = 0. Again using (2.8), the latter condition may be written (2.10)
e
−ω1
n j−1 e i=2 ∆xi ωi ρ(∆xj ωj )∆xj . = j=2
To solve this, we need a version of the iterative concave majorant algorithm, similar to that of Jongbloed [6]. We start with an initial value ω 0 = (ω10 , . . . , ωn0 ) for which (2.7) and (2.10) are satisfied. One such choice is to assume f is a normal density and estimate its mean and variance from the data. The corresponding ω 0 can be computed using a scaled piecewise linear version of log f . Let k = 0. The idea behind our algorithm is to replace the concave function φ locally near k ω by a quadratic form of the type q(˜ ω, ωk ) =
1 (˜ ω − ω k + Γ(ω k )−1 ∇φ(ω k ))t Γ(ω k )(˜ ω − ω k + Γ(ω k )−1 ∇φ(ω k )) 2
where Γ is a diagonal matrix with entries ∂ 2 φ(ω)/∂ωk2 and ∇φ is the gradient vector. This maximization has a geometric solution given left hand slopes of the l byr the l r r concave majorant of the data cloud: ( i=1 di , i=1 [di ωi + bri ]), where brk =
∂ φ(ω)|ω=ωr , ∂ωk
drk = −
(2.11)
∂2 φ(ω)|ω=ωr . ∂ωk2
This can be also characterized explicitly as, k ωir+1 = min max
2≤j≤i i≤k≤n
Finally, to satisfy (2.7), ω1r+1 = − log
r r r h=j [dh ωh + bh ] . k r d h=j h
n j−1 r+1 e i=2 ∆xi ωi ρ(∆xj ωjr+1 )∆xj . j=2
To implement this, we need to compute the partial derivatives (∂/∂ωk )φ(ω) for k = 1, . . . , n. Clearly, ∂φ(ω) =0 ∂ω1
Polya frequency function
243
Table 1 Angular distances from the center R and line-of-sight velocities V of stars in Fornax. Star
Date
F1-1 F1-2 ··· F9-8025
29 Nov. 1992 29 Nov. 1992 15 Dec. 2002
R arcmin 14.5 20.8 ··· 42.3
V km/sec 55.8 42.9 69.2
and n ∂φ(ω) ∆xk eθj−1 ρ(∆xj ωj )∆xj = (n − k + 1)(xk − xk−1 ) − n ∂ωk j=k+1
− neθk−1 ρ (∆xk ωk )∆x2k
for k = 2, . . . , n; and the second derivatives are n ∂ 2 φ(ω) = −n ∆x2k eθj−1 ρ(∆xj ωj )∆xj 2 ∂ωk j=k+1 + eθk−1 ρ (∆xk ωk )∆x3k
for k = 2, . . . , n. To achieve stability, we modify the algorithm using a line search method as follows. It is not certain that the new point ω ˜ will have a larger value of φ. Therefore, we need to perform a binary search along the line segment joining ω k and ω ˜ to get a point ω k+1 such that φ(ω k+1 ) > φ(ω k ). Finally, we stop the iteration when two consecutive iterates have very close φ values. Example. Walker et. al. [13] have reported the line-of-sight velocities of 178 stars in the Fornax dwarf spheroidal galaxy. The nature of the data is reported in Table 1. The full data set can be found in [13]. Figure 1 displays the estimated density of line-of-sight velocity, assuming that the later is a Polya frequency function2 . The sharp peak at the mode is, unfortunately, an artifact of the method. 3. Consistency Let F be a distribution function with density f and suppose throughout that (3.1) −∞ < log(f )dF < ∞. IR
Let X1 , X2 , · · · ∼ind F ; and let Fn# be the empirical distribution function. Fn# (x) =
#{k ≤ n : Xk ≤ x} . n
Further, let h denote the Hellinger distance between densities, √ √ 2 √ 2 (3.2) h (g1 , g2 ) = ( g 1 − g 2 ) dx = 2 1 − g1 g2 dx . IR
IR
The purpose of this section is to prove: If f is a PFF2 for which (3.1) is satisfied and fˆn is the maximum likelihood estimator, then limn→∞ h2 (f, fˆn ) = 0 w.p.1.
J. Kumar Pal, M. Woodroofe and M. Meyer
244
Fig 1. Estimated density of velocities for 178 stars.
Lemma 1. If f and g are densities and b > 0, then b+g log( )dF ≤ (b) − h2 (f, g), b+f IR where (b) = 2
IR
b dF. b+f
Proof. In this case,
b + g log dF ≤ 2 b+f IR
IR
b+g dF − 1 b+f
b g ≤2 dF + dF − 1 b+f b+f IR IR gf dx − 1 ≤ (b) + 2
IR
= (b) − h2 (f, g).
Polya frequency function
245
Lemma 2. Let 0 < b, c < ∞. If g is unimodal and supx g(x) ≤ c, then c (3.3) | log(b + g)d(Fn# − F )| ≤ 2 sup |Fn# (x) − F (x)| log 1 + . b x IR Proof. Integrating by parts, the left side of (3.3) is | (F − Fn# )d log(b + g)|, IR
which is at most the right side. Now let U be a class of unimodal densities; let n denote the log-likelihood function, so that n
n (g) =
log[g(Xi )] = n
i=1
IR
log(g)dFn#
for g ∈ U; and let fˆn be the MLE in U (assumed to exist). Theorem 3.1. If f ∈ U and Cn = supx fˆn (x) satisfies, √ n ˆ ] w.p.1, (3.4) log Cn = sup log fn (x) = o[ log(n) x then lim h(f, fˆn ) = 0 w.p.1.
n→∞
Proof. If f ∈ U, then # # ˆ ˆ log(f )dFn . log(fn )dFn − 0 ≤ n (fn ) − n (f ) = n IR
IR
So, if b > 0, then # ˆ 0≤ log(b + fn )dFn − log(f )dFn# = In + IIn + IIIn , IR
IR
where In =
IR
log(b + fˆn )d(Fn# − F ),
IIn = and IIIn =
log IR
b + fˆn dF b+f
log(b + f )dF −
IR
IR
log(f )dFn# .
With Cn as in (3.4), Cn |In | ≤ 2 sup |Fn# (x) − F (x)| log 1 + → 0 w.p.1 b x
as n → ∞, by Lemma 2 and the consistency of Fn# . Also, IIn ≤ (b) − h2 (f, fˆn ),
J. Kumar Pal, M. Woodroofe and M. Meyer
246
by Lemma 1, and lim IIIn =
n→∞
[log(b + f ) − log(f )]dF,
IR
by the Strong Law of Large Numbers. So, w.p.1, 2 ˆ [log(b + f ) − log(f )]dF + (b), lim sup h (f, fn ) ≤ n→∞
IR
which approaches zero as b → 0. Lemma 3. Let g be a PFF2 density. If 0 < g(a) ≤ g(b), then g(b) ≤
g(b) 1 1 + log . |b − a| g(a)
Proof. There is no loss of generality in supposing that a < b. Let h = log(g). Then h(x) ≥ ha + (
x−a )[hb − ha ], b−a
where ha and hb were written for h(a) and h(b). So,
b
Of course,
b a
g(x)dx ≤ 1, and g(x) ≥ ga for a ≤ x ≤ b. So, ga ≤ 1/(b − a) and gb ≤ ga + (
as asserted.
b
x−a )[hb − ha ] dx exp ha + ( b−a a b−a = eha ( ) ehb −ha − 1 hb − ha gb − ga . = (b − a) log(gb /ga )
g(x)dx ≥
a
g(b) 1 gb 1 , ) log( ) ≤ ( ) 1 + log b−a ga b−a g(a)
Lemma 4. If a, b, x > 0 and x ≤ a log(x) + b, then x ≤ 2a log(2a) + 2b. Proof. If x ≤ a log(x) + b, then x 1 x ≤ a log(2a + x) + b ≤ a log(2a) + + b = x + a log(2a) + b, 2a 2
from which the lemma follows immediately.
Now let fˆn be the MLE in the class of PFF2 densities. Then fˆn attains its maximum at an order statistic, say xm . If m ≤ n/2, let q = 3n/4 + 1; and if m > n/2,let q = n/4 . Theorem 3.2. If f is a PFF2 density, then (3.5)
C := sup fˆn (xm ) < ∞ w.p.1. n≥1
Polya frequency function
247
Proof. Let Kn = q or n − q, accordingly as m > n/2 or ≤ n/2. Then Kn ≥ n/4, and fˆn (xm ) ≤
(3.6)
fˆn (xm ) 1 , 1 + log |xm − xq | fˆn (xq )
by Lemma 3. If f is a PFF2 , then n (f ) ≤ n (fˆn ) ≤ Kn log[fˆn (xq )] + (n − Kn ) log[fˆn (xm )]. So, log (3.7)
fˆn (xm ) 1 n log[fˆn (xm )] − n (f ) ≤ Kn Kn fˆn (xq ) 1 ≤ 4 log[fˆn (xm )] − n (f ) . n
Combining (3.6) and (3.7), 4 1 ˆ 1 + 4 log[fn (xm )] − n (f ) |xm − xq | n = An log[fˆn (xm )] + Bn ,
fˆn (xm ) ≤
where An = and Bn = So,
4 |xm − xq |
1 4 1 − n (f ) . |xq − xm | n
fˆn (xm ) ≤ 2An log(2An ) + 2Bn , by Lemma 4. Here supn An < ∞ and supn Bn < ∞, by (3.1) and the choices of m and q, establishing the theorem. From the choice of m = mn , (3.5) may be written C = supn supx fˆn (x). Corollary 1. If U is the class of PFF2 densities and f ∈ U, then limn→∞ h2 (f, fˆn ) = 0 w.p.1. 4. Simulations To assess the speed of convergence, we conducted simulation study for different wellknown members of the PFF2 class. The densities we sampled from are: Gaussian, Double exponential, Gamma (with shape parameter 3 and scale parameter 1), Beta (with shape parameters 3 and 2) and Weibull (with parameters 3 and 1). Table 4 gives us the summary for approximate Hellinger distances between the estimate and the true underlying density with increasing sample sizes 50, 100, 200, 500, 1000. We also graphically show how the estimators look like in some of this examples. Figure 2 shows the corresponding plots for Normal, Double Exponential and Gamma for sample sizes 50, 100, 200 respectively.
J. Kumar Pal, M. Woodroofe and M. Meyer
248
Table 2 The Monte Carlo estimates of finite-sample Hellinger distances, for sample size n = 50, 100, 200, 500, 1000 and number of replications M = 500 for five different log-concave densities. The upper figure is the estimate and the lower is one standard deviation. Sample size 50 100 200 500 1000
Normal (0,1) 0.1658 ± 0.0531 0.0680 ± 0.0218 0.0624 ± 0.0167 0.0290 ± 0.0104 0.0028 ± 0.0010
Double Exponential 0.1548 ± 0.0466 0.1146 ± 0.0426 0.0956 ± 0.0252 0.0626 ± 0.0203 0.0139 ± 0.0031
Gamma (3,1) 0.1518 ± 0.0447 0.0934 ± 0.0305 0.0532 ± 0.0164 0.0088 ± 0.0029 0.0019 ± 0.0007
Beta (3,2) 0.1823 ± 0.0582 0.0988 ± 0.0363 0.1166 ± 0.0360 0.0766 ± 0.0214 0.0170 ± 0.0031
Weibull (3,1) 0.0854 ± 0.0263 0.0947 ± 0.0319 0.0312 ± 0.0067 0.0347 ± 0.0115 0.0274 ± 0.0089
Fig 2. The estimated log-concave density for different simulation examples. The sample sizes are 50,100 and 200 respectively for first, second and third columns. The three rows correspond to simulations from a Normal(0,1), a double-exponential and a Gamma(3,2) density. The bold one corresponds to the true density and the dotted one is the estimator.
Polya frequency function
249
Acknowledgments. Thanks to Guenther Walther for helpful discussions. We benefited from reading Van de Geer [12] in constructing the proof of consistency. References [1] Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972). Statistical Inference under Order restrictions. Wiley, London. [2] Bickel, P. and Fan, J. (1996). Some problems in the estimation of unimodal densities. Statist. Sinica 6 23–45. [3] Eggermont, P. P. B. and LaRiccia, V. N. (2000). Maximum likelihood estimation of smooth unimodal densities. Ann. Statist. 28 922–947. [4] Hall, P. and Huang, L. S. (2002). Unimodal density estimation using kernel methods. Statist. Sinica 12 965–990. [5] Meyer, M. (2001). An alternative unimodal density estimator with consistent estimation of the mode. Statist. Sinica 11 1159–1174. [6] Jongbloed, G. (1998). The iterative convex minorant algorthm for nonparametric estimation. J. Comput. Graph. Statist. 7 310–321. [7] Prakasa Rao, B. L. S. P. (1969). Estimation of a unimodal density. Sankhy¯ a Ser. A 31 23–36. [8] Robertson, T. (1967). On estimating a density which is measurable with respect to a σ-lattice. Ann. Math. Statist. 38 482–493. [9] Robertson, T., Wright, F. and Dykstra, R. (1989). Order Restricted Inference. Wiley, New York. [10] Rockafellar, R. T. (1971). Convex Analysis. Princeton University Press. [11] Rufibach, K. and Duembgen, L. (2004). Maximum likelihood estimation of a log-concave density: basic properties and uniform consistency. Preprint, University of Bern. [12] Van de Geer, S. (1993). Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Statist. 21 14–44. [13] Walker, M., Mateo, M., Olszewski, E., Bernstein, R., Wang, X. and Woodroofe, M. B. (2006). Internal Kinematics of the Fornax Dwarf Spheroidal Galaxy. Astrophysical J. 131 2114–2139. [14] Walther, G. (2002). Detecting the presence of mixing with multiscale maximum likelihood. J. Amer. Statist. Assoc. 97 508–513. [15] Wegman, E. (1970). Maximum likelihood estimation of a unimodal density. Ann. Math. Statist. 41 459–471. [16] Wegman, E. (1970). Maximum likelihood estimation of a unimodal density II. Ann. Math. Statist. 41 2169–2174.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 250–259 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000193
A comparison of the accuracy of saddlepoint conditional cumulative distribution function approximations Juan Zhang1,∗ and John E. Kolassa1,∗ Rutgers University Abstract: Consider a model parameterized by a scalar parameter of interest and a nuisance parameter vector. Inference about the parameter of interest may be based on the signed root of the likelihood ratio statistic R. The standard normal approximation to the conditional distribution of R typically has error of order O(n−1/2 ), where n is the sample size. There are several modifications for R, which reduce the order of error in the approximations. In this paper, we mainly investigate Barndorff-Nielsen’s modified directed likelihood ratio statistic, Severini’s empirical adjustment, and DiCiccio and Martin’s two modifications, involving the Bayesian approach and the conditional likelihood ratio statistic. For each modification, two formats were employed to approximate the conditional cumulative distribution function; these are Barndorff-Nielson formats and the Lugannani and Rice formats. All approximations were applied to inference on the ratio of means for two independent exponential random variables. We constructed one and two-sided hypotheses tests and used the actual sizes of the tests as the measurements of accuracy to compare those approximations.
1. Introduction When analyzing data arising from a model with a single unknown parameter, statisticians frequently build tests of a simple null hypothesis around the likelihood ratio statistic, since the signed square root of the likelihood ratio statistic, R, often has a distribution that is well-approximated by a standard normal distribution under the null hypothesis. In the presence of nuisance parameters, the statistic R depends on the nuisance parameters. Practitioners often replace the nuisance parameters in the likelihood function by their maximum likelihood estimates and examine the resulting profile likelihood as a function of the parameter of interest. Denote by n the sample size. The standard normal approximation to the conditional distribution of R typically has error of order O(n−1/2 ), and R can be used to construct approximate confidence limits for the parameter of interest having coverage error of that order. In large sample settings, this approximation works well. However, in small sample situations, with 10 or 15 observations, the standard normal approximation may not be adequate. Hence, various authors developed modifications for R ∗ The authors were supported by Grant DMS 0505499 from the National Science Foundation. The authors are grateful for the assistance of one of the volume editors, Bill Strawderman, and an anonymous referee. 1 Department of Statistics and Biostatistics, Hill Center, Bush Campus, Rutgers, The State University of New Jersey, 110 Frelinghuysen Rd., Piscataway, NJ 08854 USA, e-mail: [email protected]; [email protected] AMS 2000 subject classifications: 62E60, 41A58. Keywords and phrases: modified signed likelihood ratio statistic, saddlepoint approximation, conditional cumulative distribution.
250
Saddlepoint conditional distribution function approximations
251
using saddlepoint approximation techniques. These modifications reduce the order of error in the standard normal approximation to the conditional distribution of R. Barndorff-Nielsen [2] first proposed the modified directed signed root of the likelihood ratio statistic R∗ . This statistic will be reviewed in the next section. The relative error in the standard normal approximation to the conditional distribution of R∗ is of order O(n−3/2 ). Barndorff-Nielsen [3–5] also considered using a variation on this approximation, of the same form as the univariate expansion of Lugannani and Rice [10]. The drawback of these approximations is that the calculation their calculation requires the calculation of an exact or approximate ancillary, and in some situations it is hard or impossible to construct this ancillary. For the other approximations that we will study in the following, no such ancillary needs to be specified, and hence the approximations are easier to apply in practice. ˆ ∗ to Barndorff-Nielsen’s R∗ based on Severini [12] proposed an approximation R empirical covariances. Under some assumptions and model regularity properties, ˆ ∗ is distributed according to a standard normal distribution, with error O(n−1 ), R conditionally on the observed value of an ancillary statistic A. However, the conˆ ∗ does not require the specification of A. struction of this R DiCiccio and Martin [8] proposed an alternative quantity to R∗ , denoted by R+ , that is available without specification of A. The derivation of R+ involves the Bayesian approach to constructing confidence limits considered by Welch and Peers [15] and Peers [11]. In the presence of nuisance parameters, Peers [11] chose a prior density for the parameters to satisfy a partial differential equation. With this prior, the standard normal approximation to the conditional distribution of R+ has error of order O(n−1 ). If the parameter of interest and the nuisance parameter vector are orthogonal, solving the partial differential equation is relatively easier. In some cases that the parameters are not orthogonal, solving that equation numerically is problematic. Parameter orthogonality will be reviewed in the following section. For a parameter of interest that is orthogonal to the nuisance parameter vector, Cox and Reid [6] defined the signed root of the conditional likelihood ratio statistic R. The standard normal approximation to the distribution of R has error of order + O(n−1/2 ). DiCiccio and Martin [8] defined R similar as the R+ mentioned above. + The standard normal approximation to the conditional distribution of R has error of order O(n−1 ). The use of R and its modifications is often effective in situations where there are many nuisance parameters. In such cases, the use of R and its modified versions can produce unsatisfactory results; see DiCiccio, Field and Fraser [7] for examples. The above variants on R have never been systematically compared to each other as a group. This paper provides an accuracy comparison among the modifications stated above. Each of these approximations are used to generate an approximate one-sided p-value by approximating P[R ≥ r], for r the observed value of R. Approximate two-sided p-values are calculated by approximating 2 min(P[R ≥ r], P[R < r]). One and two-sided hypotheses tests of size α may be constructed by rejecting the null hypothesis when the p-value is less than α. Both the Barndorff-Nielson format approximation (1)
Φ{R + R−1 log(U/R)}
and the Lugannani and Rice format approximation (2)
Φ(R) + φ(R)(R−1 − U −1 )
were considered in this paper, where the variable U may vary for different modifications. We examined as an example the ratio of means of independent exponentials.
J. Zhang and J. E. Kolassa
252
We calculated via simulation the size of tests constructed as above, and then compared the results among different approximations. 2. Methodology We first review several statistics whose marginal distributions are very close to standard normal. Consider continuous variables X1 , . . . , Xn having joint density function that depends on an unknown parameter ω = (ω1 , . . . , ωd ). Suppose that ω = (ψ, χ), where ψ = ω1 is a scalar parameter of interest and χ = (ω2 , . . . , ωd ) is ˆ χ) ˆ = (ψ, ˆ be the maximum likelihood estimator a nuisance parameter vector. Let ω ˆ ψ be the constrained maximum likelihood estimator of of ω, and for fixed ψ, let χ ˆ − χ. The signed root of the likelihood ratio statistic is R = sgn(ψˆ − ψ0 ){2(l(ω) ˆ 0 ))}1/2 , where χ ˆ 0 will be shorthand for χ ˆ ψ0 and l(ω) is the log-likelihood l(ψ0 , χ function for ω. The standard normal approximation to the distribution of R typically has error of order O(n−1/2 ), and R can be used to construct approximate confidence limits for ψ having coverage error of that order. The earliest general conditional saddlepoint tail probability approximation was provided by Skovgaard [13], who applied double saddlepoint techniques to the problem of approximating tail probabilities for conditional distributions when the data arise from a full exponential family. In this case the double saddlepoint distribution function approximation can be expressed in terms of the quantities in the joint density function. Skovgaard’s double saddlepoint approximation to the conditional distribution function is of form (2), with U a Wald statistic. In this paper, we consider only models more complicated than canonical exponential families, and so won’t apply this approximation. 2.1. Barndorff-Nielsen’s modification The modified signed root of the likelihood ratio statistic R∗ was first proposed by Barndorff-Nielsen [2] and given by R∗ = R + R−1 log(U/R), where (3)
U=
ˆ ψ ) l;ωˆ (ω) ˆ − l;ωˆ (ω ˆ ψ )| |lχ;ωˆ (ω , 1 1 ˆ ψ )| 2 |j(ω)| ˆ 2 |jχχ (ω
ˆ ψ ) = −lχχ (ψ0 , χ ˆ 0 ) and j(ω) ˆ = −lωω (ω), ˆ with lωω (ω) the matrix of and jχχ (ω ˆ A) taken with respect to ω and lχχ (ω) second-order partial derivatives of l(ω; ω, the submatrix of lωω (ω) corresponding to χ. Here U represents an approximate conditional score statistic, which, in the multivariate normal case would exactly coincide with R. Outside the multivariate normal case, it measures the difference between R and U is a measure of departure from normality. The quantity l;ωˆ (ω) ˆ A) taken with respect to ω, ˆ is the d × 1 vector of partial derivatives of l(ω; ω, and lχ;ωˆ (ω) is a d × (d − 1) matrix of mixed second-order partial derivatives of ˆ A) taken with respect to χ and ω. ˆ The sign of U is the same as that l(ψ, χ; ω, of R and the resulting U is of the form U = R + Op (n−1/2 ). The relative error in the standard normal approximation to the conditional distribution of R∗ is of order O(n−3/2 ). The conditioning is on an exact or approximate ancillary statistic A. The variable U is parameterization invariant and does not depend on χ.
Saddlepoint conditional distribution function approximations
253
The value of ψ0 satisfying Φ(R∗ ) = α is an approximate upper 1 − α confidence limit which has relative coverage error of order O(n−3/2 ) both conditionally and unconditionally. Barndorff-Nielsen [3–5] also considered using the alternative to Φ(R∗ ) provided by the Lugannani and Rice format approximation (2). Consider the exponential family model for a random vector T whose density evaluated at t is fT (t; θ) = exp(θ t − HT (θ) − G(t)). The random vector T is the sufficient statistic and set τ (θ) = Eθ [T]. In the presence of nuisance parameters, the calculation of U requires the specification of the ancillary A. Barndorff-Nielsen [1] suggested an approximate ancillary statistic for use in conditional inference. Kolassa [9], in Chapter 8.4, presented this approximate ancillary A as ˆ ˆ χ)) , B(ψ)(T − τ (ψ, with χ held fixed, and 1
B(ψ) = [(∂τ /∂ψ)⊥ Σ(∂τ /∂ψ)⊥ ]− 2 (∂τ /∂ψ)⊥ . ˆ a) = l(θ; θ, ˆ a)/n. Then Suppose that θ is scalar. Let ˜l(θ; θ, (4)
√ √ √ ˆ θ) = [Φ( nˆ FΘ|A (θ|a; ω ) + φ( nˆ ω )[1/ˆ ω − 1/ˇ z ]/ n ][1 + Op (n−1 )], ˆ
ˆ and the superscripts ; 1 on ˜l;1 represent j(θ), ˆ after reexpressing t in terms of differentiation of the likelihood with respect to θ, ˆ θ) is the conditional cumulaˆ (θ|a; θ and a. Here a is the observed value of A; FΘ|A ˆ tive distribution function and Φ(·) is the standard normal cumulative distribution function. In the computation of Barndorff-Nielsen’s R∗ , the calculation of U requires the ancillary A to be specified, which may present difficulties in practice. In the following, we will introduce several modifications that do not require the specification of A. ˆ θ, ˆ a) − ˜l;1 (θ; θ, ˆ a)]/ with zˇ = [˜l;1 (θ;
2.2. An empirical adjustment ˆ ∗ to Barndorff-Nielsen’s R∗ based on emSeverini [12] proposed approximation R pirical covariances. Recalling the formula of U (3), the key step is to approximate ˆ ψ ) and l;ωˆ (ω) ˆ − l;ωˆ (ω ˆ ψ ). lχ;ωˆ (ω Let l(j) (ω) denote the log-likelihood function based on observation j alone. Denote (j) (j) (j) ˆ ω0 ) = ˆ (ω 0 )T , (ω)lω (ω 0 )T , I(ω; lω Q(ω; ω0 ) = l(j) (ω)lω
ˆ ω; ˆ ω). ˆ The quantity ω 0 is any point in the parameter space. Then and ˆi = I( ˆ − l;ωˆ (ω) and lω;ωˆ (ω) may be approximated by l;ωˆ (ω) ˆ ω; ˆ ˆ − ˆl;ωˆ (ω) = {Q( ˆ ω) ˆ − Q(ω; ˆ ˆi(ω) ˆ −1 ˆj hatl;ωˆ (ω) ω)} and ˆ ˆ = −lωω (ω). where ˆj = j(ω)
ˆlω;ωˆ (ω) = I(ω; ˆ ω) ˆ ˆi(ω) ˆ −1 ˆj,
J. Zhang and J. E. Kolassa
254
ˆ the approximation to the statistic U based on the above quantities, Denote by U and then denote ˆ ∗ = R + R−1 log(U ˆ /R). R ˆ ∗ can be used in approximation (1). This represents a correction The quantity R similar to that of (3), with expectations of quantities replaced by sample means. ˆ ∗ is distributed accordUnder some assumptions plus model regularity properties, R ing to a standard normal distribution, with error O(n−1 ), conditionally on a, the ˆ ∗ does not require observed value of the ancillary A. However, the construction of R the specification of A. Again, the alternative approximation (2) is also available as ˆ −1 ). Φ(R) + φ(R)(R−1 − U 2.3. DiCiccio and Martin’s modification DiCiccio and Martin [8] proposed an alternative variable to U , denoted by T , which is available without specification of the ancillary A. The modification for approximation (1) is R+ = R + R−1 log(T /R),
(5)
where T is defined in (7). As with (3), the final term in R+ represents the departure from normality; unlike (3), this measure represents the departure of the posterior for ψ from normality, and involves the prior distribution. Once again, one might use the alternative probability approximation (2) with T substituting the place of U . The replacement of T avoids the necessity of specifying A in calculating U and hence simplifies the calculations. The derivation of T involves the Bayesian approach to constructing confidence limits considered by Welch and Peers [15] and Peers [11]. When ω = ψ, that is, when the entire parameter is scalar and there are no nuisance parameters, Welch and Peers [15] showed that the appropriate choice is π(ω) ∝ {i(ω)}1/2 , where i(ω) = E{− d2 l(ω)/ d ω 2 }. In the presence of nuisance parameters, Peers [11] showed that π(ω) must be chosen to satisfy the partial differential equation (6)
d j=1
1j
11 −1/2
i (i )
d ∂ ∂ (log π) + {i1j (i11 )−1/2 } = 0, j j ∂ω ∂ω j=1
where ijk (ω) = E{−∂ 2 l(ω)/∂ω j ∂ω k } and (ijk ) is the d × d matrix inverse of (ijk ). The variable T is defined as (7)
ˆ 0) T = lψ (ψ0 , χ
ˆ 0 )|1/2 π(ω) ˆ | − lχχ (ψ0 , χ . 1/2 ˆ ˆ 0) | − lωω (ω)| π(ψ0 , χ
Here lψ (ω) = ∂l(ω)/∂ψ, and π(ω) is a proper prior density for ω = (ψ, χ) which satisfies the equation (6). Then the resulting approximation (2) is P(ψ ≥ ψ0 |X) = Φ(R) + (R−1 − T −1 )φ(R) + O(n−3/2 ), where T = U + Op (n−1 ), and thus the approximation (1) to the conditional distribution based on R + R−1 log(T /R) has error of order O(n−1 ). To error of the order Op (n−1 ), T is parameterization invariant under transformations ω → {ψ, τ (ω)}. Parameter orthogonality may make a big difference in solving the partial differential equation (6). Orthogonality is defined with respect to the expected Fisher
Saddlepoint conditional distribution function approximations
255
information matrix. We define θ 1 to be orthogonal to θ 2 if the elements of the information matrix satisfy ∂l ∂l 1 ∂2l 1 ;θ = E − ;θ = 0 (8) iθs θt = E n ∂θs ∂θt n ∂θs ∂θt for s = 1, . . . , p1 , t = p1 + 1, . . . , p1 + p2 , where θ = (θ 1 , θ 2 ); θ 1 and θ 2 are of length p1 and p2 respectively. If equation (8) is to hold for all θ in the parameter space, then the parameterization is sometimes called globally orthogonal. If (8) holds at only one parameter value θ 0 , then the vectors θ 1 and θ 2 are said to be locally orthogonal at θ 0 . The most direct statistical interpretation of (8) is that the relevant components of the statistic are uncorrelated. The definition of orthogonality can be extended to more than two sets of parameters, and in particular θ is totally orthogonal if the information matrix is diagonal. In general, it is not possible to have total parameter orthogonality at all parameter values, but it is possible to obtain orthogonality of a scalar parameter of interest ψ to a set of nuisance parameters. If the parameter of interest and the nuisance parameter vector are orthogonal, solving the partial differential equation (6) is relatively easier. The equation (6) reduces to (9)
(iψψ )−1/2
∂ ∂ (log π) + (iψψ )−1/2 = 0, ∂ψ ∂ψ
whose solutions are of the form π(ψ, χ) ∝ {iψψ (ψ, χ)}1/2 g(χ) (Tibshirani [14]), where g(χ) is arbitrary and the suggestive notation iψψ (ψ, χ) is used in place of i11 (ψ, χ). In some cases in which the parameters are not orthogonal, solving equation (6) numerically is problematic. 2.4. Conditional likelihood ratio statistic and its modification For ψ and χ orthogonal, Cox and Reid [6] defined the conditional likelihood ratio ¯ − ¯l(ψ0 )}, where statistic for testing ψ = ψ0 as W = 2{¯l(ψ) ¯l(ψ) = l(ψ, χ ) − 1 log | − lχχ (ψ, χ ˆ ψ )| ψ 2 and ψ¯ is the point at which the function ¯l(ψ) is maximized. The signed root of 1/2 the conditional likelihood ratio statistic is R = sgn(ψ¯ − ψ0 )W , and the standard normal approximation to the distribution of R has error of order O(n−1/2 ). Let + −1 + R = R+R log(T /R). One may use approximations (1) and (2), say, Φ(R ) or −1 −1 Φ(R) + φ(R)(R − T ), where ¯ −1/2 T = ¯l(1) (ψ0 ){−¯l(2) (ψ)}
¯ λ ¯) π(ψ, ψ , π(ψ0 , λ0 )
and ¯l(j) = dj ¯l(ψ)/ d ψ j , j = 1, 2. Those approximations have errors of order O(n−1 ). The use of R and its modifications is often effective in situations where there are many nuisance parameters. In such cases, the use of R and its modified versions can produce unsatisfactory results; see DiCiccio, Field and Fraser [7] for examples.
J. Zhang and J. E. Kolassa
256
3. Example: Exponential samples with orthogonal interest and nuisance parameters Let X and Y be exponential random variables with means µ and ν respectively; the ratio of the means ν/µis the parameter of interest. The parameter transformation √ µ → √λ , ν → λ ψ makes the two new parameters ψ and λ orthogonal. Then ψ
1
1
X and Y have expectations λψ − 2 and λψ 2 , respectively. Suppose we have n independent replications of (X, Y ). Denote ω = (ψ, λ). We can obtain the log-likelihood function as ψx ¯ + y¯ √ + 2 log λ . l(ω) = −n λ ψ
Each of the approximations in section 2 may be used to generate an approximate one-sided p-value by approximating P[R ≥ r], for r the observed value of R. Approximate two-sided p-values may be calculated by approximating 2 min(P[R ≥ r], P[R < r]). One and two-sided hypotheses tests of size α may be constructed by rejecting the null hypothesis when the p-value is less than α. Both the BarndorffNielson format approximation (1) and the Lugannani and Rice format approximation (2) were considered. We calculated via simulation the size of tests constructed as above, and compared the results among different approximations. Some of the approximations in section 2 require specific algebraic calculations. We present the related calculations below. Other applications are generic, and no specific algebraic calculations are needed. 3.1. Some algebraic calculations Barndorff-Nielsen’s modification The expectations of the √ sufficient √ statistics T = (X, Y ) in the new parameterization are τ (ψ, λ) = λ/ ψ, λ ψ , and 3
1
d τ (ψ, λ)/ d ψ = {−λ/(2ψ 2 ), λ/(2ψ 2 )}.
A vector perpendicular to this is (d τ (ψ, λ)/ d ψ)⊥ = {ψ, 1} . The variance of the sample mean vector is 1 λ2 /ψ 0 Σ(ψ, λ) = . 0 λ2 ψ n √
√ √ √ In our case, B(ψ) = n ψ/( 2λ), 1/(λ 2ψ) , and A=
√ 2n X Y /λ − 1 .
Using Barndorff-Nielsen’s formula [4], ˜l(ψ; ψ, ˆ a) = −
and
|a +
√ ˆ 2n|(ψ + ψ) − 2 log λ, ˆ 2nψ ψ
√ 1 1 ω ˆ = sign(ψˆ − ψ)ψ 1/4 (a + 2n)(ψ 2 − ψˆ 2 ) /(ψˆ1/4 n1/2 ).
Saddlepoint conditional distribution function approximations
257
The negative of the second derivative of the log likelihood is √ ˆ j(ψ) = (a + 2n) (3ψ − ψ)/(4 2nψ ψˆ5 ), ˆ a) with respect to ψˆ is and the derivative of ˜l(ψ; ψ,
√ ˆ (a + 2n) (ψ − ψ)/(2 2nψ ψˆ3 ).
Then to the tail probability approximation (4) is zˇ = the√quantity zˇ contributing ˆ ˆ − |a + 2n| (ψ − ψ)/(2 2nψ ψ). DiCiccio and Martin’s modification Based on the above, the information matrix is 1/2ψ 2 0 i(ω) = E[−l (ω)] = n . 0 2/λ2 ˆ ˆ ˆ ˆ1/2 The √ maximum likelihood estimators are ψ = Y /X and λ = (ψX + Y )/(2ψ ) = ˆ ψ be the constrained maximum likelihood of λ. Here, X Y . For fixed ψ, let λ ˆ 0 = (X + Y )/2. ˆψ = λ ˆ ψ = (ψX + Y )/(2ψ 1/2 ). If ψ = ψ0 = 1, then λ λ 0 In this case, the parameters are orthogonal. Using the simplified √ √ partial differential equation (9), we chose g(λ) = 1, and hence π(ψ, λ) = n/( 2 ψ). In addition to use the prior solved from equation (9), we also studied the outcome from a uniform prior, that is to say, the prior with a constant density, which is obviously not a solution to equation (9). 3.2. Simulation results Simulation procedure For sample size n = 10, (1) Generate 10 draws from the pair of {X, Y }, where X and Y both follow standard exponential distribution; (2) Calculate one and two-sided p-values for each approximation; (3) Compare the p-values to the α level, say 0.05; denote by q the number of miss coverages; if the p-value is less than 0.05, then q = q + 1; (4) Repeat steps (1)-(3) for 10,000 times and report the final value of q; let q ∗ = (q/10000) ∗ 100, the Type I error probability in percentage. (5) Repeat steps (1)-(4) for 100 times and report the average of q ∗ as the measurement of the accuracy for the modifications. Approximations (1) and (2) have a removable singularity at R = 0. Consequently, these and similar formulae require care when evaluating near R = 0. Specifically, we found (1) and (2) to exhibit adequate numerical stability as long as |R| > 10−4 . Out of 1,000,000 simulated data sets, 60 presented R (or a modification of R) closer to zero. In these cases, for all but the most extreme conditioning events, the resulting conditional p-value is large enough as to not imply rejection of the null hypothesis, and so these simulated data sets were treated as not implying rejection of the null hypothesis.
258
J. Zhang and J. E. Kolassa Table 1 Type I error probability (BN) Approximation Φ(R) Φ(R) Φ(R + R−1 log(U/R)) ˆ /R)) Φ(R + R−1 log(U −1 Φ(R + R log(T /R)) Φ(R + R−1 log(Tu /R)) −1 Φ(R + R log(T /R)) −1 Φ(R + R log(T u /R))
One-sided Average 5.241 4.807 5.046 5.018 4.615 11.017 4.883 11.723
Two-sided Average 5.168 4.575 4.760 4.882 4.312 6.828 4.411 7.249
Table 2 Type I error probability (LR) Approximation Φ(R) Φ(R) Φ(R) + φ(R)(R−1 − U −1 ) ˆ −1 ) Φ(R) + φ(R)(R−1 − U −1 Φ(R) + φ(R)(R − T −1 ) Φ(R) + φ(R)(R−1 − Tu−1 ) −1 −1 Φ(R) + φ(R)(R − T ) −1 −1 Φ(R) + φ(R)(R − T u )
One-sided Average 5.241 4.807 5.046 5.017 4.613 11.274 4.878 12.190
Two-sided Average 5.168 4.575 4.760 4.881 4.308 6.943 4.403 7.510
Results Tables 1 and 2 below report the average of the Type I error probabilities (in percentage) of the 100 rounds simulation. The quantities Tu and T u are assumed with uniform prior densities. From the tables, we can see that for both the Barndorff-Nielsen format approximation (BN) and the Lugannani and Rice format approximation (LR), the empirical adjustment has best performance. Barndorff-Nielsen’s modification has the best asymptotic error rate (O(n−3/2 ) rather than O(n−1 )), and hence we might expect that the best performance from this approximation. Instead we observe the best performance from other modifications with worse asymptotic error. The authors hope to explore this discrepancy in later work. One may also notice that the performance of DiCiccio and Martin’s modification is not as good as expected. One explanation could be that, generally in small sample settings Bayesian method is more sensitive to the choice of the prior density than in the large sample situations. The importance of the choice of prior can be demonstrated by the poor performance of the approximations with the incorrect uniform priors. References [1] Barndorff-Nielsen, O. E. (1980). Conditionality resolutions. Biometrika 67 293–310. [2] Barndorff-Nielsen, O. E. (1986). Inference on full or partial parameters based on the standardized signed log likelihood ratio. Biometrika 73 307–322. [3] Barndorff-Nielsen, O. E. (1988). Discussion of the paper by N. Reid. Statist. Sci. 3 228–229.
Saddlepoint conditional distribution function approximations
259
[4] Barndorff-Nielsen, O. E. (1990). Approximate interval probabilities. J. Roy. Statist. Soc. Ser. B 52 485–496. [5] Barndorff-Nielsen, O. E. (1991). Modified signed log likelihood ratio. Biometrika 78 557–563. [6] Cox, D. R. and Reid, N. (1987). Parameter orthogonality and approximate conditional inference. J. Roy. Statist. Soc. Ser. B 49 1–39. [7] DiCiccio, T.J., Field, C. A. and Fraser, D. A. S. (1990). Approximations of marginal tail probabilities and inference for scalar parameters. Biometrika 77 77–95. [8] DiCiccio, T. J. and Martin, M. A. (1993). Simple modifications for signed roots of likelihood ratio statistics. J. Roy. Statist. Soc. Ser. B 55 305–316. [9] Kolassa, J. E. (2006). Series Approximation Methods in Statistics, 3rd ed. Springer, New York. [10] Lugannani, R. and Rice, S. (1980). Saddle point approximation for the distribution of the sum of independent random variables. Adv. in Appl. Probab. 12 475–490. [11] Peers, H. W. (1965). On confidence points and Bayesian probability points in the case of several parameters. J. Roy. Statist. Soc. Ser. B 27 9–16. [12] Severini, T. A. (1999). An empirical adjustment to the likelihood ratio statistic. Biometrika 86 235–247. [13] Skovgaard, I. M. (1987). Saddlepoint expansions for conditional distributions. J. Appl. Probab. 24 875–887. [14] Tibshirani, R. (1989). Noninformative priors for one parameter of many. Biometrika 76 604–608. [15] Welch, B. L. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. J. Roy. Statist. Soc. Ser. B 25 318–329.
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 260–267 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000201
Multivariate medians and measure-symmetrization∗ Richard A. Vitale1 University of Connecticut Abstract: We discuss two research areas dealing respectively with (1) a class of multivariate medians and (2) a symmetrization algorithm for probability measures.
1. Introduction Geometric and stochastic ideas interact in a wide variety of ways over both theory and applications. For two connections that do not seem to have been mentioned before, we present here descriptive comments and suggestions for further investigation. The first deals with multivariate medians, an active area of research in which Yehuda himself was interested (see [11] with Cun-Hui Zhang). Using the intrinsic volume functionals for convex bodies, we define a class of multivariate medians and show that among them, as special cases, are the L1 median and the Oja–median. The second question deals with a measure-theoretic generalization of the classic Steiner symmetrization technique for convex bodies. 2. A class of multivariate medians 2.1. Background In Rd with its usual algebraic and metric structures, we consider the class Kd of convex bodies (compact, convex), which is closed under scaling λK = {λx : x ∈ K} and Minkowski addition K + L = {x + y : x ∈ K, y ∈ L}. A special class of convex bodies are sums of line segments, or zonotopes. The λparallel body to K is K + λB, where Bd is the closed unit ball in Rd . Steiner’s formula gives the volume of a typical parallel body: vold (K + λBd ) =
d
Vd−j (K) volj (Bj ) λj .
j=0
∗ Invited talk: Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond. A Conference in Memory of Yehuda Vardi. Rutgers University, October 21–22, 2005. 1 Department of Statistics, University of Connecticut, Storrs, CT 06269, USA, e-mail: [email protected] AMS 2000 subject classifications: primary 60D05; secondary 52A21, 52A40, 52B12, 60E99, 62H99. Keywords and phrases: double-mean iteration, intrinsic volumes, L1 -median, multivariate median, Steiner formula, Steiner symmetrization, zonotope.
260 imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
Multivariate medians and measure–symmetrization
261
The coefficients Vj (K), j = 0, 1, 2, . . . , d are the so-called intrinsic volumes of K ([10]). They arise in a variety of problems and can, for example, be defined quite differently: (2π)j/2 Evolj Z[j,d] K , Vj (K) = j! volj (B) where Z[j,d] , is a j × d matrix of independent, standard Gaussian random variables ([17]). Some can be identified with familiar geometric functionals: V0 (K) = 1 V1 (K) = intrinsic width .. . Vd−1 (K) = 1/2 · surface area of K Vd (K) = d-dimensional volume of K Vj (K) := 0 for j > d.
Here are some specific intrinsic volumes: V1 ([0, 1]) = 1 n (bi − ai ) V1 ([a1 , b1 ] × [a2 , b2 ] × · · · × an , bn ]) = i=1
and generally for j = 1, 2, . . . , d: Vj ([a1 , b1 ] × [a2 , b2 ] × · · · × [an , bn ]) =
(bi1 − ai1 )(bi2 − ai2 ) · · · (bij − aij ).
i1
2.2. Vj -medians Suppose that a point sample x1 , x2 , . . . , xn ∈ Rd is given, and a median is sought. We begin by creating, for every x ∈ Rd , the zonotope Z(x) = x − x1 + x − x2 + · · · + x − xn . Z(x) evidently aggregates, as a polytope, the discrepancies between x and the sample points xi . This can be quantified as follows: from our previous discussion, it is evident that the intrinsic volumes are generally measures of size. More specifically, Vj is a volume-like functional of homogeneity degree j. Consider the class of associated variational problems: for each j = 1, 2, . . . , d, minimize Vj (Z(x)) over x ∈ Rd . The minimizing point (if it exists) we call the Vj –median of the sample. This point of view unifies two well-known medians that, on other grounds, appear to have little in common: n • The V1 –median follows from minimizing 1 x − xi and thus coincides with the well-known L1 median (an interesting treatment of computational issues and other properties appears in [11]). imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
R. A. Vitale
262
• The Vd -median, on the other hand, depends on minimization of vold (Z(x)) and thus coincides with the affine–equivariant median of Oja [7]. Further work: • It would be of interest to investigate the Vj medians for intermediate values of j, 2 ≤ j ≤ d − 1. For such a j, one would seek to minimize
1≤i1
• The Wills functional
det x − xi1 , x − xi2 , . . . , x − xij .
W (K) = 1 + V1 (K) + V2 (K) + · · · + Vd (K) is not homogeneous in K but on account of its other remarkable properties (e.g. [15], [16], [17]) would likely be an interesting alternative size measure. The representation 2 W (Z(x)) = e−πdist (y,Z(x)) dy y∈Rd
could be useful. • One can focus on the inverse, or polar, body related to Z(x). One would then seek to maximize, for example, its volume, which is proportional to 1 n du. d u∈S d−1 1 | < u, x − xi > |
• Finally, it would be of interest to relate these ideas to other instances of zonotopes’ use in data analysis, e.g., [6], [13]. 3. Steiner symmetrization as mass transport Steiner symmetrization is a geometric transformation of convex bodies that has been useful in a variety of problems, notably when an extremizing body is sought for a prescribed functional. Mass transport, on the other hand, embraces a number of issues that deal with shifts of mass (i.e., measure) from one location to another (see, for example, [8, 9]). Recently the question has arisen as to whether it is possible to generalize Steiner symmetrization along the lines of mass transport. Here we suggest a formulation and sketch some preliminary thoughts. Classical Steiner symmetrization is as follows: given a convex body K and a direction (= unit vector) u, locate each chord of K parallel to u and shift it along its line of inclusion so as to re-position its midpoint on the hyperplane Hu⊥ perpendicular to u. The aggregate of such shifted chords forms a new convex body, known as the Steiner symmetral of K. The original idea for this transformation was apparently due to L’Huiller in the 1780s, but Steiner popularized it in his treatment of the isoperimetric inequality ([3], [4]). Variants of Steiner symmetrization can be found in [1], [3], [5]. The present one is motivated by re-casting the procedure in an equivalent form: regard each chord as bearing a uniformly distributed mass and shift it so as to re-position its center of mass on Hu⊥ . More generally one can think of a non-negative (and otherwise nice) imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
Multivariate medians and measure–symmetrization
263
f : Rd → R , which defines a finite mass distribution on Rd and therefore on each line {u⊥ + tu, −∞ < t < ∞}, u⊥ ∈ Hu⊥ . On such a line, the center of mass is
⊥
f (u + tu)dt t∈R
−1
(u⊥ + tu)f (u⊥ + tu)dt = u⊥ + mu.
t∈R
The shift then amounts to replacing t → f (u⊥ + tu) with t → f (u⊥ + (t − m)u). The combined effect of such shifts (over u⊥ ) amounts to a transport of mass. One can extend this even farther by regarding total mass as normalized to one and observing that the procedure has an equivalent formulation in terms of conditional expectations of random variables. This avoids issues of regularity for the density. Suppose then that X is a random vector in Rd for which EX exists ( ⇐⇒ EX < ∞). Then its Steiner symmetrization in the direction u ∈ S d−1 is defined to be the random vector (1)
Xu = X − E [uu X | Πu⊥ X] = X − uE [u X | Πu⊥ X] .
Here Πu⊥ = I − uu is (orthogonal) projection onto the subspace u⊥ . Thus, conditioned on Πu⊥ X, the random vector X is shifted so that its conditional expectation lies in u⊥ . It can be shown rather directly that this formulation extends classical Steiner symmetrization: Theorem 1. Suppose that X is uniformly distributed on a convex body K. Then (i) Xu is uniformly distributed on the Steiner symmetral of K, and (ii) there is a sequence of symmetrizations that converge in distribution to uniform measure on the (centered) ball of the same volume as K. As we discuss below, general results appear to be difficult to prove. But the special case of symmetrization of a Gaussian measure already presents some interesting phenomena. We provide details for completeness. Theorem 2. Suppose that X has Gaussian distribution in Rd with mean µ and (invertible) covariance matrix Σ, X ∼ N (µ, Σ). Then (i) any Steiner symmetrization yields another Gaussian distribution, and (ii) there is a sequence of Steiner symmetrizations producing a limiting Gaussian distribution that is spherically symmetric about the origin. Proof. (i) Direct by the linear nature of the transformation. (ii) There is no loss of generality in assuming µ = 0 since, if not, a symmetrization with u = µ/µ yields this centering. For the first actual symmetrization, fix u and then as above Xu = X − uE [u X | Πu⊥ X] . For convenience, let Π = Πu⊥ . Well-known properties of normal random variables provide that E [u X | ΠX] is that linear functional of ΠX which minimizes 2 2 E (u X − w ΠX) . Setting ∇E [u X − w Πx] = 0 leads to
⇐⇒ ⇐⇒
Π Σ Π w Σ Πw Πw
= = =
Π Σu Σu + cu u + cΣ−1 u.
for a constant c
imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
R. A. Vitale
264
−1 For c, we apply u to get 0 = 1 + cu Σ−1 u =⇒ c = − u Σ−1 u . Then
Πw = Π2 w = Π u + cΣ−1 u = cΠΣ−1 u
so that
E [u X | ΠX] = −cu Σ−1 ΠX and Xu = X − cuu Σ−1 ΠX = (I − cuu Σ−1 Π)X. It follows that Σu = EXu Xu = E(I − cuu Σ−1 Π)XX (I − cΠΣ−1 uu ) = (I − cuu Σ−1 Π)Σ(I − cΠΣ−1 uu ). Now suppose that (v1 , λ1 ) and (v2 , λ2 ) are two eigenpairs of Σ (v1 = v2 = 1). A convenient choice for symmetrization is u = √12 (v1 + v2 ). Then 1 Π = I − (v1 + v2 )(v1 + v2 ), 2 and 1 1 1 = −u Σ−1 u = − (v1 + v2 )Σ−1 (v1 + v2 ) = − c 2 2
(2)
1 1 + λ1 λ2
.
Further,
uu Σ
−1
Π= = = = =
1 1 1 1 I − (v1 + v2 )(v1 + v2 ) (v1 + v2 ) v + v 2 λ1 1 λ2 2 2 1 1 1 1 1 1 (v1 + v2 )(v1 + v2 ) v1 + v2 − + (v1 + v2 ) 2 λ1 λ2 4 λ1 λ2 1 1 1 1 1 1 1 (v1 + v2 ) 2 v1 + 2 v2 − v1 − v2 − v1 − v2 4 λ1 λ2 λ1 λ1 λ2 λ2 1 1 1 (v1 − v2 ) + (v2 − v1 ) (v1 + v2 ) 4 λ1 λ2 1 1 1 − (v1 + v2 )(v1 − v2 ). 4 λ1 λ2
With (2), this gives cuu Σ−1 Π = −
1 λ2 − λ1 (v1 + v2 )(v1 − v2 ). 2 λ1 + λ2
imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
Multivariate medians and measure–symmetrization
Then
265
1 λ2 − λ1 1 λ2 − λ1 (v1 + v2 )(v1 − v2 ) Σ I + (v1 − v2 )(v1 + v2 ) Σu = I + 2 λ1 + λ2 2 λ1 + λ2 1 λ2 − λ1 = Σ+ (v1 + v2 )(λ1 v1 − λ2 v2 ) 2 λ1 + λ2 1 λ2 − λ1 × I+ (v1 − v2 ) (v1 + v2 ) 2 λ1 + λ2 1 λ2 − λ1 1 λ2 − λ1 =Σ+ (v1 + v2 )(λ1 v1 − λ2 v2 )) + (λ1 v1 − λ2 v2 ) (v1 + v2 ) 2 λ1 + λ2 2 λ1 + λ2 2 1 λ2 − λ1 + (v1 + v2 )(λ1 v1 − λ2 v2 )(v1 − v2 )(v1 + v2 ) 2 λ1 + λ2 1 λ2 − λ1 =Σ+ [(v1 + v2 )(λ1 v1 − λ2 v2 ) + (λ1 v1 − λ2 v2 ) (v1 + v2 ) 2 λ1 + λ2 1 + (λ2 − λ1 )(v1 + v2 )(v1 + v2 ) 2 λ1 + 3λ2 1 1 λ2 − λ1 3λ1 + λ2 =Σ+ v1 v1 − v2 v2 − (λ2 − λ1 )(v1 v2 + v2 v1 ) . 2 λ1 + λ2 2 2 2
Ignoring all but the first two terms of the spectral decomposition Σ λ2 v2 v2 + · · · + λd vd vd , one gets 1 λ2 − λ1 3λ1 + λ2 λ1 + 3λ2 λ1 v1 v1 + λ2 v2 v2 + v1 v1 − v2 v2 − 2 λ1 + λ2 2 2 − λ1 )(v1 v2 + v2 v1 )]
= λ1 v1 v1 + 1 (λ2 2
λ21 + 6λ1 λ2 + λ22 1 (λ2 − λ1 )2 (v1 v1 + v2 v2 ) − (v1 v2 + v2 v1 ). 4(λ1 + λ2 ) 4 λ1 + λ2 As a 2 × 2 array, this has eigenvalues =
(λ2 − λ1 )2 λ21 + 6λ1 λ2 + λ22 ± , 4(λ1 + λ2 ) 4(λ1 + λ2 ) which simplify to
−1 1 1 1 1 . (λ1 + λ2 ), + 2 2 λ1 λ2 It follows that Σu has the same eigenvalues as Σ except for λ1 and λ2 , which are replaced by their arithmetic and harmonic means. Iterating this double-mean transformation leads to a common limit ([2]), i.e., λ1 and λ2 can be replaced by a pair of eigenvalues arbitrarily close to one another. Since this is true for any pair of eigenvalues, the result is established. Remarks (1) The appearance of the two means in the proof is of historical interest, as the so– called “double-mean-iteration.” See, for example, [2] with a discussion of the work of Archimedes and Gauss among others. It would be of interest to know whether it appears here merely as a curiosity or whether there is a deeper theory connecting symmetrization of Gaussian measures and eigenvalue means. (2) In analogy with the classical theory, one would like to have results along the following lines: imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
266
R. A. Vitale
Conjecture 1. For some sequence of Steiner symmetrizations [resp. for almost every sequence of Steiner symmetrizations (i.e. utilizing independent, random selections of u, cf. [14])], the induced sequence of symmetrized distributions is asymptotically spherically symmetric. Conjecture 2. Any sequence of Steiner symmetrizations yields a (weakly) convergent sequence of probability measures. These do not seem easy to establish, but it seems plausible that results along the following lines may be helpful: Proposition 1. Steiner symmetrization is norm-reducing in mean-square: EX2 − EXu 2 = EX − Xu 2 2
= E [E(u X|Πu⊥ X)] ≥ 0. Proof. The following case is generic: d = 2, u = 10 1 1 X − E(X 1 |X 2 ) X −→ Xu = X= X2 X2 EX2 − EXu 2 = EX − Xu 2
2 = E E(X 1 |X 2 ) ≥ 0.
(3) Finally, it would be of interest to place the transformation X −→ Xu in a martingale context. References [1] Ehrhard, A. (1983). Sym´etrisation dans l’espace de Gauss. Math. Scand. 53 281–301. [2] Foster, D. M. E. and Phillips, G. M. (1984). The arithmetic-harmonic mean. Math. Comp. 42 183–191. [3] Gardner, R. J. (1995). Geometric Tomography. Cambridge Univ. Press, Cambridge. [4] Gruber, P. M. (1993). History of convexity. In Handbook of Convex Geometry (P. M. Gruber and J. M. Wills, eds.) 1–15. North-Holland, Amsterdam. [5] Lieb, E. H. and Loss, M. (1997). Analysis. Amer. Math. Soc., Providence, RI. ¨ tto ¨ nen, J. and Oja, H. (2003). A scatter matrix [6] Koshevoy, G. A., Mo estimate based on the zonotope. Ann. Statist. 31 1439–1459. [7] Oja, H. (1983). Descriptive statistics for multivariate distributions. Statist. Probab. Lett. 1 27–332. ¨schendorf, L. (1998). Mass Transportation Problems. [8] Rachev, S. and Ru I. Theory. Springer, New York. ¨schendorf, L. (1998). Mass Transportation Problems. [9] Rachev, S. and Ru II. Applications. Springer, New York. [10] Schneider, R. (1993). Convex Bodies: the Brunn-Minkowski Theory. Cambridge Univ. Press, New York. [11] Vardi, Y. and Zhang, C.-H. (2000). The multivariate L1 –median and associated data depth. Proc. Natl. Acad. Sci. 97 1423–1426. imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
Multivariate medians and measure–symmetrization
267
[12] Vitale, R. A. (1985). The Steiner point in infinite dimensions. Israel J. Math. 52 245–250. [13] Vitale, R. A. (1991). Expectation inequalities from convex geometry. In Stochastic Orders and Decision under Risk (K. M¨ osler and M. Scarsini, eds.) 372–379. IMS Lecture Notes and Monograph Series 19. [14] Vitale, R. A. (1994). Stochastic smoothing of convex bodies: two examples. Supplement to Rend. Circ. Mat. Palermo 35 315–322. [15] Vitale, R. A. (1996). The Wills functional and Gaussian processes. Ann. Probab. 24 2172–2178. [16] Vitale, R. A. (1999). A log–concavity proof for a Gaussian exponential bound. In Contemporary Math.: Advances in Stochastic Inequalities (T. P. Hill, C. Houdr´e, eds.) 234 209–212. [17] Vitale, R.A. (2000). Intrinsic volumes and Gaussian processes. Adv. in Appl. Probab. 33 354–364.
imsart-lnms ver. 2007/04/02 file: lnms5420.tex date: April 13, 2007
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond Vol. 54 (2007) 268–273 c Institute of Mathematical Statistics, 2007 DOI: 10.1214/074921707000000210
Statistical thinking: From Tukey to Vardi and beyond Larry Shepp1,∗ Rutgers University Abstract: Data miners (minors?) and neural networkers tend to eschew modelling, misled perhaps by misinterpretation of strongly expressed views of John Tukey. I discuss Vardi’s views of these issues as well as other aspects of Vardi’s work in emision tomography and in sampling bias.
Statistics is not in my main skill set but I take this opportunity to record a few thoughts about it I have accumulated over the years. In the ’60’s, John Tukey and his followers brought exploratory data analysis into statistics, partly as a revolt against what was then perceived as an overly rigid and brittle mathematical modelling philosophy that held sway at that time. Some problems seemed to demand such a purely data-driven approach where data mining methods in the absence of mathematical modelling is the driving philosophical methodology. One did not want to be biased by preconceived ideas about the origin of the data by formulating a model but instead allowed the data to “speak for itself”. Vardi liked mathematical modelling and was very good at it. He also promoted data mining, depending on the problem and thus straddled both philosophies. He and I often debated these issues, and were often in friendly disagreement. I will try to argue with concrete examples of work of Vardi and others in statistics that the pendulum should again swing back a bit towards encouraging more mathematical modelling to obtain maximal benefit from the use of statistical procedures by allowing physics, biology, and other fields of science to enter the statistical problem formulation via mathematical modelling of the specific statistical problem at hand. I would argue that the solution to a specific problem ought to somehow depend on the problem itself, which is not the case with neural-nets and other data-driven approaches that live mostly or entirely within the data or training set of the problem. Data-driven statistics has the danger of isolating statistics from the rest of the scientific and mathematical communities by not allowing valuable cross-pollination of ideas from other fields. To illustrate these ideas I will discuss among other concrete examples of statistics problems: emission tomography, machine learning, sampling bias. These topics were debated frequently by Vardi and me. I will do my best to give Vardi’s side as honestly as possible. Needless to say, I wish he were here to continue the debate. I will quote Tukey and/or Vardi on issues I will raise, and you should be aware that people who quote absentees don’t allow the quotees to modify the positions they are being quoted on unless they are in agreement with the ∗ Research
partially supported by National Science Foundation Grant DMS-0504387. of Statistics and Biostatistics, Rutgers University, Hill Center, Busch Campus, Piscataway NJ 08854, USA, e-mail: e-mail: [email protected] Keywords and phrases: Data mining, neural nets, statistical modelling, emission tomography, sampling bias. AMS 2000 subject classification: 6207. 1 Department
268
Statistical thinking
269
quoter’s own positions. However, if I quote the current views of Tukey and Vardi inaccurately it is inadvertent! 1. Emission tomography and the EM algorithm Emission tomography has the advantage over CT scanning that it can be used to study metabolism. Thus one can in principle learn, say, where in the brain, higher cognition is taking place by tagging psychoactive substances with radioactivity. These substances move to the part of the brain which is active during the performance of a higher cognition task, and then radiate γ rays which are then detected. Vardi and I wrote [1] on ET and then he wrote [2] and several others without my participation: His role was large indeed: I was trying to find a maximum likelihood estimate of the unknown emission density because one could write down in closed form the likelihood of the given detected counts for an arbitrary emission density using the well-known Poisson model for radioactive emissions from a given emission density. Since emission CT is “photon-limited” it is reasonable to use maximum likelihood estimation as a driving modelling approach. Vardi was familiar with the EM algorithm, with the notion of “over-fitting”, and also with Kullback-Leibler theory and alternating convergence techniques. I was indeed lucky that Vardi was at Bell Labs at the time and that he got very interested in this project. Our second paper above, which is much better written than the first, was entirely his and Linda Kaufman’s doing. Later on, with the advent of functional MRI, I felt that brain physiology is better modelled by the physics of MRI. The reason is that emission CT is too slow to study rapid events in the brain which take place in a fraction of a second. Statistics still plays a major role in fMRI brain physiology but the methodology of the EM algorithm is no longer involved because the physics is now different and there is no likelihood to maximize in fMRI. I feel that one should employ methods that reflect the physics of the problem at hand rather than the methods one happens to know. I said this so often in our discussions that Vardi used to laugh at me. Vardi turned his attention to problems in which he could use parallel EM type methodology to solve reconstruction problems of similar type but without having the clear basis in physics for setting a criterion for a maximization. We generally agreed to disagree on this methodological point. I must admit that he got some very nice results on various problems just by using convergence methods of EM type even with no basis in physics. Vardi showed for example how to reconstruct the traffic on each leg of a graphical network from the overall traffic between the nodes of the network, which he called “network tomography” and which is indeed strongly analogous to emission CT. This debate between Vardi and me became even more relevant in the next topic. Of course it was always a friendly debate though sometimes I got too loud. 2. What would the founding fathers have said about neural nets and machine learning? I often tease the neural net community by asking them to design a neural net that would take the CT, ET, or MRI reconstruction and try to find a tumor in it or decide that no tumor is present. This might be possible of course and indeed there are many papers written on this topic which use image enhancement methods
270
L. Shepp
to delineate blobs that a radiologist could then make a decision as to whether to suspect a tumor which would save effort, presumably. But to make things more interesting, I dared the neural net community to work not with the CT, ET, or MRI reconstruction, but instead with the raw or measured data. After all if neural net technology really works, who needs Radon inversion or Fourier transforms; why not use it on the raw data directly? I made this challenge to point out the mindlessness of throwing mathematical modelling away completely. Another difficult problem area, which is more difficult to use mathematics on and in which there are many attempts to use neural nets or other statistical approaches to pattern recognition problems is on the problem of automated understanding of human speech. I must admit that much progress has been made on it by Allen Gorin who used his ideas about “salience” to build systems like “How May I Help You” for AT&T which involves direct discourse with a human being who makes one of a limited number of requests from an operator or a business office. Salience led quickly to a working system which accomplished this limited task. It avoids the need for the usual confusion of a system using things like “if you want . . . press one, if . . . press two” which irks many people no end. Here machine learning avoids mathematical modelling or true understanding and yet it does accomplish the job. Vardi often would rub my nose in this fact. At the other end of the spectrum of machine understanding of speech would be a problem such as machine translation of Russian text into English. Here decision theory or salience can hardly be expected to find an appropriate L2 -loss function to be used to train an algorithm to converge to a good translator. It seems clear that when we do construct a robotic translator it will have to somehow understand the “meaning” of the passage to be translated and it will no doubt use some modular system based on key word and rules rather than on the standard statistical decision theory, salience, or neural net approaches. What was Vardi’s opinion of this controversy? Did he agree with Tukey’s position, with mine, or with neither? I’m not sure; probably neither, but I have to say that he liked to reuse the methods of statistics with no basis in modelling more than I do, and did it occasionally more successfully than I expected. He used clever convergence methods and did not require there to be any physics behind the model. I would often argue that this gives no way for mathematics and physics to interact with statistics. I think he agreed with me at least once, but I am sure both Tukey and Vardi agree completely with my position today! There seems to me to be a disturbing trend by statisticians to use “standard statistical methodology” to solve problems which may not be amenable to simple approaches. Let me give a few examples of such problems and let’s try to guess what John Tukey and Yehuda Vardi’s opinions would be on the issues. Robotics is an important problem to solve for the future development of society. Can statistics play a key role in this important area? A useful robot would relieve us of the need to perform routine tasks and would also provide entertainment and companionship when desired. Most problems in robotics call for the development of algorithms for automated pattern recognition often called machine learning as we discussed above. Examples of robotic tasks include recognizing hand-written characters so that addresses on envelopes can be automatically sorted. More profound pattern rocognition tasks include speech understanding so that a robot could respond to us directly and converse with us in order to take instructions and perhaps provide companionship. Chess playing robots are already well-developed and provide an opponent at any time for people to play with. Another conceptual problem I enjoy contemplating is to program a computer to
Statistical thinking
271
recognize whether a given picture of an animal is that of a dog or that of a cat. Of course small children can do this with high accuracy in most cases but it may not be so easy to write such a program without some understanding of what is the “salient feature” that is the real difference between the two animals. Is it reasonable to try to find such a salient feature by a neural net on a training set or is this likely to find some “feature” that has little or nothing to do with how a child actually does it? Instead the neural net might just find some commonality of all the dogs in the training set that is really totally irrelevant to the problem. Also it seems likely that the system of the neural net would not be modular since there would not be any way to build on it to include recognition of other animals but instead one would have to start anew. While I would argue that each particular problem of pattern recognition should bring a different solution based on some feature from the real world that we can see is relevant and then find an algorithm to look for the feature, I think Tukey might argue just the opposite. After all, he advocated looking at the data of a particular statistical problem without trying to model it, so as not to prejudice yourself with preconceived ideas. Exploratory data analysis is similar to the use of salient features found automatically via a neural net on a training set except that Tukey himself or some other data miner would be doing the analysis rather than a program. But Tukey’s articulate urging made neural nets more attractive and so in a sense Tukey may have stimulated neural nets. Would he have changed his mind today in the light of some of the overblown claims of neural nets? I think so, or at least I hope so. Character recognition; an easy machine learning problem. The problem of robotics has statistical components because machine learning is based on training set data which has lots of randomness. Thus a neural net for the “post office problem” uses a training set of tens of thousands of examples of hand-written characters where we are given which particular character was written by a person as well as a (say 16 × 16 zero-one) digitized image of the character in each example. The algorithm classifies each image and names a character. One statistical approach is to minimize the L2 distance between the new image and the set of all images for a given character. The neural net approach is similar: they use a scheme where a collection of linear functionals is used and training takes place to choose automatically which linear functional is most salient on the training set. This approach has the advantage that one does not have to think, and one does not have to use any physics or any preconceived notions to guide the convergence of the neural net to the meaningless salient feature. In this case, however, the problem is not all that hard and almost any scheme can recognize characters with an error rate of a few percent. While Tukey, Vardi, and I were at Bell Labs there was much work being done on the above problem of recognizing just handwritten characters from 0 to 9 for automated zipcode reading for the post office sorting application. Let me relate my recollections as an outside observer of this effort which involved several large groups of people. One of the engineers, Patrice Simaud, suddenly announced a great advance and Trevor Hastie, who had been working on the problem with statistical methods, and had been getting less successful error rates, courageously invited Patrice to present his results in our statistics seminar. Instead of error rates in the region of 2 or 3 percent obtained by the methods of the neural nets and the statistical loss-function approach, Patrice announced an algorithm that was virtually error-free except in those cases where even a human being couldn’t recognize a character that was sloppily written. How did Patrice accomplish this?
272
L. Shepp
He used “physics” a modelling of the problem in a clever way. Of course this problem is not as hard as other automated learning problems mentioned above but it still is hard enough to be done incorrectly if one does not think and decides to use statistics or neural nets blindly. Patrice took advantage of the fact that there are mathematical invariances present in the problem. The image of the number being written will depend on how much ink is in the pen, or how thick the pencil point is, how the envelope is oriented relative to the wrist, etc. This type of inputting something from the real world to solve the problem is what I like to see. Patrice introduced 7 transformations of the image of the character to be identified. Each character was thickened a bit (as if it were written by a pen with a larger tip) to get transformation one, was rotated to get transformation two, was translated in two directions to get transformations three and four, and so on. I don’t recall the other 3 transformations he used. This gave him then 7 new points in 16 × 16 = 256 dimensional space and determined together with the original image a 7 dimensional hyperplane in 256 dimensional space. He then dropped a perpendicular onto this plane from each of the 10,000 training characters and identified the new character as that of the closest point to the hyperplane in the training set. He had a method that completely solved the problem. Clever! The sad part of the story was that he presented his solution to our seminar as a neural net method which it was certainly not. Patrice thought about the problem from a modelling point of view and looked for a way to let the solution depend on the problem. Since Patrice’s approach yielded a perfect algorithm one would think that it would be the preferred method today but it is not! The method had a drawback: it is slow. Dropping the 10,000 perpendiculars took a lot of time. One might think that in view of the great improvement in the error rate, one might put effort into computational speed-up of his method, but instead everyone worked still harder on the neural net approach. After considerable effort, the neural net community found that a fast algorithm could be obtained to find salient features which matched the performance of Patrice’s method. I would venture to guess that there was some guiding of the convergence of the the neural net to incorporate some features that were understood only after seeing how Patrice’s algorithm performed. Alas, even today even Patrice himself insists that neural nets is the better way to go for character identification. Argh! 3. Sampling bias Vardi was a great statistician and would instinctively move to a simple and clear understanding of how to think about a new statistical problem. Often his way of thinking would involve little mathematical technique and so others could copy his ideas easily and since he was always very open and generous, he sometimes lost some credit for his insightful ideas. One exception was the area of sampling bias where he was the recognized expert. Others at this conference will characterize Vardi’s work on sampling bias better than I can, but he and I did solve one cute mathematics problem where the motivation originated in sampling bias. We wrote [3] with Ben Logan. The problem posed by Vardi was to find the class of all cdF’s, F = F (x), for the lifetime of a lightbulb for which the residual lifetime would be a scaled version of F , F (qx). Vardi wanted to know whether there was an analog to the result for the exponential, F (x) = 1 − e−x , that the residual lifetime cdF is the same as
Statistical thinking
273
the lifetime cdF itself, for other values of q. We showed that the class, Cq , of all F ’s with Vardi’s scaling property is either empty (if q > 1) unique (if q = 1), or uncountable and convex if (0 < q < 1). However even for q = .5 there is essentially only a single cdF in Cq in the sense that all the graphs of F ∈ C.5 would fit inside a single pencil-line curve, as we showed. The class Cq gets large as q → 0. Sampling bias plays its role in lifetime statistics as Vardi emphasized by observing that if one samples lightbulbs or obituaries in the NYTimes one gets an incorrect sampling of performance of individuals since one is looking mostly at long-lived ones. Vardi had great taste for what constituted a good mathematics problem. We all will miss his insights, his leadership, and his great sense of humor. References [1] Shepp, L. and Vardi, Y. (1982). Maximum likelihood reconstruction for emission tomography. IEEE Trans. on Medical Imaging 1 113–122. [2] Vardi, Y., Shepp, L. A. and Kaufman, L. (1985). A statistical model for positron emission tomography. J. Amer. Statist. Assoc. 80 8–35. [3] Vardi, Y., Shepp, L. A. and Logan, B. (1981). Distribution functions invariant under residual-lifetime and length-biased sampling. Z. Wahrsch. Verw. Gebiete 56 13–38.