Series in Biostatistics Vol.1
Development of Modern Statistics and Related Topics In Celebration of Prof Yaoting Zhang's 70th Birthday
Heping Zhang ian Huang
World Scientific
Development of Modern Statistics and Related Topics
This page is intentionally left blank
Series in Biostatistics Vol.1
Development of Modern Statistics and Related Topics In Celebration of Prof Yaoting Zhang's 70th Birthday
edited by
Heping Zhang Yale University School of Medicine, USA
Jian Huang University of Iowa, USA
V f e World Scientific wb
NewJersev New Jersey • London • Si, Singapore • Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202,1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
DEVELOPMENT OF MODERN STATISTICS AND RELATED TOPICS: In Celebration of Professor Yaoting Zhang's 70th Birthday Copyright © 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-238-395-6
Printed in Singapore.
Preface This proceedings is dedicated to Professor Yaoting Zhang in celebration of his 70th birthday in November 2003. Professor Zhang is an internationally renowned statistician and a Chinese pioneer in the statistical development, applications, and education. During the Fifth International Chinese Statistical Association Conference held in Hong Kong, August 17-19, 2001, Professor Zhang reunited with many of his friends and former students inside and outside China. The idea for this proceedings was conceived on this unforgettable occasion. Inside this proceedings, Dr. Yao and Dr. Li present their interview article with Professor Zhang. In their article, they highlight Professor Zhang's career as a scientist and teacher, and his influence in many fields outside statistics including finance and geology. He is one of the statisticians in China who played a pivotal role in rebuilding statistics as a research discipline after the Culture Revolution. He has trained and inspired several generations of Chinese statisticians. This monograph includes two special invited papers and nineteen invited papers from Professor Zhang's friends and former students. The special invited paper by D. Siegmund and B. Yakir discusses approximation to the distribution of the maximum of certain Gaussian processes. The need for such approximation arises from multi-point linkage analysis for detecting chromosomal regions that harbor genes predisposing a trait of interest, such as a disease in human. The special invited paper by Z.L. Ying introduces an elegant approach for computing variances in a class of adaptive semiparametric statistics. This approach has applications in many incomplete and censored data models. The invited papers encompass a wide range of topics. They cover the following areas: asymptotic theory and inference, biostatistics, economics and finance, statistical computing and Bayesian statistics, and statistical genetics. In the areas of asymptotic theory and inference, L.Z. Lei, L.M. Wu and B. Xie study large deviation and deviation inequalities with L\ norm in kernel density estimation in a p-dimensional space. G. Lu investigates local sensitivity of model misspecification in likelihood inference. Y.S. Qin and Y.X. Wu study empirical likelihood confidence intervals for quantile differences. Q.W. Yao establishes exponential inequalities for spatial processes and uniform convergence rates in density estimation. In biostatistics, F.F. Hu discusses some recent advances in responsev
VI
adaptive randomized designs in clinical trials and industrial applications. L.X. Li studies estimation in linear regression models with interval-censored data. Y.C. Xia proposes a childhood epidemic model with birthrate dependent transmission to study the effect of birthrate on the epidemic dynamics. In economics and finance, T.J. Chen and H.F. Chen introduce a new regression model for studying stock volatility. Z.H. Chen proposes using ranked set sampling in observational economy. D.Y. Xie investigates explicit transitional dynamics in growth models. X.D. Zhu studies the linkage between uncertainty about estimated probabilities of catastrophic events and their high insurance premium. H.F. Zou studies a fiscal federalism approach for optimal taxation and intergovernmental transfers based on a dynamic model. In statistical computing and Bayesian statistics, the paper by M.H. Chen, X. He, Q.M. Shao, and H. Xu proposes a Monte Carlo gap test in computing the Bayesian highest posterior density regions. C.H. Liu investigates methods for accelerating the EM and Gibbs algorithms. The paper by M. Tan, G.L. Tian and H.B. Fang considers the problem of estimating restricted normal means using the EM-type algorithms and the noniterative inverse Bayes formulae sampling. In statistical genetics, J. Huang and K. Wang propose using a semiparametric normal copula model in linkage analysis of quantitative traits. Z.H. Li, M.Y. Xie, and J.L. Gastwirth use optimal design theory to suggest ways for improving the Haseman-Elston regression for detecting quantitative trait loci. Finally, Zhu and Zhang explore structural mixture models and their applications in genetic studies. We are grateful to Qiwei Yao and Zhaohai Li for their invaluable help in editing this monograph, and for conducting the biographical interview with Professor Zhang. We are also grateful to all the contributors for their enthusiastic support of this project. We appreciate the support and assistance of Dr. K K Phua, Dr. Ye Qiang, Ms. Tan Rok Ting, and Ms. Elaine Tham at the World Scientific Publishing Company. We thank Mr. Chang-Yung Yu for his able assistance in assembling the articles. Lastly, but most importantly, as Professor Zhang's students, we thank Professor Zhang wholeheartedly for opening our eyes to the statistical discipline and for teaching us the learning and research skills, and after all, for teaching us how to mentor the future generations. We are proud of you, Professor Zhang and wish you the happiest birthday and a good health. Jian Huang and Heping Zhang, March 2003
Contents
Preface
v
An Interview with Professor Yaoting Zhang Qiwei Yao and Zhaohai Li
1
Significance Level in Interval Mapping David O. Siegmund and Benny Yakir
10
An Asymptotic Pythagorean Identity Zhiliang Ying
20
A Monte Carlo Gap Test in Computing HPD Regions Ming-Hui Chen, Xuming He, Qi-Man Shao and Hai Xu
38
Estimating Restricted Normal Means Using the EM-type Algorithms and IBF Sampling Ming Tan, Guo-Liang Tian and Hong-Bin Fang
53
An Example of Algorithm Mining: Covariance Adjustment to Accelerate EM and Gibbs Chuanhai Liu
74
Large Deviations and Deviation Inequality for Kernel Density Estimator in Li(_Rd)-distance Liangzhen Lei, Liming Wu and Bin Xie
89
Local Sensitivity Analysis of Model Misspecification Guobing Lu Empirical Likelihood Confidence Intervals for the Difference of Two Quantiles of a Population Yongsong Qin and Yuehua Wu
VII
98
108
viii
Exponential Inequalities for Spatial Processes and Uniform Convergence Rates for Density Estimation Qiwei Yao
118
A Skew Regression Model for Inference of Stock Volatility Tuhao J. Chen and Hanfeng Chen
129
Explicit Transitional Dynamics in Growth Models Danyang Xie
140
A Fiscal Federalism Approach to Optimal Taxation and Intergovernmental Transfers in a Dynamic Model Liutang Gong and Heng-Fu Zou Sharing Catastrophe Risk under Model Uncertainty Xiaodong Zhu
156
179
Ranked Set Sampling: A Methodology for Observational Economy Zehua Chen
189
Some Recent Advances on Response-Adaptive Randomized Designs Feifang Hu
205
A Childhood Epidemic Model with Birthrate-Dependent Transmission Yingcun Xia
220
Linear Regression Analysis with Observations Subject to Interval Censoring Linxiong Li
236
When Can the Haseman-Elston Procedure for Quantitative Trait Loci be improved? Insights from Optimal Design Theory Zhaohai Li, Minyu Xie and Joseph L. Gastwirth
246
IX
A Semiparametric Method for Mapping Quantitative Trait Loci Jian Huang and Kai Wang Structure Mixture Regression Models Hongtu Zhu and Heping Zhang
262
272
X
Professor Yaoting Zhang
:-s#*K8SWIfry
.•
,•- • r . J. —•**•. • .a? 11 • ; -wSt 14/
;•-•*.
* • . " »••*
The participants of the Statistics Training Course in Wuhan University in 1980. The course was commissioned by the State Education Commission and had significant impact on the development of statistics in China. Zhang was the fifth from the left in the front row.
XI
In Peking University in 1961.
Professor Paolu Hsu, sitting at the center of the front row, and his students in Peking University. The second from the left in the front row is Zhang.
1, - •
r"*' 5 '^aulSfe.ss
**%$&%$ - v."" }V
3OT§#
. , * . '••'
*
.
'
•
\ ^ 1 f?
-
•
*
\M&&>
INf PPlf
" • ' . • . « »
^ ' WKk*-
Professor Yaoting Zhang and his student in Wuhan University.
A N INTERVIEW W I T H PROFESSOR YAOTING Z H A N G
Q I W E I YAO Department Guanghua
of Statistics,
London School of Economics, Houghton Street, London WC2A 2AE, UK School of Management, Peking University, 100871, Beijing, China Z H A O H A I LI
Department
of Statistics,
George Washington University, 2001 G Street NW., Washington DC 20052, USA Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Blvd., EPS, Rockville, MD, 20852
Professor Yaoting Zhang was born in Shanghai, China in 1933. He entered the Department of Mathematics at Qinghua University in 1951, and then was transferred to the Department of Mathematics at Beijing (Peking) University. After completing his undergraduate education, he studied probability and statistics under Professor Paolu Hsu in 1955. He and Professor Hsu developed a lifetime friendship. During t h e 'Culture Revolution', he was sent to the Wu-Jiang Hydropower Station in Gui Zhou Province. After the 'Culture Revolution', he joined the faculty at Wuhan University. Professor Zhang is a devoted teacher in statistics and an enthusiastic adviser for his students. His contribution in statistical applications in China ranges over a wide spectrum of areas. At present, Professor Zhang holds a Professorship in the School of Economics of Shanghai University of Finance and Economics. He is also an Adjunct Professor at both the People's University of China and Beijing University.
This interview is based on several conversations with Professor Yaoting Zhang in 2001 and early 2002. Inevitably some editing has been required. The final version was firmed up, with Professor Zhang's approval, on 29 June 2002 in Peking University. Growing U p Yao: Professor Zhang, many thanks for agreeing to talk to us. Could we start by asking you to tell us some of your early days? Zhang: I was born in 1933 in an ordinary family in Shanghai. At that time China was still a society which valued boys above girls. My parents 1
2
were very happy to have me. As the youngest boy with one brother and three sisters in the family, I was treated with extra love. However before long the whole family fled from Japanese invasion after the notorious '8.13' event. The memory of my childhood was to flee from calamities, from Japanese invasion, and from Japanese brutality. Due to increasing difficulty to make ends meet, my mother, my brother and two sisters went back to my parents' hometown in Wujin to live on farm. My father, my second sister and I stayed in Shanghai in order for me to finish my primary school. We managed to live on a single grocery stand until the summer of 1945. After I finished my primary school, we went back to Wujin to join the rest of the family. My family went back to Shanghai after the end of the anti-Japanese war in the fall of 1945. I then entered the Shanghai Middle School, which changed the way of my thinking. The Shanghai Middle School was in a suburb of Shanghai, having boarding students only. The students fell into two clusters: those from wealth families such as bankers, senior officials and capitalists, and those from poor families. I remember clearly that the traditional Chinese-style cottongowns and jackets with buttons in the front were forbidden in the School. For the first time I wore shirts and overcoats. When the communist party took the power in Shanghai in 1949, some of my classmates went to Hong Kong or Taiwan. But most of us stayed in Shanghai. Out of more than 30 students in my class, 10 were admitted to the Qinghua University and the other 10 entered the Shanghai Jiao Tong University in 1951. I entered the Mathematics Department in Qinghua. Actually I did not have a clear idea on what I should do at that time. The three preferred subjects on my application form were mathematics, foreign language and chemical engineer. I was assigned to mathematics. Yao: As one of the very top universities in China, Qinghua can afford to be very selective in recruiting students. You must have a very exciting time in Qinghua. Zhang: When I entered Qinghua, there were 20 students in our class, more than all the students in the second, third and fourth years put together in the Mathematics Department at that time. We were warmly welcomed. The teachers in the department took us to the city for a day-out, visiting the Forbidden City and Temple of Heaven. Each of us was given a lunch bag with bread, sausages and eggs, which was pretty high-class. In those days, tourists typically carried with themselves some steam-buns and salted vegetables. Two of my classmates later became the Fellows of Chinese
3
Academy, and more than 10 were certified PhD student supervisors. Professor Paolu Hsu and Statistics Yao: As both a senior student and a young faculty member, you were very close to Professor Paolu Hsu. Would you please share with us some of your memories of Professor Hsu? Zhang: I met Mr. Hsu Paolu in 1955. Under the arrangement by the Department, I started to learn from him, as a part of the preparation for setting up a new teaching and research section in Probability and Statistics. At that time my own wish was to go to Tibet — to serve my Motherland in the poorest and the most needed place. However I was 'ordered' to stay in the Department as a teacher after the graduation. Mr. Hsu asked me to read two books: "Measure theory" by P.R. Halmos and "An introduction to probability and its applications" (1st edition) by W. Feller. Those two books gave me a foundation from which I immensely benefit in my whole life. In that time, probability and statistics were looked down within mathematical circle. At one stage even Mr. Hsu was thinking to change to algebra. The whole mathematics in China would not be as it is now if not for the publication of "The National Science Outlines" in 1956, which earmarked computational mathematics, differential equations, probability and statistics as the important mathematical subjects for further development. After learned that I was learning probability from Hsu, one of famous mathematician (who was also my teacher and taught me several courses) commented: "Nowadays some people are willing to learn quasi-mathematics, such as probability". This reflects the common view of probability and statistics in most people's mind at that time. In 1956 after participating the First National Science Planning Conference, Mr. Hsu said to me that in the whole country there were only 8 people, including me, working on Probability. This was how we started. In the beginning I learned Markov chains, as Mr. Hsu was interested in some limiting theorems of Markov processes at that time. In the 'big jump movement' in 1958, I realised the great potential of statistical application from doing practical work. Mr. Zhou Hua-Zhang in Qinghua, who came back from the United States together with Qian Xue-Sen, did a lot of work in statistical applications. I was greatly influenced by him. When the Meteorology Department in Beijing University could not find anybody to teach a statistics course, I offered my service. Actually I was learning the subject while teaching it. This is how I got into statistics.
4
Yao: Among your various expertise, you are known as an expert in Multivariate Analysis. Why did you choose this particular area? Zhang: In 1959 Mr. Hsu led a seminar series reading "Multivariate analysis" by Roy and "Mathematical statistics" by Wilks. At that time I learned the Neyman-Pearson theory and some of multivariate analysis methods more systematically. In the second half of the year, Zhang LiQian's work on the triangular scheme for experiment design caught Mr. Hsu's attention. So we changed to experiment design. In the same year the Annals of Mathematical Statistics published a long paper by Bose in this subject. My knowledge on experiment design started with that paper. In the seminar, we also read a small book entitled "Experiment design and analysis of variances" by H.B. Mann. That was the first time I came across to the formula for inverse partitioned matrices. I was impressed that Mr. Hsu could simplify the proof in Mann's book by using this formula. In the erratic 1960's, life was getting harder in Beijing after the rush for ultrasonic and political anti-rightist movement. I did a lot practical work in factories in that period. In order to carry out the forest survey, we read "Sampling Techniques"' by Cochran. The book "Sampling Theory" by Sun Shang-Ze was based on Mr. Hsu's summarised notes. In 1961 Mr. Hsu gave seminar talks in linear models, which was largely reflected in my joint book with Fang Kai-Tai on multivariate analysis. Mr. Hsu also gave talks on the derivation of exact sampling distributions in multivariate analysis, which we found difficult to master at the time. Because of this, I did a lot of exercises, including some rather tricky multiple integrals which I solved using Hsu's methods. Later in 1980 Krishaniah brought us the notes taken by Olkin and Deemer from Hsu's lectures at North Carolina Chapel Hill in 1947, which helped me to understand the technique properly. But with modern stochastic simulation, deriving an exact sampling distribution remains merely as an intellectual challenge nowadays and is losing its practical significance. Yao: This reminds me those lectures and seminar talks you gave us when I was a PhD student in Wuhan. You could always find a simple and often elegant treatment for a complex mathematical problem. This amazed us most! Your exquisite matrix techniques were beyond us. What is the trick to gain those abilities? Zhang: I learned the matrix techniques through several research projects led by Hsu, mainly on the PBIB and BIB schemes in experiment design. We worked on matrices defined on finite fields. The key to solve many problems was to count the number of canonical forms under various
5
transforms. The results were collectively published in Mathematics Advances in 1963 under the pen-name 'Ban-Cheng'. One of the important ideas from Hsu is to think of the invariance of a mathematical problem under certain transform group. Then the problem can be reduced to find the solution for the canonical from under the transforms. The canonical form of linear models is a case in point, which can be found in many books such as Lehmann's. This idea often works very well in experiment design and multivariate analysis. By incorporating this idea into the teaching and textbooks, I was able to provide simple and elegant treatments for some materials in my lectures as well as my books, which was well received by students. When we were poised for further advanced research in 1963 and 1964, the political turbulence such as the socialism education, culture revelation started one after another. The research was almost standstill for whole ten years. When it was all over, I was approaching my fifty already. Yao: I also remember your constant yearnings for the advancement of Applied Statistics and Statistical Applications in China on various occasions. The urgency of doing that must, at least partially, come from your own experiences. Would you like to tell us some of your applied work? Zhang: Since 1958 I have established wide connections with faculty members in other departments in Peking University as well as practical sections in the society. The practical projects which I was involved include the resource survey analysis in Xi-Xia-Bang-Ma area and the causing analysis of yellow earth jointly with the Geology and Geography Department; weather forecast, especially on forecasting the three key factors (namely the length of period, the intensity and the beginning date) for plum rains, and median and long term forecast jointly the Geophysics Department and the National Climatological Bureau; the forecast for earthquake wave; armyworm forecast jointly with the Biology Department; the survey and the forecast of water supply and waste-water treatment jointly with the Institute of Municipal Engineering. The most effective project with long-lasting impact was the foundation design blueprint for the high buildings in QianShang-Meng area in Beijing. Many key technicians in the Institute grew out of this project. In spite of the fact that I was denounced as counterrevolutionary in the 'Culture Revolution', many people from various applied areas still came to me for statistical consultation. In that period, I did some projects related to military techniques. The statistical techniques involved include outliers detection, spline regression, factor analysis etc, which also enhanced my appreciation on those techniques.
6
During the 'Culture Revolution' Li: We all know you suffered badly from the 'Culture Revolution'. Would you mind telling us what happened during those terrible years? Zhang: One thing which I never gave in during the 'Culture Revolution' is to disseminate and to popularise the orthogonal design. During that period, the university students were workers, peasants and soldiers. We had to give lectures in factories, and combined our own expertise with practical problems. So the orthogonal experimental design was an obvious and handy choice. We applied it successfully in real practice in both Peking Analytical Instrument Factory and Yan-Shang Petrochemical Engineering Factory. The results were assembled in a book on orthogonal design published by the High Education Press. Towards the end of the 'Culture Revolution', I, as an assigned counterrevolutionary, could not be accepted by universities. Beijing University sent me to the Wu-Jiang Hydropower Station in Guizhou where my wife was, which, as they thought, was the only place I could go. In October 1976, I left Beijing for Guizhou and ended as a school teacher there for two years. The life was hard, we ate boiled vegetables as we had no oil to stir-fry them. But on the other hand, I got on very well with the colleagues there and was very popular among my students and pupils. In fact with no political pressure and no discrimination, I felt relaxed and happy. During those two years I had time to go back to the work I started in the earlier years and was able to write more research papers. I visited Mao-Tai twice; once was on a business trip and the other was during a Spring-festival holiday in my colleague's home at his invitation. Even now I still cherish those very pleasant days in my life. After 'Culture Revolution' Li: After the 'Culture Revolution', you joined Wuhan University where both Qiwei and I met you at the first place. How would you summarise your major activities at Wuhan University? Zhang: After the culture revolution, I received invitations from several universities. I went to Wuhan University and spent my next 16 years there. When I arrived there I did not feel as energetic as used to be, but I decided to do my best to bring up some young people. At that time the State Education Commission entrusted me to organise and conduct a halfyear national training course in Statistics for university teachers ranked at
7
lecturer level or above. The course was attended by about 40 people from different universities in the country and the invited lecturers included Chen Xiru, Ni GuoXi and Deng Weicai. The course had a significant impact on the development of statistics in China. The second thing worth mentioning is that in 1985 I organized a postgraduate class for 45 students; 15 from Wuhan University and 15 from Central China Normal University plus the other 15 who were sent to Wuhan by other universities. Quite a few of them are now well-accomplished statisticians. Another thing that I was proud of was to put together the national selection examination for students studying statistics in the United States. The project was initiated by Professor George Tiao when I visited University of Chicago in 1983. The idea was to select 30 candidates out of 100 students sitting in the examination every year for PhD studies in American universities. Harvard, Berkeley, Chicago and Wisconsin were jointly responsible for the allocation of those candidates. The aim was to produce high quality applied statisticians for China. After continuous effort for two years, the project was finally brought into effect. Over the years dozens of students went to the United States. Some of them have become influential figures in statistics now. If I have made some contribution to Chinese statistics, it must be that I have encouraged the talented young statisticians coming forth in large numbers. Li: Your recent research interest has shifted to economics and finance. What was your motivation for such a change? Zhang: In early 1990's, I recognized the serious lack of expertise in quantitative economic analysis in China. The modern development of statistics has two backbones: medical science and economics. In terms of statistical application in these two areas, we were far behind. I decided to shift my main focus to introducing the relevant modern analytic methods and the related theory, doing research which is relevant to Chinese reality, and to promote the development of specialists in this area. China had a State-planned economy for a long time, which undermined the market research and development. Obviously the old survey methods based on report forms were no longer adequate. Actually I realised the importance of the sampling survey methods in early 1980's. Together with Wu Hui, the Director of the Foreign Affairs Office in the State Statistics Bureau, we translated Cochran's "Sampling Techniques" into Chinese. At that time the publisher thought that the book would not sell. The book had been held for almost two years before it was published in 1985 under an official intervention. In fact the publisher's initial estimate was proved wrong; the second print appeared within one year of its first publication.
8
To meet the practical needs, I published "Bayesian Statistical Inference" and "Statistical Analysis for Qualitative Data" in early 1990's. These books presented the relevant methods and theory together with some real Chinese examples. In late 1990's, I wrote a few economy-oriented textbooks, including "Theory and Methods for Data Ordering and Quantification", "Utility Functions and Optimisation" and "Information and Decision Theory". In three books, I put economy and mathematics well together. Hopefully the students in economics will find these new textbooks useful and helpful. I am glad to see that my book "Statistical Analysis of Financial Market" was very well received. Some universities listed it as the standard reference book for their entrance examination for graduate studies in finance. By writing this book, I became more familiar with statistical methodologies for analysing financial market. I read the literature from abroad and tried to get myself acquainted with new development in the world. During this process, Zhou Hengpu helped me greatly. He set up the Institute for Advanced Economics Research in Wuhan University which imported many new books and subscribed a number of overseas periodicals. This made it much easier for teachers and students to keep up with the newest development in the world. Many students have graduated from this Institute. While in the late 1980's I focused on reliability theory and its applications (organising a research group with people from six universities, undertaking dozens projects with the results assembled in two specific monographs), in late 1990's I worked on the applications in taxes and finance. Together with my students, we focused on stock price data. Over the years, we carried out research on the measures for stock market risk and on the forecasting the market direction and had obtained interesting results. Most results have been commercialised now; they cannot be published in academic journals. My plan for the rest of my life is to bring up some capable specialists on economic analysis, who can solve problems in finance, public and social security. I am Proud of M y Students Li: Many of your students, including myself, see you as a good friend, an excellent teacher and one of pioneer statisticians in China. You always enjoy interacting with your students. What would you like to say to your past and current students? Zhang: Looking back of my life, I am gratified that I have been involved in statistical applications in various areas, and have made substantial effort to disseminate and to popularize statistics in China. In spite
9
of little contribution on the theoretical side, I take comfort from the fact that many of my students are now well-accomplished theoretical or applied statisticians. They are working hard for our Motherland, or are doing their best to glorify her. I sincerely thank my students. Having progressed a long way ahead in their career, they still remember me — a teacher who happened to introduce them to statistics. I also sincerely thank my friends and teachers. Without their constant encouragement, affirmation and appreciation for what I did, I could not pull myself through those difficult times. Last but not least, I would like to thank my teacher Mr. Hsu Paolu wholeheartedly. Although he has left us for thirty years now, his earnest words, his passion and persistent drive for science, his heart-felt love for the Motherland has always been influential in my life. I regret deeply that I could not adequately accomplish his idea in statistics. But I believe that the achievements of the new generation will comfort his soul in Heaven.
S I G N I F I C A N C E LEVEL IN INTERVAL M A P P I N G
DAVID O . S I E G M U N D Department
of Statistics,
Stanford
University,
Stanford,
CA 94305,
USA
BENNY YAKIR Department
of Statistics,
Hebrew
University,
Jerusalem,
Israel
The false positive rate of a genome scan that uses interval mapping involves the distribution of the maximum of a Gaussian process, for which there are two approximations: one relying on the Rice-Davies formula, which is accurate for relatively sparsely placed markers, and one that is accurate for closely spaced markers. In this paper we combine these two approximations to obtain an approximation that is accurate for both sparse and dense markers. We also give a new proof of the Rice-Davies formula.
1. Introduction Genome scans lead to testing a large number of markers distributed throughout the genome for linkage to a trait of interest. Typically linkage is detected in any region of the genome where an appropriate asymptotically Gaussian stochastic process exceeds a threshold. The genomewide false positive error rate is the probability, under the null hypothesis that no markers are linked to the trait, that the maximum of the stochastic process exceeds this threshold. The stochastic process arises from marker data, ideally placed on an equally spaced grid throughout the genome. Often one also uses interval mapping (Lander and Botstein, 1989), which is a technique based on the EM algorithm to interpolate the observed process between markers. The "markers only" process is asymptotically the discrete skeleton of a non-differentiable, locally Markovian, Gaussian process, either the Ornstein-Uhlenbeck process or a close relative, while the interpolations of interval mapping are very smooth. For mapping quantitative traits in experimental genetics, there are two recommended approximations to control the genomewide false positive error rate. One is based on Rice's formula as applied by Davies (1987) (Rebai, Goffinet, Mangin, 1994, 1995, Dupuis and Siegmund, 1999) for the expected 10
11
number of upcrossings of a smooth Gaussian process. It is appropriate when intermarker spacing is reasonably large (ca. 20 cM) and interval mapping is used to interpolate between markers. The second is based on an approximation to the distribution of the maximum of the discrete skeleton of a locally Markov Gaussian process (Feingold, Brown and Siegmund, 1993, Dupuis and Siegmund, 1999) and is appropriate for closely spaced markers. Of these two approximations the first is overly conservative when markers are closely spaced, while the second is anti-conservative when markers are widely spaced and interval mapping is employed. The goals of this paper are (i) to suggest a combination of these two approximations, which seems to combine the best features of both, and (ii) to give a new derivation of the Davies approximation based on a likelihood ratio transformation along the lines of Yakir and Pollak (1998) and Siegmund and Yakir (2000a, 2000b). 2. Known Results We begin with the one dimensional case of a backcross (or an intercross where the possibiity of a dominance deviation is ignored). Let Xt denote a mean 0 variance 1 stationary Gaussian process with covariance function that satisfies R(t) = 1 — (3\t\ + o(t). A case of particular interest is the Ornstein-Uhlenbeck process, where R(t) = exp(— 0\t\), which arises from the Haldane (no interference) model for crossovers. For markers equally spaced at distance A, for a single chromosome of genetic length L, we have the approximation (Feingold, Brown and Siegmund, 1993) P{ max XiA > i } a l - $(x) + vLxipix),
(1)
0
where ip and $ are the standard normal probability density function and distribution function respectively, and v = ^[a;(2/3A)1/2]. The definition of v(y) is given by Siegmund (1985, p. 83). It is easily calculated numerically, and for 0 < y < 2 an excellent approximation is v{y) w exp(—0.583y). From this result and the independent assortment of chromsomes we obtain the genome wide approximation for an organism having C chromosomes of total genetic length £ = J2i Lc: P{max c
max
x
i% > x) » * " exp{-C[l - $(s)] + vixtp(x)}.
(1')
0
For the numerical results cited below, when we refer to the approximation (1), we have used the exponential form (1') and similarly for the approximations (2) and (6).
12
Let Zt denote the Gaussian process obtained from X , A by interval mapping. Hence ZiA = XiA, and between markers Zt has smooth sample paths. An explicit formula for Zt is given in Section 4. If A is fairly large (depending on /?), say A = 20cM, then Davies' (1987) formula provides a good approximation of the form P{ max Zt > x} < 1 - $(a:) + V ) E W i A i ( i + 1 ) A )
(2)
i
where Nsj denotes the number of upcrossings of the level x by Z during the interval [s,t]. An explicit expression for EA^A,(i+i)A is given below. The approximation (2) is very conservative if A is small, say less than 5 cM. The approximation (1) is reasonable for small A; but it is anticonservative if A is large, since we have failed to account for the interpolation between markers. A still simpler approximation, which is essentially a discrete version of the Rice-Davies approximation is P{ max XiA > x} < 1 - $(x) + (L/A)P{X0
< x, XA > x}.
(3)
0
3. A Combined Approximation The following argument combines (1)- (3) into a single approximation for the distribution of maxZt that is reasonably accurate for both large and small A. Observe that P{ max Zt > x\ < 1 - $(z) + Y^ PI l
0
V
' ~
'
t-J
U
max
Zt > x, max ZjA < x}.
(i-l)A
~
i<%
J
'
(4) Now for each i, P{
max
Zt > x, max ZiA < x] =
(i-l)A
j
y>oo
/
Jo
P(Z(j_i) A ex- dy)[l - P{ max ZjA > x\Z
xP{
max
Zt > x\Zu_i)A
= x - y}\
= x — dy\
(i-l)A
< P(A^(i_i)A,iA > 1) - /
Jo
W{Z(i-i)A ex-dy,
ZiA > x, max ZjA > x)
= P(JV (i _ 1)A , iA > l ) - P { Z ( i _ 1 ) A < x, ZiA > x}+P{ZiA
j<»-i
> x,m&xZjA
< x}. (5)
13
Substituting (5) into (4), using (1) and stationarity, we obtain as an approximate upper bound for P{maxo
x} the expression 1 - $(x) 4- (L/A)[EN0,A
- P{Z0 <x,ZA>
x}} + i/Lx
(6)
An explicit expression for EJV 0 ,A is given in (9) and (11). For a numerical example, we assume the Haldane map function and consider a backcross involving an idealized mouse genome of 20 chromosomes, each of length 80 cM. For A = 20 and a threshold of x = 3.52, the approximations (1), (2) and (6) yield 0.041, 0.051, and 0.052, respectively. For A = 1 and x = 3.79, they are 0.050, 0.077, and 0.051. Similar results hold for recombinant inbreds and for a two degree of freedom designed to detect either an additive or a dominance effect in an intercross. In the latter case for A = 20 and a threshold of x = 3.99, the approximations (1), (2) and (6) yield 0.034, 0.050, and 0.050, respectively, while for A = 1 and x = 4.28, they are 0.051, 0.070 and 0.052. 4. The Interval Mapping Process in the Gaussian Limit As a first step towards our likelihood ratio derivation of Rice's formula, in this section we give an explicit representation for the interval mapping process in the Gaussian limit and evaluate the first two derivatives of its correlation function, which play an important role below. Assume as above that markers are genotyped at loci 0, A, 2 A , . . . , rnA = L. At each such locus a standard normal statistic X^ is computed. Let E be the (m + 1) x (m + 1) matrix of correlations between markers. Denote by W its inverse. Given a putative QTL at locus s, denote by fi the non-centrality parameter of the associated test statistic Xs (had it been computed). Let as be the (m+1) vector of correlations between the putative test statistic Xs and the vector X = (X0, • • •, XmA)'. The log-likelihood ratio function of fi and s when X is observed (but Xs may not be) is equal to »X'Was
-
(p?l2)a'sW<js.
This leads to the generalized likelihood ratio statistic Denote Zs =
(X'W<JS)2/(a'sW<js).
X'Wasl{G'sWasfl2.
The process {Zs : 0 < s < mA} is a zero-mean unit variance Gaussian
14
process with continuous paths. Its correlation function is given by .
.
a'sWat
Note that ts — xZs —x2/2 is a log-likelihood ratio statistic for the distribution of the vector X. The likelihood in the denominator (null hypothesis) is that of the original zero mean distribution. In the numerator (alternative hypothesis) the mean value is \i = x/{a'sW
{o'sW
f{t) =
so for fixed s as a function of t, cov(Zs, Zt) = g(t)/f(t). from the symmetry of W that d gjt)
_ gjt)fjt)
dtf(t)
-
It is easy to see
fjt)gjt)
p(t)
vanishes when t — s. For the second derivative, use d g{t) dtf(t) d f(t)g(t) dt p{t)
=
=
g(t)f(t)
- f(t)g(t) p{t) 2f(t)2g(t) Pit) •
f(t)g(t) + f(t)g(t) p(t)
Plugging in t = s leads to
_pl = dHfit) #m
_ gis) - fis) _ ia'sWasf
fis)
-
i&'sWas)ia'sWas)
ia'sWas
which is negative. 5. Likelihood Ratio Transformation We are interested in approximating the (one-sided) probability of false detection: P(maxo<s x). The plan is to divide the interval [0, L] into segments much smaller than A, and to approximate the continuous process along that finer grid. Let S be the increment in the finer grid. Letting 5 and A converge to zero in various rates may produce a family of approximations
15
that may suit various scenarios. We consider in detail only the case that A is fixed while 5 —> 0. The discussion is mainly heuristic. Denote the index set of the finer grid by X. Let P s denote probability for the Gaussian process described in the preceding section with QTL at s and H = x, and let ls = dPs/dP = xZs—x2/2. A likelihood ratio transformation gives (max Zs > x) = N J E s sex
1
LEt e i
e*«
;ma,x£t > x2/2 tex
which can be re-written sex
?^I±Le[t.-m*xt«t-e.)] Etexet
e
>
x2/2
t€X
We approximate the sth term by replacing X by a local neighborhood of s, say Xs, and applying a Mill's ratio type of large-deviations approximation to get
^"VW^E S sex
max t € i s e"-
E tex
px(Zt-Z,)-
=
X~1ip(x)J2Es max t gi a e £t€l.e*C*-z.> sex
s
(7) where ^7 is the standard normal probability density function. (See, for example, Siegmund and Yakir (2000a) for a rigorous development in a related problem.) Under Wa, the process {x(Zt — Zs) : t G Xs} is Gaussian. From the results of the preceding section we see that the mean vector of this process is {x2(cov{Zt, Za)-l):t€
Xs} « {-p2s(t - sfx2/2
: t e Xs},
and the covariance matrix is {x2cov(Zt-Zs,Zr-Zs)
: (r,t) € XsxXs}
« {/32(t-s)(r-s)x2
: (r,t) e
XsxXs},
provided that the function (3S is continuous in s. Since the rank of the asymptotic covariance matrix is 1, the process has the asymptotic degenerate representation {x(Zt -Zs):t€
Xs) « {{3s(t - s)xZ - (32(t - s)2x2/2
: * € I,},
(8)
for some standard normal variable Z. This approximation is valid when the covariance function can be approximated by a quadratic approximation, i.e. when s is not too close to one of the sparser grid points.
16
6. Rice-Davies Approximation We assume that A > 0 is fixed, while <S —> 0. In this case, the part of the approximation that involves values of s in the vicinity of the points of the sparser grid can be ignored. For a typical s, the approximation of the process {x(Zt — Zs)} by the degenerate process given in (8) holds. Hence the individual terms in the summation in (7) can be approximated by Es
max.tei, e
x(Zt-Z.)
x( Z t -Z s ) J
E
E
max{i.i5+seJsye S {{i:i8+s€I i s}
= E
X{i:
i5+s£Ts}
il0,5x]Z-i2[0,5x]2/2-i 2
ei[f3BSx]Z-i
[0,5x]2/2
exp{-i(^s5x-Z)2}
If x5 —» 0 and if the region Is is wide enough to assure that for most values of Z an approximation of the summation is obtained by integration of a normal kernel, then we have E
PsSx
1 ex
-E{,:i5 + s 6 z s } P{ "
2
^(iPsSx-Z) }
Inserting an edge correction to account for the possibility that ZQ > x and adding over subintervals of I , we obtain from (7) the formula: P ( m a x Z s >x) w 1 - $(z) s€2T
2TT
/ Psds. Jo
(9)
7. Evaluation of (9) In the case where {Xt} is the Ornstein-Uhlenbeck process the matrix W has the explicit form:
W = {wij}»,j=o,
1/(1 - e - ^ A ) i = j = 0,m, (1 + e - 2 ^ A ) / ( l - e- 2/3A ) 0 < i = j < m, - e - ^ A / ( l - e- 2 / 3 A ) | * - j | = l.
Indeed, for u £ { 1 , . . . ,m - 1}, let
W0o Woi Ww Wn
17
The matrix Woo is (u + 1) x (u + 1) and the matrix W\\ is (m —u) x (m —u). Equivalently, take <x„ = (a 0 ,cri)'. Note that e~20A
1 PQWOOCTO = i1 _
3 , P'lWnCTi 1 ll"L ~= l _ e-2/3A'
-e~20A
^ J T , OiWW<J0 e-2/3A' - i " i u - u — ^ _
g_2/3A
-
If follows that / ' '\ixrru ' u I\I a b + (axbx - a0bi (o0
a1b0)e-2,3A '.
We will apply this formula first to the computation of the covariance between Zs and Zt when both s and t are between u and u + A. In this case, a 0 = e " ^ ' - " ' , b0 = e-^s~u\ ax = e^~u\ and h = e^ 3 ""). Consequently, e-0(t+s-2u) W
°'s °t
+
/ e /3(t+s-2u) _ eP(s-i)
=
! _
_
eP(t-s)\e-2f3A
e-2/?A
•
(10)
In particular, when s = t we get e-2/3(t-u) + Wcj
<
t
e-20(A-[t-u})
=
1 _
_
2e-20A
e -2/3A
•
Taking derivative with respect to t and putting t = s gives p-2/3(A-[t-u])
C ^=Px "
a[Wa * "ti
^ _
_
p-20(t-u)
e -2/3A
Next, taking derivatives ivati with respect to t, then with respect to s, and then plugging s = t gives 2
e-2/3(t-u) +
< 7 ^ t = S3 X
e-2/3(A-[i-u]) +
2e-2/3A
! _ e-2/3A
•
Hence, ,2
ft
{o'tWatf A/32 x
- (a'tWat)(a'tW
fe-20(t-u)
e-2/3A)e-2/3A
_|_ e - 2 / 3 ( A - [ i - u J ) _ 2 e - 2 / 3 A ) 2 '
A change of variables leads to
[A(3tdt = t a n - V A ( l - e -^)V2].
Jo Some manipulation of trigonometric identities shows that this last expression is identical to 2tan" 1 {[(l - e - " A ) / ( l + e'^)]1'2}
= 2tan- 1 {[/(l - #)] 1 / 2 },
(11)
18
where 6 is the recombination fraction between 0 and A. In terms of the recombination fraction this is exactly the expression found by Rebai, Goffinet and Mangin (1994) who used the Morgan rather than the Haldane map function. At first this may be surprising, but a simple argument shows this is quite generally true. Suppose we consider another map function with genetic distance r, which is a one to one function of t, say t = t(r). Let tilde's denote parameters computed relative to r. It follows from the expression derived for j3t and the chain rule that PT = Pti(r), so by a change of variable J0 $Td,T — JQ Ptdt. Hence if the markers are assumed to be separated by a given recombination fraction, it makes no difference which map function is used, while if markers are assumed to be a given genetic distance apart, the map function matters.
8. Remarks A similar likelihood ratio argument can be applied to the two degree of freedom statistic, say Zt = (Z\t, Z2t)', of an intercross. In this case under the transformed measure P s the mean of Zs is uniform on a circle of radius x. It would be interesting to let A —> 0 as S —> 0. For an appropriately chosen rate (presumably proportional to x~2), it would be necessary to consider both the smooth process interpolating the markers and the markers only process. We conjecture that this would lead to an approximation in the spirit of (6) that has a rigorous mathematical foundation. The likelihood ratio argument given above applies to a large class of smooth Gaussian processes and fields. The alternave P s is chosen so that the covariance function is unchanged but EsZt = xcov(Zs, Zt). Details for the much more difficult case of Gaussian random fields will be published elsewhere.
Acknowledgments This research was supported by the U.S.-Israel Binational Science Foundation, the National Science Foundation and the National Institutes of Health.
19
References 1. Davies, R. (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 74, 33-44. 2. Dupuis, J. and Siegmund, D. (1999). Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151, 373-386. 3. Feingold, E., Brown, P.O., Siegmund, D. (1993). Gaussian models for genetic linkage analysis using complete high resolution maps of identity-by-descent. Am. J. Hum. Genet. 53, 234-251. 4. Lander, E. S. and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185-199. 5. Rebai, A., Goffinet, B. and Mangin, B. (1994). Approximate thresholds of interval mapping test for QTL detection. Genetics 138, 235-240. 6. Rebai, A., Goffinet, B. and Mangin, B. (1995). Comparing power of different methods for QTL detection. Biometrics 51, 87-99. 7. Siegmund, D. (1985). Sequential Analysis: Tests and Confidence Intervals. Springer-Verlag, New York. 8. Siegmund, D. and Yakir, B. (2000a). Tail probabilities for the null distribution of scanning statistics. Bernoulli 6, 191-213. 9. Siegmund, D. and Yakir, B. (2000b). Approximate p-values for local sequence alighnments. Ann. Statist. 28, 657-680. 10. Yakir, B. and Pollak, M. (1998). A new representation for a renewal theoretic constant appearing in asymptotic approximations of large deviations. Ann. Appl. Probab. 8, 749-774.
AN ASYMPTOTIC PYTHAGOREAN IDENTITY ZHILIANG YING Department of Statistics, Columbia University New York, NY 10027, U.S.A.
We propose a simple method for computing asymptotic variances of adaptive statistics via geometrization. It results from the fact that many adaptive statistics can be viewed as projections of their nonadaptive counterparts with variances measuring squared lengths, making the Pythagorean Theorem applicable. The proposed method is intuitive and leads to easy-to-explain variance formulae. Its effectiveness is illustrated with several examples, some of which have been perceived as tricky and difficult.
1. Introduction Many popular parametric and semiparametric statistical models involve certain nuisance parameters (Bickel, Klaassen, Ritov and Wellner, 1993). Statistics for these models are typically derived by first assuming the nuisance parameters to be known, in which cases natural statistics are often easy to obtain, and then by replacing the nuisance parameters by their "best" estimates. Let S(6; 77) denote generically such a statistic with 6 being the parameter of interest and 77 estimating a nuisance parameter 77 (finite or infinite dimensional). A key step for making S(9; 17) useful is finding its asymptotic distribution, or, equivalently, its asymptotic variance whenever the central limit theorem becomes applicable. Often S(9; 77) is a sum of independent and zero-mean random variables and therefore its variance can be evaluated in a straightforward manner. Finding the (asymptotic) variance for S(9;fj), however, can be tricky, particularly if S is a functional of 77. Listed below are a few examples. Example 1.1. Cox's (1972) proportional hazards regression model under rightcensorship. Here one observes T, = min{Tj,Cj}, A^ = I(Ti
21
function of T; conditional on Zj = z has the form X(t\z) = exp(/3'z)A0(<)
( A(t\z) = exp((3'z)A0(t) ),
which is proportional to a baseline (cumulative) hazard Ao (Ao). If Ao were known, then the maximum likelihood score for j3 can be written as "
i-oo
Sph(p; Ao) = n-1'2 ]T /
Zi{Aid/(fi
(1)
»=i,'0
The widely used partial likelihood approach (Cox, 1975) can in a way be viewed as replacing the nuisance parameter Ao by its natural estimator, i.e., to use Spk(A0), where k0(t) = /„' £ " = 1 dNi(s)/ YJi=i exP(P'zi)I(fi>s)Example 1.2. M-statistic for the slope parameter in the simple linear regression model. Write the simple linear regression by Yi = ix + P'Zi + e% where e^ are independent with a common density / . Assuming Etp(e) = 0 and regarding fi as a nuisance parameter, the M-statistic for regression parameter 0 with score function ip is n
SM(P;{i) = n-i^ZiiP(Yi-(i-p'Zi),
(2)
i=l
where fi = fi((3) can be a root to either J^^O^i ~ P'Z% ~~ ") o r 5Z/'(^i — P'Zi - -)/f(Yi - pZi - •)•
Example 1.3. Buckley and James's (BJ) (1979) statistic for censored linear regression. For the censored regression data presented as in Example 1.1 , suppose that the (transformed) survival times are linearly related to the covariates Ti=p'Zi+ei
i = l,...,n,
(3)
where e, are independent with a common distribution (survival) function F(F = 1 — F) but need not have zero mean. Because Ti may not be fully observed, the least squares normal equation for /?, S|s(/3) = n - 1 / 2 Ylkzi ~ Z)(Ti — P'Zi) = 0, cannot be implemented. The BJ extension is to take the conditional expectation of Sis(/3) given the observed data, SBJ(p;F)
= E{Sls((3)
\Ai,Zi,fi,i
=
l,...,n}
22
= »"1/2 TLxVi ~ 2){UTM
+ (i - ^ ) % ^ } ,
(4)
where Tj(/3) — Ti — p'Zi. Thus, the natural adaptive statistic without assuming knowledge of F is SBJ(P,FP), where F@ is the Kaplan-Meier estimate of F from possibly censored data {T,(/3), Aj}.
E x a m p l e 1.4. Koul, Susarla and Van Ryzin's (KSV) (1981) extension of the least squares method. Suppose in the preceding censored linear regression, the censoring times Cj are independent of the (Tj, Zi) and have a common distribution (survival) function G(G). Let tj = e» — \i, where /u is the mean of et. Define 7' = (/i,/?') and Z? = {1,Z<). Then T4 = 7 % . To estimate 7, KSV observed that £{A i f l /<5(f i )|Z i } = E(Ti\Zi) = 7 % , warranting another extension of the least squares normal equation:
SKSV(T,G)
= n-^Y.{^§--izXzi.
(5)
Since G(G) is unknown but estimable by its Kaplan-Meier estimate G(G), they proposed to use the natural adaptive statistic SKSV{I\G). The KSV estimator 7 of 7 satisfies SKSV{I\ G) = 0. Assuming T, > 0 in addition to the independence between Ci and (Yi,Zl), Leurgans (1987) proposed a different extension
sL(T, 6) = n~^ J2{ r -¥- ~ ^'^Zi.
(6)
Note that like the KSV case, the conditional expectation of JQ ' ds/G(s) given Zi is 7 % .
E x a m p l e 1.5. Tsui, Jewell and Wu's (TJW) (1988) bias-corrected least squares method for truncated regression. T J W considered the linear regression model (3) subject to a one-sided truncation, that is, there exist truncation variables Vit which are independent of a, such that only Vi,Ti,Zi for which V* < Ti are observed. Rearrange subindices so that (^11 T\, Z\),..., (Vm, Tm, Zm) are the observed triples. Since E(Tt\Vi, Zi) = p'Zi + [ Js>Vi-P'Zi
sdF(s)/F(Vi
-
p'Zi),
23
a bias-corrected statistic with knowledge of F is
Wft F) = m-" g *{* - W - 5 ^ * - ^ " }•
P)
Without making any parametric assumption, TJW proposed to substitute F in (7) by its product-limit estimator F^ from truncated data (Ti - P'Zi: Vi - (3'Zi), i = 1 , . . . , m; see Woodroofe (1985). Their estimate of /3 is a solution, denoted by (3, to SrjwiP'i Flr) — 0.
2. Pythagorean Identity for Variance Calculation We have been concerned with a general adaptive statistics S(8;fj(8)), derived from S(8; rj) by substituting the nuisance parameter rj with its "best" estimate fj(6), which is obtained with knowledge of 8. In this section, we shall use #o and r/o to denote the true values for 0 and r/, and, unless otherwise specified, fi=7)(8o). Furthermore, for notational simplicity, S(rj) and S(T)) will be used to denote S(8Q; rj) and S(6Q; fj) when there is no ambiguity. If we are convinced that S(fi) is asymptotically normal, as is usually the case, then the task of finding the asymptotic distribution becomes evaluating its asymptotic variance. The variance calculation for S(r?o) is typically quite trivial, as can be seen from Examples 1.1-1.5 given in the previous section. We propose to deal with S(fi) via the following decomposition S(fj) = S{m) + {S(r}) - Sfa,)}.
(8)
We claim that under certain regularity conditions, this decomposition leads to an asymptotic variance identity avar{5(fy)} — avar{5(r/ 0 )} - avar{5(r)) - S(r/ 0 )}.
(9)
Here and throughout, "avar" means asymptotic variance, while "acov" means asymptotic covariance. The asymptotic variance identity (9) comes from the following geometric interpretation. Regarding 5(ryo) as a vector in an appropriate space, S(fj) can then be viewed as the projection of S(rjo) into the linear space spanned by all S(f}), where f\ is any estimator such that n 1 / 2 (r/—%) is asymptotically normal with zero mean. Asymptotic normality allows us to use linear space normed by (asymptotic) standard deviation. Figure 1 illustrates that (9) is a result of the Pythagorean identity.
24
S(rj0)S(rj)
Figure 1. Geometric interpretation of t h e relation (in the asymptotic sense between adaptive and nonadaptive statistics: S(fj) is regarded as the projection of S(rjo) to V spanned by all statistics X satisfying E„T = 0; variance corresponds to squared length.
An interesting phenomenon of (9) is the variance contraction of 5 when rjo is replaced by f). In other words, the additional variation S(fj) — S(r)o) in (8) not only does not inflate, it actually shrinks the variance. The variance identity was derived earlier by Pierce (1982) in the context of finding the asymptotic distribution for test statistics with nuisance parameters. Asymptotic normality for related statistics was studied by Randies (1982). We now give a simple and intuitive derivation for the asymptotic variance identity (9). Let Wi,..., Wn (abbreviated by W) be the observations. Assume that the nuisance parameter 77 is finite dimensional and that fj is an asymptotically efficient estimator of 770- By an asymptotically efficient estimator, we mean that there exists a parametric family indexed by 77, {f(w,r])ij,(dw)}, where // is a measure on the sample space, such that 77 is asymptotically equivalent to the maximum likelihood estimator. Because of the equivalence of 77 to the maximum likelihood estimator, we have V - Vo = J
-if(W;vo) + o {n-1'2) p f(W;vo)
(10)
where / is the gradient of / with respect to 77 at 770 and J is the corresponding Fisher information matrix at 770. The zero-unbiasedness of S{W;rf),
25
EnS(W;r]) = 0, gives, by differentiating with respect to 77 at 770, / S{w;rio)f(iv;r]o)fi(dw)
+ / S(w;r)0)f(w;rjo)fJ-(dw) = 0.
(11)
Taking the Taylor expansion of S(W; fj) at 770 we get = S(W;»ft,)W-Tto)+Op(l) = £S(W;77o)(7?-770) + o p (l). (12) Combining (10) with (12) gives S(W;T))-S(W
;T?O)
acov{5(W; 77) - S(W; 770), S(W;
Vo)}
= ES(W;r,o)J-1E{^^)S'(W;r]0)} = -ES(W; rk)J-lES'(W; r?o) + o(l),
+o(l) (13)
where the second equality follows from (11). On the other hand, from (10) we know that avar(/) — T70) = J - 1 . Consequently, the right-hand side of (13) becomes —&va.r{S(W;fj) — S(W;rjo)} in view of (12). Hence follows (9). The preceding proof is reminiscent of the commonly used algebraic manipulations involved in calculation of the Fisher information. In fact, when S is the efficient score for 9, var{5(?7o)} becomes (up to a scaling factor) the Fisher information for 6 with specified 770, and var{5(r})} turns out to be the Fisher information for 6 with 77 being unspecified. In that sense, (9) is just an extension of this information calculation to general score functions. In general, we believe that S need not be an efficient score, as with Examples 1.2 and 1.4. More interesting models are those with 77 being infinite dimensional. But then proving the asymptotic Pythagorean identity is conceivably more complicated, involving, naturally, some subtle regularity conditions. In Appendix, we shall supply a heuristic derivation of the identity when 77 is infinitely dimensional, ignoring regularity conditions. The examples worked out in section 3 should provide additional evidence to the validity of the variance identity. In practice, it is more advantageous to first use (9) as a short-cut to see the final variance form and to select a good 77 and then furnish with a rigorous proof for the particular problem being delt with. To see the efficiency of S for estimating or testing 9, it is necessary to take into account the asymptotic slope at #0 of the statistic S{9;i)(9)}, regarded here as a function of 9. As we can see, in some situations, S{9; fj(9)} does not depend on the choice of 77 (Examples 1.2 and refex4), while in others, it does (Examples 1.1, 1.3 and 1.5). The former indicates that, in
26
order to increase efficiency for making inference about 9, 17 should be chosen, among all legitimate ones, in such a way that var{5(#o; v) ~ S(6o; 770)} is as large as possible. Here legitimacy means that fj can be viewed as the "best" estimator of r\ under certain parametrization of r\. To illustrate, consider the M-statistic for regression as described in Example 1.2. Then equation ^ i/>(Y$ — fi'Zi — n) = 0 for /i (with P fixed) defines an asymptotically optimal estimator when the error density / is specified only up to J il>(t)f(t)dt = 0, while " ^ / ' / / " D e c o m e s optimal when / is completely specified. 3. Examples 3.1. Cox's Regression
Model
Following the notation and definitions of section 1, the Cox statistic SPh(P; Ao) = Sph(p; A0) + {Sph(P; A0) - Sph(P; A 0 )}. The asymptotic variance of SphiP; Ao) is n _ 1 £Z™=1 E$ Zf2I,f>t-, exp(P'Zi)dA0 by the maximum likelihood theory. Here and in sequel, a®2 for a column vector a denotes aa!. On the other hand, Sph{(3; Ao) - Sph(P; A0) = -n'1'2
/
£ 0
"'
Z i e " ' z < / ^ ^ { A o W - A0(<)},
i=l
whose asymptotic variance is 1 1 / 12^i=l Z '« e {Ti>t)S ,. ,.. dA « / V " P/3'Z; 7°^ n e JO 2-^=1 ^(Tt^t) by the extended Greenwood formula for martingales; see, for example, Gill (1980). Thus the Pythagorean rule (9) can be used to get
<w&r{Sph(P; A0)}
= i TZ.iEjZ°>I(ft>t) eMP'ZJdAo - I C ^ g ^ f f i ^ ' W ) , which agrees with results of Tsiatis (1981) and Andersen and Gill (1982). 3.2. M-estimate
for Linear
Regression
Continuing Example 1.2 in section 1, let jli{P) and faiP) solve, respectively, £ TPM -jJL-pZi) = 0 and £ / ' ( ^ - A - pZi)/f(Yi - /i - /3Z<) - 0. Their
27
corresponding adaptive M-statistics have expansions n
SM{/3;Afc) = SM((3; AO-n" 1 £ ^ { ^ Z ^ i f i k - ^ + o ^ l )
k = 1,2. (14)
Now avar{5^(/3;£t)} = n _ 1 $Z™=1 Z2Eip2(e\) and avar(/ii — /i) = £ t y 2 ( e i ) / W ( d ) } 2 > avar(A2 -/x) = l / E { / ' ( f i ) / / ( e i ) } 2 . The inequality arises from the asymptotic efficiency of maximum likelihood estimation. In view of (14) and (9), avar{S M (/?;£i)} = E ^ e ^ ^ Z f
-
Z2E^2(e1)
avar{S M (/?;£ 2 )} = B ^ i n) f ;- ^ ' " ^ W £ {^/7' (Te i 7) /7/ (7e i7) } 2 i=l
Since it can be shown that the asymptotic slopes of SM(@;p,k(P)) are the same for k = 1,2, the preceding variance calculation reveals that /ti(/3) provides a more efficient statistic (with respect to testing or estimation of (3) than feiP), though the latter is a more efficient estimator of the nuisance parameter \i. This seemingly paradoxical phenomenon has in fact a very simple and natural geometric interpretation: S^(/3; £1) is, asymptotically, the projection of SM(P', A4) into the space Vj. spanned by all SM (/3;/t) with nll2{fi, — fi) converging to normal under the model assumption Exp(ei) = 0, while 5 M (/3; £2) is the projection of SM{P',H) into the space V spanned by SM(P'IP) with n 1 / 2 (/i — fi) converging to normal under the model assumption e; ~ / . See Figure 2. Since V\ is a subspace of V, the length of the projection on V\ is shorter, or, equivalently, S M ( / ? ; £ I ) has a smaller variance. 3.3. BJ Statistic
for Censored
Linear
Regression
We now apply variance identity (9) to derive the asymptotic variance for the BJ statistic SBJ(P\ F) as introduced in Examplel.3. To do so, it suffices to find asymptotic variance for the difference SBJ{P; F) — SBJ{(3; F) since the variance formula for SBJ{P', F) is straightforward. This becomes easier when SBJ is viewed as a functional of the cumulative hazard function A (its estimator A). When a formal differentiation is applied, we know that, with ABJ denoting its "derivative", SBj(P; F) - SBJ(P; F) = J ABj(t)d{A(t)
~ A(i)},
28
Figure 2. V is spanned by T where T satisfies EnT = 0,1) e H; V\ is spanned by Ti where Ti satisfies EvTi = 0, for i) e H\. With H
whose variance is transparent in view of Greenwood's formula (Miller, 1981). Indeed, with a simple integration-by-parts manipulation, we have i>BAP,*)-i>BAP,*)-
^jJ_00-pwJ_00 ds
x
Jt°° f(s)
j i
^
f
^^ %i[«.)'iAW}
which is a martingale (in t) integral. Thus by evaluating its predictive variation we get the following simple variance formula a v a r j S ^ ; ^ ) - SBJ(/3;F)} = J ^ ^ { £ j ^ } d F { t ) , where T0(t) and T^t) are limits of n~l J2P{Ci~P'Zi Z)P(Ci — /3'Zi > t), respectively. Hence,
> t) and n'1
(15) Y,(zi~
avar{5 s ,(/?; F)} = 1 £ ( £ < - ^)® 2 var{A l f l (/3) + (1 - A , ) % ^ | Z<} -J-ooTvtT\
F(t) j ^ W -
The preceding formula can be shown to agree with the variance formula derived by Ritov (1990) and Lai and Ying (1991).
29
3.4. KSV-Leurgans-Zheng
Synthetic
Data
Regression
The main advantage of the KSV method over the BJ method is that the resulting estimator has an explicit form: 7 = ( ^ Z j Z j ' ) - 1 £^ AjTj/G(T;). This formula also implies y/n(j—7) = n(%2 ZiZ^)~1SKSv{l'-, G). Therefore, in order to derive the asymptotic distribution for 7, it suffices to do so for SKSV(1: G). Similarly, analysis of Leurgans's estimator reduces to deriving the asymptotic distribution for 5'L(7,G ! ) defined by (6). As explained in section 3.3 for the BJ statistic, we can also regard SKSV and SL as functional of the cumulative hazard function (estimator) KQ (AQ) of censoring variable C, so that use of martingale integrals for variance calculation can be exploited. In particular, SKsv(r,G)-SKSv(r,G)
1 ^/t°°«d{-Fz(u)} {AldI(fi^t)-I{ti^t)dAG(t)} F(t)G(t)
4E^ 712
and SL(T, G) - SL(T, G) « - j ] T
* 1= 1
{ dI
F{t)G{t)
^ (f,
-
I{ft>t)d^a(t)}.
Here F and Fz are the limits of n " 1 £ P(Ti > *) and n~l £ ZiP(Ti > t). Whence we have the following variance formulae a,va,i{SKSV(-y;G)} 1
"
n *-^
G(fi)
f
l'Zi \ Zi
J
F(t)G(t)
dAG(t),
—c
avar{5 L (7;G)} iZi}
f
Zi
[ft°° Fz(u)du\ -dAo(t), F(t)G(t)
J —c
which agree (via some algebraic manipulations) with the results of Koul et. al (1981) and Zhou (1992). A general method for generating "synthetic data" was due to Zheng (1984), who introduced synthetic responses Ti(G) = Aiip(fi) + (1 - Ai)^i(Tj) with ip and ^ satisfying integral equation oo
/
*(u)dG(u) = t.
(16)
For such Tj(G), he showed that the key unbiasedness equation E{Ti(G)\Zi} = -y'Zj always holds. There are infinitely many pairs (ip,^>)
30
satisfying (16). But one can easily make (16) to have a unique solution by imposing an additional constraint. In particular, \P = 0 or V = * leads to the KSV or the Leurgans statistic. Two general classes, called a-class and c-class, were also introduced there: t
A' a(u)dG(u)
T
, ,
, .
/•'
a(w)dG(w)
r
and
^ =f^h-f
<
*7Tr> *<«> = *<*> + c «
for c class
" '
y 0 G(u) y 0 G(w) where a and c can be arbitrarily chosen functions. Let Sa(G) = 52Zi(fi(G) - yZi) with (tp, * ) from the a-class and let SC(G) be similarly denned with (V', * ) from the c-class. Then we can use the proposed Pythagorean rule to get avar{5Q,(G)}
.it E [ { , 1 ( o ) _ y 4 } 4 ]-_ J r{£^ ! > + f t ( , W 0
F(t):
avar{S c (G)}
See Lai, Ying and Zheng (1995) for a formal proof of the identities. Note that the first formula becomes the asymptotic variance for the KSV statistic when a is chosen to be 0 and the second one becomes that of the Leurgans statistic when c = 0. Fygenson and Zhou (1994) used stratified samples to estimate separately the censoring distribution in the KSV or the Leurgans statistic for the K-sample problem, so that possible dependency between censoring and covariate variables can be accommodated. As a by-product, they observed that such stratification also shrinks variance, or equivalently, increases efficiency, of the corresponding statistic, even when censoring and covariate variables are independent. From the perspective of the Pythagorean rule, this is quite transparent because the stratified product-limit estimates of G makes the variance of the second term in (8) larger or (9) smaller. Again see Figure 2 for a geometric explanation.
31
3.5. TJW Statistic
for Truncated
Regression
We now discuss the bias-correction approach of Tsui, Jewell and Wu to the truncated regression as specified in Example 1.4. Recall that the TJW statistic STJW{P': Fjf) is to replace the unknown error distribution F in (7) by Fir, the product-limit estimate of F from the truncated data set. In this case, we can also explore, as in the analysis of censored regression, natural martingale techniques developed for truncated data; see Kieding and Gill (1990) and Lai and Ying (1991). Analogous to (15), the asymptotic variance
for Srjwtfiff)
-
STJW(P;F)
is Jff2(t)T^(t) x
JtF(u)duF-\t)dF(t), p v
where f 0 (t) and f\(£) are the limits of m- Y, { i m " 1 £ ( Z i - Z)P{Vi - P'Zi < t}. Thus
~ P'zi
^ *}
and
avar{5r J V y(/?;F| r )}
where E and var, the expectation and variance, are both conditional on the Zi. The preceding asymptotic variance can be shown to coincide with formula (18) of Lai and Ying (1992). 3.6. Bivariate Survival Censoring
Function
under
Univariate
Lin and Ying (1993) proposed a simple nonparametric estimator of bivariate survival function when censoring is univariate. Let (Tu,T2i) be i.i.d. with a common survival function F(ti,tz) = P(Tu > t\,T2i > £2)- Under univariate right censorship, observed values are Tji — mm(Tji,Ci), Aj, = I(T-i
p{tlM;G)=-j:I(fr-t^'>-t'\
(17)
where G(t) is the Kaplan-Meier estimator of G(t) = P(Ci > t) regarding Tu V T-n as the censoring times. Let S(ti,t2;G)=n1/2{F(ti,t2;G)-F(ti,t2)}. They showed that S is asymptotically normal with variance avar{5(ti,£ 2 ;G)} = scv&i{S(t1:t2;G)}
- avar{5(
S{h,t2;G)}. (18)
32
Note that the first term on the right-hand side is a sum of independent zero-mean random variables and the second can be written as a martingale integral. Because one can also use Tu or T2i as the censoring variable in constructing an estimator for G, alternative estimator to (17) can easily be obtained so that (18) still holds. From (18), it follows that one would use as inaccurate an estimator for G as possible so that the resulting variance can be made smaller. In fact, this is behind the more efficient modification proposed by Tsai and Crowley (1998). 3.7. Doubly
Censored
Regression
Nonparametric maximum likelihood estimation of a distribution under double censorship has been studied by many people, including Turnbull (1974), Tsai and Crowley (1985), Chang and Yang (1987), Chang (1990) and Gu and Zhang (1993). For the regression model (1.3), double censorship means that there are random censoring intervals [C/% C^] such that only Di = max{min(T i ,C/ ? ),Cf}, Af = I(Ti>ct) anc * ^ = A ^ ^ c f ) a r e °^" served. If /3 were known, then one would apply the E-M algorithm to Di{(3) = Di — (31'Zi, Af and A ^ to obtain the nonparametric maximum likelihood estimate Fic of the error distribution F. Note that an analogue to the BJ statistic is Sdc((3; F) = £ " = 1 ( Z i - Z)D*{(3; F), where
Thus from the Pythagorean rule, we can anticipate that the asymptotic variance for 5
•
- Z)®2vzr{D*((3; F)\Z{} - avarfS^/?; F*c) - Sdc(P; F)}.
1
Furthermore, the formula for avar{5dc(/?; F£c) — Sdc(f3; F)} can be obtained in principle from weak convergence of yfn{Fic(-) — F{-)} to a Gaussian process (Chang, 1990; Gu and Zhang, 1993). 4. Remarks We have introduced a general principle for evaluating asymptotic variances of certain adaptive statistics. It is motivated by a geometric interpretation that many such statistics may be viewed as projections of their nonadaptive
33
counterparts, which are more manageable. The resulting variance identity is simply a reinterpretation of the ancient Pythagorean Theorem for rightangled triangles. When the nuisance parameter is finite dimensional, a rigorous proof of this principle is provided. For more useful situations in which nuisance parameters are infinite dimensional, a heuristic derivation is supplied. Fully rigorous justification may be done under appropriate regularity conditions using compact differentiability and nonparametric maximum likelihood estimation. See, for example, Gill (1989). Furthermore, we have verified that the variance identity indeed holds for several examples for which the asymptotic variances were known. For practical purposes, (9) can be served as a short-cut to see final asymptotic distributions and as a guideline for rigorous proofs. A main advantage of the new method is its simplicity. It can provide, instantaneously, asymptotic variance formulae for rather complex statistics. Many of the examples we discussed are not trivial, but were treated quite effectively by this method as illustrated. In addition, we also see the emergence of unification. One may notice that almost all the examples we used arise from modeling and dealing with incompletely observed data. The reason for this is largely due to the fact that, to incomplete data, adaptive statistics are often very natural. However, there are other interesting statistics for complete data that involve nuisance parameters and satisfy (9). One such example is a method proposed by Robins, Mark and Newey (1992) in their study of a certain regression model for exposure effects in epidemiological studies. One can see that the asymptotic variance formula for their adaptive statistic has form (9). On the other hand, the examples discussed for incompletely observed data are far from being exhausted. For right- and left-truncated data, as discussed briefly in Gross and Huber-Carol (1992), we believe the same rule can be used. Other situations, such as right-censorship with possibly missing censoring indicators (Lo, 1991) may be treated likewise. There are less standard situations in which the asymptotic Pythagorean principle may still be applicable. For example, Cheng, Wei and Ying (1995) proposed a U-type estimating equation for the regression parameter in the family of semiparametric transformation models. When it is subject to independent right censorship, the asymptotic variance identity (9) still holds from their results, even though the estimating function is not in the form of an independent sum. When the estimator of nuisance parameter does not have root-n convergence rate, as in the case of interval-censored Cox model
34
(Huang, 1996), it is conceivable that the identity could hold in some cases, but may not be generally true as it is much more complicated to define efficient estimation for the nuisance parameter. Acknowledgements The manuscript was initially written when the author was with Department of Statistics at the University of Illinois at Urbana Champaign. The author was grateful to the Department for providing a collegial environment. He also thanks Professor James M. Robins for helpful discussions. The research was supported in part by grants from the National Science Foundation and the National Security Agency.
Appendix Clearly, to show (9), it suffices to show acov{5(^) — S(rjo), S(rjo)} = —a.va,r{S(fj) — S(r]o)}. For notational simplicity, assume dim{5(r;)} = 1. We again would like to point out that the derivation given here does not maintain a full rigor, but should provide convincing evidence that (9) is true when suitable regularity conditions are imposed. Taking the von Mises differential (Serfling, 1980), denoted by h, of S with respect to 77 at the true parameter value rjo, we get S(fj) - S(rjo) = (h,fj-
rjo) + o p (l)
= <M)-(Mo}+oP(l),
(19)
where (h, •} is the linear functional induced by h. Define v = (h, 77), so UQ = (h, rjo). Since 77 is the "best" estimate of 77, under appropriate parametrization, say W ~ /(• ; ^ ) , v is asymptotically equivalent to the maximum likelihood estimator. In other words, with J{u) = J f2(w;i/)/f(w;v)dw,
In view of (19) and (h, fj — rjo) = 0 — VQ, acov{5(?}) - ^(r/o), S^o)} = acov{r> - ^0,5(770)}
"
1 J(p0)
= jr^r
r/Qt>o)
Qfw/
a
5( ;r?o) f a c o vl /\7m^v } (W>o)' ^
J AW, "o)S(w; Vo)dw.
(21)
35
Now 77 = r){v) and E„S(W;r)(v)) = 0 for every v. Therefore by differentiating the preceding equation with respect to v, we get j f(w; v0)S(w;
Vo)dw
= ~E^S(W;
n[u))
\V=VQ
.
(22)
Since v = (h,r](u)), we have 1 = (/i, 77(1/0)). By the chain rule,
E±S(W-V(v))l=i/o=(h,r,M)
=h
recalling t h a t h is the von Mises differential of S. This, together with (21) and (22), implies acov{5(7)) - S(770), S(T70)} = 7 7 ^ -
(23)
On t h e other hand, J _ 1 ( J / O ) = avar(/> — VQ) = avar{5(r}) — S(»7o)}. From this and (23) we get acov{5(77) - 5(770), 5(77o)} = -avar{5(77) - 5(770)}.
References 1. Anderson, P. K. and Gill, R. D. (1982). Cox's regression model for counting processes: a large sample study. Ann. Statist. 10, 1100-20. 2. Bickel, P.J., Klaassen, C.A.J., Ritov, Y. and Wellner, J.A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore. 3. Buckley, J. and James, I. (1979). Linear regression with censored data. Biometrika 66, 429-36. 4. Chang, M.N. (1990). Weak convergence of a self-consistent estimator of the survival function with doubly censored data. Ann. Statist. 18, 391-404. 5. Chang, M.N. and Yang, G.L. (1987). Strong consistency of a nonparametric estimator of the survival function with doubly censored data. Ann. Statist. 15, 1536-47. 6. Cheng, S.C., Wei, L.J. and Ying, Z. (1995). Analysis of transformation models with censored data. Biometrika 82, 835-845. 7. Cox, D. R. (1972). Regression models and life tables (with discussion). J. R. Statist. Soc. B 34, 187-220. 8. Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269-76. 9. Fygenson, M. and Zhou. M. (1994). On using stratification in the analysis of linear regression models. Ann. Statist. 22, 747-762. 10. Gill, R.D. (1989). Non- and semi-parametric maximum likelihood estimators and the von Mises method (part 1). Scan. J. Statist. 16, 97-128. 11. Gross, S.T. and Huber-carol, C. (1992). Regression models for truncated survival data. Scan. J. Statist. 19, 193-213.
36 12. Gu, M. and Zhang, C-H. (1993). Asymptotic properties of self-consistent estimators based on doubly censored data. Ann. Statist. 2 1 , 611-624. 13. Huang, J. (1996). Efficient Estimation for the Cox Model with Interval Censoring. Ann. Statist. 24, 540-568. 14. Koul, H., Susarla, V. and Van Ryzin, J. (1981). Regression analysis with randomly right censored data. Ann. Statist. 9, 1276-88. 15. Lai, T. L. and Ying, Z. (1991). Large sample theory of a modified BuckleyJames estimator for regression analysis with censored data. Ann. Statist. 19, 1370-402. 16. Lai, T. L. and Ying, Z. (1992). Asymptotic theory of a bias-corrected least squares estimator in truncated regression. Statistica Sinica 2, 519-39. 17. Lai, T. L., Ying, Z. and Zheng, Z. (1995). Asymptotic normality of a class of adaptive statistics with applications to synthetic data methods for censored regression. J. Multivariate Anal. 52, 259-279. 18. Leurgans, S. (1987). Linear models, random censoring and synthetic data. Biometrika, 74, 301-9. 19. Lin, D. Y. and Ying, Z. (1993). A simple nonparametric estimator of the bivariate survival function under univariate censoring. Biometrika 80, 573581. 20. Lo, S.-H.. (1991). Estimating a survival function with incomplete cause-ofdeath data. J. Multiv. Anal. 39, 217-235. 21. Miller, R. G. (1981). Survival Analysis. Wiley, New York. 22. Pierce, D.A. (1982). The asymptotic effect of substituting estimators for parameters in certain types of statistics. Ann. Statist. 10, 475-8. 23. Randies, R.H. (1982). On the asymptotic normality of statistics with estimated parameters. Ann. Statist. 10, 462-474. 24. Ritov, Y. (1990). Estimation in a linear regression model with censored data. Ann. Statist. 18, 303-28. 25. Robins, J.M., Mark, S.D. and Newey, W.K. (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 48, 479-95. 26. Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York. 27. Tsai, W.-Y. and Crowley, J. (1985). A large sample study of generalized maximum likelihood estimators from incomplete data via self-consistency. Ann. Statist. 13, 1314-1334. 28. Tsai, W.-Y. and Crowley, J. (1998). A note on nonparametric estimators of the bivariate survival function under univariate censoring. Biometrika 85, 573-580. 29. Tsui, K-L., Jewell, N.P. and WU, C.F.J. (1988) A nonparametric approach to the truncated regression problem. J. Amer. Statist. Assoc. 83, 785-92. 30. Turnbull, B.W. (1974). Nonparametric estimation of a survivorship function with doubly censored data. J. Amer. Statist. Assoc. 69, 169-73. 31. Woodroofe, M. (1985). Estimating a distribution function with truncated data. Ann. Statist. 13, 163-77. 32. Zheng, Z. (1984). Regression Analysis with Censored Data. Ph.D. disserta-
37
tion. Columbia University. 33. Zhou, M. (1992). Asymptotic normality of the 'synthetic data' regression estimator for censored survival data. Ann. Statist. 20, 1002-21.
A M O N T E CARLO G A P T E S T IN C O M P U T I N G H P D REGIONS
MING-HUI CHEN Department
of Statistics,
University of Connecticut, Storrs, CT 06269, USA
215 Glenbrook
Road,
XUMING HE Department
of Statistics,
University of Illinois, 725 S. Wright, Illinois 61820, USA
Champaign,
QI-MAN SHAO Department
of Mathematics,
University of Oregon, Eugene, USA
OR
97403-1222,
HAIXU Department
of Statistics,
University of Connecticut, Storrs, CT 06269, USA
215 Glenbrook
Road,
We consider estimation of the Bayesian highest posterior density (HPD) regions for a parameter of interest and propose a Monte Carlo gap test based on a random or dependent sample of the relevant parameter generated from its posterior distribution to determine whether the Bayesian regions contain one or more intervals. We also provide a simple Monte Carlo procedure to determine the exact form of the HPD region based on the outcome of the gap test. The basic theory is developed and a simulation study is conducted in examining the performance of the Monte Carlo gap test.
1. Introduction Highest posterior density (HPD) region is a useful summary of a posterior distribution for some parameter of interest. Suppose the parameter is onedimensional. As discussed in Box and Tiao (1992), an HPD region has two main properties: (a) the density for every point inside the region is greater than that for every point outside the region; and 38
39
(b) for a given probability content, say 1 - a, the region is of the smallest size. Unlike other Bayesian posterior quantities, such as the posterior mean and the posterior variance, it is typically computationally intensive to compute the HPD region unless one is dealing with a simple model such as the standard normal model. Consider a Bayesian posterior density of the form <0, tp\D) = ^
W
(1)
where D denotes data, the parameter 8 is one-dimensional, and
= a / 2 and I I ( 0 ( 1 - Q / 2 ) | £ > ) = 1 - a / 2 .
When ir(8\D) is symmetric and unimodal, the Bayesian credible (0(a/2)#(i-a/2)) i s a i s o a n H p D m t e rval. However, when TT(8\D) symmetric, (#( a / 2 ),0( 1 - a / 2 )) is not an HPD interval in general. case, an HPD interval or region is more desirable, as it better the important features of the posterior distribution than does a interval. A 100(1 — a)% HPD region for 8 is given by R(ira) = {8: ir(8\D) > na},
(2)
interval is not In this displays credible (3)
where 7rQ is the largest constant such that P{9 € R(na)) > 1 — a. In (3), R(na) can be reduced to a 100(1 - a)% HPD interval for 8 when n(8\D) is unimodal. Due to the recent advances in computing technology and the development of Markov chain Monte Carlo (MCMC) sampling algorithms, several Monte Carlo algorithms have been proposed for computing Bayesian HPD intervals or regions. In particular, Tanner (1996), Wei and Tanner (1990),
40
and Hyndman (1996) provide a Monte Carlo (MC) algorithm to calculate the content and boundary of HPD regions. However, their MC algorithm requires evaluating the marginal posterior densities analytically or numerically. The implementation of their algorithm is also quite complicated and computationally intensive. Chen and Shao (1999) propose a simpler MC method for computing HPD intervals. Their method does not require knowing the closed forms of ir(d\D) and U(9\D), and can be applied to compute HPD intervals for the parameters of interest and also for functions of the parameters. However, the algorithm proposed by Chen and Shao (1999) requires the unimodality of the marginal posterior distribution TT(6\D). We note that the marginal posterior density ir(6\D) is given by
*(B\D) = J TT(0, V\D) d
P\D)K(0,
if) dip,
(4)
where the normalizing constant c(D) is often unknown and the analytical evaluation of the integral may not be available. This is particularly true when
Quantile
Approach
This approach was first introduced by Wei and Tanner (1990) and further elaborated by Hyndman (1996). Let { ( 0 , , ^ ) , i = 1,2, . . . , n } denote a
41
random (or MCMC) sample from ir(9,tp\D). Then, {6U i = l , 2 , . . . , n } is a random (or MCMC) sample from ir(6\D). Also, let & = ir{9i\D) for i — 1,2,..., n, and let f([Q„]) be the [an]™ smallest of {&}. Define Rn(a) = {9:
TT(9\D) > £ ( [ c m ] ) },
(5)
which is the set of 9's such that their density values are greater than or equal to £([Qn]) • In Hyndman (1996), it was stated that Rn{a) converges to R (7rQ) given in (3), as n —* oo when {9i, i = 1,2, . . . , n } is a random sample from TT(9\D). However, no formal proof was provided by Hyndman (1996). Wei and Tanner (1990) proposed a Monte Carlo method for calculating the boundary of an HPD region via data augmentation. Hyndman (1996) discussed how to compute Rn(a) using the contour function for a bivariate HPD region and the grid method with spline interpolation for a univariate HPD region. 2.2. Sampling
Quantile
Approach
Assume that 7r(#|.D) is unimodal and {(#,, y?j), i = 1, 2 , . . . , n) is an MCMC sample from the joint posterior distribution 7r(#,
Algorithm
Step 1. Obtain an MCMC sample {9t, i = 1,2,..., n} from n(6\D). Step 2. Sort {#j, i = 1, 2 , . . . , n} to obtain the ordered values: 0(1) < 0(2) < • • • < 0(n)Step 3. Compute the 100(1 — a)% credible intervals Rj(n) = (0O-),0(j+[(l-a)n])) for j = 1,2, . . . , n - [(1 -a)n]. Step 4. The 100(1 - a ) % HPD interval is the one, denoted by Rj*{n), with the smallest interval width among all credible intervals.
42
Under certain regularity conditions, Chen and Shao (1999) showed that the above procedure is asymptotically valid. We state this result in the following proposition.
Proposition 2.1. Assume that {Oj, i = 1,2,..., n) is an ergodic MCMC sample from n(9\D). If ir(9\D) is unimodal and min (\ir(9u\D) - ir(8L\D)\ + \U(9u\D) - U(dL\D) - (1 - a)|) SL<SU
has a unique solution, then Rj- (n) —> R(TTa) a.s. as n —> oo, where R(ira) is defined in (3). The proof of this proposition can be found in Chen and Shao (1999). Chen and Shao (1999) also considered the extension of their method to the multimodal case. The following conjecture is given in Chen and Shao (1999).
Conjecture 2.1. Assume that TT(9\D) has at most two modes. Let {61,62, • • • ,0n} be a sample from n(9\D) and let 9^ be the ordered values of the 9j. For 0 < a < 1, denote D=
min 0<m<[(l-a)n] i+m<j
min 0
U (9(i-+m-) - 9(i-)) + (%«+[„ Q ]- m *) - % • ) ) = D, then (%.), 9(i.+m.)) U (9(j-), #(j*+[na]-m*)) »s an approximate HPD region for 9. It is expected that when TT(9\D) is unimodal, (#(j-), #(i«+m-)) U c a n De (9Q*), ^(j*+[na]-m*)) automatically reduced to one interval. Similarly, this conjecture can be extended to cases where 7r(#|.D) has more than two modes. However, the question remains open as to how one can test the unimodality of -K{6\D) based on an MC sample {9i, i = 1,2,... , n } . We will address this issue in the next section. 3. Monte Carlo Gap Tests To develop a Monte Carlo gap test, we first show that the density quantile estimate of Rn{a) given in (5) is asymptotically correct. That is, Rn{a) —•
43
R(na), as n —> oo, where R(-!ra) is defined in (3). We formally state this result in the following theorem.
Theorem 3.1. Let 7r„ = £([„„]) denote the lower or*1 sample quantile of &=7r(0i|I?). Then, 7T„ —> 7T Q
OS
71 —> O O .
Consequently, Rn(a) —> fl(7rQ), as n —> oo. Proof: First note that 7rQ > 0 for any a > 0. Since 7rn is the sample quantile of {ir(8i)} and 7rQ is the corresponding population quantile, the theorem follows from the standard quantile convergence result. • Let nn(a)
= {di:
Zi=Tr(ei\D)>Z{[an])},
(6)
be the set of #j's such that their density values are greater than or equal to £([cm])- Denote
^ = sa Sin, JW* and °"W = !a 2& J6^-
6 n)
<7)
R e m a r k 3 . 1 . If n(G\D) is unimodal, then R{-Ka) reduces to an interval, called the HPD interval. In this case, we have Ra(n) = (et(n),
9ua{n)),
(8)
and by Theorem 3.1, ( ^ ( n ) , 0^{n)) is a consistent estimator of the HPD interval R(Tra).
Definition 3.1. Let R(na) denote a 100(1 - a)% HPD region. There are no gaps in R (na) if there exist constants a < b such that P(R(na) = (a,b)\D) = l. Roughly speaking, Definition 3.1 implies that if there are no gaps, R (ira) reduces to an interval. On the other hand, if R (7ra) consists of several disconnected intervals, then there are gaps. For example, if ir(6\D) is bimodal, typically R (na) consists of two intervals, and therefore there is a gap inside R(-Ka). But, we notice that it is not necessary that a bimodal distribution
44
always leads to a gap inside R(-7ra). Whether there is a gap in R(TTa) depends on the shape of the density and the choice of 1 — a. For example, if the density values are similar at the two modes, then the larger 1 — a or the shorter the distance between the two modes, the less likely there is a gap. Next, we introduce a useful lemma.
Lemma 3.1. Let {Xi, 1 < i < n} be a random sample of size n from a population distribution F. Suppose that the density function f of Xi exists and is continuous on R1 with f(x) > 0 for x in the support of Xi. Assume that for any a < b, there exist C > 0,r > 0 such that \f(x) — f(y)\ < C\x — y\r for any a < x < y < b. Let Xn
max
< logn + x) -> e -< a a - a i > e ~*. (9)
(*„,* - Xn,i-i)f(XnJ
ain
Proof: The outline of the proof is given as follows. Let Ui = F(Xi) and let {&, 1 < i < n + 1 } be i.i.d. exponential random variables with mean 1. Put Sk = ]Ci=i&- Then Ui are i.i.d. uniformly distributed random variables over (0,1) and {Un^, 1 < i < n) and {Si/Sn+i, 1 < i < n} have the same distribution. Observe that F(Xn,i)
— F(Xnti-i)
= f(Xnti
- Si)(Xnti - Xn^-i),
(10)
where 0 < Si < Xn^ — Xn^\. It is easy to see that for any 0 < c*i < a.2 < 1, there exist a\ < a? such that ai < Xnt[ain]
< X ni [ a2 „] < a 2 a.s.
as n —> oo. Thus, by (10) and by the assumption that / is continuous and positive in the support of Xi max
\X„ti —
Xnti-i)
oc\n
= 0(1)
max
( F ( X „ , J ) - F ( X „ , i _ 1 ) ) = 0 ( n - 1 l o g n ) a.s.
Therefore, by the Lipschitz condition for / n
max
= 0{n)
(Xn^ - Xn^-i)f{Xn^) max
oc\n
- n
max
{Un^ - Un^-i)
{Xn
(11)
= o(l) a.s.
45
It is easy to show that n/Sn+i n
max
—> 1, and £i/Sn+i —
max
£j —> 0.
(12)
Furthermore, p ( m a x Q i n < i < a 2 „ 6 < l o g n + x) = (1 - e -0°8»+*))(<w-"i)" /
\(a2-ai)n
fl-e~x/n)
=
_
—»• exp(-(a2 - ai)e
x
)
as n —> oo. The lemma then follows from (11), (12), and the fact that {Un,i, 1 < i < ft} and {Si/Sn+i, 1 < i < n} have the same distribution. • Let a n = a i j n n denote the number of 0j's that are less than equal to 0a(n) a n d bn = a2,nn denote the number of 0j's that are less than equal to 0%(n), where £(n) and 9%(n) are defined in (7). We are led to the following theorem.
Theorem 3.2. Assume that TT(8\D) is continuous and R (ira) has no gaps. Also assume that for any a < b, there exist C > 0, r > 0 such that \TT(0\D) — 7r(0*|£>)| < C\6-6*\r for any a < 9 < 9* < b. Suppose {6^ i = l , 2 , . . . , n } is a random sample from tr(9\D). Let 9n>\ < #n,2 < • • • < Bn,n be the order statistics. Then, P(n
max (0n>i - 0„,i-i)7r(0nii|£>) < logn + x) - e ^ 1 " 0 ^ - * .
(13)
o„
Proof: Notice that when R(na) has no gaps, by the definition of £ln(a), 9nti € Cln(a) for all i's such that an < i < bn. Also, it is easy to show that »2,n — &\,n = (bn — an)/n —> 1 — a as n —> oo. The rest of the proof follows from Lemma 3.1. • Determining whether there are any gaps inside a 100(1 — a)% HPD region R(ira) is a standard hypothesis problem. That is, we wish to test a null hypothesis i?o : n o gaps in R(na) versus an alternative hypothesis Ha: one or more gaps in R(TTa). Theorem 3.2 directly leads to the following Monte Carlo gap test. Monte Carlo Gap Test: Step 1: Generate a random sample from {9i, i — 1, 2 , . . . , n} from ir(9\D). Step 2: Compute & = n(9i\D) and the a lower sample quantile £([««])•
46
Step 3: Sort {0< : 0( e O n (a)}, where Qn(a) = {0t : ir(di\D) > £([cm])}, to obtain the ordered values denoted by 0 n „ , l < #n Q ,2 < • • • <
9na,na,
where na is the size of Q n (a). Step 4. Compute the test statistic T = n max (enati-enaii-i)ir(enQii\D)-logn.
(14)
l
Step 5. Compute the asymptotic p-value p-value = P{T > f ) = 1 - e ^ 1 - ^ " ' * ,
(15)
where t* denotes the observed value of T given in (14). Step 6. If p-value < a*, a prespecified level of significance, we reject Ho, and hence, we conclude that R(-7ra) has one or more gaps. If p-value > a*, we do not reject HQ, and therefore, we conclude that there are no gaps in R(ira). Remark 3.2. Suppose that the gap test leads to the rejection of Ho and assume ia is the integer such that (Ona,ia
- 0na,ia-l)ir(9na,iJD)
=
m a x
(&na,i ~
dna,i-l)A^na,i\D)-
l
Then, the lower limit of the gap is 0n a ,t„-i and the upper limit of the gap is 0naiia- In this case, we can further apply the gap test for # n Q , l < Qna,2 <: • • • <
&na,ia-li
8na,ia
8na,nQ,
and < 9na,2
< ••• <
respectively. Let tj and *£ denote the respective observed values. Then, the p-values can be approximated by 1
_e-((i«-l)/n)e-t!
a n d
1
_ e - ( ( n „ - i „ + l)/n)e-t5)
respectively. We continue this process until Ho is accepted. We notice that a large sample size is required when the number of gaps is large. We also caution making a large number of such tests without controlling for the false positive rate.
47
Remark 3.3. Suppose n(6\D) is bimodal and R(ira) has exactly one gap. Assume that the gap test in fact detects the gap. Using the same notation introduced in Remark 3.2, as a byproduct, R(na) can be approximated by (#na,l,
# n Q , « Q - l ) U (0na,ia,
@na,na)-
Remark 3.4. In practice, -K(9\D) is unknown. In this case, we propose to use a Monte Carlo estimate rr(9\D) to replace TT(6\D) in the gap test. There are several density estimation methods available in the literature, including the kernel density estimate, the conditional marginal density estimate (CMDE) (see, for example, Gelfand, Smith and Lee (1992)), and the importance-weighted marginal posterior density estimate (IWMDE) of Chen (1994). Let {9t, i = 1,2,...} denote a random sample from ir(0\D). Then, the kernel density estimate has the form
WW
= ;i-i>(^),
(16)
where the kernel K, is a bounded density on R1, hn is the bandwidth, and 9 is a point in R1. As recommended by Silverman (1986), if a Gaussian kernel, i.e., K.(9) = (l/v / 27r)e -61 / 2 , is used, a good choice of hn is 1.06cr*n_1/'5, where a* is the sample standard deviation of the 0j's. Let {(#,, ) = - y > (%><,/?), n *—i
(17)
i=i
where 7r(^|<^, D) denotes the conditional posterior density of 9. We note that in (17), ir(9\ip,D) needs to be completely known. Here, "completely known" means that n(9\(fi,D) can be evaluated at any point of {9,
48 TT(8\D)
takes the following form:
^ . W ) - -L^I^)Z(^^DRC^)'
(18)
where w(6\ip) is a completely known conditional density whose support is contained in or equal to the support of the true conditional density TT(0\
49
400
500
1000
3000
5000
7000
Figure 1. Estimates of Type I Error Probabilities (a*) under the Bimodal Normal Distribution (left) and under the Bimodal Cauchy (right) for various sample sizes.
t
1
Sampta SizftilOOO
SunptoSiza=IM)0
Sample S U M K K X )
Sampta Size=5000
Figure 2. Estimates of the Powers (/3) of the Gap Test under the Bimodal Normal Distribution (left) and under the Bimodal Cauchy (right).
Third, we investigate how accurate our HPD region estimate is based on the gap test. We choose 7r(0|.D) to be the bimodal normal 0.5 * [iV(-2.05,1) + ./V(2.05,0.25)] density. The Monet Carlo sample size is n = 5000. We repeat the calculations 1000 times. In this case, the 95% HPD region contains two intervals. Therefore, we have four end-points. The true density and the estimates of these four end-points are plotted in Figure 3. A numerical summary of the boxplots shown in Figure 3 is given
50
End Point Estimate
Figure 3. Boxplots of End-Point Estimates of HPD Regions (left) and the True Bimodal Probability Density Function (right).
in Table 1. From both Figure 3 and Table 1, it can be seen that the Monte Carlo estimates are quite good. Table 1. End-Point 1 2 3 4
True and Estimated End-Points
True Value -3.854 -0.246 0.960 3.127
Estimated Value Mean Std Dev IQR 0.038 -3.851 0.028 0.038 -0.249 0.027 0.964 0.013 0.018 0.016 3.123 0.012
Finally, we examine the performance of the Monte Carlo gap test when ir(6\D) is replaced by the kernel density estimate and CMDE. We consider TT(9,
No
-2.1 ,Si -2.1
+JV 2
2.1 2.1
1 0.8^ Straightforward algebra shows that the ^0.8 1 marginal distribution for 6 is 0.5[iV(-2.1,l) + iV(2.1,l)]. For sample size n = 1000, the powers calculated based on the true density IT(9\D), the kernel density estimate, and the CMDE for ir(0\D) are (3 = 0.54, Pkemei =0.198, and pcmde = 0.571, respectively. For sample size n = 5000, where Si
So =
51
t h e powers calculated based on t h e t r u e density TT(9\D), the kernel density estimate, and t h e C M D E for n(8\D) are /? = 0.99, /3kernei = 0.542, and Pcmde = 0.984, respectively. In addition, we take TT(9,
I , S i I so t h a t the marginal distribution for 9 is N(0,1).
In this
case, t h e estimated type I error probabilities are a*mde = 0.05 when n = 50 for t h e C M D E and a*kernel = 0.033 when n = 500 for the kernel density estimate. This simulation study clearly demonstrates t h a t with the C M D E , t h e performance of t h e gap test with an estimated density is as good as t h e one with the known density. However, the gap test with t h e kernel density estimate performs less favorably. 5. C o n c l u d i n g R e m a r k s In this paper, we proposed a novel Monte Carlo gap test. T h e aim of t h e gap test is to determine whether an H P D region consists of only one or more intervals using a Monte Carlo sample from the posterior distribution. T h e simulation study presented in Section 4 shows t h a t t h e gap test performs well with modestly large sample sizes. Today's computer power has m a d e it relatively easy to generate thousands of random numbers from a posterior distribution, so t h e sample size requirement for the proposed gap test is quite realistic. Although we only consider a random sample from ir(9\D), our results may be extended a dependent M C M C sample. However, the performance of the gap test with dependent samples needs further studies. Extension of t h e gap test to higher dimensional parameters will also be worthwhile. References 1. Box, G.E.P. and Tiao, G.C. (1992). Bayesian Inference in Statistical Analysis. Wiley, New York. 2. Chen, M.-H. (1994). Importance-weighted marginal Bayesian posterior density estimation. J. Amer. Statist. Assoc. 89, 818-824. 3. Chen, M.-H. and Shao, Q.-M. (1999). Monte Carlo estimation of Bayesian credible and HPD intervals. J. Comp. Graph. Statist. 8, 69-92. 4. Chen, M.-H., Shao, Q.-M. and Ibrahim, J.G. (2000). Monte Carlo Methods in Bayesian Computation. Springer-Verlag, New York. 5. Gelfand, A.E., Smith, A.F.M. and Lee, T.M. (1992). Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. J. Amer. Statist. Assoc. 87, 523-532. 6. Hyndman, R.J. (1996). Computing and graphing highest density regions. Amer. Statist. 50, 120-126.
52 7. Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. 8. Tanner, M.A. (1996). Tools for Statistical Inference. Third Edition, SpringerVerlag, New York. 9. Wei, G.C.G. and Tanner, M.A. (1990). Calculating the content and boundary of the highest posterior density region via data augmentation. Biometrika 77, 649-652.
ESTIMATING R E S T R I C T E D N O R M A L M E A N S U S I N G T H E E M - T Y P E ALGORITHMS A N D IBF S A M P L I N G
MING TAN, GUO-LIANG TIAN AND HONG-BIN FANG Division
of Biostatistics, University of Maryland Greenebaum Cancer 22 South Greene Street, Baltimore, MD 21201, USA
Center,
Restricted parameter problems arise in many applications, for example, in ordinal regression, sample surveys, bioassay, dose-response, variance components models, and factor analysis models. We proposed a unified method to estimate normal means with or without nuisance variance parameters subject to a class of restrictions. This class encompasses the simple order, simple tree order, umbrella order, increasing convex and increasing concave restrictions. A nonproduct parameter space is then transformed into a product parameter space via a linear mapping. To overcome the difficulty due to nuisance parameters, we introduce a sequence of latent variables which vary in each conditional maximization step. The maximum likelihood estimates and Bayesian estimates of the parameters in normal distribution with these restrictions are derived by the EM-type algorithms and the noniterative inverse Bayes formulae (IBF) sampling. The formulation and implementation of the EM or the expectation/conditional maximization (ECM) and IBF sampling for both cases with known and unknown variances are derived. Two real datasets are used to illustrate the proposed methods.
1. Introduction Restricted parameter problems arise in a variety of applications including, for example, ordinal categorical data, sample surveys (Chiu and Sedransk, 1986), bioassay (Ramgopal, Laud and Smith, 1993), dose finding (Gasparini and Eisele, 2000), and variance components and factor analysis models (e.g., Dempster, Laird and Rubin, 1977; Rubin and Thayer, 1982). The most common parameter restriction is an ordering of the model parameters, namely, the order restrictions (Barlow et al., 1972). From a frequentist perspective, isotonic regression techniques such as the pool-adjacent-violators procedure (Robertson, Wright and Dykstra, 1988) provide explicit maximum likelihood estimates (MLEs) of the parameters for some simplest cases while for more general cases the solution is only obtained from a nonlinear programming approach (Schmoyer, 1984; Geyer, 1991). For example, closed-form estimation of the means in the normal 53
54
model (1) below with restriction (5) or (6) is not available. Furthermore, for a restricted likelihood ratio test the usual asymptotic chi-square distribution theory does not apply. In some cases, the resulting distributions are known to be a mixture of \ 2 distributions (Hirotsu, 1998), but in other cases some parametric bootstrap tests (Geyer, 1991), or an asymptotic conservative approximation method (Schmoyer, 1984) have to be utilized. One alternative to the isotonic regression is to use the EM algorithm (Dempster, Laird and Rubin, 1977) or the expectation/conditional maximization (ECM) algorithm (Meng and Rubin, 1993). Liu (2000) shows how the EM algorithm can be used for MLE of discrete distributions with a class of simplex constraints. However, based on these EM-type methods, statistical inference such as the confidence bounds and hypothesis testing depends heavily on large sample theories and the ability to compute standard errors of the estimates (Louis, 1982). Since the second order partial derivatives are involved, standard errors are quite complicated to compute, especially, in models with many parameters but limited sample sizes (Meng and Rubin, 1991; Liu, 1998). Thus, the Bayesian approach provides an attractive alternative for restricted parameter problems with small to moderate sample sizes (Gelfand, Smith and Lee, 1992). In the Bayesian framework, the Gibbs sampling is especially appealing since the restrictions can be imposed directly on the samples drawn so that the restrictions are satisfied. For example, the conditional support of /ij|/u_j for the normal model (1) below subject to restrictions (5) to (6) is of the form [/ij(/i_j), h^i^-i)], where h\ and h? are two functions and H-i = (HI, ... , / i j _ i , / i i + i , . . . , / i m ) T . However, slow convergence may occur because of the highly correlated iterations. A partial explanation for this slow convergence is that the interval [hi(fi-i), /i2(M-i)] within which each Hi must be generated from its full conditional can be very narrow so that fii may change very little between successive iterations (Chen, Shao and Ibrahim, 2000, p.37). Chen and Deely (1996) also pointed out that constraints in general may slow down Gibbs sampler convergence while certain transformations may typically be helpful. To avoid the highly correlated iterations, we propose a unified approach in this article to represent the restrictions imposed on the parameters in a form such that the reparameterization results in a product parameter space. In addition, the burden of Gibbs sampling is shifted to the assessment of the convergence of the Markov chain to its stationary distribution. To avoid convergence problems, we adopt the non-iterative inverse Bayes formulae (IBF) sampling method, proposed in Tan, Tian and Ng (2003), to obtain independent sam-
55
pies approximately from the observed posterior distribution. To motivate t h e restricted parameter problems, we consider several examples. First toxicities or responses at ordering doses of a drug are often ordered. Let di,...,dm denote t h e dose levels of the drug associated with m groups and n^ mice be administered t o dose di (i = 1 , . . . , m ) , where the first (m — 1) groups are the t r e a t m e n t groups and t h e last one is t h e control group without the drug (i.e., dm = 0 ) . Denote by Yij t h e logarithm of the t u m o r volume for mouse j (j = 1 , . . . , n») under dose di. Suppose t h a t all outcomes are independent and t h e outcomes in group i have the same normal distribution, t h a t is, Yil,...,Yin,i^N(fil,^),
t = l,...,m.
(1)
If we assume t h a t d\ > • • • > dm and t h e decreasing dose levels will increase the outcomes, t h e n it is reasonable t o impose the monotonically ordered restrictions on t h e means by Mi < • • • < Mm,
(2)
which is called the simple order in t h e setting of isotonic regression/inference. Similarly, other ordered restrictions on t h e means may arise in different designed experiments. For example, an experiment where several treatments are expected to reduce t h e response (e.g., systolic blood pressure) in t h e control would result in a simple tree order on t h e means in t h e form Mi < Vm,
i = l , . . . , r o - 1.
(3)
Some authors study the so-called umbrella orderings (Shi, 1988; Geng and Shi, 1990), i.e., t h e means exhibit an unimodal trend Mi < • •' < Mh > • • • > Vm-
(4)
For a class of discrete distributions without nuisance parameters, Liu (2000) considers a general increasing convex restriction: H\ < ^ < • • • < Mm a n d M2 — Mi ^ M3 ~ M2 < """ < IJ-m — Mm-i; which is equivalent to 0 < M2 - Ml < M3 - M2 < • • • < Mm - Mm-i-
(5)
He also considers the increasing concave restriction: fii < M2 < • • • < ^m and M2 — Mi •> M3 — M2 > • • • > ^m — Mm-i, which are equivalent to M2 - Mi > M3 - M2 > • • • > Mm - Mm-i > 0.
(6)
T h e purpose of this article is to develop a unified framework for estimating order restricted normal means where t h e variances may be unknown.
56
The presence of nuisance variance parameters presents a statistical challenge for estimation in this problem since it is impossible to introduce the latent variables. We first propose a linear mapping that transforms the nonproduct parameter space (NPPS) into a product parameter space (PPS). This unified approach encompassing various classes of restrictions such as the simple order, simple tree order, umbrella order, increasing convex and increasing concave restrictions. The MLEs and Bayesian estimates of the parameters in normal distributions with these restrictions are derived by using the EM-type algorithms and the non-iterative IBF sampling. In Section 2, we introduce the concepts of PPS and NPPS and present the linear mapping that transforms an NPPS into a PPS and unifies the treatment of the restrictions (2) to (6). Sections 3 and 4 derive the MLEs and Bayesian estimates of parameters by using the EM-type algorithms and the IBF sampling when the variances are known and unknown, respectively. The methods are illustrated with two real examples in Section 5. A discussion is given in Section 6 and a brief introduction of the IBF sampling in the context of this article is given in Appendix. 2. Nonproduct versus Product Parameter Space Let S{ni) denote the support of parameter pn and «S(/u) the joint support of [i = (fj,\,..., fim)T• If S(n) = n ^ i ^(Mt)) * n e n ^(AO i s called a product parameter space; otherwise, it is called a nonproduct parameter space. Note that S(p.) with the simple order restriction (2) on /j, is an NPPS. Making a linear transformation on the simple order restriction (2), we have /j, = B\9, where 8 belongs to Si,m(e)
= {(0i,...,em)T:
- o o < 0 i < + o o , 6i>0,
i = 2,...,m},
(7)
B\ = A m and A m is an m x m matrix defined by /10--0\ 1 1 ••• 0 (8)
Vn-i/ Obviously, <Si,m(0) = (—oo, +oo) x n l ^ P ' + 0 0 ) ls a PPS- In other words, an NPPS is transformed into a PPS via a linear mapping. Similarly, for the simple tree order restriction (3), we obtain fi = B2O, where 6 € <Si,m(#) and
S2
="
/J-m —1
1
—
-'m— 1
0m— 1
57
here lfc denotes the fc-dimensional vector with component 1, Ofc the kdimensional vector with component 0, and Ik the identity matrix of order k. For the umbrella order restriction (4), we have \x = B36, where 9 S Si,m(9) and Ah
B,
O
where Afc is defined in (8). For the increasing convex restriction (5), we have /z = B46, where 9 e <Si,m(6») and / B4
1
0m — X 1
1 2
0 1
• 00\ • 00
am— 1
fim-L
(9)
m — 2 m —3 \ m — 1 m —2
• •
• 10 •2lJ
For the increasing concave restriction (6), we have /x = B$0, where 9 € Si,m(0) and B5 =
lm-l
1
0Tm-l
3. Estimation W h e n Variances Are Known In this section, we assume that a2 = (o\,... ^ J , ) 1 is a known constant vector. The objective is to estimate the mean vector fi = (/JI, . . . ,/i m ) subject to a linear restriction /i = B 0. According to the discussion in Section 2, we can assume without loss of generality that B = (6^) is a known m x q scalar matrix and 9 belongs to Sr,,(0) = {(0i, • • •, 0,) T : Ok € 1R1, 1 < k < r, 9k > 0, r + l < k < q}. (10) Therefore, the linear restriction /J, = B9 implies that yn — Y^k=\bik9k, i = 1 , . . . ,m. 3.1. MLE via the EM
Algorithm
Note that Yi = ^- 'YT^=\ ^ij ls a sufficient statistic of /Ji when a2 is known. Denote the observed data by Xn,s = {Yi : i = 1,..., m}. Similar to the EM structure of Lange and Carson (1984) for the Poisson emission tomography and Liu (2000) for discrete distributions, we augment the observed data
58
X>bs by the latent data Z(a2) = {Zik(cr2), i = 1 , . . . , m, k = 1 , . . . ,q — 1} to obtain a complete-data Ycom{(j2) = {Zik(cr2), 1 < i < m, 1 < k < q}, where ^ ( a
2
)
1
^ ^ ^ ^ ,
%-),
i = l,...,m,
k = l,...,q,
(11)
i
Yi = Y^zik{
i=
l,...,m.
k=l
The mapping Y. = J2l=i Zik{cr2) from ^ om (<7 2 ) to l£bs preserves the observed-data likelihood Yl^Li N(Y,\/j,i,cr2/ni). The likelihood function of 6 for the complete-data Ycom{a2) is given by
L(^com(^))cx{n(^)-<'/2} 1 _m. r
(
q
2
•^A-\zZ ^ £ ( ^ ) - ^ )
2
1 •>
\> (12)
where 0 € Sr,q(&) and Srtq(6) is defined in (10). Therefore, the sufficient statistics for 6>fc are Sk = YT^A1^)2^2) for fc = 1 , . . . ,g. The complete-data MLEs of Ok are ^ = Sk/ YHLi{ni iH) for fc = 1 , . . . , r, and fife = max{0, Sk/ E™ l t 2 ^ ) } for k = r + l,...,q. To derive the conditional predictive distribution, we first need to prove the following result. Proposition 3.1. Let random variables W\,... ,Wq be independent and Wk~N((3k,62),k = l,...,q. Then (W1,...,V%-1)T
w Nt=i 'Z2k=l ^k =1
Nn-,_i ULq+
yr_
S2
2 *-v ; diag{5i )-^~^
J:2
• £„>
where ft, = ( f t , . . . , ft-i)T and 62q = (<5 2 ,..., S^^. S2 = 52, then
v
(
Especially, if Sj =
(±Wk=W)
(W1,...,\%-i)T N, .
(13)
q
^
^
fc=i
.
'
V
^
t
/
H
-
^
)
)
-
<">
59 PROOF.
W = (Wu
Let (3 = (f3u.. .,(3q)T and S2 = (52 ..., Wq)T ~ Nq((3, diag(«52)), then
[ , • • • , S
t
w
///3\
Nq+1
rj
q
2\T
We note that
/diag(<52) S2
wJ'\
*2T n«a
From the property of multivariate normal distributions, we have 9
(Wi,...,W,)T
( ! > * = «;)
~iV„ /3 + M - E L I & •62, diag(«52) 1^2
2 : <52X2T <5
(15)
1 ^
Let E = diag(<52) - S262T/(lTqS2), then E = £>1/2(/(? - E ^ i ) 1 / 2 , where 2 12 £> = diag(<5 ) and E x = D ' lq(l'q'Dlq)-1iq'D1/2. Since Iq - Ei is a projection matrix and the rank of a projection matrix is equal to its trace, then rank(i,j — Ei) = q — 1, which means rank(E) = q — 1 < q. That is, the distribution in (15) is a degenerate g-dimensional normal distribution. From (15), we immediately obtain (13) and (14). Define Zi(a2) = {Za(
f(Z(a2)\Yobs,e)
= n / Z^ 2 )(£^c(a 2 )=^),0 »=i
^
fe=i
= n ^ - i [ZiW2) 1=1
E(Zz{cj2)\Yoh&,e),
'
%-(lq-i-
q-1
1n%
^
9
1
" 1 ) ) , (16) ' J
where E(Zi((T2)\Yohs,8)
=
(bil91,...,bi^1eq_1)T+lq-i
Yi — E k = i bikQk
(17)
for i = 1 , . . . , m. Given the observed data l£bs and the current estimate of the parameter vector 9, written as 9^ = (9[ , . . . ,9q ) T , the E-step requires to calculate the conditional expectation of the complete-data sufficient statistic Sk,
S? = E(Sk\Yohs,9^) = £ (^-(b^J^^f^)'
<18)
60
for k = 1 , . . . , q. The M-step is to update the estimates of 9
Or]=S^/±(^\ I
k = l,...,r, l
»=i ^
'
( ? i * + 1 > = m a x { o , s i Y f ; ( ^ ) } > fc = r + l , . . . , , .
(19)
The algorithm is iterated until ||#' t+1 ) — 0O|| is sufficiently small. Assume that the EM algorithm converged at the (t + l)-th iteration, then the MLE of 9 is 6 = 0(t+1\ The standard errors of 9 can be calculated by the method of Louis (1982). Therefore, using the 5-method (e.g., see Tanner, 1996, p.34), we can obtain the restricted MLE and standard errors of p via the linear relationship fj. = B 9. Remark 3.1. The choice of the initial value 9^ is a crucial problem. Denote the restricted MLE of \i by jl = (fii,..., / i m ) T . Theorem 1.6 of Barlow et al. (1972, p.29-30) shows that min{K, • • • ,Ym} < fn < max{F 1 ; ... ,Ym} for i = l , . . . , m . Therefore, the initial value /x' 0 ' can take some partial order of { Y i , . . . , Y m } such that /J(°' e £(M)- For example, if <S(/i) = {(pi,. .^ Hm)T IJJ-i < ••• < Mm}, then we take /J^> = ( V ^ ) , . . . , Y( m )) T , where Y^),..., Y(m) denote the ordered values of Y\,..., Ym. Hence, we can take the initial value 9^ = B+ /j,(°>, where B+ denotes the MoorePenrose inverse matrix of Bmxq. In particular, we have B+ = B~l when q = m. Obviously, statistical inference on n such as confidence bounds and hypothesis testing depends on the large sample theories. Thus, for small sample size, Bayesian method is an appealing alternative. 3.2. Bayesian
Estimation
via the IBF
Sampling
In the Bayesian setting, the priors of 9 for given complete data are specified. More specifically, we assume that the components of the 9 are independent and for k = 1 , . . . , r, 9k ~ N(9ko, ^ho)* a n o - f° r ^ = r + 1, • • •, 9, @k is distributed as a normal with mean 9ko a n d variance o\§ but truncated to the interval [0, +00), denoted by 9k ~ TN(9ko,a^0; 0, +00), where 9ko and cr!0 are known scalars. With prior distribution r
n(9) = l[N(9k\9k0,a2k0)fc=l
q
J } TN(9k\9k0,<j2k0; k=r+l
0,+00),
(20)
61
the complete-data posterior distribution of 9 is r
f(9\Yohs,Z(a2))
Y[N(ek\uk(
= ~ =1
JJ
TN{ekuk{a2),v2k(a2)-
0,+oo),
(21)
fe=r+l
where / 2\
2, 2\( ®ka . sr^ qnibikZlk(cr
«2(0=f4- +f;22#>) .
) \
* = !,•••,9-
(22)
The conditional predictive distribution of the latent data Z(a2) given 5£bs and 9 is given in (16). To obtain iid posterior samples of //, note \i = B 6, we only need to obtain an iid sample of 9 from f(0\X>bs)- According to Appendix, it suffices to generate an iid sample of Z(a2) from f(Z(er2)\%ba)- From the samplingwise IBF (37), we have f{Z(*2)\Yohs) oc / ( f f r 2 ) ^ ) ,
(23)
where 9 is the mode of the observed posterior f{9\Y0\,s). If we use diffuse priors, i.e., <j\0 —> +oo for k = 1,... ,q, then the posterior mode is as same as the MLE, we have 9 = 9. Therefore, we can utilize the MLE 9 obtained in Section 3.1 as the best initial value in IBF sampling.
4. Estimation when variances are Unknown From assumption (1), the observed data is l£bs = {Y%j : i = 1 , . . . , m, j = l , . . . , 7 i i } . The objective here is to estimate the mean vector /J, = (fii,... ,HmY and the variance vector a2 = (a2,... ^ J , ) 1 , where fi is estimated subject to a linear restriction \i — B 9, or equivalently fa = E L i bikek (i = 1 , . . . ,m), and 9 = (6U.. .,9Qf € S r ,,(0) given by (10). The unknown parameter vector is denoted by ip — (^J 0 " 2 )-
62
4.1. MLE via the EM-type
Algorithms
The likelihood function of ip for the observed data Xibs *s g i v e n by
= ft ft - ^ — exp f - bs,9), we have
2i=-Y. [Yij-Y.bik6k) Hi
j=l
V
'
fc=l
* = l,---,m.
(25)
'
Similarly, given X>bs and a 2 G iR+, the conditional MLE of 9 maximizes the conditional likelihood function m
2
/ _
q
b
L(9\Yohs,a ) = ]jN [Yi J2 ^k, »=i
^
fc=i
rr2\
-*-), n i
fl
€ 5r,,(fl),
(26)
'
which results in ^ = arg e 6 max e ) L(e|y o b s ,
(27)
Thus, with the introduction of the same latent data Z(a2) as in Section 3.1 (see, (11)), we can use (18) and (19) to find 6 satisfying (27). Remark 4.1. In the absence of missing data, the ECM algorithm is a special case of the cyclic coordinate ascent method for function optimization. In fact, the formulae (25) and (27) can be viewed as the first CM-step and the second CM-step of an ECM algorithm without missing data. In each CM-step, we use the EM algorithm to find the conditional MLE of 9 given o 2 by introducing latent data Z(a2), which varies with a2. Liu and Rubin (1994) called this EM-type algorithm as ECME algorithm. Remark 4.2. To implement the EM-type algorithm, one need to choose the initial value of of. When Hi is free, the unrestricted sample variance •~- Y^jLiO^ij ~ ^i) 2 c a n be considered as the initial value of of.
63
4.2. Bayesian
Estimation
via the IBF
Sampling 2
We consider independent priors on 9 and a , where 9 ~ TT(9) given by (20), and a2 ~ I G ( ^ , ^ ) with inverse gamma density I G ^ I ^ , ^ f ) = TL 12) tx ~ 1 ~ g '° e x P{~llf}> where qw and A,o are known constants, i = 1 , . . . , m. The observed data posterior distribution /(^|i^bs) is proportional to the joint prior TT(9) Yl^Li 7r(cr?) times the observed data likelihood function L(tp\X,bs) given by (24). We obtain
f(a2\Yohs,9) = n i G H
jyjLl(Yij-Ylk=lbik0k)2' 2
QiO + rii ^iO + 2 '
/(0|l£b B ,
9 6 5 r ,,(0),
(29)
2
where 7r(0) and L(#|y£bs,°' ) are given by (20) and (26), respectively. Since f{9\X>bs, c 2 ) is quite intractable, we augment the observed data l£bs by the latent data Z(a2) defined in (11). Similar to (21) and (16), we have f{9\Yohs,Z{a2),a2) r
q
= \{N(6k\uk(a2),v2(<j2)) fc=l
-llTN^Ukia2)^2^2);
0,+oo), (30)
k=r+l
f(Z(a2)\Yohs,9,a2) rn
,
= I ] ^ - i (Ziia2)
1
2
E(Zi(a2)\Yobe,e),
^-(lg-i
iT
- "'' ' " M
\
, (31)
where Mfc(<72), w2(o-2) and ^ ( ^ ( ( T 2 ) | ^ b s , 6>) are given by (22) and (17), respectively. In order to implement the IBF sampling, we first find the posterior mode ip — (9,a2). Let xj> = (9:a2) denote the MLE obtained in Section 4.1. If we use flat priors, i.e., U\Q —> +00 for k = 1 , . . . , q and (qto, AJO) = (—2,0) for i = 1 , . . . ,m, then the posterior mode tp is as the same as the MLE ip. However, in practice, a flat prior on 9 and a noninformative prior on a2 are common used (Box and Tiao, 1973), that is, (qio,\io) = (0,0) for i = 1 , . . . , m. Therefore we have 9 = 9,
ai2=(-^\(fi2,
i = l,...,m.
(32)
Having obtained (9, a2), by (31) we can easily compute Z0 = E(Z(t2)\Yohs,8).
(33) 2
To sample from the observed posterior distribution /(#, a |l£bs), noting that f(9,a2\Y>bs) = f(9\%bs) • f(cr2\Ybs,9) and sampling from the second
64
term as given in (28) is straightforward, the rest boils down to sampling from /(#|X>bs)- Prom (39), we have f{d\Yobs,a
) oc
—^-,
(34)
where Zo is given by (33). Therefore, based on (30) and (31), we can obtain iid samples approximately from /(#|X>bs, v2) by using IBF sampling. From (37), we have
'<«*•> « f W
<35)
Then, with (34) and (28), we can obtain iid samples approximately from f{e\Yohs). 5. Applications We analyze two real datasets to illustrate the proposed method. The first is binomial response data with normality obtained by arcsin transformation. The second is a case with unknown variances. Both have relatively small sample sizes. 5.1. Diesel Fuel Aerosol
Experiment
Dalbey and Lock (1982) conducted an experiment to assess the lethality of diesel fuel aerosol smoke screens on rats. Rats were enclosed in chambers in which a specified dose of diesel fuel aerosol could be monitored and controlled. Let pi denote the proportion of rats that died at dose di, i = 1 , . . . , m. If rt, the number of rats tested at dose di, is reasonably large, 1 /2
then arcsin (pi ) can be considered normal with mean /ij and variance of = l/(4rj), see, for example, Schoenfeld (1986, p.187). Table 1 lists the proportions of rats that died at the various doses and the corresponding values of arcsin ( p / )• Schmoyer (1984) analyzed this data set and obtained MLEs of pi subject to a sigmoid constraint. We use the normal model (1) to fit the tranformed data. Now we have m = 8, all m = 1, Yi = arcsin(p/ ) ~ iV(/ij,of), where Yi,...,Ym are independent and of = l/(4rj) are known. The objective is to estimate the mean vector /x = (//i,... , ^ m ) T subject to (i) the monotonically ordered restriction (2) and (ii) the increasing convex restriction (5). Case 1: The monotonically ordered restriction (2). In (10), set r — 1 and q = m = 8, we have fj, = B6, where B = (bik) = As given by
65 Table 1.
Group i 1 2 3 4 5 6 7 8
Dose di (h-mg/L) 8 16 24 28 32 48 64 72
Data from the diesel fuel aerosol experiment. r,, number of rats tested 30 40 40 10 30 20 10 10
Pi, proportion of dead 0.000 0.025 0.050 0.500 0.400 0.800 0.600 1.000
arcsin (pi = Yi 0.000 0.158 0.225 0.785 0.684 1.107 0.886 1.571
)
Variance
*? = V(4ri) 1/120 1/160 1/160 1/40 1/120 1/80 1/40 1/40
Source: Schmoyer (1984).
(8) and 6 = {6U... ,<98)T e Sh8(6) given by (7). Let /x(°> be the order value of Yi,... ,Y&, we can take the initial value 6^°> = B_1 //°) = (0.000,0.158,0.067,0.459,0.101,0.101,0.221,0.464) T . Using (18) and (19), the EM with the initial value 6^ converged to the restricted MLE 8 given in the second column of Table 2 after 180 iterations with precision 10~ 3 . The corresponding restricted MLE (i are displayed in the third column of Table 2. Table 2.
ML and Bayesian estimates for monotonically ordered restriction.
Group i 1 2 3 4 5 6 7 8
MLE of ^ 0.000 0.158 0.067 0.484 0.000 0.324 0.000 0.538
MLE of Hi 0.000 0.158 0.225 0.709 0.709 1.033 1.033 1.571
Posterior mean of in 0.000 0.158 0.225 0.708 0.727 1.051 1.085 1.621
Posterior SD of in 0.018 0.026 0.033 0.040 0.042 0.055 0.061 0.095
95% Posterior interval estimates of Hi [-0.036, 0.037] [0.106, 0.210] [0.160, 0.289] [ 0.629, 0.788] [ 0.646, 0.811] [0.943, 1.159] [ 0.965, 1.210] [ 1.438, 1.809]
Note: fi = BO and B = A 8 given by (8).
In the Bayesian setting, we use diffuse priors, i.e., a\Q —» +oo for k = 1 , . . . , 8. Therefore, the posterior mode is the same as the MLE, we have 6 = 6. Based on (23), the IBF sampling is implemented by first generating an iid sample of size J = 8000 of Z{
66 Table 3. Group i 1 2 3 4 5 6 7 8
ML and Bayesian estimates for increasing convex restriction. MLE Of0i 0.007 0.141 0.000 0.075 0.000 0.000 0.000 0.201
MLE of Mi 0.007 0.148 0.289 0.505 0.721 0.937 1.153 1.571
Posterior mean of /i; 0.007 0.147 0.292 0.512 0.740 0.981 1.244 1.604
Posterior
SD of m 0.018 0.018 0.020 0.025 0.033 0.044 0.060 0.103
95% Posterior interval estimates of m [-0.030, 0.042] [0.109, 0.184] [ 0.250, 0.333] [ 0.460, 0.562] [ 0.673, 0.807] [ 0.895, 1.070] [ 1.127, 1.362] [ 1.507, 1.911]
Note: /x = B6 and B = B4 given by (9).
Figure 1. Comparison among the unrestricted, the monotone ordered and the increasing convex MLEs of fj, = (fj.\,..., fig) .
given Z(a2) = Z^e\a2). Then 9^,..., 6»(M> are iid samples of 0 approximately from /(#|5£bs)- The posterior samples of \i can be obtained by fj, = B0. The corresponding Bayesian estimates of fi are given in Table 2. By comparing the third column and the fourth column in Table 2. As shown in this table, the MLE or posterior mode of \ii is slightly different from the posterior mean of Hi for i = 2 , . . . , 8. This is expected as the mode of a truncated normal distribution differs from to its mean. Case 2: The increasing convex restriction (5). From (9), we have fi = B9, where B = (bik) = B4 given by (9) and 9 = (9u...,9a)T G Sits(9) given by (7). Let n^ be the order value of Yi,...,Yg, we have
67
£ - V ( 0 ) = (0,0.158,-0.091,0.392,-0.358,0,0.120,0.243) T . Replacing the two negatives by 0, let the initial value 6^ = (0,0.158,0,0.392,0,0,0.120, 0.243)T. Using (18) and (19), the EM with the initial value 0<°) converged to the restricted MLE 6 given in the second column of Table 3 after 2000 iterations with precision 10~ 3 . The corresponding restricted MLE ft are displayed in the third column of Table 3. For the Bayesian analysis, the same diffuse priors as in Case 1 can be used, the corresponding Bayesian estimates of \i are given in the last three columns of Table 3. Figure 1 gives a comparison among the unrestricted, the monotone ordered and the increasing convex MLEs of /i = ( / i i , . . . , n$)T. 5.2. Half-Life
of An Antibiotic
Drug
The effect of an antibiotic drug is estimated from an experiment where increasing doses are administered to groups with five (m = 5) rats and the half-life of the antibiotic drug is listed in Table 4. The usual analysis of variance is obviously inappropriate, because of the ordering of the doses. Hirotsu (1998) applies the cumulative \2 test to this data set and his testing result is in favor of a monotone relationship in the mean half-life, i.e., Mi < • • • < Ms- We assume that Yix,... ,Yini ~ N(/j,i,cr2), i = 1 , . . . ,5. The objective is to estimate /J, = (fix,... , ^ s ) T and a2 = (a2,...,
Half-life of an antibiotic drug in rats.
rii, number of Group Dose di (mg/kg) rats administered i 5 1 5 2 10 5 3 25 4 4 50 5 5 200 5
Half-Life (hour) 1.17 1.00 1.55 1.21 1.78
1.12 1.21 1.63 1.63 1.93
1.07 1.24 1.49 1.37 1.80
0.98 1.14 1.53 1.50 2.07
1.04 1.34 1.81 1.70
Average Sample Yi variance 1.076 0.005 1.186 0.016 1.550 0.004 1.504 0.054 1.856 0.021
Source: Hirotsu (1998).
Let n^ be the order value of Y i , . . . , I5, and let the initial value 8^ = B - V ( 0 ) = (1.076,0.110,0.318,0.046,0.306) T , where B = (bik) = A 5 is given by (8). Given 6(°\ we first implement the CM-step 1 to compute
68
the second column of Table 6, respectively. Table 5. Group i 1 2 3 4 5
ML and Bayesian estimates of \i for monotone restriction.
MLE of ^ 1.076 0.110 0.361 0.000 0.308
MLE of in 1.076 1.186 1.547 1.547 1.856
Posterior mean of in 1.076 1.185 1.546 1.562 1.871
Posterior SD of m 0.006 0.010 0.013 0.018 0.029
95% Posterior interval estimates of Hi [1.062, 1.088] [1.165, 1.207] [1.520, 1.574] [1.529, 1.601] [1.813, 1.928]
Note: /* = BO and B = A 5 given by (8).
Table 6. Group i 1 2 3 4 5
MLE of a? 0.004 0.013 0.003 0.045 0.017
ML and Bayesian estimates of a2 for monotone restriction. Posterior mode of of 0.003 0.009 0.002 0.032 0.012
Posterior mean of a2 0.007 0.021 0.005 0.077 0.030
Posterior SD of of 0.011 0.025 0.008 0.092 0.044
95% Posterior interval estimates of a2 [0.001, 0.024] [0.004, 0.073] [0.001, 0.023] [0.017, 0.291] [0.006, 0.108]
To implement the IBF sampling, we first find the posterior mode 9 and a2. We use the flat prior on 6 and the noninformative priors on of. Therefore, from (32), we obtain 9 = 8 given in the second column of Table 5 and IT2 given in the third column of Table 6. Then we compute ZQ by using (33). The first IBF sampling is implemented by drawing an iid sample of size J = 8000 of 9 from f(9\YDbs, Zo,a2) and, then, obtaining an iid sample of size J\ = 6000 of 9 approximately from f(9\Y0bs,a2) based on (34). Based on (35), the second IBF sampling is implemented by generating an iid sample of size M = 5000 of 9 from f(6\Yohs), denoted by 6^\ I = 1 , . . . , M. Finally, we generate a2^ from f(a2\Yohs, 9^) for given 9 = 9^\ Then a2^l\ ..., a2(~M) are iid samples of a2 from f(a2\Y0bs)- The posterior samples of fi can be obtained by \i = B 9. The corresponding Bayesian estimates of fi and a2 are given in Table 5 and Table 6, respectively. 6. Discussion The statistical inference concerning the restricted parameters is extremely difficult, e.g., the MLEs are obtained usually at the boundary of the restricted parameter space. We proposed a unified method to estimate normal
69
means with or without the nuisance variance parameter subject to a class of restrictions. This class encompasses the simple order, simple tree order, umbrella order, increasing convex and increasing concave restrictions. To overcome the difficulty due to nuisance parameters, we introduced a sequence of latent variables which vary in each conditional maximization step. To derive small sample inference, we adopted the Bayesian approach with computation performed using a noniterative IBF sampling. This method is appealing for its simplicity since it builds on the result from the EM algorithm. However, the method also has certain limitations. The EM appears to converge very slowly requiring 2000 iterations in the Case 2 of Section 5.1. To accelerate the EM algorithm, we may use the parameter expandedexpectation maximization algorithm (Liu, Rubin and Wu, 1998). Another method is to reduce the number of latent variables introduced. Noting that in (11) we introduced total m{q — 1) normal latent variables, which are universal for arbitrarily known m x q scalar matrix B = (bik). For some specific B, e.g., B = A m given by (8), we only need to introduce latent data Z(a2) — {Zik(cr2), i = 2, . . . , m , k = l,...,i — 1} and to obtain a complete-data Ycom(a2) = {Yohs, Z(a2)} = {Zik(a2), i = l,...,m, k = 1 , . . . , i}, where 2
Zik{a2)
l d
~ N(0k,
^-),
i = l,...,m,
k = l,...,i,
(36)
and J2k=i Zik(&2) = Yi(i = l,..., m). So there are only m(m — l ) / 2 latent variables introduced in (36) and the convergence of EM is expected to be faster. Furthermore, it would be of interest to extend the method to cases where informatively censored data occur and to binomial responses as in the dose-response models. Appendix: The IBF sampling Let YQfos denote the observed data and 6 the parameter vector of interest. Using the concept of data augmentation (Tanner and Wong, 1987), we augment the observed data ^ ^ g with missing data Z so that both the complete-data posterior distribution /(0|y, tz){6\Y0^s, Z) and the conditional predictive distribution / ( z | y , ie)(Z|l^)jDS,^) are available. The objective is to obtain an iid sample from the observed posterior distribution f(t>\Yohs)(0\Yobs)-
70
Let S(Qz\Yh ) denote the joint support of (0,Z) conditional on X>bs, S(e\Yohs) and «S(^|y ) the conditional supports of 6 and Z for given l£bs, respectively. Under the assumption of conditional product measurable, i.e., S(e,z\Yohs) = S(0|yobs) x s(z\Yohs), we have a samplingwise IBF:
fel |Z
P7>
« "'*"Cw'.wy
for some arbitrary 9Q € S^y, ) and all Z £ «S(z|y, )• Based on samplingwise IBF (37), Tan, Tian and Ng (2003) proposed a noniterative sampling approach called IBF sampler. Using the sampling/importance resampling (SIR) technique (Rubin, 1988), the IBF sampler is as follows: (i) Draw J independent samples of Z from f(z\Y . ,0)(-^l^obs' ^°)' denoted by Z{1\ ..., Z^; (ii) Calculate the weights f(9*Y, ,Z)(eo\Yobs> Zi:>)) r obs ZLifv\Yoha,z)Vo\Yoh8,ZWy
•, J;
(38)
(iii) Choose a subset from {Z^l\ ..., Z^} via resampling without replacement from the discrete distribution on {Z™'} with probabilities {u>j} to obtain an iid sample of size m (< J ) approximately from f(z\Y , )(-^l^obs)' denoted by Z^kl\ ..., Z^km^; (iv) Generate 0W from the augmented posterior pdf f(e\Yohs,z)(0\YobS'Z{k']) for i = l,...,m. Then 6™,..., 6™ *~ fWYoh8MYoJNote that any subset of independent samples is still independent. The first part of Step (iii) implies {Z^kl\ . . . , Z^km^} is a independent sample. The second part of Step (iii), i.e., "resampling from the discrete distribution on {Z^} with probabilities {u>j}", implies {Z< fcl \ . . . ,Z(fc"*>} are approximately from f(z\Yb )(Z\X>bs) with the approximation "improving" as J increases (Smith and Gelfand, 1992). However, resampling with replacement results in dependent samples. In principle, the samplingwise IBF (37) holds for any given 0Q G S(0|y, ). However, the efficiency (but not the correctness) of the IBF sampling depends on how well the proposal density f(z\Yh ,e)(Z\Yobs, #o) approximates the target function f(z\Ybs)(^l^obs)- Tan, Tian and Ng (2003) have showed that f(z\Yohs)(Z\Yobs)
= fiZ\YobB,0)(Z\Yobs,dohs){l
+
O(l/N)}.
71
where 0 o b s denotes t h e mode of t h e observed posterior distribution f(fl\Yh )(^l^obs) a n d AT denotes t h e sample size of t h e observed d a t a Y0\>s. Heuristically, if #o is chosen t o be t h e observed posterior mode 90\,s, t h e overlap area under t h e two functions would be substantial since t h e approximation is accurate t o t h e order of 0(l/N). T h e y further suggested using t h e E M algorithm t o find locate 0obs- T h e advantage of using t h e E M is t h a t t h e samplingwise I B F (37) a n d t h e E M have t h e same structure of augmented posterior distribution/conditional predictive distribution, thus, no e x t r a derivations are needed for t h e I B F sampling. By exchanging t h e roles of Z and 8 in (37), alternatively, we have, f{8\Yobs,Z)(0\X>bs,Zo
w>ci*.) - 7!T-z^: f(z\Y e){ZQ\Y ,ey ;;• ohs
<*»
d all 6 G S^\Y.
)• Tan, T i a n and Ng
ohs<
for some arbitrary ZQ € S^z\Yh
)
an
(2003) suggest taking Z0 = E{Z\Yohs,6ohs). Gelfand a n d Dey (1994) pointed out t h a t t h e harmonic mean estimate of Newton a n d Raftery (1994) is likely t o suffer from numeric instability since t h e reciprocals of conditional densities m a y approach infinity. However, in our proposed method, t h e weights {cOj} in (38) is a ratio and is free from this kind of numeric instability. In fact, u>j can be rewritten as
„ / ( fliV obs , Z )(Wbs,z w ) ^fwy^z^Yo^Z^) When / ( e | V o b s , Z ) ( W b s , ^ o ) ) = maxi oo, we have uij0 —> 1. To further enhance numeric stability, we can use the exponent of t h e logarithm of t h e ratio in calculating uij. Ross (1996) pointed out t h a t resampling algorithm results in some loss of information. However, since we sample from t h e augmented posterior t h a t is already quite close t o t h e observed posterior (39), such loss is minimal and t h e computational gain is tremendous. References 1. Barlow, R.E., Bartholomew, D.J., Bremner, J.M. and Brunk, H.D. (1972). Statistical Inference under Order Restrictions. John Wiley & Sons, New York. 2. Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference in Statistical Analysis. John Wiley & Sons, New York. 3. Chen, M.-H. and Deely, J.J. (1996). Bayesian analysis for a constrained linear multiple regression problem for predicting the new crop of apples. J. Agric. Biol. nvir. Statist. 1, 467-489.
72
4. Chen, M.-H., Shao, Q.-M. and Ibrahim, J.G. (2000). Monte Carlo Methods in Bayesian Computation. Springer, New York. 5. Chiu, H.Y. and Sedransk, J. (1986). A Bayesian procedure for imputing missing values in sample surveys. J. Amer. Statist. Assoc. 8 1 , 667-676. 6. Dalbey, W. and Lock, S. (1982). Inhalation toxicology of diesel fuel obscurant aerosol in sprague-dawley rats. ORNL/TM-8867, Biology Division, Oak Ridge National Laboratory. 7. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B 39, 1-38. 8. Gasparini, M. and Eisele, J. (2000). A curve-free method for phase I clinical trials. Biometrics 56, 609-615. 9. Gelfand, A.E. and Dey, D.K. (1994). Bayesian model choice: asymptotic and exact calculations. J. R. Statist. Soc. B 56, 501-514. 10. Gelfand, A.E., Smith, A.F.M. and Lee, T.M. (1992). Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. J. Amer. Statist. Assoc. 87, 523-532. 11. Geng, Z. and Shi, N.Z. (1990). Isotonic regression for umbrella orderings. Appl. Statist. 39, 397-402. 12. Geyer, C. J. (1991). Constrained maximum likelihood exemplified by isotonic convex logistic regression. J. Amer. Statist. Assoc. 86, 717-724. 13. Hirotsu, C. (1998). Isotonic inference. In Encyclopedia of Biostatistics (P. Armitage and T. Colton, eds.), 2107-2115. John Wiley & Sons, New York. 14. Lange, K. and Carson, R. (1984). EM reconstruction for emission and transmission tomography. J. Comput. Assist. Tomography 8, 306-312. 15. Liu, C.H. (1998). Information matrix computation from conditional information via normal approximation. Biometrika 85, 973-979. 16. Liu, C.H. (2000). Estimation of discrete distributions with a class of simplex constraints. J. Amer. Statist. Assoc. 95, 109-120. 17. Liu, C.H. and Rubin, D.B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika 8 1 , 633-648. 18. Liu, C.H., Rubin, D.B. and Wu, Y.N. (1998). Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85, 755-770. 19. Louis, T.A. (1982). Finding observed information using the EM algorithm. J. R. Statist. Soc. B 44, 98-130. 20. Meng, X.L. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Amer. Statist. Assoc. 86, 899-909. 21. Meng, X.L. and Rubin, D.B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80, 267-278. 22. Newton, M.A. and Raftery, A.E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap (with discussions). J. R. Statist. Soc. B 56, 3-48. 23. Ramgopal, P., Laud, P.W. and Smith, A.F.M. (1993). Nonparametric Bayesian bioassay with prior constraints on the shape of the potency curve. Biometrika 80, 489-498.
73
24. Robertson, T., Wright, F.T. and Dykstra, R.L. (1988). Order Restricted Statistical Inference. John Wiley & Sons, New York. 25. Ross, S.M. (1996). Bayesian should not resample a prior sample to learn about the posterior. Amer. Statistician 50, 116. 26. Rubin, D.B. (1988). Using the SIR algorithm to simulate posterior distributions (with discussion). In Bayesian Statistics, 3 (J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith, eds.), 395-402. Oxford University Press, Oxford. 27. Rubin, D.B. and Thayer, T.T. (1982). EM algorithm for ML factor analysis. Psychometrika 47, 69-76. 28. Schmoyer, R.L. (1984). Sigmoidally constrained maximum likelihood estimation in quantal bioassay. J. Amer. Statist. Assoc. 79, 448-453. 29. Schoenfeld, D.A. (1986). Confidence bounds for normal means under order restrictions with application to dose-response curves, toxicology experiments and low-dose extrapolation. J. Amer. Statist. Assoc. 8 1 , 186-195. 30. Shi, N.Z. (1988). A test of homogeneity for umbrella alternatives and tables of the level probabilities. Comm. Statist. Theory & Methods 17, 657-670. 31. Smith, A.F.M. and Gelfand, A.E. (1992). Bayesian statistics without tears: A sampling-resampling perspective. Amer. Statistician 46, 84-88. 32. Tan, M., Tian, G.L. and Ng, K.W. (2003). A noniterative sampling method for computing posteriors in the structure of EM-type algorithms. Statistics Sinica, to appear. 33. Tanner, M.A. (1996). Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd ed. Springer, New York. 34. Tanner, M.A. and Wong, W.H. (1987). The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Statist. Assoc. 82, 528-540.
A N E X A M P L E OF A L G O R I T H M M I N I N G : COVARIANCE A D J U S T M E N T TO ACCELERATE EM A N D GIBBS
C H U A N H A I LIU Department of Statistics and Data Mining Technologies, 600 Mountain Avenue,
Research, Bell Laboratories, Lucent Murray Hill, NJ 07974, USA
The EM and Data Augmentation (or more generally, Gibbs sampling) algorithms are popular tools for parameter estimation but are often criticized for their slow convergence. In the last decade, there have been tremendous efforts to create efficient EM-type and Gibbs sampling algorithms, including interest in the recent parameter expanded (PX)-EM algorithm and its stochastic versions. With the 'covariance adjustment' interpretation of PX-EM as the theme, this paper provides an overview of the work on PX-EM and PX-EM inspired Data Augmentation algorithms, including the most recent one proposed by Liu, Liu, and Wu (2003). The student-t distribution is used as an illustrative example.
1. An Overview Missing data make statistical analysis difficult in general and make simple complete-data methods inapplicable in particular (Rubin, 1987, 1996; and Hopke, Liu, and Rubin, 2001). Filling in missing values has a strong intuitive appeal because simple standard complete-data methods can be applied. The induced simplicity by this general strategy can also help explain the popularity of the EM algorithm (Dempster, Laird, and Rubin, 1977), the Data Augmentation (DA) algorithm (Tanner and Wong, 1987), the Gibbs sampler (Gelfand and Smith, 1990), and their extensions in modern statistical computations for fitting complex models (see, for example, the recent books by Robert and Casella (1999), Chen, Shao, and Ibrahim (2000), and J. Liu (2001) on this topic). The EM algorithm is an iterative algorithm for maximum likelihood (ML) estimation from incomplete data. By incomplete data, we mean the observed data that can be augmented so that it is simple to analyze the augmented complete data. Denote by l^bs the observed data, by Ym;s the missing data, and by Ycom = {5^,bs,^mis} the augmented complete data. The underlying complete-data model f(Ycom\0) is required to preserve the 74
75
observed-data model /(Yobs|#) = J f(YCOm\9)dYmis, where 9 is the vector of parameters. Each iteration of EM consists of two steps: an Expectation (E)-step to impute missing values, and a Maximization (M)-step to analyze the imputed complete data. EM is simple and stable in the sense that each iteration increases the observed-data likelihood. Both the simplicity and stability have made EM a popular tool for ML estimation from incomplete data. To some extent, the statistical interpretations of the E-step and the M-step have also inspired many other algorithms, including the DA algorithm for Bayesian estimation, and the PX-EM algorithm for accelerating EM. For Bayesian estimation from incomplete data, Tanner and Wong (1987) developed the DA algorithm, which can be viewed as the stochastic version of EM. DA is now typically represented as a two-step Gibbs sampler. DA is formulated by replacing the E-step of EM with an Imputation (I)-step and replacing the M-step of EM with a Posterior (P)-step. The I-step imputes the missing data by taking a draw of Ym;s from its predictive distribution /(^misl^obs,^)- The P-step performs Bayesian analysis by drawing 9 from the complete-data posterior f(6\Ycom) oc f(Ycom\9)f(9), where f(9) denotes the prior distribution of the parameter 9. As for EM, the simplicity of DA or Gibbs has made it a powerful tool for fitting complex Bayesian models. While being popular tools for ML and Bayesian estimations, EM and DA are often criticized for their slow convergence. To accelerate the EM algorithm, Liu, Rubin, and Wu (1998) considered a more efficient analysis of the imputed complete data, which are effectively imputed by the E-step under the 'wrong' model before the convergence. The intuitive idea of PX-EM is to use a 'covariance adjustment' to correct the analysis of the M-step, capitalizing on extra information captured in the imputed complete data. To accomplish this, Liu et al (1998) introduced the technique of parameter expansion and proposed the parameter expanded (PX)-EM algorithm. PX-EM expands the complete-data model f(Ycom\9) further to obtain the parameter-expanded complete-data model f(Ycom\® = (9X,XX)) while preserving the observed-data model in the sense that there is a manyto-one onto mapping, called the reduction function, 9 — R(Q) such that /(^obs|©) = f(Y0bs\9 = ^(©))- That is, the expansion parameter is unidentifiable from the observed data. After parameter expansion, PX-EM is simply an EM algorithm applied to the parameter-expanded complete-data model with the M-step followed by the explicit 'covariance adjustment' 9 = R(9X, Xx). As a result, PX-EM shares with EM its simplicity and stability. For more discussion on PX-EM, such as its limitations, see Liu et al
76
(1998). Since the development of PX-EM for maximum likelihood estimation, there have been some considerable efforts in studying its stochastic version for Bayesian analysis (Meng and van Dyk, 1999; Liu and Wu, 1999; and Liu 2003). To provide an overview of the work, a new methodology for studying algorithms seems necessary. This explains the use of the phrase "Algorithm Mining" in the title, which is borrowed from the "Data Mining" and "Statistical Methods Mining" literature and may serve a good way to overview the recent work in the development of the algorithms for EM and Bayesian computation. By Algorithm Mining, or more specifically, Statistical Algorithm Mining, here we mean thinking statistically about algorithms so that we can understand, organize, and expand better the creative thoughts built into the existing algorithms. To construct parameter expanded (PX)-DA algorithm (or the Marginal DA algorithm) both Meng and van Dyk (1999; see also van Dyk and Meng, 2001) and Liu and Wu (1999) started with the same framework where the expanded parameter space consists of the original parameter and the expansion parameter, that is, 0 = (9, Xx), and a proper prior distribution f(\x) on the expanded parameter Xx. Consider, for example, the scheme in Meng and van Dyk (1999) and Liu and Wu (1999) with the following two steps, henceforce MvDLW. Step 1. Draw A^' from f{Xx), and then draw Ymjs from f{Ymis\Y0bs,9, Xx ). Step 2. Draw (8,\x) from f(0,\x\Yohs,Ymis) = /(A x |Y obs , Ymis) • f{9\Xx, Y0bs, Ymjs). Statistically, Step 1 can be viewed as collecting data in two inner steps: (1) randomly select a "design" Ax , and (2) collect the data with the design X{xd). Step 2 can be viewed as analyzing the data with the known design Xx treated as unknown. Although this idea is very interesting and useful, it should be noted that the underlying philosophy is quite different from that of PX-EM. Given the design Xxd^ and the data Y com , PX-EM would make a more efficient data analysis by adjusting for the difference between observed statistics and those expected from the correct model with the known design Xx . This philosophical difference has motivated some extensive and interesting work on the use of improper priors to target the DA version of PX-EM using limiting arguments. The use of the conjugate priors for the expansion parameter Xx, investigated by van Dyk and Meng (2001), is appealing and consistent with the general strategy of data augmenta-
77
tion that makes the corresponding algorithms easy to implement. Since the improper priors are not unique, optimization techniques for finding the optimal improper prior are necessary. Prom the point of view discussed above, Scheme 2 of Liu and Wu (1999), formulated using group transformations, is very different from MvDLW and is philosophically closer to PX-EM. As with PX-EM, however, we are concerned with the following two questions. (a) Can PX-DA be formulated without using group transformations? (b) Can PX-DA be applied to parameter-expanded models that do not have group representations? Here, we present in Section 6 an alternative formulation of PX-DA proposed recently by Liu, Liu, and Wu (2003), which preserves both the formulation and the covariance adjustment interpretation of PX-EM. The new definition can also be viewed as a generalization of Scheme 2 of Liu and Wu (1999). For convenience, we call a complete-data model parameter-expandable (PXable) if it allows for a parameter expansion in the sense of Liu et al (1998). The current results of Liu et al (2003) show that under mild conditions, the expanded prior is uniquely defined for the PX-able models that have group representations, where the result on the existence of the expanded prior is due to Liu and Wu (1999) and Liu and Sabatti (2000). According to the new definition of PX-DA, the unique expanded prior for the expansion parameter A^ can also be obtained from the conjugate priors for Xx, without using the group theoretical results. Thus, we have a positive answer to Question (a). Nevertheless, the group-theory-based formulation provides a mathematically elegant way of doing PX-DA. For an answer to Question (b), a 'counter' example is sufficient. Liu et al (2003) show that the Poisson imaging model of Liu et al (1998) provides such a counter example. In the context of DA, Liu (2003) noted that there is an alternative way of accomplishing the covariance/covariate adjustment. Using statistics, instead of extra parameters, to capture extra information in the imputed complete data, Liu (2003) proposed the so-called covariance-adjusted (CA) DA algorithm. CA-DA extends ordinary DA in such a way that the P-step of CA-DA performs a covariance adjustment for a better analysis. The way that CA-DA does this for PX-able models is to re-impute a sufficient statistic for the underlying expansion parameter. Formulated to adjust for statistics imputed under the wrong model, CA-DA is somewhat more general than PX-DA, as shown by the artificial example in Section 7 as well as the real world examples in Liu and Sun (2000), Liu and Rubin (2002),
78
and Liu (2003). The rest of the article is arranged as follows. Section 2 describes the student-t distribution, which is to be used as a running example. The special case of the t distribution with a single observation, the unknown scale parameter, and the known location parameter is used to demonstrate the improved rate of convergence by more efficient analyses introduced in PX-EM, PX-DA, and CA-DA. Sections from 3 to 7 describe EM, DA, PXEM, PX-DA, and CA-DA, respectively. Finally, Section 8 concludes with a brief discussion. 2. The Student-t Distribution The student-t distribution has served as a useful tool for robust statistical inference (Lange, Little, and Taylor, 1989; Liu, 1995, 1996; and Pinheiro, Liu, and Wu, 2002). Liu (1997) noted that ML estimation of the t-distribution has also motivated many EM-type algorithms, including the ECME algorithm (Liu and Rubin, 1994), the efficient EM algorithm (Kent, Tyler, and Vardi, 1994; and Meng and van Dyk, 1997), and the algorithm of Liu (1997), which is a PX-ECME algorithm. Liu et al (1998) also used the t-distribution as the motivating example of PX-EM. Here we consider the univariate t-distribution t(/u, a2, v) with the unknown location parameter /i, the unknown scale parameter a (a > 0), and the fixed degrees of freedom v (y > 0). The complete-data model commonly used in practice for the t-distribution is as follows. For i — 1, ...,n, (T^,?/;) are independent with y%\{8,Ti) ~ N f n, — J
and
Tj|(9 ~ Gamma ( - , - J ,
where T^S are called the weights and the density function of Gamma (a, /3) is proportional to ra~x exp{—0T} for all r > 0. This leads to n\{0,yi) ~ Gamma I —^—,
1.
(1)
3. The EM Algorithm Denote by Y^bs the observed data, by /(l^>bs|#) the observed-data model with parameter 6, by l^0m = {Yobs,Ymis} the augmented complete data that consist of the observed data y o b s and the missing data Ymis, and by f(YCOm\8) the complete-data model that preserves the observed-data model,
79
that is, j f(Ycom\9)dYmis
= f(Yobs\6).
Let
Q{6\6') = E {In f(Ycom\9)\Yohs,
9'},
the conditional expectation of the complete-data log-likelihood. The EM algorithm maximizes the observed-data likelihood L(9) = ln/(y o bs|#) by iteratively maximizing Q(0\6') over 9 with 9' replaced with the current estimate of 9. More precisely, starting with 9^ in the parameter space of 9, the (t + l)st (t > 1) iteration of the EM algorithm consists of two steps: an E step and an M step, which are given as follows. E step. Compute Q(6>|(*+1) = arg maxgQ(e\0^). Dempster et al (1977) showed that (i) each iteration of EM increases L(6), which implies that EM is stable, and (ii) if EM converges to 9*, then 9* is a (local) maximum of L{9) (see Wu (1983) for more discussion). For the t distribution, we have l^bs = {j/j : i = 1, •••, n}, Ymls = {T, : i = 1, ...,n}, Ycom = {(yi,Ti) : i = 1, ...,n}, and 9 = (/i,cr2). The complete-data log-likelihood function is linear in the weights. More specifically, we write ln/(y c o m |0) = - | l n a 2 - — ^TiGfc - nf +
h(Ycom,v)
i=l
with the term h(Ycom, i>) independent of 9. From (1), we have
T ^ E N 0 , ^
+
J + * )2AT2
(2)
for i = 1, ...,n. Let 0W = (/iW, (o- 2 ) (t) ) be the estimate of B at the ith iteration. Then, at the (t + l)st iteration EM for t-distribution has the following two steps. E step. Assume 9 = 9^, impute the expected values of the weights fj|0(t) based on (2) for i = 1, ...,n. M step. Maximize the expected complete-data log-likelihood, giving ^t+D
=
T,7=ifi\ewyi
a n d ( C T 2 ) ( t+i) =
Er=iV'>(^-M(t+1))2^
For the special case with a single observation y = yi ~ t(/j, = 0, a2, v) with the fixed degrees of freedom v, the EM sequence { ( c 2 ) ^ : t = 0,1,...} is as follows.
(^"-(^^0-.,
.-0.1....
80
It is easy to show that <x2 = \im.t^00(a2)(-t^
= y2 and that
ln(-lnfr 2 = 1 2 2 tiJS. l n ( a ) ( ' ) - l n a ~ 1/+ 1' Thus, when i/fsO, EM has a very slow rate of convergence. 4. The D A Algorithm The setting for the DA algorithm is similar to that for EM, except for the extra prior distribution f(9) for the parameter 9, which is required for Bayesian inference. DA is formulated by replacing the E-step and the Mstep with the I-step and the P-step, respectively. More specifically, given the draw of 8 obtained at the tth iteration, the (t + l)st iteration of DA consists of the following two steps. / step. Impute F m i s by drawing Y^1] P step. Draw 0(' +1 ) from f(e\Yc(o^1})
from /(Ymis|yobs, # (t) )oc f ( Y ^ \9) f (9), where
v(' +1 ) - zv L v r(t+1 h 'com
— Vobsi •'mis
/•
For the t-distribution, we use the prior distribution /(/i, a2) oc <j~2 (—oo < /x < oo; a2 > 0). Let IG(a, (3) denote the inverse Gamma distribution. Then, DA iterates between the following two steps. / step. Draw n from Gamma (^±1, v+^-^)2la"\ P step. Draw a2 from IG ( ^ ^ , ^
N
^
for
i = 1,..., n.
J and then draw fj, from
{*> rfc)' w h e r e y= Ster-
For the special case with a single observation y = y\ ~ t(/j, = 0,a2,u) with the fixed degrees of freedom v, the P-step of DA is implemented to draw a2 fromlG U, ^ ) . The resulted DA sequence {(
J -,
„ , ^+
^-^1-ZJ^l L^ + l
( ff 2)(t+l)
„
+ 1
it = o,i,...),
where F,, „ + { is an independently F^+i-distributed sequence. As with EM, DA has a very slow rate of convergence when i / « 0 . 5. The P X - E M Algorithm The PX-EM algorithm expands the complete-data model f(Ycom \ 9) used in EM to a larger model fx(Ycom | 6X,\X) by including the extra parameter
81
Ax, which is typically hidden in /(l^ 0 m I 0) with a fixed value Ax — X. Let © = (8X, Xx) be the expanded parameter. The expansion preserves the observed-data model in the sense that there is a many-to-one onto mapping, called the reduction function, 6 = R(@) such that f(Y0hs\8) = f(Y0bs \ 0 =
R(0)). PX-EM extends EM by replacing the complete-data model f{Ycom\0) with the expanded complete-data model fx(Ycom\6x, A x ). More specifically, starting with (0XO) = 0(°>,AXO) = A), the (t + l)st iteration of PX-EM consists of a parameter-expanded E-step and a parameter-expanded M-step as follows. PX-E step. Compute the conditional expectation of the completedata log-likelihood
Qx{Q\e(t)) = v{infx(Ycom\e)\Yohs,ex PX-M step. Find ©C+1) that maximizes Qx(G\6(t)) then obtain 6>
= ev,\x = x). over G, and
For the t-distribution, we expand the parameter space by activating the scale parameter Xx of the missing weights, which is fixed at A = 1 in the complete data model used for EM. Let G = (nx,cr2,Xx). Then we write the expanded complete-data model as follows. For i = l,...,n, (TJ,?/J) are independent with
!/i|(e,Ti)~N[/ia:)^)
and
-£ G ~ Gamma ^ - , - J . *x
The marginal distribution of y, obtained from the expanded model is t(/j,x,a2/Xx,i/). This leads to the reduction function /i = fix
and
a2 = o1/Xx
with the null value of Ax: A = 1. Thus, PX-EM for the t distribution iterates between the following two steps. PX-E step. This is the same as the E-step of EM. PX-M step. This is the same as the M-step of EM except that (
82
This algorithm for the t-distribution was first proposed by Kent et al (1994). Meng and van Dyk (1997) showed that this algorithm can be obtained as an EM with a different complete-data model. Liu et al (1998) provided the above PX-EM version. Liu et al (1998) showed that in general PX-EM has a faster rate of convergence than its parent EM. To demonstrate a dramatically improved rate of convergence obtained by PX-EM, here we consider the the special case with a single observation y = yi ~ t(/j. = Q,o2,v) with the fixed degrees of freedom v. For any starting value (cr2)^ > 0, the PX-M step gives (er2)^1) = /'"i|0(*>2/i/'ri|0(*) = y\- That is, for this special case PX-EM converges in one iteration, whereas EM can be hopelessly slow, as shown in Section 3. The explicit interpretation of the PX-M step as a covariance adjustment is given in Liu et al (1998). Because the expansion parameter A^ is hidden and takes a fixed value \ x = A in the unexpanded model, 0 = (9X,\X) is typically constructed in such a way that the complete-data likelihood has two factors with distinct parameters 6X and Ax. In this case, the estimate of 6X obtained in the PX-M step is often the estimate of 6 given by the M step of EM. This parameterization makes it simple to implement PX-EM by modifying the M step of the parent EM. Note that R(0X,\) = 8. This also leads to the following explicit interpretation of the PX-M step as a covariance adjustment 0(t+D where 3 = —if' aA
*
=
fl(0(t+i), Ai t+1 )) « 9xt+V +/3(Aii+1> - A),
*'
. For the t-distribution, this becomes the identity AX=A
ln( = ln(a 2 )( t+1 > - (lnA£+1> - In A). 6. The P X - D A Algorithm As discussed in Section 1, we present a new formulation of PX-DA proposed recently by Liu et al (2003). Making the use of the setting of PX-EM, this PX-DA expands the prior distribution f(6) for 9 to fx(&) = fx(0x,Xx) for the expanded parameter 0 while preserving the observed-data posterior in the sense that fx(0 = R(&),\x\Ymis,Yohs)f(Ymis\Yohs)dYmisd\x = f(9\Yohs), (3) / where the fx(.) is the posterior density function obtained from the expanded model. Condition (3) is also necessary for PX-DA to converge properly. Note that the corresponding condition for DA is
83
J f(S\Ymis, Vrobs)/(^mis|^'obs)^^mis = f(0\Yobs). Since the expansion parameter in PX-EM is used only in the PX-M step for a more efficient analysis, there is no problem in using improper priors for Ax as long as / x (©|l^ o m ) is proper. With the expanded prior / x (©), the PX-DA is obtained by applying DA to the expanded model in the way that PX-EM applies EM. More precisely, each iteration of PX-DA consists of a PX-I step and a PX-P step as follows. PX-I step. Impute Ym\s with a draw from /(Vmisl^obs, #)• PX-P step. Draw 6 from / x ( e | F c o m ) oc / x (y c om|©)/ x (©), and then obtain 6 = R{&). Interesting enough, one can show that under mild conditions such as that Ax and 0 are independent a priori, the required expanded prior, excluding the trivial cases with prob(Ax = A) > 0, is uniquely defined for commonly used PX-able models. Thus, for example, from Liu and Wu (1999) we have fx(Xx) oc const if Ax is a 'location' parameter, and /X(AX) oc |A x | _fc if Ax is a (kx k) matrix 'scale' parameter, where location and scale parameters are corresponding to the (location) translation and scale transformations. Alternatively, one can also find the expanded prior for /X(AX) by applying the fundamental condition (3) to a class of conjugate priors for Ax, without using the group theory. For the t distribution, the expanded prior for / x ( 0 ) is / x (/i x ,<7 2 ,A x ) oc
CTJ2AJ1
(-00 < yux < oo;al > 0; Ax > 0).
Hence, PX-DA for the t distribution iterates between the following two steps. PX-I step. This is the same as the I-step of DA. PX-P step. Draw (fj,x,ax) in the same way that the P-step of DA draws (//,
v
^ T' ) i a n d then
set /x = nx and a2 =
84
7. The C A - D A Algorithm The CA-DA algorithm modifies the P-step of DA to re-impute a sufficient statistic S,(Y'mis) of the expansion parameter of PX-EM. Formally, making the use of a one-to-one mapping: (S m i s ,C m i s ) = (S(F m i s ),C(F m i s )) = ^(yinis), e a c n iteration of CA-DA steps through the following two steps. CA-I step. Draw y m i s from f{Ymis\8). CA-P step. Draw (Smis,9) from /(S , mis ,0|y o bs,C' m i s ). For the student-t distribution, we use the one-to-one mapping with fi unchanged, (fi2 = cr 2 /s, s = Xl?=i Ti-> wi = Til S j = i Tj for i = 1,..., n with the constraint X^iLi Wi = 1- Since fi is kept unchanged, the Jacobian of the inverse transformation is the same as that of the transformation a = s(fi2,
T\ = s I 1 — \_.wi
a n d Ti = STi
for i = 2,...,n, where (fi2 is a positive scalar. Because da2/d(fi2 = s and dri/d(fi2 = 0 for i = l,...,n, the Jacobian of the transformation from (/X,CT 2 ,Ti,...,T n ) tO (ll,(fi2,S,W2,-..,Wn) 3ri ds
din
8T„
dW2
du)2
dwn
dvin
8w2
is 1
- E " = 2 WiW2 —S S
...Wn
dT„,
This leads to the conditional distribution of (/u,
(n/2+l)
exp
2 I
gnu/2-
1
exp I ——s > dfid(fi2ds.
Therefore, s is independent of both (fi, (fi2) and w = (u>i,..., wn) and follows the Gamma distribution Gamma(ni//2, z//2). Letting S' m j s = s = Yl7=iTi and letting C m j s = (wx,..., W2), w e have the following CA-DA algorithm: CA-I step. This is the same as the I-step of DA. CA-P step. Draw s from Gamma ( ^ , | ) , replace Tj by T;^ S T ., and then do the P-step of DA. It is clear that this CA-DA is effectively the same as PX-DA. Typically, implementation of CA-DA involves somewhat tedious derivation of the Jacobian of the one-to-one mapping (5 m i s , Cmis) = M(Ym-ls). For this, Zhang and Fang (1983) and Fang and Zhang (1990) provide useful tools.
85
Since it corrects imputed statistics directly, CA-DA can be applied to models where PX-DA is not applicable. To demonstrate this, we consider the following complete-data model, which is obtained from the special case of the t distribution considered in previous sections by replacing the missing weight with an incomplete Gamma variable. The missing data consist of a single incomplete Gamma variable T, that is, f(r\v) oc W 2 _ 1 exp{—VT/2} (0 < r < TQ), where v and TO are constant. The observed data consist of a single observation y. The complete-data model is complete with y\{r, a} ~ N(0, CT 2 /T), where er2 is the unknown parameter with the prior /(
8. Discussion We provided in this paper a brief review of the work on the PX-EM algorithm and PX-EM inspired DA algorithms for Bayesian inference. This paper shows that statistical thinking played an important role in both creating efficient EM and DA schemes and understanding different schemes. Although the author has tried to be objective by using the 'covariance adjustment' interpretation of PX-EM as a golden criterion, the review itself is certainly subjective. Nevertheless, different schemes have their own values in understanding PX-EM from different perspectives. In the past five years, it has been discovered that PX-EM and PX-DA can be applied to a variety of statistical models. This includes the multivariate student-t distribution, factor analysis models, mixed-effects models, Poisson imaging models, and multivariate probit models. The detailed discussions on these models appear in various' places (see, for example, Liu et al, 1998; Meng and van Dyk, 1999, Liu and Wu, 1999; Chen and Liu, 1999; Liu and Sun, 2000; Liu and Sun, 2002; van Dyk and Meng, 2001; Liu and Protassov, 2001; Liu, 2001, 2003; and Pinheiro, Liu, and Wu, 2001). We expect to see more applications in the future. Care must be taken, however, when the I-step of DA consists of a sequence of Gibbs-steps, for example, as in the CA-DA implementation for the multivariate probit models (Liu, 2001). The fundamental condition (3) does not imply the needed 'consistency' condition f fx(Q,^x\ymis,Yobs)d\x = f(S\Ym-ls). In this case, an explicit adjustment to Vmis is necessary and can be done using, for example, CA-DA as in Liu (2001), or equivalently, Scheme 2.1 of Liu and Wu (1999). The Marginal DA algorithm of Meng
86
and van Dyk (1999) can also be modified to handle this problem, as implied in t h e van Dyk and Meng's reply to the discussion of Liu (2001) on van Dyk and Meng (2001). For this, t h e ' d a t a transformation' argument of Liu and Wu (1999) is expected t o be useful. It is also important in practice t o identify slow-converging components for which it is worthwhile to consider alternative or new algorithms. Liu and Rubin (2002) considered a model-based approach to identifying slowconverging components. T h e y showed t h a t ordinary DA can be so hopelessly slow t h a t no meaningful results can be obtained. Leaving many other i m p o r t a n t topics undiscussed, we conclude the paper by quoting Rubin (1997): "And there is still much to do!"
Acknowledgments T h e paper was based on a recent invited talk given in t h e Department of Statistics at the University of Michigan. T h e author thanks Professor Yingnian Wu for insightful discussions and Professor J u n Liu and two referees for their thoughtful comments.
References 1. Chen, M. and Liu, C. (1999). Comments on "Simulated sintering: Markov chain Monte Carlo with spaces of varying dimensions" by J. S. Liu and C. Sabatti, Bayesian Statistics 6, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford University Press, 402-405. 2. Chen, M., Shao, Q., and Ibrahim, J. G. (2000). Monte Carlo Methods in Bayesian Computation, Springer-Verlag, New York. 3. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with Discussion), J. R. Statist. Soc. B 39, 1-38. 4. van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation (with discussion), J. Comput. Graph. Statist. 10, 1-111. 5. Fang, K-T. and Zhang, Y-T. (1990). Generalized Multivariate Analysis, Springer-Verlag, New York and Science Press, Beijing. 6. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling based approaches to calculating marginal densities, J. Amer. Statist. Assoc. 85, 398-409. 7. Hopke, P. K., Liu, C , and Rubin, D. B. (2001). Multiple imputation for multivariate data with missing and below-threshhold measurements: timeseries concentrations of pollutants in the Arctic. Biometrics 57, 22-33. 8. Kent, J. T., Tyler, D. E. and Vardi, Y. (1994). A curious likelihood identity for the multivariate t distribution. Coram. Statist. Simul. Comp. 23, 441-53. 9. Lange, K. L., Little, R. J. A., and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution. J. Amer. Statist. Assoc. 84, 881-896.
87 10. Liu, C. (1995). Monotone Data augmentation using the multivariate t distribution. J. Multi. Anal. 53, 139-158. 11. Liu, C. (1996). Bayesian robust multivariate linear regression with incomplete data. J. Amer. Statist. Assoc. 9 1 , 1219-1227. 12. Liu, C. (1997). ML estimation of the multivariate t distribution and the EM algorithms. J. Multi. Anal. 63, 296-312. 13. Liu, C. (2001). Bayesian analysis of multivariate probit models — comments on "The art of data augmentation" by Van Dyk and Meng (with discussion). J. Comput. Graph. Statist. 10, 75-81. 14. Liu, C. (2003). Alternating subspace-spanning resampling to accelerate Markov Chain Monte Carlo simulation. J. Amer. Statist. Assoc. 98, to appear. 15. Liu, C , Liu, J. S., and Wu, Y. (2002). A note on parameter-expanded data augmentation, Technical Report, Bell-Labs, Lucent Technologies. 16. Liu, C. and Protassov, R. S. (2001). Bayesian estimation of the multivariate ordinal probit regression model, Qualifying paper, Department of Statistics, Harvard University. 17. Liu, C. and Rubin, D. B. (1994). The ECME algorithm: An simple extension of EM and ECM with faster monotone convergence. Biometrika 8 1 , 633-48. 18. Liu, C. and Rubin, D. B. (2002). Model-based analysis to improve the performance of iterative simulations. Statistica Sinica 12, 751-767. 19. Liu, C , Rubin, D. B., and Wu, Y. N. (1998). Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85, 755-770. 20. Liu, C. and Sun, D. X. (2000). Analysis of interval-censored data from fractionated experiments using covariance adjustments. Technometrics 42, 353365. 21. Liu, C. and Sun, X. (2002). Maximum likelihood estimation from incomplete data: the EM-Type algorithms (in Chinese), in Advanced Medical Statistics, eds. J. Fang and Y. Lu, 695, People's Medical Publishing House, Beijing, 695-705. 22. Liu, J. S., (2001). Monte Carlo Strategies in Scientific Computing, SpringerVerlag, New York. 23. Liu, J. S. and Sabatti, C. (2000). Generalized Gibbs sampler and multigrid Monte Carlo for Bayesian computation. Biometrika 87, 353-369. 24. Liu, J. S. and Wu, Y. (1999). Parameter expansion for data augmentation. J. Amer. Statist. Assoc. 94, 1264-1274. 25. Meng, X. L., and van Dyk, D. (1997). The EM algorithm — an old folk song sung to a fast new tune. J. R. Statist. Soc. B 59, 511-567. 26. Meng, X. L., and van Dyk, D. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 86, 301-320. 27. Pinheiro, J. C , Liu, C. and Wu, Y-N. (2002). Efficient algorithms for robust estimation in linear mixed-effects models using the multivariate tdistribution. J. Comput. Graph. Statist. 10, 249-276. 28. Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods, Springer-Verlag, New York. 29. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys, Wiley,
88
New York. 30. Rubin, D. B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Assoc. 9 1 , 473-489. 31. Rubin, D. B. (1997). Comment on "The EM algorithm — an old folk song sung to fast new tune" by X.L. Meng and van Dyk. J. R. Statist. Soc. B 59, 541-542. 32. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Statist. Assoc. 82, 528-550. 33. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11, 95-103. 34. Zhang, Y-T. and Fang, K-T. (1983). An Introduction to Multivariate Analysis (in Chinese), Science Press, Beijing.
LARGE DEVIATIONS A N D DEVIATION I N E Q U A L I T Y FOR K E R N E L D E N S I T Y ESTIMATOR IN L i ( # D ) - D I S T A N C E
LIANGZHEN LEI, LIMING WU 1 AND BIN XIE Department of Mathematics and Statistics,
Wuhan
University
Universite Blaise Pascal In this paper we establish large deviation estimations for the nonparametric kernel density estimator /£ in L 1 (K d ) and provide a gaussian type deviation inequality for the L1(Rd)-distance between f£ and the underlying density.
1. Introduction Let {Xi,i > 1} be a sequence of independent and identically distributed random variables (i.i.d.r.v in short), taking values in M.d, defined on probability space (fi,^ 7 , P) with distribution measure d/j, = f(x)dx, where the density / e L 1 (R d ) is unknown. The empirical measure is Ln = n S"=i fiXi • Let K be a measurable function such that (HI)
K > 0, JRd Kdx = l.
and set Kh(x) = -^K{^). as usually as:
The kernel density estimator of / is defined
S(.) = «i..i.-it^(^), * e * W where {hn,n > 1} is a sequence of bands that is, a sequence of positive numbers satisfying (H2)
hn —> 0,
nh^ —> +oo
as n —> oo.
The limit behavior of /* in (L 1 (R d ), || • ||i := || • ||LI(R'*)) is a subject of current study. L. Devroye (1983), in a fundamental paper on the subject, proved under (HI) and (H2), that ||/* — / I I L 1 ^ ) —> 0, a.s. (the strong 89
90
consistence), and proved the following exponential convergence: for each r > 0, there are C, S > 0 such that P(ll/„*-/Hi>r-) l
(2)
More recently Louani (2000) obtains the large deviation estimation associated with (2), i.e., giving an exact identification of S as n —> oo. To state his result, define, for any 0 < a < 1, the following functions:
r + ( r ) :=
J (° + §) M l + £ ) + (1 - a - S) log(l - ^ y ) I +oo,
if 0 < r < 2 • 2a otherwise (3)
=
J (a-§)log(l-£) + (l-a+f)log(l + ^ ) 1 '-oo,
if 0 < r < 2a otherwise (4)
ro(r):=min{r+(r),r-(r)}; l(r) := inf{r a (r) : 0 < a < 1}
(5)
Now the main result of Louani (2000) is stated as lim - l o g P ( | | / * - / | | i > r ) = - i ( r ) , Vr > 0.
(6)
n—>oo n
To complement those two results, we study in this paper the following two questions: Question 1. What is the large deviation estimation of P(||/jJ — g||i < 5) for any given density function gl Question 2. For each fixed n and deviation r > 0, how can one bound
n\\fn-fh>r)? For large deviations on L°°(Rd) which is much more difficult, the reader is referred to Gao (2001) and the references therein. 2. Main Results For the language of large deviation, the reader is referred to Dembo (1998) and Wu (1997). Our first result may seem a little abstract. It extends the well known Sanov theorem.
91
Proposition 2.1. Assume (HI) and hn —> 0 (without (H2)). Then as n goes to infinity, !P(fn £ •) satisfies the large deviation principle (LDP in short) on L1(dx) w.r.t. the weak topology a(Lx, L°°)), with the rate function given by K \
=
lEntM
•= SRd9(x) log jf^dx,
if ge V,g(x)dx
[+oo
otherwise
(7) where V is the set of all probability density functions on R d . More precisely (1) (GRF) I is a Good Rate Function on (L 1 (E' i ),cr(L 1 , L°°)), i.e., for any L>0, [I < L] is compact in (L 1 (R d ),(r(L 1 , L°°)). (2) (LLD) (Lower bound of Large Deviation) For any open subset G C (L\Rd),a(L\L°°)), lim i n f - l o g P ( / * e G) > - inf 1(g). n—>oo n
g£G
(3) (ULD) (Upper bound of Large Deviation) For any closed subset F C
limsup - logP (/* e F) < - inf 1(g). n^oo
n
g£F
Remark 2.1. Because I is not a good rate function on L1(Rd) w.r.t. the norm || • ||i-topology, P(/* e •) does not satisfy the LDP on (L1^), || • \\i). The following result says however that P(/£ € •) satisfies the weak*LDP on (L 1 (R d ), || • ||i). It gives a satisfactory answer to Question 1) in the Introduction. Theorem 2 . 1 . Assume (HI) limlimini
Then for any g G L1
and (H2).
-\ogF(\\f*
- g\\L1{Rd) v
5—>0 n—>oo n
< S) '
(8)
= lim l i m s u p - log P( ||/* - g\\LH&i) o—*v n—*oo
< S) =
-1(g).
'I'
Moreover for any || • \\\~open and convex subset G in Ll(WLd), lim - l o g P ( / „ * e G ) = -inf/(). n-»oo n
Remark 2.2. By (8), for any || • ||i-open subset G in lim - l o g P ( / : G G ) > - i n f / ( 5 ) . n—>oo n
(9)
geG
g£G
L1(Rd),
92
In particular, we have for any r > 0, lim i l D g P ( | | / ' - / | | > r ) > - i n f { / ( f f ) ;
\\g-fh>r}.
n—>oo n
We can prove that inf {/(); \\g — / | | i > r } = l(r), the rate function (5) found by Louani (2000) In other words the lower bound here is much more general. We now present our answer to Question 2. Theorem 2.2. Assume (HI). Let Jn := JRd \f*(x)-f(x)\dx Then for any n > 1 and r > 0,
(
= ||/*-/||i.
TIT \ —§")•
(10)
As KJn —> 0 under (HI) and (H2), the inequality above is much more precise than (2). About the deviation inequality L.Devroye (1988) has in his paper got under (HI), Tie
P(|J„ - E J „ | > e) < 2 e x p - ^ ^ but only for e small. His method is based on Poissonisation of n. Our proof will be completely different and based on the known deviation inequality of product measure w.r.t. the Hamming Distance, see the excellent monograph by Ledoux (1999) 3. Proofs of the Main Results 3.1. Proof of Proposition
2.1
Step 1 (identification of the Cramer function). For any k e L°° = (L 1 )*, we calculate the Cramer functional by means of the i.i.d. property, A(/c) := lim - l o g E e x p ( n ( / * , / c ) d x ) n—>oo n
=
n
=
n
where X = Xi.
l i m I l o limlog
g
E e x p ( | : ^ / ^ ( ^ ) ^ x-X
E
exp^yK^^-J
f c
(^
But it is well known that ^ / K ( ^ ) k(x)dx
to k(y) in measure dy.
tends
Since the law of X is absolutely continuous,
93
then -p- J K ( % ^ ) k(x)dx converges to k(X) in P-probability. Moreover, | p - / K ( ^j^-) k(x)dx\ < ||fcj|oo• Consequently by dominated convergence, we obtain A(fc) = logEexp(/c(X)) l o g / , zkix)n(dx) Moreover by the famous variational formula of entropy of DonskerVaradhan, the Legendre transformation A* : L^R^) —> [0, +00] of A is given by A*(g):=sup{(g,k)dx-A(k);
k e L°°}
= sup{(ff, k)dx - log J ek^n(dx); =
[Ent^g),
if
l+oo
otherwise
k e L°°}
g£V;
In other words A* (g) = 1(g) given in (7). Step 2: LLD. Since A(fc) is Gateaux-differentiable on L°°(]Rd), hence the LLD in (ii) follows by the abstract Ellis-Gartner theorem (see Wu (1997), Theorem 2.7). Step 3: G F R + ULD. Again by the abstract Ellis-Gartner theorem (see Wu (1997), Theorem 2.1), it is enough to show that if g : k —•> (g, k) is a linear form on L°°(Md) such that A*(g) := sup{( 5 , k) - A(ifc); k G L°°} < +00, 1
then g € L ^ ) . By following the proof of Wu (1997), Theorem 3.3, it suffices to prove that A(kn) —> 0 for any sequence (fcn)n>o of functions in L°°(K d ) decreasing dx — a.e. to zero. The last property follows by the explicit expression (11) of A(kn) and dominated convergence, for e"'00"00 >
3.2. Proof of Theorem
2.1.
The following basic result of Devroye is crucial. Lemma 3.1. (due to Devroye) Under (HI) |/* - /|cfcc -> 0,
and P-a.s
(H2),
94
We can now prove the lower bound in Theorem 2.1, and (H2), we have for any g G L1
Lemma 3.2. Under (HI)
liminf i IogP(||/^ - g\\LH&,}
<5)>
-1(g),
V«5 > 0
(12)
Proof. We may assume that 1(g) < +oo (trivial otherwise). Without loss of generality we assume that (fi,.F,P) = ((M d ) N \ BN",nm") and Xi(w) = u>i,i > 1 are the coordinates on the product space CI. Let Fn •.= a(X\,--- ,Xn). The following method of changement of measure is standard for treating the LLD. Since 1(g) < +oo, g is a probability density such that u(dx) := g(x)dx <€; /j.(dx) = f(x)dx and Ent^g) = Jlog(g(x)/f(x))dv(x) < +oo. Consider the product measure Q := v®N . Then
Qu„=.(x1,..x„)=n^PK We have for any e > 0, r
n\\rn-9h<s)>
d¥
^ k •i[„/.-.„1
L exp f - ^ l o g ^ X O j
•l[||/;-fl||1<«]dQ
>« : exp(-n[£ni M (<;) + e])Q (^„, e f | [||/^ - g^ < S\) where A„, e := [± £ ? = 1 log(//)(X,) < £«*„((,) + e]. Now to prove (12), it remains to show that Q(An^E) —* 1 and Q([ll/n ~~ slli < ^]) —> 1, as n goes to infinity (for any e > 0). The first is easy, because by the law of large number,
-J^log^X,)
^E^log^-(X)
= Ent„(g),
Q-a.s.
The second, i.e., Q([||/£ - g\\i < 6]) —> 1, follows by Lemma 3.1 (applied to (Xi) under Q). D We now turn to give the Proof of Theorem 2.1. By Lemma 3.2, we have already limjnf liminf i l o g P ( | | / * - g\\Lim
< 5) >
-1(g).
95
For the upper bound, note that B(g,6) := {g € L1^4); \\g - g\\ < 5} is convex, closed w.r.t. || • \\\, then it is closed w.r.t. a(Ll,L°°). Thus by the ULD in Proposition 2.1, we have limsuplimsup - logP(||/* - g\\Lx{Rd) < 5) 5^0
n->oo
v
n
'
< limsuplimsup —log P(/* € B(g,5)) S—>0
n—*oo
<-liminf
inf
n
1(g) = -1(g)
where the last equality follows from the lower semi-continuity of I (by the GRF in Proposition 2.1). Hence (8) is established. The proof of (9) is similar. For any g G G, choosing some ball B(g, 5) = {g G L 1 ^ ) ; \\g - g\\ < 8} C G, we have by Lemma 3.2, 1(G) := liminf - l o g P ( / ; G G) > n—»oo
-1(g).
fl
As 5 G (3 is arbitrary, we obtain 1(G) > — info / . For the upper bound, note that the closure of the convex subset G w.r.t. || • ||i or w.r.t.
fl
g€G
Hence for (9), it remains to show that inf„e(5 1(g) = mfglzQ 1(g). For this purpose let g\ € dG := G\G. By a known result in Banach space theory (see Holmes (1975)), (11.1), p59), for any 50 £ G fixed, gt := tgi + (l-t)g0 £ G for any f 6 (0,1), As / is convex on L 1 (R d ) (for it is the Legendre transformation of A), t —> I(gt) is convex on [0,1]. Thus / ( 5 i ) > l i m / ( 5 t ) > inf 1(g) where the desired equality follows for g\ G dG is arbitrary. 3.3. Proof of Theorem
2.2.
All is based on the following Lemma 3.3. Let (Ei,Bi,/j,i),i = 1,- • • ,n, be arbitrary probability space, and let P = /zi (g> • • • ® fin be a product measure on the product space En = E\ x • • • x En- A generic point in En is denoted by x = (x\, • • • ,xn). Then, for every real measurable F on E such that \F(x) — F(y)\ < 1 whenever
96
x = (x\, • • • ,xn) and y = (yi,- • • ,yn) onM) differ by one coordinate (this is equivalent to say that the Lipchizian coefficiant of F w.r.t. the Hamming distance is not greater than one). Then /
P ( F > ErF + r) < exp
r2-
- — \ 2n
See Ledoux (1999) (Section 3.1, p l 6 2 ) for presentation of this important result. We now prove Theorem 2.2. We now prove Theorem 2.2. W i t h o u t loss of generality we may assume t h a t X\,--- ,Xn are coordinates xi,--- ,xn on the product space (fl,T,F) = ((Rd)n,Bn,n®n). Put F(xi,---
,xn)
:= - J „ ( x i , - - - ,xn) I
r /
= -
i ." - S~\Kh{z
A j-^d
- Xi)dz - f(z)
n
where h = hn. For any i = 1, • • • ,n and for any x,y Xj = yj for all j ^ i, we have \F(x) - F(y)\ <\
I
\Kh(z
-
Xl)
- Kh(z
-
€ (Rd)n
Vl)\dz
dz
such t h a t
< 1.
Consequently by Lemma 3.3 (applied t o F and —F), P ( | J n - EJn\
> r) < P (F - EF > y ) + P (F - EF <
^
2exp
1 rnri2\
(-2^LT]
=2exp
/
~ ) nr2\
(~T ''
the desired (10).
Acknowledgements We are grateful t o Prof. Gao Fuqing for useful conversations and communications of papers on the subject. This research was supported by t h e National Natural Science Foundation of China.
References 1. A. Dembo, O. Zeitouni (1998). Large Deviations Techniques and Applications. Second Edition. Springer. 2. L. Devroye (1983). The Equivaluce of Weak, Strong and Complete Convergence in L\ For Kernel Density Estimates. Ann. Statist. 11, 896-904. 3. L. Devroye (1988). The Kernel Estimate is Relatively Stable. Prob. Theo. Rela. Field. 77, 537-550.
97
4. F.Q.Gao.,(2001). Moderate Deviations and Large Deviations For the Kernel Density Estimators. Preprint. 5. R.B. Holmes, (1975). Geometric Functional Analysis and its Applications, G.T.M. 24, Springer-Verlag. 6. M. Ledoux (1999). Concentration of measure and logarithmic Sobolev inequalities, Seminaire de Probab. XXXIII, LNM 1709, 120-216, Springer. 7. D. Louani (2000). Large deviations for the L\-distance in kernel density estimation. J. Statist. Plan. Infer. 90, 177-182. 8. Wu, L.M. (1997). An introduction to large deviation, in Several Topics in Stochastics Analysis (J.A.Yan, S.G.Peng, S.Z.Fang, L.M.Wu.Eds), 225-336, Academic Press of China,Beijing (In Chinese).
LOCAL SENSITIVITY ANALYSIS OF MODEL MISSPECIFICATION
G U O B I N G LU University
MRC HSRC, Department of Social Medicine of Bristol, Canynge Hall, Whiteladies Road, Bristol
BS8 2PR,
UK
The effects of small misspecification for a parametric model are analysed in a general framework. Following a standard paradigm for local sensitivity analyses, we embed the postulated model in a tubular-like class of models introduced by Copas and Eguchi (2001), and then investigate the resulting biases in estimating the model parameters. The richness of the tubular-like class, as is shown in this paper, permits general analyses of local sensitivity to small model misspecification. The main cause for the biases is revealed to be the perturbations of the tangent space of the postulated model.
1. Introduction Model misspecification is frequently encountered in practice. In essence, a statistical model is a set of postulates based on the modeler's knowledge about the problem to be solved and hence only an approximation to the generating mechanism behind data. Thus a diagnosis for model specification is more or less a common need in model-build processes to examine the validity of postulates. Suppose that we have a random sample Y\,..., Yn of random vector Y with unknown density function h(y) and our interest is to make inference about some characteristics of the distribution, say 9(h). Let & = {f{y; 0) : 9 £ Q C W) be a parametric model satisfying some regularity conditions, and let 9n denote the maximum likelihood estimate (MLE) of 9 under &. Note that some authors prefer the term pseudo or quasi maximum likelihood estimate when the model specification is under inspection. Then, according to White (1982), as n —> oo, we have §n -^ 0* =
Argmme€eEh log 98
f(Y;9)
(1)
99
where #* is the so-called pseudotrue value of the parameter, and Men
- 0.) -^ N^AZ'B^A;1),
(2)
where A* =
-\im-Eh n
E
c>2 log f(Yk, 9) 89d9T
(3)
_fc=i
B*
=\im-Eh n
E
dlogf(Yk,0) 09
d\og
f(Yk,0) 89T
(4)
fc=i
Hence 9n is the best choice of 9 in the principle of (pseudo) maximum likelihood, provided there is no any extra knowledge available about the underlying density h(y). In this case, the reliability of model postulates mainly lies on the depth of our understandings about the background problem and the degree to which the model fits the data. If the model & is correctly specified (to some accuracy), then A* ~ £?*, and we have the ordinary results. In some other situations, certain extra knowledge may be available. For example, we may have an alternative model, say, £f', joining the competition. The validity of the use of the MLE under either model can be checked by the specification test. Cox (1961, 1962) considered the problem of testing separate families of hypotheses with the following form Hf:h(y)e{f(y;e):eeG}
vs
Hg : h{y) e {g{y; 7 ) : 7 € V].
(5)
This is known as the non-nested hypothesis test. Other tests for model misspecification include the Lagrange multiplier test, the Hausman test, White's information matrix test, and Newey's conditional moment test (see, e.g., White, 1994). Sensitivity analysis (SA) can also provide useful information about the effects of model misspecification on the parametric inference. In a broad sense, SA is profitable in studies wherever mathematical models are involved. It can be applied to dealing simply with uncertainties in the input variables and model parameters, as well as to incorporating model conceptual uncertainty, i.e. uncertainty in model structures, assumptions, and specifications (Saltelli, 2000). Compared with the non-nested specification test, SA requires less extra knowledge or assumptions about model specification, and local SA (LSA) is even more appealing when the extent of misspecification is small. Good references include Cook (1986),
100
Gould and Lawless (1988), McCulloch (1989), Lavine (1991), Neuhaus et al (1992), Guatafson and Wasserman (1995), Tsou and Royall (1995), Guatsfson (2001), and Oakley and O'Hagan (2002), among many others. A general approach to LSA is embedding the postulated model into a larger class of models and then comparing the change of estimates due to the small perturbations of the model. Suppose the embedding is parameterized as ^ = { / ( y ; 0 ) : 0 e e } C <S = {g(y; 0, 7 ) : 0 e 9 , 7 £ T}, where 7 is an additional parameter, and there exits a value of 7, say, 70, such that g(y,0,70) = f(y,Q),Vy- Let #**(7) denote the pseudotrue value of 0 under the enlarged model <& for 7 in a small neighbourhood of 70, that is 0**(7) = Argmin eee f?h
log-g(Y-,e,j) W
(6)
Then the bias due to model misspecification may be defined as the deviation of the pseudotrue values of parameter under Sf from that under &, i.e.,
*>(7) = M 7 o ) - M 7 ) -
(7)
Clearly 6(7) can be estimated consistently by 6,1(7) = @n — #71(7), where 9n(j) is the MLE under <S for given 7. The behaviour of the function b(-) or its estimate &„(•) at 70 reflects the local sensitivity of pseudotrue value of the parameter to model misspecification. However, a severe difficulty with this approach arises from the problem: how to choose a class & rich enough to include as many as possible models that are close to the postulated model? The motivation of this paper is to try to answer this question. We introduce a rich class of candidate models in section 2 and discuss its generality. The formulation for the model misspecification bias is given in section 3, where the cause of the bias is also analysed. The exponential family is taken as an example. Last section offers some comments.
2. A Tubular Class of Surrounding Models We use the Kullback-Leibler (KL) information distance to measure the departure of a density / from another density g and denote it by Dg{g, / ) = Eg\og(g/f). For the LSA, our interest is focused on the cases where the underlying model h is assumed to lie somewhere close to the postulated model with the parameter in a neighbourhood of the pseudotrue value.
101
Consider a class of models proposed by Copas and Eguchi (2001):
{
oo
g(y,9) = f(y,9)exp{
^
oo
~ «(*)} = £ A « ? = l'6
e£A^(*/,0)
G
° f' (8)
where {itj(y,#)} is an orthonormal basis of the linear space <2C = { u (y; 0) : E / u ( y ; 9) = 0,
£ / u ( F ; 6>)2 < oo}
(9)
with the inner product defined by product moment under / , and K(6) = log Ef exp{e ] T XiUi(Y, 9)}. i
Then for small e, it is easy to show that Df(f,g) ~ Dg(g, / ) ~ K(0) ~ e 2 /2.
(10)
Geometrically, for a small e, ^ e is a tubule-like set surrounding the postulated model & with an approximate "radius" e 2 /2 in the KL information distance and in the direction of A = {Aj}. Copas and Eguchi (2001) give a variety of examples to illustrate how LSA can be carried out by means of this model in the context of observational studies. We further discuss two basic properties of the class: (i) Generality A more general form of model (8) is ^ = {9(y;9) = f(y;9)ex1?{u(y,9)
- K(9)} : 8 € 0,u(y,0) e «r}.
(11)
In fact % = {g(y;8)e&:\\u(y;8)\\
= e},
where || . || is the inner product norm in &, and & = UeSfe. Theorem 2.1 Let h(y) be a density function with the same support as f such that Ef [log h(Y)] exists. Then there is a member g G & such that h (v) = 9(y), a.s. y. Proof This is equivalent to proving that, for any given /i(y), we can find a point 9o € & and a function u*(y, 9) e fy, such that h(y) = g(y, 0O)Let #o / 0 be a fixed point in 0 . Define
^* = {g(y, 9) = f(y, 9) exp{u*(y, 9) - K*(9)}
:9ee},
102
where u (y,9) =
,
lo
h(y)
„ ,
a \ ~ Ef J
§ tt.
f(y,0o)
lo
h(Y) S•
°f(Y,o0).
and
w = -£%**. w Po II ' *f(Y,6 y 2
0
It is easy to varify that u*(y, 9) € W and % ) = g(y, 90) £<£* C&.
D
The generality of $ lies upon the fact that & is actually a semiparametric model with the parameter space 0 and semiparameter space ^ , in which model & serves as a core family, and any densities in Sf are indexed by functions u 6 'W. Rather than a model aimed at ordinary statistical inference, <£ provides a mathematically convenient way of representing the possible distributions of Y close to the core model &, and hence provides a general framework for sensitivity analysis of model misspecification. (ii) Local Approximation By Taylor expansion and the standard results in minimizing quadratic function, we can prove: Theorem 2.2 Suppose that g(y, 6) € S?e for a small e. Let 6$ £ 0 be an inner point such that there is a neightbourhood Go C © of 9Q with diameter 0(e). Then we have the following local property Argmmee@oDgeo
(g(y; 90), f(y; 9)) ~ 90 + e £ ) XJ(90)-1Eg0
[ui(y, 90)£(e0)}. (12)
where 1(9) = {d/86) log f(y; 9). We remark the equivalence Argmin6,£)ft(/t, 5) = Argmax0.E/jlog<7. Thus theorem 2.2 can be applied to the approximation of MLE. 3. Bias and Score Basis Let U(i) = (u\,... ,um)T denote the vector of the first m basis functions, ^(1) the sub-space of °tt spanned by U(i), and A(i) = ( A i , . . . , A m ) T the corresponding coefficient vector. In this section we consider the sub-class
103
in which ^{\) is spanned by the score functions of f(y; 9), i.e.
VI = {
= i,ee &}. (13)
A direct application of theorem 2.2 leads to Theorem 3.1 Suppose that the underlying distribution h is included in &* for some A(i) and a small e. Then, based on a sample of size n, we have bn = ei(6n)-1/2\{1)+0(e2)
-^
6 = e7(^)-1/2A(1)+0(62),
(14)
where 9n is the MLE of 9 and 6* is the pseudotrue value under the postulated model &'. Thus, the size of the smallest eigenvalue of the Fisher information matrix provides an approximate measure for the sensitivity of estimates to small model misspecification. The smaller the smallest eigenvalue, the more sensitive the model would be. Moreover, theorems 2.2 and 3.1 convince us that the effect of small misspecification of a model on its parameter is mainly produced by perturbations of the tangent space spanned by the score functions. From the other angle of view, the submodel S^f can capture the main features of the bias problem in general cases, taking the "worst" possibility into consideration. A special yet important case is the exponential family. When covariates available, the postulated model & is often built on the conditional distribution and the generalized linear models are frequently used in practice. Substantial mathematical simplification can be gained from the analogical exponential form of the models of & and Sf. Let & be a full-rank exponential family with densities given by f(y; 9) = exp{9TS(y)
- C(9) + D(y)},
(15)
where S(y) is a jj-dimensional statistic and 9 is a p-dimensional natural parameter. Then model <S takes the form 5(2/;
9) = exp{S(y)T[9
+ eC*(^)-1/2A(1)]
- \C{9) + eC{9)TC{9)-1'2\{l)]
+ D(y) + 0(e2)}.
(16)
If we make a parametric transformation locally in an 0(e)- neighborhood of an inner point 9 by 4>(9;\{1)) = 9 + eC(9)-l/2\(1),
(17)
104
then, since C(0) + eC(0) T C(0)- 1 / 2 A ( i) = C (O + eC(0)- 1 / 2 A ( 1 ) ) +
0(e2),
model (16) can be rewritten as g(y; 4>) = exp{S(y)T - C(<j>) + D(y) + 0(e2)}
(18)
which is also an exponential family with a new natural parameter <j>. It is clear to see that, locally at a parameter value 6Q, the bias due to model misspecification can be approximated by ^o;A(i))-e0~eC'^o)-1/2A(1). Now consider the sensitivity of the moments for the sufficient statistic S. Let EgS and VargS denote the corresponding moments of S under the postulated model &, and E^S and Var^S the corresponding moments of S under the enlarged model (18). It is easy to varify the following first-order approximation: Theorem 3.2 For exponential family model (15) and enlarged model (18), the following approximations hold E^S(Y) Var^SiY)
= E0S(Y) = VareS(Y)
+ eC(e)1/2\{1)
+ 0(e 2 )
and
+ e[C(0)][C(0)- 1/2 A (1) ] + 0(e2),
where C(6) denotes the p x pxp 1998, Appendix A) is given by
(19) (20)
array, and the array multiplication (Wei,
[aikj\pxpxp[bk\pxl
—
/
,aikjbk pXp
Example 3.1 Let Y have the binomial distribution B(n,n) with parameter n € (0,1), where n is a fixed positive integer. Then the density of Y f(y;v)=(
jexplylog
y^— + n l o g ( l
-rj)
can be rewritten by f(r, 0) = r \ exp {yO - n log(l + ee)} ,
(21)
105
where 9 = log j^— is the natural parameter. Now S(y) = y, C(6) = nlog(l + ee) and A(i) = 1. Thus, under the enlarged model, we obtain that E+Y-EaY
+ eCid)1'2 nV2ee/2
= EeY + e[nr)(l - 77)]1/2 and Var+Y ~ Var9Y +
tC(6)C{6)-ll2 ni/2ee/2{1
_
efl)
= F a r ^ r + e n 1 ' 2 ^ - r})]^2(l - 277). Note that the above first order term in e vanishes if 77 = 1/2. This reveals the intuitive fact that the binomial model is relatively stable when the pseudotrue value of the parameter is near 1/2. Example 3.2 (Logistic Regression Model) Let y = (yi,- • • ,yn), where yi,i = l , . . . , n , are independent Binary variables with parameters 77$, and Xi,i = l , . . . , n , be the associated pdimensional covariates. If the link function is defined as 8i = log[77,/(1 — 77;)] = x[0, where /3 is the p-dimensional regression coefficient vector, then we have a logistic regression model with the conditional distribution of Y given X as follows
f(y\x; (5) = exp j ( £ V^Tj P ~ E l o ^ + ^
+ D^ } •
(22)
Let (3 and (3 denote the MLEs of (3 under model (15) and the enlarged model (16) respectively. Then eXi &X;
w> = E ^ and ex'0
It follows that p~P
+ e(XTAX)-l/2\{1),
(23)
106
where X is an n x p matrix with t h e i-row xj, and A is an n x n diagonal matrix with t h e i-th elements eXi @/(l + eXi ^ ) 2 . T h e LSA can be made by analysing t h e eigenstructure of t h e matrix XTAX.
4.
Comments
T h e LSA based on t h e enlarged model (8) or (13) requires only one extra assumption t h a t the underlying model h(y) lies e 2 / 2 away in KullbackLeibler information measure from t h e postulated model &. A significant property of t h e model is its generality as described by theorem 2.1. Of course we have t o pay a price for the weakness of t h e extra assumption and the generality of t h e model. To see this, recall theorems 2.2 and 3.1. It is in accord with our intuition t h a t t h e bias is a composition of three factors, namely, the magnitude of the closeness e described by t h e extra assumption, t h e square root of t h e precision matrix I(6)~1^2 given by t h e postulated model, and t h e misspecification direction A(i). However t h e parameter vector X^ is neither specifiable nor estimable without extra assumptions or further knowledge about model uncertainty. This is the price paid for the flexibility of t h e model, but it seems reasonable. Model (8) or (13) provides a general framework for LSA in model specification problems. T h e focus of this paper is only t h e basic cases where independent and identically distributed samples are modelled and analysed. In other cases, some covariates are often available and various kinds of models may be involved. T h e flexibility of our model may leave much room for more delicate analyses. Acknowledgement I am grateful t o professor J o h n Copas and professor Shinto Eguchi for valuable discussions and fruitful comments on this paper.
References 1. Cook, R.D. (1986). Assessment of local influence (with discussion). J. R. Statist. Soc. B 48, 133-155. 2. Copas, J.B. and Eguchi, S. (2001). Local sensitivity approximations for selection bias. J. R. Statist. Soc. B 63, 871-895. 3. Cox, D.R. (1961). Tests of separate families of hypotheses. Proc. 4th Berkeley Symp. 1, 105-123.
107 4. Cox, D.R. (1962). Further results on tests of separate families of hypotheses. J. R. Statist. Soc. B 24, 406-424. 5. Gould, A. and Lawless, J.F.(1988). Consistency and efficiency of regression coefficient estimates in location-scale model, Biometrika 75, 5353-540. 6. Gustafson, P. (2001). On measuring sensitivity to parametric model misspecification. J. R. Statist. Soc. B 63, 81-94. 7. Gustafson, P. and Wasseman, L. (1995). Local sensitivity diagnostics for Bayesian inference. Ann. Statist. 23, 2153-67. 8. Lavine, M. (1991). Sensitivity in Bayesian statistics: the prior and the likelihood. J. Amer. Statist. Assoc. 86, 396-399. 9. McCulloch, R.E. (1989). Local model influence. J. Amer. Statist. Assoc. 84, 473-478. 10. Neuhaus, J.M., Hauck, W.W. and Kalbfieisch, J.D. (1992). The effects of mixture distibution misspecification when fitting mixed-effects logistic model. Biometrika 79, 755-762. 11. Oakley, J. and O'Hagan, A. (2002). A Bayesian approach to probabilistic sensitivity analysis of complex models, Research Report, University of Sheffield, Sheffield, UK. 12. Saltelli, A. (2000). What is sensitivity analysis, in Saltelli, A., Chan, K. and Scott, E.M. (ed.). Sensitivity Analysis. Wiley, 1-15. 13. Tsou, T.S. and Royall, R.M. (1995). Robust likelihood. J. Amer. Statist. Assoc. 90, 316-320. 14. Wei, B.C. (1998). Exponential family nonlinear model. Berlin: Springer. 15. White, H. (1982). Maximum likelihood estimation of misspecified models, Econometrica 50, 1-25. 16. White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge University Press.
EMPIRICAL LIKELIHOOD CONFIDENCE INTERVALS FOR T H E D I F F E R E N C E OF T W O QUANTILES OF A POPULATION
YONGSONG QIN Department of Mathematics, Guangxi Normal Guilin, Guangxi, China
University
YUEHUA WU Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, Ontario, Canada M3J IPS In this paper, by employing the empirical likelihood method, an extension of Wilks' theorem is developed, which is then used to construct a confidence interval for the difference of two quantiles with asymptotically correct coverage.
1. I n t r o d u c t i o n Let X\, • • • , Xn be univariate random variables which are independent and distributed as X with unknown distribution and density functions F(-) and / ( • ) , respectively. For given 0 < q\ < q-z < 1, assume t h a t t h e qith quantile, 9 = F~1(qi), and t h e 52th quantile, 6 + A = F~l{q2), are uniquely defined respectively. T h e purpose of this paper is t o construct confidence intervals for A . Here A is t h e range of X between t h e q± and q2 quantiles. If X is a normal random variable, q\ = 0.025 and qi = 0.975, then A is 1.96er where a is t h e s t a n d a r d deviation of X. In this case, a confidence interval for a can be easily obtained from the confidence interval for A. In practice, one may ignore t h e effect of extreme values below or above some quantiles. Hence A may be viewed as the effective range of X. It is an important practical issue t o make inferences for ranges of populations, especially in medical field. For reference, see Wade (1994,1998) and A m a r a t u n g a (1997), among others. T h e empirical likelihood method as a nonparametric technique for constructing confidence regions in the nonparametric setting has been introduced by Owen (1988,1990). These confidence regions have sampling prop108
109
erties similar to those based on the bootstrap. But instead of resampling, the empirical likelihood method works by profiling a multinomial likelihood supported on the sample points. Chen and Hall (1993) constructed smoothed confidence intervals for a quantile by using the empirical likelihood. Qin (1994) studied the construction of semi-empirical likelihood ratio confidence intervals for the difference of two sample means. Some related work in regard to comparing the differences of two populations can be found in Jing (1995), Qin (1997), and Qin and Zhao (1997), among others. It is noted that empirical likelihood and its related properties have been well studied in Owen (1988,1990,1991), Hall (1990), Diciccio (1991), Qin and Lawless (1994), among others. One approach to determine a confidence interval for A is based on the asymptotic distribution of the difference of two sample quantile estimates for F^1(qi) and F~x{q2). However, it involves the unknown values of the probability density at the two population quantiles. As a result, this method has seldom been used in practice. In this paper, by employing the empirical likelihood method, an extension of Wilks' theorem for the empirical likelihood ratio statistic is developed, which is then used to construct a confidence interval for the A with asymptotically correct coverage. The paper is organized as follows. Section 2 states the main results. Some simulation results are presented in Section 3. The main results are proved in Section 4. In the sequel, we will use —> to denote convergence in distribution, 11 • 11 to stand for the Euclidean norm.
2. M a i n R e s u l t s Let K(-) be a kernel function and h = hn(h —> 0 as n —> oo) be a bandwidth. Denote Kh{-) = K(-/h), G{t) = fJ^K^du, u>x{x,8) = G{6 - x) qx, w2(x,0,A) = G(9+A-x)-q2,&ndu)(x,9,A) = {UJI(X,9),UJ2(X,9,A))', where 6* denotes the transpose of a vector b. Similar to Qin and Lawless (1994) and Chen and Hall (1993), we define the following profile empirical likelihood ratio statistic:
R(A,6) =
sup
Y[{nPi),
pi,-,P»i=1
110
where p\, • • • ,pn are subject to the following restrictions: n
P i > o , ] [ > = i,
(l)
i=l
^2piu3(Xl,9,A)=0.
(2)
i=l
The log-version of R(A, 9) is given as follows: n
H{A,9) =
sup
^log(npi),
pii
where Pi, • • • ,p„ are subject to restrictions Eq. (1) and Eq. (2). Note that H(A, 0) may be found via Lagrange multipliers. It can be shown that the optimal value for pt is given by
where rj(9) is the solution of the following equation:
"(*"g'A> f(0)u(Xi,e,A)
I V 2=1
=0
(3)
Hence,
H(A, e) = -J2 iog(i + v'VMXi, e, A)) i=l
Denote jtfXi, 61, A) = / i -
1
^ ^ - Xi), Kh(6 + A-
Xi)Y.
Let dH(A, 6) j'89 = 0. We have the following empirical likelihood equations:
itY^m^m9'{X,Mm^
<4)
Assume the true value of 9 is 9Q. Let
a = (f(90),f(90
+
A)y,
V0=(^-
^ - ^ X
\ 9 i ( i - 9 2 ) 92(1-92) y We make the following assumptions: (i) f(6o)f{9o + A) > 0. For some m > 2, /( m ~ 1 ) exist in some neighbourhoods of #0 and #0 + A, and / and / ( m _ 1 ) are continuous at #0 and 90 + A.
Ill
(ii) K is a bounded Borel measurable function with a compact support; K' and K" exist and are bounded; and K satisfies the following condition: = | 1, 3 = 0, 0, 1 < j < m - 1.
fujK(u)du
(iii) nh —> oo and nh2m —> 0 as n —• oo. Condition (i) is basically the same as the condition on the density function in Chen and Hall (1993). The larger the m in Condition (i), the weaker is the Condition (iii) on h. We now state our main results. Theorem 2.1. Suppose that Conditions (i) to (iii) hold and 1/3 < r < 1/2. Then as n —> oo, with probability tending to 1, H(A,6) (as a function of 6) attains its maximum value at OE, and 6E, as a root of equations Eq. (4), is in the interior of the interval \9 — 6Q\ < n~r. Moreover, (aX-'a)-1),
Vn~(8E - e0)^N(0, and V^r](eE)-^N(0,V0-1{I2
{a'V^ay^aa'V^1}).
-
From Theorem 2.1 we can get a by-product that the asymptotic variance of 6E is less than that of the usual sample <7ith quantile, i.e., 9i(l ~ 9i)// 2 (#o)- It can be justified as follows: Note that aty a
^
= -71
rtet 1 - ^)f\e0)
\?
9i (1 - 9 2 X 9 2 - 9 1 ) - 2 ? i ( l " q2)f(0o)f(0O + A) + qi(l - qi)f2(90
+ A)}
> —r, ^7 7 [92(1 - q2)f((>o) <7i(l-92)(<72-9i) - 9 i ( l - 92){/ 2 (0o) + /*(0o + A)} + —f, ^7 -M2 - 9i)/ 2 (0o). 9i(l -92j(92 - 9 i ) It follows that tT/
-l
N-l ^ 9 l ( l
-9l)
P(0o)
112
Actually, 6E can not be used as an estimate for 9 if A is unknown. However, this result does imply that the empirical likelihood method can be applied to make more accurate statistical inference by employing auxiliary information efficiently. Some related work can be found in Chen and Qin (1993) and Zhang (1995), among others. T h e o r e m 2.2. Suppose that Conditions (i) to (Hi) hold and 6E is as given in Theorem 2.1. Then —2\ogR{&.,QE)—>x?
as
n
—* °°i
where n
R(A,6E) = l[(npi(eE)), t=l
PJ(^E) =
-7-—tlo
,
—77,
l v a n 1+ T7*(6»E)w(X i, 6E, A)
i = !,-••
,n.
By Theorem 2.2, a 100(1 — a)% confidence interval for A can be constructed by Ja = {A:-21ogfl(A,0E)
P{A€la}
= l-a
+ o(l).
3. A Simulation S t u d y In our simulation study, we intend to illustrate the coverage accuracy of the Ia given in Section 2. We generate two sets of data from the standard normal distribution N(Q, 1) and chi-squared distribution with two degrees of freedom xii respectively. For both cases, we take q\ = 0.025 and q^ = 0.975. The sample size n ranges from 80 to 200. For each sample, 90% and 95% empirical likelihood confidence intervals for the difference A of two quantiles are computed. We let K{u) = y | ( l — u 2 ) 2 7(|u| < 1), which satisfies Condition (ii) in Section 2, and choose the smoothing bandwidth h = n - 1 / 4 ( l o g 7 i ) - 1 / 2 . As argued in Chen and Hall (1993), the choice of h is not generally settled, but choices of h in some range as indicated in the above paper generally provide quite good coverage accuracy. We here report the average estimated coverage values based on 1,000 replications.
113
We also report the average estimated coverage values for 1,000 replications based on the asymptotic distribution of the difference of two sample quantile estimates for F~1(qi) and F~1(q2). This can be done by using Theorem B in Serfling (1980) (page 80) with the probability density function being estimated through the ordinary kernel method. We call this method as EQ method, and the empirical likelihood method discussed in this paper as EL method. It can be seen from the simulation results that all empirical coverage levels are close to the nominal levels when the sample size n is moderately large, and the EL method performs much better than the EQ method. Table 1. n l-a=90%(EL) l-a=90%(EQ) l-a=95%(EL) l-a=95%(EQ)
Table 2. n l-a=90%(EL) 1-Q=90%(EQ) 1-Q=95%(EL)
l-a=95%(EQ)
X ~ JV(0,1), qi = 0.025, q2 = 0.975, A = 1.96 100 .8756 .8301 .9324 .9101
80 .8624 .8234 .9235 .8935
X ~ x|, 80 .8553 .7785 .9132 .8288
120 .8867 .8355 .9367 .9175
160 .8908 .8405 .9433 .9257
140 .8872 .8373 .9421 .9203
180 .8929 .8511 .9446 .9267
200 .8931 .8562 .9452 .9285
91 = 0.025, q2 = 0.975, A = 7.3274 100 .8604 .7812 .9223 .8302
120 .8655 .8043 .9266 .8345
160 .8732 .8233 .9333 .8572
140 .8713 .8124 .9300 .8383
180 .8751 .8210 .9376 .8601
200 .8766 .8313 .9384 .8675
4. Proofs of Theorems 2.1 and 2.2 The main purpose of this section is to prove Theorems 2.1 and 2.2. First, we need some introductory lemmas. Lemma 4.1. Suppose that Conditions (i) and (ii) hold and r > 0. Then as n —> oo,
r1(9) = Op(n-r
+ hm +
n-1'2),
uniformly for 9 e {6 : \9 - 90\ < n~r}. Proof.
Denote
u{9)
=
^T,7=1^(Xi,9,A),
LJ* = maxi^nllwp^fl.A)!!, Vn(9) = i E , # , 9 , A y ( I „ « , A ) . Let
114
TJ(0) = p(9)a(6), where p{9) > 0 and \\a{9)\\ = 1. Similar to the proof of Theorem 2 in Owen (1991), we have 0 < <xt{9)u,{9) -
P 1 +
^l)u.
min{A!(V n (0)), A2(V„(0))},
(5)
where Ai(^4) and A2(^4) denote the eigenvalues of a 2 x 2 matrix A respectively. It may be shown that w(0) = u>(0o) + Op(n-r),
Vn{9) = Vn(60) + o p (l),
Eu,(90) = 0 ( / i m ) , w(0 o ) = O p (/i m + n " 1 / 2 ) , and Vn(90) = V0 + op(l). Therefore, u,(6») = O p ( n - r + / i m + n - 1 / 2 ) . Combining with Eq. (5), it follows that p{6) = Op{n-T +
hm+n-1'2).
This completes the proof of Lemma 4.1 . Lemma 4.2. Suppose that Conditions (i) to (Hi) hold and 1/3 < r < 1/2. Then as n —> oo, with probability tending to 1, H{A, 9) ( as a function of 9 ) attains its maximum value at some point 9E, and 9E, as a root of equations Eq. (4), is in the interior of the interval \9 — 9$\ < n~r. Proof.
Let ji{9) = T}t(9)uj(Xi, 9, A). It follows from Lemma 4.1 that max \-fi(8)\ = OJn-r
+ hm +
n~1'2).
l
From Eq. (3), 0 = T Ju>(Xi, 9, A) - - V u(Xi, 9, A^iXi, n ^— n •*—' i=l
n{^
9, A)v{9)
«=1
l+1f(9)
'
It follows that
0 = -Y^LjiXiA n
i=i
A) - -J2^(Xi,0, n
,=i
+ O p { ( n " r + /i m + / i - 1 / 2 ) 2 } .
Ay(X„9,A)))(«)
115
Therefore, 7,(0) = V-l{6)G>(6) + Op{(n-r
+ hm+ n-1'2)2}.
(6)
By (6) and Taylor's expansion, we can show that n
n
.
O, A)}2
-H(A, 9) = J2 V^MXi, 6, A) - - Y.tfVMXi, t=l
1=1
+ Op{n(n-r +
hm+n-1/2f}.
Noting that 1 — 2r > 0, we have our result by following the proof of Lemma 1 in Qin and Lawless (1994). We now return to the proof of main results. Proof of Theorem 2.1 Let 6E as defined in Lemma 4.2. Let rj = r](6), VE = V(0E),
and
By Lemma 4.2, we have Qin(9E,riE)
= 0, » = 1,2.
In view of Lemmas 4.1 and 4.2, by applying Taylor's expansion, we can show that 0=
Qin{0E,r)E) dQ e o ^ e '°\eE-0o)
= Qm(0o,o) + 3Qm(0o,O) + aUiiZr)r,E
+ 0P{(n~r
+ h™ + n~V*)%
i = 1,2.
It may be shown that dQin(0o,0) 06
>0
dQ2n(e0,0)
—m—
'°-fl=0
dQin(Oo,0) — a ^ dQ2n(60,0)
' —a^
T/
>-Voa.s., t
>a a s
--
Hence, it follows that
{eE-e0)=s~1
(~Qlf°'0))+oP{(n-r
+ h™ + n-^f }.
(7)
116
where
It is easy to obtain that (a^arWo"1
^
(o^o"1")-1
J'
Since yfiQm(6o,0)-?-*N(0,Vo), Theorem 2.1 then follows from Eq. (7), 1/3 < r < 1/2 and Condition (hi). Proof of Theorem 2.2 By Lemma 4.1 and Taylor's expansion, we can show that \ogR(A,9E)
= -nr1t{eE)Q{eE) + Op{n{n~r
+
^\B E )V n (8> E ) n {6 E )
+ hm +
n'1'2)3},
where u>(0E) and Vn(6E) are defined in the proof of Lemma 4.1. Denote A = VQ1{I2 - ( a V o ^ c O - W V o - 1 } , In view of Eq. (6) and Eq. (7), it follows that -21ogi?(A,0 B ) = {V^Vt(0E))Vn{eE)(V^r1{eE)) = (yftQin(6o,
0))tAtVn(eE)A(^Qm(0o,
+ Op{n(n-r
0)) + Op{n(n-r
+
hm+n-
+ hm + n " 1 / 2 ) 3 }
(V^V-1/2Qln(e0,0))X/2AtVn(6E)AV01/2(V^V-1/2Qln(80,0))
= +Op{n{n-r
+ hm +
n-1/2)3}.
Moreover, Vn(0E) = Vo + op(l), V^V0-1/2Qln(60,0)^N(0,1). It is easy to see that V0 A'VoAVJ) is symmetric and idempotent, with trace 1. Theorem 2.2 then follows from 1/3 < r < 1/2 and Condition (iii). Acknowledgments We wish to thank Professor Art B. Owen for introducing us to this problem. The research was partially supported by the Natural Sciences and Engineering Research Council of Canada.
117
References 1. Amaratunga, D. (1997), Reference ranges for screening preclinical drug safety data. Journal of Biopharmaceutical Statistics. 7, 417-422. 2. Chen, J. and Qin, J. (1993). Empirical likelihood estimation for finite population and the effective usage of auxiliary information. Biometrika 80 , 107-116. 3. Chen, S. X. and Hall, P. (1993). Smoothed empirical likelihood confidence intervals for quantiles. Ann. Statist. 2 1 , 1166-1181. 4. Diciccio, T. J., Hall, P. and Romano, J. P. (1991). Bartlett adjustment for empirical likelihood. Ann. Statist. 19, 1053-1061. 5. Hall, P. and La Scala, B. (1990). Methodology and algorithms of empirical likelihood. Internat. Statist. Rev. 58, 109-127. 6. Jing, B.-Y. (1995). Two-sample empirical likelihood method. Statistics and Probability Letters. 24, 315-319. 7. Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237-249. 8. Owen, A. B. (1990). Empirical likelihood confidence regions. Ann. Statist. 18, 90-120. 9. Owen, A. B. (1991). Empirical likelihood for linear models. Ann. Statist. 19, 1725-1747. 10. Qin, J. (1994). Semi-empirical likelihood ratio confidence intervals for the difference of two sample means, Ann. Inst. Statist. Math. 46, 117-126. 11. Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22, 300-325. 12. Qin, Y. S. (1997). Semi-parametric likelihood ratio confidence intervals for various differences of two populations. Statistics and Probability Letters. 33, 135-143. 13. Qin, Y. S. and Zhao, L. C. (1997). Empirical likelihood ratio statistics for the difference of two population means. Chinese Ann. Math. 18, 687-694. 14. R. J. Serfling, Approximation Theorems of Mathematical Statistics (John Wiley & Sons, New York, 1980). 15. Wade, A. M. and Ades, A. E. (1994). Age-related reference ranges: significance tests for models and confidence intervals for centiles. Statistics in Medicine. 13, 2359-2367. 16. Wade, A. M. and Ades, A. E. (1998). Incorporating correlations between measurements into the estimation of age-related reference ranges. Statist. Med. 17, 1989-2002. 17. Zhang, B. (1995). M-estimation and quantile estimation in the presence of auxiliary information. J. Statistical Planning and Inference. 44, 77-94.
E X P O N E N T I A L INEQUALITIES FOR SPATIAL PROCESSES A N D U N I F O R M C O N V E R G E N C E RATES FOR D E N S I T Y ESTIMATION
QIWEI YAO Department
of Statistics, London School of Economics and Political Science, Houghton Street, London, WC2A 2AE, UK Department of Business Statistics and Econometrics, Guanghua School of Management, Peking University, 100871, Beijing, China
We establish an exponential type of inequality for a-mixing spatial processes. Based on it, an optimum convergence rate of a kernel density estimator for stationary spatial processes is obtained. Its asymptotic mean and variance are also derived.
1. Introduction The classical asymptotic theory in statistics is built upon central limit theorems and laws of large numbers for the sequences of independent random variables. In the study of the asymptotic properties for linear time series which are sequences of dependent random variables, the conventional approach is to express a time series in terms of an moving average form in which the white noise is assumed to be independent. This idea also prevails for linear spatial processes; see Yao and Brockwell (2001) and Hallin, Lu and Tran (2002). Unfortunately the moving average representation becomes irrelevant in the context of nonlinear processes for which more complicated dependence structure is encountered and certain asymptotic independence, typically represented by a mixing condition, will be characterised. We refer to §2.6 of Fan and Yao (2003) for an introductory survey on the mixing conditions for time series. A mixing spatial process may be viewed as a process for which random variables from distant locations are nearly independent; see §1.3.1 of Doukhan (1994). The goal of this paper is two-fold. First we establish an exponential type inequality for a-mixing spatial processes. Exponential type inequalities are powerful tools in estimating tail probabilities of partial sums of random sequences; see, for example, §1.4 of Bosq (1998). They imply weak 118
119
laws of large numbers for mixing processes, and give sharp large deviation estimations. Note that the stationarity is not required. Further, based on an established exponential inequality, we derive an optimum uniform convergence rate for density estimators of stationary spatial processes. The application of those results in additive modelling for spatial processes will be reported elsewhere. Nonparametric kernel estimation for spatial processes is still in its infant stage. A limited references include Diggle (1985), Diggle and Marron (1988), Hallin, Lu and Tran (2002), and Zhang, Yao, Tong and Stenseth (2002). Related to current work, Hallin et al. established the asymptotic normality for density estimators for linear spatial processes. 2. Exponential Type Inequalities 2.1. a-mixing
Processes
Let {X(s)} be a real-valued spatial process indexed by s = (u,v) € Z2, where Z consists of all integers. Since the index s is two dimensional, {X(s)} is also called a random field. For any A C Z2, let T{A) denote the (T-algebra generated by {^"(s), s G A). We write |^4| for the number of elements in A. For any A,B c Z2, define a(A, B) =
sup
\P(UV) -
P(U)P(V)\,
and d(A,B)
= min{||si - s 2 | | | si e A, s 2 e B}.
where 11 • 11 denotes the Euclidean norm. Now the a-mixing coefficient for the process {X(s)} is defined as a(k;i,j)
=
sup
{a(A,B)\\A\k},
(1)
A,BCZ2
where i,j,k are positive integers and i,j may take values of infinite. The process {X(s)} is a-mixing (or strong mixing) if a(k;i,j) -» 0 as fc -> oo. A typical choice is i = j = oo. Note that a(k;i,j) is monotonically non-decreasing as a function of i or j . Hence a(fc;oo, oo) —> 0 implies a(k; i,j) —• 0 for any values of i and j . In the context of stochastic processes with single index, a-mixing is the weakest among the most frequently used mixing conditions (§2.6 of Fan and Yao 2003). Such a simple statement is no longer pertinent for spatial processes. For example if we let i = j = oo, a-mixing is equivalent
120
to p-mixing, and further, /3-mixing reduces to simple m-dependence; see §1.3.1 of Doukhan (1994) and references within. Further in contrast to single-indexed processes, we need to choose appropriate values for i and j according to the nature of a spatial problem in hand. For example, the coefficient a(fc;oo,oo) is not useful for Gibbs fields (Dobrushin 1968).
Remark 2.1. If {X(s)} is a causal and invertible ARMA process (Whittle 1954) defined in terms an independent and identically distributed white noise sequence {e(s)}, JE|e(s)|<5 < oo for some 5 > 2, and density function pe of e(s) satisfies the condition that \pe(x + z) - p£(x)\dx
< C\z\,
zeR,
it follows from Corollary 1.7.3 of Guyon (1995) and Lemma 1 of Yao and Brockwell (2001) that a(k;i,j)
i,j,k>l,
(2)
where C > 0, p £ (0,1) are constants independent of i,j and k.
2.2. Exponential
Inequalities
Suppose we have observations {X(u, v); u = 1,- • • ,Ni, v = l , - - - , ^ } . Let JV = N\N2- Define the partial sum
v=lu=\
Theorem 2.1 below presents upper bounds for the tail probabilities of |5jv|, which resembles the exponential type inequalities for single-indexed stochastic processes; see Theorem 1.3 of Bosq (1998). We introduce some notation first. Define an auxiliary continuous-indexed process £(ti,t 2 ) = A-([ti],[t 2 ]),
(h,t2)eR2,
where [t] denotes the integer part of t. For an integer q between 1 and JVi ATVi = min{ATi,iV"2}, let pi = Ni/(2q) (i = 1,2). For i,j = !,••• ,q,
121
define /-(2»-l)pi
A2j-l)pi
VIP = / ,(2)
_
dh /
1)P2 J2(j-1)
J2(i-l)pi r(2i-i)pi /•(•"-i-lPi
//
r*3P2
*r = J2(i-l)p
d«i *i /
Z{ti,t2)dt2,
*i /
S(h,t2)dt2,
J(2j-l)p2 /"(2j-i)p2
1
r2ipi y
d
i? = / J(2i-l)Pl
W
£(h,t2)dt2,
J2(J-i)P2 dtl
=
/
J(2i-l)pi
C(«l,*2)d*2J(2j-l)p2
It is easy to see that (3)
For some constant e, K > 0, let 32g4
,2
Ke
l
Now we are ready to present the theorem. Theorem 2.1. Let {X(s)}
be a zero-mean spatial process with
p j s u p |X(s)| < if i = 1. Then for any integer q between 1 and N\ A N\ and s > 0, it holds that -2„1-
P(\SN\>Ns)
< 8exp(-|^) q2 a
+ 44 1 +
(4)
No
2q
A [2q
N
VJ'
N
and 2 2
P(\SN\>Ne)
< 1/2
44
('•")
q2 a
8exp[-^LMQY N1 ~N2
\ ] A 2q
2q
(5) ' TV "
) 4g2
N
122
Proof. The key idea of this proof is to divide the rectangular {(u,v) : 1 < u < TVi, 1 < v < N2} into 4<72 small blocks (see (3)), and then apply Bradley's coupling lemma to replace the sums of random variables on those blocks by their independent counterparts. The required inequalities (4) and (5) then follow from Hoeffding's inequality and Bernstein's inequality respectively. We outline the main steps of the proof as follows. Let p=plP2 = N/(4q2), 8 = l+e/(2K), 0 = mm{Ne/(8q2), (5-l)Kp}, and c = 5Kp. Then \V^ + c\ > c - \V^\ > (6 - \)Kp almost surely. Applying Bradley's coupling lemma (see, e.g. Lemma 1.2 of Bosq 1998) recursively, we may define a sequence of independent random variables {W/- } such that Wj
PQW,T -
and Vi • share the same marginal distribution, and further
v,<'»i > « < 11 ( m l n W '( 8 ;,;^i 1 ) K p } )*° ( i p i 1 A W ; w ' w ) 1/2
11(1 + ^ )
a([pi]Ab2];b],A0.
Since {W^ } are independent, it follows from Hoeffding's inequality (Theorem 1.2 of Bosq 1998) that
(
*
2exp
N2e2
\
(rm^vW
n
= 2exp
(
Ns2
V^io)
\
• (6)
123
Combining the above two inequalities, we have that (i)
(i) ij
>
> X'
|y
^ ~ WP] -p
for a11
*•j I
+ ?2F(|
^ 1) - <•'I > ^
>^-^)+92p(|v;;i)-<)i>/3)
< P *.J
< P
£<
(i)
>-«-
)| + ^ P ( l ^(!)- WU^/ (|! > /3)
»>J
< 2 exp
-
2
8K J
1/2
4K 11 1 + —
9 2 a([ Pl ]Ab2];[p],^V)-
It is easy to see that the above inequality also holds for {V^ } for I = 2, 3,4. Now (4) follows immediately from the relation
P(\SN\>Ne)
£^
>
ATe
l
,J
which is implied by (3). Inequality (5) may be proved in the same manner with (6) replaced by
5><
(i)
Ne , > — < 2 exp < 2 exp
e*N2/64 r(l)
^ZliE(Wi;i)^
+ 2PKNe/8/
eW
which is guaranteed by Bernstein's inequality (Theorem 1.2 of Bosq 1998). 3. Density Estimation for Spatial Processes 3.1. Estimators
and Regularity
Conditions
We assume now that the process {X(s)} is strictly stationary with marginal density function /(•). The kernel estimator for / is defined as /(*) = T ; £ £ ^ { X ( « , u=lv=l
«)-*},
(7)
124
where Wh(-) — h~1W(-/h), W(-) is a probability density function defined in R and h > 0 is a bandwidth. We introduce some regularity conditions first. We use C to denote some positive generic constant which may be different at different places. (CI) As N = N1N2 —» oo, Ni/N2 converges to a positive and finite constant. (C2) As N -> oo, h - • 0 and N/3-5h0+5(logN)-«3+11 -> oo, where /? > 5 is a constant. (C3) The kernel function W(-) is bounded, symmetric and Lipschitz continuous. (C4) The density function /(•) has continuous second derivative /(•). Further the joint density function {X(u, v),X(u + i, v + j)} is bounded by a constant independent of (i,j). (C5) It holds that a(k; k',j) < CkT0 for any k,j and k' = 2 0{k ). Conditions (C2) - (C4) are standard in kernel estimation. For optimum bandwidth h = 0(N-^5), (C2) requires /3 > 7.5. For causal and invertible ARMA processes satisfying conditions in Remark 2.1, a(k;k', oo) decays at an exponential rate as k —+ oo. Therefore, condition (C5) fulfils for any /?>0. Remark 3.1. We assume in this paper that the observations were taken from a rectangular. This assumption can be relaxed. In fact Proposition 3.1 and Theorem 3.1 below still hold if the observations were taken over a connected region in Z2, and both minimal length of the side of the squares containing the region and the maximal length of side of the squares contained in the region converge to infinite at the same rate. For general discussion on the condition of sampling sets, we refer to Perera (2001). 3.2. Asymptotic
Means and
Variances
Proposition 3.1. Let conditions (CI) and (C3) — (C5) hold with /3 > 4. Then for h —• 0 and Nh —> oo as N —» oo, it holds that + o(h2),
(8)
Var{/(a:)} = ^ - / ( x ) j W(u)2du + o(7V" 1 / l - 1 ).
(9)
E{f(x)}
= f[x) + \h2f{x)
J u2W(u)du
and
125
Proof. Equation (8) follows from simple algebraic manipulation. Put ZUv = Wh{X{u,v)-x). Then, Var{/(x)} = -^Var(Zn) + ±
£
Cov(Zuv,
ZtJ).
The first term on the RHS of the above expression is equal to the RHS of (9). We only need to prove the second term is of the order o{-^). To this end, note that for (u, v) ^ (i, j), \Cov(Zuv,Zij)\
< Ci,
where C\ > 0 is a constant independent of u,v,i,j. Define a unilateral order in Z2 as follows: (u, v) > 0 if either u > 0 or u — 0 and v > 0, and further (u,v) > (i,j) if (u — i,v — j) > 0. Let Sn = {(u,v) : 1 < u < Ni, 1 < v < N2}. Then 1
—
9
^2
\Cov(Zuv,Zi:j)\
=—
(u,v)^(i,j)
]T
\Cov(Zuv,Zi:j)\
(u,v)<(i,j)
o =
Jj2
Z^ (u,v)eSN
Z-,
2
~ N2
^
(u,v)€Sp,
\Cov(Zuv,Zu+itV+j)\
(i.3)>0 (»+>,«+))6^
5Z
+ 5Z >|Cov(Zu„,Z„ +i,v+j
(i,j»0, i2+32
= 0(T2/iV)+^
^
)
i2+J'2>r2
^
|Cov(Z u l ( ) Z u + i > „ + i )|,
(10)
(u,i;)65Ni2+j2>r2
where T = T(N) > 0 is a constant. By Billingsley's inequality (see, e.g. Corollary 1.1 of Bosq 1998) and condition (C5), \Cov(Zuv,Zu+i,v+j)\
Nh2TP-2 4^ 2 v r 2 ;
r2
V^/i2r^-v
"{NhJ'
provided T = / i " 2 ^ for /3 > 4. This also ensures O ( ^ ) = o ( ^ ) ; see (10). The proof is completed now.
126
3.3. Optimum
Uniform
Convergence
Rates
Theorem 3.1. Under conditions (CI) - (C5), it holds that for any finite a
f{x) - Ef(x)
sup
log AT
Or,
1/21
(11)
. Nh
xe[a,b]
Further, sup x£[a,6]
/(*) - m - < M l ^ V + *
(12)
Remark 3.2. Theorem 3.1 presents the uniform convergence rates for the kernel estimator /(•)• For h = 0{(logiV/./V) 1/5 }, (12) admits the form
/(*) - / ( * )
sup xG[o,6]
.cUl'^)""' N
which is the optimal convergence rate according to Hasminskii (1978). The similar results for single-indexed processes may be found in, for example, Masry (1996) and Theorem 5.3 of Fan and Yao (2003). Proof of Theorem 3.1 . Write the N observations as X(si), • • • , X(SN)Partition [a, b] into L subintervals {Ij} of equal length. Let Xj be the centre of Ij. Since |/(x) - f(x')\
1 N < - £ \Wh{X(Sj)
-x}-
Wh{X(Sj)
C - x'}\ <-r\x-
x'\,
J=I
it holds that \Ef(x) - Ef(x')\
< f \x - x'\. Hence
sup \f[x) - Ef(x)\
< \f(Xj)
- Ef(xj)\
+ ^ .
Therefore, sup \f(x) - Ef(x)\ Since \Wh{-)\ < Ch'1,
< max \f(Xj)
- Ef(xj)\
+ •£-.
(13)
it follows from (5) that
P{\f(x)-Ef{x)\>e}
< 8exp
eV 8i/(«) 2
/2 ( 4 C \ 11/2 2 + 44 ( 1 + — J 9 a([p!] A [p2]; [Pip2},N),
(14)
127
where Pi = Ni/(2q). Let q = [ ^ ( i V j AN2)}. By (9), u(q)2 < C/(plP2h) + for s o m e lar e Ce/h < Ce/h. Now let e2 = (^N^h S constant a > 0. It is easy t o see t h a t
6XP
{-tow) * 6XP {-
8C
) =N •
(15)
On t h e other hand, condition (C5) entails t h a t (eftJ-^VadpijAlpaljIpipa],^) 12
= O^^^NhT ' ) Let L = (N/h)1/2.
=
< C{eh)-l'2q2{Pl
Ap2)^
(16)
3 4
0{AT-' / +3/4/l-^4-3/4(logiV)'3/4+1/4}
It follows from (14) - (16) and condition (C2) t h a t
P{max\f(xj)-Ef(xj)\>e} < L{N~a
+ CN-V4+3/4h-V4-V4{\og
AT)^4+1/4} _, 0.
Note t h a t e = 0 { ( ^ ) 1 / 2 } . Now (11) follows from (13) immediately. Note t h a t sup
f(x)-f(x)
xe[a,b]
< sup xS[o,6]
f(x)-Ef(x)\+
sup
\Ef(x)-f(x)\
xG[a,b] '
and t h e second t e r m on t h e RHS of t h e above expression is non-random. Simple algebraic manipulation shows t h a t it is of the order h2 under t h e condition t h a t W is symmetric and / has two continuous derivatives. Now (12) follows from (11). Acknowledgements T h e research was partially supported by a Leverhulme Trust grant. References 1. Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes (2nd edition). Springer, New York. 2. Diggle, P.J. (1985). A kernel method for smoothing point process data. Appl. Statist. 34, 138-147. 3. Diggle, P.J. and Marron, J.S. (1988). Equivalence of smoothing parameter selectors in density and intensity estimation. J. Amer. Statist. Assoc. 83, 793-800. 4. Dobrushin, R.L. (1968). The description of the random field by its conditional distribution. Theory Probab. Appl. 13, 201-229. 5. Doukhan, P. (1994). Mixing. Springer-Verlag, New York. 6. Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York.
128
7. Guyon, X. (1995). Random Fields on a Network: Modeling, Statistics, and Application. Springer-Verlag, New York. 8. Hallin, M., Lu, Z. and Tran, L.T. (2001). Density estimation for linear processes. Bernoulli 7, 657-668. 9. Hasminskii, R.Z. (1978). A lower bound on the risks of nonparametric estimates densities in the uniform metric. Theory Prob. Appl. 23, 794-798. 10. Masry, E. (1996). Multivariate local polynomial regression for time series: uniform strong consistency and rates. J. Time Series Analy. 17, 571-559. 11. Perera, G. (2001). Random fields on Z : limit theorems and irregular sets. In Spatial Statistics: Methodological Aspects and Applications, M. Moore (edit.), p.57-82. Springer, New York. 12. Whittle, P. (1954). On stationary processes in the plane. Biometrika 4 1 , 434-449. 13. Yao, Q. and Brockwell, P.J. (2001). Gaussian maximum likelihood estimation for ARMA models II: spatial processes. A preprint. 14. Zhang, W., Yao, Q., Tong, H. and Stenseth, N.C. (2002). Smoothing for spatio-temporal models and its application in modelling muskrat-mink interaction. A preprint.
A S K E W R E G R E S S I O N MODEL FOR I N F E R E N C E OF STOCK VOLATILITY
T U H A O J. C H E N A N D H A N F E N G C H E N Department
of Mathematics and Statistics, Bowling Green State Bowling Green, Ohio 43403, USA.
University,
The relation between log return of a stock and its market proxy is usually modeled by a normal regression model. However, the distribution of the random error is often not normal, but rather skewed. This paper proposes a skew normal model for inference of stock volatility. Distribution theory of the skew normal distribution family is reviewed and developed. Statistical inference procedures for the volatility parameter under the skew normal regression model are outlined. In particular, the maximum likelihood estimate is discussed in details.
1. Introduction According to Sharpe (1963, 1964), Fama (1965) and Hsu (1979), in stock portfolio analysis there is a relation between log returns of a particular stock and its market proxy. One of the frequently used models is the Gaussian (normal) regression model. Let Pt be the price of a stock of interest at time t. Define Yt to be the log return from time t — 1 tot, i.e., Yt = \og(Pt/Pt_i). Let Mt be the price of the overall market proxy, and let Xt = log(M t /Mt_i). The oft-used Gaussian regression model is:
Yt = a + Pxt+et,
(1)
where et, t = l , - - ,n are assumed to be independently distributed as iV(0,cr2). Here xt is the realization of Xt at time t. In the model (1), the parameter /3 is a measure of volatility of the specific stock explained by the overall market. It is endowed with the following interpretation. Suppose that the absolute value of j3 is greater than 1. When the market changes (rises or falls), the return of the security will change more rapidly (rise or fall corresponding to the sign of j3) and thus the stock is more risky than its market. If the absolute value of (3 is smaller than 1, the change of return does not follow a market change at the same pace and the stock is more stable. Accordingly, j3 as a measure of risk of a security, is 129
130
frequently used as an index to evaluate a portfolio of investment. Reliability and validation of inference of /3 relies on the normality assumption of the random error term et- In many situations, however, reality can departure far from the normality assumption. For example, in a bear market, a stock is highly likely to be over-sold at a time, so that the random error would be more likely on the negative side than on the positive side and hence follow a negatively skewed distribution, rather than a symmetric one. It is also easy to understand that various major events, such as a sudden social/political change, big company-specific news/rumors and earning surprises, may also contribute to the skewness of the distribution of the random error. These catastrophic events are totally unpredictable and can occur at a random time and anytime in random manner and greatness. Whenever one of such events occurs, it could cause over-sold for a stock (or over-bought in case of a positive event). JPMorgan (1995, page 47) points out that the random error distribution is often negatively skewed and tends to have fat tails. Surprisingly, it does not appear to have any research papers in the literature to discuss this problem with interpretation bridging the normal and fat tail financial models. We believe that part of the reasons for this is the lack of fundamental theory to describe the external interference on the intractable market in stock exchanges. In this article, we explore and propose a skew regression model for inference of stock volatility. The paper is organized as follows. A detailed description of the skew normal regression model is given in Section 2. Statistical inference procedures for the volatility parameter (3 are developed and discussed in Section 3. 2. A Skew Normal Regression Model We proceed to give a brief account on skew normal distributions. 2.1. Skew normal
distributions
According to Azzalini (1985), a random variable Z is distributed as a skew normal distribution with parameter A, if Z has the probability density function (pdf) (f)(z; A) = 2
- o o < z < oo,
(2)
where <j>(z) and $(z) are the pdf and the CDF (cumulative distribution function) of standard normal N(0,1), respectively.
131
Azzalini (1985) discusses the skew normal distribution family SN(X). A skew normal random variable Z retains many statistical properties of the standard normal distribution. For example, if Z follows a skew normal distribution, then Z2 ~ Xi, for any A. Compared with the normal distribution family, the skew normal distribution family, however, features the capability of inferring a underlying distribution which is suspected to be skewed. If A > 0, the skew normal distribution is skewed to the right; if A < 0, it is skewed to the left. Thus, the parameter A serves as another dimension to shape the underlying population distribution. When A = 0, the pdf (2) reduces to the pdf of the standard normal random variable. A scale parameter a2 can also be included in the skew normal distribution family to make the family even more flexible as follows: f(z; A, a 2 ) = 2(z/(T)*(\z/a). Let the distribution of f(z;X,a2) be denoted by SN(X,a2), and let T = {SN(X,a2) : - c o < A < oo,<7 > 0}. The parameter A in SN(X,a2) is called the skew parameter and a2 the scale parameter. Let Z ~ SN(X,cr2). The moment generating function of Z has an explicit form as follows: M(t) = 2exp{t2a2/2)${Xat/(l
+ A 2 ) 1 / 2 }.
(3)
From this, it is easy to obtain the first two moments of Z: E(Z) = a(2/ 7 r) 1 / 2 A/(l + A 2 ) 1 / 2 E{Z2}
= a2.
Some constructive stochastic representations given below for a skew normal random variable shed light on insight of a skew distribution that shows how the skew normal random errors occur and what would be the possible sources so that the model can be used to fit skew data appropriately. Proposition 2.1. Let £i ~ N(0,a2) and £2 ~ N(0, 0, Z = £i + £2 is distributed as a skew normal distribution with the skew parameter X = —cri/0-2 and the scale parameter a2 = a\-\-a\., i.e., Z ~SN(X,a2). Proof. For any y, P(Z0)
=
2P(Z0) />oo (
= 2/
II
ry+v
-\
(27r)- 1 /V 2 - 1 e-" 2 /(2<, 2 2 ) d u l x
{2^)-1'2a^e-v2'^Uv.
132
Hence the pdf f(y) of Z, given £1 > 0, is f(y) = 2 / Jo
(2TT)- 1 / 2 ( 7 2 - 1
exp{-(j/ + v)2 l{2a2)}{2^l2o-^
exp{-v2/(2a2)}dv
/•OO
= (n^y1
/ exp{-(2o-2)- 1 ( 2/ 2 + 22/o-iv + a\v2 +
^[u
+ a\)]2du
+ yoilipl
2,o~2 />oo
= 2a'1 <j>(y/o) \ Jo
4){a(u +
yai/a-2)/a2}du,
where a2 = a2 + a\. Note that
I
4>{o(u + yai/cr2)/a2}du
=
{a2/cr)^{-ya1/(a2a)}.
lo Jo
We thus have f(y) = 2<j-l4>{y/o)<S>{-yal/{<j2cj)}, i.e., f(y) is the pdf of the skew normal distribution with the skew parameter A = — 0) - 1} ~ where A = p/{l — p2)1^2
SN(X,a2),
and /(•) is the indicator function.
Proof. The result is implied by Azzalini and Capitanio (1999). 2.2. The skew regression
model
Let Yt be the log return of a stock from time t — 1 tot, and let xt be the log return of a market proxy at time t. Consider the skew normal regression model of Yt over Xt as follows: Yt=a
+ 0xt + eu
t = l,--- ,n,
(4)
where t\, • • • , en are independently and identically distributed with E(ei) = 0 and a + ti ~ SN(X,a2). The model can have two parameterization systems. One uses the skew normal distribution parameters, i.e., /?, A, and a2;
133
the other uses directly the regression parameters: a, (3, and r 2 = Var(Yt). It is easy to see the following relationship between the two systems: a = aA{2/[7r(l + A 2 )]} 1 ' 2 ,
(5)
and r 2 = a2{\
- (2/TT)A2/(1 +
A 2 )}.
3. Inference of the Volatility Parameter /3 Since the model (4) has a Gauss-Markov simple regression set-up, the least square estimate (LSE) of the volatility parameter j3, which is the best linear unbiased estimate, is a natural method for inference of /3. On the other hand, under the model (4), p has regular Fisher information and hence a likelihood-based method, for example, the maximum likelihood method (MLE), may be preferable to many users. Both methods are discussed in this section. 3.1.
LSE
The least square estimate of the volatility parameter /? is the ordinary one given as follows: P
=
^xy/^xxi
where 1
™
-
1 ™
Sxy = - Yl(Xt ~ *)(Yt - Y), Sxx = - Y^(xt ~ xfn
t=i
t=i
Note that
et)2=E(Yt-(3xt)2.
The parameter a2 can thus be estimated by
Similarly, one can use (5) to obtain a least square estimate of A when a is estimated by
a = Y-J3x.
134
To construct an exact confidence interval for /3 based on J3, the sampling distribution of $ has to be available explicitly. Since J3 is a linear combination of a skew normal sample, it might not be unreasonable to hope that (3 or a standardized form of it also follows a skew normal distribution so that its square would have Xi-distribution and an exact confidence interval could be easily obtained. Unfortunately, the following result shows that any linear combination of a random sample from a skew normal distribution does not follow a skew normal distribution, except for trivial cases. This is a sharp contrast to the Gaussian linear regression model. Proposition 3.1. Let Z\, • • • ,Zn be a random sample of size n > 1, from SN(X, T2) with A ^ 0. If a. = (ai, • • • , an)' is a constant vector with at least one pair of (i,j) such that a, ^ a, ^ 0, then n
a'z =
yajZj i=l
is not a skew normal random variable, where z' = (Zi, • • • , Zn). The proof of Proposition 3.1 is given in Appendix. By this result, it is implied that under the skew normal regression model (4), an exact confidence interval for /? is intractable computationally. An approximate confidence interval based on the asymptotic normal distribution theory of J3 is therefore recommended. See, for example, Sen and Singer (1993). Suppose the following conditions: Condition 1. As n —* oo, x approaches XQ € (—00,00). Condition 2. As n —> 00, Sxx approaches a2 e (0, 00). Condition 3. Noether's condition holds. Let c\t = (xt — x)2/(nSxx). max{c^ t : t = 1, • • • , n} approaches 0.
As n —> 00,
Then as n —> 00, n li/2 (/3 — (3) has the limiting normal distribution with mean 0 and variance T2/ax where r 2 = Var(et). Consequently, for large n, an approximate 100(1 — rf)% confidence interval for (3 based on the least square estimate is $±zv/2t/(nSxx)1'2 where z^/2 is such that 1 — $(2^/2) = rj/2, and
T2 = - f > t - & - / f c : t ) 2 .
(6)
135
Note that the consistency of f 2 is an immediate result of the model (4) and does not require Conditions 1-3. 3.2. MLE For a parametric problem, there are strong reasons to use a likelihood-based procedure for statistical inference. In particular, the MLE has a number of advantages over other estimation methods. The likelihood equations for a random sample of size n under the model (4) are readily written down. The log-likelihood function of (/?,
l(P,a,X) = - n l o g a - f^
{Yt
^
t ?
a
t=\
+£${\(Yt
- (3xt)/a}.
t=i
Put ht((3,{X(Xt - 0xt)/
- Pxt)/
Then differentiating l(/3,cr,X) with respect to /3, er and A, respectively, yields:
J2 * n
~
+
*t-(AAr)^>WM) = 0
(7)
n
y
3
E ( * - Pxtfl" t=l n
2
- (A/T) £(1$ - 0xt)ht(0, a, A) = 0
(8)
t=l
J2(Yt - Pxt)<j>{\{Yt - Pxt)/a}/${\(Yt
- /3xt)/
(9)
«=i
The solution of (7), (8) and (9) for (/?, a, A) gives the MLEs of (3, a and A, denoted by (3, a and A, respectively. An explicit solution for /3, a and A does not seem to be possible. However, because of the log-concavity of the skew normal density as remarked by Azzalini (1985) and Burridge (1981), for any fixed A, the likelihood equations (7) and (8) have an unique solution for (3 and a2. Furthermore, combining (8) and (9) establishes the relationship between the MLEs J3 and a as follows:
which is a well-known equation for the MLE's under a normal regression model. As suggested by Azzalini (1985), a practical way of proceeding is as follows. For a fixed value A, solve the likelihood equations for (/?, a) taking
136
into account (10); repeat these steps for a reasonable range of values of A. Denote the MLE of (3 by J3. The average Fisher information I for (/3, a, A) can be computed. Azzalini (1985) gives the formulas for an iid model which can be used to derive the information for the present regression model. Let b = (2/n)1'2/(l + A 2 ) 2 / 3 , and for k = 0,1, 2, define /ifc =
E{Zk{(XZ)/$(XZ)}%
where Z ~ SN(X, 1). Put x2 = n _ 1 Ylx%> and let x be the usual sample mean. Then / = a~2 A, where the 3 x 3 matrix A = (a^) is symmetric with « n = (1 + AVo)z 2
a 22 = 2 + A2/i2
a 33 = \i2a2
aw = {b\(l + 2\2) + X2 fii}x
ai3 = a(b — Xfii)x
a23 = -Xfj,2cr
Let u> be the element at the upper left corner of J - 1 , i.e., 2^2cr4 where \A\ is the determinant of A. Suppose that Conditions 1-3 holds. Then as n —* oo, n 1 / 2 (/3 — P)/\/u approaches ./V(0,1) in distribution. Thus the asymptotic 100(1 — T])% confidence interval of /? based on the MLE is p±zn/2(Q/n)V2,
(11)
where Q is any consistent estimate of w, for example, the estimate obtained by substituting the parameters with their MLEs in w. Note that the information matrix I is positive-definite even at A = 0, provided that Sxx = x2 — x2 > 0. In fact, when A = 0, a\2 =
and u> = a2 /Sxx. In this case, a2 — Var(Yt). val (11) for (3 is the same on the LSE. Indeed, it is model, the LSE and MLE
Consequently, the asymptotic confidence interas the asymptotic confidence interval (6) based well-known that under the Gaussian regression procedures are equivalent.
137
3.3. On the profile MLE and LSE Since the confidence interval (11) based on the exact MLE method is computing-demanding, an approximate MLE for j3 may be of interest. The profile MLE is an appealing approximation method to the MLE when the exact MLE is computationally difficult. Let 6 be the parameter of interest and ip be a nuisance parameter (vector) which has a consistent estimator, say tp. Let L(6,ip) be the likelihood function of (6, ip). The profile MLE 6* is the maximizer of the profile likelihood function L{9,ip). In the present problem, (3 is the parameter of interest and (a, A) is nuisance. If (
M(t) = £ [ e x p { t £ ) a i Z i } ] i=l n
= J^E[exp{taiZi}] n
n
= 2" exp{r2i2 Y,
fl2
/2} I I H^tai/(1
i=l
+ A2)1/2}.
i=\
Put Cl
If a'z ~
SN(XQ,
TO) n
for some
= ATOi/(l + A 2 ) 1 / 2 . AQ
and
TQ
> 0, we have,
n
2" exp{r 2 t 2 Y, *i/2} \{ H^t) i=l
= 2 exp{T 2 t 2 /2}${A 0 Tot/(l + A 2 ) 1 / 2 },
i=l
i.e., n 2
2
2" exp{t (r Y »=1
n
«? - To )/2} R H^t) i=l
= 2${A 0 r 0 i/(l + A 2 ,) 1 / 2 },
(12)
138
for all t. Clearly, A ^ 0 implies that Ao 7^ 0. Since the skew normal distribution family T is closed under a scale transformation, we assume Ao/(l + Ag)1//2 = 1 without loss of generality. Suppose that all c, > 0. Since any Cj = 0 leads to the corresponding term missing in (12) and since n > 2 and there is at least one pair of i ^ j such that Cj ^ 0 and Cj ^ 0, we can assume that all c, > 0. Now if T 2 Y2 °H < r o > then as t —> 00, the left-hand side of (12) approaches 0 , but the right-hand side approaches 2. A contradiction is met. If f2 ]T}a2 > T^, (12) leads to 2nU*(ciO<2$(7tf),
(13)
for all t > 0. Let c = min{cj}. Then c > 0 and we have 71
2 n JI$( Ci t) >2"$n(ci),
(14)
i=l
for all £ > 0. So combining (13) and (14), we have for all t > 0, 2"$"(tf) < 2$(r 0 t).
(15)
Letting £ —> 00, the left-hand side of (15) approaches 2 n , but the right-hand side goes to 2. A contradiction is again established. Next consider the case that there is i such that Ci < 0. First note that
Um - * f e L = x->-oo lim
x ^ - 0 0 —X$(x)
-X
—X(j)(x) — $ ( x )
->(l) + X2(j){x) x^oc ^ ( g ) -
2
fls)
=
L
(16)
Let m be the number of c, < 0 and denote
E"-E. IT-IIi:c;<0
i:Cj<0
Applying (16) yields lim 2" exp{t 2 (r 2 V a? - r 2 ) / 2 } J ] $(c^) t—>CX>
* — '
-*•-*•
1=1 n
= Hm 2"exp{i 2 (r 2 ^ a
t=l
2
- r 0 2 )/2}n'*(ci*)
n
= lim 2 " e x p { t 2 ( r 2 ^ a 2 - r 2 ) / 2 } < — e x p { - £ 2 ^ ' c 2 / 2 } / n ' l ^ - ( 1 7 ) £—•00
*
•*
i=l
* — *
•*•-*•
139 It is seen t h a t (17), hence t h e left-hand side of (12), approaches 0 if
-2E«r-02-E'c^°: i=\
(17) tends t o infinity if
-2i>2--o2-£'c2>°i=l
On the other hand, t h e right-hand side of (12) gives lim 2 $ ( T 0 < ) = t—>oo
2.
A contradiction is again discovered. We have exhausted all possible cases and t h e proof is completed. References 1. Azzalini, A. (1985). A class of distribution which includes the normal ones. Scandinavian Journal of Statistics 12, 171-178. 2. Azzalini, A. and Capitanio, A. (1999). Statistical applications of the multivariate skew normal distribution. J. R. Statist. Soc. B 3, 579-602. 3. Burridge, J. (1981). A note on maximum likelihood estimation for regression models using grouped data. J. R. Statist. Soc. B 43, 41-45. 4. Dichev, I.D. and Piotroski, J. D. (2001). The long-run stock returns following bond ratings changes. The Journal of Finance, Vol. LVI, No. 1. 173-203. 5. Fama, E. F. (1965). The behavior of stock market prices. The Journal of Business 38, 34-105. 6. JPMorgan (1995). RiskMetrics - Technical Document, 3rd Edition. Issued on May 26, 1995 in New York by Morgan Guaranty trust Company, Global Research. 7. Sen, P. K., and Singer, J. M. (1993). Large Sample Methods in Statistics. Chapman & Hall: New York. 8. Sharpe, W. F. (1963). A simplification model for portfolio selection. Management Science 9, 277-293. 9. Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. The Journal of Finance 3, 425-442.
EXPLICIT TRANSITIONAL DYNAMICS IN GROWTH MODELS
DANYANG XIE International
Monetary Fund, 700 19th Street, N.W., IMF Institute, Division, Washington, DC 20431, USA
Asian
Transitional dynamics in growth models have been subject to much attention recently. With a few exceptions, existing studies rely on computational techniques. This paper uses a set of examples to illustrate that qualitative insights on the transitional dynamics can be gained at the expense of using special utility-production pairs. In continuous time framework, necessary and sufficient conditions are established for a utility-production pair to yield explicit dynamics. These conditions are potentially useful for applications in other dynamic settings.
1. Introduction In growth models, the equilibrium allocation is characterized by a system of differential or difference equations in state and co-state variables constrained by a set of initial and transversality conditions. Furthermore, if the models are in infinite horizon, the transversality conditions are in limit form, which is hard to deal with when computing the transitional dynamics, namely, the transitional trajectory to a steady state or a balanced growth path.. If the modelled economies are shown to have a unique stable steady state, computation of the transitional path is reasonably straightforward. First, we solve for the steady state. Then the system of equations is linearized around the steady state. Finally, the linearized system can be solved using standard softwares. This routine shall provide decent approximation of the transitional dynamics when the initial state variables are close to the steady state. Multiple shooting and backward shooting methods are also used on some occasions. If the modelled economies are shown to have a unique balanced growth path, Mulligan and Sala-i-Martin (1993) suggests that we transform the variables into "state-like" and "control-like" variables which converge to a steady state. The transformed system of equations can then be solved as 140
141
described above. The numerical procedures outlined above are powerful when conducting a simulation exercise, but they are not without limitations. Qualitative details are usually buried in the computational complexities. In this paper, we highlight the alternative method in studying transitional dynamics: namely the explicit dynamics method. This method makes use of utility-production pairs (henceforth U-P pairs) that allow for explicit transitional dynamics. Its limitation lies in that certain restrictions are put on the utility and production functions; these restrictions may not be consistent with the real economy and hence should be discarded when simulation is the purpose. The benefit of using the U-P pairs is that qualitative results are seen explicitly and therefore provide guidance for simulation and estimation. Explicit dynamics method is employed in Long and Plosser (1983) and McCallum (1989) in stochastic discrete time framework to analyze real business cycles. The U-P pair used there is the now familiar log utility and Cobb-Douglas production function. Benhabib and Rustichini (1994) finds that the pair of log utility and Cobb-Douglas production function can be extended to a class of CES utility and CES production function with a common parameter. In deterministic continuous time framework, Xie (1991) uses a class of U-P pairs to show that with externality in production, growth rate can increase over time and approach an upper bound which depends on preference and technology parameters. With similar U-P pairs, Xie (1994) shows explicitly that the two-sector growth model of Lucas (1988) contains multiple equilibria, and the global transitional dynamics is found to display surprising features (See Figure 1 in Xie (1994)). In an independent study, Benhabib and Perli (1994) uses local approximation technique to uncover similar characteristics of the Lucas model. Because the set of papers above uses explicit dynamics method in specific and complex context, the method itself does not receive enough exposure. This paper aims at putting explicit dynamics method into the hands of the researcher who is interested in transitional dynamics. The rest of the paper is organized as follows. Section 2 deals with discrete time framework. We present a one-sector RBC model and use the U-P pair a la Benhabib and Rustichini (1994) to illustrate the explicit dynamics. Also, a deterministic variant is used to show that chaos can easily arise in the presence of externality. Section 3 deals with continuous time framework. We present a one-sector growth model and demonstrate how a U-P pair leads to explicit solution. Section 4 studies the one-sector
142
growth model more closely and delivers necessary and sufficient condition for a U-P pair to permit explicit dynamics. Applications of the necessary conditions are illustrated in examples. Section 5 examines continuous time stochastic models. Section 6 concludes with a discussion of future research. 2. U-P Pairs in Discrete Time Framework This section contains two models. The first is a one-sector RBC model. The second is a one-sector growth model. In the first model, the U-P pair a la Benhabib and Rustichini (1994) is used to illustrate the explicit dynamics. In the second model, we use log utility and Cobb-Douglas production function with externality to show the possibility of chaos. We see that chaos can arise no matter what value the discount factor assumes. 2.1. One-Sector
RBC
Model
Consider an economy with long-lived representative agent. The preferences of the agent are as follows: oo
Et Y, F [Ou(ct+T) + (1 - 9)v(l - nt+T))
(1)
r=0
where Et is mathematical expectation conditional on time t information; (3 G (0,1) is the discount factor; 0 is the weight on the utility of the single consumption good, Ct+T, produced in this economy; 1 — n t + T is the leisure at time t + r (the total time available to the representative individual is normalized at unity every period). The production technology exhibits constant returns to scale in capital k and labor n: yt = Atf(kt,nt)
(2)
where At is the technology shock and is assumed to be i.i.d. Capital accumulation takes the natural form: h+i = Vt + (1 - S)kt - ct
(3)
where 5 is rate of capital depreciation. If we let u(c) = lnc, v(l — n) = ln(l — n), f(k, n) = kanl~a and 5 = 1, then this model is exactly the one-sector Long and Plosser model reported in McCallum (1989). The consumption, output, and capital are shown explicitly to display cyclical behavior in response to technology shock. One
143
minor unsatisfactory point is that labor is irresponsive to the shock, which Long and Plosser (1983) explained is due to the special combination of log utility, Cobb-Douglas production function and 100% depreciation*. Benhabib and Rustichini (1994) suggests that the pair of log utility and Cobb-Douglas production function is a special example of the following class of pairs which permits explicit dynamics with 100% depreciation:
and
f(k, n) = [ah1-* + (1 - a)**—] 1 / ( 1 _ C T ) .
(5)
It is easy to see that the traditional combination of log utility and CobbDouglas production function is the limiting case with a —> 1. The above class of U-P pairs can be indexed by the common parameter a. To see how the special class of U-P pairs allow for explicit dynamics, we write down the Lagrangian for the optimization problem:
£ = Et E ~ o r {fl^fe 1 + a - ) (1""tt:t1"'~1 +At+T [At+T[akfc + (1 - a X ^ / a - ) _ Ct+T -
fct+T+1]}
The first order conditions are: Et6c^T Et(l-6)(l-nt+T)-°
= EtXt+T
= Et\t+T{l-a)At+T[ak}-°
EtXt+T = pEt {xt+T+1aAt+T+1{akj-^
(6) +
0--(*)ri\lTY,{1~a)^+T (7)
+ (1 - a)n\^Y'^^k^T+1)
.
(8) These equations hold for any T > 0. The transversality condition is given by Etf3t+T Xt+Tkt-\-T+i - » 0 a s r - > oo. Similar to McCallum (1989), we guess that the solution takes the following form: ct = byt and kt+i = (1 — b)yt with b constant for any t. When At is i.i.d., so is A\~a. Let us denote the unconditional mean of A\~a by 7r, i.e. EA\~" = IT. Then we can easily verify that the above guess is correct when 6=1a
(a/37r)1/
(9)
100 percent depreciation is a technical assumption often made in order to have analytical solution in this type of models.
144
And we have: kt+1 = (af37r)1^At[ak1t-'T ct
1 - (a/?7r)1/,rl At[ak\-a
+ (1 - a)^-']1^1-^
(10)
+ (1 - a)n t 1 -"] 1/(1 ~
nt =
^ - ^ ^ „ [0(1 - a ) ] 1 / - + (1 - eyi° [1 - (a/37r)i/CT] j ^ ^ - ' ) / ' To check our calculation, set a = 1. In this case, 7r = 1 and n
(11)
(12)
6(1 - a) ' ~ 6(1 - a) + {1 - 0){1 - a0)'
which is identical to (1.19) in McCallum (1989) after making the notations consistent. Note that among all possible solutions to the first order conditions, the solution given above is the unique one that satisfies the transversality condition. We see from equation (10) to (12) that productivity shock, even if it is i.i.d., can generate persistent business fluctuations. In particular, equation (12) says that the employment can be pro-cyclical if a < 1. Clearly, it may not be reasonable to assume that the utility and production function share a common parameter a. But the literature in RBC seems to have used the special pair of a = 1 without realizing the fact. This exercise seems to suggest, in the light of equation (12), that the existing RBC simulation can be improved upon if we use a log utility function and a CES production function with a < 1: the employment variation may be raised. The class of U-P pairs can also be used in variants of the model to generate explicit dynamics. For instance, labor indivisibility described in Rogerson (1984) and used in Hansen (1985) for RBC simulation can be incorporated. These pairs can also be useful in multi-sector models to simplify the dynamics. 2.2. One-Sector
Growth Model with Externality:
Chaos
Consider a representative individual's utility maximization problem:
max ^ / 3 ' l n ( c ( ) t=o
145
subject to: kt+i = Akf B(kt) — c t , where B(kt) captures the external effect with kt representing average stock of capital in this economy. This is a model that uses log utility and Cobb-Douglas production function. We will see that the presence of externality does not jeopardize the nature of explicit dynamics. Again, 100% depreciation is assumed. The first order conditions can be written as follows:
At =
f3\t+1aAk?-?B{kt+1).
The transversality condition is: Atfct+i/?' —» 0 as t —> oo. The solution in this model is very much similar to the one in previous model, namely we have: ct = (1 - aP)yt
(13)
kt+i = a(3yt.
(14)
Note that the above derivation does not require any specification for B(kt). Suppose the external effects are two-fold, B(kt) = fct1_a(l — kt)- The term k]~a captures the positive effect considered in endogenous growth literature such as Romer (1986) and Lucas (1988). The other term (1 — kt) captures the negative effect such as pollution or congestion considered in Day (1982). Then equation (14) becomes: fct+i = aPAhiX ~ h)
(15)
where the equilibrium condition kt = kt has been substituted in. Equation (15) shows that depending on the values of a@A, the dynamics of capital can be simple or complex. For example, if aj3A = 4, chaos arises. Day (1982) shows that chaos may arise in two variants of the Solow growth model. In the first variant, he lets the saving rate be constant but introduces a pollution effect. In the second variant, he leaves the production function in the Solow model untouched but allows variable saving rate. While he demonstrates chaotic possibilities, his two models do not involve any optimization decision. In contrast, our conclusion comes from individual's well-defined optimization problem which can be decentralized to a competitive equilibrium. Another point is worth noting. In competitive chaos literature, for example Boldrin and Montrucchio (1984), Neumann et al (1988), a small discount factor f3 is usually needed for complex dynamics. In our simple
146
model here, it is shown explicitly that for any f3, chaos can arise if a and A assume appropriate values. 3. Continuous Time Framework Now we turn to continuous time framework. Let us use the following standard one-sector growth model as illustration: pt max / u{c)e- dt, Jo subject to: k = f(k) — 5k — c, with fco given. Write down the Hamiltonian,
Ti = u(c) + A(/(fc)
(16)
-Sk-c)
The first order conditions are: u'{c) = A
(17)
A = p\ - \(f'(k)
- 6)
(18)
pt
The transversality condition is \ke~ —> 0 as t —> oo. Let us use CES utility function and production function: c1_CT - 1 u(c) = -T—-
(19)
f(k) = Ak0 with /? e (0,1).
(20)
and
Then we find that
H=-(*)i-i
,
x
(21)
= f - * ^ + ( i -1)/(*)/* Clearly, if a = (3 G (0,1), equation (21) is much simplified. Indeed in this case, we find the obvious solution: c = P+ (l-a)5 k a which is the only solution that satisfies the transversality condition. As a result, the evolution of capital becomes: k = Ak" - [p + (1 -
(23)
147
Explicit solution for the above equation exists but is omitted here. This class of U-P pairs is the same in spirit as the class in discrete time framework that the utility and production function share a common parameter. The difference in form should be noted however: the class in discrete time involves CES utility and CES production function whereas the class in continuous time involves CES utility function and Cobb-Douglas production function. Also, in continuous time, we do not need a 100% depreciation of capital to have explicit solution. This class of U-P pairs still permits explicit dynamics when the production function is modified to include externality or other productive factors. As a result, it has a number of applications. In one of these applications, Xie (1991) shows that in the presence of positive externality, growth rate of capital can increase over time and approach an upper bound that depends on preference and technological parameters in an intuitive way. In another application, Xie (1994) shows that the two-sector growth model of Lucas (1988) contains multiple equilibria; furthermore, it shows explicitly that the transitional dynamics conjectured by Lucas is incorrect. 4. More on Continuous Time Framework In this section, we give a complete characterization of the U-P pairs in the continuous time one-sector growth model that allow for closed form solution. By closed form solution, we mean that the solution has the form c = g(k), with g(.) as a known function. For convenience, in the following discussion, we assume zero percent depreciation <5 = 0. 4.1.
Propositions
Proposition 4.1. Let utility function be given by u(c). There is a closed form solution c = g(k) if and only if the production function belongs to the following class:
m__g{k)+d^m^mi
(21)
and that the transversality condition is satisfied. Proof: For the "necessary" part, suppose that there is closed form solution c = g{k). The first order conditions (16)-(18) (note that we assume 5 = 0) yield
^mmif{k)-g{k)]=P-m
148
Rearranging terms we obtain u'(g(k))f'(k)
+ u"(g(k))g'(k)f(k)
= pu'(g(k)) +
The left-hand side is the derivative of u'(g(k))f(k) taking integral of both sides, we have: u'(g(k))f(k)
u"(g(k))g'(k)g(k)
with respect to k. Thus,
=pf u'{g{k))dk + J u"(g(k))g'(k)g(k)dk = pfu'(g(k))dk + u'(g(k)g(k) - f u'(g(k))g'(k)dk = pju'(g(k))dk + u'(g(k)g(k) - u(g(k))
Prom this equation we get:
f(k) = g(k) +
pju'(g(k))dk-u(g(k)) u'(g(k))
Furthermore, the transversality condition has to be satisfied. For the "sufficient" part, it is easy to verify that if f(k) can be written as in equation (24) with a known function g(.), then c = g(k) coupled with k = f(k) — g(k) satisfies all the first order conditions. If furthermore, the transversality condition is satisfied, then c = g(k) is the optimal consumption rule. Hence, the solution is in closed form.
Proposition 4.2. Let production function be given by f(k). There is a closed form solution c = g(k) if and only if u(c) belongs to the following class
•"-/-i/*3^^:
dc,
(25)
and that the transversality condition is satisfied. In equation (25), h(.) =
Proof: For the "necessary part", suppose that there is closed form solution c = g(k). By the definition of h(.), we have k = h(c). Thus, k = h'(c)c. Combining this with the first order conditions (16)-(18) yields f(h(c)) - c = h'(c)£Q
[p - f'(h(c))]
(26)
Rearranging terms we obtain: u"(c) u'(c)
=
h'(c)[p-f'(h(c))} f(h(c))-c
(
'
149
Taking an integral from both sides of equation (27) and further calculating, we arrive: u'(c) = exp
h'(c)[p-f'(h(c))] /
f(h(c)) -
-de
(28)
Therefore, u(c) must belong to the class given by equation (25). In order to be sure that c = g(k) is the solution, the transversality condition must be verified. For the "sufficient" part, suppose the utility function can be written as in (25) for some explicit function h(c). Let g(.) = h~l(.). To show that c = g(k) is the optimal consumption rule, we reverse the above process to verify that all the first order conditions are satisfied. Since the transversality condition is also satisfied, we have shown that c = g(k) is optimal and in closed form. Proposition 4.1 and 4.2 can be used to find U-P pairs that permit closed form solution. We shall use a few examples to demonstrate the usefulness of these results. 4.2.
Examples
Example 4.1. Suppose u(c) — [c1_CT — l] / ( l — a) and suppose we want a closed form solution c = 7/c with 7 constant. In the language of Proposition 4.1, g(k) = 7/c. The appropriate production function is given by equation (24):
/(fc)=fl(fc) + - 1
K K u
_
7K
pJu (g
' l* ( »g ) 7 ( g ( f c ) )
pJ(7fc)-"dfc-[(7fc)1-'-l1/(l-(r) {fk)-" , p7-'fc1-V(l-cr) + .;-[(7fc)1-'-l]/(l-,T) + (7k)-" ,
+
= 7fc [1 + (P/7 - 1)/(1 - o)\ + (J + V ( l - °)){lkY where J is any constant. Note that when 7 = [p+ (1 — a)S]/cr, the formula above yields f(k) = Aka — 6k, which is exactly the case we discussed in detail in Section 3. The transversality condition is easily verified for 0 < a < 1 because the capital k approaches a steady state and so does the consumption c.
Example 4.2. Suppose f(k) = Aka (a € (0,1)), and suppose we want a closed form solution c = (p/a)k. Then Proposition 4.2 can be used to
150
find out the appropriate utility functions. The calculation is a bit difficult because we have to do integral twice. Nonetheless, the answer is that u(c) = Jicl~" + Ji with constant J\ and J 2 (Ji positive). In general, Proposition 4.1 is easier to use than Proposition 4.2.
Example 4.3. Suppose u(c) = - e c , and g(k) = 2{pk)1/2.Equation implies that the production function must be of the following form: tll.\ — n(U\ J- P-f J[K) - g{K) H ,
u'(g(k))dk-u(g(k)) u'(g(k)) n f p-2
>i / o
1/2
= 2(pk)
(24)
+
Pf
= 1/2 + {pkf2 +
e-J"tl
Je2^1'2
where J is any constant. The simplest member in the above class is the one with J = 0. In this case, f(k) = 1/2 + (pfc) V2 , which behaves nicely as a production function except that /(0) ^ 0. To make sure that c = 2(pfc)1//2 is the solution, we need to check the transversality conditions. Note that k — f(k) — c = 1/2 — (pfc)1/2. k converges to a steady state and obviously so does c. Hence the transversality condition is satisfied.
Example 4.4. Suppose u(c) = —e c , and g(k) = 7lnfc. The equation (24) says that we need the following function to make c = g(k) optimal: f(k) = 7 In k + Jk1 + pk/(l - 7) + 1, with J an arbitrary constant. For instance, when we set J = 0, we get a rather simple concave production function f(k) = jink + pk/(l — 7) + 1 provided that 7 is in the interval (0,1). To verify that c = 7 In A: is indeed optimal in this case, again we need to check the transversality condition. In fact, from the following first order conditions e~c = A,
(29)
fc = 7lnfc + p f c / ( l - 7 ) + l - c ,
(30)
A = pA-A[7A + p / ( l - 7 ) ] ,
(31)
and
151
we see that c = 7 In A; implies k = pk/(l — 7) + 1. Also, equation (29) says
that \ = e-'llnk
=k-i.
Thus \ke-<* =
k^e'"1,
which converges to [fco + (1 — 7 ) / p ] p ' ' 1 _ 7 ' and therefore the transversality condition is violated. As a result c = 7 In k is NOT the optimal consumption rule for the case u(c) = —e~c and f(k) = 7 In A; 4- pk/(l — 7) + 1. WARNING: Example 4.4 shows that after deriving the production function, we need to verify the transversality condition to make sure that the consumption rule is indeed optimal for the utility and production pair.
Remark 4.1. In Example 4.4, we can verify that for any arbitrary constant J, the utility-production pair will violate the transversality condition if c = 7lnfc. Therefore, we can conclude that when u(c) — —e~c, there exists no production function such that the optimal consumption rule has the form: c = -ylnk. 5. Continuous Time Stochastic Models In Rebelo and Xie (1998), we gave examples of explicit solutions in stochastic monetary models with continuous time. Again, the examples made use of the U-P pairs described in the last section. We showed that with inelastic labor supply, a constant nominal interest rate (it does not necessarily have to be zero) is optimal in a monetary model. In this section, I would like to point out the same U-P pairs can also yield explicit dynamics in a model with the spirit of capitalism (see discrete time model of Zou (1994) and continuous time AK model of Gong and Zou (1998)). b 5.1. Additively
Separable
Utility
00 r 1 a
•f/o
max
c-
l-ar
Function k1-" 0-.
~ptdt
1-cr
Jo
subject to : dk = {Aka — c)dt + ekdz fco given b
Zou (1994) models the spirit of capitalism (Weber 1958: capitalists accumulate wealth for the sake of wealth) by putting capital stock directly into the utility function.
152
where dz is an increment of standard Wiener process. Note that k is included in the objective function to capture the idea of capitalist spirit. We show below that when a = a £ (0,1), there will be an explicit solution and we can then draw some qualitative conclusions for this model. The Hamilton-Jacobi-Bellman equation for this problem is, 0 = max
l - < 7
+i
pj{k)
1-0"
+J'(k) (Aka -c) +
1
^J"(k)s2k2
We conjecture that the value function takes the following form:
J{k) = a
b~°k
\-a
(l-<7)'
where a 6 (-co, oo) and b > 0 are to be determined later. The optimum condition for c implies: c-" = J'(fc) = b-ak~a Namely, c = bk. Putting this back into the Hamilton-Jacobi-Bellman equation, we obtain, 0= — l-o-
+61 -or
+b-°A-b1-"k1-'T
p a+
b-"kl-a~
{I-a) 1, 2 1 o-b-°e k 2
Comparing coefficients, we find that 0 = cr6 + Ob"
p+
a{l-a)-s2
pba An increase in 6, namely a stronger capitalist spirit, leads to a lower b. Thus, consumption will be a smaller proportion of capital. Growth rate is higher. A higher e, namely a greater uncertainty, leads to a higher b. The consumption will be a higher proportion of capital stock. Hence saving rate is lower. To understand this, note that an increase in uncertainty has two effects. The first relates to the precautionary motive to save and implies that saving should be higher. The second relates to the fact that higher
153
uncertainty in the returns reduces incentive to invest. In the model specified here when a = a < 1, the second effect dominates and in equilibrium, saving is lower. Whether the result holds in general when a < 1 but a > 1 remains unanswered. 5.2.
Multiplicatively
Separable
Utility
Function
Let us take an alternative specification of the objective function. pt e - dt
max /
subject to : dk = (Aka — c)dt + ekdz k0 given where 7 € (0,cr). To be consistent with the idea of capitalist spirit, we again need to constrain a to be less than 1. The Hamilton-Jacobi-Bellman equation for this problem is, c1~ak1 0 = max —
pJ{k)
1 — cr
+J'{k) {Aka -c) + i
J"(k)e2k2
where J(.) is the value function. We conjecture that the value function takes the following form:
J(k) = a + -, r y } (7 + 1 - a) The optimum condition for c implies: c-
ff
F = J'(k) = b-ak"<-a,
namely, c = bk. Putting this back into the Hamilton-Jacobi-Bellman equation, we obtain, 6l-
0=
6
_CTfci_ff+7-
p a+
(7 + 1 - ) I—a a a + r F - [Ak - bk] + (7 - a) V ^ f c CT
1
-^
7
If a = a — 7, then our conjecture is confirmed if the coefficients satisfy: t
P l & ^ j + (1 ~ «0 fr - 7) j e 3
154
A a = —— pb" An increase in 7 leads to a lower b. Thus, consumption will be a lower proportion of capital. Growth rate is higher. This is consistent with the result in the case with additively separable utility function. An increase in e increases b and lowers savings. This simply says that with multiplicatively separable utility function, the precautionary saving motive is stronger than the disincentive to invest. Again, this qualitative result has to be attached with a qualifier: a = a — 7. 6. Conclusion In this paper, we highlight the U-P pairs that permit explicit dynamics in growth models. Both the discrete time case and the continuous time case are treated. We point out the similarities and differences in the two frameworks. Although the study is conducted in one-sector model, similar usage of the U-P pairs is possible in multi-sector model. See for example Long and Plosser (1983) and Devarajan et. al (1998) for discrete time case; and Xie (1994) and Mino (2001) for continuous time case. In continuous time case, the two propositions in Section 4 give researchers plenty choices over the U-P pairs that lead to closed form solutions. Corresponding propositions for the discrete time case have yet to be found. The paper also shows that closed form solutions for continuous time, stochastic one-sector model found in Chang (1988) and Rebelo and Xie (1999) can be extended to models with capitalist spirit. We hope that the results presented here about the U-P pairs lead to their more frequent application in a variety of context to obtain qualitative insights that provide guidance for simulation and estimation. Acknowledgments I thank Jess Benhabib, Reza Vaez-Zadeh, Heng-fu Zou and an anonymous referee for comments. The views expressed here are those of the author and do not necessarily represent those of the IMF. References 1. Benhabib, J. and Aldo R. (1994). A note on a new class of solutions to dynamic programming problems arising in economic growth. Journal of Economic Dynamics and Control 18, 807-813.
155 2. Benhabib, J. and Roberto, P. (1994). Uniqueness and indeterminacy: Transitional dynamics in a model of endogenous growth. Journal of Economic Theory 63, 113-142. 3. Boldrin, M. and Montrucchio, L. (1984). The emergence of dynamic complexities in models of optimal growth: The role of impatience. Working Paper # 7 , Rochester Center for Economic Research. 4. Chang, F.-R. (1988). The inverse optimal problem: A dynamic programming approach. Econometrica 56, 147-172. 5. Day, R. (1982). Irregular growth cycles. American Economic Review 72, 406414. 6. Devarajan S., D. Xie and Zou, H.-P. (1998). Should public capital be subsidized or provided? Journal of Monetary Economics, April. 7. Gong, L.T. and Zou, H.-F. (2002). Direct preferences for wealth, the risk premium puzzle, growth, and policy effectiveness. Journal of Economic Dynamics and Control 26, 247-70. 8. Hansen, G. (1985). Indivisible labor and the business cycle. Journal of Monetary Economics 16, 309-327. 9. Long J., Jr. and Plosser, C. (1983). Real business cycles. Journal of Political Economy 9 1 , 39-69. 10. Lucas, R., E. Jr. (1988). On the mechanics of economic development. Journal of Monetary Economics 22, 3-42. 11. McCallum, B. (1989). Real business cycle models, modern business cycle theory. Harvard University Press, 16-50. 12. Kazuo, M. (2001). Human capital formation and patterns of growth with multiple equilibria, in long-run growth and economic development: From theory to practice, (edited by Michele Boldrin, Been-Lon Chen and Ping Wang), Edward Elger, forthcoming. 13. Mulligan C.B. and Sala-i-Martin, X. (1993). Transitional dynamics in twosector models of endogenous growth. Quarterly Journal of Economics 108, 739-773. 14. Neumann, D., O'Brien, T., Hoag, J. and Kim, H. (1988). Policy functions for capital accumulation paths. Journal of Economic Theory 46, 205-214. 15. Sergio, R. and Xie, D.Y. (1999). On the optimality of interest rate smoothing. Journal of Monetary Economics 4 3 , 263-282. 16. Rogerson, R. (1984). Indivisible Labor, Lotteries, and Equilibrium. University of Rochester. 17. Romer, P. (1986). Increasing returns and long run growth. Journal of Political Economy 94, 1002-1037. 18. Weber, M. (1958). The Protestant Ethic and the Spirit of Capitalism. Charles Scribner's Sons, New York. 19. Xie, D.Y. (1991). Increasing returns and increasing rates of growth. Journal of Political Economy 99, 429-435. 20. Xie, D.Y. (1994). Divergence in economic performance: Transitional dynamics with multiple equilibria. Journal of Economic Theory 63, 97-112. 21. Zou, H.-F. (1994). The spirit of capitalism and long-run growth. European Journal of Political Economy 10, 279-93.
A FISCAL FEDERALISM A P P R O A C H TO O P T I M A L TAXATION A N D INTERGOVERNMENTAL TRANSFERS IN A D Y N A M I C MODEL
LIUTANG GONG Guanghua Institute
School of Management, Peking University, Beijing, 100871, China for Advanced Study, Wuhan University, Wuhan, 430072, China H E N G - F U ZOU
Development
Research
Group, The World Bank, Washington, DC 20433,
MC2-611, USA
1818 H St.
NW,
In this paper, we study the optimal choices of the federal income tax, federal transfers, and local taxes in a dynamic model of capital accumulation and with explicit game structures among private agents, the local government, and the federal government. When the federal government is t h e leader and the local government is the follower in a Stackelberg game with both the consumption tax and property tax available to the local government, the optimal local property tax is zero, and local consumption tax is positive. But federal transfers to the local government are negative, and the federal income tax can be positive or negative. In this case, the local consumption tax is used to finance both local and federal public spending.
1. Introduction This paper considers optimal choices of the federal income tax, local property tax, local consumption tax, and federal transfers to local governments in an intertemporal model of capital accumulation. There exists an enormous literature on optimal income and commodity taxation. Classical contributions include, for example, Ramsey (1927), Mirrlees (1971), Diamond and Mirrlees (1971), Atkinson and Stiglitz (1972, 1976), and Samuelson (1986). Comprehensive literature reviews are provided by Atkinson and Stiglitz (1980), and Myles (1995). In most of these contributions, the government has often taken to be a single identity without introducing the structure of tax assignments and expenditure assignments among multiple levels of government. But in reality, income tax is mainly collected by central governments in Europe and jointly by the federal government and 156
157
state governments in the United States, property tax is mainly collected by local governments, and commodity tax is collected by both central governments and local governments in Europe or by local governments in the United States. In most developed countries, each level of government has the power to determine tax rates and tax bases. In addition, intergovernmental transfers in various forms exist among different levels of government in every country of reasonable population size. It is natural to see how the structure of fiscal federalism affects optimal taxation and intergovernmental transfers. In an earlier contribution to optimal taxation and revenue sharing in the context of fiscal federalism, Gordon (1983) has utilized a static model to consider how local governments set the rules of local taxes including tax rates and types of taxes in a decentralized form of decision-making while allowing the central government the role of correcting externalities through grants, revenue sharing, and regulations on local tax bases. Recently, Persson and Tabellini (1996a, 1996b) have considered risk sharing and redistribution across local governments in a federation using static models involving risk. In this paper, on the basis of the contributions by Gordon (1983), and Persson and Tabellini (1996a, 1996b), we analyze the optimal choices of federal taxes, federal transfer, and local taxes in a dynamic model of capital accumulation and with explicit game structures among private agents, local governments, and the federal government. a For ease of the treatment, we focus on federal income tax, local property tax, local consumption tax, and federal matching grant for local public spending. Our dynamic approach is timely because the optimal design of tax assignments, expenditure assignments, and intergovernmental transfers among different levels of government has received considerable attention in the 1990s in the context of fiscal federalism, public sector reforms, and economic growth for both developing and developed countries. One of the most important goals of establishing a sound intergovernmental fiscal relationship is supposed to promote local as well as national economic growth (see Rivlin (1992),Bird (1993),Gramlich (1993),and Oates (1993)). The paper intends to provide an analytical framework for the ongoing discussion on fiscal federalism and economic growth. a
See Zou (1994,1996), Brueckner (1996), Devarajan, Swaroop, Zou (1996), Davoodi and Zou (1998), and Zhang and Zou (1998) for related dynamic approaches to multi-level government spending, intergovernmental transfers, federal taxes, and local taxes in a "federation".
158
Section 2 presents the optimal choices of taxes and transfer from the dynamic Cournot-Nash game between the federal and local government while assuming the Stackelberg (leader-follower) games between the local government and the private agent and between the federal government and the private agent. Section 3 derives the optimal choices of taxes and transfer by studying the Stackelberg game between the local government and the federal government, while retaining the same Stackelberg games between the two levels of government and the private agent. Section 4 concludes. 2. The Framework There are three actors in the economy: a representative agent, a local government, and the federal government. 2.1. The
Agent
Like Arrow and Kurz (1970),Barro (1990),Turnovsky (1995),, and Turnovsky and Fisher government expenditures are introduced into the representative agent's utility function. Unlike those studies, public expenditures are divided into the federal and local ones in the model. The agent derives a positive, but diminishing, marginal utility from the expenditures of both the federal and local governments and private consumption. Let / , s, and c be federal expenditure, local expenditure, and private consumption, respectively. If the utility function u{c, / , s) is twice differentiable, the assumption is equivalent to: uc > 0, Uf > 0, us > 0, ucc < 0, Uff < 0, uss < 0.
(1)
For the cross effects ucf, ucs, and Ufs, they are assumed to be positive in general. In addition, u(c, f, s) satisfies the Inada condition: lim us = oo, limy_,o u / = °°! lim c ^o w c = °°
(2)
s—*0
lim^ooUs = 0, lim/^ooW/ = 0, linv^ooUc - 0 The representative agent's discounted utility is given by poo
/ u(c,f,s)e-'>tdt, Jo where p is the positive, constant time preference. U=
(3)
159
Again following Arrow and Kurz (1970),Barro (1990)and Turnovsky (1995),output y is produced by a constant-return-to-scale production function with three inputs: private capital stock, k, federal government expenditure, / , and local government expenditure, s, namely y = y(k,f,s),
(4)
where all variables are in per capita terms. For simplicity, the size of population or the labor force is assumed to be constant. The marginal productivity of private capital stock, federal government expenditure, and local government expenditure are positive and decreasing: Vk > 0, yf > 0, ys > 0, ycc < 0, yff
< 0, yss < 0.
(5)
Federal government expenditure, / , is financed by the income tax on the agent. Local government expenditure, s, is the sum of the consumption tax b , TCC, the capital or property tax, T/.k, and federal government's transfer, gs. Tf, TC, and T\. are the federal income tax rate, local consumption tax rate, and local capital or property tax rate, respectively, and g is the rate of federal matching grant for local spending. Hence, the budget constraints for the federal government and local government can be written as follows f = Tfy-
gs
s = gs + Tkk + TCC
(6)
(7)
respectively, and the budget constraint for the representative agent can be written as dk — = (1 - Ty)y(k, / , s)-Skrkk - (1 + TC)C (8) where 6 is the rate of capital depreciation. The representative agent is assumed to have an infinite planning horizon, to face a perfect capital market, and to have perfect foresight. Given these assumptions, he chooses his consumption path and capital-accumulation path to maximize his discounted utility °The consumption tax has been analyzed recently in growth models with one level of government by King and Rebelo (1990), Rebelo (1991) and Jones, Manuelli, and Rossi (1993), and Turnovsky (1995)
160
poo
max U v{f) + w(s))e-ptdtptdt (9) U== / (u(c) +(u(c)+v(f)+w(s))eJo subject to (8). His initial capital stock is given by fc(0) = fco- For simplicity, we have taken the utility function to be separable in c, / , and s in (9). The Hamiltonian associated with the optimization problem is defined as H = u(c) + v(f) + w(s) + A((l - Ty)y(k, / , s) - 5k - rkk - (1 + TC)C) (10) where A is the costate variable, and it represents the marginal utility of wealth. The first-order conditions for individual optimization are dk -g = (i - T,)v(t, / , s) - (i + TC)C - (S + n)k
(ii)
§ — * [ < 1 - T , ) | * - P - « - H ]
(12)
Wc = (l+T C )A.
(13)
And from the last condition (13), we have c = c(A,r c ). 2.2. Local
(14)
Government
The local government and the private agent play the Stackelberg game with the local government as the leader and private agent the follower0. At the same time, in this section, we also assume that the local and the federal government react to each other along Cournot-Nash lines. That is to say, given the federal income tax rate, federal matching grant, and federal spending, the local government maximizes the agent's welfare by fully incorporating the agent's first-order conditions in section 2.1 into its own maximization. Specifically, the local government will choose optimal C
A similar technique is used by Chamley (1985,1986), Lucas (1990) and Devarajan et al. (1996) in the treatment of optimal taxation of capital income with one level of government and a representative agent.
161
taxes r c and Tk, public expenditure, s, private capital stock, k(t), and the marginal utility of private wealth, X(t), to maximize the agent's welfare /•OO
[u(c(X,Tc))+v(f)+w(s)}e-ptdt
ma.xxtk,rc,rk,s
(15)
Jo subject to its own budget constraint s-gs
= Tcc + Tkk
(16)
and the first-order conditions for private agent's optimization dk — = (1 - T,Uk,f,s)
- (1 + TiMA.Tj - (S + Tt)k
§-»[
(17)
<1»
where we have already used the optimal consumption for the private agent in the objective function: c = c(A,r c ). Define the Hamiltonian for local government's optimization problem as
H = u(c(\,Tc))
+
v(f)+w(8)+l3{-\[(l-Tf)Q-p-5-Tk]}
+a[(l - Tf)y(k, f, s) - (1 + r c )c(A, r c ) -(5 + rk)k] +£[T C C(A, T C ) + Tfcfc + gS - s] + /iTfc + UTC
where a is the "local" costate variable associated with the agent's dynamic budget constraint; (3 is the "local" costate variable associated with the agent's Euler equation of optimal consumption; £ is the multiplier for local government's budget constraint; /J, is the multiplier for the inequality constraint that 0 < Tfc < 1, v is the multiplier for the nonnegative consumption tax constraint TC > 0. The first-order conditions for local government's optimization are 9
-§ = W + a ( l - r
/ }
3H —— =u cTc-ac-
| -
W
a(l +
-rf)^L+
TC)CTC
VTC = 0,V>0
fr
- 1) = 0
+ £c + ^ c c T c + u = 0
(19)
(20) (21)
162
—
= -ak + (3\ + tk + fi = 0
(22) (23)
llTk = 0, (J, > 0
da Tt=pa--dk
dH
= pa —
Q[(1
(24)
-',%-'-*+
dp dt
(25) = p(3 - u'c A + a ( l +
2.3. The Federal
TC)CX
+ P[(l -Tf)^-
p-S-Tk]-
£TCC\.
Government
We assume that the federal government and the agent play the Stackelberg game with the federal government as the leader and the agent the follower, whereas the federal government and the local government play the CournotNash game. Therefore, taking as given local government's choices of r c , T^, and s, the federal government incorporates the private agent's first-order conditions for his optimization into the federal optimization program by choosing federal income tax, -ry, federal public spending, / , the rate of federal transfer to the local government, g, private capital stock, k, and the marginal utility of private wealth, A, to maximize the agent's welfare, namely, /•OO
t max / [u(c{\,Tc))+v{f)+w{s))e-P dt ./o subject to the agent's optimization conditions:
(26)
dk — = (1 - Tf)y(k, / , s) - (1 + rc)c(A, r c ) - (5 + rk)k ft = -A[(l - rf)
dy{k
dl'
s)
-p-5-rk(k,
A, a, /?)]
(27) (28)
and the federal budget constraint f + gs = Tfy
(29)
163
with the initial private capital stock fc(0) given. Define the Hamiltonian function for the federal government as Hf = u(c(X,rc)) + v(f) + w(s) + 0 2 {-A[(l - rf)dy{k^s) +01 [(1 - Tf)y(k,
+v[TfV - f
/ , S) - (1 + TC)C(A, TC) -(5
-p-S-rk}}
+ Tk)k]
-gs]+wg
where 9\ is the "federal" costate variable associated with the agent's dynamic budget constraint; #2 is the "federal" costate variable associated with the agent's Euler equation of optimal consumption; 77 is the multiplier for the federal budget constraint; and to is the multiplier for the requirement of a non-negative rate of federal transfer, i.e., g > 0. The first-order conditions for the federal government's optimization are dII
i-v' + [vTf + 0l{i _ T / ) ] | | _ e2\(i - rf)-^L _ v = 0 df — v -+-[T)Tf - h W i l dHf drf
d9x dt =
p8i-
(30)
-0iZ/ + 0 2 A | | + rry = 0
(31)
dHf dk
(32)
= ^-W-r/)|-*-7,]+M[(l-r/)0-^| (33)
"df ~ ^ ~ ~aT = p62 - u cx + 6>i(l +
TC)CX
+ 02[(1 - T/)-^| - p - 5 - rfc]
-??s + w = 0,a;5 = 0,w > 0 . 2.4. Some Results from the Cournot-Nash the Federal and Local Governments
(34)
Equilibrium
for
The full dynamic system is extremely complicated. But some results regarding the optimal choices of taxes and federal transfer along the CournotNash lines can be derived from the steady-state or long-run analysis of the full dynamic system. In the steady state, dk
dX
dc
df
ds
drf
drc
drk
^~^~^~a^~'dl~~dT^'dT~~dT~'di~
dg
. .
^'
164
and so are various costate variables and multipliers: da
d(3 d£ dp
d6\ dQ-i
drj
dt
dt
dt
dt
dt
dt
dt
duj = 0. dt
(36) '
Therefore, (1 - Tj)y{k, f, s)-(l + TC)C(A, TC) - (8 + Tk)k = 0 -X[(l-Tf)^-p-S-Tk]=0 «c = (1 + rc)X s - gs =
TCC
u cTc - ac - a ( l + UTC
TC)CTC
(40)
+ t(g - 1) = 0
+ £c + ^ c Cr c + u = 0
= 0, t> > 0.
-ak + PX + £k + n = 0
(42) (43) (44)
(45)
^(1-^)0-^ = 0
(46)
+ a ( l + TC)CA - £r c c A = 0
f + gs = Tfy
[TJT/
(41)
nn = o, n > o
pP-ucx
t, +
(38) (39)
+ rkk
w + a ( l - r , ) g - /3A(1 - r , ) J ^
(37)
+ 6»!(1 -
T,)]^
- 02A(1 - r , ) ^ - „ = 0
(47)
(48) (49)
165
-Oiy + e2X^ +r]y = 0
^(1-^)0-^1=0
(50)
(51)
p62-ucx+e1{l+Tc)cx=0 —7]s + w = 0, ujg = 0, u) > 0.
(52) (53)
P r o p o s i t i o n 2 . 1 . TTie steady-state optimal property tax rate is zero, but the steady-state consumption tax is positive. Proof: First from equation (46), we have
/?A(l-r / )0=er f c >O.
(54)
Hence, (3 < 0. Suppose the optimal property tax rate is strictly positive: 7> > 0 . From equations (44) and (47), we have (£ - a)k + f3\ = 0,
(55)
u -a{l+Tc)+irc
= ^-. (56) c\ Substituting equations (55) and (56) into equation ( 42), we obtain
PP.cTc-(3\^+v = 0. c\ k From equation (39), we have 1+Tc c\ =
Urn
A ,cTc = U,cc .
(57)
(58)
Substituting equation (58) into equation (57), we get
» ,_£) + „ = „.
( 1 + TC
k
m
Substituting equations (37) and (38) into equation ( 59), we have
A 1 +TC
" - U - / > + « = 0.
(60)
166
Because of the assumption on the production function, we have
%k < y-
(61)
If TC > 0, we have v > 0. Now, equation ( 60) implies [3 = 0.
(62)
Hence, from equations (44) and (46), we have £ = a = 0.
(63)
Then, from equation (41), we obtain w'(s) = 0,
(64)
which is impossible because w (s) is strictly positive by our assumptions. Therefore, we must have Tfc = 0. Q.E.D. This result is rather intuitive. For the local government, consumption tax has no distortionary effect on private production and private capital accumulation, whereas local property tax directly reduces private capital accumulation. It is always welfare maximizing for the local government to finance local public spending through the less distortionary consumption tax instead of capital or property tax. Proposition 2.2. The steady-state federal transfer is zero: g = 0. Proof: Suppose that g is not equal to zero. Then, from equation ( 53), we have ^ = 0,77 = 0.
(65)
That is to say, from equation (51), 6>2 = 0.
(66)
9X = 0.
(67)
Then, from equation (50),
Now from equation (49) we must have «'(/) = 0.
(68)
167
This contradicts our assumption that v (/) > 0. Therefore, we must have
3 = 0.
(69)
Q.E.D. This is also intuitively convincing. The federal government and the local government decide their optimal choices along the Cournot-Nash lines without taking into consideration the interactions of their choices. In this case it is always in the federal government's interest to provide zero subsidy to the local government. In the next section this picture will dramatically change when the two governments play a Stackelberg game with the federal government as the leader and the local government as the follower. Before we conclude this section, please note the following three points. First, the steady-state consumption tax and property tax cannot be zero at the same time. This is true because, from proposition 2.2, we know that the steady-state government matching grant is zero. From equation (40 ), if Tk = TC = 0, we have s = 0, which cannot be optimal in view of the Inada conditions (2) on the utility function. Second, if the local consumption tax is set to zero, i.e., r c = 0, then local spending must be financed by a positive capital or property tax. Still, the Cournot-Nash game between the federal and local governments will result in a zero federal transfer to locality: g = 0. The proof is similar to the one in proposition 2.2. Finally, along the Cournot-Nash lines, the optimal federal income tax must be positive because of the Inada condition for federal spending. 3. The Stackelberg Game between the Federal Government and Local Government In section 2, we find that if the federal and local governments play the Cournot-Nash game in choosing their individually optimal taxes and transfer, respectively, then federal transfer to the local government is zero. Here we suppose that the federal and the local governments play a Stackelberg game while retaining the same game structures of the private agent versus the two levels of government. In the new setting, it is natural to let the federal government be the leader, and the local government be the follower. In order to by-pass the complexity of the general solutions of these complicated, multi-stage Stackelberg games, and provide some explicit solutions to the optimal choices of taxes and federal transfer, we use specific utility function and production technology.
168 3.1. The
Agent
The production function of the agent is assumed to take the following form y = kafSi,
(70)
where a > 0, (3 > 0, 7 > 0 and a + (3 + 7 < 1. In equation (70), the output and inputs are all measured in terms of the representative agent's labor input. This is why a + /3 + 7 < 1. For simplicity, the agent's labor input is assumed to be constant. His utility function is logarithmic: u(c,f,s)
= l n c + tfiln/ +
tf2ms
(71)
where $1 and i?2 are constant and positive. With these choices of preferences and technology, it is simple to show that steady-state capital, output, and consumption are the functions of federal income tax, federal spending, local property tax, local consumption tax, local spending, and various technology and preference parameters:
a(l -
Tf)
p + 6 + Tk ._=_ , _ 2 _ _a_ a ( l - Tf)
p+(l-a)(5
+ Tk) p + 5 + Tk ^
a ( l + Tc)
3.2. The Local
K
a{l-T})>
^
^
J
Government
The local government maximizes the steady-state agent's welfare
maxTc>T/, i5) lnc + ^ i l n / + ^2his
(73)
subject to the individual's optimal choices of consumption and capital stock given in equation (72), and its own budget constraint: s- gs = rcc + rkk. Substituting equation (72) into equation (16) yields
(74)
169
TC p + (l-g)(6 l-<7
—
a ( l + rc)
+ Tk) p + 5 + rk ^ ^ ( 1 - T , ) '
«.
^
J
^k_(P + S + Tk _^_ _g_ ^ _ + l-/a(l-r/)) ' r c ( p + ( l -a)<5) + (r c + Q:)Tfc /9 + (5 + rfc _ L -1 .. lot — 1 f a ( l + TC) a(l - r/)
1 1- g
1 — at g l~-ot •
Therefore, we have K
a(l+r c )
'
4V
^(l-r/);
7
(75) Now, the local government's objective function upon substitution becomes: lnc + t?2lns = ln(p + (1 - a)(8 + Tk)) - ln(l + r c ) + +(h i?2)lns + constant 1—a = \n(p + (1 - a){6 + n)) ~ ln(l + r c ) +
-ln(/> + <5 + rfc) a —1
-ln(p + 5 + n) a —i
+ ( — y — + iM 1 - Q [ln(r c (p + (1 - a)*) + (r c + a ) ^ ) - ln(l + 1—a 1 —a — 7 7 1 -(h tii)-., ln(p + S + Tk) + constant 1—a 1—a — 7 (76)
Hence, the local government's optimization problem is equivalent to maximizing equation (76) by determining r c and Tfc.The first-order conditions for the optimal choices of taxes of the local government are: 1—a *?2(1 ~ a) + 7 rc + a p+(l -a)(5 + Tk) l - a - 7 Tc(p + (1 - a)6) + (r c + a)rk tf2(l-a) +7 1 1 1 ^Q | (1 —a)(l—a-j) p +5 +Tk a-lp + 5 + Tk
(77)
170
1 l+rc
|
fla(l-q)+7 p+{l-g)5+Tk 1 - a - 7 Tc{p + (1 - a) 5) + (TC + a)"rfe
—1=0 1 + TC (78)
To simplify the calculations, the rate of capital depreciation is set to zero: <5 = 0. Now we have the following results Proposition 3.1. If it is required that r^ > 0, the optimal property tax and consumption tax are rfc = 0
(79)
rc = Ml^)±2.
(80)
1—a—7 Therefore, the constrained optimal property tax is always zero as shown in proposition 2.1. With the specific example in this section, we can obtain explicit solutions to optimal local tax rates Proposition 3.2. If Tk can take any value, we have i?2
Tk = -Pz 1—a —7 __tf2(l-a)+7 1—a —7
7
Pjz rrz r, ( 1 — a ) ( l — a —7) p(02(i_Q)+7)
(81)
(1 — a)(l — a — 7) — i?2(l — 01) — 7
In this case, the optimal r^ for the local government is in fact negative, whereas r c is strictly positive. This result conforms to our intuition. The local government taxes consumption to subsidize capital investment. This tax subsidy scheme leads to more welfare for the agent in the long run. 3.3. The Federal
Government
Unlike the Cournot-Nash game between the federal and local governments, in the Stackelberg game the federal government takes into consideration the optimal choices of both the agent and the local government when it maximizes the agent's steady-state welfare maxlnc +tfi l n / +
tf2lns
(82)
by choosing 77, g, r c , Tk, f, and s. The federal budget constraint is still f + gs = Tfy.
(83)
171
Substituting equations (72) and (75) into equation ( 83), we have t= 1
+ s + Tk,sL1(Tc(p+ (1 -a)6) + (r c + a)r fc __i__ Tf{ )a l a(l-Tf) ~ a ( l + rc) > , 1 ^ 2 ,p + 5 + rk ^ i i_ s. (P
X(
) 1-o-T ( -
y
l-g}
g{ =
f T 1
a(l-Tf)} TC(P + (1 - a)5) + (TC + a)rk a ( l + rc) ^ W(P + 8 + Tk^,Tc{p+{l XfK K a{l-T{)>
X(
^
) l - a — r 1-°
{
) 1-c-T ( -
l ) l - a -
T
f
1-a—r
J
1 . ^ _ p + 6 + Tk K l-g> a{l-rs)} - a)5) + (TC + a)rk ^ ^ a(l + rc) >
^±_
TJ^_
;
K
l - Q
J
1-a
Tc(p+(l-a)8) + (Tc + a)TkT±^_ ; a ( l + Tc)
4 -
1 ^ ( P + i + T t , ^ ; ^a(l-T/); '' 5
Therefore, we have 1
if
a(l-Tf)A
l-gA
K
*
{
a{l-rf)>
l-g>
(84) where A =
( ^ ( 1 - f f t M . h ^ v
a(l+rc)
'
Substituting equations (72), (75), and (84) into the federal government's objective function yields lnc + # i l n / + i?2lns -i-ln(l i —a
Tf)
+ ( - ^ - + ti>2)[1— a
1
" Q ln(l " 9) + . 1—a — 7
+ [( — — + d2)+ T - — + ^ i ] l n / + constant 1—a 1—a — 7 1— a - i - l n ( l - rf) + ( - L _ + tf2)[ 1 ~ Q ln(l - 5 ) + \ —a 1— a 1—a — 7
] HI - rj)\ 1— a — 7
* ln(l - rf)\ 1— a —7
+ [(TX- + ^ 7 - ^ + 7 ^ - + *>iK 1-a 1-a — 7 1-a 1 — a — /? — 7 a ( l — 17) 1—g 1 7 +ln(l -Tf)ln(l -g)} 1 — a — [3 — 7 1 — a — fi — 7
+ constant. (85)
172
Given substitutions above, r c , T^, / , and s are all functions of ry and g. Now the federal government's optimization is equivalent to maximize the agent's welfare in equation (85) by choosing Tf and g. The first-order conditions for maximization are
Ji±*!2. + , ( T ^ + * ) r - £ _ 1—a—7
1—a
1—a—7
+ T ^_ + 1—a
*l]
(86)
= 0,
«f-«> + ? + [(-I1—a —7 7
f 1
+
tf2)-^_
+^ - +^]
1 —a 1 —a —7 l - a - 7
1 - a - £- 7
(87)
1- a -A1'"
1 - a - /? - 7 Tf ff/_+r^ (1 - 5 ) A T - ffA1-"
= 0. From these two first-order conditions, we have Proposition 3.3. The optimal federal income tax and federal transfer to the local government are
T
f
M
Ca | l+atf2— 7 l 1—a—0—7 1—a—7 1+^2 | C I _ 2^41-0-7 1 —a—7 ' 1 —a —/3—7 Ca | l+ai?2—7 c* 4 l —a—7 1—a—0—7 1—a—7 fl2(l-a)+7 1 Cy I — a^l-a-7 l-a-7 l-a-/3—c />
where
C={-^+d2)—-£
+ -J—+0!
(89)
1—a 1—a — 7 1— a ,r (p + (1 a)5) + (T + a)r A = c C fc _ i _ j ^ a ( l + TC) ° "' It is interesting that when the consumption tax is available to the local government, and when the federal government acts as the leader in the Stackelberg game with the local government, federal transfer to the local government can be negative. At the same time, it is unclear whether federal income tax must be positive. The reason is now obvious enough: with a less distortionary consumption tax at the local level, and given the Stackelberg
173
game between the federal and local governments, the federal government can impose a negative transfer to locality and at the same time ask the local government to levy a high rate of consumption tax. Hence, the local consumption tax can be used to finance both federal and local spending, and subsidize private investment. To see how the signs of federal income tax and federal transfer are determined, we make some numerical calculations based on propositions 3.1 and 5. In this case, r^ = 0 and r c = 2 1 _ ~ _ + 7 . We let the marginal utility of local public spending, i?2, take different values. Other parameters are fixed as follows: a = 0.3, (3 = 0.2, 7 = 0.1, i?i = 0.1, and p = 0.05. Prom Table 1, as i?2 rises from 0.05 to 0.30, (i.e., the marginal utility of local public spending rises), r c increases from 18.5% to 92.5%. Without the constraint that g > 0, g is negative all the time, and the "reverse" transfer rate from locality to the federal government rises from 18.8% to 191.7%. At the same time, the federal income tax, r / , decreases and eventually becomes negative. Thus the local consumption tax finances local public spending, federal public spending through a negative federal transfer, and federal subsidies to private production through a negative income tax.
Table 1. 02 Tc T
f
9
0.05 0.184615 0.232575 -0.0188034
Optimal Tax Rates and Federal Transfer
0.10 0.283333 0.187222 -0.345833
0.15 0.40 0.132 -0.675676
0.20 0.54 0.0641538 -1.03143
0.25 0.711111 -0.0203292 -1.43593
0.30 0.925 -0.1275 -1.91731
If we impose the condition that federal transfer to the local government must be positive in the federal government's optimization problem, it is easy to show the next proposition. Proposition 3.4. If g > 0, the optimal federal income tax is C(l-a-y) 1 — a—1
1—Q—0—7
5 = 0. In this case, since it is not feasible for the federal government to collect any revenues from the local government, federal income tax is strictly positive with the Inada conditions on the utility function. As an illustration,
174
we still choose rk = 0 and r c = 2^_^_ from local government's optimal choices of tax rates in proposition 3.1. We also let $ 2 vary and let other parameters be fixed at a = 0.3, (3 = 0.2, 7 = 0.1, i?i = 0.1, and p = 0.05. The optimal federal transfer is always zero, and the optimal federal income tax and optimal local consumption tax are calculated in Table 2.
Table 2. 1?2 Tc
7
0.05 0.184615 0.248529
0.10 0.283333 0.233333
Optimal Tax Rates
0.15 0.40 0.218766
0.20 0.54 0.204918
0.25 0.711111 0.191972
0.30 0.925 0.180282
In Table 2, as i?2 rises from 0.05 to 0.30, the marginal utility of local public services rises sharply, and so does the local consumption tax, which increases from 18.5% to 92.5%. At the same time, astf1 = 0.1, the marginal utility of federal public spending falls relative to local public spending. Therefore, it is less socially desirable to finance as much of federal public spending as before. Hence, federal income tax falls from 24.9% to 18%. 3.4. The Case of Zero Consumption
Tax
In our optimal-tax framework with a multiple levels of government the optimal property tax is always zero or negative given the availability of consumption tax for the local government. In reality, of course, local governments in most developed countries rely on property tax to finance their local public services. While we do not want to argue whether the reality deviates from the theoretical optimality, we can allow some role of a positive, optimal property tax if we set local consumption tax to zero. Then, letting r c = 0 in equation (77), we have
Ll2
Ja(i-q) + 7 l
p + (1 - a)(5 + rk) I - Q - 7 rfe fl2(l-a)+7 1 1 (1 - a ) ( l - a - 7 ) p + S + Tk a-lp = 0.
(90)
v
1 + S + Tk
Now from equation (90) we have the optimal local property tax rkFrom equations (86) and (87), we have the optimal federal income tax and federal transfer, Tf and g, respectively.
175
Proposition 3.5. The optimal property tax, optimal federal income tax, and optimal federal transfer are Tk
V [ 4 a ( 7 + (1 - a)fl 2 ](l - a ) ( l +192)/>2 + [<*2(1 + 0 2 ) + 7 + ^2 - (3^ 2 + 2)a 2a(l-a)(l+02) [a 2 (l + tf2) + 7 + ^2 - (3^2 + 2)a]/g + 2a(l-a)(l+tf2) (91)
1 —a—/3—7
Tf = l
1 —a—7
-^
(92)
l-a-/3-7
9 = 1-
, l+m92—1 1-0-/3-7 1—a—7 fl2(l-a)+7 C7 l-a-7 ~ 1_Q_/3_7
a p Tk 1 - «TJ, p
(93)
where C = ( - ^ - + t?2)T—^ + ^ - + ^11—a 1—a — 7 1— a To provide some intuition on how the optimal tax and transfer rates are determined, we compute the three optimal rates in Table 3 for different values of a, which measures the productivity of private capital stock. For all other parameters, their values are fixed at /? = 0.2, 7 = 0.1, i?i = 0.1, tf2 = 0.1, and p = 0.05.
Table 3.
Optimal Property Tax, Income Tax, and Transfer
a
0.2
Tk
0.0446251 0.24934 0.0541665
T f 9
0.25 0.035477 0.262329 0.180578
0.3 0.0232356 0.264068 0.230509
0.35 0.0187604 0.261329 0.249001
0.40 0.0157411 0.256353 0.250828
From Table 3 it is clear that, because local property tax is highly distortionary, the optimal property tax declines steadily from 4.46% to 1.57% as the productivity of private capital stock rises from .2 to .4. At the same time, the federal government raises its transfer rate to the local government sharply from 5.4% to 25%, without significantly altering the rate of federal income tax.
176
4.
Summary
In this paper, we have studied the optimal choices of federal income t a x , federal transfer, and local taxes in a dynamic model of capital accumulation and with explicit game structures among t h e representative agent, t h e local government, and t h e federal government. We summarize our main findings as follows. W h e n t h e federal and t h e local governments choose their optimal t a x rates and transfer scheme along t h e Cournot-Nash lines, optimal local property t a x is zero, optimal local consumption t a x is strictly positive, optimal federal income t a x is strictly positive, and optimal federal transfer is zero. W h e n t h e federal government is t h e leader and t h e local government is t h e follower in a Stackelberg game with b o t h consumption t a x and property t a x available to t h e local government, again, t h e optimal local property t a x is zero, and local consumption t a x is positive. B u t federal transfer t o t h e local government is negative, and federal income t a x can be positive or negative. In this case, t h e local consumption t a x can be used t o finance b o t h local and federal public spending. This "reverse" transfer from t h e local government t o t h e federal government is optimal from the perspective of welfare maximization because t h e local consumption t a x is less distortionary t h a n b o t h local property t a x and federal income t a x . W h e n t h e local consumption t a x is set t o zero, optimal local property t a x can be positive, as are t h e federal income t a x and federal transfer t o t h e local government.
Acknowledgements Project 70271063 was supported by t h e National N a t u r a l Science Foundation of China. References 1. Arrow, K. and Kurz, M. (1970). Public Investment, the Rate of Return and Optimal Fiscal Policy. Johns Hopkins University Press. 2. Atkinson, A. and Stiglitz, J. (1972). The structure of indirect taxation and economic efficiency. Journal of Public Economics 1, 97-119 3. Atkinson, A. and Stiglitz, J. (1976). The design of tax structure: Direct versus indirect taxation. Journal of Public Economics 6, 55-75. 4. Atkinson, A. and Stiglitz, J. (1980). Lectures on Public Economics. McGrawHill. 5. Barro, R. J. (1990). Government spending in a simple model of endogenous growth. Journal of Political Economy 98, S103-S125.
177 6. Bird, R. (1993). Threading the fiscal labyrinth: Some issues in fiscal decentralization. National Tax Journal XLVI, 207-227. 7. Brueckner, J. (1996). Fiscal federalism and capital accumulation. Mimeo, Department of Economics, University of Illinois at Urbana-Champaign. 8. Chamley, C. (1985). Efficient taxation in a stylized model of intertemporal general equilibrium. International Economic Review 26, 451-468. 9. Chamley, C. (1986). Optimal taxation of capital income in general equilibrium with infinite lives. Econometrica 54, 607-622. 10. Davoodi, H. and Zou, H. (1998). Fiscal decentralization and economic growth: A cross-country study. Journal of Urban Economics 4 3 , 244-257. 11. Devarajan, S., Swaroop, V. and Zou, H. (1996). The composition of government expenditure and economic growth. Journal of Monetary Economics 37, 313-344. 12. Devarajan, S., Xie, D. and Zou, H. (1998). Should public capital be subsidized or provided? Journal of Monetary Economics 41, 319-331. 13. Diamond, P. and Mirrlees, J. (1971). Optimal taxation and public production 1: Production efficiency and 2: Tax rules.American Economic Review 6 1 , 827 and 281-278. 14. Gong, L. and Zou, H. (2002). Optimal taxation and intergovernmental transfer in a dynamic model with multiple levels of government. Journal of Economic Dynamics and Control 26, 1975-2003. 15. Gordon, R. (1983). An optimal taxation approach to fiscal federalism. Quarterly Journal of Economics 98, 567-586. 16. Gramlich, E. (1993). A policy maker's guide to fiscal decentralization. National Tax Journal XLVI, 229-235. 17. Jones, L., Manuelli, R. and Rossi, P. (1993). Optimal taxation in models of endogenous growth. Journal of Political Economy 101, 485-517. 18. King, R. G. and Rebelo, S. (1990). Public policy and economic growth: Developing neoclassical implications. Journal of Political Economy 98, S126-S150. 19. Lucas, R. (1990). Supply-side economics: An analytical review. Oxford Economic Papers 42, 293-316. 20. Mirrlees, J. (1971). An exploration in the theory of optimum income taxation. Review of Economic Studies 38, 175-208. 21. Myles, G. (1995). Public Economics. Cambridge University Press. 22. Oates, W. (1972). Fiscal Federalism. New York: Harcourt Brace Jovanovic. 23. Oates, W. (1993). Fiscal decentralization and economic development. National Tax Journal XLVI, 237-243. 24. Persson, T. and Tabellini, G. (1996a). Federal fiscal constitutions, risk sharing and moral hazard. Econometrica 64, 623-646. 25. Persson, T. and Tabellini, G. (1996b). Federal fiscal constitutions: Risk sharing and redistribution. Journal of Political Economy 104, 979-1009. 26. Ramsey, F^ (1927). A contribution to the theory of taxation. Economic Journal 37, 47-61. 27. Rebelo, S. (1991). Long-run policy analysis and long-run growth. Journal of Political Economy 99, 500-521. 28. Rivlin, R. (1992). Reviving the American Dream: The Economy, the States,
178 and the Federal Government. Brookings Institution, Washington, D.C. 29. Samuelson, P. (1986). Theory of optimal taxation. Journal of Public Economics 30, 137-143. 30. Turnovsky, S. (1995). Methods of Macroeconomic Dynamics. MIT Press. 31. Turnovsky, S. and Fisher, W.H. (1995). The composition of government expenditure and its consequences for macroeconomic performance. Journal of Economic Dynamics and Control 19, 747-786. 32. Zhang, T. and Zou, H. (1998). Fiscal decentralization, public spending, and economic growth in China. Journal of Public Economics 67,221-240. 33. Zou, H. (1994). Dynamic effects of federal grants on local spending. Journal of Urban Economics 36, 98-115. 34. Zou, H. (1996). Taxes, federal grants, local public spending, and growth. Journal of Urban Economics 39, 303-317.
S H A R I N G CATASTROPHE RISK U N D E R MODEL UNCERTAINTY
X I A O D O N G ZHU Department
of Economics, Toronto,
University of Toronto, 150 St. George Ontario M5S 3G7 Canada
Street
According to the standard Arrow-Debreu theory of risk sharing, catastrohpe risk can be easily diversified in capital markets and therefore the risk premium associated with the risk should be small. Empirically, however, the premium on catastrophe insurance is extremely high relative to expected losses, especially for very low probability events. In this paper I argue that in reality there is significant uncertainty about estimated probabilities of catastrophic events and that the high premium observed in the market is due to people's concern about the uncertainty. I show this by using a simple model of asset pricing with Knightian uncertainty.
1. Introduction Traditionally, financial losses from natural disasters such as hurricanes, earthquakes, floods, etc. are covered by the insurance industry. Several disasters that happened in the last decade, however, have exposed the vulnerability of the insurance industry to catastrophic losses. Hurricane Andrew in 1992 and the Northridge Earthquake in 1994 resulted in almost $31 billion in insured industry losses and caused the failure of more than ten insurance companies. The estimated financial losses from the tragic event on September 11, 2001 are more than $40 billion. The US insurance industry now regularly discusses potential catastrophic losses of $50-$100 billion. As large as these losses are, they are trivial compared to the aggregate income and wealth in the US economy. For example, the US GDP in 2001 was more than $10 trillion, and the capitalization at the New York Stock Exchange alone is now about $16 trillion. A mere half percent drop in the stock market would imply $80 billion financial losses, much larger than any catastrophic losses in history. So, there should be plenty room for the insurance industry to diversify the catastrophe risk in capital markets. In the last decade, several new financial instruments were innovated aiming at 179
180
realizing the potential of capital markets for diversifying catastrophe risk. These instruments are usually securities with payoffs contingent on some potential catastrophic event. By issuing these catastrophe-linked securities, insurance companies can receive valuable capital from capital markets if a catastrophic event occurs. Investors of these securities will receive returns that are significantly higher than those of standard securities if the anticipated catastrophic event does not occur, and will lose their investments if the catastrophic event does occur. Since the probabilities of catastrophic events are generally very low, e.g., once in 50 years, these securities provide higher than average returns to investors. At the same time, since the catastrophe risk is generally uncorrelated with other financial market risks, these securities also provide investors a good way to diversify their portfolio. Given the size of capital markets, and given the increasing demand by the insurance industry for risk capital, it was expected that the market for catastrophe-linked securities would grow rapidly. After almost ten years of development, however, the market is not nearly as large or as active as expected. Catastrophe bonds have been the most popular catastrophelinked securities in the market. But after a period of rapid growth, the amount of risk capital raised through catastrophe bonds seemed to have hit a plateau at somewhere between $1 billion and $2 billion, way below the amount of risk capital needed by the insurance industry to deal with catastrophic losses. The catastrophe-linked futures and options listed on Chicago Board of Trade were never able to generate enough liquidity, and their trading was stopped since the summer of 1999 due to inadequate open interests. Why have catastrophe-linked securities not been successful in attracting investors? In this paper, we argue that the lack of liquidity in the market for catastrophe-linked securities is due to the difficulty in estimating the probabilities of catastrophic events. Using a simple static model of risk sharing under model uncertainty, I show that investors' aversion to model uncertainty will drive the price of catastrophe risk to an extremely high level and therefore reduce the demand by insurance industry to download such risk through capital markets. The paper is organized as follows. In Section 2, I discuss some patterns about market prices of catastrophe-linked securities and the empirical evidence that the behavior of these prices is not consistent with the standard Arrow-Debreu theory of risk-sharing. In Section 3 and 4,1 construct a simple static model of catastrophe risk sharing and analyze the Arrow-Debreu market equilibrium in this economy. Then, In Section 5 I incorporate model
181
uncertainty and investors' aversion to uncertainty into the model and show that there exists an uncertainty premium for insurance against catastrophic event, independent of how diversifiable the catastrophe risk is. I discuss why the model introduced here is useful for understanding catastrophe risk sharing In Section 6. Finally I conclude in section 7.
2. Empirical Observations about Pricing of Catastrophe Risk According to standard theory of risk sharing, protection by insurers and firms against the largest loss events such as hurricanes and earthquakes is most valuable. However, Froot (2001) examines the reinsurance market for catastrophic event risk and finds that most insurers purchase relatively little reinsurance against large loss event. He also finds that premiums are high relative to expected losses. To explain these findings, Froot argues that the lack of demand and high premiums of insurance against large loss events are due to the limited supply of capital to the reinsurance market. Sun (2002) examines the futures and options market for catastrophe risk. In particular, she analyzes the market prices of PCS options that were traded at Chicago Board of Trade. These options are securities that provide buyers a payoff when an index of aggregate insurance industry losses associated with a particular catastrophic event exceeds certain threshold. Because in principle any investor can particpant in trading these options, limited capital supply should not be a serious problem in this market. However, Sun still finds that the market prices of PCS options are significantly higher than expected payoffs. Furthermore, she finds that catastrophic losses are uncorrelated with all the financial market variables she considered, including stock returns and interest rates, which suggests that the risk premium associated with catastrophic losses should be small. Finally, she finds that the premium observed in the market is negatively related to the probability of the catastrophic event that is covered. The premium is largest for events that are least likely to occur. Sun also used the standard option pricing method to value the options and finds the theoretical prices are much lower than the market prices. Evidence presented by both Froot and Sun suggests that the traditional Arrow-Debreu theory of risk sharing is incapable of explaining the price behavior in the market for catastrophe risk.
182
3. A Simple Model of Risk Sharing Consider a static economy with two representative agents, A and B. There are two states, 1 and 2. For i = A, B and j = 1,2, let agent i's endowment income and consumption in state j be denoted by yl, and A, respectively. To focus on catastrophic risk, I assume that agent B faces no income risk, i.e., yf = XJ2 = yB• Agent A, however, faces the risk of having a lower endowment income in the first state, i.e., yf = pyQ for some constant p G [0,1). Clearly, the lower the value of p, the higher the risk faced by agent A. We call the first state the catastrophic state. Let ql be the subjective probability of the catastrophic state for agent i, i = A, B. The agents' attitudes toward risk are summarized by a standard Von Neumann-Morgenstern expected utility function: Ui=qi\og{c[)
+ (W)log(4),
i = A,B.
(1)
Because the agents' utility function, log(c), is concave in consumption, they would like to smooth their income across the two states by trading financial securities. By assumption catastrophe risk only affects the income of agent A. So, there are obvious incentives for agent A to diversify the risk. There is a financial market where two types of financial securities can be traded: b>i= (1,0) and b 2 = (0,1). That is, b i is the security that pays one unit of good in state 1 and nothing in state 2, and b 2 is the opposite. Since there are only two possible states in this economy, the market is complete when both of these securities are traded. Agents can sell these securities against their endowed incomes and buy financial securities to finance their consumptions. Let pi and p 2 be the prices of the two financial securities, b i and b2, respectively. Without loss of generality, we normalize so that Pi +P2 = 1- Let p = p\. For a given ql, agent i's optimal consumption allocation problem is Vi(qi) =
max {c\>0,cl2>0}
{9Mog(ci) + (1 - g ^ l o g ^ ) }
(2)
subject to the budget constraint: pc[ + (1 - p ) 4 = py{ + (1 - p)yl
(3)
This optimization problem can be easily solved and has the following solution: c\=
[pyi +
ii-P^tf/p, (4)
4=[py\
+
(l-p)yl}(l-qi)/(l-p).
183
Thus, the value function in Eq. (2) can be written as V\q*) = g« In tf/p] + (1 -
cf = yf+yB,
j = 1,2.
(6)
That is, in both states, aggregate consumption should equal to aggregate income. Substituting Eq. (4) into Eq. (6) and solving for p yields
+ qBVB
<7 V y
A
+ y
B _
{ l
_
q
A
) { 1
_
p ) y
(?)
A
4. Risk Premium of Catastrophic Events As a benchmark, I first consider the case when there is no uncertainty about the probability distribution of the states. I further assume that both agents share the same subjective probability of the catastrophic event: qA = qB = 7r. From Eq. (7), then, I have
«* = P
^yA + vB) yA+yB_{1_7T){1_p)yA-
(8) W
Note that p* is the price of the security that pays one unit of good in the catastrophic state. The expected payoff of the security is simply the probability of the catastrophic state, IT. SO, the ratio (p* — ir)/Tr measures the premium one has to pay to buy insurance against catastrophic event, relative to the expected payoff. From Eq. (8) one can see that the premium is always greater than zero, i.e., there is always a risk premium associated with the catastrophe risk. However, the size of the premium depends on how important the risk is to the aggregate income. Let z — (1 — p)yA/(yA + yB), which measures the importance of catastrophic loss to aggregate income, and let 7*(z,7r) = (p* — -K)/TT. From Eq. (8), then, we have
?(*>*)= i{1nv)\
(9)
1 — (1 — -K)Z So, the risk premium increases with z and converges to one as z goes to zero. In other words, the more important the catastrophic loss is to the aggregate economy, the larger a risk premium one has to pay to insure against the catastrophe risk; and if the catastrophic loss is small relative to aggregate
184
income, then the risk premium should be fairly small. As I argued in the introduction, catastrophic losses are usually small relative to the whole economy. Therefore, without uncertainty, one should not expect a large risk premium associated with catastrophe risk in the market, independent of the probability of catastrophic events. Next, I examine the impact of uncertainty about the probability of catastrophic event on the price of catastrophe risk. 5. Uncertainty Premium of Catastrophic Events Following Knight (1921) and Epstein and Wang (1994), in this section I distinguish between risk and uncertainty. The risk refers to the fact that agents do not know which state will occur. Uncertainty, on the other hand, refers to the case when agents are not sure about the probability distribution of the states. What the agents do know is that the probability of catastrophic state is in a particular set of prior probabilities. Let Q be the prior set for both agent A and B. I assume that the prior set takes the following form: Q={q:\q-n\
(10)
Here, IT is the so-called reference probability, which can be interpreted as the agents' best estimate of the probability of the catastrophic event. k(ir) measures the degree of uncertainty the agent has about the distribution of states, which is allowed to be related to the reference probability -K. The idea is that the information agents use in obtaining the estimated reference probability can also help them to determine the degree of confidence they have about the estimate. To ensure that these prior sets contain only probabilities between zero and one, I assume that fc(7r) < min{7r, 1 — -it}. As is shown by Gilboa and Schmeidler (1989) and Epstein and Wang (1994), in the presence of uncertainty, the agents' optimal decision problem becomes min Vi(qi). From Eq. (5), the value function can be written as follows:
VKJ) = Vtf) + In [py\ + (1 - p)y{] ,
(11)
where VW) = ql In [g'/p] + (1 - q*) In [(1 -
(12)
185
Since the last term in Eq. (5) is not a function of q\ agent i's problem becomes min F(
(13)
It is easy to show that V(q) is a convex function on the interval (0,1) with an unconstrained minimum at q = p. Because Q is a compact subset of (0,1) and V(.) is a continuous function on (0,1), the constrained minimization problem in Eq. (13) is also well defined and always has a unique solution, q(p), for any given p £ (0,1) and for both i = A and i = B. In other words, both agents will choose the same probability in evaluating their expected utilities, independent of their income. Substituting qA = qB = q(p) into Eq. (7) and simplifying yields the following equation:
P=.
fo*
, ,
(14)
1 - [1 - q\p)\ z where z = (1 — p)yA/(V2 + VB) a s defined earlier. An equilibrium exists if and only if there exists a p e (0,1) that solves the Eq. (14). From Eq. (14) it is clear that, if exists, the equilibrium price has to be higher than the agents' subjective probability of the catastrophic event, q(p). Because q(p) is the solution to the constrained minimization problem in Eq. (13), and because p is the solution to the unconstrained minimization of V(q) on (0,1) and V(q) is convex, p > q{p) holds if and only if 7r + k(-n) < p, in which case q{p) = n + k(ir). So, in the presence of uncertainty about the probability of the catastrophic event, both agents in the economy will choose the most pessimistic probability in evaluating their expected utilities. Substituting it into Eq. (14), then, I have __ P=
-ir + k(ir) l-[l-7r-fc(7r)]z'
As one might have expected, the presence of uncertainty, measured by k(n), increases the equilibrium price of the catastrophe risk relative to the case when there is no uncertainty. More importantly, the premium one has to pay over the expected payoff for insurance against the catastrophic event is bounded away from zero even when the importance of the catastrophic loss to aggregate income shrinks to zero. Specifically, let j(z,n) = (p — TT)/TT, I have 7 ( 2 , 7f)
[1 + 7 * ( Z , 7T + fc(7T))] - 1 « MZ[) + 7*(Z,7T + fc(TT)).
186
So, the premium agents pay for catastrophe insurance has two components, one is the standard risk-premium, 7*(z,7r + A;(7r)), and the other is the premium associated with uncertainty about probabilities, k{-n)/ir, which I call the uncertainty premium. As the impact of catastrophic loss on aggregate income diminishes, the risk premium shrinks to zero, but the uncertainty premium is always there: lim 7(*,7r) = ^ i z
»0
(15)
IT
6. W h y Uncertainty Premium is High for Catastrophic Events? Even though I have referred the state when agent A has a lower endowment income as the catastrophic state, the state could be any bad state, not necessarily catastrophic. What I have shown so far is that when there is uncertainty about the probability of the bad state, there is always an uncertainty premium in the market price of insurance against the bad state, no matter how small the impact of the loss of income is for the aggregate economy. In this section, I provide some intuitive argument about why there are reasons to believe that uncertainty premium is most pronounced in the case of catastrophic events. From Eq. (14) one can see that the uncertainty premium in this model is positively related to the degree of uncertainty, as measured by fc(7r). One salient feature of catastrophic events is that they are rare events. For example, the type of the catastrophic events that are covered by PCS options have only occurred once or twice in every 50 years. Therefore, market participants have very little historical information that can be used to predict precisely the probability that such an event occurs. This suggests that the degree of uncertainty is high for catastrophic event. Even if people do have some limited data so that they can predict the probability with some confidence, the relative degree of uncertainty, as measured byfc(7r)/7ris still likely to be high when 7r is small, which is usually true for catastrophic events. Furthermore, I use an example here to argue that this measure is likely to increase as the probability of a catastrophic event declines. Consider the following example: Suppose that income loss associated with a particular type of event is Xt, and suppose that the event actually happens relative frequently, so that we observe Xt, t = 1,2, ...N. Furthermore, assume that event losses are identically and independently distributed, with mean \i. An event is called a catastrophic event if loss exceeds some large number L > fi. To predict
187
the probability of having a catastrophic event, P(Xt > L), one can use the well known empirical distribution estimator: number of Xt's > L Because asymptotically the estimator FJV converges to a normal distribution with mean P(Xt > L) and standard deviation y/P(Xt > L) [1 — P(Xt > L)], one may estimate the standard deviation of FN by y/F^il — FN). If we think about the degree of uncertainty agents have about the probability estimator ir is captured by some confidence interval, then we should have k(n) = a\/7r(l — 7r). for some constant a. As L increases, both the estimated probability of the event, 7r, and the degree of uncertainty about the estimator, k(n), converge to zero. However, the uncertainty premium, which is the ratio of the two, is given by fc(7r)
/ l — 7r =
7T
a
•
\
V
7T
Thus, in this example, the uncertainty premium increases with L or decreases with the estimated probability, 7r. As 7r goes to zero, the uncertainty premium goes to infinity!
7. Conclusion In this paper, I use a simple model of risk sharing to show that when agents are uncertain about probability distribution of states, market price of risk may include an uncertainty premium that is independent of how diversifiable the risk is. In addition, I argue that this premium is likely to be especially high for small probability events. While the simple model's prediction is consistent with empirical observations about market price of catastrophe risk, it would be interesting to build a more realistic model along the same line that can be estimated using the data on PCS options. Another interesting question is how to formally establish the link between agents' degree of uncertainty about a distribution to the length of the confidence interval for the distribution estimator. I leave these interesting questions for future research.
188 References 1. Epstein, L.G. and Wang, T. (1994). Intertemporal asset pricing under Knightian uncertainty. Econometrica 62, 283-322. 2. Froot, K.A. (2001). The market for catastrophe risk: a clinical examination. J. of Financial Econ. 60, 529-571. 3. Gilboa, I. and Schmeidler, D. (1989). Maxmin expected utility with nonunique prior. J. Math. Econ. 18, 141-153. 4. Knight, F.H. (1921). Risk, Uncertainty and Profit. Houghton and Mifflin, Boston. 5. Sun, Y. (2002). An Empirical Analysis of Catastrophe-linked Securities. Ph.D. Dissertation, University of Toronto, Toronto.
R A N K E D SET SAMPLING: A M E T H O D O L O G Y FOR OBSERVATIONAL E C O N O M Y
ZEHUA CHEN Department
of Statistics & Applied Probability, National University Singapore, 3 Science Drive 2, Singapore 117543
of
The notion of ranked set sampling (RSS) provides a methodology for achieving observational economy. In this article, we give a brief review on the developments of RSS and discuss some of its theoretical aspects. Then we further discuss some new directions of research on RSS. The discussion is focused on three topics: (i) RSS with concomitant variables, (ii) design with observational data using RSS and (iii) application of RSS to data mining.
1. Introduction The needs for observational economy arise from the fact that, in many practical problems, the procurement of the values of the variables of interest is costly and/or time-consuming. For example, in the assessment of the status of hazard waste sites, it usually involves expensive radio-chemical techniques to get the value of the variable of interest. Take for another example, in fishery aging studies, the determination of the age of a fish is a time-consuming and expensive process. In the early 1950's, in seeking to estimate effectively the yield of pasture in Australia, Mclntyre (1952) proposed a sampling method which later became known as ranked set sampling (RSS). The idea of RSS provides an effective way to achieve observational economy under certain particular conditions. The idea of RSS was buried in the literature for a long time. However, because of its cost-effective nature, its value was gradually rediscovered in the last 20 years or so. There has been a surge of research interest of statisticians on RSS in recent years. The earliest theoretical works on RSS appeared in Takahasi and Wakimoto (1968) and Dell and Clutter (1972). The estimation of cumulative distribution function with various settings of RSS was considered by Stokes and Sager (1988), Kvam and Samaniego (1993) and Chen (2001b). The RSS version of distribution-free test procedures such as sign test, signed rank test and Mann-Whitney-Wilconxon 189
190
test were investigated by Bohn and Wolfe (1992, 1994), and Hettmansperger (1995). The estimation of density function and population quantiles using RSS data were studied by Chen (1999, 2000a). The RSS counterpart of ratio estimate was considered by Samawi and Muttlak (1996). The list at istic and M-statistic based on RSS were considered, respectively, by Presnell and Bohn (1999) and Zhao and Chen (2002). The RSS regression estimate was tackled by Patil et al. (1993), Yu and Lam (1997) and Chen (2001c). The parametric RSS assuming the knowledge of the family of the underlying distribution was studied by many authors, e.g., Fei, Sinha and Wu (1994), Stokes (1995), Abu-Dayyeh and Muttlak (1996), Bhoj (1997), Chen (2000b). The optimal design in the context of unbalanced RSS was considered by Stokes (1995), Kaur, Patil and Taillie (1997), Ozturk and Wolfe (1999), Chen and Bai (2000) and Chen (2001a). A general theory on parametric and non-parametric RSS was developed by Bai and Chen (2002). The use of concomitant variables in RSS were developed by Stokes (1977), Chen (2002) and Chen and Shen (2002). An extensive bibliography of RSS before 1999 was given by Patil et al. (1999). A comprehensive coverage of RSS is given in the monograph by Chen et al. (2002). In this article, we consider some theoretical aspects of RSS and discuss some new directions of research on RSS. The outline of this article is as follows. In §2, we describe the methodology of RSS and discuss the basic properties of RSS. In §3, we consider RSS with concomitant variables. In §4, we explore designs with observational data using RSS. In §5, we discuss the application of RSS in data mining.
2.
R a n k e d Set Sampling a n d I t s Basic P r o p e r t i e s
The ranked set sampling in its general form can be described as follows. Suppose a total of n sampling units are to be measured on the variable of interest. First, n simple random samples (sets), each of size k, are drawn at random from the population under study. Then the units of each set are ranked by a certain mechanism other than using the actual measurements of the variable of interest. Finally, one and only one unit in each set with a pre-specified rank is measured on the variable of interest. It ends up with a sample of n measurements on the variable of interest which is to be referred to as a ranked set sample. Here, it is implicitly assumed that the measurement of the variable of interest is costly and/or time-consuming but the ranking of a set of sampling units can be easily done with a negligible cost. It should be noted that, in the procedure, nk units are sampled but
191
only n of them are actually measured on the variable of interest. Let nr be the number of measurements on units with rank r, r = 1 , . . . , k, such that X] r =i nr = n- Denote by Yjr]j the measurement on the ith measured unit with rank r. Then the ranked set sample can be represented as Y{1]1 Y[i]2 • •
-•[ljm
^[2]l Y[2] 2 ' '
Y[2]n
Y[k]i Y[k]2 ••
Y
[k}nk
(1) •
If Tii = ri2 = • • • = rife, the RSS is said to be balanced, otherwise, it is said to be unbalanced. The ranks which the units in a set receive may not necessarily tally with the numerical orders of their latent Y values. If the ranks do tally with the numerical orders, the ranking is said to be perfect, otherwise, it is said to be imperfect. The ranking mechanism could be: (i) ranking according to the latent values of the variable of interest by judgment, (ii) ranking according to the measured values of certain easily obtainable concomitant variables, and (iii) ranking by using some other auxiliary information. Let F denote the distribution function of Y. Let F[rj denote the distribution function of the rth ranked statistic under the ranking mechanism. A ranking mechanism is said to be consistent, if F
1
k
( 2 > ) = f c E F M ( y ) ' ^ ally.
(2)
r=l
Equality (2) plays a crucial role in RSS and is referred to as the fundamental equality. We also refer to an RSS with consistent ranking mechanism as a consistent RSS. The essence of RSS is conceptually similar to the classical stratified sampling. The RSS can be considered as post-stratifying the sampling units according to their ranks in a sample. Although the mechanism is different from the stratified sampling, the effect is the same in that the population is divided into sub-populations such that the units within each sub-population are as homogeneous as possible. From an information point of view, an ranked set sample contains not only the information of the measurements but also the information of the ranks. Therefore, if compared with a simple random sample of the same number of measurements, we can expect that a statistical procedure based on a ranked set sample will be more efficient than based on a simple random sample. We present some rigorous results in this aspect in the remainder of this section. We consider balanced RSS for
192
three statistical procedures: the nonparametric estimation of means, the nonparametric estimation of smooth functions of means and the maximum likehood estimation of parameters for parametric families. Let h(y) be any function of y. Denote by fih the mean of h(Y), i.e., Hh = Eh(Y). Examples of h(y) include: (a) h(y) = yl,l = 1,2, ••• , corresponding to the estimation of population moments, (b) h(y) = I{y < c} where /{•} is the usual indicator function, corresponding to the estimation of the distribution function, (c) h(y) = JK(^YL), where K is a given function and A is a given constant, corresponding to the estimation of the density function. Consider the estimate of fih given by k m
1
r—1
i=l
The estimate is referred to as the RSS estimate of /%. A smooth function of means is a function of the form: g(fihi, • • •, Hhp), where g is smooth and \ih = Ehj(Y) and hj are any functions of Y. Typical examples of smooth-function-of-means are (i) the variance, (ii) the coefficient of variation, and (iii) the correlation coefficient, etc. An RSS estimate of g(/J.hi, • • •, A*hp) is given by SRSS
=
g\^i•hl•Rss^
•••>
HhpRssJ-
We have the following general results (see Bai and Chen (2002)): Theorem 2.1. If an RSS is balanced and consistent, then (i) The RSS estimates fih-Rss and gRSS are consistent and asymptotically normally distributed. (ii) The RSS estimate fih-Rss has a smaller variance than its simple random sampling (SRS) counterpart. (iii) The RSS estimate gRSS has a smaller asymptotic variance than its SRS counterpart. In a parametric setting, let f(y; 6) and F(y; 6) denote, respectively, the density function and cumulative distribution function of Y, where 0 = (6i, • • • , 6q)T is a vector of unknown parameters. Let 1(6) denote the Fisher information contained in a single observation and /jj(0) denote the Fisher information contained in a ranked set sample of size n = mk. In the case of imperfect ranking, let psr denote the probability with which a unit whose
193
numerical order of Y is s is given a rank r. Let
*<»> = S^( f c - S )!( S -l)! F 5 " 1 ( ^ ) [ 1 -
Fiy)rS
-
We have, see Chen (2000b), that Theorem 2.2. RSS.
Assume certain regularity conditions. Consider a balanced
(i) If ranking is perfect, then rn(e) =
nI(8)+n(k-l)A(0),
where A(0) = E •
1
(dF{Y-6)\
F{Y-e)[l-F{Y-e)\ V
(dF(Y-9)\T\
06 ) \ 89 J J '
(ii) If ranking is imperfect, then rn(6) = nI(0) + A(O), where dgr(Y) r=l
dgr(Y)
9r(Y)
The expectations above are taken with respect to y ( ~ F). Theorems 2.1 and 2.2 imply that the balanced RSS is more efficient than SRS in terms of the mean square error (MSE) or asymptotic MSE of an estimate. However, the efficiency gain by using RSS differs in different statistical procedures. We can measure the efficiency gain by the relative efficiency (RE) or asymptotic relative efficiency (ARE) defined as the ratio of the MSE (or asymptotic MSE) of the RSS estimate and that of its SRS counterpart. Typically, the RE or ARE is larger in the parametric setting than in the nonparametric setting. In the estimation of means, the RE is roughly (fc 4- l ) / 2 , see Dell and Clutter (1972). But the RE or ARE in other cases is smaller. In certain cases, it can be near 1, indicating that it is not worth adopting the balanced RSS in those cases. This has motivated considerations of optimal design of unbalanced RSS. We do not elaborate in this direction any further. The reader is referred to Chen and Bai (2000) and Chen (2001a).
194
3. Ranked Set Sampling with Concomitant Variables We mentioned briefly in the previous section several ranking mechanisms in RSS. Originally, the ranking is made by judgment with respect to the latent values of the variable of interest, like what Mclntyre (1952) had done in estimating the yield of pastures. But the cases where sampling units can be ranked by judgment relating to the latent values of the variable of interest are rare. This partially explains why the idea of RSS had remained dormant for many years since it was first proposed. Ranking mechanisms determined by concomitant variables, however, have a quite different prospect. There are abundant situations where, apart from the expensive variable of interest, the sampling units are also associated with other easily obtainable variables which are referred to as concomitant variables. For instance, in the fishery aging study mentioned before, a fish's length and weight, which are closely correlated to the age, are easily measurable. In medical studies, many variables are expensive to measure, however, certain surrogates of the variables can be obtained easily and cheaply. Contrary to the abundance of the availability of concomitant variables, the literature on RSS with concomitant variables is not abundant. Many aspects of RSS with concomitant variables are yet to be investigated. In this section, we briefly summarize the available results in the literature and discuss some directions of research on RSS with concomitant variables. There are two roles for concomitant variables to play in RSS: as a means to determine ranking mechanisms and as a means to improve inferences. We discuss the use of concomitant variables for the determination of ranking mechanisms first. There are typically two types of problems: (i) only inferences on the features of the variable of interest are to be made and (ii) the relationship between the variable of interest and the concomitant variables is to be inferred. Different ranking mechanisms are needed for different purposes. When there is only one concomitant variable involved, the solution becomes straightforward: for either purpose, sampling units are ranked according to the numerical order of the concomitant variable. The RSS with a single concomitant variable was first tackled by Stokes (1977). However, when multiple concomitant variables are involved, the question of how to determine a ranking mechanism becomes nontrivial at all. Any function of the concomitant variables, including any single concomitant variable, can be used as a criterion for ranking. The resultant ranking mechanism will be consistent. But, what is the best function to use? Chen (2002) argued that, for the inference on the features of variable of interest,
195
the best ranking function is the conditional distribution of the variable of interest given the concomitant variables. Since the conditional expectation is an unknown function, Chen (2002) developed an adaptive RSS procedure with continually updated estimate of the conditional expectation as the ranking criterion. However, this ranking mechanism is not appropriate if our purpose is to infer on the relationship between the variable of interest and the concomitant variables. Chen and Shen (2002) developed a scheme called multi-layer RSS which can serve for both purposes. The detail of the multi-layer RSS will be discussed in the next section. We now turn to the discussion on how to incorporate the concomitant variables into the inferences on the features of the variable of interest. This is one of the areas on RSS where more research needs to be done. A regression estimate of the mean of the variable of interest was considered by Yu and Lam (1997). The estimate is a straightforward extension of the regression estimate of the mean in the context of SRS to the context of RSS. A more general treatment was given in Lam et al. (2001). However the treatment can not be extended to unbalanced RSS which, when properly designed, is much more efficient than balanced RSS. In the following, we propose some methods for incorporating the concomitant variables into the inferences on the features of the variable of interest. We only discuss the case where only a single concomitant variable is involved. The discussion can be extended to the case of multiple concomitant variables easily. When the ranking is done by using the measurements on the concomitant variable X, we have the following data: (y[1]i)x(1)i)xi1---A:{fci = i , . . . , n i W[2]i, •X'(2)i) -X21 ' ' ' X2k
(Y[k)i,X{k)i)
i = 1, . . • , 7l 2
Xlkl • • • Xlkk i =
l,...,nk
where (X^ • • • X^k) is the simple random sample used to obtain X^y, and Yjr]i is the measurement on Y corresponding to X^y. Let fy\x{y,x) denote the conditional density function of Y given X. Let G(x) and g(x) be the cumulative distribution function and the density function of X respectively. Denote by f(y,x) the joint density function of Y and X and by f[r]{y,x) the joint density of Yjr] and X(ry We have, see Chen and Shen (2002),that f[r)(y,x) = }Y\x{y,x)CkrGr-\x)[l = f(y,x)C'Gr-1(x)[l-G(x)}k-r.
-
G(x)}k-rg(x) (3)
196
where Ck = fc!/(r - l)!(jfc - r)!. Denote u>r(x) = C*Gr-1(x)[l
-
G(x)]k~r.
Then, from (3), we have f{y,x)
=
f[r}{y,x)/ujr{x).
Thus, for any vector q = (q\,... ,qk)T satisfying qr > 0 and Y^r=i 9r = 1> we have / ( y , * ) = X]9r/[r](y,a:)/wr(a0r=l
A particular choice of q is the one with qi — • • • = qk = 1/k. Then we can express the cumulative distribution function of Y as k
ry
F
(y) = ^2lr r_j
roo
/
/[ r ](z,x)/o; r (x)(ia;^.
J — oo J — oo
Making use of the expression of F(y) above, we propose the following estimate of F(y):
where if is a known kernel function, say, K can be taken as the density function of the standard normal distribution, hr's are bandwidthes which can be chosen by standard methods in kernel density estimation, and tjr(x) - C*Gr-l{x)[l
-
G{x)\k~r.
Here G is the estimate of G using all the observations X\s. For features of Y which can be expressed as functionals of F, say 9(F), the plug-in method can be used to obtain the estimate 0(F). For example, by the plug-in method, the pth quantile of Y is estimated by
iP =
m£{y:F{y)>p}.
The method proposed above can tackle both balanced and unbalanced RSS data. Furthermore, it opens the way for the consideration of optimal design problems in the inference on the features of Y, like that considered by Kaur et al. (1997), which differs from the design problems we are going to discuss in the next section. Further research needs to be done in this direction.
197
4.
Design with Observational Data
Assume that Y = p0 + ftXi + • • • + PpXp + e, where Y is the response variable which is expensive to measure, X\,..., Xp are concomitant variables or functions of concomitant variables which are easily obtainable, and e is a random error independent of the concomitant variables. Consider an RSS with ranking mechanism determined by the measurements of the concomitant variables. Let R = 1,...,K, denote the ranks generated by the ranking mechanism. Here the ranks do not necessarily have a correspondance with the numerical orders of either Y or X's. Suppose a total of n sampling units are to be measured on both Y and the X ' s where UR of them have rank R. The data will follow the same regression model, i.e., Y[R]t = A) + PiXi[R]i H R =
\,...,K,
i =
l,...,nR,
h PpXp[R}i + em,
(4)
where em's are i.i.d. with the same distribution as e. The design problem we are going to consider in this section is how to determine the n^'s subject to J2R=I TIR = n SO that certain optimality can be achieved for the estimation of the regression coefficients. Let (5 = (PQ, Pi,..., PP)T. Denote by y and X(q) the vector of Y values and the design matrix in (4) where q = (ni,... ,nj<)T/n. The regression coefficients can be estimated by the ordinary least squares method. The estimate is given by P=
{X{q)TX{q)}-lX{q)Ty.
The variance-covariance matrix of (3 can be derived as Var03) =
o*E[X(q)TX(q)]-\
where the expectation is taken with respect to the distribution of the X's. We can consider several types of optimality such as ^4-optimality and Doptimality which entail the minimization of the trace and the determinant of Var(/3) respectively. However the problem is intractable for fixed sample size n. Instead, we consider the optimality criteria based on the asymptotic version of the variance-covariance matrix. Suppose q —> q, as n —• oo, where q = (qi,... ,qx). Note that the components of q need not necessarily
198
all bigger than zero. Let m,j[R] = E[Xj[R]], m^R] = E[Xj[R]Xi[R]] and m
j(v)
=
E R = I QRI^JIR],
rrijiiq) =
(
1
Y,R=I
m1(q)
9fl"ty[iq- Then we have • • • mp(q)
m
m
p(q)
m
ip(q)
"'"
\
m
\{q) ™n(g) • • •
ip(q)
= A(q), say.
m
pp(q) I
The optimization based on asymptotic ^4-optimality and D-optimality is then equivalent to the minimization of the trace and the determinant of A~l{q) respectively. To carry out the optimization, the moments of the ranked statistics must be supplied. In the case that the joint distribution of the concomitant variables is known, these moments can be computed numerically. In general, the moments need to be estimated from the data. The estimation of the moments does not pose any difficulty. Since the values of the concomitant variables are easily obtainable and the number of measurements on the concomitant variables is usually large, the moments can be estimated very accurately by using a bootstrap procedure. Let X — {Xi^X^, • • • ,XN}, where X\ = (Xn,... ,Xip), be the set of all measurements on the concomitant variables. The bootstrap procedure is as follows. First, draw B bootstrap samples of size K from X with replacement: ( X * V - - , X £ ) , 6 = 1,... ,B. Then apply the ranking mechanism to get the ranks of these samples to yield ( - ^ [ i ] ! " ' ' .-X"[«•]). b - !,-••
,B.
Finally, the estimates of the moments are approximated by
b=l
b=l
R=l,--,K;j,l = l,...,p. l an Since, as B —> oo, -j^ J2R=I '^ i(R) ~~* ^j' where rhj — jfY^i=i^-ij' d the similarly, -zY,R=i™ji(R) ~> ™jl = l E i = i ^ i j ^ i bootstrap size B is determined such that max{|-^ X^H=I ^jiR) ~ ^j'li 1^ Efi=i ™ji(R) ~ "ij/|} < c f° r some specified precision c. The special case of polynomial regression with a single concomitant variable was considered by Chen and Wang (2002). They applied the methodology to a fish-aging problem and a lung cancer study problem and
199
found that the design methodology can greatly improve the efficiency for the estimation of the regression coefficients. For the fish-aging study, it results in an asymptotic relative efficiency of 1.44 in terms of integrated mean square error of the estimated regression function. For the lung cancer study, the interest is to test the association between smoking status and three biomarkers. The aim is to minimize the variances of the estimated regression coefficients so that the power of hypothesis testing can be maximized. The asymptotic relative efficiencies in terms of the sum of the variances of the estimated regression coefficients for the three biomarkers are 1.98, 2.44 and 2.53 respectively. In the remainder of this section, we consider another special case: multiple regression with multi-layer RSS. The multi-layer RSS was mentioned in the last section. Here we provide more details on the multi-layer RSS. For the ease of notation, we discuss the case of two concomitant variables without loss of generality. Denote by X^, X^ the concomitant variables. Let k,l be positive integers. A two-layer RSS procedure goes as follows. In the first layer, I independent sets, each of size k, are drawn from the population. The units in each of these sets are ranked according to X^. Then, for each ranked sets, the units with X ^ - r a n k 1 are selected. Let the values of (Y, X^,X^) of these selected units be denoted by ( y [i]i.- x '[i]i.- x '[i]i) • • • (Y[i]l>X[i]i>X[i)i)
(5)
where the values of (XAJ, ( X L ' ) a r e measured and Y^jj are latent. In the second layer, the units represented in (5) are ranked according to X^. Then, the unit with X' 2 '-rank 1 are selected and its value on Y is measured. Thus the triplet (Y[i][i]>x\i]\i]>x\u\u) ls obtained. Repeat this procedure Tin times, we then get t i n copies of the triplet: (Y[W]i>X[l][l]i<X[l][l]i)
: i =
1
' •" •»nH-
The procedure can be carried out for any pair (r, s) for any number of times. Finally, we arrive at a two-layer ranked set sample: {(Yj r][s]i , Xff[s]i, Xff[s]i)
•• r = 1 , . . . , k; s = 1 , . . . , I; i = 1 , . . . , nrs}.
The procedure can be extended to general multi-layer RSS straightforwardly only with increasing complexity of notations. It was shown that the ranking mechanism of the multi-layer RSS is consistent. Note that the ranks in the multi-layer RSS are actually multiple ranks. In principle, the general methodology can be applied to optimize the propor-
200
tions of the multiple-ranked statistics. However, it soon becomes intractable to carry out the optimization as the number of concomitant variables increases. For the RSS to be efficient, the set sizes, say I and k, should be taken large. But when the set sizes are large the number of multiple ranks is huge. For example, in the case of three concomitant variables, if the set size at each layer is taken as 5 then the number of multiple ranks is 125. To get around this problem, we might make a connection between the marginal ranks of a concomitant variable and the levels of a factor in the model of experimental designs. We need not to consider all but only a few levels, say, low, high and medium. The ranking mechanism is then used as a means to determine the levels of the concomitant variables. The set sizes have the effect to distinguish the levels and hence to increase efficiency but no longer affect the number of ranks (combination of levels) to be considered. The rich results in the experimental design might be borrowed in dealing with the RSS designs. The hybrid of multi-layer RSS and traditional experimental design can pave a way for the designs with observational data.
5.
Ranked Set Sampling and Data Mining
The notion of RSS is to procure as small number of measurements as possible while a certain amount of information can be obtained. In other words, it is to sample the units which have more information contents. The notion can be well applied to data reduction problems in data mining. Contrary to the traditional problem facing statistics that sample sizes are small, the data size in data mining is tremendously huge. It is common in data mining to deal with data sets in gigabytes or even terabytes. It is simply impossible to store a whole data set of such size in the central memory of a computer. Data reduction becomes a necessary step in dealing with such data sets. A data reduction procedure is essentially a procedure to discard data with low information contents and retain data with high information contents. The technique of RSS can be applied for this purpose. In this section, we briefly discuss some aspects of RSS in the application to data reduction. Suppose that the original data can be arranged in a random sequence as X i , X2, • • •, X j , . . . , where the Xj's are i.i.d.. Let the sequence be divided into segments of length k. Then, for each segment, let the data be ranked according to some ranking mechanism. For example, in the case where the Xj's are scalars, the data in each segment are ordered according to the numerical values. In each ranked segment, one and only one datum
201
is to be retained. However, which one is to be retained depends on what information is to be extracted from the data. We discuss a special case in the following. Consider the scalar case. Suppose information is to be extracted to make inference on a few quantiles, say, £ Pl , £ P 2 , . . . , £ Pm . The following result can be used to determine which datum is to be retained in the segments, see Chen (2001a). Let q = (q\,...,%) be the vector of proportions of the order statistics to be retained. Then the optimal q can be obtained by minimizing either the trace or the determinant of V(q) given by B-1(q)A(q)B-\q),
V(q) = where ^ £ r = l 9 r d r ( P l ) ••• B ( q ) =
0 k
\
0
•••
•••
J2r=l1rdr(Pm)
Er=l9rC r (Pl)[l -C r (pi)] ••• Er=l9rC r (pi)[l -C r (p„ A{q) E r = l QrCr{Pl)[l ~ Cr(pm)} • • • £ r = 1 qrCr(pm)[l - Cr(pm)} Here Crip) = Bir,k-r
+ l,p),
^^ir-mk-ry^1-^ where £?(r, k — r + \,p) is the cumulative distribution function of the beta distribution with shape parameters r and k — r + 1. Denote the optimizer by q* = ((/J,..., q£). The data reduction procedure then goes as a random selection procedure. For each ranked segment, the r t h order statistic is selected to be retained with probability q*. Let k
Fq.{x) = Y,£F(r){x). r=l It is easy to see that fc Q' (*) = E 1*Bir, k-r r=l
F
+ 1, F(x)).
Let x = £Pj. Then k q*i£pj) = ^tiBir^-r
F
+ l^j) = sjt say.
202
This implies that the p^th quantile of F is the s^th quantile of Fq*. Therefore, the quantiles, £ P l ,£ P 2 , • • • ,Cpm> c a n be estimated by the sith, . . . , s m t h quantiles of the retained data. If the retained data is still too huge to handle, a re-ranked-set procedure which repeats the above process can be carried out. The idea of re-ranked-set procedure is closely related to the idea of re-median, see Chen and Chen (2002). A one-pass algorithm can be developed to implement the re-ranked-set procedure. We conclude this section by mentioning that, when the Xi's are vectors, if the information is to be extracted to make inference on the contours of the distribution of X, then an adaptive ranking mechanism based on the quadratic form (.X" — fj,)TT,~1(X — /x), where /* and £ are the mean vector and variance-covariance matrix of X respectively, can be adopted, and a procedure similar to the scalar case discussed above can be developed.
References 1. Abu-Dayyeh, F.L. and Muttlak, H.A. (1996). Using ranked set sampling for hypothesis tests on the scale parameter of the exponential and urn loan distributions. Pakistan Journal of Statistics. 12, 131-138. 2. Bai, Z. D. and Chen, Z. (2002). On the theory of ranked set sampling and its ramifications. The special issue of Journal of Statistical Planning and Inference in honor of C. R. Rao, Vol. 3. in press. 3. Bhoj, D.S. (1997). Estimation of parameters of the extreme value distribution using ranked set sampling. Communications in Statistics - Theory and Methods. 26, 653-667. 4. Bohn, L. L. and Wolfe, D. A. (1992). Nonparametric two-sample procedures for ranked-set samples data. Journal of the Aimierican Statistical Association 87, 552-561. 5. Bohn, L. L. and Wolfe, D. A. (1994). The effect of imperfect judgment rankings on properties of procedures based on the ranked-set samples analog of the Mann-Whitney-Wilcoxon statistics. Journal of the American Statistical Association 89, 168-176. 6. Chen, H. and Chen, Z. (2002). Asymptotic properties of the remedian. Manuscript. 7. Chen, Z. (1999). Density estimation using ranked-set sampling data. Environmental and Ecological Statistics 6, 135-146. 8. Chen, Z. (2000a). On ranked-set sample quantiles and their applications. Journal of Statistical Planning and Inference 83, 125-135. 9. Chen, Z. (2000b). The efficiency of ranked-set sampling relative to simple random sampling under Multi-parameter Families. Statistica Sinica 10, 247263. 10. Chen, Z. (2001a). The optimal ranked-set sampling scheme for inference on population quantiles. Statistica Sinica 1 1 , 23-37.
203
11. Chen, Z. (2001b). Nonparametric inferences based on general unbalanced ranked-set samples. Journal of nonparametric statistics. 13, 291-310. 12. Chen, Z. (2001c). Ranked-set sampling with regression type estimators. Journal of Statistical Planning and Inference 92, 181-192. 13. Chen, Z. (2002). Adaptive ranked set sampling with multiple concomitant variables: an effective way to observational economy. Bernoulli 8, 313-322. 14. Chen, Z. and Bai, Z. D.(2000). The Optimal Ranked-set Sampling Scheme for Parametric Families. Sankaya A 62, 178-192. 15. Chen, Z., Bai, Z. D. and Sinha, B. K. (2002). Ranked Set Sampling: Theory and Applications. Springer-Verlag, New York. 16. Chen, Z. and Shen, L. (2002). Two-layer ranked set sampling with concomitant variables. Journal of Statistical Planning and Inference, in press. 17. Chen, Z. and Wang, Y. (2002). Optimal sampling strategies using ranked sets with applications in fish aging and lung cancer studies. Manuscript. 18. Dell, T. R. and Clutter, J. L. (1972). Ranked set sampling theory with order statistics background. Biometrics 28, 545-555. 19. Fei, H., Sinha, B. K. and Wu, Z. (1994). Estimation of parameters in twoparameter Weibull and extreme-value distributions using ranked set sampling. Journal of Statistical Research 28, 149-161. 20. Hettmansperger, T. P. (1995). The ranked-set sampling sign test. Nonparametric Statistics 4, 263-270. 21. Kaur, A., Patil, G. P. and Taillie, C. (1997). Unequal allocation models for ranked set sampling with skew distributions. Biometrics 53, 123-130. 22. Kvam, P. H. and Samaniego, F. J. (1993). On maximum likelihood estimation based on ranked set sampling with applications to reliability. In Advances in Reliability. A. Basu, ed. North Holland, Amsterdam. 215-229. 23. Lam, K. F., YU, P. L. H. and Lee, C. F. (2001). Kernel method for the estimation of the distribution function and the mean with auxiliary information in ranked set sampling. Manuscript. 24. Mclntyre, G. A. (1952). A method of unbiased selective sampling, using ranked sets. Australian Journal of Agriculture Research 3, 385-390. 25. Ozturk, O. and Wolfe, D. A. (1999). Optimal allocation procedure in ranked set sampling for uniniodal and multi-modal distributions. Environmental and Ecological Statistics 7, 343-356. 26. Patil, G.P. Sinha, A.K., and Taillie, C. (1993). Relative precision of ranked set sampling: Comparison with the regression estimator. Environmetrics 4, 399-412. 27. Presnell, B. and Bohn, L. L. (1999). U-statistics and imperfect ranking in ranked set sampling. Journal of Nonparametric Statistics 10, 111-126. 28. Samawi, H. M. and Muttlak, H. A. (1996). Estimation of ratio using rank set sampling. Biometrical Journal 38, 753-764. 29. Stokes, L. (1993). Parametric ranked set sampling. Annals of the Institute of Statistical Mathematics 47, 465-482. 30. Stokes, S. L. (1977). Ranked set sampling with concomitant variables. Communications in Statistics - Theory and Methods A 6 ( 12), 1207-1211. 31. Takahasi, K. and Wakimoto, K. (1968). On unbiased estimates of the popu-
204
lation mean based on the sample stratified by means of ordering. Annals of the Institute of Statistical Mathematics 30, 814-824. 32. Yu, P. L. H. and Lam, K. (1997). Regression estimator in ranked set sampling. Biometrics 53, 1070-1080. 33. Zhao, X. and Chen, Z. (2002). On the Ranked-Set Sampling M-Estimates for Symmetric Location Families. Annals of the Institute of Statistical Mathematics, to appear.
SOME R E C E N T A D V A N C E S O N R E S P O N S E - A D A P T I V E RANDOMIZED DESIGNS
FEIFANG HU Department
of Statistics,
University
of Virginia,
Charlottesville,
VA 22904,
USA
Response-adaptive randomized design uses sequentially accruing data in allocation decisions to reach certain objective. It is useful in clinical research, industrial experiments, and bioassay. We review some common used response-adaptive randomized designs with emphasis on some recent proposed designs. Advantages and disadvantages of these adaptive designs are discussed. Some further research topics and directions are also discussed.
1. Introduction Sequential design is a subfield of experimental design dealing with the sequential selection of design points, or treatments. When these design points are chosen according to outcomes at previously selected design points, such designs are called adaptive. Many industrial and biomedical experiments are sequential by their very nature, for example, the sequential selection of patients for a clinical trial, or the sequential testing of items on an assembly line. Since future design points depend on outcomes at previous design points, objectives can often be targeted more efficiently than if little information is known or only pre-experimental knowledge is available. The two main goals of adaptive designs are: (i) to develop early stopping rules so that a trial can be terminated early with the possibility of reducing the overall number of patients on a randomized clinical trial; and (ii) to develop methods which make use of the accruing outcome data that allow changing the treatment allocation rule during the course of the trial. To achieve the goals, optimal allocation proportions are usually determined according to some multiple objective optimality criteria. In Section 2, we briefly review some optimal allocations of designs based on some criterions. Response-adaptive randomized designs that are not based on formal optimality criteria have been proposed over the last five decades. For example, biased-coin design (Efron, 1971) is proposed for balance of designs. Play-the-winner rule (Zelen, 1969) and the randomized play-the-winner rule 205
206
(Wei and Durham, 1978) are proposed to assign more patients to the better treatment. These designs can be viewed as special cases of generalized Friedman's urn model. Sequential maximum likelihood procedure (Melfi and and Page, 1998, 2000) and doubly adaptive biased-coin designs (Eisele, 1994) have been proposed to target these optimal allocations (or any given allocations). In Section 3, we briefly review three main classes of adaptive designs: (i) urn models; (ii) Sequential maximum likelihood procedure; and (iii) biased coin designs and doubly adaptive biased-coin designs. We also discuss their advantages and disadvantages of these three main classes of adaptive designs in this section. Some further topics are discussed in Section 4. 2. Optimal Allocation Jennison and Turnbull (2000) describe a general procedure for determining optimal allocation according to some criterion for two normal samples. Suppose responses from a population on treatment A follow a normal distribution with mean /J,A and variance aA and responses from a population on treatment B follow a normal distribution with mean /J.B and variance aB. Let 6 = HA — HB and suppose we wish to test Ho : 8 = 0. The usual test would be based on the difference of sample means. Suppose, we fix the variance of the test to be some constant. Then we can minimize the expected value of a loss function of the form L(9)=u(9)nA+v(9)nB,
(1)
where nA and ng are the numbers of subjects assigned to treatments A and B, respectively, and u(9) and v(9) are strictly positive with v{9) increasing in 9 for 6 > 0 and u{9) increasing in 6 for 6 < 0. The minimum of E(L(6)) under the constraint that the variance of the test is constant is given by
where w{9) = y/v(8)/u(9). Note that when w(9) = 1 we allocate proportionately to the standard deviations, and this is simply Neyman allocation, which we can write as _1\A_ n
_
In general, we would like to select u{9) and v(8) according to some ethical or cost criterion, such as minimizing the total monetary cost of the experiment or the expected number of treatment failures in a clinical trial.
207
Consider a clinical trial of two treatments (A and B) with binary response. Let PA be the probability of a treatment success on treatment A, let PB be the probability of a treatment success on treatment B, and let qA = 1 — PA, QB = 1 ~ PB- Let p be the proportion of patients to be assigned to treatment A. Rosenberger, Stallard, Ivanova, et al. (2001) considered the following optimality criterion: for fixed variance of the test statistic, find the allocation p to minimize the expected number of treatment failures. Let /(PAIPB) be a function for comparing two binomial probabilities (this function / is usually related to the particular measure of the treatment effect). Let PA and PB be the estimators of PA and PB based on the responses of the first n patients. The asymptotic variances of these estimators, denoted avar(f(pA,pB)) can be found by delta method. The optimal allocation p, which, for fixed avar(f(pA,PB)) (reflecting the power of the test), minimizes the expected number of treatment failures. Thus, the optimal allocation is p* = argmin {nAqA + nBqB} = argmin {pnqA + (1 p
where n —
TI(P,PA,PB)
p)nqB},
p
is found by solving the equation avar(f(pA,PB))
=C
for some constant C. If we consider the case that f(pA,PB) = PA — PB, it can be shown that the optimal allocation p =
2*—. (2) y/P~A + VPB Note that for the simple example above, Neyman allocation is given by
-^M
(3)
^/PAqA + ^JPBqB
Unfortunately, optimal allocation cannot be accomplished in practice, because the optimal allocation depends on unknown parameters. Hence, we require some adaptive designs to achieve an approximate optimal allocation. In some applications, we need to compare more than two treatments. Dunnett (1955) considered an allocation rule for a set of multiple comparisons of K — 1 treatments versus a single control. He proposed the following unbalanced rule with allocation ratios 1 : 1 : 1 : - - - : 1 : \JK — 1. In general the optimal allocation proportions will depend on the test statistics and are not well studied in the literature. It is still unknown how to generalize the Jennison and Turnbull approach to /^-treatments to find optimal allocations for applications in clinical trials and industrial experiments.
208
3. Adaptive Designs Because the optimal allocation rule, p, in Section 2 is typically a function of the unknown parameters, one cannot plan a study based on the optimal allocation. But we can use accruing data to select dynamically design points to target the optimal allocation. This is the realm of adaptive designs. Here we distinguish among three types of adaptive designs: (i) adaptive designs based on urn models; (ii) sequential maximum likelihood estimation (SMLE) procedures; and(iii) doubly adaptive biased coin designs (DBCD). Here we discuss these designs in details. 3.1. Urn
Models.
The study of randomized urn model begins with the Polya's urn model,a well-known example in elementary probability book. An urn contains Yu type A balls and Y\i type B balls. A ball is drawn from the urn and is replaced in the urn. If a type A (or B) ball is drawn then add /? type A (or B) balls into the urn. A draw is then completed. After n — 1 draws, the urn composition can be described by a stochastic process Y n = (^ni, ^ n 2 ) , n = 1 ; 2,3,.... The asymptotic properties of this process is wellknown. A modified version of Polya's urn was proposed by Friedman (1949). In controlled clinical trials, Zelen (1969) propose the following adaptive design called the play-the-winner rule. Consider a two-arm clinical trial (treatment A and B) with dichotomous response (success and failure). Patients enter sequentially and are assigned to one of the two treatments. A success on a particular treatment generates a future trial on the same treatment with a new patient. A failure on a treatment generates a future trial on the alternate treatment. Let NnA and NnB be the number of patients assigned to the treatment A and B respectively for the first n patients. It is shown (Zelen, 1969) that NnA
n
qj3
_^
qA + QB
almost surely and further Vn(—— - ———) n
-•
N{0,aPW)
qA + qB
in distribution, where 2 °~PW
_ qAQBiPA+PB) — "i ; vi • (qA + qB)3
209
As pointed in Wei and Durham (1978) and Wei (1979), Zelen's play-thewinner rule is deterministic, and hence carries with it the attendant biases of non-randomized studies. Wei and Durham (1978) modified Zelen's rule to the following randomized play-the-winner (RPW) rule: A urn contains (ao,ao) balls (with ao type A balls and ao type B balls) initially. Suppose patients enter the trial sequentially. For a given patient, a ball is drawn and replaced, the patient is assigned to treatment with the type of the ball. The patient's response is observed. A success on treatment A or a failure on treatment B generates a type A ball to the urn. A success on treatment B or a failure on treatment A generates a type B ball to the urn. The urn is updated after each patient. Wei (1979) first recognized that the RPW rule can be described as the following generalized Friedman's urn model (Athreya and Karlin, 1968 and Bai and Hu, 2002). Consider an urn containing particles of K types, respectively representing K 'treatments' in a clinical trial. These treatments are to be allocated sequentially in n stages. At the beginning, the urn contains Yo = (YOI,...,YOK) particles, where Yofc denotes the number of particles of type k, k = 1,..., K. At stage i, i = 1,..., n, a particle is drawn from the urn and replaced. If the particle is of type k, then the treatment k is assigned to the i-th patient, k = 1,...,K, i = 1, ...,n. We then wait for observing a random variable £(i), the response of the treatment at the patient i. After that, an additional Dkq(i) particles of type q, q = 1,...,K are added to the urn, where Dkq{i) is some function of f(i). This procedure is repeated throughout the n stages. After n splits and generations, the urn composition is denoted by the row vector Y„ = (Yn\, ...,Ynx), where Ynk represents the number of particles of type k in the urn after the n-th split. This relation can be written as the following recursive formula:
where X„ is the result of the n-th draw, distributed according to the urn composition at the previous stages, i.e., if the n-th draw is type k particle, then the fc-th component of X„ is 1 and other components are 0. D n is the K x K matrix with Dkq{n) as its (k, q) — th element. Furthermore, write N„ = (Nni,...,NnK), where Nnk is the number of times a type k particle drawn in the first n stages. In clinical trials, Nnk represents the number of patients assigned to the treatment k in the first n trials. For notation, denote D , a K x K matrix with Dkq as its (fc, q) — th element and let Ti be a sequence of increasing u-fields generated by {Yj}J = 0 ,
210
{Xj}} =1 and [Dj})=1.
Define H, = {{E^D^T^),
q,k =
1,...,K)),
i — l,...,n. The matrix D,'s are named as the adding rules and Hj's as the generating matrices. In the literature, the adding rule Di usually depends only on the last treatment used and the outcome. In these cases, the adding rule D» = D and the generating matrices H , = H = ET> are fixed. A GFU model is said to be homogeneous if Hi = H for all i = 1, ...,n. Suppose that the matrix H (homogeneous GFU model with Hj = H) has a unique maximal eigenvalue A with associated left eigenvector v = (v\,..., VK) with ^,Vi = 1. Under certain assumptions, Athreya and Karlin (1968) prove that J'nfc > Vk
j and
ink —=p
• vk
n VK Y almost surely as n —> oo. As a simple example, £ might be the primary outcome of a clinical trial, such as death or cure. Assuming that Yi is deterministic, let Dij — (K — l)Sij if cure on treatment i, and D^ = (1 —<5y) if death on treatment i, where Sij is the Kronecker delta. Assuming that £ is immediately observable after the patient is randomized, we have | Y n | = | Y i | + (K — l)(n — 1). This design is proposed and studied by Wei (1979). When K = 2, this is the randomized play-the-winner rule of Wei and Durham (1978), which has been used occasionally in clinical trials [see, for example, Bartlett, Roloff, Cornell, et al. (1985) and Tamura, Faries, Andersen, et al. (1994)]. Wei, Smythe, Lin, et al. (1990) give a simple probability model for the randomized play-the-winner rule, letting PA be the probability of success on treatment A and PB be the probability of success on treatment B. Under this model, n qA+qs almost surely, and hence the rule allocates according to the relative risk of failure on treatment 2 versus treatment 1. The asymptotic distribution and variance of the urn composition had eluded researchers for years. The result for the randomized play-the-winner rule was derived in Rosenberger (1992). Bai and Hu (1999) proved the asymptotic normality of the urn composition (Y n ) under very general conditions (include non-homogeneous case). Also a general form of the variance was eventually found by Bai and Hu (1999). The form involved the Jordan decomposition of H.
211
Of more interest to sequential design researchers is the asymptotic distribution of (Nni,..., N„K), the number of subjects assigned to each treatment. In the clinical trials context, these are the number of patients allocated to each treatment. The asymptotic distribution and asymptotic variance of (Nni/n,..., NnK/n) are more difficult propositions. For the randomized play-the-winner rule, Matthews and Rosenberger (1997) obtained the asymptotic variance of NnA/n for PA + PB < 3/2. For general urn model, the asymptotic normality of (Nni/n,..., Nnx/n) was shown by Bai and Hu (2002). In that paper, they also obtained a general formula for the asymptotic variance. The form involves very complicated calculation of the Jordan decomposition of H, and would take more than a full page to completely define. However, for a given urn model, we can use this formula to calculate the asymptotic variance of Nnj/n. For the randomized play-the-winner rule, when PA + PB < 1-5 (or qA + QB > -5), we have
M^r--^-)^N{^iPW) n
QA+qB
in distribution, where 2
aRPW
=
g^ga(5 - 2(qA + qB))
(2(qA+qB)-l)(qA+qB)2'
The asymptotic normality was first given in Smythe and Rosenberger (1995). When qA + qB < 0.5, the asymptotic distribution of NnA is still unknown. This is an open problem. In some applications, the adding rule D» depends on the history of the urn composition (see Andersen, Faries and Tamura (1994) and Bai, Hu and Shen (2002)), then the general matrices Hj is the conditional expection of Dj given Ti-\. Therefore, the general matrices Hj are usually random matrices. Recently, Bai and Hu (2002) consider this general case. The easy extension for designs of K treatments make urn designs particularly attractive (see Wei, 1979 and Bai, Hu and Shen, 2002 for K > 2). The main disadvantage of urn models is they target a specific p (for example, p = qB/(qA + qB) for the RPW rule), that is not selected based on formal optimality criteria. The investigator cannot select p. However, as noted by Ivanova and Rosenberger (2001), in many cases the asymptotic limit is very close to certain optimal allocations. It is interesting to note that these randomized urn models have been used extensively in adaptive learning algorithms and game theory, as well as dynamical systems (e.g., Artur, Ermol'ev, and Kaniovskii, 1987; Posch, 1997).
212
The urn model we proposed so far is based on one assumption that we get immediate response from each patient. Typically, clinical trials do not result in immediate outcomes, and urn models are simply not appropriate for today's oft-performed long-term survival trials, where outcomes may not be ascertainable for many years. However, there are many trials where many or most outcomes are available during the recruitment period, even though individual patient outcomes may not be immediately available prior to the randomization of the next patient. Consequently, the urn can be updated when outcomes become available, and this does not involve any additional logistical complexities. Wei (1988) suggested such updating for the randomized play-the-winner rule and introduced an indicator function, Sjk, j < k, that takes the value 1 if the response of patient j occurs before patient k is randomized and 0 otherwise. He did not explore its properties. When there are delayed responses, Bai, Hu and Rosenberger (2002) prove a very general central limit theorem of the urn composition (Y„) for a generalized Friedman's urn of K treatments and L outcomes. Hu and Zhang (2002b) further show the asymptotic normality of the (Nn\/n,..., Nnji/n). Based on these results, we can apply urn models to clinical trial with delayed-response. The main disadvantage of urn model is that it only targets a specific allocation p. Therefore it can not be used to target the optimal allocation in Section 2. 3.2. Sequential Maximum Procedures
Likelihood
Estimation
(SMLE)
The second major type of adaptive design replaces the unknown parameters in the optimal allocation with the current version of the maximum likelihood estimators. For the binary response case, let Xi,...,T n be the treatment assignment indicators, which take the values one for treatment A and zero for treatment B. Let Z\,...,Zn be the response of the first n patients. Now let the pA,i and ps,i be the maximum likelihood estimators of PA and pe based on the results of the first i patients. The maximum likelihood estimator of the optimal allocation p is pi = p{pA,i,PB,i)- Define Ti be the a-field generated from {Xi, ...,T*, Z\,..., Zi). The adaptive design is then defined as follows: E(Tt\Ti-{)
=
Pl^.
By a theorem of Melfi, Page and Geraldes (2001), the maximum likelihood
213
estimators PA and ps are strongly consistent for PA and ps and —^
-^ P(PA,PB)
(4)
almost surely. Hence the limiting allocation is optimal. In general, for a parameter 6, any estimator 8 of 6 will be strongly consistent for 8 following a sequential maximum likelihood procedure, provided: (i) 9 is strongly consistent for 6 following a non-adaptive design; (ii) NnA —> oo and NnB —> oo almost surely, as n —> oo. Under these conditions, we can generally assure that the limiting allocation is optimal, as in (4). Melfi and Page (1998, 2000) have applied this design for the Neyman allocation. This sequential maximum likelihood estimation procedure was also used in Rosenberger et al. (2001) for the optimal allocation in (2). For two treatments {K = 2), Melfi, Page and Geraldes (2001) show the consistency of NnA/n. The asymptotic variance and the asymptotic distribution of NnA/n are studied in Hu and Zhang (2002a). Further, Hu and Zhang (2002a) obtain the asymptotic variance and the asymptotic distribution of N n for general K > 2. They find that the sequential maximum likelihood estimation procedure is a special case of the doubly adaptive biased coin designs discussed in next subsection. The main advantages of these designs are: i) they can target any given value p; ii) they are simple to use and easy to interpret. One disadvantage is that they might result in severe imbalance in small sample experiments. This can be seen by the special case with a fixed p = 1/2 (K = 2) (Efron, 1979). In this case, the sequential maximum likelihood estimation procedure reduces to complete randomization (50%-50%). 3.3. Biased Coin Design Designs
and Doubly Adaptive
Biased
Coin
In the case of K = 2, when balance (50%-50%) is desired in the allocation, Efron (1971) and Wei (1977, 1978) proposed subject assignment algorithms offering a compromise between complete randomization and perfect balance to reduce experimental bias and to increase the precision of inference about treatment difference. These designs, named biased coin designs, achieve balance more quickly than complete randomization, but retain enough randomization to preclude effective guessing of the next treatment to be as-
214
signed. Wei (1978), Smith (1984) and Wei, Smythe and Smith (1986) extended these designs to multi-treatment case when balance is desired or the desired allocation proportions are known. When the desired allocation proportion p is a function of parameters PA and PB (as in the optimal allocation in Section 2), Eisele (1994) and Eisele and Woodroofe (1995) proposed a doubly adaptive biased coin designs to achieve the desired allocation proportion p. Let pi = p{pA,i:PB,i) as in Section 3.2, the estimation of p based on the responses of the first i subjects. The assignment algorithm is defined as follows. Let t be a function from [0, l ] 2 to [0,1] such that the following four conditions hold: (i) t is jointly continuous; (ii) t(r,r) = r; (hi) t(p,r) is strictly decreasing in p and strictly increasing in r on (0, l ) 2 and (iv) t has bounded derivatives in both arguments. The function t represents the closeness of i V ^ / i to p in some sense. Then we allocate to treatment A with probability
E(Ti\Fi-1)=t(^f,pi-iy The properties of this design will depend on the function t. Eisele and Woodroofe (1995) show that (4) holds but under somewhat restrictive conditions. Hu and Zhang (2002a) consider the asymptotic properties under more general conditions. They also extend the doubly adaptive biased coin designs to the cases of more than two treatments. As pointed out by Melfi, Page and Geraldes (2001), the conditions on the allocation function t are rather restrictive. Indeed, the example given in the paper makes use of a function t which does not even satisfy these conditions. Under more general conditions (the example of Eisele and Woodroofe (1995) satisfies these conditions), Hu and Zhang (2002a) consider the asymptotic normality of NnA — np and obtain the asymptotic variance of NnA — np. Further we can develop the asymptotic normality for general K treatments. Here we use K = 2 to illustrate the asymptotic result. Hu and Zhang (2002a) propose the following allocation function: t(0,p) = 1, t(l,p) = 0, t(x,p) = p{,)a+P^_ap){^)a:
(5)
where a > 0. This allocation function is also studied in Hu and Rosenberger (2002). We found that this function has particularly favorable properties. When a — 0, t(x, p) = p, this leads to the SMLE procedure discussed in last subsection. In general, we can choose the parameter a to control the
215
randomness and variation of the procedure. When a = oo, the variance of the DBC design is minimized, but the design is completely predictable. Now, we use the doubly biased coin design to keep the same desired allocation proportions p = qB/(
= OU
V
° g ° g n ) a.s. n
nV2(Nn,A/n-vi)
and
->
N(0,a2DBCD),
where qAqB(pA+PB) {QA+QB)3
DBCD
,
+ (l
qAiB 2a)(qA+qB)3'
+
For the SMLE procedure, a — 0, we have nl>2(Nn,A/n
cr2SMLE),
- vi) -» N(0,
with 2 SMLE
°
_ qAqBJPA +PB) (qA + qB)3
, +
qAiB (U+lsr
4. Some Further Topics and Discussions In the literature of response-adaptive randomization procedure, it has been focused on proposing new designs and the properties of these designs. It is very important to evaluate and compare different designs. Recently Hu and Rosenberger (2002) provide a theoretical template for comparing different response-adaptive randomization procedures for clinical trial based on the optimality, variability and power of the design. In that paper, they show explicitly the relationship between the power of a test with the variability of randomization procedure for a given target allocation proportion. This formulation allows us to directly evaluate and compare different responseadaptive randomization procedures and different target allocations in terms
216
of power (variability of design) and expected treatment failure rate without relying on simulation. So we can compare adaptive designs by calculating their covariance matrices. Based on the results of Hu and Rosenberger (2002), we can just compare the variations of designs for a fixed allocation p = qB/(lA + <1B) for binary response trial with K = 2 as follows, (i) For the SMLE and DBCD, the asymptotic normality holds for all 0 < PA < 1 and 0 < PB < 1, while the asymptotic normality only holds for PA+PB < 1-5 for the RPW rule. So for PA +PB > 1-5, the RPW rule should not be used in clinical trials. The SMLE procedure is a special case of DBCD (a = 0). (ii) It is easily seen that O\)BCD i s a strictly decreasing function of a > 0 and a\ —> crpw as a —> +oo. Also, O\>BCD < aRPW f° r a u a > 1> whenever qA + QB > -5. Furthermore, if qA + qB is near 1/2, then CTDBCD is much smaller than GRPW- SO, this adaptive design is more stable than the RPW rule. The larger a is, the more stable and less random is the design. So, the doubly adaptive biased coin design compromises between the stability in the PW rule and the randomization in the RPW rule. The doubly biased coin design can share the spirit of the RPW rule in the sense that it assigns more patients to the better treatment and allows delayed response by the patient, (iii) The DBCD can target any given allocation p (include the optimal allocations in Section 2), but the RPW rule (or PW rule) only target p — qB/(qA + qB)- See Hu and Rosenberger (2002) for details, (iv) For K > 2, a lot of problems remain to be solved. Another important problem is how to estimate the required sample size for response-adaptive randomization procedures. The main difficult is that 7Vni,..., NnK are random vector and the allocation probabilities keep changing during the course of the trials. Hu (2002) calculates the required sample size for K = 2 based on the asymptotic distribution of A^,^. It remains a further research topic to estimate the required sample size for K > 2. In urn models, Bai, Hu and Rosenberger (2002) and Hu and Zhang (2002b) obtain the asymptotic distribution of Y„ and N n , when there are delayed responses. For SMLE and DBCD, we have to estimate the unknown parameters based on data. The delayed responses will effect these estimations. The asymptotic properties of N „ remain unknown for both SMLE and DBCD.
217 Acknowledgments Professor Hu is also affiliated with Department of Statistics and Applied Probability, National University of Singapore. The research was supported in part by Grant R-155-000-015-112 from National University of Singapore and by a grant from University of Virginia. This work was partially done while the author were visiting the Institute for Mathematical Sciences, National University of Singapore in 2002. The visit was supported by the Institute and by a grant from BMRC-NSTB of Singapore.
References 1. Andersen, J., Faries, D. and Tamura, R.N. (1994). Randomized play-thewinner design for multi-arm clinical trials. Communications in Statistics, Theory and methods 23, 309-323. 2. Athreya, K. B. and Karlin, S. (1968). Embedding of urn schemes into continuous time branching processes and related limit theorems. Ann. Math. Statist. 39, 1801-1817. 3. Athreya, K. B. (1972). Branching Processes. Springer, Berlin. 4. Bai, Z. D. and Hu, F . (1999). Asymptotic theorem for urn models with nonhomogeneous generating matrices, Stochastic Process. Appl. 80, 87-101. 5. Bai, Z. D. and Hu, F. (2001). Strong consistency and asymptotic normality for urn models, Submitted. 6. Bai, Z. D., Hu, F. and Rosenberger (2002). Asymptotic properties of adaptive designs for clinical trials with delayed response, Ann. Statist. 30, 122-139. 7. Bai, Z. D., Hu, F. and Shen, L. (2002). An adaptive design for multi-arm clinical trials, Journal of Multivariate Analysis 8 1 , 1-18. 8. Bai, Z. D., Hu, F. and Zhang, L.X. (2002). The Gaussian approximation theorems for urn models and their applications. Annals of Applied Probability, In press. 9. Eisele, J. (1994). The doubly adaptive biased coin design for sequential clinical trials, J. Statist. Plann. Inf. 38, 249-262. 10. Eisele, J. (1995). Biased coin designs: some properties and applications, In Adaptive Designs (Flournoy, N. and Rosenberger, W. F. eds.) Hayward, CA: Institute of Mathematical Statistics, pp. 48-64. 11. Eisele, J. and Woodroofe, M. (1995). Central limit theorems for doubly adaptive biased coin designs, Ann. Statist. 2 3 , 234-254. 12. Efron, B. (1971). Forcing a sequential experiment to be balanced, Biometrika 62, 347-352. 13. Flournoy, N. and Rosenberger, W. F., eds. (1995). Adaptive Designs, Hayward, Institute of Mathematical Statistics. 14. Friedman, L.M., Furberg, C D . and DeMets, D.L. (1981). Fundamentals of Clinical Trials. Wright PSD, Boston. 15. Hall, P. and Heyde, C. C. (1980). Martingale Limit Theory and its Applications, Academic Press, London.
218
16. Hayre, L.S. (1979). Two-population sequential tests with three hypotheses. Biometrika 66, 465-474. 17. Hayre, L.S.and Turnbull, B.W. (1981). Estimation of the odds ratio in the two-armed bandit problem. Biometrika 68, 661-668. 18. Hu, F. (1997). The consistency of the maximum relevance weighted likelihood estimator. Canadian Journal of Statistics 25, 45-60. 19. Hu, F. (2002). Sample size of response-adaptive randomized designs. Submitted. 20. Hu, F. and Rosenberger, W. F. (2000). Analysis of time trends in adaptive designs with application to a neurophysiology experiment. Statistics in Medicine 19, 2067-2075. 21. Hu, F. and Rosenberger, W. F. (2002). Optimality, Variability, Power: a tamplate for comparing response-adaptive randomization procedures. Submitted 22. Hu, F., Rosenberger, W. F. and Zidek, J. V. (2000). Relevance weighted likelihood for dependent data, Matrika 5 1 , 223-243. 23. Hu, F. and Zhang, L.X. (2002a). Asymptotic properties of doubly adaptive biased coin designs for multi-treatment clinical trials. Annals of Statistics, Revised. 24. Hu, F. and Zhang, L.X. (2002b). The asymptotic normality of adaptive designs with delayed response. Submitted . 25. Jennison, C. and Turnbull, B.W. (2000). Group Sequential Methods with Applications to Clinical Trials. Boca Raton: Chapman and Hall/CRC. 26. Matthews, R C . and Rosenberger, W. F. (1997). Variance in randomized play-the- winner clinical trials. Statistics and Probability Letters 35, 223-240. 27. Melfi, V. and Page, C. (1998). Variability in adaptive designs for estimation of success probabilities. In New Developments and Applications in Experimental Design (Flournoy, N., Rosenberger, W.F. and Wong, W.K., eds.) Hayward, CA: Institute of Mathematical Statistics, pp. 106-114. 28. Melfi, V., Page, C. and Geraldes, M. (2001). An adaptive randomized design with application to estimation. Canadian Journal of Statistics 29, 107-116. 29. Rosenberger, W. F. (1996). New directions in adaptive designs, Statist. Sci. 11, 137-149. 30. Rosenberger, W. F. (2002). Randomized urn models and sequential design (with discussions). Sequential Analysis 2 1 , 1-21. 31. Rosenberger, W. F., Flournoy, N. and Durham, S. D. (1997). Asymptotic normality of maximum likelihood estimators from multiparameter responsedriven designs, J. Statist. Plann. Inf. 60, 69-76. 32. Rosenberger, W.F. and Hu, F. (1999). Bootstrap methods for adaptive designs. Statistics in Medicine 18, 1757-1767. 33. Rosenberger, W.F., Stallard, N., Ivanova, A., Harper, C.N. and Ricks, M.L. (2001). Optimal adaptive designs for binary response trials. Biometrics 57, 909-913. 34. Rosenberger, W. F. and Sriram, T. N. (1997). Estimation for an adaptive allocation design, J. Statist. Plann. Inf. 59, 309-319. 35. Smith, R. L. (1984). Properties of Biased coin designs in sequential clinical trials, Ann. Statist. 12, 1018-1034.
219
36. Smythe, R. T. (1996). Central limit theorems for urn models, Stochastic Process. Appl. 65, 115-137. 37. Smythe, R.T. and Rosenberger, W.F. (1995). Play-the-winner designs, generalized Polya urns, and Markov branching processes, In Adaptive Designs (Flournoy, N. and Rosenberger, W. F. eds.) Hayward, CA: Institute of Mathematical Statistics, pp.13-22. 38. Tamura, R.N.; Faries, D.E., Andersen, J.S. and Heiligenstein, J.H. (1994). A case study of an adaptive clinical trial in the treatment of out-patients with depressive disorder. J. Amer. Statist. Assoc. 89, 768-776. 39. Wei, L.J. (1978). The adaptive biased coin design for sequential experiments, Ann. Statist. 6, 92-100. 40. Wei, L. J. (1979). The generalized Polya's urn design for sequential medical trials, Ann. Statist. 7, 291-296. 41. Wei, L. J. (1988). Exact two-sample permutation tests based on the randomized play-the-winner rule, Biometrika 75, 603-606. 42. Wei, L. J. and Durham, S. (1978). The randomized pay-the-winner rule in medical trials, J. Amer. Statist. Assoc. 73, 840-843. 43. Wei, L.J., Smythe, R.T. and Smith, R. L. (1986). .K'-treatment comparisons with restricted randomization rules in clinical trials, Ann. Statist. 14, 265274.
A CHILDHOOD E P I D E M I C MODEL W I T H BIRTHRATE-DEPENDENT TRANSMISSION
YINGCUN XIA Department of Zoology, University of Cambridge, U.K. Department of Statistics, National University of Singapore, Singapore E-mail: [email protected]
Most infectious diseases have strong age-dependent and density-dependent transmission rates. Birthrate can change the age structure and susceptible density of childhood epidemics. In this paper, we introduce birthrate into transmission parameters in the times series susceptible-infected-recovered (TSIR) model to analyze the effect of birthrate on the epidemic dynamics. Nonparametric methods are used to build models and an adaptive method is proposed to estimate the models. Applied to measles data in New York city and London, our approach can describe the dynamics quite well in views of both prediction and epidemic pattern. Our analyses suggest that dynamics of measles in these cities are deterministic systems with observation errors and exhibit nonlinear and even chaotic signatures depending on the birthrate.
1. Introduction It is well known that many infectious diseases have strong age-dependent contact rate. A typical example is measles, which is still a major childhood killer in west and central Africa. The measles data from some big cities in the world provide a benchmark for the methodology in dynamical epidemics. The realistic age-structured (RAS) model can capture many features of measles dynamics (Schenzle, 1984; Keeling and Grenfell, 1997). However, it is difficult to develop a direct statistical link between measles time series and the RAS models because of the present state of the complexity of the model and lack of observations for different age groups. It is noticeable that birthrate can change the density of susceptibles in different age groups as well as age structure. In this paper, we shall study the effect of birthrate on transmission parameters in epidemiological models and therefore reveal the functions of birthrate in childhood infectious diseases. Based on a stochastic version of the SEIR model introduced by Fine and Clarkson (1982), Finkenstadt and Grenfell (2000) proposed the so-called 220
221
time series susceptible-infected-recovered (TSIR) model as follows. £ ? ( / t | / t . 1 ) S t _ i , t ) = M a - i 5 7 - i . St = Bt.d + St.1-It, (1) where S^ is the number of susceptibles, It is the number of infected cases, and Bt is the birth number at time t; a, (3t and 7 are transmission parameters. The TSIR model can capture some features of the epidemics of measles in England and Wales. Especially, it can predict the epidemics remarkably well in the largest cities. However, the model does not reflect directly the relation between the birth rate and transmission parameters, such as a and 7. When applied to situations with fast changing birthrates, the model fails to capture the dramatic changes in epidemic patterns; see Figure 2 in section 2. Figure 1 shows the cases of measles in different age groups in London. The top line in Figure 1(a) is the birthrate (scaled for easy visualization). When the birth rate is high, the epidemics has annual cycles; otherwise biennial cycles. More statistical evidences for such a relationship were given in Finkenstadt et al. (1998). They attributed the annual cycles to the quick replacement of susceptibles after a major epidemics in the high birth rate periods and the biennial cycle to the time needed to accumulate sufficient susceptibles for the next big epidemics when the birth rate is low. In Figure 1(b), the top line is the ratio (scaled for easy visualization) of the number of cases of school-aged children over that of non-school-aged children. It is notifiable that when the cases of school aged children dominate the epidemics, it has biennial cycles; otherwise it has annual cycles. This agrees with the motivation of Schenzle (1984). The school-aged children have higher contact rate than the non-school aged children, which is supported by the fact that when the epidemics is dominated by non-school-aged children, there are less cases than that dominated by the school-aged children. To formulate the above observations, we take a look at the marginal utility of infectious hosts (or susceptibles), i.e. the increment of current cases when the number of the infected (or susceptibles) in the previous time unit increases 1 percent. It is easy to see from model (1) that
?k idSt~i -
?k /^Iizi _ h I h-1
^
h I St.!
T
A mass assumption behind the above relations is that one percent increment of infected cases in both school and non-school will result in the same increment of new cases (similar for the susceptibles). This is not in line with our observations above. The transmission parameters such as a and 7 might depend on the age structure. Since birthrate affects of the age
222
i \ /
J
V
i
\
'--,
[_
45
50
55
60
65
45
50
55
60
65
Figure 1. (a): the thin line is the total number of cases in all age groups; the thick line is the birthrate (scaled for easy visualization), (b): the line on the top is the ratio of cases of school-aged children over those of non-school-aged children; the second to fourth lines from the top to the bottom are the cases of school-aged, 3-5 years old and 1-3 years old children respectively; the other lines are the cases in the other age groups.
structure and the density of susceptibles, it is reasonable for us to take the transmission parameters as functions of the birthrate. Thus, we introduce the birth rate, bt, to transmission parameters a, 7 and /3 t . Denote them by a (bt), l{bt) and Pt(bt) respectively. We have model
E(It\It-1,St^t)=pt(bt)I?}b1t)S?{bt),
St = St-i+Bt-It.
(2)
The transmission functions a(-), 7(-), and (3t(-) are the main concern of the paper. They reflect the effect of birthrate on the transmission of epidemics and therefore determine the epidemic patterns. To choose appropriate forms for the transformation functions, we shall use some nonparametric methods which allows data to speak themselves and avoid mis-specification. We shall address this problem in section 2 and suggest simple functional forms. Because we only have a small range of observations for the birth rate, to use it more efficiently, we proposed a multi-steps ahead estimation method. This method can provide more reliable estimators of parameters. We shall discuss it in details in section 3. Finally, we shall apply model (2) and the estimation method to the measles data in New York city and London.
223
2. The Effect of Birth Rate To see the effect of birthrate on the transmission parameters, we consider measles data in the 50 biggest cities in England and Wales. The available data are the time series of cases in all the cities and the birth numbers in the corresponding period. Under the assumption of homogeneous mixing for all ages, Ellner et al. (1998) and Finkenstadt & Grenfell (2000) proposed a simple balance-equation for the susceptibles: St = St-i + Bt — It, where Bt is the number of births in biweek t. Because of the underreporting of It, this balance cannot be used directly. Instead, if we write St — S + zt, then zt can be obtained approximately by some nonparametric methods; see Finkenstadt & Grenfell (2000). A simple approximation used by Finkenstadt & Grenfell (2000) for model (1) is E(yt+i\yt,St,t)
=rt+1
+ayt + izt,
(3)
where rt = log(/3t) and yt = \og(It). This approximation can fit the measles data in England and Wales quite well. However, it fails to capture the dramatic changing patterns of the epidemics when the birth rate changes substantially; see, for example, the complicated cycle patterns before 1944 in New York city and the annual cycle around 1950 in London as shown in Figure 2. Next, we give a more delicate approximation to model (2). It is reasonable to assume that the average number of susceptibles is proportional to the birth number, i.e. S = ex (birth number). Taking logarithm to model (2) and using Taylor expansion, we have the following approximation T
E(yt+1\yt,
Bt, St,t) = V /3T(bt)DT,t + a(bt)yt-i
+ 7 i ( M # + 72 (M
T= l
lo
s(Bt),
^
where bt and Bt are the birthrate and birth number respectively at time t. For the same city, this two variables are related as Bt = bt x (population size). DTt is employed to describe seasonal variations in infection rate. DTtt = 1 if T = mod(t, T); 0 otherwise, where T is the period of a seasonality cycle. In order to merge the data sets from the 50 cities, we consider a unified cross-section time series model as follows T
E(yk,t\yk,t,Bk,t,Sk,t,t)
= ^2Pt(bk,t)DT,t
+
a(bktt)yk,t-i
fc=i
+li{h,t)zk,t/Bk!t
+ 72(6fc,t) log(B M ),(4)
k — 1,2, ••• ,50, where bk,t is the relative birth rate in city k, i.e. bk,t = (birth number of city fc)/(average birth number of city k), yk,t,Bk,t
224
relative birthrate
relative birthrate
Figure 2. The lower thin lines in the first two panels are the time series of observed cases. The lower thick lines are deterministic realizations of the estimated models. The lines on the tops indicate patterns of dynamic cycles: one line indicates annual cycle; two lines biennial cycle and so forth. The middle curves are the relative birthrate. The panels on the bottom are the corresponding bifurcation diagrams.
and Sk,t are the logarithm of cases, birth number and susceptibles at time t for city k. Note that model (4) is a typical functional coefficient linear regression model; see Hastie and Tibshirani (1993). The model can be estimated using the method proposed by Xia and Li (1999). We use the data observed in the 50 cities to estimate the model. The estimation results are shown in Figure 3. The parameters a, 71,72 and the variation of seasonality, i.e. std(/3 T ,r = 1,2, ••• ,T), varies with the birthrate significantly, whereas the mean of seasonality does not show significant changes. Finally, we approximate the RAS mechanism by the following statistical
225
i
.
,
0.8
1
1.2
1
i
.
.
0.8
1
1.2
1
oi i
0.8
1
1
1.2
Figure 3. The thick lines are the estimated transmission functions. The thin lines are the 95% point-wise confidence intervals.
model T
E(yt+i\yt,zt,bt,t)
= r(bt)'^2(3TDT!t+i
+ a(bt)yt-i
+j(bt)zt/bt
+ p(bt)bt,
T= l
zt = zt-i + Bt - It,
(5)
where r{bt) = r0+ri(bt-to)+-r2(bt-t0)and a(bt) = aa -a\bt -a2(bth)+, l{bt) = 7o+7i&t. p(bt) = po + P\bt with ao,a1,a2,r0,ri,r2,ji,pi positive; see Figure 4. Our choices of the functions are simple.
Figure 4.
The functional forms of the transmission parameters.
Explanations for the estimated transmission functions are as follows. It is known the transmission rate among school children is faster than that in non-school children. Taking school as a community, the density of susceptibles does not depend on the whole population and the transmission is relatively independent of the whole population. Thus the density of susceptible in schools is stable with respect to the population and the birthrate, whereas that in non-school children changes with the number of susceptibles and the birthrate substantially. When the birth rate is low, because
226
the density of susceptible in non-school children is low and the transmission rate among non-school susceptibles is low. The transmission is mainly carried out in schools and the infected hosts are mainly in schools. Thus, averagely speaking, the infected hosts have high marginal utility, i.e. a is big. When the birth rate is high, the density of susceptibles in non-school children is high. The proportion of infected cases in non-school children is high. Because they have relatively smaller contact rate with the other susceptible than those in schools, their transmission rate is low. Therefore, the overall transmission rate reduces with the birthrate. On the contrary, when the birthrate is low, the density of susceptibles of non-school children is low, there is less probability for them to be contacted. Therefore the overall ratio of susceptibles contacted by the infected is low, i.e. 7 is small. Note that the mean contact rate, i.e. the mean of ftT, T = 1,2, • • • , T, does not change significantly with the birthrate, which seems to contradict with the results based on the SEIR differential equations; see Olsen and Schaffer (1990) and Earn et al. (2000). Their models are based on the assumption of the "mass action". This assumption can be improved by the "generalized mass action"; see for example Liu et al. (1987) and Finkenstadt and Grenfell (2000). Our result suggests that it is not the overall contact rate, i.e. mean(/? r ), that changes with the age-structure but the overall proportion of different age groups among the susceptibles or the infected that change with the birthrate. If we assume the "mass action", i.e. fixing the transmission parameters a = 1 and 7 = 1 , then the changes in a and 7 are transferred to the changes in the contact rate numerically. 3. Estimation of the Model To estimate model (5), we first have to determine the attribute of the model. With no doubt, the data we observed are contaminated by error terms. Therefore, there are two types of settings: deterministic dynamics with observation errors and stochastic system. Different settings result in different estimation approaches. However, few statistical methods are available for distinguishing these two systems. Therefore, a robust estimation method is essential. For a dynamical system {zt}, models for {zt} under the above two settings are respectively (yt = g(yt-i,Xt,a) i , {Zt
, ' a n d
zt=g(zt-1,Xt,a)+et,
— yt + £fi
where Xt is an exogenous variable vector. For ease of exposition, write Yt = (yt,Xt) and Zt = (zt,Xt). For the first setting, the commonly used
227
estimation procedure is minimizing n
£||yt-9'(X0,a)||2,
(6)
t=i
with respect to a and Xo, where gl{-,a) denotes the i-fold composition of g; see, for example, Berliner (1991). The difficulty with the estimation is that the nuisance parameters XQ has to be estimated and it is too sensitive to the observations. For the second setting, the estimation procedure is to minimize n
Y,\\Zt-9{Zt-u*)\\2
(7)
4=1
with respect to a. The estimation strongly depends on the correctness of the model, i.e. the estimation method is not robust to the mis-specification of the model. Next, we shall combine these two methods to propose a new one which is expected to keep the advantages of the original estimations. Let r^(k) and r^(k), k = 0,1, 2, • • • , be the auto-covariance functions (ACF) for {£t} and {^} respectively. The difference between the two time series can be measured by oo
L(a) = J > 5 ( f c | a ) - r c ( f c ) } 2 .
(8)
k=0
The best estimate of a based on the cost function of L(a) is a =argmin(L(a)). Let yt+m\t = E{yt+m\yt,Xut), which is the m-step ahead prediction. It is easy to see that Yt+i\t = g(Yt,a). Therefore, the estimation method in (7) is just minimizing one-step ahead errors. The estimate based on one-step ahead prediction errors approximates the observed data only by its low order of ACF rather than that of all orders. As an example, suppose we use the simplest AR(1) model £t = a(t-i +£t to fit £t and assume £t is standardized. The estimator of a is a = J ] £t£t-i/ ^2 C^-iAs n —•> oo, we have a = rg(l). It is easy to see that r^(k) = H-(l), which does not minimize L(a). Thus, the estimation of the parameter aims to approximate the observed time series {^} by the model mere in the sense of its first order ACF. Motivated by this example, we propose to estimate the parameters by minimizing all steps prediction error n — m n\
EEfe+™-M2 t=l
m
(9)
=l
with respect to a, where n\ —> oo as n —* oo. Note that the estimation is
228
1930
1935
1940
1945
1950
1955
1960
Figure 5. The first and the second panels are the observations (thin lines) from New York city and London and deterministic realizations (thick lines) of the estimated models respectively.
2.5
4 yearly cases
t
l|
1
!l5
- — - '
11 0.5
BiUili iB::
1.2 1.4 1 elative birthrate
0.8
2
1e
1
,, 3 power
9
a2 1
u
1.4
x 10 1!
;
OS
1.2 relative birthrate
.A^__ period(months)
i.i.L
i period(years)
Figure 6. The first and second panels are the corresponding bifurcation diagrams of the two New York and London respectively. The last two panels are the corresponding spectral power diagrams; the thin lines are the spectral of the observations and the thick lines are that of the realizations.
still based on fitting and prediction, but at the same time it can minimise the ACF difference between the model and the real data. We call the
229
estimation based on (9) all-step ahead estimation. More discussions of the method shall be given in the appendix. Applied the model and the estimation method to the measles data in New York city and London, we obtained the estimates of the parameter as listed in Tables A.l and A.2 in the appendix. Figures 5 and 6 show that the deterministic realizations of the estimated models, bifurcation diagrams and the spectral power diagrams. The estimated model can trace the real data remarkably well as well as capture the changing cycles accurately.
4. Signatures of Measles Dynamics An important question is whether the dynamics of measles in the big cities have some common signatures. To answer this question, we adjust the birth rates in the cities to make it comparable; see the upper two panels in Figure 7. We redraw the bifurcation diagrams and put them together as shown in the lower panel of Figure 7. These two cities share most of their dynamic patterns when the have the same birth rate. When the birthrate is higher than 0.0025, the dynamic cycle is annual; When the birth rate is between 0.002 and 0.0025, the dynamic cycle is biennial; When the birth rate is under 0.002, the dynamic cycle is 4-year or chaotic. Because of complex performance, the measles data from New York city has been paid special attentions. The focus is whether the dynamics reflect a noisy limit cycle or chaos; see for example, Schwartz (1985) and Casdagli (1991) for the former claim and Sugihara & May (1990) and Olsen & Schaffer (1990) for the latter. There are mainly two kinds of models; the (parametric) SEIR model as used by Olsen & Schaffer (1990) and nonparametric method as used by Sugihara & May (1991) and Casdagli (1991). Correct model and correct estimation of parameters are essential to the problem. The models used by Casdagli (1991) and Sugihara & May (1990) are simple univariate nonparametric time series models. They did not consider the effect of susceptibles. The differential equations SEIR model is a reasonable model for the data. However, the problem is how to estimate the parameters in the model and how to check the assumptions, such as "mass action". Moreover, none of these models are checked from both data fitting and ACF points of views, which are necessary for checking a time series model; see Tong (1990) and Sugihara & May (1990). Because of the lack of necessary data, the SEIR model cannot be tested in view the statistics fitting although their cycle patterns can fit that of the real data quite well; see Olsen and Schaffer (1990) and Earn et al. (2000).
230
0.035 0.03
Annual cycle
S 0.025 Biennial cycle 3
0.02
1930
1940 1950 New York
1960
1945
1950
1955 1960 London
1965
0.02
Figure 7. The upper panels are the birthrates in New York and London. The dotted lines partition the panels into serval parts corresponding to different dynamical cycles. The Lower panel is the bifurcation diagram (y-axis is rescaled for easy visualization); dots denote the bifurcation diagram of New York city and the solid line that of London.
Our model is based on the mechanism of SEIR and can fit the data quite well from both data fitting and the ACF point of views; see Figures 5 and 6. The bifurcation diagram in Figure 6 shows that when the relative birthrate is low (around 0.85), the dynamics has some chaotic patterns; when the birthrate is in the middle, it shows biennial cycle; when the
case* of Ihe previous year
of the previous year
Figure 8. Plots of next year's cases to the current year's at birthrates 0.85 (left panel) and 0.9 (right panel) respectively.
231
birthrate is high, it has annual cycle. Figure 8 shows the relations of next year's cases to the current year's at birthrates 0.85 and 0.9 respectively. These Figures are very similar to those obtained by Olsen and Schaffer (1990) and show obvious fingerprint of chaos. Olsen and Schafer (1990) discovered the chaotic pattern by changing the seasonal component, which is equivalent to the the variation of (3T. Different from them, our analysis suggests that the birthrate at some levels can result in chaotic dynamics in the measles. 5. Conclusions In this paper, we investigate the effects of birthrate on childhood epidemics. By introducing the birthrate to the TSIR model, the new model can fit the measles data and approximate the ACF of the data quite well. It also trace the real data accurately. These results convince us to believe that the model is a reasonable approximation of the real mechanism. Based on these analyses, we have the following conclusions. (1) The birthrate affects not only the number of susceptibles but also change the relative structure of susceptible. (2) The high birthrate tends to reduce overall transmission parameters a and increase 7 because the proportion of transmissions among non-school children is high, and vise versa. The birthrate can also change the seasonal component. The conclusions about the changes of parameters agree with the RAS model and Olsen and Schaffer (1990). (3) Changing the transmission parameters can change the patterns of the epidemics; see for example Olsen and Schaffer (1990) and Eran et al. (2000). By affecting the transmission parameters, the birthrate can then change the patterns of epidemics. (4) In the large cities, the dynamics of measles is deterministic system with observation errors rather than stochastic system. The nonlinearity and even chaos may appear in the dynamics depending on the level of birthrate. Acknowledgments The author thanks NUS research grant R-155-000-032-112, the Friends of London School of Economics (Hong Kong) and the Wellcome Trust for partial support.
232
Appendix: Estimation Method and values of the parameters in the model The theory of the estimation method is too complicated and need more intensive investigations. Here, we report some simulation results. Suppose {yt} follows model
yt = 0.6jft-i - 0.4j/ t _ 2 + 0.1»t_3 + £t
(10)
where st ~ N(0,1). Suppose a realization from the model is {yt : t = 1, • • • , n}. When sample size is not large, e.g. 200, it is likely for us to model the series from (10) by a AR(2) model using the AIC criteria. (Our simulations suggest with 82% probability we choose lag 2). We generate 1000 realizations from model (10) for sample size 100 and 200 each. Figure 9(a)-(b) shows the difference of ACF based on the estimated AR(2) model and the true model (10). All-step ahead estimation proposed in the paper always outperforms the commonly used method (i.e. one-step fitting method). Next, we consider the TSIR model contaminated by autocorrelated time series. Suppose a community with constant birth number 100 and an epidemic follows the following dynamical SIR model.
It = 0.3 J ^ S ^ ,
St = St.! +Bt-
It,
(11)
where It, St and Bt are the numbers of infected and susceptibles and number of births in time t respectively, log(£t) = 0.1et — 0.05et with £t ~ N(0,1). The residuals have negative correlations because the more infected number in the previous years will reduce the number in this term, and vise versa. 1000 realizations with sample sizes 200 and 500 are drawn from the model. The difference of ACF of the real data and he estimated model based on the one-step ahead estimation and all-step ahead estimation are shown in Figure 9(c)-(d). Our simulations shows that the proposed estimation method has better performance than the commonly used method in the sense of minimizing the difference of ACF between the estimated model and the observations. The all-step ahead estimation is robust to the model miss-specification, even in the cases that statistical methods cannot diagonal the miss-specification.
233
Table A.l. Estimation of parameters in the model for London parameter estimate t-value parameter estimate t-value 0.5452 8.1957 0.1417 2.3160 01 020 0.4660 6.9071 0.4544 7.5362 Pi 021 0.4878 7.1091 0.5172 8.4474 03 022 7.0125 0.7045 10.0818 0.4386 04 023 0.5433 7.5013 0.3551 5.5937 05 024 0.4481 6.0613 0.3702 5.7861 06 025 0.4127 6.3876 0.3945 5.2818 07 026 4.7041 130.5858 0.3527 0.9600 Q0 08 0.2666 3.5614 Ql 0 * 09 0.3080 4.1605 -2.15 -2.3626 £*2 0io 5.2793 2.87xl0-6 16.0043 0.3875 70 0n 0.4072 5.5631 0 * 012 71 0.2927 4.0042 1.216 * 013 to 0.2495 3.4518 -0.05 * ao 014 0.2452 3.4441 0.19 * a\ 015 0.1793 2.5567 1 * 016 ro 0.7552 0.0519 0 * Pn n -0.1323 -1.9820 0 * r2 018 -0.1503 -2.3518 -23950 * z\ 019 Table A.2. Estimation of parameters in the model for New parameter parameter estimate estimate t-value 1.2077 12.1156 0.888 01 ao 1.1992 11.1022 ai 0.392 02 1.3296 11.4409 0.439 03 (22 1.1186 8.8892 4 . 4 1xl0~5 04 70 1.0272 7.8339 0 05 71 0.6879 5.1515 1 06 to -0.0756 -0.5833 0 07 P0 -0.4739 -4.1396 0 08 Pi -0.2276 -2.3650 1 09 ro 0.6804 0.225 8.0560 010 n 0.9934 11.5520 0.18 011 ri 1.1378 12.3795 1 t\ 012 z\ 2610
York t-value 57.6879
* * 14.0593 0
* 0 0
* * * * *
References 1. Anderson, R.M. and May, R. M. (1991). Infectious Disease of Humans: Dynamics and Control. Oxford: Oxford University Press. 2. Bartlett, M. S. (1957). Measles periodicity and community size. J. R. Statist. Soc. A 123, 37-44.
234
(0)
(d)
Figure 9. (a) and (b) are simulation results from model (10) with sample size 200 and 500 respectively; (c) and (d) are simulations from model (11) with sample size 200 and 500 respectively. 3. Berliner, L. M. (1991). Likelihood and Bayesian prediction of chaotic system. J. Amer. Statist. Ass. 86, 938-952. 4. Casdagli, M. (1991). Chaos and Deterministic versus Stochastic Non-linear Modelling. J. R. Statist. Soc. B 54 303-328. 5. Earn, D.J.D. Rohani, P., Bolkler, B. M. and Grefell, B. T. (2000). A simple model for complex dynamical transitions in epidemics. Science 287, 667-670. 6. Ellner, S. P., B. A. Bailey, G.V. Bobashev, A.R. Gallant, B. T. Grenfell and D.W. Nychka (1998). Noise and nonlinerity in Measles Epidemics: Combining Mechanistic and Statistical Approaches to Population Modeling. Am. Nat. 151, 425-440. 7. Fine, P. E. M. and Clarkson, J. A. (1982). Measles in England and Wales: I, an analysis of factors underlying seasonal patterns. Int. J. Epidem. 11, 5-14. 8. Finkenstadt, B. F. and Grenfell, B. T. (2000). Time series modelling of childhood diseases: a dynamical systems approach. Appl. Statist. 49, 187-205. 9. Finkenstadt, B. F. Keeling, M. J. and Grenfell, B. T. (1988). Patterns of density dependence in measles dynamics. Proc. R. Soc. Lond. B 265, 753762. 10. Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models (with discussion). J. R. Statist. Soc. B 55, 757-796. 11. Keeling, M.J. and Grenfell, B.T. (1997). Disease extinction and community size: Modeling the persistence of measles. Science 275, 65-67 12. Kermack, W. O. and McKendrick, A. G. (1933). A contribution to the methematical theory of epidemics: Part III, Further studies of the problem of endemicity. Proc. R. Soc. Lond. A. 141, 92-122. 13. Liu,R.M., Hethcote, H.W. and Levin,S.A. (1987). Dynamical behaviour of epidemiological models with nonlinear indendence rates. J. Math. Biol. 98,
235
543-468. 14. Olsen, L.F. and Schaffer, W. M. (1990). Chaos versus noisy periodicity: Aternative Hypotheses for Childhood Epidemics. Science 249 499-504. 15. Schenzle, D. (1984). An age-structured model of pre- and post-vaccination measles transmission. IMA J. Math. Appl. Med. Biol. 1, 169-191. 16. Sugihara, G and May, R. (1990). Nonlinear forecasting as a way of distinguishing chaos from measurement error in a data series. Nature 344, 734-741. 17. Tong, H. (1990). Nonlinear Time Series Analysis: a Dynamical System Approach. Oxford University Press, London. 18. Xia, Y. and Li, W. K. (1999). On the estimation and testing of functionalCoefficient linear models. Statistica Sinica 9 735-758.
LINEAR REGRESSION ANALYSIS WITH OBSERVATIONS SUBJECT TO INTERVAL CENSORING
L I N X I O N G LI Dept.
of Math,
Univ.
of New Orleans,
New Orleans,
LA
70148
In biomedical studies it is not uncommon that an observation is not observed exactly but only known falling between two (censoring) time points. The Cox proportional hazards model has been studied by many authors when d a t a are subject to interval censoring. In this paper, a log-linear model with interval-censored data is considered. Two methods for estimating regression coefficients, the leastsquare estimation and a non-parametric approach, are proposed. To illustrate the methods proposed, an application of the methods to a cancer study is presented.
1. Introduction Interval-censored data are commonly seen in medical follow-up studies. Two commonly used interval censoring mechanisms are case 1 interval censoring (also called current status) and case 2 interval censoring. A lifetime is said to be subject to case 1 interval censoring if it is either right- or left-censored, but never observable. And a lifetime is subject to case 2 interval censoring if it always falls between two censoring points and is never observed exactly. Nonparametric estimation of survival functions based on the censoring schemes can be found in Groeneboom and Wellner (1992), among others. In this paper, we consider a censoring scheme that contains both case 2 interval-censored data and exact observations. An example of our censoring scheme can be found in Odell et al. (1992). Estimation of survival functions with both case 2 interval-censored data and exact observations can be found in Peto (1973), Turnbull (1976), and Li et al. (1997). Various regression models with right-censored data have been extensively studied. We in this paper focus on interval-censored data. For case 1 interval-censored data, Huang (1996) studied properties of efficient estimators for proportional hazards models. When data are case 2 intervalcensored, Finkelstein (1986) used the proportional hazards model to compare two treatments for breast cancer patients, and Rabinowitz et al. (1995) proposed a class of score statistics to estimate parameters of a log-linear 236
237
model. For a data set containing both case 2 interval-censored and exact observations, Li and Pu (1999) modified Buckley and James (1979) leastsquare method to estimate the regression coefficient of a single-covariate log-linear model. However, Li and Pu did not discuss how to physically find the estimator. In this paper, we will extend Li and Pu (1999) to multiple regression along with an algorithm for obtaining the least-square estimator (Section 3). Then, we will propose a non-parametric approach using ranks to estimating regression coefficients (Section 4). In Section 2 we present censoring schemes and the log-linear model. An application of the proposed methods and a comparison between the proposed methods and existing results are provided in Section 5. 2. Data and Model Let the random variable T > 0 denote a lifetime, which is subject to censoring by two observable (random) censoring times, L and R, where 0 < L < R. The lifetime T always falls into the interval [L,R], called the censoring interval. In clinic trials, if the lifetime is the time to event of interest under investigation and [L, R] is two check-up times, then due to various reasons the lifetime may not be observed exactly and is only known occurred between two check-ups, i.e., L < T < R. This makes the lifetime T to be interval-censored. There are two possibilities regarding T G [L, R]: (l)HL = R, lifetime T = L = R is observed exactly, and (2) if L < R, T is case 2 interval-censored. So, for each lifetime, we observe an interval [L, R\. Note that when R = oo, the lifetime is right-censored, and in this case the interval [L, R] is interpreted as [L, R). Let [li, r,], i = 1, 2 , . . . , n be a random sample of observed intervals from [L, R\. Let £={li, i = 1, 2 , . . . , n} and 1Z={n, i = l , 2 , . . . , n}. Following Turnbull's (1976) notation (also see Li et at, 1997), an interval [q,p] with q G L and p £ 72. is called an innermost interval if [q,p] does not contain any li or r« except themselves. In other words, [q,p] is an innermost interval if [q,p] Di^-U^-} = {l^p}Apparently, each exact observation comprises an innermost interval, and the innermost intervals are mutually exclusive. We suppose that there are m such innermost intervals for a given data set. To associate the lifetime with covariates, we assume that there are A; covariates and consider a log-linear model k
Yi = \ogTi=p0
+ ^f3jxij j=i
+ ei=Xi(5+ei,
z = l,2,...,n,
(1)
238
where e, are assumed to be independent, identically distributed with unspecified distribution function F, and the covariates (x\,X2, • • • ,Xk) are assumed to be scalars. Let [y\ = \ogli,y\ = logr,] denote an observed log-interval and e* = y% — Xij3 denote the error. When lifetime T (or y) is observed exactly, the error is a single value; when T is interval-censored, the corresponding error e is also censored by interval [y\ - X&yr - Xip] = [e{,ej]. Let [ej.ej], i = 1,2,... , n represent the observed error intervals. Since as assumed there are m innermost intervals based on {[/j, r*], i = 1,2,..., n}, there are also m innermost intervals based on {[e',e[], i = 1 , 2 , . . . , n } . We denote the latter innermost intervals by [Qj,Pj],3 = 1,2, . . . , m . 3. Least-square E s t i m a t i o n Let X = (X[,X'2,..., X'n)' be the n x (k + 1) design matrix and Y = (2/1,2/2, ••• ,2/n)'- For uncensored data, the least-square estimator of /? is known as /3=(X'X)-1X'F.
(2)
When T is censored, (2) cannot, however, be used directly. For rightcensored data with a single covariate, Buckley and James (1979) proposed a method to use (2) to estimate /?. Li and Pu (1999) generalized the method to interval-censored data with one covariate. We now extend it to multiple covariates. Based on observed error intervals [e\, e\], i = 1, 2 , . . . , n, let F be the generalized maximum likelihood estimator (GMLE) of F (Turnbull, 1976; Li et al., 1997). It is known that the GMLE F only assigns weight to innermost intervals (Peto, 1973). Let Wj = F(pj) — F{qe^ — 0) denote the weight assigned to the innermost interval [
239
If yi is censored, we estimate it by
I/r=^[r < |yi
+
ei\ei£{e\,eri)}
l^k=1dikwk ,x
^
2
,„ g^[l + J f e = 0°) +PemHPem < 00)] ,
Then the estimator /? of p is a solution of the equation (3={X'X)-1X%,
(3)
where Y£, a function of /?, is the same as Y except with censored yi replaced To solve (3), note that when (3 varies, the innermost intervals (and hence F) change accordingly. For a fixed /?°, there exists a small neighborhood such that when P is in the neighborhood, the innermost intervals do not change, and hence the weights wi,W2,---,wm are also fixed for P varying inside the neighborhood. As a consequence, the y* and the solution of (3) do not change either. We now present an approach to obtaining (3. The idea here is similar to Yu and Wong (2000). First assume there is only one covariate x. Since the estimator PQ equals y — p{x, it suffices to discuss p\. As mentioned earlier, when /? varies within a small neighborhood, the vector Yp remains unchanged. So, we can assume that the real line (—oo, +oo) has been partitioned into N subintervals such that YS remains the same on each of the subintervals. For a given sample of size n, we shall show that Af is finite. Thus one can compare these N values of {X'X)"1 X'Y^ with corresponding /3i and choose the one satisfying (3) to be the estimate /3i of Pi. The proof of the statement that N is finite is outlined below. Suppose there is only one covariate. For each given i and j with 1 < i ^ j < n, solve four equations e' — e' = 0, e[ — eT, = 0, e\ — ej = 0, and e^—elj = 0. Obviously, for a sample of size n, there is atotalof 2n(n—1) such equations. Suppose there are d distinct solutions out of these equations. These d values partition the real line into d + 1 subintervals. Notice that when Pi varies within such a subinterval, the order of elements in C [J 1Z does not change, and hence YS remains unchanged within each subinterval. When there are two covariates, solutions of each of the above four equations determine a straight line on a two dimensional plane. There are still
240
2n(n — 1) equations. Suppose there are d different solutions (lines). These d lines partition the plane into many but finite number of areas. Similar to the one covariate case, when (/3i,/32) varies within each such area, the vector Y£ does not change. This indicates that finding a solution of equation (3) is equivalent to checking to see whether or not (X'X)~1X'Yp falls into the same area as where /? belongs. If they do belong to the same area, then there is a solution in that area. Otherwise, there is no solution in that area. Since there are finite number of areas to check, the solution of (3) can be found. The case of k covariates is analogous. The above discussions provide us with an algorithm to find (3. 4. Non-parametric Estimation In this section, we assume there is only one covariate, X, which is random but not subject to censoring. Our goal is to estimate (3\. The estimation of Po is less interesting (although not difficult). To simplify notation, we will use j3 instead of (3\ for the rest of the section. To use a statistic similar to Kendall's correlation coefficient, one needs to rank Tj, i = 1,2,..., n, which are however not observable due to censoring. So we consider ranking the corresponding censored intervals, instead. For two error intervals [e\, e[) and [e', e p , we are not able to rank them if they overlap. We now define a statistic based on the following equation
Kn(P) = J > ( * i < Xj)-I(Xi
> XjWiem
< el3m-I(e\(P)
> eJ(/3))].
Since the error intervals are i.i.d., the estimate of f3 should be a value (3 such that Kn0) = 0. This is actually the Theil-Sen estimator discussed by Akritas et al. (1995) for right-censored data. We state the following lemma. The proof is straightforward and thus omitted. l e m m a 1. Kn(P) is non-increasing in /3. Since Kn((3) is discrete, equation Kn((3) = 0 either has no solutions or has multiple solutions. For this reason, we let /3*. = sup{f3 : Kn{(3) > 0} and (3** = inf{(3 : Kn{j3) < 0}, and define $ = 0* + (3**)/2 to be the estimator of (3. We now discuss large sample properties of /3. Assumption 1. X is independent of (U, V). Assumption 2. The joint distribution of (Xi, Yt, Ui, V{) has a bounded density. Theorem 4 . 1 . Under assumption 1, E[Kn(@)] = 0, where /? is the true
241
regression parameter. Proof. It suffices to show that E[w(/3)} = 0. In fact, by assumption 1 and the fact that {(e',e^),i = l , . . . , n } are i.i.d., we have E[w((3)} — E{E[w({3) \xilXj}} = E{{I{xi < Xj) - I(Xi > Xj)]E[I{eV(p) < e<(/?)) I{e\{f3) > e'j(P)) \xi,Xj]} = 0. This proves the theorem. For any b, rewrite w(b) = [I(xi < Xj) - I(xi > Xj)] [IMP)
< ejtf) - (b -/3)(Xj
-
Xi))
-I(ei(J3)>ej(0)-(b-P)(xj-xi))] d
=k(Qi:Qj;b-(3),
where Qi = (xj,e',e[). Define a U-statistic
Kn{b)=(n\
Zn{b-(3)=(f\ \
/
\
^(Q^Qjib-P). t<j
/
Let n(s) = E[Zn(s)} = E[k(Q1,Q2;s)] and 7i(s) = va.r(E[k(Qi,Q2;s)\Qi}). From the definitions, it is easy to see that r](s) and 7i(s) are bounded. Furthermore, 77(0) = 0, and r](s), under assumption 2, is differentiable in a small neighborhood of 0. Theorem 4.2. Under assumptions iV(0,47i(0)/r?'(0) 2 ) as n -> 00.
1 and
2,
n 1 ' 2 (/3 — P)
—*
Proof. Since P{nl'20
-p)<s)
= P(0 > Kn{n-x/2s
+ /3))
by Lemma 1 and Kn{0) = 0 = P(Jn~Zn(n-xl2s)
< 0),
it suffices to consider the limiting distribution of
y/KZn(n-^2s)
= V^[Zn(n-1/2s)
- vin'^s)}
+
^ L I ^ ^ I
S
.
By Theorem 3.3.1 of Ronald and Wolfe (1979), Vn-[Zn(s)-r,(s)]±>N(0,4li(s)). Since N(0,iji(s)) is a continuous random variate, the above convergence is uniform in s in a small neighborhood of 0. Thus yfa[Zn(n-V23)
- V{n-1,2s)\
h JV(0,4 7 i(0)).
242
Meanwhile r](nn~ll2s
^ s ^ r f
'(0)s.
As a consequence, y/nZn(n -l'2s)
-h N{r]'(0)s ,4 7 i(0))
The theorem is proved. To find a consistent estimator of the variance of 0, let 72 (s) = cov[k(Qi,Q2;s),k(Qi,Q2;s)]. Note that, for any u> € (0,0.5), HQi,Qi\n~°-5+UJ + 0-0) = w(0 + n-°-5+") does not depend on 0. It is known (See (3.1.18) of Ronald and Wolfe, 1979) that
Var{Zn{s)) =
Theorem 4.3. For any
UJ
^3iM+
n(n — 1)
*»(»)_
n(n — 1)
e (0,0.5), n 0 5 " " Zn(n-°-5+"
+ 0-0)^
T/'(0).
Proof. Write
- V(n-0^n)
= n™-«[Zn(n-™+n
+ ^ ' ^ . i Z
^
(4)
•
Since E [n°- 5 - u [Z n (n-°- 5 + w ) - r/(n-°- 5 + w )]] 2 = nl~2uVar{Zn{n-°-b+UJ))
-+ 0
and [ri(n-0-5+u) - 77(0)]/n" 0 5 ^ -> r?'(0), we have (4)-> r/(0). So, if we rewrite n0-5'"Zn{n~°-5+UJ + 0 - 0) = n 0 - 5 -"[Z„(n-°- 5 + " + 5+ 5 ; 0 - 0) - Zn(n-°- ")} + n°- -" Z n (n- 0 - 5 + a '), it suffices to show that n°- 5 - w [Z n (n-°- 5 + w +0-0)Zn(n-°-5+u)] -^ 0. To this end, note that P(|n°- 5 - w [Z n (n-°- 5 + w +0-0)
5(1 w)
-
Z„(n-°- 5 + w )]| > e)
|/3-/3|>e)
5 a;
+P(|n°- - [Z„(n- 0 - 5 + w + n-°- 5 ( 1 - w ) e) - Zn{n-°-5+UJ)}\ 5
+P(|n°- -"[Z„(n-°-
5+w
- „-°-5(i-<")c) - Z n ( n - ° -
5+u
> e)
) ] | > e).
By Theorem 4.2 and Chebychev's inequality, the first term of the inequality converges to zero. We now show that the third term converges to zero. The second term can be done in a similar fashion. Since, by the expression of Var(Z„(-)), n 1-2aj 'E{Z n {n~ 0 ^ +UJ -
243 n-0.5(l-U)e)_TI(n-0.S+u,
n -o.5(i-u;) e )
_ „-0.5(l-u,) e )]2 _,
_ v(n-o.5+u
5 w
_ n -o.5(i- w ) e )j 4
5
Qj w e g e t n 0 . 5 - W [ Z n ( n - 0 . 5 + w
_
Analogously, we can show
0
that n°- - [Z n (n-°- +") - r?(n-°- )] •£ 0, and that n°- 5 -"[r/(n- 0 - 5 +" n -o.5(i-w) e ^ _ ^^-0.5+u^] _> o. Combining the above inequalities yields n0-5-"[Zn(n-0-5+" orem.
5+u
-n-0^1-"^)
- Zn(n-°-B+")}
4 0. This proves the the-
We now consider the estimation of 71 (0). It can be verified that 71 (s) = E[k(Qi,Q2, s)k(Qi,Q3, s)], which is continuous in s. Define
7i(*) = ^ D T ^ T W E
E
KQl,QJ,s)k(Qi,Qhs)
+ faTTyjE*2<^. **)]•
(5)
T h e o r e m 4.4. As n —> 00, 71 (s) — 7i(s) —> 0 uniformly in s.
Proof. Since k2(Qi,Qj,s)
< 1, j^Ej^i^iQuQj.s)
<
J^J-
Hence the second term of (5) converges to zero as n —> 00, and we only need to consider the first term. Given Qi,
,w
2
xE
E
HQi,Qj,s)HQi,Qi,s)
is a [/-statistic with kernel k(Qi,Qj,s)k(Qi,Qi,s), 60(Qi,s) = E[k(Qi,Qj,s)k(Qi,Q,,s) h(Qi,s)
I ^ j , I ^ i. Define
\QX], I / j , I ? i
= cov[k(Qi,Qj,s)k(Qi,Qi1,s),k(Qi,Qj,s)k(Qi,Qi2,s)
\Qi],
3±h +h ^ i h(Qi,s)
= cov[k(Qi,Qj,s)k(Qi,Qi,s),k(Qi,Qj,s)k(Qi,Qi,s)
\Qi],
Then \5k(Qi, s)\ < 1, k = 0,1,2. One can verify straightforwardly that Var(0(Qi,s)|QO =
_
2
_
[2(n - 3)5^,3)
+ 52(Qi,s)},
I ^ j , I ± i.
Thus, for 0 < w < 0.5, (n - l ) 1 " * ^ [[<j>{Qu s) - S0(Qi, s)]2\Ql] -» 0 uniformly in Qi a n d s. Consequently, letting A» = 4>(Qi,s) — So(Qi,s)
244
yields E(Af) = 0(n 1) uniformly in Qi and s. Based on the above discussion, we can write, approximately,
7i(s) =
j^w2 h
^^ h
Furthermore, since E[± £ " = 1 A * ] 2 ^ l?™2 • max(£(A 2 )) = O ^ - 1 ) , we have 7j(a) « ^Ei=i"*o(Qi,*) - £[Q[*(Qi,Qj,*)*(Qi,Qj,*) IQi]] = 7i(s) uniformly in s, which was to be shown. T h e o r e m 4.5. As n —• oo, 71 (/3 - /3) - 71 (0) —> 0. Proof. ^[[7i(/3-/?)-7i(0)]2'
= £ [s[(7i(/3 - /?) - 7i(4 - P) + 7 i 0 -P)-
7i(0))2|/3 - /?]]
< 2E [£[(71 (/? - /?) - 71 (/? - /?))2 I/? - fl +2£
[E[(7I(/3
- / ? ) - 7i(0)) 2 |/3 - / ? ] ] .
By Theorem 4.4 and the continuity of 7i(-) at 0, the right hand side of the above inequality converges to zero in probability. This proves the theorem. It is readily seen that n0*-" Zn(n-°-5+u
+ /3-/3) and 71 (p - p) do not
depend on the parameter (3, so the limiting variance of j3, 47i(0)/?/(0) 2 , can be estimated consistently by [47i(/3-/3)]/[n°- 5 " u 'Z n (n~ 0 - 5+u ' + ^ - / ? ) ] 2 . Confidence intervals for /3 can thus be constructed using the asymptotic normal distribution. 5. Application a n d Conclusion Finkelstein and Wolfe (1985) compared two treatments for breast cancer patients. The interval-censored observations arose in the follow-up studies for patients treated with radiotherapy and chemotherapy or with radiotherapy alone. The survival time (in months) is the time until cosmetic deterioration determined by the appearance of breast retraction. To compare the treatments, we use a dummy variable as the only covariate. The least-square estimate of (3 is -0.28, and the non-parametric estimate is -0.32 using u) = 0.03. These results are similar to Finkelstein and Wolfe and
245
indicate t h a t t h e patients receiving chemotherapy alone h a d a significantly slower deterioration t h a n those receiving b o t h chemotherapy and radiotherapy. We have also used other values for w and obtained similar results. T h e impact of different values of u is not significant. We performed simulations for b o t h of t h e methods. T h e simulation results show t h a t the proposed methods work very well. In the case of one covariate, these two methods provide similar point estimates. For t h e non-parametric method, t h e coverage probability of confidence interval for (3 is close t o its nominal probability whether the d a t a set contains exact observations or not. So, we believe t h a t t h e proposed methods can be used in practice for small to moderate sample sizes.
References 1. Buckley, J. and James, L. (1979). Linear regression with censored data. Biometrika 66, 429-436. 2. Finkelstein, D.M. (1986). A proportional hazards model for interval censored failure time data. Biometrics 42, 845-854. 3. Finkelstein, D.M. and Wolfe, R.A. (1985). A semiparametric model for regression analysis of interval censored failure time data. Biometrics 4 1 , 933-945. 4. Groeneboom, P. and Wellner, J. A. (1992). Information bounds and nonparametric maximum likelihood estimation. Birkh&user Verlag, Basel. 5. Huang, J. (1996). Efficient estimation for proportional hazards models with inter censoring. Ann. Statist. 24, 540-568. 6. Li, L. and Pu, Z. (1999). Regression models with arbitrarily interval-censored observations. Comm. Statist.-Theory and Methods 28, 1547-1563. 7. Li, L., Watkins, T. and Yu, Q. (1997). An EM algorithm for smoothing the self-consistent estimator of survival functions with interval-censored data. Scand. J. Statist. 24, 531-542. 8. Odell, P.M., Anderson, K.M. and D'Agostino, R.B. (1992). Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 48, 951-959. 9. Peto, R. (1973). Experimental survival curve for interval-censored data. Applied Statistics 22, 86-91. 10. Randies, R. H. and Wolfe, D. A.(1979). Introduction to The Theory of Nonparametric Statistics. Wiley: New York. 11. Rabinowitz, D., Tsiatis, A., and Aragon, J. (1995). Regression with interval censored data. Biometrika 82, 501-513. 12. Turnbull, B. W. (1976). The empirical distribution function with arbitrary grouped, censored and truncated data. JRSS B 38, 290-295. 13. Yu, Q. and Wong, G. (2000). A semi-parametric maximum likelihood approach to linear regression with right censored data. Submitted for publication.
W H E N C A N T H E H A S E M A N - E L S T O N P R O C E D U R E FOR QUANTITATIVE T R A I T LOCI BE I M P R O V E D ? INSIGHTS FROM OPTIMAL D E S I G N THEORY
Z H A O H A I L I 1 - 2 , M I N Y U X I E 1 ' 3 A N D J O S E P H L. G A S T W I R T H 1 ' 2 1
Department
of Statistics,
' Biostatistics Branch, Cancer Institute, Department
George Washington University, Washington DC 20052.
2201 G Street
NW.,
Division of Cancer Epidemiology and Genetics, National 6120 Executive Blvd., EPS, Rockville, MD, 20852
of Mathematics,
Central
China Normal
University,
Wuhan,
China
A wide variety of statistical methods are used by geneticists to locate human disease genes. The purpose of this paper is to show that theory of statistical optimum design provides insight into the properties of the study designs used to discover genetic linkage and indicates situations where they might be improved. The properties of the classic Haseman-Elston (H-E) procedure are examined from the viewpoint of D-optimal design theory. The H-E method regresses the squared difference of the trait values of a pair of siblings on the number of alleles they share that are identical by descent (IBD). When the available genetic information enables the IBD status of all the sib-pairs in the study to be fully determined, the H-E method is optimal. When the IBD status of a meaningful fraction of the sib-pairs needs to be estimated, i.e., there is a potential for misclassification, an alternative design is shown to have more power. This is relevant to late onset diseases as IBD status typically must be estimated from genetic information on relatives rather than parents.
1. Introduction Recent advances in molecular genetics have provided new tools for the genetic study of complex human traits. Sib pair methods are commonly used to determine whether a complex trait has a genetic component. Penrose (1935, 1953) introduced this approach based on the idea that sib-pairs with similar (dissimilar) phenotypes have an excess (deficit) of allele sharing. These methods were further developed by Haseman and Elston (1972), Suarez, et. al. (1978), Risch and Zhang (1995, 1996), Zhang and Risch 246
247
(1996), Zhao, et. al. (1997), Gu and Rao (1997a, 1997b), and others. Under suitable assumptions, Haseman and Elston (1972) showed that the expected squared difference between the quantitative trait values of two sibs is a linear function of the proportion of alleles IBD shared at the marker locus being tested. When the marker locus is linked to the trait locus, this slope will be negative, i.e., it is a function of the recombination frequency between the trait and marker loci. When the marker is not linked to the trait locus, the slope is zero. Thus, testing whether the slope 71 — 0 or < 0 has been widely used in sib-pair linkage analysis. The power of the original test can be increased when sib-pairs that are highly discordant can be obtained (Cary and Williamson, 1991; Eaves and Meyer, 1994; Risch and Zhang, 1995; Dudoit and Speed, 2000; Li and Gastwirth, 2001 and Feingold, 2001). The reason for this is that under the alternative of linkage these pairs should have lower than expected IBD sharing. Since the classical HasemanElston (H-E) methods are based on the allele sharing principle and are easy to implement, they remain very useful techniques for linkage studies (Stoesz, 1997; Xu et al, 2000). In section 2, D-optimal design theory is applied to the H-E sib-pair procedure assuming we could sample sib-pairs according to their IBD status. The optimal design focuses attention on pairs with IBD=0 (discordant) and IBD=2 (concordant) providing insights into the properties of procedures using phenotypes (Risch and Zhang, 1995). The power and sample size of various procedures is discussed in section 3. The asymptotic variance of original H-E test statistic when the alternative (linkage) holds is presented there.
2. The Optimal Design for the Haseman-Elston Procedure Let x\ and x2 denote the trait values for each member of a pair of sibs. Following Haseman and Elston (1972), we assume that the trait values have the following structure: xi =H + gi + e i ,
x2 = /i + 92 + e2
where fi is the overall mean, the gi represent genetic contributions to the trait value x,, and e% the residual. We consider a single locus with two alleles A\ and A2. The allele frequencies of Ai and A2 are p and q = 1 — p, respectively. The distribution
248
of gi is as follows: A2A2 -a q2
MM d 2pq
Ai_Ai a p2
Then, the additive genetic variance is a2 = 2pq[a + (q — p)d\2 and the dominance variance is a\ = (2pqd)2. The total genetic variance is the sum of the additive and dominance variances, that is, a2 — a\-\- a\. Let a2 = E(ei - e 2 ) 2 . Following Haseman and Elston (1972), we assume that there is no dominance, i.e., G\ = 0. Haseman and Elston (1972) showed that the expected squared difference of the observed trait values of a sib-pair, conditional on the proportion of alleles shared IBD at the marker locus satisfies the regression equation E(y\irm) = a\ + 2a2V + 2a2g(l - 2tf)7rm =
7o
+ 7 i7r m = (l,7r m ) 7 ,
(1)
where j / = (an - z 2 ) 2 , 7o =
(2)
where a2 is the variance of the observation value y and
( is the corresponding design matrix.
'•no ^ « o
\
l « i 2^"i I ln2 ln2
/
(3)
249
The notation lni in (3) is a column vector with rij elements of ones, i = 0, land 2. Note that
MC*)" 1 (?) - i5kj.
W
where |X'X| is the determinant of the matrix X'X. Thus, a design minimizes (2) if and only if it maximizes i | X ' X | , i.e., it is D-optimal. The D-optimal design for model (1) is ( f , 0 , §) (Section 9.5, Pukelsheim, 1993). That is, half of the sib-pairs should have IBD = 0 (n = 0) and the other half should have IBD = 2 (IT = 1). An elementary proof of the D-optimality of this design follows: Prom (3)
\x'x\ = \(n1
fni + M |
= (n2 +
I ni)(no + i n i )
< ( » 2 + -n 1 )(no + ^ n 1 ) < ( | ) 2 .
*
n
(5)
Let r = ( " 2 4 j " 1 ) and 1 - r = ( " 0 + j n i ) . The fact that r ( l - r) attains its maximum value | at r = ^ implies the last inequality in (5). Thus, when no = §,14 = 0,n2 = f, l-^'-Xl = (f ) 2 attains its maximum value. It is worth emphasizing that the D-optimal design for H-E regression procedure derived above is based on the number of alleles at the marker locus that are IBD. In practice one observes the trait value before genotyping so the D-optimal design cannot be used directly. It can be approximated, however, by selecting sib-pairs with extreme phenotypes. If linkage exists, the trait values of those sib-pairs with IBD = 0 (2) at the marker locus are likely to be very different (similar). Therefore, the D-optimal design for model (1) suggests that the sib-pairs with extremely discordant (ED) or extremely concordant (EC) trait values contain most of the information relevant to linkage. This yields additional insight into the properties of methods based on these types of sib-pairs. (Risch and Zhang, 1995; Li and Gastwirth, 2001). As noted by Haseman and Elston (1972), the proportion of marker alleles sharing IBD is usually estimated. In terms of the estimated proportion 7fm (1) becomes E(y\nm)
= 70 4- 7i7rm,
(6)
where 70 and 71 are as those in (1). Now nm takes values | , i = 0,1, 2, 3,4
250
(Haseman and Elston, 1972). The design matrix becomes
X
/ln00 \ lni (l/4)lni l „ a (2/4)l n a l n a (3/4)lns L
n4
A
(7)
n4
where n* is the number of sib-pairs with the estimated proportion of marker alleles sharing IBD being \ (%m = \). Note that (2) and (4) still hold where X is now given by (7). The D-optimal design for model (6) is: no = §, rii = n-i = 713 = 0,714 = §. This follows from l^i=0
\X'X\ Z-ji=0
n
i i
n2[y(1)2^1
i=0
7TI;
Z-ii=0\i)
Ui
_ , y 1111)2] < n 2[V- ijh _ , y i«ix 2l i=0 4
i=0
-j(i:i>-Ei?)s(5)". J n.
.2/
i=0
i=0
w
i=0
This upper bound is achieved when no = f ,ni = n2 = TI3 = 0,ri4 = ^, When the IBD status of some sib-pairs needs to be estimated we show that the linkage test based on the above D-optimal design is more powerful than the usual one in the next section. 3. Power Comparison This section presents analytical and numerical comparisons of the power of two tests: the usual Haseman-Elston procedure and the one using the D-optimal design. First, we consider the regression model (1). The standard H-E test statistic for testing Ho : 8 = \ vs. H\ : 8 < | is based on . _ (n 0 + \nx) YJjLi V2j + \{np ~ n2) E " l i Vij ~ ("2 + \n{) Y.%i Voj "o"2 + \n0m -I- \nin2 (9) where y^ is the squared difference of the trait values of the j'-th sib-pair with i marker alleles shared IBD. We calculate the approximate power or required sample size under the large sample theory for the statistic (9). As a random sample of sib-pairs is taken, their IBD distribution is 0 (w.p. | ) , 1 (w.p. i ) , and 2 (w.p. | ) .
251
When the sample size is large, the law of large numbers (LLN) implies that 77.
71 n
0 -
T,
n
-
l
4 Then, 71 (9) is approximately
77. a n d
7T
n
2
-
T-
2
4
n/4
n/4
ft = - £ > * - ! > * ) • Theorem 3.1.
(io)
Under HQ and assumption (13), 8<72
^0(71*) = 0 ,
Vartf 0 (7?) = — ,
(H)
w/iere a 2 = 2 ^ + CT2(4CT2 + ^
VarHl(T1)
= ^
2
+ \
(12)
Proof: Recall that the trait values of a sib-pair satisfy i y = A* +
(13)
For notational simplicity, we drop the index j for j t h sib-pair and let y = (xi — X2)2 and e = ei — e2Under assumption (13), we first show that Var(y\7rm) = 2o\ + 2<j2g(4a2 + a 2 * + a 2 ) + 2
2^)(4CT 2
+ CT2 + a2)7Tm (14)
where tf = 02 + (1 - 6)2. As a preliminary step, following Table I in Haseman and Elston (1972), we present the conditional distributions of the squared difference of the genetic contributions g\ and g^ defined at the beginning of the section 2 given the IBD status of the pair in Table 1. Using the assumption that the trait locus is in Hardy-Weinberg equilibrium and the joint distribution of IBD and identity by state (IBS), the entries in Table 1 are derived. First, we compute the conditional second moments of y given the proportion nt of alleles IBD at the trait locus. From assumption (13) and
252 Table 1.
The Conditional Distributions of (
0
(31 - 92 ) 2 \ 7T(
0
4
(p + 9 + 4 p V ) 4pg(p2 + q2)
2
a 4a 2
1
2 4
2pV
z
2
(P + ) 2pq 0
1 0 0
Table 1, we have E(y2\irt = 1) = Ee4 = 3<xe4, 2
(15)
4
4
2
2
^(y |7Tt = 0) = Ee + E[(9l - g2) \nt = 0] + 6 £ e £ [ ( 5 l - 2) |7rt = 0] = 3
+ 3a2+a2).
(16)
Similarly, it can be shown that E(y2\ift = \)=
3«7e4 + a2(a2 + 6a2 + 3a 2 ) - 3a 4 .
(17)
Formulas (15), (16) and (17) are summarized in E(y2\nt)
= 3ae4 + 2a2(6ae2 + 3a 2 + a2) - 2a 2 (6a e 2 + 3a 2 + a2)vt - 12a47r((l - 7rt).
(18)
From (18) and Table IV of Haseman and Elston (1972) and their assumption that given the IBD configuration at the trait locus the trait value is independent of the marker, we obtain the conditional second moment of y given nm: E(y2\nm)
= ^2E(y2\nt)P(7Tt\irm)
= 3ae4 + 2a 2 (6a 2 + 3a 2 + a2)
- 2a 2 (6a e 2 + 3a 2 + a2)E(-Kt\nm) - 12
(19)
2 where c = 2a 2 (6a 2 + 3cr2 + aa2).) 2 Since Var(y\irm) = E(y \Trm) - [E(y\Trm)}2, (14) follows from (19) and
(I)-
Because least squares estimators are unbiased, Ei{0(Afl)
EHAII)
=7i-
= 0 and
253
Prom (14), the null variance of y is a2 = Var(y) = VarHo(y\nm)
=
Var(y\irm)\y=i
= 2oi+
(20)
Under H0, from (10) ._. 16 n , 8<72 VarHo{ll) = -2-2o2 = — Under the alternative, (14) yields a2 = VarHl {y\nm = 0) = 1a\ +
2
+ a2gV + a2)*
a\ = VaruM^m
2
+ a2g+ a2)
(21)
and = 1) = 2a\ +
- 2a2gV(4(T2 + cr2g(2 - V) + a2).
(22)
Using (10), it follows that VarHl (7i*) = ^2j[VarHl{y\^m
= 1) + ^ar W l (y|7r m = 0)]
completing the proof. For an a level test, the H-E statistic rejects Ho : 0 = ^ when the standardized 7i is less than Za =
UR =
. 4 ( l 2 - W { Z a " Zl-/3[1 + ^
( 1
" 2^)2]"}2-
(23)
Under the D-optimal design for model (1), the least squares estimator 7i becomes n/2
n/2
Under #o, EHo{l?)
= 0, VarHo{tf)
= ^ ,
(25)
Under Hi, ^ ( 7 ? ) = 71, Vor f f l ( 7 l D ) = ^[
(26)
254
Formulas (25) and (26) are derived by the same technique used in Theorem 3.1. Hence, the required number of sib-pairs to achieve the desired power 1 — (3 for a level a test based on 7 D is UD
=
.4(1 - 2 2 * ) 2 { Z " "
Zl 0[1
~
+
^
4
(
1
~ 2*)2]"}2"
( 2 7 )
The sample size no for the D-optimal design is one half that required for the standard H-E procedure. Thus, the D-optimal H-E method appears to be more powerful than the standard one. In practice, however, in order to obtain nr> sib-pairs where about one half have 0 IBD and the other half have 2 IBD, one needs to examine approximately 2no sib-pairs, i.e., UR = 2nn- Thus, when the IBD status of the marker locus is known, the original Haseman and Elston procedure is essentially equivalent to the Doptimal design as both require the same number of sib-pairs to be screened. Next, we compare the powers of the tests when IBD status is estimated (6). To distinguish this situation from the case where IBD status at the trait is completely known, we use A* in place of 71 The H-E linkage test statistic is A
i =
Eto(ELo(* ~ fcK)(i E"4i Vij) wt ,i,n 7^3—:—~—' nELiCD^-tElii^)2
^
where j/y is the squared difference of the trait values of the j - t h sib-pair with Ttjm
=
41 * =
" 1 1> •••) 4 .
From Table V of Haseman and Elston (1972), the estimated proportion of marker alleles IBD has the distribution: ) = = 4) -) = = -^Pmti )-mim, P(*m = -4) = PP(7T ^rnm = 1 3 p j) = = P(n P{*mm = = 7) 4) = = 2p '^PmqmiP'm km/' {*m = 7) ++q^), mqm(pm p
2
(*m = 4) = pm + 5p2mqm + qmwhere pm is the allele probability at the marker locus and qm = 1 — pm and rii are defined as in (7). By the LLN, n 0 — TI4 and nj ~ 713 so Ai is approximately x. = 1
4 ( E " ° 1 V4j - S " ° 1 voj) + 2 ( S ^ ! y3j - E ; = i yy) 4n 0 + ni
Theorem 3.2.
Under Ho and assumption (13), EHo(Xl) = 0,
VarHo{\\)=
8(T2
4n 0 + ni
.
(30)
255
Under Hi and assumption (13),
*»,<*!> - 7 ,
Va^.W) . _ ^ _
+
j^^.^(1_„,..
(31)
Proof: It is easy to show that EH0(^\) = 0, ifo^A*) = 71. Under ifo> ^ = 5 so (31) reduces to the variance in (30). Hence, we only derive (31). From Table V of Haseman and Elston (1972), we have J5(7rm|7rm) = 7rm and E{irm(l -irm)\irm} = 7r m (l -7r m )[§ - 8 ( 6 - \)(itm - \)(nm - §)], where
b=srr^^T=rE(y2\nm)
Using (19),
= ^ E ( 2 / 2 | 7 r m ) P ( 7 r m | 7 r m ) = 3ae4 + ctf + c(l - 2^)E(7rm\nm)
6
- *)
- 12
= 3<74 + C* - 6
- 12a 4 (l - 2*) 2 7r m (l - * r o )[§ - 8(6 - |)(7r m - !)(*„> - | ) ] . (32) As ^ar(2/|7r m ) = £(2/2|7rm) - [£(y|7r m )] 2 , (32) and (6) imply Var(y\#m)
= 2o\ + 2
Let Si = Var(y\wm = | ) , i = 0,1,2, 3,4. From (31), (21) and (22), the formulas for Si are: 1 2 Sx = 2o-l + 2
2
,
(34)
J 3 = 2ae4 + 2
(35)
<^o = erg and £4 = a2,
(36)
2
where a , and a\ are given in (21) and (22). Using (34), (35), and (36), we obtain 5 3 + ^ i = 2
- 2
256
Prom (29), ,r
,\*\ _ 16no[yar gl (y 4j -) + VarHl(y0j)] + 4ni[Vfflrgl{y3j) + ^ar// x (yij)] (4n 0 + n1)2 _ 16n0[Var(y\nm = \) + Var{y\nm = \)\ (4n 0 +n{)2 4ni[yar(y|7r m = f) + yar(t/|7Tm = |)] (4n 0 + n i ) 2 _ 16n 0 (J 4 + <5Q) + 4ni(fe + Si) (4n0+ni)2 8a 2 4n 0 + n!
8no-ni (4n 0 + n i ) 2
+
4(1 _
2^)2
9
In order for the test statistic, AJ, to achieve the desired power 1 — (5 with level a, the following equation must hold: 4710 + n i =
4 ^ ( 1 - 2M>)2
•
(37)
By the law of large numbers, at the marker locus we have TlO
—
1
2
2
n
n
^l
-> -PmQm,
—
z
I 1
- » IPmqmiPm
1 \
+ Qm)-
n
Thus
8n 0 - n! (pmqm)2 1 ; —; • 4n 0 + Tli 1 - Pm9m From (37), the required number of sib-pairs for the usual H-E procedure
o / 2 , , 2\ A 4n 0 + nx ~ 2pmqm{pm + PrnQm + 9 m ) n
and
is 1 2pmqm (p2, +pmqm + qm)
R
{Za2V2a
- Z1.0[8a2
X
- ( M ^ 2 q 4 ( l - 2*)2]^}2
4
(38)
The D-optimal design H-E test for model (6) is based on n/2
n/2
U
U
n
Prom (39) and (36), we have VarHl(\?)
= VarHl(tf)
=V
= ±%(50 2
+ ^(l-2*) ].
+ &A) = ~[2a2
+
(40)
257
The test statistic Af not only is similar to -yf, it has the same mean and variance of j ^ under HQ and H\, respectively. By standard arguments, the required number (n*D) of sib-pairs for the D-optimal H-E test when the 7rm are estimated, i.e., (6), is the same as n ^ in (27). An important consequence is that the sample size (no) needed to achieve a preset power 1 — j3 using the D-optimal design is less than the sample size (n*R) required by the original H-E design. Formally, we have Theorem 3.3. n*R > nD.
(41)
Proof: To prove the inequality (41), let k = 2pmqm(p2n Expression (27) for no can be written as UD
+pmqm+qm)-
4.f(l-2tt) 2
=
•
(42)
By comparing (38) and (42) and recalling Za < 0 and Z\~$ > 0, we only need to prove that 8<72
,
>4
(43)
and I[8(T2 _ k
\2
2(Pm
1
—
_
2$)
2] >
4(T2 +
2 o .4 ( 1
_
2^2
(44)
PmQm
Since k < 2pmqm < \, 8a 2 9 9 - — > 16o-2 > 4cr2. k To obtain (44), the fact that /1
o,T,\2 ^ i „„A (?™_ ~ 9m)_2 _ 1 - 4 p m q m
VJ.
* * ,
-
1 - PmQm
1
<
1
PmQm
implies (>m - 9m) 2
^
•*•
= \{^t +
2 K
PmQm
^(4
2
+
2
a )] +
2ag4}
[by (20)]
> 16[2CT4 4• • 7 2 ( 4 ( T 2 + a 2 ) ] + 4 ( 7 4 > 4 [ 2 ^ 4
2
= 4<J2 + 2or ( l - 2 * ) .
[by (20)]
Usa2- i°t\
4 + a 2 (4<7 2 +a 2 )] + 2a* + 2ag (l - 2*)^
258
Since the D-optimal design utilizes only the sib-pairs with 0 or 2 alleles IBD, and only half of a screened sample are expected to have these IBD values, one should double no in order to compare the sample sizes of the optimal design to the usual H-E procedure. The number of required sib-pairs to achieve power 1 — /? for a level a test for an additive model with a = 1.5, d = 0, a\ = 1 and allele frequency p = 0.5 at the trait locus is given in Table 2. The numerical results show that the sample sizes for the D-optimal H-E procedure are noticeably smaller than those needed by the original H-E procedure. However, based on formulas (23) and (27) the same number of sib-pairs need to be screened since the D-optimal design ignores pairs with IBD=1. In order to have at least ^- sib-pairs of both types (0 or 2 IBD) in practice one will need to sample slightly more than 2no sib-pairs. Simulation showed this excess was small. Of course, the expected power of the test will also be slightly increased. Table 2. T h e Number of Sib Pairs t o Achieve Power 1 — /3 with Significance Level a for Standard and D-Optimal H-E Procedures a = 0.01
n
R
e
1-/3 = 0.90 0.1
0.2
0.3
0.4
0.1
2692
8529
43221
691698
0.3 0.5
1332 1180 446
4214 3730 1402
21340 18883 7084
341475 301238 113306
Pm\
2nD a = 0.01
0.2
0.3
0.4
0.1
3258
10330
52354
837881
0.3 0.5
1613 1429 542
5104 4518 1700
25850 22874 8584
413642 365992 137252
Pm\
n
R
2nD
e
1-/3 = 0.95 0.1
4. Discussion The H-E (1972) linkage test is commonly used for locating a quantitative trait locus involved in complex diseases. By replacing a random sample of sib-pairs with a selected sample increases the power of the test (Cary and Williamson, 1991; Eaves and Meyer, 1994; Risch and Zhang, 1995). In this paper, we use D-optimal design theory to provide insight into the Haseman
259
and Elston and related methods of linkage analysis. These results imply t h a t one should select t h e sib-pairs with either 0 or 2 alleles I B D . Currently, it is not feasible t o sample pairs of individuals on t h e basis of allele sharing ( I B D = 0 or 2). Thus, the D-optimal design cannot be implemented directly. T h e results indicate t h a t designs maximizing the fraction of sib-pairs with estimated I B D = 0 or 2 provide t h e greatest information for detecting linkage. W h e n there is linkage, procedures selecting highly discordant or concordant sib-pairs should have t h e largest fraction of sib-pairs with estimated I B D = 0 or 2 and yield a powerful study (Risch and Zhang, 1995; Li and Gastwirth, 2000). It should be noted t h a t there are other reasons for using sib-pairs with I B D = 0 or 2, which is equivalent to using b o t h E D and E C pairs in linkage studies. Elston et al. (1996) observe t h a t t h e E D or I B D = 0 pairs can be regarded as a control group t o which the E C or I B D = 2 pairs can be compared (Guo and Elston, 2000). W i t h the completion of t h e h u m a n genome project and new technology for genotyping and storing samples, in t h e future it m a y be possible to use genotype information in t h e design of studies (Human Genome Project, 2001). Of course, appropriate privacy and confidentiality procedures would need t o be established before such information is m a d e available t o researchers. Acknowledgments This research was partly supported by National Cancer Institute grant CA64363 (Z.L. and M.X.), National Eye Institute EY14478 (Z.L.), an N S F grant (J.L.G.), and China N a t u r e Science grant (M.X.). T h e work for t h e first author was partially done while t h e author was visiting t h e Institute for Mathematical Science, National University of Singapore in 2002. References 1. Carey, G. and Williamson, J. (1991). Linkage analysis of quantitative traits: increased power by using selected samples. American Journal of Human Genetics 49, 786-796. 2. Dudoit, S. and Speed, T.P. (2000). A score test for the linkage analysis of qualitative and quantitative traits based on identity by descent data from sib-pairs. Biostatistics 1, 1-26. 3. Eaves, L. and Meyer, J. (1994). Locating human quantitative trait loci: guidelines for the selection of sibling pairs for genotyping. Behavior Genetics 24, 443-455.
260 4. Feingold, E. (2001). Methods for linkage analysis of quantitative trait loci in humans. Theoretical Population Biology 60, 167-180. 5. Gu, C. and Rao, D.C. (1997a). A linkage strategy for detection of human quantitative trait loci. I. Generalized relative risk ratios and power of sib pairs with extreme trait values. American Journal of Human Genetics 6 1 , 200-210. 6. Gu, C. and Rao, D.C. (1997b). A linkage strategy for detection of human quantitative trait loci. II. Optimization of study design based on extreme sib pairs and generalized relative risk ratio. American Journal of Human Genetics 6 1 , 211-222. 7. Guo, X. and Elston, R.C. (2000). Two-stage global search designs for linkage analysis II: including discordant relative pairs in the study. Genetic Epidemiology 18, 111-127. 8. Elston, R.C., Guo, X., and Williams LV. (1996). Two-stage global search design for linkage analysis using pairs of affected relatives. Genetic Epidemiology 13, 535-558. 9. Haseman, J.K. and Elston, R.C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, 3-19. 10. Li, Z. and Gastwirth, J.L. (2001). A weighted test using both extreme discordant and concordant sibpairs for detecting linkage. Genetic Epidemiology 20, 34-43. 11. Penrose, L.S. (1935). The detection of autosomal linkage in data hich consist of pairs brothers and sisters of unspecified parentage." Annals of Eugenics, 6, 133-138. 12. Penrose, L.S. (1953). The general purpose sib-pair linkage test. Annals of Eugenics 18, 120-124. 13. Pukelsheim, F. (1993). Optimal design of experiments. John Wiley & Sons, New York. 14. Rao, D.C. (1998). CAT scans, PET scans, and genomic scans. Genetic Epidemiology 15, 1-18. 15. Risch, N. and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268, 1584-1589 16. Risch, N. and Zhang, H. (1996). Mapping quantitative trait loci with extreme discordant sib pairs: sample size considerations. American Journal of Human Genetics 58,836-843. 17. Stoesz, M.R., Cohen, J.C., Mooser, V., et al. (1997). Extension of the Haseman-Elston method to multiple alleles and multiple loci: theory and practice for candidate genes. Annals of Human Genetics 6 1 , 263-274. 18. Suarez, B.K., Rice, J., and Reich, T. (1978). The generalized sib pair IBD distribution: its use in the detection of linkage. Annals of Human Genetics 42 87-94. 19. Xu, X., Weiss, S., Xu, X., et al. (2000). A unified Haseman-Elston method for testing linkage with quantitative traits. American Journal of Human Genetics 67, 1025-1028. 20. Zhang, H. and Risch, N. (1996). Mapping quantitative trait loci in humans using extreme concordant sib pair: selected sampling by parental phenotypes.
261
American Journal of Human Genetics 59, 951-957. 21. Zhao, H., Zhang, H., and Rotter, J.I. (1997). Cost-effective sib-pair designs in the mapping of quantitative trait loci. American Journal of Human Genetics 60: 1211-1221.
A S E M I P A R A M E T R I C M E T H O D FOR M A P P I N G QUANTITATIVE T R A I T LOCI
JIAN HUANG1'2 AND KAI W A N G 2 Department
Department
of Statistics
and Actuarial Science, 241 SH, University Iowa City, I A 52242, USA
of Biostatistics, Division of Statistical Genetics, Iowa, Iowa City, IA 52242, USA
of Iowa,
University
of
We propose a semiparametric normal copula model for mapping quantitative trait loci when the normality assumption of the phenotypic values may not be satisfied. This model is constructed by first making an appropriate transformation (the normal-quantile distribution transformation) of the original data so that it is marginally normally distributed, then the joint distribution of the transformed data is assumed to be multivariate normal. Since we do not know a priori what form of the transformation will result in the marginally normal distribution for the data, we estimate it nonparametrically by use of the empirical distribution function. We then propose using a pseudo-likelihood ratio (LR) statistic for the detection of linkage of quantitative traits.
1. Introduction Identification of the chromosomal regions that affect a quantitative trait is an important first step towards understanding its genetic determinant. The Haseman-Elston (H-E) regression [Haseman and Elston (1972)] is a widely used method for such a purpose. This method is originally designed for sib pair data, it tests the slope in the regression of the squared trait difference of the pair on the estimated proportion of the marker alleles shared identicalby-descent (IBD). An important advantage of the H-E method is that it does not require normality assumption of the trait values. However, it has been noted that the sum of the trait values also contains information for linkage [Wright (1997); Drigalenko (1998)]. To fully extract the information contained in the data, Fulker and Cherney (1996) used the likelihood ratio (LR) test based on the multivariate normality assumption (normal LR test), using the individual sib values. Because the LR test does not require the preliminary data reduction as in the H-E method, it is more powerful when 262
263
the normality assumption is satisfied. However, in practice, the normality assumption is often violated. The violation of the normality assumption is a major concern for the methods based on the normal distributions, see for example, Lynch and Walsh (1998), Chapter 11; and Allison et al. (1999). It has been demonstrated that, when the trait distribution deviates markedly from normality and when there is residual correlation among sibs, the LR method based on the normality assumption can have an excess type 1 error rate or reduced power. [Allison et al. (1999)]. Therefore, while the H-E method is robust against departure from the normality assumption in their type 1 error rates, they do not make full use of the information contained in the data; and while the normal LR method is efficient when the normality assumption holds, it tends to have inflated type 1 error rate or reduced power when the normality assumption is violated. To combine the strength of the H-E and LR methods, we propose a semiparametric normal copula model for linkage analysis of quantitative traits. Our proposed approach reduces the reliance on the multivariate normality assumption and in the mean time, enable us to efficiently use the full information of multiplex sibship data. An important feature of the normal copula model is that there are two types of parameters. The parameters of main interest are the correlation parameters in the normal copula model. Evaluation of whether the trait is linked to a certain chromosomal location can be done by testing suitably formulated hypotheses concerning these correlation parameters. The second type of parameter is the (nonparametric) transformation. We usually do not know a priori what form of the transformation will result in the normal distribution for the data, since we do not know the underlying marginal distribution. We estimate it nonparametrically by use of the empirical distribution function. By use of this empirical distribution function in the likelihood, we obtain a pseudo-likelihood function. We then use a (pseudo) LR statistic for testing linkage.
2. The LR Test for Linkage Based on Semiparametric Normal Copula Distributions In this section, we first give the general form of the multivariate normal copula model. We then describe how to apply this model to quantitative trait data. Finally, we propose a LR test for linkage based on this model.
264
2.1. The Semiparametric
Normal
Copula
Model
Consider a random vector Y = ( Y i , . . . , Yky. Let Fj be the marginal distribution of Yj, j = 1 , . . . , k. Suppose that Fj is continuous. Then without any further assumption, we have FJ(YJ-) ~Uniform(0,l).
Let $ be the distribution function of N(0,1), and let Zj=$-1(Fj(Yj)), where $ _ 1 represents the inverse of $. Then Zj ~ N(0,1) without any assumption on the form of Fj. We call this transformation a normalquantile distribution transformation. To complete the specification of the distribution of Y through the joint distribution of Z — (Z\,..., Zky, we assume that Z is distributed as fc-dimensional multivariate normal distribution Nk(0,T,). Because Zj ~ 7V(0,1), the diagonal elements of £ are equal to 1. Let <&£ and !>£ denote the distribution function and density of iVfc(0, £ ) , respectively. Because P(YX < yi,...,
Yk < yk) = P(Z1 < Q-'iFM),
...,Zk<
$- 1 (F f c (y,))),
the joint distribution function of Y is $ s ( $ _ 1 ( F i ( t / i ) ) , . . . , $>~1(Fk(yk)))The joint density function of Y is
(i) Thus the dependence structure of Y is determined by the normal distribution, however, the marginal distributions of Y are unspecified. If we let Fj be the distribution function of N(nj,aj),j — 1 , . . . ,fc,equation (1) reduces to the multivariate normal distribution. Thus (1) is a natural semiparametric extension of the multivariate normal model. This distribution is often referred to as the multivariate normal copula model. For more discussions on copula models, see e.g., Genest and MacKay (1986) and Klaassen and Wellner (1997). 2.2. The Normal
Copula Model for Quantitative
Trait
Let Y denote the vector of the trait values of the sibship, and let M denote all the maker genotype data, including those of the parents if they are available. For example, in a genome screen with / markers covering part of the genome of interest, then M = ( m i , . . . , m(), where each rrij, 1 < j < I,
265
represents the marker genotype of the family at the jth marker location. Let there be J possible configurations of alleles shared IBD among the sibs at trait locus t. Let s(t) be the IBD configuration of the sibship of size k at the locus. The joint probability density of the data (Y, M) is J
p(y, M) = 5>(l/K*) = j)P(s(t) = j\M)P(M),
(2)
where j represents the j t h IBD configuration, and we have used that p(y\s(t) = j , M) = p(y\s(t) = j). Here we assumed that, given the IBD configuration at trait locus t, the trait is independent of the markers. We use the multivariate normal copula model for p(y\s(t) = j ) , the density of the trait values given that the IBD configuration s is j at trait locus t. We assume that the marginal distributions of Y are equal, and denote this common marginal distribution by F. This assumption is reasonable because the designation of the order of the sibs is arbitrary and because the allele sharing pattern does not affect the marginal distributions. We can represent s(t) by a symmetric matrix (sij, 1 < i, j < k), where su = 2, and for i ^ j , s^ = 0,1 or 2 is the number of alleles shared IBD by sibs i and j . Given an IBD configuration s, we define a correlation matrix E = (pij, 1 < i,j < k), where pa = 1, and for i ^ j , Pij
=Pl= C o r ^ - H T O ) ) , * " 1 ^ ) ) ! ^ ) = 0, * = 0,1,2.
Thus pij = po,Pi or pi according to s^ = 0,1 or 2. In other words, pij is the correlation coefficient of the normalized trait values given that the number of alleles shared IBD Sij(t) = I, I = 0,1,2. Let p = (po, Pi, p-i)- The elements of T, are completely determined by p. To explicitly indicate the parameters involved in the model, we write Pj(y;p,F)
=p(y\s(t)
=j),j
= 1,...,J,
where Pj(y; p, F) = p(y; T,j, F,..., F), as defined in (1). Because the entries of every T,j are determined by p, we only need to include p (and F) to indicate the parameters involved in the model. 2.3. The Likelihood
Ratio
Test
Suppose there are n nuclear families in the data set, the ith family has /q sibs. Let Yi = (Yn,..., YifcJ denote the trait values, and let Mi denote the marker data of the ith family, 1 < i < n. Let s;(£) be the IBD sharing
266
pattern of the ith family at locus t, which can take J, possible patterns. Denote the probability of the j t h pattern given the observed marker data by iTij(t) — P(si(t) = j\Mi). This probability can be computed from the Genehunter program [Kruglyak et al. (1996)]. By expression (2), the likelihood of the data at t is
Ln(p,F;t) = n 2{^(*)Pi(^;p.-F')}^(Mi)
(3)
i=l
where P(Mi) is the Mendelian probability of the marker data, which is independent of the parameters of interest and thus can be considered as a constant in the likelihood. We note that likelihood (3) is based on the joint distribution of marker and trait. Thus this likelihood is only correct for randomly selected families and approximately correct for moderately selected sibship data. This likelihood cannot be simply applied to extremely discordant sib pair data. However, analysis of extremely discordant sib pair data can be considered as a missing data problem, in which the pairs not selected for genotyping can be viewed as having their marker data missing. We do not discuss this issue in detail here. However, we note that one approach for applying our method to the extremely discordant sib pair data is to include all the untyped pairs and assign the prior IBD sharing probabilities to these pairs (Eaves et al., 1996; Dolan et al., 1999). Other approaches for analyzing extremely discordant sib pair data include the method based on IBD sharing scores or weighted IBD sharing scores [Risch and Zhang (1995); Dudoit and Speed (2000)] or the method based on the conditional likelihood of marker data given trait values [Dudoit and Speed (2000); Sham (2000)]. Likelihood (3) is a semiparametric likelihood, since it contains a finitedimensional parameter p and a nonparametric component F. In principle, we can base our inference on Ln(p,F;t) by treating F as an infinitedimensional nuisance parameter. However, there are two difficulties in implementing such a proposal. First, computation of the maximum likelihood estimator (MLE) of (p, F) is difficult, because the dimension of the MLE of F is proportional to the total number of individuals in the sample. Second, the distributional property of the likelihood ratio statistic based on estimation of both p and F in the present model has not been worked out, although previous work suggests that it may be similar to the standard results, see Murphy and Van der Vaart (1997). Therefore, we use a pseudo-likelihood obtained as follows. First, we
267
estimate F using t h e following modified empirical distribution function: , /V
n
ki
I
Fn(y) i v -t-
i
/v
—
=1 j = l
where N = YH=i ^»- ^ is clear t h a t Fn is a consistent estimator of F. Next, we substitute Fn for F in (3), which gives Ji
Ln(p,Fn;t) = Yl 5]{7ry(t)pJ-(y<;p)Fn)}P(Mi) J'=I
This is not a likelihood in t h e usual sense, because it contains an estimator Fn of the parameter F. However, it can still be used as a basis for statistical inference. Such a likelihood is called a pseudo-likelihood (or a pseudolikelihood) [Besag (1975); Gong and Samiengo (1981); Lindsay (1988)]. Before defining t h e test statistic, we need t o consider the parameter space for p. Because the correlation between the trait values increases along with the increased number of alleles shared IBD, the following restrictions should be satisfied: 0 < Po < Pi < 92T h e restriction t h a t po > 0 arises from the consideration t h a t even when two sibs share 0 alleles IBD at locus t, they may still share t h e same polygenes and similar environmental factors. Therefore, t h e residual correlation should b e nonnegative. Under the null hypothesis of no linkage, 0 < Po = Pi = P2We consider t h e following three parameter spaces:
n0 = {P 0 < Po = Pi= P2< 1}, Ki = {p 0
1},
n2 = {P 0 < po < Pi < l,pi = cip0
C2P2},
where C\ and c2 are known nonnegative constants and c\ + c2 = 1 T h e parameter space 7£o corresponds t o t h e null hypothesis space. 7^i -n0 corresponds t o t h e general alternative hypothesis, in which a n a t u r a l order restriction is imposed. In 1Z2, by choosing different values of c\ and c2, we get different alternative models. For example, if we let c\ = c2 = .5, then p\ = .5(po + P2)- This is equivalent to the additive model assumption used in t h e s t a n d a r d H-E method in which t h e dominance variance is assumed t o be zero. Other choices are also possible. It is important to note t h a t the restriction in 1Z2 is only an assumption used in the analysis, it may not
268
correspond to the underlying genetic model. However, this restriction does not affect the validity of the test, in the sense that the type 1 error rate is controlled at the nominal level for a given critical value. Corresponding to the three parameter spaces are the hypotheses: Ho : p G 7?o, H\ : p € 72-1 — 7?-o, and Hi : p G 72-2 — ~R-o- The test of these hypotheses can be based on the LR statistic at location t: Ln{po,Fn;t) Specifically, we consider two cases: (1) for hypothesis HQ versus Hi, jon is the maximum pseudo-likelihood estimator (MPLE) under the restriction 0 < po < Pi < P2', (2) for HQ verses H2, pn is the MPLE under the restriction 0 < po < P2, Pi = cipo + C2P2 for some given (01,02). In both cases, pb is the MPLE of the common value of po, pi and pi under Holt can be shown that, under HQ, for any fixed t, A„ for testing Ho versus Hi is asymptotically distributed as a mixture of \ 2 distributions (0.5 — a)xo + 0.5xi + <*X2> where a is a constant depend on the distribution of allele sharing proportions at each chromosomal location t. Here \o denotes the degenerate distribution that puts probability 1 at point 0. Unfortunately, this asymptotic distribution is difficult to use because a needs to be calculated at each location t. A much simpler result holds for A„ for testing Ho versus Hi for any fixed values of (01,02). In this case, A„ is asymptotically distributed as 0.5^0 + 0.5xi- This asymptotic distribution is independent of t. For results concerning the asymptotic distribution of a constrained LR test statistic, see for example, Chernoff (1954) and Self and Liang (1987). We have conducted simulation studies in a range of models to evaluate the type I error rate and power of the proposed test. We found that the observed type I error rates are within the random sampling fluctuation of the nominal significance levels based on the asymptotic distribution for data sets consisting 100 and 200 sib pairs. The power behavior of the proposed normal copula composite LR test is robust with respect to a range of nonnormal models. In particular, when the trait distributions are skewed, the proposed method tend to have appreciably higher power than the normal LR and H-E methods. 3. Discussion We have proposed a semiparametric normal copula model for the quantitative trait data. This modeling approach amounts to making a normal-
269
quantile distribution transformation of the data, so the marginal distribution of the data is normal, and then use the multivariate distribution to the transformed data. Based on this model, we proposed a LR test for assessing linkage between a quantitative trait and a chromosomal location. This approach can be naturally applied to multiplex sibship data without completely assuming the multivariate normality assumption. We have considered the performance of an asymptotically equivalent test statistic, the score statistic, in a separate report [Wang and Huang (2002)]. The simulation results there suggest that the normal copula model has consistent type 1 error rate and preferable power behavior when comparing with some existing methods, such as the variance component (VC) method, when the normality assumption is not satisfied. We have only considered sibship data. However, because the copula model can be formulated for multivariate data with arbitrary covariance structures, this method can be generalized to data with multigenerational families. The formulation can be done similarly as in the VC method [Goldgar (1990); Amos (1994); Almasy and Blangero (1998)], by use of relationship coefficients to determine the covariance structure. The main difference is that the VC method is based on the multivariate normal distribution, but the proposed method is based on the multivariate normal copula distribution. The marginal distribution of the trait can still be estimated based on the empirical distribution, using the trait values of all the individuals in the data set. This is equivalent to first transform the data based on the empirical normal-quantile distribution transformation, and then apply the VC method. However, we note that after making this transformation, the total variance in the VC method should be constrained to equal one. Then a LR test can be carried out. We expect that this method should also have good power behavior in this case for a broad range of underlying trait distributions. An interesting question is the efficiency of the LR test based on the normal copula model when the true underlying distribution is normal. Based on the result of Klaassen and Wellner (1997), in which they showed that the maximum likelihood estimator of the correlation coefficient in the bivariate normal copula model is fully efficient, and based on the correspondence between an efficient estimator and an efficient test, we conjecture that, if the underlying distribution is indeed normal, our method is asymptotically locally equivalent to the normal LR test in the sense that they have exactly the same local power function (hence their Pitman relative efficiency is one). In other words, for normal data, the power of our method is approximately
270
the same as t h e power of the normal L R approach. O u r limited simulation results (data not shown) suggest this is true. However, a rigorous analytical proof of this conjecture is needed. We hope to work this out in t h e near future.
References 1. Allison, D. B., Neale, M. C , Zannolli, R., Schork, N. J., Amos C. I., Blangero J. (1999). Testing the robustness of the likelihood ratio test in a variancecomponent quantitative trait loci (QTL) mapping procedure. Am J Hum Genet 65, 531-544. 2. Almasy, L., Blangero, J. (1998). Multipoint quantitative trait linkage analysis in general pedigrees. Am J Hum Genet 62, 1198-1211. 3. Amos, C. I. (1994). Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet 47, 842-853. 4. Besag, J. E. (1975). Statistical analysis of non-lattice data. The Statistician 24, 179-195. 5. ChernofF, H. (1954) On the distribution of the likelihood ratio. Ann Math Statist 25, 573-578 6. Dolan, C. V., Boomsma, D. I. and Neale, M. C. (1999). A simulation study of the effects of assignment of prior identity-by-descent probabilities to unselected sib pairs, in covariance-structure modeling of a quantitative-trait locus. Am J Hum Genet 64, 268-280. 7. Drigalenko, E. (1998). How sib-pairs reveal linkage. Am J Hum Genet 63,1243-45. 8. Dudoit, S. and Speed, T. P. (2000). A score test for linkage analysis of qualitative and quantitative traits based on identity by descent data on sib-pairs. Biostatistics 1, 1-26. 9. Eaves, L. J., Neale, M. C , and Maes, H. (1996). Multivariate multipoint linkage analysis of quantitative trait loci. Behav Genet 26, 519-525. 10. Fulker, D. W. and Cherny, S. S. (1996). An improved multipoint sib-pair analysis of quantitative traits. Behav Genet 26, 527-532. 11. Genest, C. and MacKay, J. (1986). The joy of copulas: bivariate distributions with uniform marginals. Amer Statistician 40, 280-283. 12. Goldgar, D. E. (1990). Multipoint analysis of human quantitative genetic variation. Am J Hum Genet 47, 957-967. 13. Gong, G. and Samiengo, F. J. (1981). Pseudo maximum likelihood estimation: theory and applications. Ann Statist 9, 861-869. 14. Haseman, J. K. and Elston, R. C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, 3-19. 15. Klaassen, C. A. J. and Wellner, J. A. (1997). Efficient estimation in the bivariate normal copula model: normal margins are least favorable. Bernoulli 3, 55-77. 16. Kruglyak, L. and Lander, E. S. (1995). Complete multipoint sib-pair analysis of qualitative and quantitative traits. Amer J Hum Genet 57, 439-454.
271
17. Kruglyak, L., Daly, M. J., Reeve-Daly, M. P., and Lander, E. S. (1996). Parametric and Nonparametric Linkage Analysis: A Unified Multipoint Approach. Amer J Hum Genet 58, 1347-1363. 18. Liang, K-Y and Self, S.G. (1996). On the asymptotic behavior of the pseudolikelihood ratio test statistic. J Royal Statist Soc, B 58, 785-796. 19. Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics 80, 221-239. 20. Lynch, M. and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates, Inc. Publishers, Sunderland, Massachusetts. 21. Murphy, S. A. and Van der Vaart, A. W. (1997). Semiparametric likelihood ratio inference Ann Statist 25, 1471-1509. 22. Risch, N. and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268, 1584-1589. 23. Schork, N. J. (1993). Extended multipoint identity-by-descent analysis of human quantitative traits: efficiency, power, and modeling considerations. Amer J Hum Genet 53, 1306-1319. 24. Self, S. G. and Liang, K-Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Statist Assoc 82, 605-610. 25. Sham, P. C., Zhao, J. H., Cherny, S. S., and Hewitt, J. K. (2000). Variancecomponents QTL linkage analysis of selected and non-normal samples: conditioning on trait values. Genet Epidemiol 10, Suppl 1, S22-S28. 26. Wang, K. and Huang, J. (2002). A score statistic approach for mapping quantitative trait loci with sibships of arbitrary size. Amer J Hum Genet 70, 412-424. 27. Wright, F. A. (2000). The phenotypic difference discards sib-pair QTL linkage information. Amer J Hum Genet 60, 740-742.
S T R U C T U R E M I X T U R E R E G R E S S I O N MODELS
H O N G T U ZHU A N D H E P I N G Z H A N G Department
of Epidemiology
and Public Health, Medicine, New Haven, CT 06520-8034,
Yale University
School of
USA.
A difficult question commonly arising from many applications is whether the d a t a at hand come from a homogeneous or heterogeneous population. Finite mixture models have been widely used to analyze the d a t a from a heterogeneous population. We present a family of structure mixture regression models and investigate the related statistical issues including the asymptotic theory. The motivating example for our general model is from family studies where we need to assess familial aggregation of certain diseases. We examine three standard testing statistics and derive their asymptotic distributions under null and local alternative distributions. A resampling procedure for approximating the p—values is also proposed. Our results are applicable in a wide variety of applications. This research is supported in part by NIH grants DA12468 and AA12044.
1. Introduction Finite mixture models have broad applications in many fields including biology, psychology, and genetics. Statistical inference and computation based on these models pose a serious challenge; see Titterington, Smith and Markov (1985), Lindsay (1995) and McLachlan and Peel (2000) for systematic reviews. Recently, major progress has been made for finite mixture models by Chen, Chen and Kalbfleish (2001), Dacunha-Castelle and Gassia (1999), Liu and Shao (2003) and references therein, but there still exists a major gap between the theory and applications. For example, those work assume independent and identical distributed (i.i.d.) data, and in regression models, they exclude the covariates. Thus it is important to relax these restrictions. The technological advancement has strengthened our ability to conduct data of large sizes from different resources and populations. In the statistical literature, structure mixture regression has emerged as a new technique, 272
273
reflecting the technological trend, especially in genetic studies; see Devlin et al. (2002) and Chiu, et al. (2002). Our examples will also demonstrate the great need to consider the structure mixture models (1). This work is a further development from the asymptotic theory of Andrews (1999, 2001) and Zhu and Zhang (2002, 2003). Specifically, our main contribution is to establish the asymptotic theory in the presence of covariates, nuisance parameters and high dimensional parameters under the new framework (1). This paper is organized as follows. In Section 2, we introduce structure mixture regression models and present two motivating examples. Section 3 focuses on developing asymtotic theory for two general cases. Subsection 3.1 considers an admixture (or a single mean mixture) model and presents the related asymptotic theory for three testing statistics. Based on these results, a resampling procedure for approximating the p-value is also included. In subsection 3.2, we examine a general two-mean mixture regression and investigate the large sample behavior of three testing statistics. We concludes the paper with some discussions in Section 4.
2. Structure Mixture Regression Models We consider a data structure that arises from both longitudinal and family studies. Data are collected from n subjects. Let y^, x^, and z^ denote the Ui x mi matrix of response, the n, x m,2 matrix of covariates, and the rii x m3 matrix of covariates, respectively, for the i—th subject, i = 1, • • • , n. Assume that we obtain a random sample of n independent observations {yi,x;,Zi}" from density function
= {[1 -a(z i ;7)]/j(y i ,x i ;/3,)Ui) + a(z i ;7)/ l (y i ,x i ;/3,^2)}9i(xi,z i ), (1) where 6 = (7,/?,Mi,^2) is the unknown parameter vector and <7J(X,,ZJ) is the distribution function of (XJ,ZJ). Typically, (3 (qi x 1) and 7 (g2 x 1) measure the strength of association contributed by the covariates and the two 93 x 1 vectors, /ii and /X2, represent the contributions from two subpopulations. The parameter space 0 is defined as 0 = {6 = (7,/?, ^1,^2) : 7 £ r,/3 € B,/J,I,H2 € U), where B and U are, respectively, the subsets of Rqx and Rq3, and T is a compact subset of i? 92 . We call the model (1) as the structure mixture regression model. If a(zj;7) does not depend on subject i and can take any value in the interval [0,1], the model (1) reduces to the commonly used finite mixture regression considered in Zhu and Zhang (2003). We allow a(zj;7) to depend
274
on Zj to reflect the fact that data clusters are characterized by covariates in heterogeneous populations. For instance, in genetic studies, there are attempts to divide the study sample into more "homogeneous" strata using genetic markers or other covariate information before employing the standard methods for linkage and association studies [see, e.g., Satten et al., 2001; Shannon et al., 2001; Zhang et al. 2001]. Before presenting our method and results, let us examine a few examples. Example 1. Mixture Models for Linkage Analysis of Affected Sibling Pairs Devlin et al. (2002) proposed mixture models for linkage analysis of affected sibling pairs(ASP). Suppose that there are n ASP's and for the ith pair, multi-locus marker data y; and a set of covariates z; are observed, i = 1, • • • , n. An important concept in linkage analysis is identity-by-decent (IBD). For sib pairs, it refers to the same allele inherited from the same parent by the two sibs. Because each person has two alleles, thus the number of the alleles shared IBD by two sibs is 0, 1, or 2, which may not be determined due to lack of information. Let Si be the the number of the alleles shared IBD by the i-th sib pair. Thus, the probability of observing Si = s is given by p(s\zi) = [1 - a(zi)]p(s; A) +
a(zi)p0(s),
where p0(s) = o ^ S ^ ^ ' + ' ^ ^ O . S ' ^ 1 ) and p(s; A) is defined by
p( s;A ) = (-L)^=o) 0 .5 / ( s=1 )[0.5- ^ ] / ( s = 2 ) Here A is Risch's (1990) recurrence risk ratio for the sibling of an individual. Then, the likelihood for observing y, conditional on z; is
p(yi\zi) = ^2p(yi\s)p(s\zi) s
= [1 - a(zi)] ^2p{s; X)p(yi\s) + a(z^) ^ P o ( s ) p ( y i | s ) , s
(2)
s
where the weighting function p(yi\s) can be directly calculated by GENEHUNTER [Kruglyak, et al., 1996]. In this example, we have 7,(2/,, Xi;/3,/ii) = ^ p ( s ; A ) p ( v i | s ) s
and /i(^,Xi;/3,^ 2 ) = ^ P o ( s M y i l s ) ,
275
where /xi = j satisfying the constraint fii < 1 and \ii = \, and /3 refers to the frequencies of the alleles. To model a(zj) in different circumstances, Devlin et al. (2002) proposed two models: Pre-clustering and COV-IBD models. The pre-clustering model assumes that some covariates Zj such as genetic markers [Pritchard, et al. (2000)] contain enough information about the membership in the linked and unlinked subpopulations. Therefore, standard clustering algorithms can be applied to determine a(zj)'s. In this case, parameter 7 is either unnecessary or completely determined. If the covariates Zj only contain partial information about the membership, the COV-IBD model relates the mixing distribution with the covariates Zj by a link function, i.e., a(zjj7) = exp(7 T Zj)/[l + exp(7 T Zj)]. For the ASP linkage analysis, we are interested in testing the existence of linkage. The statistical hypotheses are H0 : (ii = 1 v.s. Hi : Hi < 1. Devlin et al (2002) considered the log likelihood-ratio testing (LRT) statistic, but did not provide the asymptotic distribution of this statistic. We will investigate a general admixture model in subsection 3.1 and establish related asymptotic results. Example 2. Backcross Population. Consider two parental inbred lines Pi and P 2 , differing substantially in a trait of interest, and two markers M and N at the same chromosome. Assume that both loci M and N have two alleles and the marker genotypes of Pi and P 2 at {M, N} are AB/AB and ab/ab, respectively. Crossing Pi with P 2 produces Pi population with heterozygous genotypes at every locus. Then, the markers genotype of Pi is Aa/Bb. A backcross between Pi and Pi produces a backcross population JBi. Let Q be a putative QTL (quantitative trait locus) flanked by the markers M and N. The possible genotypes at M, N, and Q in the Bi population are (AA,Aa), (BB,Bb), and (QQ,Qq). Suppose that QTL mapping data for the P i population have n individuals and a triplet (y;,Zj,Xj) is observed for the i-th individual, consisting of a quantitative (or ordinal) trait, the genetic marker genotypes at M and AT, and other explantory variables, i — 1, • • • , n. We consider a conditional interval mapping (CIM) statistical model [Kao and Zeng, 1997] Yi = Vim + (1 - Vi)fi2 + xf (3 + et,
(3)
where e, is a random variable and Vi is a indicator function of the event
276
{QTLj(7) = QQ}, in which 7 is a putative position (in cM) of QTL between the markers M and TV. Let p(zj) denote the frequence of the genotypes z» at the markers and p(vi\zi, 7) be conditional probability of a putative QTL at 7 given the flanking marker genotypes z,. Table 1 of Kao and Zeng (1997) lists all corresponding probabilities {p(zi),p(vi\zi,^)}, which depend on the recombination fractions between M and QTL, QTL and N, and M and N. Some assumptions such as ignoring double recombination and only one QTL in the testing interval are assumed. The likelihood for the observed quantities of the i-th individual is 1
p(yi,Xi,Zi)
= ^p(yi|ui,xi)p(t;i|zi)p(zi). Ui=0
Thus, a(z,;7) = p(vi = 0 | Z J , 7 ) . For the quantitative trait, we usually assume that e; follows TV(0, cr2), then p(yi, x,, Zj) equals HiYi - Mi - xJ/3)/cr}p(vi = l|Zi,7)p(zi) +0{(Yi - A*2 - x f 0)/a}p(vi
= 0|Zi,7)p(zi),
where >(•) denotes the standard normal density function. Many health conditions including cancer and psychiatric disorders are assessed by an ordinal scale. Thus, it is useful to consider ordinal traits. Without loss of generality, let yj denote an ordinal trait taking an ordinal value of 0, 1, 2. The method is similar when there are two categories only or more than three categories. Following Zhang et al. (2003), we assume that logit(P{yi = 0|ui,Xi}) = vmi + (1 - Vi)^2 + x f 0, logit(P{yi < l\vi,Xi})
= Vim + (1 - Vi)fi2 + x}/3.
For the QTL mapping in the B\ population, we are interested in whether there is a QTL within the marker interval [M, TV]. The hypotheses are Ho : [ii = H2 for all positions
in [M, TV]
Hi : fii ¥" V2 for at least a position
(4)
in [M, TV].
Many efforts have been made for testing the hypotheses in (4); for example, Lander and Botstein (1989), Rebai et al. (1995), Dupuis and Siegmund (1999) and many others. It is noteworthy that they only consider a single trait, that is, y^ is one-dimensional. More recently, it has been suggested that when multiple traits are available, it can be more advantageous to
277
consider them together. See for example Czerwinski et al. (1999). Our theoretical framework enables us to consider multiple traits. Nettleton and Praestgaard (1998) presented a constraint version of the QTL interval mapping by taking account of the fact that certain restrictions on model parameters are known. With extra information, the power of the testing procedures is improved. However, Nettleton and Praestgaard (1998) needed to fix 7 at a position, and hence their procedures do not test the simultaneous hypotheses in (4). Our results will allow us to perform a simultaneous test. 3. Main Results Let P» be the true model from which the data are generated and ©, = {#* £ 6 : Pem = P*) represent the set of true model parameters. From now on, the use of an asterisk in the subscript means that the parameter value belongs to 9«. One of the key hypotheses is whether the data come from a heterogeneous population. In linkage analysis, it implies whether there exists a linked group. Formally, this hypothesis can be stated as follows: H0:
||/*a —/x2|| = 0, v.s. i f i : | | M l - / i 2 | | ^ 0 ,
(5)
where 11 • 11 is the Euclidean norm of a vector. Interestingly, different techniques must be adopted under the null and alternative hypotheses when establising the asymptotic theory for statistical inference. Basically, if the alternative hypothesis is true, the standard asymptotic theory is applicable; see for example Andrews (1999). When the null hypothesis is true, recent asymptotic results established in Andrews (2001) and Zhu and Zhang (2002) can be used. 3.1. Admixture
Regression
Now we need to introduce some notation. Let /i(/3, fJ.) = fi{yi, x,; (3, fi). Define a>i(j) = a(zj;7), F M (/3,/x) = d0fi(P,£*)//« and F M , = Fi,i(/3,,n*), and Fii2(n) = 9M/i(/?,.,/x)//i* and F ii2 * = F ii2 (/i*)- Let w-X)(7) = (*&., [l-ai(7)]*&.)T,
where W^1'^) is a (91+93) x 1 vector and J« (7) i s a (91+93) x (91+93) matrix. Further, let 77 = (P,m) and At] = (A/3, A ^ ) , in which A ^ —
278
Hi — fi* and A/3 = f3 — f3„. The log-likelihood function Ln{r), 7) is given by n
Ln(v,l)
= £ > g [ ( l - a < ( 7 ) ) / i ( | 9 . w ) / / i * +a<(7)/i(/J)//i.],
where /,* = /i(/?«,/z*) and /,(/3) = fi(P,fi*)- The maximum likelihood estimate (MLE) of 77 for each 7 G T, 7)1(7) = (^(7)1 A I ( T ) ) I is £n(^1(7).7) =
sup
Ln(ji,i),
where 771(7) is a function of 7. Finally, let 7)1 = 7)1(7) s u c n
tna
t
in(»7i(7).7) = sup Ln(6), where 0 = {9 € © : fi<x = /J«}. Motivated by Example 1, we consider the admixture regression with M2 = /J*- Thus, the hypotheses in (5) can be stated as H0 : fiu = fi* v.s. Hi:fiu^fi*-
(6)
To construct a test statistics for (6), we need an asymptotic expansion for Ln(rf, 7), which is the cornerstone of our asymptotic theory. Under assumptions (A.1)-(A.4) in the appendix, we can show that 2Ln(77,7) = Wi 1 ) T (7)4 1 ) (7)- 1 W^ 1 ) (7) - Q n ( v ^ A r 7 , 7 ) + o p ( l ) holds uniformly in {6 e 9 : where
||AJJ||
(7)
< Co/y/n,fi2 = fi*} for any constant Co,
Qnivn) = [V- 41)(7)_1W^1)(7)]r41)(7)['7 - 41)(7)-1W^1)(7)]First, we start with the log-likelihood ratio test statistic defined by LRTn = sup Ln(r?, 7) - sup eee 0ee o
Ln(r],j),
where 60 = {9 € 6 : fi\ = fi2 = fi*}- Under 0 O , Ln(9) simplifies to Yli=\ l°g[/»(/3)//i*] and the standard asymptotic theory leads to sup 2Ln{9) = ^ 1 ) T ( 7 ) 4 1 ) ( 7 ) - 1 W W ( 7 ) - inf Q n ( V S A % ) 7 ) + o p ( l ) , where A770 = (A/3 T ,0 T ) T . By using (7) and the assumptions (A.1)-(A.4) in the appendix, we find that 2 L n ( 7 7 i ( 7 ) , 7 ) = ^ 1 ) ( 7 ) T 4 1 ) ( 7 ) - 1 W « ( 7 ) - ^ inf
Q„(v^A77, 7 )+o P (l)
279
holds uniformly for all 7 € I\ Finally, the log-likelihood ratio statistic has a simple form given by LRTn=
inf Qn(vo,l) peA/3
~
jnf Q (v,l) ^eA/sxA^ n
+ op(l),
where r]0 = (/3 T ,0 T ) T , and A^ and AM are defined in the appendix. Second, we consider the Wald test statistic (Wald, 1943). The Wald test statistic is relevant to fti(-y) under the alternative hypothesis. As a matter of fact, the Wald test statistic, WLSn, can be defined as follow: WLSn
= sup WLSn{-y), -r€T
where WLSn(7) = n ( £ i ( 7 ) - ti*)T[Hf Jn1] (-y)-1!!^^^) - /*.) and T H\ = [0,Iq3] is a (
9„(A*i,7) = [MI -^ii 1 ) (7)- 1 ^ 1 ) (7)]Vi T ^ I ) (7)- 1 ^i)- 1 [vi-HTjpw-iwVw]. Finally, the score test statistic, SCSn, is defined by SCSn =
supd1fiJ(HlJ^(f)-1H1)-1d^n1.
Similar to the Wald test, the score test statistic requires the evaluation of both Wn(-f) and Jn{l) at (/3o,M*)To assess the power of the above three test statistics, we exploit the notion of asymptotic local power. In our case, the distribution of Jn {l)Wn (7) plays a key role in determining the asymptotic local power of LRSn, WLSn and SCSn. So, we explore its property under a sequence of local alternatives 77™ = (/3",/x") consisting of (3n = (3* + n - 1 / 2 h i and /i™ = /x» 4- n _ 1 / 2 b.2, where h i and ri2 are qi- and (fe-vectors, respectively. We have the following theorem.
280
1. Under Assumptions (A.1)-(A.4) in the appendix, the following results hold as n —> oo: (i) If both Ap andhfj, are convex, 7)1(7) —(/?»,/x«) = O p ( n - 1 / 2 ) uniformly for 7 € T. (it) The Z / „ ( T ] I ( 7 ) , 7 ) converges to a stochastic process £1(7) and LRSn converges to the maxima of L 1(7) in distribution. (Hi) The test statistics WLSn and SCSn are asymptotically equivalent. (iv) Under the alternatives wn, THEOREM
41)(7)-1W)(1>(7)-
^ 1 ) m (7) = E ^ ( 1 ) ( 7 K r o / ^ and
-HfJn1H1)-lW^m(j)]T(HTJn1\1)-1H1)-1
C(/*i,7) = [MI
Mi-ff?'41)(7)-1W^1)ro(7) Step 3. Minimize <7™(//i,7) to get dy/it™ and calculate
SCS™ =
supd-y^iHlJ^d^y'd^T•y€T
281
Step 4. Repeat Steps 1-3 J times and obtain a sample of {SCS™ : m = 1, • • • , J } . It can be shown that the empirical distribution of SCS™ converges to the asymptotic distribution of SCSn. Therefore, the empirical distribution of the realization forms the basis for calculating the critical values in the hypothesis testing as well as the power calculation. Example 1. (continue) Let us consider the COV-IBD model with Q(ZJ) = exp(7 T Zi)/[l +exp(7 T Zi)]. According to (2), there is no /3 in the model. Moreover, it is clear that 1
wwH)
P(yi|0)-p(yi|2)
= r
* ™
l+exp(7 Zi)
P(yi|0)+2p(yi|l)+p(y<|2)-
After some algebra, we find that LRSn = WLSn + o p (l) = SCSn + op(l) = supmaxijWfrr^W^M.O}2.
3.2. General
Mixture
Regression
Now we need to introduce some new notation. Let
f! 2) (7) = ( i f t . , if 2 *, [ai(7)-0.5]Jf 2 .) r ,
^
i=\
»=1
where W„ '(7) is a () = (AP,0.5([*i+[*2) —I**,1*2 — /ii). The log-likelihood function Ln(6) is given by n
in(«) = X) 1 °eKl-ai(7))/i(/?,A*i)//i* + «i(7)/i(/9,^2)//i.]. »=1
The maximum likelihood estimate (MLE) of w for each 7 € T is defined as £n(&(7).7)=
sup
L n (w,7).
ueSxWxM
In particular, o> = £(7) satisfies that L n (a>(7),7) = sup06@ Ln(6).
282
Similar to the admixture regression, in order to derive test statistics, we need to develop an asymptotic expansion for Ln(?7,7). Under assumptions in the appendix, we have 2L„(o;,7) = W^ 2 ) r (7)4 2 ) (7) _ 1 W I i 2 ) (7) - Q™{yfrK{w)n)
+
0p(l)
holds uniformly in {6 € 6 : ||K(w)|| < Co/y/n} for any constant Co, where Q< 2) (A, 7 ) = [A - j W f r r ^ M f j W M l A
-
^ ( ^ W f W l -
First, we start with the log-likelihood ratio statistic defined by LRTn = supL„(w,7) - sup Gee 0eeo
Ln{w,i),
where Go = {6 G G : Hi = ^2}- Under the null hypothesis, Ln{6) = 5^™-i log[/»(/3, fJ-)/fi*}- The standard asymptotic theory gives rise to sup 2L n (0) 6>ee0 = W i 2 ) ( 7 ) T 4 2 ) ( 7 ) - 1 W ^ 2 ) ( 7 ) - , f l \nf
Q ^ v ^ W , / * ) ^ ) + op(l),
where K0{P,n) = (A/3, A/z, 0). Furthermore, we have 2L n (u»(7) )7 ) = ^2)(7)T42)(7)"1^2)(7)-
inf f Q^(v^^H,7)+op(l) uEBxUxU
holds uniformly for all 7 € I \ Finally, the log-likelihood ratio statistic has a simple form given by LRTn=
inf
QV\yfcKo{p,ii),i)-
{(3,^)&BxU
Q
inf ueBxUxU
Second, the Wald test statistic is defined as follow: WLSn
= sup •yer
WLSn(j),
where WLSn(j)
=n[£2(7) - £ i ( 7 ) f [ F J J ^ ) " [£2(7)
T
1
^ ] "'
~AI(7)]
and i?2 = [0, / 9 3 ] is a (gi + 2q3) x 53 matrix. Finally, the score test statistic is a quadratic form of a directed score vector dyH, which is a function of 7. Explicitly, for each 7 € T, the directed
283 (2)
score d 7 /i is the minima of a quadratic form q\ (/ii — ^2,7) for both /ii and /i2 in AM, where
e\nn)
= [/i-ff 2 r 4 2 )(7)- 1 wW(7)] r (flJ^ 2 ) (7)- 1 ^)- 1 M-/^
2
^)"
1
^
2
^)'
Thus, the score test statistic, STSn, is defined by STSn = s u p d y / i T ( i ^ J ^ ( 7 ) - 1 / r 2 ) - 1 d Y / x .
(8)
To assess the power of the three test statistics, we will exploit the distribution of Jn (j)Wn '(7) under a sequence of local alternatives K(cjn) = K(u)t,) + n _ 1 / 2 h , where h is (gi + 2<73)-vector. We can prove the following theorem. THEOREM 2. Under Assumptions (A.1)-(A.4) in the appendix, the following results hold as n —» 00: (i) If both A/3 and AM are convex, 01(7) — (/3*,/x*,/i„) = O p ( n - 1 / 2 ) «niformly for 7 € T. (M_) TTie L n (d)(7), 7) converges to a stochastic process £2(7) ari<^ LRSn converges to the maxima 0/^2(7) in distribution. (Hi) The test statistics WLSn and SCSn are asymptotically equivalent. (iv) Under the alternatives w",
4 2) (7)- 1 W^ 2) (7) - d iV(j( 2 )( 7 )- 1 j( 2 )(7,/ i ,)h, J ^ ) " 1 ) , where both J^(~y) and J^(j, /i#) are (gi+2q 3 ) x ((71+293) matrices defined in the appendix. A similar resampling procedure can be established for computing the critical values for these potentially complicated distributions of test statistics. The details are omitted. Example 2. (continue) Consider the model (3) with e, ~ N(0,cr2). It can be shown that
-P'(7)= ( * # . & - £ . ^.W-.M-OJO)'. By using Table 1 of Kao and Zeng (1997), we find that E[a(zi\j)\ = 0.5 and Var[a(zj|7)] > 0. We can check all assumptions (A.1)-(A.4) in this example and apply our results to make statistical inference.
284
4. Discussion This paper introduces a family of structure mixture regression models with wide applications. We mainly investigate the asymptotic distributions of the three well-known testing statistics under some mild conditions. Our contribution includes a unification of many new mixture models, the consideration of non-i.i.d observations, three different testing statistics and high dimensional /i, and hence greatly broadens the applicability of the asymptotic theory. There are a few issues that merit further research: (a) One major issue is the empirical performance of the three test statistics in finite samples under different situations; (b) The asymptotic admissibility of the proposed test statistics is another interesting topic; (c) In the paper, we use a resampling algorithm to approximate the p—value. An interesting topic is how to apply the Gaussian random field theory and tube formula to approximate the p—value; (d) In genetic studies, most structure mixture regression models have more than two modes, which create potentially interesting difficult problems.
Appendix Before we present all assumptions, we need to define the first three-order derivatives of fi(P,n) as follows: Fit3(f3,fj,) = 9^/i(/3,//)//«, FiA{n) = 9/39M/i(/?*,M)//i* a n d Fi>5(fj.) = 9^/t(/3,,ju)//i». The following assumptions are sufficient conditions to derive our asymptotic results. ASSUMPTIONS:
(A.l) The sets n^2(B - /3»)/6n and nl/2(U - /x*)/fon can be locally approximated by cones hp and AM, respectively, where bn —* oo and bnn~1/2 < oo, a n d B - / 3 * = {/3-/3* : f3 € B} andW-/i» = {^-;u» : /x e U}. (A.2) (Identifiability) As n -+ oo, s u p 0 € e n - 1 | L „ ( 0 ) - L„(0)| —> 0 in probability, where Ln(8) = E{Ln(9)}. For every 5 > 0, we have l i m i n f n - ^ o o n " 1 ^ ^ ) - s u p w 6 e / e > s Ln(9)} > 0, where ~6n is the maxim i z e r o f L n ( 0 ) a n d e , , 4 = {fl:||A/3||<(J ) ||A/xi|| < 5, ||A/K 2 || < 5} n 0 for every 5 > 0. (A.3) For a small 50 > 0, let B 5 o = {(/3,/x) <E O : \\A/3\\ < 50
and
| | A / J | | < 50 },
sup {\\-f^F , {P^)\\ + \\-Y F (P^)\\ + \\-f2Fi,2^)\\ + n^> i 1 n ^ i ii3 n ^
(/J./0€B,o
285
E H - £ F a ( M ) l l } = o P (l), fc=4
n
sup
i=l
sup (/3,M)€B4o
{ 1 1 — ^ ^ ( ^ ) 1 1 } = Op(l),fe = 4,5 V « ^
-V{||F i ,i(/3,p)|| 3 +||F i , 3 (^M)l| 3 +||F i , 2 ( A 1 )|| 3 +^||F a ( / ,)|| 3 } = O p (l),
(/J,M)€B <0 n ^
fc=4
Moreover, m a x i < i < „ s u p ( ^ ) e B s o { | | . F u ( / ? , / i ) | | + \\Fi<2((J.)\\} = Opin1^). (A.4) (Wi 1 ^-), J ^ O ) => (^ ( 1 ) (-).^ ( 1 ) (-)), where these processes are indexed by 7 € I \ and the stochastic process { ( ^ ( 7 ) , J("f)) : 7 € T} has bounded continuous sample paths with probability one. Each J ^ ( 7 ) is a symmetric matrix and 00 > s u p 7 e r A max [j( 1 )(7)] > inf 7 e r A m i n [J^ 1 '(7)] > 0 holds almost surely. The process W^(j) is a mean-vector RQi+i3valued Gaussian stochastic process {W^(j) : 7 € T} such that E\WM(1)WW(~/)T} = J « ( 7 ) and E[W^(j)W^(-y')T] = J ( 1 ) (7,7') for any 7 and 7' in T. In addition, we assume that a similar assumption holds for (Wi 2 ) (7), J P ( 7 ) ) . We take it for granted to save the space. Sketch of Proof of Theorem 1. We proceed in three steps. Step 1 deals with the convergence of the maximum likelihood estimate by using Assumption (A.2). In Step 2, let 5,(77,7) = [1 - a i M M / j , ^ ) - / ; ] / / ; + " i ( 7 ) [ / i ( / 3 ) - / ; ] / / ; , then Si (77, 7) can be written as Si (w) — AT? T U^ '(-y) +ej (77,7), where e j (77,7) is the remaining term. Assumption (A.3) gives rise to the order of e,(?7,7), and it follows from Assumptions (A.3), the inequality log(l + x) < x — x2/2 + x 3 / 3 and the consistency shown in the first step that Ln(v,l)
< V T T A T ? ^ 1 ^ ) - niA77 T J«(7)AT7 T + op(n||A77||2 + 1),
and hence, A77 = O p ( n - 1 / 2 ) by noting Ln(771 (r),f) > 0. Further, by using Assumption (A.3), we know that maxi
+OP(1)
holds uniformly for all 77 in N[Op{n'1'2)]. In Step 3, we can directly apply results in Andrews (1999) and Zhu and Zhang (2002) to conclude the proof.
286
References 1. Andrews, D. W. K. (1999). Estimation when a parameter is on a boundary: theory and applications. Econometrica 67, 1341-1383. 2. Andrews, D. W. K. (2001). Testing when a parameter is on the boundary of the maintained hypothesis, Econometrica 69, 685-734. 3. Chen, H., Chen, J., and Kalbfleish, J. (2001). A modified likelihood ratio test for homogeneity in finite mixture models. J.R.Statist.Soc.B. 63, 19-29. 4. Chiu, Y.F., Liang, K.Y., and Beaty, T.H. (2002). Multipoint linkage detection in the presence of heterogeneity. Biostatistics 3, 195-211. 5. Czerwinski, S.A., Mahaney, M.C., Williams, J.T., Almasy, L. and Blangero, J. (1999). Genetic analysis of personality traits and alcoholism using a mixed discrete continuous trait variance component model. Genetic Epidemiology 17 (Suppl 1), S317-S322. 6. Dachunha-Castelle, D. and Gassiat, E. (1999). Testing the order of a model using locally conic parameterization: population mixtures and stationary ARMA processes. Ann. Statist. 27, 1178-1209. 7. Devlin, B., Jones, B.L., Bacanu, S.A., and Roeder, K. (2002). Mixture models for linkage analysis of affected sibling pairs and covariates. Genet Epidemiology 22, 52-65. 8. Dupuis, J. and Siegmund, D. (1999). Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151, 373-386. 9. Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64, 413-430. 10. Kao, C.H., and Zeng, Z.B. (1997). General formulas for obtaining the MLEs ant the asymptotic variance-covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics 53, 653-665. 11. Kruglyak, L., Daly, M.J., Reeve-Daly, M.J., Lander, E.S. (1996). Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet 58, 1347-1363. 12. Lander, E.S. and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RPLP linkage maps. Genetics 121, 185-199. 13. Liu, X. and Shao, Y.Z. (2003). Asymptotics for the likelihood ratio test under loss of identifiability. Ann. Statist. Preprint. 14. Lindsay, B.G. (1995). Mixture Models: Theory, Geometry and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics. Vol 5. IMS, Hayward, CB. 15. McLachlan, G and Peel, D. (2000). Finite mixture models. Wiley, New-York. 16. Nettleton, D., and Praestgaard, J. (1998). Interval mapping of quantitative trait loci through order restricted inference. Biometrics 54, 74-87. 17. Pritchard, JK., Stephens, M., Rosenberg, NA., and Donnelly, P. (2000). Association mapping in structured populations. Am J Hum Genet 67, 170-181. 18. Rebai, A., Goffinet, B. and Mangin, B. (1995). Comparing power of different methods for QTL detection. Biometrics 5 1 , 87-99. 19. Risch, N. (1990). Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 46, 222-228.
287
20. Satten G.A., Flanders, W.D., Yang, Q. (2001). Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet 68, 466-477. 21. Shannon, W.D., Province, M.A., Rao, D.C. (2001). Tree-based recursive partitioning methods for subdividing sibpairs into relatively more homogeneous subgroups. Genetic Epidemiology 20, 293-306. 22. Titterington, D. M., Smith, A. F. M. and Markov, U. E. (1985). Statistical Analysis of Finite Mixture Models. Wiley, New York. 23. Zhang, H. P., Fui, R., and Zhu, H. T. (2002). A latent variable model of segregation analysis for ordinal outcome. Technical Report. Yale University School of Medicine. 24. Zhang, H. P., Tsai, C.P., Yu, C.Y., Bonney, G. (2001). Tree-based linkage and association analyses of asthma. Genetic Epidemiology 21 (Suppl 1), S317S322. 25. Zhu, H.T. and Zhang, H.P. (2002). Asymptotics for estimation and testing procedures under loss of identifiability. Technical Report. Yale University School of Medicine. 26. Zhu, H.T. and Zhang, H.P. (2003). Hypothesis testing for finite mixture regression models (mathematical details). Technical Report. Yale University School of Medicine.
Series in Biostatistics Vol.1
Development of Modern Statistics and Related Topics
»
in Celebration of Prof Yaoting Zhang's 70th Birthday This book encompasses a wide range of important topics. The articles cover the following areas: asymptotic theory and inference, biostatistics, economics and finance, statistical computing and Bayesian statistics, and statistical genetics. Specifically, the issues that are studied include large deviation, deviation inequalities, local sensitivity of model misspecification in likelihood inference, empirical likelihood confidence intervals, uniform convergence rates in density estimation, randomized designs in clinical trials, MCMC and EM algorithms, approximation of p-values in multipoint linkage analysis, use of mixture models in genetic studies, and design and analysis of quantitative traits.
26.1
ISBN 981-238-395-6
www. worldscientific. com 9 "789812"38395211