Acknowledgments This volume is based on a special symposium held in honor of Newton E. Morton on the occasion of his 70th birthday. The following organizations, whose support is greatly appreciated, generously sponsored the symposium: Washington University in Saint Louis Division of Biostatistics at Washington University Department of Genetics at Washington University Institute for Biomedical Computing at Washington University National Institute of Aging/National
Institutes of Health
Promega Corporation The symposium was guided by a Program Advisory Committee, whose members are gratefully acknowledged for their enthusiastic participation, advice, and commitment: James E Crow (chair), Eric Boerwinkle, Claude Bouchard, David States, Robert Waterston, and the late Roger Williams. We are very grateful to the dedicated efforts of the Local Organizing Committee for an excellent organizational job: Derek Morgan, Nancy Grafton, Chris Young, Sharon Stark, and most notably, Kristen Dolgos and Jeanne Cashman. We thank all the symposium speakers and session chairs for fine presentations and for keeping the program on track. The participants of the symposium deserve special recognition for their overwhelming interest and enthusiasm. The following individuals generously contributed their time for prompt reviewing of the individual chapters, and often at very short notice. Their services and support are greatly appreciated: Laura Almasy, Christopher I. Amos, Ingrid B. Borecki, James M. Cheverud, Chi Gu, Nancy Cox, Robert C. Elston, Peter Holmans, Leonid Kruglyak, Newton E. Morton, Rosalind Neuman, Nancy Lim Saccone, Duncan C. Thomas, Glenys Thomson, Daniel E. Weeks, and Jeff T. Williams. Finally, we thank Kristen Dolgos for extensive editorial assistance and Ingrid Borecki for considerable editorial consultation. Last but not least, it was a pleasure working with Craig Panner and Hilary Rowe at Academic Press. It is their diligent efforts that are responsible for an attractive organization and timely publication of the volume. The Editors xxi
This volume is based on a symposium held in honor of one of the founding fathers of genetic epidemiology, Newton E. Morton, on the occasion of his 70th birthday. We hope that it constitutes a fitting tribute to the man who continues to make pioneering contributions to the field. The primary goal of this volume is to ask how best to even partially achieve the genetic dissection of complex traits, which do not have simple, singlegene causes. Toward that goal, this volume documents state-of-the-art methods and strategies and provides guidelines for undertaking the genetic dissection of complex traits. Is genetic dissection of complex traits achievable? The answer seems to be a resounding “yes,” now more than ever before. This is an exciting time to be a genetic epidemiologist, with unprecedented new opportunities unfolding that we could only dream of a few short years ago. At the same time, we must recognize the limitations of some of the current approaches. It is generally recog nized that genes and environments affect most human biological processesin complex and often interacting ways. Investigators the world over may be divided by differences in language and methodologies, but they are united in the conviction that genetic dissection of complex traits, though a formidable challenge, is possible. Recent failures have actually strengthened our resolve to succeed..The recent years have witnessed a~flurry of activity, in the development of novel methods as well as promising new strategies. Therefore, the time is ripe to undertake a realistic evaluation of contemporary methods, to critique their real utility, and to project promising new directions. Toward this end, the current volume, written by leading experts in genetic epidemiology, has three objectives. First, to provide a comprehensive and well-balanced review intended to quickly bring scientists and students up to speed with the state of the art in an important and rapidly growing field. Second, to place contemporary methodologies in their proper perspective by including critical evaluations of their real value. And finally> to project promising new directions for the future. To assist investigators in asking precise questions about complex phenomena, the chapter authors endeavor to convey balanced opinions about the various methods. The genetic dissection of complex traits is a challenge to the practitioners who seek to uncover the genetic architecture underlying complex phenotypes, as well as to those who advocate specific approaches and methodologies. Now that even well-designed and thoughtful investigations are beginning to produce ambiguous results, we are coming to realize that unanticipated challenges underlie what were considered hitherto to be very pomising approaches. Thus we need new strategies, as well as a dose of humility, as we apply them though&&y in ways that anticipate failures and frustiatians. This book contains 32 chapters, divided into nine sections, and an appendix. It fills an important void by including a comprehensive account of conxix
xx
Preface
temporary methods and a detailed overview of current methodological trends. Section 1, which summarizes briefly Newton Morton’s impact on science, is supplemented by a complete list of his research contributions in the appendix. In Section 2 we provide an overview of the methods for genetic dissection of complex traits and a succinct summary of the fundamental concepts of heritability, linkage, and association. In Section 3 we cover phenotypic and genotypic issues,such as quality and refinement, and discuss ways of handling multivariate phenotypes. In Section 4 we discuss the most powerful model-based methodology for linkage analysis, including a critical discussion of its strengths and weaknesses.In Section 5 we present contemporary and promising model-free methods, including variance components methods and transmission disequilibrium tests (TDT). Section 6 presents a comprehensive discussion of emerging new methodologies with considerable potential, including metaanalysis, classification methods, neural networks, and genome partitioning methods. In Section 7 we discussoptimum strategies for mapping complex trait loci, including gene-gene and gene-environment interactions, and special studies of population isolates. In Section 8 we deal with the thorny issuesof multiple comparisons and significance levels. Finally, in Section 9 we offer some thoughts for the new millennium. This book can be used as a handbook for a wide audience of pre- and postdoctoral scientists, methodologists who seek an overview of some of the latest thinking in this area, and, most importantly, the scores of investigators who seek to evaluate the etiological basis of complex traits. It can be used as a reference book for upper-level undergraduate students, as well as a textbook for graduate students majoring in quantitative aspects of human genetics or in genetic epidemiology. It also provides an excellent account for statisticians interested in methodological opportunities. We believe that it represents an important resource that will have some longevity as we march forward in the new millennium. As thd saying goes, “Wisdom comes from experience, and experience comes from making mistakes.” Surely we have all made our share of mistakes, hence are experienced, and may now hope that wisdom is just waiting for us to claim it! We have tried to fill this volume with the wisdom the authors have accumulated from their combined experience. Perhaps the most important sign of wisdom is the humility with which we recognize the limitations of some of our current approaches. Predicting the future is a perilous exercise, especially when it comes to complex traits. It is nearly impossible for us to know today which methods will have been the most useful in dissecting the genetic architecture of complex traits when we look back 10 to 20 years from now. We can only judge what appear to be the most promising approaches from our current perspective, and that is what we have tried to document here. No doubt there will be many surprises ahead. Genetic dissection of complex traits is the greatest challenge in genetics at the start of the new millennium, and we hope that the wisdom conveyed herein will shed some light. What a time it is to be a genetic epidemiologist! D. C. Rao and M. A. Province
I
::4bwton Morton: ’The Wisconsi;nYears James F. Crow Department of Genetics University of Wisconsin Madison. Wisconsin 53706
I first heard about Newton Morton from a graduate school classmate, Gordon Mainland, who was on the faculty of the University of Hawaii. He recommended Morton for graduate work, saying he was the brightest student he had ever had. Mainland also said that Morton might be a challenge and implied that I was the person to accept it. Accept it I did, and the challenge was a sheer delight. Newton plunged immediately into the problem of measuring Wright’s effective population number experimentally with Drosophila. He finished this and a master’s thesis in short order. Measuring effective population number is still an active field of research, still based on the theory that Newton and I developed together while he was a first-year graduate student (Crow and Morton, 1955). His real interest, however, was human genetics, and after his master’s work he took leave from graduate school to go to Japan and join the study of the Hiroshima and Nagasaki populations. While there, he met a gifted Japanese mathematical geneticist, Motoo Kimura, and told me about him. As a result, Kimura eventually joined my group. At about the same time, Sewall Wright moved from Chicago to Madison. So I had two brilliant students, and we all three had the opportunity to interact with Wright whenever we wished. It was a heady time. Kimura and Morton had different interests. Kimura was applying the Kolmogorov equations to stochastic problems in population genetics. Morton was developing methodology for human genetics. Although our interests were different, the three of us talked regularly, and we had the unique privilege of having Wright as a consultant. Advancesin Geneiics, Vol. 42
Copyngilr~ ZCOlby-Acddemtc AlI nghrs ofreproducrmn K65-2663/OlS35.C3
Preb in any form reserved.
3
4
James F. Crow
By the time his Ph.D. was completed, Morton had written 10 papers. He stayed in Madison as a postdoc and faculty member until moving to Hawaii a few years later. While in Madison, he won the Lederle Award in 1958 and the Allan Award from the American Society of Human Genetics in 1963. From the beginning, Morton was a self-starter. I played a role, but it was more as a sympathetic listener and sounding board for his ideas than as a guide. It was a very rich period, for during this time he laid the foundation for much of the standard methodology of human genetics and genetic epidemiology. He was also a pioneer in the use of computers for genetic analysis. And of course, he has continued to produce important and voluminous work ever since. Here are a few of his accomplishments during the Wisconsin period. Most important is his work in linkage, his Ph.D. thesis problem. He used lod scores, prior probabilities, and the sequential probability ratio test for efficient linkage detection and estimation of its value (Morton, 1955a). One of his first studies, a classic, showed linkage of elliptocytosis and the Rh locus, but with a new twist. The pedigrees fell into two clearly distinct groups, strikingly revealed by plotting lods against the recombination fraction. This was the first use of linkage to show that a clinical entity was genetically heterogeneous (Morton, 1956). During his short stay in Japan, Newton did a number of studies involving pedigree and consanguinity analysis. He provided empirical risks for consane guineous marriages (Morton, 1958) and showed the various components of birth weight, including a large maternal component (Morton, 195513). After returning to Wisconsin, together with H. J. Muller and me, he developed genetic load theory. We worked out a way in which consanguinity data on viability could be used to infer heterozygous selection (partial dominance) and estimate genomic deleterious mutation rates (Morton et al., 1956). Newton then extended the load analysis to phenotypic traits (Morton, 1960). A combination of consanguinity, pedigree, and segregation analysis permitted him to show, for deafness, the large contribution of recessive genes and for the first time to provide a minimum estimate of their number (Chung et al., 1959). Morton provided a substantial extension of the work of Weinberg, Fisher, and others on segregation analysis. Making use of computer availability, he worked out the theory of testing segregation ratios under various forms of incomplete ascertainment and provided tables to ease the computation (Morton, 1959). Among several other things, he showed selection for heterozygous mothers at the MN locus (Morton and Chung, 1959), clarified the four genetic types of muscular dystrophies, worked out the genetics of spherocytosis, and developed a discriminant function for the detection of PKU carriers. Finally, he was one of the first to appreciate the pioneering work of Gustav Malecot on inbreeding theory and population structure. Happily, Newton was fluent in
1. Newton Morton: The Wiscmsin Years
5
French, and he did more than anyone else to bring MalCcot’s great work to the attention of English-speaking geneticists. Much of what Newton did in Wisconsin was ahead of its time. This is strikingly true of linkage analysis. At the time there were only two good examples of autosomal linkage. For many geneticists, Newton’s linkage work had little impact, simply for want of opportunities to apply it. Now things are different, and it is a great satisfaction for me to note that the field has finally caught up with him. Every human geneticist now knows what a lod score is, but ir wasn’t always that way. I hope that this belated recognition is bringing satisfaction to him; it is clearly bringing vicarious satisfaction to me. Newton Morton has pioneered in one area after another. This volume attests not only to his early pathbreaking work, but to his continued output. Human genetics is a far richer field for his having been involved in it.
References Chung, C. S., Robison, 0. W., and Morton, N. E. (1959). A note on deaf mutism. Ann. Hum. Genet. 23,357-366. Crow, J. E, and Morton, N. E. (1955). M easurement of gene frequency drift in small populations. Evolution 9,202-214. Morton, N. E. (1955a). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277-348. Morton, N. E. (1955b). The inheritance of human birth weight. Ann. Hum. Genet. 20, 125-134. Morton, N. E. (1956). The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood group. Am. J. Hum. Genet. 8,80-96. Morton, N. E. (1958). Empirical risks from consanguineous marriages. Am. J. Hum. Genet. 10, 344-349. Morton, N. E. (1959). Genetic tests under incomplete ascertainment. Am. J. Hum. Genet. 11, 1-16. Morton, N. E. (1960). The mutational load due to detrimental genes in man. Am. J. Hum. Genet. 12,348-364. Morton, N. E., and Chung, C. S. (1959). Are the M N blood groups maintained by selection! Am. j. Hum. Gene 11,237-251. Morton, N. E., Crow, J. E, and Muller, H. J. (1956). An estimate of the mutational damage in man from data on consanguineous marriages. Proc. Natl. Acad. Sci. USA 42,855-863.
NewtonMorton’s influence on Genetics: The Morton Number Daniel E. Weeks IIepartment of Human Genetics University of Pittsburgh Pittsburgh, Pennsylvania 15261
I. The Morton Number II. Data
References
At the symposium we have heard several talks about the lod score statistic, which was one of Newton Morton’s most significant contributions to statistical genetics. I think that it is time to introduce a new statistic, which I believe not only will honor the contributions of Newton Morton in an appropriate manner, but also will come to play a significant and important role in our field for years to come.
I. THE MORTONNUMER The new statistic I would like to introduce is one that will provide an accurate measure of Newton Morton’s contribution to statistical genetics. This new statistic is “the Morton number.” The Morton number is defined inductively as follows (using wording from Grossman, 1998): Newton has Morton number 0. For each n 2 0, a person not yet assigned a Morton number who has a joint publication with a person having Morton number n has Morton number n + 1. Anyone who is not assigned a Morton number by this process is said to have Morton number x. Thus a person’s Morton number is just the minimum distance from that person to Sewton Morton in the collaboration graph (in which two Advancesin Gene&x, Vol. 42 Copyrighr Q 2001 by Academic Press. All rights of reproduction in any form reserved. ilO65-266C/Cl S35.CQ
7
8
Daniel E. Weeks
Newton Morton 0
Jws
Dan
Ott
Weeks
1
2
Figure2.1. Illustration of the Morton numbers for Newton Morton, Jurg Ott, and Dan Weeks.
authors are joined by an edge if they have a joint publication). Note that, in terms of acquaintances, it is believed that “everyone in the world is connected to everyone else through a chain of at most six mutual acqaintances” (Collins and Chow, 1998; Gladwell, 1999; Shulman, 1998). For example, Jurg Ott has Morton number 1 since he has published with Newton Morton. I have Morton number 2, as I have published with Jurg Ott but not with Newton Morton. This is shown graphically in Figure 2.1. While my Clinton number is also 2 (but not through Monica), I am much more interested in lowering my Morton number to 1, I hope in the near future.
II. DATA The data presented here are based on the electronic reference ‘database PubMed, and so are definitely not representative of Morton’s full contribution to our field. Even using incomplete data, however, we still find that Morton had Table 2.1. The Distribution of Morton Numbers for the 37 Allan Award Winners Morton number
Count
0
1
2
>2
1(3%)
4 (11%)
26 (70%)
6 (16%)
2. Newton Morton’s Influenceon Genetics
9
Table 2.2. The Distribution of Morton Numbers for the 30 Speakers and Moderators at the Newton Morton Symposium Morton number
Count
1
2
3
4
8 (27%)
19 (63%)
1(3%)
2 (7%)
a huge influence on our field. Morton has 291 PubMed entries, with approximately 335 coauthors (as of August 23,1999). These coauthors each have Morton number 1. Using “as complete an author name as possible” on all publications consistently (Grossman and Ion, 1995) would greatly facilitate future research into Morton numbers. Let us now consider the 37 Allan Award winners; the Allan Award has been given by the American Society of Human Genetics annually since 1962 in recognition of outstanding contributions to human genetics. Note that the first Allan Award winner has Morton number 0 (i.e., it is Newton Morton himself, who won this prize in 1962, but we won’t mention that I was only one year old at the time). A remarkably high percentage of Allan Award winners have Morton numbers of 1 or 2 (Table 2.1), showing Morton’s influence on the evholefield of human genetics! I have also computed the Morton numbers for the 30 speakers and moderators listed as participating in the Newton Morton Symposium (Table 2.2) Judging by the Morton numbers in Table 2.2, the group of people assembled here to honor Morton appear to be more distinguished than the Allan Award winners. And finally, I’d like to point out that examination of collaboration graphs is of current scientific interest, as evidenced by a paper in Nature in which Watts and Strogatz (1998) examined collaboration graphs and proved that: “It only takes a small number of well-connected people to turn a large world into a small world” (Collins and Chow, 1998). Certainly, judging by the empirical results presented here, Newton Morton is one of those well-connected and influential people to whom all of us are more closely connected than we might realize.
Acknowledgments I thank Jeffrey R. O’Connell for helping me improve this presentation and also point out that m y John F. Kennedy number is 2, through m y great-aunt, Dr. Janet Travell.
10
Daniel E. Weeks
References Collins, J. J., and Chow, C. C. (1998). It’s a small world. Nature 393,409-410. Gladwell, M. (1999). Six degrees of Lois Weisberg. New Yorker January 11, pp. 51-63. Grossman, J. W. (1998). Erdos number update. http://www.oakland.edu/-grossman/enu.ps Grossman, J. W., and Ion, l? D. E (1995). On a portion of the well-known collaboration graph. CongressusNumerantium 108,129- 131. Shulman, P. (1998). From Muhammad Ah to Grandma Rose. DiscowerDecember, pp. 85-89. Watts, D. J., and Strogatz, S. H. (1998). Collective dynamics of ‘small+world’networks. Nature 393,
440-442.
.Gmetic Dissection ,of ComplexTraits: ’An Overview Il. C. Rao Division of Biostatistics and Departmentsof Psychiatry and Genetics Washington University School of Medicine St. Louis, Missouri 63110
I. II. III. IV. V. VI. VII.
Summary Introduction Existence of Genetic Effects Overall Study Design Other Issues Related to Study Design Lumping and Splitting as a Strategy Discussion References
I. SUMMARY Genetic dissection of even simple Mendelian traits has been sufficiently challenging. Complex traits are proving to be much more challenging and frustrating than previously thought. The concepts, methods, and strategies discussed in this volume emphasize the critical importance of study design, appropriate methods of analysis, including relatively newer and emerging methods, and issues relating to the interpretation of results from genome scans; some thoughts on the future the new millennium holds are offered, as well. This chapter overviews the key steps involved in the study of complex traits, which are discussed in detail in subsequent chapters. It is suggested that a combination of lumping and splitting strategies is more appropriate for the analysis of complex traits, and large-scale collaborations should make this possible. For example, by Advancesin Genetics, Vol. 42 Copyright Q 2oCl by Academic Press. All rights of reproduction in any fom reserved. X%5~2660/@1$35.00
13
14
Il. C. Rao
pooling data and/or results from multiple studies on a given disease/trait, one may attain a sample size large enough to permit the division of the data into multiple relatively more homogeneous subgroups. The sample size of the subgroups may still be sufficiently large sample, but the genetic dissection within each subgroup should be much less daunting. The expectation is that analyses within subgroups will enhance gene finding, especially when any interacting determinants are taken into account at the time of dividing the data into subgroups. Perhaps the methods are not yet optimum, but the future holds much promise. In the meantime, the cutting-edge methods discussed in this volume by leading experts should help. There is an increasing healthy tendency for investigators to collaborate by pooling materials and results across studies, with the goal of increasing the sample size and thus the power. We believe that such efforts are essential for the genetic dissection of complex traits and should contribute to greater success, especially if there is a real commitment to meaningful collaboration. After all, for most complex traits, the question is not whether there are genes, only when and how they might be found.
II. INTRODUCTION Genetic dissection of simple Mendelian traits and Mendelian-like traits has been greatly enhanced by the numerous pioneering contributions of Newton E. Morton over the decades. The same methods are widely used today for investigating, and sometimes determining, the genetic basis of complex traits. The lod score method (Morton, 1955), w h’ic h constitutes the basis of most linkage studies, has been singly recognized as a pivotal contribution. It certainly represents a major milestone in the genetic dissection of human traits, benefiting from the computational enhancements due to Elston and Stewart (1971) which made possible LIPED, a widely used linkage analysis package (Ott, 1974). The LINKAGE package took this work a step further (Lathrop et al., 1984), with numerous computational enhancements. It is hard to imagine what the current state would be had Morton not developed the lod score method. Morton’s contributions have been rich and varied, and some of his earlier work on population structure in particular is now being recognized for its critical role in the genetic dissection of complex traits. In recent years, investigators are increasingly coming to realize that complex traits pose unique challenges that require more imaginative approaches, not necessarily more sophisticated methods. Often, even large-scale investigae tions end up with disappointing results (perhaps all that is sm$e about complex traits begins and ends with the spelling). One thing we can all agree on is that complex traits are determined by the joint action of multiple genes and environmental factors. Not surprisingly, routine approaches have largely failed to identify genes for truly complex traits like blood pressure and have generated
3. ComplexTraits
15
much controversy, with “conflicting” findings from multiple studies. It is possible that sometimes what appear to be “conflicting” results may not actually be conflicting, in the sense that different genes may segregate in different study populations and different investigations report on different genes. When this is true, mandating replication in different populations may render far too many false negatives (i.e., may miss real genes). Perhaps replication studies should be designed more appropriately with this in mind. For complex traits, the effect sizes of most of the multiple etiologic factors are likely to be rather modest. It appears to be a characteristic feature of complex traits that genes with large individual effects are rare. Therefore, as the experience of recent years has shown, methodologies meant for detecting genes with large effects (major genes) are less likely to be successful in finding most of the genes for complex traits. The genetic component of many complex traits is oligogenic (a few genes each with a moderate effect) or even polygenic (many genes, each with a small effect). Even though the individual effect of a gene may appear to be small, interactions with other genes and/or environments could make a substantial contribution to the final manifestation of the trait. In fact, failure to recognize and accommodate such interactions may often mask the effects of the individual genes. Therefore, to unmask the gene effects and aid in the discovery of disease/trait genes, we must pay attention to all relevant aspects of gene discovery, including study design, optimal methods of analysis, and interpretation of the results. Brute force -that is, very large sample size alone-may not achieve the desired goal, although sufficiently large sample sizesare necessary.
A. Lumpingand splitting as a strategy A common approach to enhance the power of any study is to utilize larger sample sizes. Fortunately, the concept of multicenter genetic and family studies (e.g., Higgins et al., 1996) is rapidly evolving as a means of generating large samples of family data collected by using standardized protocols. Even in preplanned collaborations of this sort, where common protocols are used and data collection is standardized, one must remain cognizant that the frequency and distribution of risk factors-both genetic and environmental-may well differ among the participating study centers, Clearly, pooling data from different studies that are conducted independently without any standardization encompasses even greater challenges, as there may be considerable differences in sampling strategy, in phenotypic measurement, or in the ancillary information available for subclassification of a phenotype. When it is not possible to pool the data directly, it may be useful to pool the results from different studies. Some of these issues have been considered in the development of metaeanalytic methods for pooling results from multiple linkage studies, as discussed in this volume. For complex traits where we expect etioiogic heterogeneity, one may argue that pooling data across studies may make things worse. Ideally, one
16
Il. C. Rao
wishes to maximize the signal-to-noise ratio by analyzing the largest possible sample of families sharing the same predominant etiologic factor(s). Strategies that enable investigators to subdivide the pooled data into relatively more homogeneous subgroups are extremely desirable. An approach that holds promise is the classification and regression trees (CART) methodology (Brieman et al., 1984). We believe that a combination of the lumping (pooling data from multiple studies) and splitting (as done in the CART applications) strategies will be very useful. Although the lumping and splitting approach is very attractive from a practical point of view, it is not the only desirable strategy for enhancing gene finding. Very large pedigrees of genetic isolates, such as the Icelandic study or the more recently begun Nepalese study (Blangero et al., personal communication), are very promising. It is not clear which of the two approaches is the more feasible and/ or the more cost-effective, or which has the greater potential for finding genes.
B. Model-based versus model-free methods Although calling a method “model based” or “model free” serves no real purpose, methods of analysis are usually classified as one or the other. Distinction between the two can be useful in terms of which methods are appropriate for a given purpose. In this context, the “model” in a model-based method simply indicates that a specific model has been assumed for the disease/ trait inheritance. Therefore, “model” carries a specific meaning here, unlike the generic meaning referring to a certain mathematical formulation of the overall method (not necessarily meaning the inheritance model). The traditional lod score method is a model-based method because it assumes a trait inheritance model that specifies a gene frequency and the relationship between the latent trait locus genotypes and the phenotype. This volume discusses several model-free methods, so called because a specific trait inheritance model is not assumed. The literature refers to all these methods variously as model based, or parametric, and model free, or nonparametric. Strictly speaking, some of the so-called model&ee methods are in fact “parametric” in the sense that unknown parameters are to be estimated which are related in some way to the trait model; therefore, “model free” and, “non parametric” are not necessarily synonymous. Another point to note is that methods are not always strictly model based or model free. ‘While in the model-based lod score method, the trait inheritance is “strongly” modeled, in some such so-called model-free methods as the variance components methods, trait inheritance is “weakly” modeled. Therefore, models may be classified as strongly model based (the traditional lod score method), weakly model based (variance components methods), or model free. Thus, the variance components methods may be viewed as hybrids in that they are neither strongly model-based methods, nor totally model free.
3. ComplexTraits
17
HI. EXISTENCEOF GENETICEFFECTS Genetic dissection presupposes that there indeed exist some trait genes. Investi< gators undertaking an evaluation of the genetic architecture of a disease/trait usually ensure, or should ensure, that there is considerable familial aggregation, with additional evidence showing that at least part of the familial resemblance is genetic. However, the same cannot be said of all gene finding studies. With the advent of the molecular revolution, it has become customary to undertake genome-wide scans, with hundreds of anonymous markers, whose goal is often two fold: first, to identify and locate genomic regions that may harbor genes influencing the trait variability, and second, to thus demonstrate the very existence of trait genes with detectable effect sizes. In light of this dual role, it may be useful to review the potential complexity underlying the trait variability and to consider how simplistically it is modeled in real-life applications.
A. Varying causes of phenotypievariation When confronting a complex problem, sometimes it helps to acquire a broader perspective by going back to the basics and reviewing the causesof the underiy ing complexity. In this spirit, Figure 3.1 shows how multiple causes, both genetic and nongenetic, and interactions among them, contribute to the variation in a given phenotype, like the body mass index (BMI). Some investigators believe that incorporating the full complexity of causation into our analytical models is necessary and important if we are to succeed in finding the genes and understanding their effects. Unfortunately, such complexity often renders the models intractable or indeterminate. Often, lack of data on appropriate family structures and/or on the relevant interacting determinants, (e.g., smoking and alcohol consumption, to name just two) prevents us from even entertaining full-blown models. Therefore, despite the awareness that identification of important interactions involving multiple genetic and nongenetic determinants is necessary for the detection of the very genes we are seeking, the complex reals ity is often approximated by simple but feasible models as shown, for example, at the bottom of Figure 3.1. Using such models, one can investigate the degree of familial resemblance for the phenotype, as shown in Chapter 4 by Rice and Borecki. It is important to note that when data are limited to certain family structures (e.g., intact nuclear families), it is possible to determine the role of familial factors, including both genes and environments, but it may be impossible to resolve the effects of genes from those of the environment(s).
B. Genetic effects or familial environment We have noted that most often it is possible to determine the strength or mag nitude of familial resemblance but difficult to resolve this into genetic versus
D. C. Rao
Figure 3.1. Varying causes of phenotypic variation. Hypothetical model of the underlying genetic and environmental factors giving rise to a complex phenotype like the body mass index (BMI). In the modeling approximations shown at the bottom, P is phenotype, h2 is genetic heritability, cz is familial environmental component, 7’ is residual heritability, and t2 is maximal heritability (due to both genetic and familial environmental effects) [From D. C. Rao and Treva Rice, Path analysis in genetics, in “Encyclopedia of Biostatistics,” vol. 4 (Peter Armitage and Theodore Colton, eds.), 1998. Copyright John Wiley & Sons Limited. Reproduced with permission].
other components. This difficulty is reflected in certain ambiguous models, as shown at the bottom left of Figure 3.1, which gives rise to the notion of maximal heritability. Maximal heritability, as discussed in Chapter 4 by Rice and Borecki, includes all sources of familial resemblance, including familial environmental effects, and may be regarded as pseudo-polygenic heritability. While maximal heritability of a trait says a lot about the degree to which a phenotype clusters within families, it says nothing about the very existence of possible genetic effects. Some investigators use the maximal heritability value as a means of justifying genome-wide scans, but we must recognize the potential limitation of such an argument. Even a very large value of maximal heritability does not guarantee
3. ComplexTraits
I9
that there exist trait genes. It should be clarified that while genomic scans may not be justified in the absence of definite knowledge about the existence of genetic effects, genomic approaches do not suffer from the confounding of genetic and environmental effects. For this reason, a significant maximal heritability may be required as a prerequisite before a genome-wide scan is launched.
IV. OVERALLSTUDYlIESIGN One can hardly overemphasize the importance of study design in the planning of any genetic study. Some would argue that choice of analysis methods is less important than a carefully developed study design. Feasibility of the study, statistical power, and cost-effectiveness all depend critically on the design. It is important that all the available information about the disease/trait (e.g., physiology, etiology, etc.) be used fully when decisions are made about the sampling schemes, sampling units, and analytical methods. More information should lead to better designs. The major steps involved in a study design discussed in the subsections that follow are amplified throughout the volume. In particular, see Chapter 26 on optimum study designs by Gu and Rao.
A. Definitionand refinementof phenotype Although definition of the phenotype may at first seem to be a trivial issue, some thought should be given to whether the current deiinition of the phenotype, however expertly done the original statement, is still the right one to use in gene finding studies. After all, our goal is finding the trait genes, not deciding to follow or not follow a traditional approach. Different definitions of the phenotype do lead to different results. Certain definitions tend to dwarf the signal, while others might at least have the potential to enhance the signal. The fundamental idea behind any approach to gene finding is one of evaluating the correlation between the degree of phenotypic similarity and genotypic similarity among relatives. Needless to say, this relationship will be weakened if either type of similarity (phenotypic or genotypic) is underestimated. In particular, to avoid underestimating the phenotypic similarity, or at least to minimize the extent of underestimation, we must require that the phenotype be reasonably highly reproducible. Multiple measurements and/or using other pertinent information (e.g., age of onset, severity of the disease, family history) can lead to refined phenotypes and may’result in dramatic improvement in power. For quantitative traits, the average of multiple measurements increases power by reducing the measurement error, and thus by increasing the genetic signal-to-noise ratio (e.g., Rao, 1998). Finally it would seem desirable to study phenotypes that are not highly reproducible in smaller family units rather than in extended pedigrees. Detailed discussion of issuesrelating to the phenotype may be found in Chapter 6 by Rice et al.
20
Il. C. Rao
B. Sampling Sampling considerations play a critical role in any study design. The most critical among them are the sampling unit, the sampling method, and the sample size. One should not be decided independently of the other two. For genetic studies of complex traits, sibpairs of one type or another are commonly used in conjunction with model-free methods of analysis. When sibpair methods are used, sampling larger sibships yields more power per sampled subject than sampling independent sibpairs (Todorov et al., 1997). Other more powerful sampling units such as exhemely discordant (ED) sibpairs (Eaves, 1994; Risch and Zhang, 1995) or extremely discordant and extremely concordant (EDAC) sibpairs (Gu et al., 1996), can reduce the required sample size and increase the power. Sampling some sibs from above the 90th percentile of a trait distribution and other sibs from below the 30th percentile appears to provide an optimum strategy for many quantitative traits. This includes ED sibpair( sibpairs both above the 90th percentile (high concordant, HC), and sibpairs both below the 30th percentile (low concordant, LC). The HC is analogous to the affected sibpair (ASP) method. However, for discrete diseases, an affected individual and an unaffected sib do not constitute an ED sibpair; they represent a “discordant” sibpair, not necessarily an “extremely discordant” sibpair. Simply discordant sibpairs do not necessarily constitute an optimum design. In general, any selective sampling strategy could be used to enhance power, except that some are more efficient than others in achieving the desired goal. For relatively rare discrete diseasesor traits, the ASP (and ASP-like) is a good sampling unit for linkage studies. One may also select on severity of the disease, if such a thing exists. Finally, it should be noted that extended families can add substantially more power even under random sampling, as discussed in Chapter 12 by Blangero et al. Feasibility, cost, and reproducibility of the phenotype(s) should all be taken into account when one is choosing a sampling unit. The sampling method has an important bearing on the cost of a genetic study.
C. Genotypingissues Most studies pay due attention to issues concerning which type of markers to use, the number of markers to use (density), and whether to genotype all relatives, especially the parents (e.g., Elston, 1992; Elston et al., 1996; Holmans, 1993). With the advent of large-scale genotyping using automated robotics, special attention should be directed toward data management and quality control issues.In general, genotyping errors can have a significant effect on the power of a study. For example, a 10% genotypic (not allelic) error rate requires a sample of 250 ED sibpairs to yield the same 80% power as would a sample of 190 ED sibpairs with zero genotypic error rate when the marker involves 8 alleles (Rao, 1998). Note that the difference of 60 ED sibpairs is huge and that the situation
3. ComplexTraits
21
is much worse for markers with fewer alleles. Weber and Broman review various genotyping issues and their implications for genome-wide scans in Chapter 7.
D. Linkageversusassociation Designing a genetic study depends on what the investigator wishes to pursue: linkage or association. A “linkage study” analyzes the cosegregation of two genetic loci in families (some loci may be latent: e.g., a disease locus), while an “association study” investigates the coexistence (nonindependence) in individuals. However, the premise of a genetic association is based on the hope that association induced by linkage disequilibrium (LD) will lead us to the gene and that “spurious” association will be excluded by other means of study (such as using family-based controls). While association studies are most meaningful for evaluating the role of physiological candidate genes, opinions differ with respect to the promise and limitations of genomeswide association scans (see Risch and Merikangas, 1996). It is commonly believed that linkage studies have a limited genetic resolution of about 1 CM, and therefore, fine mapping requires other approaches. On the other hand, association studies, although notorious for false positives, have a much finer resolution because the recombination history is in a sense used in the calculation. The transmission disequilibrium test (TDT) alleviates the false positive problem caused by population admixture (Spielman et al., 1993; Spielman and Ewens, 1998) except in the presence of loose linkage (Elston, 1998). Perhaps every gene finding study should first apply linkage analysis to identify and narrow down interesting genomic regions, which could then be followed by using dense singleenucleotide polymorphism (SNP) maps. Some of these issues are discussed later by Schork et al. (Chapter 14), Amos and Page (Chapter 15), Terwilliger (Chapter 23), and Chapman and Thompson (Chapter 25) +
E. One-stageversustwo-stagedesigns Elston (1992) proposed a two-stage strategy as a cost-effective way of designing genomic scans: a relatively sparse marker map is used in the first stage to detect linkage signals, followed by a second stage with a denser marker map around the signals detected in the first stage. Later Elston et aE. (1996) investigated the properties and performance of the one-stage and two-stage strategies, concluding that a two-stage procedure could result in a study costing half the amount of a study done by a one-stage procedure. In such designs, it is critical that the first stage have excellent power, well over the usual 80%, since the second stage cannot recover any linkages missed in the first stage. Also, use of the same sample for both srages may not help for pruning false positives. It is desirable to use independent samples of relative pairs in each stage.
22
D.C.Rao
As noted earlier, using the same sample for both stages of a genome scan appears to result in cost savings, but other optimal properties may be absent. We then ask how often the second stage is able to prune out false signals detected in the first stage. Perhaps a better two-stage design is one in which the first stage carries out a linkage analysis using a relatively dense map and identifies (and narrows down via fine-structure mapping) potential regions to be assessedin the second stage with association studies. For the latter, dense SNP maps are attractive. The effect of using the same sample in such a design is unclear. See Chapter 27 by Guo and Elston and Chapter 26 by Gu and Rao.
F. Sample size and power For a given significance level, the power of a study (ability to detect true genes) increases with sample size. Likewise, for a given sample size (i.e., after a study has been carried out) the power increases with the significance level. Unfortunately, we cannot increase the power simply by increasing the significance level, since false positives also increase with the significance level (except in the first stage of a two-stage design: see Chapter 27 by Guo and Elston). It is important to note that in earlier days, the real cost of a false positive was unaffordably high, and accordingly, the emphasis was on keeping false positives to an absolute minimum. False positives were minimized by using very small significance thresholds (e.g., LOD > 3). Unfortunately, for the genetic dissection of complex traits, this results in low power. This was not a big problem at the time because linkage analysis was used primarily to map disease genes that were already known to exist. These days, linkage analysis of complex traits is used often to prove that such genes exist as much as to map them. Therefore, in the current situation, failing to find genes is not desirable either. This renders the issue of sample size and power even more important. It is generally known that detection of complex trait genes is difficult, and replication of a particular linkage finding is even more difficult, especially under genetic heterogeneity (e.g., Suarez et al., 1994). The need to detect genes for complex traits has prompted several new investigations of optimal sampling, since the sample size alone cannot be indefinitely increased. Nonrandom sampling has been exploited to deliver much more power for the same sample size, as discussed earlier in connection with sampling. It is clear that both types of errors need to be minimized as much as possible at the stage of designing a study. Therefore, the required sample size should be so calculated that it will yield power as high as possible when a relatively more stringent significance level is used. The usual practice of calculating a sample size to yield barely 80% power at a nominal significance level of CY= 0.05, and that too with an exaggerated effect size of the gene, does not serve us well. Instead, these calculations should require greater power, such as
3, ComplexTraits
23
90%, and a much smaller significance level, like (x = 0.0023 (see Chapter 29 by Rao and Gu). Understandably, the necessary sample sizeswill be huge, and budget projections may not permit this expense. Nonetheless, as long as studies are designed in isolation from each other, with the sample size heavily compromised (for whatever reason), one should not expect miraculous results. Well-designed and well-coordinated multicenter studies appear to be particularly attractive for this purpose. Several other chapters also allude to sample size and power issues. In particular, see Chapter 12 by Blangero et al.
G. Cost- benefitanalysis In the end, a fixed budget and practical constraints will likely determine the type of study, regardless of what an investigator wishes to pursue. Especially in the planning of an ED or EDAC sibpair study, we must be realistic in determining the feasibility of a proposed sample size. The EDAC approach was developed primarily to render the extreme sibpair studies more feasible (Gu et al., 1996). For extreme sibpair studies, a simple method may allow a quick assessmentof the cost versus the benefit (e.g., see Gu and Rao, 1997). For example, let the cost be CP for phenotyping and Co for genotyping of one person. If we need to screen N sibpairs to get a certain required number (nan) of ED sibpairs for genotyping, the total cost of the study is 2NCp + 2nr&o. The total cost for a corresponding EDAC study would be 2N1Cp + 2(n,d + nl, + nh,)Co, where Ni is the total number of sibpairs needing to be screened to obtain a smaller number (ned) of ED pairs (T-Q< nsn), and a certain number (nl, + nhc) of extremely concordant sibpairs of both types. Clearly, Nl < N and ned < nm Therefore, the EDAC not only is more feasible than an ED study, it is also less expensive, as long as Co C;:Cp(N - NJ/(nd + nl, -I- nhc - n~o). Gu and Rao (1997) provide detailed guidelines for cost-benefit analysis. See also Chapter 26 by Gu and Rao.
V. OTHERISSUESRELATEDTO STUDYDESlON It is not uncommon for study design to be viewed as merely an issue of calculating and justifying a sample size. While at least some study designs address the issues discussed earlier, we feel that optimum study designs must also take into account such additional issues as the methods of analysis and the interpretation of results.
A. Methodsof analysis Genetic linkage is the phenomenon where by alleles at different loci cosegregate in families. The strength of this cosegregation is measured by the recombination
24
D. C. Rao
fraction 0, the probability of an odd number of recombinations. In most cases, unknown phase and other limitations necessitate estimation of the value of 0 and statistical testing by means of appropriate methods. Three classes of methods are most commonly used today for linkage analysis: the classic lod score method (discussed in detail in Chapters 8 by Rice et al., 9 by Clerget-Darpoux, and 10 by Ott), the so-called model-free relative pair methods (discussed in Chapters 11 by Elston and Cordell and 17 by Goldgar), and finally the hybrid variance components methods (discussed in considerable detail in Chapters 12 by Blangero et al., 13 by Province, and 21 by Schork; see also Almasy and Blangero, 1998). Especially for complex traits, one lacks reasonable trait models, and therefore routine use of the (strongly) model-based lod score method may not be very attractive. This realization has given rise to the development of alternative methods that are not based on strong assumptions about the trait inheritance. It is natural to reason that the existence of a susceptibility gene should lead to an elevated probability that a pair of affected siblings would inherit the same allele(s) from the parents. Based on this premise, a class of nonparametric methods, or more generally, model-free methods, has been developed based on the sharing of alleles identical by descent (IBD) among relative pairs (see Chap* ter 11 by Elston and Cordell). Risch (1990a) later gave a rigorous characterization of the IBD distribution using a parameter called the “relative-risk ratio (A)” for various types of affected relative pairs. The values of h are estimable, and based on such estimates, the strength of a linkage signal, the power of a study, and the possibility of the existence of a multilocus model can be deduced (Risch, 1990b; for exceptions, see Rybicki and Elston, 2000). For quantitative traits, the first insightful method was presented by Haseman and Elston (1972), who took the squared difference of trait values of a sibpair as the outcome variable and regressed it on the proportion of alleles shared IBD by the sibpair using the model E(Yj I rj) = PO f P1~j. A significantly nonzero negative regression coefficient p1 implies genetic linkage to the marker. Goldgar (1990) and Amos (1994) used maximum likelihood methods to model directly the covariance structure of sibpairs and arrived at a variance components method that was more powerful than the original Haseman-Elston (HE) method. Fulker and Cardon (1994) extended the HE method to estimate the location of a quantitative trait locus by means of flanking markers, specifically, by applying the interval mapping method of Goldgar (1990). As a hybrid of the (strongly) model-based and model-free methods, the variance components method combines the strengths of both methods and seems to perform well. The hybrid methods use all the data available within a pedigree, without excluding subjects with partially missing information or producing redundancy in the statistics by counting all relative pairs. See also related discussions by Terwilliger (Chapter 23), Gauderman and Thomas (Chapter 24), Lalouel (Chapter 31), and Morton (Chapter 32).
3. ComplexTraits
25
Another promising methodology is that of artificial neural networks ( ANNs). The ANNs, developed originally for modeling interactions among neurons in the brain, consist of interconnected layers of nodes/neurons. A classical feed-forward ANN includes an “input layer,” an “output layer,” and perhaps one or more “hidden layers” in between. In the general formulation, each input node is connected to every node in the hidden layer (if any) by weighted links, and each node in the hidden layer is connected to every node in the output layer. Weights associated with these connecting links determine the relative importance of various connections. In genetics applications, markers (genotypes or alleles) and environmental factors may be regarded as the input variables, with the phenotypic outcomes represented by the output nodes. In this approach, first applied for human linkage analysis by Lucek and Ott (1997), gene finding may be viewed as a mapping from the set of markers and environmental factors to the set of phenotypic outcomes. An advantage of applying ANNs to complex traits is that it enables analysis of the entire genome at one time, with the potential for detecting gene-gene and gene-environment interactions (Saccone et al., 1999). Sherriff and Ott discuss in Chapter 20 a basic description of ANNs, along with applications to affected sibpairs, case-control studies, and family data. Methods for the analysis of associations have undergone many enhancements in the last decade. Although candidate genes have been the primary focus of association studies, genome-wide association scans are being projected as a promising approach (Risch and Merikangas, 1996; see also Chapter 14 by Schork et al. in this volume). It remains to be seen to what degree the unpredictable pattern of linkage disequilibrium proves to be a limitation for genome-wide association scans. While the full promise of this approach remains unclear, partial association scans in the narrow genomic regions identid fied by linkage seem to be particularly attractive. Methods of association are discussed in chapters 13 (Province), 14 (Schork et al.), 16 (Eaves and Sullivan), 23 (Terwilliger) , and 25 (Chapman and Thompson).
B. Interpretation of the results When one is interpreting genome scan results, caution should be exercised before either positive or negative conclusions are inferred. Three issues are important: consistency of results across studies, split-sample analyses when sample size permits, and finally the choice of significance level used to interpret the results. While the level of evidence from a study may be small (e.g., LOD = IS), consistently obtaining similar levels of evidence from other studies should increase our confidence in the result. If the sample size is large enough, this concept can be applied even within a study by randomly splitting a sample into two subgroups and analyzing each one separately. Concordance of results between the subgroups
26
0. C. Rao
would indicate that the result may be true. This approach is inherently attractive, especially if the splitting and within-subgroup analyses are repeated several times. Genome scan analyses,as they are performed today, give rise to the problem of multiple testing, thus necessitating caution in the choice of a significance level. Failure to correct for multiple testing will inevitably yield an unrealistically high number of positive results, many of which would be false. Likewise, overly correcting for multiple testing will inevitably lead to far fewer positive results, thus missing some or even most of the very signals we seek in the first place. Perhaps some reasonable guidelines need to be followed to minimize both types of error (false positives that cannot be replicated and false negatives that will remain IX-P detected). Such guidelines must be based on a balanced consideration of, among other things, the real costs associatedwith false positives and false negatives. It is important to note that a significance level merely indicates one’s tolerance for inferring a false result. Choice of appropriate significance levels for genomepwide scans has been a thorny issue (Risch, 1991; Thomson, 1994; Lander and Kruglyak, 1995; Rao, 1998; Morton, 1998). While everyone agrees that reporting false positives is both undesirable and misleading, the issue of false negatives (i.e., missing true signals) has not received much attention. For example, Lander and Kruglyak (1995) h ave recommended that the genome-wide significance level be set at 0.05 (with a pointwise significance level of a = O.OOOOZ2).On the other hand, Thomson (Chapter 28) suggeststhat perhaps we should not make any corrections for multiple testing, while Rao and Gu (Chapter 29) suggest accepting, on average, one false positive per genome scan, to strike a balance between false positives and false negatives (with (Y = 0.0023). It seems advisable to use a less stringent significance level like a~= 0.0023 for the purpose of identifying genomic regions that may be pursued further in follow-up studies. False positives generated by less stringent significance levels should be resolved by other means, as discussed earlier. It is clear that the choice of a significance level is complicated on account of multiple testing. Can multiple testing be avoided altogether by analyzing all markers simultaneously? Province (Chapter 30) presents an exciting new methodology based on sequential methods of analysis. One of his methods, called the SMDP method, performs just one global hypothesis test making simultaneous use of all the marker data available, and does so by using the smallest number of observations necessary (i.e., without necessarily using the entire sample, thus saving some of it for replication/validation). We anticipate that genome analyses of the future will use this and similar methods.
VI. LUMPINGAND SPLITTINGAS A STRATEGY There are several other methods and strategies that are believed to be very promising in the context of complex traits, some of which are covered in this
3. ComplexTraits
27
volume: meta-analysis, multivariate methods, context-dependent effects, classification methods, and neural networks. Lumping and splitting as an analysis strategy is illustrated in Figure 3.2. We believe that it is particularly useful for finding complex trait loci. As the ongoing experience with complex diseases shows, most studies, however large, find ambiguous or unconvincing levels of evidence, which tends to dampen enthusiasm for any follow-up efforts. By pooling various studies into a common database, whenever the protocols permit, we make it possible for other novel approaches to be applied. These emerging methods have considerable potential
Lumping
/oh f Analysis Mefa-Analysis
&
SplWng
CN? 7-/ Pee Linkag Clusfeting Confexf-Dependen
1c
Figure 3.2. Lumping and splitting as an analysis strategy for genetic dissection of complex traits. It reflects the recognition that, since QTL effects are likely to be modest, analysis of separate individual studies would severely compromise sample size and potver. By lumping or pooling individual studies when appropriate, we have the option of carrying out joint (combined) analysis of the entire pooled data set or meta-analysis of the basic results from individual studies. Since pooled data would likely increase the heterogene-, ity, we also have the opportunity to subdivide the pooled data into relatively more homogeneous subgroups, each of which might be large enough to permit analysis of the subgroups separately. Subgrouping may be based on tree linkage, clustering, context dependency, or other alogrithms that utilize covariate information for splitting.
28
D.C.Rao
at least for taking us to the next level of data analysis, and sometimes finding genes that were missed previously.
A. Meta-analysis The term “meta-analysis” is applied to a wide variety of statistical procedures developed for summarizing results from multiple studies (see Olkin, 1995, for a review). Meta-analysis techniques were first applied to genetic studies in the late 1990s (Li and Rao; 1996; Rice, 1998; Gu et al., 1998). For meta-analysis of linkage results, Gu et al. (1998) proposed using the proportion of alleles shared IBD at a marker locus by a sibpair (with specified trait outcomes) as the common effect, and they presented methods for pooling results from model-free sibpair analyses. A random effects model was used to characterize the among-study variability, and the weighted least-squares method was used to generate a weighted estimate of the overall effect and the variance components. A heterogeneity test was also proposed to assessvariability among studies. This model was then extended to incorporate study-specific covariates by using a mixed effects model that enables explanation of possible heterogeneity among studies (Gu et al., 1999). Chapter 18 by Gu et al. presents a detailed discussion of the methods. This area can benefit from further methodological work.
B. Multivariate methods Another charac&ristic feature of complex traits is that the effects of individual loci often manifest in a battery of correlated traits, and this additional information can be exploited by using appropriate multivariate methods of analysis. Almasy et al. (1997) h ave used multivariate linkage analysis to separate out direct causal effects from indirect effects through pleiotropy. Todorov et al. (1998) proposed a treatment of multiple traits in which structural relationships among multiple phenotypes are used to differentiate direct causal effects from secondary influence of genes (see Chapter 13 by Province). Other multivariate methods such as principal component analysis (PCA) and factor analysis are often used to reduce the dimensionality of data (e.g., Bartholomew, 1987). With respect to genetic studies, the method of PCA can be used to construct a few “summary phenotypes” (explaining most of the variance) from a large number of correlated traits, which can be used in turn for genetic analysis. Alternatively, the method of latent factor analysis can be used to select a group of latent factors that could explain the observed correlation structure in the multivariate phenotypic data (Comuzzie et al., 1997; Neuman et ul., 1999). The ideal case would be that each of the latent factors selected would be influenced by one gene, so that at least some genes and their major effects are reconstructed. However, linkage analysis of “major” principal components alone is known to
3. ComplexTraits
29
increase false negatives and the linkage signals may actually be found in the “minor” components not analyzed (Olson et al., 1999; Goldin and Chase, 1999). Alternatively, full multivariate methods for simultaneous analysis of multiple traits may be used with greater power, as Ghosh and Majumder discuss in Chapter 22.
C. Contextdependency It is generally recognized that complex traits are derived from the interactions among many genes and nongenetic determinants. However, most analyses of risk factors usually attempt to identify the determining agents that have a consistent effect in all contexts. The genes contributing to compiex traits do not have the same effect across all time or across all environments. In general, con text-dependent effects, or gene-environment interactions, may be more amenable to detection when the contexts are explicitly considered (Cheverud and Routman, 1995; Turner et al., 1999; Kardia, 2000). ‘Talus in exploring the relationships between genetic variation and phenotypic variation, it is instructive to carry out some of the analyses within particular contexts. The most basic and important specific contexts seem to be age, sex, and family history. Additional contexts may also be warranted depending on the trait of interest, such as general body size in investigations of cardiovascular and fimesserelated traits. Likewise, socioeconomic factors might provide an additional context when one is dealing with diagnoses based on assessments. Some may at first regard the context dependency as merely another way of incorporating covariate effects that can be included in the analysis models (as was done in linkage analysis by Towne et al., 1997). We caution, however, that it is not always possible to model covariate effects in an unconstrained manner. Therefore, it may be prudent to investigate at least the primary findings of a study by further specializing within contexts. We subscribe to the view that although modeling covariate effects can reveal useful information about context-dependent effects, such modeling alone may not fully reflect the intricacies of context-dependent effects.
Il. Regressiontree linkage Although analyses of aggregate samples by means of sophisticated models have the potential to uncover complex trait genes, analyses of relatively more homogeneous subgroups should enhance gene finding. While context dependency provides one type of subgrouping, the class&cation and regressian trees methodology (Breiman et al., 1984) o ff ers another. For complex traits, we cannot anticipate all possible complex interactions among the determinants, and even if we could, we are unlikely to succeed in modeling them adequately. It is therefore useful to consider alternative ways of simplifying the problem, which may then
30
D.C.Rao
render simpler solutions. One way of doing this would be to divide the data into potentially more homogeneous subgroups, with the expectation that a simpler model with very few interacting determinants might suffice for analyses of indt vidual subgroups. The classification and regression trees methodology, or CART, offers one promising way to subdivide the data. CART provides a purpose-oriented methodology for partitioning a data set into relatively homogeneous subgroups that give a gradient of risk to an outcome. An inherent attraction of the CART methodology is its assumption that interactions among the independent variables (the predictors) are more the rule than the exception. The CART methods typically use one predictor at a time to partition the data through a series of binary splits. In genetic studies, CART can be used to focus attention on families in which the signal is the greatest. This could be done by using relevant covariate information to identify clinically and/or biologically more homogeneous subgroups, within each of which the disease etiology may be more homogeneous. Applications of this methodology for linkage studies are in the early stages (Rao, 1998; Shannon et al., 2000). Chapter 19 by Province et al. presents a coherent discussion of this topic, which will benefit much from further research.
VII. The study of human diseaseis entering a new era. The Human Genome Project is progressing by leaps and bounds, thus providing molecular tools and opportunities necessary for the genetic dissection of human diseasesand disorders. Yet significant challenges remain, especially in understanding compkx traits. Complex traits, such as coronary heart disease, hypertension, and most psychiatric disorders, just to name a few, are also common in the population, accounting for a significant proportion of the public health burden. This preponderance in turn emphasizes the importance of the endeavor and, at the same time, complicates the analysis. In contrast to Mendelian traits, one seeks the genes that influence pred@osition to complex diseasesrather than those that causethem, which is a considerably more daunting task that mandates newer and more imaginative approaches. With unprecedented opportunities unfolding, the new millennium promises to be an exciting time for genetic epidemiologists. Although investigators realize that complex traits arise from interactions among multiple genes and environments, not much has been done to date. Most studies routinely ignore the interactions. We are only beginning to realize the hard way that complex traits need to be handled more thoughtfully if we are to find their genes, as we begin to understand how genes and environments work together to produce these complex traits. We begin the new millennium with challenges, but with them come enormous opportunities.
3. ComplexTraits
31
Early results from multiple genome-wide scans indicate that although the scans are successful in finding some of the trait genes with large individual effects, much of the evidence is inconclusive, thus leaving many of the genes yet to be found. This raises two important questions: How might we replicate the few initial findings of major genes?And how might we find the many other genes with moderate effects that yield inconclusive results? The choice of population is believed to play a key role in replication studies. In general, failure of a replication study to substantiate the original finding can mean one of three things: it may be that the original finding was a false positive, or that the replication sample had inadequate power (false negative), or that the replication study was based on a different ‘population’ whose characteristics are different from those of the original study. It is possible that the experssion of one or more genes occurred in the original population because of a favorable environment, whereas different interacting environmental determinants in the replication study prevented expression under those conditions. Therefore, it is desirable for replication studies to be conducted in populations with characteristics comparae ble to those of the original population. What about inconclusive findings with much smaller levels of evidence? It is believed that sometimes what is more important than the within-study evidence is a pattern of consistency across studies (much in the spirit of replication), even though the evidence from any single study may be moderate. It may be more promising to find a lod score near 2 in each of several studies than a lad score of 4 in one study and near zero in others (thus failing to replicate the one big lod score). With sufficiently large sample sizes,split-sample analysis might be worth pursuing. Concordance of evidence in the split samples would be encouraging. Because of the arbitrariness of any single split+ample analysis, one should resort to repeated split-sample analysis and look for a pattern in the results. This approach is likely to yield additional information inherent in the data. Reliance on these and similar methods will likely identify genomic regions that harbor trait genes and, with some luck, these regions will be sufficiently narrow for further follow-up work. While opinions differ about the best ways to pursue follow-up work, it seems promising to pursue a dense SNP map ping of each region as a way of sublocalizing the gene(s), especially if the region is sufficiently narrow. If not, one may be guided by physiology and pursue the (positional) candidate genes in those regions with dense SNP mapping. Either way, the potential exists to ultimately find the functional. variants we seek. Especially when multiple genomic regions and/or multiple positional candidates are implicated, evidence of linkage in syntenic regions of well-characterized animal models might be used for prioritization of the follow-up work. Although animal models have not yet played a major role in the identification of disease susceptibility genes in humans, the manipulability of animal models makes them important at least for testing the genes that are found in humans.
32
D. C. Rao
Genetic dissection of complex traits needs multiple approaches, since a given method may work in one case but not in the next. As experienced investigators realize, reliance on a single analytical method of analysis, however appropriate it may seem under the circumstances, is not optimal. Likewise, reliance on any single analysis strategy, such as analysis of the aggregate sample versus analysis of subgroups, may also be less than optimal. As long as our primary objective is to find genes for complex diseasesand disease-related traits, we should be willing to consider alternative strategies toward achieving the objective. In particular, we believe that the “lumping and splitting” strategy, which requires collaborative efforts, holds promise. There is tremendous opportunity for meaningful collaborations, and this may be the limiting factor in terms of whether we are to succeed. For such collaborations to be productive, one must go beyond a mere willingness to share data. Only when investigators interact actively without bars can the added benefits of constructive synergism be pitted against the complex challenges, with the promise that we may win at least some of the time. After all, for most complex traits, the question is not whether there are genes, only when and how they might be found.
Acknowledgments This work was partly supported by a grant from the National Institute of General Medical Sciences (GM 28719) of the National Institutes of Health.
References Almasy, L., and Blangero, J. (1998). Multipoint quantitative trait linkage analysis in general pedigrees.Am. J. Hum. Genet. 62:1198-1211. Almasy, L., Dyer, T D., and Blangero, J. (1997). Bivariate quantitative trait linkage analysis: Pleiotropy versus co-incident linkages. Genet. Epidemiol. 14953-958. Amos, C. I. (1994). Robust variance-components approach for assessinggenetic linkage in pedigrees. Am. .J. Hum. Genet. 54535-543. Bartholomew, D. J. (1987). “Latent Variable Models and Factor Analysis.” Oxford University Press, New York. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. H. (1984). “Classification and Regression Trees.” Wadsworth, Belmont, CA. Cheverud, J., and Routman, E. J. (1995). Epistasis and its contribution to genetic variance components. Genetics 139~963-971. Comuzzie, A. G., Mahaney, M. C., Almasy, L., Dyer, T D., and Blangero, J. (1997). Exploiting pleiotropy to map genes for oligogenic phenotypes using extended pedigree data. Genet. Epidemiol. 14975-980. Eaves, L. J. (1994). Effect of genetic architecture on the power of human linkage studies to resolve the contribution of quantitative trait loci. Here@ 72:175- 192. Elston, R. C. (1992). Designs for the global search of the human genome by linkage analysis. Proceedingsof the XVlth International Biometic Conference, Hamilton, New Zealand, pp. 39-51. Elston, R. C. (1998). Linkage and association. Genet. Epidemiol. 15:565-576.
3.
ComplexTraits
33
Elston, R. C., and Stewart, J. (1971). A general model for genetic analysis of pedigree data. Hum. Hered. 21~523-542. Elston, R. C., Guo, X., and Williams L. V. (1996). Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet. Epidemiol. 13:535-558. Fulker, D. W., and Cardon, L. R. (1994). A sib-pair approach to interval mapping of quantitative trait loci. Am. J. Hum. Genet. 54:1092-1103. Goldgar, D. (1990). Multipoint analysis of human quantitative genetic variation. Am. J. Hum. Genet. 47~957-967. Goldin, L. R., and Chase, G. A. (1999). Comparison of two linkage inference procedures for genes related to the P300 component of the event related potential. Genet. Epidemiol.l7 Suppl. l:S163-167. Gu, C., and Rao, D. C. (1997). A linkage strategy for detection of human quantitative trait loci I!. Optimization of study designs based on extreme sibpairs and generalized relative risk ratios. Am. J. Hum. Genet. 61:211-222. Gu, C., Todorov, A. A., and Rao, D. C. (1996). Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost-effective way to linkage analysis of QTL. Genet. EpidemioE.13:513-533. Gu, C., Province, M. A., Todorov, A. A., and Rao, D. C. (1998).,Meta-analysis methodology for combining non-parametric sibpair linkage results: Genetic homogeneity and identical markers. Genet. EpidemioE.15609-626. Gu, C., Province, M. A., and Rao, D. C. (1999). Meta-analysis of genetic linkage to quantitative trait loci with study-specific covariates: A mixed effects model. Genet. Epidemiol. 17 Suppl.l:S599-604. Haseman, J. K., and Elston, R. C. (1972). Th e mvestigation of linkage between a quantitative trait and a marker locus. Behao. Genet. 2:3- 19. Higgins, M., Province, M. A., Heiss, G., et al. (1996). The NHLBI Family Heart Study: Objectives and design. Genet. Epidemiol. 143:1219-1228. Holmans, P (1993). Asymptotic properties of affected sibpair linkage analysis. Am. J. Hum. Genet. 52~362-374. Kardia, S. L. R. (2000). Context-dependent effects in hypertension. Current Hypertension Reporu (submitted). Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat. Genet. 11:241-247. Lathrop, G. M., Lalouel, J.-M., Julier, C., and Ott, J. (1984). Strategies for multilocus linkage analy sis in humans. Proc. Nutl. Acad. Sci. USA 81:3443-3446. Li, Z., and Rao, D. C. (1996). A ran d am effect model for meta-analysis of multiple quantitative sibpair linkage studies. Genet. Epidemiol. 13:377-383. Lucek, l? R., and Ott, J. (1997). N eural network analysis of complex traits. Genet. E~idemioE. 14:1101-1106. Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7:277-328. Morton, N. E. (1998). Significant levels in complex inheritance. Am. J. Hum. Genet. 62:690-697. Neuman, R. J., Todd, R. D., Heath, A. C., et al. (1999). Evaltkation of ADHD typology in three contrastingsamples: A latent class@roach. J. Am. Acad Child Adolescent Psychiaa. 38:25-33. Olkin, I. (1995). Statistical and theoretical consideration in meta-analysis. J. Clin. Epidemiol. 48:133-146. Olson, J. M., Rao, S., Jacobs, K., and Elston, R. C. (1999). Linkage of chromosome ! markers to alcoholism-related phenotype by sibpair linkage analysis of principal components. Genet. !$idemiol. 17:S271-276. Ott, J. (1974). Estimation of the recombination fraction in human pedigrees: Efficient computation of the likelihood for human linkage studies. Am. J. Hum. Genet. 26:588-597.
34
0. C. Rao
Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genet. Epidemiol. 15:1-18. Rice, J. l? (1998). The role of meta-analysis in linkage studies of complex traits. Am. J. Med. Genet. 74:112-114. Risch, N. (1990a). Linkage strategies for genetically complex traits. I. Multilocus models. Am. J. Hum. Genec. 46~222-228. Risch, N. (1990b). Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet. 46:229-241. Risch, N. (1991). A note on multiple testing procedures in linkage analysis. Am. J. Hum. Genet. 48:1058-1064. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science273:1516-1517. Risch, N., and Zhang, H. (1995). Extreme discordant sibpairs for mapping quantitative trait loci in humans. Science268:1584-1589. Rybicki, B. A., and Elston, R. C. (2000). The relationship between the sibling recurrence-risk ratio and genotype relative risk. Am. J. Hum. Genet. 66:593-604. Saccone, N. L., Downey, T J. Jr., Meyer, D. L., Neuman, R. J., and Rice, J. P. (1999). Mapping genotype to phenotype for linkage analysis. Genet. Epidemiol. 17(suppl. l):S703-S708. Shannon, W. A., Province, M. A., and Rao, D. C. (2000). A CART method for subdividing linkage data into homogeneous subsets. Genet. Epidemiology. (submitted). Spielman, R. S., and Ewens, W. J. (1998). A sibship test for linkage in the presence of association: The sib transmission/disequilibrium test. Am. 1. Hum. Genet. 62:450-458. Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). T ransmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. .J. Hum. Genet. 52:506-516. Suarez, B. K., Hampe, C. L., and Van Eerdewegh, l? (1994). Problems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S. Gershon and C. R. Cloninger, eds.), pp. 23-46. American Psychiatric Press, Washington, DC. Thomson, G. (1994). Identifying complex disease genes: Progress and paradigms. Nut. Genet. 8:108-110. Todorov, A. A., Province, M. A., Borecki, I. B., and Rao, D. C. (1997). Trade-off between sibship size and sampling scheme for detecting quantitative trait loci. Hum. Hered. 47:1-5. Todorov, A. A., Vogler, G. I?, Gu, C., Province, M. A., Li, Z., Heath, A. C., and Rao, D. C. (1998). Testing causal hypotheses in multivariate linkage analysis of quantitative traits: General, formula, tion and application to sibpair data. Genet. Epidemiol. 15:263-278. Towne, B., Siervogel, R. M., and Blangero, J. (1997). Effects of genotype-bysex interaction on quantitative trait linkage analysis. Gener. Epidemiol. 14:1053- 1058. Turner, S. T., Boerwinkle, E., and Sing, C. E (1999). Context-dependent associations of the ACE I/D polymorphism with blood pressure. Hypertension 34:773-778.
I
Familial Resemblance and Heritability Treva K. Rice’ Division of Biostatistics Washington University School of Medicine St. Louis, Missouri 63 110
lngrid B. Borecki Division of Diostatistics and Department of Genetics Washington University School of Medicine Sr. Louis, Missouri 63 110
I. II. III. IV. V.
Summary Introduction Familial Resemblance and Heritability Study Designs and Multifactorial Models Discussion References
I. SUMMARY Familial resemblance, which arises when members within families are more similar than are unrelated pairs of individuals, may be estimated in terms of correlations (or covariances) among family members. The magnitudes of such correlations generally reflect both the extent of environmental sharing and the degree of biological relationship between the relatives. Heritahilit~, or more appropriately mult+ctorial heritability or geenemli~ed herit&i~, quantifies the strength of the familial resemblance and represents the percentage of variance in
‘To whom correspondenceshould be addressed. Advancesin Genetics,Vol. 42 Copyright % 2001 by Acadcmlc Prcs XII rights of reproduction in any form rtwrved. 2065.2660101 535.33
35
36
Rice and Borecki
a trait that is due to all additive familial effects including additive genetic effects and those of the familial environment. However, the traditional concept of heritability, which may be more appropriately called the genetic heritability, represents only the percentage of phenotypic variance due to additive genetic effects. Resolving the sources of familial resemblance entails other issues. For example, there may be major gene effects that may be largely or entirely nonadditive, temporal or developmental trends, and gene-gene (epistasis) and gene-environment interactions. The design of a family study determines which of these sources are resolvable. For example, in intact nuclear families consisting of parents and offspring, the genetic and familial environmental effects are not resolvable because these relatives share both genes and environments. However, extended pedigrees and twin and adoption study designs allow separation of the heritable effects and, possibly, more complex etiologies, including interactions. Various factors affect the estimation and interpretability of heritability, for example, assumptions regarding linearity and additivity of the effects, assortative mating, and the underlying distribution of the data. Nonnormality of the data can lead to errors in hypothesis testing, although it yields reasonably unbiased estimates. Fortunately, these and other complications can be directly modeled in many of the sophisticated software packages available today in genetic epidemiology.
II. INTRODUCTION One of the earliest statistical concepts, correlation, was conceived and advanced by Pearson and Galton almost simultaneously with the idea of the quantification of genetic resemblance in close relatives. These two ideas fed off, and complemented, one another even before the term “heritability” was coined, as Fisher recounted in his seminal 1925 work: One of the earliest and most striking successesof the method of correlation was in the biometrical study of inheritance. At a time when nothing was known of the mechanism of inheritance, or of the structure of the germinal material, it was possible by this method to demonstrate the existence of inheritance, and to (‘measure its intensity;” and this in an organism in which experimental breeding could not be practiced, namely, Man. (Fisher, 1925, p. 175) Although the fundamental concepts of familial resemblance and heritability (which are reviewed in this chapter) were created for an earlier era of genetics when it was not possible to directly measure genes in humans, these ideas
4. FamilialResemblance andHeritability
37
remain at the cornerstone of genetic epidemiology even today. Indeed, some of the most recent and powerful developments in linkage equilibrium and disequilibrium analyses (in the form of variance components models) rely heavily on the estimation and partitioning of heritable components due to different measured sources, such as a linked genetic marker or a measured candidate gene, Even approaches that do not rely upon the estimation of heritability per se often find it useful to “translate” effect sizes onto the heritability scale because it is such a convenient concept. Here we review the methods for estimating the magnitudes of unmeasured and measured effects and for testing hypotheses. We also review some of the relevant study designs and the corresponding statistical methods, factors affecting the estimation, and extensions developed to model more complex etiologic factors. This provides a foundation for the extensions that incorporate linkage and association developed in later chapters.
Ill. FAlURIAt RESElWBLkNCE AN0 ff EtWWLI’TY A. Familial resemblance Familial resemblance arises when relatives who share genes and/or environmental factors exhibit greater phenotypic similarity than do pairs of unrelated individuals. The extent of the familial resemblance can be measured by familial correlations (e.g., sibling, parent-offspring, and spouse). In general, biological relatives such as sibs have both genes and familial environments in common. Thus, familial resemblance can be a function of shared genes, shared environments, or both. In contrast, under the assumption that there is no inbreeding or assortative mating, spouse pairs have no genes in common, but they do share common environments. Therefore, the magnitude of the spouse correlation provides an indication of the importance of the familial environment. The accuracy and reliability with which a trait is measured can influS ence these correlations. If the measurement is not reliable or accurate (i.e., high measurement error), then familial resemblance may be underestimated. On the other hand, certain types measurement practices can lead to a false assertion of familial resemblance. For example, interrater differences or day-to-day variability in measurement can inflate familial resemblance if entire families are measured by the same rater and/or on the same day.
B. Heritability The traditional concept of heritubility, which in retrospect may be called rhe genetic heritability, was developed in the context of quantitative genetics to index
38
Rice and Borecki
the relative contribution of genetic factors to trait variability. It was conceived in reference to a polygenic model, that is, one in which a large number of genes, each with a small, linear, and additive effect, influence the phenotypic variability. Under a pure polygenic model, the phenotype (P) is a function of genetic (G) and environmental (E) effects (i.e., P = G + E), usually expressed in terms of variance components (VP = Vo + V,). The broad-sensegenetic herirability is defined as the proportion of the total phenotypic variance due to genetic effects (hi = Vo/Vp). This genetic variance can be further decomposed into additive effects and dominance deviations (Vo = VA + Vo), which gives rise to the narrow-sense genetic heritability, defined as the proportion of total phenotypic variance due to strictly additive genetic effects (h& = VA/V,). Likewise, the environmental component can be decomposed into common or familial (C) and random nonfamilial (R) environmental effects (i.e., VP = VA + Vn -I- Vo + V,). Thus, analogous to genetic heritability, the cdturd heritability or familial environmental variance component may be defined as c2 = Vo/V,, which is assumed to be due to a large number of linear and additive familial environmental effects. Heritability also can be used to quantify the strength of the genetic component underlying qualitative disease traits. In this case, one assumes an unobserved liability where individuals are affected when their liability crosses a particular threshold. The threshold is defined as the point at which the area under the tail equals the population prevalence. Under this formulation, replacing the phenotypic variability with that of the unobserved liability will render the definition of heritability valid.
IV. STUDYDESIGNSAND MULTIFACTORIAL MODELS The specificity of the model in terms of which parameters can be estimated depends on the study design or the types of relatives included (e.g., twin, nuclear family, pedigree, and adoptive). Under some designs, the familial components (G and C) are not resolvable and are estimated as a single heritable component that represents all additive effects that are transmitted from parents to offspring. The assumptions underlying these study designs are also critical and are reviewed here.
A. Nuclear families Nuclear families consisting of parents and their biological offspring are commonly used for investigating familial aggregation. It is often assumed that there is no inbreeding or assortative mating, that the genetic variation is additive with no dominance or epistasis (gene-by-gene interaction), and that there is no
4. Familial Resemblanceand Herifabifity
39
genotype-by-environment interaction or correlation, although each of these effects can be modeled and estimated provided that informative data are available. In the absence of additional information, nuclear family data cannot resolve familial resemblance into genetic vs cultural heritabilities, since family members share both genes and familial environments. Thus, because heritability estimates from nuclear families they are confounded with familial environmental effects, they measure the maximal effect of genes. The simplest estimator of the maximal heritability (or generalized heritability) under this model is given by twice the average correlation for first-degree relatives (parent-offspring and sibling) because they have half (on the average) of their genes in common. Spouse resemblance, if not accounted for, can lead to bias in both genetic and cultural heritabilities (McGue et al., 1989). However, in the absence of additional information such as from twins (Eaves et al., 1989) or multiple measurements on spouses (Heath and Eaves, 1985), it is difficult to distinguish whether the marital resemblance is due to phenotypic assortment or to correlated antecedents. In any case, heritability (or generalized heritability or multifactorial heritability) may be estimated by using the heuristic equation that automatically adjusts the estimate for the spouse correlation (Rice et al., I997),
is the average parent-offspring correlation, rsiblingis the where rparent - offspring average sibling correlation, and espouse is the spouse correlation.
1. Estimation of familial correlations and hypothesis testing Various software packages are readily available for estimating the familial correlations using maximum likelihood procedures. Maximum likelihood methods are based on the assumption that the phenotypes of all members within a family jointly follow a multivariate normal distribution (Hopper and Mathews, 1982). Hypotheses can be tested by using the likelihood ratio test (LRT), which is the difference in log-likelihood between a general model and a reduced model mdtiplied by negative two. The LRT is asymptotically distributed as a x2, with the degrees of freedom being given by the difference in the number of parameters estimated in the two models. In the simple nuclear family design, there are four types of individuais (fathers = f, mothers = m, sons = s, and daughters = d) leading’to 8 correlations: four parent-offspring (fs, fd, ms, md), three sibling (ss, dd, sd), and one spouse (fm). The most relevant hypothesis ta test is whether the correlations are significantly different from zero. This hypothesis is tested by comparing the
40
Rice and Borecki
log-likelihood of a model in which the correlations are fixed to zero with that of the general model in which all eight correlations are estimated. Similarly, hypotheses regarding the influence of sex may be tested by equating certain correlations (e.g., fs = fd = ms = md, ss = sd = sd) and comparing the result to the case in which the correlations are not equated.
B. Extended pedigrees Genetic heritability can be accurately estimated in extended pedigrees, with their richness of relative pairs of different degrees. This data structure permits estimation of the genetic heritability without contamination by shared environmental influences for at least two reasons. First, extended family members are unlikely to share environmental influences to any great extent and, second, with a reasonable variety of relationships of differing degrees, there is a precise structuring of the expected covariances under a polygenic model. While analy sis of extended pedigrees does involve increased computational demands, the advent of faster computers alleviates this constraint.
C. Twins Twins provide one of the simplest designs for resolving genetic and cultural inheritance (Eaves, 1977). Monozygotic (MZ) twin pairs have 100% of their genes in common, while dizygotic (DZ) pairs are only as similar as full siblings and have 50% of their genes in common. In addition to the assumptions outlined for nuclear families, the common environment is assumed to be the same for both types of twins. The most common estimates of genetic and cultural heritabilities are hZ = 2 (r MZ - rnz) and c2 = 2rnz - rMZ, where TMZand rnz are the twin correlations. If the common environmental effect is greater for MZ than for DZ twins, as some would argue, h2 would be overestimated and c2 would be underestimated.
il. Adoptions A powerful design for assessing the proportion of variance due to genetic and cultural sources is the full adoption study. It is assumed that the resemblance between an adopted child and the biological parent is due only to genetic effects, while that between the adopted child and the adoptive parent is only familial environmental in origin. The full adoption design is rarely used, however since it is difficult to obtain both biological parents of the adoptees. A noteworthy example is the Colorado Adoption Project (Plomin et al., 1990). Important assumptions in adoption studies are that the adoption occurs immediately after birth (which justifies that the resemblance between adopted child and biological
4. Familial Rasemhlanceand Heritability
41
parents is entirely due to genetic effects), that the adoptive families are representative, and that there is no selective placement.
E. Multifactorialmodelingextensions Model extensions have been developed for complex etiologies and to take advantage of additional phenotypic information. For example, changes in familial resemblance may arise through temporal and/or developmental trends in the genetic and cultural heritabilities (Province and Rao, 1985). A variety of mechanisms can cause temporal variations. For example, there may be more than one set of genes acting independently over time, or genes having different effects at various points in development, or a variable lag time between gene action and observed product, or even environmental triggers of gene action. Models that take these complex etiologies into account, often using repeated measurements (e.g., Boomsma and Molenaar, 1987), have been developed for family data (e.g., Hopper and Mathews, 1982; Province and Rao, 1989), twin data (e.g., Eaves EI: al., 1986), and adoption data (e.g., Phillips and Fulker, 1989). The familial models also have been extended to address multiple correlated phenotypes (e.g., Lange and Boehnke, 1983; Vogler et al., 1987; Blangero and Konigsberg, 1991). The covariation among traits can result from common genetic effects (pleiotropy), from common (familial) or nonfamilial environmental effects, or even from the direct phenotypic influence of one trait on another. These hypotheses may be explored with a simple analysis of a single phenotype by contrasting the effects of different covariate adjustments. A change in the magnitude of the familial resemblance of trait X before and after adjusting for trait Y can provide indirect evidence of pleiotropy. Another important concern when modeling complex inheritance arises on account of gene-environment interactions (Qttman, 1990). The phenotype corresponding to a particular genotype may depend, in part, on exposure to certain environmental factors. There are numerous examples of gene-environment interaction models (e.g., Martin and Eaves, 1977; Cavalti-Sforza and Feldman, 1978; Blangero et al., 1990). In all cases,the data and family structure types dictate the complexity of the model needed.
F. Factorsaffectingherittibility estimation Violations of model assumptions can potentially affect any estimation procedure. For example, linearity and additivity are fundamental assumptions of most polygenic or multifactorial models, and complex traits are likely to be affected by factors that act nonadditively. Assumptions regarding assortative mating have been discussed.Additional assumptions about the underlying distributional properties of the data can be critical. For example, Rao et al. (1987) have shown that
42
Rice and Borecki
even relatively large departures from multivariate normality can yield reasonably unbiased parameter estimates, while errors in hypothesis testing can occur even with moderate departures from normality. In many situations-for example, when nonnormality is due to group differences like male vs female-such deviations from normality may be controlled with suitable data transformations or by specifically modeling the group effect. While data transformations will not guarantee multivariate normality, they will tend to minimize the impact. Various factors can lead to nonnormally distributed data, and if the source is recognized, these factors can be specifically modeled. For example, the presence of a major gene leads to nonnormality. Either admixture analysis or segregation analysis is an appropriate method for investigating this possibility (e.g., Lalouel and Morton, 1981; Blangero and Konnigsberg, 1991). Nonnormality also may be induced via gene-environment interaction (Pooni and Jinks, 1976). In fact, nonnormality due to group differences (e.g., sex) can be modeled by considering group as an “environmental” condition that interacts with the genotype (Konnigsberg et ul., 1991). Finally, a variety of methods have been developed to correct for nonnormality due to the effects of ascertainment, whether the selection conditions are well defined or ambiguous (e.g., Hanis and Chakraborty, 1984; Boehnke and Lange, 1984; Rao et al., 1988).
V. DISCUSSION In 1986, Newton Morton wrote: “Genetic epidemiology is now assimilating the rapid advances in molecular biology that promise a complete linkage map and molecular definition of disease loci. . . . When they are complete our morning will have passed into the afternoon of molecular epidemiology.” Our morning of genetic epidemiology has revealed that complex traits generally involve additive and nonadditive interactions among several genes and environmental factors, none of which may entail large effects. Are the methods discussed here likely to be useful in the new millennium? We believe that they will be useful for at least two reasons. First, a preliminary understand* ing of the magnitude of heritability of a complex trait is necessary and useful before a massive and expensive genome analysis involving large-scale genotyping is undertaken. Often, estimates of heritability or even more specific model parameters are needed for power computations. Second, a preliminary understanding of the pattern of familial correlations and heritability provides a basis for undertaking more appropriate modeling of the myriad of etiological effects. By assessingfamilial resemblance and heritability both before and after accounting for the effects of measured genotypes, we can seek evidence for the involvement of any additional familial effects. Eventually, the molecular variants
4. Familial Resemblanceand Heritability
$3
responsible for the disease or the complex risk factor will be found, which should lead us into our afternoon of molecular epidemiology.
Acknowledgments This work was supported in part by a grant from the National Institute of Genera1 Medical Sciences (GM-28719), a grant from the National Heart, Lung, and Blood Institute (HL47317), and the Medical Research Council of Canada (PG-1181 and MT-13960).
References Boehnke, M., Lange, K. (1984). Ascertainment and goodness of fit of variance components for pedigree data. In “Genetic Epidemiology of Coronary Heart Disease: Past, Present and Future” (D. C. Rao, R. C. Elston, K. H. Kuller, M. Feinleib, C. Carter, and R. Havlik, eds.), pp. 173-192. Liss, New York. Boomsma, D. I., and Molenaar, I? C. M. (1987). The genetic analysis of repeated measures. 1. Simplex models. B&v. Genet. 17, 111-123. Blangero, J., and Konigsberg, L. W. (1991). Multivariate segregation analysis using the mixed model. Genet. EPidemioE.8, 299-316. Blangero, J., MacCluer, J. W., Kammerer, C. M., Mott, G. E., Dyer, T. D., and McGill, H. C. J. (1990). Genetic analysis of apolipoprotein A-1 in two dietary environments. Am. J. Hum. Genet. 47,414-428. Cavalli-Sforza, L. L., and Feldman, M. W. (1978). The evolution of continuous variation. III. Joint transmission of genotype, phenotype and environment. Genetics 90,391-425. Eaves, L. J. (1977). Inferring the causes of human variation. J. R. Stat. Sot. Ser. B 140, 324-355. Eaves, L. J., Long, J, and Heath, A. C. (1986). A theory of developmental change in quantitative phenotypes applied to cognitive development. Behaw.Genet. 16, 143- 162. Eaves, L. J., Fulker, D. W., and Heath, A. C. (1989). The effects of social homogamy and cultural inheritance on the covariates of twins and their parents: A LISREL model. Behav. Genet. 19, 113-122. Fisher, R. A. (1925). “Statistical Methods for Research Workers.” Hafner Publishing, New York. Hanis, C. L., and Chakraborty, R. (1984). Nonrandom sampling in human genetics: Familial correlations. Inst. Math. A@. J. Math. A@. Med. Biol. 1, 193-213. Heath, A. C., and Eaves, L. J. (1985). Resolving the effects of phenotype and social background on mate selection. Behau. Genet. 15, 15-30. Hopper, J. L., and Mathews, J. D. (1982). Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet. 46,373-383. Konnigsberg, L. W., Blangero, J., Kammerer, C. M., and Mott, G. E. (1991). Mixed model segregation analysis of LDL-C concentration with genotype-covariate interaction. Genet.Epidemiol.8,69-80. Lalouel, J. M., and Morton, N. E. (1981). Complex segregation analysis with pointers. Hum. Hered. 31,312-321. Lange, K., and Boehnke, M. (1983). Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am. J. Med. Genet. 14,513-524. Martin, N. G., and Eaves, L. J. (1977). The genetical analysis of covariance structure. Heredity38, 79-95. McGue M, Wette R, and Rao, D. C. (1989). Path analysis under generalized marital resemblance: Evaluation of the assumptions underlying the mixed homogamy model by the Monre Carlo method. Genet. Epidemiol. 6,373-388.
44
Rice and Borecki
Morton, N. E. (1986). Foundations of genetic epidemiology. J. Genet. 65,X5-212. Ottman, R. (1990). An epidemiologic approach to gene-environment interaction. Genet. Epidemiol. 7,177-185. Phillips, K., and Fulker, D. W. (1989). Quantitative genetic analysis of longitudinal trends in adoption designs with application to IQ in the Colorado Adoption Project. Behau. Genet. 19, 621-658. Plomin, R, DeFries, J. C., and McCleam, G. E. (1990). Behavioral Genetics: A Primer, 2nd ed. Freeman, New York. Pooni, H. S., and Jinks J. L. (1976). The efficiency and optimal size of triple test cross design for detecting epistatic variation. Heredity 36,3 15-327. Province, M. A., and Rao, D. C. (1985). A new model for the resolution of cultural and biological inheritance in the presence of temporal trends: Application to systolic blood pressure. Genet. EpidemioE.2,363-374. Province, M. A., and Rao, D. C. (1989). Path analysis of family resemblance with temporal trends: Applications to height, weight and Quetelet Index in Northeastern Brazil. Am. J. Hum. Genet. 37,178-192. Rao, D. C., Vogler, G. P., Borecki, I. B., Province, M. A., and Russell, J. M. (1987). Robustness of path analysis of family resemblance against deviations from multivariate normality. Hum. Hered. 37,107-112. Rao, D. C., Wette, R, and Ewens, W. J. (1988). Multifactorial analysis of family data ascertained through truncation: A comparative evaluation of two methods of statistical inference. Am. J. Hum. Genet. 42,506-515. Rice, T., Despres, J. I?, Daw, E. W., Gagnon, J., Borecki, I. B., P&usse, L., Leon, A. S., Skinner, J. S., Wilmore, J. H., Rao, D. C., and Bouchard, C. (1997). F ami 1’ ia 1 resemblance for abdominal viscer, al fat: The HERITAGE family study. ht. J. Obes. 21, 1024-1031. Vogler, G. P., Rao, D. C., Laskarzewski, I? M., Glueck, C. J., and Russell, J. M. (1987). Multivariate analysis of lipoprotein cholesterol fraction. Am. .I. Epidemiol. 125, 706-719.
I
,..Linkageand Association: Basic Concepts lngrid B. Boreckil Division of Biostatistics and Department of Genetics Washington University School of Medicine St. Louis, Missouri 63110
Brian K. Suarez Departments of Psychiatry and Genetics Washington Univcrsiry School of Medicine St. Louis, Missouri 63110
I. II. III. IV. V. VI.
Summary Introduction Historical Perspective Fundamentals and Methods Contemporary Approaches Challenges and Issues References
I. SUWIMARY Many investigators are turning their efforts to dissecting the etiology of complex traits. The primary tools for gene discovery, localization, and functional analysis are linkage and association studies. While the conceptual underpinnings of these approaches have long been known, advances in recent decades in tnolecular genetics, in the development of efficient computational algorithms, and in computing power have enabled the large-scale application of these methods.
To whom correspondence should he addrrsscd. 45
46
Borecki and Suarez
Here, we review the biological basis of linkage and association among loci and the common methods used to assessthese relationships with respect to observed phenotypes. We further consider the two most common approaches-genome scans and candidate gene studies-especially their respective strengths, weaknesses,and resource requirements. Finally, we highlight some of the major challenges that arise from these investigative approaches and those that are inherent in the nature of complex traits. The chapters that follow elaborate on many of these topics.
With the advent of appropriate tools and numerous successesin mapping genes for Mendelian disorders such as Duchenne muscular dystrophy, cystic fibrosis, and Huntington’s disease, investigators are enthusiastically turning their interest and efforts to mapping genes influencing complex traits, so called because of the anticipated complexity of their underlying etiology. Complex traits (i.e., diseases and their risk factors) are assumed to be determined by several genes, environmental exposures, and interactions, among these factors. Many complex traits, such as coronary heart disease and hypertension, are common in the population and, therefore, account for a significant proportion of the public health burden. Although the specific methodological details may differ for the analysis of complex traits compared to Mendelian traits, they share a common fundamental strategy: demonstration that a particular locus is associated with the trait implicates that locus in the trait etiology. The two main approaches include linkage and association studies. While linkage studies seek to identify loci that cosegregate with the trait within families, association studies seek to identify particular variants that are associated with the phenotype at the population level. These are complementary methods that, together, provide the means to probe the genome and describe the genetic etiology of complex human traits, potentially elucidating the mechanisms leading to some of the most important contemporary health problems. Linkage studies continue to play an important role in gene discovery and localization, and methods have been developed to exploit the genetic information present in kindreds consisting of sibling pairs to extended pedigrees. Some approaches strongly model the genetic effects (the so-called model-based approaches), while others make few assumptions regarding the etiologic model underlying the phenotypic distribution (the so-called model-free approaches). In either case, there is growing interest in developing models that begin to capture the etiologic complexity by allowing for oligogenic inheritance, concomi-
5. Linkage and Association:Basic Concepts
47
tants, epistasis, and gene-environment interaction. While linkage tests are powerful and specific for gene discovery, localization of the locus can be achieved only to a certain level of precision- on the order of megabases- that represents a region that potentially can include hundreds of genes. Additionally, genes with small or subtle effects may not be detectable by linkage at all, regardless of study design. Recently, there has been increased interest in the potential of association studies for gene discovery. The appeal of this approach was fueled by Risch and Merikangas (1996), who suggested that under certain circumstances, association tests are more powerful than linkage tests and may be capable to detecting loci with smaller effects. Moreover, current fine-mapping strategies for complex trait loci rely on associations between phenotype and marker variants that arise from linkage disequilibrium and, finally, putative functional variants are verified by association with phenotypic variability. However, because of the central role of linkage disequilibrium in this paradigm and the variable nature of disequilibrium in the human genome, the ultimate utility of association tests vis-84s these different goals remains to be seen. These complementary approaches comprise the primary tools for con temporary genetic epidemiological analysis of complex traits. In this chapter, we present a historical perspective to the development of these approaches and review the basic concepts and issues surrounding linkage and association studies.
IN. HISTttRlCALPERSPECTIVE One hundred and thirty-six years ago a Bohemian monk announced the results of his breeding experiments with the common garden pea (Pisum satiuum). Gregor Mendel’s manuscript (1866) was published the following year and remained virtually unnoticed and unappreciated until it was “rediscovered” 34 years later by three botanists whose own experiments led them to the same conclusions Mendel had reached more than a generation earlier. Mendel’s experiments revealed two principles that have been elevated to the status of “laws.” Mendel’s first principle dealt with the segregation of alleles at a single locus, and his second postulate-the law of independent random assortment-dealt with the joint behavior of alleles at two loci. There is a peculiar irony in the rediscovery of Mendel’s work. Although other theories of “particulate” inheritance (e.g., Darwin, 1868; Galton, 1889) had been developed during the period that Mendel’s insights lay dormant, none presented sufficient experimental evidence to be persuasive. Yet with the rediscovery of Mendel’s work, bolstered by the experimental evi-
48
Borecki and Suarez
dence independently produced by Correns, de Vries, and Tschermak, the principle of independent random assortment rapidly became so strongly fixed that there was a reluctance to give it up in the face of rapidly accumulating evidence to the contrary (Oliver, 1967). Genetic linkage, of course, is the exception that “proves” Mendel’s second rule. Accordingly, although Bateson and Punnett in a series of papers published between 1905 and 1908 clearly described many exceptions to the independence principle [and despite the well-known work of Sutton (1903) and Boveri (1904) that demonstrated the parallelism between chromosome behavior and segregation], neither Bateson nor Punnett was able to adequately explain the concept of cosegregation. This task fell to Morgan (191 l), who had the good fortune of studying sex-linked traits (Morgan called them sex-limited) in Drosophila melunogaster. Morgan proposed that genes are linked as a consequence of lying close together on a chromosome. He also recognized that the formation of chiasmata during synapsis, a phenomenon described two years earlier by Janssens (1909), provided an explanation for the occurrence of recombinants. Sturtevant realized that if Morgan’s explanation was correct, recombination could be exploited to construct a linear gene map that would reflect a gene’s position, relative to other genes, along a chromosome. Within two years Sturtevant (1913) had produced the first such map for Drosoghila. With the publication of Sturtevant’s seminal paper, all the essentials were in place for what, today, we label “linkage analysis.” For the next four decades, linkage studies in species such as Drosophila made slow but steady progress. By comparison, linkage studies in our species languished primarily because of the lack of easily assayable markers and the inability (and inappropriateness) of conducting experimental crosses. Indeed, by the time Sturtevant had produced the first linkage map for the common fruit fly, only one useful human marker was known- the ABO blood groups (Landsteiner, 1900)-and the details of its inheritance would not be fully understood for another 11 years when Bernstein (1924) eloquently demonstrated the system’s three-allele structure. It can be argued that the spectacular accomplishments in human linkage studies of the last quarter-century can be attributed more to advances in molecular genetics (that allow the rapid and reproducible genotyping of thousands of markers), the development of efficient computational algorithms, and the unprecedented progress in computer science than to the development of new theoretical or statistical methods, which, for the most part, antedate the discovery of DNA’s structure. It should be noted that one of the most influential pioneers in this field, whose statistical contributions and insights into the challenges of linkage analysis in humans have launched the vigorous developments we have seen, is Newton Morton.
5. Linkage and Association: Basic Concepts
IV. FttiW W lE~~~lS M l
49
METHODS
A. Linkageanalysis Linkage refers to the physical proximity of loci along a chromosome. Two loci are Iinked because of their physical connection by a stretch of DNA and are sufficiently close together that their alleles tend to cosegregate within families-a departure from Mendel’s second law of independent assortment. Such cosegregating haplotypes are broken up by the process of recombination. The probability of a recombination between two loci becomes small with less distance between the loci and, conversely, recombination occurs more frequently between loci that are farther apart. Thus, recombination is a function of the distance between the two loci, although it is not a simple linear relationship. The parameter 8 is defined as the frequency of an odd number of recombinations between two loci; gametes resulting from an even number of recombinations cannot be recognized as recombinants because they look like the parental types and are, therefore, incorrectly scored as nonrecombinants. For two loci far apart on the same chromosome or on two different chromosomes, 8 = 5 and independent assortment obtains. That is, recombinant and nonrecombinant gametes occur with equal frequency. In linkage analysis, we seek to identify loci for which 8 < i. The expected frequency distribution of gametes under the alrernative hypothesis of linkage and the null hypothesis of independent assortment from a completely informative parent is shown in Figure 5.1. As a class, recombinants occur with frequency 8, while nonrecombinants occur with frequency (1 - @ ), the probability of not having a recombination. The simplest estimator of 8 is the
Doubly
heterozygoas Al
Bl
A2
B2
parent:
Figure 5.1. The distribution of gametes from a fully informative parent genotype under a linkage hypothesis and under the null.
50
Borecki and Suarez
sample proportion of gametes that are recombinants. However, since the number of offspring in human families is relatively small, it is necessary to find a way to combine information on the distribution of recombinants across families and, also, to consider how to statistically test for the presence of linkage. Morton (1955) proposed a method that has turned out to be one of his earliest and most enduring contributions to human genetics. He introduced the concept of the lod score (the logarithm of an odds ratio). The odds ratio is the probability of observing the specific genotypes in the family given linkage at a particular recombination fraction versus the same probability computed conditional on independent assortment. Thus, high values of the odds ratio would favor the linkage hypothesis, while values close to 1 provide evidence against linkage. The log (to the base 10) of the odds ratio is taken for convenience-it has more desirable scale properties and provides an easy way to combine information across families by simply summing the lod scores computed from each family. While Morton proposed this statistic for use in a sequential test, it has been more typically applied to fixed samples. Nonetheless, the original “landmarks” for significance suggestedby Morton still have meaning to many investigators for the interpretation of their results. A lod score of 3 was taken as statistically significant evidence for linkage at the test recombination fraction-this means that the linkage hypothesis is 103 times more likely than the hypothesis that the two loci are not linked. The significance level, (Y, is that which is associated with a likelihood ratio test computed to the base e---x2 = lod(2lnlO). For a lod of 3, a = 0.0001. A lod score of 1.5 or greater is considered to be “suggestive” (a < 0.004) and, at the opposite end of the spectrum, a lod score of - 2 or less is considered evidence against linkage (odds of 1:lOO or less in favor of the linkage hypothesis). Chapter 8 in this volume covers the lod score method in considerable detail. The lod score method proved to be an extremely powerful method for linkage analysis whose utility was galvanized when Elston and Stewart (1971) provided the algorithm by which the likelihood of more complex pedigrees could be computed and when Ott (1974) provided a user-friendly computer program to quickly compute lod scores in arbitrary pedigrees. In addition, iterative methods were implemented to obtain a maximum likelihood estimate of the recombination fraction. The key to the power of model-based or “parametric” lod score method is the ability to distinguish recombinants from nonrecombinants; thus, this method has been ideal for developing the map of the human genome where, for molecular markers, the genotype and phenotype are equivalent. Lod score analysis also has been used to great advantage for mapping single locus Mendelian disorders with accommodation for such complicating factors such as dominance, reduced penetrance, phenocopies, and variable age at onset. Even genetic heterogeneity could be resolved, as demonstrated by Morton (1956) in the analysis of linkage between Rh and elliptocytosis. However, these complications reduce the power of a particular study because of the uncertainty in
5. Linkage and Association: Basic Concepts
51
inferring the underlying trait genotype from the phenotype. Further, recombination is confounded by reduced penetrance, and misspecification of the mode of inheritance and of the genetic model parameters can lead to further reductions in power. Complex phenotypes present novel challenges since, by definition, there is no simple mode of inheritance. Moreover, some traits of interest, such as cardiovascular disease,prostate cancer, and Alzheimer’s disease,have a late age at onset and it is difficult or impossible to obtain multigenerational pedigrees without a prospective approach. These problems stimulated renewed interest in sibpair methods (Penrose, 1935) that rely on patterns of allele sharing to infer linkage, although with contemporary methods, this approach can be applied as well to more distant relative pairs and intact pedigrees (Weeks and Lange, 1988; Curtis and Sham, 1994; Kruglyak et al., 1996; Almasy and Blangero, 1998). The expected pattern of allele sharing at a locus in relative pairs of a particular class is given by the coefJ;cient of consanguinity,which is defined as the probability that an allele pair drawn at random, one from each of the two individuals, will be identical by descent. For sibpairs, the distribution of alleles shared identically-by-descent (IBD) is shown in Table 5.1. In one-quarter of the pairs, sibs share none of their
Table 5.1. The Expected Distribution of Alleles Shared IBD in Sibpairs from Parents with the Completely Informative Mating Type of A B X CD Sibship genotype AC AC AC AC AD AD AD AD BC BC BC BC BD BD BD BD
AC AD BC BD AC AD BC BD AC AD BC BD AC AD BC BD Total -
IBD 1
0
2 J
J J J J J J J J J J J J J J 7 i
4
8 f
J 4 1 4
52
Borecki and Suarez
alleles at a particular locus IBD, in half the pairs they share one allele IBD, and in the remaining quarter of the pairs they share two alleles IBD, which leads to the general expectation that the probability that an allele is IBD in sibs is 0.5. It should be noted that what is relevant for linkage analysis is the inheritance (or co-inheritance) of alleles at adjacent loci; therefore, it is of critical importance to determine whether the alleles are identical by descent (i.e., copies of the same parental allele) or only identical by state (i.e., appearing the same, but derived form two different copies of the allele). For example, let us consider a pair of sibs’genotypes AB and AC. The A allele is identical by state, but without information on the parental mating type, it is uncertain whether the A is identical by descent. If the parents are AD X BC, then the A is IBD, coming from the paternal A. In contrast, if the parental mating type is AB X AC, then the A allele is identical by state only, each sibling inheriting a different copy of the A allele. In the absence of information on parental mating type, the probability that the allele in question is shared IBD can be computed by considering all possible mating types; rarer alleles that occur in each member of a sibpair are more likely IBD than common alleles. To reiterate, only IBD information is useful for linkage studies; however, identity-by-state information can be used for tests of association. Thus, since unrelated individuals share no genes identical by descent (when there is no inbreeding), such samples provide no information for linkage. The linkage test is based on the relationship between phenotypic similarity and 1~,the proportion of alleles shared IBD for sibpairs (or other pairs of relatives). For disease traits, sibling pairs that are both affected are ascertained for study, thereby obviating the problem of incomplete penetrance. The idea is that sibpairs will tend to share a greater proportion of marker alleles IBD if the marker is linked to the disease locus. That is, the null hypothesis of no linkage is GT= i while the alternative hypothesis is rr > i. Many statistical tests have been developed to evaluate this hypothesis (e.g. Suarez et al., 1978; Suarez and Van Eerdewegh, 1984; Bl ack we Id er and Elston, 1985; Risch, 1990; Kruglyak et al., 1996). For quantitative traits, one can look at phenotypic similarity directly without ascertaining the sibships-simply, under the linkage hypothesis, sibs that share a greater proportion of alleles IBD will have more similar phenotypes while, under the null, the similarity in sibs’phenotypes will have no relation to the degree of allele sharing. It should be noted, however, that ascertaining sib, pairs can considerably improve the power for detection of quantitative trait loci (Risch and Zhang, 1995). While Haseman and Elston (1972) used the squared phenotypic difference in sibs, current more favorable methods look at the phenotypic covariance or correlation as a function of rr (Amos, 1994; Almasy and Blangero, 1998; Province et al., 2000) using a variance components approach. These methods as referred to as “model-free” or “robust,” since there are no assumptions regarding the underlying genetic model although, certainly, there
5. LinkageandAssociation:BasicConcepts
53
are the usual assumptions about the distribution of the test statistic. Such model-free methods are elaborated in several subsequent chapters, led by Chapter 11, an overview chapter by Elston and Cordell.
B. Association analysis Association at the population level between a marker and the phenotype of interest can arise under either of two circumstances: (1) when the functional variant is measured directly, or (2) when the marker variant is in linkage disequilibrium with the functional variant. In the former case, genotypes can be characterized in a number of ways (e.g., electrophoretically, immunologically, or directly at the DNA sequence level), and such assessment gives rise to a “measured genotype,” a term coined by Boerwinkle et aE. (1986). In this case, one can directly test the effect of the measured genotype on the phenotypic autcome, However, if anonymous rather than functional markers are used, then the test of association relies on linkage disequilibrium, which is a population concept. Let us consider two linked loci, A and B. The pairwise combinations of one allele from locus A with one allele from locus B constitute the gametic hap lotypes. Let us assume that a new mutation at locus A, Ai, occurs on a chromosome that has the allele Bi. The gametes produced in the next generation are of two kinds, recombinants and nonrecombinants; if there is no recombination between the loci, then the original haplotype (A,Bj) is preserved. Recombination breaks up this haplotype such that eventually the mutation Ai would occur on haplotypes involving the other alleles at the B locus in proportion to their relative frequency-this is linkage equilibrium. However, initially, the frequency of the haplotype AiBj is greater than the product of the two allele frequencies, which is linkage disequilib-r&m. In a randomly mating population, the approach to equilibrium is a function of the recombination fraction (8); the departure from the equilibrium value is reduced by a fraction equal to 0 in each generation. For unlinked loci, the haplotype frequency goes halfway to the equilibrium value each generation. Thus, the tighter the linkage, the longer the disequilibrium (association) will persist in the population. It is useful to talk about the time required to go halfway to the final equilibrium value, similar to the concept of a half-life for radioactive decay. For unlinked loci, the median time is 1 generation, when 8 = 0.10 it is about 7 generations, when 8 = 0.01 it is 69 generations, and when 0 = 0.001 it is about 693 generations. With these examples, it is clear that to have appreciable disequilibrium that may be useful for association studies, two loci must be very close; moreover, the prospects for detecting linkage disequilibrium are less for ancient mutations than for more recent mutations, and evolutionary forces such as drift, migration, admixture, and rapid population expansion also play an important role in shaping the pattern of iinkage disequilibrium.
54
Borecki and Suarez
There are several standard methods of testing for association. In principle, the effect of measured genotypes can be assessedfor any given phenotypic outcome, either qualitative (e.g., disease presence or absence) or quantitative. The traditional epidemiologic case-control study was among the first approaches utilized to determine whether a particular genetic variant is associated with increased risk of disease. Early on, Woolfe (1955) proposed a relative risk statistic that could be used to assessgenotype-dependent risk. However, a persistent concern regarding these studies is the adequacy of the matching of cases and controls. In particular, population stratification can produce false positive associations. In response to this concern, Falk and Rubenstein (1987) suggested a method for assessingrelative risk that uses family-based controls, obviating this source of potential error. Basically, the method uses as the control sample the parental alleles or haplotypes not transmitted to affected offspring. Thus, in the fully informative mating shown in Figure 5.2, the “case” genotype is AC and the “control” genotype is BD. Many different test statistics have been proposed, including the haplotype relative risk (Falk and Rubenstein, 1987; Terwilliger and Ott, 1992), affected family-based controls (Schaid and Sommer, 1994), and the transmission disequilibrium test (TDT) introduced by Spielman et al. (1993). In particular, the TDT has gained wide popularity in recent years; this method also focuses on alleles transmitted to affected offspring, but it is formulated to take account of both the linkage and the disequilibrium that underlie the association. This test requires genotype information on trios of individuals, namely, affected children and both biological parents; and at least one parent must be heterozygous for the test to be informative. The proposed test statistic is actually a McNemar’s chi-square and tests the null hypothesis that the putative
AC “Case” + Transmitted alleles: A 8 C “Control” + Non-transmitted alleles: B & D Figure 5.2. Construction of family-based controls for association studies: “case,” transmitted alleles A and C; “control,” nontransmitted alleles B and D.
5. Linkage and AssaciaSion:Basic Concepts
55
disease-associated allele is transmitted 50% of the time from heterozygous parents; under the alternative hypothesis, the disease-associated allele wifi be transmitted more often. Subsequently, Ewens and Spielman (1995) demonstrated that, indeed, the TDT is not affected by population stratification or admixture. Many investigators have contributed to extending the TDT to other relevant situations including multiallelic marker loci (Sham and Curtis, 1995; Bickeboeller and Clerget-Darpoux, 1995; Rice et al., 1995), and when only one parent may be available for study (Sun et al., 1999). Considerable attention also has been focused on situations in which parental data may be absent or impossible to obtain (such as for late-onset disorders), and such approaches involve the use of discordant sibpairs (Spielman and Ewens, 1998; Hovarth and Laird, 1998; Boehnke and Langefeld, 1998). The discordant unaffected sibling provides information on the alleles not segregating to affected individuals. Development of association tests for quantitative traits has proceeded on a parallel track. Analysis of variance, classifying subjects by their mead sured genotype, can reveal effects of a test locus on quantitative variation. A classic demonstration of this approach was presented by Boerwinkle et al. (1987), who examined the effect on lipids and lipoproteins of apo E genotypes-comprising isoelectric variants eZ, ~3, and e4. In this statistical approach, it is straightforward to simultaneously control for the effects of covariates, thereby improving the power to detect associations. Analyses of these types are usually aimed at samples of unrelated individuals and treat the measured genotype simply as a fixed main effect using standard statistical methods. However, there are extensions to these variance components models that use the information from family data to simultaneously model residual familial resemblance (Boerwinkle et al., 1986; Almasy and Blangero, 1998; Province et al., 2000). One of the problems in assessing associations lies in determining how to accommodate analysis of multiallelic markers. A simple approach such as increasing the number of classes leads to an increase in the degrees of freedom, with a consequent loss of power and statistical problems with sparse cells. Another approach consists of choosing an index allele and collapsing all other alleles, reducing the locus to a diallelic situation. While this may be a valid approach when independent evidence implicates a particular allele as either functional or in strong disequilibrium with the functional variant, in general, it can lead to biases and statistical problems with multiple comparisons when testing is being done over all possible alleles. One approach has been to model the marker genotype as a random variable rather than a fixed effect (Almasy and Blangero, 1998). An alternative likelihood-based approach for tests of disequilibrium that also accommodates
56
Borecki and Suarez
multiple markers has been suggested by Terwilliger (1995); the test becomes more conservative as the number of marker alleles increases, while maintaining high power. The concept of the TDT test also has been attractive to investigators seeking quantitative trait loci. The basic idea of Spielman et al. (1993) has been extended for quantitative traits (Allison, 1997), with generalizations for multiple sibs, multiple alleles, and relaxation of parametric assumptions on the distribution of the trait by Rabinowitz (1997). Other developments have cast the problem in a flexible regression framework that allows for analysis of pedigree data and easy incorporation of concomitants (George et al., 1999) and for varying degrees of linkage disequilibrium between the marker and trait locus alleles (Xiong et al., 1998). In summary, there has been a vigorous development of association methods to detect complex trait loci with the goal of exploiting linkage disequilibrium (when it exists) to detect and map loci with relatively small effect. When substantial disequilibrium exists between a functional variant and a marker allele under study, or when the marker allele is the actual functional variant, the promise for mapping complex trait loci is great. Several chapters in this volume provide insightful discussions involving disequilibrium-based methods.
V. CONTEMPORARY APPROACHES There is great interest in exploring the genetic etiology of many novel traits, as well as traditional ones. To establish the feasibility of finding loci affecting complex traits by linkage and/or association studies, it is important to demonstrate that (1) the trait can be measured reproducibly on a large number of subjects, (2) the trait is significantly heritable, and (3) there is an appropriate study design and methodology that will yield sufficient power to detect loci of a plausible effect size, relative to the total trait heritability. There are two general strategies for identifying complex trait loci. The candidate gene approach is motivated by what is known about the trait biologically. For example, reasonable candidate genes for adiposity would be appetiteregulating factors or, for blood pressure, the proteins and enzymes involved in the renin-angiotensin system. This can be characterized as a hypothesis-testing approach because of the biological foundation supporting the proposed candidate genes. A completely different kind of experiment is a genome scan. In this case, anonymous polymorphisms that are uniformly distributed throughout the genome are tested for the presence of a linked trait locus at each of hundreds (or thousands) of locations; in this sense, it is a hypothesis-generating approach. The importance of genome scan experiments is that they represent a means of detecting previously unknown trait loci.
5. Linkage and Association: Basic Conctrpts
57
Candidate genes can be evaluated either by linkage or by association studies. The effectiveness of these approaches will be influenced by the type and density of the markers utilized in the study, as well as by study design and sample size. While a few highly polymorphic microsatellite markers in the candidate gene region are ideal for assessmentby linkage, sparsity of markers is problematic for association studies. The spacing of linkage markers is usually not sufficiently dense to permit detection of associations arising from linkage disequilibrium, and there are further challenges in analyzing multiallelic markers with the potential for allelic heterogeneity. Diallelic markers, such as the single nucleotide polymorphisms (SNPs), placed at a density on the order of kilobases (rather than megabasesas for linkage studies), are far more suitable for association studies and have the potential to localize functional variants. Since a generally usable SNP map is still under construction, some investigators are probing specific candidate gene regions by identifying new markers in strategic locations, such as in the promoter, within exons, or in putative regulatory regions. Similarly, genome scans can be carried out using either a linkage or an association approach and, again, the map requirements differ. Whereas the typical marker map for linkage studies includes approximately 300-400 highly polymorphic markers with an average density of 5- 15 CM, Kruglyak (1999) has estimated that approximately 500,000 SNPs will be required for whole-genome association studies, since a useful level of linkage disequilibrium is unlikely to extend over distances greater than roughly 3 kb in the general population. While such a dense SNP map is not currently available, there are intense efforts on the part of the Human Genome Project and various academic and industrial groups to produce dense maps and to devise the means to quickly and easily genotype large numbers of subjects. One of the principal weaknesses of the association approach is that there is no guarantee that linkage disequilibrium exists in the region of interest and, if there is no disequilibrium, association tests will have no power unless the marker represents an actual functional variant. This concern led Terwilliger and Weiss (1998) and Wright et al. (1999) to argue that an important consideration is the type of population studied-preferably, one with a history that would enhance the presence of disequilibrium and with relatively high genetic homogeneity. Data on the type of variation that exists in the human genome are sparse but becoming increasingly available. Huttley et al. (1999) examined over 5000 short tandem repeat polymorphisms in a set of European mapping families (at CEPH) and found extremely variable patterns of disequilibrium throughout the genome. In another study of the lipoprotein lipase gene, investigators sequenced 9.7 kb in 71 individuals from three different populations; while significant disequilibrium was found among the 88 variable sites that were identified, there was also a complexity in the structure of the variation that may pose challenges to the interpretation of disease association studies (Clark et al.,
58
Borecki and Suarez
1998). Certainly, the prospects for whole-genome linkage disequilibrium mapping is one of the hotly debated issues of our time.
VI. CHALLENGES AND ISSUES A. Type I and type II errors One of the obvious problems in interpreting the results of a genome scan is sorting out the true positive signals from the false ones. Fundamentally, this problem has its roots in the complications arising from multiple comparisons-a test for linkage (or association) is carried out hundreds (or thousands) of times in a typical scan, possibly for multiple phenotypes. One simple way to control the rate of false positives in a linkage study is to adopt a stringent alpha level, such as has been suggested by Lander and Kruglyak (1995). While such bruteforce control of the type I error certainly is one way to approach the problem, it does not come without a cost. The power to detect loci of modest effectwhich is what is anticipated for complex trait loci-is severely compromised. Even the classic experimental paradigm of replication in an independent population is problematic. Certainly, replication of a particular finding in different populations lends additional credibility to the finding; but failure to replicate can be due to etiologic heterogeneity or simply to the statistical consequences of attempting to map several genes involved in an oligogenic trait (Suarez et al., 1994). More basically, the substantial imprecision of location estimates obtained from genome scans begs the question of whether two location estimates in a particular region represent the same gene (Roberts et al., 1999). The multiple comparisons problem also extends to the assessment of candidate genes, since it is common for investigators to examine multiple poly morphisms in a number of candidate genes, to carry out individual tests over multiple alleles at a marker locus, or to test multiple phenotypes. These practices can inflate the actual type I error rate, and the problem will become even more monumental with the analysis of sequence data with the goal of identifying functional variants. Thus, the problem of multiple comparisons is relevant for candidate gene studies, as well. Strategies to improve the interpretability of these important experiments aimed at gene discovery and localization are needed. The chapters in Section 8 of this volume deal with these issues in greater detail.
B. Study design Among the many challenges in carrying out studies of complex phenotypes, the issue of study design may prove to be one of the most critical. Is there an optimal sampling design in terms of family structure for linkage analysis of complex
5. Linkage and Association: Basic Concepts
59
phenotypes? This is a contentious issue, with some investigators arguing that large extended pedigrees are optimal while others smaller sampling units such as sibpairs (which allow greater ascertainment stringency than larger sampling units). While some of these important study design issues are discussed in sever+ al chapters, Chapter 26, by Gu and Rao, attempts to provide a comprehensive discussion of the most important design issues. Clearly all families are not equal with respect to the information they bring to a linkage analysis. By judiciously choosing the types of families to study, we can dramatically lower or increase the probability of detecting linkage. Although not always explicitly stated, the objective in choosing a sample of families for linkage analysis is simple: we wish to maximize the number of informative (i.e., segregating) meioses at the loci that affect susceptibility. Unfortunately, we do not know which loci are susceptibility loci in advance of a linkage study. It is obvious that certain sampling units, by virtue of their structure, have advantages over other sampling units. For instance, a strength of threegeneration pedigrees is that they provide direct information about phase. Among two-generation families, larger sibships will be better than smaller sibships for the same reason- they allow better inference about phase. But if the proportion of affected individuals in a sibship is large, one or both parents could be homozygous at the susceptibility loci and the family will not be informative for linkage. For complex non-Mendelian phenotypes, the optimal family structure for mapping susceptibility loci will depend on the true mode of transmission which, in general, is unknown to a researcher. Indeed, for any particular mode of transmission, the optimal family structure may depend on such “nuisance” parameters as the allele frequencies at the susceptibility loci, since these will influence population prevalence. For the common complex diseases, the correspondence between the genotype at a specific locus and the phenotype is expected to be poor, resulting in a dramatic decrease in allele sharing among affected family members. Ali linkage techniques-both “model based” and “model free”-rely on increased allele sharing among affected kindred members to detect and map disease su5ceptibility genes. So when the increase is barely above the “null” value, large sample sizes will be required to establish linkage. As an illustration, consider a common disease with a 10% population prevalence and 100% heritability that results from the additive effects of susceptibility alleles at N loci (see Suarez et al., 1994, and Rice et al., 2000, for details of the model). Figure 5.3 shows that as N goes from 1 to 10, the average allele sharing between a pair of randomly selected affected sibs decreasesfrom about 71% to about 52%. Since under the null hypothesis of no linkage the expectation is 50% sharing, it is easy to see why it will be so difficult to detect and map genes when many loci contribute to the phenotype.
60
Borecki and Suarez
65 60
12
3 4
5 6 7 8
NUMBER OF SUSCEPT-IBILITY
9 10 LOCI
Figure 5.3. Mean proportion of susceptibility alleles shared IBD in a pair of affected sibs for an additive oligogenic model.
As noted above, this realization has led some genetic epidemiologists to suggest that association methods will be required to detect genes of small effect. While association methods may prove useful in some instances, we believe that linkage studies will continue to be the primary tool to map susceptibility loci although, as we illustrate shortly, more attention will need to be given to experimental design. For example, suppose that a common disease (10% population prevalence) results from the interaction of up to 6 unlinked diallelic susceptibility loci. For simplicity, suppose the allele that increases risk has the same frequency at each of the 6 loci and that to be affected, a person needs to have inherited a total of 6 or more of these “at-risk” alleles. For this very simple model, there are 3 6 = 729 genotypes, of which 435 give rise to the disease. Figure 5.3 indicates that a random collection of affected sibs will share, on average, 54.1% of their alleles IBD at any one of these 6 loci. Now suppose we ascertain a sample of families consisting of four children, at least two of whom are affected, and inventory these sibships according to the number and pattern of affected family members back to the sibship’s grandparents. We do not need to genotype the parents or grandparents (who may not be available); it will suffice to keep track of who is affected. There are 63 possible nonredundant configurations for this simple pedigree, of which the 10 most frequent are shown in Figure 5.4. To determine the relative proficiency of these different family structures to detect genes of small effect, we simulated 36 sets of 250 families of each configuration shown in Figure 5.4 and analyzed them for linkage using the tz statistic of Blackwelder and Elston (1985). This statistic uses information from the entire
5. Linkage and Awciaiion: Basic ConGapts RELATIVE
f2
Asww~ce
SCORE
61
RELATIVE EFFLCXENCY
15.7
3.100
0.44
12.9
4.697
1.00
10.7
3,166
0.45
10.0
4.523
0.93
7.7
3.412
0.53
5.0
3.762
0.64
3.7
2.499
0.28
2.9
2.598
0.32
2.7
3.049
0.42
2.5
2.884
0.38
Figure 5.4. Top 10 most frequent nonredundant configurations for a sibship consisting of four offspring (at least two of whom are affected) and their immediate ancestors. The diseasemodel consists of six unlinked susceptibility loci, each with equal effect, and gives rise to a common diseasewith a popularion prevalence of 10%. Reiative abundance is the frequency of the configuration among all ascertained sibships of size 4. The C~statistic was computed for 250 families of identical configuration, averaged over the 6 quantitative trait loci and over 36 replicates.
sibship -not just the affected sibs. Figure 5.4 reports the mean t2 score averaged over all 6 susceptibility loci. There is a great deal of variation in the amount of infarmation provided by the various configurations. Since the t2 statistic has an asymprotically standard normal distribution, the relative efficiency of any two
62
Borecki and Suarez
configurations can be obtained by computing the ratio of the squared scores. Figure 5.4 reports these relative efficiencies by comparing the various configurations to the one that provides the greatest evidence for linkage (i.e., two of four affected sibs, no affected parents, and no affected grandparents). The results from this simulation provide a number of interesting insights that may be of help in designing future linkage studies for complex phenotypes. It has long been recognized that for diseases with a simple genetic architecture, bilineal families, (i.e., families containing affected ancestors on both the mother’s and father’s side), offer less linkage information than unilineal families (Hodge, 1992). This also appears to hold true for a complex threshold trait of the sort simulated here, since bilineal families have an average relative efficiency of only 0.34. Among nonbilineal families, those that contain an affected parent have a relative efficiency of 0.48, versus 0.97 when neither parent has the common disease. This is a reasonable finding, since unaffected parents capable of having a minimum of two to four affected children must have a liability close to the threshold and must, therefore, be segregating at many of the susceptibility loci. It is likely that the underlying genetic architecture of the common diseaseswill prove much more complicated than the simple 6-10~~s model considered above. Nonetheless, the wide variability in linkage information contained in the different family configurations suggests that more attention should be given to experimental design and, in particular, to ascertainment. A simple “multiplex” ascertainment strategy will probably prove inefficient for complex phenotypes.
C. Genetic heterogeneity When Morton (1962) first argued that the “detection of genetic heterogeneity is the principal purpose of a linkage test in man,” the only documented autosomal example known was the heterogeneity between the Rh blood groups and elliptocytosis (Morton, 1956). To many, this heterogeneity seemed to be more a curiosity than an expected finding, since elliptocytosis appeared to be a simple and clinically homogeneous phenotype. Since then more powerful heterogeneity tests have been developed (Smith, 1961; Risch, 1988) and many more examples of “locus heterogeneity” are known. Alleles at 11 unlinked loci, for instance, are known to cause autosomal dominant retinitis pigmentosa (RI’). All e1es at another 7 unlinked loci, in homozygotes or compound heterozygotes, can cause recessive RI? In addition, 4 X-linked RP loci (1 dominant, 3 recessive) have been identified (Heckenlively and Daiger, 1996). Other well-known examples with documented locus heterogeneity include maturity-onset diabetes of the young (Froguel and Velho, 1999), prolongation of the QT interval (Viskin, 1999), early-onset Alzheimer’s disease
5. Linkage and Association: Basic Concepts
63
(Selkoe, 1996), and breast cancer (Hall et al., 199C; Wooster et al., 1994). These data support the idea that genetic heterogeneity may be one of the prominent features of complex phenotypes. Morton’s suggestion that careful linkage analysis may be a powerful tool for dissecting locus heterogeneity was extremely insightful in its time and may well serve in contemporary studies. I-Iowever, the task for complex traits poses additional challenges in that the underlying loci may be neither necessary nor sufficient to cause disease, may have modest marginal effects on quantitative variation, or may be involved in complex patterns of interaction with other loci or environmental factors. Nonetheless, strategies like those used in the examples just cited may prove to be similarly useful in the analysis of complex traits. In the chapters that follow, various authors address the challenges and opportunities in mapping and characterizing complex trait loci. While the opportunities have never been greater-with the imminent availability of human genome sequence, technical advances that make high-throughput genotyping a reality, and improved computational power coupled with novel and powerful statistical methods -the challenges also remain great, as is the promise of improved health and quality of life. It is an exciting time in human genetics.
Acknowledgments This work was supported in part by GM28719 (IBB), MH3i302 from the Urological Research Foundation (BKS) .
(both authors), and an award
References Allison, D. B. (1997). Transmission-disequilibrium tests for quantitative traits. Am. J. Hum. Genet. 60.676490. Almasy, L., and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. Am. J. Hum. Genet. 62, 1198-1211. Amos, C. I. (1994). Robust variance-components approach for assessinggenetic linkage in pedigrees. Am. J. Hum. Gem. 54,535-543. Bateson, W., and Punnet, R. C. (1905-1908). Reports to the Evolution Committee of the Royal Society. Reports 2,3, and 4. Bernstein, E (1924). Ergebnisse cinerbiostatischen zusammenfassenden Betrachtung tiber die erblichen Blutstrukturen des Menschen. Kkn. Wschr. 3, 1495-1497. Bickeboe!ler, H., and Clerget-Darpoux, F. (1995). Statistical properties of the allelic and genotypic transmission/disequilibrium test for multiallelic markers. Genet. Epidemiol. 12,865 -870. Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for disease sus^Jceptibility loci. Genet.
[email protected],85-97. Boehnke, M., and Langefeld, C. D. (1998). Genetic association mapping based on discordant sib pairs: The discordant-alleles test. Am. J. Hum. Genet. 62,950-961. Boerwinkle, E., Chakraborty, R., and Sing, C. F. (1986). The use of measured genotype information in the analysis of quantitative phenotypes in man. I. Models and analytical methods. Ann.’ Ham. Genet. 50,181-194.
64
Borecki and Suarez
Boerwinkle, E., Visvikis, S., Welsh, D., Steinmetz, J., Hanash, S. M., and Sing, C. F. (1987). The use of measured genotype information in the analysis of quantitative phenotypes in man. II. The role of the apolipoprotein E polymorphism in determining levels, variability, and covariability of cholesterol, betalipoprotein, and triglycerides in a sample of unrelated individuals. Am. J. Med. Genet. 27,567-582. Boveri, T. (1904). Ergebnisse iiber die Konstitution der chromatischen Kernsubstanz. Fischer, Jena. Clark, A. G., Weiss, K. M., Nickerson, D. A., Taylor, S. L., Buchanan, A., Stengard, J., Salomaa, V., Vartiainen, E, Perola, M., Boerwinkle, E., and Sing, C. E (1998). Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet. 63,595-612. Curtis, D., and Sham, P C. (1994). Using risk calculation to implement an extended relative pair analysis. Ann. Hum. Genet. 58, 151- 162. Darwin, C. (1868) “The Variation of Animals and Plants under Domestication.” Murray, London. Elston, R. C., and Stewart, J. (1971). A general model for the genetic analysis of pedigree data. Hum. Hered. 21,523-542. Ewens, W., and Spielman, R. S. (1995). The transmission/disequilibrium test, History, subdivision and admixture. Am. J. Hum. Genet. 57,455-464. Falk, C. T., and Rubenstein, P. (1987). Haplotyp e relative risks: An easy, reliable way to construct a proper control sample for risk calculations. Ann. Hum. Genet. 5 1, 227-233. Froguel, I?, and Velho, G. (1999). M o 1ecu 1ar g enetics of maturity onset diabetes of the young. Trends End&n. Metab. 10, 121-125. Galton, E (1889). “Natural Inheritance.” Macmillan, London. George, V., Tiwari, H. K., Zhu, X., and Elston, R. C. (1999). A test of transmission/disequilibrium for quantitative traits in pedigree data, by multiple regression. Am. J. Hum. Genet. 65, 236-245. Hall, J. M., Lee, M. K., Newman, B., Morrow, J. E., Anderson, L. A., Huey, B., and King, M. C. (1990). Linkage of early-onset familial breast cancer to chromosome 17q21. Science 250, 1684-1689. Haseman, J. K., and Elston, R. C. (1972). Th e investigation of linkage between a quantitative trait and a marker locus. Belmv. Genet. 2,3- 19. Heckenlively, J. R., and Daigler, S. P. (1996). Hereditary retinal and choroidal degenerations. In “Emory and Rimoin’s Principles and Practice of Medical Genetics, 3rd ed. (D. L. Rimoin, J. M. Connor, and R. E. Pyeritz, eds.). ChurchillaLivingstone, Edinburgh. Hedge, S. E. (1992). Do bilineal pedigrees represent a problem for linkage analysis? Basic principles and simulation results for single-gene diseases with no heterogeneity. Genet. Epidemiol. 9, 191-206. Hovarth, S., and Laird, N. M. (1998). A d’lscordant-sibship test for disequilibrium and linkage: No need for parental data. Am. .J. Hum. Genet. 63,1886-1897. Huttley, G. A., Smith, M. W., Carrington, M, and O’Brien, S. J. (1999). A scan for linkage disequilibrium across the human genome. Genetics 152, 17 1 1 - 1722. Janssens,F. A. (1909). La thkorie de la chiasmatypie. C&de 25,389-411. Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22, 130-144. Kruglyak, L., Daly, M. J., Reeve-Daly, M. l?, and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 58, 1347- 1363. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nut. Genet. 11, 241-247. Landsteiner, K. (1900). Zur Kenntnis der antifermentativen, lytischen und agglutinierenden Wirkungen des Blutserums und der Lymphe. Zbl. Bakt. 27,357-362. Mendel, G. (1866). Versuche fiber Pflanzer-hybriden. Verb. Narurfrsch. Verein Brunn. 4,3-47. Morgan, T. H. (1911). An attempt to analyze the constitution of the chromosomes on the basis of sex-limited inheritance in Drosophila. J. Exp. Zool. 11,365-412.
5. Linkage and~Association:Basic Cortcepts
65
Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Cenet. 7, 277-318. Morton, N. E. (1956). The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. J. Hum. Genet. 8,80-96. Morton, N. E. (1962). Segregation and linkage. In “Methodology in Human Genetics (W. J. Burdette, ed.), pp. 17-52. Holden Day, San Francisco. Oliver, C. P. (1967). Dogma and the early development of genetics. In “Heritage from Mendei” (R. A. Brink), pp. 3- 10. University of Wisconsin Press,Madison. Ott, J. (1974). Estimation of the recombination fraction in human pedigrees: Efficient computation of the likelihood for human linkage studies. Am. J. Hum. Genet. 26,588-597. Penrose, L. S. (1935). The detection of autosomal linkage in data which consist of pairs of brothers and sisrersof unspecified parentage. Ann. Eugen. (London) 6,133 - 138. Province, M. A., Rice, T., Borecki, I. B., Gu, C., and Rae, D. C. (2000). A multivariate and multilocus variance components approach using structural relationships to assessquantitative trait linkage via SEGPATH. Genet. Epidemiol. in press. Rabinowitz, D. (1997). A transmission disequilibrium test for quantitative trait loci. Hum. Hered. 47,342-350. Rice, J. P., Neuman, R. J., Hoshaw, S. L., Daw, E. W., and Gu, C. (1995). TDT with covariates and genomic screens with mod scores: Their behavior on simuIated data. Genet. Epidemiol. 12, 659-664. Rice, J. P., Saccone, N. L., and Suarez, B. K. (2000). The design of studies for investigating linkage and association. In “Analysis of Multifactorial Diseases (T. Bishop and P. Sham, eds.). BIOS, Oxford. Risch, N. (1988). A new statistical test for linkage heterogeneity. Am. J. Hum. Genet. 42, 353-364. Risch, N. (1990). Linkage strategies for genetically complex traits. I. M&locus models. Am. J. Hum. Genet. 46,222-228. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273(5281), 1516-1517. Risch, N., and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans, Science268,1584-1589. Roberts, S. B., MacLean, C. J., Neale, M. C., Eaves, L. J., and Kendler, K. S. (1999). Replication of linkage studies of complex traits: An examination of variation in location estimates. Am. J. Hum. Genet. 65,876-884. Schaid, D. J., and Sommer, S. S. (1994). Comparison of statistics for candidate gene studies using casesand parents. Am. J. Hum. Genet. 55,402-409. Selkoe, D. J. (1996). Amyloid B-protein and the genetics of Alzheimer’s disease.J. BioE.Chem. 271, 18295-18298. Sham, P. C., and Curtis, D. (1995). A n extended transmission/disequilibrium test (TDT) for multiallele marker loci. Ann. Hum. Genet. 59,323-336. Smith, C. A. B. (1962). Homogeneity test for linkage data. Proc. 2nd Int. Congr. Hum. Genet. 1, 212-213. Spielman, R. S., and Ewens, W. J. (1998). A sibship test for linkage in the presence of association: The sib transmission/disequilibrium test. Am. J. Hum. Genet. 62,450-458. Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). Transmission test for linkage disequiiibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Helm. Genet. 52,506-516. Sturtevant, A. H. (1913). The linear arrangement of six sex-linked factors in Drosophila, a$ shown by their mode of association. J. Exe. Zool. 14,43 -59. Suarez, B. K., Rice, J., and Reich, T. (1978). The generalized sib pair IBD distributions: its use in the detection of linkage. Ann. Hum. Genet. (London) 42,87-94.
66
Borecki and Suarez
Suarez, B. K., and Van Eerdewegh, P. (1984). A comparison of three affected-sib-pair scoring methods to detect HLA-linked disease susceptibility genes. Am. .J. Med. Genet. 18, 135-146. Suarez, B. K., Hampe, C. L., and Van Eerdewegh, I’. (1994). Probl ems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S. Gershon, and C. R. Cloninger, eds.), pp. 23-46. American Psychiatric Press,Washington, DC. Sun, F., Flanders, W. D., Yang, Q., and Khoury, M. J. (1999). Transmission disequilibrium test (TDT) when only one parent is available: The l-TDT. Am. J, Epidemiol. 150,97-104. Sutton, W. S. (1903). The chromosomes in heredity. Biol. Bull. 4, 23 l-25 1. Terwilliger, J. D. (1995). A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci. Am. J. Hum. Genet. 56, 777-787. Terwilliger, J. D., and Ott, J. (1992). A haplotype-based “Haplotype Relative Risk” approach to detecting allelic associations. Hum. Hered. 42,337-346. Terwilliger, J. D., and Weiss, K. M. (1998). Linkage disequilibrium mapping of complex disease: Fantasy or reality? Curr. Opin. Biotechnol. 9,578-594. Viskin, S. (1999). Long QT syndromes and torsade de points. Lancet 354, 1626-1633. Weeks, D. E., and Lange, K. (1988). The affected-pedigree-member method of linkage analysis. Am. J. Hum. Genet. 42‘315-326. Woolf, B. (1955). On estimating the relation between blood group and disease. Ann. Hum. Genet. 19,251-253. Wooster, R., Neuhausen, S. L., Mangion, J., Quirk, Y., Ford, D., Collins, W., Nguyen, K., et al. (1994). Localization of a breast cancer susceptibility gene, BRCAZ, to chromosome 13q 12-13. Science265,2088-2090. Wright, A. E, Carothers, A. D., and Pirastu, M. (1999). Population choice in mapping genes for complex diseases.Nat. Genet. 23,397-404. Xiong, M. M., Krushkal, J., and Boerwinkle, E. (1998). TDT statistics for mapping quantitative trait loci. Ann. Hum. Genet. 62,431-452.
I
clMinition of the Phenotype John P. Rice,’ Nancy 1. Saccone, and Erik Rasmussen Department of Psychiatry Washington University School of Medicine St. Louis, Missouri 63110
1. II. III. IV. V. VI.
Summary Introduction The Benefits of a Narrowly Defined Disease Phenotype Endophenotypes and Quantitative Traits The Impact of Diagnostic and Measurement Error Discussion References
Definition of the phenotype is a key issue in designing any genetic study whose goal is to detect disease genes. This chapter describes strategies to increase the power to detect susceptibility loci for complex diseases. A narrowly defined disease phenotype can offer advantages over broad definitions. Studies of clinical disease can also benefit from judicious selection of endophenotypes and related quantitative traits for analysis. The effect of diagnostic and measurement error is also discussed; power is maximized when strategies to reduce error are incorporated into a study design.
‘To whom correspondence should be sddrcssed. Advances in Genelics, Vol. 42 Copyrighr 0 2CCl by Academic Prw. All rights of reproduction in any form rcscrvcd @C65-266O:Cl $35.03
70
Rice ef al.
II. INTRODUCTION To identify susceptibility loci for common, complex human diseases,researchers must first define the disease or phenotype of interest. Although genetic studies may lead to a molecular basis for disease definition, uncertainty in the clinical diagnosis or confounders and measurement error for quantitative risk factors may preclude the discovery of linkage. A dichotomous disease phenotype is often of primary interest to clinical investigators. High heritability h2 of the disease, defined as the ratio of genetic variance to total phenotypic variance, can indicate that direct linkage analysis of the disease phenotype may be fruitful. However, because a disease may be influenced by multiple loci, each of which makes only a small contribution, detection of any one locus may be difficult, even for clearly heritable diseases.
Ill. THE BENEFITSOF A NARROWLYDEFINED DISEASEPHENOTYPE One option to counter the difficulties in analyzing common, complex, oligogenic diseases may be to narrow the disease definition or define subtypes for analysis. Such strategies aim to identify more severe, more “biological,” or early onset forms of illness that are perhaps due to one or few genes. Focusing on subtypes can also identify more homogeneous families for analysis. Successful applications of this approach include the subdivision of Alzheimer’s disease according to age of onset, which has led to identification of disease mutations involved in autosomal dominant, early onset forms, and to discovery of the association of the apoE e4 allele with late-onset Alzheimer’s disease, as reviewed in Tilley et al. (1998). An additional advantage of a “narrow-phenotype” approach is purely statistical: simulations have shown that the population prevalence of a trait can affect the ability to detect linkage, with more common diseasesrequiring larger sample sizes for detection in a sibpair study (Rice et al., 2000). Oligogenic traits were generated using the models of Suarez (1994), assuming equal action of the multiple loci. The results in Table 6.1 show the increased power for a disease of 1% prevalence compared to a disease of 10% prevalence, for a fixed heritability h’, and a fixed number of trait loci. This effect is a reflection of the comparative gene frequencies of disease alleles segregating when affected sibpairs are sampled. The advantage of increased ability to detect linkage for less prevalent diseases,however, is somewhat counteracted by the practical difficulty of ascertaining for a rarer phenotype.
71
6. Definition of the Phenotype Table 6.1. Simulation Resultsfor Oligogenic Models” Heritability, h2 (%) Number of loci
100
75
50
2s
10% Prevalence 1 2 4 6 8 10
59 159 669 1,510 3,094 4,926
59 70 389 251 1,106 1,815 2,592 6,264 12,502 3,740 20,075 7,645 1% Prevalence
131 983 4,112 9,048 20,360 63,078
1 2 4 6 8 10
42 62 192 449 711 1,357
42 160 369 685 1,149 1,720
42 154 466 834 1,802 2,624
43 163 870 2,012 4,438 7,118
“Sample sizeN (= number of affectedsibpairs)required to detect linkage at a significancelevel of (Y = 0.0001 and 80% power.
tV. ENBWHENDTYPES AND QUANTITATWE TRAITS The study of clinical disease phenotypes can also be enhanced by investigating related quantitative traits. Complex diseasesmay be influenced by several genes of small effect that are difficult to detect by direct analysis of the clinical phenotype. However, it may be possible to detect such genes if they have a major effect on related traits under study. Such associated biological traits, called endophenotypes or risk factors, offer several advantages. Often these endophenotypes are quantitative measures subject to minimal or quantifiable measurement error. Both unaffected and affected individuals may be measured and included in analysis. A quantitative phenotype provides a range of values with potentially more information than a discrete or threshold scale, and a well-chosen endophenotype may be more “biologicall) than a clinical diagnosis, and more directly tied to gene expression. Finally, there are quantitative phenotypes that are important to study in their own right, such as adiposity, fat distribution, and blood pressure, even as
72
Rice et al,
opposed to obesity and hypertension. There is considerably more information for linkage in quantitative variation than for an arbitrarily discretized trait (Duggirala et al., 1997). For example, plasma cholesterol levels are quantitative indicators of risk for coronary heart disease (CHD) and have been found to be significantly associated with genotypes at candidate genes such as apoE (Boerwinkle and Sing, 1987; Kaprio et al., 1991; Kamboh et al., 1995). Further refinements on the basis of more precise knowledge of the component phenotypes or metabolic processes also can be useful. For example, the study of genes underlying CHD risk can be extended by studying the various fractions of total cholesterol-lowdensity lipoprotein cholesterol (LDL-c), high-density lipoprotein cholesterol (HDL-c), and triglycerides-which have heterogeneous effects on disease risk. In fact, the identification of quantitative trait loci (QTLs) will be easiest for quantitative phenotypes most proximal to the genotype, simply because the relative contribution of a single major locus will be greater. Thus, a more informative and successful study may consist of a search for genes affecting apolipoprotein B and AI levels rather than LDL-c and HDL-c, respectively. This strategy of refining the quantitative phenotype is analogous to that of using narrowly defined disease phenotypes discussed earlier. Human event-related potentials (ERPs) have been studied as possible endophenotypes for psychiatric diseases. The P50 sensory gating response appears promising in genetic studies of schizophrenia (Freedman et al., 1997). Schizophrenic probands and their first-degree relatives exhibit reduced suppression or nongating of the P50 auditory-evoked response when presented with repeated auditory stimuli. Neurobiological studies of the P.50 phenotype in rodents and humans strongly suggest the response is mediated by the cr+icotinic cholinergic receptor gene (CHRNA7) on chromosome 15 (Freedman et al., 1997). Freedman et al. (1997) conducted a genome-wide scan for nongating QTL using the P50 response and schizophrenia as phenotypes using nine Caucasian pedigrees of European ancestry in which schizophrenia was present in at least two members of a family. For the P50 phenotype, the greatest lod score of 5.30 (0 = 0.0, P < 0.001) was obtained with D15S1360, < 120 kb from the first exon of CHRNA7. Another example of a proposed endophenotype is the P3 component of human ERPs, which shows reduced amplitude in alcoholics even after long-term abstinence (Porjesz et al., 1998). Quantitative linkage analysis has given evidence of several loci linked to the amplitude of the P3 component of the ERP (Begleiter et al., 1998). Th ese loci may lead to candidate genes and further understanding of genetic factors underlying susceptibility to alcohol dependence. Care should be exercised in selecting and pursuing endophenotypes as a means for studying clinical disease. Platelet monoamine oxidase (MAO)
6. Definitionof the Phenotype
73
activity has been suggested as an endophenotype for alcoholism, based on early reports of association between alcohol dependence and low enzyme activity. MAO activity has furthermore been shown to be heritable, and it exhibits a commingled distribution that suggestsa major gene for activity level. However, recent findings (Whitfield et al., 2000) indicate that the association between MAO activity and alcohol dependence is most likely explained by the confounding effect of cigarette smoking. While MAO activity may still warrant genetic study in its own right (Saccone et al., 1999), it no longer is expected to directly shed light on the genetics of alcoholism susceptibility.
V. THE tMFACT OF DIAliNOSTtCAND MEASDRHWERT ERROR We now return to the dichotomous trait setting and examine the impact of diagnostic error on the heritability h2 and risk ratio h of a dichotomous poly genie trait, We assume that a trait is determined by an underlying liability scale, which in turn is determined by the additive effects of many genes. Individuals above a set threshold are affected; this threshold determines the true prevalence K of the trait. The joint distribution of liability within a family is assumed to be multivariate normal, with familial resemblance given by the correlation in liability between family members. Assuming no dominance, the correlation between first-degree relatives is one-half the heritability. Let s be the sensitivity of diagnosis (the probability of correctly diagnosing a true case) and let t be the specificity (the probability of correctly diagnosing a noncase). We will see that in particular, reduced specificity has a significant effect on the power to detect genes. Recall that K is the true prevalence of the trait, and let Ki represent the (true) risk to a first-degree relative of a true case. Note that the observed prevalence of the trait is given by K* = SK f ( 1 - t) (1 - K). If all probands are in fact true cases, the observed rate KT in relatives is similarly defined. However, assuming that probands include false positives, the observed rate in relatives is KT* = Kr SK/K* + (1 - t) (1 - K). To illustrate the significance of the preceding formulas, consider the case of a threshold set to yield a true prevalence K of 10%. Setting the true heritability to be lOO%, the risk Ki of a first-degree relative being a true case is 32.4% (Rice et al., 1987). Thus the true lambda is h = Ki/K = 3.24. Now consider the effect when the specificity t is held fixed and the sensitivity s is reduced. If t = 1.0 and s = 0.95, then the observed prevalence K* = 0.095, h2 is 97%, and the observed h = Ki**/K* = 3.24. If t = 1.0 and s = 0.90, then the observed prevalence K * = 0.090, h2 is 94%, and the observed h is still 3.24. Hence there is only a minor impact of reduced sensitivity on the observed heritabilitv and risk ratio.
74
Rice ef al.
In contrast, if sensitivity is fixed at s = 1, and specificity t = 0.95, then K* = 0.145, h2 = 68%, and the observed A = KT*/K* = 2.01. If specificity is further reduced to t = 0.90, we find that K* = 0.190, h2 is 50%, and the observed h = 1.56. The inclusion of false positives (reduced specificity) has a major impact. A specificity of 90% reduces h2 and h by 50% or more, and would have a dramatic effect in reducing power compared to the situation of no diag nostic error. A more detailed table of the effects of varying degrees of reduced sensitivity and specificity appears in Rice et al. (2000). However, the lesson is clear from the foregoing examples, which underscore the necessity of making clinical diagnoses as carefully as possible. Incorporating repeated measures into a study design may be a potentially useful strategy to reduce error. Similarly, measurement error of quantitative traits can reduce the power to detect and localize QTLs. However, various simple strategies can be used to maximize the signal-to-noise ratio. First, the average of multiple measurements can be used to minimize measurement error; this is commonly done in studies of blood pressure. Second, the effect of known or suspected confounders can be controlled for by regressing out the effects of the predictor variables on the phenotype. The resulting adjusted phenotype should reflect much reduced “noise” variance.
In contrast to Mendelian phenotypes, the effect sizes of genes for complex phenotypes are unknown. The sample sizes in Table 6.1 range from 59 to 63,078 affected sibpairs for a disease with 10% prevalence. Even once a linkage has been detected, the identification of the disease gene may be problematic because there is a wide support interval for the linkage signal. It is clear that phenotype definition can play a key role in gene discovery. As noted earlier, the use of a narrowly defined phenotype or a refined quantitative phenotype can lead to a dramatic increase in power. Moreover, the elimination of false positive cases or adjustment for confounders can have a similar benefit. Similar arguments pertain to a quantitative endophenotype. The statistical power for the detection of genes associated with a quantitative measure may be high, even though that gene is a minor susceptibility gene for the disease phenotype of interest. Thus attention should be given not only to the definition of the disease phenotype, but also to related phenotypes. Phenotype definition should be considered along with sampling strategy and analytic procedures in the design of a study. The power depends on all these factors as well as the true underlying state of nature. A phenotype
6. Definition of the Phenotype
75
with high measurement error or marked heterogeneity is likely to be problematic.
Ackncrwldgments This work was supported in part by grants MH37685, MH31302, AA12239, and ME117104 (NLS, ER) from the National Institutes of Health. Special thanks to Christine Roark for preparation of this manuscript.
References Begleiter, H., Porjesz, B., Reich, T, Edenberg, H. J., Goate, A., Blangero, J., Almasy, L., Foroud, T., Van Eerdewegh, P., Polich, J., Rohrbaugh, J., Kuperman, S., Bauer, L. 0, O’Connor, S. J., Charlian, D. B., Li, T.-K., Conneally, P M., Hesselbrock, V., Rice, J. I?, Schukit, M. A., Cloninger, R., Nurnberger, J. Jr., Crowe, R., Bloom, E E., (1998) Quantitative trait loci analysis of human event-related brain potentials: P3 voltage. ElectroencephalogsClan. Neurophysiol. 108~244-250. Boerwinkle, E., and Sing, C. E (1987). Th e use of measured genotype information in the analysis of quantitative phenotypes in man. III. Simultaneous estimation of the frequencies and effects of the apolipoprotein E polymorphism and residual polygenetic effects on cholesterol, betalipoprotein and triglyceride levels. Ann. Hum. C&net. 51,211-226. Duggirala, R., Williams, J. T., Williams-Blangero, S., and Blangero, J. (1997). A variance component approach to dichotomous trait linkage analysis using a threshold model. Genet.
[email protected]. 14,987-992. Freedman, R., Coon, H., Myles-Worsley M., OrrUrtreger, A., Olincy, A., Davis, A., Polymeropoulos, M., Holik, J., Hopkins, J., Hoff, M., Rosenthal, J., Waldo, M., Reimherr, E, Wender, P., Yaw, J., Young, D., Breese, C., Adams, C., Patterson, D., Adler, L., Kruglyak, L., Leonard, S., and Byerley, W. (1997). Linkage of a neuropsychological deficit in schizophrenia to a chromosome 15 locus. Pm. Natl. Acad. Sci. USA. 94,587-592. Kamboh, M. I., Evans, R. W., and Aston, C. E. (1995). Genetic effect of apolipoprotein(a) and apolipoprotein E polymorphisms on plasma quantitative risk factors for coronary heart disease in American black women. Atherosclerosis117, 73 - 8 1. Kaprio, J., Ferrell, R. E., Kottke, B. A., Kamboh, M. I., and Sing, C. E (1991). Effects of polymorphisms in apolipoproteins E, A-IV, and H on quantitative traits related to risk for cardiovascular disease.Arterioscler. Thromb. 11, 1330-1348. Kardia, S. L., Haviland, M. B., Ferrell, R. E., and Sing, C. E (1999). The relationship between risk factor levels and presence of coronary artery calcification is dependent on apolipoprotein E genotype. Arterioscler. Thromb. Vast. Biol. 19,427-435. Porjesz, B., Begleiter, H., Reich, T., Van Eerdewegh, I?, Edenberg, H. J., Foroud, T., Goate, A., Litke, A., Chorlian, D. B., Stimus, A., Rice, J., Blangero, J., Almasy, L., Sorbell, J., Bauer, L. Q., Kuperman, S., O’Connor, S. J., and Rohrbaugh, J. (1998). Amplitude of visual P3 event-related potential as a phenotypic marker for a predisposition to alcoholism: Preliminary results from the COGA Project. Alcohol C&n. Exp. Res. 22, 1317-1323. Rice, J. I’., Endicott, J., Knesevich, M. A., and Rochberg, N. (1987). The estimation of diagnostic sensitivity using stability data: An application to major depressive disorder. J. Psychiat. Res. 21, 337-345. Rice, J. P., Saccone, N. L., and Suarez, B. K. (2000). The design of studies for investigating Iindage and association. In “Analysis of Multifactorial Disease” (T. Bishop and P. Sham, eds.). Bias, Oxford. In press.
76
Rice et al.
Saccone, N. L., Rice, J. I?., Rochberg, N., Goate, A., Reich, T, Shears, S., Wu, W., Numberger, J. I., Foroud, T., Edenberg, H. J., and Li, T. K. (1999). Genome screen for platelet monoamine oxidase (MAO) activity. Am. J. Med. Genet. 88,517-521. Suarez, B. K., Hampe, C. L., and Van Eeredewegh, l’., (1994). Pro bl ems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S. Gerghen and C. R. Cloninger, eds.), pp 23-46. American Psychiatric Press, Inc., Washington D. C. Tilley, L., Morgan, K., and Kalsheker, N. (1998). Genetic risk factors in Alzheimer’s disease.I. Clin. Pa&l. Mol. Puthol. 51, 293-304. Whitfield, J. B., Pang, D., Bucholz, K. K., Madden, P. A. E, Heath, A. C., Statham, D. J., and Martin, N. G. (2000). Monoamine oxidase: Associations with alcohol dependence, smoking, and other measures of psychopathology and personality. Psychol. Med. 30,443 -454.
I
Gettotypibng for l&man Whole-Genome Scans: Past, Present,and Future James 1. Weberl Center for Medical Genetics Marshfield Medical Research Foundation Marshiicld, Wisconsin 54449
Karl W. Broman Department of Biostatistics School of Hygiene and Public Health Johns Hopkins University Baltimore, Maryland 21205
1. II. III. IV. V.
Summary Introduction: Gemtyping Genotyping Present Genotyping Future
Past
Conclusions References
I. SlJMMARY Efficient and effective whole-gcnome 1 O-CM short tandem repeat polymorphism (STRP) scans are now available. Doubling or tripling STRP density to an averHowever, if typing costs for dialage spacing of 3 - 5 CM is readily achievable. lelic polymorphisms can be brought close to, or preferably less than, one-third those of STRPs, then diallelics may gradually supplement or supplant STRPs in whole-genome scans. The power of higher density gcnome scans for gene map-
.To tvhom corrrspndcncc
should be addrcmxl.
Advances in Genefics. Vol. 42 Copyright Q 2501 by Academic Press. All right of rt~prxlucr~on in any form reserved X65-266CjCl S35.33
77
78
Weber and Broman
ping by association and for many other research and clinical applications is great. It would be wise to continue investing heavily for many years in genotyping technology.
II. INTRODUCTION: GENOTYPING PAST In their landmark paper in 1980, Botstein, White, Skolnick, and Davis outlined the use of restriction fragment length polymorphisms (RFLPs) to map disease genes through linkage analysis (Botstein et al., 1980). The breakthrough achieved by these authors was the concept that highly abundant DNA polymorphisms as opposed to protein polymorphisms or other phenotype-based markers could be utilized for whole-genome scans. Throughout the 198Os, hundreds of RFLPs were identified and combined into whole-genome linkage maps. Several important disease genes were mapped, including those for Duchenne muscular dystrophy, Huntington’s disease, and cystic fibrosis (Gusella, 1986). Unfortunately, RPLPs were largely diallelic and therefore low in informativeness. Also, the methods required for analysis of RFLPs were relatively complicated and inefficient. Analysis involved digestion of genomic DNA with one or more restriction enzymes, separation of the resulting DNA fragments by size through electrophoresis on agarose gels, Southern blotting of the DNA fragments to membranes, and detection of specific DNA fragments on the membranes by hybridization to highly radioactive, cloned DNA probes. In 1989 a new type of abundant, multiallelic DNA polymorphism, the short tandem repeat polymorphism (STRP) ( a 1so called microsatellite or simple sequence length polymorphism) was reported (Weber and May, 1989). STRPs are based on variations in the numbers of tandem repeats in relatively short (usually < 60 bp) runs of primarily mono-, di-, tri-, and tetranucleotide repeats. Many STRPs have heterozygosities in the range of 70-90%. Analysis of STRPs involved just two simple steps: PCR amplification of a short (70-400 bp) seg ment of genomic DNA, followed by sizing of the amplified fragment through electrophoresis on denaturing polyacrylamide gels. Because the PCR primers annealed to unique sequences flanking the runs of tandem repeats, each pair of primers was specific for a single locus in the genome. The only equipment required was a thermal cycler and electrophoresis apparatus. Since STRPs were more informative and easier to type than RFLPs, the former quickly supplanted the markers introduced earlier. Throughout the 199Os, about 10,000 human STRPs were identified’and mapped. Linkage mapping successesfor disease genes with STRPs were quickly achieved. Many hundreds of disease genes have since been mapped with the use of these markers.
7. Genofypingfor HumanWkole-GesomeScans
79
ICI. REN@TYPtNR PRESENT Today, mapping genes for monogenic disorders using STRPs is routine. When sufficient family material is available, a single experienced lab worker can map a monogenic disorder in less than a month-in optimal cases, over a weekend. However, for genetically more complex disorders, at least one to two orders of magnitude more DNA samples may be required for linkage mapping success. Typing STRPs on such a large scale has motivated the growth of large dedicated genotyping centers. Genotyping output at these centers has increased greatly over the last few years, and concomitantly genotyping costs have rapidly dropped. Table 7.1, for example, presents 1990s Mar&field output for 400, marker STRP scans. Most of the whole-genome polymorphism scans carried out at Marshfield are supported by the National Heart, Lung, and Blood Institute (NHLBI) Mammalian Genotyping Service. Genotyping is offered for all types of disorders, not just those involving the heart, lung, or blood. Genotyping through the service is free; however, brief applications must be submitted which are subject to peer review and NHLBI staff evaluation. Capacity of the Mammalian Genotyping Service is currently about 5.5 million genotypes per year and is steadily increasing. The service is funded through September 2006, More information can be obtained from www.marshmed.org/genetics. The
Table 7.1. Marshfield Genotyping Output
1993 1994 1995 1996 1997 1998 1999
DNA samples with 400s marker genome scans
Total cost per genome scar?
350b 674b 2,150b 3,600 7,700 11,400 14,200
$1,200 $920 $600 $428 $272 $192 $160
“Iota1 cost is comprehensive and includes salaries, supplies, equipment, overhead, and miscellaneous expenses. bin 1993- 1995 much of the lab’s genotyping was with CEPH families instead of for diseasegene mapping. Therefore, for comparison purposes, total genotypes were divided by 400 to obtain equivalent numbers of DNA samples scanned.
80
Weber and Broman
Center for Inherited Disease Research (CIDR), an intramural program of the National Institutes of Health, offers a similar genotyping service (www.cidr.jhmi.edu).
A. Marker screening sets Typically, human whole-genome polymorphism scans involve 350-400 STRPs with average sex-equal spacing of about 10 CM. Lower density screens are occasionally carried out, particularly for monogenic disorders. Since about 10,000 human STRPs have been identified and many more can now be easily developed from the human genomic sequence, the selection of a small subset of markers for the whole-genome scans is an important issue. STRPs differ greatly in quality. They vary widely in informativeness, amplification efficiency, and the ease by which the alleles can consistently be called (see also later). At Marshfield, we are currently putting the finishing touches on the tenth version of our whole-genome STRP screening set (see www.marshmed.org/genetics). Average marker heterozygosity in the Marshfield screening set is about 76%. The Marshfield set is comprised primarily of tri- and tetranucleotide STRPs, with dinucleotide STRPs only used at positions along the genetic map where a high-quality tri- or tetranucleotide STRP could not yet be found. Other labs utilize screening sets based primarily or exclusively on dinucleotide repeat STRPs (see, e.g., Reed et al., 1994; wwwZ.perkin-elmer.com/ab). Marker spacing in the whole-genome screening sets is not uniform. An example of the marker spacing for chromosome 2 in our Marshfield Screening Set 10 is shown in Table 7.2. Because human linkage maps are based upon the typing of relatively few meioses in the CEPH families (Broman et al., 1998), the estimated map distances have quite limited precision. There is also growing evidence that recombination rates along chromosomes differ among individuals (Yu et al., 1996; Broman et al., 1998). Therefore, although statistical geneticists often assume equal marker spacing in their simulations and theoretical work, in reality, screening set marker spacing will never be perfectly uniform and probably will always have a fair degree of uncertainty owing to individual differences in recombination patterns.
B. Genotyping quality Clearly, polymorphism genotypes of relatively high quality are essential for successful completion of gene mapping projects. A summary of genotyping quality for large genotyping projects (> 700 samples) completed at Marshfield in 1998- 1999 is shown in Table 7.3. Average genotyping completeness, after correction for samples that amplify poorly under our standard PCR conditions, was
81
7. Genotypinfifor Human Whole-GenomeScans Table 7.2. Marshfield Chromosome 2 Screening Set (from Set 10)
Locus
TPO D2S1780 D2S2952 D2S1400 DZS1360 D25405 D2S1788 DZS1356 DZS1352 D2S441 D2S1394 DZS1790 D2S2972 D2S410 D2S1328 D2S1334 D2S1399 D2S1353 D2S1776 D2S1391 D2S1384 D252944 DZS434 D251363 D2S427 D2S2968 DZS2986
Marker
SRA GATA72Gll GATAllBBOl GGAA20GlO GATAllHlO GATASF07 GATA86E02 ATA4F03 ATA27D04 GATASF03 GATA69E12 GATA88G05 GATAl76COl GATA4Ell GATA27A12 GATA4D07 GGAA20G04 ATA27H09 GATA71DOl GATA65C03 GATA52A04 GATA30E06 GATA4G12 GATA23 DO3 GATAlZHlO GATA178G09 2QTEL47
Heterozygosity
Map position (CM)”
Marker spacing (CM)
0.64 0.71 0.77 0.67 0.82 0.67 0.87 0.76 0.67 0.74 0.71 0.78 0.73 0.81 0.75 0.81 0.82 0.81 0.76 0.74 0.76 0.79 0.76 0.77 0.70 0.61 0.68
0 10 18 28 38 48 56 64 74 87 91 103 114 125 133 145 152 165 173 186 200 210 216 227 237 252 265
10 8 10 10 10 8 8 10 13 4 12 11 11 8 12 7 13 8 13 14 10 6 11 10 15 13
0
“Based on sex-averaged map.
97.2%. Completeness is dependent upon the genotyping process, but it is also highly dependent upon the quality of the DNA samples. It is unfortunate but true that many groups involved in gene mapping projects take great care in phenotyping and analysis but skimp on the issuesof DNA extraction and handling, This is a major mistake because projects cannot be successful without high-quality, accurately labeled DNA. PCR will not be effective unless the DNA is pure and at the correct concentration in the correct solute. Analysis will be substantially weakened if significant numbers of DNA samples are mislabeled. Substantial DNA quality problems are encountered with roughly 20% of the projects undertaken by the Mammalian Genotyping Service.
82
Weber and Broman
Table 7.3. Ma&field
Project
Genotyping Quality
Completion date
Number of DNA samples
Average genotyping completeness (%)a
Estimated genotyping error rate ( %)b
Average marker heterozygosity (%)’
7117198 9/18/98 1012198 11/11/98 12/14/98 315199 4123199 6118199 7122199
1049 a93 841 780 705 734 728 833 1068
98.5 97.1 96.5 96.9 96.6 97.0 97.1 98.1 97.3
0.4 0.7 1.0 0.7 1.0 0.6 0.5 0.5 0.6
76 77 74 79 77 77 74 76 77
aCompleteness was calculated after all samples that had amplified especially poorly under standard PCR conditions (< 75 % complete). bError rates were determined by blind, duplicate, or triplicate genotyping of CEPH family individuals on different gels. ‘Heterozygosity calculations excluded sex chromosome polymorphisms.
Genotyping error rate at Marshfield has averaged about 0.7% (Table 7.3). Note that this is genotype and not allele error rate. Since one of the two alleles is correct for most incorrect genotypes, allele error rate is approximately 60% of the genotyping error rate. Genotyping accuracy is monitored by blindly typing CEPH family DNA samples in duplicate or triplicate along with the remainder of the DNA samples. Family and individual numbering schemes for these control samples are disguised to match those of the remaining samples. The duplicated or triplicated CEPH f amily DNA samples are loaded on different gels, as opposed to loading in adjacent lanes of the same gel, so that error rates determined using these CEPH family samples are near the worst-case scenario. Marshfield genotyping error rates have been confirmed by collaborating labs that send their own blinded, duplicate DNA samples. Genotyping accuracy is substantially improved when family structure is used as a final check on the allele calls. Under ideal conditions, such as the CEPH families with large sibships, genotyping accuracy improves to about 99.8%. Accuracy in this case refers to the consistency of allele calling within a single family. This is of course perfectly acceptable for linkage analysis, but consistency across families, gels, and time is required for association studies. Consistency requires the use of standard DNA with known screening set marker genotypes. At Marshfield, for example, amplified DNA from two of the CEPH family parents (133101 and 133102) is loaded about six times on each 200-lane gel.
7. Genofyping for HumanWhole-Genome Scans
83
Genotyping accuracy is also dependent upon specific laboratory processes. We have found, for example, that accuracy drops for the two or three lanes at the very edges of the gels, where there is often substantial skewing of fragment mobility compared with the interior portions of the gels. Error rates for the outer lanes are typically two to three times those for interior lanes. Also we have determined that there is substantial difference in accuracy among different classes of STRPs. Dinucleotide and noninteger (see later) STRPs have higher error rates (5 2%) than tri- and tetranucleotide STRPs. This is the reason for the emphasis on tri- and tetranucleotide markers in the Marshfield screening sets. Finally, it is important to note that the foregoing discussion applies to genotyping as carried out specifically at Marshfield. Genotyping centers using different processes, different markers, and different equipment will likely show at least modest variation in quality from the Marshfield results.
C. Gemtyping cost As shown in Table 7.1, STRP genotyping costs at Marshfield have dropped dramatically over the last few years. Current costs are about $150 per 4OO-marker whole-genome scan or $0.38 per genotype (one STRP typed on one DNA sample). Superior markers, more experienced personnel, and economies of scale have all played important roles in the cost reductions, but the greatest factor has been improvements in technology. Dedicated genotyping instruments, especially including high-capacity water bath thermal cyclers and multidye fluorescence-based scanning electrophoretic instruments, have been designed and built. Our largest thermal cycler has a capacity of 600 microtiter plates per day. Our scanning &orescence detectors (SCAFUDs) utilize 200~lane gels, and nearly all gels are used for four separate runs. SCAFUD throughput is currently over 16,000 genotypes per day. Sophisticated software packages have been generated for allele calling, for genotype checking, and for data storage and management. Laboratory process improvements include amplification of three to six markers simultaneously and the introduction of robotics for semiautomated sample handling. Table 7.4 breaks down the genotyping costs at Marshfield by the steps in the genotyping process. Administration costs include the handling and managing of the DNA samples. These costs are unlikely to change greatly regardless of the type of marker or the approach used for genotyping. The PCR ampl&ation step of the operation consumes most of the laboratory supplies for genotyping. Plastic microtiter plates, thermostable DNA polymerase, and fluorescent dye-labeled PCR primers currently comprise the great majority of the supply costs. The electrophoresis step is often cited as a drawback of utilizing STRPs. The costs of running the gels are not as high as often imagined, however: we utilize 200-lane gels and three marker dyes per gel run (and in the future more
84
Weber and Broman Table 7.4. Marshfield 1998 Genotyping Costs by Operation
operation Administration Amplification Electrophoresis Scoring
cost (%) 14 32 25 29
than four), and we reuse each of the gels four times. The scoring step in the operation involves the greatest amount of labor because genotypes called by the computer must be manually checked. Overall, STRP genotyping remains a labor-intensive process, with about half the total cost devoted to salaries and fringe benefits. Labor costs could potentially be reduced substantially by conversion to a genotyping system in which allele calling is completely automated. Low genotyping costs are dependent upon use of optimized markers in the whole-genome scans. Use of strongly amplifying and easily scored polymorphisms improves genotyping efficiency as well as quality. Substantial efficiencies are gained through purchase (or synthesis) of large quantities of fluorescent dye-labeled PCR primers and through the establishment of combinations of markers that amplify well together. These efficiencies are possible only with screening set markers, which are used in many different genome scans. The cost of typing non-screening-set markers, as in fine-mapping in a specific chromosome region to confirm and/or extend initial linkage mapping results, is roughly twice the cost of typing standard screening set markers. These factors have substantial implications for two-stage linkage mapping strategies in which low-density whole-genome scans are followed by fine-mapping by means of nonoptimized markers. Genotyping quality is tightly connected to genotyping cost. By altering the genotyping process, as in the extreme example of typing each marker in duplicate, genotyping accuracy could be improved substantially. However, this improvement would be accompanied by significantly increased costs. Conversely, if quality were relaxed, then genotyping costs could be reduced. Automated STRP allele calling at Marshfield is currently about 94% accurate. Tedious and expensive manual editing of the genotypes is required to bring the error rate down below 1%. Through changes and improvements in the software and/or modified laboratory processes, it may, at least for some markers, be possible to get the automated genotyping accuracy up to 99%. Throughput in whole-genome scans is becoming large enough to permit researchers to contemplate genotyping entire human populations. Decode Genetics, for example, has plans to complete genome scans on essentially all
7. Genotypingfor HumanWhole-GenomeScans
85
residents of Iceland (www.decode.is). At about $150 per 400-marker wholegenome scan, genotyping costs are becoming a small fraction of the total cost of a linkage mapping project. Except for phenotypes such as height and weight, which are unusually inexpensive to obtain, the costs of contacting, visiting, and phenotyping family members and of analyzing the genotype and phenotype data, usually greatly exceed the costs of genotyping. The possible scales of gene mapping projects are therefore largely limited by the phenotyping and analysis costs. This conclusion does not of course apply to whole-genome association studies, in which marker densities will generally be much greater than 400 per genome.
0. Genotyping limitations Several difficulties with STRP genotyping affect the quality of the genotyping data and considerations for future progress in whole-genome scans. These include PCR artifacts such as strand slippage and weak/null alleles as well as the practice of using gel electrophoretic mobility to approximate true allele sequence. The problem of weak/null alleles also generally applies to typing of diallelic polymorphisms. Other limitations, including some not currently recog nized, will undoubtedly plague any typing system for any class of polymorphisms. Strand slippage (also called stuttering), an artifact seen in PCR with short tandem repeats, results in skipping of repeats during amplification and production of DNA fragments smaller in size than the original genomic fragment (see Figure 7.1). Strand slippage is highly dependent upon the repeat length. For mononucleotide repeats, strand slippage is so severe that despite the great abundance of these sequences in the human genome, they are only rarely used as polymorphic markers. For dinucleotides, strand slippage is manageable, and these markers can be scored accurately. However, in our many’ years of experience we have found that dinucleotide repeats are more difficult to score accurately than markers with higher repeat lengths. For trinucleotide and higher repeat lengths, strand slippage is minimal and is rarely a factor in genotyping. Despite considerable effort, no one has been able to devise a solution for strand slippage during PCR. Weak or null alleles may occur in PCR when a second polymorphism occurs within one (or conceivably both) of the PCR primer annealing sites (see, e.g., Callen et aI., 1993). If the primer/template mismatch occurs r-mar the 5’ end of the PCR primer, the effect may be only modest and the intensity of an allele with the mismatch may just be relatively weak compared to the other alleles. However, when the mismatch occurs near the 3’ end of primer, PCR can be disrupted entirely and only one of two alleles may be amplified, resulting in the scoring of the individual as a pseudo-homozygote. Whether a specific allele
86
Weber and Broman
Strand Slippage Tri
Figure 7.1. Electrophoretic profiles of amplified DNA from four unrelated individuals for each of three STRl’s with mono, di, and trinucleotide repeats. Note that the relative amount of strand slippage decreasesdramatically as the repeat length increases.
will be weak or null depends strongly on the PCR conditions. Detection of null alleles usually requires analysis of families rather than unrelated individuals. In nearly all cases, it should be possible to avoid weak/null alleles by shifting the offending PCR primer. Markers with frequent weak/null alleles are usually excluded from standard screening sets. Uncommonly large size differences between alleles may result in a relatively weak amplification for the longest alleles, but this effect is usually modest in comparison to that of mismatches between PCR primer and template. Sizing of PCR products on denaturing acrylamide gels also limits STRP genotyping. STRP allele calling is nearly always based on the mobility of the amplified DNA fragment on the gels. Generally, gel mobility is a reasonably good indicator of the length of the PCR product and, therefore, of the numbers of tandem repeats within the allele. However, a number of situations exist in which mobility only approximates the true allele sequence (see, e.g., Primmer and Ellegren, 1998; Bergstrom et al., 1999). This phenomenon is sometimes called homoplasy. Figure 7.2 shows several hypothetical examples of different alleles that will have indistinguishable mobilities on the gels and, therefore, will
7. Genofypingfor HumanWhole-GenomeScans
87
Homoplasy ATGGCACCTTTT ATGGCACCTTTT ATGGCACCTTTT ATGGCACCTT
(AC)18 (Ac)9~%=)8 (Acho(AG)a (AC)19
GGGCAATTGCTC GGGCAATTGCTC GGGCAATTGCTC GGGCAATTGCTC
Figure 7.2. Sequences of hypothetical alleles all with the exact same nucleotide length and with likely indistinguishable mobilities on denaturing polyacrylamide gels. Note that in the last example the length difference is outside of the repeats.
all be assigned the same allele size. Note that in addition to imperfections in the array of tandem repeats, or even the presence of two or more different types of repeats, the insertion/deletion can lie outside the tandemly repeated region (Grimaldi and Crouau-Roy, 1997; Colson and Goldstein, 1999). For linkage studies in relatively small living families, homoplasy is unlikely to be a major factor. However, in association studies, homoplasy can be a major confounder, in as much as two alleles with exactly the same size and gel mobility can have very different ancestral histories. Incomplete electrophoretic resolution of PCR products differing in length for various reasons (e.g., the use of short gels; long PCR products) can also limit STRP genotyping. For many STRPs, alleles differ in size only by integer multiples of the repeat length. So, for example, a tetranucleotide STRP might have alleles ranging from 8 to 14 full tandem repeats. However, at appreciable frequency, STRPs will also exhibit alleles that differ in size by other than integer multiples of the repeat lengths (Brinkman et al., 1998). These “noninteger“ alle les can be difficult to score consistently because they often differ in size from other alleles by only a single nucleotide. Noninteger alleles are known for di-, tri-, and tetranucleotide repeat markers but are most commonly recognized for the tetranucleotide repeat markers. The fraction of STRPs that have common noninteger alleles is uncertain. For the great majority of STRPs, only one or ti few alleles have been sequenced. As much as possible, we have excluded STRPs with common noninteger alleles from our screening sets. However, noninteger alleles still arise unexpectedly for human populations that have not been typed. In summary, an ideal STRP for genomic screening would have the following properties. It would amplify strongly with little strand slippage and would produce sharp (as opposed to diffuse and fuzzy) bands upon gel electrophoresis. It would be highly informative and would produce few if any nonin-
88
Weber and Broman
teger alleles. Accurate scoring by computers, without manual checking, would be possible. If STRP genotyping continues to be cost-competitive, then, eventually, each marker within a genome screen will have many alleles sequenced and will become at least reasonably close to this ideal. In addition to genotyping limitations, sample labeling errors and pedigree structural errors detract substantially from the quality of gene mapping. In the Mammalian Genotyping Service, the rate of such pedigree structure or gender errors has ranged from near 0% in several projects, to an average of about 3% per project, to a high of 12% for one large project. In many cases, therefore, family structural errors substantially exceed genotyping errors. Although problematic individuals and families usually can be identified and excluded from subsequent analysis (see later), these problems still significantly reduce power. It cannot be emphasized strongly enough that for successful completion of a longterm, expensive, gene mapping project, careful recording of family structure and careful handling and labeling of the DNA samples are absolutely vital.
E. Error detection and effects Prior to the analysis of data from a genome scan, pedigree and genotyping errors must be identified and resolved. Pedigree errors include sample mislabeling, errors in the entry of pedigree information into a computer database, nonpaternities, and unreported adoptions or twinning. Genotyping errors occur when observed genotypes do not correspond to the true underlying genetic information, as a result of a mistake in data entry or the misinterpretation of a pattern on a gel. A mutation at a marker locus may mimic a genotyping error; it is just as important to identify and resolve mutations as errors. The effects of genotyping errors on the estimation of genetic maps have been well characterized (Buetow, 1991; Lincoln and Lander, 1992). Errors generally introduce apparent recombination events and thus lead to an expansion in the estimated maps. When the markers are more tightly spaced, the relative effect of errors greatly increases. The effects of genotyping errors on the power to detect disease susceptibility genes in linkage studies is not well understood. It is clear that power will generally decrease, though the extent of the effect has not been quantified. Analytic methods that use multipoint marker information will be more greatly affected by genotyping errors, in comparison to methods that make use of data on a single marker at a time. The effects of errors in pedigree information are also not well understood. A study on the effect of ignoring the relationship between two parents when they are first cousins (Merette and Ott, 1996) showed that the effect of such errors may be considerable. More work clearly needs to be done to evaluate the effects of errors in genotypes and pedigree information (see, e.g., Rao, 1998).
7. Genotypingfor Human Whole-GenomeScans
89
While the detection and resolution of genotyping and pedigree errors are not yet completely automated, several computer programs are available to assist in the identification of errors. While the process is often tedious, if it is performed carefully, and preferably with the involvement of both the lab generating the data and the individuals ultimately responsible for data analysis, the resulting, more refined data set will have maximal power to detect disease genes. The process of cleaning genotype data properly begins with the identification of pedigree errors. In many cases pedigree problems may be easily seen in checks for Mendelian inheritance. When parental data are missing, a more sophisticated approach may be necessary. Several computer programs are now available for verifying all pairwise relationships in a study (e.g., Boehnke and Cox, 1997; Goring and Ott, 1997; Broman and Weber, 1998). The relationship of each pair of individuals in a study, is inferred from their entire set of genotype data and then compared with the reported relationship. In most cases it is sufficient to consider the five relationships monozygotic twins, parent-offspring, full sibs, half-sibs, and other relationship or unrelated, One advantage to this type of approach is that the correct pedigree structure is often made clear, whereas when pedigree errors are observed by looking for large numbers of Mendelian inconsistencies, it can be tricky to determine what change in the pedigree structure will eliminate the problem. The approach of Boehnke and Cox (1997), implemented in the computer program RELPAIR, is especially valuable because it takes account of the known linkage relationship between the genetic markers and yet is very fast. Such a program should be used on the raw genetic data, prior to the resolution of any apparent genotyping errors, since apparently erroneous genotypes may provide important information about the relationship between individuals. The detection of genotyping errors begins with the identification of genotypes that are inconsistent with Mendel’s rules. One then determines the individual or individuals responsible for the problem. Generally one seeks the most parsimonious explanation for the problem, finding the fewest genotypes which must be removed to eliminate the inconsistency. Allele frequency information may be used to obtain probabilistic statements on which genotype is most likely in error. Several computer programs have been written that assist in the process of identifying and resolving Mendelian inconsistencies in genotype data. The programs PEDCHECK (O’C onnell and Weeks, 1998) and a module in the Mendel package (Stringham and Boehnke, 1996) are especially good examples. Ideally, one would go beyond such one-marker-at-a-time checks for genotyping errors, looking further for unlikely multiple recombination events that may indicate the presence of genotyping errors (Broman et al., 1998). Unfortunately, the typical density at which most genome scans are performed
90
Weber and Broman
makes this a largely useless effort, since a double recombinant within 30 or 40 CM cannot be immediately ascribed to a genotyping error. In the future, if the use of diallelic markers becomes common, efforts to detect genotyping errors by looking for unlikely multiple recombination events will become much more important, since with fewer polymorphic markers, which have fewer possible genotypes, erroneous genotypes will be more likely to conform to Mendel’s rules. For example, if both parents are heterozygous at a diallelic marker, their children may have any of the three possible genotypes, and so checks for Mendelian inheritance may fail to disclose any errors.
IV. GENOTYPING FUTURE Human whole-genome polymorphism scans have important clinical in addition to research applications. These scans can be used to detect chromosomal aneuploidies and segmental aneusomies (Gusella, 1986; Rosenberg et al., 2000). They can be used to propagate genetic information, such as the presence of a mutant gene through kindreds (Weber, 1994). They can be used to confirm putative biological relationships in patient families, and, if marker densities become high enough, they can be used to identify autozygous regions in individual patients (Broman and Weber, 1999) and to suggest from the presence of specific haplotypes which mutations an individual is likely to carry. In the long run, it may well turn out that clinical needs for the scans will outweigh needs for research applications. Perhaps someday, the whole issue of carrying out wholegenome polymorphism scans for research purposes will become irrelevant because these scans will have been routinely carried out on essentially all individuals for clinical purposes. Nevertheless, since this volume is devoted to the mapping of genes, our discussion focuses on this particular application. For gene mapping, the two primary factors to consider for future genome scans are marker type and marker density. Several different types of polymorphisms with different properties and typing methodologies are potentially available. Scans with a broad range of average marker densities are similarly conceivable. Genotyping costs, of course, will have important bearing on both factors.
A. Marker type Currently, the only types of polymorphisms that can be practically considered for whole-genome scans are diallelic base substitution or short insertion/deletion polymorphisms and multiallelic STRPs. Other types of polymorphisms such as minisatellites and complex chromosomal rearrangements such as large duplications and inversions are not sufficiently abundant or amenable to automation
7. Genotyping for HumanWhole-Genome Scans
91
to be used in genome scans. Informative STRPs (excluding mononucleotide repeats) occur on average roughly every 20 kb in the human genome (unpublished results). Together, diallelic base substitution and short insertion/deletion polymorphisms occur with reasonable informativeness about once every 1.0 kb (Cargill et al., 1999; Halushk a et al., 1999). Base substitution polymorphisms appear to be approximately 10 times more abundant than the insertion/deletion polymorphisms (Kwok et al., 1996; Wang et al., 1998). STRP genotyping costs have dropped substantially (Table 7.1) and are likely to continue decreasing in coming years. PCR reaction volumes are being reduced to minimize supply expenses. The number of fluorescent dyes that can be simultaneously detected in gel electrophoresis is steadily increasing. Capillaries and/or thinner slab gels may speed the time required for electrophoresis. Software for allele calling is steadily being improved, as is the quaiity of markers within the screening sets. Automated loading of electrophoresis platforms is being introduced into the process. All these and other improvements indicate that the genotype cost for STRPs will continue to fall. Currently these costs are at $0.40 per genotype. It is likely that over the next 3-5 years the cost can be brought down to about $0.25 per genotype. Nevertheless, there is no technology on the horizon that would permit the cost of STRP genotyping to decrease to a penny or less per genotype. These very low costs may, however, be achievable with diallelic polymorphisms, If for gene mapping, the 1980s was the decade of RFLPs and the 1990s the decade of STRPs, then perhaps the first decade of the twenty-first century may belong to diallelic polymorphisms. Many have speculated that because these markers can be analyzed without gel electrophoresis, typing costs will be dramatically reduced. Although no one has yet achieved this feat, many groups in both the public and private sectors are devoting substantial resources to a wide spectrum of potentially promising approaches for typing dialleiic polyymorphisms. A number of promising closed-tube systems have recently been developed involving the Tasman assay (Livak et al., 1995), molecular beacons (Tyagi et al., 1998), the Invader assay (Lyamichev et al., 1999), and thermal denaturation (Germer and Higuchi, 1999). These systems all have the attractive feature that samples are not handled after initial reaction setup. Approaches involving fluorescent microspheres (F&on et al., 1997; Michael et al., 1998) and mass spectrometry (Ross et al., 1998; Griffin et aI., 1999) have the potential to facilitate analysis of many polymorphisms simultaneously. Methods utilizing hybridization to dense microarrays of oligo probes or PCR products take advantage of the exceptional potential of miniaturization (Elango et al., 1996; Wang et al., 1998). In addition to these newer and more exotic approaches, a large group of effective though relatively expensive analysis methods currently exist (reviewed by Landegren et al., 1998). These include direct sequencing of PCR products, restriction enzyme digestion of PCR prod*
92
Weber and Broman
ucts, single-stranded conformation polymorphism (SSCP) analysis, high-performance liquid chromatography, and allele-specific single-base polymerase extension of synthetic oligos. That so many approaches are being explored for diallelic polymorphism analysis and so many dollars are being injected into this research bode well for the future. However, it is far too early to pick a winner or even a leader among the competing technologies. In addition, the big lead in efficiency enjoyed by STRP genotyping should not be discounted. Our many years of experience with genome scans have taught us that it is important to optimize each marker within screening sets. The cost required to optimize thousands of diallelic polymorphisms, even in highly efficient systems, will be large. Despite the long-term promise, arrival of efficient diallelic polymorphism systems for whole-genome scans may be farther away than most anticipate. A number of groups have considered the question of how many lower informativeness diallelic polymorphisms will be required to match the information content of an average multiallelic STRl? Estimates have ranged from two to over five (Nickerson et al., 1992; Kruglyak, 1997; Chapman and Wijsman, 1998). Answering this question requires consideration of many factors, including the average assumed informativeness of the markers, the type of family structure used for gene mapping, sample size, and genotyping error rates. STRPs show relatively little variation in informativeness among different populations. For example, in large studies carried out at Marshfield, average informativeness for various populations ranged from a low of 73% for a group of Native Americans to 80% for a collection of African Americans. Diallelics, in contrast, show much more population-specific variation in frequency (see, e.g., Gelernter et al., 1999). Nearly all diallelics will be uninformative in a fraction of human populations. More theoretical work and also real gene mapping tests will be required to rigorously determine the relative information content of diallelics versus STRPs. However, a working estimate at this time is that three informative diallelics are required to match each informative STRI? Besides informativeness and typing methods, STRPs and diallelic marker differ substantially in mutation rate. Most STRPs have mutation rates in the range of 10-3-10-5 per gamete per generation (Weber and Wong, 1993; Brinkman et al., 1998). In contrast, diallelic polymorphisms have much lower mutation rates on the order of 10-7-10-9 per gamete per generation (Vogel and Motulsky, 1997). Mutation has little impact on linkage analysis when families with living members are tested, but the approximately lO,OOO-fold difference in mutation rates between STRPs and diallelics likely will have substantial impact on detection of linkage disequilibrium. Although disequilibrium can often be readily detected between closely linked STRPs and between STRPs and diallelic polymorphisms (Huttley et al., 1999; McPeek and Strahs, 1999),
7. Genotyping for HumanWhote-Genome Scans
93
shared haplotypes older than a few hundred years are usually affected by STRP mutation (Hastbacka et al., 1992). Because STRPs mutate most often by the gain or loss of a single repeat, it may be possible to correct for mutation by considering windows of STRP alleles that differ by a single repeat (McPeek and Strahs, 1999). Nevertheless, the relatively high rate of STRP mutation will introduce complexity into disequilibrium analysis.
6. Marker density For linkage analysis in families with living members, both for monogenic and more complex disorders, a 5 to lo- CM STRP scan appears satisfactory as a f&t step. Since gaps between markers in a lo-CM scan (see Table 7.2) can be fairly large, and an uninformative marker in such a gap can substantially decrease coverage of that portion of the genome, increasing STRP density to an average of one marker per 5 CM would be reasonable. However, for many other research and virtually all clinical applications of whole-genome scans, a much higher marker density would be preferred. The high-density polymorphism scan is analogous to a new more powerful telescope in astronomy. It will permit us to study such genetic phenomena as autozygosity, which were previously invisible (Broman and Weber, 1999). In terms of gene mapping, highedensity scans are particularly important in the detection of association. Most human genetic variation was established before humans migrated out of Africa, roughly 100,000 years ago (Ha& et al., 1999). Disease alleles this old will tend to lie within quite short shared haplotypes in diverse, panmictic populations (Kruglyak, 1999), Detection of this ancient association may require marker densities up to 500,000 per genome, which is far beyond current technology. However, disease alleles that arose or were introduced into populations much more recently may be amenable to whole-genome association mapping. This approach has been very successfully applied to rare recessive diseases in isolated populations (Friedman et al., 1995; Peltonen and Uusitalo, 1997; Sheffield et al., 1998). Freimer, Sandkuijl, and colleagues have argued that densities as high as one marker every 3 CM may be sufficient for detection of association for complex disorders in isolated populations (Service et al., 1999). Regardless of the validity of &is last hypothesis, it is still clear that whole-genome polymorphism scans at marker densities of l-3 CM (3000- 1000 STRPs or equivalent) will find some useful application in gene mapping by association. These marker densities should be achievable within the next few years. It is also possible of course, through additional effort and cost, to type higher densities of polymorphisms in selected chromosomal regions-for example, as a followup to initial linkage mapping for complex disorders.
94
Weber and Broman
V. CONCLUSIONS Efficient, reasonably effective whole-genome lo-CM STRP scans are now available. Doubling or tripling STRP density to an average spacing of 3-5 CM is readily achievable+ However, if typing costs for diallelic polymorphisms can be brought close to or preferably below one-third of those of STRPs, then diallelics may gradually supplement or supplant STRPs in whole-genome scans. The power of higher density genome scans for gene mapping by association and for many other research and clinical applications is great. It would be wise to continue investing heavily for many years in genotyping technology.
References Bergstriim, T. E, Engkvist, H., Erlandsson, R., Josefsson,A., Mack, S. J., Erlich, H. A., and Gyllensten, U. (1999). Tracing the origin of HLA-DRBl alleles by microsatellite polymorphism. Am. J. Hum. Gener. 64,1709-1718. Boehnke, M., and Cox, N. J. (1997). A ccurate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61,423-429. Botstein, D., White, R. L., Skolnick, M., and Davis, R. W. (1980). Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. 1. Hum. Genet. 32, 314331. Brinkmann, B., Klintschar, M., Neuhuber, F., Hiihne, J., and Rolf, B. (1998). Mutation in human microsatellites: Influence of the structure and length of the tandem repeat. Am. J. Hum. Genet. 62,1408-1415. Broman, K. W., and Weber, J. L. (1998). Estimating pairwise relationships in the presence of genotyping errors. Am. J. Hum. Genet. 63, 1563- 1564 Broman, K. W., and Weber, J. L. (1999). Long homozygous chromosomal segments in the CEPH families. Am. .7. Hum. Genet. 65, 1493-1500. Broman, K. W., Murray, J. C., Sheffield, V. C., White, R. L., and Weber, J. L. (1998). Comprehensive human genetic maps: Individual and sex-specific variation in recombination. Am. J. Hum. Genet. 63,861-869. Buetow, K. H. (1991). Influence of aberrant observations on high-resolution linkage analysis outcomes. Am..l. Hum. Genet. 49,985-994. Callen, D. F., Thompson, A. D., Shen, Y., Phillips, H. A., Richards, R. I., Mulley, J. C., and Sutherland, G. R. (1993). Incidence and origin of “null“ alleles in the (AC),, microsatellite markers. Am. J. Hum. Genet. 52,922-927. Cargill, M., Altshuler, D., Ireland, J., Sklar, l?, Ardlie, K., Patil, N., Lane, C. R., Lim, E. P., Kalyanaraman, N., Nemesh, J., Ziaugra, L., Friedland, L., Rolfe, A., Warrington, J., Lipshutz, R., Daley, G.Q., and Lander, E. S. (1999). Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231-238. Chapman, N. H., and Wijsman, E. M. (1998). G enome screens using linkage disequilibrium tests: Optimal marker characteristics and feasibility. Am. .J. Hum. Genet. 63, 1872-1885. Colson, I., and Goldstein, D.B. (1999). Evidence for complex mutations at microsatellite loci in Drosophila. Genetics 152, 617-627. Elango, R., Riba, L., Housman, D., and Hunter, K. (1996). G eneration and mapping of Mus spretus strain-specific markers for rapid genomic scanning. Mamm. Genome. 7,340-343. Friedman, T. B., Liang, Y., Weber, J. L., Hinnant, j’. T., Barber, T. D., Winata, S., Arhya, I. N., and
7. Genotypingfor HumanWhote-GenomeScans
95
Asker, J. H. Jr. (1995). A gene for congenital, recessive deafness DFAJB3 maps to the gericentromeric region of chromosome 17. Nat. Genet. 9,86-91. F&on, R. J., McDade, R. L., Smith, l? L., Kienker, L. J., and Kettman, J. R. Jr. (1997). Advanced multiplexed analysis with the FlowMetrixTM system. Chin. Chem. 43, 1749-1756. Gelemter, J., Cubells, J. E, Kidd, J. R., Pakstis, A. J., and Kidd, K. K. (1999). Population studies of polymorphisms of the serotonin transporter protein gene. Am. J. Med. Genet. 88, 61-66. Germer, S., and Higuchi, R. (1999). Single-tube genotyping without oligonucleotide probes. Genome Res.9,72-78. Garing, H. H. H., and Ott, J. (1997). Relationship estimation in affected sib pair analysis of iateonset diseases.Eur. J. Hum. Genet. 5,69-77. Griffin, T. J., Hall, J. G., Prudent, J. R., and Smith, L. M. (1999). Direct genetic analysis by matrixassistedlaser desorption/ionization massspectrometry. I’roc. Nutl. Acad. Sci. USA 96,6301-6306. Grimaldi, M.-C., and Crouau-Roy, B. (1997). M ’lcrosatellite allelic homoplasy due to variable flanking sequences.]. Mol. Euol. 44,336-340. Gusella, J. E (1986). DNA polymorphism and human disease. Annu. Rev. Biochem. 55, 831-854. Hacia, J. G., Fan, J. B., Ryder, O., Jin, L., Edgemon, K., Ghandour, G., Mayer, R. A., Sun, B., Hsie, L., Robbins, C. M., Brody, L. C., Wang, D., Lander, E. S., Lipshutz, R., Fodor, S. P. A., and Collins, F. S. (1999). Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays. Nut. Genet. 22, 164-167. Halushka, M. K., Fan, J.-B., Bentley, K., Hsie, L., Shen, N., Weder, A., Cooper, R., Lipshutz, R., and Chakravarti, A. (1999) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nut. Genet. 22,239-247. Hhtbacka, J., de la Chapelle, A., Kaitila, I., Sistonen, P., Weaver, A., and Lander, E. (1992). Linkage disequilibrium mapping in isolated founder populations: Diastrophic dysplasia in Finland.
Nat. Genet. 2,204-211. Huttley, G. A., Smith, M. W., Carrington, M., and O’Brien, S. J. (1999). A scan for linkage disequilibrium across the human genome. Genetics 152, 1711- 1722. Kruglyak, L. (1997). The use of genetic map of biallelic markers in iinkage studies. Nat. Genet. 17,
21-24. Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of common diseasegenes. Nat. Genet. 22, 139-144. Kwok, P.-Y., Deng, Q., Zakeri, H., Taylor, S. L., and Nickerson, D.A. (1996). increasing the information content of STS-based genome maps: Identifying polymorphisms in mapped STSs. Genomics31,123-126. Landegren, U., Nilsson, M., and Kwok, P.-Y. (1998). Reading bits of genetic information: Methods for single-nucleotide polymorphism analysis. Genome Res. 8,769-776. Lincoln, S. E., and Lander, E. S. (1992). Systematic detection of errors in genetic linkage data. Gewmics 14,604-610. Livak, K. J., Mannaro, J., and Todd, J. A. (1995). T owards fully automated genome-wide polymorphism screening. Nut. Genet. 9,341-342. Lyamichev, V., Mast, A. L., Hall, J. G., Prudent, J. R., Kaiser, M. W., Takova, T., Kwiatkowski, R. W., Sander T. J., de Arruda, M., Acre, D. A., Neri, B. l?, and Brow, M. A. (1999). Polymorphism identification and quantitative detection of genomic DNA by invasive cleavage of oligonu+ cleotide probes. Nat. Biotech. 17, 292-296. McPeek, M. S., and Strahs, A. (1999). Assessment of linkage disequilibrium by the decay of haplo. type sharing, with application to fine-scale genetic mapping. Am. j. Hum. Genet. 65, 858-875. Merette, C., and Ott, J. (1996). Estimating parental relationships in linkage analysis of recessive traits. Am. 1. Med. Genet. 63,386-391. Michael, K. L., Taylor, L. C., Schultz, S. L., and Walt, D. R. (1998). Randomly ordered addressable high-density optical sensor arrays. Anal. Chem. 70, 1242-1248.
96
Weber and Broman
Nickerson, D. A., Whitehurst, C., Boysen, C., Charmley, l?, Kaiser, R., and Hood, L. (1992). Identification of clusters of biallelic polymorphic sequence-tagged sites (pSTSs) that generate highly informative and automatable markers for genetic linkage mapping. Genomics l&377-387. O’Connell, J. R., and Weeks, D. E. (1998). PedCheck: A program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet. 6, 259-266. Peltonen, L., and Uusitalo, A. (1997). R are d’Iseasegenes-lessons and challenges. Genome Res. 7, 765-767. Primmer, C. R., and Ellegren, H. (1998). Patterns of molecular evolution in avian microsatellites. Mol. Biol. Ewol. 15, 997-1008. Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genet. Epidemiol. 15, 1-18. Reed, l? W., Davies, J. L., Copeman, J. B., Bennett, S. T., Palmer, S. M., Pritchard, L. E., Gough, S. C., Kawaguchi, Y., Cordell, H. J., Balfour, K. M., Jen k’ms, S. C., Powell, E. E., Vignal, A., and Todd, J. A. (1994). Chromosome-specific microsatellite sets for fluorescence-based, semi- automated genome mapping. Nat. Genet. 7,390-395. Rosenberg, M. J., Vaske, D., Killoran, C. E., Ning, Y., Wargowski, D., Hudgins, L., Xfft, C. J., Meek, J., Blancato, J. K., Rosenbaum, K., Pauli, R. M., Weber, J., and Biesecker, L. G. (2000). The detection of chromosomal aberrations by a whole genome microsatellite screen. Am. I. Hum. Genet. 66,419-427. Ross, l?, Hall, L., Smimov, I., and Haff, L. (1998). High 1eve1 multiplex genotyping by MALDITOF mass spectrometry. Nat. Biotech. 16, 1347-1351. Service, S. K., Temple Lang, D. W., Freimer, N. B., and Sandkuijl, L. A. (1999). Linkage-disequilibrium mapping of disease genes by reconstruction of ancestral haplotypes in founder populations. Am. J. Hum. Genet. 64, 1728-1738. Sheffield, V. C., Stone, E. M., and Carmi, R. (1998). U se of isolated inbred human populations for identification of disease genes. Trends Genet. 14,391-396. Stringham, H. M., and Boehnke, M. (1996). Identifying marker typing incompatibilities in linkage analysis. Am. J. Hum. Germ. 59, 946-950. Tyagi, S., Bratu, D. P., and Kramer, E R. (1998). Multicolor molecular beacons for allele discrimination. Nar. Biotech. 16,49-53. Vogel, E, and Motulsky, A.G. (1997). “Human Genetics: Problems and Approaches,” 3rd ed. Springer-Verlag, Berlin. Wang, D. G., Fan, J.-B., Siao, C.*J., Bemo, A., Young, I’., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E., Spencer, J., Kruglyak, L., Stein, L, Hsie, L, Topaloglou, T., Hubbell, E., Robinson, E., Mittmann, M., Morris, M. S., Shen, N., Kilburn, D., Rioux, J., Nusbaum, C., Rozen, S., Hudson, T. J., Lipshutz, R., Chee, M., and Lander, E. S. (1998). Large-scale identification, map ping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280, 1077-1082. Weber, J. L. (1994). Know thy genome. Nut. Gener. 7,343-344. Weber, J. L., and May, P. M. (1989). Ab un d an t c 1ass of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. I. Hum. Genet. 44,388-396. Weber, J. L., and Wong, C. (1993). Mutation of human short tandem repeats. Hum. Mol. Genet. 2, 1123-1128. Yu, J., Lazzeroni, L., Qin, J., Huang, M.-M., Navidi, W., Erlich, H., and Arnheim, N. (1996). Individual variation in recombination among human males. Am. J. Hum. Genet. 59, 1186-1192.
John P. Rice’, Nancy 1. Saccone, and Jonathan Corbett Department of Psychiatry Washington University School of Medicine St. Louis. Missouri 63110
I. II. III. IV. V. VI. VII. VIII.
Summary Introduction The Generalized Single Major Locus Model The Lod Score The Lod Score and Meta-Analysis The Probability of Detecting Linkage Model-Free Linkage Methods Discussion References
I. SUMMARY The lod score method originated in a seminal article by Newton Morton in 1955. The method is broadly concerned with issues of power and the posterior probability of linkage, ensuring that a reported linkage has a high probability of being a true linkage. In addition, the method is sequential, so that pedigrees or lad curves may be combined from published reports ro pool data for analysis. This approach has been remarkably successful for 50 years in identifying disease genes for Mendelian disorders. After discussing these issues, we consider the situation for complex dis, orders, where the maximum lod score (MLS) statistic shares some of the advan‘To whom correspondence should bc addressed Advances In Genetics, Vol. 42 Copyright 0 2301 by Academic Press. All nghu of reproduction in any form wserved. CO65-2660101 335.03
99
100
Rice et al.
tages of the traditional lod score approach but is limited by unknown power and the lack of sharing of the primary data needed to optimally combine analytic results. We may still learn from the lod score method as we explore new methods in molecular biology and genetic analysis to utilize the complete human DNA sequence and the cataloging of all human genes.
II. INTRODUCTION Advances in molecular biology have enabled the rapid mapping and identification of genes for single-locus disorders. These successeshave relied on the lod score method for analysis. This statistic, introduced half a century ago, has proven to have several strengths in the analysis of genetic data. The lod score method, which utilizes this test statistic, originated in the classic 1955 paper of Newton Morton and is more broadly concerned with issues of power, false positive rates, and combining results from multiple studies.
111.THE GENERAWEDSINGLEMAJOR LOCUSMODEL A. Description In the single major locus (SML) model it is assumed there is single trait locus. The trait may be either continuous or polychotomous. We consider the case of a two-allele locus for a dichotomous phenotype. Denoting the two alleles by A and a, let p be the frequency of A and 4 = 1 - p be the frequency of a. Under the assumption of random mating, the probabilities for the three genotypes AA, Aa, aa are p2, 2pq, q2, respectively. For a continuous phenotype, it is typically assumed that the distribution within a genotype is normal, so that parameters consist of the three genotypic means and a common variance. For a dichotomous trait, the penetrances fm, fAa, f, are defined as the probability that individuals of genotype AA, Au, aa, respectively, are affected. Accordingly, the SML can be described in terms of the four parameters p, fu, fh, f,. We further assume that all familial resemblance is due to the single locus. Thus, the phenotype of an individual depends only on his or her genotype. That is, if (Xi, . . . , XJ denotes the phenotypes in a family of size m, and (gi, . . . , g,) denotes their genotypes, then fYX,lg,, . 1 * 9 g,,,) = I’(& I gi) and I’(&, Xj I gi, 6) = P(X, I gi) P(X I gJ for individuals i and j, where P( ) denotes the probability of the event in parentheses. For a marker M, let 8 denote the recombination fraction between the trait locus and M. Then the genotype denotes the joint genotype of the trait locus and marker, so that the likelihood of the family depends on p, JCAA, f& faa, and 8. These assumptions enable the efficient computation of the likelihood of
8. The Lad Score Method
101
targe pedigrees using the Elston and Stewart algorithm (1971) as implemented in the LINKAGE package (Lathrop et al., 1984) and have served as the “stan dard” analytic model for Mendelian traits.
IV. THE LOD SCQRE A. Definition The lod curve Z( 6) is defined by
where L denotes the likelihood. Note that we assume that the penetrances and gene frequency are known for the trait locus, so the only unknown is the recombination fraction 8. The maximum of Z( 0) is called the lad score and denoted 2. The likelihood ratio statistic is -2 times the difference between the log of the likelihood at 8 = i and the log of the maximum likelihood value, where the logarithm is to the base e. Thus Z(log, 1O)Z = 4.62 = XT, so that a lod score of, say 3, is equivalent to 2 = 13.8 on one degree of freedom. Thus in a (one-sided test) of 6 = i, the corresponding p value is 0.0001.
6. Examples For the sake of simplicity, assume that we are examining two loci, each of which has two alleles, and that we are able to distinguish heterozygotes from homozy gotes of both types. Let us label the alleles at the first locusby A/u and the alleles at the second locus by B/b. Consider the case that the father of our nuclear family is heterozygous at both loci, whereas the mother is homozygous for the ‘“lower+ case” alleles (i.e., the father has genotype Au Bb and the mother has genotype aa bb). Let us further assume that we know the phase of the father’s genotypefor example, that we know he inherited both “uppercase” alleles from his father and both “lowercase” alleles from his mother. In this case we know the father’s
102
Rice et al.
haplotype is ABlab and we can directly count the number of recombinant and nonrecombinant offspring. Assuming that we observe k recombinants in n matings, we can test the hypothesis of free recombination as a standard hypothesis test of the parameter of a binomial distribution. If we consider the simple hypothesis test
Ho: e=+
(8.1)
Ht: f3 = e,,
(8.2)
where 0 < 6, < 5, then we define the lod score of the test as the (base 10) log of the likelihood ratio
LR =
L(k/nJ 8 = 6,) L(k/nle
= +> =
ek(i - eopk) c;)”
We then see that in the particular case that k = 0, the maximum likelihood estimate for 6’, is 0, = 0, and hence LR = 2” so that the lod score is loglo LR = n logia2. More specifically, in this case we would be able to reject the null hypothesis if n 2 10, since a lod score of 3 is traditionally accepted as sufficient to reject the hypothesis of free recombination and each observation adds about 0.3 to the observed lod score. We note that it is not necessary that these 10 observations all come from the same mating, only that they come from matings in which recombinants and nonrecombinants can be distinguished without ambiguity. While such matings are the most informative for the purposes of detecting linkage, we can also make use of many other types of matings in our attempts to detect or reject linkage. Let us now consider a nuclear family with a pair of biallelic loci labeled as just shown. Assume that the father is doubly heterozygous with genotype (Aa Bb), whereas the mother is doubly homozygous with genotype (aa bb); unlike the foregoing case, however, we do not know the phase of the father’s genotype. In other words, we do not know whether the father’s haplotypes are (AB lab) (I) or (AblaB) (II). While it is impossible to unambiguously decide whether any given offspring of this mating is a recombinant, we can use probabilistic methods to gain linkage information. We will assume, lacking any reason to believe otherwise, that the prior probabilities of (I) and (II) are equal. We first observe that we gain no information for linkage in the case that this mating produces only one offspring. To see why this is so, let us consider the case of one offspring and assume that this offspring has genotype (Au bb). Let R represent the event that the offspring is the result of recombination
103
8. The Lad Score Method
during meiosis, and let N represent the event that the offspring is a nonrecombinant. Observe then that the event R coincides with the father’s haplotype being of type (I), whereas event N coincides with the father’s haplotype being of type (II). If we now let 8 represent the true recombination fraction between the foregoing loci and y the offspring’s observed genotype, we may conclude that the probability of the observed genotype in the offspring is given by P(y) = P(yjI)P(I)
+ P(y[ II)P(II)
= +
+
(1 - 0) 4
= $.
It is straightforward to see that we would obtain the same result for each of the four possible genotypes that could arise from this mating. Specifically, we note that P(y) is not dependent on the true value of the recombination fraction. We cannot obtain a nonzero lod score for any of the possible genotypes of the offspring, and, thus, such a mating is not informative for linkage in the case that there is only one offspring. However, in the case that this mating produces at least two offspring, we do gain information for linkage. Consider, for simplicity, the case in which this mating produces two offspring. Since we have two offspring, each of which has four possible genotypes, there are 16 possible combinations of genotypes. In this biallelic system, it turns out that out of these 16 possible combinations, there are really only two distinct states. Let us denote by y1 the state that each of the offspring receives either both “uppercase” or both Yawercase” alleles from the father as well as the state that each offspring receives one “uppercase” and one “lowercase” allele from the father. In other words, y1 represents the case that both offspring are either recombinants or nonrecombinants. We denote by y2 the set of eight possibilities not contained in case yr, that is, the case that one offspring is recombinant and the other is not. Note that in yr we do not know whether the offspring are recombinant or nonrecombinant, only that their recombination status is identical. We, following convention, state that, in this case, the offspring are concordant. Similarly, in ~2, we do not know which offspring is recombinant and which offspring is nonrecombinant, only that we have one of each. The offspring in this state are usually termed discordant. Intuitively, we would believe that linkage between the loci under study is more likely under yr than under ~2. It turns out that our intuition is correct. Taking the same set of hypotheses used earlier, we calculate
f(Y1leo) = (1- 6oJ2 + f-s faded = 2hu - eo),
104
Rice et a/.
whereas f(yr( 0 = 5) = f(yJ ~3= i) = i. Since (1 - 8,)’ + @j > i provided that 0, # i, we see that we do obtain information on linkage in this mating scheme, and, furthermore, that this information provides evidence for linkage in the case that yr holds and evidence against linkage in the case that y2 holds. In the simple case of one mating with two offspring, in which these offspring are concordant and 0, = 0, we observe that the likelihood ratio is given by LR =
f(Y1leo) f(yJ 8 = f,
= 2.
Thus we obtain a lod score of log,, LR = loglo 2 = 0.30. We note that this is the same lod score as that obtained from one unambiguously observed recombination event, as we saw earlier.
V. THE LOD SCOREAND META-ANALYSIS A. ML traits Meta-analysis in used to integrate ward to combine when N families
the social sciences (Hedges and Olkin, 1985) is a technique results from multiple studies. For SML traits, it is straightfordata. If Zi( 0) is the lod curve for family i, then the lod curve are genotyped is
z(e) = f$&(S) i=l
Moreover,
if multiple studies are used, the Z( 0) may be added from each study. Several studies in the 1950s reported linkage between elliptocytosis and the Rh system. Morton (1956) collected the published pedigrees and derived algebraic expressions for the lod curves to be combined in analysis. Moreover, heterogeneity was detected and explained conflicting reports of linkage. It became standard to either display the pedigrees with marker data or publish lod curves in tabular form, to allow pooling of independent studies. It must be emphasized that it is not proper to add lod scores (the maximum of individual lod curves); rather, the lod curves must be added, and then the maximum taken. For multipoint methods (i.e., analysis of more than one marker), a mapping function can be used to relate genetic distances to recombination fractions, so that the lod curve would be related to chromosomal location.
8. The Lod Score Method
105
As noted, the lod score is in fact a measure of statistical significance. Hedges and Olkin point out the fallacy of using outcomes of significance to interpret consistency and replication. These “vote counting” methods can give misleading conclusions. Significance/nonsignificance may reflect design or pow, er differences. Even when two studies are significant, there may be heterogeneity: If two linkage studies give high lod scores at markers 50 CM apart, are they consistent? The genetic parameter of interest is x, the disease gene location, and the lod score is a test statistic. When a single marker is analyzed, 8 has been traditionally used rather than x. The ultimate validation comes from the laboratory when the disease gene is identified.
6. Complextraits For a complex trait, if a program such as LINKAGE is used and the assumptions of the SML model are incorrect, then the estimate of 8 (or X) will be biased. This has led to the use of affected sibpair methods as discussed in Section VII. A. A major question for complex traits is how to combine results from multiple studies. The difficulties associated with combining results are well illustrated by analyses done by the Schizophrenia Linkage Collaborative Group for chromosomes 3,6, and 8 (1996) in which data from 14 international research groups were assembled. Even when different studies reported significant lod scores at the same marker, they were computed using different diagnostic criteria (broad vs wide); even when the same criteria were analyzed, reports used different analytic methods, and when a single method was used, the support for linkage was not always consistent with the primary analysis. In contrast to single-locus traits-where data were fully shared through publishing either the original pedigrees or lod curves, there would appear to be inadequate sharing of data for complex traits. As noted in Section V. A, a major strength of the lod score method was that published reports allowed combining of data either to ultimately localize the disease gene or to reject linkage. For complex traits, published reports contain p values computed in ways that cannot be compared across studies. Another informative example is from Genetic Analysis Workshop 10 (summarized by Rice, 1997) where five bipolar disorder data sets were analyzed by 25 groups. Difficulties arise not only owing to variation in diagnostic ,criteria, but also to differences in markers used and in allele calling even for the same marker. Dorr et aE. (1997) advocated the use of percent IBD sharing in affected sibpairs as the dependent variable and used logistic regression to test for heterov gene&y among the five studies. This approach has some of the advantages of the lod score for SML traits.
106
Rice et al.
More work is needed to develop methods as effective as combining lod curves for SML traits. Besides the methodologic issues, more uniform standards for reporting results, or even making genotypic data available, are needed to allow effective pooling of data.
VI. THE PROBABILITYOF DETECTINGLINKAGE Let “ + ” denote the presence of true linkage and “-” denote nonlinkage to a particular marker. Let L denote the rejection of the null hypothesis (i.e., the conclusion that there is linkage): P&I+)
= 1 - p = power,
P(L( -)
= cz = significance level.
We are interested in the quantity P(+ IL), the probability that a linkage is correct when have we concluded that linkage is present. As noted by Morton (1955), “We are especially anxious to avoid the assertion that two genes are linked-Since a misleading map is worse than no map at all.” The quantity P( + /L) depends on P( +), the prior probability of linkage. Under the assumption that there is a trait locus, Morton set a value of P(+) = 0.05, although others have argued 0.02 is more realistic. P(+IL)
= P(L[ +)%
ml +P(+) = P(LI +)P(+)
+ P(L( -)P(-)
(1 - PM+) = (1 - p)P(+)
+ a[1 - P(+)] *
This quantity increases as a becomes smaller and as the power (1 - p) becomes larger. Note that, for example, if P( +) = 0.02, (1 - /3) = 1, and (Y = 0.05, then P( + I L) = 0.29. That is, even with 100% power, less than one-third of reported linkages would be real. This led Morton to choose (Y = 0.001 to reduce the “posterior type I error.”
A. The lod score of 3 criterion The original suggestion of Morton (1955) was to use a sequential test for linkage. In this approach, families are added one at a time to the sample and the lod
8. The LodSwre Method
107
score computed. If the running lod score goes above cutoff log (A), we conclude inkage, if it goes below cutoff log (B), we conclude no linkage, and if it is in ,etween, we continue to sample more families. In this setting, the expected sample size is lower than that of a fixed sample to achieve the same power. Wald ( 1947) showed the cutoffs are given by A o-
1-P a!
and P B s=5l--G! Thus with (Y = 0.001, 1 - p = 0.99, we have log(A) = 3 and log(B) = - 2, the values proposed by Morton ( 1955 ) . The lod score of 3 criterion has been used even when the analysis was not sequential. As noted earlier, this would correspond to cy = 0.0001 for a single marker. Moreover, excluding regions with lod scores below - 2 has proven effective in exclusion mapping.
B. Genome-widesignificancelevel The theory of Morton was developed for a single marker. In the context of a modern-day genome scan, there are several hundred markers used, although the statistical tests are correlated, since sets of markers are linked. Lander and Kruglyak (1995) note that for P(L I-) to be 0.05, the appropriate LOD score cutoff is 3.6 for an infinitely dense map. That is, on average, 1 in 20 genome scans will provide a false positive lod score of 3.6, or more. The three chapters in Section 8 of this volume deal with these issues in greater detail. In this setting, the probability of having a marker linked to the trait locus is 1.0, so the real question is: What is the power to detect the trait locus using a particular lod cutoff? For a SML trait, the computations are straightforward. Power calculations can be done with computer programs such as SIMLINK (Boehnke, 1986) or SLINK (Weeks et al., 1990) on the family material prior to genotyping and can model incomplete penetrance and incomplete marker information In this setting, false positive high lod scores are not problematic; areas of interest, from one study can be genotyped in an independent sample. For complex traits, the number of loci involved in the trait is unknown, so that power can not be estimated with certainty. As noted in Table 6.1 in
108
Rice et al.
Chapter 6 of this book (Rice et al.), the number of affected shipairs required to achieve 80% power with (Y = 0.0001 ranges from 59 for one trait locus to 4926 for 10 additive trait loci for a totally heritable trait with a 10% prevalence. This number is 63,078 for 10 loci if the heritability is 25%. Thus for any one study with a given sample size, it is impossible to estimate P( + IL). A more subtle issue is that of replication of findings. If there are six unlinked disease loci for a disorder, and the power to detect any one of them is, say, 20%, then the power to detect at least one is 1 - (0.8)6 = 74%. That is, there may be a high probability of correctly detecting a susceptibility locus, but in a replication study looking at the specific locus, the power is low (20% in our example). This presents a problem unless the replication sample is much larger than the original sample (Suarez et al., 1994).
VII. MODEL-FREELINKAGEMETHODS Model-free linkage methods, surveyed in Chapter 11 by Elston and Cordell, have gained popularity as research into the genetics of complex diseases has intensified. While the lod score has been a very successful tool for the study of Mendelian disorders, parametric lod score analysis is sometimes considered less appropriate for analyzing traits if the mode of inheritance is unclear, if several modes of inheritance seem plausible, or if the trait is complex and nonMendelian. This issue is not straightforward, however. Several investigators have explored the implications of analyzing a trait under an incorrect assumed mode of inheritance. Clerget-Darpoux et al. (1986) evaluated the effects of misspecifying genetic parameters in lod score analyses of a single-gene disease. They showed that misspecifying the degree of dominance (e.g., analyzing a recessive trait under an assumed dominant mode of inheritance) can lead to underestimation of the evidence for linkage, as well as to a biased estimate of the disease gene location.
A. Affected sibpair methods and lod scores t Popular alternatives to parametric lod score analysis include affected sibpair (ASP) methods. Often described as a “nonparametric” or “model-free” method, ASP analysis tests for excess sharing of marker alleles identical by descent (IBD) in affected-affected sibpairs. For example, the sibpair mean IBD test (Blackwelder and Elston, 1985) considers, for a sample of n sibpairs, the IBD test statistic t2 = 2n2 + nl, where ni = the number of sample pairs sharing i alleles IBD; t2 is compared to n, the total IBD allele sharing that is expected under the null hypothesis of no linkage. ,
8. The Lod Score Method
109
More generally, affected pedigree member (APM) methods compare observed allele sharing patterns among affected pedigree members against those expected under random assortment, An affected relative pair statistic is given by a weighted sum of the frequencies of observed IBD configurations. For a sample of nuclear families with exactly two offspring, both affected, the mean IBD sibpair test is a simple case of an APM test, with weights of 0, 1, and 2 for the configurations of 0 IBD, 1 IBD, and 2 IBD, respectively, in the sibs. Tests based on APM and ASP methods do not require explicit specification of a genetic model for the disease. Hence it might appear that APM and ASP methods have advantages over model-based lod score analysis for studying complex diseases of uncertain genetic etiology. However, this may be only somewhat true. As discussed later, it is important to be aware that equivalencies between ASP and parametric tests have been demonstrated in particular cases (Knapp et al., 1994). Furthermore, implicit model assumptions exist for APM and ASP tests, and the power of such tests is thus influenced by the appropriareness of these assumptions (Whittemore, 1996 ) . Knapp et al. (1994) have shown that commonly used affected sibpair tests are in practice equivalent to lod score analysis under certain assumed modes of inheritance, regardless of the true mode of inheritance. A similar equivalence was also noted in Hyer et al. (1991). These equivalencies imply that the power of parametric analyses under certain model assumptions remains comparable to the power of ASP tests, in such settings, Knapp et al. prove these equivalencies for samples consisting of nuclear families with parents and two affected sibs. For lod score analysis, three cases of assumed parental status are considered: both parents unaffected (I), both parents affected (II), and both parents unknown (III). More specifically, Knapp et al. demonstrate the equivalence of tests based on certain IBD statistics t and tests based on the parametric lod score by showing that under the respective assumptions for parental status, there exists a function f(t) mapping the test statistic t to the lod score statistic 2 computed under a particular assumed mode of inheritance. The key is that this function is strictly monotone increasing over a range corresponding to significance levels (a’s) of practical interest. Hence the two statistics are equivalent over &is range and, after appropriate adjustment of the test critical values, describe the same test. For cases I and III (the assumption of both parents unaffected and the assumption of both parents unknown), the equivalence is proved by using the sibpair mean IBD test statistic t2 and an assumed recessive mode of inheritance for the lod score. The corresponding function f(t) is monotone in the range t E [n, 2n]. Since under the null hypothesis of no linkage (8 = i), .PrH, (t2 z n) > 0.5, the two statistics induce equivalent tests for any significance level a 5 0.5.
110
Rice ef al.
For case II (both parents assumed affected), Knapp et al. prove equivalence of the sibpair “two-allele” or “proportion” test (based on the statistic t = nz, the number of affected-affected pairs sharing two alleles IBD) and the lod score under an assumed dominant mode of inheritance. Here the restriction for a is more stringent: the tests are equivalent if the disease gene frequency is sufficiently small and (Y I PrHO(nZ2 3n/7). Knapp and colleagues note that this range corresponds to cyvalues of 0.0001 or smaller for sample sizes up to n = 88; as n increases, PrHO(nz 5: 3n/7) approaches 0 and the corresponding limiting cy values approach 0 as well. Further insight into the relationship between classical lod scores and affected sibpair IBD statistics is provided by Whittemore (1996), via a unified framework for various linkage test statistics. Whittemore introduces the efficient score statistic, obtained from the likelihood function for the observed marker data and the observed trait data. This statistic is asymptotically equivalent to the lod score and includes affected pedigree member (hence affected sibpair) statistics as special cases. Writing affected pedigree member statistics as efficient score statistics, Whittemore shows that any affected pedigree member test implicitly assumesa model for the phenotypic risk ratios for a pedigree with a given IBD configuration at the trait gene, relative to an arbitrary pedigree having the same family structure. An example in the paper demonstrates that for samples of five-member nuclear families with one affected parent and three affected offspring, the mean IBD relative pair test (using the six possible affected relative pairs) corresponds to an implicit assumption of an additive genetic model plus an additional constraint on the allele effects. Hence the appropriateness of IBD-based ASP or APM tests is not automatic for studies of complex disease, and such tests cannot strictly be called “model&ee.” Power of these tests depends upon how the underlying model assumption for the phenotypic risk ratios compares with the “true” model.
B. Maximum lod score for affected sibpairs As noted earlier, the SML may be parameterized in terms of five parameters:
Suarez et al. (1978) noted that the IBD distribution of sibpairs, conditioned on their phenotype, does not depend on 4. Risch (1990) used an alternative parameterization {K, h,, A,, 0}, where hA = KR/K with K the population prevalence and KR the risk to a type of relative R of an affected individual. Here h, and ho are risk ratios to siblings and offspring, respectively. Moreover, Risch noted that the IBD distribution in affected sibpairs depends only on {h,, h,, 0). With multi-
8. The Lad Score Method
111
point marker data, all three parameters I&,, A,, x}, where x is chromosome location, can be estimated when a trait locus is present. Under the null hypothesis, the parameter space is degenerate, so that likelihood theory does not apply. However, Hauser et al. (1996) provide simulation results in terms of h = h, = h,, where h is the locus-specific recurrence risk ratio. The important point here is that these statistics, as implemented in programs such as ASPEX (Hinds and Risch, personal communication) or MAPMAKER/SIBS (Kruglyak and Lander, 1995), are in fact the lod score maximized over these parameters. The likelihood is P(marker data ) both sibs affected and parents unknown phenotype), and maximizing the likelihood is equivalent to maximizing the lod score (Clerget-Darpoux et al., 1986).
C. MLSversusthe lad score The maximum lod score (MLS) just discussed is a lod score (maximized over h,, h, or, alternatively, over the IBD sharing probabilities at the trait locus), and it gives statistically consistent estimates of disease gene location. It is important, however, to recognize the differences between the MLS and traditional lad score. In general, there will be more power when the mode of inheritance is known and when data are analyzed under the correct model. In the case of a simple recessive model, we know that affected parents are homozygous at the disease locus, and thus noninformative for linkage analy sis, whereas the unaffected parent of two affected children is heterozygous. In contrast, for a dominant disease, the affected parent is informative and an unaffected one is not, This information is used in the traditional linkage approach, whereas parental phenotypes are not used in the MLS approach. To illustrate this, assume that we sample N families with affected sibpairs for a (rare) dominant disorder. We assume that the disease gene frequency is small, so that each family will have precisely one affected parent. Assume, further, that the disease gene is completely linked to the marker. In the 2N parents, there will be 100% sharing with respect to the N affected parents, and 50% sharing on average, with respect to the N unaffected parents. Thus there will be 75% sharing overall if parental diagnosis is ignored. In the model-free approach, the resulting chi-square value would be N/2 if the cell counts expected under the alternative were observed. In the model-based approach, 100% sharing would be observed with respect to the N affected parents, and the resulting chi-square value would be equal to N, twice that obtained when parental diagnosis is not used. In sum, even though the approaches are equivalent under some sampling units (e.g., affected siblings with parental diagnosis unknown), the MLS and traditional lod score methods have important differences.
112
Rice et al.
VIII. DISCUSSION In the near future, we will have a complete human genome sequence, a catalog of human genes, and a dense map of single-nucleotide polymorphisms in coding regions. However, the optimal way to utilize these technologies to identify susceptibility genes for complex disease remains unclear. We may still learn from lod score method, which has proven so successful in identifying genes for Mendelian traits. It must be emphasized that the detection of genetic heterogeneity and linkage of the Rh blood group to a subtype of elliptocytosis by Newton Morton was quite remarkable in 1956, even though it may appear routine today. We have highlighted two aspects of this method which, we believe, have contributed to its success. The first is the ability to share and pool data from published reports. This is consistent with the sequential nature of Morton’s 1955 paper. The collective lod score will continue to rise for true linkage, and will become strongly negative for false ones. Related to this is the ability to localize a disease gene once linkage has been detected. As the lod score increases, the support interval for the region containing the gene narrows. Given that one centimorgan corresponds roughly to one megabase of DNA, the importance of combining all available data is evident. Until recently, positional cloning efforts usually began once the region of interest was under 2 Mb. The o&r aspect of the lod score method we highlighted is the ability to estimate the power of a linkage study, and to set cutoffs to control the posterior probability of linkage. Indeed, the successof this method has been its high positive predictive value (i.e., the high probability that a reported linkage is, in fact, true). Validation of statistical linkage comes from the identification of the disease gene by the biologist. This dictates the need for methods with high predictive value. The next decade should wimess similar successesfor the identification of genes that contribute to common diseases.The exact methods and study designs that effect this are not yet clear. The MLS enjoys some of the properties of the traditional lod score and does provide unbiased estimates of diseasegene location and the ability to do exclusion mapping for given values of the risk ratio h. Alternative approaches to traditional linkage analysis-studying related quantitative phenotypes (endophenotypes) in addition to the dichotomous clinical diagnosis; genome-wide association studies; and the use of animal models-are all promising. The lessons learned from the lod score method are still timely and will have substantial impact on these new methodologies.
References Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for disease susceptibility loci. Genet. E@emiol. 2,&T-97.
8. The Lo6 Score Method
113
Boehnke, M. (1986) Estimating the power of a proposed linkage study: A practical computer simulation approach. Am. J. Hum. Genet. 39,513-527. ClergetyDarpoux, F., Bonaiti-Pellie, C., and Hochez, J., Boehnke, (1986). Effects of misspecifying genetic parameters in lod score analysis. Biometrics42,393-399. Dorr, D. A., Rice, J. P., Armstrong, C., Reich, T., and Blehar, M. (1997). A meta,analysis of chromosome 18 linkage data for bipolar illness. Genet. Epidemiol. 14, 617-622. Elston, R. C., and Stewart, J. (1971). A general model for the analysis of pedigree data. Tim. Hered. 21,523-542. Hauser, L., Boehnke, M., Guo, S., and Risch, N. (1996). Affected-sib-pair interval mapping and exclusion for complex genetic traits: Genet. Epidemiol. 13(2), 117-137. Hedges, L, V., and Olkin, I. (1985). “Statistical Methods for Meta-Analysis.” Academic Press, San Diego, CA. Hyer, R.N., Julier, C, Buckley, J.D., Trucco, M., Rotter, J., Spielman, R., Bamett, A., Bain, S., Boitard, C., Deschamps, I., Todd, J.A., Bell, J.I., and Lathrop, G.M. (1991). High-resolution linkage mapping for susceptibility genes in human polygenic disease: Insulin-dependent diabetes mellitus and chromosome llq. Ann. Hum. Genet. 48,243-257. Knapp, M., Seuchter, S.A., and Baur, M.l? (1994). Linkage analysis in nuclear families. 2. Relationship between affected sib-pair tests and lod score analysis. Hum. Hered. 44,44-51. Kruglyak, L., and Lander, E. S. (1995). Complete multipoint sib-pair analysis of qualitative and quantirative traits. Am. .J. of Hum. Genet. 57,439-454. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nut. Genet. 11, 241-247. Lathrop, G.M., Hooper, A.B., Huntsman, J.W., and Ward, R.H. (1984). Stratagies for multilocus linkage analysis in humans. Proc. Nat. Acad. Sci. USA 81,3443-3446. Morton, N.E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277-318. Morton, N.E. (1956). The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. 1. Hum. Genet. 8,80-96. Rice, J. P. (1997). Genetic analysis of bipolar disorder: Summary of GAWlO. Genet. Epidemiol. 14, 549-561. Rice, J.I?, Saccone, N.L., and Suarez B.K. The design of studies for investigating linkage and association. In “Genetic Analysis of Multifactorial Diseases.” (D. Bishop and P. Sham, eds.). Biox, Oxford. In press. Risch, N. (1990). Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs. An. J. Hum. Genet. 46, 242-253. Schizophrenia Linkage Collaborative Group for Chromosomes 3,6, and S. (1996). Additional support for schizophrenia linkage on chromosomes 6 and 8: A multicenter study. Am. J. Med. Genet. (Neuropsychiatr. Genet.) 67,580-594. Suarez, B. K., Rice, J. P., and Reich, T. (1978). The generalized sib-pair IBD distribution: Its use in the detection of linkage. Ann. Hum. Genet. 42, 87-94. Suarez, B. K., Hampe, C. L., and Van Eerdewegh, P. (1994). Pro bl ems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S. Gershon and C. R. Cloninger, eds.), pp. 23-46. American PsychiatricPress, Washington, DC. Wald, A. (1947). Sequential Analysis. J. Wiley and Sons, Inc., New York. Weeks D. E., Ott, J., and Lathrop, G. M. (1990). SLINK: A general simulation program for linkage analysis. Am. J. Hum. Genet. 47, A204 (Supplement). Whittemore, A. S. (1996). Genome scanning for linkage: An overview. Am. j. Hum. Genet. 59, 704-716.
I
Extension of the Lad Score: The Mod Score Fraqoise Cletget-Darpoux INSERM UnitC 535 Gen&ique Epidemiologiqueet Structure des Populations tlumaines 94276 Le Kremlin-&c&e, France
I. Summary II. Introduction III. Misspecifying Genetic Parameters at the Disease Locus in Lod Score Analysis IV. The Mod Score Function V. Conclusion References
I. SUMMARY In 1955 Morton proposed the lod score method both for testing linkage between loci and for estimating the recombination fraction between them. If a disease is controlled by a gene at one of these loci, the lod score computation requires the prior specification of an underlying model that assigns the probabilities of genotypes from the observed phenotypes. To address the case of linkage studies for diseases with unknown mode of inheritance, we suggested (ClergetDarpoux et al., 1986) extending the lod score function to a so-called mod score function. In this function, the variables are both the recombination fraction and the disease model parameters. Maximizing the mod score function over all these parameters amounts to maximizing the probability of marker data conditional on the disease status. Under the absence of linkage, the mod score conforms to a chi-square distribution, with extra degrees of freedom in comparison to the lod score function (MacLean et al., 1993). The mod score is Advances in Genetics, Vol. 42 Copynghr 0 2001 by Acadc~nic Press. All rights of rcprduct~~~ in any form reserved. OC65-2660/01 $35.00
lli
116
FrangoiseClerget-Darpoux
asymptotically maximum for the true disease model (Clerget-Darpoux and Bonai’ti-Pellie, 1992; Hodge and Elston, 1994). Consequently, the power to detect linkage through mod score will be highest when the space of models where the maximization is performed includes the true model. On the other hand, one must avoid overparametrization of the model space. For example, when the approach is applied to affected sibpairs, only two constrained disease model parameters should be used (Knapp et al., 1994) for the mod score maximization. It is also important to emphasize the existence of a strong correlation between the disease gene location and the disease model. Consequently, there is poor resolution of the location of the susceptibility locus when the disease model at this locus is unknown. Of course, this is true regardless of the statistics used. The mod score may also be applied in a candidate gene strategy to model the potential effect of this gene in the disease. Since, however, it ignores the information provided both by disease segregation and by linkage disequilibrium between the marker alleles and the functional disease alleles, its power of discrimination between genetic models is weak. The MASC method (ClergetDarpoux et al., 1988) has been designed to address more efficiently the objectives of a candidate gene approach.
The goal of genetic studies is to unravel the genetic component of diseaseswith the hope of understanding the pathogenic process. Because of the rapid increase in the number of genetic markers provided by molecular technology, linkage analysis has become a pervasive approach in the study of human diseases. Indeed, since linkage with a marker implies the presence of a disease susceptibility gene in the marker region, genetic markers are useful reference points for localizing disease genes. For a long time, the most widely used method was Morton’s lod score method. The method was meant to apply to traits with known mode of inheritance, and it was successfully used to locate genes of diseaseswith Mendelian inheritance. These successes generated great enthusiasm among genetic epidemiologists and promoted the naive idea that systematic screening of the genome by Morton’s lod score method would allow us to determine the genetic basis of any human disease. Lod score computation, however, requires the specification of an underlying disease model that assigns the probabilities of genotypes from the observed phenotypes, while for a multifactorial disease, the underlying model is unknown. If the specification is incorrect, the recombination fraction-which is the key parameter-will not be correctly estimated and the true location of the risk factor may be wrongly excluded (Clerget-Dar-
9. The ivlod Score
117
poux et al., 1986; Clerget-Darpoux and Bonaiti-PelliC, 1993). This did not stop many genetic studies of complex inheritance diseases from resorting to this method. This in turn has led to a plethora of publications that exclude parts of the genome, many of them substantiating their claim through pages of negative lod scores! We proposed in our 1986 paper an extension of the lod score function, the mod score function. In this function, the variables are not only the recombination fraction 0, but also the disease model parameters G (i.e., allele frequencies and penetrances at the disease locus). We review here the advantages and limitations of the mod score function in terms of linkage test and parameter estimation.
HI. ~MlS$PEClF~lyG GEttET1CPARAME@GSAT THE DISEASELOWS IN LDD SCOREAIMK~SIS In 1955 Morton proposed the lod score method both for testing linkage between loci and for estimating the recombination fraction between them. The recombination fraction measures the proportion of recombined gametes from parents to their children. Estimation is thus possible when, for each family member, the genotype is known at each locus. If a disease is controlled by a gene at one of these loci (a disease locus), the computation of a lod score for a given family requires the consideration of all possible genotypic configurations Gi at the disease locus and the computation of the probabilities P(G,/@) of these configurations given a’, the phenotypic information for the disease. When there is no selection and no uncertainty on the marker jocus, the recombma+ tion fraction estimate is asymptotically unbiased as long as the probabilities P(G,/@) are correct. This is not the case either when the information in the pedigree is truncated unblindly to the disease statuses (Vieland and Hodge, 1996) or when the disease model is misspecified (Clerget-Darpoux et al., 1986). All phenotypic information available in the pedigree must be used. As Figure 9.1 shows, the lod score functions are not the same regardless of whether one includes in the lod analysis two untyped sibs of family F (see Fig. 9.2). More generally, excluding untyped individuals with known disease status, or using only affected individuals in the analysis, leads to a biased recombination fraction. Vieland and Hodge (1996) emphasize that the family member sampling scheme may lead to truncated phenotypic information and thus to a biased estimate. Moreover, the recombination fraction 8 may be biased when errors are made in specifying the values of the disease locus parameters (disease allele frequency, dominance relationship between normal and disease alleles, penetrance
118
Franyise Clerget-Darpoux
I I I I I I I I I I
0.14
0.19
0
Figure 9.1. Lod score functions for the family F (see Fig. 9.2): broken curve, the status of the two untyped sibs is taken into account; solid curve, the two untyped sibs are ignored in the analysis.
values) (Clerget-Darpoux et al., 1986). In particular, as shown in Figure 9.3, incorrect specifications of the disease allele frequency may greatly bias the estimation of the recombination fraction 8. A disease gene strictly linked to the marker may seem to be located at a large genetic distance if the lod score analy sis is performed assuming a low susceptibility allele frequency when this allele is actually frequent. However, incorrect specification does not increase the lod score and thus does not lead to a false conclusion of linkage. This last statement was confirmed by Williamson and Amos (1990), who showed that whatever the disease model specification, the lod score function is distributed as a chi-square under
MA2
MIM,
FamilyF Figure 9.2. Family F for which the lod score functions are shown in Fig. 9.1. Key: 0, affected individual; #, deceased individual; MI and Mz, marker alleles.
9. The Mod Score
119
EZ@)xlO-2 A
Figure 9.3. Expected lod scoresfor a nuclear family with two affected sibs assuming for the analyses different allele frequencies 4, whereas the true underly ing model is a dominant disease allele of frequency q0 = 0.20, a penetrance of 0.05 for disease allele carriers, and a null recombination fraction between the diseaseand marker loci.
the absence of linkage. The lod score method is then, as linkage test, robust to model misspecification. In contrast, when linkage does exist, a wrong specifrcation may strongly decrease the lod score, and the true location of the disease locus may be falsely excluded (Clerget-Darpoux and BonaYiti-Pellie, 1992, 1993). This is illustrated by the following example. HLA typing was available on 58 nuclear families of multiple sclerosis (MS) patients. Lod score analyses between HLA and a putative MS susceptibility locus were performed under two different models at the disease locus. As shown in Table 9.1, when one assumes a rare recessive allele (4 = 0.03) with a penetrance of 0.43 (model proposed by Tiwari et al., 1980), the lod scores are extremely negative far small recombination fraction. Thus, under this model, the presence of a risk factor in the HLA region is strongly excluded. In contrast, evidence for a risk factor within the HLA region was obtained (lod score > 3 for 0 = 0~)when a frequent dominant allele (q = 0.18) with low penetrance (f = 0.02) was assumed (Clerget-Darpoux et al., 1984).
IV. THE MOD SWRE FUNCTION We suggested that the lod score function be extended to a so-called mod score function (Clerget-Darpoux et al., 1986). In this function, the parameters of
120
FrangoiseClergel-Darpoux Table 9.1. Lod Score Values for Different Recombination Fractions 0 between HLA and an M S Disease Susceptibility Locus Computed under Two Dif, ferent Models Recessive 8 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
q = 0.03, -
f = 0.43
31.34 19.83 14.08 10.32 - 7.59 - 5.51 - 3.87 - 2.56 - 1.49
Dominant q = 0.18,
f=
0.02
3.11 3.06 3.00 2.94 2.90 2.80 2.72 2.64 2.56
interest are both the recombination fraction and the disease model parameters. Maximizing the mod score function is equivalent to maximizing the conditional probability of marker genotype, given the disease status, and thus is equivalent to the approach proposed by Risch in 1984 for modeling the HLA component in type I diabetes. The demonstration is given in the chapter appendix. The statistical properties of the mod score as a linkage test have been considered by MacLean et al. (1993). Under the absence of linkage and in a sample large enough, the mod score conforms to a chi-square distribution with extra degrees of freedom, compared with the usual lod score function. At boundary parameter values, the mod score statistics converge to a mixture of chi-squares. Hodge and Elston (1994) examined the relationships among the true lod scores, the lod scores under a wrong model (denoted “wrod scores”) and the mod scores. Under the assumption of absence of linkage, they showed that the lod score and wrod score have the same chi-square distribution. Indeed, in that case, the likelihood ratio no longer depends on the disease model. However, the use of the mod score implies testing linkage over many models, and consequently the degrees of freedom must be adjusted. Maximizing the mod score over parameters 0 and G is equivalent to maximizing the likelihood function by applying the marker information conditionally to the disease status of each family member. Thus it is an “ascertainmentassumption-free” method (Shute and Ewens, 1986), and the mod score is asymptotically maximum for the true parameter values (Clerget-Darpoux and Bona’iti-Pellie, 1992; Hodge and Elston, 1994). However, as stated by Vieland and Hodge (1996), this is no longer true if “the observed pedigree structure depends on which relatives within the pedigree happen to have been the probands.”
9. TheModScore
121
For monogenic diseases,the mode of inheritance is usually known and the lod score method is the most appropriate approach both for testing linkage and for localizing the disease locus. The mod score function may, however, be used when the disease gene is identified, to refine parameter estimates such as penetrances. For multifactorial diseases,two kinds of strategies may be considered to identify the genetic susceptibility factors: a genome scan by linkage analysis, or the study of candidate genes. We will consider, in these two strategies, the bene+ fits of using the mod score function.
A. Genomescanby linkageanalysis When screening the genome by linkage analysis, the objective is to find a genome area in which a disease risk factor is present. This is the case if a linkage between the disease and genetic markers can be evidenced. For multifactorial diseases, the model underlying the effect of each genetic factor is generally unknown, and Morton lod scores computed under wrong models may be unable to detect linkage. Is then the mod score an efficient approach? Since the mod score is maximum under the true disease model, to maximize the power of detection, we must choose a maximization space that includes the true model. On the other hand, we must avoid overparameterization of this space. Indeed, on a given family sample, the maximum mod score is obtained for many different disease models and recombination fractions. In particular, the mod score applied to affected sib pairs has been shown by Knapp et aI. (1994) to be equivalent as a test to the sibpair maximum lod score (MLS) test proposed by Risch (1990), which means that maximizing on the disease model is equivalent to maximizing on the identity-by-descent sharing probabilities (Q, Q, Q), accounting for triangle constraints (Suarez, 1978; Holmans, 1993) with Q -I- Q + z2 = 1 and 2~ 5 z1 E= 0.5. This implies that, regarding the disease model, the mod score maximization must be performed on two constrained parameters only: 20 and x1. In addition, one should emphasize the strict relation between the disease gene location and the disease model. This is obvious from Figure 9.2, which showed that the same mod score would be obtained for an infmite set of allele frequencies and recombination fractions. As a consequence-and this is true whatever statistics are used -there is poor resolution of the susceptibility locus position when the disease model at this locus is unknown.
B. Candidategenestrategy The mod score may also be used in a candidate gene strategy. In that approach, the recombination fraction is assumed to be negligible and thus does not remain a parameter. The objective of a candidate gene strategy is first to detect the
122
Frayoise Clergel-Darpoux
involvement in the disease of the considered gene, then to model its effect and to evaluate the risks associated to the different genotypes. The mod score may be used to that end. It should be noted that for a multifactorial disease, the marginal effect of a single gene on an individual may depend on his or her ascertainment mode. This is the case if there is residual family resemblance, such as familial environment or additional susceptibility genes. This problem may be illustrated under a two-disease-locus model. For example, let us consider the following model. At the first locus G, the alleles g and G have a frequency of 0.20 and 0.80, respectively. At the second locus H, the alleles h and I-I have a frequency of 0.10 and 0.90, respectively. The penetrance matrix is the following:
hh hH HH
gg
gG
GG
0.95 0.95 0
0.95 0 0
0 0 0
Under such a model, the marginal penetrances at locus G for an individual of the general population are 0.18, 0.01, 0 for genotype gg, gG, and GG respectively. For the sib of an affected individual, we have 0.57, 0.12, and 0, respectively. These penetrance values would again be different if one knew that the considered individual had several affected relatives. Consequently, mod score estimates at locus G depend on the structures and ascertainment mode of families to which the mod score applies. Moreover the mod score function uses only the information provided by linkage (marker and disease cosegregation). Since it ignores the information provided both by disease segregation and by associations between the marker alleles and the functional disease alleles, its power of discrimination between genetic models is weak. The MASC method (Clerget-Darpoux et al., 1988) has been designed more efficiently to fulfill the objectives of a candidate gene approach. It applies to familial structures corresponding to a well-defined mode of ascertainment and uses simultaneously the three kinds of information: familial disease segregation, disease-marker correlation at population level (allelic association), and disease-marker correlation at family level (linkage) +
VmCONCLUSION The major strength of Morton’s lod score method is the ability to precisely localize genes responsible for monogenic diseases. Such a goal cannot be achieved for multifactorial diseasesfor which the marginal effect of each genetic factor is unknown. The strong correlation between the disease gene location and the disease model makes their simultaneous estimations by the mod score
123
9. TheModScore
(or by another statistic) overly ambitious. The mod score may be applied in a candidate gene strategy to help in modeling the gene effect. However, the conclusions may strongly depend on the ascertainment of the analyzed families and are limited because the mod score does not use the disease segregation information but instead is conditioned on it.
APPEMIIX 1 Consider a disease locus D and a marker locus M. If we denote the marker locus data by Ma and the disease data by CD,then the mod score is defined as a function of the recombination fraction 6 between M and D and of the parameters G at the disease locus D:
m(G,6) = log,,
P(Ma,*,I 0,G) P(Ma, cP[8 = 5, G)
1 *
Assuming that Ma and @ are independent when 6 = $, the mod score function is then
P(Ma,@I0,G) dG, 0) = log,0 PbWP@lG)
I
= const -t log,, [P ({Ma/ @} 16, G)].
Thus, provided that the marker has no effect on disease status and the loci involved are in equilibrium, maximizing the mod score function is equivalent to maximizing the conditional probability of Mu given @. Suppose that the true genetic model is denoted Go and the true value of 8 is 0,. The conditional probability in Equation (A.l) is asymptotically maximized with respect to all parameters at the true parameter values GO and 0,. (Edwards, 1972). Thus the maximum value for the mod score is m( GO, 6,).
Acknowtedgments I am most grateful to my colleagues M.-C. Babron, E. G&in, and l? Jeannin for helpful discussions.
References ClergetDarpoux, E, and Bonaxti-Pelli& C. (1992). Strategies based on marker information for the study ofhuman diseases.Ann. Hum. Genet. 56, 145-153.
124
Francoise Clerget-Darpoux
Clerget-Darpoux, F., and Bonaiti-P&Z, C. (1993). A n exclusion map covering the whole genome: A new challenge for genetic epidemiologists? Am. J. Hum. Genet. 52,442-443. Clerget-Darpoux, F., Govaerts, A., and Feingold, N. (1984). HLA and susceptibility to multiple sclerosis. 7issue Antigens 24,160- 169. Clerget-Darpoux, E, Bonaiti-Pellie, C., and Hochez, J. (1986). Effects of misspecifying genetic parameters in lod score analysis. Biometics 42,393 -399. Clerget-Darpoux, E, Babron, M. C., Pmm, B., Lathrop, G. M., Deschamps, I., and Hors, J. (1988). A new method to test genetic models in HLA-associated diseases: The MASC method. Ann. Hum. Gaet. 52,247-258. Edwards, A. W. E (1972). “Likelihood”. Cambridge University Press,Cambridge. Greenberg D. A. (1989). Inferring mode of inheritance by comparison of lad scores. Am. J. Med. Genet. 34,480-486. Hodge, S. E., and Elston R. C. (1994). L o d s, wrods and mods: The interpretation of lod scores calculated under different models. Gmet. Epidemiol. 11,329-342. Holmans, P. (1993). Asymptotic properties of affectedesib-pair linkage analysis. Am. ,J. Hum. Genet. 52,362-374. Knapp, M., Seuchter, S. A., and Baur, M. P. (1994). Linkage analysis in nuclear families. 2. Relationship between affected sib-pair tests and lod score analysis. Hum. Hered. 44,44-51. MacLean, C. J., Bishop, D. T., Sherman, S. L., and Diehl, S. R. (1993). Distribution of lod scores under uncertain mode of inheritance. Am. J. Hum. Genet. 52,354-361. Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277-318 Risch, N. (1984). Segregation analysis incorporating linkage markers. I. Single locus models with an application to type I diabetes. Am. J. Hum. Genet. 36,363-386. Risch, N. (1990). Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet. 46,229-241. Shute, N. C., and Ewens, W. J. (1988). A resolution of the ascertainment sampling problem. II. Generalizations and numerical results Am. J. Hum. Genet. 43,374-386. Suarez, B. K. (1978). The affected sib pair IBD distribution for HLA-linked disease susceptibility genes. TissueAntigens 12,87-93. Tiwari, J. L., Hodge, S. E., Terasaki, l? I., and Spence, M. A. (1980). HLA and the inheritance of multiple sclerosis: Linkage of 72 pedigrees. Am. J. Hum. Gener. 3 1, 103 - 111. Vieland, V. J., and Hodge, S. E. (1996). Th e p ro bl em of ascertainment for linkage analysis. Am. J. Hum. Genet. 58,1072- 1084. Williamson, J. A., and Amos, C. I. (1990). On the asymptotic behavior of the estimate of the recombination fraction under the null hypothesis of no linkage when the model is misspecified. Genet. Epidemiol. 7,309-318.
:,MajsrStrengthsand Weaknessesof the Lo-dScareMethod Jurg Ott Laboratory of Statistical Genetics Rockefeller University New York. New York 10021
1. II. III. IV. V.
Summary Introduction Genes Underlying Complex Traits Weaknesses of the Lod Score Method Strengths of the Lod Score Method References
Strengths and weaknesses of the lod score method for human genetic linkage analysis are discussed. The main weakness is its requirement for the specification of a detailed inheritance model for the trait. Various strengths are identified. For example, the lod score (likelihood) method has optimality properties when the trait to he studied is known to follow a Mendelian mode of inheritance. The ELOD is a useful measure for information content of the data. The lod score method can emulate various “nonparametric” methods, and this emulation is equivalent to the nonparametric methods. Finally, the possibility of building errors into the analysis will prove to be essential for the large amount of linkage and disequilibrium data expected in the near future.
Advancesin Genetics, Vol. 42 Copyright 0 2031 by Academuz Prcs. All nghta ot rcpnduct~~n in any form reserved 2965.266C,iOl 335.00
125
126
Jurg Ott
II. INTRODUCTION One of the main objectives of linkage analysis is to estimate the position of a disease locus on the human gene map. For simple Mendelian diseaseswith complete penetrance and no phenocopies, under favorable circumstances, this may be achieved by inspecting the haplotypes passed from parents to offspring. A crossover (recombination) on a haplotype (chromosome) may then divide the haplotype into two segments, one of which must contain the disease gene, while the other cannot possibly contain the disease gene. Observing recombinant haplotypes in different families eventually allows one to localize the disease gene to a small genomic region. The basis for this type of gene mapping is that crossovers occur randomly over the length of a chromosome. For details, see relevant textbooks (e.g., Ott, 1999). However, various complications generally make such an approach impossible. For example, penetrance may be incomplete, or some key individuals may be unavailable for study. Thus, the position of a disease gene on a map of markers must be estimated by the maximum likelihood method. In human genetics, this method is referred to as the Iod score method because likelihoods are generally used in the form of logarithms of the likelihood ratio (odds for linkage). The numerator of the likelihood ratio refers to an assumed position of the disease locus, and the denominator refers to absence of the disease locus from the marker map. The maximum value of the lod score then serves as a statistic in the test for linkage. The lod score method was developed by Newton Morton, originally in the form of a sequential probability ratio test (Morton, 1955). It is ideally and directly suited for gene loci following a (generalized) Mendelian mode of inheritance, that is, genetic markers and diseasesunder the influence of a single gene. Likelihood analysis then involves estimating the relevant parameters such as allele frequencies and recombination fractions. The ability to estimate unknown parameters may be viewed as a benefit of the likelihood method. On the other hand, many researchers see it as a disadvantage that parameters must be specified, particularly in the case of diseases for which inheritance parameters are not well known. For such complex traits, nonparametric methods not requiring specification of a disease model are often preferred. Thus, strengths and weaknesses of the lod score method may be discussed by comparing its properties with those of nonparametric methods for linkage analysis. In classical statistics, nonparametric methods refer to approaches in which observations are replaced by their ranks. In nonparametric human linkage analysis, disease inheritance is replaced by inheritance of markers hypothesized to be close to disease loci.
10. Lod Score Method: Strengthsand Weaknesses
127
The sections that follow indicate that in many casesparameters may be treated as nuisance parameters so that their actual values in the analysis are not crucial. In addition, it is shown that many if not all of the well-known “nonparametric” methods have a one-to-one correspondence to the lod score method. Thus, the lod score method in a wider sense comprises many of the methods that are sometimes seen as avoiding the need for parameter specification. Overall, there is no sharp boundary between “the lod score method” and most other methods of linkage analysis.
111.GENESUNDERLYING COMPLEXTRAITS For Mendelian traits with (approximately) known inheritance parameters such as disease allele frequencies and penetrances, the lod score method is undoubtedly optimal. This statement is based on the known optimality properties of likelihood methods. So, the question of the relative merit of the lod score method arises mostly in connection with genes underlying complex traits. Such traits are thought to be due to multiple interacting genes of varying effect sizes. Implicitly or explicitly, essentially all current analysis methods imply a single disease gene in any one family, but allowance may be made for heterogeneity, that is, different disease genes occurring in different families. Therefore, any parametric model specification is necessarily “incorrect” for complex traits. Despite this shortcoming, the lod score method may profitably be applied for localizing complex trait genes, but researchers must be aware of possible pitfalls.
tV. WEAKSYESSES OF THE LDD SCOREMETHDD The main weakness of the lod score method is its Fequirement for exact specification of the mode of inheritance of the trait. Even in the case of Mendelian traits, it is well known that analyzing a recessive trait under the assumption of a dominant mode of inheritance (and vice versa) dramatically reduces power (see, e.g., Risch and Giuffra, 1992). Thus, it is generally recommended that linkage analysis for a complex trait be carried out at least twice, once assuming a recessive and once assuming a dominant inheritance model for the trait. It is then still a question of what penetrances should be specified. This is not easy to answer. Many researchers determine penetrances so that the Mendelian model predicts known trait characteristics (prevalence, recurrence risks to various degrees of relatives, etc.) as accurately as possible. However, it is unclear whether this estimation technique is very beneficial for a situation
128
Jurg Ott
in which pedigrees are ascertained to contain large numbers of affected individuals. Abreu et al. (1999) conducted a simulation study to test the relative merit of lod score analysis versus nonparametric linkage analysis for various assumed modes of complex inheritance. For the lod score analysis, they assumed fixed penetrances of 0 and 0.50 for nongenetic and genetic cases, respectively. Each analysis was carried out twice, once under recessive and once under dominant inheritance, and the higher of the two resulting lod scores was chosen as the test statistic, with a lod score of 0.30 subtracted from this maximum as a penalty for multiple testing F;idd and Ott, 1984: two tests, log(2) = 0.301. The lod score analysis was generally superior to the nonparametric analysis. Thus, there appear to be easy ways to overcome this inherent weakness of the lod score method. The choice of penetrances by Abreu at al. (1999) represents a “strong” locus in the sense that the penetrance ratio for genetic versus nongenetic cases is infinite. For loci with a (true) small penetrance ratio of, say, 10 or 20, the choice of a large analysis penetrance ratio may not be optimal. Another method allows for this by treating various model parameters as nuisance parameters and constraining them to the known population prevalence. Nuisance parameters are those that are estimated in numerator and denominator of the likelihood ratio. Their estimates are not of direct interest, but they must be carried along because the analysis method requires this. Curtis and Sham (1995) have developed such a method and implemented it in a program called MFLINK. For specific assumed two-locus modes of inheritance, Schork et al. (1993) quantified the amount of information loss incurred through analy sis under a (“wrong”) single-locus analysis model. Information contents of two-locus and various single-locus analysis models were measured by the expected lod score, which was approximated by computer simulation. The best single-locus analysis approach turned out to be a Mendelian model that allowed for heterogeneity of the admixture type as implemented in the HOMOG3R and HOMOGM computer programs (Ott, 1999; see also http://linkage.rockefeller.edu/ott/homog.htm). Nonetheless, depending on the true two-locus trait inheritance, even this analysis approach incurred information losses of up to 37%. This is not to say that nonparametric approaches would do better- they probably would not. The problem here is the complexity of the trait inheritance and perhaps the inadequacy of the current analysis techniques. The investigation of Schork et al. (1993) exhibited another weakness of the lod score method: for a gene underlying a complex trait, the recombination fraction to a nearby marker tends to be overestimated (see also Risch and Giuffra, 1992). Thus, what may be seen as a strength for Mendelian traits,
10. LodScoreMethod:StrengthsandWeaknesses
129
namely, the capability of the lod score method to accurately estimate map positions of a disease gene, turns out to be a weakness in the case of complex traits.
V. STHENDTHSOF THE LOD SCOREMETHUD As outlined in Section IV, some apparent weaknesses of the lod score method may be overcome by suitable analysis techniques (Abreu et al., 1999). The main advantages of the lod score method as I see them are outlined here.
A. Expectedlod score:A measureof informativeness The expected lad score (Ott, 1985), or ELOD, is a summary measure for the linkage information content of a given body of data under a specified analysis model. It allows for an objective comparison of different analysis techniques and may, thus, be used for the planning of optimal experimental designs. For example, consider a phase-known double-intercross mating (see Ott, 1999, p. 238). All offspring exhibit two known recombination events (recombination or nonrecombination) except for an ambiguous offspring type, which is the result of either two recombinations or two nonrecombinations. If all data are analyzed by maximum likelihood, under tight linkage, the ELOD per offspring is given by 0.44 (Ott, 1999: Table 11.1). If ambigous offspring are deleted from the analysis, the resulting ELOD is 0.30, reflecting a sizable information loss. On the other hand, if all ambiguous offspring are declared to be nonrecombinants, the ELOD is 0.60, which is even higher than the ELOD under straight maximum iikelihood analysis of the original data. The latter approach leads to an asymptotic bias in the recombination fraction estimate, which, however, is minimal when linkage is tight.
6. Exploitknownmodecifdiseaseinheritance It is a simple known fact that in dominant inheritance, an offspring carrying a disease allele must have received this from either the mother or the father. In contrast, under recessive inheritance, an affected offspring carries two dis. ease alleles, having received one from the mother and one from the father, It is important to discriminate between these two cases in the analysis, and the lod score method achieves this by design. On the other hand, the classical form of the affected sibpair (ASP) analysis treats bath parents the same; that is, it assumes that both are potentially informative for linkage. For this reason, ASP analysis is much less powerful for dominantclike than for recessive-like traits.
130
Jurg On
C. Emulate affected sibpair analysis The affected sibpair design focuses on pairs of affected siblings and their parents. The typical analysis consists of determining for each sibpair the number of marker alleles shared identically by descent (IBD). The only connection to disease is through the ascertainment of affected siblings. All calculations are carried out on marker alleles. Clearly, this method is nonparametric in the sense that it does not require specification of a disease inheritance model for the analysis. However, this approach implies an inheritance model. As shown by Hyer et al. (1991) and Knapp et al. (1994), the affected sibpair method is equivalent to the lod score method when the latter is carried out under recessive inheritance with full penetrance, with parental trait phenotypes taken to be unknown. In practice, more than two affected individuals may be encountered in a sibship. It then becomes desirable to combine results from the multiple pairs that can be formed from such multiplex sibships. Various weighing schemes have been proposed to this end, for example, by Suarez and Hodge (1979) and by Hodge (1984). Collins and Morton (1995) recommend using all pairs in an unweighted manner but with significance levels established by computer simulation. The equivalence between ASP and lod score analysis offers a simple and “natural” solution to the multiple dependent pair problem: in linkage analysis, it is straightforward to evaluate multiple affected siblings. This approach has been implemented in Dr. Joseph Terwilliger’s ANALYZE program, which emulates ASP analysis through lod score analysis.
D. Unified model for linkage and association A rather comprehensive approach at emulating nonparametric linkage and disequilibrium analysis has been proposed by Terwilliger as follows (Trembath et al,, 1997). A test statistic is defined, which consists of three components: (1) linkage within sibships, (2) linkage between sibships, and (3) association between pedigrees. Linkage within sibships is evaluated as outlined in the preceding section, that is, through emulation by lod score analysis under a recessive model. Similarly, with appropriate modifications, the second component may be emulated by a dominant mode of inheritance (distantly related individuals share at most one allele IBD). For a detailed description of the method and associated software, see Goring and Terwilliger (2000).
E. Build errors into analysis Pedigree errors are a severe problem whose importance and impact often are insufficiently recognized (see Ott, 1999, Chapter 11.5). An “error” may be any
10. Lod Score Method: Strengthsand Weaknesses
131
change from the true state of nature: for example, a genotype misreading, sample swap, or unrecognized nonpaternity. Here, the focus is on errors leading to changes in marker genotypes. Several authors have proposed methods COidentify pedigrees and/or individuals with marker errors (Lathrop et at., 1983; Lincoln and Lander, 1992; Brzustowicz et al., 1993; Ott, 1993; Sasseet al., 1994; String, ham and Boehnke, 1996; Ehm et al., 1996). With the increasing use of single-nucleotide polymorphism (SNP) markers in small families, the question of error detection has gained renewed interest. Consider “trio” families, that is, two parents and a child each, and an SNP marker with alleles I and 2. Assume a simple error model such that, with a rate, E, the 1 allele is seen,as a 2 allele, or the 2 allele is seen as a I allele. An error is taken to be observed if it leads to a Mendelian inconsistency-for example, to a child with genotype 212 while a parent has genotype l/l. Gordon et al. (1999, 2000) showed analytically that under these conditions, error detection rates range between 25 and 30% only, the detection rate being lowest when the two marker alleles have equal frequencies. The true error rate is roughly 3.3 -4 times higher than the apparent error rate. The greatest discrepancy between me and apparent error rates occurs when allele frequencies are equal. Because the amount of genotyping for SNP loci is expected to increase dramatically, it may eventually become impractical to look up and try to elimie nate each and every error. Instead, it may be more economical to build errors into the analysis as originally proposed by Keats et al. (1990). The possibility of allowing for errors in the analysis is a major strength of the lod score method and may become very important in the future.
Acknowledgments This work was supported by grant HGOOOS f rom the National Human Genome Research Institute.
References Abreu, P. C., Greenberg, D. A., and Hodge, S. E. (1999). Direct power comparisons between simple lod scores and NPL scores for linkage analysis in complex diseases.Am. J. Hum. Genet. 65, 847457. Brzustowicz, L. M., Merette, C., Xie, X., Townsend, L., Gilliam, T. C., and Ott, J. (1993). Molecular and statistical approaches to the detection and correction of errors in genotype databases.Am. J. Hum. Genet. 53, 1137-1145. Collins, A., and Morton, N. E. (1995). Nonparametric tests for linkage with dependent sib pairs. Hum. Hered. 45,311-318. Curtis, D., and Sham, P. C. (1995). Model-free linkage analysis using likelihoods. Am. J. Hum. Genet. 57,703-716. Ehm, M. G., Kimmel, M., and Cottingham, R. W., Jr. (1996). E rror detection for genetic data, using likelihood methods. Am. J. Hum. Genet. 58,225-234.
132
Jurg Ott
Gordon, D., Heath, S., and Ott, J. (1999). True pedigree errors more frequent than apparent errors for single nucleotide polymorphisms. Hum. Hered. 49,65 - 70. Gordon, D., Leal, S. M., Heath, S. C., and Ott, J. (2000). A n analytic solution to single nucleotide polymorphism error-detection rates in nuclear families: Implications for study design. In “Proceedings of Pacific Symposium on Biocomputing 2000” (R. B. Altman, A. K. Junker, L. Hunter, K. Lauderdale, and T. E. Klein, eds.), pp. 663-674. Gijring, H. H. H., and Terwilliger, J. D. (2000). Linkage analysis in the presence of errors. IV. Joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am. J. Hum. Genet. 66, 1310- 1327. Hodge, S. E. (1984). The information contained in multiple sibling pairs. Genet. Epidemiol. 1, 109- 122. Hyer, R. N., Julier, C., Buckley, J. D., Trucco, M., Rotter, J., Spielman, R., Barnett, A., Bain, S., Boitard, C., Deschamps, I., Todd, J. A., Bell, J. I., and Lathrop, G. M. (1991). High-resolution linkage mapping for susceptibility genes in human polygenic disease: Insulin-dependent diabetes mellitus and chromosome llq. Am. J. Hum. Genet.48, 243-257. Keats, B. J. B., Sherman, S. L., and Ott, J. (1990). H uman Gene Mapping 10.5-Report of the Committee on Linkage and Gene Order. Cytogenet. Cell Genet. 55,387-394. Kidd, K. K., and Ott, J. (1984). Power and sample size in linkage studies. Cytogenet. Cell Gemt. 37, 510-511. Knapp, M., Seuchter, S. A., and Baur, M. l? (1994). Linkage analysis in nuclear families. 2: Relationship between affected sib-pair tests and lod score analysis. Hum. Hered. 44,44-51. Lathrop, G. M., Hooper, A. B., Huntsman, J. W., and Ward, R. H. (1983). Evaluating pedigree data. I. The estimation of pedigree error in the presence of marker mistyping. Am. J. Hum. Genet. 35, 241-262. Lincoln, S. E., and Lander, E. S. (1992). Systematic detection of errors in genetic linkage data. Genomics 14,604-610. Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277-318. Ott, J. (1985). “Analysis of Human Genetic Linkage,” 1st ed. Johns Hopkins University Press, Baltimore. Ott, J. (1993). Detecting marker inconsistencies in human gene mapping. Hum. Hered. 43, 25-30. Ott, J. (1999). “Analysis of Human Genetic Linkage,” 3rd ed. Johns Hopkins University Press, Baltimore. Risch, N., and Giuffra, L. (1992). Model misspecification and multipoint linkage analysis. Hum. Hered. 42, 77-92. Sasse,G., Mtiller, H., Chakraborty, R., and Ott, J. (1994). Estimating the frequency of nonpatemity in Switzerland. Hum. Hered. 44,337-343. Schork, N. J., Boehnke, M., Terwilliger, J. D., and Ott, J. (1993). Two-trait-locus linkage analysis: A powerful strategy for mapping complex genetic traits. Am. J. Hum. Genet. 53, 1127- 1136. Stringham, H. M., and Boehnke, M. (1996). Identifying marker typing incompatibilities in linkage analysis. Am. J. Hum. Genet. 59,946-950. Suarez, B. K., and Hodge, S. E. (1979). A simple method to detect linkage for rare recessive diseases:An application to juvenile diabetes. Clin. Genet. 15, 126-136. Trembath, R. C., Clough, R. L., Rosbotham, J. L., Jones, A. B., Camp, R. D. R., Frodsham, A., Browne, J., Barber, R., Terwilliger, J., Lathrop, G. M., and Barker, J. N. W. N. (1997). ldentification of a major susceptibility locus on chromosome 6p and evidence for further disease loci revealed by a two stage genome-wide search in psoriasis. Hum. Mol. Genet. 6, 813-820.
Overviewof Model-free Methodsfor linkage Analysis Robert C. Elstonl Department of Epidemiolou and Biostatistics CaseWestern ReserveUniversity Cleveland, Ohio 44101
Heather J. Cordell Department of Medical Genetics Cambridge Institute for Medical Research Cambridge, England
I. II. III. IV. V. VI.
Summary Introduction Estimating Marker Identity by Descent Quantitative Traits Qualitative Traits Discussion References
I. !SUMMARY Methods of model-free linkage analysis do not require a detailed specification for the mode of inheritance of the trait locus being linked. Beginning with methods proposed by Penrose in the 193Os,which allowed detection of linkage only, these methods now allow one to use multipoint analysis borh to locate trait genes and to estimate variance components that give informxion on the genetic mechanism underlying the trait. The newer methods can utilize data on multiple types of pairs of relatives other than just sibpairs, and they can detect multiple trait loci. In combination with special sampling schemes, these methods give hope ‘To whom correspondence should be addressed. Advances In Genetics, Vol. 42 Copyright Q 2CCl hy Acadenuc Press. All rights of reprodu~r~on m any form rcservcd. CQ6i-2660/01 S3S.OC
135
136
El&on and Cordell
that they may play a crucial role in unraveling the genetic etiology of multifactorial traits, regardless of whether epistatic interactions are present. The results of such analyses can guide the use of more powerful model-based linkage analyses.
II. INTRODUCTION Model-free methods of linkage analysis are those methods for which there is no need to specify a mode of inheritance for the trait being linked to one or more markers. They thus differ from model-based methods, for which the mode of inheritance must be specified by a set of penetrance functions, one for each genotype assumedto underlie the trait being analyzed, and genotype frequencies. Of course, all methods of analysis that test hypotheses assume a probability model. But whereas model-based methods typically make simplifying assumptions (such as that the trait is determined solely by segregation at a single diallelic locus) in order to make likelihood computations simple, model-free methods ignore detailed aspects of any model underlying the trait mode of inheritance and base inference on the correlation between relatives’ similarity with respect to the trait and their similarity with respect to one or more markers. Model-based methods are sometimes called “lod score” methods, a term that is equally applicable to some modelfree methods, and model-free methods have been called “nonparametric” and “robust,” even though they involve parameters and can share with model-based methods many robust properties, especially with respect to test validity (Elston, 1998). Chapter 8 by Rice et al., on the lod score method, provides interesting additional discussion on model-free methods and their correspondence to lod scores. Model-free methods of linkage analysis were first proposed by Penrose for discrete traits (1935) and for quantitative traits (1938). These methods were based on marker identity in state, the marker similarity between two relatives being based solely on their similarity with respect to marker phenotypes. More modern methods are based on marker identity by descent, that is, the number of alleles at a marker locus that are shared by virtue of being identical copies of the same ancestral allele. Methods based on identity by descent are more powerful, and so methods based on marker identity in state are only very briefly considered further here. It should be noted that these two measures of marker similarity, based on identity by descent and identity in state, respectively, approach each other as a marker becomes more and more polymorphic. We begin by discussing the estimation of marker identity by descent.
Ill. ESTIMATINGMARKERIDENTITYBY DESCENT Early methods of model-free linkage analysis, as well as most current methods, use samples of full siblings. If the sibs’parents are also typed for a marker, or if a
11. Model-fwe Methodsfur Linkage Analysis
137
sufficient number of sibs are typed to permit the parental marker genotypes to be deduced, it is often possible to count the number of alleles a sibpair shares identical by descent (IBD). The early methods of using this information eliminated from the sample any pair of sibs for whom it was not possible to actually count the number of alleles shared IBD. For a very polymorphic marker system, such as HLA, this led to only small biases and loss of information; however, for a diallelic marker this is not the case. Haseman and Elston (1972) proposed estimating the IBD allele sharing prob*abilities at a marker locus by utilizing the marker information available on the sibs and their parents. Let h be the prior probability that a pair of sibs share, by virtue of their relationship alone, i alleles IBD. Thus for full sibs, f, = i, fi = i, fi = f. Then by Bayes’stheorem, the pose terior probability that the sibs share i alleles IBD, given the available marker information I,n, is simply f,1 = fi ‘(‘m I the sibs share i alleles IBD) 7 mn)
(11.1)
where P(I,) = X:=0 fk I’( I, ( the sibs share k alleles IBD). Haseman and Elston (1972) considered only the availability of marker information on a pair of full sibs and up to two parents, but the same general expression can be used when other relatives are present, and pairs of relatives other than sibs can be easily accommodated. (Amos et al., 1990). In general, these estimates f^( can depend on having accurate estimates of the marker genotype relative frequencies, but, provided both parents and the sibs are genotyped, they are independent of the genotype frequencies. The proportion of marker alleles that a sibpair shares IBD, q (which in fact can take on only the values 0, 5, or l), is then estimated by 7i = pz + Izf. It can be shown that, estimated this way, v and & have the highest correlation possible for a single marker (Haseman, 1971). Kruglyak et al. (1995) and Kruglyak and Lander (1995) showed how the IBD allele sharing probabilities can be estimated in a multipoint fashion with greater accuracy (i.e., to obtain a higher correlation between v and 4-j at each marker location and how they can be estimated at any intervening point along the genome. They based their computational method an the Lander-Green (1987) algorithm, which is eminently suitable for multipoint analysis in small families because with it the amount of computatian increases only linearly in the number of marker locations, though exponentially in the size of the family. They implemented this algorithm in the computer program package MAPMAKER/SIBS. The full speed of the algorithm predicted by Lander and Green (1987) was not achieved for another decade, by Idury and Elston (1997), whose algorithm is implemented in the S.A.G.E. (1999) program GENIBD. An advantage of this implementation is that it involves virtually no
138
Elston and Cordell
increase in computational time if the recombination fractions are made meiosis specific (e.g., dependent on the age or sex of the transmitting parent). However, the speed of the Lander-Green algorithm is critically dependent on the assumption of no linkage interference (i.e., independence of crossover events along the chromosome). Fulker et al. (1995) derived an alternative multipoint algorithm for estimating rq, the proportion of the alleles IBD at any location 4, from the single point estimates of IBD at a set of m marker loci on the same chromosome. Their method is based on a multiple linear regression of rq or the m estimates 4, 33-2;* *, ii, for m marker loci, using the fact that for any two loci i and j with recombination fraction eij between them, E[Cov($,
iij)] = 8V(jji)V(7jj.)(l
- 2Q”
Their regression equation provides a multipoint estimate of 7rqat any location other than the locations for which marker information is available; the estimates at those locations are the m estimates themselves. Thus the Lander-Green algorithm can be used to obtain the multipoint estimates at each marker location, and then the regression method can be applied to these estimates to obtain a very fast and good approximation to true multipoint estimates at all chromosomal locations on the assumption of no interference. Essentially the same method can be used to obtain very fast and good multipoint estimates of the fi at all locations. Furthermore, it is very easy to incorporate into the Fulker et al. (1995) regression any mapping function, to allow for interference, provided the same mapping function is used to obtain the multipoint IBD probabilities at the marker locations themselves. Unfortunately, although this is possible (Lin and Speed, 1996), a fast algorithm to do it has not yet been developed. We now turn to the various model-free methods proposed for detecting linkage at any chromosomal location once the estimates i and ii at this loca, tion have been obtained.
IV. QUANTITATIVE TRAITS Letting x1 and x2 be the trait values for a pair of sibs, Haseman and Elston (1972) proposed regressing the squared sibpair trait difference, (x1 - xJ2, on P. The expected slope of the regression line is - 2( 1 - 28)2cri, where 8 is the recombination fraction between the marker and a trait locus at which the genetic variance is gi. (This assumes that only one such trait locus is linked to the marker, but the Haseman-Elston paper also gave an extension to allow for several trait loci acting nonepistatically.) If no linked trait locus exists, 8 = i
11. Model-free Methodsfor Linkage Analysis
139
and the regression coefficient is expected to be 0. On the other hand, whenever a linked trait locus exists, the regression is expected to be negative. Similarity at the trait locus gives rise to a small value of (xi - XL)‘, and when there is linkage this would be correlated with similarity at the marker locus, indicated by a relatively larger estimated probability of sharing marker alleles IBD. Conversely, dissimilarity at the trait locus, or a large value of (xi - xJ2, would be correlated with dissimilarity at the marker locus, indicated by a smaller estimated probabil+ ity of sharing alleles XBD, Asymptotically the estimated regression coefficient is normally distributed (even though the dependent variable is not normally distributed), and so, especially in large samples, the usual one-sided t-test for a neg ative slope can be applied. At the location for which 6 = 0, the slope of the regression line is - 24. It is sometimes stated that the Haseman-Elston test requires the assumption of a normally distributed trait and/or a random sample of sibpairs. Neither of these suppositions is true; in fact, the first application of the method was to a binary disease trait, affected or unaffected, measured on a sample of dizygotic twins ascertained via affected probands (Elston et al., 1973). A binary trait can be quantified by assigning any value a to the affected individuals, and any different value b to the unaffected individuals. Thus the squared sibpair trait difference is 0 for concordant pairs (whether affected or unaffected) and greater than 0 for discordant pairs; and testing whether the regression of this squared difference on 3 is negative is identical to testing whether the mean vaI# ue of ii for concordant pairs is greater than that for discordant pairs. Indeed, because the mean value of ii is expected to be 5 for all pairs if there is no linkage, we can formulate a large sample test based only on concordant pairs (with alternative 6 > $) or only on discordant pairs (with alternative 8 < i). Similarly, under the null hypothesis of no linkage, ascertainment of a sample on the basis of the trait alone will not induce a negative regression, and so the test remains valid for samples ascertained in this way. An extension to incorporate in the same analysis the squared trait dif# ferences between other pairs of relatives followed. For all other pairs of relatives who can be informative for linkage, the regression on 9 is 0 if there is no linked trait and less than 0 if there is a linked trait (Amos and Elston, 1989). To allow for the various correlations between different pairs in the same pedigree, Olson and Wijsman (1993) developed general estimating equations. They suggested two asymptotic tests: one in which a separate slope is estimated for each type of relative pair, and a weighted’average of these is used as the test statistic, and one in which a single common slope is estimated. It is noted in the original Haseman and Elston (1972) paper, as well as many papers extending the method, that one can include an additional regressor to detect dominant variance at the linked locus. Thus the original formulation was to include both ii and fi as regressors, in which case nonzero regression
140
Elston and Cordell
coefficients would be detecting total genetic variance (~3 and dominant genetic variance ((T$), respectively, due to the linked trait locus. The regression equation can easily be extended to include, for example, 6-i and &, for a first location, and 4 and Kz for a second location, to detect two different trait loci. It is then in principle a simple matter to include four further regressors, fir@, +r &,, @ frl, and II1 &2, to pick up epistatic components of variance: additive X additive, additive X dominant, dominant X additive, and dominant X dominant (Elston 1995; Tiwari and Elston, 1997). Such components of variance are not necessarily small (Tiwari and Elston, 1998), and they can be detected in special samples, as we indicate shortly for discrete traits. For the jth pair of sibs, with trait values xij and xzj, the original Haseman and Elston regression can be written, for a marker at the trait locus:
where the last line follows from the preceding line from the facts that jOj + frj + fLj = 1 and the total locus-specific genetic variance C$ is the sum of additive ((T:) and dominance (~3) components:Here ,u is any constant, but we shall take it to be the mean of all the sibs’ trait values. Now it can be shown (Drigalenko, 1998) that for the squared sum of the mean-corrected trait values we can write:
It follows, subtracting the squared difference from the squared sum and dividing by 4, that we have
where LY= (4 - ~)/4. Th us, if we change the dependent variable in the Haseman-Elston regression from the squared sibpair difference to the sibs’ mean-corrected cross-product? we can model the sib covariance directly to find, as the coefficients of *j and fLj,respectively, the locus-specific additive genetic and dominant genetic components of variance; and, in this regression equation, a! estimates all the residual sib covariance (whether genetic or environmental) that is not due to the trait locus assumed to be at the location at which 6- and
11. Model-free Methodsfor Linkage Analysis
141
fLj are estimated. This is the basis of a new Haseman-Elston regression (Elston et al., 2000) that is comparable in power to the variance component method, a model-free method that is discussed in Chapter 12 of this book. Clearly, ascertaining each pair of sibs via a proband with an extreme trait value can be expected to increase the probability of trait locus segregation in the sample, hence increase the power to detect linkage, For this situation, Carey and Williamson (1991) essentially proposed a multiple linear regression of the nonproband sib’s trait value on both rr and the proband’s trait value, and demonstrated that power is then increased by comparing the result to an analogous regression analysis performed on random sibpairs. For the latter regression analysis, the one sib’s value was multiply regressed on QT,the other sib’s value, and the interaction between these two variables. The investigators showed that this interaction, or the difference in regression of the one sib’s value on the other’s for different values of rr, is proportional to - 2( I - 8j2ai. Risch and Zhang (1995, 1996) sh owed that even mare power can ‘be attained by sampling extremely discordant sibpairs: sibs with very high values of the trait each paired with a sib with a very low value of the trait. Because of the minimal variation among the very high or very low values, the data then become virtually dichotomous, and we test whether the proportion of alleles the sibs share IBD is less than half. If both concordant affected and discordant sibspairs are analyzed together, Risch and Zhang proposed using the original Haseman-Elston regression method for a quantitative trait as the method of analysis.
V. ~~ALITATMETRAITS As just mentioned, it is possible to consider dichotomous traits (e.g., those resulting in a phenotype that is “affected” or “unaffected”) as quantitative simply by assigning different values to the phenotypes affected or unaffected, Linkage analysis can then be carried out by using the methods for quantitative traits described earlier, although care may be necessary to ensure that the assumptions of a given method are not violated. Historically, however, qualitative traits have usually been considered differently from quantitative traits, leading to the development of separate methodologies that differ widely in approach and involve different types of statistical analysis. The main difference between the approaches is that linkage analysis for quantitative traits involves studying trait phenotypes conditional on marker genotypes, or IBD sharing, whereas methods for qualitative traits usually involve studying the marker genotypes, or IBD sharing, conditional on the trait phenotypes. This has led to the development of methods that examine IBD sharing among affected individuals only, the idea being that individuals in a pedigree who have inherited disease alleles in common are likely to share genetic material in the region of the disease locus more
142
Elston and Cordell
often than would be expected by chance. Analysis of only affected individuals has some advantages: statistically it reduces the degrees of freedom by effectively removing one penetrance parameter from the model; in addition, affected individuals often contribute most of the information to a study; and, finally, for many complex diseases, particularly those with late ages of onset, an affected person is more likely to possessa disease susceptibility allele than an unaffected person is likely not to possessone. In principle, however, we may also use information on IBD sharing among unaffected individuals, or indeed lack of sharing between discordant individuals, as a test of linkage. The first author to use only affected sibpairs in a model-free linkage test was Penrose (1953), and he was followed some years later by a number of authors (e.g., Day and Simons, 1976; Suarez et al. 1978, Fishman et al., 1978). The basic idea of this methodology was to categorize affected sibpairs from different families according to whether they share 0, 1, or 2 alleles IBD at a marker locus. Under the null hypothesis that the marker is not linked to a trait-causing locus, pairs are expected to fall into the categories corresponding to 0, 1, 2 sharing in the proportions (i, i, i). A test of linkage therefore tests whether the observed IBD sharing in the sample of affected sibpairs is consistent with these null proportions, against the alternative hypothesis that there is a skewing toward higher numbers of alleles shared IBD, as would be expected for the case of the marker being linked to a trait locus. Let (~0, zl, Q) denote the true underlying probabilities that an affected sibpair shares 0, 1, 2 alleles IBD at a location. A number of different test statistics have been proposed for testing the null hypothesis that (~a, x1, ~z) = ($, 5, i) when the marker location is fully informative, that is, when we can know exactly how many alleles are shared IBD by each sibpair. Several of these were compared by Blackwelder and Elston (1985), w h o concluded that the “mean” test, which is based on the mean number of alleles shared IBD, ?I + 2&, where (%, &, &) denotes sample estimates of (~0, zl, ~z), is generally more powerful then either the “proportion” test based on & (the proportion of sibpairs with two marker alleles IBD) or an overall chi-square goodness-of-fit test with two degrees of freedom (df). Note that the power of the various statistics, although model-free, will depend on the underlying (unknown) true genetic model, so that the performance of a test is influenced by the true mode of inheritance of the disease being studied. Knapp et al. (1994a) showed that the mean test is uniformly most powerful under a multiplicative monogenic mode of inheritance and is locally optimal otherwise. If we let Si be the probability that a person with i disease alleles at a locus is affected, then the multiplicative assumption is Sf = 6&; a special case of this is recessive inheritance in the absence of sporadic cases (6, = 8, = 0). Schaid and Nick (1990) proposed using the maximum of the mean and proportion statistics as a method that is more robust than using either test individually.
11. Model-free Methodsfor Linkage Analysis
143
More generally, the mean and proportion tests fall into a class of one-df tests with test statistics of the form T=
tz + w& - EC& + qPJ m
(11.2) ’
where E and V denote null expected value and variance, respectively. Setting wi = i results in the mean test, while setting wl = 0 gives the proportion test. Whittemore and Tu (1998) proposed a “minmax” test in which the weight w1 takes the value 0.275, which is very similar to the test proposed by Feingold and Siegmund (1997) m which wl takes the value 0.25. These tests tend to perform slightly better than the mean test, unless the true genetic model is additive, in which case the mean test is optimal. An alternative to the one-df tests just described is to use a two-df likelihood ratio test (originally called the “maximum lod score” but also called the “maximum likelihood statistic”), as described by Risch (1989, 1990). This test is based on the ratio of the likelihoods for the observed data
Udata / ?oa,81,?J L(datal 0.25,0.5,(X25) ’ where &, Bi,& now denote maximum likelihood estimates. These estimates will coincide with the usual sample estimates when all IBD sharing is known unambiguously. More generally, estimation by maximum likelihood techniques also allows inclusion of data when IBD sharing cannot be precisely determined, via calculation of the posterior probabilities 6, defined in Equation (1 l.l), that a sibpair share 0, 1, or 2 alleles IBD. Note that the maximum likelihood estimates (Q,?i, ez) can also be used in the one-df tests. Faraway (1993) and Holmans (1993) showed that the power of the one-df and two-df tests can be increased by constraining the maximization so that the estimates (&, ti, 22) lie in a “possible triangle” defined by b > 0, 9r < 0.5, t1 > 2&,,,which corresponds to values of (a, zl, ~2) that are consistent with simple Mendelian inheritance. These con+ straints correspond to requiring the additive and dominant genetic variances to be positive when the disease trait is considered to be quantitative (Olson, 1997). The constrained likelihood ratio test is the test currently implemented in the computer program package MAPMAKER/SIBS (Kruglyak and Lander, 1995). Note that the two-df likelihood ratio test can be converted into a one-df likelihood ratio test by adding further constraints to the estimates (Go,Qi, 2,) (Risch, 1992; Morton, 1996). In general, the power of the two-df likelihoodbased test appears to be somewhat lower than that of one& tests (Collins et ul., 1996; Whittemore and Tu, 1998). However, the likelihood-based approach has
144
Elston and Cordell
some advantages in that it can be easily extended to the case of multilocus models, that is, situations in which a single trait is caused by multiple disease loci that may themselves be linked or unlinked (Knapp et al., 199413;Dupuis et al., 1995; Cordell et al., 1995; Farrall, 1997; Olson, 1997; Cordell et al., 2000). When there are more than two affected individuals per family, it is of interest to consider IBD sharing either for all possible affected pairs in the family (i.e., affected individual 1 paired with affected individual 2, affected individual 1 paired with affected individual 3, affected individual 2 paired with affected individual 3, etc.), or IBD sharing among the whole set of related affected individuals in the pedigree. The advantage of the pairwise approach is that methods developed for the situation of a single affected pair per family can be used, although some modification may be required to take into account the nonindependence of affected pairs coming from the same family. For sibships, several weighting schemes have been proposed to downweight the contributions of pairs coming from larger families (e.g., Suarez and Hodge, 1979; Hodge, 1984; Sham et al., 1997). These weights should be chosen to maximize power. Under the null hypothesis of no linkage, the asymptotic type 1 error, at least of the tests of the form shown in Equation (11.2), will not be increased even when all possible pairs are included with equal weights. Note, however, that in small samples these tests have increased type 1 error (Abel et al., 1998), especially in a relative sense in the extreme tail of the distribution corresponding to P values 5 low4 (Elston et al., 1996). For the likelihood ratio test, some of the proposed weighting schemes tend to be very conservative, whereas the use of equal weights, which is in many situations approximately optimal for up to five affected sibs, gives approximately valid type 1 error rates. However, simulation is still recommended to generate accurate P values (Greenwood and Bull, 1999). Several alternative statistics have been proposed based on IBD sharing between pairs of affected members in an extended pedigree. Davis et al. (1996) proposed a statistic, SimIBD, that assigns a score to each affected pair based on the probability that two affected individuals share a specific allele IBD. The scores are summed over all possible pairs of affected persons to obtain a score for the pedigree as a whole. The pedigree scores are then summed, and the significance of the overall similarity statistic is evaluated by using simulation conditional on the genotypes of the unaffected individuals in the pedigree. This is a generalization of the “Affected-Pedigree-Member” method of Weeks and Lange (1988), which uses a similar scoring system but is based on identity in state rather than IBD. The computer program GENEHUNTER (Kruglyak et al., 1996) uses a comparable statistic, Spalcr,proposed by Whittemore and Halpem (1994). Given the inheritance vector v at a location on the genome (i.e., the origin of each nonfounder’s two alleles at that location), Soairs is defined to be the number
11. Model-free Methodsfor Linkage Analysis
145
of pairs of alleles from distinct affected pedigree members that are IBD at that location. We construct a normalized score for pedigree i, Zi(v) = [Sr,i,(v) ru)/cr, where p and (T are the mean and standard deviation of Spainunder the null hypothesis of no gene for the trait linked to that location. In the case of incomplete IBD information, we take the expected value of Spairs over the distribution of inheritance vectors [i.e., Z, = (S - ~)/u, where S = X~Spairs(w)P(Y = w)]. H ere, the null standard deviation of Zi is approxie mated by a; which has been termed the “perfect data approximation.” However, cr is in fact the null standard devizion of Z,(v), which will tenr to be larger than the null standard deviation of Zr Thus, inferences based on Zi can be overly conservative (Kong and Cox, 1997). To combine scores among p pedigrees into an overall score 2, we take a linear combination 2 = Xt=i$il where the 7/i are weighting factors. Kruglyak et al. (1996) proposed using equal weights for all pedigrees, 3: = l/G f or a11i, and comparing 2 to a standard normal distribution, or computing an exact P value to test for linkage. It can be more informative to consider IBD sharing in larger sets of affected relatives, rather than just pairs. This motivated use of another scoring function, .&I, also introduced by Whittemore and Halpern (1994) and also implemented in GENEHUNTER. The method proceeds as already described, except that instead of using Spairs,we use the function $11, which gives higher scores to IBD configurations in which several individuals share the same allele IBD, Another method for sibship data, which considers all affected individuals simultaneously, is the maximum likelihood binomial method (Abel et al., 1998). This method follows on from work by De Vries et al. (1976) and Green and Woodrow (1977), with the likelihood for a sibship being expressed as a binomial distribution parameterized by the probability a that an affected child receives allele A from a heterozygous AI3 parent. There is a direct relationship between CYand the proportion of alleles shared IBD by the sibpairs. In fact, this method and the mean test are equivalent in the case of independent sibpairs. In simulations this method appears to have good power, with type I error close to that expected even for quite small sample sizes. For hypothesis testing, an alternative to the likelihood ratio test is the classical efficient score statistic (Cox and Hinkley, 1994). For regular statistical problems, a test based on the score statistic is asymptotically equivalent to the test based on the likelihood ratio statistic, with the advantage that the score statistic is often computationally simpler. ‘Whittemore (1996) showed that in the case of completely informative data, the test statistic 2 = Cp=lyiZi (which corresponds to the NPL, or nonparametric linkage score, output by GENEHUNTER) is, in fact, the efficient score statistic corresponding to the likelihood for the observed marker data given the trait phenotype data, on the assumption that the probability that the inheritance vector v, for the ith pedigree takes configuration w can be written I’( Vi = w / 6) = I’( vi = w){l + 6y~Zi).
146
Elston and Cordell
Here S is a parameter to be estimated, representing the magnitude of deviation from null sharing. Kong and Cox (1997) considered the same likelihood, but proposed using a likelihood ratio test of the null hypothesis S = 0, instead of the score statistic. In the case of incomplete data, the likelihood ratio statistic is easier to compute than the sco_restatistic, which requires multipoint simulation to determine the variance of Zi. Approximating the variance by means of the “perfect data approximation” can lead, as mentioned earlier, to very conservative results. The likelihood ratio approach of Kong and Cox (1997) leads to less conservative P values and thus greater power to detect linkage. It also produces lod score curves that conform in shape to traditional model-based lod score curves, unlike the NPL curves output by GENEHUNTER, which tend to decrease at locations between markers where the IBD information is less complete. The likelihood ratio statistic proposed by Kong and Cox (1997) has been implemented in the computer program package GENEHUNTER-PLUS. Kong and Cox (1997) also considered a different form of likelihood in which they assumed P( ~i = w 16) = I’( Vi = w)T~( 8) exp {GyiZi}, where ri is a normalization constant. This has the same score function as the likelihood of Whittemore (1996), but with the advantage that, when used for a likelihood ratio test, it allows for larger deviations from null sharing to be modeled. McPeek (1999) has investigated the optimal choice of scoring function S (e.g., Spairs,S,J and weights 3/i to obtain the asymptotically most powerful test using the likelihood ratio and score tests described earlier. The optimal choice will depend on the (unknown) true genetic model, but in general Spairsappears to perform well over a variety of different models. This is consistent with results obtained by Kong and Cox (1997) and Davis and Weeks (1997). Kruglyak et al. (1996) found some models in which $11performed better, but the distribution of SalIis very skewed (McPeek 1999), w h ic h can lead to problems with the normal approximation for calculating P values. McPeek (1999) also proposed a different scoring function, Srobdam,which performs well for dominant and additive models.
VI. DISCUSSION With the wide range of methodologies currently available for model-free linkage analysis, it is natural to ask which method should be used for a given data set. This question is difficult to answer, because, although these methods are model-free in that they do not require explicit specification of the underlying genetic model, each will have optimal power to detect linkage for a particular true underlying genetic model, which in the case of complex traits is unknown. The performance of a method can be studied by using simulation, as has been done by Davis and Weeks (1997), w h o compared the performance of a large
11. Model-freeMethodsfor LinkageAnalysis
147
number of popular model-free methods. Sometimes the choice of method is motivated by the structure of the data collected, (e.g., whether they consist solely of affected sibpairs), which may be decided on the basis of convenience of sampling rather than on statistical grounds. Recently, several authors have investigated different ways of conditioning on previously known effects as a means to increase power for detecting lesser effects. For example, in the case of insulin-dependent (type I) diabetes, conditioning on the strong HLA region effect increases power to detect linkage in other regions (Davies et al., 1994; Cordell et al., 1995; Farrall, 1997; Cox et al., 1999; Cordell et al., 2000). These methods hold promise for unraveling the genetics of complex diseases in the presence of multiple interacting disease loci. Dudoit and Speed (1999, 2000) h ave explored the use of score statistics in a unified likelihood-based approach to the analysis of either quantitative or qualitative traits. They considered a likelihood parameterized by the recombination fraction 6 between marker location and disease, and by nuisance parameters representing the parameters of the genetic model (stich as penetrances and parental disease genotype frequencies). They showed that for independent affected sibpairs, a score test in 8 leads to a test based on the usual mean statistic, P1 + 2&, while for larger affected sibships of a given size, the linkage score statistic is equivalent to the statistic SpaiTs+ In the case of a qualitative trait, the score statistic proposed by Dudoit and Speed (1999, 2000) does not depend on the values of the nuisance parameters (i.e., parameters of the true genetic model), in contrast to the methods of Whittemore (1996), Kong and Cox (1997), and McPeek (1999), which have some dependence on the underlying genetic model. In general, once enough is known about the number of trait loci and how they interact, it should be possible to conduct a full model-based analysis that will make best use of the maximum amount of information possible.
Acknowledgments This work was supported in part by U.S. Public Health Service research grant GM28356 from the National Institute of General Medical Sciences and resource @ant RR03655 from the National Center for Research Resources.
References Abel, L., Alcais, A., and Mallet, A. (1998). C om p arison of four sib-pair linkage methods for anaiyz, ing sibships with more than two affect&s: Interest of the binomial maximum likelihood approach. Genet. E@miol. 15,371-390. Amos, C. I., and Elston, R. C. (1989). Robust methods for the detection of genetic linkage for quantitative data from pedigrees. Genet. Epidemiol. 6,349-360. Amos, C., Dawson, D. V., and Elston, R. C. (1990). The probabilistic determination of identity-by descent sharing for pairs of relatives from pedigrees. Am. 1. Hum. Genet. 47, 842-853.
148
Elston and Cordell
Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for disease susceptibility loci. Genet. Epidemiol. 2, 85-97. Cardon, L. R., and Fulker, D. W. (1994). The power of interval mapping of quantitative trait loci using selected sib-pairs. Am. 1. Hum. Genet. 55, 825-833. Carey G., and Williamson J. (1991). L’m k age analysis of quantitative traits: Increased power by using selected samples. Am. J. Hum. Genet. 49, 786-796. Collins A., MacLean, C. J., and Morton, N. E. (1996). Trials of the p model for complex inheritance. Proc. Natl. Acad. Sci. USA 93, 9177-9181. Cordell, H. J., Todd, J. A., Bennett, S. T., Kawaguchi, Y., and Farrall, M. (1995). Two-locus maximum lod score analysis of a multifactorial trait: Joint consideration of IDDMZ and IDDM4 with IDDMl in type 1 diabetes. Am. J. Hum. Genet. 57,920-934. Cordell, H. J., Wedig, G. C., Jacobs, K. B., and Elston, R. C. (2000). Multilocus linkage tests based on affected relative pairs. Am. .l. Hum. Genet. 66, 1273- 1286. Cox, D. R., and Hinkley, D. V! (1994). “Th eoretical Statistics.” Chapman & Hall, London. Cox, N. J., Frigge, M., Nicolae, D. L., Concannon, i?, Hanis, C. L., Bell, G. I., and Kong, A. (1999). Loci on chromosomes 2 (NIDDMI) and 15 interact to increase susceptibility to diabetes in Mexican Americans. Nut. Genet. 21, 213-215. Davies, J. L., Kawaguchi, Y., Bennett, S. T., Copeman, J. B., Cordell, H. J., Pritchard, L. E., Reed, l? W., Gough, S. C. L., Jenkins, S. C., Palmer, S. M., Balfour, K. M., Rowe, B. R., Farrall, M., Bamett, A. H., Bain, S. C., and Todd, J. A. (1994). A genome-wide search for human type 1 diabetes susceptibility genes. Nature 371, 130- 136. Davis, S., and Weeks, D. E. (1997). Comparison of nonparametric statistics for detection of linkage in nuclear families: Single-marker evaluation. Am. J. Hum. Genet. 61, 1431- 1444. Davis, S., Schroeder, M., Goldin, L. R., and Weeks, D. E. (1996). Nonparametric simulation-based statistics for detecting linkage in general pedigrees. Am. J. Hum. Genet. 58,867-880. Day, N. E., and Simons, M. J. (1976). D’rseasesusceptibility genes-Their identification by multiple case family studies. Tissue Antigens 8, 108-l 19. De Vries, R. R. P, Fat, R. E M. L. A., Nijenh uis, L. E., and Van Rood, J. J. (1976). HLA-linked genetic control of host response to Mycobacterium leDrae.Lamet ii, 1328-1330. Drigalenko, E. (1998). How sib-pairs reveal linkage. Am. J. Hum. Genet. 63, 1243- 1245. Dudoit, S., and Speed, T. E (1999). A score test for linkage using identity by descent data from sib, ships. Ann. Stat. 27, 943-986. Dudoit, S., and Speed, T. P (2000). A score test for the linkage analysis of qualitative and quantita, tive traits based on identity by descent data on sib-pairs. Biosmtistics 1, l-26. Dupuis, J., Brown, O., and Siegmund, D. (1995). Statistical methods for linkage analysis of complex traits from high-resolution maps of identity by descent. Genetics. 140,842-856. Elston, R. C. (1995). The genetic dissection of multifactorial traits. C&n. Exp. Allergy. 25, 103-106. Elston, R. C. (1998) Statistical Genetics ‘98. Methods of linkage analysis-and the assumptions underlying them. Am. J. Hum. Genet. 63,931-934. Elston, R. C., Kringlen, E., and Namboodiri, K. K. (1973). P OSSL ‘bl e 1’ m k age relationships between certain blood groups and schizophrenia or other psychoses. Behaw. Genet. 3, IOl- 106. Elston, R. C., Guo, X., and Williams, L. V. (1996). T wo-stage global search designs for linkage analysis using pairs of affected relatives. Genet. Epidemiol. 13, 535-558. Elston, R. C., Buxbaum, S., Jacobs, K. B., and Olson, J. M. (2000). Haseman and Elston revisited. Genet. Epidemiol. 19, 1-17. Faraway, J. J. (1993). Improved sib-pair linkage test for disease susceptibility loci. Genet. Epidemiol. 10,225-233. Farrall, M. (1997). Affected sibpair linkage tests for multiple linked susceptibility genes. Genet. Epidemiol. 14, 103-115.
11. Model-free Methodsfor LinkageAnalysis
149
Feingold, E., and Siegmund, D. 0. (1997). Strategies for mapping heterozygous recessive rraits by allele+haring methods. Am. J. Hum. Genet. 60,965-978. Fishman, P. M., Suarez, B., Hodge, S. E., and Reich, T. (1978). A robust method for the detection of linkage in familial diseases.Am. J. Hum. Genet. 30,308-321. Fulker, D. W., Chemey, S. S., and Cardon, L. R. (1995). Multipoint interval mapping of quantitative trait loci, using sib pairs. Am. J. Hum. Genet. 56, 1224-1233. Green, J. R., and Woodrow, J. C. (1977). Sibling method for detecting HLA-linked genes in disease. TissueAntigens 9,31-35. Greenwood, C. M. T., and Bull, S. B. (1999). D own-weighting of multiple affected sib pairs leads to biased likelihood-ratio tests, under the assumption of no linkage. Am. J. Hum. Genet. 64, 1248-1252. Haseman, J. K. (1971). The genetic analysis of quantitative traits. Ph.D. thesis, University of North Carolina at Chapel Hill. Haseman, J. K., and Elston, R. C. (1972). Th e mvescigation of linkage between a quantitative trait and a marker locus. Be/xxv.Genet. 2,3-19. Hodge, S. E. (1984). The information contained in multiple sibling pairs. Genet. Eeidemiol. 1, 109-122. Holmans, P. (1993). Asymptotic properties of affected-sib-pair linkage analysis. Am. J. Hum. Genet. 52,362-374. Idury, R. M., and Elston, R. C. (1997). A faster and more general hidden Markov model algorithm for multipoint likelihood calculations. Hum. Hered. 47, 197-202. Knapp, M., Seuchter, S., and Baur, M. (1994a). Linkage analysis in nuclear families. 1. Qtimality criteria for affected sib-pair tests. Hum. He-red.44,37-43. Knapp, M., Seuchter, S. A., and Baur, M. P. (1994b). T wo -1ecus disease models with two marker loci: The power of affected-sib-pair tests. Am. 1. Hum. Genet. 55, 1030-1041. Kong, A., and Cox, N. J. (1997). Allele-sharing models: Lod scores and accurate linkage tests. A-n. 1. Hum. Genet. 61,1179-1188. Kruglyak, L., and Lander, E. (1995). Complete multipoint sib pair analysis of qualitative and quantitative traits. Am. .J. Hum. Genet. 57,439-454. Kruglyak, L., Daly, M. J., and Lander, E. S. (1995). Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am. J. Hum. Genet. 56, 519-527. Kruglyak, L., Daly, M. J., Reeve-Daly, M. P., and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 46, 229-241. Lander, E. S., and Green, P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA. 84,2363-2367. Lin, S., and Speed, T. (1996). Incorporating crossover interference into pedigree analysis using rhe chi square model. Hum. Hered. 46,315-322. McPeek, M. S. (1999). Optimal allele-sharing statistics for genetic mapping using affected relatives. Genet. Epidemiol. 16,225-249. Morton, N. E. (1996). Logarithm of odds (lods) for linkage in complex inheritance. Proc. Natl. Acad. Sci. USA 93,3471-3476. Olson, J. M. (1997). Likelihood-based models for genetic linkage analysis using affected sib pairs. Hum. Hered. 47,110-120. Olson, J. M. (1999). A general conditional logistic model for affected-relative-pair linkage studies. Am. J. Hum. Genet. 65, 1760-1769. Olson, J. M., and Wijsman, E. (1993). Linkage between quantitative trait and marker locus: Methods using all relative pairs. Genet. Epidemiol. 10, 87-102. Penrose, L. S. (1935). The detection of autosomal linkage in data which consist of pairs of brothers and sisters of unspecified parentage. Ann. Eugen. London 6, 133- 138. Penrose, L. S. (1938). Genetic linkage in graded human characters. Ann. Eugen. London 8,233-237.
150
Elston and Cordell
Penrose, L. S. (1953). The general purpose sib pair linkage test. Ann. Eugen. London 18, 120- 124. Risch, N. (1989). Genetics of IDDM: E v1‘d ence for complex inheritance with HLA. Genet. Epidemi01. 6, 143-148. Risch, N. (1990). Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs. Am. .J. Hum. Genet. 46, 242-253. Risch, N. (1992). Corrections to “Linkage strategies for genetically complex traits. III. The effect of marker polymorphism on analysis of affected relative pairs” (Am. J. Hum. Genet. 46, 242-253, 1990). Am.]. Hum. Genet. 51,673-675. Risch, N. J., and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science268,1584-1589. Risch, N. J., and Zhang, H. (1996). Mapping quantitative trait loci with extreme discordant sib pairs: Sampling considerations. Am. .I. Hum. Genet. 58,836-843. Schaid, D. J., and Nick, T. G. (1990). Sibzpair linkage tests for disease susceptibility loci: Common tests vs. the asymptotically most powerful test. Genet. Epidemiol. 7,359-370. Schaid, D. J., and Nick, T. G. (1991). A p owerful test of sib-pair linkage for disease susceptibility; reply to “sib-pair linkage tests for diseasesusceptibility loci.” Genet. Epidemiol. 8, 141- 143. Sham, l? C., Zhao, J. H., and Curtis, D. (1997). Optimal weighting scheme for affected sib-pair analysis of sibship data. Ann. Hum. Genet. 61,61-69. Suarez, B. K., and Hodge, S. E. (1979). A slmpl e method to detect linkage for rare recessive dis, eases:An application to juvenile diabetes. Clin. Genet. 15, 126-136. Suarez, B. K., Rice, J., and Reich, T. (1978). The generalized sib pair I.B.D. distribution: Its use in detection of linkage. Ann. Hum. Genet. 42,87-94. Tiwari, H. K., and Elston, R. C. (1997). Linkage of multilocus components of variance to polymer, phic markers. Ann. Hum. Genet. 61,253-261. Tiwari, H. K., and Elston, R. C. (1998). Restrictions on components of variance for epistatic mod< els. Theor. Popd. Biol. 54, 161-174. Weeks, D. E., and Lange, K. (1988). The affected-pedigree-member method of linkage analysis. Am. J. Hum. Genet. 42,315-326. Whittemore, A. S. (1996). Genome scanning for linkage: An overview. Am. .I. Hum. Genet. 59, 704-716. Whittemore, A. S., and Halpern, J. (1994). A c 1ass of tests for linkage using affected pedigree members. Biometrics 50, 109-117. Whittemore, A. S., and Tu, I.-l? (1998). Simple, robust linkage tests for affected sibs. Am. J. Hum. Genet. 62,1228-1242.
John Blangero,’ Jeff T. Williams, and Laura Almasy Department of Genetics Southwest Foundation for Biomedical Research San Antonio, Texas 78245
I. II. III. IV. V. VI.
Summary Introduction Variance Component-Based Linkage Analysis The Effects of Nonnormality Power of Variance Component Linkage Analysis Conclusions References
I. SUMMARY Variance component-based linkage analysis has become a major statistical tool for the localization and evaluation of quantitative trait loci influencing complex phenotypes. The variance component approach has many benefits-it can, for example, be used to analyze large pedigrees, and it is able to accommodate multiplc loci simultaneously in a true oligogenic model. Important biological phenomena such as genotype-environment interaction and epistasis are also examined easily in a variance component framework. In this chapter, we review the basic statistical features of variance component linkage analysis, with an emphasis on its power and robustness to distributional violations.
’To whom compondcncc
should bc addressed
Advancesin Genetics,Wot.42 C:opynght 0 2COl by Academic l’ress. All rights of rcproducrion in any form rebcnwd 006%2660/31 535.00
131
152
Blangero ef al,
II. INTRODUCTION During the past few years enormous advances have been made in techniques for finding genes that influence human disease susceptibility. Highthroughput genotyping methods have revolutionized the search for both monogenic and complex disease loci, and the resulting emphasis on linkage studies and genome-wide linkage scans represenrs the current state of the science. Classical penetrance-based linkage analysis methods have proven extremely successful for mapping the genetic determinants of monogenic diseases, making the localization of genes for Mendelian disorders a routine and readily achievable task. However, the success of penetrance-based linkage methods has not generalized to the case of complex diseases that are influenced by multiple quantitative trait loci (QTLs) and their interactions with each other and with the environment. Localizing and evaluating the relative importance of specific QTLs for complex traits presents a number of complications, such as quantitative variability, oligogenic inheritance, epistasis, and genotype-environment interaction. These phenomena are difficult to model with classical, penetrance-based approaches that require specification of parameters such as the allele frequencies at the QTL and the QTLspecific genotypic means and variances. The genetic analysis of complex phenotypes clearly requires new statistical techniques that can accommodate the more complicated and inherently oligogenic genetic architectures underly ing these traits. Many complex diseases have quantitative correlates that are directly related to disease risk. Such quantitative characters have many benefits over their discrete counterparts for statistical genetic analysis (Blangero, 1995; Wijsman and Amos, 1997; Duggirala et al., 1997; Williams et al., 1999a,b). One approach to localizing QTLs for complex disease has been to conduct linkage analyses of these quantitative disease-related phenotypes (Lander and Schork, 1994; Blangero, 1995; Williams et al., 1999a,b). Much of quantitative trait linkage analysis has been oriented toward applications utilizing only pairs of related individuals such as sibpairs (Haseman and Elston, 1972; Kruglyak and Lander, 1995). However, large extended pedigrees can provide greater linkage information, on a per-person basis, than can smaller sampling units, and in recent years there has been much progress in developing linkage methods that fully exploit all the information in nuclear families (Goldgar, 1990; Schork, 1993; Amos, 1994; Amos et al., 1996) and in extended kindreds (Blangero and Almasy, 1997; Duggirala et al., 1997, 1999; Almasy and Blangero, 1998; Williams et al., 1999a,b; Williams and Blangero, 1999a,b).
12. Variance ComponentiWhods
153
In particular, there has been a great deal of recent work on extending the classical variance component quantitative genetic approach to allow for linked QTLs. Hopper and Mathews (1982) first proposed such an extension, although they limited consideration to a random effects association model that incorporated information on the probability that pairs of individuals share haplotypes that are identical by state (IBS). Goldgar (1990) later extended the model for sibship data to use information on chromosomal segments that are identical by descent (IBD), thus producing the first linkage model based on variance components. Amos (1994) further generalized the model to nuclear families, provided the necessary framework for conducting two-point linkage analysis, and set the stage for much wider usage. Our more recent extensions to the variance component method have enabled penetrance model-free multipoint oligogenic linkage analyses of complex quantitative or qualitative traits in pedigrees of arbitrary size and complexity (Blangero and Almasy, 1997; Duggirala et al., 1997; Almasy and Blangero, 1998; Williams et al., 1999a,b). The Genetic Analysis Workshops (GAWs) have been instrumental in demonstrating the generality and power of variance component linkage analysis. The complex simulated data sets generated for these workshops have been used extensively to examine the operating characteristics of the method-there were 2 papers on variance component methods at GAW9 (1994), 13 at GAWlQ (1996), and 16 at GAWl 1 (1998). The first empirical applications of variance component linkage analysis to quantitative traits in human families appeared in 1996 (Stern et al., 1996; Duggirala et al., 1996). S’mce then, a large number of applied QTL mapping studies using variance component analysis have appeared. Among these are studies to localize QTLs influencing variation in such important disease-related risk factors as serum leptin levels (Comuzzie et al., 1997), LDL-cholesterol (Rainwater et al., 1999), HDL-cholesterol fractions (Almasy et al., 1999), event-related brain potentials (Begleiter et al., 1998; Williams et al., 1999b), personality dimensions (Cloninger et $., 1998), and body mass index (Duggirala et al., 1996; Mitchell et al., 1999). The variance component framework also facilitates investigation of important biological phenomena such as genotype-environment interaction (Blangero, 1993; Jaquish et al., 1997; Towne et al., 1997); epistasis (Blangero, 1993; Stern et al., 1996; Mitchell et al., 1997; Cloninger et al., 1998), and oligogenic inheritance (Blangero and Almasy, 1997; Blangero et al., 1999).
III.’ VARlANCECOWIPONENT-BASED UNKAGEANALYSIS The variance component methodology is widely used in quantitative genetics (Lange et al., 1976; Hopper and Mathews, 1982), and the variance component approach to linkage analysis has been extended to pedigrees of arbitrary com-
154
Blangero et al.
plexity (Blangero and Almasy, 1997; Almasy and Blangero, 1998). With this method, all possible biological relationships are used simultaneously to dissect the genetic architecture of a quantitative trait. The method is quite general, and a number of extensions have been made to relax some of the power-diminishing, simplifying assumptions that are implicit in most “penetrance model-free” methods. Furthermore, because the variance component method requires the estimation of fewer parameters than fully parametric penetrancebased linkage methods, it is also more efficient.
A. Modeling the phenotype The variance decomposition individual, yi, that influence
component linkage method is based upon the following simple of the phenotype. Let the quantitative phenotype for the ith be modeled as a linear function of the n quantitative trait loci it
Yi =
P
+
i&Ii i=l
+
(12.1)
e,
where p is the grand mean, qi is the effect of the ith QTL, and e represents a random environmental deviation. Assume that 4i and e are uncorrelated random variables with expectation 0 so that the variance of yi is ~2 = EF=i (~5 + at. For simplicity we ignore dominance effects and assume that the qi represent purely additive effects; for a description of this model that allows dominance effects, see Amos (1994) and Almasy and Blangero
(1998). For this simple random effects model, the expected phenotypic covariante between the trait values of any pair of relatives is
=
+kli
)1
+ kli u$ ,
(12.2)
where kli and kzi are the k coefficients of Cotterman (1941), with kji being the probability of the pair of relatives sharing j alleles IBD at the ith QTL. From this covariance, the expected phenotypic correlation between any pair of relatives is
(12.3)
12. Variance ComponentMethods
155
where hZ, is the proportion of the total phenotypic variance due to the additive genetic contribution of the ith QTL. In classical quantitative genetic variance component models, we do not have information on specific QTLs and use instead the expectation over the genome of the probabilities kj~to obtain the approximation
where ai = X:=1 af is the total additive genetic variance, 4 = $E[kli/2 + kzi] is the expected kinship coefficient, and 24 is the expected coefficient of relationship. Because we are interested in examining one or a few QTLs at a time, we employ the approximation in Equation (12.4) to reduce the number of parameters that must be considered. For example, if we are focusing on the effect of the ith QTL in Equation ( 12.1), the effects of all remaining QTLs can be absorbed in residual components of covariance. In terms of these residual covariances, the expected phenotypic covariance between relatives is well approximated by
where T< = kiJ2 + kzi is the probability of a random allele being IBD at the ith QTL, and ~2, represents the residual additive genetic variance. For any given chromosomal location, ri can be estimated from genetic marker data and information on the genetic map. The coefficients pi and their expectations effectively structure the expected phenotypic covariances and are the basis for much of quantitative trait linkage analysis, such as the sibpair difference method of Haseman and Elston (1972). The simple additive model for a relative pair is easily extended to the situation in which n QTLs and an unknown number of residual palygenes influence a trait in a general pedigree. In this case the covariance between a pair of relatives is replaced by the covariance matrix for the pedigree,
(12.6) i=l
where Fiji is the matrix whose elements &ijr specify the expected proportion of genes that individuals j and I share IBD at the quantitative trait locus qi, 26, is the matrix of average coefficients of relationship, and I is the identity matrix. The matrix fr, is the estimated IBD matrix for a specific chromosomal location;
156
Blangero eta/.
its estimation from genetic marker data is discussed in detail elsewhere (Almasy and Blangero, 1998).
B. Maximumlikelihoodestimation If multivariate normality of the pedigree phenotypic vector y is assumed, the likelihood of any pedigree can be written easily and standard numerical methods used to estimate the model parameters. For the covariance model in Equation (12.6), and assuming that a single QTL is under consideration, the In likelihood of a pedigree of t individuals is In L(F, u:, cri, C; / y) =
- $ln(2*)
- +lnj
0 )- i
A’WlA,
(12.7)
where p is the vector of grand trait means and A = y - p. If desired, the effects of covariates can be incorporated as mean effects in the multivariate normal model. In practice, we routinely estimate covariate effects simultaneously with all other linkage parameters, although Liang and Self (1996) have shown that there is little penalty for estimating covariate effects prior to genetic analy sis and then using residual trait values. Maximum likelihood estimation involves finding the parameter values that maximize the likelihood function for the pedigree. To do this, we need to evaluate the first derivatives of the In likelihood function with regard to its parameters. Let 8 = [p, ai, (T;, a:]’ represent the vector of parameters; then the vector of first derivatives, d In L/de, has elements -
a1t-iL
aP d 1nL -= au;
= l’Wl(y
- /.4)
-?j- Tr(fi-‘HI)
l3lnL -q-=
--+
-=a1nL au;
- + Tr(fi-l)
+ + (A’a-lIIan-lA)
Tr(ti-l2@)
(12.8)
+ $ (A’JJ-‘2@0-1A) + t
(A’C&-‘WIA),
where Tr denotes the matrix trace operation. The vector d In L/d0 is known as the score vector S(0), and by solving the matrix equation S(0) = 0 we obtain the maximum likelihood parameter estimates 6. The error variance-covariance matrix and the information matrix of the estimated parameters are obtained from the matrix of second derivatives
12. VarianceComponent Methods of the In likelihood function. Let A = E[&%L/a8W] be the expected matrix of second derivatives of the likelihood function, and let V be the variante-covariance matrix of the estimated parameters. Under the assumption of multivariate normality, V -I = -A. Furthermore, the variance components crf and the mean p are uncorrelated, and therefore A has some elements with expectation 0. The nonzero elements of V-i are as follows:
E(--p$-)=1’,n-11 E(-dT&)= +
Tr(a-iIIWin[)
= +- Tr( ~-‘Z@cTt-‘2@) = + Tr(SZ-lWl)
(12.9)
= +-Tr(fi-‘rIn-‘ZO) = i
Tr(a-13Jfiz-1)
= -j!j- Tr(an-i2@W’). The assumption of multivariate normality leads also to several important equivalences. Let B = E[S( 8)S( 8)‘] = Cov[S( 8)] denote the covariance matrix of the score vector S( 8). Under normality, A = - B, so that A + B = 0; this is known as the information matrix equality (White, 1994). When this equality holds, then Cov( 8) = - A- i = B- i; that is, the covariance matrix for the parameter vector is equal to the negative of the inverse of the covariance matrix for the score vector.
C. The likelihoodratio statisticand the lod score In the variance component model the null hypothesis that the additive genetic variance due to the ith QTL is zero (i.e., that there is no linkage) can be tested by comparing the likelihood of this restricted model with that of a model in which the variance due to the ith QTL is estimated. Formally, this ieads to the likelihood ratio statistic A = 2[lnL(ii)
- InI@)],
(12.10)
158
Blangero et al.
where L( 8) is the likelihood under the alternative hypothesis of linkage and L(8) is the likelihood under the null hypothesis of no linkage. The difference between the two log10 likelihoods yields a lod score that is the equivalent of the classical lod score of linkage analysis, that is,
LOD = log,sL(&
- logtsL(6) = A.
Under the assumption of multivariate normality, the statistic A is asymptotically distributed as a i : t mixture of a xf variable and a point mass at zero (Self and Liang, 1987). When multiple QTLs are considered jointly, the resulting likelihood ratio test statistic has a more complex asymptotic distribution but continues to be a mixture of chi-square distributions.
D. Alternative test statistics There are several alternative test statistics that can easily be calculated. A Wald-type test statistic can be obtained from the parameter estimates and their variance-covariance matrix calculated under the alternative hypothesis of linkage as follows: 12.1 1) For the case of testing a single QTL component, (12.11) reduces to w =
(Q2 Var(3,2) *
Another alternative test statistic is the score test, given by
9 = S(i$‘VS(8),
(12.12)
where S(a) is the score vector evaluated at the maximum likelihood estimates of parameters obtained under the null hypothesis, and O is the error covariance matrix of the estimated parameters. For the single-QTL linkage test in which a2, is the ith parameter, the score test reduces to
since vii = l/Var[S(@]. If S(&) is different from zero, it indicates that the likelihood is not at its maximum when c”, = 0; that is, there is evidence for
12. Variance ComponentMethods
159
linkage. A larger score indicates greater support for the alternative hypothesis of linkage. The main advantage of the score test is that it does not require estimation of the parameters under the alternative hypothesis of linkage. The score test requires only the evaluation of the first and second derivatives of the likelie hood function in the parameter neighborhood of the null hypothesis. Because of the large number of tests that are performed during a genome scan, the use of score tests may be an excellent first-pass approach. Asymptotically, A = W = Y, so that the choice of tests can be based on other considerations (such as reducing computational burden) when the sample size is large.
IV. THE EFFECTSOF NONtfORMAUTY A number of studies have documented some of the major statistical properties of the variance component method (Amos, 1994; Amos et al., 1996; Blangero and Almasy, 1997; Duggirala et al., 1997; Almasy and Blangero, 1998; Allison et al., 1999; Williams and Blangero, 1999a; Williams et al., 1999a). However, the effect on variance component linkage analysis of failures in the assumption of multivariate normality has received relatively little attention. Beaty et al. (1985) showed that variance component estimates obtained assuming multivariate normality were consistent regardless of the true underlying distribution, but estimates of standard errors for the variance component parameters can be biased low in the presence of nonnormality. For the specific case of variance component linkage analysis, Allison et al. (1999) showed how extreme deviations from normality can lead to increased type I errors. The main concern in this respect is the potential influence of the implicit and explicit underlying assumptions on the validity of our inferences. Under the alternative hypothesis of most interest here (specifically, that a QTL has an effect on the focal trait), the assumption of multivariate normality is obligately violated because the expected distribution then represents a finite mixture of QTL genotypic distributions. Therefore, it is important to test the operating characteristics of the variance component method over a range of potential distributional violations. Before examining the effects of nonnormality on variance component linkage analysis, we review the statistical framework for the moments of a finite mixture of normal variables. Most of the modeling and simulation analyses that were used to examine the main features of variance component linkage analysis employed simple genetic models consisting of a small number of QTL genotypes that exhibit homogeneous within-genotypic variance but lead to various deviations from normality.
160
Blangero et at.
A. Finite mixtures due to genotypic variation Consider a phenotype y that is partially determined by variation at a locus having n alleles and m = n(n + 1)/2 possible genotypes. Let the allele frequencies at the locus be pl, . . . , pn, with IZy = r pi = 1. Assume the alleles are in Hardy-Weinberg equilibrium; then the frequency ~~jof the jjth (i 5 j) genotype oii is
and IZ$‘=1I.&j $ij = 1 Conditional upon the ijth genotype, assume that y is normally distributed with mean Iuij and residual variance (T$ then the conditional probability density function for y is 4dY I Oij) = *
exp
r
[-f(x$y].
(12.13)
The marginal distribution of y has a probability density function that is a frequency-weighted mixture of the conditional densities,
f(Y)
such that J_“,f(y)& en by
=
Zi li j=liSj
IG;,+(Y
I Oij),
(12.14)
= 1. The distribution function for the mixture is then giv-
F(Y) =I;m f(xMx Thus, the cumulative density for the finite mixture is just the weighted sum of the genotype-specific cumulative densities. The first four crude moments of the mixture distribution are easily obtained from the characteristic function of the mixture density; they are
12. Variance CompenentMetbeds
I61
where the two-dimensional genotypic subscript ij has now been reduced to one dimension so that the appropriate index for the klth genotype is i = &I - 1) /Z -Ik. These formulas generalize with little difficulty to multiple-locus genotypes. The ith central moment mi can be calculated from the crude moments using standard formulas: ml = E[y], mz = E[y’] - E[rlz, 1713= E[y3] - 3E[y2]Ety] -I- 2E[r13, and m4 = E[y4] - 4E[r3]E[y] 4 6E[y2]E[yJ2 - 3E[y14. From the central moments we obtain useful indicators of the asymmetry and peakedness of the distribution by calculating standard skewness ( y) and kurtosis (K) according to
Because the error variance of the variance components is largely dependent upon the second and fourth distributional moments, kurtosis plays the major role in determining the deviation of the likelihood ratio statistic from its asymptotic distribution, Clearly, K # 0 for a QTL model based upon a finite mixture of normal genotypic distributions.
B. Relationshipbetweenkurtosisand type I error Allison et al. (1999) examined the effect of nonnormality of the trait distribution on type I error rates in variance component linkage analysis using sibpairs and showed that some markedly nonnormal trait distributions led to substantial excesses of type I error. In particular, leptokurtic distributions (i.e., those for which K > 0) showed increased type I error rates compared with the nominal values expected for normally distributed traits. A second major determinant of the effect of nonnormality is the total magnitude of correlations among relative pairs. If we ignore environmental sources of covariation and restrict consideration to additive genetic effects, the magnitude of these correlations is adequately reflected in the total trait heritability h% = Xihi+ For nonnormal traits with K > 0, the type I error increases with increasing h$. We have performed additional preliminary simulation experiments to evaluate the importance of the assumption of normality on parameter estima? tion and the distributional properties of the LOD score test. Figure 12.1 shows the relationship between type 1 error and kurtosis for a set of simulated traits with widely varying degrees of kurtosis. For this simulation, we used the same
162
Blangero et al.
I
0.00
-2.0
-1.0
0.0
1.0
2.0
.
I
3.0
.
I
4.0
.
I
5.0
.
6.0
Kurtosis Figure 12.1. Relationship between type I error and and kurtosis of the trait distribution. Results are based on 100 replications of 500 sibpairs for a variety of distributions of completely heritable traits.
generating models as Allison et al. (1999). The expected nominal type I error rate was set to cy = 0.05 (indicated by the dashed line in Figure 12.1) and the total heritability was fixed at 1.0 to model the most discrepant possible situation. For each model, 100 replicates of 500 randomly selected sibpairs were generated. Figure 12.1 clearly reveals the strong influence of kurtosis on type I error. The most leptokurtic distribution considered was the x “2,which has an expected kurtosis of 6. With h# = 1, application of the variance component method under the assumption of multivariate normality resulted in greater than threefold inflation of the type I error (18.2%) over the expected nominal rate. When hT2 for the xi distribution is decreased and the simulations are repeated, the observed type I error rate decreases and approaches the nominal value. For total heritabilities of O&0.6,0.2, and 0.0, for example, the resulting type I error rates are 14.1, 11.5,6.8, and 5.0%, respectively. The leptokurtic distributions used to generate Figure 12.1 represent worstecase scenarios because the correlation among relatives was maximized in the simulations. Even so, trait distributions with K < 2 appear not to result in grossly inflated type I error. In practice, however, for most of the quantitative traits that are generally considered risk factors for common chronic diseases,the kurtosis is seldom large enough to cause a marked deviation. We determined the
12. VarianceComponent Methods
163
kurtosis values for 105 quantitative traits related to heart disease, diabetes, and obesity for approximately 1400 individuals from the San Antonio Family Heart Study (Mitchell et al., 1996; MacCluer et al., 1999). About 82% of these traits have distributions with K < 2 and could reasonably be analyzed under an assumption of multivariate normality of the within-pedigree trait vector (data not shown). The distributions for a number of the traits are markedly kurtotic, however, and special procedures will be necessary to ensure that linkage inferences for these traits are correct.
C. Alternativerobusttest statistics There is a diverse body of statistical theory concernins the behavior of test statistics when these are computed under an incorrect probability model (Foutz and Sri. vastava, 1977; Beaty et aE., 1985; Browne and Shapiro, 1987; Westfall, 1987; White, 1994). For example, Beaty et al. (1985) derived the expected value of the score statistic as a function of kurtosis and showed that distributions for which K > 0 can lead to mis-estimation of the asymptotic covariance matrix for the parameter estimates. This result obtains for any variance component model in which there are more than two components and for traits with nonzero total trait heritability h$. For the case of two variance components-as in the classical polygenic model with genetic and environmental random effects 5,” and cr,2---kurtosis does not alter the asymptotic distribution of any of the test statistics (they remain distributed as i : $ mixtures of a xi variable and a point mass at zero). However, for models with three or more variance components, such as the variance component-based link* age model, significant deviations can occur when K + 0 and hi > 0.
1. The robust covariance matrix The source of the deviation stems from violation of the information matrix equality A = - B (White, 1994). However, a number of corrections have been described that take into account the deviation of the true probability generating model from the assumed (misspecified) one (Foutz and Srivastava, 1977; Browne and Shapiro, 1987; White, 1994). Th e solution is to find a robust consistent estimator of V. In the nonnormal case, the estimated matrix A is aug mented over its value expected under normality, that is, A = A + F(~,fi) = Cov[S(B)], where F denotes a matrix function and K is the vector of kurtosis coefficients for each of the random effects. By employing a consistent estimator of B instead of assuming the analytical derivatives obtained under the assumption of multivari-
164
Blangero eta/.
ate normality, a sandwich estimator can be used to obtain a consistent estimate of V. One obvious consistent estimator of B is
BR= ~
(12.15)
S(~)i S( 3):,
where BR denotes a robust estimate of B and the summation is over all families (Beaty et al., 1985). If the pedigrees differ in structure, some weighting based upon effective pedigree size should be employed. Alternatively, a simulationbased method can be used to estimate BR. Once BR has been calculated, a consistent estimate of V is then
2. Robust Wald and score tests With a robust estimate of the error covariance matrix, we can derive a number of robust test statistics that will have the appropriate asymptotic distributions. A robust Wald test can be obtained as
“wk = iPA-1 Cov[S(6>]A-16
(12.17)
and a robust score test as
!YR= S( e)~(;i-‘cov[s(8)l;i-1)-’
S( 6,.
(12.18)
3. Robust likelihood ratio tests The robust Wald and score tests can be used in quantitative trait linkage analy sis whenever the trait distribution is markedly nonnormal. However, we may still be interested in expressing our results using a robust likelihood ratio statistic so that we can report it as a robust LOD score. To define a robust likelihood ratio statistic, we must examine the distribution of this statistic under an incorrect probability model. The necessary analytical machinery is found in the results of Foutz and Srivastava (1977). For the simple case of a test of a single QTL, inference is limited to the parameter vector u = [a$ ai, (~21’. Partition the matrix A as
where Al is a scalar (Aii) pertaining to the second derivative of the likelihood function with regard to ui. Let v denote the first diagonal element of vfl. It
12. Variance ComponentMethods
165
can then be shown that
A -+; c(&)x~ + t (01,
(12.20)
where c(a) = u(-A,
+ A; AylAz).
This remarkable result states that the distribution of the likelihood ratio statistic under model misspecification is equal to a constant times a xf variate. Therefore, a robust alternative to the likelihood ratio statistic is
(12.21) and the analogous robust lod score is LODR = -
1 da-)
LOD.
(12.22)
Given the asymptotic equivalence of Yfs, Ys, and As, the constant l/c(&) is equal to Var (u$/VarR(a,?J, which is simply the ratio of the classical variance estimate to the robust variance estimate. This estimate of c(&) can be used to correct the lod score. For leptokurtic distributions, c > 1 and LODs < LOP. As an alternative that avoids the requirement of directly estimating Va, simulation can be employed to generate a sampIe of test statistics under the null hypothesis (by gene-dropping unlinked markers) and the empirical distribution of the lod scores under the assumption of multivariate normality determined. The simulated markers can be completely informative, since the distribution of the test statistic is insensitive to the estimate of II (provided fi f: 2@, which is a requirement for identifiability of the parameter a$. The empirical distribution of the simulated lod scores can then be used to assign percentiles to each replicate and an expected test statistic calculated based on the percentile. The expected lod scores are then regressed on the observed (and biased) lod scores to obtain the correction constant, which is then used to adjust all observed lod scores. The benefit of this approach over direct use of the empirical distribution of the lod scores is that it results in a lod statistic whose interpretation remains intact. As long as the sample, is large enough so that asymptotic inference is valid, many fewer replicates are needed to estimate the
166
Blangero et al.
correction constant than are needed to estimate accurately from the empirical density function the small p values generally of interest in genome scans.
4. The multivariate t distribution Another robust alternative is to use the multivariate t distribution instead of the multivariate normal distribution to model the pedigree phenotypic vector (Lange et al., 1989). Through an additional parameter that is largely a function of the kurtosis, the multivariate t distribution has the natural benefit of downweighting outlying observations. As shown by Lange et al. (1989), the density function for the multivariate t is similar in form to the multivariate normal. Consequently, modification of computer programs assuming multivariate normality is relatively simple, and conventional maximum likelihood methods can be used to estimate the parameters of the linkage model. All the test statistics defined in subsections l-3 (including the robust alternatives) can be used with the multivariate t distribution.
D. Accuracy of QTL effect estimation We have noted elsewhere the effective unbiasedness of estimates of hi for a small range of generating models. Few studies, however, have examined the accuracy of the estimates of QTL effect size in the presence of nonnormality (Amos et al., 1996; Almasy and Blangero, 1998; Williams and Blanger, 1999a; Williams et al., 1999a). To further examine this question, we simulated a quantitative trait influenced by a single diallelic QTL with additive effects and by residual poly genes. Family structures, taken from GAW9, consisted of 1000 phenotyped individuals in 23 extended pedigrees. QTL allele frequencies were varied from 0.50 to 0.90, and the values of the QTL-specific heritability were varied from 0.05 to 0.90. Within-genotype distributions were assumed to be normal with constant variance. Completely informative linked markers were simulated, with zero recombination. Each of these generating models induces a unique finite mixture distribution whose moments can be calculated using the formulas provided earlier. The kurtosis for these simulated trait distributions ranged from -0.81 to 2.07. Linkage analyses of the simulated traits were performed using our computer package SOLAR. The mean QTL-specific heritability from 100 replicates of each generating model are summarized in Table 12.1. In general, the estimates are accurate for all generating models, and there is no indication of systematic bias. A plot of bias against kurtosis for this experiment showed no correlation, indicating that there is no relationship of estimation bias to deviation from the assumption of multivariate normality. Thus, QTL effect size estimates appear to be relatively robust to reasonable departures from multivariate
12. Variance ComponentMethods Table 12.1.
161
Consistency of Estimationof hi with Allele Frequency n Estimated/I,’
Trueh,'
p = 0.50
p = 0.70
p = 0.90
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.60 0.70 0.80 0.90
0.0479 0.0972 0.1379 0.2060 0.2548 0.3077 0.3498 0.4081 0.4619 0.5039 0.6019 0.7061 0.8082 0.8972
0.0561 0.0951 0.1428 0.2057 0.2541 0.3105 0.3538 0.4003 0.4658 0.4986 0.6086 0.7080 0.8046 0.8984
0.0492 0.0913 0.1376 0.2156 0.2404 0.3008 0.3647 0.4055 0.4493 0.4998 0.6066 0.7067 0.8013 0.9019
“Resultsarebasedon 100replicationsof a simulationof 1000individualsin 23 extendedfamilies.The total heritability wasfixedat 0.95.
normality. This behavior is expected from the work of Beaty et al, (19&j), who showed that the consistency of variance component parameter estimates is robust to distributional violations.
E. Typet error under model m isspecification To examine the influence of extreme deviations from normality on type I error, we performed a simulation experiment using sibpair data. We examined some of the same distributions as did Allison et al. (1999), including a mixture of normal distributions and a xj distribution. The mixture distribution was generated using a QTL that explained 33% of the phenotypic variance and had a recessive allele frequency p = 0.15 that led to increased trait values. The total heritability was fixed at 0.7, and within-genotypic variances were held constant. For the experiments involving the x$ distribution, we used total heritability values of 0.6 and 0,8. For each mixture experiment we simulated 50Q sibpairs and a completely informative unlinked marker. A total of 100,000 replicates was generated, and SOLAR was used to perform a number of different types of linkage analyses. Both mixture models generated extremely leptokurtic and skewed distributions. In practice, no reasonable analyst would analyze such a trait
168
Blangero et a/.
assuming multivariate normality, and some type of transformation would typically be applied to reduce the skewness. More radically, an inverse Gaussian transformation could be employed to guarantee normality. However, as a worst-case scenario, it is useful to see how the test statistics fare when calculated under the assumption of m&variate normality, and also to examine the accuracy of the alternatives just discussed. Specifically, we analyzed each trait assuming (incorrectly) that it followed a multivariate normal distribution, a multivariate log normal distribution, and a multivariate t distribution. Under each assumption, conventional lod scores were calculated and, in addition, we calculated the robust LODR score using the regression method already described to estimate the correction constant from 1000 randomly chosen replicates.
1. Finite mixture distribution Figure 12.2a shows the density function and moments for the normal mixture. The kurtosis of 2.47 is substantial but is likely to be encountered in practice. The type I error rates observed in 100,000 replications of each simulating model
,
0.6 .
.
I
I
.
I
Mean Variance Skewness Kurtosis
(4
0.5 -
.
= = = =
I
0.09 1.33 0.81 2.47
.
.
.
.
Mean = Variance = Skewness = Kurtosis =
(b)
U.6
2 . 4 - 0.5 2 6
0.4 -
- 0.4
2 % 0.3 -
- 0.3
B
- 0.2
0.2 -
0.0 -4
-2
%alue4
0
Trait
0.0 6
8
0
2
4
6
8
10
12
14
Trait Value
Figure 12.2. (a) Density function for the normal mixture. (b) Density function for the x$ distribution.
12. Variance ComponenZMethods
169
Table 12.2. Type I Error Rates Observed in 100,000 Replicates of 500 Sibpairs for a Trait Following a Normal Mixture Distribution Heritability
Nominal 01
MVN
M V N [In x]
MVT
LODri
0.7
0.05 0.01 0.001
0.0928 0.0312 0.0067
0.0768 0.0215 0.0036
0.0432 0.0076 0.0006
0.0496 0.0104 0.0010
are shown for several nominal a levels in Table 12.2, where the column labeled “MVN” gives the type I error rates when multivariate normality is assumed. There is some inflation of error under this assumption, and naive application of the multivariate normal model could lead to errors of inference. Log transforma tion of the trait diminished the discrepancy somewhat but did not eliminate it. Analysis using the multivariate t distribution gave acceptable, if slightly conservative, error rates. The corrected lod score LODR yielded exceptionally accurate type I error rates.
2. xi distribution The ~22 distribution represents an extreme deviation from normality. Figure 12.2b shows the density function and the moments for this strongly leptokurtic distribution, and the results of the simulation experiment are summarized in Table 12.3. As noted by Allison et al. (1999), this trait distribution leads to substantial inflation of the type I error rate when multivariate normality is assumed, and the excess error increases as the total trait heritability increases. When the trait is log-transformed and analyzed, the type I error rates are reduced but remain somewhat in excess of expectation. Type I error rates obtained when a multivariate t distribution is assumed are less than expected,
Table 12.3. Type I Error Rates Observed in 100,000 Replicates of 500 Sibpairs for a Trait Following a xi Distribution Nominal cy
MVN
M V N [In x]
MVT
LODR
0.6
0.05 0.01 0.001
0.1162 0.0469 0.0127
0.0633 0.0153 0.0021
0.0396 0.0067 0.0005
0.0505 0.0099 0.0011
0.8
0.05 0.01 0.001
0.1407 0.0635 0.0211
0.0720 0.0202 0.0032
0.0470 0.0089 0.0006
0.0504 0.0101 0.0010
Heritability
170
Blangero eta/.
again suggesting that this approach will be slightly conservative. The robust lod score again provides an excellent test statistic, giving error rates consistent with asymptotic expectation. These results suggest that LOD, can be used for any trait distribution.
3. A real example of a markedly nonnormal trait Since in practice, we are occasionally required to analyze radically nonnormal traits, we consider a trait from our genome scan for loci influencing susceptibility to helminthic infections in the Jirels, an isolated Nepalese population. In this study, led by Dr. Sarah Williams-Blangero, we are examining parasitic worm and egg counts in approximately 1000 individuals belonging to a single complex pedigree. There is clear evidence for genetic factors influencing these traits even when shared environment is considered (Williams-Blangero et al., 1999). For illustration, we consider the distribution of worm counts for the parasite Ascaris lumbricoides. Typically, such count data exhibit a negative binomial distribution with an extremely long right tail. Figure 12.3 shows the distribution of this trait after log transformation and removal of covariate effects such as age and sex. Even after transformation, however, the trait distribution
500
il4ean = -0.03 Variance = 20.6 Skewness = 3.08 Kurtosis = 12.4
400
+, 300 2 2 $ G
200
100
0 -10
E-l,,,
,
0
Resiclua&n (Worm CZnt + l)]
I 30
Figure 12.3. Distribution of In (worm count + 1) residuals in a Nepalese population.
1
12. Variance ComponentMettmds
171
remains highly leptokurtic (K > 12), and the assumption of multivariate nor mality is grossly violated. Using the Jirel pedigree structure, we simulated a completely informative unlinked marker and determined the type I error rates when multivariate normality is assumed and when the robust Iod score is used. We performed 10,000 simulations, using a total additive genetic heritability for the trait of 0.47 (Williams-Blangero et al., 1999). The observed type I error rates were 0.145,0.07, and 0.025 for nominal (Yvalues of 0,05,0.01, and 0.001, respectively; thus the inflation of type I error with this trait is greater even than that seen for the x$ distribution (Figure 12.2b and Table 12.3). Figure 12.4 illustrates the linear relationship between the robust lod score LODR and the observed incorrect lod score. The correction constant was estimated as 2.5, indicating that lod scores calculated under the assumption of multivariate normality are 2.5 times too large. However, when we examine the distribution of the LOD, statistic (Figure 12.5), we find a nearly exact correspondence with asymptotic expectations. This extreme example again documents the validity of the robust statistic LODR. In addition, this example
Observed LOD Figure 12.4. Relationship between the expected robust lod score and the observed lod score calculated under the assumption of multivariate normality for the trait distribution in Figure 12.3. The regression line has slope 0.4, and was fitted using all points for which LOD < 3.75.
172
Blangero eta/.
----
0.5
1.0
2.0
1.5
Observed Expected
2.5
3.0
LOD Figure 12.5. Cumulative distribution function for the LOD, compared with its asymptotic expectation.
demonstrates the ability of variance component linkage analysis to accommodate extremely large and complex pedigrees.
V. POWEROF VARIANCECOMPONENTLINKAGEANALYSIS The failure of penetrance model-based linkage methods to generalize from simple Mendelian traits to complex phenotypes has motivated the development of numerous strategies to search for genes for complex diseases.An essential lesson emerging from these efforts is that different diseases will require different study designs, which will, in turn, require different analytical methods. We believe that disease prevalence offers the best basis for choosing the optimal sampling design and analytical method for a given trait. With rare complex diseases, linkage studies using only affected individuals will extract most of the information regarding the genetics of the trait (Risch, 1990). The inclusion of unaffected individuals in linkage analyses of rare complex diseases provides little additional information, and consequently it is most cost-effective to focus sampling effort on the collection of the affected individuals.
12. Variance ComponentMethods
173
The situation is dramatically altered for common complex diseases,for which the ascertainment of affected-only individuals is less than optimal. For a common disease, with a prevalence greater than about lO%, incorporating unaffected individuals can markedly improve power, and studies limited to affected individuals begin to lose power precipitously. Furthermore, many common diseases can successfully be investigated by means of genetic analysis in extended kindreds of quantitative disease risk factors and other disease correlates, or of implicit models of continuous disease liability. A critical concern with any method of linkage analysis is irs power, or the probability that the test will correctly reject a false null hypothesis of no linkage. Power studies can be difficult, however, and are often undertaken by making use of simulated data sets (Duggirala et al., 1997; Williams and Blangero, 1999a). Nevertheless, some general results have been developed (Risch, 1990; Williams and Blangero, 1999b) that can be used to evaluate different strategies for linkage analysis. In this section we compare the power of the variance component approach using quantitative correlates with that for an affected sibpair approach for a variety of disease models.
A. Affectedsibpair linkageanalysis Risch (1990) has shown that for the affected sibpair linkage design the power to localize a disease-influencing QTL is a monotonic function of the relative risk to siblings, h, . Although h, is usually defined in terms of prevalences as KS/K, where KS is the prevalence in siblings of an affected proband and K is the popu lation prevalence, it can also be expressed in terms of the trait heritability as x,=1+
(12.23)
where hi is the heritability due to the disease gene on the binary scale and di is the relative proportion of variance due to dominance effects on the binary scale. We emphasize that the heritability and dominance effects are on the binary scale because this scale is inappropriate for diseases that truly have an underlying continuous liability, and this discrepancy generally leads to underestimates of the true heritability (Dempster et al., 1950). The dominant quantity in the expression for h, is the diseaseprevalence; the factor (1 - K) /K can take any positive value depending upon the magnitude of K, while the factor (i hi + i di) h as a theoretical maximum of +. Therefore h, can become very large as the diseaseprevalence decreases,even when the contribution of the QTL is very small. Equation (12.23) can be combined with other results in Risch (1990) to explore the dependence of the power of affected relative pair linkage analysis on diseaseprevalence and QTL heritability.
174
Blangero ef al.
B. Quantitative trait variance component linkage analysis In contrast to the affected pair method, the power to detect linkage by means of the variance component method for quantitative traits is primarily a function of the disease heritability on the quantitative scale, and is only slightly influenced by disease prevalence. Williams and Blangero (1999b), who investigated in general terms the power of variance component linkage analysis of quantitative traits, derived exact expressions for the sample size required to achieve a given power with various sampling structures. In general, the power of the variance component method to detect linkage is determined by the distribution of the likelihood ratio statistic. Under regularity conditions this statistic asymptotically follows a noncentral chi-square distribution x’~( vJ) having v degreesof freedom and noncentrality parameter [given by
dJ= (8 - eyv-l( 6 - ii,.
(12.24)
The power of the likelihood ratio test is therefore given by
(12.25) where ~2 (v,O) is the 100 (1 - LY)percentage point of the central chi-square distribution with v degrees of freedom. For a univariate test of linkage having a single degree of freedom and declared significant at a LOD = 3.0, the lower limit of integration in Equation (12.25) is x2= O,OaO1 (1,0) = 3.0 X 2 In 10 = 13.82. Under these conditions, the value of the noncentrality parameter required to achieve 90% power is 24.98; for 80% power the value is 20.78. For certain relationship classes,the matrix equations in the general formulation can be manipulated relatively easily. Williams and Blangero (199913) derived exact expressions for the sample size required to achieve a given asymptotic power in variance component linkage analysis of sibpairs, sib trios, twoand three-sib nuclear families, and general relative pairs. For simplicity, they considered a covariance model consisting of a major gene effect, a residual addie tive genetic effect, and an individual-specific random environmental effect. For sibpairs, the contribution by a single sibpair to the expected lod score is
+ 41 ELoD = 2 In1 10 (h2)2K@)2 2r](l&)2 - 412 .
(12.26)
For a test of linkage having 80% power at a LOD = 3.0, the critical value of A is 20.78; consequently the number of sibpairs required is n = 20.78/[(2 In 10) ELOD] the total number of individuals required is 2n.
12. VarianceComponentMethods
175
A result for arbitrary relative pairs is also of interest and can serve as a basis for estimating the power of any given pedigree. At a given locus let kr denote the probability that two individuals i, j share one allele IBD, and let ki denote the probability that i, j share both alleles IBD. The contribution per relative pair to the expected lod score is then ELOD = &-
(I$2t(h$)zk: + ll[k, + 4k2(1 - 2kJl 4Kh$)2k: - ll*
.
(12.27)
This result can be used to estimate the power to detect linkage with a pedigree of any structure by summing the expected lod score for each relative class over the distribution of relative classeswithin the pedigree. Note, however, that this result is based on the assumption that the relative pair in question exhibits nonzero variance in the number of alleles they share IBD; consequently, parent-offspring pairs and monozygotic twin pairs must be excluded. This is of no particular disadvantage, however, for investigating the power to detect linkage, since neither of these relationships exhibits any variance in IBD sharing at a QTL and cannot be informative for linkage.
C. Differentdiseases-different designs-different methods Our analytical results for the asymptotic power of affected sibpair and variance component linkage analysis allow us to systematically compare alternative Sam* pling designs with regard to their utility for localizing genes that influence complex common diseases. Figure 12.6 compares the power to detect linkage with these approaches by comparing the number of individuals required in each analysis to achieve 80% power to detect a QTL with LOD 2 3. Five study designs are illustrated in each graph. The two broken lines show the numbers of individuals required in an affected sibpair study when the disease-associated allele is rare (p = 0.1) or common (p = 0.5). The three solid lines represent the required sizes of randomly selected samples when the variance component method is used with sibships of size 2 and 4 (labeled “SZ” and “54,” respectively) and in extended pedigrees (labeled “I”‘). For the extended pedigrees, we used a typical family structure from the San Antonio Family Heart Study containing 48 individuals. The prevalence of the focal disease is varied in the four panels of Figure 12.6 to be 1, 15, 25, or 35%. The prevalence of 1% is typical of a rare complex disease, such as schizophrenia, whereas a prevalence of 35% is seen for numerous chronic diseases in the United States, including obesity and hypertension. It is evident from Figure 12.6 that each approach to linkage is optimal for different disease prevalences, with the quantitative trait variance component approach outperforming the affected sibpair approach for common dis-
176
Blangero eta/.
10’ A 0.0
0.1
0.2
0.3
0.4
0.5
Heritability due to QTL
0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Heritability due to QTL
Figure 12.6. Analytical power cm-vesdemonstrating the effect of diseaseprevalence, QTL effect size, and study design on the power to detect linkage.
eases.These power curves also reveal the dramatic advantage of pedigree-based designs over sibship-based designs and highlight the potential power of extended pedigrees for mapping QTLs at any disease prevalence. For example, QTLs accounting for as little as 5% of the variance can be mapped with only 10,000 people if large pedigrees are employed. Although this may at first appear to be a prohibitively large sample size, many epidemiological studies are larger than this, as are some collaborative genetic studies.
Il. Effect of ascertainment and ascertainment correction The sample sizes shown in Figure 12.6 for the variance component method were calculated assuming random sampling and are therefore conservative estimates. Substantial improvements in power can be obtained when consistent selective sampling is performed. Figure 12.7 illustrates this increase in power for sibships of size 2 and size 4 when sibships are ascertained through a single proband in the upper 10% of the quantitative trait distribution. To generate these results, the residual additive heritability was fixed at 0.3 and the frequency of the trait-
12. Variance ComponentMrtthods
177
0 2 sibs: Ascenained mui corrected 8 2 sibs: Randwn~~sampled 04 sibs: Ascertainedand coxeaed
k .I 0.10
0.20
Heritabilirj
0.30
0.40
0.50
due to QTL
Figure 12.7. Comparison of the sample sizes required to achieve 80% power to detect linkage at a LOD = 3 in sibships sampled randomly and in sibships singly ascertained through a proband in the upper 10% of the trait distribution.
causing allele was set to 0.1. Correction for ascertainment was made by conditioning the likelihood for the sibship on the phenotype of the proband (Hopper and Mathews, 1982; Boehnke and Lange, 1984). Each curve is based on 100 replications of 300 sibships. For a given sibship size, Figure 12.7 shows that the power to detect ~ linkage can be increased by ascertaining the sample. Over a considerable range of QTL heritabilities, the sample size required under ascertainment can be three to four times smaller than would be required &der a random sampling design. Furthermore, the size of the sibship has little effect on the relative increase in power of an ascertained sample over a random sample (cf. the separation of the ascertained and unascertained curves for each sibship size). Note also that the use of a larger sampling unit, whether ascertained or not, dramatically reduces the sample size required to detect linkage at a given effect size. Although ascertainment on a focal phenotype can markedly lower the number of sibships required to detect linkage, selective sampling also introduces some potential disadvantages that should not be ignored> First, the appropriate correction for sampling bias can be difficult or impossible to implement, but is crucial if population parameters such as the QTL effect size are to be estimated accurately (Comuzzie and Williams, 1999). Selective sampling of only extremely
178
Blangero ef al.
affected individuals can also markedly increase recruitment costs. Finally, although the ascertained sample may be efficient for linkage analysis of the focal phenotype, it will not in general be equally efficient for other phenotypes. In fact, the ascertained sample can be seriously underpowered even for linkage analysis of traits that are highly correlated with the focal phenotype. A randomly chosen sample, however, is equally powerful irrespective of phenotype chosen for linkage analysis.
VI. COWCLUSlONS In this chapter we have provided a basic review of variance component linkage analysis with an emphasis on its robustness to model misspecification and on its power. We believe that the variance component approach is well suited for the linkage analysis of quantitative traits. Its ability to fully utilize information from pedigrees of arbitrary size and complexity can be exploited to design more powerful linkage studies incorporating extended families, and so avoid a reliance on inefficient designs limited to sibpairs or sibships. Our work indicates that marked model misspecification can detrimentally affect the validity of variance component-based tests for linkage. However, any excess type I error is primarily a function of positive kurtosis, and it seems likely that only distributions with kurtosis greater than 1.5-2 will require alternative robust tests. We have described several tests that eliminate the problem of model misspecification, including a robust and easy-to-implement likelihood ratio test that can be used for any extreme distribution. Although we have concentrated on the most basic model, the variance component model is easily generalized to account for many biological complexities such as genotype-environment interaction, epistasis, oligogenic inheritance, and pleiotropy. In fact, many of the complexities of biological traits can be well approximated by the addition of only a few variance component parameters. Such modeling parsimony in the presence of statistical power, parameter consistency, and robustness to model misspecification provides for a remarkably flexible statistical genetic framework. The historical success of these simple models bodes well for their future widespread use in mapping human QTLs.
Acknowledgments This research was supported in part by National Institutes of Health grants MH59490, AI1042, HL45522, GM31575, GM18897, and HL28972. W e are grateful to Tom Eyer and Charles Peterson for their expert programming assistance. Interested individuals can obtain our computer package, SOLAR, by visiting our Web site at http:// www.sfbr.org. We regret that page restrictions forced us to omit a considerable number of references relevant to the issuesdiscussed in this chapter.
12. VarianceComponent Methods
179
References Allison, D. B., Neale, M. C., Zannolli, R., Schork, N. J., Amos, C. I,, and Blangero, J. (1999). Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci-mapping procedure. Am. 1. Hum. Genet. 65,531-544. Almasy, L., and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees.Am.1. Hum. Genet. 62,1198-1211. Almasy, L., Hixson, J. E., Rainwater, D. L., Cole, S, Williams, J. T., Mahaney, M. C., VandeBerg, J. L., Stern, M. l?, MacCluer, J. W., and Blangero, J. (1999). Human pedigree-based quantitativetrait-locus mapping: Localization of two genes influencing HDL-cholesterol metabolism. Am. J. Hum. Genet. 64,1686-1693. Amos, C. I. (1994). Robust variance-components approach for assessinggenetic linkage in pedigrees. Am. J. Hum. Genet. 54,535-543. Amos, C. I., Zhu, D. K., and Boerwinkle, E. (1996). Assessing genetic linkage and association witb robust components of variance approaches. Ann. Hum. Genet. 60, 143- 160. Beaty, T. H., Self, S. G., Liang, K. Y., Connolly, M. A., Chase, G. A., and Kwiterovich, I? 0. (1985). Use of robust variance components models to analyse triglyceride data in families. Ann. Hum. Germ. 49,315-328. Begleiter, H., Porjesz, B., Reich, T., Edenberg, H. J., Goate, A., Blangero, J., Almasy, L., Foroud, T., Van Eerdewegh, P., Polich, J., Rohrbaugh, J., Kuperman, S., Bauer, L. O., O’Connor, S. J., Chorlian, D. B., Li, T. K., Conneally, F! M., Hesselbrock, V., Rice, J. P., Schuckit, M. A., Cloninger, R., Numberger, J. Jr., Crowe, R., and Bloom, E E. (1998). Quantitative trait loci analysis of human event-related brain potentials: P3 voltage. Elecaoencephalogr.Clin. Neurophysi01. 108, 244-250. Blangero, J. (1993). Statistical genetic approaches to human adaptability. Hum. Biol. 65,941-966. Blangero, J. (1995). Genetic analysis of a common oligogenic trait with quantitative correlares: Summary of GAW9 results. Genet. Epidemiol. 12,689-706. Blangero, J., and Almasy, L. (1997). Multipoint oligogenic linkage analysis of quantitative traits. Genet. Epidemiol. 14, 959-964. Blangero, J., Williams, J. T., Iturria, S. J., and Almasy, L. (1999). Oligogenic model selection using the Bayesian information criterion: Linkage analysis of P300 Cz event-related brain potential. Genet. Epidemiol. 17, S67-S72. Boehnke, M., and Lange, K. (1984). A scertainment and goodness of fit of variance component models for pedigree data. Prog. Clin. Biol. Res. 147, 173- 192. Browne, M. W., and Shapiro, A. (1987). Adjustments for kurtosis in factor analysis with elliptically distributed errors. J. R. Stat. Sot. 49,346-352. Cloninger, C. R., Van Eerdewegh, P., Goate, A., Edenberg, H. J,, Blangero, J., Hesselbrock, V,, Reich, T., Numberger J. Jr., Schuckit, M., Porjesz, B., Crowe, R., Rice, J. P., Foroud, T., Przybeck, T. R., Almasy, L., Bucholz, K., Wu, W., Shears, S., Carr, K., Crose, C., Willig, C., Zhao, J.j Tischfield, J. A., Li, T.-K., Conneally, l? M., and Begleiter, H. (1998). Anxiety proneness linked to epistatic loci in genome scan ofhuman personality traits. Am. J. Med. Genet. 81,313-317. Comuzzie, A. G., and Williams, J. T. (1999). Correcting for ascertainment bias in the COGA data set. Genet. Epidemiol. 17 (suppl l), S109-S114. Comuzzie, A. G., Hixson, J. E., Almasy, L., Mitchell, B. D., Mahaney, M. C., Dyer, T. D., Stern, M. I?, MacCluer, J. W., and Blangero, J. (1997). A major quantitative trait locus determining serum leptin levels and fat mass is located on human chromosome 2. Nut. Genet. 15, 273 -276. Cotterman, C. W. (1941). A calculus for statistico-genetics. Unpublished Ph.D. dissertarion, Ohio State University, Columbus. Dempster, E. R., Lerner, M. I., and Robertson, A. (1950). Heritability of threshold characters. Genetics 35,212-236.
180
Blangero et a/.
Duggirala, R., Stem, M. P., Mitchell, B. D., Reinhart, L. J., Shipman, I? A., Uresandi, 0. C., Chung, W. K., Leibel, R. L., Hales, C. N., O’Connell, I?, and Blangero, J. (1996). Quantitative variation in obesity-related traits and insulin precursors linked to the OB gene region on human chromosome 7. Am. J. Hum. Genet. 59,694-703. Duggirala, R., Williams, J. T., Williams-Blangero, S., and Blangero, J. (1997). A variance component approach to dichotomous trait linkage analysis using a threshold model. Genet. Ej~id.emiol. 14,987-992. Duggirala, R., Blangero, J., Almasy, L., Dyer, T. D., Williams, K. L., Leach, R. J., O’Connell, I’., and Stem, M. I? (1999). Linkage of type 2 diabetes mellitus and age-of onset to a genetic location on chromosome IOq in Mexican Americans. Am. J. Hum. Genet. 64, 1127-1140. Foutz, R. V., and Srivastava, R. C. (1977). Th e p erf ormance of the likelihood ratio test when the model is incorrect. Ann. Stat. 5, 1183-1194. Goldgar, D. E. (1990). Multipoint analysis of human quantitative genetic variation. Am. J. Hum. Genet. 47,957-967. Haseman, J. K., and Elston, R. C. (1972). Th e investigation of linkage between a quantitative trait and a marker locus. Behnv. Genet. 2,3- 19. Hopper, J. L., and Mathews, J. D. (1982). Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet. 46,373-383. Jaquish, C. E., Leland, M. M., Dyer, T., Towne, B., and Blangero, J. (1997). Ontogenetic changes in genetic regulation of fetal morphometrics in baboons (Pupio hamadryas subspp.). Hum. Biol. 69, 831-848. Kruglyak, L., and Lander, E. S. (1995). Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. 57,439-454. Lander, E. S., and Schork, N. J. (1994). Genetic dissection of complex traits. Science 265, 2037-2048. Lange, K., Westlake, J., and Spence, M. A. (1976). Extensions to pedigree analysis. III. Variance components by the scoring method. Ann. Hum. Genet. 39,485-491. Lange, K. L., Little, R. J. A., and Taylor, J. M. G. (1989). Robust statistical modeling using the t distribution. .J. Am. Stat. Assoc. 84,881-896. Liang, K.-Y., and Self, S. G. (1996). On the asymptotic behaviour of the pseudolikelihood ratio test statistic. 1. R. Stat. Sot. B 58, 785-796. MacCluer, J. W., Stern, M. I?, Almasy, L., Atwood, L. A., Blangero, J., Comuzzie, A. G., Dyke, B., Haffner, S. M., Henkel, R. D., H&on, J. E., Kammerer, C. M., Mahaney, M. C., Mitchell, B. D., Rainwater, D. L., Samollow, I? B., Sharp, R. M., VandeBerg, J. L., and Williams, J. T. (1999). Genetics of atherosclerosis risk factors in Mexican Americans. Nutr. Rear.57, S59-S65. Mitchell, B. D., Kammerer, C. M., Blangero, J., Mahaney, M. C., Rainwater, D. L., Dyke, B., Hixson, J, E., Henkel, R. D., Sharp, R. M., Comuzzie, A. G., VandeBerg, J. L., Stem, M. F’., and MacCluer, J. W. (1996). Genetic and environmental contributions to cardiovascular risk factors in Mexican Americans: The San Antonio Family Heart Study. Circulation 94(9), 2159-2170. Mitchell, B. D., Ghosh, S., Schneider, J. L., B’lrznieks, G., and Blangero, J. (1997). Power of variance component linkage analysis to detect epistasis. Genet. Epidzmiol. 14, 1017- 1022. Mitchell, B. D., Cole, S. A., Comuzzie, A. G., Almasy, L., Blangero, J., MacCluer, J. W., and Hixson, J. E. (1999). A quantitative trait locus influencing BMI maps to the region of the p-3 adrenergic receptor. Diabetes 48, 1863 - 1867. Rainwater, D. L., Almasy, L., Blangero, J., Cole, S. A., VandeBerg, J. L., MacCluer, J. W., and Hixson, J. E. (1999). A genome search identifies major quantitative trait loci on human chromosomes 3 and 4 that influence cholesterol concentrations in small LDL particles. Arterioscler. Thromb. VAX. Biol. 19, 777-783. Risch, N. (1990). Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet. 46, 229-241.
12. VarianceComponent Methods
181
Schork, N. J. (1993). Extended multipoint identivby.descent analysis of human quantitative traits: Efficiency, power, and modeling considerations. Am. J. Hum. Genet. 53,1306-1319. Self, S. G., and Liang, K.-Y. (1987). Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82,605-610. Stern, M. P., Duggirala, R., Mitchell, B. D., Reinhart, J. L., Shivakumar, S., Shipman, l? A., Uresandi, 0. C., Benavides, E., Blangero, J., and O’Connell, P. (1996). Evidence for linkage of regians on chromosomes 6 and 11 to plasma glucose concentrations in Mexican Americans. Genome Res. 6, 724-734. Towne, B., Siervogel, R. M., and Blangero, J. (1997). Effects of genotype-by-sex interaction on quantitative trait linkage analysis. Genet. Epidemiol. 14, 1053 - 1058. Westfall, P. H. (1987). A comparison of variance component estimates for arbitrary underlying distributions.]. Am. Stat. Assoc. 82,866-874. White, H. (1994). “Estimation, Inference and Specification Analysis.” Econometric Society Monographs. Cambridge University Press,Cambridge, MA. Wijsman, E. M., and Amos, C. I. (1997). Genetic analysis of simulated oligogenic traits in nuclear and extended pedigrees: Summary of GAWlO contributions. Genet. Epidemiol. 14, 719-735. Williams, J. T., and Blangero, J. (1999a). Comparison of variance components and sibpairsbased approaches to quantitative trait linkage analysis in unselected samples. Genet. Epidemiol. 16, 113-134. Williams, J. T., and Blangero, J. (1999b). Power of variance component linkage analysis to detect quantitative trait loci. Ann. Hum. Genet. 63,545-563. Williams, J. T., Van Eerdewegh, l?, Almasy, L., and Blangero, J. (1999a). Joint multipoint linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formulation and simulation results. Am. 1. Hum. Genet. 65,1134-1147. Williams, J. T., Begleiter, H., Porjesz, B., Edenberg, H. J., Foroud, T., Reich, T., Goate, A., Van Eerdewegh, P., Almasy, L., and Blangero, J. (1999b). Joint multipoint linkage analysis of muitivariate qualitative and quantitative traits. II. Alcoholism and event-related potentials. Am. J. Hum. Genet. 65,1148-1160. Williams,Blangero, S., Subedi, J., Upadhayay, R. I?, Manral, D. B., Rai, D. R., Jha, B., Robinson, E. S., Blangero, J. (1999). Genetic analysis of susceptibility to infection with Ascaris lumbricoides. Am. J. Trap. Med. Hyg. 60,921-926.
Linkageand Assoc.iationwith StrwcturalRelationsh:ips Michael A. Province Division of Biostatistics Washington University School of Medicine St. Louis, Missouri 63110
I. Summary
II. III. IV. V.
Introduction SEGPATH Linkage and Association Models Unique Features of SEGPATH Models Discussion References
I. SUMMARY The use of structural equations (path analysis) provides an alternative, equivalent formulation to variance components models. Instead of partitioning the variance, we focus on modeling the underlying random variables themselves through a system of linear, mixed model, regression equations. A few specific examples of genetic path models for linkage and association (linkage disequilibrium) are discussed. This formulation provides a simple yet elegant framework that can continue to be extended to meet the challenges of modeling and dissecting the genetic nature of complex traits in the new century.
II. INTRODUCTION An alternative formulation of the variance components models discussed in the preceding chapter is provided by the theory of path analysis, which is sometimes Advances in Genefics, Vol. 42 Copynghr Q 2001 by Academic Press. All rights ~freproducrmn in any form rcsenvd. X65-266S101 S35.0@
183
184
Michael A. Province
also referred to as strmcturuEequations modeling. Path analysis, introduced by Sewall Wright (1921), is at its heart mathematically very similar to variance component modeling, but the focus is different (see also Li, 1975). Instead of modeling the variance, we model the random variables themselves. This shift of focus means that models that are more cumbersome or less intuitive to posit in one framework can be straightforward to develop in the other. Taken together, these two approaches complement each other. We shall first discuss the similarities, differences, and equivalencies of the two formulations, turning then to a few specific models for linkage and association for complex traits that use the path analysis approach. Finally, we discuss how this formulation can be further extended to meet the challenges of modeling and dissecting the genetic nature of complex traits in the new century. Path analysis and variance components have much the same relationship to one another as their more traditional cousins, regression and analysis of variance (ANOVA) -they are really two different formulations of the same underlying mathematical treatment. In the variance components paradigm, just as in ANOVA, we model on the variance scale. We attempt to decompose the total variance of a trait into its constituent explained and unexplained sources. This includes the parts due to different measured covariates, measured/unmeasured genetic and familial effects, linked and unlinked components, and so on. In fact, variance components models can be thought of as the generalization of ANOVA models to include so called “random effects” in addition to the simple “fixed effects” modeled in ANOVA. The distinction between fixed and random effects is that we make no distributional assumptions whatsoever about the fixed effects when we estimate or test the model; they are simply “given” as data. However, we do make additional assumptions about how the random effects are distributed in the data (marginally as well as in joint distribution). Usually, this is a Gaussian or multivariate normality assumption, but it need not be. In fact, in ANOVA, the only statistical distribution assumption usually made is on the residual of the model (the unexplained variance), whereas in variance components, we make stronger assumptions about all random effects, which allows us to model much more complicated systems. Similarly, path analysis, like regression, casts the same model in terms of a system of predictive linear equations in the underlying random variables themselves. Just as variance components is a generalization of ANOVA to include random effects, path analysis is a generalization of regression to include latent (unmeasured) random variables as well as the traditional measured ones. As in, variance components, the variables in path analysis can be fixed or random. The differences between path analysis and variance components are more cosmetic than real, just as the differences between regression
13. Linkage and Associationwith Structural Relationships
185
and ANOVA are more in the way we conceive of models than in the underly ing mathematics. In almost all cases, we can cast any variance component model into its path analysis equivalent, and vice versa, and obtain the same estimation and testing procedures, with the same degree of accuracy, precision, and power.
III. SEGPABlkINKAGEAND ASSGGlAfiONW IGGELS One implementation of the path analysis/structural equations formation of genetic models is provided through the SEGPATH formulation (Province and Rao, 1995; Province et al., 2000). The SEGPATH framework makes it possible to easily develop, expand, and extend such generalized linear models to include all genetic, familial, and nonfamilial sources and causes of phenotypic variation, including linkage equilibrium as well as disequilibrium information, in a consistent, coherent manner. Using the SEGPATH structure, models can be developed to perform segregation analysis, path analysis, linkage (equilibrium) analysis, linkage disequilibrium~ (allelelic association) analysis, or combinations thereof, using any number of latent or observed factors (Rao and Province, 2000). These can accommodate multivariate phenotypes, linkage to a single marker or true multipoint linkage, environmental indices, and/or any number of measured covariate fixed effects (including measured genotypes), as well as genotype-specific covariate effects. Population heterogeneity models, repeated-measures models, longitudinal models, autoregressive models, developmental models, gene-environment interaction models can all be analyzed via the SEGPATH method. Pedigree structures can be defined to be arbitrarily complex (without loops), and the data analyzed can have any missing value structure (assumed to be missing at random), with entire individuals missing, or missing on one or more measurements. Corrections for ascertainment can be done on a vector of phenotypes and/or other measures, maximizing the likelihood conditionally on those exact values (Boehnke and Lange, 19841, which provides a simple yet quite parsimonious correction for ascertainment (Rao et al., 1988), Because the model specification syntax is general, the SEGPATH approach can also be used in nongenetic applications where there is a hierarchical structure, such as longitudinal, repeated-measures, time series, or nested models. Basically, any (consistent) path model that can be drawn is translated into the corresponding set of regression equations, with the proper imphed constraints, which defines and implements that model. The complexity of the model is limited only by the data and computer resources available. The most
186
Michael A. Province
general SEGPATH model consists of a set of structural equations defined by a matrix equation of the form: V=DV, where V is the complete vector of all k variables in the model (including all observed, latent, and residual factors), and D is a k X k matrix of regression (path) coefficients, some known and some that are to be estimated from the data. Thus, for every variable in the model, D has exactly one row, which corresponds to the one and only structural equation having that variable as the dependent one. For the primary (exogenous) variables, the corresponding row of D is formally 1 on the diagonal and 0 everywhere else (i.e., the identity equation). For all other variables, the diagonal element is zero and at least one other element in the row is nonzero. If we order the variables in V by their “causal order” (primary causesfirst, down to the “bottom” of the path diagram) then the matrix D is lower triangular for nonrecursive models (without feedback loops or reciprocal interactions). For family (or other block diagonally covariance-structured data, such as in longitudinal, growth curve, or time series analysis), we can rewrite the model as: A0 + BL + CR = 0, where 0 is a matrix of all observed effects within a family (including discrete and/or continuous phenotypes, either single or multivariate, fixed effects, covariates, etc.), L is a matrix including all latent family effects (including trait loci, polygenes, cultural familial effects, household effects, etc.), and R is a matix of residuals, and A, B and C are parameter matrices defining the structural equations, and 0 is a matrix of zeros. In the context of a traditional genetic model predicting univariate complex trait phenotypes from their constituent genetic and nongenetic causes, we would typically rewrite this matrix equation in the more familiar form P = g + m + f + r to indicate that the phenotypes (P), are decomposed into a linear combination of a segregation component (g) [including any number of major genes, with whatever effects they may have on the phenotype(s)]; a multifactorial path model component (m) (including any latent trait genes linked to measured anonymous markers, polygenic components, or other familial nongenetic effects); fixed effect covariates ( f) (sex, age, smoking, dummy variables for linkage disequilibrium/allelic association measured genotypes, etc.); plus a residual, (r) (which may have a complex autocorrelation structure).
A. Sibship linkage model One of the simplest cases is linkage in sibships of varying sizes without parental marker genotypes, as shown in Figure 13.1. For the ith of n siblings, the pheno-
13. Linkage and Associationwith Structural Relationships
181
Trait Major gene (kkt!f.l~
Phenotypes Fixed effect covariates Residuals
Pseudo-polygenic background Figure 13.1. Path analysis linkage model for sibships with n sibs (parents not genotyped) for i, j, k = 1, 2, . , n sibs, gi, marker locus genotype, ith sib; GRI, pseudo-polygenic background for ith sib; h,, effect of marker locus on phenotype (hi = heritability at marker locus); h,, effect of pseudo-polygenic familial background on phenotype (ht = “residual” heritability); 7ijk, IBD proportions at marker locus for sibpair j, k (from Province et al., 2000, Genetic EpidemioIogy. Reprinted by permission of Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc.).
type, Pi is postulated to arise from a linear regression Ji = ~j:jpjXij, on a set of measured fixed effect covariates X, (e.g., sex, age, interactions, measured genotypes, etc.), the latent trait locus genotype, gi (at which we have estimates of the IBD sharing between sib pairs), and a pseudo-polygenic familial latent factor Gri, and an independent residual {Pi}. The G,.r factor represents all remaining transmissible familial components, both genetic and nongenetic, after accounting for fi as well as the locus of interest, g!. The regression equation is therefore: Pi = fi + h,(gi) + k, (G,i) + ri{Pil, where for all i E[gil = E[GJ = E[{I’J] = 0 and E[gf] = E[G$ = E[fPi}‘] = 1 for all i, j: E[giG,j] = E~i{Pj)J = E[G,{~j]] = 0 for i Z j E[g, gJ = rq and E[G,Gd = i subject to the constraint that: h,”+ I$ -I- rz = 1, where rz = &/a’.
188
Michael A. Province
Thus, ht is the heritability of Pi due to the trait locus g,, and hz is the remaining, “residual” heritability of Pi.
B. Multilocuslinkagemodels The model can be, extended by adding multiple linked (or unlinked) loci as shown in Figure 13.2. The m linked loci can be “candidate” trait loci on top of actual markers or at arbitrary places between finely spaced markers, as long as one has estimates of IBD proportions for every relative pair at each of those
h #!I
SIBj
h .el
SIB k
Figure 13.2. Multilocus path analysis linkage model with m (linked) loci (sibs j, k shown) for sibs j, k = 1, . . . , n; and i, 1 = 1, 2, m markers: P,, phenotype for jth sib; gCi,ith locus genotype for jth sib; G,, pseudo-polygenic background for jth sib; R,, Residual of phenotype for jth sib; h,, effect of ith locus on phenotype (hi = heritability); h, = effect of pseudo-polygenic background on phenotype (h! = “residual” heritability); r, effect of residual on phenotype [note that r2 = 1 - (v + IZEl,hi)J; ++k, IBD proportions at ith locus for sibpair j, k; f&, recombination fraction between loci i, 1; b, phenotypic residual sibling correlation. Dashed lines represent cross-sib/crossalocus correlations, which are functions of both the IBD proportions at each locus and their recombination fraction (see text for the exact formulas) (from Province et al., 2000, Genetic Epidemiology. Reprinted by permission of Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc.).
13. LinkageandAssociationwith StructuralRelationships
189
locations, say rijk for the ith marker and sibpair j,k. Each has its own heritability, h,, so that the total heritability of the trait is hZ,+ XL1 hi. In addition to the within-locus, crossrelative, IBD-induced correlations, there will be within-person cross-locus correlations that are functions of the genetic distance between these loci, specifically, if oil is the recombination fraction between loci i and E, then this correlation is ( 1 - 2 oil). If the map is known, we can used fixed values for these oil or, if the data are sufficiently rich, we can obtain the maximum like* lihood estimate of the map by simultaneously estimating these parameters with the rest of the model, Finally, these two sets of correlations will in general induce cross-relative/cross-locus correlations -for example, for the jth sib at the ith locus and the kth sib at the lth locus, which is given by:
d&Yd = =
(&jk+ &jjk) 2 (fiijk
C1 - 2eil)
+ tib5jljk - l)(l
- 2&).
IV. UMQUEFEATURESOF SERPATHMODELS Although the variance components and path analysis models are mathematically equivalent, it can be easier to work and develop some models on the random variable scale than on the variance-covariance scale. One such example is the possibility of modeling direct causal effects from one phenotype to another in the context of a combined genetic model. Some very basic and provocative questions may be answered such models when are used. For instance, how much does obesity alone explain the phenotypic variance of blood pressure (BP)? If we have a candidate gene or linked marker to a latent gene for obesity, how much will it affect BP secondarily, distal to the primary obesity phenotype, as opposed to affecting BP (proximal). Are these effects the same in men and women? The same in all races or populations? Do these effects change over time, with development in individuals, or is there a secular trend with birth cohort? How does smoking, or diet, or exercise affect these phenotypes simultaneously? Are there interactions, and do the interactions also change over time? We can easily develop models to answer these questions by building more complex vector-valued path diagrams (e.g., Todorov et al., 1998). In the SEGPATH framework, the residual component can also more general than simple, completely independent error terms. Residuals may be correlated across multiple longitudinal measures over time, or between relatives, which would allow for resemblance due to unmeasured familial factors, even in the absence of heritable genetic components. Finally, in this formulation, means and variances are not necessarily constrained to be equal for all relatives, which
190
Michael A. Province
easily allows for secular trends, intergenerational effects, sex effects, and so on in these first two moments.
V. DISCUSSION The SEGPATH approach to variance components linkage analysis is quite general, and one that gives a consistent, easy-to-specify framework for building complexity as it is peeded, from the bottom up. This is in direct contrast to the approach taken by most, in which a single general “megagenetic model” is defined a priori, and all other genetic models considered must be submodels of it. By using path diagram notation, we can take advantage of the fundamental theorem of path analysis and avoid deriving and solving large and complex sets of cumbersome regression equations, and instead just concentrate on the essential features of model specification and creation. Using this approach, we can better tailor our models to fit our data, whatever their demands, instead of forcing our data to make compromises with the available models.
Acknowledgment This work was partly supported by National Institutes of Health grant GM28719 from the National Institute of General Medical Sciences.
References Boehnke, M., and Lange, K. (1984). A scertainment and goodness of fit of variance component models for pedigree data. In “Genetic Epidemiology of Coronary Heart Disease: Past, Present, and Future” (D. C. Rao, R. C. Elston, L. H. Kuller, M. Feinleib, C. Carter, and R. Havlik, eds.), pp. 173-192. Liss, New York. Li, C. C. (1975). “Path Analysis, A Primer.” Boxwood Press,Pacific Grove, CA. Province, M. A., and Rao, D. C. (1995). A general purpose model and a computer program for combined segregation and path analysis (SEGPATH): Automatically creating computer programs from symbolic language model specifications. Genet. Epidemiol., 12,203-221. Province, M. A., Rice, T, Borecki, I. B., Gu, C., and Rao, D. C. (2000). A multivariate and multilocus variance components method based upon structural relationships to assessquantitative trait linkage via SEGPATH. Genet. Epidemiol. Rao, D. C., and Province, M. A. (2000). The future of path analysis, segregation analysis, and combined models for genetic dissection of complex traits. Hum. Hered. 50,34-42. Rao, D. C., Wette. R. W., and Ewens, W. J. (1988). Multifactorial analysis of family data ascertained through truncation: A comparative evaluation of two methods of statistical inference. Am. .J. Hum. Gene. 42,506-515. Todorov, A. A., Vogler, G. l?, Gu, C., Province, M. A., Li, Z., Heath, A. C., and Rae, D. C. (1998). Testing causal hypotheses in multivariate linkage analysis of quantitative traits: General formulation and application to sibpair data. Gene. Epidemiol. 15, 263-278. Wright, S. (1921). Correlation and causation, .J. A@. Res. 20,557-585.
The Future of Genetic Case- ControlStudies Nicholas J. Schork1s2 Department of Epidemiology and Biostatistics Case Western Reserve University Cleveland, Ohio 44109 Program for Population Genetics and Department of Biostatistics Harvard School of Public Health Boston, Massachusetts 02 115 The Jackson Laboratory Bart Harbor, Maine 04609
Dani Fallin Department of Epidemiology and Biostatistics Case Western Reserve University Cleveland, Ohio 44109
Bonnie Thiel Department of Epidemiology and Riostatistics Case Western Reserve University Cleveland, Ohio 44109
Xiping Xu Program for Population Statistics and Department of Environmental Health Harvard School of Public Health Boston, Massachusetts 0211.5
Ulrich Broeckel and Howard J. Jacob Human and Molecular Genetics Center Medical College of Wisconsin Milwaukee, Wisconsin 53226
Daniel Cohen Genset SA Paris, Prance 7.5008 ’On leave sponsored by The Genset Corporation of La Jolla, California. and Genset SA of Paris, France. Correspondence may he sent to the Cleveland address. 2Thesc authors conrributed eqwlly co this work. Advancesin Genetics, Vol. 42 Copyrighr D 2001 by Academic Press. All righrs of rcpn~duc~~~n III any form rcservcd. COo65-266Cj@1$35.XZ
191
192
Schork ef a/.
I. II. III. IV V. VI. VII. VIII. IX. X. XI. XII. XIII.
Summary Introduction Stratification Assessing Statistical Significance Multiple Linked Loci and Haplotype Analysis Haplotype Diversity and Allelic Heterogeneity Genetic Background and Allelic Heterogeneity Genetic Outlier Detection Genetic Matching Assessing Power and Linkage Disequilibrium Strength Assessing Admixture Pleiotropy and Physiologic Significance Discussion References
I. SUMMARY The case-control study design has been a veritable workhorse in epidemiological research since its inception and acceptance as a valid and valued field of inquiry. The reasons for this owe to the simplicity of the required sampling and the (potential) ease of analysis and interpretation of results. Unfortunately, there are a number of problems that plague the use of the case-control design in assessingrelationships between genetic variation and disease susceptibility in the population at large. Many of these problems are entirely analogous to problems that inhere in applications of the case-control design in nongenetic settings. These problems include stratification, the assessment of statistical significance, heterogeneity, and the interpretation of multiple outcomes or phenotypic information. In this chapter we describe 10 problems thought to plague genetic case-control studies and offer potential solutions to each. Many of our proposed solutions require the use of multiple DNA markers to accommodate the genetic background of the individuals sampled as casesand controls. It is hoped that our discussions and proposals will spark further debate about the analysis and ultimate utility of the case-control study in genetic epidemiology research.
II. INTRODUCTION Epidemiological studies are often undertaken to examine the relationship between a putative disease risk or susceptibility factor and either an outcome associated with the disease or the disease itself. Bv far the easiest and most
14. GeneticCase-ControlStudies
193
convenient way of approaching such studies is through the traditional “case-control” design. Basically, the case-control design involves collecting a large number of individuals with the disease or trait in question (“cases”) and a large number of individuals without the disease or trait (“controls”) and then assessingthe significance of the difference in the frequency of individual cases’ and controls’ exposures to the putative risk or susceptibility factor of interest (Schlesselman, 1982). If the caseshave been exposed to the susceptibility factor more frequently than the controls, one can infer that the susceptibility factor is involved in disease pathogenesis. Although extremely simple, case-control studies are known to suffer from many-oftentimes correctable-defects, such as ignorance of a possible confounding variable that upsets a true, or induces a false, relationship between the exposure and disease status, and concerns over whether any resulting association is causal or merely reflects a spurious association (Schlesselman, 1982). Because of these and other defects, the value of the case-control design has been called into question in genetic epidemiology applications examining the relationship between a genetic disease susceptibility factor, such as a mutation, genie variant, or haplotype, and a disease outcome (Lander and Schork, 1994). However, as the number and sophistication of technical innovations in the detection, cataloging, and dissemination of genetic polymorphism information have experienced exponential growth in the last few years, it is crucial to consider more fully the utility of study designs meant to relate such poly morphism to disease (Lander, 1996; Collins, et al., 1998), Consider, for examsingle-nucleotide ple, the great interest in applications involving polymorphisms and the possibility of whole-genome association studies, both of which will require reliable study designs for their appropriate and successful implementation (Risch and Merikangas, 1996; Camp, 1997; Collins et u.l., 1997). In this chapter we consider 10 different problems or sets of issues thought to plague case-control designs in genetic epidemiologic investigations. We first provide a short description of these problems and then consider potential way of approaching or overcoming them. We showcase many of the proposed methods with actual data and results. The following problems are considered: 1. Stratification or inherent genetic differences over the entire genome between cases and controls The assessmentof the statistical slgnificunceof an association The analysis of @&types or multiple loci, given that phase information is often unknown and not easily discerned among unrelated casesand controls Ha~Io~~~c diversity due to allelic heterogeneity among the casesand controIs The assessmentand accommodation of differences in generic background that create allelic heterogeneity
194
Schork ef a/.
6. Genetic outliers or the possibility that individuals among the cases and controls have different population origins, hence possibly different diseaseinducing mutational spectra 7. Genetic matching and controlling for differences in genetic background 8. The assessmentof study power and linkage disequilibrium strength 9, The determination and utility of knowledge of geneticadmixture among cases and controls 10. The assessmentof pleiotropy and physiologicsignificance We in no way claim that the problems, issues, and ideas put forward in this chapter are exhaustive or present ideal solutions for genetic case-control studies. Rather, we feel that they provide a reasonable point of departure for further discussion. We close the chapter with a brief discussion and a consideration of some areas for further research.
III. STRATIFICATION One of the most vexing problems with genetic case-control studies, from both a historical and practical point of view, concerns controlling for the effects of genetic stratification or population subdivision between the cases and controls (Spielman, et al., 1993; Lander and Schork, 1994; Ewens and Spielman, 1995). Stratification arises when casesand controls are sampled, oftentimes unknowingly, from different populations. (As an obvious example, consider cases sampled from Africa and controls sampled from the Aleutian Islands.) This can occur when, for example, a disease is unique to (or more frequent in) one population rather than another, so that case ascertainment will almost by necessity involve sampling from the first population. If explicit attempts are not made to ascertain controls from that same population, then inherent genetic differences between the cases and controls may exist at many loci throughout the genome because of an inherent “genetic distance” between the populations (Cavalli-Sforza et al., 1994). This could lead to the erroneous inference that a particular genetic variant is more frequent in the casesthan in the controls because of its role in disease pathogenesis rather than because of the inherent genetic differences between the case and control populations (Lander and Schork, 1994). Thus, if only a single polymorphism is tested, it might prove difficult to know whether any observed difference in its frequency between the cases and controls reflects the causal impact of the polymorphism on the pathogenesis of the disease or a simple overall population-level genetic difference between the casesand controls. However, if many polymorphisms are tested across the genome, and only one or a few are the candidate polymorphisms thought to be associated with disease, then one could empirically determine the evidence for stratifica-
195
14. GeneticCase-Control Studies
tion. Thus, if frequency differences are observed between cases and controls for a number of random or completely anonymous and biologically inert polymorphisms (e.g., potentially, those within intergenic regions or those resulting in synonymous amino acid sequence changes), then one could infer that the popu lations have inherent genetic differences consistent with stratification. Tests of “genetic distance” between the case and control groups could be constructed to empirically assessthe existence of stratification in precisely this manner (Nei, 1978). In addition, by accommodating the genetic distance between the case and control groups when the significance of a statistical test result performed with the candidate polymorphism is assessed,one could conceivably control for stratification and thereby perform an unbiased test, as discussed by Pritchard and Rosenberg (1999) and in the next section. Table 14.1 describes the results of an inquiry into stratification in a case-control study examining the relationship between a candidate gene and renal failure. A more complete description of this study and its results are forthcoming (Broeckel et al., in preparation). Basically, five different groups (from two renal failure case-control studies) were collected with an average of 72 individuals in each sample. Each subject was typed on markers in the candidate gene as well as 44 additional microsatellite markers (2 per autosome). Nei’s genetic distance measure was computed for each pair of study populations. It is clear from Table 14.1 that some possible case and control groups show an inherent genetic distance or stratification (e.g., the cases in study 1 and the controls in study 2) that, if ignored, could have led to erro”neousinferences about the difference in frequency of an allele at a locus in the candidate gene between renal failure and control subjects. Table 14.1. Genetic Distances and P Values for Comparisons between Hypertensives with Renal Failure (Cases), Hypertensives without Renal Failure (Controls), and Normotensive Groups from Two Study Samples Study 1 Controls
Cases
Controls
Normotensives
1 0.307
0.00059 1
0.00007 0.001
0.0016 0.002
- 0.0005 0.002
0.218 0.039 0.742
0.049 0.039 0.029
1 0.267 0.059
0.0004 1 0.148
0.001 0.148
Cases Study 1 Cases Controls Study 2 Cases Controls Normotensives
Study 2
Key: Numbers above the diagonal are genetic distances using the standard Fst measure (Nei, 1978). Numbers below the diagonal are p values associated with the hypothesis Fst = 0. Bold typeface indt cams significance at the 5% level.
196
Schork ef al.
IV. ASSESSING STATISTICALSIGNIFICANCE The problem of assessingthe statistical significance of a finding in populationbased genetics initiatives has been both extremely difficult to overcome and highly contentious (Lander and Schork, 1994; Lander and Kruglyak, 1995; Curtis, 1996; Witte et al., 1996; Morton, 1998). Although theory and simulationbased procedures have been developed for assessingthe statistical significance of linkage analysis results (Lander and Schork, 1994), it is not clear how to proceed with association studies. The development of relevant theory for association studies is impeded considerably because the correlation between test statistics computed with polymorphisms at neighboring loci induced by linkage disequilibrium (LD) varies from population to population (Tishkoff et al., 1996; Laan and Paabo, 1997) and from genomic region to genomic region (Jorde et al., 1994). This is unlike linkage analyses, where by virtue of the ubiquity of Mendel’s laws, correlations between linkage test statistics evaluated at neighboring loci can be computed analytically (Feingold et al., 1993; Lander and Schork, 1994; Lander and Kruglyak, 1995; Morton, 1998). The reasons for this LD variation are numerous and complex and implicate phenomena such as genomic site-specific mutation rates, gene conversion rates, the age of a population, the size of a population, the immigration rates of a population, and the physical distance between loci (Schork and Fallin, 2000). Without knowing the correlation between case-control associationebasedtest statistics across multiple polymorphic loci, it is hard to make “adjustments” to the criteria for declaring statistical significance that will preserve a desired false positive rate for testing all the polymorphisms. Instead of relying on theory, however, one can estimate the probability distribution of relevant test statistics under the null hypothesis of no association empirically by merely cataloging test statistics on a number of anonymous and/or inert polymorphisms (denoted the “test set”). One can then assessthe significance of a candidate polymorphism by comparing the test statistic it produces against the empirical distribution provided by the test set. Although one or a few of the chosen polymorphisms used to estimate the null distribution of the test statistic may actually be associated with the trait-possibly even to a greater degree than the hypothesized candidate gene polymorphism(s)-this would result in a more conservative test of the candidate polymorphisms, which is preferable to liberal tests that could lead to extensive follow-up studies of false positives. In addition, it is simply true that the distribution of test statistics produced by a set of inert poly morphisms will capture purely noise-induced fluctuations obtainable with such test statistics better than any theory or simulation, since they would automatically build in appropriate linkage disequilibrium effects between the polymorphisms, population founder effects (if any), and other factors associated with the actual sample population. Note that this procedure would also provide a good control for stratification effects: if a case and control population did exhibit allele frequency differences at anonymous sites that are consistent with stratification, then the test
197
14. GeneticCase-Control Studies
statistics associated with the candidate polymorphisms would have to surpassthis ‘Lbasal”polymorphism frequency difference to be considered significant. Obviously, however, power would be lost if stratification existed and was taken into account in this way. It should be emphasized, however, that better methods for controlling for stratification can be devised (N. Risch, personal communication). Figure 14.1 shows the distribution of p values and standardized chisquare test statistics computed from the 44 microsatellite loci used in the renal
. .
‘-
0.00
12
-06
.13
.19
.25
.31
33
.44
so
.56
.63
.69
.75
.61
1
b
Figure 14.1. Histograms displaying (a) the frequency of p values and (b) standardited chi-square statistics calculated for each of the 44 microsatellite markers used in the renal failure study. Standardized chi-squared values were computed as follows: (x2 - df) / (2 X df)“‘.
198
Schork ef a/.
failure study discussed in Section III. One locus in a candidate gene region produced a standardized chi-square statistic of 2.26 and a p value of 0.026, which from the Figure 14.1 is not a strong enough association to be considered other than that produced by chance or stratification (if any).
V. MULTIPLELINKEDLOCI AND HAPLOTYPEANALYSIS The proper assessment of the association of an allele at a single locus with a trait or disease can require enormous sample sizes if the polymorphism is rare, has only a moderate effect on the trait or disease, is only one of two (or few) alleles at a locus, or is merely in weak or moderate linkage disequilibrium with a true functional polymorphism (Schaid and Rowland, 1998). Overcoming the first two of these problems is not easy and may require special ascertainment schemes and phenotypic “narrowing” strategies (Lander and Schork, 1994). The second two problems can be overcome, to some degree, by studying haplotypes or the association of multiple polymorphisms within and around an unknown polymorphism of significance to the disease or trait of interest. Multilocus haplotypes can essentially be seen as the “signature patterns” of allelic variation on a chromosome harboring a disease polymorphism. As a result, haplotype analysis may identify a functional polymorphism in a more compelling way than an analysis of any single polymorphism merely linked to it (HBstbacka et al., 1992; Puffenberger et al., 1994). Unfortunately, since humans are a diploid species, the construction of haplotypes often requires either family data to sort out which polymorphisms have been transmitted together on a single chromosome (Weeks et al., 1995) or expensive and time-consuming molecular genetic assay systems (see, e.g., Clark et al., 1998, for an application). However, it is possible to estimate haplotype frequencies for a particular population from genotypic data using statistical algorithms such as the expectation maximization (E-M) algorithm (Excoffier and Slatkin, 1995; Hawley and Kidd, 1995; Long et al., 1995). Th e reliability of such haplotype frequency estimation is quite good for multiple biallelic loci, even when the loci exhibit departures from Hardy-Weinberg equilibrium (which is often assumed in such estimation schemes) (Fallin and Schork, 2000). It is thus entirely possible to construct tests of the equality of haplotype frequencies between cases and controls by using estimated haplotype frequencies (Fallin et al., 2000). Figure 14.2 presents the output from a program for such testing using case-control data from the study described previously in which the “first” study case and control groups were used (see Figure 14.1). Two locus haplotype frequency estimates were obtained for two markers in the
LR Test:
tc cc
tt Ct
Haglotype
5.87827
Chi-Square 0.00045 0.44426 1.83914 0.78620
Ave.
3.12905
Chi-5 0.03594 0.37271 0.45294 0.39663
100 Permuations
2.44811
SD Chi-S 0.04898 0.53292 0.56209 0.52282
Il.03541
Max Chi-S 0.27310 3.29313 1.96068 2.82069
P-excesla 0.58822 -4.02762 -1.91894 5.88308
and Significance
Odds Ratio 1.00804 0.70860 0.00000 2.55439
Estimation
Control 0.73556 0.13286 0.01883 0.11275
Haplotype
10.00000
Number > 87.00000 25.00000 10.00000 14.00000
Apx*
Program
Chi-Square, 0.00045 0.44426 1.83914 0.78620
Testing
0.10000
p-value 0.87000 0.25000 0.10000 0.14000
13.S.
g-v l2.s. n.s. n.s.
Figure 14.2. Analysis resulti from a study inv~srigatiny esrimated haplotype frequency differences herween renal failure and non-renal failure patients. The firx section shows rhe est~tI~t~ frequencies for the program restxt achieving the greater likelihood, along with two me~~res of asstxciation: (“PCXCCS”corresponds the lambda stxiscic described by l?evlin and Risch, (1995), and “Apx” p-v ” is the asymptotic p+&e for a simpk 2 X 2 table chi-square statistic comparing a particular haplotypc to all others combined between cases and controls. The bottom section describes the results of permutation tests (100 permutations) investigating rhe significance of the estimared hapiotype frequency differences herween the Casey and controls: “Ave,” “SD,” “Max,” ail correspond to test statistics computed over the IUO permutations; “Number >” offers the number of test star&tics from these permutations char surpnssed rtle ob~rved test sratisric. The ‘~~ni~~us LK T&s? FW offersresttltsof a testcompxing werdl heplocype ~t~quc~~y profiles between the ca,” an
Omnibus
Hag
# 1 2 3 4
Test
Permutation
ct tc cc
Case
from
0.73711 0.09794 0.00000 0.16495 Results:
Overall 0.73629 0.11111 0.00722 0.14538
Haplotyge
tt
Frequencies:
Estimated
Exaaqglda Output
200
Schork et al.
candidate gene. Allele labels at these loci were assigned arbitrarily. Since the distribution of test statistics computed from estimated haplotype frequencies is unknown, significance levels for tests can be assessed by using the empirical strategy described above or through the use of simple randomization tests (Good, 1994). In addition, since the E-M algorithm is a numerical likelihood maximization procedure, it has the potential to converge to a local maximum. Thus, restarting the algorithm with different initial values to test convergence is appropriate. Figure 14.2 suggests that no significant haplotype frequencies exist either individually or as a composite or profile across all possible haplotypes [i.e., the “omnibus likelihood ratio (LR) test” comparing the overall frequency differences was not significant via randomization tests].
VI. HAPLOTYPEDIVERSITYAND ALLELICHETEROGENEITY If a disease has more than one independent genetic determinant (i.e., if an individual can express a disease or trait because he or she possessesany one of a number of mutations or defective genes), then it is possible that genetic subgroups among the cases exist-each subgroup being defined by the common predisposing gene the individuals within it possess. This subgrouping can result from “locus heterogeneity,” in which these different mutations are located at different sites around the genome, or “allelic heterogeneity,” in which multiple mutations at the same locus gene are segregating in different populations or in the population at large. Ways of assessing and accommodating locus heterogeneity are dealt with in the next section. Allelic heterogeneity can create situations in which multiple different alleles are associated with the disease, rather than a single, very specific allele. If an analysis of haplotypes-like the analyses discussed in the preceding section- is undertaken, then one might expect to see multiple haplotypes showing greater frequency among the cases than the controls, since each of these haplotypes may represent the signature pattern of alleles surrounding a locus harboring a disease allele. It may also be the case that allelic heterogeneity exists and that each mutation, although occurring in the same gene region, has been transmitted on an individual chromosome, with its own unique allelic pattern or haplotype (Terwilliger and Weiss, 1998). A related phenomenon is that a specific, possibly unilineally derived, disease mutation at a locus in question happens to occur on several different haplotypes as a result of intragenie recombination, mutation, gene conversion, or de nova origination. This is not allelic heterogeneity, but rather “haplotype” heterogeneity due to a loss of linkage disequilibrium in the surrounding area. At the level of analysis
14. Genetic Case-Control Studies
201
discussed here, these two scenarios are indistinguishable and can be dealt with similarly. To address them, one can assessthe equality of haplotype frequency profiles using, for example, the omnibus testing approach described in the preceding section, rather than the equality of single haplotype frequencies, under the assumption that more than one haplotype may be of greater frequency in the cases than the controls because it harbors a disease locus. Thus, it may be the case that a single haplotype may not be more pronounced in frequency among cases than controls; rather, some number of haplotypes may show slight to marginal increases in frequency which, when taken as whole, suggest that significant differences in haplotype frequencies exist between case and controls.
VII. GEtiETlGEIACKGRWND AND AllEllC HETERGGENEITY A form of heterogeneity related to that discussed in Section VI concerns the origin of independent disease-predisposing alleles and mutations. Consider that it the cases for a study have been unknowingly ascertained from different populations in which the different genes or mutations originated or are segregating (i.e., and thus reflect locus heterogeneity). This heterogeneity may further manifest itself in differences in the genetic background of the individuals with the different disease genes, in as much as these backgrounds may reflect the different populations from which they were ultimately drawn. One can assume further that the genetic backgrounds of the individuals with the different mutations are distinct and reflect their populations of origin. With multiple genetic markers, one can attempt to identify, or test for, the existence of clusters of individuals with similar genetic backgrounds among the cases (or controls) using appropriate analysis methods. If evidence for such clusters exists, then one may consider accommodating the potential locus and allelic heterogeneity produced by this “cryptic stratification” among the cases (or controls) when one is testing the significance of the association between candidate gene polymorphisms and a disease or trait. Figure 14.3 depicts the dendrogram associated with the results of the construction of a “tree of individuals” (Mountain and CavalliSforza, 1997) using the 44 microsatellite markers gathered on the renal failure study subjects. Although there are a number of statistical and modeling issues in need of addressing, Figure 14.3 suggests that the cases and controls fall into three broad clusters. Table 14.2 describes the association between these clusters and the frequency of an allele at a marker locus. Since there is evidence for
203
14. Genetic?Case-Control Studies Table 14.2. Association Analysis Results Measuring the Relationship between Cluster Identification and Possessionof a Marker Allele" Frequency expectedb Cluster 1
Cluster 2
Cluster 3
Total
42 (35.6)
36 (45.9)
111
&I) 52
(Z.1) 67
Allele present
51
Allele absent (Z.5) 43
Totals
162
“A chi-square test with 2 degrees of freedom for assessingthe relationship between cluster and allele yielded a statistic of 11.759 with a p value of 0.003. *Numbers in parentheses correspond to the expected number of individuals in each cell.
an association, the clustering could be considered to be a possible confounding factor in the assumed relationship between the marker allele and renal failure (at least in the sample used). This was assessedfurther in an analysis that accounted for the clusters via logistic regression (Table 14.3). The results of this analysis suggested that the alleiic effect on renal failure was independent of the cryptic stratification in the sample (i.e., the p value testing the coefficient quantifying the effect of the allele was significant even after the cluster effects had been accommodated).
Table 14.3. Logistic Regression Analysis Results with All&c Effects and Genetic Cluster Ident& fiers as Predictors for the Renal Failure Study
Variable“ Intercept Cluster 2 Cluster 3 Allele
Coefficient 6 E) -0.7282 0.1044 0.3377 0.6959
(0.333) (0.423) (0.412) (0.301)
Chi-squareb
P value
Standardized estimateC
4.766 0.058 0.673 5.339
0.029 0.809 0.412 0.021
0.027 0.092 0.216
Odds ratioC
1.110 i .402 2.006 “The predictor variable assessedin the regression where allele refers to the possessionof the marker allele. bChi-square and p values are associated with tests of the significance from 0.0 of the coefficient. ‘Statistics measuring the impact of the predictor variables on renal failure susceptibility.
204
Schork et al.
VIII. GENETICOUTLIERDETECTION Outliers are individuals among a larger set of individuals that appear to have characteristics unlike the others in the set. Such individuals are often removed from case-control statistical analyses involving the set because their presence could unduly influence the outcome of a test on that set, either by creating a heterogeneity, which could confound the detection of an association of interest, or by adversely affecting the properties of the chosen statistical test. With respect to genetic case-control studies, one could examine and test for the existence of genetic outliers by assessing the similarity of the genetic background of each individual with the rest. The motivation for these tests would be that individuals with a radically different genetic background may possess susceptibility (or protective) polymorphisms for a disease that are different from those possessed by the others, hence could contribute to a genetic heterogeneity within the sample. With available multiple polymorphisms, testing for genetic outliers among a set of cases and controls could be achieved in a manner analogous to assessing the probability that an individual recently immigrated to a particular community (Rannala and Mountain, 1997). Criteria for assessing the appropriateness of removing individuals with outlying similarity scores need to be developed, although the problem can be likened to determining “leverage” statistics for possible outliers in regression analysis contexts (Neter et al., 1985).
IX. GENETICMATCHING If multiple polymorphisms influence a disease or trait of interest, then, when testing the association of a particular polymorphism with that trait, it might be useful to match the cases and controls on the basis of the other polymorphisms. Such matching is done routinely for cofactors thought to influence an outcome in nongenetic case-control studies (Schlesselman, 1982). In the absence of known disease- or trait-influencing polymorphisms to use in a matching strategy, one could attempt to match on overall genetic background (as is the motivation in the assessment of stratification, genetic heterogeneity, and outlier assessment discussed earlier). Since one will not likely know an individual’s genetic constitution prior to the initiation of a study, this may require the selection of individuals with previously collected genotype information (such as might be found in a large database) or the use of statistical analysis tools that accommodate multiple locus effects in case-control studies, such as logistic regression and log-linear modeling (Schlesselman, 1982).
14. GeneticCase-ControlStudies
205
Theoretical and simulation-based power studies are pursued routinely by statisticians to gauge the yield of a particular case-control design and sample size (Schlesselman, 1982). These studies can provide tremendous insight and direction for study design, but it is also of interest to assessthe power of a particular data set that has already been collected. This kind of post-hoc power assessment could in fact be empirically driven in the following way. If one had genotype information at a number of loci on a series of individuals, one could merely create hypothetical data sets by assuming that one of the polymorphisms studied (i.e., a randomly chosen one) was a diseasedpredisposingpolymorphism. Further assumptions about the penetrance of the polymorphism could lead to the assignment of hypothetical case and control status to the individuals in the study. One could then test the association between polymorphisms that neighbor the one in question and the hypothetically assigned case-control status. By repeating this procedure a number of times, and recording the results as a function of characteristics such as frequency of the chosen polymorphism and assumed association strength, one could gauge how powerful a study might be on a candidate polymorphism whose frequency, position, and effect were similar to the hypothetical disease genes created in the process. This strategy is particularly appealing because, as with the empirical assessmentof the statistical significance of a candidate polymorphism analysis resuIt discussed earlier, it would preserve linkage disequilibrium relationships and capture properties of actual data rather than relying on purely hypothetical or theoretical constructions.
Xl. AS$ES!HNGADMIXTURE Two populations may show different disease frequencies as a result of differences in the existence or frequency of disease-predisposing mutations. When individe uals from these populations mate, they produce offspring that are analogous to intercross hybrids in a model organism cross. Further mating of these offspring with individuals from either “parental” population will produce offspring that have more genetic similarity with one of the parental populations (as in repeated backcross studies of model organisms: Beebe et al., 1997). Knowledge of this admixture among cases and controls, coupled with the assessment of markers that can distinguish the two parental populations, can be used to identify the disease-predisposing polymorphisms that were unique or more frequent in one of the parental populations. This procedure has been referred to as “admixture mapping” and is enjoying considerable interest (Stephens et al., 199%
206
Schork eta/.
McKeigue, 1998). Fundamental to the use of admixture mapping, however, is the knowledge that the disease or trait of interest exhibits frequency differences across the two populations because of genetic differences rather than, say, any exposure of the individuals within the populations to different environmental stimuli. This can be tested by examining the association between the degree of admixture possessedby an individual and the frequency, severity, or other manifestations of the disease. The degree of admixture of an individual can be computed either by merely tallying how often an individual has a polymorphism associated with one of parental populations or, in the absence of such markers, as a measure of genetic similarity between that individual and the parental populations (such as those discussed in the context of genetic outlier detection).
XII. PLEIOTROPY AND PHYSIOLOGIC SIGNIFICANCE Assessing the impact of a particular genetic variant or allele on intermediate metabolism, biochemical networks or pathways, and general physiology has typically fallen under the research heading of “functional genomics.” Experimental designs used in functional genomic initiatives are often highly specialized and include model organism-based knockout, transgene, homology, and gene expression profiling (Lander, 1996; DeRisi et al., 1997; Fields, 1997). However, by collecting multiple phenotypes and a number of individuals, one could assessthe impact of a polymorphic locus on the “network” or relationships underlying those phenotypes (if any) and thereby draw inferences about the role of the gene (and its variants) in normal and pathological physiology. Although there are an enormous number of ways one could achieve this, we outline one such method briefly. Consider forming the correlation matrix of the phenotypes under study within each of the case and control groups. Testing the equality of these two matrices (and each individual corresponding element of these matrices) can reveal relationships among the variables that are different in the diseased vs nondiseased individuals, possibly implying a dysregulation of relevant physiologic pathways or networks consistent with the diseased state. Graphical aids such as dendrograms can be used to make such an analysis more palatable and intuitive (Arkin and Ross, 1995; Arkin et al., 1997). To include polymorphism effects, consider for the moment a biallelic locus with a dominant allele. The correlation matrices could be computed within each genotypic category (i.e., susceptible/nonsusceptible) within each of the case and control groups. By testing the equality of the resulting four matrices and their elements, one could draw inferences about relationships between variables consistent with genotypic effects, disease state effects, and genotypic-disease state interaction effects. For example, by limiting contrasts to individuals with the susceptible genotype but across the case and control groups, one could infer dysregulatory phenomena
14. GeneticCase-Control Studies
207
attributable to the susceptibility locus as exacerbated by (or at least consistent with) the disease process. Alternatively, if differencs between the relationships of the variables were found only between cases with and without the susceptibility genotype, one could assume that those differences arise as a result of dysregulatory phenomena attributable to the disease state as exacerbated by the presence of the susceptibility genotype. Figure 14.4 depicts dendrograms computed from correlation matrices on a number of phenotypic variables gathered on normotensive subjects with
2 A r
BMI
LVEDD LVMIPENN
Figure 14.4. (continues)
208
Schork eta/.
AGE LN LMAX LN INTL
0.1
Figure 14.4. (continued)
elevated left ventricular mass (LVM). The dendrograms are computed based on patterns of correlation strength across groups of individuals defined by whether they possess 0, 1, or 2 copies of a candidate gene allele. Statistical issues aside, Figure 14.4 suggests that differences in the relationships of the variables exist across the genotype categories consistent with the notion that the allele has a broad, pleiotropic sphere of influence on physiology. A more complete description of these results is forthcoming (Broeckel et al., in preparation).
209
14. Genetic Case-Control Studies
0 C LVEDD
EhCE EMAX
0.1
Figure 14.4. Dendrograms computed from correlation strengths between variables assessedin LVM individuals having 2, 1, or 0 copies of a putative susceptibility haplotype. The varying branch lengths and the order of the variables across the three genetic categories suggest modification of the relationships of the variables as a function of genotype.
The choice of polymorphisms used in any of the foregoing analyses may require some thought and care. For example, in detecting genetic clusters, outliers, or
210
Schork et a/.
admixture parameters, one may want to have markers that maximally discriminate between population subgroups (Shriver et al., 1997; McKeigue, 1998). Such markers can be chosen prior to the study in question based on studies investigating the utility of markers for such a purpose. In a like manner, markers used to estimate the null distribution of test statistics should ideally be inert biologically. Although it may be hard to know whether a polymorphism is truly inert a priori, there are guidelines one could adopt (e.g., intergenic SNP, SNPs on third-codon positions or that otherwise do not result in an amino acid substitution, etc.). In addition, the polymorphic markers used to estimate a null test statistic distribution should match, to the degree that this is possible, the candidate polymorphisms to be tested (e.g., in general, inert SNPs should not be used to estimate the null distribution of tests conducted on multiallelic microsatellite markers). In addition, further theoretical work on the development of appropriate statistical models and hypothesis testing procedures is needed and encouraged. There is likely to be continuing debate on the utility of collections of polymorphisms that can be used in association studies. There is also likely to be further debate about how to design association slides. Such debate is extremely healthy because it will likely result in a more efficient utilization of relevant resources. Oftentimes, however, the choice of a particular study design or sampling strategy for an association study it is not necessarily within the control of an investigator. Consider a retrospective study of participants in a clinical trial for pharmacogenetic analyses (Drazen et al., 1999). In this situation, the sample size and the nature of the phenotyping (e.g., responders to the compound tested considered as “cases” and nonresponders as “controls”) are fixed. The tools described here could help identify and accommodate problems that may inevitably arise in situations like this. However, irrespective of situations in which one may be forced to adopt a case-control design, it is likely that, given the methodologies described, and the availability of high-throughput genotyping and polymorphic markers, the case-control design will become an increasingly valuable and preferred tool rather than merely the default for many genetic epidemiology investigations.
Acknowledgments Aspects of this work were supported, in part, by U. S. National Institutes of Health grants HL94011 (NJS), HL54998-01 (NJS), and RR03655+11 (Robert Elston), and by generous support from the Genset Corporation. The authors thank Richard Cooper of Loyola University for supplying DNA for the renal failure study discussedin this chapter.
References Arkin, A., Ross, J. (1995). Statistical construction of chemical reaction mechanisms from measured time,series. j. Phys. Chem. 99,970-999.
14. Genetic Case-Control Studies
211
Arkin, A., Shen, P., et al. (1997). A test case of correlation metric construction of a reaction pathway from measurements. Science277,1275 - 1279. Beebe, A. M., Mauze, S., et al. (1997). S erial backcross mapping of multiple loci associated with resistance to Leishmania major in mice. Immunity 6:551-557. Camp, N. J. (1997). Genomewide transmission/disequilibrium testing: Consideration of the gene’ type relative risks at disease loci. Ame. J. Hum. Gene. 61, 1424-1430. Cavalli-Sforza, L. L., Menozzi, I’., et al. (1994). “The History and Geography o]’Human Genes.” Princeton University Press,Princeton, NJ. Clark, A, G., Weiss, K. M., et al. (1998). Haplotype structure and population-genetics inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. ]. Harm. Gene. 63,595-612. Collins, E S., Geyer, M. S., et al. (1997). Variations on a theme: Cataloging human DNA sequence variation. Science278, 1580-1581. Collins, E S., Patrinos, A., et al. (1998). New goals for the U.S. Human Genome Projects: 1998-2003. Science282,682-689. Curtis, D. (1996). Genetic dissection of complex traits (letter). Na. Genet. 12, 356-358. DeRisi, J. L., Iyer, V. R., et al. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science278,680-686. Devlin, B., and Risch, N. (1995). A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29,3 11-322. Drazen, J. M., Yandava, C. N., et al. (1999). Ph armacogenetic association between ALOX promoter genotype and the response to anti-asthma treatment. Na. Gene. 22,168-170. Ewens, W. J., and Spielman, R. S. (1995). The transmission/disequilibrium test: History, subdivision, and admixture. Am. J. Hum. Gene. 57,455-464. Excoffier, L., and Slatkin, M. (1995). M aximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Euol. 12,921-927. Fallin, D., and Schork, N. J., (2000). Th e accuracy of haplotype frequency estimation involving biallelic markers and genotypic data. Submitted. Fallin, D., Cohen, D., et al. (2000). The power of testing estimated haplotype frequency differences between casesand controls. (In preparation.) Feingold, E., Brown, P. O., et al. (1993). G aussian models for genetic linkage analysis using complete high resolution maps of identity-by-descent. Am. J. Hum. Gene. 53,234-251. Fields, S. (1997). The future is function. Na. Gene. 15,325-327. Good, I? (1994). “Permutation Tests.”Springer-Verlag, New York. Hsstbacka, J., de la Chapelle, A., et al. (1992). Linkage disequilibrium mapping in isolated founder populations: Diastrophic dysplasia in Finland. Nat. Gene. 2, 204 - 2 11. Hawley, M. E., and Kidd, K. K. (1995). HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes.]. Hered. 86,409-411. Jorde, L. B., Watkins, W. S., et al. (1994). Linkage disequlibrium predicts physical distance in the adenomatous polyposis coli region. Am. J. Hum. Gene. 54, 884-898. Laan, M., and Paabo, S. (1997). Demographic history and linkage disequilibrium in human populations. Nut. Gene. 17,435-438. Lander, E. S. (1996). The new genomics: Global views of biology. Science274,536-539. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporring linkage results. Nat. Gene. 11,241-247. Lander, E. S., and Schork, N. J. (1994). Geneticdissection of complex traits. Science265,2037-2048. Long, J. C., Williams, R. C., et al. (1995). An E-M algorithm and testing strategy for multiple locus haplotypes. Am. J. Hum. Gene. 56, 799-810. McKeigue, I?.M. (1998). Mapping genes that underline ethnic differences in disease risk: Methods for detecting linkage in admixed populations, by conditioning on parental admixture. Am. 1. Hum. Gene. 63,241-251.
212
Schork et al.
Morton, N. E. (1998). Significance levels in complex inheritance. Am. J. Hum. Gene. 62, 690-697. Mountain, J. L., and Cavalli-Sforza, L. L. (1997). M&locus genotypes, a tree of individuals, and human evolutionary history. Am. J. Hum. Gene. 61, 701-715. Nei, Ml (1978). Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics 89,583-590. Neter, J., Wasserman, W., et al. (1985). “Applied Linear Statistical Models.” Irwin, Homewood, IL. Pritchard, J. K., and Rosenberg, N. A. (1999). U se of unlinked markers to detect population stratification in association studies. Am. J. Hum. Gene. 65, 220-228. Puffenberger, E. G., Kauffman, E. R., et al. (1994). Identity-by-descent and association mapping of a recessive gene for Hirschprung disease on human chromosome 13q22. Hum. Mol. Gene. 3, 1217-1225. Rannala, B., and Mountain, J. L. (1997). Detecting immigration by using multilocus genotypes. Proc. Nutl. Acad. Sci. USA 94,9197-9201. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273, 1516-1517. Schaid, D. J., and Rowland, C. (1998). U se of parents, sibs, and unrelated controls for detection of associations between genetic markers and disease.Am. J. Hum. Gene. 63, 1492- 1506. Schlesselman, J. J. (1982). “Case-Control Studies.”Oxford University Press,New York. Schork, N. J., and Fallin, D. (2000). w’th 1 er association? (In preparation.) Shriver, M. D., Smith, M. W., et al. (1997). Ethnic affiliation estimation by use of population,specific DNA markers. Am. J. Hum. Gene. 60,957-964. Spielman, R. S., McGinnis, R. E., et al. (1993). T ransmission test for linkage disequilibrium: The insulin gene region and insulin dependent diabetes mellitus (IDDM). Am. J. Hum. Gene. 52, 506-516. Stephens, J., Briscoe, D., et al. (1994). Mapping by admixture linkage disequilibrium in human pop ulations: Limits and guidelines. Am. ,J. Hum. Genet. 55,809-824. Terwilliger, J. T., and Weiss, K. M. (1998). Linkage disequilibrium mapping of complex disease:Fantasy of reality? Curr. Opin. Biotechno. 9,578-594. Tishkoff, S. A., Dietzsch, E., et al. (1996). Global patterns of linkage disequilibrium at the CD4 locus and modem human origins. Science271, 1380-1387. Weeks, D. E., Sobel, E., et al. (1995). Computer programs for multilocus haplotyping of general pedigrees. Am. J. Hum. Gene. 56,1506- 1507. Witte, J. S., Elston, R. C., et al. (1996). Genetic dissection of complex traits (letter). Nat. Gene. 12, 355-358.
Cost of linkage versus AssociationMethods Christopher I. Amod Dcpartmcnts of Epidemiology and Riomathematics The University of Texas M. D. Anderson Cancer Center Houston, Texas 7TC30
Grier Page Departments of Biosratistics and EpidemioloLv, Medicine, and Hollings Cancer Center Medical University of South Carolina Charleston, South Carolina 29524
I. II. III. IV. V.
Summary Introduction Methods Results Discussion References
I. SUMMARY Identifying genetic factors that influence disease risk is a major goal in genetic epidemiology. In the past, for disease with relatively simple etiology, genetic linkage methods have been highly effective for this purpose. However, as we begin to study more complex diseases and disorders for which each specific genetic factor may play a minor role in causation, the relative values of genetic linkage methods that do not require linkage disequilibrium versus associationbased methods that do require linkage disequilibrium must be evaluated. Here, !To whom correspondence should he addressed Advances In Ganetics, Vol. 42 Copyright G 23311 by Academic Press. All righrs of reproduction in any form rccervcd
CO65266@/Cl.S3i.C?J
213
214
Amos and Page
we compare the cost-effectiveness of linkage and association methods for identifying genetic factors for a quantitative trait locus that explains 10% of interindividual variability. We find that the choice of analytical scheme depends upon the degree of disequilibrium in the population. Because this parameter has not been adequately assessed,planning association studies is currently difficult.
II. INTRODUCTION Genetic linkage methods have proven to be highly effective for identifying genetic factors that influence highly penetrant diseasessuch as Alzheimer’s disease. However, the efficacy of usual genetic linkage methods for identifying genetic factors for diseasesor disorders with complex etiologies has not been as well established. Usual methods of genetic linkage analysis assume no disequilibrium between a susceptibility locus and the disease or trait under study. This assumption is at least approximately valid when a sparse genetic map is used for gene identification. When a dense map is used for gene identification, the linkage equilibrium assumption may lead to some loss of power for linkage methods. However, methods for jointly estimating linkage and linkage disequilibrium are not yet well established. Recently emphasis in genetic epidemiology has focused on the use of linkage disequilibrium to help in identifying genetic factors (Risch and Merikangas, 1996). Newer association-based methods such as the transmission disequilibrium test (TDT) (Spielman et al., 1993) use family-based sampling designs to construct conditional tests that will identify evidence for linkage, allowing for potential population stratification. Association tests for quantitative traits have only recently been developed. Allison (1997) developed five family-based tests for association. Of these, the first four are conditional tests and therefore would not be influenced by population admixture. Previously, Boerwinkle and colleagues (Boerwinkle et al., 1986) had developed the measured genotype approach. This approach essentially consists of an analysis of variance in which the genotypes form the classification factor. Measured genotype approaches are influenced by population admixture (Page and Amos, 1999). However, the power of this unconditional approach for nonadmixed populations is greater than the power of any of the conditional tests developed by Allison (1997). In general, unconditional tests such as the measured genotype approach are more powerful than conditional approaches because the unconditional tests use all the trait information in the family, while the conditional tests use only trait information from the offspring. In addition, for qualitative traits and diseases,conditional tests such as the TDT require sampling a case and both parents per unit of study in order to characterize alleles as being associated with case status. Unconditional studies for discrete traits can use a single case and control and so require only two individuals per unit of study. Thus, uncondi-
15. Costof LinkageversusAssociationMethods
215
tional case-control tests require a smaller sample size than the TDT. Moreover, as developed by Morton and Collins (1998), the efficiency of case-control studies can, in principle, be further enhanced by further selection of controls. When one is considering quantitative traits, the genotypes are usually used to define groups of individuals, so that case and control groups are usually not specified. Unconditional tests detect association from any source, including population stratification, and have therefore been criticized for the study of heterogeneous U.S. populations. In principle, unconditional methods can be made more robust to potential population stratification by using empirical rather than asymptotic critical values to assign significance. Page and Amos (1999) compared the power of unconditional tests with an empirical critical value to the conditional tests for quantitative traits developed by Allison (1997). Page and Amos (1999) found that the power of the unconditional test depended upon tine amount of population stratification; however, power was greater for the unconditional tests than for the conditional tests for levels of stratification that would be expected in North American populations. Stephens et al. (1994) and Dean et al. (1994) studied linkage detection in association studies of admixed North American populations. Their results showed that for the major admixed populations in North America, minimal excess disequilibrium due to admixture is expected. Although association tests have been proposed for detecting genetic factors, these methods have rarely actually been successfully applied for genome-wide screening. In part, the lack of application reflects the relatively sparse nature of current marker maps. Existing marker sets provide a resolution of little under 10 CM between anonymous markers. Simulation studies (Kruglyak, 1999) f or nonadmixed populations showed that a marker density of 3 kb may be required to identify loci with old mutations if anonymous markers are used. Thus as many as 500,000 markers may be required for a genome-wide association scan. To reflect technologies that may soon exist, we have restricted our current studies for association methods to maps consisting of 100,000 markers. Although more markers are required to detect genetic factors by means of an association-based method, fewer subjects may be required. Therefore, the actual cost required to identify a genetic factor is a complex function of the competing costs of phenotyping, data collection, and genotyping the subjects. Here, we compare the costs for variance components linkage analysis and association methods for identifying a genetic locus for a trait with 10% of the interindividual variability being attributed to a specific locus that is being mapped.
III. METHODS Sample sizes to detect linkage with 90% power were calculated according to the methods described by Page et al. (1998) for the variance components method
216
Amos and Page
and by Page and Amos (1999) for methods based on linkage disequilibrium. To simulate the data for the disequilibrium study, we assumed that a mutation occurred upon a founding haplotype. The founder haplotype consisted of a single trait allele, A, with frequency p, and the alleles at the marker locus had a frequency of l/n, where n is the number of alleles at the locus. We allowed this founder haplotype to decay as a function of generations, t, and recombination fraction, 0. We assumed that the haplotypes including the A trait allele were in equilibrium. The expected proportion of alleles at a specific locus that will still be in disequilibrium with a particular allele is given by E[( 1 - e)l (Li, 1976). By varying t and 0, any amount of disequilibrium, D’, could be specified. The simulation of the data for the variance components model follows the descriptions given by Amos (1994) and Page et al. (1998). We constructed a cost per study according to the model: total cost = NpCp + N,C,NfN,. Here, IV, is the number of units to be sampled, C, is the cost to sample and phenotype all members of a single unit, Ns is the number of families to be genotyped, C, is the cost to genotype a single marker in a single individual, Nf is the number of individuals per family, and N, is the number of markers typed per individual. Each method requires a different number of individuals to be collected for the analysis: the variance component method that we used required four people (two parents and two children), TDT analysis requires three (two parents and one child), and the other disequilibrium methods required only one individual. For the association tests we used diallelic markers to take advantage of their low mutation rates with respect to microsatellites. For the variance component method we used a fully informative marker, similar to the information from a microsatellite (heterozygosity = 1.0). The association tests described here are those that were most powerful among those we studied (Page and Amos, 1999). Generally, the most powerful of the unconditional tests was the measured allele test, which is a variant of the measured genotype method of Boerwinkle et al. (1986), which uses alleles rather than genotypes to categorize individuals. Selecting individuals for study because they had extreme values was more powerful than using the nonselected sample, and here we studied the measured allele test, sampling from the lowest and highest 30% of the distribution. The most powerful conditional test was Allison’s test 1, which compares the mean levels of individuals who received one allele from a heterozygous parent to the mean levels from individuals who inherited the alternate allele from a heterozygous parent. Again, selecting individuals for having extreme phenotypes typically led to more powerful tests, and here we present results from studying individuals with values in the upper and lower 30% of their population trait distribution. We used the disequilibrium measure, D’, in this study. We let pi and qj be the frequency of the ith and jth alleles at loci p and 4, respectively, and p4ij be the proportion of haplotypes having the ith and jth alleles at loci p and 4. Then, D’ is given by Z&/X. Where
15. Cost of Linkage versus Association Methods
217
$ = ~i4j - &. If 6, is positive, X is the minimum of [piqj, (1 - pi) (1 - e)], and if 8, is negative, X is the min [pi (1 - qj), (1 - pi)qj].
In Figure 15.1 we present results of comparing the cost to identify a genetic factor by using anonymous markers for a trait that costs $2000 per unit to phenotype, with a cost for genotyping of $0.75 per marker. For the variance components analysis we assumed a completely informative marker, which is approximately correct for analysis of microsatellites, which are used for most linkage genome scans. We assumed that 400 markers were used for this analysis. For the association analyses, we assumed that 100,000 biallelic markers were studied. For the variance components analysis, we calculated costs assuming a target lod score of 3.0, while for the association tests, the significance level was set at 0.0001 for each marker (which may be less conservative than the significance required for the variance components test, since more tests are being performed for the association analysis). The appropriate significance level to control type 1 error rates while still being able to identify genetic factor(s) in an association study has not yet been adequately determined. Although a large number of tests may be performed for an association study, because of linkage disequiiibria among the markers, the tests are unlikely to be independent. Here we are using
3*00x10'"
T
$$ 2.00 x1o-s 8 z s 1.00x10" 0.00 0.9
0.8 -VC
0.7 ---Al
&
0.6 0.5 0.4 D' a __m_'J-m-.-&&+
0.3
Figure 15.1. Cost for detecting quantitative traits locus: p = O&90% power, 0.0001 significance, 10% linked genetic variance, $0.75 genotyping cost, and $2000 phenotyping cost per unit. Key: VC, variance components; Al, Allison tesr; A3, Allison test 3; TMA, truncated measured allele test; MA, measured allele test.
218
AmosandPage
approximately the same pointwise significance of the tests to compare association and variance components tests, since the appropriate significance levels for association studies will depend upon the specific population being studied, and the joint disequilibria among all of the markers. Risch and Merikangas (1996) conservatively required a significance level of 1 X lo- 8 in their studies, which would considerably raise the sample size requirements for the association methods we report. The association tests that are compared include the measured allele test, and the truncated measured allele test for upper and lower 30% of the population; Allison test 1, which compares the means of individuals according to the allele transmitted to them from their heterozygous parent; and Allison test 3, which modifies test 1 to include only individuals in the upper and lower 30% of the distribution. Results of this analysis show that all association tests are cheaper to conduct than variance components analysis, provided the disequilibrium is greater than 0.5. For traits having a higher proportion of variance from a specific genetic factor, the variance components tests had higher power than association tests over most of the range of D’. Variance components tests are more severely influenced by the cost of phenotyping individuals, since they typically require many more samples to be collected than the association methods (Page and Amos, 1999).
Here we provide some limited results comparing the cost of performing studies by means of variance components versus association-based methods. The linkage and linkage disequilibrium (LLD) method for performing both linkage and family-based association analysis has been devised for quantitative traits (Xiong et al., 1998; Fulker et al., 1999). In the absence of linkage disequilibrium, the LLD method devolves into a usual linkage test, except that additional parameters must be estimated for the association component. In the presence of linkage, the total genetic variance due to linked factors remains constant, but the variance is partitioned among parameters reflecting linkage and association. We can anticipate that the cost of completing an analysis via this method would lie between the cost of the measured allele tests and the tests provided by Allison (1997). The number of samples required to detect linkage with variance components methods is sharply influenced by the proportion of interindividual variability associated with a trait locus. The single locus heritability of 10% that we have studied here is at the low end of effects detectable through variance components analysis. For traits with higher single-locus heritabilities, variance components methods are more cost-effective than linkage disequilibrium methods over a wider range of D’, while the opposite is true for traits with lower single-locus heritabilities.
15. Cost of linkage versus AssociationMethods
219
The critical issue in deciding which method to use for identifying a genetic factor that explains a small amount of interindividual variability is the value of D’ that might be anticipated in a genome-wide study. Unfortunatelyv, there is very little information available about genomic levels of disequilibrium in humans, as a function of genetic or physical distance. Efforts to plan association studies are thus critically hampered by a lack of empirical data from which to design the studies. One other salient feature of Figure 15.1 is the relatively high cost of the studies that would be required to identify new genetic factors. These high costs indicate the need for collaborative research in which multiple groups provide families and resources for combined analyses. The preferred method for combining data is to have several groups use the same tools for data collection and processing. In the absence of shared tools across studies, the further development of meta-analytical tools is required (Guerra et al., 1999; Goldstein et al., 1999; Gu et al., 1999). In this brief analysis we have not presented any results from candidate gene studies. If there are specific candidates that merit testing, the costs of an association study can be greatly reduced. The cost to perform a variance components study is not as dramatically influenced by the avail* ability of candidate genes, since the main costs in a variance components study come from data collection and phenotyping rather than from genotyping. Although the unconditional association methods may be influenced by population admixture, one can anticipate that when an entire genome scan is completed, an empirical p value might be constructed by setting the significant results to include a comparison with the results obtained from the entire genome scan. Pritchard and Rosenberg (1999) studied methods to allow for population admixture in an association study by using either unlinked microsatellite or biallelic markers (i.e., single-nucleotide polymorphjisms). They found that 20 unlinked microsatellite markers is sufficient to restore the empirical p value for association tests to a nominal 5% level, with more biallelic markers being required for this purpose. Bayesian methods have been developed, as well, to assist in the interpretation of genome-wide association tests (Devlin and Roeder, 1999). These methods should be robust to population admixture. A major advantage of unconditional association methods over conditional methods is the ability to jointly estimate effects from both genetic and environmental risk factors. Although the use of hypercontrols (Morton and Collins, 1998) may provide an efficient method to identify genetic factors, use of these controls may violate principles of comparability of cases and controls (Wacholder et al., 1992) if effects from both environmental and genetic factors are jointly studied. Therefore, as with any case-control study, the choice of controls in a genetic association study must be critically assessed.For quantitative traits, the data are not typically divided into case and control groups, but sampling individuals with extreme phenotypes will generally lead to less costly studies. However, because one usually studies many phenotypes jointly in the
220
Amos and Page
analysis of quantitative data, selection on the basis of any particular phenotype may not lead to overall gains in efficiency if several different phenotypes are to be studied.
Acknowledgments This research was partially supported by grants ES-09912 and GM-52607.
References Allison, D. B. (1997). Transmission*disequilibrium tests for quantitative traits. Am. J. Hum Genet.
60,676-690. Amos, C. I. (1994). Robust variance components approach for assessinggenetic linkage in pedigrees. Am. .J. Hum. Genet. 54,535-543. Boerwinkle, E., Chakraborty, R., and Sing, S. E (1986). Th e use of measured genotype information in the analysis of quantitative phenotypes in man. I. Models and analytical methods. Ann. Hum. Genet. 50, 181-194. Dean, M., Stephens, J. C., Winkler, C., Lomb, D. A., Ramsburg, M., Boaze, R., Stewart, C., Charboneneau, L., Goldman, D., Albaugh, B. J., Goedert, J. J., Beasley, P., Hwang, L.-Y., Buchbinder, S., Weedon, M., Johnson, P. A., Eichelberger, M., and O’Brien, S. J. (1994). Polymorphic admixture typing in human ethnic populations. Am.‘J. Hum. Genet. 55, 788-808. De&n, B., and Roeder, K. (1999). Genomic control for association studies. Biometics 55,997- 1004. Fulker, D. W., Chemy, S. S., Sham, P. C., and Hewitt, J. K. (1999). Combined linkage and association sib-pair analysis for quantitative traits. Am. .I. Hum. Genet. 64,259-267. Goldstein, D. R., Sain, S. R., Guerra, R., and Etzel, C. J. (1999). Meta-analysis by combining parameter estimates: Simulated linkage studies. Genet.Epidemio2.17 (suppl. l), S581-S586. Gu, C., Province, M., and Rao, D. C. (1999). Meta-analysis of genetic linkage to quantitative trait loci with study-specific covariates: A mixedaeffects model. Genet. Epidemiolo.17 (suppl. l),
S599-S604. Guerra, R., Etzel, C. J., Goldstein, D. R., and Sain, S. R. (1999). Meta-analysis by combining e-values: Simulated linkage studies. Gent. Epidemiol.17 (suppl. l), S593-S598. Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nut. Gaer. 22, 139-144. Li, C. C. (1976). “First Course in Population Genetics.” Boxwood Press,Pacific Grove, CA. Morton, N. E., and Collins, A. (1998). Tests and estimates of all&c association in complex inheritance. Proc. NutE. Acad. Sci. USA 95, 11389-11393. Page, G. P., and Amos, C. I. (1999). C om p arison of linkage-disequilibrium methods for localization of genes influencing quantitative traits in humans. Am. 1. Hum. Genet. 64, 1194-1205. Page, G. S., Amos, C. I., and Boerwinkle, E. (1998). Th e q uantitative LOD score: Test statistic and sample size for exclusion and linkage of quantitative traits in human sibships. Am. J. Hum.
Genet. 62,962-968. Pritchard, J. K., and Rosenberg, N. A. (1999). U se of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65,220-228. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273, 1516-1517. Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). T ransmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus. Am. J. Hum. Genet. 52, 506-516.
15. Cost of Linkage versus AssociationMethods
221
Stephens, J. C., Briscoe, D., and O’Brien, S. J. (1994). Mapping by admixture linkage disequilibrium in human populations: Limits and guidelines. Am. J. Hum. Genet. 5.5,809-824. Wacholder, S., McLaughlin, J. K., Silverman, D. T., and Mandel, J. S. (1998). Selection of controls in case-control studies. I. Principles. Am. J. Epidemiol. 135(9), 1019-1028. Xiong, M. M., Krushkal, J., and Boerwinkle, E. (1998). TDT statistics for mapping quantitative trait loci. Ann. Hum. Genet. 62,431-452.
I
Genotype- Euviruntftent tnteraction in Transmission DisequilibriumTests Lindon J. Eaves’ Depxtment of Human Genetics Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University Richmond, Virginia 23298
Patrick Sullivan Department of Psychiatry Virginia Institute for Psychiatric and Behavioral Genetics Virginia Commonwealth University Richmond, Virginia 23298
1. II. III. IV.
Summary Introduction Model for Genetic and Environmental Risk (;encralizing the Approach to Allow for G X E and Other Interactions V. Evaluation of the Method through Simulation VI. Conclusions and Discussion References
’To whom correspondenceshould be addressed. Advances in Genetics, Vol. 42 Copyright Q 2001 by Academic Press. All right, of reproduction m any fmn resrrved CG6i-266C!Cl 935.00
223
224
Eavesand Sullivan
I. SUMMARY Transmission disequilibrium tests (TDTs) provide an approach to the detection of associations between alleles at marker loci and risk of complex disorders. The logistic regression approach to TDTs proposed by Sham and Curtis (1995) is generalized to provide separate tests of the main effects of marker loci on genetic risk and genotype-environment interaction (G X E) arising because multiple alleles differ in their sensitivity to specified environmental covariates. A modification of the same model may be used to detect the effects of genomic imprinting on the expression of susceptibility loci. In the presence of G X E, highly significant genetic effects may be present that will not produce marked twin or sibling resemblance and will not yield significant associations in conventional TDTs. However, simulation studies show how the logistic regression model can be used to detect the main effects of marker alleles and their interaction with covariates on continuous outcomes in offspring-parent trios, pairs of siblings and their parents, and monozygotic twin pairs and their parents. TDT tests with MZ with twin pairs permit the detection of alleles whose primary effects on the phenotype are mediated through the control of sensitivity to latent features of the within-family environment. It is shown that although the genotype-environment correlation caused by the environmental effects of parental alleles on offspring phenotypes can produce spurious marker-phenotype association in population studies, the outcome of TDTs is not biased thereby.
Ii, INTRODUCTION Although genetic linkage studies provide a powerful approach for detecting the effects of individual genes on complex traits, it is commonly accepted that association studies may provide greater power and precision for localizing specific genes. The simplicity of association studies is especially appealing for testing the impact of candidate loci on complex phenotypes, since these are equivalent to completely correlated susceptibility and marker alleles. Population studies and case-control association studies can be seriously misleading (see, e.g., Sham, 1998, for a summary of the issues). Factors that create genetic association within and among sibships (“real” linkage, pleiotropy, and linkage disequilibrium) are confounded with a variety of spurious factors (e.g., population stratification, assortative mating) that create additional genetic associations among but not within families. A variety of strategies have been developed to provide tests of association between marker alleles and outcomes that reflect only the segregation of genetic effects within families, thus eliminating the spurious effects confounded
16. G x E Interactianin Transmission DisequilibriumTests
225
with genetic differences among families. Among these approaches, the most elegant appear to be those that examine the pattern of parental alleles transmitted and not transmitted to offspring in different phenotypic classes (e.g., Rubinstein, et al., 1981; Falk and Rubinstein, 1987; Terwilliger and Ott, 1992). Spielman et al. (1993) proposed the transmission disequilibrium test (TDT), which relies on the distortion from 0.5 in the probabilities of marker allele transmission from heterozygous parents to affected offspring. Several approaches have been suggested for the analysis of TDT data (Schaid and Sommer, 1993; Knapp et al., 1995). Of particular interest is the approach of Sham and Curtis (1995), who proposed a simple and flexible method for TDT analysis of samples involving loci with multiple alleles based on logistic regression. Although they developed their own computer program, ETDT, for their approach, the authors indicate that it is easily implemented in standard software for statistical analysis. Waldman et al. have formulated a logistic regression model for TDT involving the two-allele case and continuous outcome measures. They note that this approach allows for the inclusion of covariates and interaction between a candidate locus moderator variables such as indices of environmental exposure. The principal thrust of association studies has been the detection of the main effects of alleles at marker loci on risk to disease outcomes. This is an understandable initial focus in the process of gene discovery. However, genetic risk to some complex disorders may reflect the effects of alleles whose influences are expressed in some environments but not others. Such genotype-environment interactions (G X E) have long been recognized as components of the genetic architecture of complex traits (see, e.g., Mather and Jinks, 19$3), but their detection in humans has been restricted by our ability to characterize candidate loci and the salient environments (Martin et al., 1987). The example of interaction between allelic variants at the TFGA locus and maternal smoking in risk for cleft palate (Christensen et al., 1999) suggeststhat such interactions are only awaiting investigators with the energy to look carefully. Several other genetic mechanisms have formal consequences that are similar to those of G x E. These mechanisms include age-dependent gene expression (e.g., Eaves et al., 1986), in which age functions as the “environment” that interacts with genotype; genomic imprinting (e.g., Hall, 1990), m which the salient “enviranment” is the sex of the transmitting parent; and certain genetic mechanisms for comorbidity in which the “environment” is one or more associated clinical outcomes. Such interactions between covariates and the effect of alleles at candidate loci can be specified easily in the basic logistic regression model proposed by Sham and Curtis for TDT analysis, thus opening the, way to screening loci and environments for a variety of G X E interactions. Like the original model proposed by these authors, the extended approach can be implemented in standard statistical software such as SAS, to capitalize on the flexibility of
226
Eavesand Sullivan
proprietary software for the management of complex data sets involving many risk factors and outcomes.
Ill. MODELFOR GENETICAND ENVIRONMENTAL RISK Denote the phenotypic status (affected or not affected) of the ith offspring by Ai. Ai = 1 if the offspring is affected, 0 otherwise. Let - ~0 < Yi < 00 be the (continuous) liability of the ith individual to a given disorder. Yi is a function of genetic and environmental effects. The probability that the it-h individual is affected is assumed to be (16.1) The liability Yi is a function of the offspring’s maternally and paternally derived alleles (mi and pi, respectively) and the offspring’s environment. The latter is assumed to be due to two independent factors: environmental effects that can be measured (ei) and a residual random environmental term, ri, that is unmeasured. Thus we write Yi = Ul, + U2i + (bli + bzi)ei + ri,
(16.2)
where ali is the effect of the maternal allele, mi, and Uzi that of the paternally derived allele pi. The “environmental sensitivity” parameters bii and bzi are the regressions of liability on the measured environment resulting from the maternal and paternal alleles, respectively. If the a’s are the same for all alleles, then there will be no overall genetic effect on risk. Genetic effects are created by interallelic variation among the a’s The model assumes no dominance; that is, there are no interactions between alleles derived from mother and father. In the absence of imprinting, the a’s will be the same in maternally and paternally derived alleles. The effects of imprinting are represented by setting ali azi for some i. If the b’s are all zero, the measured environment has no effect on risk. If the b’s are constant across alleles, then there is a main effect of the measured environment but no genotype-environment interacd tion (G X E). If the b’s differ across alleles, then the alleles generate differ. ences in sensitivity to the environment, and there will be additional variation due to G X E (cf. references from Mather and Jinks, 1983). This is essentially the same model for G X E proposed for segregation analysis (Eaves, 1984). The model represented by Equation (16.2) assumes that the paternal and maternal genotypes have no environmental effects on the outcome (i.e., no
16. G x E Interaction in TransmissionDisequilibrium Tests
227
genotype-environment correlation). Such effects may be added to the model by extending this relation to include an environmental term that regresses on the alleles in the maternal and paternal genotype (regardless of whether they are transmitted).
A. Model far logistic regression of TDT data Let i and j, i # j, denote the two alleles in a parent at a putative susceptibility locus. For the alleles in a given offspring derived from this parent, we define a response variable, S, = 1 or 0, such that i represents the transmitted allele and j the nontransmitted allele. For any allele pair, i # j, we let S, = I if i < j and Sij = 0 if i > j. The logistic regression approach of Sham and Curtis amounts to modeling the response variable S over both maternally and paternally derived alleles, i # j, in a sample of affected offspring. Corresponding to Sij, we define the variable X, such that P&J
=
1 1 + exp( - Xij) *
Sham and Curtis’s approach develops a linear model for the X, in terms of parameters that reflect the relative odds of i and j being transmitted from a parent having both alleles. Thus, when the locus has n alleles, the model has parameters di, i = 1, . . . , n. Then, Xij = cidi + cjdj, whereci=
(16.3)
landcj= -lfori<jandq=--landcj= lfori>j. In the classical TDT study of transmission from parents of selected (usually affected) offspring, the d’s are expected to be zero if there is no differential effect of one or more alleles due to pleiotropy or other within-family genetic association. This approach requires only k parameters to reflect the additive contrasts between the transmission of k(k - 1) /2 allele pairs. Being more parsimonious, the model yields tests of significance based on far fewer degrees of freedom (df) than are required if all pairwise associations are aggregated into a single test. It also avoids a priori (or, worse, a posteriori) decisions about which alleles are likely to be primarily responsible for a given association. Sham and Curtis note that the additive model seems to apply under a set of reasonable assumptions and that a test of goodness of fit can be constructed by comparing the likelihood under the additive model with that under a (saturated) model that allows also for the unique interactions between all i and j. A final advantage is that the approach is easily programmed in standard statistical software such as SAS.
228
Eavesand Sullivan
IV. GENERALIZING ‘THE APPROACHTO ALLOW FOR G x E AND OTHERINTERACTIONS The logistic model was originally formulated with the goal of testing for TDT in the typical study of transmission in trios comprising affected offspring and their parents. One of its attractions is the relative ease with which it can be extended to a number of instructive situations including other ways of selecting samples and defining the phenotype, G X E interaction, and imprinting. All these approaches build naturally on the basic model of Sham and Curtis and can be implemented very easily in standard statistical software. Formally, these extensions all amount to specifying parameters that reflect the interaction between the d’s in the original model and other salient covariates such as phenotypic categories or values (in the case of continuous outcomes), environmental covariates (for G X E interaction), or the maternal versus paternal origin of the alleles (in the case of imprinting). We first relax the assumption that the offspring are selected for being affected and further assume that corresponding to any X, above there is a corresponding phenotypic measure Y. The data then comprise the .7$‘sand the Y’s for the transmitted and nontransmitted maternal and paternal alleles for every sub, ject in the sample. If alleles at the locus in question have an effect on the measured phenotype Y, then the relative probability of an allele being transmitted in a given trio will be a function of Y. The foregoing model can be modified to reflect the interaction between phenotype and the probability of transmission thus X, = b + Ycidi + Ycjdj,
(16.4)
where b is a constant. The unknown parameters di, i = 1, . . . , n, will be jointly significant if the locus has a significant impact on the measured phenotype Y. The effects of the candidate locus on a continuous phenotype Y are thus detected by a simple extension of the foregoing logistic regression model. If there is interaction between the candidate locus and a measured environmental variable 2, then the G X E interaction can be incorporated in the logistic regression model by including a further set of parameters g,, i = 1, . . . , n to account for the differential sensitivity of the alleles at the candidate locus to the environment 2. The model then becomes X, = b + Ycidi + Ycjdj + ZYc,gi + ZYcjg,,
(16.5)
where the 4 are the parameters that account for the allelic differences in the average phenotypic response with coefficients Yci, and the gi parameterize the G X E interaction, each with corresponding coefficient ZYci. Note that the effects of imprinting can be specified in the model by defining the environmental
16. G x E Interactionin Transmission UisequilibriumTests
229
covariate 2 such that 2 = 1 if transmission of a maternal allele is being considered and 2 = 0 if a paternal allele is being considered. The joint significance of the gi then becomes a test of the effects of genomic imprinting at the locus. The model just described captures effects of interaction between all&c effects at a candidate locus and measured features of the environment 2. The model can be readily implemented when the offspring environment is assessed as well as the phenotype and alleles at a candidate locus in the offspring and parents. However, aspects of the environment that cannot currently be measured may also interact with genotype. One possible approach to the analysis of such interactions would be to exploit pairs of monozygotic (MZ) twins in TDT analysis. Insofar as MZ twins are exposed to different environments, interaction of genetic differences with the differential environment will contribute to intra+ pair differences (see, e.g., Jinks and Fulker, 1970). Genetic effects that do not interact with environment will contribute to pair means. The model described by Equation (16.2) can be used to resolve these two kinds of effects of candidate loci by applying TDT to MZ twins and their parents. If we fit the model to the parental transmitted and nontransmitted alleles, using the pair means as the Y variable, any significant allelic effects detected are contributing to the average trait expression, not G X E interaction. On the other hand, if we use the absolute intrap& differences as the Y variable in Equation (l&2), any significant effects will be due to alleles that mediate sensitivity to features of the differential twin environment. This method provides a withinfamily test of association for loci that affect the outcome primarily through the control of sensitivity to the environment (G X E interaction).
V. EVALUATION OF THE METHODTHROUBMSlMULATlON A series of simulation studies was conducted to evaluate the TDT approach to the analysis of G X E interaction. The simulations focused mainly on the analysis of trio data comprising a single offspring on whom both phenotypic and genotypic data were obtained, and both parents on whom only genotypic data were gathered. In our studies, however, a sibling of the index case was always simulated to allow the sibling correlation and concordance rates to be estimated. Supplementary simulations were conducted (1) to evaluate the additional information that is obtained from the inclusion of siblings in TDTs and (2) to evaluate the possible role of TDTs applied to pairs of MZ twins in the presence of G X E interaction.
A. Analysis of trio data by means of logistic regression Offspring-parent trio data were simulated for a variety of different genetic architectures: (1) variation in risk is purely environmental, without either main
230
Eavesand Sullivan
effects of a candidate locus or G X E interaction (the “baseline” case); (2) variation in risk is partly due to the main effects of alleles at a candidate locus and random environmental factors without G X E interaction (the “additive/classiCal” case); (3) variation is due to alleles at a candidate locus that affect only sensitivity to a measured environmental covariate (“pure G X E”); (4) variation is due to genetic effects at a candidate locus and interaction between alleles at the same locus and a measured environment, with different alleles contributing to the main effects and to sensitivity to the environment (“mixed genetic and G X E”); (5) the same alleles at a candidate locus contribute to average genetic risk and sensitivity to the measured environment (“scalar G X E”); (6) maternally and paternally derived alleles have different sizes of effects on risk (“imprinting”); (7) no direct additive or G X E effect of alleles on offspring risk, but the alleles cartied by the parent exercise an indirect environmental effect on the risk in offspring (“genotype-environment correlation”). Case 5 is termed scalar G X E because the alleles having the greatest (or lowest) sensitivity to the environment are also those that have the highest mean liability. There is thus a correlation between mean trait value and sensitivity to the environment. Obviously, the number of conceivable simulations far outweighs what can reasonably be presented, so we summarize selected examples of each that illustrate the main trends. Simulation and analysis was conducted in PC SAS. An example of the program may be obtained from the first author. For each case we simulated 100 sets of 1000 nuclear families comprising mother, father, and two offspring. Following Sham and Curtis (1995), we assume random mating. Our models also assume that there are no heterozygous deviations (“dominance effects”) between alleles contributing to overall genetic risk or sensitivity to the environment. A candidate locus was assumed to have 10 equally frequent alleles whose effects on mean liability and sensitivity to the environment were specified by values of &, k = 1 X 10 and bk, k = 1, . . . , 10, respectively. In simulating the data for case 6 (imprinting), we set the additive effects al, . . . , uk to a constant in the paternally derived alleles. The parameter values used in simulating the seven cases are summarized in Table 16.1. Genotypes at the candidate locus were simulated for mothers, fathers, and offspring. “Measured” environmental deviations ei = N[O, l] and unmeasured environmental effects ri = N[O, s’] were simulated for each offspring. Continuous risk values Yi were then generated by substitution in Equation (16.2). The probabilities P(Ai = 1) that the ith individual would be affecred were obtained by substitution in Equation (16.1). An individual was identified as “affected” (Ai = 1) if V = U[O, l] < P, or “unaffected” (Ai = 0) otherwise. Preliminary statistics (prevalence rates, sibling concordance rates, probabilities that A = 1 conditional on allele, sibling correlations in risk, heri-
16. G x E Interaction in TransmissionDisequilibrium Tests
231
Table 16.1. Parameters Used in Simulations” Effect
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6*
Case 7’
al a2
-3 -3 -3 -3 -3 -3 -3 -3 -3 -3
-1 -1 -3 -3 -3 -3 -3 -3 -3 -3
-3 -3 -3 -3 -3 -3 -3 -3 -3 -3
-3 -3 -3 -3 -1 -1 -3 -3 -3 -3
-1 -1 -3 -3 -3 -3 -3 -3 -3 -3
-3 -3 -3 -3 -3 -3 -3 -3 -3 -3 -1 -1 -1 -4 -4 -4 -4 -4 -4 -4
-3 -3 -3 -3 -3 -3 -3
a3 a4
a5 a.5 a7 a8 % al0
4 b2 b3 b4 bs b6 b7 b8 bg b 10
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
3 3 0 0 0 0 0 0 0 0
3 3 0 0 0 0 0 0 0 0
3 3 0 0 0 0 0 0 0 0
1
1; -3 1 I 1 I 1 -1 -1 -1 -1 -1
“The casesare described as follows: case 1, no genetic effect or G X E; case 2, genetic effects, no G X E; case 3, G X E, no genetic main effect; case 4, genetic effects and G X E, different alleles; case 5, genetic effects and G X E, same alleles; case 6, genomic imprinting; case 7 parent alleles affect offspring environment (genotype-environment correlation). In cases1-5, aI, . . , alo denote the main genetic effects of the alleles and bl, . . , bIOthe sensitivities of the alteles to the environment (G X E interaction parameters). bIn the case of imprinting a,, , . . , alOdenote the paternal allelic effects (no genetic variance) and b . < . ) bl, denote the maternal allelic effects. c ;k the case of genotype-environment correlation, a,, . . , aI0 denote the direct allelic effects (no genetic variance in this example), and bI, . . . , blo denote the environmental effect of parental alleles on offspring.
tability of risk, etc.) were derived from 100,000 independent observations by using SAS procedures. Generalized linear models were fitted to the outcome measures to estimate the contribution to the simulated risk values of the maternal and paternal alleles and their interaction. Trios were formed comprising mother, father, and one unselected offspring. The transmitted and nontransmitted maternal and paternal aIleles of each offspring were identified wherever possible, and the values of xlj, ci, cj, 4, and dj of Equation (16.3) were derived for every offspring allele that could be assigned unambiguously to either maternal or paternal origin.
232
Eavesand Sullivan
Tables 16.2 and 16.3 summarize the basic population data for the seven simulated cases. Table 16.2 presents the population prevalence rates and the overall probability of being affected associated with each allele. Table 16.3 presents the sibling correlations and heritability of the underlying normal risk, and the probability of the disorder in siblings of affected individuals with the 95% confidence intervals for the associated case-control odds ratios. Table 16.3 also summarizes the joint impact (measured by F ratio) on the continuous phenotype of the alleles derived from mothers and fathers. The F ratios were obtained by fitting a general linear model for the allelic effects on risk. The tables show a wide range of patterns in the statistics as a function of the underlying genetic process. In every case except the nongenetic case 1, one or more of the alleles increases the probability of the disorder. This occurs regardless of whether the allele interacts with the environment and even when the only “genetic” effect is the secondary environment effect of the parental alleles on the offspring outcome (genotype-environmental correlation). Similarly, all the casesexcept the first have sibling odds ratios that significantly exceed unity. The results for the latent continuous risk variable, however, are different. In the absence of a main effect of the maternal or paternal alleles on risk (case 3), there is no sibling correlation in risk even when there is considerable G X E interaction in risk and an increased probability of being affected in carriers of the “sensitive” allele. This superficially curious result arises because the G X E interaction affects the variation in liability within an allele but not the mean liability. However, because more sensitive alleles show greater variation in liability, it is expected that individuals carrying the more sensitive allele will have a higher
Table 16.2. Summary Statistics (N = 100,000) fr om Nuclear Family Data: Prevalence Rates (%) and Average Proportion (%) Affected for Each Allele”
Prevalence Allele 1 Allele 2 Allele 3 Allele 4 Allele 5 Allele 6 Allele 7 Allele 8 Allele 9 Allele 10
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
4.3 4.2 4.2 4.1 4.4 4.4 4.3 4.5 4.2 4.2 4,4
7.9 15.3 15.4 5.7 5.8 6.1 5.7 6.1 6.6 6.3 5.8
6.7 12.3 11.2 5.6 5.3 5.9 5.3 5.3 5.4 5.6 5.4
10.4 13.3 13.5 7.1 7.7 17.2 17.3 7.1 7.2 7.2 7.1
10.4 22.8 22.6 7.4 7.0 7.4 7.3 7.4 6.7 7.3 7.2
5.4 12.4 12.7 12.1 2.3 2.3 -2.6 2.6 2.4 2.2 2.4
6.9 10.1 10.0 10.0 9.9 10.0 3.5 3.5 3.9 3.8 3.8
&Forsummary description of cases,see note a to Table 16.1.
16. G x E Interaction in TransmissionDisequilibrium Tests
233
Table 16.3. Summary Statistics from Simulated Data (N = 100,000) Statistic
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Sib correlation Heritability F maternal F paternal Risk to affected sib (%) Sibling odds ratio (95%CI)
0.00 0.00 0.87 0.66 5.0 0.891.19
0.06 0.12 780.5 812.2 9.7 1.171.37
0.00 0.00 0.69 1.09 7.3 1.171.41
0.04 0.09 608.7 574.2 11.9 1.131.28
0.04 0.09 586.7 574.1
0.09 0.16 2243.6 1.4 8.5 1.501.83
14.7 1.461.64
Case 7 0.31
0.00 1039.7 1043.0 14.7 2.372.73
“Correlation and heritability are based on continuous risk measure. The F statistics test for the additive effects of the maternal and paternal alleles from fitting a general linear model for the allelic effects on the continuous phenotype, ignoring G X E interaction. For each numerator and denominator, there are 9 and 99,900 degrees of freedom, respectively.
than average probability of being affected. In practice, the risk to siblings will depend on the precise values of the allelic effects on mean liability and sensitivity to the environment (see Andrieu and Goldstein, 1996.) We note that although case 7 (genotype-environment correlation) shows no genetic component in offspring liability, there is a high sibling correlation caused by the shared environmental effect of the parental alleles, and spurious large F ratios for allelic main effects in the general linear model. That is, there will be overall evidence of association between the alleles and the continuous outcome that will have nothing directly to do with the function of the locus in the offspring. Thus, genotype-environment correlation behaves like population stratification in the analysis of allele-phenotype associations. Case 6 (imprinting) shows the expected difference in F ratios for the effects of maternally and paternally derived alleles in the general linear model for the continuous phenotype. The LOGISTIC procedure of SAS 6.12 was used to fit the logistic regression model (case 5) to the observed data on transmitted and untransmitted alleles. Likelihood ratio chi-squares were obtained as a guide the joint sig nificance of the genetic and G X E effects (parameters dk and gk, k = 1, . . . , 9 in the model). The genetic effects are significant when the parameters 4, i = 1, * . . I k are jointly significant. These parameters will be jointly nonzero when the allelic effects [ai, . . . , “k in Equation (16.2)] are significantly heterogeneous. The G X E effects gl, . . . , gk will be jointly significant if the alleles that differ in their sensitivities bl, . . . , bk to the measured environment are heterogeneous. If the bi, . . -tk b are all equal, then there will be a main effect of the measured environment but there will be no G X E, and the parameters gi-gk will not differ significantly from zero. Following Sham and Curtis (I995), we arbitrarily fix dk = 0 to fix the scale for the allelic main effects. Similarly, we set gt<= 0 in models for which no G X E is specified to fix the scale of
234
Eavesand Sullivan
environmental sensitivity. In case 6 (imprinting), the effects of imprinting are tested by making the “measured environment” 2 a dummy variable (coded, e.g., 1 for maternally derived alleles and 0 for paternally derived alleles). When Equation (16.5) is derived by using these 2 values, the test of G X E interaction (i.e., the test of significance of gr, . . . , gk) is a test of genomic imprinting.
B. Results of trio simulations Table 16.4 summarizes the results of fitting the logistic regression TDT model (to predict the probability of differential allelic transmission as a function of the continuous phenotypic measures Y and the environmental measures 2 (case 5). The table presents summary statistics based on 100 replicates of 1000 nuclear families. The values for case 1 show the expected pattern of TDT results in the
Table 16.4. Results of Simulation Studies: 100 Replicates of 1000 Nuclear Families Parameter
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
Constant 4 4 4 4 4 d6 4 da a, s(d) A? 4x2) Power (%)
- 001 - 003 -001 - 003 002 - 001 001 000 - 001 - 002 024 9.59 4.74 6 002 005 001 001 001 002 004 002 004 023 10.32 5.15 12
-
- 040 -036 - 035 - 008 - 007 - 007 - 002 002 - 001 - 001 023 9.59 4.23 8 - 063 - 060 - 002 - 002 - 001 003 - 001 - 001 -001 022 36.57 10.09 100
- 023 - 036 -038 - 007 - 002 - 064 - 063 - 003 001 000 025 26.66 8.55 85 - 068 - 065 - 002 002 - 006 - 007 001 003 000 026 36.71 11.26 97
- 163 - 132 - 138 -031 - 025 - 022 -021 -013 - 010 - 002 027 33.10 11.77 94 - 097 - 094 -001 000 002 000 000 000 001 025 58.29 14.95 100
- 180 - 106 - 101 - 094 -032 - 025 - 024 -016 -014 - 001 033 22.34 8.55 67 062 063 057 - 001 - 002 003 - 002 002 - 003 041 16.40 7.10 36
-300 - 010 - 010 - 006 - 009 - 008 - 003 000 -001 000 025 11.02 5.39 10 - 006 003 003 003 003 001 003 002 005 023 10.26 5.75 10
g1 g2 g3 g4 g5 g6 g7 a3 579 Sk) x79, 4x2)
Power (%)
235 12s 118 048 038 033 023 019 013 - 002 027 40.29 11.71 98 -001 003 - 002 -001 - 002 002 - 003 - 003 - 002 024 9.11 4.13 5
16. G x E Interactionin Trartsmission DisequilibriumTests
235
absence of genetic and G X E effects at the locus. All the regression coefficients are symmetrically distributed around zero, and the chisquare tests of the additive genetic allelic effects and G X E interaction show the expected rate of significant values under the null hypothesis. The additive genetic case (case 2) shows a high rate of significant chi-square values for the overall genetic component in the logistic regression, but the expected false positive rate for the G X E components. Note that the mean values of some of the genetic parameters di, . . . , dg are no longer zero, although, as expected, the parameters representing the G X E effects g,, . . . , gk do not differ significantly from zero. Although, in this simple case, the most significant values of d are generally associated with the a’s that differ most from the average over genotypes, there is considerable variability across simulations. Thus, although the overall test of significance works well, it may be prudent to seek replication of inferences about the roles of specific alleles. The complementary simulation (case 3) in which the alleles differ only in their sensitivity to the measured environment (“pure” G X E) behaves as expected. The chi-square values for the overall additive genetic effects show the distribution expected under the null hypothesis, and the test of G X E interaction reaches significance in a very high proportion of the simulated data sets. Both cases4 and 5, in which there are both genetic and G X E effects, behave as expected and generally yield highly significant chi squares for both the genetic and the G X E components in the model. Again, although there is broad correspondence on average between the most significant regression coefficients and the most deviant a and b values, there is considerable variability in individual cases. Thus it would be unwise to make too much of the significance of individual parameters. However, the broad pattern of coefficients bears reasonable correspondence to the action of the component alleles. That is, alleles that contribute most to the genetic main effect generally have more significant d values, and alleles that contribute mainly to atypical sensitivity to the environment have the most significant g values. The G X E interaction model is capable of detecting the effects of imprinting (case 6), although with the parameter values selected, the power of the test is rather smaller than for the other cases. In the simulated example, the main effects of the maternally derived alleles were set to those used in case 2 and those for the paternal alleles were all fixed at -3. The test of imprinting amounts to a test of the heterogeneity of the d’s between maternally and paternally derived alleles and is expected to be less powerful than the test of equivalent overall values for d under the classical model without imprinting. When there are effects of the parental genotype on the offspring phenotype (genotype-environment correlation, case 7) the TDT tests show no
236
Eavesand Sullivan
main effects of the alleles and no G X E interaction. This result is as it should be, since the TDT reflects only genuine within-sibship associations between alleles at a candidate locus and the phenotypic outcome. However, the tests of overall association, which include differences between families (see Table 16.2), detect highly significant differences between alleles in the outcome measure. Thus, spurious association between a marker and outcome arising because parents exercise a nongenetic effect on their children can be resolved by TDT. Thus, as with other causes of spurious marker-phenotype association in populationbased association studies, the analysis of trios is not biased by this particular type of genotype-environment correlation.
C. Including other siblings The foregoing analyses treat alleles derived from mothers and fathers as independent and consider effects of alleles only on randomly chosen offspring. We may include the other sibling in the analysis and allow for any correlation between the observations of sibling pairs by treating the sib pairs and the alleles derived from mothers and fathers as repeated measures in the generalized estimating equation (GEE) procedure of the GENMOD procedure in SAS. The unit of analysis is now the sibling pair and we specify an unstructured (4 X 4) covariance matrix for the repeated measures terms. The results of the GEE procedure may be compared with those of the conventional logistic regression, which treats the maternal and paternal alleles of the two siblings as independent observations. To illustrate the point, this analysis was conducted with only one data set under one example (case 2, additive genetic effects). When the covariance matrix is assumed to be unstructured, the correlations between the observations are relatively small, and allowing for these in the repeated measures analysis leads to very little change in the estimates or standard errors of the original logistic regression parameters (see Table 16.5). Such a finding is fairly typical in the analysis of data with relatively small clusters and low corre+ lations between clustered observations.
D. Genetic effects and G x E in TDTswith MZ twins The original insight (links and Fulker, 1970) that heterogeneity of intrapair differences of MZ twin pairs will reflect genetic differences in sensitivity to the environment suggests a way of combining candidate gene data from MZ twin pairs in a TDT for the contributions of specific loci on G X E interaction. Pairs of MZ twins were simulated corresponding to cases 2, 3, 4, and 5 in Table 16.1. The environment was assumed to be uncorrelated between twins. A pair of MZ twins comprises two replicates of the same genetic events. Pair
16. G x E Interaction in TransmissionDisequilibrium Tests
237
Table 16.5. Comparison of Independent and Generalized Estimating Equation Analysis of TDT Data from Paired Observations”
Parameter Constant 4 4 & 4 4 de 4 ds 4
Assuming independence
Allowing for clustering
Estimate
S E Cd)
Estimate
S E Cd)
0.253 0.170 0.125 0.076 0.074 0.054 0.060 0.067 0.002 0.049
0.050 0.022 0.021 0.019 0.018 0.018 0.017 0.017 0.016 0.016
0.254 0.168 0.126 0.072 0.071 0.053 0.057 0.063 0.014 0.048
0.052 0.023 0.023 0.020 0.020 0.019 0.019 0.018 0.017 0.018
“Analysis is based on 987 informative clusters out of 1000 simulated pairs of siblings and parents for the additive genetic case in the absence of G X E interaction (cf. Table 16.1, case 2).
means and absolute intrapair differences were computed for each pair and then used as separate traits in the logistic regression model. The values of d obtained in the regression on the pair means provides a test of the main effects of the locus on the phenotype. The observations on MZ pairs are simply repeated measures that provide greater precision for testing the main effects of the alleles an the phenotype. The values of d obtained in the logistic regression on the absolute intra pair differences now reflect the effects of the alleles on sensitivity to the environment (G X E). The results of the three sets of simulations are summarized in Table 16.6. Once again, the analysis performs as expected. In case 2, which assumes that the locus has only additive effects on the phenotype, the genetic effects in the regression on pair means are typically highly significant, but the regressions on absolute intrapair differences show only the expected false positive rate. The converse is true when the alleles differ only in their sensitivities to the environment (case 3). Then the d’s are generally not significant in regression on the MZ pair means but typically highly significant in regressions on the absohtte intrapair differences. When there are both average genetic effects and G X E (cases 4 and 5), we typically find significant regressions on both pair means and intrapair differences. Thus, MZ twin data provide a means of separating the main effects of alleles at candidate loci from those having effects on sensitivity to the environment. In this case, however, the specific environmental covariates are not specified.
238
Eavesand Sullivan
Table 16.6. Resolving the Effectsof G X E Interaction at Candidate Loci in Pairsof MZ Twins: Logistic RegressionTDT Analysis Using Pair Means and Absolute Intrapair Differencesas Predictorsof TransmissionDistortion Case 2
Case3
Effect
Genes
GxE
Constant 4 4 4
337 183 172 084 073 059 048 036 021 018 033 53.71 13.69 100
000 - 009 - .145 - 007 001 165 - 003 001 160 -013 001 054 - 009 039 004 - 008 001 038 - 008 000 034 - 001 000 019 000 001 011 001 - 017 004 081 028 067 10.83 10.84 27.34 4.82 4.61 9.39 12 12 89
a4
4 d6 4 da 4 s(d) X(9?
4x3 Power (%)
Genes
GxE
Case4
Case5
Genes
GXE
Genes
287 065 059 051 042 031 158 148 021 016 030 46.20 13.11 100
009 090 092 - 013 -010 - 010 - 013 - 005 - 003 -011 061 21.62 8.24 70
242 - 144 128 194 125 168 055 062 055 046 037 048 035 032 026 024 020 016 006 014 030 061 39.66 27.85 12.77 9.40 90 97
GxE
As in every simulation study, it is possible to deal with only a fraction of the possible cases and sets of parameter values. However, our examples illustrate a number of simple yet informative extensions of the logistic regression approach to TDT proposed by Sham and Curtis. In particular, we show how the method can be modified to resolve the main effects of alleles at a candidate locus from those of G X E interaction, which have typically been ignored in conventional marker studies. By incorporating environmental measures, the detection of G X E becomes feasible in conventional trio data. The method thus allows us to conduct a joint genetic and environmental “scan” to test hypotheses about the interaction of genetic effects with specific environ mental factors. If G X E interaction involves environments that remain to be identified, the analysis of TDT in monozygotic twin pairs may provide an avenue for resolving G X E from the additive effects of alleles at a candidate locus. G X E interaction may be large, thus betokening a major genetic influence on outcome, yet produce no marked first-order association between the outcomes of sibling pairs. Thus, the conventional discussion of lambda values that dominates the design of gene discovery studies is marginalized in the presence of G X E interaction.
240
EavesandSullivan
References Andrieu, N., and Goldstein, A. M. (1997). Use of relatives of cases as controls CO identify risk factors when an interaction between environmental and genetic risk factors exists. Int. J. Epidemiol. 25,649-657. Christensen, K., Olsen, J., Norgaard-Pedersen, B., Basso, O., Stovring, H., Milhollin-Johnson, L., and Murray, J. C. (1999). Oral clefts, transforming growth factor alpha gene variants, and maternal smoking: A population-based case-control study in Denmark, 1991-1994. Am. 1. Epidemiol. 149,248-255. Eaves, L. J. (1984). The resolution of genotype X environment interaction in segregation analysis of nuclear families. Genet. Epidemiol. 1, 215-228. Eaves, L. J., Long, Jo, and Heath, A. C. (1986). A theory of developmental change in quantitative phenotypes applied to cognitive development. Behav. Genet. 16,143- 162. Falk, C., and Rubinstein, P. (1987). Haplotype relative risk: An easy, reliable way to construct a proper control sample for risk calculations. Ann. Hum. Genet. 51, 227-233. Hall, J. G. (1990). Genomic imprinting: Review and relevance to human disease. Am. J. Hum. Genet. 46,857-873. Jinks, J. L., and Fulker, D. W. (1970). Comparison of the biometrical genetical, MAVA, and classical approaches to the analysis of human behavior. Psychol. Bull. 73,311-349. Knapp, M., Wassmer, G., and Baur, M. P. (1995). The relative efficiency of the Hardy-Weinberg equilibrium-likelihood and the conditional on parental genotype-likelihood methods for candidate gene association studies. Am. J. Hum. Genet. 57, 1476-1485. Martin, N. G., Eaves, L. J., and Heath, A. C. (1987). Prospects for detecting genotype X environment interaction in twins with breast cancer. Acta Genet. Med. Gem&log. 36,5-20. Mather, K., and Jinks, J. L. (1982). “B’lometrical Genetics, 3rd ed. Chapman & Hall, London. Rubinstein, P., Walker, M., Carpenter, C., Carrier, C., Krassnet, J<,Falk, C., and Ginsberg, E (1981). Genetics of HLA disease associations: The use of haplotype relative risk (HRR) and the ‘haplodelta’ (Dh) estimates in juvenile diabetes from three racial groups. Hum. Immunol. 3,384. Schaid, D. J., and Sommer, S. S. (1993). G enotype relative risks: Methods for design and ana!pis of candidate gene studies. Am. J. Hum. Genet. 53, 1114-1126. Sham, P. (1998). “Statistics in Hunnn Genetics.” Wiley, New York. Sham, P., and Curtis, D. (1995). An extended transmission/disequilibrium test (TDT) for multiailele marker loci. Ann. Hum. Genet. 59,323-336. Spielman R. S., McGinnis, R. E, and Ewens, W. J. (1993). T ransmission test for linkage disequiiibriurn: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52,506-516. Terwilliger, J. D., and Ott, J. (1992). A haplotype-based “haplotype relative risk” approach to detecting allelic associations. Hum. Hered. 42, 337-346. Waldman, I. D., Miller, M. B., Robinson, B. F., and Rowe, D. C. (1997). A continuous variable TDT using logistic regression analysis. Paper presented at the annual conference of the Behavior Genetics Association, Toronto, Ont., Canada, July 10-13. Waldman, I. D., Robinson, B. E, and Rowe, D. C. (1999). A logistic regression based extension of the TDT for continuous and categorical traits. Ann. Hum. Genet. 63,329-340.
16. G x E interaction in TransmissionDisequilibrium Tests
239
A simple variant of the model can be used to test for the effects of genomic imprinting. The method is not “fooled” when an overall gene-phenotype association is created by the environmental effect of parental genotypes for the marker on the offspring phenotype (genotype-environment correlation). The ability of the method to exploit all the data management and analytical capabilities of many statistical packages available on a variety of platforms (SAS 6.12 in Windows 95, in our case) removes some of the mystery and frustration experienced in setting up and running programs specifically written for genetic analysis. The approach makes it possible to rapidly screen sets of loci, outcomes, and environments for possible main effects of alleles and their interaction with specific environments as a prelude to replication or further study. We have not considered explicitly the clinically important case of the etiological heterogeneity of different patterns of comorbid disorders. Insofar as different alleles contribute to heterogeneity of outcome, the inclusion of ancillary outcomes as 2 variables in the basic regression model, and tests of the corresponding “g” parameters in the logistic regression, should provide some leverage on whether phenotypic heterogeneity of the clinical outcome reflects underlying genetic heterogeneity. The method generalizes very simply to the inclusion of other types of covariate including age to analyze the effects of loci whose effects change with age, and genotypes at other loci to analyze patterns of epistatic interaction between loci. Further studies are needed to evaluate the power and feasibility of these approaches. Our treatment focuses on continuous outcome measures and unselected samples. This approach clearly exploits the entire phenotypic range and allows us to test for a rich variety of interactions. The same basic approach works well, of course with reduced power, when the phenotypic outcome Y is coded 1 /O. A few minor changes in code also permit the more familiar analysis of trios ascertained through affected offspring. Naturally, parameter values are less stable in small selected samples. However, as more functional markers are characterized at lower cost, and as geneticists pose more subtle questions about the action and interaction of genes and environment in risk to complex human disorders, it is to be expected that there will be a growing need to analyze the impact of known genetic markers in the context of far richer sets of measures than can be reflected in studies of samples selected for a single dichotomous outcome. The approach outlined here may help accomplish this goal.
Acknowledgments This work is supported by grant MH45268 (PI L.J. E aves) from the National Institutes of Health. We thank Dr. Irwin Waldman and two anonymous colleagues for their helpful comments on an earlier version of this chauter.
iUlajorStrengthsand Weaknessesof Madei-free : Methods David E. Goldgar Unir of Genetic Epidemiology Intcmarional Agency for Researchon Cancer 69008 Lyon, France
I. II. III. IV. V. VI.
Summary Introduction Nonparametric Methods in Statistics Do We Need Model-free Methods? Quantitative Traits Discussion References
I. SUMMARY This chapter discusses some of the principal advantages and disadvantages inherent in the use of model-free (MF) methods. The principal advantage is that one does not need to specify, a priori, a genetic model for the trait of interest, which often is not known for many complex phenotypes of interest. On the other hand, as with all nonparametric approaches, use of model-free methods results in reduced power for detection of linkage compared with model-based methods when the model is correctly specified. The MF methods also have a potential for computational simplicity and are ideally suited for analysis of speciIic relative sets such as affected sibpairs. The MF methods are ideally suited to the analysis of quantitative traits for which finding and impletnenting a suitable genetic model for use in a paratnetric linkage analysis may be cumbersome. On the other hand, for discrete traits, most model-free methods allow for only a simple definition of “affected,” making it difficult to consider such factors as age
242
David E. Goldgar
at onset, diagnostic accuracy of phenotype, or sex-specific disease risks. A factor that can be viewed as both a strength and weakness of MF methods is the large number of statistical approaches and implementation options of model-free methods; while providing a number of choices for the more sophisticated users, such variety also may lead to the risk of overanalysis of the data by selecting the approach that gives the desired result. In the end, the choice between modelfree and model-based methods will largely depend on the nature of the phenotype under study and the existing knowledge base about its underlying mode of inheritance.
II. INTRODUCTION The preceding chapters in this section describe a variety of methods that can be applied to family data in the absence of knowledge about the underlying genetic model relating specific genetic effects to the phenotype of interest. These methods can be applied to discrete (disease) traits as well as to quantitative traits to detect linkage and association. Because in practice one never knows the “true” underlying genetic model for a complex (or even simple) phenotype, these methods would seem to have a primary role in the analysis of complex phenotypes. However, it is fair to say that there is some disagreement on this issue within the genetic epidemiology community, and some investigators have taken up the practice of analyzing a given data set by both model-free (MF) and model-based (MB) methods. A glance at several issues of the American Journal of Human Genetics illustrates the point. Of six articles describing linkage analysis of what would be considered complex traits (bipolar affective disease, cancer, psoriasis, inflammatory bowel disease, cholesterol, osteoarthritis), four used ,only model-free methods, one used exclusively parametric linkage analysis, and one used both approaches. Interestingly, in the five articles using model-free approach, four distinct programs were utilized for analysis of the data. As with most choices of analytic strategies, there are both advantages and disadvantages associated with the use of so-called model-free or nonparametric methods in genetic analysis. The goal of this brief chapter is to highlight both the strong points and weaknesses of the model-free methods for linkage analysis. It is only fair to say that some of these remarks may be subjective; what one person views as a weakness, others may view as an advantage of the model-free methods. It also must be recognized that the model-free methods represent a very diverse group of methodologies. Thus, not all the advantages and disadvantages discussed in this chapter are applicable to each method; rather, we shall attempt to concentrate
17. Model-freeMethods
243
on the overall philosophy behind the majority of such methods. Moreover, because model-free methods are, by definition, the alternative to model-based methods, many of the strengths of one approach are a weakness of the other, and the discussion of the strengths and weaknesseswill reflect this (see Chapter 10). It also should be noted that the particular strengths and weaknesses depend, in part, on whether one is interested in the analysis of disease (discrete) traits or quantitative traits. Therefore, the discussion addressesthese different classesof phenotypes. The major advantage of these model-free methods is that one does not have to specify a genetic model relating the genotypes at the underlying suscep tibility locus to the observed phenotype(s). As discussed briefly in the introduction, the true underlying etiological model for any phenotype is never known with certainty. Even for many straightforward single-gene disease such as neurofibromatosis or cystic fibrosis, there are complications that arise from phenotypic differences associated with different mutations, and from the presence or absence of specific associated traits due to other loci or environmental exposures. For most common diseasesand quantitative traits studied today, there is a particularly high degree of uncertainty associated with the underlying genetic model. As discussed later on in this chapter, parametric linkage analysis under an incorrect model can have serious consequences on the power to detect linkage. Thus, there is a clear need for methods that can be applied in the absence of knowledge of the underlying genetic architecture of a given phenotype. Apart from this rather obvious advantage, the other strengths of these methods can be stated as follows: I. Potential for computational simplicity 2. Implicit allowance for effects of multiple loci, including locus heterogeneity and epistasis 3. Well suited for analysis of specific sets of relatives (e.g., sibpairs) 4. Availability of a wide variety of different methods Disadvantages or weaknessesof the model-free approach are: 1. Loss of power compared with model-based approach under correct model. 2. Information about location/recombination is confounded with the magnitude of the effect of linked disease locus or QTL. 3. Lack of flexibility in incorporating degrees of severity or diagnostic accuracy for disease phenotypes. 4. Most not readily applicable to complex pedigree structures (e.g., loops). 5. Variety of different methods, each optimal under different conditions. IIow to choose?
244
David E. Goldgar
III. NONPARAMETRIC METHODSIN STATISTICS Before talking specifically about the case of gene mapping studies, I think it is useful to examine the general motivations in statistics for the use of nonparametric and/or model-free methods. Most of the “classic” nonparametric methods used in statistics were developed as alternatives to the likelihood theory methods of Karl Pearson, based on the central limit theorem and the likelihood ratio test. In the days before computers, of course, one major advantage of these methods was their ease of calculation in comparison to their parametric counterparts. In addition, for some data sets, there were concerns about using tests that relied on the normal theory and the central limit theorem, either because of particular distributional considerations and/or small sample sizes. Nonparametric methods provided an alternative that did not assume any underlying functional form for the distribution. However, the cost in terms of power (or equivalently, the sample size required to detect an effect with specified power) can be substantial if the assumptions underlying the parametric test can be justified. For example, the asymptotic efficiency for the nonparametric sign test compared to its parametric counterpart, the paired t-test, is 0.63; that is, the sign test would require a 59% larger sample size to achieve the same power as the paired t-test if the assumptions underlying the latter were justified. Given the robustness of the t-test to departures from normality, the parametric method would be preferred except in cases of very small sample sizes or clearly skewed distributions for which no normalizing transformation is available. In general, the usefulness of any nonparametric test will depend on its relative efficiency under the true model and the robustness of the parametric equivalent to departures from the underlying assumptions. When the assumptions underlying a given parametric test are not met, a nonparametric alternative can be more powerful than the parametric test in detecting a given departure from the null hypothesis.
IV. DO WE NEED MODEL-FREEMETHODS? It seems appropriate to comment at some length on the principal reason for the tremendous increase in interest in these methods: the ability to detect linkage without specifying a genetic model. For the complex traits that now form the majority of ongoing linkage investigations around the world, it is not always apparent what is really the nature of the underlying genetic mechanism of disease susceptibility. For some of these traits, we are not even sure that there is a genetic component. Some investigators may argue that since in many cases little or nothing is known about the genetics of the trait of interest, there is no alternative but to use the model-free approaches. The counterargument to this
17. Model-free Methods
245
proposition is that, in the absence of any knowledge of the strength and pattern of the familial (one hopes, genetic) component of the trait of interest, one perhaps should not be investing in genome scans to identify susceptibility loci for that phenotype. Because the use of model-free methods results in a loss of power in detecting linkage in comparison to methods that specify the true genetic model of the trait locus, the use of parametric methods should not be ruled out, even for complex traits. Of course, analysis of family data under the wrong genetic model will also reduce power for linkage analysis, although a number of studies have shown them to be reasonably robust as long as the assumed model is not radically different from the true underlying genetic mechanism (ClergetDarpoux et al., 1986; Greenberg et al., 1998). For most common diseases,data from epidemiological studies indicating the level of familial aggregation, prevalence, and so on usually can be found, In many cases there may be segregation analyses from which to derive appropriate genetic models, Even in the absence of formal segregation analysis, if one has some knowledge of the sibling relative risk hR, and an idea of the population prevalence, a set of “reasonable” models can be constructed that fit the epidemiological data. For example, a disease in which the sibling risk due to a hypothesized locus is 4.0 and the disease population prevalence is 1% would be compatible with (among literally an infinite number of other models): (1) a common autosomal dominant locus with an allele frequency of 0.01 and a penetrance of 20% in carriers and 0.6% in noncarriers and (2) an autosomal recessive locus with a disease allele frequency of 0.05 and a risk to individuals who are homozygous for this allele of 80% compared with 0.8% for heterozygous and normal homozygous individuals. While both these models are associated with equivalent population prevalence and sibling relative risks, they differ with respect to the parent-offspring risk. Thus, if one also has an idea of the offspring risk, it may be possible to exclude one of them from further consideration. In any case, the a~c~~uteknowledge of the basic epidemiological profile of the disease in question can aid in choosing a small number of consistent models to use in parametric lod score analysis. Then traditional lod score analysis (either two-point or multipoint) can be performed under this limited set of models, with appropriate correction of the significance threshold for the number of models tested (Hodge and Elston, 1994). If the model is even approximately correct, the use of the lod score method should provide greater power than nonparametric methods (Goldin and Weeks, 1993). More recent work by the group of Greenberg and Hodge has focused on an approach that tests two models, an intermediate recessive and intermediate dominant, then corrects for this by requiring a more stringent threshold for significance (MMLS-C method). This approach has been found to be more powerful than both a sibpair approach and nonparametric multipoint linkage (NPL) scores under a variety of single and two-locus models (Abreu et al., 1999; Durner et al., 1999) in simu-
246
David E. Goldgar
lated nuclear families. It is not known, however, to what extent this finding is generalizable to models and pedigree structures other than those examined by these authors; it may well be that in certain situations a model-free approach would be more powerful than the MMLS-C model-based approach.
A. Computational considerations The first methods that did not require specification of a genetic model were relatively simple statistics based on counts of shared alleles between affected sibpairs. Many of these methods required completely informative matings to be employed and thus could be used only for certain highly polymorphic systems. Because of its informativeness as a genetic system, the ease of serotyping, and its association with a wide range of diseases, many of these earlier studies focused on the major histocompatibility complex (MHC) /human leukocyte antigen (HLA) system, since it is (or can be made to be) informative in a high propor, tion of individuals. Because of missing parental genotype data, uninformative markers, and complex pedigree relationships, however, the estimation of allele sharing probabilities, in practice, is only slightly less computationally intensive than parametric linkage analysis. As computing power increases and more efficient algorithms are developed for pedigree analysis (e.g., Cottingham et al., 1993; O’Connell and Weeks, 1995), any differences between these two methods of analysis from a computational standpoint are likely to be relatively insignificant. This is especially true for multipoint analyses, in which most of the computational load is in calculating the multilocus haplotypes at the marker locus; the addition of a (usually) two-allele disease locus in a parametric multipoint linkage analysis requires only a proportionately small amount of extra computing time. Moreover, as more and more p values (for both MB and MF methods) are calculated by simulation, this will come to represent the largest computational burden. However, for classical affected sibpair studies with informative markers, the ability to calculate the test statistic “on the back of an envelope” still holds some attraction.
6. Multiple loci One aspect of the model-free methods that can be quite useful is that, by their very nature, they can easily accommodate locus heterogeneity in which multiple loci are independently involved in disease susceptibility. While in the case of model-based parametric methods, locus heterogeneity can be tested by using specific programs such as HOMOG (Ott, 1991) or by invoking a particular program option in programs such as GENEHUNTER (Kruglyak et al., 1996), there are a number of statistical issues that arise in interpretation of these tests. In particular, some argue that specific tests for locus heterogeneity should not be
17. Model-free Methods
247
performed in the absence of a significant lod score under homogeneity. In addition, the interpretation of multipoint heterogeneity lod scores can be problematic in terms of the required heterogeneity lod score threshold required for statistical significance. While locus heterogeneity will certainly reduce the power for detecting linkage when both approaches are used, a significant result from a model-free analysis does not require further analyses allowing for heterogeneity. Inasmuch as almost all parametric linkage analyses assume a single locus, departures from this assumption usually result in a higher recombination fraction. One advantage of many of the model-free methods is that they provide an estimate of the strength of the effect of a locus linked to a marker or region of interest. For discrete traits, given the estimated allele sharing probabilities, one can easily calculate the corresponding locus-specific sibling risk. If one knows from other studies the overall population sibling relative risk, the contribution of the linked locus can be calculated under additive and multiplicative models. For quantitative traits, one either gets directly or can calculate the proportion of phenotypic variance accounted for by alleles at that locus. Since this proportion is confounded by the distance, however, there is one caveat: some assumption must be made about the location of the true disease locus which, as pointed out earlier, is not provided by model-free methods. However, in the multipoint case, one can choose the location with the lowest p value, or the one in which the disease effect is strongest. The sharing probabilities also can be used to exclude an effect greater than a certain magnitude at each position in the genome.
C. Phenotypedefinition The flexibility that can be incorporated into a parametric approach in terms of the complexity of phenotype and interaction with known environmental risk factors and so on is a great advantage. For many diseasesof interest, earlier age at onset is an indicator of increasing genetic liability. This is generally true for cancer and for many of the inherited neurological disorders (e.g., familial Alzheimer disease). Moreover, there may be milder forms of the diseaseof interest associated with the underlying genetic susceptibility, but much more common in the general population. For example, when one is studying cancer as a phenotype, there are often precursor neoplastic lesions that, when properly incorporated into the genetic model, can dramatically increase power for linkage analysis. Similarly, psychiatric diseases are often characterized by a variety of levels of diagnostic certainty and severity that have varying frequencies in the population, and therefore, varying levels of genotype-specific relative risk. The use of multiple liability classes (or more complex penetrance functions) in a parametric analysis allows for affected (or unaffected) individuals to be treated
248
David E. Goldgar
differently in the analysis depending on age, severity, associated phenotype, gender, and/or diagnostic criteria. This is difficult or impossible to do when the model-free methods available for analyzing discrete traits are used. In principle, one could construct a quantitative trait reflecting the underlying disease liability and then apply model-free methods designed for the analysis of quantitative traits. However, this in some sense negates some of the advantages of these methods, that is, simplicity and the lack of dependence of the result on a specific model. Most model-free methods by their very nature do not allow for other complex modes of inheritance that can be useful in certain situations. For example, the analysis of linkage of two loci jointly (Schork et al., 1993) can be useful when one is searching for an unknown locus conditional on linkage to a previously identified linked marker.
D. Too many methods? Another feature of model-free methods that can be viewed as both a strength and a weakness is that there is a wide variety of different analytic procedures and statistical approaches available. In contrast, the model-based lod score approach represents essentially a single likelihood-based analysis, with differences in specific implementations representing primarily more efficient computational algorithms for calculating the pedigree likelihoods, as first proposed by Morton (1955) and Elston and Stewart (1971). This plethora of different approaches within the overall class of model-free methods can be quite useful in exploratory data analysis but also can add confusion to the analysis, particularly if results differ according to the specific model-free method used. Even in relatively simple situations, such as analyzing sibpairs by means of relatively simple statistics based on the mean IBD sharing, or the proportion of sibpairs sharing 0 alleles IBD, there can be differences in outcome. This can arise because different methods may be sensitive to different underlying genetic models (Blackwelder and Elston, 1985). As is the case with parametric methods, there also will be differences in results between multipoint and two-point approaches. However, as the pedigrees become more complex, only a limited number of such methods can be used. Also, some of the methods (e.g., SimIBD, Davis et al., 1996) rely on simulation to obtain the complete pedigree IBD distribution, introducing a random element into the calculation of the test statistic (and p value). In analyzing pedigrees that contain more than two affected individuals, many programs examine allele sharing between all possible pairs of affected individuals. Since these pairs are usually nonindependent, difficulties arise in assessing the significance level associated with a set of such pedigrees; to address this problem, corrections for this nonindependence have been proposed (Hodge, 1984) and are implemented in several of the analysis programs. Alternatively, some methods (e.g., MapMaker/Sibs, Kruglyak and Lander, 1995)
17. Model-free Methods
249
examine all related individuals together rather than breaking up the pedigree into the possible pairs, thus avoiding the problem of nonindependence, but further complicating the test statistic and associated p values. Most model-free methods also do not properly allow for inbreeding loops, which can result in a loss of power for detecting linkage in consanguineous pedigrees.
V. QUANliTATIVETRAITS Although there is considerable debate concerning the merits of model-free methods for discrete phenotypes, there is much less controversy concerning their application to quantitative traits. This stems from the much greater difficulty of specifying a genetic model for quantitative traits than for the discrete traits. In principle, given knowledge of the overall heritability of the trait, one could derive a simple single-locus model that produced the desired heritability. However, this seems much less satisfactory somehow than a similar procedure for disease phenotypes. Given that many quantitative traits of interest show lit+ tle, if any, evidence of multimodality, which would be expected under a major gene hypothesis, the use of a model-free approach is in many instances the only practical method of linkage analysis. Like earlier renditions of nonparametric tests in standard statistical applications, the original sibpair methods were based on quite simple statistics, notably on some aspect of the distribution of alleles shared identically by descent (IBD) in completely informative matings for a single locus or haplotype. Another advantage of model-free methods for quantitative traits is that they are readily adapted to specific sampling schemes designed to increase their power to detect linkage to a QTL, f or example, preferentially selecting pairs in which the two members of the pair are at opposite ends of the distribution of the quantitative trait (so-called extremely discordant sibpairs, EDSPs; Risch and Zhang, 1995).
VI. DlStWSlON It is clear that there is no single best solution to the question of whether to use a model-free or model-based approach. From a personal standpoint, when dealing with disease locus mapping, I prefer the model-based approach, even in the absence of a specific trait genetic model, I feel that this provides me the opportunity to incorporate the existing body of knowledge regarding the genetic epidemiology and genetics of the disease of interest. From a purely philosophical point of view, the parametric approach is in fact measuring the quantity of interest, that is, the nonrandom cosegregation of a disease susceptibility locus and the marker(s) of interest. On the other hand, the nonparametric statistics
250
David E. Goldgar
based on allele sharing are simply testing the departure of the observed proportion of the number of alleles shared identical by state or descent from that expected. Many of these do not even require that the allele sharing fit a genetic model (Holmans, 1993). In my view, model-free methods are best applied to genetic analysis of quantitative traits, for which developing a reasonably plausible parametric genetic model may be more cumbersome. In practice, it is likely that many investigators will continue to use both parametric linkage analysis and their favorite model-free method in analyzing a complex discrete data set, while quantitative traits will be analyzed almost exclusively by model-free methods, notably methods utilizing a variance components approach. In using these methods (or any methods for that matter), one should be cautious of repeated analysis via different programs until a significant result is’obtained unless, minimally, proper correction for the multiple testing is employed. The strength of these methods, at least for discrete phenotypes, is that they provide a useful exploratory analysis, to be later followed up with more rigorous model-based approaches and with more stringent criteria for significance. It is clear, however, that these methods will continue to play a major role in the analysis of complex traits, and that there will be continued methodological developments in this area. It will be up to the genetic epidemiology community at large to evaluate and determine the best use of these methods with appropriate interpretation of the results.
References Abreu, I?, Greenberg D., and Hodge, S. (1999). Direct power comparisons between simple LOD scores and NPL scores for linkage analysis in complex disease.Am. J. Hum. Genet. 65,847-857. Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for diseasesusy ceptibility loci. Genet. Epidemiol. 2,85-97. Clerget-Darpoux, E, Bonaiti-Pellie, C., and Hochez, J. (1986). Effects of misspecifying genetic parameters in lod score analysis. Biometics 42,393-399. Cottingham, R. W. Jr, Idury, R. M., and Schaffer, A. A. (1993). Faster sequential genetic linkage computations. Am. J. Hum. Genet. 53(l), 252-263. Davis, S., Schroeder, M., Goldin, L. R., and Weeks, D. E. (1996). Nonparametric simulation-based statistics for detecting linkage in general pedigrees. Am. J. Hum. Genet. 58,867-880. Dumer, M., Vieland V., and Greenberg D. (1999). Further evidence for the increased power of LOD scores compared with nonparametric methods. Am. J. Hum. Genet. 64, 281-289. Elston, R. C., and Stewart, J. (1971). A general model for the genetic analysis of pedigree data. Hum. Hered. 21,523-542. Goldin, L. R., and Weeks, D. E. (1993). Two-locus models of disease: Comparison of likelihood and nonparametric linkage methods. Am. J. Hum. Genet. 53(4), 908-915. Greenberg D. E., Abreu, I’., and Hodge S. E. (1998). Th e p ower to detect linkage in complex disease using simple genetic models. Am. J. Hum Genet. 63, 870-879. Hodge, S. E. (1984). The information contained in multiple sibling pairs. Genet. Egidemiol. 1, 109-122. Hodge, S. E., and Elston, R. C. (1994). Lads, wrods, and mods: The interpretation of lod scores calculated under different models. Genet. Epidemiol. 11,329-342.
17. Model-free Methods
251
Holmans, l? (1993). Asymptotic properties of affected 516spair-linkage analysis. Am. J. Hem. Genet. 52,362-374. Kruglyak, L., and Lander, E. S. (1995). Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. 57,439-454. Kruglyak, L., Daly, M. J., Reeve-Daly, M. l’., and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 58,1347-1363. Morton, N. E. (1955). Sequential tests for the detection and linkage. Am. J. Hum. Genet. 7, 277-318. O’Connell, J. R., and Weeks, D. E. (1995). The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance. Nut. Genet. 11,402-408. Ott, J. (1991). “Analysis of Human Genetic Linkage,” rev. ed. Johns Hopkins University Press,Baltimore. Risch, N., and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science268,1584- 1589. Schork, N. J., Boehnke, M., Terwilliger, J. D., and Ott, J. (1993). Two-trait-locus linkage analysis: A powerful strategy for mapping complex genetic traits. Am. J. Hum. Genet.. 53(5), 1127-1136.
Meta-analysisfor *Model-freeMethods Chi Gul and Michael A. Province Division of Biostatistics Washington University School of Medicine St. Louis, Missouri 63110
D. C. Rao Division of Biostatistics Departmentsof Psychiatry and Genetics Washington University School of Medicine Sr. Louis, Missouri 63 110
I. II. III. IV. V.
Summary Introduction Meta-analysis of Genetic Studies Practical Issues Discussion References
I. SUMMARY The intricate nature of complex genetic traits dictates that novel methodologies be developed and utilized to achieve better power, better accuracy, and more favorable balance between type I and type II errors than could be achieved by the traditional methods as they are used in mapping Mendelian traits. Meta-analysis provides one such method for synthesizing information from multiple studies. This h;~ the advantage of being able to pool relatively weak signals from individual studies into a collectively stronger evidence of genetic effects, while at the ’To urhomcorrespondence shouldbe addressed. Advances in Genetics, Vol. 42 Copyright Q 2031 by Acadrmx Press. All rights of reyxlucrion 11,any form rcwvcd LX%~-266C!Cl 535.@J
255
256
Gu et al.
same time providing a quantitative framework for modeling variability among studies. The traditional lod score measuressignificance level of a linkage effect in an individual study, and its additive property make it a natural candidate for combining results across independent studies. To incorporate the within-study variation of the linkage effect into the pooled overall measure of genetic effect, the effect sizes (such as the proportion of genes shared identical-by-descent, IBD) should be pooled directly across studies. Traditional regression models and mixed effects models can be used to estimate the overall genetic effect size and its variance, and to test heterogeneity among studies. Our simulation studies show that designing studies with moderate power and pooling their results via meta-analysis may be more cost-effective than large dedicated studies. We believe that, as a newly emerging methodology, the meta-analysis approach has the potential to become an integral part of our toolbox that will expedite the search for complex human diseasegenes.
As reports of conflicting genetic linkage/association claims continue to accumulate in the literature, it becomes useful to systematically review the multiple sources of evidence and, when possible, to pool the evidence. Complex traits likely involve only small genetic effects, and thus the need becomes even greater, since pooling may be the only way to get convincing evidence. In the field of controlled clinical trials and social sciences, a new methodology called “meta-analysis” has been developed for quantitative overview and synthesis of research results. Its application to epidemiological studies has also been studied (e.g., see DerSimonian and Laird, 1986). The development of such methods in the analysis of genetic linkage began only in the late 1990s (Li and Rao, 1996; Gu et al., 1996a; Allison and Heo, 1998).
A. Meta-analysis as a scientific process Although the practice of research synthesis has a long history and early methodological work could be dated back to the 1930s (Cochran, 1937), the active development and use of quantitative techniques began in the 1970s (Glass and Smith, 1978; Rosenthal, 1979; DerSimonian and Laird, 1986). For the purpose of this chapter, “meta-analysis” refers to a variety of statistical procedures developed and applied to the quantitative review of summary statistics from controlled experimental designs and/or observational epidemiological studies. In a meta-analysis, the main effect of an experiment, a drug, or a psy chological test is modeled either as a fixed parameter or as a random realization
18. Meta-analysisfor Model-free Methods
257
from the distribution of effects from all primary studies. Methods analogous to the traditional analysis of variance may be employed to estimate the main effect and its standard error, and confidence intervals can be constructed to carry out significance tests. In a sense, summary statistics from primary studies are treated as “raw data” and are quantitatively analyzed to produce a grand estimate of the effect. A typical meta-analysis consists of three major steps: preparation, synthesis, and interpretation. The preanalysis preparation stage is responsible for defining the problem (including the main effect to be pooled), setting up inclu. sion/exclusion criteria, conducting an exhaustive search of literature on the subject, and assessingthe quality of individual studies. The next step involves extracting the effect sizes of interest from individual primary studies. Various forms of statistics such as standardized mean differences or relative risks are used as main effect in the literature, though some argument against the use of standardized unit was raised because of possible spurious results (e.g., see Greenland, 1987). We discuss later in this chapter some statistics that can serve as main effects for meta-analysis of genetics studies. The final step of a meta-analysis involves constructing and testing models for the effect(s) and assessmentof possible trends in the effect sizes, Other post analysis tasks include interpreting the final results, resolving for sources of any heterogeneity among studies, and formulating and testing new hypotheses, A workshop held in Potsdam, Germany, published a set of detailed guidelines for meta-analysis of randomized control trials (Cook et al., 1995). A flowchart for applications of meta-analysis in medical sciences may be found in Jenicek (1989).
8. Meta-analysis and complex traits One characteristic that makes complex traits different from simple Mendelian traits is the involvement of several genetic and non genetic factors. This not only makes the effects complex, involving interactions among the genes and environments, but also renders the effects of individual loci relatively small. Even replication studies may not work effectively, because of possible genetic heterogeneity (Suarez et al., 1994). Instead of focusing exclusively on how to maximize the chances of detecting the underlying genes within an individual study, it may be more effective to use appropriate meta-analysis methods to pool evidence from multiple studies. In a series of papers, we addressedhow to define and extract common effect sizesfrom individual genetic studies, handling study specific covariates and pooling genetic association studies (Gu et al., 1996a,b, 1999). Although each individual study may have been designed to achieve a level of power sufficient to detect genes above some minimal threshold of effect,
258
Gu et al.
the power is often misleadingly inflated when some random factors are ignored (see Section V, Discussion), and, therefore, most individual studies are likely to be underpowered. On the other hand, a carefully executed meta-analysis of existing studies can pool the evidence across all studies to achieve a better power for detecting these genes of smaller effect. By accommodating and resolving inconsistencies among studies, we can hope to make better progress than with any single study. Instead of simply counting an individual study as a “success”or a “failure,” if information from individual studies is quantitatively pooled, the variation among studies can be accounted for in arriving at a pooled measure of the overall effect of genetic linkage. Even though meta-analysis is a newly emerging methodology, and its application to genetic studies is relatively recent, it is impossible to cover every aspect of the methodology in this chapter. The development of such a methodology for genetic analysis encounters many new challenges, some of which we discuss later. We first give an overview of the development of meta-analytical methods for linkage studies by reviewing how issues such as defining “common effect” can be addressed, by providing some basic quantitative techniques for synthesizing effects from individual studies into a pooled measure of linkage, and by discussing practical issues that we deem important and critical to a successful meta-analysis. We also provide guidelines to be followed when one is designing and carrying out meta-analysis of genetic studies.
Ill. META-ANALYSIS OF GENETICSTUDIES Although in theory every genetic study should have in its design a careful analysis of power and sample sizes, it remains time that, with the exception of animal models, human genetic studies are almost always observational rather than well-controlled experiments. This observational nature has several important implications on the meta-analysis of genetic studies. First, study-specific features such as sampling schemes, analysis method, and overall quality of the study vary widely, giving ample opportunity for confounding bias in the pooled effects. In other words, differences in the results of linkage analysis among studies may be due as much to variations in design as to genetic factors. Also, our ability to extrapolate the findings may be greatly limited if the collected primary studies fail to cover the whole spectrum of the study population. On the other hand, if strenuous effort is devoted to reviewing all primary studies, and artifacts are carefully corrected, such heterogeneity could add to our confidence in the pooled findings. Real complication in meta-analysis of genetic studies arises on account of genetic heterogeneity, whereby the same phenotyp-
18. Meta-analysisfor Model-free Methods
259
ic outcome may involve different genetic etiologies in different studies. This last source particularly complicates any attempt to perform meta-analysis of linkage results by brute force, since potentially genuine differences among at least subsets of the studies must be accommodated. In this sense, meta-analysis means something much broader in the genetics context than it does in the statistical literature.
A. Pooling lod scores and /Values The traditional lod scores calculated from independently sampled pedigrees are additive, thus making them natural candidates for combining as common measure if they were calculated under the same genetic models (Morton, 1955). Even if they were calculated under different models, if the pedigrees are published, it is still possible to recalculate the lod scores for the individual studies under some uniformly defined models (see Leder et al., 1998). This coincides with the spirit of pooling of P values, one of the simplest methods for combining individual tests into a single omnibus test, and perhaps the oldest meta-analytic technique. The method, developed by R. A. Fisher (1932) some 40 years before the term “meta-analysis” was ever coined, is based upon the observation that under the null hypothesis, the P values are uniformly distributed. Therefore, if n independent tests result in P values pl, . . . , p,, then the sum of - 2log(pi) is asymptotically distributed as a x2 with 2n degrees of freedom, which provides a combined P value for all n tests. Namely, we can reject the omnibus null hypothesis if P = -25og(pJ i=l
z C,,
(18.1)
where C, is the (1 - CX)critical value of a x2 distribution with 2n degrees of freedom. In the case of linkage studies, one can easily work on the P value scale if for some studies the lod scores are impossible to calculate, or on the lod scale if all lods are available, since there is a simple 1 - 1 correspondence between the two (Ott, 1991). This technique is remarkably general. The n individual tests need not use the same statistic to produce the P values, and they may each even operate on very different sampling units. All that is required for the validity of the combined test is that the individual P values be from tests of the same hypothesis and be independent of one another. Thus, we can use this method to combine parametric (model-based) with nonparametric (model-free) linkage tests (including Haseman-Elston, variance components, etc.), CO combine dichotomous with continuous phenotype definitions (e.g., blood pressure vs
hypertension status), to combine samples of affected sibpairs, extremely discordant sibpairs, entire sibships, and/or extended pedigrees, and to combine unie point with multipoint linkage analyses. However, such generality comes with a price. Its biggest deficiency arises from the nature of the P value itself. Since the P value confounds onto a single scale both the magnitude of the effect and its standard error, there is no way to disentangle these effects when this technique is used. This fact has several implications. First, we have no “test of homogeneity of effect,” as is available with other meta-analysis strategies, which can provide some level of protection against combining studies that are not really poolable. Second, it may be hard sometimes to interpret a significant result. For example, if the combined test of one study with highly significant linkage and other non significant studies produces a significant P value, should we interpret the linkage as significant across all studies? The third concern is that directly pooling P values may be more sensitive than other techniques to large sample size differences among studies precisely because effect sizes and standard errors are confounded. To alleviate this problem, a weighting scheme utilizing the sample sizes of individual studies may be used, and one may form the product p,
=
pp’pj..
. p2
(18.2)
to test the combined hypothesis (Robbins, 1948; Good, 1955). Other study-specific features, such as the types of sibpairs used for analysis can also enter the definition of the weight y. Not only are such schemes often subjective, however, but many study-specific factors necessarily will remain unaddressed. Therefore, this method can serve as an attractive screening tool for promising genomic regions identified by many studies utilizing some-what dissimilar approaches. But further parametric interpretation of combined findings require more sophisticated modeling.
B. Pooling common linkage effects Modern techniques of meta-analysis are marked by their quantitative treatment of common effect sizes drawn from individual studies. Unlike other epidemiological studies, there is no clearly defined common effect per se for genetic linkage that is widely adopted by researchers (not counting the lod scores, which are really I’ values rather than direct measures of linkage effect sizes). For studies that employed certain design and analytical approaches, one may find some relative risk or regression coefficient that can play the role of a common effect size. For example, in an earlier study by Li and Rao (1996), the regression coefficients derived from the Haseman-Elston analysis were pooled
18. Meta-analysisfor Model-free Methods
261
to get an overall estimated coefficient that allows for a combined test of linkage. However, for meta-analysts who ambitiously plan to pool a wide range of genetic studies, the lack of a clearly defined common effect poses a serious challenge. To mitigate the problem, we proposed a strategy ro use allele sharing as the common effect for pooling (Gu et al., 1996a). Since the magnitude of allele sharing can be derived, at least in theory, this may offer a general solution to the problem.
1. Defining common effect For most studies employing a non parametric sibpair design, tests are based on the departures of the observed IBD distributions from rhose expected under the null hypothesis of no linkage, though different statistics might be used, For example, the affected sibpair (ASP) method is based upon the idea that, under the alternative of tight linkage between the trait and marker loci, sibpairs with both members affected should have a higher probability of sharing more than one allele IBD, and sibpairs with one member affected and the other unaffected should have a higher probability of sharing none of the alleles IBD+ The extreme sibpair (ESP) method relies on a similar idea but enhances the probability of IBD sharing by sampling from the extreme tails of the trait distribution. The Hoseman-Elston (HE) method and its extensions, on the other hand, regress the square of the trait difference between the sibs on their IBD proportion at the marker and detect linkage through significant departure of the regression coefficient from zero. Each of these designs uses different statistics, to test the null hypothesis of no linkage, but all are based on the distribution of IBD sharing among sibs. As for studies employing a parametric design, test of linkage does not rely directly upon IBD sharings, but such parameters should be estimable if original data are available or the original authors are willing to help. Information loss of this kind is anticipated for some individual studies, but the resulting broader range of studies will enhance the generality of pooled estimates, hence compensating the suboptimal usage of some primary studies. Based on such observations, we proposed to use n(T), the observed IBD proportion at the marker of interest for sibpairs of a certain trait outcome T, as a common effect for pooling across studies. Various types of Sibpairs can be considered depending on T, and the concept can be extended to other types of relatives. Rice (1998) has made a similar proposal in the context of psychiatric disorders. We now briefly review methods for extracting such parameters from various primary studies and for deriving a pooled estimate and test for overall linkage.
262
Gu ef al.
2. Extracting common effect a. Proportion of IBD in ASP and ESP studies For ASP and ESP studies, information on IBD sharing should be available, one way or another, in the original publication. For example, if an ASP study used the t2 statistic (see Blackwelder and Elston, 1985) and reported nj (observed number of sibpairs having j members affected) and (number of sibpairs with j affected sibs that share k alleles IBD), then an estimate of the IBD proportion T(AA) can be calculated by ~jk
n(AA)
=
I521
+ 822 2
=
32,
+
$22,
with sampling variance S = (l/nj)[~~j, + ($ - +)+I, where ijk = rjk/nj is the estimated probability that a sibpair with j affected-sibs will have k alleles IBD. It is clear that if only affected sibpairs are used in the study, and the P value as well as the number of affected sibpairs n2 are reported in the article, we can reconstruct *. We can also get the sampling variance if values of are available. When they are not reported in the original article, it may still be possible to recover them from the author(s), since these data were essential in calculating the test statistics. We stress that it is important for a meta-analyst to get as much information as possible from the original authors, since some unreported information, such as results at non significant markers, could be vital to the meta-analysis. For ESP studies, the test statistics used, X(h, 1) = (l/ 2n)Zy=r[Xri(h, 1) + X,,(h, 1)], is the observed proportion of IBD sharing by a sibpair when one sib has outcome h and the other 1 (Gu et al., 1996b). As discussed earlier, the sample variance of this estimate is (l/n)[iij2 + (i - &)+I, where fijk is now ~jk(h, l), which depends on the type of extreme sibpairs used (e.g., h could correspond to the 90th percentile and I could correspond to the 30th percentile). Assume that the threshold of trait value used to classify a person as affected in the ASP studies is the same as that which would classify the person as having “extremely high trait values,” or only varies randomly among studies. Then the T’S of extremely high-concordant (HC) sibpairs derived from the ESP studies and the 7js of affected-affected (AA) sibp airs derived from the ASP studies are all observed values of the same random variable, and appropriate models could be used to pool the two types of studies. ~jk
b. Proportion of IBD in HE studies As discussed earlier, a natural candidate of common effect for pooling HE studies is the regression coefficient pi, whose meta-analysis is discussed in Li and Rao (1996). However, to pool such studies with others we need to extract the IBD sharings too. If results of ASP analysis were also reported in the original publica-
18. fvleta-analysisfor Model-free Methods
263
tion, we may estimate the proportional IBD as discussed earlier. If not, we can make the best prediction of the IBD proportion from the regression equation fi=E(~lY)
=5-k-------
where ?i: and Y are the nonweighted sample averages, and e2(rr) and S’(y) are the sample variances of P and Y, respectively (see Gu et al., 1998~ for details). Note that for the purpose of pooling, only rr(ED) estimated in this way for extremely discordant (ED) sib pairs should be used for pooling with those derived from other ESP studies, because the HE design does not differentiate concordant sibpairs from the upper tail and those from the lower tail of the trait distribution. If pooling IBD on other types of sibpairs is desired, recalculation should be done by the meta-analyst or obtained from the original authors.
c. IBD from other types of studies If original data are available, such as in the case of large-scale multicenter studies, only some reanalysis of the data is necessary. Otherwise, derivation of IBD sharing often requires the meta-analyst to contact the original authors directly. If such effort fail, less sophisticated methods, such as meta-analysis of the P values discussed earlier, can be used. 3. Quantitative
synrhesis
After common effects have been extracted from primary studies, the central questions are: How do they vary among studies, and Can we derive a “better” estimate (grand mean) of such an effect with higher precision? If we are con& dent that the studies are all testing the same hypothesis, employing similar method and sampling schemes (in other words, if they are rather homogeneous), then procedures of analysis of variance can be applied by treating the effects as raw data points. However, we believe that for genetic studies, even when the same disease is being examined in identical genomic regions, many other design factors are involved that warrant modeling by means of random effects (Gu et al., 1998~).
a. Random effects model We assume that for the ith study the population parameter of IBD sharing, ri, is drawn from a random distribution: namely, Ti= 7+ Si,
(18.3)
264
Gu et al.
where T is the mean of the population of all comparable studies and Si is the value of a random variable 6 with mean zero and variance r2(S), which equals the among-study variance we are going to estimate. Suppose that we have from each of the primary studies an estimate of the IBD proportion ~~ and its sample variance S 1, i = 1, . . . , n. We have the random effects model (18.4) where ei is the sampling error, and whose variance equals Sf in the ith study. Applying the weighted least squares procedure, we have an estimate for the overall IBD sharing
where the weights are computed by using the estimate of CT$(6;): (18.6) and (18.7) For a given significance level a, the approximate variance of the estimate of the mean IBD proportion, [ Zr= i l/( 62, + Sf)]-i may be used to construct a confidence interval for the overall estimated IBD sharing and test for linkage.
b. Mixed effects models In general, genetic parameters also vary among the study populations. This may be due to factors specific to the study, such as sampling scheme and diagnostic battery, and such variance can be measured via study-specific covariates. If any of these covariates is associated with the effect sizes (which can be verified by checking plots of effect sizes against these covariates in the preanalysis), it should be incorporated in the model using a mixed effects approach. Namely, given a set of k study-specific covariates, Xi, . . . , XL, we can extend the random effects model to the following mixed effects model: Tj =
70 +
71Xil
+
72xi2 +
.
* *+7kXik+&+Eifori=1,.
. . ,n. (18.8)
18. Meta-analysis for Model-free Mefhods
265
Or, in matrix from: 7r=Xr+6+E.
(18.9)
The overall effect r. as well as the regression coefficients 71, . . . , 7~ could be derived again using a weighted least-squares procedure with the weights wi = l/(cr$ + Sf) (i=l, . . . ,n). Derivation of the variance can be found in Gu et al. (1999), where the foregoing model was used to combine linkage results from 20 replicate studies, consisting of four study groups, of a simulated complex disease. The application demonstrated how a mixed effects model can help disentangle complex relationships.
c. Combined test of linkage If the common effects (proportion of alleles shared IBD) are pooled for only one type of sib pairs, say highly concordant affected sib pairs, then the resulting ? and its estimated variance SZ,can be used to construct a confidence interval at a given significance level a. If the interval does not contain 7 = l/Z (expected value under the null hypothesis), we can reject the null hypothesis of no linkage with lOO( 1 - a)% confidence. Under the mixed model, the confidence bounds are calculated by using the asymptotic t-distribution with df = n - k - 1. When the number of primary studies available for pooling is too small to warrant a reasonable asymptotic approximation, a permutation procedure can be used to derive the correct significance level for the pooled measure (see Follmann and Proschan, 1999). If 9’s are available for various types of sibpairs, and there is enough evidence that the studies are homogeneous, we may use an appropriate linear combination of such pooled effects and their estimated variance to construct a pooled statistic and its confidence bounds, similar to the EDAC concept developed by Gu et al. (199610). Th is enables pooling across studies via very different sibpair designs.
IV. PRACTICALISSUES In the preceding sections, we reviewed the basics of two meta-anaiytic approaches for combining results from genetic linkage studies, namely, Fisher’s elegant method for pooling P values and a more sophisticated quantitative synthesis of IBD sharing across studies. There are many issues pertinent to the successful practice of meta-analysis of genetic studies, but we have space for only a brief overview in this section. Some of these issues are open statistical problems for active research; others pose intrinsic challenges to genetic concepts and may not be solved until the genetic problems are solved.
266
Gu at al.
A. Publication bias Because combined data may lead to artificially reduced variance and thus make results appear more conclusive than they are, publication bias, the “soft spot” of meta-analysis methodology, is always the easiest place to criticize. The ultimate solution is outrageously straightforward and at the same time unachievable (at least in the near future)-a registry system for all genetic studies. However, contrary to popular myth, this problem is not completely uncontrollable. Just as with isolated single studies, one can make corrections to the bias if the sampling frame is well understood. So the most important thing is the assessmentof possible publication bias in the preanalysis stage. Correlations with some study-specific features (sample sizes, significance levels, publication years, etc.) can be plotted to detect such biases. Rosenthal (1979) gave a conservative “file drawer” treatment to assessthe correct significance level of a combined test by estimating the “fail-safe number,” that is, the maximum number of unpublished insignificant results that would diminish the pooled result. Other analytic models are also available that incorporate the selection of studies into the correction of publication bias. For example, the weighted distribution theory by Patil and Rao (1977) can be used to directly assessthe chance of bias entering the meta-analysis. These methods are still in their early development and are not foolproof. But until the reporting system is completely overhaulted, careful application of such methods in a meta-analysis may be necessary if serious publication bias is detected.
8. Quality assessment of studies Besides assessingpublication bias, the preanalysis assessesthe quality of a primary study and thus determine whether to include the study for pooling. There exist well-defined coding systems for assessingthe quality of controlled clinical trials (Chamers et al., 1981), but such systems for evaluating genetic studies remain to be developed. Since coding the quality of a study can sometimes be subjective, sensitivity analysis of such assessment is essential for unbiased incors poration of such quality scores into the meta-analysis.
C. Heterogeneity We mentioned in the introduction that heterogeneity among studies can be an advantage in interpreting the robustness of pooled results. But that by no means conveys the importance of testing for heterogeneity in the preanalysis stage. Contrarily, a test of homogeneity should be mandatory before any pooling is done, simply because the test result will guide us to the analytical model that is better suited for synthesis (i.e., a fixed effects or a random effects model).
18. Meta-analysisfor Model-free Metkods
267
For the random effects model discussed earlier, the homogeneity test is given by
where ;i = E:Z 1~J.Sf]/[X~= r l/S$. It follows asymptotically a chi-square distribution with (k - 1) degrees of freedom if 02(S) = 0. Another type of heterogeneity is due to distinct genetic etiology of the disease, where different genetic pathways are responsible for the same disease phenotype. If such heterogeneity is associated with study populations through study-specific covariates, the mixed effects model can help to reveal such heterogeneity, and further subgroup analysis may lead to the resolution of the distinct genetic etiologies. However, if the aforementioned association does not exist or was not reported in the published results, the traditional meta-analytical procedure would not solve the problem. Nonetheless, pooling of results may still produce enhanced identification of interesting regions/loci better than any single primary study, but only with possibly suboptimal power.
D. Meta-analysisof geneticassociationstudies Meta-analysis of genetic association studies is more complicated by the ability of non genetic factors to produce “spurious” associations. We proposed a model for pooling family-based association studies by using the difference of transmission probabilities as a common linkage disequilibrium effect (Gu et al., 1998a). It is similar in spirit to pooling relative risks, and the models already discussed can be modified for application. Details may be found eisewhere (Gu et al., 1998a,b).
V. DtSCllSSlON The problem of moderate effects of complex disease genes giving rise to increased rates of false signals in genome-wide scans has produced a healthy debate (Lander and Schork, 1994; Thomson, 1994; Curtis, 1996; Lander and Kruglyak, 1995; Witte et al., 1996; Todorov and Rao, 1997). The call for metaanalytic methodology for the quantitative overview of linkage results also has been voiced (Rao, 1998; Li and Rao, 1996; Allison and Heo, 1998), but perhaps it has yet to meet with the level of enthusiasm it deserves. Clearly, one of the benefits of meta-analysis is enhanced statistical power. We have undertaken a simulation study by generating a pool of primary studies, each employing the ASP or the ED method, with an underlying disease gene being additive with a frequency of 0.20 and a heritability of 0.30. The effect
268
Gu et al.
size (i.e., the expected IBD sharing of a sibpair) was assumed to follow a normal distribution N(0, a*(S)), as a result of random factors unknown to the analyst. The simulation study showed how variability in the effect size among primary studies could have a detrimental effect on the power of individual studies, especially when the effect size in question is small. The actual power of an individual study may be much smaller than was originally believed (nominal power). Figure 18.1 shows the effect of among-study variability on the actual (averaged) power of an individual study. We see that when the among-study variability is relatively low (e.g., the ratio of among-study to within-study variability, R = 0.2), the effect on actual power due to the among-study variability is less serious for a nominal power of 95% (drops from 95 to 87.5%); even this low level of variability had a noticeable effect if the nominal power was 80% (nominal power of 80% vs actual power of 67.7%). When R = 1, an individual study with a nominal power of 80% will detect linkage with an actual power of
t
0.9I .,
I
I
I
I
I
I
._...... d . . . . . . . . Nominal power = 70% / ___._____ Nominal Power = 80% x-- -- .- .. Nominal Power = 95%
0
1
2
3
4
5
6
7
8
9
/
10
Ratio of Among- to Within-study Variance (RI Figure 18.1. Actual power of individual study as a function of the ratio R of among-study to within-study variance. Three different values (70, 80, and 95%) were assumedfor nominal power of individual primary studies; 100,000 replicates were simulated for calculating power at a significance level of 0.001 (reproduced by permission of WileyLiss, Inc., a subsidiary of John Wiley & Sons, Inc., from Gu et al., Meta-analysis methodology for combining non-parametric sibpair linkage results: Genetic homogeneity and identical markers, Genet. E~idemiol. (15, 609-626), copyright 0 1998).
18. Meta-analysisfar Model-free Methods
269
only 37%. On the other hand, combining 10 studies each with a nominal power of 60% would give an overall power higher than 80%, even when the variability among studies is very high (R = 5) (see Fig. 2 in Gu et al., 1998~). We also carried out an analysis of cost-effectiveness for conducting and pooling primary studies of different sizes and concluded that a large number of smaller primary studies each with moderate nominal power may be more costeffective than a handful of large primary studies. However, we did not consider the extra cost of coordinating a large number of smaller studies, and we also did not consider that a larger number of studies will lead to greater among-study variability (R), hence less aggregate power. Nonetheless, the investigation served to demonstrate the power of meta-analytical methods in genetic studies, and to call for further investigation on how to balance these conflicting considerations in designing future studies.
A. Guidelinesfor meta-analysisof geneticstudies There exist excellent guidelines for performing meta-analysis of controlled clinical trials which can be useful for an analyst who plans to carry out a metaanalysis of genetic studies (e.g., see Cook et al., 1995). We recommend some guidelines for reporting individual study results, to make future meta-analysis easier and more efficient. However, development of more complete guideliaes for meta-analysis of genetic studies requires the collective effort of many experts. Meanwhile, some simple guidelines may be followed: A clear goal must be defined and stated before anything is done. A meta:aanalysis should be performed to answer a scientific/health care question. This is both ethical and essential for the scientific integrity of the pooled results. Strenuous effort is required to obtain as much information from the original studies as possible, including the data sets if possible. Reanalysis of individual studies is sometimes essential, and it should be done as throughly as the available information permits and followed closely. This is because of the observational nature of genetic studies. Test of heterogeneity is mandatory both before and after quantitative synthesis, especially after pooling. Possible venues of contribution to the amongstudy variation must be exhausted and adjusted for before any overall conelusion is drawn. The study goal, the inclusion/exclusion criteria, and the review method used should be clearly presented, along with detailed information about treatment of missing data and “fugitive studies.” Interpretation of results should be explicit and within the context, and inferences must be made together with a clear statement of their limitations.
270
Gu et al.
B. Pooled data analysis There is no doubt that whenever the original data are available, they should be used for reanalysis to get better estimates of study effects in a meta-analysis. If the original data from the primary studies are available, such as in the case of large-scale multicenter studies, pooled data sets can be constructed and statistical models with potentially important covariates fitted to the whole data set (as Rice and colleagues did to the pooled bipolar data set; see Dorr et al., 1997). Such methods also need to define a common effect as in a meta-analysis, but can be more powerful than pooling summary statistics alone. However, a study by Olkin and Sampson (1998) sh owed that under certain conditions, metaanalysis of summary statistics can be as powerful as analysis of the combined data. Taking the cost-effectiveness into account, meta-analysis could still be the choice, at least as an exploratory tool. With an avalanche of genome-wide screens descending upon us, it becomes increasingly clear to investigators that systematic overview of such studies is necessary and the methodology needs to be further developed to suit the characteristics of such studies. The increased attention paid to these issues at a recent Genetic Analysis Workshop is a positive sign. Many issues remain to be solved concerning both the methodology and the practice of meta-analysis when applied to genetic studies. For example, individual studies usually screen panels of (linked) markers for linkage. The correlation structure of, say, the IBD sharing by sibpairs at linked markers needs to be addressed in the meta-analytical model for obtaining more accurate estimates of the pooled linkage effect. Moreover, ascertainment correction of the likelihood function can be nontrivial even at the level of individual studies, and can certainly become more problematic for the meta-analysis, especially when substantially distinct ascertainment schemes (e.g., pedigree-based selective sibpairs) are used in primary studies. We proposed using the mixed effects model as a treatment for the heterogeneity problem. But the problem is far from solved, especially when genetic heterogeneity exists but remains undetected within the original primary studies. The postanalysis process, in which the pooled result is explained and possible exploratory models are formulated by the meta-analyst, involves addi, tional issues. For example, novel statistical procedures are needed to classify the primary studies into subgroups in the presence of heterogeneity and/or pleiotropy, and covariance structure among the within subgroups needs to be addressed in a megamodel. This calls for development of new methodologies. We believe that as the Human Genome Project progresses, and the new era of dissecting complex traits begins, development of meta-analytical methodology and its practical use in synthesizing genome scan results will be both exciting and crucial to the successful resolution of genetic determinants of complex human diseases.
18. Meta-analysis for Model-freeMethods
271
References Allison, D. B., and Heo, M. (1998). Meta-analysis of linkage data under worst-case conditions: A demonstration using the human OB region. Genetics 148,859-565. Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for disease susceptibility loci. Genet. Epidemiol 2,85-97. Chamers, T. C., Smith, H. J., Blackbum, B., Silverman, B., Schroeder, B., Reitman, D., and Ambroz, A. (1981). A method for assessingthe quality of a randomized control trial. Control. Clin. Trisols 2,31-49. Cochran, W. G. (1937). Problem arising in the analysis of a series of similar experiments. J. Ry. Stat. Sot. 4 (suppl.), __ 102-118. Cook, D. J., Sackett, D. L., and Spitzer, W. 0. (1995). Methodologic guidelines for systematic reviews of randomized control trials in health care from the Potsdam Consultation on Metaanalysis. J. Clin. Epidemiol. 48, 167’-171. Curtis, D. (1996). Letter to the editor. Nut. Genet. 12,356-357. DerSimonian, R., and Laird, N. (1986). Meta-analysis in clinical trials. Control. Clin. Trials 7, 177-188. Don; D. A., Rice, J. P, Armstrongg, C., Reich, T., and Blehar, M. (1997). A meta.analysis of chromosome 18 linkage data for bipolar illness. Genet. Epidemiol. 14,617-62X Fisher, R. A. (1932). “Statistical Methods for Research Workers,” 4th ed. Oliver & Boyd, London. Follmann, D. A., and Proschan, M. A. (1999). Valid inference in random effects meta-analysis. Biometrics 55,732-737. Glass, G. V., and Smith, M. L. (1978). Meta-analysis of research on the relationship of class sizes and achievement. E&c. Eval. Pal. Anal. 1,2-16. Good, I. J. (1955). On the weighted combination of significance tests.J. R. Stat. Sot. B17, 264-265. Greenland, S. (1987). Quantitative methods in the review of epidemiologic literature. Epidemio Reu. 9, l-30. Gu, C., Province, M., Li, Z., and Rao, D. C. (1996a). A meta-analysis methodology for combining non-parametric sibpair linkage results. Genet. E@demiol. 13,302. Gu, C., Todorov, A. A., and Rao, D. C. (1996b). Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost effective way to linkage analysis of QTLs. Genet. Epidemiol. 13,513-533. Gu, C., Province, M., and Rao, D. C. (1998a). A metayanalysis methodology for combining results of familybased genetic association studies. Genet. Epidemiol. 15, 547. Gu, C., Province, M. A., and Rao, D. C. (1998b). Meta-analysis of familybased association studies. Manuscript. Gu, C., Province, M., Todorov, A., and Rao, D. C. (1998~). Meta-analysis methodology for combining nonparametric sibpair linkage results: Genetic homogeneity and identical markers. Gener. Epidemiol. 15,609-626. Gu, C., Province, M. A., and Rao, D. C. (1999). A meta-analysis approach for pooling sibpair linkage studies with study-specific covariates: mixed effects models. Genet. EpidemioE.17 (suppl. l), 599-604. Jenicek, M. (1989). Meta-analysis in medicine: Where we are and where we want to go. 1. C&r. E@emiol. 42, (1):35-44. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interprering and reporting linkage results. Nut. Genet. 11,241-247. Lander, E., and Schork, N. J. (1994). Genetic dissection of complex traits. Science 256, 2037-2047. Leder, R. O., Mansbridge, J. N., Hallmayer, J., and Hodge, S. E. (1998). Familial psoriasis and HLAB: Unambiguous support for linkage in 97 published families. Hum. Hered. 48, 198-211.
272 Li, Z., and Rae, D. C. (1996). A random effect model for meta-analysis of multiple quantitative sibpair linkage studies. Genet. Epidemiol. 13,377-383. Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277-318. Olkin, I., and Sampson, A. (1998). Comparison of met&-analysis versus analysis of variance of individual patient data. Biometrics 54,(1):317-322. Ott, J. (1991). “Analysis of Human Genetic Linkage.” John Hopkins Univ. Press, Baltimore, MD. Patil, G. P., and Rao, C. R. (1977). Th e weighted distribution: A survey of their applications. In “Applications of Statistics” (P. R. Krishiaiah, ed.), pp, 383-405. North Holland, Amsterdam. Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genet. Epidemiol. 15, 1-8. Rice, J. P. (1998). The role of meta-analysis in linkage studies of complex traits. Am. J. Med. Genet. 74,112-4. Robbins, H. E. (1948). The distribution of a definite quadratic form. Ann. Math. Stat. 19,266-270. Rosenthal, R. (1979). The “file-drawer problem” and tolerance for null results. Psychol. Bull. 86, 638-641. Suarez, B. K., Hampe, C. L., and Van Eerdewegh, I? (1994). Pro bl ems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S. Gershon and C. R. Cloninger, eds.), pp. 23-46, American Psychiatric Press. Thomson, G. (1994). Identifying complex disease genes: Progress and paradigms. Nut. Genet. 8, 108-110. Todorov, A., and Rao, D. C. (1997). Trade-off between false positives and false negatives in the linkage analysis of complex traits. Genet. Epidemiol. 14,453-464. Witte, J. S., Elston, R. C., and Schork, N. J. (1996). Letter to the editor. Nat. Genet. 12, 355-356.
I
Classification Methods for Confronting Hetemgeneity Michael A. Province’ Division of Biostatistics Washington University School of Medicine St. Louis, Missouri 63110
W. 0. Shannon Divisions of Biostatistics and General Medical Sciences Wshington University School of Medicine St. Louis, Missouri 631 10
Il. C. Rao Division of Biosratistics and Departments of Psychiatry and Genetics Washington University School of Medicine St. Louis. Missouri 63118
1. Summary II. Introduction III. Coping with the Challenges
IV. Recursive Partitioning Models V. Discussion References
I. SUMMARY Recursive partitioning/tree models are discussed as a method of dissecting the complex nature of traits with different causal mechanisms operating in different subsets of the data (e.g., different genes operating in different subsets of families). In addition to the straightforward application of classification and rcgres‘To whom comspondencr should be addressed.
274
Province et a/.
sion trees to define more homogeneous subsets of the data on which to conduct further analysis, developments incorporating linkage analysis into the definition of the regression trees (Shannon et al., 2000) are discussed. The pros and cons of recursive partitioning vs the related approach of context-dependent analysis (Turner et al., 1999) are also reviewed as two promising analysis strategies that may be useful for genetic dissection of complex traits.
II. INTRODUCTION Complex diseasesand disease-related traits such as hypertension and blood pressure involve heterogeneous genetic etiologies that do not fit simple Mendelian inheritance patterns. As a rule, such traits are determined by interactions among multiple genes and environmental attributes. Most show variability in age of onset and/or severity. Some are measured as multiple correlated phenotypes with unresolved pleiotropic effects. Some of these traits show different etiologies in different subgroups or families, and this heterogeneity can particularly bedevil attempts to detect, localize, and characterize the underlying genes. Unfortunately, many of the existing popular tools and methods for genetic analysis either cannot handle such intricacies, or handle them in the most rudimentary way. In the face of such difficulties, the options for the investigator are limited.
Ill. COPINGW ITH THE CHALLENGES There arise three strategies for coping with these challenges just described. One strategy is to simply ignore the complexities, apply the existing simple models, and hope for the best. After all, every model has some kind of “error” or residual term, and one could argue that all the aforementioned complexities will simply manifest themselves as part of that error when we use such models. If there are any genes with large effects and relatively simple inheritance patterns, we should be able to apply our single-gene models to such cases and succeed in detecting them at least some of the time. Thus we can carry out genome-wide linkage scans, repeatedly using a monogenic linked-locus model. Each time, the polygenic residual component (or other appropriate error term, depending on the exact model) will contain all other trait loci whose effects we are not considering at that particular time. By this line of reasoning, we should be able to find the genes with large effects (if any) by using the simple traditional models. Unfortunately, such an approach can quickly lead to models that fit poorly because too much is left in the error term. For complex traits, it can easily hap-
19. ClassificationMethodsfor Hetereogeneity
275
pen that the noise quickly overwhelms the signal, so that there is very little power to resolve the effects we hope to discern. Therefore, this may be regarded as the least desirable strategy. A second strategy is to study more homogeneous subgroups of the families, where even simple models may apply. Thus, instead of studying hyperlipidemia (high cholesterol/triglycerides) in general, we may choose to study a particular form, such as familial combined hyperlipidemia, which in turn can be divided into subtypes 2b, 4, and so on. Instead of studying the genetics of type II diabetes, we may subdivide further by focusing on early-onset type II diabetes (MODY: maturity-onset diabetes of the young). Likewise, we may subdivide hypertension into several subcategories of obesity-induced, sodium-sensitive, and so on, with the hope that each of these has a more homogeneous genetic etiology that can be captured by simple models. A related strategy is to study subphenotypes or intermediate phenotypes that are closer to the gene products than these complex traits themselves. Instead of looking for genes that directly cause overt hypertension, one may choose to study the renin-angiotensin system, which influences blood pressure levels. Strong science such as this has driven much of the progress in genetic epidemiology over the years. However, taking such a sensible approach may not completely solve the problem. What if, despite one’s best efforts to study more homogeneous subgroups and more relevant subphenotypes, one discovers another, deeper layer of heterogeneity or further sub-subphenotypes underlying the ones studied? Do we simply delete the “impure” data? Study design alone may not always completely cure the problem of heterogeneity, because until the genetic dissection of a complex trait is come plete, we only incompletely understand the full extent of heterogeneity. A third strategy is to meet the challenge head on and utilize more sophisticated analysis methods by explicitly modeling the heterogeneity and complexity of the system. For instance, one may devise a multilocus model, with ar without epistasis, with or without genotype-specific covariate effects; gene-environment and environment X environment interactions, variable age of onset, and/or developmental effects, and so on. Since many of the existing models are linear (or generalized linear), it is in principle easy to extend them by including additional terms to account for many, if not all, of these complex effects. For instance, in a variance components QTL linkage model (which is, after all, in the regression domain), additional linear terms can be added for environmental covariates, measured genotypes, interaction terms, and so on. These regression coefficients can be estimated simultaneously along with the primary linkage parameters (see Chapters 12 and 13 on variance components linkage methods for further details along these lines). This strategy can go a long way toward accommodating some of the sophisticated mechanisms we hope to explain, and in principle, can reach any level of complexity desired.
276
Province et a/.
Practically speaking, there are inherent limitations to a complex linear model approach, particularly in the face of certain types of heterogeneity. The traditional generalized linear models are based upon a “one size fits all” philosophy. The same model is applied to the entire sample for the purpose of quantifying risk (or primary effect) in all individuals/families simultaneously, with the same degree of precision. The difficulty with this approach is best illustrated by an example. Suppose that our sample is actually a mixture of populations or families in which one gene, Go, is operating in one subset and a different gene, Gi, is dominating the expression of the phenotype in another subset. Such a situation is not implausible for complex traits, and in fact may be quite common, resulting in a “many roads to Rome” phenomenon (i.e., many different ways to acquire the disease/trait). When traditional models are used, it is quite awkward to accommodate this type of heterogeneity. Even assuming that we can find the “right” way to measure and define the two subsets and have created a O/ 1 dummy indicator variable T for subset group membership, to achieve this level of heterogeneity in the classical models, we must add multiple interaction terms between this dummy variable and all other effects in the model. In fact, even in the simplest case of linear regression, we need to estimate six parameters to produce the needed degree of heterogeneity: one intercept, three main effects (T, Go, G,) and two interactions (T*Go, T*Gl). These last two interaction terms exist simply as an artifice because the model is not designed to work differently in different subsets of the data. To force it to do so, we need to make the net effect of the G0 gene zero in the T = 1 group, by creating the interaction term T*Go and restricting it to be exactly minus the main Go effect thus canceling it out (and likewise for the T*Gl term when T = 0). Further extension of this model increases its complexity. For example, adding both AGE and SEX effects would result in a total of 24 terms, 18 of which are interactions. Only 16 of these parameters are needed to model what is happening separately in the two data subsets (eight each: intercept, AGE, SEX, Gene, AGE*SEX, AGE*Gene, SEX*Gene, AGE*SEX*Gene). The other eight are “overhead” parameters that are actually constrained so as to cancel out other terms in the model, in the appropriate subsets, to produce heterogeneity (just as in the preceding example). They arise because we are trying to fit the “round peg” of a heterogeneous sample through the “square hole” of a homogeneous model. Thus, in this framework, as heterogeneity and complexity increases, we need multiple interaction terms of higher and higher orders, with more and more constraints on the parameters, making the model overly complicated, difficult to estimate, and eventually, hopeless to interpret. Exacerbating this problem, the traditional generalized linear models are really not designed to handle interactions well in the first place. The primary focus of such models is on the main effects. In the framework of the traditional homogeneous models, interactions are typically considered rare entities,
19. ClassificationMethodsfor Hatereogeneity
277
which should be avoided unless they absolutely must be included, This is due in part to the overall philosophy that models should be kept as simple as possible. In addition, such models are not readily interpretable if ail lower order interaction terms, which are subsets of the higher order terms, are not always included (e.g., all two-way interactions involving three-way terms). Thus, it is difficult to make an interaction model that is both simple enough and rich enough to describe the realities. Interactions may well be the rule, not the exception, and heterogeneity could very well be common, not rare, for complex traits. For many complex diseasesand traits, very little progress can be made with homogeneous, main effects models. Therefore we need models and approaches that treat complexities as the main focus of an investigation, instead of relegating them to the back burner as nuisance effects to be avoided. The context-dependent analysis (e.g., Turner et al., 1999). systematically looks at all possible interactions and subgroups to find the most informative ones, A very related, but more structured approach to such complex modeling is the recursive partitioning method (e.g., the classification &-id Regression Tree (CART’“) models of Breiman et al., 1984). We will concentrate on the recursive partitianing models here, but contrast some of the pros and cons of these two strategies later in the discussion.
IV. RECURSIVEFARTITIONIWG MODELS In the recursive partition framework, it is not assumed a priori that the data are homogeneous. Instead, the data are recursively partitioned into increasingly homogeneous subsets, with “finer” subgroups obtained each time. Ultimately, a number of homogeneous subsets are derived, for each of which a single model (but not necessarily the same model for each subset) fits well. Thus, “recursive partitioning” is a method to fit tree-based models (so called because the data can be displayed conveniently by using a rooted tree graph as defined in mathematical graph theory, so that the partitions of the data “branch out” by a series of binary splits into homogeneous subsets, much like a tree) for predicting the value of a continuous or categorical outcome from a potentially large pool of independent variables. This methodology has been developed and extensively enhanced over the last 40 years in statistics and computer science. Morgan and Sonquist (1963) introduced the idea of the automatic interaction detector, or AID, to analyze social sciences data in which interactions were also thought to be more the norm than the exception. The book Classification and RegressionTrees (Breiman et al., 1984) formally developed and introduced this methodology to the statistics community. The accompanying software package, CART”‘, made it accessible to applied statisticians. CART” is designed to fit classification tree models
278
Province et a/.
to a single categorical outcome, and regression tree models to a single continuous outcome. Independent variables, or predictors, can be a mixture of discrete and continuous ones. In computer science, tree-based models developed along different lines beginning with attempts to model human concept learning. They were extended to place objects into one of two classes in problems for which perfect discrimination was obtainable: that is, in caseshaving no uncertainty in the measurements, so that a specific covariate pattern always returned the same outcome value (Quinlan, 1986, 1993; Clark and Pregibon, 1992; Hand, 1998; Langley, 1996; Nakhaeizadeh and Taylor, 1997; Zhang and Singer, 1999).
A. Purity and splitting rules The notation of recursive partitioning models can sometimes be intimidating to a newcomer, since there is a great deal of new terminology to learn, but the basic ideas are quite simple and attractive. A “node” is a subset of the data, and the “root node” is the entire data set. The data in a parent node is partitioned (“split”) into two, mutually exclusive subsets called the left and right child nodes. Each parent node is partitioned into two (and only two) child nodes (“nonterminal node”) or not further partitioned (“terminal node”). For categord ical outcomes, splitting of nodes is done to improve “purity” (homogeneity), and conversely to reduce “impurity” (heterogeneity) within each node. When a parent node is split into two child nodes, an “improvement in purity of the system” occurs if each of the two subsets is more homogeneous than was the original combined set. For instance, suppose we are dealing with a simple binary response trait with a whole host of possible predictors: some measured genotypes, some environmental exposures, some predictors measured on a categorical scale, while others are continuous. We want to find the best ways of using these predictors to define subsets of the data that are the most “pure.” The most “impure” nodes (subsets) would be those that contained equal numbers of diseased and nondiseased subjects, while the “most pure” would be those that were all of one kind or all of the other. This can be formalized by the use of a function known as the Gini index (Breiman et al., 1984). For any subset “t”, we denote by p,(O) and p,(l) th e P ro P or t ion of cases with D = 0 or D = 1 (i.e. nondiseased and diseased proportions), respectively. For a node t, The Gini index of impurity is
i(t) = 2P,(Oh(l), which is a maximum of 0.5 when p,(O) = a,(l) = 0.5, and minimum of zero when either p,(O) = 1 or p,( 1) = 1. To select the best split of a node, we score every possible split that can be made and select the one with the best score (e.g., greatest reduction in
19. ClassificationMefhsdsfar Hetereogeneffy
279
impurity). To formalize this, let S be the set of all possible splits af a node and s be a particular split defined by a specified covariate Xi (whether continuous or categorical) and cut point cjr where each split s induces a binay partition of the cases into the left child node, denoted tL, (where Xi 5 cj) and the right child node, denoted tR (where Xi > Cj). If P(Q) and p(tR) are the proportion of cases in t that fall into tL and tR, respectively, then we can score each splits of a node t by d&t)
= i(t) - a(
- PttRh(tR),
which is the reduction in impurity of node t due to split s. We then select the split that maximizes $(s, t). Note that +(s, t) is maximum [i.e., +(s, t) = i(t)] when a split at a node is found that perfectly separates the two classes,since this causes i(tL) = i(tR) = 0. To score each possible split, the recursive partitioning algorithm loops through each covariate and every cut point in the data set. The set of allowable cut points falls at the midpoints between neighboring, observed values of the covariate. For our disease state example (Figure 19.1), the best single split may be on a measured genotype separating the AA genotypes from all others,
AA
aa/Aa
BMI 540
BMI>40
P( 1) = 0.2
P(1) = 0.75
(I= 0.32)
(I = 0.375)
Figure 19.1. Example of a recursively partitioned tree model for a disease state. The effect of the AA genotype is to increase risk, while genotypes aA and aa operate differently depending upon whether the BMI of the subject is low (decrease risk) or high (moderate risk).
280
Province ef al.
which defines a node containing 90% diseased subjects. This node has an impurity of only I = 2(0.9)(0.1) = 0.18. Th e second split may be on a covariate, say body mass index (BMI), with the optimal cutoff determined to be 40 kg/mz, which further subdivides the us/AA genotype subgroup into more obese subjects vs leaner subjects. This gives conditional probabilities of disease of 0.75 and 0.2, and impurities of 0.375 and 0.32, for the obese and lean groups, respectively. Depending upon the predefined threshold of minimal impurity, one may or may not continue to “grow” the tree, using additional covariates/genotypes to define even more pure subgroups. If high purity is imperative, many sparse subgroups would eventually be defined. In such a case, there is a risk of creating a model that is overdetermined and would never reproduce in a different data set. An obvious way to prevent this is to set a higher threshold of acceptable impurity and avoid growing such detailed and overdetermined trees. But through extensive work with this technology (Breiman et al., 1984), it has been demonstrated that it is actually better to “overgrow” the tree and then “prune back” (i.e., to delete post hoc some nodes and rkcombine others with similar posterior probabilities of disease) than to define early termination rules. This is analogous to the differences between a true stepwise (forward/backward) regression strategy and a forward-only regression strategy. Using the “overgrow/prune” technique, one is more likely to get reproducible trees that are fitting to signal rather than to noise. In fact, standard methods for pruning trees are often based on the concepts of cost vs complexity that balance the size of the tree (complexity) with its accuracy (cost). An example of an actual classification tree fit to real data is shown in Figure 19.2, in which we show the effects of the adducin gene on hypertension in Caucasians, and its interactions with BMI, age, triglyceride levels, and urine potassium. As can be seen, this method allows us to define highly complex interactions compactly and efficiently, producing subgroups with a sharp gradient of risk. In fact, all terminal nodes but one (BMI < 25.8) involve multiply interacting covariates, a result very different from what would be produced by the more traditional main effects homogeneous sample models. When the outcome is a continuous variable instead of a categorical one, we are in the domain of “regression trees” instead of “classification trees.” The exact same strategy is used, with the Gini impurity index replaced by a measure of the within-node sum of squares. With the wide availability of software (CARTTM, Breiman et al., 1984), th ese recursive partitioning models are beginning to be used in earnest for the analysis of complex traits. They are especially useful for dealing with measured genotypes in the presence of many covariates (e.g., Fann et al., 1999; Province et al., 2000), as well as both a precursor and follow-up to linkage analysis in order to cluster the data into more homogeneous subgroups (Merette et al., 1999; Wilcox et al., 1999).
281
19. Classification Methods for HetereogeneiIy
, BMI<25.8
1
BIdI
25.8,
1
0.20
0.43
0.76
Figure 19.2. Recursively partitioned tree model for hypertension in Caucasian subjects from the HyperGEN study. The u-adducin gene operates in a complex interaction with BMI, age, and triglyceride levels (from M. Province et al., “Association between the ol-adducin gene and hypertension in the Hypers GEN study,” Am. J. Hypertension. Copyright 0 2000. Reprinted by permission of Wiley-Liss, Inc., a subsidiary of John Miley 6; Sons, Inc.).
6. Linkagetree models Until recently, it was not possible to perform linkage analyses within a tree model. Recursive partitioning tree-based models have been extended to identify homogeneous subsets of the data, thus increasing the power and ability to detect linkage in the face of heterogeneity (Shannon et al., 2001). To add a formal linkage component to a tree model, one must operate on relative pairs, relative sets, or even entire pedigrees as sampling units, since this is where the linkage information lies. Shannon et al. (2001) have begun this process by extending the Haseman-Elston (HE) regression model on independent sibpairs (Haseman and Elston, 1972), under the assumption that some genes may be linked to the phenotype in some subsets of the data, while other genes may linked in other subsets. Let yi = (yil - yiJ2 be the squared difference between quantitative trait measurements (the Haseman-Elston “response” variable), and Mi be IBD at marker M for the ith sibpair, i = 1, . . . , n. In the model yi = PO + PtMi -tEi, the coefficient fit indicates the strength of the linkage of M to the quantitative trait locus, and the test of PI = 0 is a test for linkage. We have impiemented a recursive partitioning algorithm to identify subsets of sibpairs over which this standard HE linkage model applies. The subsets are defined by applying splitting/pruning rules to the values of a pool of other (non-HEemodeled)
282
Province eta/.
covariate values on one or a combination of measures on both members of a sibpair. In other words, we split the data by a measured covariate, if that covariate defines two subsets of the data in which the HE linkage to the marker operates differently. With this strategy, it is possible to identify multiple QTLs, each found through linkage to a different marker in a different subset of sibpairs defined by combinations of measured covariate splits. It is tempting to use the existing standard regression tree fitting methods to model the response of squared differences yi = (yil - yi2)’ for sibpairs. However, simply splitting sibpairs on the basis of their phenotypic resemblance alone, regardless of their corresponding IBD status, will not give the desired outcome. What is needed is to split sibpairs on the basis of the strength of the linkage evidence, as characterized by the relationship between their IBD status and their phenotypic resemblance. Thus we want the subsets with the high IBD to phenotype relationship versus those whose phenotypic agreement does not correspond to the level of IBD. External covariates that make the HE linkage regression parameter estimates strongly negative are preferable to those in which it is zero or positive. To evaluate our algorithms, we conducted a Monte Carlo simulation in which we generated a data set with 10 covariates (Xl, . . . , Xi& and 9 markers (M,, . . . , M,). Let X4, Xs be two binary variables describing the sibpair (e.g., X4 might be race, and .X5 might be that both sibs are smokers). Suppose marker Mz is linked to a QTL in those sibpairs with X4 = 0 and Xs = 1, while marker M3 is linked to a different QTL in the sibpairs defined by X4 = 1 and X5 = 0, and finally, sibpairs with X4 = X5 = 0 and X4 = X5 = 1 are linked to no gene. The model now becomes PO + P1M2, p2 + &M3,
y” = 1
P4,
x, = 0, x, = 1 X4 = 1, X5 = 0 otherwise.
Recursive partitioning provides a natural framework for this type of model. Good methods should partition the sibpairs into four nonoverlapping subsets (terminal nodes) defined by the patterns of X4 and X5. To split a node, we calculate the model sum of squares (SS) for Y = PO + /31Mi + P2Xj + P,MiXj over all possible combinations of markers M, and covariates Xj. To fit this model, we use all sibpairs in the node. The covariate in the model with the largest model SS is selected as the splitting covariate. When the covariate is binary, the sibpairs are partitioned according to their value for that covariate. The algorithm is recursively applied to each child node obtained from this split. To test whether the correct subsets can be identified using this recursive partitioning strategy, we simulated data according to the model y*. The
283
19. Classification Methodsfor Hefereogeneify Table 19.1, Generating Linkage Model to Evaluate Haseman-Elston Linkage Tree Algorithms
Groupa
x4
1 2 3 4
X5
Proportion
0 1 0 1
0.11 0.22 0.22 0.45
0 0 1 1
Generating linkage model y: = y!’ = y,’ = yi’ =
1.9 1.9 - 0.75M2 1.9 - 0.75M3 1.9
“No linkage to any marker in groups 1 and 4.
group proportions and genetic model used to generate the ~7 are shown in Table 19.1. In our simulations, all markers were assumed to be fully informative. All but MZ and M3 are unlinked to any QTL. All other covariates (other than X4 and X5) are independent of the linkage model and were included, as were the other markers, as noise. Two sets of simulations were performed with 100 replications each. In one set, a random sample of N = 1000 sibpairs were simulated per replication. In the second set, N = 3000. To measure the signal in the simulated data, standard HE regression models were fit. A recursive partirioning of each replication was performed to see how often the correct model (i.e., parti+ tion on X4 and X5) was found. The results are shown in Table 19.2. The “Homogeneous sample model” results are the percentage of times HE regression, when performed on the entire sample, correctly identified marker Mz or M3 as significantly linked in the 100 replications. The “Recursive partitioning” results are the percentage of times the correct tree and the correct model were identified in the 100 replications. A correct tree is defined as a partition into the four groups X4 = 0 and Xj =1,Xq=1andXg=0,X4=X5=0,andXq=Xg=1. Thus it appears that the recursive partitioning method is much more successful in detecting QTLs than an analysis that assumeshomogeneity of the
Table 19.2. Monte Carlo Simulation Results” Identification of correct model (%) Method
N = 1000 sibpairs
N = 3000 sibpairs
Homogeneous sample model Recursive partitioning
8 33
29 78
aEmpirical power based upon 100 replications/condition
284
Province ef al.
overall sample. These simulations suggest that a recursive partitioning strategy might be successfully implemented to identify homogeneous subsets of sibpairs for detecting QTLs more easily and more often. By partitioning sibpairs we can increase the power to detect linkage by allowing local fitting of the HE regression models, and identify distinct subpopulations where different genetic mechanisms might be operating.
v. DISCUSSION Broadly speaking, heterogeneity is of three types: genetic heterogeneity(the same phenotype is caused by any of several genes), phenotypic heterogeneity(the same genotype causes multiple forms of the phenotype, often called pleiotropy), and population heterogeneity(the population is actually a mixture of different subpope ulations with different marginal prevalences of the disease/traits, genotypes, and/or causal etiologies). Population heterogeneity can be a particular challenge when it comes to gene finding precisely because different causal mechanisms are operating in different subsets of the data, This situation does not lend itself well to the traditional “one size fits all” homogeneity models, which require higher and higher order interaction terms to deal with such complexities, as they attempt to fit a single increasingly complex model to the entire data set. In the face of such population heterogeneity, an approach such as recursive partitioning, which is specifically designed to look for and define the homogeneous subsets (each of which has a simple but perhaps different model operating), can be a much more fruitful avenue. We have focused on recursive partitioning tree methods as one approach to simultaneously define homogeneous subgroups and detect major interactions among the effects in the data. A related method is “context dependent analysis” (e.g., Turner et al., 1999). Intuitively, context-dependent analyses also hypothesize that interactions are common rather than rare, and that there may be many homogeneous subsamples within a given sample. The context-dependent analysis, however, is not “tree” structured. As such, the relation of context-dependent methods to the recursive partitioning models is similar to that of “all possible subsets regression” has to “stepwise regression.” When using context-dependent methods, one must evaluate all possible subsets of the data to screen for heterogeneity. This lack of formal structure is both a strength and a weakness. That many more subsets can be evaluated than are considered by recursive partitioning is a strength. Thus, if no single covariate will produce a marginally better split than the whole data, the recursive partitioning stops right away. One might find good subsets if one considered covariates two or three at a time, instead of just one at a time. This effect is well known in the area of regression modeling, where “all
19. Classification Methodsfor Hetereogeneity
285
possible subsets regression” can sometimes find good models that are never found by the stepwise procedures. But the lack of formal structure is also a weakness in that there are many more subsets to be evaluated in the context-dependent paradigm than with recursive partitioning. Thus, one must be careful to set up screening procedures that adjust for the multiple comparisons (such as using a Bonferroni adjustment), or in some other way balance the number of “false positive” subgroups that are defined against the “false negative” ones rhat might be missed. To analyze a data set in real life, trees would be fit independently to each putative marker, and the within-subgroup linkage analysis would be performed on this same marker. Conceptually, however, multiple markers could be included in the splitting rule to allow additive and nonadditive genetic effects to be modeled. More complicated models of multiple gene interactions could conceivably by developed by using a stepwise strategy analogous to stepwise regression methods (i.e., fit a tree to each marker independently, select the marker exhibiting the strongest linkage evidence, and then fit a tree independently to each combination of that marker with each of the other markers, etc.). Work along these lines is in progress, and further research is needed to extend these methods for applications to genome-wide scans. But it should be clear that methods such as recursive partitioning models, which define, model, and exploit the heterogeneities inherent in complex traits, will play a big part in the dissection of their exact causative and genetic nature in the new millennium.
Acknowledgments This work was partly supported by grants from the National Heart, Lung, and Blood Institute (HL 54473) and the National Institute of General Medical Sciences (GM 28719).
References Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). “Classification and Regression Trees.” Chapman & Hall, New York. Clark, L., and Pregibon, D. (1992). Tree-based models. In “Statist&E Models in S.” (J. Chambers and T, Hastie, eds.), Wadsworth and Brooks, Pacific Grove, CA. Fann, C. S. J., Shugart, Y. Y., Lachman, H., Collins, A., and Chang, C. J. (1999). The effect of redefining affection status of alcohol dependence on affected sib-pair analysis. C&net. Epidemiol. 17,5151-156. Hand, D. (1998). “Construction and Assessment of Classification Rules.” Wiley, New York. Haseman, J. K., and Elston, R. C. (1972). Th e Investigation of linkage between a quantitative trait and a marker locus. Behaw.Genet. 2, 3- 19. Langley, P. (1996). “Elements of Machine Learning.” Morgan Kaufman, San Francisco.
286
Province et al.
Merette, C., Gayer, M. Rouilard, E., Roy-Gagnon, M.-H., Guibord, P., Kovac, I., Ghazzali, N., Szatmari, P., Roy, M.-A., Maziade, M., and Plamour, R. (1999). EVI‘d ence of linkage in subtypes of alcoholism. Genet. Epidemiol. 17, S253 -258. Morgan, J., and Sonquist, J. (1963). Pro bl ems in the analysis of survey data, and a proposal. J. Am. Stat. Assoc. 58,415-434. Nakhaeizadeh, G., and Taylor, C., eds. (1997). “M ach’me L earning and Statistics: The Interface.” Wiley-Interscience, New York. Province, M. A., Arnett, D. K., Hunt, S. C., Leindecker-Foster, K., Eckfeldt, J. H., Oberman, A., Ellison, R. C., Heiss, G., Mockrin, S. C., and Williams, R. R. (2000). Association between the a-adducin gene and hypertension in the HyperGEN study. Am. J. Hypertension, in press. Quinlan, J. (1986). Induction of decision trees. Machine Learning 1, 81-106. Quinlan, J. (1993). C4.5: “Programs for Machine Learning.” Morgan Kaufman, San Maeeo, CA. Shannon, W. D., Province, M. A., and Rao, D. C. (2001). T ree- b ased recursive partitioning methods for subdividing sibpairs into relatively more homogeneous subgroups. Genetic Epidemiology. Submitted. Turner, S. 1, Boerwinkle, E., and Sing, C. E (1999). Context-dependent associations of the ACE I/D polymorphism with blood pressure. Hypertension 34, 773-778. Wilcox, M. A., Smaller, J. M., Lunetta, K. L., and Neuberg, D. (1999). Using recursive partitioning for exploration and follow-up of linkage and association analyses. Genet. Epidemiol. 17, S391-396. Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J. Am. Stat. Assoc. 93,120-131. Zhang, H., and Singer, B. (1999). “Recursive Partitioning in the Health Sciences.” Springer-Verlag, New York.
I
Applications of #eural Networksfor GeneFinding Andrea Sherriff Universiq of Rris~ol Bristol, United Kingdom BS8 ITH
Jurg Oft’ Laboratory of Statistical Genetics Rockefeller University New York, New York 10021
I. II. III. IV. V. VI.
Summary Introduction Description of Artificial Neural Networks Interaction among Genes Underlying Complex Traits Applications of Neural Networks to Genetic Data Discussion References
I. SUMMARY A basic description of artificial neural networks is given and applications of neural nets to problems in human gene mapping are discussed. Specifically, three data types are considered: (1) affected sibpair data for nonparametric linkage analysis, (2) case-control data for disequilibrium analysis based on genetic markers, and (3) family data with trait and marker phenotypes and possibly environmental effects.
‘To whom correspondenceshould De addrcjsed. Advances in Genetics,Vol. 42 Gqyrigh:h[ D 2C@l by Acadcrnic l’res All rights of reproducrion m any form rmxved 0365.2660/@1 $35.W
287
288
Sherriff and Ott
II. INTRODUCTION Artificial neural networks (ANN s) were originally developed as models for the intricate interactions among neurons in the brain. Nowadays, however, they are rather seen as mathematical or statistical devices. Section III, which provides a short description of ANNs and lists some of their statistical properties, also explains why ANNs may be seen as particularly suitable devices for the analysis of complex trait data. Section V gives a discussion of applications of ANNs to genetic data that is specific for the genetic applications tried in our laboratory. Much of our exploration into the use of ANNs for genetic data is still ongoing, and our results should be considered to be tentative. Unfortunately, much of the neural network terminology is very different from that used in statistics even when reference is made to the same thing. For example, estimating coefficients (weights) associated with input variables is referred to as training a neural network.
Briefly, a typical neural network functions as follows. It consists of various layers of nodes (“neurons”), for example, an input layer, a hidden layer, and an output layer (see Figure 20.1). Often the output layer consists of a single node. A given input node is connected with each node in the hidden layer, and a given hidden node is connected with each output node. Each such connection is associated with a particular “weight” or coefficient, designated v and eu in Figure 20.1. One
0 ..’ 0 Yl Yew
Wll
Output layer mc
Hidden layer
... 1
2
‘rn
v&
V&n . . .
Input layer
xd Xl X2 Figure 20.1. Schematic representation of a neural network.
20. Neural Networksfor Gene Finding
289
of the tasks of the ANN is to change these weights (estimate the coefficients) according to some rules outlined here (this task is also called “learning”; or one says that the network is being “trained”). Assume that the ith input node receives a certain input value x,. This value is then transmitted to each hidden node. While an input value is sent to a hidden node, it is multiplied (adjusted) by the particular weight associated with this path. For example, the jth hidden node will receive from the ith input node the value ~~~~~~ A given hidden node now carries out a rather simple mathematical operation. It sums up all adjusted values received from the input nodes and subjects this sum to a transformation, g(a), usually the logistic transformation, so that the result will be between 0 and 1. A given hidden node then transmits this result to each output node, where again each transmitted value is multiplied by the weight associated with the given path. A given output node repeats what a hidden node was doing; that is, it sums up all received modified values and subjects them to the same (or a different) transformation. That is, for a given set of weights, a set of input values (one at each input node) will lead to a certain number of values at each output node. Thus, an ANN transforms a set of d input values to a set of c output values. In fact, an ANN may be viewed as a universal function approximator from d-space to cspace (d input nodes, c output nodes). Without any hidden nodes and a single output unit, with g(q) being the logistic transform, the operation of an ANN may be shown to be equivalent to linear logistic regression (Bishop, 1995). With multiple hidden nodes, it will carry out nonlinear logistic regression, taking into account higher order functions among input values. A set of input values is sometimes referred to as a pattern, and neural networks are also said to be able to carry out pattern recognition tasks. In addition to the nodes shown in Figure 20.1, the input and hidden layers usually have an additional node each, called bias. They transmit constant values (x0 from the input layer, .zofrom the hidden layer) and correspond to the constant in linear logistic regression. Thus, for given sets of weights, ~ji (from itb input node to jth hidden node) and (from jth hidden node to kth output node), the value received by the kth output node is given by ~kj
(20.1)
(Bishop, 1995). Let’s call these values computed output pi&es, in contrast to target output values to be defined shortly. In many applications, two types of data are fed to an ANN, and this is the case to be discussed here. Correspondingly, there may be a single output node. The two data types may refer to case and control data: that is, observa-
290
Sherriff and Ott
tions on case and on control individuals, where on each individual a multiplicity of observations (as many as there are input nodes) are available. Assume that case data are labeled 1 and control data are labeled 0. Input values are presented to the ANN one individual at a time, with all observations for that individual fed to the input nodes. For example, observations may be single nucleotide polymorphism (SNP) genotypes at different marker loci, where each genotype (l/l, l/2,2/2) is coded as (0, 1, 2) corresponding to the number of 2 alleles in the genotype. Each time a set of input values is presented to the network, it will come up with some computed output value. Also, the network will be “told” which data type (label 1 or 0) is being input. These labels serve as target output values. The task of the network is now to modify the weights in such a manner that computed output values will eventually be as close as possible to target output values. In other words, the weights are estimated by a minimization of the sum of squared differences between target and computed output values. This minimization is usually carried out iteratively by a simple downhill gradient procedure called backpropagation, although a number of more sophisticated but computationally expensive methods exist (Bishop, 1995; Ripley, 1996) including simulation-based approaches that locate global minima of the error surface. An ANN may be seen as a set of simple computing devices working in a highly parallel manner. This way an ANN can perform very complicated tasks. ANNs are usually emulated on a computer. The computer program SNNS (Stuttgart Neural Network Simulator) is available at no charge from: httP:I/www.informatik.uni-stuttgart.de/ipvr/bv/projekte/snns/snns.html. The connection between neural networks and complex traits may be seen as follows. A heritable complex trait typically shows two phenotypes, affected and unaffected, but these are not inherited in a simple Mendelian fashion. Rather, they are thought to be due to multiple underlying and presumably interacting susceptibility loci. Disease gene mapping may thus be viewed as a mapping from the set of markers to the set of phenotypes. The problem is to identify sets of marker loci that are each close to a disease locus, as discussed in more detail later. Most of the current linkage or disequilibrium analysis procedures in human genetics work with one marker or genetic location at a time; that is, they do not take interactions among putative disease loci into account. New methods need to be investigated that go beyond this intrinsically Mendelian approach (Hoh and Ott, 2000), and neural networks comprise one such method. Various network architectures exist. For example, there may be different numbers of hidden units in the hidden layer, or an ANN may have multiple hidden layers. Discussion of network architectures is beyond the scope of this chapter (e.g., see Ripley, 1995, 1996). Here, only a single hidden layer is considered. Then, the number of parameters (weights) to be estimated is given by (d + 1)m + (m + 1)c = m(d + c + 1) + c. If the number of observations is
20. Neural Networksfor Gene Finding
29P
not much larger than the number of parameters estimated, the danger of overfuting exists. In neural network language, the ANN learns the data rather than the structure underlying them. Generally it is not easy to see when an ANN is overdetermined. For this reason, it is conventional to randomly split the data in two, using one portion to train the network and the other portion to validate the network {Ripley, 1995). This protects against overfitting the data and promotes generalizability of the network. Often, ANNs are used for hypothesis generation rather than for analytical work. In our work, however, we are trying to apply ANNs as statistical tools in their own right (Lucek et al., 2000).
IV. INTERACTION AMONGGENESUNDERtYINGHIMFLEX TRAlTS The simplest case of multilocus (oligogenic) inheritance is the one involving two underlying genes (see Chapter 14.1 in Ott, 1999). An interesting theoretical example is shown in Table 20.1. Locus 1 is an assumedtrait gene with a recessive mode of inheritance if genotypes at locus 2 are bb or Bb, and incomplete penetrance only for heterozygotes if the genotype is BB at locus 2. Thus, locus 2 may be viewed as a modifier locus because it modifies the mode of inheritance of locus 1. With the given allele frequencies of P(A) = 0.20 and P(B) = 0.80, the trait has a population prevalence of 0.04. Also, the marginal penetrances for locus 2 are the same for each of its genotypes. That is, this locus by itself is expected to be “invisible” in samples of families from the general population (with constant penetrances, phenotypes are all equivalent to ‘Ymknown”). I-Iowever, through preferential ascertainment of affected individuals, the (conditional) penetrances change, which is seen as follows.
Table 20.1. Assumed Penetrances for a Two-Locus Trait Inheritance Model” Locus 2 Locus 1
bb
Bb
BB
aa Aa AA Marginal
0 0 1 0.04
0 0 1 0.04
0 0.125 0 0.04
“This model, with population allele frequencies of P(A) = 0.20 and P(B) = 0.80, leads to a trait prevalence of 0.04 and marginal penetrances of 0.04 for all genotypes at locus 2.
292
Sherrifl and Ott
Table 20.2. IBD Allele Sharing Probabilities per Parent for the Inheritance Model in Table 20.1 Joint probabilities
Conditional probabilities
Locus 2
Locus 2
Locus 1
IBD = 0
IBD = 1
Sum
IBD = 0
IBD = 1
Sum
IBD = 0 IBD = I Sum
0.1275 0.3072 0.4347
0.1341 0.4312 0.5653
0.2616 0.7384 1
0.49 0.42
0.51 0.58
1 1
Given that a family has two affected offspring, the IBD allele sharing probabilities predicted by this inheritance model are as shown in Table 20.2. They were computed with the IBD program developed by Harald Goring (Ott, 1999). The three right-hand columns of Table 20.2 demonstrate the interaction between the two loci at the IBD sharing level (the corresponding correlation coefficient is 0.06). Given no IBD sharing at locus 1, the IBD sharing probability at locus 2 is 0.51, while it is 0.58 with presence of IBD sharing at locus 1. The marginal IBD sharing probability for locus 2 is roughly 0.56 (lefthand side of Table 20.2), distinctly exceeding the null value of 0.50. On the other hand, an IBD sharing of 0.56 is not very high and requires relatively large sample sizes for a significant detection. In contrast, locus 1 is easily detected, with the marginal IBD sharing probability of 0.74, well above the null value of 0.50.
V. APPLICATIONSOF NEURALNETWORKSTO GENETICDATA This section describes our work on the application of ANNs to genetic data. We have learned quite a bit since starting out on this path, but much still needs to be explored. Therefore, the material in this section should be considered preliminary. Three data types have been considered in our work: affected sibpairs (linkage analysis), case-control data (disequilibrium analysis), and extended families (linkage analysis). In most of these applications, input nodes correspond to genotypes or alleles at genetic marker loci. Concomitant variables such as environmental effects are easy to accommodate through additional input nodes. ANNs are generally used as a “black box”; that is, a network is trained on some training set of data and then the trained network is applied to another set of data for prediction and/or hypothesis generation. Here, however, we are using ANNs in a very different manner. We work with all our data at once and
20. Neural Networksfor Gene Finding
293
train the ANN as best we can. Then, we analyze the estimated weights (coefficients) to come up with suitable genetic interpretations. Details are provided in the next section.
A. Affectedsibpair analysis Affected sibpair data have a very simple structure. Each parent does (IBD sharing, x = 1) or does not (no IBD sharing, x = 0) pass the same allele to the two affected offspring. The data matrix may then be represented by {xij = (0, I)), corresponding to the ith parent and the jth marker locus. These data are hypothesized to contain effects of disease genes in the vicinity of various markers in the genome, but the majority of markers presumably show no deviation from random mendelian inheritance. Thus, the data contain “signal” and “noise” (Lucek et al., 1998) but are otherwise homogeneous and not directly applicable to neural network analysis. A solution was found by creating control data on the computer. That is, marker genotypes are simulated for sibpair data according to the Mendelian rules. These contain only “noise.” So, an ANN is asked to discriminate between observed and generated sibpair data. As a rule of thumb, found by trial and error, the latter data set was chosen to be nine times larger than the former (Lucek et al., 1998). In principle, a single output node would be sufficient to discriminate between the two classesof input data (i.e., observed and generated sibpair data). However, it turned out to be beneficial to use two output nodes, Oi and Ozl where 01 was expected to indicate “signal” with O2 absorbing “noise.” Conespondingly, target output values are set to (I, I) f or observed sibpair data and to (0, 1) for generated sibpair data. Guided by the heuristic principle that “signal -I- noise” minus “noise” equals “signal” (Lucek et al., 1998), contribution values CV were defined as follows to reflect the importance of input nodes on their effect on output nodes. Consider the m X d matrix v = {uji} of estimated weights for the jth hidden node and ith input node, and the c X m matrix w = {wkj} of estimated weights for the kth output node and jth hidden node. Then the product, u = WV = {Uki), is a c X d matrix representing the sums over all hidden nodes of products of weights, 4/ji X wkj+ For the setting of output nodes considered here (C = Z), the contribution value of the ith input node is defined as CV, = } uli - uZi/ and reflects the relative contribution of the ith input node on the first output node. Contribution values may be plotted for all markers on the genome, The highest points are taken to be indicative of underlying genes. An application to published diabetes sibpair data clearly demonstrated the feasibility of the method just described (Lucek et al., 1998). Also, these authors carried out a power comparison with the MAPMAKER/SIBS program (Kruglyak and Lander, 1995) as follows. Disease’data for affected sibpairs were
294
Sherriff and Ott
generated under a two-locus model similar to the one shown in Table 20.1, with a “strong” and a “weak” locus. Marker data were generated for a genome-wide screen with fully informative markers. It turned out that with an assumed sample size of 200 affected sibpairs, the neural network approach exhibited slightly more power than the state-of-the-art affected sibpair analysis to detect the “weak” locus (the “strong” locus was easily detected by both methods).
B. Disequilibrium analysis in case-control data Consider a number, d, of SNP markers, either in candidate genes or spread over the whole genome in a genomic screen. It is desired to carry of disequilibrium analysis in case-control data. Thus, at each marker, the frequencies of the 1 and 2 alleles are compared between cases and controls. A significant result is considered indicative of linkage disequilibrium. For neural network analysis, one may proceed in analogy to the description already given for affected sibpairs. There will be a single output node with target output value of 1 for a case and 0 for a control observation. The ith marker (input node) may be coded xi = 0, 1, or 2 depending on the number of 2 alleles contained on an individual’s genotype. This representation assumes additive allelic effects. Alternatively, the three marker genotypes may be represented as categories by two dummy variables, xir) and xi2), often also called indicator variables. For example, the genotypes 1/ 1, 1/2, and 2/2 may be coded as (x/l), x,‘“)) = (1, 0), (0, l), and (0, O), respectively. Of course, this more general coding scheme greatly increases the number of parameters to be estimated. We have not yet tried this approach but plan on implementing it in the very near future. Interactions among marker loci may be seen by inspection of estimated weights. However, the often large number of weights may not be easy to interpret. An ad hoc procedure has been suggested by Paul Lucek (personal communication) along the following lines. Consider a trained network with fixed estimated weights and assume that a potential interaction between input nodes i = 3 and i = 11 is to be found. An input array of values, x = (x1, . . . , xd), will then produce some well-defined response y1 at the output node. The response y1 may be interpreted as the estimated probability of being a case when an individual has the observations given by x. Consider now two arrays of input values, x(r) and xc’), differing only in their third element, with all other elements being equal to zero, say. Let x3 = 0 in x(r), and x3 = 1 in xc’). The resulting change in y1 corresponds to the main effect at input variable 3 when xs changes from 0 to 1 and all other variables are held constant at zero. Analogously, such a main effect for element 11 may be constructed, and also an interaction effect for a simultaneous change of elements 3 and 11 from 0 to 1. In the preceding section, contribution values were described as they were developed intuitively (Lucek and Ott, 1997). A more formal derivation
20. Neural Networksfor Gene Finding
295
may be based on Equation (20.1), which expresses the dependency of the kth output node on the ith input node. This dependency may be formalized as the partial derivative, ayk/axi (Sara Solla, personal communication). If one works this out for Equation (20.1), an expression very similar to the contribution value is obtained. Analogously, mixed partial derivatives may be calculated to show the dependency of Yk on both, xh and xi, but we have not yet tried this.
C. Extendedfam ilies Originally, the neural network approach to linkage analysis was applied to the Genetic Analysis Workshop 10, problem 2a, in a very ud hoc fashion, as follows (Lucek and Ott, 1997). Consider families with individuals ordered in some systematic manner, for example, with parents preceding their children. At each marker, each individual has two alleles. Such a genotype was represented by an array of length a, with a being the number of alleles at this marker. The elements of this array were all set equal to zero except for the allele numbers present in the genotype. For example, for a marker with a = 5 alleles, genotypes l/3 and 3 /3 are coded as (1, 0, 1, 0, 0) and (0, 0, 1, 0, 0), respectively. For ail individuals together in the data set to be analyzed, this resulted in roughly 2500 input nodes. The quantitative phenotype was represented through four output nodes corresponding to four quartiles of the trait. Clearly, there were far more parameters than observations. However, many of these parameters were highly interdependent, and the “effective” number of independent parameters must have been much smaller. Neural network analysis of these simulated data was highly successful. It found all six of the major genes involved in the disease.
VI. 0lSCtJSSi0N Neural networks have been used extensively in genetic analysis (e.g., Bjorkesten and Soderman, 1999; list in Lucek et al., 1998). Also, the well-known GRAIL program for recognizing protein coding regions (exons) in human genomic DNA sequences contains neural network applications (Xu et al., 1994). On the other hand, ANNs have seen very little use in human gene mapping (e.g., Lucek et al., 1998; Saccone et al., 1998). There are still various unsolved problems with our neural net approaches. For example, it is somewhat unclear how to handle missing observations. In affected sibpair data, the best possibility seems to proceed in two stages. In stage 1, the GENEHUNTER (Kruglyak et al., 1996) program is used to look at all markers on a chromosome jointly and to predict IBD sharing for markers with incomplete information. Then input data at a given marker are
296
Sherriff and Ott
not numbers of alleles shared IBD but rather a quantitative trait, that is, the estimated IBD sharing probability (Naimark and Paterson, 1999; Saccone et al., 1999). Another unsolved problem is the determination of significance levels for contribution values. For affected sibpair data, one possibility is to apply Monte Carlo methods by simulating random sibships that are known not to contain disease genes. However, this approach is very time-consuming (Lucek et al., 1998). For case-control disequilibrium data, the situation seems simpler. Random permutations among cases and controls are expected to furnish appropriate null data that can serve as the basis for determining appropriate thresholds for contribution values, but this avenue has not yet been tried. We initially applied ANNs on large numbers of input nodes (marker loci) with the aim of letting the networks decide which of the input nodes furnished relevant information. However, to estimate the weights accurately in a system that is not overdetermined, this approach no longer seems optimal, particularly in genome-wide screens. Instead, following the practice of many users of ANNs, there needs to be a first step, which is often referred to as feature extraction. In our case, the first step should preselec markers suitable for further analysis by the ANNs. For example, only markers with estimated IBD sharing probabilities exceeding the null values might be considered. This device may well eliminate half of all marker loci from further consideration. In classical sta, tistical methods, a rule of thumb is that the number of observations should be at least 3-5 times higher than the number of parameters estimated. Various extensions exist for the simple so-called feed-forward networks described here. Most of them have some analogies to procedures in mathematical statistics. It would be interesting to compare ANNs with newer methods of artificial learning, particularly regarding their suitability to genetic problems. For example, support vector machines have been applied to microarray gene expression data (Brown et al., 2000).
Acknowledgment This work was supported by grant MH44292 from the National Institute for Mental Health.
References Bishop, C. M. (1995). “Neural Networks for Pattern Recognition.” Clarendon Press,Oxford. Bjorkesten, L. S., and Soderman, T. (1999). Artificial neural network used for the detection of mutations in DNA sequence of raw data traces. Am. J. Hum. Genet. 65 (suppl.), A221. Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M. Jr., and Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97, 262-267.
20. Neural Networksfor Gene Finding
291
Hoh, J. J., and Ott, J. (2000). Complex inheritance and localizing disease genes. Hum. Hered. 50, 85-89. Kruglyak, L., and Lander, E. S. (1995). Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. 57, 439-454. Kruglyak, L., Daly, M. J., Reeve-Daly, M. I?, and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 58, 1347- 1363. Lucek, I? R., and Ott, J. (1997). Neural network analysis of complex traits. Genet. Eoidemiol. 14, 1101-1106. Lucek, I?, Hanke, Jo,Reich, J., Solla, S. A., and Ott, J. (1998). Multi-locus nonparametric linkage analysis of complex trait loci with neural networks. Hum. Hered. 48, 275-284. Lucek, I?, Hanke, J., Reich, J., Soila, S., and Ott, J. (2000). Neural network analysis of complex traits: Method description and optimization. Arti&ial Intelligencein Medicine (submitted). Naimark, D. M. J., and Paterson A. D. (1999). Application of probabilistic neural network analysis to a diseasewith complex inheritance: The GAWll simulated data. Genet. Epidemiol. 17 (suppl. l), S667-S671. Ott, J. (1999). “Analysis of Human Genetic Linkage,” 3rd ed. Johns Hopkins University Press, Baltimore, MD. Ripley, B. D. (1995). Statistical ideas for selecting network architectures. In “‘Neural Networks: Artificial Intelligence and Industrial Applications” (B Kappen and S. Gielen, eds.), pp. 183 - 190. Springer, New York. This paper may be downloaded from the list provided in http://www.stats.ox.ac.uk/ ripleyjpapershtml The file SNNS-ps.eps should then be renamed to SNNS&.Z, from which the postscript file SNNS-ps may be extracted. Ripley B. D. (1996). “Pattern Recognition and Neural Networks.” Cambridge University Press,Cambridge. Errata may be found at http://www.stats.ox.ac.uk/ - ripley/PRbook/PRNN-Errata.html. Saccone, N. L., Rice, J. P., Downey, T. J., Goate, A., Neuman, R. J., Rochberg, N., Edenberg, H. J., Foroud, T., and Reich, T (1998). Mapping genotype to phenotype: Linear and neural network methods. Am. J. Hum. Genet. 63 (suppl.), A59. Saccone, N. L., Downey, T. J. Jr., Meyer, D. L., Neuman, R. J., and Rice, J. P. (1999). Mapping genotype to phenotype for linkage analysis. Genet. Epidemiol. 17 (suppl. l), S703-S708. Xu, Y., Mural, R., Shah, M., and Uberbacher, E. (1994). Recognizing exons in genomic sequence using GRAIL II. Genet. Eng. (NY) 16,241-253.
GenomePartitioning and : Whale-Genome Analysis Nicholas J. Schorkl Department of Epidemiology and Biostatistics Case Western Reserve University Cleveland, Ohio 44 109 Program for Population Genetics and Department of Riostatistics Harvard University of Public Health, Boston, Massachusetts 02 115 The Jackson Laboratory Bar Harbor, Maine 04609
1. II. III. IV. V. VI. VII. VIII.
Summary Introduction The Basic Model WG and GP Analysis with WG and GP Analysis with Human Population-Based Extensions to Basic MIVC Discussion and Conclusion References
Human Family Data Inbred Model Organism Studies Model
Crosses
I. SUMMARY Standard DNA marker-based approaches to mapping genes that influence complex traits typically consider a limited number of hypotheses. Most of these hypotheses concentrate on the effect of a single individual locus (or
‘Dr. Schork is currently on leave sponsored by The Genset Corporation,
La jolla,
California, and Gcnsct, SA of Paris, France. Correspondence may be sent to the Cleveland address. Advances In Genetics, Vol. 42 Copyrlghr 0 2001 by Academy Press. All rights of reprmlucr~onm any form reserved. X365-266C/Cl $35.00
299
300
Nicholas J. Schork
relatively few loci) on the trait of interest. Although of tremendous importance scientifically, such hypotheses do not accommodate the full range of genetic phenomena that may contribute to phenotypic expression. We present novel approaches to complex trait analysis that make as complete, use of marker information as is possible. The proposed methodologies can be used to entertain a wide variety of hypotheses, including those that engage, for example, the contribution of a particular chromosome, genome-wide’ heterozygosity, and multiple genomic regions, to phenotypic expression. We consider a number of possible extensions of the proposed methods as well as their limitations. Although we discuss many methodological details in the context of quantitative trait locus mapping involving sampling units such as human pedigrees and hybrids resulting from crosses between inbred strains of model organisms, our procedures can be easily adapted to standard sibpair and other sampling unit-based designs. Ultimately, the proposed approaches not only have the potential to increase power to identify individual loci that harbor trait-influencing genes, but also present a framework for testing a number of hypotheses about the nature of the genetic determinants of phenotypes in general.
One of the most widely used tools for the identification of genes that influence traits or disease predisposition is meiotic or “linkage” mapping (Lander and Schork, 1994; Schork and Chakravarti, 1996). Ivleiotic mapping involves tracing cosegregation and recombination phenomena between alleles at observed genomic marker loci and hypothetical trait-influencing alleles at unobserved loci among related individuals. Meiotic mapping thus depends not only on the availability of genetic maps, high-throughput genotyping technologies, and sampling units such as families, but also on statistical methods for modeling and testing cosegregation (Schork and Chakravarti, 1996; Schork et al., 1997). Although great success has been had with linkage mapping strategies in the identification of fairly sizable chromosomal regions potentially harboring loci that influence many traits and diseases, the results of a number of studies of the same traits and diseases have often been inconsistent and have rarely led to the actual cloning of functional genes and polymorphisms. This lack of overt success is probably a function of two related factors: first, an overly optimistic faith in the power of linkage analysis methods to actually lead to precise disease gene identification (Risch and Botstein, 1996; Risch and Merikangas, 1996), and second, an inherent inability of many linkage analysis models to accommodate complexities associated
21. GenomePartitioning Analysis
301
with multifactorial traits and diseases of contemporary interest (Schork, 1993b). In this chapter we develop and elaborate an approach to the analysis of complex traits and diseases with DNA markers that is meant to allow for a greater variety of hypothesis tests and modeling flexibility than are afforded by standard meiotic mapping models. The proposed models and methods are meant to extract as much information as possible from multiple marker data typically collected as part of a standard “genome scan” (Lander and Schork, 1994) associated with meiotic mapping studies. The models, which we group together as “whole-genome” (WG) and/or “genome partitioning” (GP) methods, can be applied to both inbred model organism crosses and human family studies; they have their foundation in variance component or random effects linear models. We first consider the basic intuitions and modeling constructs behind WG and GP analysis methods. We then describe their use in typical human family-based studies. Following the discussion on methodologies for human family data, we consider their use in inbred model organism cross-studies, and then describe their application in some novel human studies in which related individuals may comprise only a small fraction of the sample. We close with a few extensions and a discussion of areas for further research. We assume throughout the chapter that the reader has some familiarity with linkage analysis and DNA marker-based investigations. We assume further that the trait or disease liability factor under scrutiny is quantitative in nature and that one has marker data distributed throughout the genome, as in standard genome scan contexts.
III. THE BASIC MODEL The basic modeling construct at the foundation of WC and GP methods is a multipoint identity-by-descent variance component (MIVC) model. Variance component models, which have received considerable attentian in linkage analysis settings (Boerwinkle et al., 1986; Goldgar, 1990; Schork, 1993; Amos, 1994; Olson, 1995; Xu and Atchley, 1995; Amos et al., 1996; Schork et aE+I 1997), also have had a long history of use in other genetic analysis contexts (Fisher, 1918; Lange et al., 1976; Schork, 1992, 1993; Fulker and Cherny, 1996). We first describe generalities of MIVC models and then consider ways of accommodating the dual goals of modeling flexibility and broader hypothesis-testing capabilities within the MIVC modeling framework. More comprehensive treatments of the use of variance component models in genetic analyses are discussed elsewhere (Schork, 1993; Schork et al., 1997).
302
Nicholas J. Schork
A. The MIVC framework Let y denote a quantitative trait (e.g., blood pressure level, cholesterol level, body mass index) collected on N individuals in a defined sampling unit, some subsets of which can be related biologically (e.g., as sibpairs, members of nuclear families, extended pedigree members, litter mates). Assume that the trait value vector, Y = [rl, . . . , YN], which contains the trait values gathered on each of the N individuals in the sampling unit, can be modeled with an appropriate multivariate distribution with mean vector or.and variance-covariance matrix 0, which permits partitions of the form: n = -f I&J-f + 2Ku,2 + Duj + Ha^Z + IO-;, I=1
(21.1)
where crt, ai, CT~,ai, and crf are variance terms characterizing locus-specific, residual (i.e., nonmajor locus) additive, dominance, shared household, and random or individual-specific effects, respectively. The coefficient terms are N X N matrices relating the variance terms to pairs of individuals. Thus, I& is an identity-by-descent matrix, K is the kinship coefficient matrix, D is Jacquard’s delta-7 matrix (Lange et al., 1976), H is a matrix characterizing shared households (Moll et al., 1979), and I is the identity matrix. The elements of the allele sharing matrices, IIL, contain estimates of the fraction of alleles shared either identical-by-descent (IBD) or identical#by-state (IBS) for two individuals at a specific genomic locus. They therefore reflect, in some sense, the genetic correlation between two individuals at a specific locus. The use of either IBD and/or IBS information depends on the nature of the hypotheses to be tested and the available sample material. Although we discuss the use of IBD/IBS information in detail later, we focus at present on the use of IBD information. Estimates of IBD sharing for the pairs of individuals in the sampling unit can be derived from multilocus genotype and haplotype data obtained from marker loci that flank a genomic locus of interest (Fulker et al., 1995; Kruglyak and Lander, 1995; Olson, 1995a,b; Kruglyak et al., 1996; Xu and Gessler, 1998). Thus, if +i,jll is a multipoint-based estimate of the fraction of alleles shared IBD by individuals i and j at locus 1, then: ?j1,211
...
1 (21.2)
21. GenomePartitioning Analysis
303
The estimation of fif,jll from multilocus data involves a function (p(1I Mi, Mj; 0)‘ which considers allele sharing at a locus I based on individuals’ i and j multilocus marker data Mi and Mj, and interlocus distances 6. $0 is often assumed to be a linear function of the marker data such that
(21.3) where I. is the number of marker loci and is an estimate of the fraction of alleles shared IBD at marker locus k. The bo and bk terms are obtained from an assumed mapping function relating decay in linkage (or linkage disequilibrium) to distance between the loci (Fulker et ai., 1995; Kruglyak and Lander, 1995; Olson, 1995; Kruglyak et al., 1996; Xu and Gessler, 1998). Although much of the work outlining the estimation of allele sharing at genomic loci from DNA marker information at sites that flank the locus in question considers sibling pairs, Almasy and Blangero (1998) h ave considered sharing between arbitrary relative pairs. Assume further that p can be modeled as p = f(X, B), where X = (xi, xN) is a vector of pedigree member-specific covariates, B is a vector of estimable parameters, and f () is ’ a function relating X and B to Y. Note that a number of genetic hypotheses can be tested through the use of estimable parameters in B. For example, X could include information about the presence (x = 1) or absence (x = 0) of a mutation within individuals. Information about actual genotypes (or haplotypes) could also be encoded as covariates in X. For example, if genotypes at a biallelic locus (with alleles A and a, say) are collected, and one assumesthat one of the alleles, A, has a dominant effect over the other, a, then the genotype information could be encoded as x = 1 if the individual has the AA or Aa genotype and x = 0 if the individual has the aa genotype. The estimated coefficient measuring the effect of the genotype could then be tested for its significance to draw inferences about the relevant locus effect on the trait. Such modeling and testing could proceed while investigators controlled for, or simultaneously tested, variance component parameters meant to characterize the effects of other genetic and nongenetic factors. This strategy is considered in detail in the section on analysis of inbred model organism crosses. The parameters B, o-i, af, a$, ai, and a: can be estimated via rnaximum likelihood techniques. For example, if multivariate normality of Y is assumed, then the relevant log-likelihood equation becomes: ‘Iji,jjk
=-$loglfi\
-~[Y-f(X,B)I~-lIY-f(X,B)l
(21.4)
Nicholas J. Schork
304
If more than one sampling unit (e.g., large pedigree or litter) is collected, the log-likelihood becomes the sum of the individual sampling unit log-likelihoods. Note that (M) multiple locus effects can be estimated in the model through variance component terms, but additional locus effects can be tested through the use of covariates as described earlier.
B. Basic estimation and hypothesis testing To carry out maximum likelihood estimation of the parameters, a number of dif+ ferent numerical schemes can be used. For example, the scoring algorithm and Newton-Raphson iteration work quite well for most variance components models (Lange et al., 1976; Searle et al., 1992). In addition, variance component models that make use of patterned covariance matrices like the MIVC models admit easy derivation of relevant partial derivative terms for their parameters, and thereby facilitate the computation of information matrices and other useful statistical analysis constructs (Searle et al., 1992). Testing the contribution of any element in the model, including individual locus effects, can proceed through likelihood ratio (LR) tests (Schork, 1993) or by using estimates of the standard errors of the parameters, whose calculation and estimation can be “robustified” to departures of the normality assumption (Beaty et al., 1985; Amos, 1994). Hypothesis tests more detailed and elaborate than those that consider only one parameter can be constructed. For example, one could simultaneously test the hypothesis that all locus effects modeled through a variance component (or some subset of them) are insignificant:
HO: ~7: = * * . = &
= 0.
IV. WG AND GP ANALYSISW ITH HUMAN FAMILY DATA The basic intuition behind the WG and GP methods is to note that the standard emphasis in linkage mapping on testing individual loci for their phenotypic effect is limited. There are many reasons, however, to believe that multiple polymorphisms within a defined region of the genome may contribute to phenotypic variation. For example, allelic heterogeneity has been found in the genes responsible for many traits and diseases,suggesting that a larger genomic region than an individual locus or single base position within a gene can influence a trait (Terwilliger and Weiss, 1998). Al so, it is known that because of epistasis, the genetic background of an individual can easily influence the effect of an individual locus, especially in model organism crosses. The genetic background of an individual has even been shown to act as a surrogate for the existence for trait-influencing genes within the genome (Cavalli-Sforza and Bodmer, 1971;
21. GenomePartitioningAnaiysls
305
Erosseau et al., 1979; Knowler et al., 1988; Stem and Haffner, 1990; Cavailie Sforza et al., 1994). Evidence for coordinated gene expression is also consistent with clusters of genes being under common regulatory control in eukaryotes (Blumenthal, 1998; Niehrs and Pollet, 1999). In addition, genome-wide heterozygosity is thought to influence general viability and response to environmental pathogens and other stimuli (see, e.g., Carrington et al., 1999). Finally, the estimation of the total genetic influence on a quantitative trait- the “hen tability” of that trait-exploits the concept of kinship, which is a measure of genome-wide sharing of alleles (rather than allele sharing at a specific genomic locus) between two individuals (Schork et al., 1997). To accommodate nonlocus-specific allele sharing effects, WG and GP analysis methods rely on multipoint estimates of genomic and genomic region sharing between two individuals that is derived from knowledge of their ancestry and DNA marker data collected on them both (Almasy and Blangero, 1998).
A. BasicWG analysis To showcase how the WG models are constructed, consider the standard way in which one can use human families as sampling units to estimate and test the “‘heritability” of a quantitative trait (Khoury et al., 1993a,b; Schork, 1993). Typically, a model of the form described in the preceding section is used, which assumesonly the following simple covariance matrix partition: Sz = 2Ka; -I- Ia:.
(21.5)
Thus, only additive genetic and random variance terms are estimated. The ratio h = (T~/(c: + crj!) gives an estimate of heritability of the trait. The kinship coefficient 2K in Equation (21.5) is computed from knowledge of the genealogical relationships of the individuals in the sampling unit and ultimately has as its elements the probabilities that any pair of individuals in the sampling unit shares genes over the genome as whole. Thus, the elements of 2K are measures of the fraction of genome shared by two individuals. The WG model for estimating herita&lity replaces the mstrix 2K, with an “empirical” kinship coeflicient matrix, K. The elements of K are computed as empirical estimates of the fraction of genes shared IBD for pairs of individuals over an entire genome. The basic constructs discuzsed in the context of the derivation of Equation (21.3) are used to compute K from marker data. To clarify, consider that elements of 2K contain the expected genome sharing for two individuals, whereas the proposed empirical estimate of whole genome-allele sharing is calculated by integrating over allele sharing estimates at each locus in the genome computed from the multipoint calculations described in Equation (21.3).
306
Nicholas J. Schork
To explain how these calculations can be pursued, first consider a single chromosome, denoted c, for which marker data at various sites are available and for which a pair of related individuals have been typed. Define (21.6) where gC (M) is that part of the chromosome spanned by the markers and for which informative estimates of +Ycan be obtained, and scale is a constant used to ensure that the genome sharing measure varies between 0 and 1. A summation can be used to approximate the integral
%,jlc
=
$
Ii
iii,jll
=
$
I$l
441
) M,,
Mje),
where P is the number of loci at which ii,jil is computed in the defined region. Therefore si,jlCrepresents an average 4,jli over the region of interest and thus offers an estimate of the fraction of genome shared by persons i and j in the defined region. Summing the si,j(c over all c chromosomes (or chromosomal segments or subsets) gives an estimate of evhole genome sharing qiij = (l/23) Z$i 1 si,jlo for individuals i and j. With this in mind, we write %2
...
1
w 1.N p2,N
(21.7) ...
...
1
_
Note that the empirical kinship coefficient accommodates ~)ariationin the kinship coefficient among similarly related individuals (Guo, 1996a,b). It should be emphasized that for small regions of the genome, Goldgar (1990) introduced a method for estimating the fraction of genome shared IBD between two marker loci for sibhngs (see also Guo, 1995). Replacing 2K with K in Equation (2 1S) and then estimating the parameter CT: should give a more reliable estimate of crf, since K accommodates variation in kinship among similarly related individuals and thus more adequately characterizes the genome sharing of two individuals than could be achieved through the use of expe_ctations about such sharing (this of course assumes that one has calculated K with confidence, a topic discussed later).
21. GenomePartitioning Analysis
307
Hypothesis- and parameter-oriented testing can proceed via likelihood ratio tests. Once estimates of F$ and g”, have been obtained, one can get an estimate of the,. heritability: that is, the percentage variation explained by “genetic” factors h, as with the standard model in Equation (21.5), which uses 2K to quantify kinship.
6. GP analysis The method just described for empirically estimating from marker data the fraction of the entire genome shared by two individuals of known ancestry can be ease ily adapted to estimate sharing of any subregion of the genome, including individe ual chromosomes. The fraction of variation in the trait of interest that can be attributed to genetic variation in a subregion of the genome of interest can be estimated easily enough by substituting computed si,j/c values (i.e., computed for the relevant subregion) for specific locus-based Ti,j/l in Equation (21.2). A sequential procedure that use these values for testing and assessingthe contribution of genomic regions to variation in the trait of interest can be constructed. This sequential procedure is the foundation for the proposed GP analysis methods. Consider first estimating the parameter P; with the WG strategy described in the preceding section. After the “whole-genome” effect parameter as has been estimated, one can turn to a finer partition of the genome in which the contribution of individual chromosomal variation is considered. The covariante matrix partition would be
n = f$ s&T; + I$
+ IcT,2,
(21.8)
c=l
where the cr”, terms quantify individual chromosome effects. Relevant coefficient matrices, S, would take on a form similar to that of Equations (21.2) and (21.7) but would have estimates of chromosomal sharing as elements rather than locus-specific or whole-genome sharing estimates. Note that precisely how such a model would be fit to, and tested with, the data is open to debate and research. Consider that if sharing between all chromosome pairs was accommodated in the model [as in Equation (21.8)], it might not be necessary to assume a need to account for any residual genetic variation through the estimation of the parameter a: [as in Equation (21.8)]. However, it may be the case that some genetic effects are “wholistic” in nature and manifest themselves as true genome-wide phenomena that are greater than the sum of their parts (e.g., through interactions between loci on different chromosomes). In addition, tests
308
Nicholas J. Schork
of relevant hypotheses may be better served through the inclusion of this parameter. Consider that after cri has been estimated, one may want to “fix” this parameter under the assumption that estimates of more specific genetic components must sum to the total genetic variation estimated or captured in ai. Thus, one could test the hypothesis that a specific chromosome contributes to phenotypic variation by considering I-& : cfri = ~2 - X~~rg.+i~a~=j = 0. The main point here is that after estimating ai , one could fix the value of the total genetic variation to this estimate and then proceed to estimate the contribution of each chromosome under the assumption that these contributions must sum to the value of crf estimated earlier. With all these factors in mind, finer and finer levels of genomic resolution can be assessedfor their contribution to the trait of interest. Thus, the contribution of specific chromosomal regions or specific loci can be assessed after the broader chromosomal effects have been characterized. The main caveat or protocol for these subsequent estimations and tests, however, would be to retain in each subsequent model parameters that explain some of the variations in the trait that had been assessedand estimated earlier. For example, consider some models that have been fit sequentially to a set of data for which relevant hypothesis tests have been conducted. Suppose that these tests have resulted in the suggestion that two entire chromosomes (denoted through the subscript c and the chromosome number), three chromosomal subregions (on chromosomes different from those showing entire chromosome effects but two subregions on the same chromosome, denoted through the subscript s and the chromosome number), and a residual “whole-genome” effect contribute to variation in a trait. The covariance matrix for this model would be
A model fit subsequent to this model may assume that some of the effect attributable to the subregion on chromosome 1 is due to a specific locus. The covariante matrix for this subsequent model would be
where the locus to be tested, denoted with the subscript 1, is on chromosome 1 in the subregion of interest, I$ is the IBD sharing matrix computed for the locus I, and the asterisk denotes that the variation attributable to the subregion on chromosome 1 has been partitioned. Note that a:= i = at,* + u!cr (in theory), and this relation could be used to test the hypothesis
21. GenomePartitioning Analysis
309
s . :::. z ‘) 1
Step 1: Whole Genome I
Step 2: Individual Chromosomes 1
Step 3: Chromosomal Regions I t
Step 4: Individual Loci Figure 21.1. Schematic representation of genome partitioning approach to genome scan analysis. The darkness of the shading pattern and the size of the arrows correspond to the strength of the influence of the genomic region on the trait in question.
Ho : crj = a;= 1 - u,2& = 0 as described in the coniext of testing chramosomal regions, after the empirical genome sharing matrix K has been used to estimate a whole-genome effect.
310
Nicholas J. Schork
The idea of testing genome- or chromosome-wide effects initially and then refining and assigning effects to smaller chromosomal regions is outlined in Figure 21.1. The goal of such analyses would not necessarily be to resolve or attribute effects to individual loci (although this would be of great interest if possible!), but rather to provide a more comprehensive dissection of the genetic architecture of a quantitative trait, and to do so in way that accommodates both broad genome effects and the possibility that the restriction of estimations and tests to an isolated, individual locus might not be the most powerful or compelling way to proceed.
V. WG AND GP ANALYSISW ITH INBRED MODELORGANISMCROSSES The constructs described in the preceding section can be adapted and extended to analyze data arising from model organism crosses (Schork et al., 199613;Lui, 1998). Since many such crosses involve pairs of inbred strains with distinguishable alleles at sites throughout their genomes, phase information can be easily established. This information can be exploited in a number of ways in modeling devices that embrace a wide variety of genetic effects. Although we focus on situations in which simple intercross hybrids derived from two inbred strains are studied, the proposed methods can easily be adapted for other types of cross and progeny. It is known that many factors can influence quantitative trait locus (QTL) mapping (Schork et al., 1996b; Lui, 1998). One of the most important factors is the “genetic background” of the organisms studied. Genetic background effects arise as a result of inbred strains typically being fixed for unique trait-influencing alleles that may result in epistatic interactions, polygenic effects, and gene-environment interactions that cannot be adequately accounted for in traditional isolated single-locus analyses (Lyon et al., 1996). In addition, it is known that factors such as genome-wide (or region-specific) heterozygosity can influence responses to environmental stimuli and hence contribute to phenotypic variation of interest (Lynch and Walsh, 1997; Lui, 1998). Thus, it is very important in QTL mapping studies to accommodate genetic background effects. One way to achieve this is through WG and GP analysis modeling devices that can be implemented as either “fixed” factors (i.e., in the context of the models described, this would mean “as independent variables in a regression model”) or random effect variance components (or both) in the MIVC modeling framework. Both these approaches are outlined in the subsections that follow.
21. GenomePartitioning Anatysis
311
A. Simpleregressionmodeling Standard interval mapping models for intercross hybrids basically try to relate genotype information on each intercross hybrid to phenotype information in a regression model. Assume that the alleles in the two parental strains that generated the hybrids can be distinguished at each locus in the genome. Let the alleles for one parental strain be denoted as A and the alleles for the other strain as a. Thus, any intercross hybrid will be homozygous AA, heterozygous Aa, or homozygous CU.X at each locus. Let yi be the phenotype for the ith hybrid. Further let Mi be the marker genotype information possessed by the ith hybrid at loci flanking a locus of interest, and let 8 denote relevant interlocus distance information. The standard regressionbased interval mapping model is of the form
where ~(9 1M, 6) is a function giving the probability that a hybrid has genotype g at a locus given flanking marker genotype information M, interlocus distance information 8, and a mapping function, and where f(y / g) is a function relating the genotype g to the phenotype value y. Formulas and models for T(gl M, 0) have been discussed extensively in the literature (see, e.g., Lui, 1998). Often f(r 1g) is assumed to be linear: y = b0 -t b,x, + e, where xe is assigned values such as 0, 5 or 1 depending on whether the genotype is assumed to be AA, Au, or aa and e is assumed to be normally distributed with mean 0 and variance CT’ (Lander and Botstein, 1989). Thus the standard interval mapping model represents a simple mixture model in which the mixing weights reflect the probabihty that a hybrid has a particular genotype. Tests of the locus effect would concentrate on the coefficient bq: Ho : b, = 0. To accommodate WG and GP modeling within the standard interval mapping framework, one can simply add additional predictors to the regression equation for y : (2L.10) These additional regressors, xk, can reflect phenomena such as the amount of genome, or the amount of genome in a genomic subregion, possessedby a hybrid that either emanated from one of the parental strains (e.g., consists of the A allele) or is heterozygous Aa. These regressors are computed as true fixed effects without the need to consider every possibility, as in the case of
312
Nicholas J. Schork
assessinga single-locus effect. These regressorscan be estimated by integrating a function that considers probabilities of genotypes over a particular region or chromosome. Thus, (21.11) where cp(g) is a function relating a genotype g to a measure of genome composition for an individual hybrid and r( 0) is defined for Equation (21.9). For example, if one is interested in testing to see whether the amount of genome that is heterozygous in a genomic region, entire chromosome, or entire genome influences a trait, then (P(g) =
0 ifg = AA,aa 1 ifg = Au
(21.12)
If one is interested in the amount of genome that is from a particular parental strain (say, e.g., the strain whose alleles are denoted “A”), then
p(g) =
0 ifg=aa 5 if g = Aa 1 ifg=AA
(21.13)
Tests of hypotheses about the contribution of a chromosome, chromosomal region, or entire genome would focus on the coefficients bk. Multiple regions could be tested in this way. In addition, GP sequential testing strategies could be adopted, but since tests would involve regression coefficients instead of variance component parameters, constraining estimates obtained in subsequent models to sum to values obtained in earlier models would not be likely to work.
B. ExtendedMIVC modeling Variance component parameters can be exploited in WG and GP models for model organism cross progeny in very intuitive ways. The elements of allele and genome sharing coefficient matrices in relevant MIVC WG and GP models can be calculated from marker genotype data by using the constructs described for simple regression modeling. For example, given that the meiotic events for each intercross offspring are independent, allele sharing at a particular locus for a pair of offspring can be defined as (21.14)
21. GenomePartitioning Analysis
313
where gi, gjE{da, Aa, AA}, and the function 6(gi, gJ maps the genotypes of the two offspring into the fraction of alleles that they share. For example, if one is interested in total allele sharing for the two offspring, then a(&, gj) is defined as
Wgl, gz) =
0 if (gl, gJEI(aa, AN, GA 41 i if kl, g&Ha Ad, (Aa, 4, VA Aa), 0% A@3 1 if kl, gJE{b, 4, WA AN, (Aa, AdI (21.15)
Since genotypes of the offspring are not known except at the marker loci, they can be assigned only probabilistically. Hence, all possible genotypes at an unmarked locus must be considered in the evaluation of allele sharing, as in Equation (2 1.14). The allele sharing function can focus on other phenomena besides total alIele sharing. For example, one might be interested in the sharing of only A alleles. A function characterizing A allele sharing, denoted sA(gt, gx), can be constructed easily for use in Equation (21.14). In addition, functions can be constructed for sharing homozygous genome SHom(gj,gJ, heterozygous genome aHet(gl, gz), homozygous A genome 8H”tA(gl, gZ), and others. Example sharing functions are outlined in Table 2 1.1. If one is interested in sharing of entire genomes, chromosomes, and/or chromosomal segments, as in WG and GP modeling, then estimates of sharing can be constructed by integrating over the regions of interest using Equation (21.14) in an analogous manner to the procedures described for simple regression modeling and human family studies, writing
Table 21.1. Functions That Relate Intercross Progeny Genotype Pairs to Genomic Sharing 81
gz
AA AA AA Aa Aa Aa aa aa aa
AA Aa rf4 Aa El Aa aa
%,,
.d 1 1 t, 1 ; i L iI 1
~Ak,> d 1 1 6 f i 0 0 0 0
SHOYgl,gz)
~H”(a~ gz)
SHom~“%lr gz)
1 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0
314
Nicholas J. Schork
where gill denotes that the genotypes considered are from a locus within the defined interval. GP sequential testing and modeling procedures can be used to test the variance component terms. However, two issues must be considered. First, there are an infinite number of hypotheses to test, depending on how one defines the size of certain genomic regions and whether one is interested in heterozygosity, homozygosity, and so on within each region. Second, one could actually consider models that combine the use of regressors to characterize some effects and variance components to characterize others. Unfortunately, such combined models may be computational prohibitive owing to the potential need to consider all possible genotype configurations for each individual for testing single locus effects [i.e., through the use of the mixture model in Equation (21.9>]. Since modeling covariation among the individuals using the patterned covariance matrix constructs described would consider the hybrids’ traits values as exhibiting dependencies, the trait values cannot be evaluated independently as in the simple regression model used in traditional interval mapping [i.e., Equation (21.9)]. Thus, for example, if N intercross hybrids are studied, the covariance matrix (which would have to be invented to get the information matrix) would be of dimension N X N and 3N different genotype configurations for the hybrids would have to be considered. (See Schork, 1991, 1992, for a discussion of similar models used in human genetic analysis.) One could avoid this, however, by breaking up the sampling unit into smaller units (such as litters) and modeling covariation only among individuals in these units. Even this strategy may prove to be computationally prohibitive if the smaller units are still rather large.
As an extension to the WG models described for sampling units that contain related individuals, one can consider models that work with largely unrelated individuals or, rather, individuals whose ancestry and/or genealogical links are unknown and thought to be too remote and complicated to be estimated via standard IBD-based allele sharing methods. Such samples inevitably arise when random samples of individuals are collected across different ethnic or geographic strata for epidemiological investigations. We briefly consider two approaches to WG analysis of population studies, one that operates at the level of individual data and one that operates at the level of population data.
A. WG analysis of individuals in populations Consider the collection of a random sample of N individuals from a population. Assume that each individual has been genotyped at M marker loci dispersed
21. GenomePartitianing Analysis
315
throughout the genome. If the sample is random, there are bound to be individuals within it that are more related than others, either because of immediate genealogical links not known at the time (e.g., second cousins being sampled unknowingly) or because of common population origins (e.g., some individuals had ancestors that emigrated from Africa or China). In the absence of overt knowledge of the relatedness of individuals, one could use the DNA markers to estimate the overall genetic similarity of pairs of individuals and then relate this genetic similarity to phenotypic similarity using MIVC models. To estimate genetic similarity between individuals in the sample, procedures such as the following could be used. Calculate allele frequencies for all nk alleles (denoted ah, h = I, . . . , nk; k = I, . . . , M) at each of the M marker loci from the sample. Let these allele frequencies be denoted through the use of superscripts and subscripts as follows:
f(4), . . . ,fM iJ; f(4), . . . ,fbL; . . . ; fN9, . . . , fb$J. There are seven different possible IBS allele sharing scenarios for a pair of individuals. Define a measure of genetic similarity at a single locus for two individuals, i and j, with alleles c&i, &f and a#, u$j, where the superscript after the comma denotes allele 1 or allele 2 of the two possessedby an individual (assigned arbitrarily just to distinguish them), and the subscript after the comma denotes the individual, given these seven scenarios as
where w ranges over the seven possible allele sharing states (w = 1, . . . ,7) and I<,j denotes the number of alleles shared IBS for the two individuals (i.e., Ii,j = 0, 1, 2). The weight functions o’(o), . . . , w7(*) are functions of allele and genotype frequencies that reflect intuitions about how common and rare alleles, homozygous genotypes, and so on shared by two individuals should be reflected in the similarity measure. Thus, for example, if two individuals share two very rare alleles at a locus (based on their frequency in the sample), it is quite likely that they are related at some level. This circumstance should be given more weight in their measure of similarity than, say, sharing of common alleles. Table 21.2 gives the seven allele sharing scenarios, their frequency, the number of alleles shared IBS, and the weight. By assuming that the weight functions are not equal, one achieves greater flexibility in the assignment of a similarity score. A measure of overall similarity si,j in genetic profile can be obtained as the average of the locus-specific similarity measures over the M loci: (21.17)
316
Nicholas J. Schork Table 21.2. Frequency of Allele Sharing Scenarios and Associated Information Configuration
Type” I 2 3 4 5 6
at:!, akf u&t, a& abi, $i agJ, agj akt, a:?
i2
k’l ab:i ? ac:i
IBS*
Frequency
Weight
2 0 1 0 2 1
f(4)’
4.)
2f(4J2 f(432 4fM3fbk) ’ k 4f(&” f(d) ftad) 4fk42f(ak)2
X a$, ak;” X a$, u$ X $f, a$ X a$/, a$; X a$, a$ x ag, a&2
&/, I>a$?X & , &
7
w”(*) a*) w4(*)
6JT.j
SfCakbJ2 f(d) k& Sf(akb) f(4) fk$ fb!)
0
4*) 6J7(*)
“Refers to the allele sharing scenario. bNumber of alleles shared identical-by-state given the configuration of alleles possessedby the pair.
If one assumes that the weight functions are all equal to 1, then the measure of genetic similarity is simply the fraction of alleles shared IBS over all loci. A sharing matrix can then be constructed and used to estimate variance component parameters reflecting whole-genome similarity effects within the MIVC modeling framework. The elements of this matrix would characterize the “genetic distance” (of sorts) between the individuals. The model would be of the form in Equation (2 1.3) with covariance matrix: 1 S2,l n NxN
%,2
* * *
1
...
%,N S2,N
a;+
=
,sN,J . ’ ’
’’*
1
1
0
0 . . .
1 . . . ...
0
***
0
**. . . . ...
0 . . CT:. .
(21.18)
1
One could additionally study subregions of the genome by confining the similarity measure to those regions and thereby pursue GP-like analyses as well. There are a few concerns with this strategy, however. The sharing matrix should be scaled to approximate genetic correlations (e.g., range between -1 and 1 or 0 and 1). This can be achieved easily enough by dividing each element in a sharing matrix by appropriate scaling factors. In addition, there is no guarantee that the matrix resulting from the use of the similarity measure will be positive definite. This can be corrected through the use of appropriate matrix manipulations, but the interpretability of outcomes may be affected (Astemborski et al., 1985). Note also that the choice of DNA markers used to estimate genetic background similarity is crucial. In this context it may be possible to use popula-
21. GenomePartitioning Analysis
317
tion-specific alleles (Shriver et al., 1997) or short stretches of sequence known even to have a number of variants (i.e., haplotypes) to characterize ancestry, genealogical links, and genetic similarity in more clever ways.
B. W Ganalysisinvolvingpopulations As an adjunct or complementary approach to the analysis of individuals same pled from populations, one could treat actual populations as the units of observation, For example, one could collect allele frequency information obtained from populations around the world, use this information to construct a genetic similarity measure between the populations (e.g., genetic distance: Cavalli-Sforza et al., 1994), and then determine whether genetic similarity between populations predicts similarity in outcomes in those populations, such as frequency of a disease or average level of a quantitative measure like blood pressure. The model would be similar to that outlined in the discussion of Equation (21.18), but the trait values in the trait vector would be populationebased variables (e.g., the frequency of a disease or condition) and rhe similarity matrix (or matrices) would reflect genetic distance between the populations. Similar models have been discussed by Smouse et al. (1986). Such populationebased studies run the risk of committing the “ecological fallacy” (Schwartz, 1994), but the would be of tremendous interest, especially if genomic subregions were assessedas in GP methods. In addition, one could compare the results of individual- and population-based approaches to WG and GP analysis.
VII. EXTENStONS TO BAStC M W MBDEL The MIVC models can be extended in a number of useful ways that can be implemented within the WG and GP analysis frameworks. We focus on only three of these extensions.
A. Allelic and locusinteractions Much of the focus in this chapter has been on the additive effects of alleles. However, both allelic interactions (i.e., dominance relationships) and locus interactions can be accommodated as long as, for example, relevant IBD sharing measures are computed appropriately and interaction matrices are constructed properly. (For a discussion of relevant modeling issues, see, e.g., Guo 1995, 1996a,b; Cheverud and Routman, 1995). Obviously, testing for genetic dominance effects, interaction effects, and so on would increase the chance for a false positive result in the absence of corrections for multiple comparisons.
318
NicholasJ. Schork
6. Accommodating uncertainty in the IBD calculations As a complement to whole-genome sharing parameter (WGSP) estimation, one could compute whole-genomeuncertainty parameters, denoted 8,, which quantify the uncertainty with which the estimates of 7i were obtained in the calculation of WGSPs (e.g., as a result of the use of noninformative markers, large intermarker distance, etc.) (Xu and Gessler, 1998). Once these parameters have been calculated, they can be incorporated into the analysis in a variety of ways. For example, they can be used to delineate parts of the genome for which more informative markers are needed, or, more importantly, they can be used to weight pairs of individuals by the “unambiguity” with which their genome or allele sharing estimates have been computed and thereby result in more reliable parameter estimates.
C. Pleiotropy and multiple phenotype analysis The MIVC models can be extended to include multiple traits (Lange and Boehnke, 1983; Thompson and Shaw, 1992; Schork, 1993a,b; Schork et al., 1994). Such analyses can lead to estimates of genetic correlations between the various trait measures and also potentially increase the power to detect individual locus effects (Schork, 1993a,b). Th e b asic strategy for analyzing multiple traits is to construct a multitiered trait vector and an expanded covariance matrix. The basic estimation and testing strategies would remain essentially as described (Lange and Boehnke, 1983).
VIII. DISCUSSION AND CONCLUSION The concept of partitioning the variation of a quantitative trait into discrete and estimable components is hardly new. Sir Ronald Fisher, Oscar Kempthorne, and others pioneered such modeling efforts in genetics decades ago (Fisher, 1918; Kempthorne, 1957). However, it was not until DNA markers became available that efforts to partition the variation of a trait into components reflecting the specific effects of individual loci were pursued. Much of this modeling with DNA markers, however, only emphasized tests and estimates of individual loci in isolation from others. It was not until recognition of the actual utility of controlling for “residual” genetic effects-those acting beyond a locus whose effects are of immediate interest- that efforts to accommodate multiplelocus effects were pursued. Most of the statistical models used to achieve this accommodation of additional genetic effects were, in fact, based on standard variance components linear models (Lange et al., 1976; Hopper and Mathews, 1982; Goldgar, 1990; Schork, 199310;A mos, 1994; Almasy and Blangero, 1998).
21. GenomePartitioning Analysis
319
This need to consider or accommodate residual variation over and above an individual locus effect was also anticipated by experiences with the derivation of “mixed models” in classical segregation analysis (Ott, 1979; Hasstedt, 1982; Lalouel et al., 1983; Bonney, 1984; Schork, 1992; Schork et aLI 1996a). What is interesting in this regard is that studies meant to identify actual loci that influence quantitative traits via “genome scan” technologiesstudies that sequentially examine DNA markers dispersed throughout the genome for evidence that one of them, or some subset of them, is likely to be near a locus that actually influences the trait-traditionally focused only on the individual effects of each locus, while possibly accounting for some gross residual or genetic background effects (see, e.g., Hager et al., 1998; Ferraro et al., 1999). Although useful, this strategy ignores the fact that researchers pursuing these studies typically have marker information spread over the whole genome. The information about allele sharing at other sites around the genome can be used to accommodate a wide variety of genetic factors that influence a quantitative trait over and above that due to any particular locus. This chapter has described how marker information around the genome can be used in this manner, although there are some very notable issues that need further examination. Foremost among these issues are concerns over the utility of a particular set of DNA markers for capturing locus-, region-, chromosomes, and genome-wide effects. Simply put, greater insight into the required density and information content of DNA markers for achieving adequate power for WG and GP studies is needed. In addition, it is important to assessjust what increases in power, if any, may result from studies assessinga single+locus effect on a trait that accommodate WG and GP constructs. The power of the proposed methods will undoubtedly reflect the use of appropriate type I error rates, given that multiple comparisons and multiple models are likely to be fit. This topic has been virtually ignored in this chapter for the sake of brevity and because our focus was on the models themselves. Related issues such as accommodating ascertainment, handling missing data (especially missing marker data), and assessingqualitative traits have also been ignored. It is hoped that this chapter will motivate studies into these issuesand others of relevance.
Acknowledgments Aspects of this work were supported, in part, by U.S. National Institutes of Health grants HL94011 (N.J.S.), HL54998-01 (N.J.S.), and RRO3655-11 (Robert Elston), and by generous support from the Genset Corporation.
References Almasy, L., and Blangero, J. (1998). Multipoint quantitative trait linkage analysis in general pedi, grees. Am. J. Hum. Genet. 62, 1198-1211.
320
Nicholas J. Schork
Amos, C. I. (1994). Robust variance components approach for assessinggenetic linkage in pedigrees.Am. J. Hum. Genet. 54,535-543. Amos, C. I., Zhu, D. K., et al. (1996). Assessing genetic linkage and association with robust components of variance approaches. Ann. Hum. Genet. 60, 143- 160. Astemborski, J. A., Beaty, T. H., et aI. (1985). V arlance components analysis of forced expiration in families. Am. J. Med. Genet. 21, 741-753. Beaty, T. H., Self, S. G., et al. (1985). U se of robust variance components models to analyze triglyceride data in families. Ann. Hum. Genet. 49,315-328. Blumenthal, T. (1998). Gene clusters and polycistronic transcription in eukaryotes. BioEssuys20, 480-487. Boerwinkle, E., Charkraborty, R., et al. (1986). The use of measured genotype information in the analysis of quantitative phenotypes in man. I. Models and analytical methods. Ann. Hum. Genet. 50, 181-194 Bonney, G. E. (1984). On the statistical determination of major gene mechanisms in continuous human traits: Regressive models. Am. J. Med. Genet. 35, 816-826. Brosseau,J. D., Eelkema, R. C., et al. (1979). Diabetes among the three affiliated tribes: Correlation with degree of Indian heritage. Am. J. Public Health 69, 1277-1278. Carrington, M., Nelson, G. W., et al. (1999). HLA and HIVzl: Heterozygote advantage and B*35Cw*O4 disadvantage. Science283, 1748- 1752. Cavalli-Sforza, L. L., and Bodmer, W. E (1971). “The Genetics of Human Populations.” Freeman, San Francisco. Cavalli-Sforza, L. L., Menozzi, l?, et al. (1994). “The History and Geography of Human Genes.” Princeton University PressPrinceton, NJ. Cheverud, J. M., and Routman, E. J. (1995). Epistasis and its contribution to genetic variance components. Genetics 139, 1455-1461. Ferraro, T. N., Golden, G. T., et al. (1999). Mapping loci for pentylenetetrazol-induced seizure susceptibility in mice. J. Neurosc. 16,6733 -6739. Fisher, R. A. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Trans. R. Sot. Edinburgh 52,399-433. Fulker, D. W., and Chemy, S. S. (1996). A n improved multipoint sibpair analysis of quantitative traits. Behati, Genet. 26,527-532. Fulker, D. W., Cherney, S. S., et al. (1995). Multipoint interval mapping of quantitative trait loci using sib pairs. Am. J. Hum. Genet. 56, 1224-1233. Goldgar, D. E. (1990). Multipoint analysis of human quantitative variation. Am. .J. Hum. Genet. 47,957-967. Guo, S. W. (1995). Proportion of genome shared identical by descent by relatives: Concept, computation, and applications. Am. 1. Hum. Genet. 56, 1468-1476. Guo, S. W. (1996a). Gametogenesis processesand multilocus gene identity by descent. Am. J. Hum. Genet. 58,408-419. Guo, S. W. (1996b). Variation in genetic identity among relatives. Hum. Hered. 46, 61-70. Hager, J., Dina, C., et al. (1998). A genome-wide scan for human obesity genes reveals a major susceptibility locus on chromosome 10. Nat. Genet. 20,304-308. Hasstedt, S. J. (1982). A mixed-model likelihood approximation on large pedigrees. Cornput. Biomed. Res. 15,295-307. Hopper, J, L., and Mathews, J. D. (1982). Extensions to multivariate normal models for pedigree analysis. Ann. Hum. Genet. 39,485-491. Kempthome, 0. (1957). “An Introduction to Genetic Statistics.” Wiley, New York. Khoury, M. J., Beaty, T. H., et al. (1993). “F un d amentals of Genetic Epidemiology.” Oxford University Press,New York. Knowler, W. C., Williams, R. C., et al. (1988). G m and type II diabetes mellitus: An association in American Indians with genetic admixture. Am. .I. Hum. Genet. 43,755-760.
21. GenomePartitioning Analysis
321
Kruglyak, L., and Lander, E. S. (1995). Complete multipoint sib pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. 57,439-454. Kruglyak, L., Daly, M. J., et al. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 58,1347-1363. Lalouel, J. M., Rao, D. C., et al. (1983). A un ifi ed model for complex segregation analysis. Am. J. Hum. Genet. X,816-826. Lander, E. S., and Botstein, D. (1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121,185 - 199. Lander, E. S., and Schork, N. J. (1994). Genetic dissection of complex traits. Science265,2037-2048. Lange, K., and Boehnke, M. (1983). Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am. J. Med. Genet. 14, 513-524. Lange, K., Westlake, J., et al. (1976). Extensions to pedigree analysis. III. Variance components by the scoring method. Ann. Hzlm. Genet. 39,485-491. Lui, B. H. (1998). “Statistical Genomics: Linkage, Mapping, and QTL Analysis.” CRC Press,Boca Raton, FL. Lynch, M., and Walsh, B. (1997). “Genetics and Analysis of Quantitative Traits.” Sinauer Associates, Sunderland, MA. Lyon, M. F., Rastan, S., et al., eds. (1996). “Genetic Variants and Strains of the Laboratory Mouse.” Oxford University Press,Oxford. Mall, P. P., Powsner, R., et al. (1979). Analysis of genetic and environmental sources of serum cholesterol variation in Tecumseh, Michigan. V. Variance components estimated from pedigrees. Ann. Hum. Genet. 42,343-354. Niehrs, C., and Pallet, N. (1999). Synexpression groups in eukaryotes. Nature 402,483-487. Olson, J. M. (1995a). Multipoint linkage analysis using sibpairs: An interval mapping approach for dichotomous outcomes. Am. j. Hum. Genet. 56, 788-798. Olson, J, M. (1995b). Robust multipoint linkage analysis: An extension of the Haseman-Elston method. Genet. Epidemiol. 12, 177-194. Ott, J. (1979). Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigree analysis. Am. J. Hum. Genet. 31, 161-175. Risch, N., and Botstein, D. (1996). A manic depressive history. Nal. Genet. 12,351-353. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273,1516-1517. Schork, N. (1991). Efficient computation of patterned covariance matrix mixed models in quantitative segregation analysis. Genet. Epidemiol. 8, 29-46. Schork, N. J. (1992). Extended pedigree patterned covariance matrix mixed models for quantitative phenotype analysis. Genet. Epidemiol.9, 73-86. Schork, N. J. (1993a). The design and use of variance component models in the analysis of human quantitative pedigree data. Biometr. .J. 4,387-405. Schork, N. J. (1993b). Extended multipoint identity-by-descent analysis of human quantative traits: Efficiency, power, and modeling considerations. Am. J. Hum. Genet. 53,1306-1319. Schork, N., and Chakravarti, A. (1996). A nonmathematical overview of modem gene mapping techniques applied to human diseases.In “Molecular Genetics and Gene Therapy of Cardiovascular Disease.”(S. Mockrin, ed.), pp. 79-109. Dekker, New York. Schork, N. J., Weder, A. B., et al. (1994). Th e contribution of pleiotropy to blood pressure and body-mass index variation: The Gubbio study. Am. J. Hum. Genet. 54,361-373. Schork, N. J., Allison, D. B., et al. (1996a). Mixture distributions in human genetics research. Stat. Methods Med. Res. 5, 155-178. Schork, N. J., Nath, S. I?, et al. (199613). Extensions of quantitative trait locus mapping in experis mental organisms. Hypertension 28, 1104- 1111. Schork, N. J., Thiel, B., et al. (1997). Linkage analysis, kinship, and the short-term evolurion of chromosomes.J. Exp. Zool. 34, 101-115.
322
Nicholas J. Schork
Schwartz, S. (1994). The fallacy of the ecological fallacy: The potential misuse of a concept and the consequences. Am. J. Public Health 84,819-824. Searle, S. R., Casella, G., et al. (1992). “Variance Components.” Wiley, New York. Shriver, M. D., Smith, M. W., et al. (1997). Ethnic affiliation estimation by use of population-specific DNA markers. Am. J. Hum. Genet. 60,957-964. Smouse, l? E., Long, J. C., et al. (1986). Multiple regression and correlation extensions of the Mantel test of matrix correspondence. Syst. Zool. 35, 627-632. Stern, M. l?, and Haffner, S. M. (1990). Type II diabetes and its complications in Mexican Americans. Diabetes Metabol. Rew. 6,29-45. Terwilliger, J. T., and Weiss, K. M. (1998). Linkage disequilibrium mapping of complex disease:Fan tasy or reality? Gun-. Opin. Biotechnol. 9,578-594. Thompson, E. A., and Shaw, R. G. (1992). Estimating polygenic models for multivariate data on large pedigrees. Genetics 131, 971-978. Xu, S., and Atchley, W. R. (1995). A rand om model approach to interval mapping of quantitative trait loci. Genetics 141, 1189-1197. Xu, S., and Gessler, D. D. G. (1998). Multipoint genetic mapping of quantitative trait loci using a variable number of sibs per family. Genet. Re. 71, 73-83.
SaurabhGhoshand Partha P. Majuster Anthropology and Human Indian Statistical Institute Calcutta 700 035, India
I. II. III. IV V. VI.
Genetics Unit
Summary Introduction and Objective Scenarios and Models Methodology Results and Discussion Conclusions References
I. SWIVIMARY A heritable multivariate quantitative phenotype comprises several correlated component phenotypes that are usually pleiotropically controlled by a set of major loci and environmental factors. One approach to decipher the genetic architecture of a multivariate phenotype, in particular to map the underlying loci, is to reduce the dimensionality of the data by means of a data reduction technique, such as principal component analysis. The extracted principal components are then analyzed in conjunction with marker data to map the underlying loci. ‘To whom correspondence should be addressed. Advancesin Genetics, Vol. 42 Copyright 0 2001 by AcademicPress. All rights of reproductionin any form reserved. 0065~2660/01$35.00
323
324
Ghosh and Majumder
We have examined the efficiency of this approach with and without taking into account the correlation structure of the multivariate phenotype when extracting principal components. We have assumed that genome-wide scan data on sibpairs are available for low-density (widely spaced) and highdensity markers. Using extensive simulations, based on three models of the multivariate phenotype, we have shown that although ignoring the correlation structure of the multivariate phenotype does not have any serious impact on the efficiency of mapping the underlying trait loci in wide marker intervals, there is a significant adverse effect of this practice for fine-mapping. We, therefore, recommend that the correlation structure of the multivariate phenotype be carefully examined to decide on the strategy of extracting principal components for deciphering the genetic architecture of the multivariate phenotype.
II. INTRODUCTION AND OBJECTIVE One of the major current challenges in genetic epidemiology is to unravel genetic architectures of complex traits. Quantitative variables, possibly correlated, generally underlie complex traits. Often, a dichotomous trait definition is adopted for such traits based on cutoff points defined on suitable functions of the underlying quantitative variable(s). Examples are diabetes, hypertension, and schizophrenia. Such dichotomization often leads to loss of power in estimating genetic and environmental contributions to such traits, and in mapping the loci controlling such traits. Further, this approach to defining a phenotype may lead to inconsistencies in inferences across studies. Therefore, it is desirable to use the information on the set of underlying multivariate phenotypes. Often, individual components of the multivariate phenotype vector are analyzed separately, both to estimate genetic and environmental contributions and for gene mapping. This approach has many obvious pitfalls, including the statistical problem of multiple comparisons, especially when genome-wide scans are performed. It has been emphasized that the genetic dissection of complex traits and diseases may require study designs and statistical methods that are more sophisticated than those used in the analysis of simple Mendelian genetic traits and diseases (Lander and Schork, 1994). Th ere is currently major interest in using data on multivariate phenotypes for genetic epidemiological analysis of complex traits. Methodologies have been developed and there have been attempts to jointly analyze data of sibpairs or of other sets of family members, on several correlated quantitative phenotypes as a single multivariate phenotype. Many models and approaches have been used, including variance components (Lange and Boehnke, 1983; Schork, 1993), a regressive model (Bonney et al., 1988; Moldin and Van Eerdewegh, 1995), a multivariate extension of the
22. Genetic Architecture of a Multivariate Phenotype
325
Haseman-Elston model (Amos et al., 1990; Amos and Liang, 1993), and a structural equations model (Eaves et al., 1996). It has been noted that with a large number of components in a multivariate phenotype vector, the power of a multivariate analysis to detect linkage can be substantially lower than the power of an analysis applied to a “genetically relevant” phenotype (Ott and Rabinowitz, 1999). To circumvent the above-mentioned problem of power reduction, one approach that has been adopted is the application of data reduction techniques, such as principal components analysis or factor analysis, by which the dimension of the original multivariate phenotype vector is reduced and subsequent analyses are performed on a lower dimensional vector of a few linear combinations of the original phenotypes (Zlotnik et al., 1983; Hasstedt et aE., 1994; Boomsma, 1996; Allison and Beasley, 1998; Ott and Rabinowitz, 1999). While this approach may overcome the problem of loss of statistical power to a certain extent, it is important to realize that unless the choice of variables from the vector of variables to be combined as a new quantitative phenotype is made judiciously, by using certain statistical and genetic principles, inferences may be grossly incorrect. Hasstedt et al. (1994) and Ott and Rabinowitz (1999) have emphasized the need to choose, in the final analysis, principal components that have high heritabilities. Majumder et al. (1998) have emphasized that an initial correlational analysis should be performed, using individual components of the multivariate phenotype vector, and that only the subset of variables that show high correlations within individuals in families should be chosen for further data reduction. They have also suggested that only the principal components whose coefficients show consistency across family members be chosen for final analysis. Since the multivariate phenotype underlying a complex trait may be controlled by more than one locus, it is unclear whether an initial analysis and examination of the correlation structure of the variables should be carried out to identify subsets of variables, within each of which data reduction may be performed. The purpose of this chapter is precisely to examine this issue in the context of gene mapping using data on sibpairs. Our overarching goal is to propose a methodology for analysis of a multivariate phenotype for the purpose of mapping the underlying loci controlling the phenotype.
IN. SCENARIOS AN0 MODELS We assume that we have a phenotype vector X = (Xl, Xz, . . . , XJ. A set of genetic loci pleiotropically control individual components Xi of this phenotype vector. Since we assume the existence of pleiotropic effects, the number of loci, 1, will necessarily be much smaller than p. We also entertain the
326
Ghoshand Majumder
possibility that some of the individual components of the phenotype vector may not be under any genetic control, and may be solely determined by environmental effects. Further, even when an individual component is under genetic control, we accommodate the possibility of environmental effects on this component. For purposes of illustration and the simulation studies described subsequently, we consider three simple scenarios: cases l-3, depicted in Figure 22.1. We consider a multivariate phenotype vector comprising seven individual components. In case 1, the component phenotype X, is under the control of an autosomal biallelic locus, which also pleiotropically controls the component phenotypes X, and X3. The component phenotype X4 is under
Case (1) LOCUS 1
LOCUS 2
1
1
Xl A x2
x3
Case (2) LOCUS 1
ENVIRONMENT
1 x, A x2
x3
x,
A x,
x'6
x,
Case (3)
Figure 22.1.
Three models of a multivariate phenotype X = (X,, X2, . . . , X7) considered.
22. GeneticArchitectureof a MultivariatePhenotype
321
the control of another autosomal biallelic locus, unlinked to the first locus; XT, X6, and X7 are pleiotropically controlled by this second locus. In case 2, X1, Xz, and X3 are controlled similarly to case 1, but X+, Xg, X6, and X7 are not under any genetic influence, but are influenced only by environmental factors. In case 3, both loci have direct effects on some of the component phenotypes as depicted in Figure 22.1. The model that we consider for case 1 is that Xi, i = 1, 2, 3, is distributed as normal with mean ai> pi, - ai and variance (T$according as the genotype at the first locus is AA, Au, or au, where A and a are the alleles at that locus. Similarly, Xi, i = 4, 5, 6, 7, is distributed as normal with mean (Yil pi, --q and variance crf corresponding to the genotype at the second locus BB, Bb, or bb, where B and b are the alleles at that locus. In case 2, the model for Xi, i = I, 2, 3, is identical to that in case 1. However, since X, i = 4, 5, 6, 7, is influenced only by environmental factors, the underlying distribution is normal with the same mean q and variance a;. The model for case 3 is identical to that for case 1, except that X, and X5 are also influenced by the first locus (see Figure 22.1), and so their means depend on the genotype at that locus. An additive genotypic effects model is assumed for these two phenotypic components. Under the three scenarios considered, the seven-dimensional phenotype vector really comprises two subvectors, (Xl, XI, Xs) and (X4, Xg, X6, X,). Our problem is to decipher the genetic architecture of the seven-dimensional phenotype. That is, under case 1, we would like to collect appropriate data and analyze the data to be able to map both loci; under case 2, we should be able to identify and map one locus, and so on. If we denote the expected correlation matrix of X as
then under the models specified earlier, the elements of the submatrices pll, pr2, and pzz, i, j = 1, 2, will follow certain patterns and constraints under the three scenarios. These are: Case 1. The elements of plz are all small in magnitude and are less than the elements of pll and pz2. We note that if common environmental or other common small genetic effects are absent, all the elements of p12 are expected to be zero. Case 2. As for case 1, elements of p12 are expected to be close to zero. Further, the elements of pz2 are also expected to be smaller in magnitude than the elements of prl. Case 3. Because certain phenotypic components are controlled by more than one locus (Figure 22.1) , the correlation matrix of X can be further
32%
Ghosh and Majumder
partitioned as explained later in the text and in Table 22.1:
The elements of ~13 are expected to be close to zero. Further, the elements of ~12, p22, and h3 are expected to be smaller in magnitude than the elements of pll and ~33.
Table 22.1. Simulation Parameter Values of q, W, and uj! for the Different Components of the Multivariate Phenotype Case 1
Phenotype Xl x2 x3 x4 X5 x6 x7
2
Xl x2 x3 x4 x5 X6 X7
3
Xl x2 x3 x4a
x5a
x6 x7
5 10
2 3 10 0 5 8 20
1 3 10 0.1 5 10 15
5 10 35 2 20 30 100
2 3 10
1 3 10 0.1 5 10 15
5 10 35 2 23 20 45 30 100
2 3 10 0 4 5 12 8 20
1 3 10 0.1 7 5 12 10 15
35 2 20 30 100
nSince X4 and Xs are controlled by both loci, the two sets of values pertain to the effects of the two unlinked loci of these components. The genotypic effects of the two loci were assumed to be additive.
22. Genetic Architecture of a Multivariate Phenotype
329
IV. METHODOLOGY A. Data reduction The problem we seek to examine under these models and expectations is to compare the efficiencies of statistically deciphering the genetic architecture of the multivariate phenotype, with or without ignoring the expected correlation structure under the three scenarios listed. We have used the principal components approach. Ignoring the correlation structure implies that principal components are extracted by using the observations on the seven-dimensional phenotype vector. The first two principal components extracted from these data are denoted as PC-I and PC-2, respectively. For taking the correlation structure into account, we have, for cases 1 and 2, extracted principal components based on the observations of the subvectors (XI, X2, XJ and (X4, Xs, X6, X7) separately. For case 3, principal components based on the observations of the subvectors (XI, X2, X3, X4,X4 and (X+,X5,X6, XT)were extracted. We note that two component phenotypes X, and X5 are common to both subvectors. This is because both loci have effects on X4 and X5. The first principal components extracted from observations of these subvectors are denoted PC(l) and PCc2j, respectively. For linkage analysis, we have used PC-f, PC-Z, PC(r), and PC(‘). We note that while for cases 1 and 2 the expected correlation between RY and PCY2)is zero; for case 3, it is positive. This poses no problem in linkage analysis, however, because our method does not, at any stage, consider PC(‘) and PC(2) jointly. We note that because of the specific scenarios considered by us, parritioning of the seven-dimensional phenotype vector into subvectors occurs naturally. In practice, to determine the most appropriate partitioning of the phenotype vector it will be necessary to try different permutations of the rows and columns of the correlation matrix, and to examine and perform tests of hypotheses on the structures of the submatrices. This was been done here because our purpose was to examine the effect of ignoring the correlation structure among the phenotypic variables on the efficiency of gene mapping.
B. Mapping the quantitative trait loci We have used a semiparametric method of quantitative trait locus (QTL) mapping proposed by us (Ghosh and Majumder, 2000). The data comprise observations on the principal components on pairs of siblings. We assume that a genome-wide scan has been performed and that genotype data at the various marker loci are available on these pairs of siblings. A two-stage variable stringency strategy is used. For completeness, we describe this method briefly. (Details are being published elsewhere, including theoretical justifications and demonstration, through extensive simulations, that the method performs very
330
Ghosh and Majumder
well and much more efficiently than some of the currently used statistical methods for QTL mapping.) In a genome-wide scan, the general practice is to saturate relevant chromosomal regions with low-density markers and then to coarse-map the QTLs. Based on the results of coarse localization within wide marker intervals, such intervals that are identified as possibly containing the QTLs are further saturated with high-density markers and analyses carried out for fine-mapping the QTLs. Obviously, in the first stage (coarse mapping), efforts are made to minimize the possibility of falsely concluding that the QTL is not in a particular marker interval when indeed it is located in that interval. In the first step of our procedure, we use a rank correlation technique to coarse-map the underlying trait loci. (Fine-mapping is done in the second step of our algorithm.) We assume that we have an ordered set of k markers, which are equally spaced but are not dense (say, ~5 CM). We test whether the trait locus is at all linked to the k ordered marker loci considered. A natural test for linkage between the trait locus and the Ith marker locus (1 = 1,2, . . . , k) is a test for the strength of correlation between yjs and ijis, where yj is the squared difference in extracted principal components of the jth sibpair and iijl is the estimated IBD score of that sibpair at the Ith marker. A nonparametric technique of testing for no correlation between yj’s and ;ii,l’s is based on Spearman’s rank correlation (see Randles and Wolfe, 1979). Since i;il can assume only five distinct values (viz. 0, i, i, i, 1) when the parental genotypes are known, it is expected that there will be many ties in i;jl values. Thus, we need to use Spearman’s rank correlation formula for the case of ties, which is given by (n’ - 1)/12 - (T, + TV)/2 - (1/2n)Z;j”=rdjz &I=
d(n2 - 1)/12 - T, +n2 - 1)/12 - T,
’
where dj = rank($ - rank( ;ijJ, T, = Ef= r (4 - %) / 12n, and T, = Zf= r (v: - vJ/ 12n, there being p ties in y/s of lengths ul, u2 . . . , up and 4 ties in ejl’s of lengths or, ~2 . . . , uq. The test statistic is &-I R,,, which is asymptotically distributed as N(0, 1) under the null hypothesis of no correlation. Thus for a level alpha test, the critical region is given by &? 1R, 1 > z~/~, where z,,, is the (1 - m)th quantile of a standard normal variate. If the null hypothesis of no correlation is accepted for all the k marker loci (the level of significance adjusted to a/k to account for the multiple tests), then we conclude that the trait locus is most probably not located on the same chromosome as the k marker loci. Using the foregoing test procedure, we select those marker loci for which the null hypothesis of no correlation between yj’s and i;i<s, is rejected
22. GeneticArchitectureof a MultivariatePhenotype
331
(i.e., the marker loci that show evidence of linkage with the trait locus). In the next step, we consider two such consecutive marker loci as candidate markers flanking the trait locus. This interval is then saturated with further markers of higher density (say, 1 CM), and marker genotype data on members of the sibpairs are generated. In the second step, we try to fine-map the trait locus by means of nonparametric regression. Suppose, without loss of generality, that the ordered consecutive markers 1 and 2 are found to be linked to the trait locus. We assume a nonparametric additive regression model given by yj = $J~(%$)+ +!J2(+) +
ej,
j = 1,2, . . . n,
where +I,& are real-valued function of ;;1 and sl, respectively, and er$ are random errors. Estimates of J/l and & are obtained in steps and iteratively by means of kernel smoothing techniques (see Silverman, 1986). In this technique of nonparametric regression, the domains of the explanatory variables are divided into a number of windows. Local smoothing is carried out within each.window, and appropriate adjustments are made to ensure continuity at window boundaries. Let the residual sum of squares corresponding to the foregoing regression be denoted by CV( 1, 2), and in general, by CV( I, 1 + 1) when the Eth and (I + 1)th marker loci are considered. The most likely position of the trait locus is given by the interval flanked by the ith and (i + 1)th marker loci, where i corresponds to CV(i, i + 1) = minlCV(l,
I + 1).
The kernel function used is K(t)
=
$
-
if /tj < 1. 0, otherwise t2),
Since nonparametric regression tends to overfit data (Silverman, 1986), we use the “leave-one-out technique”; chat is, we leave out the jth observation in order to predict yP For the given window length h, the total error in prediction is given by Rh = ~~= l(yj - ~j)‘. The process is repeated for different window lengths. The optimal tiindow length h* is given by that h for which Rh minimum.
C. Simulation Our simulation procedure comprises generation of trait values of siblings under the models considered and also IBD scores based on marker genotype data. This
332
Ghosh and Yajumder
is done in several steps; the number of steps is variable for the different scenarios considered by us.
1.
Case
1
In the first step of our simulation method, we generate the genotypic mating type of parents at each of the two unlinked trait loci from six-nomial distributions, with the cell probabilities being the probabilities of the different mating types. In the second step, we generate the genotypes of the sibpair at each of the trait loci from trinomial distributions, with cell probabilities being the conditional probabilities of the different trait genotypes given parental genotype information. In the third step, we generate, for both sibs, the three phenotypic values controlled by the first trait locus from a trivariate normal distribution, with appropriate mean vector and dispersion matrix as described in an earlier section. In the fourth step, we generate, for both sibs, the four phenotypic values controlled by the second trait locus from a 4-variate normal distribution, with suitable mean vector and dispersion matrix described earlier. In the fifth step, we generate the IBD scores of the sibpair at the two trait loci conditional on their trait genotypes using Table 1 of Haseman and Elston (1972). In the sixth step, we use Table 4 of the same work to generate the IBD scores of the sibpair at the two pairs of marker loci flanking the two trait loci conditional on the IBD scores at the two trait loci. In the seventh step, we generate the IBD score of the sibpair at different marker loci sequentially conditional on the IBD scores at the generated marker locus nearest to them, again using Table 4 of Haseman and Elston (1972). In the eighth step, we generate the estimated IBD scores at the different marker loci conditional on the generated IBD scores at the marker loci using Table 5 of Haseman and Elston (1972).
2.
Case
2
In the first step of our simulation method, we generate the genotypic mating type of parents at the trait locus from 6nomial distribution, with the cell probabilities being those of the different mating types. In the second step, we generate the genotypes of the sibpair at the trait locus from a trinomial distribution, with cell probabilities being the conditional probabilities of the different trait genotypes given parental genotype information. In the third step, we generate, for both sibs, the three phenotypic values controlled by the trait locus from a trivariate normal distribution, with appropriate mean vector and dispersion matrix as described in an earlier section. In the fourth step, we generate, for both sibs, the four phenotypic values that are environmental in nature from a 8-variate normal distribution with mean vector and dispersion matrix described earlier. The method of generation of IBD scores at different marker loci is identical to case 1.
22. Genetic Architecture of a Muttivariate Phenotype
333
3, Case 3 The method of generation of the genotypes of the sibpairs at the two trait loci conditional on the parental genotypes at the two loci is identical to that used in case 1. Next, we generate for both sibs, the seven phenotypic values sequentially (the three phenotypes controlled solely by the first trait locus, then the two phenotypes controlled jointly by the first and the second trait loci, and finally the two phenotypes controlled solely by the second triat locus) from a 7-variate normal distribution with approapriate mean vector and dispersion matrix described in an earlier section. (The mean of each phenotypic value jaintly controlled by the two trait loci is assumed to be the sum of the marginal means of the phenotypic value at each trait locus.) The method of generation of IBD scores at different marker loci is identical to that used in case 1. The parameter values used in the simulations are presented in Table 22.1. All results are based on 100 simulation runs for each case, For linkage analysis, each simulation run is based on 100 sibpairs.
V. RESULTSAND DISCUSSlOW As mentioned earlier, after generating data on the multivariate phenotype vector, we extracted the first principal component from the data on each of the two subvector (Xi, Xl, X,) and (X4, X5, X6, X,); these are denoted PC(r) and PC(‘), respectively. We also extracted the first two principal components, PC-1 and PC-2, from data on the complete seven-dimensional phenotype vector. We present in Table 22.2 - 22.4 the coefficients of the various principal components for the two siblings, and the percentages of variance explained (PVE) by the principal components, for five simulation runs, for each of the three cases considered. Consistency in the signs and magnitudes of the coefficients of the various principal components across simulation runs for each case is obvious Both IYY) and PCc2) explain between 85 and 90% of the variance under case 1. For this case, PC-1 and PC-2, on the other hand, each explain about 40-55% of the variance. As is intuitively expected, for case 2, while the percentages of variance explained by PC-I and PC-2 remain roughly the same as for case 1, there is a large decrease in this percentage for PCc2) and a large increase for PC(i). For case 3, the percentage of variance explained by PC(*’ and PU2) are about the same as in case 1, while those explained by PC-I and PC-2 are about 5% higher and lower, respectively, than in case I. For cases 1 and 3, correlations of values of principal components between siblings are similar regardless of whether the correlation structure of the phenotypic variables is taken into account before data reduction (Table 22.5). For case 2, however, since the phenotypic components (X4, X5, X6, X,j
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
PO”
PC(Z)
PC-1
PC-2
Type of PC Run no.
0.35 0.38 0.37 0.35 0.36
0.50 0.52 0.52 0.51 0.52
0.85 0.85 0.86 0.85 0.88
0.85 0.87 0.84 0.86 0.88
PVE
0.18 0.16 0.17 0.17 0.16
0.40 0.42 0.43 0.40 0.41
0.55 0.56 0.55 0.55 0.56
-0.52 - 0.50 -0.54 -0.52 -0.53
-
-
-
-
-
0.56 0.57 0.56 0.55 0.55
0.15 0.14 0.16 0.15 0.17
0.53 0.52 0.50 0.51 0.53
0.58 0.58 0.59 0.58 0.58
-
-
-
0.58 0.58 0.57 0.59 0.58
0.18 0.19 0.18 0.19 0.20
0.53 0.52 0.52 0.53 0.53
0.61 0.60 0.61 0.61 0.60
Coefficients of sib 1
0.38 0.38 0.36 0.40 0.37
0.53 0.53 0.52 0.53 0.53
- 0.14 -0.11 -0.12 -0.14 - 0.13
-
-
~ -
0.14 0.13 0.12 0.15 0.14
-0.51 ~ 0.48 - 0.50 - 0.49 - 0.51 0.50 0.52 0.51 0.50 0.51 - 0.15 -0.17 -0.18 -0.15 - 0.16
-
0.51 0.52 0.52 0.51 0.51 - 0.15 -0.17 -0.18 -0.15 -0.16
-
0.37 0.38 0.35 0.38 0.37
0.53 0.54 0.50 0.5 1 0.54
0.9 0.88 0.86 0.87 0.89
0.87 0.86 0.88 0.85 0.85
PVE 0.55 0.56 0.54 0.55 0.55
- 0.54 -0.51 - 0.53 - 0.50 - 0.54
0.14 0.16 0.12 0.16 0.13
~ 0.44 - 0.42 - 0.45
- 0.45 - 0.43
~ 0.58 0.59 0.58 0.57 0.58
-
0.55 0.54 0.57 0.55 0.54
0.17 0.18 0.16 0.16 0.15
- 0.51 - 0.52 -0.51 - 0.50 - 0.51
-
-
-
-
0.57 0.55 0.59 0.58 0.56
0.17 0.18 0.16 0.17 0.15
0.52 0.52 0.51 0.50 0.51
0.60 0.59 0.61 0.62 0.61
Coefficients of sib 2
0.44 0.42 0.46 0.41 0.43
0.52 0.53 0.54 0.55 0.52
-0.1 -0.12 - 0.08 -0.14 -0.11
-
-
0.49 0.48 0.49 0.51 0.50 - 0.18 - 0.20 - 0.16 -0.17 - 0.19
-
0.50 0.50 0.49 0.51 0.51 -0.14 - 0.15 -0.14 - 0.13 - 0.15
-
Table 22.2. Coefficients of Various Principal Components and Percentages of Variance Explained by the Components for a Pair of Siblings for Five Independent Simulation Runs for Case 1
0.50 0.51 0.50 0.51 0.52 -0.15 - 0.15 -0.16 - 0.14 - 0.15
-
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
PC-1
PC-2
1 2 3 4 5
0.32 0.34 0.33 0.32 0.36
0.47 0.45 0.50 0.48 0.47
0.70 0.71 0.73 0.68 0.70
0.90 0.91 0.90 0.88 0.88
Run no. WE
l-cc'*'
PC’”
Type of PC
- 0.40 - 0.40 -0.39 -0.38 -- 0.40
-0.51 -0.50 -0.51 -0.50 -0.50
- 0.60 -0.61 - 0.59 -0.60 - 0.59
- 0.45 -0.44 - 0.42 - 0.44 - 0.43 - 0.46 - 0.44 - 0.44 -0.45 -0.44
-0.38 - 0.36 - 0.38 -0.36 - 0.39
- 0.30 -0.32 -0.31 -0.30 -0.32
-0.58 -0.56 -0.57 -0.56 -0.58
-0.37 - 0.38 - 0.35 - 0.38 -0.35
- 0.40 -0.41 - 0.40 -0.43 - 0.42
-0.55 -0.57 -0.56 -0.56 -0.55
-0.55 -0.56 -0.54 -0.56 - 0.55
Coefficients of sib 1
0.26 0.24 0.27 0.28 0.28
0.31 0.30 0.30 0.30 0.31
-0.19 - 0.40 -0.18 -0.42 - 0.17 -0.37 -0.17 -0.38 -0.16 - 0.36
-0.59 -0.58 -0.58 -0.59 -0.58
0.38 0.38 0.37 0.38 0.39
- 0.40 -0.38 - 0.40 -0.40 - 0.43 0.39 0.41 0.39 0.38 0.39
- 0.43 -0.42 -0.45 -0.42 - 0.44
0.34 0.36 0.35 0.32 0.34
- 0.50 -0.52 -0.50 -0.50 - 0.51
-0.51 - 0.53 -0.52 -0.52 - 0.51
0.31 0.32 0.30 0.33 0.30
0.46 0.48 0.46 0.50 0.47
0.30 0.33 0.30 0.32 0.30
- 0.60 -0.59 -0.58 - 0.61 - 0.59 -0.1 -0.56 - 0.55 - 0.57 -0.58
- 0.58 -0.58 -0.58 -0.58 - 0.58
-0.47 - 0.45 -0.46 - 0.46 - 0.47
0.28 0.29 0.26 0.27 0.28
-0.57 -0.57 -0.58 -0.57 -0.58
-0.55 -0.57 -0.56 -0.56 - 0.56
Coefficients of sib 2
0.75 0.73 -0.12 0.69 -0.09 0.71 - 0.1 0.74 -0.11
0.88 0.92 0.90 0.87 0.89
PVE
0.02 0.03 0.01 0.02 0.01
- 0.10 -0.12 - 0.14 - 0.08 -0.12
-0.58 -0.58 -0.60 -0.58 -0.59
0.49 0.47 0.47 0.49 0.48 -0.30 -0.31 - 0.33 -0.32 -0.33
-
-0.58
-030 -0.31 -0.32 -0.32 -0.31
- 0.49 - 0.48 -0.47 -0.49 - 0.49
Table 22.3, CoeffKients of Various Principal Components and Percentages of Variance Explained by the Components for a Pair of Siblings for Five Independent Simulation Runs for Case 2
- 0.30 -0.28 -0.30 - 0.30 - 0.28
- 0.49 -0.49 - 0.48 - 0.50 -0.49
0.04 0.04 0.04 0.03 0.04 -0.65 - 0.63 - 0.65 - 0.64 - 0.64
0.55 0.52 0.54 0.55 0.55 0.31 0.33 0.32 0.32 0.31
1 2 3 4 5
1 2 3 4 5
PC-2
- 0.50 - 0.50 -0.50 - 0.49 - 0.50
- 0.60 -0.60 -0.59 -0.61 - 0.59
PC-1
0.96 0.96 0.95 0.95 0.96
0.88 0.86 0.87 0.90 0.87
- 0.63 -0.62 - 0.64 -0.63 -0.62
0.02 0.03 0.03 0.03 0.02
-0.49 -0.48 - 0.50 - 0.49 -0.49
-0.60 -0.61 - 0.60 -0.61 - 0.61
-0.39 -0.37 -0.38 -0.37 - 0.36
0.01 0.02 0.01 0.03 0.02
-0.50 -0.50 - 0.50 -0.50 -0.50
-0.39 - 0.37 - 0.39 - 0.37 -0.38
Coefficients of sib 1
1 2 3 4 5
1 2 3 4 5
PC”’
PVE
la2
Run no.
Type of PC
- 0.01 - 0.01 - 0.02 -0.01 - 0.01
-0.51 - 0.52 - 0.52 -0.51 -0.51
-0.50 -0.51 - 0.50 -0.52 - 0.51
-0.21 -0.22 ~ 0.22 - 0.23 -0.22
-0.14 -0.16 -0.15 - 0.14 - 0.16
-0.48 - 0.48 -0.47 - 0.49 - 0.48
- 0.34 - 0.36 -0.32 -0.33 -0.34
0.04 0.06 0.04 0.05 0.05
-0.50 -0.51 -0.50 - 0.50 - 0.50
0.01 0.02 0.01 0.01 0.02
-0.51 -0.51 -0.50 - 0.51 -0.51
0.27 0.28 0.30 0.27 0.28
0.58 0.56 0.56 0.57 0.58
0.95 0.96 0.95 0.94
0.88 0.87 0.85 0.88 0.86
PVE
0.59 0.60 0.59 0.59 0.60
0.24 0.24 0.23 0.26 0.25
0.94 -0.51 -0.51 -0.50 -0.50
0.55 0.55 0.56 0.55 0.55 -
0.56 0.57 0.57 0.58 0.55
0.24 0.25 0.23 0.25 0.24
0.51 0.48 0.49 0.50 0.49
0.55 0.58 0.56 0.57 0.56
0.44 0.44 0.45 0.44 0.43
0.03 0.03 0.03 0.04 0.03
-0.48 -0.50 ~ 0.50 -0.50 -0.50
-0.36 -0.38 -0.34 ~ 0.35 -0.37
Coefficients of sib 2
0.15 0.17 0.16 0.13 0.15
- 0.48 - 0.46 - 0.48 -0.47 -0.47
-0.50 -0.51 - 0.50 - 0.50 -0.51
- 0.19 - 0.22 -0.21 -0.22 -0.19
-
0.31 0.28 0.30 0.30 0.31
0.43 0.45 0.45 0.44 0.46
- 0.51
-0.39 - 0.42 -0.40 -0.38 -0.39
-
0.08 0.06 0.08 0.07 0.06
0.09 0.11 0.10 0.09 0.10
0.48 - 0.48 0.48 -0.49 0.47 ~ 0.50 0.48 - 0.49 0.49 - 0.49
Table 22.4. Coefficients of Various Principal Components and Percentages of Variance Explained by the Components for a Pair of Siblings for Five Independent Simulation Runs for Case 3
22. Genetic Architecture of a Multivariate Phenotype
337
are not influenced by any genetic factor, the sib-sib correlation of PC(‘) is very small (Table 22.5). Thus, sib-sib correlation values do not provide a very big clue about the underlying genetic architecture of the multivariate phenotype, except insofar as they help us to identify phenotypic components that may not be under any major genetic control. To assessthe performance of the rank correlation statistic in identifying the interval location of the QTL, we generated data on two unlinked sets of 100 ordered, equispaced markers, such that the recombination fraction between any two consecutive markers is 0.05. Simulated data were generated assuming no linkage between the two trait loci. For ease of presentation, we shall assume that they loci are on two separate chromosomes, arbitrarily called chromosomes 1 and 2. Each chromosome, as mentioned earlier, is saturated by a set of 100 markers. We arbitrarily assume that first trait locus is flanked by the 24th and Table 22.5. Means and Standard Deviations of Correlation Coefficients between Values of Various Principal Components for a Pair of Siblings under the Three Scenarios Considered Case 1 Type of PC PC(‘) l-G2 PC-1 PC-2
Sib-sib correlation mean SD 0.53 0.55 0.55 0.54
0.03 0.05 0.08 0.08
Case 2 Type of PC PC(l) PC@) PC-1 PC-2
Sib-sib correlation mean SD 0.55 0.12 0.46 0.44
0.04 0.02 0.07 0.08
case 3 Type of PC PC”’ PC(2’ PC-1 PC-2
Sib-sib correlation mean SD 0.61 0.58 0.58 0.57
0.05 0.05 0.07 0.07
338
Ghosh and Majumder
25th markers on chromosome 1, and the second trait locus is flanked by the 60th and 61st markers on chromosome 2. We further assume that the recombination fraction between the first trait locus and the 24th marker of the first set is 0.02, while that between the second trait locus and the 60th marker of the second set is 0.03. The nature of the absolute rank correlation between the different markers and the squared difference in the selected principal components of the sibpairs mentioned earlier are presented in Figure 22.2 and 22.3 for case 1, Figure 22.4 and case 2, and Figures 22.5 and 22.6 for case 3. From Figure 22.2a, we find that the absolute rank correlation increases with the proximity of the considered marker to the first trait locus when PC(‘) is considered. Moreover, the peaks were at the 24th marker on chromosome 1, correctly indicating the approximate location of the first trait locus. However, the absolute rank correlations are uniformly very low for all the 100 markers when PC(‘) is considered. This is expected, since PC(‘) is a function of X,, X5, X6, and X7, which are not controlled by the first trait locus on chromosome i. Similarly, we find from Figure 22.2b that the absolute rank correlation increases with the proximity of the considered marker to the second trait locus on chromosome 2 when PC(‘) is considered. The peaks were correctly detected at the 60th marker. The absolute rank correlations are also, as desirable, uniformly very low for all the 100 markers when PC(‘) is considered. Each of PC-1 and PC-2, which are the first two principal components constructed on the basis of all sevencomponent phenotypes, are, on the other hand, expected to detect both loci, and they indeed do so as is evident from Figure 22.2c,d. Thus, under the scenario that all component variables are genetically controlled, ignoring the correlation structure of the multivariate phenotype has no major effect on the ability of correctly identify the coarse (5 CM) marker interval in which the QTLs are located. In fact, as the graphs in Figure 22.3 show, there is very little variation across simulation runs, which indicates the high efficiency of the method. Having identified the interval in which the QTL may be located, in practice one saturates this interval with more dense markers (say, at 1-CM density) to obtain a finer location of the QTL. To simulate this practice, we considered data on multiple markers that are more densely located within the coarse interval identified at the preceding stage. In our simulations, we generated data on a set of five ordered markers. We use the following notations: &, 0, = recombination fraction between the trait locus and the nearest flanking markers 2 and 3, respectively 81 = recombination fraction between markers 1 and 2 6, = recombination fraction between markers 3 and 4 0s = recombination fraction between markers 4 and 5.
0.1
0.0
10
IO
0
o
30
40
50
60
70
30
40
50
60
70
Marker Position
20
Marker Position
20
80
80
90
90
100
100
:..:::. .. ..._..y....'. .*. .._:.;.. ....... *.-. . . . . .. . ..v,-....'. .._..:. ,.I..'
a
I
0.0 J
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0
d
IO
10
30
40
50
60
70
30
40
50
60
70
Marker Position
20
Marker Position
20
80
80
90
90
100
100
Figure 22.2. Rank correlations between squared difference of values of principal components and identity-by-descent scores estimated from genotype data at various ordered marker loci along a chromosome based on a sample of 100 sibpairs for case I: (a) along markers on. chromosome 1; solid and dotted lines correspond to PC(l) and PC(‘), respectively; (b) along markers on chromosome 2; solid and dotted lines correspond to PCc2) and f’@, respectively; (c) along markers on chromosome 1; sofid and dotted lines correspond to PC-1 and E-2, respectively; (d) along markers on chromosome 2; solid and dotted tines correspond to PC-1 and PC-Z, respectively.
s u
is&y
0.7
0.4
0.3
0.7
0
5
0.2
0.1
x s
K
0
C
0
a
10
10
30
40
50
60
70
30
40
50 60
70
Marker Position
20
Marker Position
20
80
80
90
90
100
100
s
0.2
0.0
K
0.3 0.1
0.4
8 5
0.6 0.5
‘S m Q) t Y
0.7
0.3
0.4
0.5
0.6
0.7
:
0
5
‘E 12
0
d
0
10
10
30
40
50
60
70
30
40
50
60
70
Marker Position
20
Marker Position
20
80
80
90
90
100
100
Figure 22.3. Rank correlations between squared difference of values of principal components and identity-by-descent scores estimated from genotype data at various ordered marker loci along a chromosome based on a sample of 100 sibpairs for case (1). Results of multiple simulation runs: (a) along markers on chromosome 1 for PC (I); (b) along markers on chromosome 2 for PC(‘); (c) along markers on chromosome 1 for PC-1 and PC-2 (the two bands of lines correspond to these two principal components); (d) along markers on chromosome 2 for PC-1 and PC-2 (the two bands of lines correspond to these two principal components).
0.0
0.5
0.3 0.4
0.6
‘S ,m Q) t 00
0.5
0.6
5 ‘Z 9 Q) L5
0.7
xc
10
10
30
40
50
60
70
30
40
50
60
70
Marker Position
20
Marker Position
20
80
80
90
90
100
100
0.3
5 0 Y on -.-
0.1
0.2
0.4
2
s pe
0.5
0.6
25 Q
E
0.0 ’
0.1
0.2
0.3
0.4
0.5
0.6
0
0
b
10
10
30
40
50
60
70
30
40
50
60
70
Marker Pqsition
20
Marker Position
20
80
80
90
90
loo
100
Figure 22.4. Rank correlations between squared difference of values of principal components and identityby-descent scores estimated from genotype data at various ordered marker loci along chromosome 1 based on a sample of 100 sibpairs for case 2: (a) solid and dotted lines correspond to PC? and P C @ ‘, respectively; (b) solid and dotted lines correspond to PC-! and PC-2, respectively; (c) results of multiple runs for PC(‘); (d) results of multiple rtins for PC-1 and PC-2 (the two bands of lines correspond to these two principal components).
0
0
1 a
0.0 I
0.1
0.2
0.3
0.4
0.5
0.6
0.7
06. 0.5
2 Q) t
0.0
K
0
0
!
10
30
40
50
60
70
20
30
40
50
60
70
Marker Position
20
Marker Position
10
80
80.
90
90
100
100
:.. . . . ..: :.. ..........., ...‘.‘. A
a
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0
d
b
10
10
30
40
50
60
70
30
40
50
60
70
Marker Position
20
Marker Position
20
80
80
90
90
100
100
Figure 22.5. Rank correlations between squared difference of values of principal components and identity-by-descent scores estimated from genotype data at various ordered marker loci along a chromosome based on a sample of 100 sibpairs for case 3: (a) along markers on chromosome 1; solid and dotted lines correspond to PC(‘) and P(Y), respectively; (b) along markers on chromosome 2; solid and dotted lines correspond to PC(2) and PC(l), respectively; (c) along markers on chromosome 1; solid and dotted lines correspond to PC-1 and PC-Z, respectively; (d) along markers on chromosome 2; solid and dotted lines correspond to PC-1 and PC-2, respectively.
0.1
0.2
0.3
0.4
0.5
x 5
5 t 6
.Is
0.6
0.1
0.0
s
u
5
0.2
Y
0.3
0.4
0.7
' .-
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0
-~
C
0
a
10
10
30
40
50
60
70
80
40 50
60
70
Marker Position
30
80
Marker Position
20
20
90
90
loo
loo
0.1
0.2
0.3
0.4
0.5
0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
d
0
b
10
10
30
40
50
60
70
30
40
50
60
70
Marker Position
20
Marker Position
20
80
I30
90
90
loo
loo
Ghosh and Majumder
344
We used simulation parameters values of M = 5; 81 = 0, = 0s = 0, = 6s = 0.01. For each set of parameter values, we performed 100 replications; each replication is based on data on 100 sibpairs. Fine-mapping was performed by means of the nonparametric kernel smoothing regression technique, described in an earlier section. The results are given in Table 22.6. It is seen from Table 22.6 that for the first scenario, while PC(‘) identified the correct fine interval of the first locus in 86% of the runs, the corresponding percentages for PC-1 and PC-2 were 43 and 51%, respectively. As expected, PC? d’1d not detect the correct interval. Similarly, PC(‘) located the second trait locus correctly in 92% of the cases, while the corresponding percentages for PC-1 and PC-2 were much lower: 51 and 58%, respectively. PC?) did not at all identify the correct location of the second QTL. Moreover, we have found that whenever PC(I) and PC(‘) yielded incorrect fine-interval
Table 22.6. Percentages of Correct Identification of the Two QTLs with Various Principal Components under the Three Scenarios Considered Case 1 Correct identification (%) Type of PC
Locus 1
Locus 2
PC(‘) PC’Z’ PC-1 PC-2
86 0 43 51
0 92 52 58
Case 2 Type of PC
Correct identification of Locus 1 (%)
PC”) PC-1 PC-2
83 70 66 Case 3 Correct Identification (%)
Type of PC
Locus 1
Locus 2
PC(‘) PC@’ PC-1 PC-2
79 18 52 45
27 74 47 41
22. Genetic Architecture of a Multivariate Phenotype
345
identifications for the first and second QTL, respectively, they identified the corresponding trait locus in an adjacent interval. However, the same was not true when PC-1 and PC-2 were used. The conclusion, therefore, is that while ignoring the correlation structure among component phenotypes results in no serious loss of efficiency in localizing the QTLs in coarse marker intervals when the multivariate phenotype is primarily controlled by major loci with large effects, for the purpose of fine-mapping of the QTLs, there is, however, a serious loss of efficiency. Under the second scenario, case 2, in which the multivariate phenotype is partially controlled by QTL s with major effects, we find that the detection of peaks by means of rank correlation for the first (and only) trait locus gives results almost identical to those of case 1 (Figure 22.4a). PC(2) did not falsely detect any peak on this chromosome because it is solely influenced by environmental components (Figure 22.4a). Both PC-1 and PC-2 also correctly detected (Figure 22.4b) the interval location of the trait locus controlling (Xl, X2, X3). Further, variation across simulation runs was small (Figure 22.4 c,d). Under this scenario, at the fine-mapping stage, the percentage of correct identification of locus 1 using PC(‘~ is 83%, while those using PC-I and PC-2 are 70 and 66%, respectively (Table 22.6, case 2). Since PCY2) is a function of purely environmental components, clearly it is irrelevant for localizing the first locus; therefore, it was not considered. Thus we find that although ignoring the correlation structure of the multivariate phenotype did not affect the efficiency of localizing the underlying trait locus in a coarse marker interval even when some of the components of the multivariate phenotype were not under genetic control, there was a serious adverse effect when fine-mapping is performed. Under the third scenario, case 3, we find that the results based on rank correlation are identical to those of case 1 (Figure 22Sa-d), except that PG2) and EYY) also detected the peaks at the 24th marker on chromosome 1 and at the 60th marker on chromosome 2, respectively (Figures 22.5 c,d), although the absolute rank correlations were lower than those when the other three principal components were considered. Since X4 and X5 are controlled by both the trait loci, and PC(‘) and PC(‘) are functions of X4 and X5, principal components located the trait loci, based on rank correlations, fairly well. Moreover, the magnitudes and trends of the rank correlations over rhe various marker intervals are consistent across simulation runs (Figures 22.6a-d). At the fine-mapping stage, we found that the percentages of runs correctly localizing the trait loci were, in general, slightly smaller (Table 22.6, case 3) than the corresponding percentages for case 1. However, PC(I) was also able to detect the correct interal of the second locus in 27% of the cases and PC? the first locus in 18% of the runs. As explained earlier, this is because X, and X5 are controlled by both the trait loci.
346
Ghoshand Majumder
VI. CONCLUSIONS We have investigated the method of statistical treatment of sibpair data on a multivariate phenotype for deciphering its genetic architecture, specifically for fine localization of the underlying loci, if any. We have considered three scenarios: (1) disjoint components of the multivariate phenotype are pleiotropically controlled by a set of major loci; (2) a subset of the components of the multivariate phenotype is pleiotropically controlled by a major locus, but the remaining subset is influenced only by environmental factors, and (3); the multivariate phenotype is pleiotropically controlled by a set of major loci, some of which can jointly influence some component phenotypes. These three scenarios yield different expected correlation structures of the multivariate phenotype. We have examined the importance of taking these correlation structures into account for statistical data analyses. We have proposed, in keeping with a widely adopted strategy (Zlotnik et al., 1983; Hasstedt et al., 1994; Boomsma, 1996; Allison and Beasley, 1998; Ott and Rabinowitz, 1999), that the multivariate phenotype be reduced to a smaller number of phenotypes by the use of principal components. In extracting the principal components, often the underlying correlation structure is ignored, and the first few principal components are used for deciphering the genetic architecture of the multivariate phenotype. However, we posited that ignoring the underlying correlation structure may lead to loss in efficiency of mapping the loci controlling the multivariate phenotype. We, therefore, performed simulations under the three scenarios described in Section V and extracted principal components with and without taking the underlying correlation structure into account. We have found that when principal components are used in low-density (wide) marker intervals, ignoring the underlying correlation structure of the multivariate phenotype has no major effect on the ability to map the loci controlling the phenotype. However, for the purpose of fine-mapping the quantitative trait loci, that is, localizing them in narrow (say, ~-CM) marker intervals, there is considerable loss in statistical efficiency unless the underlying correlation structure is taken into account at the data reduction stage, that is, for extraction of principal components. We, therefore, suggest that the empirical correlation matrix of the components of a multivariate phenotype be critically examined by means of suitable row/column permuatations of the correlation matrix and appropriate tests of hypotheses pertaining to various submatrices of the correlation matrix, to identify patterns of relationships among the components. The patterns identified then should be used at the data reduction stage. If principal components are extracted without consideration of the underlying correlation structure of the multivariate phenotype, the loci controlling the phenotype cannot be mapped efficiently and huge sample sizes may be required.
22. GeneticArchitectureof a MultivariatePhenotype
341
Acknowledgment This work was partly supported by a grant from the National Institute of General Medical Sciences (GM 28719).
References Allison, D. B., and Beasley, M. (1998). A method and computer program for controlling the familywise alpha rate in gene association studies involving multiple phenotypes. Genet. EpicIemiol. 15, 87-101. Amos, C. I., and Liang, A. E. (1993). A comparison of univariate and multivariate tests for genetic linkage. Genet. Epidemiol. 10,671-676. Amos, C. I., Elston, R. C., Bonney, G. E., Keats, B. J. B., and Berenson, G. S. (1990). A mulrivari* ate approach for detecting linkage, with application to a pedigree with an adverse lipoprorein phenotype. Am.J. Hum. Genet. 47,247-254. Bonney, G. E., Lathrop, G. M., and Lalouel, J. M. (1988). Combined linkage and segregation analy sis using regressive models. Am. J. Hum. Genet. 43, 29-37. Eoomsma, D. I. (1996). Using multivariate genetic modeling to detect pleiotropic quantitative loci. Behav. Genet. 26,161- 166. Eaves, L. J., Neale, M. C., and Maes, H. (1996). Multivariate multipoint linkage analysis of quantitative trait loci. Behav. Genet. 26, 519-525. Ghosh, S., and Majumder, P. P. (2000). A two-stage variable-stringency semi-parametric method for mapping quantitative trait loci using genome-wide scan data on sib-pairs. Am. J. Hum. Genet. 66,1046-1061. Haseman, J. K., and Elston, R. C. (1972). Th e investigation of linkage between a quantitative trait and a marker locus. B&XV. Genet. 2,3- 19. Hasstedt, S. J., Hunt, S. C., and Wu, L. L. (1994). Evidence for multiple genes determining sodium transport. Genet. Epidemiol. 11,553-568. Lander, E. S., and Schork, N. J. (1994). Genetic dissection of complex traits. Science 265, 2037-2048. Lange, K., and Boehnke, M. (1983). Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am. J. Med. Genet. 14,513-524. Majumder, P. I?, Moss, H. B., and Murrelle, L. (1998). Familial and nonfamilial factors in the prey diction of disruptive behaviors in boys at risk for substance abuse. J. Child Psychol. Psychiat. 39, 203-213. Moldin, S. O., and Van Eerdewegh, P. (1995). Multivariate genetic analysis of oligogenic disease. Genet. Epidemiol. 12,801-806. Ott, J., and Rabinowitz, D. (1999). A principal-components approach based on heritability for combining phenotype information. Hum. Hered. 49, 106- 111. Randles, R. H., and Wolfe, D. A. (1979). “Introduction to the Theory of Nonparametric Statistics.” Wiley, New York. Silverman, B. W. (1986). “Density Estimation for Statistics and Data Analysis.” Chapman Hall, New York. Schork, N. J. (1993). Extended multipoint identitybpdescent analysis of human quantitative traits: Efficiency, power, and modeling considerations. Am. J. Hum. Genet. 53, 1306-1319. Zlotnik, L. H., Elston, R. C., and Namboodiri, K. K. (1983). Pedigree discriminant analysis: A method ro identify monogenic segregation. Am. J. Med. Genet. 15,307-313.
On the Resolutionand Feasibility of Genome ScanningApproaches Joseph D. Terwilliger Ilcpartment of Psychiatryand Columbia C;enomeCenter Columbia University New York, New York 10032 Department of Neuroscience New York State Psychiatric Institute New York. New York
I. II. III. IV. V. VI. VII. VIII.
Summary Introduction Mapping a Locus with Known Genotypes Mapping a Locus Influencing Some Phenotype (Genotypes Uncertain) Allowing for Trait Locus Genotype Misclassifications “Model-Based” and “Model-Free” Methods: A False Dichotomy Complex Disease Gene Mapping: Is It Possible? Conclusions References
I. SUMMARY Before contemplating a genome scan to identify the map position of diseasepredisposing genes, an investigator should have prior evidence of the genes’ existence. It is therefore logically consistent to evaluate a genome scan experiment as an estimation problem, rather than as a hypothesis-testing problem, since absent prior evidence of the existence of disease genes, it is probably unwise to conduct the experiment at all. Recombination in a single meiosis can be modeled as a point process along the chromosome, and linkage or linkage disequilibrium (LD) mapping statistics are a simple function of the superposiAdvancesin Genelics, Vol. 42 Copyright 0 2Wl by Academic Press. All rights of reproduction in any fume rewved. 0065-266C/C1 835.X
351
352
Joseph Il. Terwilliger
tion of the recombination processes occurring in all meioses under study. Thus, multipoint lod scores are shown to be step functions, in the absence of ambiguity about the inheritance of chromosomal segments. The ability to map a disease gene is a function of how well the ascertained phenotypes predict the underly ing trait locus genotypes. This chapter presents a thorough investigation of the properties of the multipoint lod score and uses results from renewal theory to examine the effects of deviations from a deterministic phenotype-genotype relationship. The quality of estimated gene locations is assessedthrough computing the mean and variance of the length of the expected 3-lod-unit support interval around the maximum likelihood estimate. The more deterministic the model, the smaller this interval is. A more exact quantification of details of this effect is used to describe the statistical properties of such genome scanning experiments from the perspective of estimation, with appropriately little regard to hypothesis testing. Hypothesis testing, however, is discussed as an appropriate context to describe linkage and LD analysis in situations where candidate genes are being screened, since only there does one have definable null and alternative hypotheses that have not been rejected before the beginning of the experiment. By contrast, it is hoped that the null hypothesis “there is no gene affecting this phenotype” has been rejected by other means before an expensive genome scan is even contemplated (though that this is often not done is probably the main problem!).
II. INTRODUCTION Linkage disequilibrium (LD) and linkage alike are properties that correlate the genotypes of loci between and within defined pedigree structures, respectively. Both these sources of genotypic correlation have been exploited to identify the chromosomal locations of loci whose genotypes either are known or can be accurately predicted on the basis of some set of observed phenotypes. Genome scans have been performed to localize such genes against a genome spanning map of phenotypically silent marker loci, which have been used to infer the segregation patterns of each chromosomal location, such that one can map the new locus to the chromosomal location where it would least perturb the observed pattern of recombination among the marker loci. This is the fundamental basis of linkage analysis when meiotic segregation patterns can be directly observed. In LD analysis, it must be assumed that individuals with a given inferred genotype have received identical alleles from common ancestors, and in this sense, the same principle applies, though the exact pedigree structure is unknown, making it difficult to accurately determine which genomic segments are autozygous on the basis of marker genotypes alone (cf. Clark, 1999).
23. GenomeScansas StochasticProcesses
353
I begin this chapter with a description of a genome scan in terms of the underlying recombination processes in each meiosis. Initially, the theory of how to map a test locus with known genotype will be derived, assuming completely accurate knowledge of all genotypes at the test locus and complete knowledge of the chromosomal location of every recombination breakpoint in a data set. Deviations from this perfect world will be introduced one at a time, to describe the effects of each on estimability of the chromosomal map position of the test locus. It will be shown that it does not take much deviation from a deterministic phenotype-genotype relationship to suck all the power out of a mapping study. Renewal theory will be used to explore effects of trait locus modeling errors, marker genotyping errors, marker map inaccuracies, ascertainment schemes, and the like. In light of the results of these analyses, it may be necessary to develop better experimental designs, and better means of predicting which genotypes of which loci are most likely to be directly affecting the phe# notype of interest, rather than relying on pure reverse genetics to unravel the etiology of multifactorial phenotypes.
II. MAPPRG A LOCUSW ITH KNOWNGENOTYPES A. Linkageanalysis Let us begin by considering how one can use linkage and LD to estimate the chromosomal location of a test locus with known genotypes (Gn) against a genome spanning map of marker loci (GM), as outlined in Figure 23.1. In every meiosis, each chromosome undergoes the process of crossing over, and by definition, the expected number of crossovers per morgan (M) is I. There is evidence that in any individual meiosis, the locations of individual recombination breakpoints are not independent (i.e., there probably is some recombination interference-e.g., Weber et al., 1993). Nevertheless, they do occur according to some point process. If we have a set of M meioses in a sample, and we can determine with precision where the recombination breakpoints occur in each meiosis, and we know without error the inheritance pattern of our test locus in the same meioses, we know that there must be some position in the genome that cosegtegates with the test locus in every meiosis (see also Kruglyak and Lander, 19%). Figure 23.2 shows the recombination breakpoints in five meioses: the test locus is actually located at the position (D) indicated at bottom of the figure. In each meiosis shown, the inherited chromosome alternates between the two parental homologs (shown as alternating between two states-one elevated and one depressed). In this case, the upper phase for each chromosome repre+ sents the parental homolog that was inherited at the test locus (D), At the bottom of the figure, there is an X at each chromosomal position where a recombination breakpoint occurred in one of the meioses in the study. It can be
354
Joseph 0. Terwilliger
Figure 23.1. Linkage and linkage disequilibrium (LD) are the phenomena that correlate the genotypes of marker loci (GM) with the genotypes of the test locus (C$, assuming that both can be determined without error. This is the best-case scenario for mapping a system with unknown chromosomal location through exploitation of these correlations. In a linkage and/or LD analysis attempting to identify the chromosoma1 location of a locus that affects some phenotype as well, this is the only source of correlation that is a function of linkage and/or LD.
shown that when the number of meioses in a study is large, this superposition of sparse point processesweakly converges to a Poisson process with mean distance between recombination events, E[X] = 1 /M morgans (Palm, 1943; Khintchine, 1960; Grigelionis, 1963). Because of the effects of length-biased sampling (cf. McFadden, 1962; Schaeffer, 1972; Simon, 1980; Hemenway, 1982; Resnick, 1994; Terwilliger et al., 1997), the distribution of the distance from the disease locus itself to the nearest recombination event in any meiosis on either side of it will be distributed as an exponential random variable with mean distance (1 /M) morgans (X - exp(M)). This means that the length of the entire region to which the disease gene can be localized in M meioses would be the sum of two such exponential random variables, which is distributed as an E[2,M] random variable, (E = Erlang), which has mean length (2 /M) Morgans (cf. Nelson, 1995; Terwilliger et ul., 1997). In Figure 23.3, the proportion of the observed meioses that cosegregate with the test locus is graphed as a step function across the chromosome being studied. In this case, the only region where all meioses cosegregate with the test locus is the region including the test locus. In general, the mean proportion of the genome where all M meioses will cosegregate with the test locus by chance will be (0.5)“, while the mean length of the segment surrounding the test locus where all meioses cosegregate with it will be (2/M) morgans. For this sample set of five meioses shown in Figure 23.2, the resulting traditional multipoint lod score is a simple step function, as shown in Figure 23.4, where the lod score is - 00 at every genomic position where there is at least one meiosis in which that position does not cosegregate with the test locus
355
23. home Scansas StochasticProcesses
I
M4
I
M2
I
I
x):
X
-
I
I
Ml All
I
X
“”
“Y
vv
”
I
v
”
”
D
Figure 23.2. Chromosomal segregation in five meioses is diagrammed to illustrate the effects of the individual recombination processes on the overall multipoint lod score over the length of the entire chromosome. At genomic positions where the chromosome is shown to be slightly elevated, it has the same grandparental source as the test locus allele inherited in that meiosis. Recombination breakpoints alternate between the two parental homologs, such that the test locus does not cosegregate with chromosoma1 positions where the heavy line is depressed.The test locus to be mapped is actually located at the position indicated by D on the bottom of the figure, such that the test locus cosegregates with its own chromosomal location in all five meioses. The bottom line of the graph represents the superposition of these five meiotic recombination processes,where the distribution of the distance between the X ‘s converges to exp[M] when the number of meioses, M becomes large. The segment around position D, which must cosegregate with the test locus genotypes in all meioses, has mean length between adjacent recombination events that is twice as long, on average, and is distributed as E[2, M].
and has value 2 = M logre(2) elsewhere (note that the expected proportion of the genome where all M meioses will cosegregate with the test locus is 10Tz, the pointwise false positive rate).
B. Linkage disequilibrium analysis LD is a much more stochastic phenomenon than linkage, and it is much less predictable. It results from the confluence of a number of population genetic phenomena (cf. Terwilliger et al., 1998; Terwilliger and Weiss, 1998) and decays rapidly as one moves away from the test locus. Assume that we know
356
Joseph D. Terwilliger
1.0 0.9 0.8
-_
-
0.7 1
0.3 0.2 0.1 0.0 -
Figure 23.3. Superposition of the five meioses diagrammed in Figure 23.2. Note that in Figure 23.2, which graphs the proportion of the meioses in which the test locus cosegregates with each chromosomal position, this percentage either increases or decreases by increments of 1 /M at each recombination breakpoint in the superposition of the five recombination processes (bottom of Figure 23.2). In the context of complex linkage analysis, this proportion C/M = 1 - $ (variables defined in text).
the complete sequence of the genome in every individual, and for now, let us assume as well that we can also identify all autozygous chromosomal segments. If we assume that the rarer allele of the test locus was created by a mutation event at only one time in history, then all copies of this allele in the population today would be clonal descendants of this historical mutation event. However, the number of meiotic events connecting the alleles in any sample from today’s population to their common ancestor chromosome is a complicated function of evolutionary history, including population demographics, selection, and genetic drift (see Terwilliger et al., 1998, for more detailed discussion of the forces that generate and erode LD). Different samples of k alleles from today’s population will likely have resulted from different numbers of meiotic events connecting them back to their nearest common ancestor. The highly stochastic nature of evolutionary history can make it very difficult to make predictions about the length of the region that would be conserved and coinherited with the rare allele of the test locus throughout the history of the population. If we knew that our sample of chromosomes were separated from their common ancestor by M meioses, then the length of the autozygous shared segment, Y, would have a distribu-
23. GenomeScans as StochasticProcesses
357
Figure 23.4. Graph of traditional multipoint lod score based on meioses shown in Figure 23.2. The multipoint lod score is computed as logIo[(l)C(0)M - C]/(0.5)M, which has a value of M logra(2) at chromosomal positions that cosegregate with the test locus in all M meioses, and --w elsewhere, resulting in the simple biphasic step function shown. Note that on average the proportion of the genome in which the lod score will have value 2 = 2 logls(2) is (0.5)M = lo-‘, which happens to be the pointwise upper bound, from the Markov (1913) inequality, for the p value for complex-valued lod scores (see text) maximized over E.
tion that is E[2,M], as in the foregoing linkage model described. However, if there is uncertainty in the value of M, the variance of the length of this shared segment can become enormous. To evaluate how variable this region is, let us make some simplifying assumptions. First of all, let us assume that all k chromosomes in our sample are descendants of some single common ancestor, in some gigantic pedigree, as in Figure 23.5 (Le., no recurrent mutation). Second, let us assume the Wright-Fisher model (Fisher, 1930; Wright, 1931) (i.e., constant population size N, absence of selection, etc.) and further that all k chromosomes have been ascertained to carry clonal copies of some allele, D, at our test locus, where allele D is assumed to have population frequency pD. Under these highly restrictive assumptians, the number of generations between the time a sample of k chromosomes is ascertained and the time those k chromosomes’coalesce to exactly k - 1 ancestors, Tk, is distributed as exp[k(k - 1 )JLcNpn], as in Figure 23.5 (Kingman, 1982; Tajima, 1983). Accordingly, both the mean and variance of the TI; increase as k decreases, as E[TJ = 4Npn/k(k - l), and as Var[TJ = [4Non/k(k - 1)12.The total number of meioses connecting such a random sample of k D allelies to their nearest common ancestor can be com-
358
Joseph D. Terwilliger
Figure 23.5. Coalescent model for relatedness of k = 4 chromosomes ascertained to carry allele D at the test locus, identical by descent from some common ancestor. Under the Wright-Fisher model (Fisher 1930; Wright 1931), which assumesconstant population size N and absence of selection among other regularity conditions, the distribution of the time (in generations) for a random sample of k chromosomes to have exactly (k - 1) ancestors for the first time is exp[k(k - 1)/(4NpD)], as indicated. The time to the nearest common ancestor is distributed as the sum of such exponential distributions, T = ZF=zT,, and the number of meiosesseparating the sample of k chromosomes from their nearest common ancestor is M = X/(=2iTi, as can be seen from summing the branch lengths indicated in the coalescent tree.
puted as M = ~~=ziTi (see Figure 23.5), which variance:
ELM] = $ i=2
Var[M]
= i
has the following
mean and
iE[T,1 = 4NpDk$ 4 i=l I
i2Var[Ti]
= 16(N~D)2~~~~.
i=2
In Figure 23.6, E[M] is graphed f or a range of k, as a function of Npn. Clearly, as the number of chromosomes in the sample increases, the number of meioses connecting them becomes larger, but the rate of increase slows down dramatically as k increases, meaning that the expected length of the autozygous seg-
359
23. GenomeScar&as StOGhaStiGPrOGeSSeS
100000
1000 NP Figure 23.6. For the model outlined in Figure 23.5, the expected number of meioses connecting a sample of k chromosomes to their nearest common ancestor, Em], is graphed as a function of Npo. As the number of chromosomes in the sample increases, the expected number of meioses connecting them together increases by a progressively smaller amount, meaning that the resolution of the LD mapping study improves by a smaller and smaller amount per sample as the sample size increases.
ment shared by the entire set of ascertained chromosomes decreasesonly slowly with increasing k, since the distribution of the length of the segment shared by all k chromosomes, ascertained to carry clonal copies of allele D would be E[2,M], where M has the distribution just outlined. What this translates into is an enormous variance in the size of the shared segment among all persons in the sample, due to the uncertainty in the pedigree structure that relates this random sample of individuals. The probability density function for the segment length would be:
fub) =I
m fVlM b)fM(Mm m=O
= zom2Y-
= m~of~,d#'(M
= m)
-“‘yP(M = m).
However, because M is a weighted sum of exponential random variables, whose
360
Joseph D. Tetwilliger
density is difficult to describe in closed form, the density function fu(y) was estimated by simulation of 50,000 replicates of M as
where in each replicate a value mi was simulated from the distribution described for fM(m). Figure 23.7a plots &y) for the case of NpD = 100, k = 25. To show the effects of NpD, Figure 23.7b plots an estimate of the 95% confidence inter6
A
5
2
1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y = Shared Segment Length in CM
Figure 23.7. The density function for the length of the autozygous segment shared by 25 chromosomes ascertained to contain allele D at the test locus was estimated by simulation to approximate the integral m
m
f&J = fu(m(Y)fdddm= s ym2e-“‘yfdm)dm. I ?7!=0 m=O Because M is a weighted sum of exponential random variables, this integral is not trivial to evaluate. For given values of Np and k, 50,000 replicates of m were simulated from the distribution of M, to estimate &(y) = z:Z*p”( l/50,000) ytie-%y. (A) The density function estimated for the case AJp = 100 and k = 25 for the length of the autozygous segment shared by all k chromosomes, (B) The mean length of the autozygous segment shared by such a sample of k = 25 chromosomes as a function of Np, together with its upper and lower 95% confidence interval, as explained in the text.
23. GenomeScansas StochasticProcesses
100
1000
361
10000
NP Figure 23.7. (continued)
gal for the length Y shared by k = 25 chromosomes as a function of Npn. The 95% confidence limits were determined as cL and cu, such that and
0.025 =
Most of the research promoting LD analysis for mapping complex traits focuses on either the expected LD (ignoring its variance) or promotes specific examples 3f unusually high LD, which is expected to be observed in some proportion of studies. But the variance is so large that one must be careful about overstating the power of these studies. One cannot predict LD accurately because there is so much variation. Note that the distribution jr(y) is highly skewed, and the upper tail is rather thin, making the segments in the upper end of this range very much the exception rather than the rule. For the same reasons, it is also very difficult to get any accurate estimate about the time to coalescence of all k alleles based on the observed length of the shared autozygous segment in a sample: there is an enormous variance of this quantity for any fixed total time, T = Et=, Ti, which has a very complicated relationship to Y, the length of the shared segment among all k ascertained chromosomes. This analysis was based on very strict assumptions. If we include a range of plausible selection in the model, both the mean and the variance of
362
Joseph 0. Terwilliger
Y will tend to increase. Furthermore, if we include mutation in the modelsuch that new mutations in population history will make segments that are coinherited from a common ancestor differ at the sites where new alleles have arisen-the likely inference will be that the segments are shorter than the coinherited regions. This will decrease the mean length of IBD segments and likewise increase the variance of the observed LD. If the population has undergone a rapid expansion, the shared segment lengths are likewise highly stochastic (a crude approximation would be to replace N by the effective population size N, in the analysis). One should note that even in the rare diseases of the Finnish disease heritage (Norio et al., 1973), there is enormous variability in observed shared segment lengths (Varilo, 1999). For example, shared autozygous segments of several centimorgans are not unusual in many of these rare recessive diseases (cf. Houwen et al., 1994; Nikali et al., 1995; Peltonen et al., 1999). H owever, polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy (PLO-SL), a very rare recessive disease, had an observed autozygous segment of at most 153 kb in a sample of only 24 chromosomes, which is so small that an LD-based genome scan looking for completely autozygous shared segments would not have been successful, despite the rarity of the disease allele (pn - 1 / 150,000) (Pekkarinen et al., 1998). Linkage is predictable. One can map a locus with known genotypes against an infinitely dense map of fully informative markers with good predictability to a small proportion of the entire genome, which is a regular function of the number of informative meioses in the sample. LD has the potential to identify an extremely small chromosomal segment in which the test locus is likely to reside; but the actual length cannot be accurately predicted a priori. If one makes inferences about the mean length of such segments, and formulates a study design without considering the enormous variance, substantially greater amounts of work may be needed, as in the example of PLO in Finland (Pekkarinen et al., 1998).
IV. MAPPINGA LOCUSINFLUENCING SOME PHENOTYPE (GENOTYPES UNCERTAIN) A. Linkage analysis When a linkage analysis is performed to identify the chromosomal location of a gene that predisposes to some disease, the situation is more complicated than that described in Section III. As shown in Figure 23.8, one still hopes to identify marker loci in some chromosomal region (GM) that always cosegregateswith the genotypes (Gn) of the locus that are correlated with the diseasephenotypes (Ph)
23. GenomeScansas StochasticProcesses
363
Figure 23.8. Figure 23.1 showed that the linkage and linkage disequilibrium mapping is based on predictable correlations between marker genotypes (GM) and genotypes of the test locus (Go). Unfortunately, in studies of the genetic basis of observed phenotypes, we rarely know the genotypes of our test locus. Rather we observe a set of phenotypes (Ph), which are in& enced by the genotypes of the test locus, which is correlated with its flanking marker loci due to linkage and LD. We can observe only Ph and GM, which can be correlated only when each of them is correlated with the hidden node in this analysis, Gn. No matter what method is used to analyze the data, the existence of marker loci correlated to Gu due to linkage and/or LD will not ensure the success of a mapping study based on phenotypic observations. The most important factor in mapping is how well Ph predicts Gp With oneto-one correspondence, mapping will be powerful. If correspondence is weak, it will be tough to, map genes even when there is linkage and LD correlating Gu and GM.
that are actually observed. However, one cannot often uniquely determine the zenotypc of each individual at this underlying test locus, and therefore one cannot infer whether the disease alleles cosegregate with any given chromosomal segments with certainty. To this end, one must partition the likelihood of the
364
Joseph D. Tetwilliger
data over all possible underlying disease locus genotypes as follows, as suggested by Figure 23.8: L cx PO% G,)
= xP(Ph ( GD)P(GDI G,)P(G,),
GD
whereas in the case of mapping a test locus with known genotype, the likelihood was simply L m P(Gu, G,) = P(Gnj GEVL)P(Ghl)(Figure 23.1). In the deterministic case, whenever the test locus recombined with the chromosomal position being tested, the likelihood was 0 (i.e., 2 = -co), When one averages the likelihood over all admissible combinations of test locus genotypes, however, it is rare to have obligate recombinants, and thus there is a great deal less precision in the fine-mapping of a test locus. There is a wide range of possible genotype-phenotype correlations that could exist. The more deterministic the relationship, the better will be our ability to map the genes (because in reality the phenotypes accurately predict the underlying genotypes). But if the effect of a risk genotype is to increase the probability of getting a disease by a small factor, there can be very little information left about the location of the disease gene. Essentially, the effect is to smooth out the lod score curve by flattening everything toward zero. The shape of this curve will still look like a regular step function (see discussion of Figure 23.9 in Section IV B). Every time one reaches a position on the chromosome where a recombination event has occurred in one of the meioses under study, the value of this likelihood will change upward or downward by some increment, and the distance between such recombination events will be distributed as exp[M] where M is the number of meioses in the sample. However, the peaks may be less high, and the valleys in the function will not be as low, leading to reduced ability to localize the underlying gene to a small region of the genome, and also making the variance of the maximum likelihood estimate of the location of the gene much greater. Instead of restricting the location of the gene to a segment with mean length 2/M morgans, it will be much larger. The weaker the model, the larger the region to which the test locus can be mapped to with a fixed sample size (see also Terwilliger, 1998).
B. linkage disequilibrium analysis In LD analysis the problem is of the same nature as in linkage analysis. One can think of it as ascertainment of a set of chromosomes from people with a disease. This no longer means, however, that they are all expected to share the same disease allele in common, as would be the case if a fully penetrant homogeneous disease were being studied, as in the Finnish disease heritage (Norio et al., 1973; Varilo, 1999). In this case, only some proportion of the chromosomes from a sample of k diseased individuals are predicted to share a common disease allele.
23. GenomeScans as StochasticProcesses
365
Since LD analysis is the same as linkage analysis on a gigantic pedigree of unknown structure (i.e., the coalescent tree shown in Figure 23.5), one is again forced to convert the disease phenotypes into genotypes using some algorithm. One can again compute the likelihood as P(GD/ G,)P( GM) in the case of genotypes that are known at the test locus, but when test locus genotypes are unknown, must use the formulation
PO%GM)= ~P(PhI G,F’(G,~
GdP(%). GII This makes it impossible to restrict the region based on shared segment analysis with precision, as one rarely knows whether any given chromosome contains a disease allele, thus increasing the variance of the estimate of gene location and decreasing the precision of those estimates. Since LD is a highly stochastic phenomenon to begin with, this imprecision in disease locus genotypes can lead either to complete absence of any detectable signals or to enormous support intervals, where one cannot tell whether a phenotypically active locus exists, Again, as in linkage analysis, the LD will decay according to an underlying step function, where the steps should occur at intervals with distribution that is exp[M], where M is distributed as a weighted sum of exponential distributions as earlier. Note that now there is the complicating factor that also the number of chromosomes in the sample carrying the “D” allele is unknown, adding another source of variance to the estimates of gene location resulting from the nondeterministic genotype-phenotype relationship at the trait locus.
v. ALL~NN~ FOR TRAIT locus BENoTYPE ~Ml&A!%#kAT~ONS A. Complex-valued recombinationfractions A simple way to model the nondeterministic relationships between genotypes and phenotypes would be to use a misclassification model. Let us assume that we can deterministically assign some reasonable set of genotypes for the test locus on the basis of the phenotypes, accepting that there will be errors. If we had a large enough sample, eventually we would find that no genomic position coseg regated with the assigned test locus genotypes in all meioses, and thus the traditional multipoint lod score would assume a value of ---CCacross the entire genome, without discriminating with respect to how many meioses did not appear to perfectly cosegregate. Goring and Terwilliger (2000a) introduced a model of complex-valued recombination fractions to handle this situation explicitly. In their model, a parameter was added to the likelihood for each possible genomic location of the
366
Joseph Il. Terwilliger
disease gene that measured the frequency of cosegregation of a given genomic location with the assigned putative test locus genotypes. In their model P(cosegregation of test locus genotypes and genomic location xp) = (1 - E), where E is related to the probability of trait locus genotype misclassification. In a genome scan, this will allow for a discrimination between places in which there is only one meiosis in which no cosegregation occurs and those in which half the meioses are not cosegregating with disease locus genotypes. In “modelfree” analysis, 1 - E is related to the concept of IBD probabilities for sets of affected relatives (see later and Goring and Terwilliger, ZOOOd),and in “modelbased” analysis, it is related to the inflation of the recombination fraction estimates endemic in linkage analysis with an incorrect model assumption for the trait (cf. Risch and Giuffra, 1992). If one allows for this extra parameter in the pedigree likelihood, then the discrepancies between “model-based” and “model-free” methods dissipate, and one can view the latter as a subtype of the former. Formally, Goring and Terwilliger (2000a) applied this extra parameter in a “complex-valued” extension of the concept of the recombination fraction, where 0 = 0 + .ai, and 0 = P(recombination occurs between marker M and xp), and E = P(inferred disease locus genotypes do not cosegregate with xn). The latter represents the effect of errors in the inferred disease locus genotypes, where misclassifying them will lead to either a recombinant being inferred to be nonrecombinant or vice versa. In two-point linkage analysis with a single marker locus, the probability of an observed recombinant in a given meiosis would be 8 + E 288, in which case, one cannot separately estimate the two parameters because they are completely confounded, as analyzed in detail by Goring and Terwilliger (2000a). In multipoint analysis there is only this one error parameter, relating to the cosegregation of inferred trait locus genotypes with xp, which becomes estimable as the marker density increases, such that in the limiting case, one has essentially a return of the step functions described earlier in connection with “model-based” linkage analysis. In contrast to the earlier situation, where trait locus genotypes were assumed to be known without error and a single deviation from cosegregation of trait and marker loci was sufficient to exclude a chrome, somal position from harboring “the gene,” now it is the proportion of meioses that cosegregate with the test locus that determines the significance of the link, age evidence. If we return to the meioses diagrammed in Figure 23.2, and allow for complex-valued recombination fractions, a somewhat different step function results. In Figure 23.3, the proportion of meioses cosegregating with the test locus was graphed along the chromosome, and we can use multipoint linkage analysis with complex recombination fractions to test the null hypothesis that 50% of meioses cosegregate with the test locus at that chromosomal position
23. GenomeScans as StCiGhaStiG Processes
367
(i.e., He: E = OS), with the alternative hypothesis that more than 50% of meioses have the test locus cosegregating with the chromosomal position being analyzed (i.e., HI: F < 0.5). Every time a crossover event occurs in one of the M meioses, the proportion of meioses cosegregating with the test locus either increases or decreasesby 1 /M. The likelihood of the test locus being at any given point, assuming some fixed set of inferred test locus genotypes, is given by L m P(GDI xD) = eMmC(l - F)~, where M is the total number of meioses, and C is the number of meioses in which the test locus genotypes cosegregate with chromosomal location xn. The null hypothesis would be that the marker segregates independently of the test locus, in which case, E = 0.5, resulting in a “lad score” statistic of the following form
2 = log,,
maxeM-’ (1 - e)’ E o 5M
= loglo KM - C>/MI"-C tC/Mlc VCZ O.SM
p;
Z = 0 otherwise.
However, if we were to conduct a two-point linkage analysis with a fully informative marker locus at chromosomal position xD, the resulting traditional twopoint lod score would be identical to this, since max eMPC (1 - 19)~ 0 2 = log10 0.5M = h30
[(M - C)/M]“-C[C/M]C 0.5M
‘d C 1 T;
2 = 0 otherwise,
because those meioses in which the test locus genotypes do not cosegregate with the marker genotypes would be inferred to be recombinants. The equivalence of multipoint lod scores with complex recombination fractions and traditional two-point lod scores has been proven in general (Goring and Terwilliger, 2000a).
B. Consequences of test locusm isclassiffcation As shown in Figure 23.9, this complex multipoint lod score is a predictable step function across the genome, which can be used to estimate the chromosomal location of the test locus. For those interested in some sort of hypothesis testing, the genome-wide distribution of the complex-valued lod score must be the same as that of the conventional two-point lod score with fully informative markers,
368
Joseph D. Terwilliger
Figure 23.9. Graph of the complex multipoint lad score for the meiotic observations shown in Figure 23.2. The complex multipoint lod score is computed as ZG 8) = log,0 lmax, E p,0.51b M-C(x)(1 - ~)~(‘)]/0.5~}, where C(x) is the number of meioses in which the test locus cosegregateswith chromosomal position x and M is the total number of meioses in the sample. Note that Z(X,E) will exceed some critical value K by chance on a proportion of the genome that is asymptotically equal to 0.5P[xtl~ > 2~ ln(lO)], and in small samples, the Markov (1913) upper bound on this probability is 10 -K. Furthermore, the value of this complex lod score can never be less than 0 by definition.
the theory for which has been discussed elsewhere (cf. Dupuis et al., 1995; Terwilliger et al., 1997). In my opinion, as already stated, genome scans are not really hypothesis-testing experiments, but rather estimation experiments, and the main effect of adding this extra parameter in the lod score is a loss of precision in the estimate of chromosomal location of the disease locus. How can we quantify the loss of precision? Let us consider the effects of single recombination events along the chromosome on the lod score function. Clearly there is a strong effect of both the sample size and E (and accordingly the maximum lod score). As one moves away from the maximum likelihood estimate of XD, the distance to the next recombination event (at which point the lod score shifts to a new value) is distributed as exp(M), as before. However, with probability E, a meiosis that previously was not cosegregating with disease will begin to cosegregate (i.e., C = C + l), and with probability (1 - E), the opposite occurs and the disease shifts from cosegregating to not cosegregating (C = C - 1). These relationships define the chromosome-wide behavior of the lod score statistic in general
23. GenomeScansas StochasticProcesses
369
(cf. Terwilliger et al., 1997) and indicate that when C is close to M/2, it is highly stochastic whether C (and accordingly the lod score when C 2 M/ 2) goes up or down; if C is large, on the other hand, it will tend to drop toward M/2, the null hypothesis expected value of C. Thus the behavior of the lod score along the chromosome follows a random walk with a bias toward the null hypothesis. When the maximum lod score occurs with an estimate of 6 = 0, the lod score will be sharply peaked around the true location of the disease, while if .S is closer to 0.5, the lod score curve will be very flat around the maximum, and the chromosomal localization of the disease locus will be poor. In the conventional multipoint lod score, the proportion of the genome that is predicted to show complete cosegregation with the test locus by chance (i.e., false positive rate) was shown to be lo-‘, but once less than perfect cosegregation is allowed for, one must rely on asymptotic theory for likelihood ratio tests, which predicts that this proportion would be equal to 0.5P[xfl~ > 22 ln(lO)], w h ic h is roughly an order of magnitude smalEerthan in the deterministic case. On the other hand, one needs substantially more data to exceed this threshold, and the lod score will tend to decay around the true location of the disease locus much more slowly, making the support interval for a fixed sample size tend to be larger (accordingly reducing the power of a study) when there is less than deterministic phenotype-genotype relationship in a sample. Because the exact closed form solution for the distribution of the length of the 3-lod-unit support interval is extremely complex, a simulation study was performed to determine the effect of 6 on the length of the support interval for different numbers of informative meioses, M. In this case, conditional on C = (1 - &?)M meioses in which the inferred test locus genotypes cosegregate with some chromosomal position xn, a random walk was simulated in both directions from XD. The first quantity simulated in each replicate was the distance to the next recombination event in one of the M meioses as we move away from xn. This distance was simulated from the distribution exp[M]. At this recombination breakpoint, with probability C/M, a cosegregating meiosis becomes a noncosegregating meiosis (C = C - l), and otherwise the opposite occurs (C = C t 1). This simulated random walk is continued until the lod score falls below 2 -3 in each direction. The length of the support interval is the distance bzt:een the positions on either side of xn where the lod score first drops below 2 nlax- 3. Based on 50,000 replicates of this simulation for an assumed value of Z? and M, the distribution of the length of this 3-lod-unit support interval was estimated. Interestingly, the ratio of the mean length of this support interval divided by the expected length of the support interval in the absence of test locus genotyping errors (which is distributed as E[2,M]) app ears to be independent of the sample size (at least over the simulated range M E [50, 50001). Moreover, it follows an exponential shape, as shown in Figure 23.10, at least over the simulated
370
Joseph 0. Terwillioer
I”” 90. 80
I I I
70
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
E
Figure 23.10. The effects of trait locus genotype misclassifications on the 3-lod-unit support interval around the maximum complex multipoint lod score. For each of a variety of sample sizesM, 50,000 replicates were simulated and the length of the resulting support interval was determined in each replicate. The average length was determined, and the ratio of this length for given values of S(M - C)/M to the length of the support interval when E^= 0 (i.e., C = M). Th’1sratio is invariant with sample size, at least on the range simulated where M E [50, SOOO]and E E [0,0.4] as shown.
range E E [O, 0.41. For 8 > 0.4, sample sizes of 5000 informative meioses are insufficient to have a maximum lod score significantly greater than 3, in which case the entire genome would be in the support interval. To get a handle on the variability of the length of the support interval, for the case M = 200, the upper and lower bounds of a 95% confidence interval are presented, based on the simulated distribution of lengths, where 2.5% of the time, the length was less than the lower curve in Figure 23.11, and 2.5% of the time, the length was larger than the upper curve in Figure 23.11; the mean length was graphed in the middle. As 6 becomes larger, not only does the mean increase substantially, but also the variability. What this all means is that when the mode of inheritance cannot be accurately specified, the ability to fine-map a locus is not good at all, consistent with observations of Van Eerdewegh (1998) and Roberts et al. ( 1999)) among others.
C. Minimization of local perturbations of the recombination process As described earlier, the multipoint lod score is a step function with steps occurring at the position of each crossover event in any of the meioses in a study.
311
23. GenomeScans as StachasticProcesses
2 0 .c ii z ? ; ?' m
35 30 25 20 15 10 5 -
O-60.00
0.05
0.10
0.15
0.20
0.25
0.30
8
Figure 23.11. The mean length of the 3-lad-unit support interval (SI) for the complex multipoint lad score from the simulations described in the legend of Figure 23.10 for a sample size M = 200 meioses is graphed, together with its 95% confidence interval from the same simulated replicates. 2.5% of replicates had a 3-led-unit support interval shorter than the lower curve in this graph, and 2.5% of replicates had 3-lod-unit support intervals longer than the upper curve.
When one wishes to map a disease locus against a genome spanning map of markers, the goal is to place the test locus at the position in the genome that minimizes the number of additional crossover events required to explain the data. The recombination points on a given chromosome occur according to some point process, and over all meioses, the superposition of these M point processes is a Poisson process, with the distribution of the distance between adjacent crossover events over the whole genome being exponential with mean 1 /M. When the disease locus genotypes are not known without error, then for each chromosomal test location of the disease locus, inserting the inferred test locus genotypes into the observed recombination process will cause the appearance of some number of additional recombination events, making the Poisson process have some locally altered rate, The goal of mapping studies, in another sense, is to find the position in the genome at which the test locus (or loci) can be inserted, such that the recombination processes are minimally perturbed. This is a more logical way to think of multipoint linkage analysis conceptually- not in terms of recombination fractions on an interval-by-interval basis, but as point processes along the genome, which are locally altered by addition of the test locus in an incorrect location.
372
Joseph D. Terwilliger
In our model of complex-valued recombination fractions (Goring and Terwilliger, 2000a), the parameter E allows for a lengthening of the chromosomal map in the region of the test locus, such that the larger the value of .a, the greater the local perturbation of the apparent recombination process along the chromosome. In essence,if we model the superposition of the M recombination processes, the resulting Poisson processhas rate 1 /M. If we define this process locally to have rate t/M (6 2 l), th en in mapping a test locus we want to identify the chromosoma1 position at which insertion of the test locus causesE to be as close to 1 as possible. For example, consider the five meioses shown in Figure 23.2. If the test locus were inserted in the map at a different chromosomal position, say “xn” in Figure 23.12, instead of its true position “D,” a large number of additional crossover events would be needed to explain the data, as shown, since the elevated state in each meiosis represents the observed segregation pattern of the test locus. Looking at the superposition of the meiotic recombination processes(labeled “All” in the figure), there is a massive acceleration of the recombination rate around XD.
I
M3
All
I
I
MZ
X::
x
::
II
I
Y x0
D
Figure 23.12. The same five meioses as shown in Figure 23.2 are redrawn, under the mistaken assumption that the test locus is at chromosomal position xD, rather than its true location, D. It is shown that a large number of additional crossover events are required to allow xD to cosegregate with the test locus genotypes in each meiosis. In the bottom of the figure the acceleration of the rate of the superposition of the indicated recombination processesin a neighborhood around xu, is clearly shown, indicating that 5 >> 1 (see text), which implies that this is not a likely position for the test locus. By contrast, the rate of the superposition around D, the actual chromosomal location of the test locus, is unperturbed (i.e., 5 = 1: see Figure 23.2).
23. GenomeScans as Stoc’hasticProcesses
373
It is possible that the chromosomal recombination process is not stationary in a given sample, at least relative to genetic distance, since the definition of 1 CM is the chromosomal distance on which the expectednumber of crossovers per meiosis is 0.01. Further, the superposition of the M recombination processesconverges to a stationary Poisson process when M is reasonably large. For many reasons, however, the observed recombination process may not appear to be stationary and regular. For example, if the marker locus order is inaccurately specified, there may be local accelerations of the recombination process (often seen in experimental data as an inflation of the estimated map length). Similarly, marker locus genotyping errors are unavoidable in real data, and these, much like the effects of errors in disease locus genotype inference described earlier, will cause local accelerations of the rate of the estimated recombination process around the loci with highest rates of genotyping errors. Misclassification of marker locus genotypes will most often cause meioses that are truly nonrecombinant, with the flanking markers to appear to be recombinant; thus a single marker genotyping error can cause two false recombination events to be inferred. This can cause rather extreme deviations from the true statianary Poisson process, which must underlie the true recombination process. When the intermarker recombination fractions also are misspecified (as they will be in all cases, since our best maps are based on estimates from small samples that are assumed to be without error), the resulting recombination process in a new sample will never appear to have a fully stationary rate, even when marker genotypes are known without error. Eventually we hope to have technological solutions to these problems: that is, genotyping error rates will be minimized and genetic-not physical-maps of the marker loci will be better characterized-though for the less informative singlenucleotide polymorphisms (SNPs) this will be a ‘difficult task. Until then, however, it will often be necessary to look for the telltale characteristics of such errors and to allow for them as best we can One simple solution (Goring and Terwilliger, 2000~) for handling errors in the marker map (both locus order and intermarker distances) would be to reestimate the marker map from one’s own data, thus forcing the data to fit a stationary Poisson process model. The marker map is a nuisance parameter, so one can reestimate the marker map jointly with the disease locus inserted in all possible chromosomal positions. This is typically the procedure that is employed to map new marker loci, and there is no reason not to do the same thing to map a disease locus (in this case the primary effect is to minimize false negatives). Errors in specification of marker locus genotype frequencies and marker-marker LD can be dealt with in an analogous manner, by estimating them separately with and without the test locus included in the model to minimize the effects of erroneous assumptions (in this case eliminating false pasitives). Marker locus genotyping errors can be modeled by using parameters that are analogous to 8. Their effect is to locally accelerate the inferred recombination
374
Joseph 0. Terwilliger
process due to misclassification of double nonrecombinant meioses as double recombinants in the neighborhood around the offending marker locus. Through the use of hypercomplex recombination fractions (Goring and Terwilliger, ZOOOb),one can model these local variations in the underlying recombination processesin a locus-specific manner. In reality, however, if one tries to map a disease locus by inserting the test locus at some position in the chromosome, and estimating its effect on the local recombination processes,these inflations can be intrinsically taken account of. If the local recombination rate is already accelerated, a improperly mapped test locus will accelerate the rate to a lesser degree than if there were no errors (since the null hypothesis rate would be lower and the range of possible deviations would be larger). Thus, errors in marker genotyping will reduce the ability to fine-map a disease locus if the analysis appropriately takes account of the observed recombination process among the marker loci. This then leads to a reduction of fine-mapping ability above and beyond that intrinsic to test locus genotyping errors. The bottom line is that one should minimize the sources of error. When errors are unavoidable, however, it is incumbent on investigators to allow for them in some manner, in order to have any ability to do accurate mapping in their presence.
V. “MODEL-BASED"AND "MODEL-FREE"METHODS:A FALSEDICHOTOMY A. Effects of incorrect models on estimation The assumption of a weak mode of inheritance model in linkage analysis, as an attempt to accurately model the “true” mode of inheritance of a complex trait, often leads to the a priori assumption that a small proportion of the total set of available meioses is informative for linkage. If this model is completely accurate, it is probably a reasonable idea to model things that way. But in the real world, this is effectively impossible, especially when a singlelocus marginal penetrance model is used. In such situations, a generally more powerful and more reasonable approach may be to overestimate the genotype-phenotype relationship rather than to underestimate it-in the latter case, one censors meioses before the study is even begun. Through the use of complex-valued recombination fractions, one can absorb errors in the mode of inheritance model in the parameter a, rather than having them lead to false exclusions. To show the effects of the inferred model on the analysis, let us assume that we do the analysis with a model that assumesproportion p of meioses to be informative at the test locus (i.e., every meiosis has prior probability p of being informative for linkage) in the analysis. In this case, the lod score would be
23. GenomeScansas StochasticProcesses
375
m;x Cl - p + ~(1 - .a)lc[l - p + p&lMWC l%lO
[1 - P f pKmM rnzx [l - rplcfl = log,,
- (1 - &)plMSC
[l - OSp]M
’
and the mean lengths of the 3-lod-unit support intervals, as a function of the percentage of meioses in the sample that are truly informative for linkage, are graphed (assuming M = 800) in Figure 23.13. Figure 23.14 gives the ratio of the mean support interval length (assuming fixed values of p) to the length of the support interval (assuming p = 1). Note that when p is fixed to be too small, the ability to estimate the map position is worse than if p is assumed to be too large, and the smaller the value of C/M in the total sample of chromosomes, 100
0.1 0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
E
Figure 23.13. Effects of assuming a weak mode of inheritance model to infer test locus genotypes on complex multipoint lod score support interval lengths (N = 800). A simulation of the same conditions as in Figure 23.10 was performed, in which the complex multipoint lod scoreswere computed assum. ing a variety of fixed mode of inheritance models, parameterized in terms of p (see text). When p = 1, the length of the support interval has the same functional relationship to E as that shown in Figure 23.10. However, when p is assumedto be smaller than the actual proportion of uninformative meioses (i.e., the assumedmode of inheritance is too weak), the sup port interval is elongated as shown. These curves correspond to complex multipoint lod score analysis assuming p = 0.3, p = 0.5, p = 0.7, p = 0.9, and p = 1, respectively, as shown.
Joseph D. Terwilliger
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
E
Figure 23.14. Based on the same simulated conditions as Figure 23.13, here the ratio of the support intervals assuming different fixed values of p to the lengths of the support intervals for the same data set assuming p = 1 in the analysis are presented as a function of (M - C)/M = E when p = 1. These values are shown for N = 800 but are roughly invariant as a function of sample size, although for different sample sizes,the steps in this step function occur at slightly different values of (M - C) /M.
the larger the inflation of the support interval, as expected. In fact, for any given sample value of E, the length of the support interval is minimized when one does not assume that a proportion of meioses are uninformative a priori, justifying the preferential use of models that are too strong over models that are too weak-at least in studies of a large number of small pedigrees, as remarked on shortly. One can gain insight into this observation by examining the foregoing equations carefully. Whenever C/M is smaller than (1 - p) / (2 - p), the lod scores will be identical to those obtained when p = 1; but whenever C/M is larger than (1 - p) /( 2 - p), the lod scores will be smaller than those obtained assuming p = 1, since a proportion of meioses has been assumed to be uninformative a priori. Of course, if one does not allow for complex-valued recombination fractions, fixing p to be too small will cause false exclusions of the true map location of the disease locus (cf. Risch and Giuffra, 1992). The observation that fixing a mode of inheritance that is weaker than the real situation in a given sample can decrease both power and estimability (if complex recombination fractions are admitted) leads to the suggestion that one might want to assume that p = 1 in all situations. Interestingly enough, this is the
23. GenomeScansas StochasticProcesses
377
assumption implicit in most so-called model-free analysis methods, and it is exactly why they tend to be more robust to errors in the mode-of-inheritance assumptions in small pedigrees. The latter qualification is necessary,for when pedigrees become large, the assumption that all meioses are independent and identically distributed becomes less appropriate (cf. Goring and Terwilliger, 2OOOd).
6. “Pseudo-marker” linkage analysis Knapp et al. (1994) d emonstrated an exact stochastic equivalence between the affected sibpair mean test and a lod score analysis between a marker locus and a “pseudo-marker” at which parents in a sibship had genotype D/ + , and their affected offspring each had genotype D/D. In these small sibships, this is identical to assuming that 100% of the meioses are informative for linkage (p = 1) and per, forming a conventional lod score analysis. In the multipoint case, however, this equivalence no longer applies because of the confounding of recombination and the lack of complete correspondence between these “pseudo-marker” genotypes and the actual underlying trait locus genotypes. Through the use of complex-valued recombination fractions, however, it can be demonstrated that the sibpair mean test is equivalent to a “pseudo-marker” linkage analysis in which one assumes that p = 1, where the lack of complete correspondence between the pseudo-marker genotypes and the actual trait locus segregation pattern is taken account of through the parameter E (for proof, see Goring and Terwilliger, 2OOOa,d). This result, in conjunction with the observation that the sensitivity of linkage analysis is greater when the model is overestimated, rather than underestimated, gives an intuitive explanation for why such “model-free” analysis methods tend to be more robust than “model-based” methods to uncertainties in the underlying mode of inheritance. Of course, paradoxically, assuming that all meioses are informative is about the most inaccurate genotype-phenotype relationship model one could possibly use in the analysis, leading to the conclusion that one can do better with a wrong model than with the right model in many situations, as long as one allows for the errors in the analysis model. Furthermore, it also shows that “model-free” analysis methods are equivalent to “model+based” methods that assume a very deterministic genotype-phenotype relationship, and are thus very “wrong models.” Likewise, there is no justification for the platitude that “model~based” lod score analysis is not appropriate for complex disease analysis, especially on small families.
C. “Pseudo-marker” methods incorporating LD Interestingly, the foregoing result can be extended to show that haplotype relative risk (HRR) (Falk and Rubinstein, 1987) and transmission disequilibrium
378
Joseph D. Terwilliger
test (TDT) (Spielman et al., 1993; E wens and Spielman, 1995) statistics are also equivalent to a “pseudo-marker” analysis assuming that p = 1 (Goring and Terwilliger, 2OOOd).Note that in the HRR, one is contrasting transmitted with nontransmitted alleles in a test of LD. For the moment, assuming that 0 = 0, if one assigned “pseudo-marker” genotype D/ + to each parent in a “triad” and D/D to each child and performed a likelihood ratio test of LD between a marker locus and this pseudo-marker locus, one would have a test that is completely equivalent to the HRR test, as m;x L( 8 = 0; 6) 2 In L(8 = 0; 6 = 0) ’ for a marker locus with n alleles, where 6 represents the vector of all possible LD relationships between the n marker alleles and the two pseudo-marker alleles. This statistic can be applied to any pedigree structure, however, adding extra flexibility over the conventional HRR test, which is confined to singleton affected individuals. If it is applied in larger pedigrees, it may be wise to allow the recombination to vary as a nuisance parameter, as yB” u8; 6) 2 In
my
i( @ S = 0)
(cf. Goring and Terwilliger, 2000c,d), rather than forcing 6 to be 0 in both numerator and denominator (though in triads, there is no information about 8 itself, since it is completely confounded with 6: (Ott, 1989). A related statistical test is the TDT, which can be shown to be based on the same “pseudo-marker” genotype configurations, also assuming that p = 1. This test, however, is philosophically different, if not numerically (on triads), since it is testing for linkage, treating LD as a nuisance parameter, as
21n
njafL(B; 6) ’ m;x L(B = $8) +
The distribution of this statistic, however, is highly dependent on the sample size and structure and is somehow logically inappropriate, since in the denominator it allows for LD under the assumption of no linkage, which should not ever be different from 0 when the sampling is done in a reasonable manner. In most cases, a more logical statistic, which is not much different stochastically but has a distribution that is less dependent on the sample structure, would be
23. GenomeScans as StochasticProcesses
y-y% 21n
379
S)
’ L(8 = ;; 6 = 0) ’
which has a distribution that is approximately a 5050 mix of xt,,, and xf _ 1) when a large data set is available consisting of a variety of data structures. Interestingly,
2”;;” u8;S)
2ln
m;xL(@ 6 = 0) ’ = 21n +21n L(8 = ;; 6 = 0) L(8 = 5; 6 = 0)
ngx L(8; 6) ’ m;xL(B;&= 0)’
and in this case the first statistic is a standard linkage test in rhe absence of LD and the second is analogous to the HRR test of LD in the presence of linkage. The joint statistic will not often be significant, however, if neither of the component statistics are themselves significant. Furthermore, since one is not interested in any LD that might exist when the trait-locus is not linked to the marker locus, and such LD is unlikely to be present if a reasonable job of sampling was done, it is most logical to forego TDT tests in lieu of some combination of “pseudo-marker” linkage tests followed by “pseudo-marker” tests of LD conditional on the presence of linkage, Of note, one can combine case-control data with triads and larger sibships in computation of these statistics as well, in which case linkage and LD can act synergistically to improve the power and accuracy of the resulting tests and can allow joint linkage and LD analysis of all data structures simultaneously (see Goring and Terwilliger, 2000d, for details) by means of general pedigree analysis software like ILINK of the LINKAGE package (Lathrop et al., 1984). This equivalence of “model-based” and “model-free” analysis methods can be extended to larger pedigrees as well, but it should be noted that there is no longer a guarantee that a model-free analysis will be optimal in those situae tions. Small pedigrees admit only a small number of degrees of freedom in terms of the possible observed sets of chromosomal and test locus cosegregation patterns in a pedigree. However, large pedigrees admit a much wider variety of combinations, and there is a much larger interdependency between different meioses in the same pedigree. Simple “model-free” algorithms may not cover the set of outcomes as efficiently, since they may represent an overly extreme structuring of the underlying outcome space of GM (see earlier discussions and also Goring and Terwilliger, 2OOOd).Nevertheless, &is still generally more powerful to use a model that is too deterministic rather than too nondeterministic for the same reasons discussed earlier; but a wider range of deterministic trait
380
Joseph 0. Tetwillioer
locus genotype assignment models can be conceived that are consistent with some plausible genotype-phenotype relationship model, as pedigree structures increase in size and complexity. Nevertheless, as a general rule, genetic analyses of a small set of large pedigrees tend to be more powerful than analysis of a larg er set of small pedigrees, genotype for genotype.
VI. COMPLEXDISEASEGENEMAPPING:IS IT POSSIBLE? A. How complex are complex traits? This chapter began with a consideration of the simple relationships between test locus genotypes and marker locus genotypes, due to linkage and/or LD. Then, we demonstrated that there is a second relationship of major importance in mapping a locus based on phenotype, which is the relationship between disease locus genotypes and observed phenotypes. No matter whether a “modelbased” or a “model-free” analysis method is used, the correlation between observed marker locus genotypes and trait phenotypes is a convolution of the correlation between marker and disease locus genotypes and the correlation between trait locus genotypes and observed phenotypes. If the linkage and/or LD relationships are very strong, then the power of a test is a function of the trait locus genotype-phenotype relationship (see Figure 23.10), and if the test locus genotype-phenotype relationship is strong, the power is solely a function of the strength of linkage and/or LD relationships (see Figure 23.8). However, when both relationships are weak, nothing will work. The advent of new technologies has led us to the point that eventually we may be able to sequence the entire genome of an individual sufficiently cheaply and rapidly to permit the identification of loci with arbitrarily strong linkage and/or LD relationships with any underlying polymorphism, leaving the power solely as a function of the genotype-phenotype relationships, having little or nothing to do with the markers. At present, the marker correlations are still sufficiently weak that they play a role in real-world power considerations, but in the coming decades, this effect will rapidly dissipate, making it most appropriate for us to concentrate on the phenotype -+ genotype relationship as the main influence on power of any study. Let us consider a more realistic model of the etiological complexity of a multifactorial phenotype, as shown in Figure 23.15, which is admittedly still an oversimplified representation of the probable true state of nature. The inferential object of our analysis in Figure 23.15 is a small subset of the total etiological universe of the phenotype of interest. For a review of some details of the complexities involved in the etiology of actual complex phenotypes, see Weiss (1995, 1996), Terwilliger and Weiss (1998), and Terwilliger and Goring (2000).
Figure 23.15. A more complete, yet still abridged view of the etiology of a complex trait. Linkage and LD analyses attempt to identify the chromosomal location of some disease-predisposing locus based on correlating phenotypic observations (Ph) with genotypes of marker loci that are tightly linked to it (GM). However, it is well known that for most such loci there are likely to be several other genes with important effects (here two other specific loci are shown, Gns and Gns, but there are likely to be many more). Other genes (here meaning some polygenic components that are individually having too little individual effect to ever be detectable), shared environmental factors, and cultural factors can contribute to familial aggregation of the phenotype, and individual environmental factors also likely contribute to the etiology-here individual environment is used to indicate environmental factors not shared with other family members. Fur. ther, each of these etiological agents is Likely to have interactive effects with each of the other risk factors, and many of these risk factors may be correlated in populations independent of the phenotype as well. The curved arrows between risk factors are intended to acknowledge that all kinds of higher order interactions are likely and are not easily predictable a priori. Each of these complicating factors will decrease the power of genetic studies dramatically, making the power of reverse genetics approaches to understanding the etiology of multifactorial phenotypes difficult to predict with any confidence, other than best+ and worst-case scenarios (the latter indicating that a reasonable confidence interval for most complex disease power calculations probably should not exclude the false positive rate!!). 381
382
Joseph 0. Terwilliger
Environmental risk factors, genetic heterogeneity, and interactions between risk factors (gene-gene, gene-environment, etc.) can have enormous impact on the power and accuracy of any analysis, as outlined in Figure 23.15. In this fig ure, the central correlation on which a mapping study is based, that between a specific chromosomal region (GM) and the observed phenotypes (Ph), is seen to be only one small part of the total etiological spectrum. One can, of course, model any of these factors in the analysis without much complication. For example, to include environmental covariates in a parametric analysis, one would compute the likelihood as follows: L m P(Ph, GM, Env) = zI’(Ph 6
1Gn, Env)P(Gn 1GM)P(GM)P(Env),
while the likelihood of the data allowing for epistasis and/or heterogeneity can be generalized as well
which can be generalized for rapid multilocus analysis allowing for epistasis and heterogeneity in both “model~based” and “model-free” approaches (see Terwilliger, 2000). In general, the effect of heterogeneity, of all forms, on the etiology of any phenotype will decrease the correlations between any given chromosomal region (GM) (which harbors one genetic risk factor for disease) and the phenotype, leading to an inflation of E in the sample and a decrease in sensitivity of the mapping experiment. In LD analysis, allelic heterogeneity plays an additional complicating role, which is similar to the effects of locus heterogeneity in a linkage analysis, because essentially there may be several unrelated disease alleles in a single locus that may affect the phenotype, and each will occur on a completely different haplotype. Thus, the chromosomes with disease alleles are probably not identical by descent from a single common ancestor, as was assumed in the coalescent model presented in Figure 23.5; rather, there are multiple such coalescent trees connecting sets of individuals with the same disease allele to their common ancestors. The problem is that a priori, on the basis of phenotype, one does not know which individuals have which disease alleles, nor does one know how many such disease alleles are present in the sample. In a study of a rare recessive disease in Finland, it might be reasonable to hypothesize the existence of one and only one disease allele, and thus one single ancestral pedigree. But in a study of BRCAI alleles and breast cancer in New York City, there would likely be so many different disease-predisposing variants that no common ancestral
23. GenomeScans as StochasticProcesses
383
mutations would exist at all, and thus no shared segments would exist in the affected population in the area of the BRCAI gene. LD methods, therefore, are highly sensitive to allelic heterogeneity, which can further decrease the ability to map loci by using this method. In linkage analysis, this kind of heterogeneity does not matter significantly, since the correlations being detected are within pedigrees, not between them. In the study of truly complex multifactorial phenotypes, the aforementioned complications are likely to be the rule than the exception (cf. Weiss 1995, 1996). There is substantial evidence of allelic heterogeneity in most genes that have been studied in sufficient detail (Weiss, 1995, 1996; Terwilliger and Weiss, 1998). Moreover, there are many genes affecting most complex phenotypes (cf. Wright 1968-1978; Cavalli-Sforza and Bodmer, 1971; Hart1 and Clark, 1997; Schlichting and Pigliucci, 1998), which are probably characterized by complex networks of gene-gene (e.g., Burstein, 1995) and gene-environment interactions (e.g., Wright, 1920, Suzuki, 1970). Each of these factors can significantly decrease the precision of any estimates of gene location, and thus one should attempt to minimize the effects of each complicating factor up front (see Terwilliger and Giiring, 2000, for more detailed discussion of this issue). Clearly, ascertaining larger pedigrees with a large number of people presenting the same phenotype will tend to maximize the familial components relative to random environmental ones. Ascertaining samples from more homogeneous populations can likewise minimize the number of risk factors (both environmental and genetic) affecting the phenotype in a sample, and controlling for the effects of known environmental and genetic risk factors up front (e.g., sampling nonsmokers in a study of lung cancer, or individuals without the risk genotypes at apoE in a study of Alzheimer’s disease) can help improve power. There are no magic cookbook recipes, but there is an obvious result that the smaller the pedigrees and the larger and more heterogeneous the population from which they are ascertained, the greater the etiological heterogeneity in any sample will be, since nothing would have been done to enrich for any of the numerous possible etiological pathways. This is always the worst-case scenario, and it should be avoided whenever possible. If one says multiplex pedigrees cannot be found it is fair to question the hypothesis that there are genes of major influence on the phenotype anyway. (Why does one hypothesize traits that are not clustered in families to be genetic? And if they are clustered in families, why would one not be able to find families to study?)
8. Ascertainment and genome scanning: Bigger is better! To demonstrate the effects of ascertainment on the ability to predict the underlying test locus genotypes on the basis of the observed phenotypes, let
384
Joseph D. Terwilliger
us contrast two of the simplest ascertainment strategies, affected sibpairs (with parents) and triads (affected individuals and parents). When people started large-scale investigations of complex traits, they were advised to pursue collections of large sets of affected sibling pairs (cf. Blackwelder and Elston, 1985; Risch, 1990). Subsequently, when even affected sibling pairs became difficult to ascertain, people were encouraged to ascertain large samples of unrelated individuals (and their parents where possible) to map the disease genes through large-scale LD analyses (e.g., Spielman et al., 1993; Risch and Merikangas, 1996). It should be noted that Risch and Merikangas (1996) also demonstrated that one has substantially improved power under a wide variety of models when one performs joint linkage and association analyses on sibpairs, rather than just looking at singletons, for reasons now explained in detail. Let us assume that we have a marginal model for the inheritance of the disease, which represents the true underlying state of nature, in which we are looking at a biallelic disease-predisposing locus, where the disease allele D has frequency p, and the penetrances are modeled as P(Affected ( DD) = fnb; P(Affected 1 D+) = fn+; and P(Affected 1 +-t> = f++. Let us further assume for the moment that we have a locus with a dominant action, and a k-fold increased risk for being affected with the disease in the presence of disease locus genotype DD or D+ (i.e., fDD= fn+ = kf++). Now, assuming that we have ascertained a random sample of individuals from the population, the probability that any singleton affectedindividual has genotype DD is P,&DD)
=
where PAff(D+) =
P**(++)
=
W k + (1 - k)(l - P)~’
2W - p) k + (1 - k)(l - p)2
and
(1 - P)2 k + (1 - k)(l - P>~
(see Terwilliger, 2000). However, if we ascertain a sibship of size two, where both siblings are affected with the disease, the probability that a randomly chosen one of the two affected sibs has genotype DD would be p’k[k-$(k psp(DD) = k2[1 -(l
- l)(l
- P)~]
- P>~] + (1 - p)2 - i(k - 1)2p(l - p)2(4 _ p) ’
385
23. GenomeScans as StochasticProcessas
where 2p(l - p)k[k - ;(k - 1X1 - ~)(2 - PI] P,(D+)
=
k2[1-(1 - pj2] + (1 - ~1’ - f(k - l>$(l
- PJ2(4 - P)
and
Psp(++) =
(1 - pj2t1 + +tk - lM4
- P)I
k2[1-(1 - P)~] + (1 - pJ2 - $(k - U2pU - P12(4 - PI-
Figure 23.16 graphs the ratio P&DD)/PAE(DD) to illustrate the benefits of ascertaining sibpairs over singletons, where the benefits of the increased ability
1
10
100
foeI f++ Figure 23.16. The ability of phenotypes (Ph) to predict the underlying trait locus genetypes (GD) is a function of the ascertainment scheme. In this figure, two ascertainment schemes are compared, ascertainment of singleron affected individuals, and ascertainment of sib-pairs, where both siblings are afe fected with the disease. P,&DD), the probability of genotype DD in singleton affecteds, and P&DD), the probability of genotype DD in a randomly selected sib from the sibpairs, were computed as described in the text, and their ratio is graphed for a range of disease allele frequencies PD. A dominant model is assumed, in which fDD/f++, the penetrance ratio (cf. Terwilliger and Ott, 1994), is varied over the range [l, 1001. For the entire range of admissible values, the existence of an affected sibling increases the predictive value of phenotype on genotype. In gene&, the more densely packed the pedigree, the better the predictive value of phenotype on genotype.
386
Joseph 0. TetwilliQer
to predict the proband’s test locus genotype are especially critical when one aspires to detect LD. The more densely packed a pedigree is with affected individuals, the more effectively the phenotypes can predict the underlying genotypes. To this end, it is generally advised to collect densely loaded multiplex pedigrees whenever possible.
C. Candidate genes and hypothesis testing When genotypes of the test locus are known, genome scans based on linkage have very predictable behavior and can be a very powerful tool to localize the test locus to a small fragment of the total genome. But when one can observe only phenotypes influenced by some underlying test locus, and the genotypephenotype relationship is not one-to-one, the predictability of linkage analysis suffers substantially. Of course, LD analysis suffers even more, given that it is less predictable to begin with, even when test locus genotypes are known. Much of the hope for LD methods to be effective in mapping complex traits is based on the hypothesis that the risk genotypes of the test locus can be tested directly. This assumes prior knowledge of what variants at what loci are to be involved in disease etiology. In addition, the number of such risk genotypes is assumed to be manageably small. Unfortunately there is not much evidence that this is going to be the general case, and despite demonstrations that it can be a powerful approach in those circumstances, there is no evidence that the assumptions are likely to hold in most cases (cf. Weiss 1995, 1996). As a quick example, nobody would question that the most powerful way to demonstrate that having genotype XY puts a person at greatly increase risk of prostate cancer would be to compare the prevalence in males and females directly (i.e., testing the effects of the candidate genotype in an association analysis). One could apply linkage analysis with pseudo-autosomal marker loci and eventually find evidence of linkage, but this would not be efficient. Similarly, by analogy to association analysis, one could compare the prevalence among people who played football in high school with the prevalence among those who did not. The fact that the vast majority of former high school football players have genotype XY will lead to the correlation of this factor with prostate cancer to some degree, though since the majority of males did not play high school football, the power of this analysis will be substantially less than that of a model in which the candidate genotype was tested directly. Nevertheless, the correlation between having played high school football and having prostate cancer is of the same quality as that between an SNP and a disease affected by a disease locus that is in LD with the SNP. Thus for candidate gene analyses to be powerful, it may be necessary to guess the identity of the risk alleles themselves, while reliance on’ LD will be much less predictable, as demonstrated earlier. How to select an appropriate set of candidate risk factors
23. GenomeScans as StochasticProcesses
387
a priori is an open question, which may be of primary importance if anything is going to be learned about what genetic factors do affect complex multifactorial phenotypes in the real world.
VIII. CONcLUstoIS Genome scanning to detect linkage and/or LD is treated as an estimation problem rather than as a hypothesis-testing problem. The significance of a linkage or LD statistic is meaningful only relative to the values of the same statistic at other positions in the genome, because there is no null hypothesis being tested. If one accepts as the null hypothesis that “there is no diseasepredisposing gene,” then the whole genome scanning experiment is ill advised until such time as that null has been rejected by other epidemiological me&+ ods. On the other hand, candidate gene studies with markers in only a small portion of the genome are most appropriately considered as hypothesis-testing questions, since specific null and alternative hypotheses are being contrasted, namely, He: The candidate gene is uncorrelated with the phenotype, and I-I,: The candidate gene is correlated with the phenotype+ Note that one cannot explicitly test whether the candidate gene has a functional effect on the phenotype, but merely whether there is a statistical correlation between genotypes of loci in and around the candidate gene and the observed phenotypes (i.e., there could be other genes in the region that are actually functionally involved, etc.). In the latter case, one needs to pay careful attention to the distribution of the test statistic, and to be very cautious about possible reasons for spuriously rejecting the null hypothesis (e.g., bad population sampling, inaccurate marker locus genotype frequency assumptions, etc., cf. Coring and Terwilliger, 2000~). In the former case, however, the genome scan itself provides an empirical basis for assessing the relative likelihood of each possible location of the putative disease-predisposing loci, and what matters is not how the statistic fluctuates under the null hypothesis, but how the statistic varies along the genome. If the statistic is variable across the genome, and no small number of regions stand out against the background, one cannot localize the gene(s) very well, and in many cases, the entire genome may be within’ a predetermined support interval of the best estimate. Of course, in the real world, investigators generally behave accordingly, regardless of whether their most significant finding exceeds conventional thresholds of significance, and follow up the regions where their favorite statistic has its highest values. This “rank test” procedure is essentially treating the genome scan as an estimation problem and assuming that the most likely location of the putative disease gene is where this statistic is maximized. When the statistic is overwhelmingly significant at some genomic location, that region
388
Joseph D. Terwilliger
will stand out far above the background variation in the statistic across the remainder of the genome, and in this sense, one has accordingly more confidence that there is a gene in that region that influences the phenotype. But this does not make the genome scan a hypothesis-testing experiment. It is still a matter of estimating the genomic location of the putative gene, and large values of the statistic have the effect of minimizing the length of the genomic region to which the gene has been localized. Still, absent any outstandingly significant findings, it is more likely (albeit perhaps not much more) that a gene influencing the phenotype resides in areas in which the statistic has its largest values, according to the principle of maximum likelihood. One simply needs to be aware that the support intervals can become enormous when phenotypes do not accurately determine the underlying genotypes, and accordingly one’s ability to localize the genes in blind genomic scans may be extremely limited, as has been shown in great detail. In general, the only way to improve the power of genetic experiments is to increase the correlations between the marker loci and the phenotype experimentally. Improvements in statistical analysis methods will not play a major role in this aspect (cf. Terwilliger and Goring, 2000). Sampling of larger and larger pedigrees from more and more genetically homogeneous populations (the definition of “homogeneous” is admittedly nebulous, but clearly one can say that there is less variation in, e.g., Saami or Greenlanders than in African Americans or Hispanics) will decrease the universe of etiological factors in a study sample, and clearly affected individuals from pedigrees with more people sharing a disease phenotype are more likely to be affected because of the action of genetic factors as well. Traits that show more variation in prevalence among populations are more likely to be affected by a smaller number of etiological factors, most of which (certainly most of the genetic factors) typically have variable frequency across populations as well. While there are no cookbook answers, and each study needs to be carefully optimized on its own, these general principles are clear. If one jumps in without considering them, it is unlikely that one will have a successful complex trait mapping project, since even if one does take careful consideration of every relevant variable, there are no guarantees. Every assumption has some error, and the possible range of real conditions is enormous. If one allows for a range of error in each of the possible parameters of the etiology of some phenotype, there will be no predictive value at all, since the confidence intervals of the estimated power probably will include the entire range from 0 to 1. Optimists will say that 1 is within the confidence interval, and realists will note that so is 0. The bottom line is that true prediction is only as good as our assumptions, which are always bad. We can only optimize our chances by trying to sample in such a way as to skew things toward higher power, knowing that nothing will ever be guaranteed to work in the mapping of genes underlying a complex phenotype.
23. GenomeScans as StochasticProcesses
389
Acknowledgments Support from a Hitchings-Elion fellowship from the Burroughs-Wellcome Fund, and from the Columbia Genome Center is greatly appreciated. Helpful discussions and/or comments on an earlier version of the manuscript from Sebastian Zallner, Ken Weiss, Andrey Rzhetsky, Joe Lee, Harald G&ring, and Andrew Clark are gratefully acknowledged. The organization of this chapter was deveioped in context of a linkage course at the Center for Scientific Computing in Espoo, Finland, in the summer of 1999, organized by Kimmo Mattila and Pekka Uimari, whose assistance and input were greatly appreciated.
References Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for diseasesusceptibility loci. Genet. Epidemiol. 2, 85-97. Burstein, 2. (1995). A network model of developmental gene hierarchy. J. Theo~. Biol. 174, l- 11. Cavalli-Sforza, L. L., and Bodmer, W. E, (1971). “Genetics of Human Populations.” Freeman, San Francisco. Clark, A. G. (1999). The size distribution of homozygous segments in the human genome. Am. J. Hum. Genet. 65, 1489-1492. Dupuis, J., Brown, P. O., and Siegmund, D. (1995). Statistical methods for linkage analysis of complex traits from high-resolution maps of identity by descent. Genetics 140,853-856. Ewens, W. J., and Spielman, R. S., (1995). The transmission/disequilibrium test: History, subdivision, and admixture. Am. .J. Hum. Gener. 57,455-464. Falk, C. T., and Rubinstein, l? (1987). Haplotype relative risks: An easy way to construct a control sample for risk calculations. Ann. Hum. Genet. 51, 227-233. Fisher, R. A. (1930). “The Genetical Theory of Natural Selection”. Clarendon Press,Oxford. Gijring, H. H. H., and Terwilliger, J. D. (2000a) Linkage analysis in the presence of errors. I. Complexvalued recombination fractions and complex phenotypes. Am. J. Hum. Genet. 66, 1095-1106. Giiring, H. H. H., and Tenvilliger, J. D. (2000b) Linkage analysis in the presence of errors. II. Marker-locus genotyping errors modeled with hypetcomplex recombination fractions. Am. j. Hum. Genet. 66,1107-1118. GBring H. H. H., and Terwilliger, J. D. (2000~). Linkage analysis in the presence of errors. III. Marker loci and their map as nuisance parameters. Am. J. Hum. Genet. 66, 1298-1309. G&ring H. H. H, and Terwilliger, J. D. (2OOOd)Linkage analysis in the presence of errors. IV Joint pseudomarker analysis of linkage and/or LD on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am. .J. Hum. Genet. 66, 1310-1327. Grigelionis, B. (1963). On the convergence of sums of random step processesto a Poisson process. Theor. Prob. A&. 8, 177-182. Hartl, D. L., and Clark, A. G. (1997). “Principles of Population Genetics.” Sinauer, Sunderland, MA. Hemenway, D. (1982). Why your classesare larger than ‘average’. Math. Mag. 55, 162- 164. Houwen, R. H. J., Baherloo, S., Blankenship, K., Raeymaekers, I?., Juym, J., Sandkuyl L. A., and Freimer, N. B. (1994). Genome scanning by searching for shared segments: Mapping a gene for benign recurrent intrahepatic cholestasis. Nut. Genet. 8, 380-386. Khintchine, A. I. (1960). Mathematical methods in the theory of queuing bn Russian]; translated by D. M., Andrews and M. H. Quenouille. In “Griffin’s Statistical Monographs and Courses,” Vol. 7. Hafner, New York. Kingman, J. E C. (1982). On the genealogy of large populations. J. AD& Prob. 19A, 27-43. Knapp, M., Seuchter, S. A., and Baur, M. I? (1994). Linkage analysis in nuclear families: Relationship between affected sib-pair tests and lod score analysis. Hum. Hered. 44,44-51.
390
Joseph D. Terwilliger
Kruglyak, L., and Lander, E. S. (1995). High-resolution genetic mapping of complex traits. Am. J. Hum. Genet. 56,1212-1223. Lathrop, G. M., Lalouel, J. M., Julier, C., and Ott, J. (1984). Strategies for multilocus linkage analysis in humans. Proc. Natl. Acad. Sci. USA 80,4808-4812. Markov, A. A. (1913). “Ischislenie Veroiatnostei” (The Calculus of Probabilities) [in Russian]. Gosizdat, Moscow. McFadden, J. A. (1962). On the lengths of intervals in a stationary point process.J. R. Stat. Sot. Ser B 24,364-382. Nelson, R. (1995). “Probability, Stochastic Processes,and Queuing Theory.” Springer-Verlag; New York. Nikali, K., Suomalainen, A., Terwilliger, J., Koskinen, T., Weissenbach, J., and Peltonen, L. (1995). Random search for shared chromosomal regions in four affected individuals: The assignment of a new hereditary ataxia locus. Am. J. Hum. Getret. 56, 1088-1095. Norio, R., Nevanlinna, H. R., and Perheentupa, J. (1973). Hereditary diseasesin Finland: Rare flora in rare soil. Ann. Clin. Res. 5, 109-141. Ott, J. (1989). Statistical properties of the haplotype relative risk. Genet. Epidemiol. 6, 127-130. Palm, C. (1943). Intensitatsschwankungen im Fernsprechverkehr. Ericsson Tech. 44, l- 189. Pekkarinen, I?, Kestila, M., Paloneva, J., Terwilliger, J., Varilo, T., Jarvi, O., Hakola, P., and Peltonen, L. (1998). Fine scale mapping of a novel dementia gene, PLO-X, by linkage disequilibrium. Genomics 54,307-315. Peltonen, L., Jalanko, A., and Varilo, T. (1999). Molecular genetics of the Finnish diseaseheritage. Hum. Mol. Genet. 10,1913-1923. Resnick, S. I. (1994). “Adventures in Stochastic Processes:The Random World of Happy Harry.” Birkhauser, Boston, Basel. Risch, N. (1990). Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am. J. Hum. Genet. 46, 229-241. Risch, N., and Giuffra, L. (1992). Model misspecification and multipoint linkage analysis. Hum. Hered. 42, 77-92. Risch, N., and Merikangas K. (1996). The future of genetic studies of complex human diseases.Science 273, 1516-1517. Roberts, S. B., Maclean, C. J., Neale, M. C., Eaves, L. J., and Kendler, K. S. (1999). Replication of linkage studies of complex traits: An examination of variation in location estimates. Am. 1. Hum. Genet. 65,876-884. Schaeffer, R. L. (1972). Size-biased sampling. Technomehics14,635-644. Reaction Norm PerspecSchlichting, C. D., and Pigliucci, M. (1998). “Ph enotypic Evolution-A tive.” Sinauer, Sunderland, MA. Simon, R. (1980). Length-biased sampling in etiologic studies. Am. .I. EIxikmioI. ill, 444-451. Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52,506-516. Suzuki, D. (1970). Temperature sensitive mutations in Drosophikr melanogasrer. Science 170, 695-706. Tajima, E (1983). Evolutionary relationship of DNA sequences in finite populations. Generics 105, 437-460. Terwilliger, J. D. (1998). Linkage analysis: Model based. In “Encyclopedia of Biostatistics.” Wiley, New York. Terwilliger, J. D. (2000). A likelihood-based extended admixture model of oligogenic inheritance in “model-based” or “model-free”, two-point or multi-point, linkage and/or LD analysis. Eur. .I. Hum. Genet. 8,399-406.
23. GenomeScans as Stochastic Processes
391
Terwilliger, J. D., and Goring, H. H. H. (2000). Gene mapping in the 20th and 21st centuries: Statistical methods, data analysis, and experimental design. Hum. Biol. 72,63-132. Terwilliger, J. D., and Ott, J. (1994). “Handbook of Human Genetic Linkage.” Johns Hopkins University Press,Baltimore. Terwilliger, J. D., and Weiss, K. M. (1998). Linkage disequilibrium mapping of complex disease: Fantasy or reality? Curr. Opin. Biotechnol. 9,578-594. Terwilliger, J. D., Shannon W. D., Lathrop, G. M., Nolan, J. P., Goldin, L. R., Chase, G. A., and Weeks, D. E. (1997). True and false positive peaks in genomewide scans: Applications of lengthbiased sampling to linkage mapping. Am. J. Hum. Genet. 61,430-438. Terwilliger, J. D., Zollner, S., Laan, M., and P%bo, S. (1998). Mapping genes through the use of linkage disequilibrium generated by genetic drift: “Drift mapping” in small populations with no demographic expansion. Hum. Hered. 48,138-154. Van Ferdewegh, P. (1998). From gene detection to drug discovery: The challenge of complex diseases.Am. J. Hum. Biol. 10, 160. Varilo, T. (1999). The age of the mutations in the Finnish diseaseheritage: A genealogical and linkage disequilibrium study. National Public Health Institute, Helsinki. Weber, J. L., Wang, Z., Hansen, K., Stephenson, M., Kappel, C., Salzman, S., Wilkie, P J., Keats, B., Dracopoli, N. C., Brandriff, F., et $. (1993). E vr‘d ence for human meiotic recombination interference obtained through construction of a short tandem repeat-polymorphism linkage map of chromosome 19. Am. J. Hum. Genet. 53,1079-1095. Weiss, K. M. (1995). “Genetic Variation and Human Disease”. Cambridge University Press,Cambridge. Weiss, K. M. (1996). Is there a paradigm shift in human genetics? Lessons from the study of human diseases.MOE.Pfrylogenet.Euol. 5, 259-265. Wright, S. (1920). The relative importance of heredity and environment in determining the piebald pattern of guinea pigs. Proc. Natl. Acad. Sci. USA 6,320-332. Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16,97-159. Wright, S. (1968- 1978). Evolution and the Genetics of Populations: Vol. 1, “Genetics and Biometric Foundations” (1968); Vol. 2, “The Theory of Gene Frequencies” (1969); Vol. 3, “Experimental Results and Evolutionary Deductions” (1977); Vol. 4, “Variability within and among Natural Populations” (1978). University of Chicago Press,Chicago.
The Role of Interacting Determinants in the Localization of Genes W. James Gaudermanl and Duncan C. Thomas Department of Preventive Medicine University of Southern California Los Angeles, California 90089
I. II. III. IV. V. VI. VII. VIII.
Summary Introduction Penetrance Models Incorporating Interaction Effects Segregation Analysis Linkage Analysis Association Studies Treatment of Missing Covariate Data Conclusions References
I. SUMMARY We describe the potential gains in power for localizing disease genes that can be obtained by allowing for interactions with environmental agents or other genes. The focus is on linkage and association methods in nuclear families with dichotomous phenotypes. A logistic model incorporating various main effects and interactions is used for penetrance, but similar methods apply to censored age-at-onset or continuous phenotypes. We begin by discussing the influence of gene-environment interactions in segregation analysis, illustrated with analysis of smoking as a modifying factor for lung cancer. We then discuss a number of approaches to linkage analysis-model-free and model-based ‘To whom correspondence
should be addressed.
Advances in Genetics. Vol. 42 Copyright Q 2001 hp Academic Press. All right5 of reprcniuction in any form re~crved. 2065.266C/Cl S35.30
393
394
Gaudermanand Thomas
(including generalized estimating equations) - incorporating interactions with environmental factors and other genes, either candidate genes or linked loci. We find that a- test of heterogeneity in IBD sharing probabilities across strata defined by sharing of environmental factors can offer greater power for detecting linkage than the simple mean test, provided the interaction effect is sufficiently strong; we explore the conditions under which this gain in power occurs. Finally, we describe approaches for testing association and disequilibrium involving interactions, utilizing case-control, case-parent, and pedigreeebased approaches. A technical problem that must be addressed in many analyses is the effect of missing data on environmental covariates; we use multiple imputation in an analysis of lung cancer segregation to illustrate an approach to this problem.
II. INTRODUCTION Geneticists have a wide range of tools at their disposal for localizing genes, broadly divided into linkage and association methods. The standard method for linkage mapping has long been the lod score, for which Morton (1955) derived the now conventional criteria for declaring significant evidence of linkage or exclusion. Because gene discovery has historically been the province of human genetics rather than genetic epidemiology, little attention has been paid to the potential effects of gene-environment (G X E) interactions on our ability to localize genes, and gene-gene (G X G) interactions have only recently been given much attention. Nevertheless, Morton was among the first to appreciate the importance of G X E (Rao and Morton, 1974) and G X G (Kirk et al., 1970) interactions. Although standard methods will produce the correct test size for detecting linkage in the presence of unrecognized interactions, their power may be reduced, the risk of false exclusion of true linkages will be increased, and alternative methods that allow interactions may be more powerful. In this chapter, we begin by describing a general class of penetrance models for binary phenotypes and their extensions to phenotypes of other types. We then describe methods that allow for G X E and G X G interactions in segregation, linkage, and association analyses.In the context of affected sibpair linkage analysis, we review conditions under which allowing for G X E interaction can be expected to have superior power for detecting linkage. We also discussapproachesfor dealing with missing environmental data in the modeling of G X E interactions.
Ill. PENETGANCE MODELSINCORPORATING INTERACTION EFFECTS We consider a dichotomous phenotype y = 0 (unaffected) or 1 (affected). Let g denote an unobserved causal gene and m a marker gene. We assume that the
24. interacting Determinantsin Localization of Genes
395
causal gene is diallelic with low-risk allele a and high-risk allele A. We also consider a dichotomous exposure variable s = 0 (unexposed) or 1 (exposed), although in practice z may be continuous or time dependent, and there may be multiple environmental covariates. Finally, let h denote a second causal locus, also diallelic with alleles b and B, possibly linked to another marker n. A fully saturated model for penetrance would require specification (and estimation) of 18 parameters for the risk of disease in each of the 3 X 3 X 2 possible combinations of genotypes at the two loci and the environment. For simplicity, we choose instead to limit consideration to a model for these penetrances in terms of main effects and interactions. For this purpose, the logistic model is the natural parameterization,
logitPr(y
= ljg, h, z) = (Y + p G(g) + $I(h) + gG(g)z + oG(g)H(h)
-I- - . + ,
+ & (24.1)
where G(g) (and H(h)) d eno t e a coding of dominance as G(g = LU) = 0, G (g = aA) = 6, G(g = AA) = 1, and allele A is dominant if S = 1, recessive if 6 = 0, multiplicative if 6 = 5, and left free in a codominant model. The parameter LXis the logit of the penetrance in the baseline group, for example, unexposed subjects with low-risk genotype at all loci. The quantity exp(p) will be the odds ratio for the main effect of g, with analogous interpretations of the parameters for the main effects of h and z. The quantity exp(q) is the G X E interaction odds ratio
exp($ =
O(g=AA,z= 1)0(g=aa,f=0) O(g=ua,z=
l)O(g=AA,z=O)’
where O(g,z) denotes the odds of disease Pr(d = 1 I g,s)/Pr(d = 0 I g,s); exp(w) is the corresponding G X H interaction odds ratio. The symbol “- * *” in Equation (24.1) allows the possibility of including more complex interaction terms. For example, one might consider a three-way G X G X E interaction, or the possibility that g acts in a recessive manner in the absence of exposure and in a dominant manner in the presence of exposure. For simplicity, we will limit consideration to simple Mendelian inheritance models involving either G X E or G X G interactions, but not both. The model in Equation (24.1) can describe several plausible interaction models such as those described by Ottman (1990, 1996) and Khoury et al. (1993). For example, if /3 = 6 = 0 and r) Z 0, then neither the mutant allele nor the environment has any effect alone, but only when both are present. Alternatively, fi f 0, { = 0, and q f 0 corresponds to a genetic effect in the absence of environmental exposure, but an errvironmental effect only in the presence of genetic susceptibility.
396
Gaudermanand Thomas
The logistic model for penetrance can also be extended to other types of phenotypes in the framework of generalized linear models, which are specified in terms of a distribution function (e.g., binomial), a link function [e.g., the logit transformation in Equation (24.1)], and a linear predictor (the right-hand side of the equation). For example, for a continuous phenotype one could use a normal distribution function with the identity link function, and for a censored age-at-onset phenotype, one could use a proportional hazards model with a log link function. For any of these phenotypes, the linear predictor can be modified to include a regression on familial phenotypes in the so-called regressive model to account for residual familial correlation (Bonney, 1984, 1986; Abel and Bonney, 1990). Finally, let pE denote the population prevalence of exposure, #J the odds ratio for exposure concordance between sibs, J& and qa the population frequencies of the A and B alleles, respectively, 8 and cp the recombination fractions between g and m and between h and n, respectively, and qm and q,, the allele frequencies at the two markers. We assume that genes and environment are independently distributed in the population, and in two-locus models that g and h are unlinked and in linkage equilibrium. For association studies, let 6 denote the disequilibrium parameter(s) between g and m and A the disequilibrium parameter(s) between h and n.
IV. SEGREGATION ANALYSIS Interaction effects might arise in two situations when one is looking for evidence of a major gene. First, one might wish to allow for an environmental covariate that is shared within families and, if ignored, might provide spurious evidence for a major gene; in addition, one might wish to test whether the putative major gene interacts with that factor. Second, one might wish to look for evidence of a major gene allowing for the presence of another gene (possibly already localized or identified), which acts either independently (genetic heterogeneity) or interactively (epistasis) with the locus being tested. To conduct a segregation analysis incorporating environmental covariates, with or without interactions, it suffices to specify an appropriate model for penetrance, such as Equation (24.1), and form the appropriate likelihood, summing over all possible configurations of unobserved major genes. For example, if we let y, g, and z denote vectors of phenotypes, unobserved genotypes, and covariates in a given family, respectively, the likelihood for a model with G X E interaction is given by (24.2)
24. Interacting Determinantsin Lacalizatfanof Genes
397
where JI = {cr, 6, c, rlj is the corresponding set of penetrance model paramep ters. One may also need a correction for ascertainment if families were selected based on the disease status of one or more of their members. Standard maximum likelihood analysis can be used for parameter estimation, utilizing the peeling algorithm (Elston and Stewart, 1971; Lange and Elston, 1975) to facilitate computation of the likelihood. A subtlety in the foregoing likelihood involves the treatment of missing covariate data, a point we shall return to later. In principle, however, one might treat these in the same way as the unobserved genotypes by summing over all possible values of the missing data. This is a nontrivial undertaking, however, particularly when the covariates are continuous and the familial dependencies are not as well defined as for a Mendelian major gene. We illustrate these approaches with some segregation analyses of lung cancer incorporating smoking information. Smoking is, of course, a powerful risk factor for lung cancer, and the habit tends to aggregate within families. Hence, it should produce some familial aggregation, even in the absence of any genetic cause. Furthermore, it would be of interest to know whether all people are equally susceptible to the effects of smoking or whether some genotypes confer higher sensitivity or resistance. With these objectives in mind, Sellers et al. (1990) conducted a segregation analysis of 337 families (4356 individuals) identified through white lung cancer probands in 10 Louisiana parishes between 1976 and 1979. Using the program REGTL in S.A.G.E. and adjusting for the main (but not interactive) effect of pack-years of smoking, they found evidence of a rare major gene affecting either the lifetime risk (with age at onset independent of genotype) or the age at onset (with lifetime risk independent of genotype). No clear inferences about dominance were possible, although codominant inheritance appeared to fit the best for either model. Using the fitted main effects model, the investigators showed that the penetrance ratios were smaller in heavy smokers than nonsmokers, but this inference was a consequence of the model form, not a test of G X E interaction. In subsequent work, Sellers et al. (1992a) stratified the 337 pedigrees based on the age of the proband, as a surrogate for pre- and post- World War I trends in smoking behavior. They found differing segregation patterns in the two subsets and suggested that this might be a consequence of a gene-smoking interaction (Sellers et al., 1992b). In all their analyses, the penetrance contributions were omitted from the likelihood for the 1018 (230/o) m ’ d’IVI‘d ua ls on whom smoking information was missing. These data were later reanalyzed by Gauderman et uE. (1997) with the Genetic Analysis Package, using a proportional hazards model of the form A(t, g, z) = hk exp[PG(g) + &‘z + rl G(g)‘zl, rk-1 5 t < rk,
(24.3)
398
Gaudermanand Thomas
where h(t, g, Z) denotes the age-specific hazard rate for dying of lung cancer and the vector z includes linear and quadratic terms for pack-years of smoking, as well as sex. This analysis confirmed the evidence for a major gene and found no significant interaction with smoking, based on a test of the null hypothesis that rl=O(x1 ’ - 0.13, p = 0.74 in the dominant model). In other words, the instantaneous hazard ratio h(t, g = AA, z)/ h(t, g = ua, z) did not depend sig nificantly on smoking. This conclusion was not changed by randomly imputing pack-years for the 1018 individuals on whom data were lacking, as described shortly. An important age interaction was noted by Sellers et al. (1990) but was not explicitly tested for significance either in their analysis or in that of Gauderman et al. (1997). Subsequently, Gauderman and Morrison (2000) allowed for an age-dependent genetic relative risk by letting hk in Equation (24.3) also depend on g; they showed that this model fit the data better than the proportional genetic relative risk model. In the best-fitting dominant model, the genetic hazard ratio declined monotonically from over 100 below age 60 to 1.6 by age 80, after adjusting for smoking and sex in a multiplicative model. Addition of a gene X smoking interaction term produced an estimate of exp( r) = 4.6, suggesting that gene carriers are more susceptible to the effects of smoking than noncarriers but the likelihood ratio test of this effect was still not significant (xt = 2.64,~ = 0.10). In the context of segregation analysis, ignoring G X E interaction leads to reduced power for detecting a major gene and produces biased estimates of penetrance parameters and the allele frequency (Tiret et al., 1993). The related problem of inferring the presence of G X E interaction in a segregation analysis also suffers from low power, unless the interaction effect is quite large (Gauderman and Thomas, 1994; Gauderman and Faucett, 1997a). There is thus recent interest in linkage and association analysis techniques for finding genes involved in interactions and for characterizing such interactions.
V. LINKAGEANALYSIS Under the null hypothesis of no linkage, g and m segregate independently within a family, even if there is a population association between the two loci due to, say, population stratification. In a traditional model-based linkage analysis, one fixes the values of the trait model parameters (penetrances and allele frequency), maximizes the likelihood L(8), and computes a “lod score” as log&( 0) /L(OS)]. A lod score exceeding 3.0 has historically been taken as significant evidence for linkage and a lod score of less than -2.0 as significant evidence against linkage (Morton, 1955). If the model for penetrance is misspecified, for example, by omitting a G X E or G X G interaction, the null
24. Interacting Determinantsin Locafiration of Genes
399
distribution of the lod score is unaffected and test sizes are preserved (Amos and Williamson, 1993; Williamson and Amos, 1995). Under the alternative, however, misspecification of the trait model can lead to a loss in power to detect linkage and biases estimates of 8 upward (Clerget-Darpoux et al,, 1986). Likewise, model misspecification will tend to increase the probability of declaring significant evidence against linkage. The magnitude of this potential loss of power and the circumstances under which power can be improved by allowing for interactions are most easily explored for simple sib pair allele sharing methods. Let 71.(l)and &) denote the probability that an affected sibpair shares 1 and 2 alleles at a marker locus identical-by-descent (IBD) and let v = ~T(I)/Z + &) be the expected proportion of alleles shared IBD. The standard “mean” test for linkage (Blackwelder and Elston, 1985) simply compares the observed proportion S- to its null expectation of OS with binomial variance 0.25 /(ZlV), where N is the number of independent sibpairs (for the time being, we ignore the complications of larger sibships), In the presence of G X E (or G X G) interaction, rr is a weighted average of the sharing proportions in sibs that are both exposed (rr&, both unexposed (~“u), or are exposure discordant ( ~TEU),with the weight depending on the prevalence of the exposure factor. For example, consider a G X E interaction model in which we express the baseline penetrance as fo = Pr(y = 1 I g = aa,q = 0) and the remaining penetrances in terms of “relative risks” Ro, RE, RI as follows: Pr(y = l(g = AA, z = 0) = &Ro Pr(y = llg = au, z = 1) = foRE Pr(y = ljg = AA, z = 1) = jORGRERI. We assume an additive risk model within exposure category, so that the penetrance for g = Au is the mean of the corresponding penetrances in carriers and noncarriers. Consider first a “pure interaction” model, in which RE = 1 and Ro = 1, so that the gene and environmental factors have an effect only in combination. For illustration, we fix f0 = 0.01, pE = 0.3, qA = 0.01, 8 = 0.01, and assume a fully informative marker and no correlation in the environmental factor between siblings. Figure 24.1 shows the expected value of r, along with the exposure-stratum-specific sharing proportions rrEE, vGU, and nEu as the strength of RI varies. With increasing RI, the sibling relative risk As (Risch, 1990) increases, leading to an increasing 71: However, the increase is not uniform across the three exposure concordance strata: ~ZE increases much faster than rr, whereas ruu and ?q~uremain constant at their null value of 0.5.
400
Gauderman and Thomas
0.62 0.6 0.58 0.56 0.54 0.52 0.5 -
0
5
10
20
25
30
dk, Figure 24.1. Expected proportion of alleles shared IBD in affected sib pairs as a function of the interaction relative risk RI, when exposure and genotype have effects only in combination (Ro = 1 and RE = 1): all sib p am combined; and sib pairs stratified by exposure status, EE = both exposed, EU = one exposed, I-N = both unexposed. Other parameters: qA = 0.01, pE = 0.3, no correlation in exposure status between sibs, t? = 0.01, and a fully informative marker.
These results suggest that instead of the mean test based on Z-, a more powerful test might be to compare ~TEEwith ruu as two independent binomial proportions. Alternatively, one could form the regression relationship pi = (Y + pxi, where r; is the proportion of alleles shared IBD by the ith sibpair and Xi is a covariate that codes the joint exposure profile of the pair. For example, one could let x = 0 for UU, x = 1 for EU, and x = 2 for EE to model linear trend in allele sharing proportions across exposure categories. As described by Gauderman et al. (1999a), this regression approach can be used to test the hypothesis that there is both linkage and G X E interaction, simply by testing the null hypothesis that /3 = 0 (there called the p-test). For varying values of RI, Figure 24.2 compares the expected power for detecting linkage using the P-test to the power using the standard mean test, assuming a sample size of 400 affected sibpairs, a pure interaction model, and holding the other model parameters fixed to the values listed earlier. As RI, the p-test is more powerful than the mean test, although even with a sample size of 400 sibpairs, the power is still quite low unless RI is very large. In other situations, the comparison is not so clear-cut. Suppose we know that there are marginal effects of a gene factor and of an environmental
24. interacting Determinantsin Localization of Genes
0.1 0 0
5
10
&,
2o
25
30
Figure24.2. Power of the mean and p-tests for linkage in a sample of 400 affected sibpairs when exposure and genotype have effects only in combination (Ro = 1 and RE = 1) and the other model parameters have the values given in the legend for Figure 24.1.
factor but are uncertain about the magnitude of their interaction, Here, it is natural to parameterize the model in terms of the factors that are likely to be known at the design stage: for example, the sib relative risk ha, the population disease frequency Kp, and the average exposure relative risk (i.e., the exposure relative risk that would be estimated in an epidemiologic study that did not consider the gene). As discussed elsewhere (Gauderman and Siegmund, ZOOO), we can then solve for unknown parameters (e.g., fo, RE, and Ro) and proceed as before to compute the rr’s and power for the mean and p-tests. Figure 24.3 shows expected values of rr and Figure 24.4 shows power for the mean and ptests, varying the value of RI while holding As = 1.25, Kp = 0.001, and the average exposure relative risk equal to 1.0. Now, as Rr -+ 1, the three exposurestratified r?s converge (Figure 24.3), and since there is no variation in sharing among exposure groups, the power of the P-test goes to its nominal level of 5% (Figure 24.4). Because ha is fixed in this example, power for the mean test does not vary as RI varies. For these parameter choices, the /?-test becomes more powerful than the mean test for Rr > 11 .O. @or the pure interaction model depicted in Figure 24.1, it is clear that instead of the linear trend coding scheme for x, a more powerful test would be obtained by using an “EE contrast,” that is, letting x = 1 for EE pairs and x = 0 otherwise, or by simply testing He: vEE = 0.5. These coding schemes might
402
GaudermanandThomas 0.68
~
,/’
0.63
,-----“-----..........
-----------
.--
,/ ,’ 8 .fl’ , :
lg 0.58
: I .,.‘---.-...
ALL
4 \\ 0.53
**~*LIlE?---------
I..
_.--..-
uu 0.48 0
5
10
dk,
*O
25
30
Figure 24.3. Expected proportion of alleles shared IBD in affected sibpairs as a function of the interaction relative risk RI, holding hs = 1.25 and the average exposure relative risk equal to 1.0; for definition of line styles and the values of the other model parameters, see legend for Figure 24.1.
also lead to superior power for the type of allele sharing patterns depicted in Figure 24.3, particularly when RI is large. However, the EE contrast coding schemes would have poor power with the picture reversed and allele sharing greater in UU pairs. For testing linkage when a G X E interaction is suspected but not established, one might also want to consider using a 2-df test of Ho: rr = 0.5 and p = 0, as described by Gauderman and Siegmund (2000). Additional work is required to determine whether there is an optimal coding scheme for x (i.e., one that provides good power under a variety of interaction models). The regression approach described earlier can also be applied to G X G interactions, with specific implementation depending on whether the second locus is another linked marker (n) or is a measured putative or candidate gene (h). For the former, if + denotes the expected number of alleles shared IBD at locus n, one can simply set x = I,!Jand use the p-test. This is equivalent to testing for correlation in allele sharing between the two marker loci. For a candidate gene, the approach depends upon the assumed form of dominance at locus h: if dominant or recessive, one first codes each member of the pair as “exposed” or not, depending upon whether it has the susceptible genotype at locus h; then the concordance of the pair x = 0, 1, or 2 is coded as described earlier. For a codominant model, one might create two binary “exposure” variables for the bB
24. Interacting Determinantsin Localization of Genes
0
5
10
iii,
20
25
30
Figure 24.4. Power of the mean and p-tests for linkage in a sample of 400 affected sibpairs, holding A, = 1.25 and the average exposure relative risk equal to 1.O; the other model parameters have the values given in the legend for Figure 24.1.
and BB genotypes and two corresponding 0, 1, 2 concordance variables xba and was and then regress S- on the x vector as a 2-df test. This approach is closely related to that of Cox et al. (1999), w h o assigned weights to each family’s contribution to the nonparametric multipoint linkage (NPL) scores from GENEHUNTER based on whether their 2 scores for linkage at NIDDMI were positive or negative (or for some analyses, the magnitude of the 2 score if positive, 0 otherwise). In this way, they found much stronger evidence for linkage of diabetes ta a region of chromosome 15 in families showing linkage at MIDDMl than in those families without linkage at NIDDMI . Alternative approaches for incorporating covariates into the analysis of affected sibpairs to explore G X E and G X G interaction have been proposed by Greenwood and Bull (199’7, 1999), Rice et al. (1999), and Holmans (1999); comparisons of the performance of these and the methods just described would be useful. An unsatisfying aspect of these approaches to G X G interaction is their lack of symmetry: one locus must be taken as the “dependent” variable, for example, in a regression of r on $. This asymmetry motivated us to develop an alternative approach based on generalized estimating equations (GEE), as described in greater detail in Thomas et al. (1999). Essentially, we model the means and covariances of the phenotypes jointly as a function of IBD sharing at perhaps multiple loci, as well as exposure concordance and their interactions
404
Gauderman and Thomas
Marker data on both affected and unaffected subjects are required. For a binary trait, we let /+ denote E(yijlr,i.) = I?r(yij = l/xii) for member i of sibship j and postulate a logistic model of the form
We then construct the cross-products of the residuals Cijk = (rij ~~jLiJ)(Yik - auk) and model them as a linear function of the rrs, xs, and their interactions,
Thus the p’s describe the main effect of each locus, the 7’s describe G X E interactions, and rp describes the G X G interaction; similarly, in the means model, y tests allelic associations with candidate locus h and w tests G X E interactions with that locus. Thus, in principle, the approach could be used for joint linkage and linkage disequilibrium analysis by including the same locus in both models (Zhao et al., 1999). The approach is closely related to that of Elston et al. (ZOOO),with the extension of allowing the means ~j to be adjusted for person-specific covariates. To allow for the dependency within larger sibships, a GEE approach is taken to fitting the model and estimating the covariance of the regression coefficients. In the application to the problem posed at the Eleventh Genetic Analysis Workshop (GAW-1 1 ), we were able to fit a complex model involving several such interactions, including one threeway G X G X G interaction (involving two linked markers and one candidate gene). Based on the 25 replicates provided, we also showed better power for detecting the simulated G X E interaction than was obtained when the p-test was used, which in turn was better than the mean test. Candidate gene associations and interactions involving them can also be tested in the mean model; in the GAW-11 data, a strong G X E interaction with the simulated candidate locus was found. Another approach to finding genes is by model-based joint segregation and linkage analysis (JSLA), explicitly modeling possible interaction effects. Here, the likelihood in Equation (24.2) is expanded to Pr(y, m I a, qA, qm, 0), requiring modification of the second factor in the summation of that equation to Pd& m 1 qA7 qm, 0). Thus, one performs a standard likelihood analysis, simultaneously estimating the recombination fraction 0, the allele frequencies 4, and the parameters of the penetrance model, including any covariate and interaction effects. In principle, the approach could also be applied to G X G interactions, but if the second trait gene is also unobserved, this would require jointly peeling over two loci and the computations could be quite intensive. An alternative to JSLA is “mod score” analysis based on modeling Pr(m I y) (Hodge and
24. Interacting Determinantsin Localization of Genes
405
Elston, 19941, an approach that can also be used to jointly estimate linkage and G X E interaction parameters. This is useful when the ascertainment scheme is too complex to be modeled, although it can be substantially less efficient than JSLA due to the conditioning on all the trait information (Liang et al., 1996). In the context of affected sibpairs, the score test from the mod score approach has been shown to be equivalent to the mean test (Knapp et ai., 1994). The performance of JSLA for a quantitative trait was studied by Gauderman and Faucett (1997a) by simulation and application to triglyceride levels in a single large pedigree. The investigators found that the correlation between triglyceride levels and body mass index depended on genotype (TM = 0.72, r,&, = 0.49, and r, = 0.20; LR ~2”= 10.38, p = 0.006). Furthermore, the evidence for linkage to a marker on chromosome 2p was much stronger when this interaction was allowed for (8 = 0.100, SE = 0.113, LOD = 0.57) than without it (0 = 0.183, SE = 0.282, LOD = 0.05). Their subsequent simulation studies were focused on the power to detect G X E interactions, and on bias and efficiency for estimating their magnitude. As might be expected, JSLA was generally more powerful for detecting G X E interactions than segregation analysis alone, with the relative efficiency of JSLA to segregation analysis increasing with tighter linkage between the trait and marker loci or larger marker heterozygosity. Failure to account for interaction effects represents a form of misspecification of the penetrance model, and general results by Clerget-Darpoux et al. (1986) show that this should lead to reduced power to detect linkage and estimates of 6 that are biased toward 0.5, as exemplified in the application to the triglyceride data. Further research on the efficiency of JSLA allowing for interaction effects for detecting linkage, compared to pure model-based linkage analysis, JSLA ignoring interactions, or model-free linkage methods, would be very helpful.
VI. ASSOClATiOW STUDIES The other approach to localizing genes is by testing for allelic associations. Although spurious associations with unlinked markers in case-control studies using unrelated individuals could occur as a result of population stratification, the use of family controls (parents or siblings) eliminates this possibility. The most popular approach is the transmission disequilibrium (TDT) test (Spielman et al., 1993), w h’ICh compares the distribution of transmitted and nontransmitted alleles in a matched fashion using the McNemar test. This can be shown to be equivalent to a case-pseudo-sib comparison (Self et al., 1991) under a multiplicative model for penetrance (Schaid, 1996). The case-pseudosib design is a 1:3 matched case-control design, analyzed by means of the
406
Gaudermanand Thomas
standard conditional likelihood, in which the fictitious controls have the three other possible genotypes that could have been transmitted to the proband, given the parents’ genotypes. The advantage of this formulation over the standard TDT is that it naturally extends to other dominance models, and it provides an estimator of genotypic relative risks (as opposed to allelic relative risks). For diseases with a late age of onset, parental genotypes necessary for the TDT will usually be unavailable, and a standard matched case-control study of affected subjects and their unaffected sibling(s) becomes an attractive alternative. For testing the main effect of a gene, the case-sibling design is generally less efficient than either the case-parent design or the standard case-control design with unrelated matched controls (Witte et al., 1999). However, the case-sibling design can be more efficient when the sample is restricted to case-sib pairs with an additional affected relative (Gauderman et al., 199913).A subtlety in both the case-parent and case-sibling designs is that when there are two or more affected siblings in the former, or three or more siblings in the latter, the standard conditional likelihood does not provide valid tests of disequilibrium in the presence of linkage (Spielman and Ewens, 1998) or valid variance estimates for the induced relative risk parameter. Alternative approaches have been proposed in this situation for both the case-parent (Martin et al., 1997; Cleves et al., 1997; Lazzeroni and Lange, 1998) and case-sibling (Curtis, 1997; Horvath and Laird, 1998; Monks et al., 1998; Siegmund et al., 1999) designs. Interactions (G X E and/or G X G) can easily be incorporated into the conditional likelihood framework, simply by including the appropriate interaction terms in the logistic model for penetrance (Witte et al., 1999; Schaid, 1999). To obtain a valid test of interaction in the case-parent design, one must assume that G and E (or G and G) are independent, conditional on parental genotypes. Interestingly, although the case-sibling design is less efficient for estimating genetic main effects compared to the case-parent or population control design, it can lead to increased efficiency for estimating G X E effects (Witte et al., 1999, Gauderman et al., 199913). Siegmund et al. (1999) explored the performance of these approaches for detecting interactions using the GAW-11 simulated data. A highly significant interaction (p = 0.0009) between the simulated candidate gene and the environmental factor was found when the sibling controls approach was used, but it was less significant (p = 0.025) with the pseudo-sibs approach. In this simulation, no such interaction with this locus was directly simulated, but there was a strong G X E interaction with another locus that was not in disequilibrium with any flanking markers; that interaction induced an indirect interaction with the candidate locus (see Gauderman et al., 1999a, for an explanation of this phenomenon). An alternative to the foregoing conditional likelihood approaches is one based on an extension of the segregation likelihood in Equation (24.2) to incorporate any observed genotypes at a putative or candidate locus. One
24. interacting Determinantsin Localization of Genes
407
variation is the “genotyped proband design” (Gail et al., 1999), in which the genotype of the proband is determined, along with the phenotypes of his or her family members. Letting gr denote the genotype of the proband, the “modified segregation” likelihood based on gp and the observed phenotypes is
This likelihood is identical to that in Equation (24.2), except that the indicator function I(g, I g) restricts the summation over g to the set of genotype vectors that are consistent with gr. Genotypic data on additional family members can be incorporated naturally into the likelihood. An advantage of this approach is that it utilizes all available phenotypic information in a family, and it can be used regardless of whether data are available on parents or siblings, unlike the case-parent or case-sibling approaches. However, it does require estimation of baseline penetrances and the allele frequency, which may be considered to be nuisance parameters if the main goal is to test for association. This “full” likelihood requires conditioning on the method of ascertainment, but an alternative “retrospective” likelihood of the form Pr(g I y) can be applied even if families have not been ascertained according to a formal statistical samplmg procedure. Kraft and Thomas (2000) consider the statistical efficiency of these alternative likelihoods and the effects of heterogeneity in the nuisance parameters ar and c&. It has been shown that association analysis based on the TDT can be (Risch and Merikangas, 1996), but is not always (Tu and Whittemore, 1999) more powerful than affected sibpair linkage analysis for finding a’gene with a direct effect on the trait. Analogous comparisons would be useful to determine when one approach should be favored over the other for finding a gene involved in a G X E or G X G interaction. Similarly, comparisons of the various model-based methods for detecting associations would be helpful, taking into account the types of pedigree structure and likely availability of phenotypic, genotypic, and environmental data.
VII. TftEAlMENTOF M ISSINGCOWllATE DATA In any model-based analysis (segregation, linkage, or association), problems can arise with missing covariate data. The validity of the “complete case” analysis, in which penetrance contributions from individuals with missing data are omitted from the likelihood, depends upon an assumption that the data are “missing completely at random,” that is, that the probability of missingness does not depend upon disease status or any other variables (Rubin, 1976). This assumption is unlikely to hold in most pedigree studies. For example, in the iung cancer data set described earlier, missingness of smoking data depended on
408
Gaudermanand Thomas
disease status (more in unaffected), sex (more in males), and generation (more in the upper generation). To address this problem, we have pursued a strategy based on “multiple imputation” (Rubin, 1987; Little, 1992), in which missing data are repeatedly sampled from appropriate distributions, conditional on factors that are considered likely to be relevant. In the lung cancer segregation analysis, this includes such factors as year of birth, gender, disease status, and family members’smoking habits. In addition, smoking might depend upon genotype, which is unobserved. We described a Gibbs sampling approach, in which at each cycle missing smoking information was sampled from its full conditional distribution, given the current assignment of genotypes and model parameters (as well as the other fixed data), and then new genotypes and model parameters were sampled, treating the imputed smoking data as known via standard Gibbs sampling approaches (Gauderman and Thomas, 1994). Details are provided in Gauderman et al. (1997a) and Gauderman and Faucett (199713). In the application to the lung cancer analysis, the use of imputed data had little influence on estimates of the allele frequency and gene X pack-years interaction effect, but the estimated genetic main effect in the imputation analysis (Ro = 9.0) was substantially lower than in the complete case analysis (Ro = 17.3), probably because familial aggregation due to shared smoking habits among those with missing values was incorrectly attributed to the putative major gene. An alternative approach to handling missing data is simply to include in the penetrance model an indicator variable of covariate missingness (Andrieu and Demenais, 1997). For example, if smoking status is the exposure, one would code two dummy variables for smoking status: X1 = 1 if smoker and X1 = 0 otherwise, and Xz = 1 if smoking data are missing and X2 = 0 otherwise. This approach has the advantage of being easy to implement and not requiring development of new software. This approach has been studied in the context of epidemiological studies (Greenland and Finkle, 1995; Jones, 1996; Huberman and Langholz, 1999), but its statistical properties are not known in the context of pedigree studies. Further work is required to establish its validity and to determine how it compares with the imputation approach described earlier.
VIII. CONCLUSIONS The major impact of failure to account for G X E and G X G interactions in linkage analysis is not false positives but false negatives. The well-known loss of power and bias toward the null in estimates of 8 from misspecification of the penetrance model in parametric linkage analysis (Clerget-Darpoux et al., 1986) can be overcome by allowing for interactions in a joint segregation and linkage
24. Interacting Determinantsin Localization of Genes
409
analysis. Simulation studies have shown that this is an efficient technique for detecting and quantifying G X E interactions, but more research on the efficiency of JSLA for detecting linkage in the presence of interactions is needed* A similar loss of power due to the presence of interaction effects has been’shown for nonparametric linkage methods. We have shown how this loss of power can be overcome by alternative tests based on regressing IBD sharing probabilities on measures of sibling concordance of the interacting factors, or by using GEE methods to model the phenotype means and covariances directly in terms of main effects and interactions. Similar approaches can be taken for modeling interactions with candidate genes or markers that are in linkage disequilibrium with a disease gene. We are working on a unified approach to linkage and linkage disequilibrium mapping of disease genes, including interaction effects. For complex traits such as cancer, cardiovascular disease, autoimmune disorders, and neurological diseases,methods that incorporate interacting determinants may mean the difference between successand failure in the localization of new genes.
Acknowledgments This work was supported by National Cancer Institute grant CA52862 and National Institute of Environmental Health Services grants SP30-ES07048-02 and ES10421.
References Abel, L., and Bonney, G. E. (1990). A time-dependent logistic hazard function for modeling variable age of onset in analysis of familial diseases.Genet. Epidemiol. 7,391-407. Amos, C., and Williamson, J. (1993). Robustness of the maximum-likelihood (lod) method for detecting linkage. Am. .J. Hum. Genet. 52,213-214 Andrieu, N., and Demenais, E (1997). Interactions between genetic and reproductive factors in breast cancer risk in a French family sample. Am. J. Hum. Genet. 61,678-690. Blackwelder, W., and Elston, R. (1985). A comparison of sib-pair linkage tests for disease suscepti* bility loci. Genet. E@,emiol. 2, 85-97. Bonney, G. E. (1984). On the statistical determination of major gene mechanisms in continuous human traits: Regressive models. Am. J. Hum. Genet. 18, 731-749. Bonney, G. E. (1986). Regressive logistic models for familial disease and other binary traits. Biometrics42,61 l-625. Clerget-Darpoux, E, Bonaiti-Pellie, C., and Hochez, J. (1986). Effects of misspecifying genetic parameters in lod score analysis. Biometrics42,393-399. Cleves, M., Olson, J., and Jacobs, K. (1997). Exact transmission-disequilibrium tests with multiallelic markers. Genet. E+Iemiol. 14,337-347. Cox, N., Frigge, M., Nicolae, D., Concannon, P., Hanis, C., Bell, G., and Kong, A. (1999). Loci on chromosomes 2 (NIDDMI) and 15 interact to increase susceptibility to diabetes in Mexican Americans. Nut. Genet. 21, 213-215. Curtis, D. (1997). Use of siblings as controls in case-control association studies. Ann. Hum. Genet. 61,319-333.
410
Gauderman and Thomas
Elston, R., and Stewart, J. (1971). A general model for the genetic analysis of pedigree data. Hum. Hered. 21,523-542. Elston, R. C., Buxbaum, S., Jacobs, K. B., and Olson, J. M. (2000). Haseman and Elston revisited (abstract). Gener. Epidemiol. 19, 1-17. Gail, M., Pee, D., Benichou, J., and Carroll, R. (1999). Designing studies to estimate the penetrance of an identified autosomal dominant mutation: Cohort, case-control, and genotype-proband designs. Genet. Epidemiol. 16, 15-39. Gauderman, W., and Faucett, C. (1997a). Detection of gene-environment interactions in joint segregation and linkage analysis. Am. J. Hum. Genet. 61, 1189-1199. Gauderman, W., and Faucett, C. (199713). A Gibb s sampling approach for handling missing data in pedigree studies. Cornput. Sci. Stat. 29, 250-258. Gauderman, W., and Morrison, J. (2000). E v1‘d ence for age-specific genetic relative risks in lung cancer. Am. J. Epidemiol. 151,41-49. Gauderman, W., and Thomas, D. (1994). C ensored survival models for genetic epidemiology. Genet. Epidemiol. 11, 171-188. Gaudennan, W., Morrison, J., Carpenter, C., and Thomas, D. (1997). Analysis of gene-smoking interaction in lung cancer. Genet. Epidemiol. 14, 199-214. Gauderman, W., Morrison, J., Siegmund, K., and Thomas, D. (1999a). A joint test of linkage and G X E interaction using affected sib pairs. Genet. Epidemiol. 117, S563-S568. Gauderman, W., Witte, J., and Thomas, D. (1999b). Family-based association studies. 1. Nutl. Cancer. Inst. 26, 31-39. Gaudennan, W., and Siegmund, K. (2000). G ene-environment interaction and affected-sib pair linkage analysis. Human Heredity, in press. Greenland, S., and Finkle, W. D. (1995). A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am. J. Epidemiol. 142, 1255-1264. Greenwood, C., and Bull, S. (1997). Incorporation of covariates into genome scanning using sibpair analysis in bipolar affective disorder. Genet. Epidzmiol. 14,635-640. Greenwood, C., and Bull, S. (1999). Analysis of affected sib pairs, with covariates-with and without constraints. Am. J. Hum. Genet. 64,871-885. Hodge, S., and Elston, R. (1994). Lods, W ro d s, and Mods: The interpretation of lod scores calculated under different models. Genet. Epidemiol. 11,329-342. Horvath, S., and Laird, N. (1998). A d’lscordant-sibships test for disequilibrium and linkage: No need for parental data. Am. J. Hum. Genet. 63, 1886- 1897. Holmans, l? (1999). Methods for detecting gene-gene interaction using affected sib pairs. Abstract at ICES, St. Louis, MO. Huberrnan, M., and Langholz, B. (1999). Application of the missing-indicator method in matched case-control studies with incomplete data. Am. J. Epidemiol. 150, 1340-1345. Jones, M. I’. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Am. Stat. Assoc. 91,222-230. Kirk, R. L., Kinns, H., and Morton, N. E. (1970). Interaction between the ABO blood group and haptoglobin systems.Am. J. Hum. Genet. 22,384-389. Khoury, M., Beaty, T., and Cohen, B. (1993). “F un d amentals of Genetic Epidemiology.” Oxford University Press, Oxford. Knapp, M., Seuchter, S. A., and Baur, M. l? (1994). Linkage analysis in nuclear families. 2. Relax tionship between affected sib-pair tests and lod score analysis. Hum. Hered. 44,44-51. Kraft, l?, and Thomas, D. C. (2000). Bias and efficiency in family-based gene-characterization studies: Conditional, prospective, retrospective, and joint likelihoods. Am. J. Hum. Genet. 66, 1119-1131. Lange, K., and Elston, R. (1975). Extensions to pedigree analysis. I. Likelihood calculations for simple and complex pedigrees. Hum. Hered. 25,95-105.
24. Interacting Determinantsin Localization of Genes
411
Lazzeroni, L., and Lange, K. (1998). A conditional inference framework for extending the transmission/disequilibrium test. Hum. Hered. 48,67-81. Liang, K.-Y., Rathouz, l?, and Beaty, T. (1996). Determining linkage and mode of inheritance: Mod scores and other methods. Genet. Epidemiol. 13,575-593. Little, R. (1992). Regression with missing x’s. J. Am. Stat. Assoc. 87, 1227-1237. Martin, E., Kaplan, N., and Weir, B. (1997). Tests for linkage and association in nuclear families. Am. J. Hum. Genet. 61,439-448. Monks, S., Kaplan, N., and Weir, B. (1998). A comparative study of sibship tests of linkage and/or association. Am. J. Hum. Genet. 63, 1507-1516. Morton, N. (1955). Sequential tests for the detection of linkage. Am. j. Hum. Genet. 7, 277-318. Ottman, R. (1990). Epidemiologic approach to gene-environment interaction. Genet. Epid.emiol.7, 177-185. Ottman, R. (1996). Gene-environment interaction: Definitions and study designs. Preu. Med. 25, 764-770. Rao, D. C., and Morton, N. E. (1970). Path analysis of family resemblance in the presence of gene-environment interaction. Am. J. Hum. Genet. 26, 767-772. Rice, J., Rochberg, N., Neuman, R., Saccone, N., Liu, K., Zhang, X., and Culverhouse, R. (1999). Covariates in linkage analysis. Genetic Analysis Workshop 11 preliminary manuscripts, pp. 567-571. Risch, N. (1990). Linkage strategies for genetically complex traits. I. Multilocus models. Am. j. Hum. Genet. 46,222-228. Risch, N., Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273,1616-1617. Rubin, D. (1976). Inference and missing data. Biometrika 63,581-592. Rubin, D. (1987). “Multiple Imputation for Nonresponse in Surveys.”Wiley, New York Schaid, D. (1996). General score tests for associations of genetic markers with disease using cases and their parents. Genet. Epidemiol. 13,423-449. Schaid, D. (1999). Case-parents design for gene-environment interaction. Genet. EpidemioE.16, 261-273. Self, S., Longton, G., Kopecky, K., and Liang, K. (1991). 0 n estimating HLA/disease association with application to a study of aplastic anemia. Biometics 47,53 -61. Sellers, T., Bailey-Wilson, J., Elston, R., Wilson, A., Elston, G., Ooi, W., and Rothchild, H. (1990). Evidence for mendelian inheritance in the pathogenesis of lung cancer. J. Nutl. Cancer Inst. 82, 1272-1279. Sellers, T., Bailey-Wilson, J., Potter, J., Rich, S., Rothschild, H., and Elston, R. (1992a). Effect of cohort differences in smoking prevalence on models of lung cancer susceptibility. Genet. Epidemi01. 9,261-272. Sellers, T., Potter, J., Bailey-Wilson, J., Rich, S., Rothschild, H., and Elston, R. (1992b). Lung cancer detection and prevention: Evidence for an interaction between smoking and genetic predisposition. Cancer Res (suppl) 52,26948-2697s. Siegmund, K., Gauderman, W., and Thomas, D. (1999). A ssociation tests using unaffected-sibling versus pseudo-sibling controls. Genet. Epidemiol. 17, Suppl. 1: 5731-5736. Spielman, R., and Ewens, W. (1998). A sibship test for linkage in the presence of association: The sib transmission/disequilibrium test. Am. J. Hum. Genet. 62,450-458. Spielman, R., McGinnis, R., and Ewens, W. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52,506-516. Thomas, D., Qian, D., Gauderman, W., Siegmund, K., and Morrison, J. (1999). A GEE approach to modeling disease concordance within sibships in relation to multiple markers and exposure factors. Genet. Epidemiol. 17, Suppl. 1: 5737-5742.
412
Gauderman and Thomas
Tiret, L., Abel, L., and Rakotovao, R. (1993). Effect of ignoring genotype-environment interaction on segregation analysis of quantitative traits. Genet. Epidemiol. 10,581-586. Tu, I., and Whittemore, A. (1999). Power of association and linkage tests when the disease alleles are unobserved. Am. J. Hum. Genet. 64,641-649. Williamson, J., and Amos, C. (1995). Guess lod approach: Sufficient conditions for robustness. Genet. Epidemiol. 12, 163- 176. Witte, J., Gauderman, W., and Thomas, D. (1999). Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: Basic family designs. Am. .I. Epidemiol. 149, 693-705. Zhao, L. l?, Quiaoit, E, Aragaki, C., and Hsu, L. (1999). An efficient, robust and unified framework for mapping complex traits. III. Combined linkage/linkage-disequilibrium analysis. Am. J. Med. Gem. 84,433-453.
Mkag e DisequiHbrium Mapping:The Role of PopulationHistory, Size, and Structure N. Ii. Chapman Department of Biostatistics University of Washington Seattle, Washington 98195
E. A. Thompson’ L)epartmentsof Riostatisticsand Statistics University of Washington Seat&, Washington 98195
I. II. III. IV. V. VI. VII.
Summary Introduction IBD and Allelic Associations at a Single Locus Allelic Associations between Two Loci Estimation of 8 from Observed Associations Effects of Size and Structure on All&c Associations Structure in Human Populations: Some Examples References
I. SUMMARY Linkage d’iseq ui‘1i b t-turn mapping attempts to infer the location of a disease gene from observed associations between marker alleles and disease phenotype. This approach can be quite powerful when disease chromosomes are descended from a single founder mutation and the markers considered are tightly linked to the disease locus. The success of linkage disequilibrium map-
‘To whom comspondence should be addressed.
414
Chapmanand Thompson
ping in fine-scale localization has led to the suggestion that genome-wide association testing might be useful in the detection of susceptibility genes for complex traits. Such studies would likely be performed in small, relatively isolated founder populations, where heterogeneity of the disease is less likely. To interpret the patterns of association observed in such populations, we need to understand the effect of population size, history, and structure on linkage disequilibrium. In this chapter, we first review measures of allelic association at a single locus. Measures of association between two loci are described, and some theoretical results are reviewed. We then consider some methods for inferring linkage between a marker and a rare disease, focusing on those that model the ancestry of the disease chromosomes. Next we discuss factors whose effect on disequilibrium are understood, and finally we describe the characteristics of some human populations that may be useful for disequilibrium mapping of complex traits.
II. lNTRODUCTlON The goal of genetic linkage analysis is to infer the location of a disease gene based on coinheritance of the disease phenotype with some genetic marker whose chromosomal location is known. Disequilibrium mapping attempts to do the same thing, only without benefit of the pedigree relating disease chromosomes to one another. It relies on allelic associations between marker alleles and disease phenotype and is based on the idea that strong associations will be due to linkage, rather than chance. Thus identity by descent (IBD) due to coancestry is inferred from identity by state (IBS) data, in the form of observed allelic associations. In combination, linkage analysis and disequilibrium analysis have been quite successful in the localization of genes for a number of simple Mendelian disorders (e.g., Hatbacka et al., 1992, 1994; Puffenberger et al., 1994; Risch et al., 1995; Goddard et al., 1996). The two approaches are quite complementary in situations such as this. Linkage analysis using recombination events within a pedigree can map a disease locus to a region of approximately 1 CM (Boehnke, 1994). Disequilibrium analysis (also known as haplotype analysis) can then pinpoint the disease locus by assuming a common ancestor for the chromosomes carrying the disease allele, and using all recombinations on paths back to that ancestor. The success of disequilibrium testing in this context has led a number of investigators (e.g., Risch and Merikangas, 1996; Brown and Hartwell, 1998) to consider the use of genome-wide disequilibrium testing as an approach to finding susceptibility loci for common complex diseases. In particular, small isolated populations are of interest, since disease observed in these populations
415
25. Linkage Disequilibrium Mapping
may be due to one or two alleles present in founding individuals, potentially eliminating the problem of heterogeneity. Furthermore, recently founded small populations may exhibit more disequilibrium than larger outbred populations (Kruglyak, 1999a). The utility of a population for this kind of study depends on being able to distinguish disequilibrium maintained by linkage from background disequilibrium, which exists as a result of the population’s size and structure. This problem motivates our consideration of the effect of population structure on disequilibrium. In Section III, we review measures of association for a single locus and give an example of the effect of population structure on these measures. Measures of association between two loci are described in Section IV, and some theoretical results describing how their means and variances change over time are reviewed. In Section V, we consider some methods for inferring linkage between a marker and a rare disease, focusing on those that model the ancestry of the disease chromosomes. In Section VI we discuss factors (e.g., time, population size) whose effects on disequilibrium are understood, and finally, in Section VII, we describe the characteristics of some human populations that may be useful for disequilibrium mapping of complex traits.
III. lBDANDALLEllCASSOClATlONSATASlNDlELOCDS Wright (1922) introduced th e single-locus measures of relationship in use today. As a measure of IBD, he defined the fixation index f. This is commonly called the coeficient of inbreeding, and as Malecot (1948, 1969) elaborated, it is the probability that the two genes at a single locus within an individual are IBD, A related quantity is the coeffcient of kinship between two individuals, which is defined as the probability of IBD between two homologous genes, one segregating from each individual. The coefficient of inbreeding for an individual is equal to the coefficient of kinship between his or her parents. Wright also introduced a measure p of IBS, defined as the correlation between allelic states on uniting gametes. Consider a locus with alleles A and a, allele frequencies PA and P,, and genotype frequencies PAA, PAa, and P,. For two uniting gametes, let X = 1 if the maternal gamete carries allele A (X = 0 otherwise) and similarly let Y = 1 if the paternal gamete carries allele A (Y = 0 otherwise). Then in an infinitely large population where matings happen between individuals whose coefficient of kinship is equal to f, Wright (1922) showed that p = corr(X,Y) =
PAA-Pi pp9 a
=
PAf
+
pi(l
-
f)
PAP@
-
Pi
=
f
t25
1j
416
Chapmanand Thompson
This equation, probably the first presentation of the relationship between IBD and IBS, demonstrates the special case where the two measures are equal. Another concept related to allelic association at a single locus is Hardy- Weinberg Equilibrium, which was described independently by Hardy (1908) and Weinberg (1908). I n an infinite random mating population, the genotype frequencies Pm, PA&,and P, will be equal to Pi, 2Pfia and Pi respectively. This relationship between the genotype frequencies and the allele frequencies defines Hardy-Weinberg Equilibrium (HWE). Departures from HWE reflect associations between alleles at that locus. Equation (25.1) shows that when the population is in HWE, p = 0 and there is no association. Population structure can give rise to allelic association. Consider in particular the example of population subdivision. Suppose there are k subpopulations of equal size, and Pi denotes PA in the ith subpopulation. Suppose that each subpopulation is in HWE, and let
xPi h=,
and
g2
= x.
_
52
k
denote the mean frequency of allele A in the pooled populations, and the variance of the k allele frequencies, respectively. Then the genotype frequencies in the pooled population are given by:
p&4= $2 + CT2 ph = ZF(l - I;) - 2az p, = (1 - p)2 + uz. The pooled population is not in HWE; there is an excess of homozygotes. Note that j52 + m2 - F2 ff2 PAA - Pi P= = h(l - F) ’ jxl - fi> PA(l - PAA)= so that when u2 is not zero, an association exists in the pooled population, even though no associations exist in the subpopulations (Wahlund, 1928). This example demonstrates the effect that population structure can have on allelic association at a single locus. Estimation off or p in human populations is difficult because it involves the estimation of variances and covariances. Edwards (1971) used data from the ABO blood group to consider the joint estimation of ABO allele frequencies and the level of IBD (f) in a population. He demonstrated that because of a
25. Linkage Disequilibrium Mapping
417
singularity in the likelihood, it is very difficult to get good estimates of small values of f, which are likely found in human populations. Morton et al. in “The bioassay of kinship” (197 1) capitalized on an important point: population history is shared by all loci, and therefore better estimates of f or p can be obtained by pooling information from several unlinked loci. The bioassay of kinship used allelic associations resulting from isolation by distance in a structured population to estimate p, from which the authors hoped to infer coancestry ( f ) . In practice, estimates of p have often been used in place of estimates of f, as if the two processes are equivalent. This is not generally the case, because allele frequencies drift, particularly in small populations. Thompson (1976) considered a single small (N < 500) population with no hierarchic structure and examined the joint evolution of heterozygosity (H = 1 - p) and nonidentity (1 - f) at a single locus, over time scales of up to SO generations. In populations of this sort, which are typical of human populations of interest, the variances of both processes were very high, and correlations between the two were generally low and occasionally negative.
IV. ‘AUELICASSOGlATlOtdS BETWEEN TWOL;OCI The gametic correlation p described in the preceding section is a locus-specific measure of allelic association between two gametes. A measure of allelic associe ation of more interest in the context of mapping is that between two loci on the same gamete. In this section, we define two such measures and review theoretie cal properties of their means and variances in finite populations. Consider two loci, one with alleles A and a, the other with alleles 3 and b. Let pA and PB denote the frequencies of alleles A and B, respectively. If the four possible haplotypes AB, Ab, aB, and ab occur with frequencies DAB,PAb,pnB,and pd, respectively, then let
or equivalently -
r‘-iDrao
IuDrlliD-
If the alleles A and B are independently distributed on haplotypes according to their allele frequencies, D = 0. A population in which D is nonzero is said to exhibit diseauilibrium. Robbins (1918), who was the first to suggest the use of D, studied its properties in an infinite population. He showed that for loci separated by a
418
Chauman and Thompson
recombination fraction of 8,
D, = (1 - 8)D,-t. Thus disequilibrium decays to zero over many generations (when 8 > 0), so that in the limit, the alleles at each locus are are independently distributed over haplotypes in the population. To relate D to the measure p discussed earlier, consider a randomly selected haplotype, and let X = 1 if the haplotype carries allele A at locus 1 (X = 0 otherwise) and let Y = 1 if the haplotype carries allele B at locus 2 (Y = 0 otherwise). Then D = cov (X,Y). This suggests the use of a measure of disequilibrium directly analogous to p, given by D
7 = corr(X,Y) =
(25.2)
4iiGFb This measure is sometimes preferred because it is less sensitive to allele frequencies than D, but its evolution is much more difficult to study because allele frequencies change over time. Studies of the properties of disequilibrium in finite populations are of more interest, since human and animal populations often are small or are descended from groups with small numbers of founders, Earlier work concentrated exclusively on random mating populations, presumbly because of the complexity of the calculations involved.
A. Expectation of Din a finite population Karlin and McGregor (1968) considered a population consisting of N diploid individuals, corresponding to 2N haplotypes. These 2N haplotypes give rise to the next generation by donating gametes to a “gamete pool” from which the 2N haplotypes of the next generation are randomly selected, with replacement. This is known as the “random union of gametes” model. The gamete pool is produced by considering all diploid genotypes possible in the parent generation, each contributing in proportion to its probability, which is assumed to be equal to the product of the haplotype frequencies. For example, consider the diploid genotype AB/ab. This genotype is assumed to occur with probability 2pmod, and it contributes gametes of types AB, Ab, aE%,and ab with probabilities $1 - 0), @, ;S, and i(l - f3), respectively. This model has also been called the haploid model, since it is equivalent to the case of each haplotype in the offspring generation being the result of a “mating” of two randomly selected haplotypes in the parent generation. Karlin and McGregor formulated the model as a Markov chain, with state-space described by the 4-vector of haplotype
25. LinkageDisequilibriumMapping
419
counts. They showed that (1 - W3Dt) and stated that VdD,)
- Y$,
where y is a positive constant depending on the initial conditions, and p > (1 - l/(ZN))(l - 0). While both E(D,) -+ 0 and Var(D,) -+ 0 the variance of D approaches zero at a slower rate than its expectation. This suggests that even when enough generations have passed for E(D,) to be close to zero, Var(D,) may be large, and therefore a particular population may have a value of D, quite different from zero. Watterson (1970) considered the same problem, using a slightly different model for random mating. The model, called the 5andom union of zygotes” model, assumes a constant population size of N diploids. Each member of the offspring generation is obtained by randomly selecting (with replacement) two individuals from the parent generation, and generating a gamete from each. This model is also referred to as the diploid model because it retains the diploid nature of the individuals in the parent generation. Watterson also used a Markov chain formulation, with state-space described by the lo-vector of diploid genotype counts. He showed that
a result also found by Hill and Robertson (1966). The expected disequilibrium approaches zero somewhat more slowly for the haploid model, although the difference is small for small values of 0/(2hJ). In fact if 6 = 0, the two models are exactly equivalent.
B. Varianceof Dand p The papers cited in Section 1V.A confirmed that disequilibrium decays to zero in a finite randomly mating population, just as it does in an infinite population, but the rate of decay is slower for smaller populations. Karlin and McGregor’s work suggeststhat the variability of disequilibrium could be quite important in small populations. This problem was first explored by Hill and Robertson (1968) for the haploid model, assuming a population initially in equilibrium
420
Chapmanand Thompson
(i.e., Do = 0). They defined a vector of moments
and used the multinomial distribution of the haplotype frequencies to find a matrix M(N,8) such that yt+i = M(N,@y,. For the special case of 8 = 0, they derived an explicit formula for E[Df]. For values of 8 other than zero, the value of E[Df] can be found by iteratively applying M to yo. Weir and Hill (1980) considered the same problem for several types of random mating in a diploid population: l
l l
l
MS: a monoecious population of size N in which selfing is allowed (equivalent to Karlin and McGregor’s diploid model). ME: a monoecious population of size N in which selfing is not allowed. DR: a dioecious population of M males, F females, and effective population size N = 4MF/(M + F), where each child is the offspring of a random pairing. DH: a dioecious population with lifetime pairing, M males, each mated to s females. Thus F = SM and N = 4MF/(M + F). This model describes monogamy when s = 1.
Their method is based on two-locus descent measures, which are defined as the joint probability of non-IBD of a with a’ and b with b’, where a and a’ denote alleles at one locus on different chromosomes, and b, b’ denote alleles at a second locus. There are three classes of descent measures, described as di-, tri-, and quadrigametic, according to whether the alleles being compared are on two, three, or four chromosomes. For example, a trigametic descent measure would be Pr(a is not IBD to a’, and b is not IBD to b’), with a, (I’, b, and b’ as depicted in Figure 25.1. Weir and Hill (1980) show that for a population starting in equilibrium, Var(D,) = E(Df) is a simple transformation of the descent measures in generation t and the initial allele frequencies. In a contemporaneous paper, Weir et al. (1980) developed transition matrices to describe the two-locus descent measures at time t + 1 as a function of the two-locus descent measures at time t, for each of the four types of random mating just listed. The transition matrices depend on the mating system, the size N of the population, and the recombination fraction 0 between the two loci. Thus Weir and Hill’s method allows the exact calculation of Var(D,) f or a particular N and 8: first the values of the relevant descent measures are calculated, and then they are transformed appropriately.
421
25. linkage Disequilibrium Mapping
a
a’
b
t Figure 25.1 Three
b’
t chromosomes.
Figure 25.2 shows the variance of D, across monogamous random mating populations of size N = 20 or N = 50, where the founding population was in equilibrium, and the initial allele frequencies were PA = pa = 0.5. In this situation, although E(D,) over all populations is zero, the high variance of D,, particularly for small recombination fractions, suggests that the value of D, in any particular population could be quite different from zero. The plots suggest that for larger populations, the peak variance is smaller and happens later in time. There are a number of examples of relatively young human isolates who were founded by small numbers of individuals, so it seems likely that some of these populations will exhibit substantial disequilibrium between linked loci. Hill and Robertson (1968) and Weir and Hill (1980) were also interested in the behavior of the measure r [see Equation (25.2)] in finite populations. Hill and Robertson approached the problem by simulation, using a haploid population of size 16 chromosomes, starting with allele frequencies p* = PB = 0.5, and Do = 0. The correlation T has the property of being undefined when either of the loci is fixed, so they based their estimates of E(r:) on populations in which both loci were still segregating. These simulations apparently show E(r:) approaching a limiting value depending on 8, but the authors note that E(rZf is necessarily not well estimated for large numbers of generations, because so many lines have fixed. Weir and Hill (1980) chose to approximate E(rf) by E(D,~>/~(~~P~P~~~>,since both these quantities can be obtained exactly from their methods. This approximation does not appear to have been tested in the small populations that are of interest, so we do not discuss their findings further here. Sved (1971) studied the joint evolution of IBD at linked loci in a finite random mating population of size N and found an exact expression for E(r,2). He showed that E(r!) = Q,, where Q, is defined as the probability, for two haplotypes IBD at locus A in generation t, that there have been no crossover events between loci A and B on either of the two pathways from the common ancestor at locus A. Sved shows that for the haploid model, it is possible to obtain an iterative equation for Q, as a function of Q, - l; thereby showing that for a
422
Chapman and Thompson
a __ -----------
-------_I__
9
0
-------
.---
no recombination 1% recombination 10% recombination 50% recombination
----.--czz.-=.-.-
----.--4-_ ______ -___
I
I
I
I
I
I
0
20
40
60
80
100
Generation
(t)
60
40 Generation
(t)
Figure 25.2 Var (D,) as a function of time, across monogamous random mating populations: (a) N = 20 and (b) N = 50.
25. Linkage Disequilibrium Mapping
423
population that starts in equilibrium (i.e., ri = 0), E(r;) = l+~N~{l-[(l-&)MPILjj
(25.3)
for small values of 8. As t -+ ~0in Equation (25.3), E(r;) --$
, ’ 1+4hM’
(25.4)
This equation gives an expression for E(r2> in a population that has been evolving long enough to be close to drift-recombination equilibrium. For small populations, this approximation may be valid only for very large t. This two-locus result is similar in spirit to Wright’s demonstration of the conditions under which ,a = fat a single locus [see Equation (25.1)] in that it relates the square of the correlation, a measure based on IBS, to a quantity that is a probability of IBD.
V. ESTlMAflOON OF,e FRbU OBSEWEBASSOUIATNINS The possibility of strong disequilibrium between tightly linked loci has prompts ed attempts to use the observed disequilibrium between two loci to infer the recombination fraction 0 between them. Chakravarti et al. (1984) considered the use of haplotype data from African-American, Italian, Greek, and Indian populations to estimate the recombination rate in a region of the human-@globin gene cluster. From Sved’s result in Equation (25.4), they wrote -
1 - 1 = 4Nkd, 7-2
where k is the rate of recombination per kilobase (kb) and d is the physical distance between two loci, measured in kilobases (0 = kd). This inspired a regression of l/r2- 1 on d for a selection of pairs of loci known distances apart, yielding an estimate of k in terms of 4N. By assuming a value of N that represents the average size of the population over its evolutionary history, an estimate of k is obtained. Weir and Hill (1986) point out some flaws in this approach. First, Equation (25.4) is for a population with two segregating sites whose haplotypes were initially in equilibrium. In the situation considered by Chakravarti et aE. (1984), it seems more likely that the variable sites arose by mutation, and so
424
Chapmanand Thompson
there was initial disequilibrium. Weir and Hill (1986) show that this is an important factor, which renders Equation (25.4) inappropriate in this context. Second, Chakravarti et al. (1984) ignore the fact that their estimates are based on a relatively small sample of individuals, and therefore there is substantial sampling error in the estimation of r2. Thompson et al. (1988) considered this issue further and demonstrated that power to detect disequilibrium can be very low, particularly if the rare alleles are in repulsion phase. While the work of Chakravarti et al. is an important early attempt to glean useful information from observed disequilibria, it is limited by the assumptions of the model used and by the inherent variability of the process being sampled. Other approaches to drawing inferences about 8 from observed disequilibria have attempted to take into account the history of the population in which the disequilibria are observed. When a new variant first arises, it necessarily exists on one and only one chromosomal background. There is initially complete association between the new variant and the alleles at other loci on that ancestral haplotype. This initially strong association is eroded away over subsequent generations by recombination between the locus where the variant arose and neighboring loci. Edwards (1981) elaborated on this concept by describing the “half-life” of a haplotype. For a chromosome segment whose ends are separated by recombination fraction 0, the half-life is the time (in generations) until the probability that the segment remains intact is 50%. For example, a segment for which 0 = 0.005 has a half-life of approximately 3500 years (I40 generations), whereas a segment in which recombination happens much more frequently has a much shorter half-life (e.g., 8 = 0.02 corresponds to a half-life of 900 y ears). This demonstrates the importance of the time depth of the variant of ‘interest to the genetic scale over which one might hope to see a conserved ancestral haplotype. Segments of conserved ancestral haplotype are the cause of association between the variant and alleles at nearby loci. Since associations are observed by sampling multiple haplotypes, it is not recombination events on a single lineage back to the ancestral haplotype that are important, but rather, recombination events on all sampled lineages tracing back to that ancestral haplotype. Therefore the important parameter in determining the likely length of the conserved segment from a sample of haplotypes bearing the variant is the total number of meioses on the entire tree relating the sample to the original ancestral haplotype (Aranson et al., 1977). Arnason et al. (1977) were interested in estimating an upper bound for the recombination fraction between the HLAaB locus, and the Bf locus. Their data consisted of 100 apparently unrelated haplotypes, sampled from Iceland, northwest Newfoundland and Labrador, central England, and Norway. The haplotypes were chosen because they all bore allele B8 at HLA-B. All haplotypes also carried the S allele at locus Bf, providing evidence of strong
25. Linkage Disequilibrium fvtappiMl
425
Britain NeWfoundland, Labrador Ireland Iceland Norway Figure 25.3 Evolutionary relationships of populations sampled by Arnason et al. (1977).
disequilibrium. To estimate an upper bound for 0, Aranson et al. assumed that the observed allelic association in these apparently diverse populations was due to a common ancestral haplotype, which existed in approximately 3000 B.C. They then reasoned that the probability of none of c haplotypes experiencing visible recombination over t generations is given by: Pr(no visible recombinations)
=
e cf 1 - --zj- , ( )
(25.5)
since the frequency of the S allele at Bf is approximately two-thirds. To apply this reasoning to their data set, they used the tree shown in Figure 25.3, to describe the ancestry of their sample of haplotypes. Using the demographic histories of the populations as a guide, they obtained a minimal number of ancestral haplotypes (c) in each branch, and the number of generations (t) over which they existed, to estimate 2 ct -2000 over all links in the tree. Only very small values of 0 produce small enough values of (1 - @ ‘3)” to be consistent with the observed data, indicating very tight linkage between HLAeB and Bf. This paper appears to be the first to use the ancestral relationships of the sampled haplotypes along with the observed associations to infer the recombination fraction between two loci. Thompson (1978) estimated more precisely the number of ancestors of the 100 sampled haplotypes at particular times in history, and obtained a result whose practical implications were very similar to those of Arnason et al. (1977). More recently, efforts to infer 8 from disequilibria have focused on rare recessive diseases,often in isolated populations, where all chromosomes carrying the disease variant are assumed to be descended from a single founder haplotype. We restrict our attention here to methods that explicitly model the ancestry of the disease chromosomes. Thompson and Neel (1997) modeled the ancestry of a sample of chromosomes carrying a rare monophyletic variant. Conditional on the expected total population of variant-bearing haplotypes at
426
Chapmanand Thompson
all times in history, they employ a continuous time Moran model approximation to obtain the distribution of the coalescence times. The distribution is intractable, but they describe a simulation approach which allows the sampling of ancestries from the correct distribution. Hastbacka et al. (1992), who considered a sample of Finnish diastrophic dysplasia (DTD) chromosomes, estimated 0 between the disease locus and a marker locus where association was observed. Using the same reasoning as Arnason et al. (1977), they observed that
T = (1 - fj)f G ,-to,
(25.6)
where v = Pr (randomly sampled disease chromosome still carries ancestral marker allele). Assuming that the ancestral marker allele is that which was most common in their sample of disease chromosomes, they then equated the observed proportion of disease chromosomes carrying that allele to eete and solved for 8 using t = 100 (based on the population history). They obtained an estimate of 64 kb for the distance between the disease locus and the marker of interest, which proved to be amazingly accurate-a gene for DTD 70 kb from the marker in question was later cloned (Hastbacka et al., 1994). Despite the accuracy of the estimate in this example, questions remain about the usefulness of this approach. Equation (25.6) describes the probability over all possible evolutionary histories of the population from founding to the present, while the data available are necessarily from a single observation of the evolutionary process. The estimate obtained is therefore a moment estimator based on a single observation. The confidence bounds suggested by the authors may not be appropriate: Kaplan et al. (1995) sh owed by simulation that the upper bound was less than the true value of 8 over 40% of the time. The approach of Hastabacka et al. is an advance over the work of Chakravarti et al. (1984) because it does not assume that the population is in drift-recombination equilibrium. However, neither does it explicitly model the history of the population. Kaplan et al. (1995) presented the first likelihood approach to infer* ence of 0 from disequilibria, using simulation to take into account the population history. Assuming a single disease mutation occurring on a particular marker background at time zero, the number of disease chromosomes on that background at generation t + 1 is modeled as a function of the number at generation t. The disease chromosome population can then be simulated up to a time representing the time of sampling, and the probability of the observed data is calculated by means of either a binomial (for a diallelic marker) or a multinomial (for a multiallelic marker) distribution. Realizations that produce numbers of disease chromosomes inconsistent with that observed in the sampled population are excluded from the calculation, effectively conditioning
25. linkage DisequilibriumMapping
427
on the observed disease allele frequency. Simulation at many values of 8 gives a likelihood curve. Rannala and Slatkin (1998) present an approach similar to that of Kaplan et al. (1995), but they use a coalescent to model the ancestry of the disease variant. Conditional on the time of the initial mutation, and the number of disease alleles sampled, Rannala and Slatkin obtain a realization from the joint distribution of coalescence times for the sample. For a simple of i disease chromosomes, there are i coalescence times, counting the time at which the mutation first occurred. The number of disease chromosomes carrying a particular marker allele immediately after coalescence event i + 1 depends on the number carrying that allele immediately after event i, the mutation rates, and the recombination fraction between the disease and the marker. Forward simulation of the diseasehaplotype population continues until the number of disease chromosomes carrying that marker allele immediately after the ith coalescent event is realized. The probability of the observed data at the time of sampling is then calculated conditional on this realization. Repeated simulation at different values of fl gives a likelihood curve. Like the method of Kaplan et al. (1995), this method can produce realizations of the disease haplotype population that are almost incompatible with the data and therefore make a very small contribution to the likelihood. Graham and Thompson (1998) also consider disequilibrium likelie hoods in the situation of a rare monophyletic disease mutation. They assume that the pattern of population growth is known (although it does not need to be constant), and that the time of the initial mutation is known. The ancestry relating the sampled chromosomes is realized by means of a two-stage process similar to that of Thompson and Neel (1997): (1) the size of the ancestral population is realized for each generation, from the present back to the time of the initial mutation, and (2) the coalescent relating the sampled haplotypes is realized, conditional on the ancestral population sizes. Once the coalescent has been realized, recombination events between the disease locus and the marker locus are placed on the tree, with probability depending on the branch length and the recombination fraction 8. These recombination events define recombinant classes, where a recombinant class is defined as the subset of the current sample that is descended from a given recombination event. Graham and Thompson describe an analytical expression for the probability of the observed data conditional on the recombinant classes. The simulation of recombmant classes rather than allelic classes eliminates the problem of obtaining realizations that are incompatible with the data, and as a result, the method of Graham and Thompson is likely to be more computationally efficient. All the likelihood methods discussed earlier were developed in the context of a rare monophyletic disease mutation. As our attention is turning to
428
Chapmanand Thompson
more common complex diseases, it is informative to consider the effect of heterogeneity on the strength of allelic associations. If heterogeneity is in the form of multiple disease mutations at a single locus, the picture is encouraging. Disease haplotypes carrying different disease mutations at the same locus will each exhibit association with the marker allele ancestral to that disease mutation. As long as the marker locus is reasonably polymorphic, these associations will still be apparent, as in the cases of cystic fibrosis (Tsui, 1995) and Werner’s syndrome (Matsuomo et al., 1997), where the mutation of highest frequency makes up 70 and 5 1% (respectively) of the disease haplotype population. Multiple unliked disease loci (rather than multiple disease mutations at a single locus) are more problematic. In such cases, the association between a marker allele and a disease allele at a closely linked disease locus will be swamped by the absence of an association between that marker allele and the other disease loci, to which the marker is not linked. The difficulty introduced by heterogeneity has led a number of authors (e.g., Chapman and Wijsman, 1998; Kruglyak, 1999b) to suggest the use of isolated founder populations for disequi, librium mapping, in the hope that because of small numbers of founders, or pope ulation bottlenecks, diseasesobserved in such populations will be homogeneous. Kruglyak (1999b) sh owed that for the disease to be homogeneous, either the population must be descended from a very small number of founders or the dis, ease variant must be quite rare. To use founder populations effectively for gene mapping, it is important that we first understand the factors other than linkage that can influence disequilibrium in these populations.
The problem of the effect of population size on observed disequilibrium can be compared to the effect of degree of relationship on observed IBD sharing, which is important in affected relative pair methods. Closely related pairs are expected to share large portions of their genome IBD, and therefore areas of sharing are easy to locate but do not localize genes well. Conversely, distantly related relative pairs are expected to share very little of their genome IBD, and as a result, the identification of shared regions (which is difficult) can localize a disease gene quite precisely (Thompson, 1997). Analogously, disequilibrium is expected to stretch over larger distances in small populations, and over smaller distances in large populations. Large regions detectable in small populations may be useful for initial mapping, whereas the small, more difficult-to-find regions of disequilibrium in larger populations are useful for fine-scale localization. Graham (1998) has considered the effect of population growth rate on the shape of the coalescent relating disease haplotypes. Visualization of the
25. Linkage Llisequilibrium Mapping
429
coalescent relating disease chromosomes is helpful in thinking about the effects of growth rate on disequilibrium, since it is recombination events on this coalescent that reduce allelic association over time. For illustration, we consider two extreme examples of coalescent shape. Classical coalescent theory (Felsenstein, 1971; Kingman, 1982) was developed for a population of constant size, where the shape of the coalescent is similar to the one shown in Figure 25.4a. In this case, sampling additional haplotypes simply adds more tips to the tree. Most of the meioses in the tree are in the earliest branches, which means that most recombinations can be expected to occur in these branches. As a result, sampled disease haplotypes are likely to share recombination events, which means that allelic associations should be well preserved. Figure 25.4b shows an example of a so-called star phylogeny, which might exist when the population of disease chromosomes is growing extremely rapidly (Slatkin, 1996). In this case, coalescent events between haplotypes occurred long ago, and as a result, sampled haplotypes appear to be independent. Recombination events are not shared by many haplotypes, so allelic associations are greatly reduced. Graham (1998) showed that for a growing population, the coalescent is often similar in shape to that in Figure 25.4a, with most meioses being in the early branches of the tree. Even in a fast growing population (10% per generation), coalescent events are fairly evenly spaced, and therefore a substantial fraction of recombinations will happen early enough in time to be shared by many disease haplotypes. Kruglyak (199913)used a simulation approach to consider the effects of two different population growth scenarios on observed disequilibrium. For the case of a population growing at a constant rate since founding, smaller growth rates result in higher levels of disequilibrium after a given number of generations, as the foregoing coalescent argument would suggest. The second scenario considered a population founded 20 generations ago by 100 individuals, with a current size of 10,000 individuals. The population was assumed to have a period of constant size, either early or late in its history. Disequilibrium in the current population is larger when the period of constant size happened earlier in the population’s history. This is because genetic drift in the constant growth period effectively reduces the size of the founding population, thereby increasing disequilibrium. Unfortunately, the effect of population structure is much less well understood. In large populations, subdivision, isolation by distance, and admixture can all give rise to associations that could be misinterpreted as reflecting shared ancestry. Beerli (1999) has studied the effect of migration on the shape of the coalescent, and this work may give insight into the effect of migration on disequilibrium. In smaller populations, and within subgroups of larger ones, patterns of nonrandom mating, together with genetic drift, can also affect
Chapmanand Thompson
Figure 25.4 Examples of typical coalescent shapes for different growth rates: (a) constant size and (b) fast growth.
disequilibrium. To interpret the patterns of disequilibrium we can observe in the wide variety of human populations available for study, we must quantify the effects of each of these aspects of population structure. Then we must learn how to differentiate these associations from those due to common ancestry and linkage. In the next section, we describe some of the populations that have been suggested for disequilibrium mapping, with attention to known aspects of their size, history, and structure.
431
25. Linkage Disequilibrium Mapping
VU. STUUC’FURE IN HUMANPOPULiWUN% !WdE W M ft~lES Table 25.1 shows examples of the wide variety of ages, sizes, and structures observed in human populations. Of the large isolates, the Amerindians are by far the oldest. About 500 of the earliest Americans are thought to have crossed the Bering land bridge, during a period about 30,000 years ago. The population grew quite slowly until about 8000 years ago, when agricultural practices were adopted, and the population grew rapidly to approximately 20 million by the time of Columbus (Fiedel, 1987; Denevan, 1992), This population must have been highly structured, both because of their enormous range (North and South America) which results in isolation by distance, and because of the tribal structure within smaller geographic areas. Because of the time depth of this populad tion, disequilibrium between loci with respect to the initial founder population is unlikely to be detectable. Because of the recent rapid expansion of the population, however, most rare variants will have occurred recently (Thompson and Neel, 1997) and will be localized to a particular tribe or linguistic group. Disequilibrium around such a variant should be detectable in modem Amerinds. The modem Japanese population was founded approximately 94 generations ago, by approximately 1000 rice-growing immigrants who emigrated from the mainland (Benedict, 1989). Relatively little is known about the growth of the population until the period 1603 - 1867, when the population size remained remarkably constant, at about 30,000 individuals. Political change in 1867 resulted in rapid growth, and the population of Japan today is about 120 million. This population is probably much less structured than the Amerindians, Table 25.1 The Scope of Human Population Structure
Populations Large Amerindians Japanese Finns Sdl Icelanders Ch-SLSJb Hutterites Thy Newfoundland Tristan da Cunha
Time since founding (years)
Number of founders
30,000 2,400 2,000
Modem size
Structure
500 1,000 1,000
20 milliona 120 million 5 million
Tribal Homogeneous Homogeneous
1,200 350 350
20,000 8,000 80
250,000 300,000 36,000
180 180
400 19
1,600 300
OAt the time of Columbus. bCharlevoix-Saguenay Lac Saint Jean.
Homogeneous Unknown Leut and colonies 3 villages Homogeneous
432
Chapmanand Thompson
although some historical subdivision due to geography seems likely. The time depth of the population (2400 years) proved ideal for the fine-scale localization of the Werner’s syndrome gene, which apparently dates to the founding of the population (Graham and Thompson, 1998). Another large population in which disequilibrium mapping of a rare allele has been successful is that of Finland. Finland was founded by about 1000 settlers, approximately 2000 years ago. The population has remained quite isolated since then, and numbers about 5 million (Nevanlinna, 1972). Like the Japanese, the Finnish population is appropriate for fine-scale localization because of its time depth, as Hastbacka et al. (1994) demonstrated when they cloned a gene for diastrophic dysplasia in Finns. Kruglyak (1999b) studied the properties of such a population by simulation and showed that even with this relatively small number of founders, a disease variant would have to be quite rare (< 1%) to be monophyletic in such a population. The potential of smaller isolated populations has been less well explored. Iceland is a particularly interesting example. Iceland was first settled circa 900 by about 20,000 people who came primarily from Norway, but also from Ireland and Scotland (Bjarnason et al., 1973). The population grew very rapidly both by births and continued immigration to about 70,000 by the eleventh century, and remained of roughly that size until the early 19OOs, whereupon it grew dramatically to its current size of 250,000 people. While relatively isolated from other populations, there was mobility within the country, and as a result the population is probably quite homogeneous. There was likely considerable disequilibrium within the founding population because it was a Norse-Celtic mixture. Because of the shorter time depth (relative to Finns and Japanese), disequilibrium may be detectable over longer genetic distances. However, the unusually large size of the founding population makes it likely that even rare alleles would have been represented in multiple copies. Of similar size, but much younger, is the population of the CharlevoixSaguenay Lac Saint Jean (Ch-SLSJ) reg ion of Quebec. Many of the residents of this area are the descendants of about 8000 French colonists who immigrated to the Charlevoix region during 1608-1760, while the area was under French control (Heyer, 1995). The population expanded into the Saguenay region in the mid-1800s and continued to grow both internally and through considerable immigration, to its size of about 300,000 today (Heyer and Tremblay, 1995). Although disequilibrium may persist over moderate distances in such a young population, the relatively high level of immigration could result in heterogeneity for all but the rarest diseases. The Hutterites (Morgan and Holmes, 1982) are an isolate of similar age but much smaller (36,000 people). Descended from a remarkably small founder group (80 people), the Hutterites are a religious group who originated in Europe in the 1500s and immigrated to North America circa 1880. Upon
25. LinkageDisequilibriumMapping
433
arrival in North America, they split into three almost completely separate subdivisions (known as leut). Within leut, the Hutterite population is divided into colonies, and when growth necessitates division, a colony will split into two colonies, Immigration into Hutterite colonies is almost nonexistent. Disequilibrium may extend over large genetic distances in the Hutterites, since they are a very young population and are descended from so few founders. In addition, the population is unusually rich in structure, and the structure is very well documented. Study of the Hutterite population may therefore yield important insight into the effects of both large and small scale structure on disequilibrium. Table 25.1 also shows two extremely small and young populations. The Newfoundland population includes the inhabitants of what are now three small villages, on the west coast of the island. Many of the inhabitants are descendants of a single founder couple and their children and grandchildren, who founded the population in 1820 (Marshall et al., 1979). Others married into the population, and the current population traces back to about 400 founders. In 1975 the population numbered 1627 people, with 1521 descended from the original founding couple. The complete genealogy at that time contained just over 4000 people, 2600 of whom were descendants of the original founding couple. Because of the youth of the population, disequilibrium around a variant originating in the founder couple is likely to stretch over relatively long distances. The other extremely small population is that of Tristan da Cunha, a very remote island in the south Atlantic. The population, numbering abaut 275 in 1961, is descended from a total of 19 ancestors, 10 of whom were on the island prior to 1855 (Roberts, 1971). These founders are of extremely diverse origin-some being immigrants from the United States or Britain and other being survivors of shipwrecks in the region. The population has exhibited steady but slow growth, with two severe bottlenecks: the population was reduced from 103 to 33 in 1856, and from about 100 in 1880 to about 60 in 1892. The population has grown steadily since then, but emigration has kept the population small. A population of this size is necessarily quite homogeneous, and therefore not likely to tell us much about the effects of structure on disequilibrium. However, the youth of the population, the diversity of its founders, and the two bottlenecks in its history suggest that considerable diaequilibrium likely exists in this population. Lonjou et al. (1999) compared observed disequilibria in two regions of the genome for a wide variety of populations, ranging from outbred large geographic areas such as Europe, the Near East, and the Americas, to small isolates, such as Basques, the Ainu of Japan, and the population on Tristan da Cunha. The Jewish populations of Europe, Africa, and the Middle East comprise another ethnic group considered by Lonjou et al. (1999) that shows a variety of histories and structures. Disequilibrium mapping in these populations has
434
Chapmanand Thompson
been considered by Risch et al. (1995) and Levy et al. (1996). Lonjou found that generally, levels of disequilibrium in isolated populations were only slightly higher than in outbred populations. The loci they considered were diallelic and had moderate allele frequencies, and so their results suggest that isolates may not be as useful for complex disease mapping as had been hoped. Since, however, all locus pairs considered were extremely tightly linked (<0.2 CM apart), the similarity between observed disequilibria in small isolates and large outbred populations may simply demonstrate the very large time scale required to break down such associations. Kruglyak (1999a), in a discussion of the results of Lonjou et al. (1999), notes that these authors’ conclusions are based on data for only two regions of the genome and therefore are subject to the unknown evolutionary histories of the two regions. He also notes that disequilibrium in the Ainu was significantly higher than in the large populations, suggesting that disequilibrium mapping might be feasible in this isolate and in others not considered in the Lonjou study. Each isolate has its own unique evolutionary history, and therefore some may prove more useful than others for disequilibrium mapping of complex disease. Most notably, Kruglyak (1999a) points out the need for systematic empirical study of disequilibrium across the entire human genome, both within isolates and within larger outbred populations, to identify populations useful for disease mapping. We believe that empirical studies of disequilibrium in human populations should be complemented by theoretical research undertaken with the goal of understanding the effects of population history and structure on disequilibrium.
Acknowledgments This work was supported in part by funding from the Burroughs Wellcome Fund for the Program in Mathematics and Molecular Biology and by National Institutes of Health grant GM-46255.
References Amason, A., Larsen, B., Marshall, W. H., Edwards, J. H., MacKintosh, I?, Olaisen, B., and Teisberg, P. (1977). Very close linkage between HLA-B and Bf inferred from allelic association. Nature 268,527-528. Beerli, P (1999). Estimation of migration rates and population sizes in geographically structured populations. In “Advances in Molecular Ecology,” (G. R. Carvalho, ed.), pp. 39-53. NATO Science Series A: Life Sciences, IOS Press,Amsterdam. Benedict, R. (1989). “The Chrysanthemum and the Sword.” Houghton Mifflin, Boston. Bjamason, O., Bjamason, V., Edwards, J. H., Fredriksson, S., Magnusson, M., Mourant, A. E., and Tills, D. (1973). The blood groups of the Icelanders. Ann. Hum. Genet. 36,425-455.
25. Linkage Disequilibrium Mapping
435
Boehnke, M. (1994). Limits of resolution of genetic linkage studies: Implications for the positianal cloning of human diseasegenes. Am. J. Hum. Genet. 55,379-390. Brown, P O., and Hartwell, L. (1998). Genomics and human disease:Variations on variation. Nat. Genet. l&91-93. Chakravarti, A., Buetow, K., Antonarakis, S., Waber, l?, Boehm, C., and Kazazian,H. (1984). Nonuniform recombination within the human /3-globin gene cluster. Am. I. Hum. Genet. 36, 1239- 1258. Chapman, N. H., and Wijsman, E. M. (1998). G enome screens using linkage disequilibrium tests: Optional marker characteristics and feasibility. Am. J. Hum. Genet. 63, 1872-1885. Denevan, W. M. (1992). “The Native Population of the Americas in 1492.” University of Wisconsin Press,Madison. Edwards, A. W. E (1971). Estimation of the inbreeding coefficient from ABO blood-group phenotype frequencies. Am. J. Hum. Genet. 23,97-98. Edwards, J. H. (1981). Allelic association in man. In “Population Structure and Genetic Disorders, Proceedings of the Seventh Sigfred Juselius Foundation Symposia, New York” (A. W. Eribson, ed.), pp. 239-256. Academic Press,New York. Felsenstein, J. (1971). The rate of loss of multiple alleles in finite haploid populations. Theor. Popu. Bio. r&391-403. Fiedel, S. J. (1987). “Prehistory of the Americas.” Cambridge University Press,New York. Goddard, K. A. B., Yu, C. E., Oshima, J., Miki, T., Nakura, J., Piussan, C., Martin, G. M., Schellenberg, G. D., Wijsman, E. M., and members of the International Werner’s Syndrome Collaborative Group (1996). Toward localization of the Werner syndrome gene by linkage disequilibrium and ancestral haplotyping: Lessons learned from analysis of 35 chromosome Spll.l-21.1 markers.Am. J. Hum. Genet. 58,1286-1302. Graham, J. (1998). Disequilibrium fine-mapping of a rare allele via coalescent models of gene ancestry. Ph. D. thesis, University of Washington, Seattle. Graham, J., and Thompson, E. A. (1998). Disequilibrium likelihoods for fine-scale mapping of a rare allele. Am. J. Hum. Genet. 63, 1517-1530. Hardy, G. H. (1908). Mendelian proportions in a mixed population. Science28,49-50. Hidstbacka, J., de la Chapelle, A., kaitila, I., Sistonen, I’., Weaver, A., and Lander, E. (1992). Linkage disequilibrium mapping in isolated founder populations: Diastrophic dysplasia in Finland. Nat. Genet. 2,204-211. Histbacka, J., de la Chapelle, A., Mahtani, M., Clines, G., Reeve-Daly, M. P., Daly, M., Hamilton, B. A., Kusumi, K., Trivedi, B., Weaver, A., Coloma, A., Lovett, M., Buckler, A., Kaitila, I., and Lander, E. S. (1994). The diastrophic dysplasia gene encodes a novel sulfate transporter: Positional cloning by fine-structure linkage disequilibrium mapping. Cell 78, 1073- 1087. Heyer, E. (1995). Mitochondrial and nuclear genetic contribution of female founders to a contemporary population in North East Quebec. Am. J. Hum. Gener. 56, 1450-1445. Heyer, E., and Tremblay, M. (1995). Variability of the genetic contribution of Quebec population founders associated with some deleterious genes. Am. J. Hum. Genet. 56,970-978. Hill, W. G., and Robertson, A. (1966). The effect of linkage on limits to artificial selection. Genet. Res. 8,269-294. Hill, W. G., and Robertson, A. (1968). Linkage disequilibrium in finite populations. Theor, Appl. Genet. 38,226-231. Kaplan, N. L., Hill, W. G., Weir, B. S., (1995). L’Ik el’Ih oo d methods for locating disease genes in nonequilibrium populations. Am. J. Hum. Genet. 56, 18-32. Karlin, S., and McGregor, J. (1968). Rates and probabilities of fixation for two locus random mating finite populations without selection. Genetics 58, 141- 159. Kingman, JoE C. (1982). The coalescent. StochasticProcessAgpl. 13,235-248. Kruglyak, L. (1999a). Genetic isolates: Separate but equal? Proc. Nutl. Acad. Sci. USA 96, 1170-1172.
436
Chapman and Thompson
Kruglyak, L. (1999b). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22, 139-144. Levy, E. N., Shen, Y., Kupelian, A., Kruglyak, L., Aksentijevich, L., Pras, E., Balow, J. E. B. Jr., Linzer, B., Chen, X., Shelton, D. A., Gumucio, D., Pras, M., Shohat, M., Rotter, J. I., Fischel- Ghodsian, N., Richards, R. I., and Kastner, D. L. (1996). Linkage disequilibrium mapping places the gene causing familial Mediterranean fever close to D16S246. Am. J. Hum. Genet. 58,523-534. Lonjou, C., Collins, A., and Morton, N. E. (1999). All e1’ 1c association between marker loci. Proc. Natl. Acad. Sci. USA 96, 1621- 1626. Malecot, G. (1948). “Les MathCmatiques de l’H&bdite.” Masson, Paris. Malecot, G. (1969). “The Mathematics of Heredity.” Freeman, San Francisco. Marshall, W. H., Buehler, S. K., Crumley, J,, Salmon, D., Landre, M.-E, and Fraser, G. R. (1979). A familial aggregate of Hodgkin’s disease, common variable immunodeficiency, and other malig nancy cases in Newfoundland. I. Clinical features. Clin. Inwest. Med. 2, 153-159. Matsuomo, T., Imamura, O., Yamabe, Y., Kuromitsu, J., Tokutake, Y., Shimamoto, A., Suzuki, N., et a2. (1997). Mutation and haplotype analyses of Werner’s syndrome gene based on its genomic structure: Genetic epidemiology in the Japanese population. Hum. Genet. 100, 123-130. Morgon, K., and Holmes, T. M. (1982). Population structure of a religious isolate: The Dariusleut Hutterites of Alberta. In “Current Developments in Anthropological Genetics,” Vol. 2, pp. 429-448. Plenum, New York. Morton, N. E., Yee, S., Harris, D. E., and Lew, R. (1371). The bioassay kinship. Theor. Popul. Bio. 2, 507-524. Nevanlinna, H. R. (1972). The Finnish population structure-A genetic and genealogical study.
Heredim 71, 195-236. Puffenberger, E. G., Kauffman, E. R., Bolk, S., Matise, T. C., Washington, S. S., Angrist, M., Weissenbach, J., Garver, K. L., Mascari, M., Ladda, R., et al. (1994). Identity-by-descent and association mapping of a recessive gene for Hirschsprung disease on human chromosome 13q22. Hum. Mol. Genet. 3, 1217-1225. Rannala, B., and Slatkin, M. (1998). L’1k eI’1h oo d analysis of disequilibrium mapping, and related problems, Am. J. Hum. Genet. 62,459-473. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases. Science273,1516-1517. Risch, N., de Leon, D., Ozelius, L., Kramer, l?, Almasy, L., Singer, B., Fahn, S., Breakefield X., and Bressman, S. (1995). Genetic analysis of idiopathic torsion dystonia in Ashkenazi Jews and their recent descent from a small founder population. Nar. Genet. 9, 152-159. Robbins, R. B. (1918). Applications of mathematics to inbreeding problems. II. Genetics 3, 73-92. Roberts, D. E (1971). The demography of Tristan da Cunha. Popul. Srud. 25,465-469. Slatkin, M. (1996). Gene genealogies within mutant allelic classes.Genetics 143,579-587. Sved, J. A. (1971). Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor. Popul. Biol. 2, 125-141. Thompson, E. A. (1976). Population correlation and population kinship. Theor. PO@. Biol. 10, 205-226. Thompson, E. A. (1978). The number of ancestral genes contributing to a sample of B8 alleles. Nature 272, 288. Thompson, E. A. (1997). Conditional gene identify in affected individuals. In “Genetic Mapping of Disease Genes,” pp. 137-146. Academic Press, London. Thompson, E. A., and Neel, J. V. (1997). All e1’ 1c d’Iseq u ‘l’1b rmm and allele frequency distribution as a function of social and demographic history Am. J. Hum. Genet. 60, 197-204. Thompson, E. A., Deeb, S., Walker, D., and Motulsky, A. G. (1988). The detection of linkage disequilibrium between closely linked markers: RFLPs at the AI-CIII apolipoprotein genes. Am. J. Hum. Genet. 42, 113-124.
25. Linkage Disequilibrium Mapping
437
Tsui, L. C. (1995). The cystic fibrosis transmembrane conductance regulator gene. Am. J. Resp. 0%. Care Med. 151,547~S53. Wahlund, S. ( 1928). Zusammensetzung von Populationen and Korrelationsercheinungen vom Standpunkt der Vererbungslehre aus betrachtet. Hereditas 11,65- 106. Watterson, G. A. (1970). The effect of linkage in a finite random-mating population. Theor. PO@. Biol. 1,72-87. Weinberg, W. (1908). ijber den Nach weis der Vererbung beim Menschen. Jahwshe. Verein Vaterl. Nat&. Wuttemberg 64,368-382. Weir, B. S., and Hill, W. G. (1980). Effect of mating structure on variation in linkage disequilibrium. Genetics 95,477-478. Weir, B. S., and Hill, W. G. (1986). N onuniform recombination within the human P-globin gene cluster. Am. J. Hum. Genet. 38, 776-778. Weir, B. S., Avery, P. J., and Hill, W. G. (1980). Effect of mating structure on variation in inbreeding. Theor. Popul. Biol. 18, 396-429. Wright, S. (1922). Coefficients of inbreeding and relationship. Am. Nat. S&330-338.
Chi Gd Division of Riostatistics Washington University School of Medicine Sr. Louis, Missouri 63 110
D.C.Rao Division of Biostatistics Departments of Psychiatry and Genetics Washington University School of Medicine 9. Louis, Missouri 63110
I. II. III. IV. V. VI. VII.
Summary Introduction Phenotyping and Genotyping Issues Sampling Issues Sample Size and Power Cost-Benefit Issues Discussion References
I. SUMMARY Because simplistic designs will lead to prohibitively large sample sizes, the optimization of genetic study designs is critical for successfully mapping genes for complex diseases. Creative designs are necessary for detecting and amplifying the usually weak signals for complex traits. Two important outcomes of a study
‘To whom corwspondencc should be adtlressed. Advances in Genefics. Vol. 42 Copy-rqhr0 2001 bp .Academ~c l’rezr. All rights of reproducritmin any form rcservcd oL765.2662/Cl 535.W
4.39
440
GuandRao
design-power and resolution-are implicitly tied together by the principle of uncertainty. Overemphasis on either one may lead to suboptimal designs. To achieve optimality for a particular study, therefore, practical measures such as cost-effectiveness must be used to strike a balance between power and resolution. In this light, the myriad of factors involved in study design can be checked for their effects on the ultimate outcomes, and the popular existing designs can be sorted into building blocks that may be useful for particular situations. It is hoped that imaginative construction of novel designs using such building blocks will lead to enhanced efficiency in finding genes for complex human traits.
II. INTRODUCTION It is clear that genetic dissection of complex traits requires optimum study designs. Many factors that influence the optimality of a study may be out of the investigator’s control, such as manipulation of human mating types. However, several aspects of a genetic study are still manipulable at least to some extent, and the success of a study may depend heavily on how these factors are optimized in the study design. We can classify these factors into three categories: those related to phenotyping and genotyping (e.g., refinement of phenotypes, quality and density of markers, etc.), those related to sampling (sampling units, sampling scheme, splitting rule, admixture, etc.), and those related to analytic procedures (linkage, association, genomic scanning, candidate genes, etc.). These factors determine the information content of a study design, which in turn regulates the maximum power a study can achieve and the finest precision the mapping can reach. Tweaking existing designs in current use may improve either the power of analysis or the precision of mapping results. For example, weighting schemes can be applied to an existing sibpair method to achieve a small gain in power, or the nominal significance level can be lowered to further ward off false positives. But the extent of such improvement is limited, and improvement in one outcome (e.g., power) will most likely compromise the other one (e.g., precision) because the total information content of the study is not substantially improved. To successfully map complex diseases, we require improvements of current designs so that the information content carried in a study can be substantially enhanced relevant to the genetic loci of interest. Such improvements have begun to emerge as the need for novel designs has become clearer. These include the extreme sibpair design for mapping quantitative trait loci (QTLs), the genome-wide association scans, hybrid analysis combining linkage and linkage disequilibrium effects, and pooling of several comparable studies using meta-
26. OptimumStudyDesigns
441
analytical methods (e.g., see Risch and Zhang, 1995; Gu et aI., 1996; Risch and Merikangas, 1996; Fulker et al., 1999; Gu et al., 1998). This chapter includes a brief overview of available designs and a disc cussion of their advantages and limitations pertaining to the mapping of complex traits. By carefully examining the issues involved, we will see what is missing in the current designs as well as what kinds of improvements are needed and achievable for the successful mapping of complex diseases.
Ill. PHENDTYPING ANDfiENDTYPING1SSUES The general approach to gene finding in genetic epidemiology is one of searching for correlation between the degrees of phenotypic and genotypic similarity among relatives. Clearly defined phenotypes and highequality genotyping results are crucial in designing optimal studies for complex diseases.
A. Phenatypes, covariates, and context The current paradigm of gene mapping is a procedure that leads us from a disease phenotype to the genomic location of susceptibility genes, with marker genotypes used as intermediate surrogates. Therefore, the posterior probability of an individual with a certain phenotype to have a particular disease genotype implicitly determines the power of a study, Ideally, there would exist a one-toone correspondence between all the phenotypic outcomes and possible genotypes, so that all such conditional probabilities take the value of “1” and the power of the study is optimized. However, in reality, the desired correspondence is eroded by two types of variabilities: one is the measurement error induced by instrumental or experimental protocols, and the other is the choice of a specific disease phenotype to be mapped. To minimize measurement errors, one may require that the phenotype be reasonably highly reproducible. The average of multiple measurements would represent a good phenotype for quantitative traits. For phenotypes that are not highly reproducible, smaller family units may be used for study (Rao, 1998). Definition of the phenotype has more profound effects on the power of a study, since complex traits are, by nature, influenced by many genetic and nongenetic factors and their interactions. It may be necessary to define genetically more homogeneous subtypes. Moreover, analysis may be performed within particular contexts (age, sex, etc.) that are believed to be important to the development of the disease (Turner et aI., 1999). Critical covariates such as age at onset can also be incorporated in the modeling to further partition phenotypic variation or for defining subtypes of the phenotype. ForI example, consideration of early age at
442
GuandRao
onset and higher prevalence of bilateral breast cancer has helped researchers identify a refined subtype of breast cancer- the inherited form of breast cancer, which led to the cloning of BRCAl (Hall et al., 1990).
B. Genotyping issues Technology has made it possible for a reasonably equipped lab to perform genome-wide screening with high-quality STRP markers (short tandem repeat polymorphism, or microsatellite) at a density of one marker per 5- 10 CM (see Chapter 7 by Weber and Broman). However, even with the constant technological improvement in the automation of large-scale genotyping, errors in genotyping and allele calling do exist, and they become less conspicuous to analysts as the automation intensifies. This fact combined with possible erroneous pedigree information can cause significant loss of power. For example, a 10% genotypic (not allelic) error rate requires an extra 60 extremely discordant (ED) sibpairs to yield the same 80% power as would a sample of 190 ED sibpairs with zero genotypic error rate when the marker involves 8 alleles (Rao, 1998). Note that since ED sibpairs are scarce, the extra 60 ED sibpairs requires a substantial increase of cost in phenotypic screening. This shows the importance of data management and quality control. It is critical to check for errors in genotypic data by detecting Mendelian inconsistencies or unusual recombinations (Boehnke and Cox, 1996; Goring and Ott, 1997). Association mapping can be optimized by using very high density maps, and the single-nucleotide polymorphisms (SNPs) seem to be able to provide such a map. The lower mutation rate of SNPs makes them good candidates for linkage disequilibrium mapping, and the expected lower costs of genotyping in the future make it attractive for genome-wide scanning. However, the informativeness of SNP markers differs in different populations, and the optimization of the genotyping process will be costly for large numbers of SNPs (see again Chapter 7). In addition, several statistical obstacles (e.g., the number of tests required) indicate uncertainty over when an efficient SNP map for genome-wide association scanning may become available.
IV. SAMPLINGISSUES One goal of a study design is to maximize the degree of phenotype/genotype correlations among relatives via the selection of families (or other sampling units) from which phenotypic and genotypic data are collected. In mapping genes for monogenic diseases, analysis of large extended pedigrees has the advantage of unequivocal specification of relations among all members, and one can trace the
26. Optimum Study Designs
443
transmission of problematic/disease-bearing chromosomes within a pedigree. In general, the larger the pedigrees, the more powerful the genetic study (Wijsman and Amos, 1997). On the other hand, model-free methods using relative pairs become more valuable in the mapping of complex diseases,because they do not require trait model specification (hence are robust to model misspecification).
A. Extendedpedigrees Large multiplex families with many affected members have a high posterior probability that disease-bearing haplotypes are transmitted within the pedigree, and the number of informative meioses in the sample also increases. However, as the size of a pedigree grows, chances of the introduction of novel mutations through married-ins also increase. Computational complexity also increases as pedigree sizes grow. This is so because the genetic etiology of large pedigrees may well be heterogeneous. This poses a serious problem unless one can partition the sample into homogeneous subgroups. On the other hand, definition of relationships is less ambiguous in larger pedigrees, which are known to be more powerful under ideal conditions.
B. Unitaryfam ily, sibling pairs, and sibships For late-onset diseases,it is difficult if not impossible to sample large numbers of multigenerational multiplex families. In such cases,samples wilt likely consist of anly unitary families, or sibpairs and/or multiplex sibships. Naturally, these types of designs recognize the lack of a clear trait model, so alleles shared identicalby-descent (IBD) among relative pairs are used in a model-free approach to construct test statistics (e.g., Suarez et al., 1978). One problem with the sibpair method is that at times the IBD score is ambiguous for a particular sibpair or at a particular marker position, especially in a genome-wide analysis, Various types of imputation schemes have been proposed to overcome this challenge. For example, Kruglyak and Lander (1995) extended Green’s method (Lander and Green, 1987) and gave a fast algorithm for the realization of the distribution of transmission vectors, which can be used for rapid imputation of relative IBDs (also see Gu et al., 1995). For small sampling units, in which the number of affected persons is substantially less than in an extended multiplex family, the posterior probability of the existence within the family of a haplotype that harbors a disease allele is reduced, thus reducing the power of such a model-free design. In any case, simple relationships among the members within a unitary family permit fast comnutations, and small-sized families are easier and less exnensive to samnle. 3o
444
GuandRao
alleviate the problem of power loss, other steps can be taken. For example, the variance component methods model the correlation structure of the trait values within a family without specifying an underlying genetic model, combining the power of model-based designs with the robustness of model-free methods (Amos, 1994; Blangero and Almasy, 1996; Province et ul., 2000). The method of selective sampling is another such remedy. For example, by selecting only extremely discordant pairs, Risch and Zhang (1995) have shown that the power can be increased by as much as seven-fold. It was also shown that the power to detect linkage to QTLs is concentrated in three types of extreme sibpairs (ES&): those with extremely discordant (ED) trait values, those with extremely high-concordant (HC) trait values, and those with extremely low-concordant (LC) trait values. This has important implications for optimal study designs using ESPs.
C. Case-controls The case-control design uses unrelated cases and controls and can precisely locate mutant genes if genetic influence is well established and the study is designed to preclude possible confounding factors and biases (e.g., population substructure). As argued elsewhere in this volume (see Chapter 14 by Schork et al.), this can be a powerful design when applied properly. Morton and Collins (1998) demonstrated that the efficiency of a transmission disequilibrium test design is at best two-thirds of that of a case-control design, because, intuitively, a fully informative TDT trio is scored for two transmissions and two nontransmissions, whereas three fully informative individuals in a sample-based design give six informative alleles. Using DNA pooling instead of individual genotyping, Risch and Teng (1998) also compared the power of case-control design with other association designs. They found that the case-control design is the most powerful one. Power of the case-control design can be further enhanced by selection of extreme phenotypes (i.e., by using the hypernormal controls) (Morton and Collins, 1998). This approach is much more cost-effective than selective sampling in linkage analysis, because since the subjects are unrelated, there will be many more available samples. However, care must be taken to preclude possible confounding of such selection and any other unmatched factors in the design. More subtly, possible genetic relationships among the “unrelated” subjects are hidden in the long history of a population and the sum of all random events in the history leading to a current study subject results in great variability, making it hard to track and predict linkage disequilibrium (LD) effects around a specific locus of interest. It seems that substantial progress in population studies will be necessary for case-control studies to realize their full potential in the mapping of complex traits.
26. Optimum Study Designs
445
il. Targetpopulations The genetic makeup of a population plays an important role in determining how one should choose the type of designs for a study and how efficient the design can be. It is very likely that complex diseaseshave varying genetic mechanisms within different populations and that sampling subjects from the “right” population can substantially optimize a study.
1. Population isolates Simply put, a population is characterized by its founders and by the way it has expanded. On rare occasions, because of a bottleneck of some sort in history, a population isolate is formed in which genetic founder effect and/or genetic drift is created on the population’s gene pool. In such population isolates, the likelihood is high that a rare monogenic disease is consistently caused by the same mutation; therefore the power of genetic studies of such diseases is also enhanced (e.g., see Peltonen, 1999, for the Finnish Disease Heritage). The accuracy of gene mapping can also be defined to a resolution of SO-200 kb in an isolate, vs l-2 CM achievable by linkage analysis of a fast-growing large population (Peltonen, 1999). For complex diseases, the advantage of using population isolates for identifying susceptibility genes is less clear. It depends heavily on many facturs in the population history, including the effective size of the founding population and the expansion rate during the relevant history (relative to the disease mutant alleles) of the population. For common complex diseases,an isolate may have a large number of disease-bearing haplotypes, comparable to that found in an admixed population. The detection of environmental and gene-environmental interaction will also be somehow hampered because the population isolate most likely shares fairly homogeneous entiironmental conditions. Ethical and social issues can also add to the source of psoblems when a vast majority af the members of a closed community are drawn (voluntarily or involuntarily) into a medical study. This issue was echoed recently by a lengthy debate over a bill passed by the Icelandic government, which grants to a biotechnalogy company the right to construct a nationwide daqabasecontaining health records and genetic information of everyone in the dountry with presumed consent (Gulcher and Stefansson, 1999). See also a related chapter in this volume (Chapter 25, by Chapman and Thompson).
2. Population admixture For most population studies, samples are drawn from a large population that is likely admixed for a variety of reasons including (im)migration and interracial marriage. Relative to the genetic effects, the noise generated by admixture in
446
GuandRao
such samples may be so high that detection of any disease genes of small or even moderate effects will be practically impossible unless care is taken to compensate for the admixture. For example, in an admixed population, even if allele frequencies were estimated correctly for the whole population, using such estimates for subpopulations with very different values will lead to misspecification of gene frequencies. Gu et al. (1998a) showed in a recent simulation study that false positive rates are inflated when the affected pedigree member (APM) method is directly applied to samples drawn from an admixed population. The problem of admixture can be solved by using a matched-control design in which “untransmitted” haplotypes from parents are used to construct a hypothetical “control.” Falk and Rubinstein (1987) proposed such a method and termed their statistics “haplotype relative risk test” (HRRT). This design ensures that the controls are from the same subpopulation as the cases, thus eliminating spurious associations caused by admixture. Ott and colleagues proposed a haplotype-based haplotype relative risk (HHRR) test using McNemar’s statistic (Ott, 1989; Terwilliger and Ott, 1992), whereas a more general treatment using a contingency test was studied by Thomson (1995) as the “affected family based controls” (AFBAC) method. Spielman and colleagues (Spielman et al., 1993) showed that when McNemar’s test is used in the matched-control design of Falk and Rubinstein, the “transmission disequilibrium test” (TDT) is a valid test of linkage in the presence of association. All these designs are robust to population stratification, but Ewens and Spielman (1995) showed that only the TDT is robust to all mating patterns. As more sequence data become available, and more information on genetic variations in different populations accumulates in public databases, analysis conditional on the many unlinked markers across the genome can be used to detect and correct for underlying population admixture. Such ideas have been explored by Pritchard and Rosenberg (1999) in a case-control design, where they found that about 15-20 microsatellite markers are sufficient to detect population stratification at an overall significance level of 0.05 (also see McKeigue, 1998). See also Chapter 14 (Schork et al.) in this volume.
A study design will not be optimal until the analysis procedures are also optimized to achieve the best power. It is desirable to choose the analytical methods at the design stage, so that the method becomes an integral part of the design in determining the power and the cost-effectiveness of a study. Even if data have already been collected and are ready for analysis, however, one should make an effort to optimize analytic procedures.
26. OptimumStudyDesigns
447
A. Combiningsibpairsto enhancepower For model-free approaches, combining different types of sibpairs is an efficient way to increase effective sample size for more powerful analysis. As we pointed out earlier, in a sibpair design, selective sampling can produce very powerful extreme sibpairs (ES%), among which the ED sibpairs are most powerful, though scarce. In the process of sampling for ED pairs, one usually needs to screen a very large number of unselected sibpairs, which results in a collection of many EC sibpairs. It is shown that genotyping these EC pairs and combining them with ED pairs in the analysis can substantially enhance the power (the EDAC method see; Gu et al., 1996). In addition, incorporation of some measurement of the effect size, such as the h method of Risch (1990), can further optimize a design. The h method was introduced by Risch (1990) by extending James’s formula (1971) of relative recurrence risk KR and the population prevalence K. The relative risk ratio, hR = K,/K is defined so to satisfy
AR= 1 f +zOV(XI,XJ,
(26.1)
where Cov (Xi, X,) is the covariance between the trait values (0 or 1) of a relative pair. Based on this formula, relationship between h’s for different types of relative pairs can be derived to infer (weakly) the underlying genetic model of the disease. For continuous quantitative traits, Equation (26.1) is not readily defined. The concept of relative risk ratio can however be generalized by polyd chotomization of the trait values, and one may consider relative risk ratios &(h,l) for sibpairs with various trait configurations (one’s trait values in the hth interval, and the other’s in the Eth; see Gu and Rao, 1997a). Tne expected IBD proportions of various types of relatives can then be expressed in terms of h, and A,, the relative risk ratios for sibpairs and parent-offspring pairs, respectively. The effect of recombination fraction 0 can be incorporated by expressing the probability of IBD at the linked trait locus conditional on the IBD at the marker locus. Therefore, the distribution of sibpair statistics can be characterized by h’s without directly involving other genetic model parameters, and the power of such a study design can be readily assessedusing estimated values of the generc&ed hs. For n ED sibpairs, for example, the power equals z _ = 1 P
\2h, - A, - lI&
-I- Z,h,d3
4(/i, + 2)(2h, - A,) - 1 .
(26.2)
By virtue of this method, one can focus on important factors affecting the power of a study design that are controllable, such as the combination of
448
GuandRao
ESP pairs and the cutoff threshold-defining trait values for extreme phenotypes. An optimal design may be obtained by searching on the grids of A values, as prescribed in the following algorithm given by Gu and Rao (199713). Algorithm for Optimization (with preexisting phenotypic data): 1. For each plausible division of trait values (or at least for the more promising divisions), get estimated values of the generalized relative risk ratios for sibpairs and parent-offspring pairs from the existing data set or from earlier population studies. 2. For each such division, calculate necessary sample sizes of LC, ED, and HC sibpairs for desired power 1 - p if a sole ESP design is used. Calculate and retain the ESP configuration with desired power. 3. Select a configuration for combining numbers of available LC, ED, and HC sibpairs in the sample. Retain only if it has the desired power. Compare its cost with the retained ESP configuration; retain if cost is less. 4. After all possible combinations have been exhausted, the retained one attains desired power with the least cost. Such algorithms make it possible to develop optimal designs if the h’s are reliably estimated from population genetic epidemiological studies. For example, when fixed thresholds (such as lower 30% upper 5%) are used to sample extreme sibpairs, optimization can be performed over different combinations of various types of ESPs (ED, HC, and LC). Using the EDAC statistic proposed by Gu et al. (1996), which combines both extremely discordant and concordant sibpairs in the sample, such optimization over the combinations of ED and EC pairs is depicted in Figure 26.1. By allowing the thresholds to vary, we applied the foregoing algorithm and found that a design with a lower threshold at about 30% and an upper threshold at the upper 5% is consistently more cost effective than the others over a varie ety of models (see Table 26.1). More details can be found in Gu and Rao (199713).
B. Combining linkage and association to enhance power Since the magnitude of disequlibrium induced by genetic linkage decays rapidly as a population expands, it has been suggested as a fine-mapping tool (Jorde, 1995; Xiong and Guo, 1997). In a sole LD mapping design, implicit assumptions are made about the sample population; therefore, possible misspecification introduces considerable instability into the design. Moreover, the distribution of LD effects around disease loci is hard to track because of unspecified/unclear
26. Optimum Study Designs
Number of ED Sibpairs Figure 26.1 Optimization of combining various types of E S P pairs in an E S P design. The contour curves of the same power are plotted for various combinations, whereas their intersection points with the lines representing minimum cost will give the optimum design. Thresholds of extreme traits were fixed at (50%, 5%).
random events in the population history. More recently, designs that combine linkage and LD analysis have been proposed to overcome some of these difficulties. Even though estimated values of haplotype frequencies were used early on in model-based linkage analyses, modeling linkage and linkage disequilibrium effects simultaneously can be carried out with a sibpair design, as suggested in 1999 by Fulker et u.E.By modeling linkage and LD simultaneously, linkage effect in the full model will dissipate quickly (absorbed by the LD part) as marker position moves away from the true mutation. Compared with the results obtained by means of sole linkage, this should result in an elevated bump indicating the precise location of a mutant gene, which is particularly useful in finemapping of complex diseases.
C. Genome-wideversusgene-widescanning With the demonstrated power of association analyses that exploit linkage disequilibrium effects, the foreseeable availability of highly dense marker maps
450
Gu and Rao
Table 26.1 Optimization of Study Designs under Various Additive Models (h2 = 0.30), Assuming SelectiveSampling from Both Upper and Lower Tails and Equal Unit Cost for Genotyping and Phenotyping” p = 0.0 P
Thresholdsb
EDSP/EDAC’
p = 0.40 Costd
Thresholdsb
EDSP/EDAC’
Costd
(0,23,28) a 3235) (0,33,46) (0,4&W C&44,65) (79,54,0)
148 210 285 355 434 525
Additive 0.05 0.10 0.20 0.30 0.40 0.50
(55%, 5%) (40%, 5%) (25%, 5%) (ZOO/,,5%) (ZO%, 5%) ( 5%, 15%)
(0,38,25) (0,47,28) (0,44,34) @,48,40) (0,62,46) (54,61,0)
183 266 371 489 627 795
(60%, (60%, (50%, (50%, (45%, ( 5%,
5%) 5%) 5%) 5%) 5%) 45%)
“Key: p = frequency of the “risk” allele A,, h2 = heritability of the trait locus, p = residual correlation. bThresholds (x%, y%) define trait values below x th percentile as “extremely low” and those above (100 - r)th percentile as “extremely high.” “The combination (no, nt, nz) also indicates which type of ESP test is used for the optimum design, where nc of LC pairs, nr of ED pairs, and nZ of HC pairs are used. dCost is measured by the sum of the number of sibpairs to be screened and that to be genotyped.
makes it tempting to design genome-wide association scans (Risch and Merikangas, 1996). Although success stories exist for rare monogenic diseases (Peltonen, I999), it is not clear how successful such efforts can be at the present time for mapping complex diseases. Some simulation studies indicate that a large number of markers, perhaps in the magnitude of 500,000 SNPs, will be needed to carry out a genomewide association scan (Kruglyak, 1999). Most recently, Collins et al. (1999) used the Malecot model to study several published data sets and discovered that LD extends as far as several hundred kilobases and argued that a total number of 30,000 SNPs might be enough for an economical genome-wide association scan. They suggested higher density for candidate regions (see also Huttley et al., 1999). In any case, because of the importance of the particular population history, power of such genome-wide scans depends heavily on the population selected for analysis. Ignoring the population considerations could lead to false positives, which are not readily dealt with by the current methods. On the other hand, a “gene-wide” scanning procedure, namely, scanning within a candidate region, seems more promising for mapping complex disease genes when the existence of a disease gene has been established through other means. Since power of association tests at markers with weak LD effects is
26. OptimumStudyDesigns
451
negligible, we proposed a hybrid two-stage design that employs a linkage scan in the first stage followed by an association scan in the implicated regions using high-density SNPs (Rao and Gu, 2000). At the initial stage, power is maximized to detect as many true positive signals as possible, with the burden of some false positives. In the second stage, highly dense SNP markers could be used within candidate regions to increase the precision. Use of a lower significance level now should minimize false positives at the second stage+The added synergism from the robustness of linkage designs and the great sensitivity of association scans seems more promising for an optimal design than a single-scan approach. As more SNPs become available and genotyping becomes less expensive, single genome-wide association scans will become more feasible, In the interim, the combined approach may be more practical. Comparison of the optimal properties of these hybrid two-stage designs will be helpful.
VI. CUST-BENEFITISSUES The Heisenberg principle of uncertainty in quantum mechanics dictates that the simultaneous measurement of two conjugate variables (such as the momentum and position for a moving particle) entails a limitation on the precision (standard deviation) of each measurement. The same principle applies to the genetic study of complex human diseases.Under this principle, the information content of a study is determined by the major factors discussed earlier, among others. These factors in turn, regulate the uncertainty between the power of detecting the trait loci and the precision of the mapping results. A major task of a study design is to find a pivot point by which unccrtainty between power and precision can be balanced to achieve optimality. While optimality can be measured in several ways (e.g., by relative statistical efficiency, cost-effectiveness, or statistical power), we conclude this chapter by focusing on issues relevant to cost-effectiveness under several popular designs and suggest that the cost-effectiveness of a design be used as a common scale for balancing the uncertainties in the study of complex diseases.
A. Cost-effectivenessof EDACdesign We argued earlier that ED sibpairs are scarce, especially if high sibling residual correlation exists, and that a combined EDAC test utilizing both ED sibpairs and extremely concordant (EC) sibpairs from the same sampling pool should be used. The cost-effectiveness advantage of the EDAC design can be seen as foIIows. Suppose that the cost of phenotyping per person is C, and cost of genoe typing per person is Co. If sibpairs are to be sampled randomly in the whole population, the total cost of the ED test is 2NCp + 2n&&, and the total cost
452
GuandRao
of the EDAC test is 2 N’Cp + 2(n,d + nl, + q&o, where N is the total number of unselected sibpairs needed to obtain n,!?~ED sibpairs for the ED test, and N’ is the total number of sibpairs needed to obtain ned ED pairs and (q, + q,J EC pairs for the EDAC test (N’ < N). If we already have a cohort of “probands” with extreme phenotypes, assuming a uniform sibship size of 2, the total cost will be NCp + 2n&o for the ED test, and N’CP + 2(n,. + nt, f q,,)Co for the EDAC test. The additional cost for the ED test, compared with the EDAC, is 2Co[(N - N’)k - (ned + q, i- nh, - nau)] if sampled randomly Co[(N - N’)k - 2(r&d + nl, -I- nh, - nan)] if sampled via proband, where CP = k Co. Since the total number of sibpairs to be screened is much larger for the ED test than for the combined test (N - N’ > 0), the cost difference will be positive as long as k is not too small. Detailed results comparing the total costs of the ED approach and the EDAC approach can be found in Gu et al. (1996). We display in Figure 26.2 the extra cost for a sole ED design vs the cost of EDAC as a function of heritability
Additive Model Dominant Model .. ...-.I--’-
Locus-specific
Heritability (h2)
Figure 26.2 Cost-effectiveness of EDAC design vs sole ED designs, assuming equal genotyping and phenotyping cost per subject.
26. Optimum SWly Designs
453
due to the disease locus under an additive and a dominant model when Cp = Co. The results will be even stronger when CP > Co. It is clear that for complex diseases, EDAC is a favorable design because locus-specific heritabilities tend to be small.
B. Cost-effectivenessof two-stage designs Elston and colleagues, who investigated the cost-effectiveness of two-stage linkage designs for global genome search under a variety of models, concluded that a two-stage procedure could be done for half the cost of a one-stage procedure (Elston et al., 1996). It is also shown that the optimality of two-stage designs does not depend separately either on the desired final (second-stage) significance level or on power, but rather on both (see Todorov and Rao, 1997). A more subtle problem in implementing such a two-stage design lies in ascertaining how the use of the same sample for both stages would affect the optimality of such designs. On the surface, using the same sample for both stages would affect the optimality of such designs. On the surface, using the same sample appears to save costs. It also assureshomogeneity to a degree provided in the original sample. However, spurious correlations inherited in the same sample may lead to repeating the same false signals detected in the first stage, and this could drive up the ultimate cost of a mapping project. According to the theory of length-biased sampling (e.g., Terwilliger, 1997), given a threshold of significance levels, peaks along a chromosome segment of false positive signals tend to be narrower than those of true positives. This finding may be helpful for pruning out false positives, especially when the same sample is used for both stages. It is unclear if the hybrid designs discussed earlier could be used with the same sample.
C. Cost-effectivenessof meta-analysis Meta-analysis, which combines results from several studies to achieve increased power and precision, appears to offer a promising approach. Details on designing and carrying out meta-analysis are discussed separately in this volume see Chapter I8 by Gu et al. A brief discussion is presented here on the relationship between the cost-effectiveness of individual studies and that of an integrated study using meta-analysis. Our simulation studies involving meta-analysis of linkage results sug gested that fewer but relatively very powerful primary studies may not be the most cost-effective strategy (Gu et al., 1996). Recent simulations involving meta-analysis of association studies also confirm that a fair number of moderately powered association studies perform better than only a few large-scale studies. In Figure 26.3, the extra of cost for combining (up to 10 primary studies) and
Gu and Rao
Number of Primary Studies Figure 26.3. Cost-effectiveness in meta-analysis of association studies using TDT. Individual studies are simulated under the same disease model with parental heterozygosity, hZ = 0.245 and genotypic relative risk y = 2.0, and the combined power of pooled analysis is required to be 80 or 95%, at a genome-wide significance level cx = 5 X 10-s (from Risch and Merikangas, 1996).
for the meta-analysis of 10 primary studies are plotted against the number of primary studies. We let the among-study variance be the same as the within-study variance and assume that individual studies were designed to have the same power. It appears from this simple exercise that a larger number of moderately powerful primary studies is more cost-effective. Of course, in reality, there are other considerations as well. For example, coordination of a larger number of studies will entail additional costs that are ignored here. A promising route to optimal study designs would seem to comprise balancing these factors and exploring the cost-effectiveness of meta-analysis in other settings (e.g., using it to pool summary statistics for large-scale multicenter studies).
vu.DlSCUSSlON As molecular technologies improve further and the complete sequence of the human genome emerges, genetic dissection of complex human traits seems more achievable than ever. Recent years have witnessed a remarkable progress in the development of newer and better statistical tools, and there is no doubt that
26. Optimum Study Designs
455
these will continue to improve further. Arguably, more progress needs to be made on the study design front. We believe that optimal study designs play a critical role for successful mapping of complex diseases. Required sample sizes tend to be prohibitively large if computed realistically, so ultimately amplify the weak but real signals, imaginatively designed studies are needed to circumvent the problems caused by gene-gene and gene-environment interactions. It is essential that the overall goal be broken into specific, smaller, and tractable ones and that the investigator recognize the relationships among these secondary goals. At one end of the scale, a goal should be as broad and general as possible so that the power of the design is optimized. At the other end, the goal should be as narrowly focused as possible to obtain the highest resolution. These two goals do not have to contradict each other if the total information content of a study design is optimized. Whereas the increased rates of false signals in genome-wide scans have given rise to a healthy debate (Thomson, 1994; Lander and Kruglyak, 1995; Todorov and Rao, 1997), we must still remember the uncertainty principle between power and precision of a design. It is the total desired information content of the study that will ultimately enhance gene finding. We have discussed several design issues that investigators should consider prior to finalizing a study design. Needless to say, cost is often the ultimate barrier, thus highlighting the need for cost-benefit analysis as part of any study design. As part of a study design, investigators should decide which sampling units and sampling schemes to use, how to place the markers, and what to analyze, according to the particular situation at hand. Population-based genetic epidemiological studies seem to be essential for providing needed information on the genetic makeup of the current populations. With the expected high noise-to-signal ratio for complex diseases and the automation of large-scale genotyping, we also should pay closer attention to issuespertinent to data management and quality control. The quality of genotyping can have a substantial effect on the power of a study, and the quality of individual studies will influence the validity of a meta-analysis. Consideration of the myriad of issuesought to enable us to determine the necessary sample sizesmore realistically. Clearly, good designs should lead to better results, and anticipated failure should not be a part of any study design. While investigators can control the design of their own individual studies, will that alone be enough? It is clear that more data from well-designed studies should enhance power, hence gene finding. We believe that lumping (pooling of data/result) and splitting (subdividing into homogeneous subgroups) strategies, as discussed by Rao in Chapter 3, are highly promising for this purpose, and these approaches in turn requires collaboration among multiple studies and their investigators. The designers of individual studies should keep this collaboration in mind. In the end, true collaboration may hold the key to unlock the etiological mysteries underlying complex human traits.
456
GuandRao
References Amos, C. I. (1994). Robust variance-components approach for assessinggenetic linkage in pedigrees. Am. J. Hum. Genet. 54,535-543. Blangero, J., and Almasy, L. (1996). SOLAR: Sequential oligogenic linkage analysis routines. Technical notes. Southwest Foundation for Biomedical Research, Population Genetics Laboratory, San Antonio, TX. Boehnke, M., and Cox, N. J. (1996). AC curate inference of relationships in sib-pair linkage studies. Am. J. Hum. Genet. 61(2) 423-429. Collins, A., Lonjou, C., and Morton, N. E. (1999). Genetic epidemiology of single-nucleotide poly morphisms. Proc. Natl. Acad Sci. USA 96(26), 15173-15177. Elston, R. C., Guo, X., and Williams, L. V (1996). Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet. Epidemiol. 13(6), 535-558. Ewens, W. J., and Spielman, R. S. (1995). The transmission/disequilibrium test: History, subdivision, and admixture. Am. J. Hum. Genet. 57(2), 455-464. Falk, C. T, and Rubinstein, P (1987). Haplotype relative risks: An easy reliable way to construct a proper control sample for risk calculations. Ann. Hum. Genet. 51, 227-233. Fulker, D. W., Cherny, S. S., Sham, l? C., and Hewitt, J. K. (1999). Combined linkage and association sib-pair analysis for quantitative traits. Am. J. Hum. Genet. 64, 259-267. Goring, H. H., and Ott, J. (1997). Relationship estimation in affected sib pair analysis of late-onset diseases.Eur. J. Hum. Genet. 5(2), 60-77. Gu, C., and Rao, D. C. (1997a). A linkage strategy for detection of human quantitativeetrait loci. I. Generalized relative risk ratios and power of sibpairs with extreme trait values. Am. J. Hum. Genet. 61,200-210. Gu, C., and Rao, D. C. (199715). A linkage strategy for detection of human quantitative-trait loci. II. Optimization of study designs based on extreme sibpairs and generalized relative risk ratios. Am. J. Hum. Genet. 61,211-222. Gu, C., Suarez, B. K., Reich, T., and Todorov, A. A. (1995). A ch romosome-based method to infer IBD scores for missing and ambiguous markers. Genet. Epidemiol. 12, 871-876. Gu, C., Todorov, A. A., and Rao, D. C. (1996). Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost effective way to linkage analysis of QTL. Genet. Epidemiol. 13,513-533. Gu, C., Miller, M. A., Reich, T., and Rao, D. C. (1998a). The affected-pedigree-member method revisited under population stratification. in “Statistical Methods in Genetics: The IMA Volumes in Mathematics and Its Applications,” Vol. 112, pp. 165-180. Springer-Verlag, New York. Gu, C., Province, M., Todorov, A., and Rao, D. C. (1998). Metasanalysis methodology for combining non-parametric sibpair linkage results: Genetic homogeneity and identical markers. Genet. Epidcmiol. 15,609-626. Gulcher, J., and Stefansson, K. (1999). An Icelandic saga on a centralized healthcare database and democratic decision making. Nut. Biotechnol. 17(7), 620. Hall, J. M., Lee, M. K., Newman, B., Morrow, J. E., Anderson, L. A., Huey, B., and King, M. C. (1990). Linkage of early-onset familial breast cancer to chromosome 17q21. Science 250, 1684- 1699. Huttley, G. A., Smith, M. W., Carrington, M., and O’Brien, S. J. (1999). A scan for linkage disequilibrium across the human genome. Genetics 152, 1711- 1722. James, J. W. (1971). Frequency in relatives for an all-ornot trait. Ann. Hum. Genet. 35,47-49. Jorde, L. B. (1995). L’m k age d’iseq ui‘l‘ib rmm as a gene-mapping tool. Am. J. Hum. Genet. 56, 11- 14. Kruglyak, L. (1999). Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22(2), 139- 144. Kmglyak, L., and Lander, E. (1995). Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. 57,439-454.
26. Optimum Study Designs
4.57
Lander, E., and Green, P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA&+, 2363-2367. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat. Genet. 11, 241-247. McKeigue, P. M. (1998). Mapping genes that underlie ethnic differences in diseaserisk: Methods for detecting linkage in admixed populations, by conditioning on parental admixture. Am. J. Hum. Genet. 63,241-251. Morton, N. E., and Collins, A. (1998). Tests and estimates of allelic association in complex inheritance. Proc. Natl. Aced. Sci. USA 95(19), 11389-11393. Ott, J. (1989). Statistical properties of the haplotype relative risk. Genet. Epidemiol. 6, 127- 130. Peltonen, L. (1999). Positional cloning of disease genes: Advantages of genetic isolates. Hum. Hered. 50(l), 66-75. Pritchard, J, K., and Rosenberg, N. A. (1999). U se of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220-228. Province, M. A., Rice, T., Borecki, I. B., Gu, C., and Rao, D. C. (2000). A multivariate and muitilocus variance components approach using structural relationships to assessquantitative trait linkage via SEGPATH. Genet. Epidemiol. (in press). Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genet. Epidemiol. 15, i-18. Rao, D. C., and Gu, C. (2000). Principles and methods in the study of complex phenotypes. In “Molecular Genetics and Human Personality” J. Benjamin, R. Ebstein, and R. H. Belmaker, eds.). American Psychiatric Press,Washington, DC. Risch, N. (1990). Linkage strategies for genetically complex traits. I. Multilocus models. Am. j, Hum. Genet. 46,222-228. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273,1516-1517. Risch, N., and Teng, J. (1998). Th e relative power of family-based and case-control designs for linkage disequilibrium studies of complex human disease I. DNA pooling. Genome Res. 8( 12), 1273-1288. Risch, N., and Zhang, H. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science268, 1584-1589. Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hzlm. Genet. 52,506-516. Suarez, B. K., Rice, J., and Reich, T. (1978). The generalized sib pair IBD distribution: Its use in the detection of linkage. Ann. Hum. Genet. London 42, 87-94. Terwilliger, J. D. (1997). True and false positive peaks in genomewide scans: Applications of lengthbiased sampling to linkage mapping. Am. J. Hum. Genet. 61,430-438. Terwilliger, J. D., and Ott, J. (1992). A haplotype-based “haplotype relative” risk approach to detect allelic associations. Hum. Hered. 42, 337-346. Thomson, G. (1994). Identifying complex disease genes: Progress and paradigms. Nat. Genet. 8, 108-110. Thomson, G. (1995). Mapping disease genes: Family-based association studies. Am. 1. Hum. Gwet. 57,487-498. Todorov, A., and Rao, D. C. (1997). Trade-off between false positives and false negatives in the linkage analysis of complex traits. Genet. Efiidemiol. 14, 453-464. Turner, S. T., Boerwinkle, E., and Sing, C. E (1999). Context-dependent associations of the ACEIID polymorphism with blood pressure. Hypertension 34(part 2), 773-778. Wijsman, E. M., and Amos, C. I. (1997). Genetic anaiysis of simulated oligogenic traits in nuclear families and extended families. Genet. Epidemiol. 14, 719-735. Xiong, M., and Guo, S. W. (1997). Fine-scale genetic mapping based on linkage disequilibrium: Theory and applications. Am. J. Hum. Genet. 60,1513-1531.
I
dhe-Stage versus Two-Stage Strategies for Gemme Scans Xiuqing Guo LXvision of Medical Genetics Department of Medicine and Pediatrics Spiclberg Pediatrics Kesearch Center Burns and Allen Research Institute (.ktk4iS-SiIYdi Medical Center , I*cx+.Angeles, (,alifornia 9L1048
RobertC.Elstod Lkparrment of Epidemiology and Biostatisrics Case Western Kescrve University Cleveland, Ohio 44109
I. Summary II. Introduction III. One-Stage and Two-Stage Procedures
A. 1Jsing the Mean Statistic for Affected Sibpairs IV. Two-Stage Global Search Designs Using Discordant Relative Pairs V. Using Roth Discordant and Affected Relative Pairs VI. Effect of Genetic Heterogeneity and Incomplete Marker Information VII. The Computer Program DESPAIR VIII. Discussion References
459
460
Guo and Elston
I. SUMMARY One way to determine the genetic etiology of a complex disease is to find the chromosomal regions that tend to be shared among affected relatives and yet tend to differ between affected and unaffected relatives. This can be done by using equally spaced markers to perform a global linkage search of the whole genome. With the rapid development of molecular technology to readily type individuals, it is becoming relatively easy to carry out such genome searches for disease genes. The simplest approach is to type every individual in the sample at every marker locus. However, such an approach wastes a great deal of effort in genotyping large areas of the genome that are eventually found not to show any evidence for linkage. Efficient design of genome scans for linkage analysis is therefore an important issue. We first describe strategies in which a single sample is typed for markers in one or two stages, with neither the use of a second sample nor any attempt at replication of a linkage result. Then in the discussion we briefly mention other multistage strategies, including one that involves a degree of replication.
II. INTRODUCTION Taking into account cost considerations in a global search design for linkage analysis, Elston (1992, 1994) proposed a two-stage procedure in which at the first stage affected pairs of relatives are typed for widely spaced markers and a relatively large significance level is used to determine where, at a second stage, more narrowly spaced markers should be placed around the markers that were significant at the first stage. In this way a tight grid of markers is typed only at locations presenting evidence that a disease gene might occur. Assume a sample of n independent pairs of affected relatives of a particular typz (full sib, half-sib, grandparent-grandchild, avuncular, or first cousin). Let p be the observed proportion of those relative pairs that share 0 alleles identical by descent (IBD) at the marker locus (we assume initially that the markers are fully informative). Risch (1990a) showed that this proportion is a sufficient statistic for testing linkage if the disease is caused by alleles at a sins,le autosomal locus acting additively on penetrance. Let the expected value of p for a particular type of affected pair of relatives be p0 if there is no linkage and pa - S if there is linkage. Then S can be expressed as a function of the recombination fraction 8 between the disease and marker loci and a risk ratio factor h (Risch, 1990b). In the case of a monogenic disease, for relatives of type R,hR = I&./K, where K is the prevalence of the disease and & is the recurrence risk among type R relatives. If alleles at Eunlinked loci determine disease susceptibility in a multiplicative fashion, then it is possible to write
27. GenomeScans:One- versus Two-StageStrategies
P(first-degree relative of affected person is affected) P( random member of the population is affected)
=AIAz
461 -..
Al>
were hi(i = 1, 2, . . . , 1) is a risk factor that measures specifically the effect of the ith disease locus. In this case, A in the sequel will refer to the particular factor hi corresponding to the disease locus being linked to a marker. Define T(S) = .
Td2(6 - po + 8)
I
[p^(l - $)I”2 The null hypothesis 6 = l/2 (i.e., no linkage) is rejected for a particular locus if T(0) 5 z,, where z, is the CYfractile of the distribution of T(0); for large enough n, 5, can be approximated by the (Yfractile of the standard normal distribution. To have final power 1 - /S of detecting a departure as small as some 6 > 0 from pe at a significance level a, n must be chosen to be an integer such that P[T(O) 5 z,I 81 2 1 - /3 or nli2S
21-P.
P T(S) 5 2,
GCl - iw2
I
Assuming that p^(1 - 6) can be considered to be a constant, n must satisfy n*12S 2 21-p
L + &l
- ;I”2
where zl _ p is the ( 1 - ,0) fractile of the distribution of T( 8); similarly, for large enough n, z1-s is well approximated by the (1 - /3) fractile of the standard nor, ma1 distribution. Substituting (p. - 6)(1 - p. + 8) for i(l - p^), we see that we should choose n to be the smallest integer such that n?
(q-p
- sJ2(Po - 6) (1 - PO+ 6) 62
(27.1)
III. ONEGTAtiEAND TWO-STAGE PROCEDURES In a one-stage procedure, each person in the sample is typed for m markers equally spaced along the whole autosomal genome. Let Cl be the cost of determining one marker on a person and C2 be the cost of recruiting one person into
462
Guo and Elston
the study. Thus the total cost of this one-stage study is 2nCZ + 2nmC1. This cost can be written in units of Ci as Zn(R + m), where R = Cz/Ci is the ratio of the cost of recruiting a person to the cost of performing one marker assay on a person. In large samples, the value of m that minimizes this cost does not depend on (Yor p, but only on h and the type of relative pair. Thus, for each value of h and type of relative pair, there is an asymptotically optimal spacing for the markers. In a two-stage procedure, we first study m equally spaced markers at the first stage as in a one-stage design, and so the maximum distance to any marker at this stage is x* = M/2m, where M is the total length of the genome (we assume, for simplicity, that we can consider the whole genome as a single chromosome of length M). Each of these markers is tested for linkage at a significance level CY*(to be determined) and then, at the second stage, around each of these markers significant at the first stage, 2k further markers are typed with k markers placed at each side. The maximum distance to any marker now becomes x = x*/(2k + 1) = M/2m(2k f 1). To convert a distance x in morgans into a recombination fraction 8, a mapping function can be used. Elston (1992) showed that the choice of mapping function is not critical provided that it allows for at least some interference. The cost of the first stage is 2n(R + m) as before, and the extra cost of the second stage in units of C1 is 2nY, where Y is the number of markers typed at the second stage. Assuming an exact sig nificance level cy* and exact power 1 - p, the expected value of Y is 2k[a*m + (1 - p)d], where d is the number of disease loci with risk ratio factors at least as large as h. Thus the total expected cost is C = 2n{R + m + 2k[a*m + (1 - p)d]},
(27.2)
and we choose n, m, k, and (Y*to minimize this quantity. To ensure that we have a significance level of (Y and power 1 - /3 at the end of the second stage, n must be chosen to satisfy Equation (27.1) when 6 is calculated on the basis of the spacing at the second stage, that is, when the value of 19corresponding to x = M/2m(2k + 1) is used. However, to determine the value of LX*that will ensure power 1 - p at the first stage, let S* be the value of 6 calculated on the basis of the spacing at the first stage, and then LY* must be chosen to satisfy, analogous to Equation (27.1),
h-p - za*>2(Po - 6*x1- PO + s*>
n?
f
82 that is, ,l/Z$ 2,*
2
Zl-p
-
[(p, - a*)(1 - pa + s*)l”z *
27. GenomeScans: One- verw Two-StageStrategies
463
Changing the inequality to an equality, cy* is also expressed as a function of n and m in the total expected cost C. To define the optimal design, we find the integer values of n, m, and k that minimize the total expected cost C as given in Equation (27.2). It should be noted that we have been controlling the pointwise significance level a. If a genome-wide (Yis desired, then the necessary value of the pointwise a will be subject to modification for multiple testing, hence will depend on the density of loci being typed at the second stage. For affected relative pair designs, Elston et al. (1996) described how the optimal two-stage design depends on the type of pairs studied, the effect of the diseaselocus measured by A, the relative cost of recruiting affected persons to typ ing markers, the degree of genetic heterogeneity, and the informativeness of the markers (we briefly discuss the last two factors later), Asymptotically, the optimal design does not depend separately on either the desired final significance level or power, but rather on a function of the two. As expected, the cost of the study decreasesas the effect of the gene increases, and increases as the relative cost of recruiting a subject increases. However, as both the effect of the disease locus increases and as the relative cost of recruiting a subject increases, the optimal number of initial markers increases and the optimal number of pairs decreases. Assuming the same cost ratio R for relative pairs of different types, for large values of h the cost of the optimal two-stage design increases in the sequence grandparent-grandchild
< half-sib < avuncular < full sib.
First-cousin pairs can lead to the most expensive design for small values of h, but, for large values of h, they can lead to a design that is more expensive than only grandparent-grandchild and half-sib pairs. Compared to a one-stage procedure, a two-stage procedure typically halves the cost of a study.
Usingthe meanstatisticfor affectedsibpairs The statistic first used for testing linkage in a two-stage global search design (Elston, 1992, 1994; Elston et al., 1996) was based on the proportion of pairs sharing no marker alleles IBD. However, it has been established that for full sibpairs the mean statistic, equivalent to the proportion of alleles shared IBD, often has the greatest power (Blackwelder and Elston, 1985; Schaid and Nick, 1990; Knapp et al., 1994). Guo and Elston (2000a) studied two-stage global search designs in the case of affected full-sib pairs by using the mean statistic to test for linkage. They showed that when dominant genetic variance is present, using the mean statistic is usually more cost-efficient than using the proportion of pairs sharing no marker alleles IBD; when there is no dominant ,genetic variance, the mean statistic also leads to a better design provided that the relative risk ratio for first-degree relatives is small.
464
Guo and Elston
Iv. TWO-STAGEGLOBALSEARCHDESIGNSUSINGDISCORDANT RELATIVEPAIRS Earlier studies (Blackwelder and Elston, 1985; Risch, 1990b) showed that an affected relative pair design is usually more powerful than one based on discordant relative pairs. However, in practice, discordant relative pairs are sometimes easier to sample than affected relative pairs, especially for severe diseases with early mortality. In addition, discordant pairs can be more useful if disease penetrance is relatively high, which might be achieved by selecting relatives who have been exposed to a stimulus that increases disease risk, such as unaffected sibs who smoke in a study of lung cancer. Finally, including discordant pairs provides an appropriate control for the study, which could be used to detect excess IBD sharing of alleles caused by inbreeding (G&in and Clerget,Darpoux, 1996) or by meiotic drive at the marker locus. It is therefore relevant to consider opti* ma1 two-stage designs when one is sampling discordant relative pairs, or both affected and discordant relative pairs. Analogously to affected relative pairs, we define the relative risk ratio for an affected-unaffected type R relative pair as AK = (1 - I&) / (1 - K). Notice that & should always be greater than or equal to K and less than 1, so that hi is less than or equal to 1 and greater than 0. The mean proportion of marker alleles shared IBD by discordant relative pairs was derived for the various relative types by Guo and Elston (2000b) as a function of the recombination fraction 6 between the marker and disease loci, the risk ratio h, for discordant parent-offspring pairs, and the risk ratio A, for discordant sibpairs, and these are summarized in the middle column of Table 27.1. Also given in Table 27.1 is the expected mean proportion of marker alleles shared IBD. for each type of relative pair under the null hypothesis of no linkage ( rO). Similar to affected relative pair designs, Guo and Elston (2000b) found that a two-stage procedure can save 40-55% of the total cost, compared to a one-stage procedure. The cost of the optimal two-stage designs for the five types of relative pairs increases in the order: full sib < grandparent-grandchild
< half-sib < avuncular < first cousin.
Unlike the affected relative pair designs, in which sampling sibpairs is the most expensive design, sibpairs afford the best cost-saving design when discordant relative pairs are used.
When we have both affected and discordant relative pairs, the mean proportions of marker alleles shared IBD by affected relative pairs are the same as
t .Y R
I
-
2 + ,a d
466
Guo and Elston
those for discordant relative pairs under the null hypothesis of no linkage. Under the alternative hypothesis, affected relative pairs are expected to share a larger proportion of marker alleles IBD, whereas discordant relative pairs are expected to share a smaller proportion of alleles IBD. Let SC and &d be the estimated mean proportions, and ~0 + 8, and ~0 - 8, the expected mean proportions of marker alleles shared IBD under the alternative hypothesis, for affected and discordant relative pairs, respectively. Then, letting n, and nd be the numbers of concordant affected and discordant pairs, respectively, similar to Gu et al. (1996) we can define a weighted difference in the deviations of the mean proportions of marker alleles shared IBD from ~-0as
A = nC[(To+ 6) - d - nd[( %-0- 8,) - no] n
,
which can be expressed as a function of n, k, ho (the risk ratio for parentoffspring pairs), hs (the risk ratio for affected sibpairs), and p = nJn& Under the null hypothesis of no linkage, A equals zero. Under the alternative hypothesis, the value of A for each type of relative pair is given in the last column of Table 27.1. Linkage can be tested based on the value of A. Let
4%~ [(‘IT^d- P%)/(P + 1) - (1 - Ph/(l T(A) =
I
6t[(&
dro(l
+ a) -I- Al
for sibpairs
- To)
- ~~c)/(~ + 1) - (1 - ~)7~a/(l + p) + A] dToc+ - 7Tb)
foiao;ineal pairs.
For all values of A, T(A) has mean 0 and variance 1 if the sample size is large enough. The null hypothesis 6 = l/2 corresponds to A = 0 and is rejected if T(0) 5 5,. Thus n must be chosen to satisfy
(ZIGp - .g
?i-o(l - %-J 2A2
for sibpairs
(21-p - z,J2 n-0 c; - n-(J A2
foi.
unilineal relative pairs,
where A is calculated on the basis of the spacing at the second stage. As before, if we let A* be the value of A calculated based on the spacing at the first stage,
27. GenomeScans:One-versusTwo-Stage Strategies
467
the significance level at the first stage, cy*,must be chosen to satisfy
GA* bp
z,* =-
I
-
z1-B -
dro(l
for sibpairs
- To)
&A*
for unilineal relative pairs.
lITroTTo(; - To)
Guo and Elston (2000b) investigated the optimal two-stage designs when affected and discordant relative pairs are combined. The optimal twostage designs depend on the population disease prevalence K, the ratio of affected to discordant relative pairs in the sample p, and the relative cost ratio R. Among the five types of relative pairs, sibpairs still afford the best cost-saving design when both affected and discordant relative pairs are used, provided p 5 1. Even when p > 1, sibpairs can still be the best design, as is the case when p = 3, R = 200, and h = 1.5. Assuming that the total expected’cost consists of only that spent on genotyping and phenotyping the pairs kept in the sample, designs using both affected and discordant relative pairs are usually more expensive. However, if we include the cost of screening the population to obtain the desired (affected/discordant) pairs, the cost will depend on the ratio of affected to discordant relative pairs in the sample. When the proportion of desired pairs in the population is smaller, it becomes harder to recruit the desired pairs, and combining both affected and discordant relative pairs will offer a better costsaving design (Guo, 1998; Guo and Elston, 2000b).
VI. EFFECTOF,RENEWiHiTEROGENtiITY AtiD fNf%RWLETE IVUt~itERINFORlViATION In reality, genetic heterogeneity is often present and markers are rarely fully informative. Suppose in the final sample a proportion h of the pairs are affected for causes other than segregation at the linked locus, while a proportion 1 - h are affected because of it. Then the expected mean proportion of marker alleles shared IBD by affected relative pairs, under the alternative hypothesis, is effectively reduced from ~0 f S to r. + (1 - h)S. Thus heterogeneity can be allowed for in the design by substituting (1 - h)S for 6, and (1 - hf S* for S”, respectively. The effect of genetic heterogeneity is that the optimal designs use fewer initial markers but a much larger number of pairs (Elston et al., 1996; GLIO and Elston, 2OOOb). To measure a marker’s informativeness in a model-free linkage analysis via pairs of relatives, Guo and Elston (1999) developed the concept of linkage information content (LIC), which is specific to a particular type of relative pair.
468
Guo and Elston
The LIC values of a marker measure the probabilities of its allowing one to determine IBD proportions for each particular type of relative pair. As a simplification, we assume that all the markers are equally informative and that the alleles at each marker are equally frequent. The number of equifrequent alleles that give the same polymorphic information content (NC) value can be obtained by solving the equation
PIG =
(n - l)Yn + 1) n3
The LIC value for type R relative pairs, LIC,, can then be determined for any marker from its PIC value (Guo, 1998; Guo and Elston, 1999). In general, we could expect that when all the markers have the same information content value LIC,, the optimal design would be obtained by assuming that the effective sample size is nfi and the effective number of markers is mfi, with fifi = LICs. Elston et al. (1996) studied the effect of different values of fi and fi for a fixed fifi = PIC (h ere we shall use LICs) and found that when the relative cost ratio R increases, the expected cost usually decreases with increasing fi. Even when R = 0, setting fi = 1 always resulted in a minimal expected cost within 1% of that obtained by simultaneously minimizing the cost over fi and fi, while keeping PIC = frfi fixed. Therefore, we substitute mLICR form when determining the required number of pairs to approximate the situation of markers that are not fully informative. This approximation results in a slightly larger number of initial markers, and a slightly smaller sample size, than optimal. The effect of reduced marker informativeness is an optimal twostage design involving more initial markers and more relative pairs, hence is more expensive. However, the cost saving compared to a corresponding optimal one-stage design becomes larger (Elston et al., 1996; Guo, 1998; Guo and Elston, ZOOOa,b).
VU. THECOMPUTER PROGRAMDESPAIR The computer program DESPAIR (S.A.G.E., 1998), written to find the optimal two-stage global search designs for the case of affected relative pairs, was extended to cover samples of discordant relative pairs, or combinations of both. DESPAIR calculates the values of n (number of pairs), m (initial number of markers), and LX*(first-stage significance level) that minimize the expected cost for given values of M (the genome length), R (cost of recruiting a subject relative to the cost of a marker assay), (Y (final significance level), 1 - p (desired power), d (number of disease loci), and a range of values of 2k (number of flanking markers at the second stage).
27. GenomeScans: One- versus Twe-StageStrategies
469
VIII. DIS~SION Optimal two-stage global search designs typically halve the cost of a study compared to corresponding optimal one+stage designs, no matter what type of sample (affected relative pair, discordant relative pair, or both) is used. Assuming that R is no larger than 200 and markers with PIC values no less than 0.75, in the case of complex diseases an optimal initial screen should probably include no more than 200-300 markers, whatever the values of the recurrence risk ratios. A multiple-stage procedure for multigeneration pedigrees using the affected pedigree member method of analysis was investigated by Brown et al. (1994). Similar to the two-stage strategy we described earlier, more widely spaced markers are typed initially, and regions with evidence of linkage are investigated with more markers at the next stage. For each stage other than the last, a cut point is specified and linkage is evaluated by an affected pedigree member (APM) statistic to determine whether it is above or below the cut point. Brown et al. (1994) used simulated data to evaluate different spacings of the starting grid of markers, different numbers of stages, different cut points, and whether a single or two adjacent markers should be required to be above the cut point to determine when an area should be further investigated. Assuming a fully penetrant dominant disease and assuming that the aim is to reach significance at the 5% level, Brown et al. found that the optimal strategy is a fourstage strategy with a 20-CM initial grid and a nonpairwise design (i.e., a region is explored further if only a single marker is above the current cut point, in contrast to a pairwise strategy, which requires that both of two adjacent markers be P above the cut point). Risch (1990~) suggested that an efficient strategy would be first to type the affected relative pairs at an array of polymorphic markers for a preliminary linkage analysis, and then to type the whole families for the loci that initially showed at least suggestive evidence of linkage. However, in a later paper, Risch (1992) declared that the gain in information from typing additional family members is not as great as first reported. Hauser et al. (1996), who used an interval mapping method, concluded that an efficient design is to type only affected sibpairs at markers spaced lo-20 CM apart and to use a lod score criterion of 1 to identify positive regions to be further studied by typing additional markers on additional families and family members. Holmans and Craddock (1997) studied this strategy in more detail and concluded that whereas it is most efficient not to type parents in the screening stage if Hardy-Weinberg equilibrium holds, in the case when Hardy-Weinberg equilibrium does not hold, failure to type parents at the first stage increases the amount of genotyping required. Another strategy investigated by Hoimans and Craddock (1997) is sample splitting. They considered a procedure in which part of the sample is
470
Guo and Elslon
typed in a screening stage, and the remaining part of the sample is used to replicate a result suggested in the first part. Their approach involves the use of widely spaced markers on the first part of the sample, followed by genotyping the whole sample with more narrowly spaced markers around loci that were significant at the first stage. By simulation, using a fixed sample of 200 pairs of affected siblings with parents available for genotyping, they found that under a variety of plausible disease susceptibility models for complex diseases, both sample splitting and grid tightening (i.e., using more narrowly spaced markers) are important in increasing the efficiency of a genome scan. By showing that typing half the sample of affected pairs with a coarse grid of markers in the screening stage is an efficient strategy under a variety of conditions, they demonstrated that sample splitting, in addition to grid tightening, is a useful strategy for increasing the efficiency of a genome scan study. In principle, their first stage could be replaced by a grid tightening two-stage procedure of the type just described. In conclusion, typing markers in two or more stages when one is conducting a whole-genome linkage analysis, regardless of whether combined with a degree of replication, can lead to considerable economies. Although we have discussed these strategies specifically for a binary disease trait in samples of relative pairs, similar results can be expected in studies of quantitative traits and/or larger pedigrees. Specific results for a quantitative trait measured on a sample of relative pairs could be obtained by using the extended definition of the relative risk ratio (A) proposed by Gu and Rao (1997a,b).
Acknowledgmpnts This work was supported in part by U. S. Public Health Service resource grant 1 P41 RR03655 from the National Center for Research Resources and research grant GM28356 from the National Institute of General Medical Sciences.
References Blackwelder, W. C., and Elston, R. C. (1985). A comparison of sib-pair linkage tests for disease sus, ceptibility loci. Genet. Eeidemiol. 2,85-97. Brown, D. L., Gorin, M. R., and Weeks, D. E. (1994). Efficient strategies for genomic searching using the affected-pedigree-member method of linkage analysis. Am. J. Hum. Genet. 54, 544-552. Elston, R. C. (1992). Designs for the global search of the human genome by linkage analysis. In “Proceedings of the XVIth International Biometric Conference, Hamilton, New Zealand, December 7,11,1992,” pp. 39-51. Elston, R. C. (1994). P values, power, and pitfalls in the linkage analysis of psychiatric disorders. In “Genetic Approaches to Mental Disorders: Proceedings of the Annual Meeting of the American Psychopathological Association” (E. S. Gershon and C. R. Cloninger, eds.), pp~ 3-21. American Psychiatric Press,Washington, DC.
27. GenomeScans: One- versus Two-StageStrategies
471
Elston, R. C., Guo, X., and Williams, L. V. (1996). Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet. Epidemiol. 13,535-558. G&in, E., and Clerget*Darpoux, E (1996). Consanguinity and the sib-pair method: An approach using identity by descent between and within individuals. Am. J. Hum. Genet. 59, 1149-1162. Gu, C., and Rao, D. C. (1997a). A linkage strategy for detection of human quantitative-trait loci. I. Generalized relative risk ratios and power of sib pairs with extreme trait values. Am. 1. Hum. Genet. 61, 200-210. Gu, C., and Rao, D. C. (1997b). A linkage strategy for detection of human quantitative-trait loci. Ii. Optimization of study designs based on extreme sib pairs and generalized relative risk ratios. Am.J. Hum. Genet. 61,211-222. Gu, C., Todorov, A. A., and Rao, D. C. (1996). Combining extremely concordant sibpairs with extremely discordant sibpairs provides a cost effective way to linkage analysis of QTL. C&net. &de&l. 13,513-533. Guo, X. (1998). Designs of model-free linkage studies for qualitative and quantitative traits. Ph. D. dissertation, Case Western Reserve University, Cleveland, OH. Guo, X., and Elston, R. C. (1999). Linkage information content of polymorphic genetic markers. Hum. Hered. 49,112-118. Guo, X., and Elston, R. C. (2000a). T wo-stage global search designs for linkage analysis. I. Use of the mean statistic for affected sib pairs. Genet.
[email protected]&97- 110. Guo, X., and Elston, R. C. (2OOOb).Two-stage global search designs for linkage analysis. II. Inciuding discordant relative pairs in the study. Genet. Epidemiol. 18, ill- 127. Hauser, E. R., Boehnke, M., Guo, S. W., and Risch, N. (1996). Affected-sibspair interval mapping and exclusion for complex genetic traits: Sampling consideration. Genet. Epidemiol. 13, 117-137. Holmans, I’., and Craddock, N. (1997). Efficient strategies for genome scanning using maximum, likelihood affected-sib-pair analysis. Am. J. Hum. Genet. 60, 657-666. Knapp, M., Seuchter, S. A., and Baur, M. P. (1994). Linkage analysis in nuclear families. I. Optimality criteria for affected sib-pair tests. Hum. Hered. 44,37-43. Risch, N, (1990a). Linkage strategies for genetically complex traits. 1. Multilocus models. Am. j. Hum. Genet. 46,222-228. Risch, N. (1990b). Linkage strategies for genetically complex traits. II. The power of affected relax tive pairs. Am. J. Hum. Genet. 46, 229-241. Risch, N. (199Oc). Linkage strategies for genetically complex traits. III. The effect of marker poly morphism on analysis of affected relative pairs. Am. J. Hum. Genet. 46, 242-253. Risch, N. (1992). Corrections to linkage strategies for genetically complex traits. III, The effect of marker polymorphism on analysis of affected relative pairs [Risch, 1990~1. Am. J. Hum. Genet. 51,673-675. S. A. G. E. (1998). Statistical Analysis for Genetic Epidemiology, Release 3.1. Computer program package available from the Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH. Schaid, D. Jo,and Nick, T. G. (1990). Sib-pair linkage tests for disease susceptibility loci: Common tests vs. the asymptotically most powerful test. Genet. Epidemiol. 7,359-370.
I
SignificanceLevels in GenomeScans Glenys Thomson Department of lntegtdtive Biology University of California, Berkeley.California 94720
I. Summary II. Introduction 111.Approaches for Complex Diseases IV. Genome-wide Linkage Scans in Complex Diseases V. Genome-wide Association Scans in Complex Diseases VI. Discussion Rcfercnccs
I. SUMMARY Genome-wide linkage scans using affected sibpair families are being conducted on many complex diseases, such as type I and type 2 diabetes, multiple sclerosis, rheumatoid arthritis, schizophrenia, asthma, cardiovascular diseases, obesity, and alcoholism. Despite extensive efforts by many groups, progress has been exceedingly slow, and only a few genes and some genomic regions involved in complex diseases have been identified. The general picture is one of difficulty in locating disease genes and replication of reported linkages. This results from the fact rhat complex diseases and traits may result principally from genetic variation that is relatively common in the general popuiation involving a large number of genes, environmental factors, and their interactions. Genome-wide association studies are now feasible through the
476
Glenys Thomson
use of PCR methodologies with pooled DNA samples and microsatellite variation, and more recently single-nucleotide polymorphism (SNP) variation. Issues relating to significance levels in genome-wide linkage and association scans are discussed, and suggestions for dealing with false positive (type I) errors proposed.
II. INTRODUCTION Mendelian diseases are those caused by defects in a single major gene or biochemical pathway. Complicating factors, such as incomplete penetrance and variable age of onset, are often present. Nevertheless these diseases show basic Mendelian single-gene segregation patterns (Gelehrter et al., 1998). In contrast, complex or multifactorial diseasesresult from the interaction of environmental factors and multiple genes, some of which might have a major effect, although for many, the effect is relatively minor. While the underlying genes in complex diseases still follow the rules of Mendelian inheritance, the overall pattern of inheritance is not simple. Reference is made throughout to Mendelian or complex diseases,with the understanding that the statements also apply to nondisease traits. Although the boundary between Mendelian and complex traits is not precisely defined, a large number of diseases clearly fall into each category (Thomson and Esposito, 1999). It is also important to remember at this time that although we often use the shorthand notation of “disease gene” or “diseasepredisposing gene” for both Mendelian and complex diseases,we are discussing variation in genes that are involved in normal human health and development, specific forms of which can lead to a disease state. Marker genes are usually the necessary starting point for mapping and elucidating the genetic components of diseaseswhen the biochemical nature of the disease is unknown and candidate genes are not readily available. The development of highly informative microsatellite markers across the genome has greatly facilitated the localization of disease loci in Mendelian and complex diseases via both linkage and association (linkage disequilibrium) studies (see Chapter 5 by Borecki and Suarez for a detailed discussion of linkage and association). These approaches can be applied without prior knowledge of the biological basis of the disease by using genome-wide scans. The aim is first to identify the genetic regions within which one or more disease-predisposing genes lie and, once these have been found, to localize the genes and determine their functional and biological role in the disease (positional cloning) (Gelehrter et al., 1998).
28. Significance Levels in GenomeScans
477
It!. APPRDACHES FDii COMPLEXDESEASES A. Linkage studies For linkage analyses, testing of about 300-400 highly polymorphic markers, usually microsatellites, distributed approximately evenly over the genome (average spacing between markers is on the order of 10 cM, i.e., 10% recombination) is the usual practice in an initial genomic linkage scan. When large multigeneration pedigrees are available, lod score linkage analysis is a powerful technique for localizing disease genes. It has been successfully applied to a number of Mendelian traits (e.g., Huntington disease) and also for subsets of complex traits that show simple Mendelian inheritance-for example, early-onset Alzheimer’s disease, and the familial breast/ovarian cancer and breast cancer genes BRCAI and BRCA2 -and with diabetes, it has been used to map MODY (maturity onset diabetes of the young) genes. For Mendelian traits, the standard of a lod score of 3 as evidence of linkage corresponds to a 5% probability of observing at least one false hit above this threshold in a genome scan (see Chapter 29 by Rao and Gu). For complex diseases, the involvement of many genes and the strong influence of environmental factors means that large multigeneration pedigrees are rarely, or never, seen. Linkage analysis of nuclear families with both parents and two children affected with the disease, although less powerful, is therefore more commonly used to map complex traits (Thomson and Esposito, 1999). Initial demonstration of linkage of the major histocompatibility complex of humans, the HLA region on chromosome 6~21, to type 1 diabetes (until recently called insulin-dependent diabetes mellitus-IDDM) was demonstrated by Cudworth and Woodrow (1975) using 15 affected sibpairs (p < O.OOl), and this linkage, termed IDDMI , has been confirmed in many studies. Observed sharing of 2, 1, and 0 parental alleles identical by descent (IBD) in a cumulative study of 711 affected sibpairs was 52, 40, and 8% (Payami et al., 1985): the mean sharing of parental alleles of 72% is high compared to the 50% average sharing expected between siblings if ,this region were not involved in disease (p < 10w5). F or some other HLA associated diseases (e.g., multiple sclerosis, rheumatoid arthritis), the initial number of affected sibpairs required to show evidence of linkage has been larger-at least 50, and sometimes around 100,
B. Association studies For Mendelian diseases, association mapping has often been used to sublacalize the disease-predisposing gene following initial localization of chromosamal
478
Glenys Thomson
regions by linkage analysis. In these studies linkage disequilibrium extended on average over 500 kb (0.5 CM) (Jorde et al., 1994). There are a few exceptions to the use of association data in localizing disease genes; the breast cancer gene BRCAI , for example, could not be localized by means of linkage disequilibrium mapping because a different mutation was implicated in each family. Most disease association studies have involved candidate gene analy ses. Association studies have been most successfully applied in mapping over 100 complex diseases to the HLA region of humans (see Thorsby, 1997). The HLA-associated diseases include type 1 diabetes, multiple sclerosis, and rheumatoid arthritis mentioned earlier, as well as Crohns disease, celiac disease, hemochromatosis, narcolepsy, AIDS, malaria, tuberculosis, and Hodgkin disease. The associations are often very strong. For example, over 90% of patients with ankylosing spondylitis carry a specific HLA variant, B27, compared with 9% of control subjects. In some cases, the HLA immune response genes have been directly implicated in disease (e.g., ankylosing spondylitis, type 1 diabetes, and narcolepsy), whereas with hemochromatosis the association was due to linkage disequilibrium. Association studies in the early 1970s distinguished type 1 diabetes (HLA associated) from type 2 diabetes (not HLA associated). In a small subset of type 2 diabetes cases, association studies have identified the direct role of the insulin gene, insulin gene receptor, glucokinase gene, and mitochondrial genes in disease (see Thomson, 1997, for references).
IV. GENOME-WIDE LINKAGESCANSIN COMPLEXDISEASES A. The application of genome-wide linkage scans The existence of non-HLA genes in many of the HLA-associated diseases was established from theoretical considerations involving population prevalence, risks to relatives, and HLA IBD values in affected sibpairs (Risch, 1987). The overall contribution of HLA to the genetic component of type 1 diabetes is high, between 40 and 50% (Risch, 1987; Mein et al., 1998, Concannon et al., 1998). Given the relative ease with which linkage was demonstrated for many HLA-associated diseases, it seemed a logical progression to use genome-wide linkage scans on affected sibpair families to investigate all complex diseases. Such studies of many complex disorders are in progress: to map the non-HLA genes in a number of diseasessuch as type 1 diabetes, multiple sclerosis, rheumatoid arthritis, celiac disease, and Crohns disease; and for many other complex
28. SignificanceLevels in GenomeScans
479
diseases(e,g., type 2 diabetes, hypertension, coronary artery disease, alcoholism, schizophrenia).
B. Significance levels in genome-wide linkage scans in complex diseases Type 1 diabetes was the first complex disease for which genomic scans for linkage in affected sibpairs were carried out (reviewed in Pugliese, 1999). Apart from I-ILA, no evidence of major gene effects was observed. The question thus arose of when to classify as “reasonable” evidence that a disease-predisposing gene may reside in a particular region. In an editorial accompanying the Field et al. (1994) paper, Thomson (1994) proposed a pointwise significance level of 0.001, based on intuitive reasoning, for reporting what was called “putative” linkage. No specific statement linking this to genome-wide significance levels was made, except to say that “some of these preliminary linkages will turn out to be false (type 1 errors) although how many in the case of IDDM remains to be seen.” The concern was to have a reasonable balance between false positives and false negatives, while understanding false negatives to be the more serious concern (for further discussion on this see Todorov and Rao, 1997; Rae, 1998; and Chapter 5 by Borecki and Suarez and Chapter 29 by Rao and Gu). The issue of false positives (type 1 error) was addressed theoretically by Lander and Kruglyak (1995). They showed that on average with analysis of affected sibpair families there will be one false positive per genome scan with a pointwise nominal significance level of c~ = 0.00074, corresponding to a maximum lod score (MLS) of 2.2. The authors termed this “suggestive” linkage. When rounded to the more familiar number of 0.001, this corresponds to the “putative” linkage criterion of Thomson (1994). Lander and Kruglyak ( 1995) f ound that a! = 0.000022 (MLS = 3.6) corresponded to a 5% chance of a false positive per genome scan. They termed this “significant” evidence of linkage and called CY= 0.0000003 (MLS = 5.4, genome-wide significance level of 0.1%) “highly significant” evidence of link, age. For a “confirmed” linkage, Lander and Kruglyak (1995) proposed significant linkage from one or a combination of studies that has been confirmed with a point-wise significance level of 0.01. In a simulation study of multiple sclerosis affected sibpair families and microsatellite markers (Sawcer et al. 1997) used in an actual scan (Sawcer et al., 1996) found that a lower MLS value, namely 3.2, corresponded to a genomewide significance level of 5%. It appears that rhe theoretical results of Lander and Kruglyak (1995) d i ff er f rom these simulation results because the theoretical results were based on the assumption of an infinitely dense map of markers (Sawcer et al., 1997). Based on sparse map assumptions, Rao and Gu (Chapter
480
GlenysThomson
29 in this volume) report that a lod score of 3.05 corresponds to a genome-wide error of 5%.
C. Observationsfrom genome-wide linkage scans in complex diseases Many other studies using genome-wide scans to detect non-HLA type 1 dia* betes genes have followed since the 1994 papers (reviewed in Pugliese, 1999). Apart from HLA, no evidence of major gene effects has been found. In approximately 500 affected sibpairs, linkages for the chromosomal regions IDDM4 (llq13), IDDMS (6q25), and IDDM8 (6q27) were confirmed (Luo et al., 1996) by means of the criteria of Lander and Kruglyak (1995). The mean IBD sharing values in these caseswere much closer to the 50% expected randomly than seen with IDDMl: 0.58, 0.58, and 0.60, respectively, for IDDM4, IDDMS, and IDDM8. IDDMG (18q12-21), IDDMIO (lopll-qll), and IDDM12 (2q33) were more recently considered to be confirmed, on the basis of results from a combination of affected sibpair and association/linkage tests (reviewed in Pugliese, 1999). Seven additional chromosomal regions have shown evidence of linkage to type 1 diabetes in one or more studies (reviewed in Pugliese, 1999; see also Mein et al., 1998; Concannon et al., 1998). A common feature of all complex disease studies has been difficulty in both detecting and replicating linkages, with considerable heterogeneity seen between data sets both within and between populations and ethnic groups. Success with this approach has been greatly hampered by the potential involvement in disease of relatively common alleles of a large number of loci, each with a relatively small effect overall, requiring study of many hundreds, and usually thousands, of affected sibpairs to establish linkage (Suarez et al., 1994; Mein et al., 1998, Risch and Merikangas, 1996). The involvement in complex diseasesof subsets due to relatively rare mutations also remains a possibility. Two second-generation type 1 diabetes genome scans continue to illustrate this heterogeneity despite large sample sizes (Concannon et al., 1998; Mein et al., 1998). Even more surprising, little or no support was found for most reported IDDM loci in the study of Concannon et al. (1998).
V. GENOME-WIDE ASSOCIATION SCANSIN COMPLEXDISEASES A. Association (linkage disequilibrium) mapping Association mapping may in many casesbe more efficient than linkage analysesin detecting genetic regions involved in disease (Risch and Merikangas, 1996). IDDM2, which was easily detected by association, is difficult to detect by linkage analysis (see Thomson, 1997, for references; see also Pugliese, 1999). The potential power of association studies to detect disease genes depends on several unknown
28. SignificanceLevels in GenomeScans
481
parameters and cannot be determined accurately, especially given that common genetic variants may often be involved in complex diseases(Clark et al., 1998). The overall utility of association mapping also depends on the level of linkage disequilibrium seen across the genome. General population levelobservations are that there is an overall proportionality between linkage disequilibrium and the inverse of the recombination distance (Huttley et al., 1999). This rule breaks down in very closely linked regions (Jorde et al., 1994, Huttley et al., 1999), and linkage disequilibrium there may be less useful in very fine mapping of a region. Linkage disequilibrium is also nonrandomly distributed throughout the genome (Huttley et al., 1999). Some regions, such as HLA, show strong evidence of selection and significant linkage disequilibrium, which may span 3 CM or more, and an additional 10 genetic regions showing linkage disequilibrium equal to or greater than in the HLA region have been identified (Huttley et al., 1999). Th e p ower of genome association scans will thus vary across the genome, as well as across populations. At least 12,000 highly polymorphic, evenly spaced markers, giving an average distance between markers of 0.25 CM (250 kb), is required for an initial diseasegenomic association scan. Association scans are thus over an order of mag nitude higher than the 300-400 markers at 10 CM typically used for linkage genome scans.The use of pooled samples of DNA for the study of restriction frag ment length polymorphism (RFLP), microsatellite (Barcellos et ai., 1997a), and SNP variation (Germer et al., 2000), as well as the current development of DNA chip technology for the study of SNPs (Collins et al., 1998), have opened the way for future routine and extensive use of association mapping in the study of complex diseases.Genome-wide association scans in complex diseasesare starting to be used, currently with DNA pooling and microsatellites. The full development of SNPs will eventually permit routine typing for variation in every human gene and its regulatory region, the ultimate association study (Collins et al., 1998). The recent emphasis on use of only family-based association/linkage tests has ignored the readily available resource of case-control data (Morton and Collins, 1998), where new techniques make feasible the sampling and study of many thousands of samples. Provided the patient and control groups are carefully matched for ethnicity, population stratification effects creating spurious associations are eliminated. The large collections of multiplex families now available for linkage studies in many complex diseases are obviously another valuable resource for association scans.
B. Significancelevels in genome-wideassociationseans in complexdiseases Many researchers feel that genome-wide association scans are unmanageable because of the number of false positives inevitably produced. A preliminary
482
Glenys Thomson
discussion of power and a multistage strategy to reduce false positives in genome-wide association studies are outlined in Barcellos et al. (1997a). As with linkage studies (see Rao, 1998), power should be high in an initial scan, to reduce missing associations (false negatives); and sample sizes of NO- 1000 are recommended (Barcellos et al., 1997a; Long and Langley, 1999). Long and Langley (1999) have demonstrated that greater power is achieved by increasing the sample size than by increasing the number of markers.
C. Observationsfrom association studies in complex diseases The potential of association studies for follow-up of regions showing preliminary evidence of linkage for complex diseases has been demonstrated (Mein et al., 1998) and awaits full utilization. As with linkage studies, heterogeneity in association results (based on case-control and nuclear family investigations) is also seen among studies, especially among populations (Mein et al., 1998). In their simulation study, Long and Langley (1999) concluded that association studies have a low repeatability unless sample sizes are on the order of 500 individuals. There is debate with regard to the best type of population to use for association mapping. Lander and Schork (1994) state that the ideal population will be isolated, will have a narrow base, and will be sampled not too many generations from the time during which a disease-causing mutation has occurred. The Finnish and Costa Rican populations are considered ideal, because they are relatively homogeneous and show linkage disequilibrium over a wider recombination distance than other populations. However, linkage disequilibrium is routinely seen for closely linked loci and around disease genes in all populations.
VI. DISCUSSION Despite initial anxieties that genome-wide linkage scans would be flooded with false positives (Lander and Krugylak, 1995), this has not been the case. An excess of false linkages has never been a problem with complex diseases. Instead we are scrambling to find any evidence of disease-predisposing genes. The increasing availability of more markers across the genome, combined with multipoint analyses that use a number of closely linked markers, will increase the power of linkage and association studies (Concannon et al., 1998; Lernmark and Ott, 1998; Barcellos et al., 1997b; Morton and Collins, 1998; Valdes et al., 1999a). In an insightful review of the many factors that need to be considered in the design of genome-wide scans and interpretation of linkage results, Rao (1998) pointed out that application of stringent criteria drastically reduces power, and thus many disease genes remain undetected. Rao’s recommendation
28. Significance Levefs in GenomeScans
433
was that “we tolerate/accept, on average, one false positive per individual scan.” He also emphasized that power should be high in an initial linkage genome-wide scan, to reduce missing linkages (false negatives). The same argument applies to genome-wide association scans (Long and Langley, 1999). In the context of genome-wide significance levels and false positives, the issue is also raised of whether conditional linkage analyses, stratified by linkage results from established loci such as the confirmed IDDM loci, should be carried out on all genome scan data (Mein et al., 1998), or whether cond.itional analyses should be performed only after linkage has been established for a specific region (Concannon et al., 1998; Lernmark and Ott, 1998). The former approach is strongly advocated here, given the intrinsic heterogeneous nature of complex disease genetics. Even though type 1 error will be increased (Todorov and Rao, 1997; Lernmark and Ott, 1998), stratification approaches have proven their worth as an aid in identifying linkages (Mein et al., 1998; Pugliese ‘1999). The use of a number of different disease phenotype definitions in linkage studies is similarly debated. Success stories based on study of a range of disease definitions (Valdes et al., 1999a), subsets of the disease phenotype (Gibbs et al., 1999), sex effects (Paterson and Petronis, 1999), and age-of-onset effects (Day et al., 1999; Valdes et al., 1999b) support these approaches. It is clear that exceedingly large sample sizes of case-control, simplex, and multipiex family-based data, as we11 as multigeneration pedigrees, when available, are needed far our continuing studies of complex diseases.Nationai and international cooperative efforts for sharing data are mandatory to achieve the large sample sizes required, which will also allow for appropriate stratifled analyses taking account of diagnostic and genetic heterogeneity, age-of-onset effects, and so on. It is however important to also remember that the genes involved in m&generation pedigrees may often be different from those in affected sibpair and/or “sporadic” casesof disease. Our studies must also include many different populations, including extensive ethnic variation, and must not be restricted to relatively homogeneous populations. Further study of populationAlevei data is obligatory (Schork et al., 1998; Clark et al., 1998), including development of methods to understand the evolutionary history of a region (Clark et al., 1998; Grote et al., 1998; Huttley et al., 1999). Only then can the power of different linkage and association methods be completely assessedand the processes understood by which disease-predisposing variants become established, and often frequent, in populations. The choice of significance levels may be a moot point in light of the considerations presented here. All studies of complex diseases should now be seen as exploratory data analyses, without correction far multiple comparisons, or as confirmational studies, as appropriate. Meta-analyses across data sets for linkage and association genome scans are necessary and can be powerful in
484
Glenys Thomson
identifying genetic regions involved in a disease (Gu et al., 1998, Wise et al., 1999). The study of genetic effects common to multiple diseases will also be of increasing interest (Becker et al., 1998, Wise et al., 1999). All data from linkage and association genome-wide scans, and follow-up studies of particular regions, should be made available to all researchers (e.g., on the World Wide Web). Only in this way will we identify all the genetic and environmental factors involved in complex diseases,and their interactions.
Acknowledgment This work was supported by grant GM56688 f rom the National Institutes of Health.
References Barcellos, L. F., Klitz, W., Field, L. L., Tobias, R., Bowcock, A. M., Wilson, R., Nelson, M. I?, Nagatomi, J., and Thomson, G. (1997a). Association mapping of disease loci, by use of a pooled DNA genomic screen. Am. J. Hum. Genet. 61, 734-747. Barcellos, L. F., Thomson, G., Carrington, M., Schafer, J., Begovich, A. B., Lin, P., Xu, X. H., Min, B. Q., Marti, D., and Klitz, W. (199715). Chromosome 19 single-locus and multilocus haplotype associations with multiple sclerosis. Evidence of a new susceptibility locus in Caucasian and Chinese patients..!. Am. Med. Assoc. 278, 1256-1261. Becker, K. G., Simon, R. M., BaileyWilson, J. E., Freidlin, B., Biddison, W. E., McFarland, H. E, and Trent, J. M. (1998). Clustering of non-major histocompatibility complex susceptibility candidate loci in human autoimmune diseases.Proc. Natl. Acad. Sci. USA 95, 9979-9984. Clark, A. G., Weiss, K. M., Nickerson, D. A., Taylor, S. L., Buchanan, A., Stengard, J., Salomaa, V., Vartiainen, E., Perola, M., Boerwinkle, E., and Sing, C. F. (1998). Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. j. Hum. Genet. 63,595-612. Collins, E S., Brooks, L. D., and Chakravarti, A. (1998). A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8, 1229-1233. Concannon, P., Gogolin-Ewens, K. J., Hinds, D. A., Wapelhorst, B., Morrison, V. A., Stirling, B., Mitra, M., Farmer, J., Williams, S. R., Cox, N. J., Bell, G. I., Risch, N., and Spielman, R. S. (1998). A second-generation screen of the human genome for susceptibility to insulin-dependent diabetes mellitus. Nut. Genet. 19, 292-296. Cudworth, A. G., and Woodrow, J. C. (1975). EVI‘d ence for HLA-linked genes in “juvenile” diabetes mellitus. Br. Med. J. 3, 133 - 135. Day, E. W., Heath, S. C., and Wijsman, E. M. (1999). Multipoint oligogenic analysis of age-at-onset data with applications to Alzheimer disease pedigrees. Am. 1. Hum. Genet. 64, 839-851. Field, L. L., Tobias, R., and Magnus, T. (1994). A 1ecus on chromosome 15q26 (IDDM3) produces susceptibility to insulin+dependent diabetes mellitus. Nat. Genet. 8, 189- 194. Gelehrter, T. D., Collins, E S., and Ginsberg, D. (1998). “Principles of Medical Genetics,” Williams & Wilkins, Baltimore. Germer, S., Holland, M. J., and Higuchi, R. (2000). High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res. 10, 258-266. Gibbs, M., Stanford, J. L., M&doe, R. A., Jarvik, G. P., Kolb, S., Goode, E. L., Chakrabarti, L., Schuster, E. E, Buckley, V. A., Miller, E. L., Brandzel, S., Li, S., Hood, L., and Ostrander, E. A.
28. SignificanceLevelsin fienomeScans
485
(1999). Evidence for a rare prostate cancer-susceptibility locus at chromosome 1~36. Am. J. Hum. Genet. 64,776-787. Grote, M. N., Klitz, W., and Thomson, G. (1998). Constrained disequilibrium values and hitchhik+ ing in a three-locus system. Genetics 150, 1295- 1307. Gu, C., Province, M., Todorov, A., and Rao, D. C. (1998). Meta-analysis methodology for combining non-parametric sibpair linkage results: Genetic homogeneity and identical markers. Genet. E@emiol. 15, 609-626. Huttley, G. A., Smith, M. W., Carrington, M., and O’Brien, S. J. (1999). A scan for linkage disequilibrium acrossthe human genome. Genetics 152,1711- 1722. Jorde, L. B., Watkins, W. S., Carlson, M., Groden, J., Albertsen, H., Thliveris, A., and Leppert, M. (1994). Linkage disequilibrium predicts physical distance in the adenomatous polyposis coii region. Am. J. Hum. Genet. 54, 884-898. Lander, E., and Kmglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nut. Genet. 11, 241-247. Lander, E. S., and Schork, N. J. (1994). Genetic dissection of complex traits. Science 265, 2037-2048. Lemmark, A., and Ott, J. (1998). Sometimes it’s hot, sometimes it’s not. Nut. Genet. 19, 213-2i4, Long, A. D., and Langley, C. H. (1999). The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res. 9, 720-731. Luo, D. F., Buzzetti, R., Rotter, J. I., Maclaren, N. K., Raffel, L. J., Nistico, L., Giovannini, C., Pozzilli, P., Thomson, G., and She, J. X. (1996). Confirmation of three susceptibility genes to insulin-dependent diabetes mellitus: IDDM4, IDDMS and IDDM8. Hum. Mol. Genet. 5, 693-698. Mein, C. A., Esposito, L., Dunn, M. G., Johnson, G. C., Timms, A. E., Goy, J. V., Smith, A. N., Sebag-Montefiore, L., Merriman, M. E., Wilson, A. J., Pritchard, L. E., Cucca, E, Bamett, A. H., Bain, S. C., and Todd, J. A. (1998). A search for type 1 diabetes susceptibility genes in families from the United Kingdom. Nut. Genet. 19,297-300. Morton, N. E., and Collins, A. (1998). Tests and estimates of allelic association in complex inheritance. Proc. Natl. Acad. Sci. USA 95,11389-11393. Paterson, A. D., and l’etronis, A. (1999). S ex of affected sibpairs and genetic linkage to type 1 diabetes. Am. J. Med. Genet. 84, 15-19. Payami, Iii., Thomson, G., Motro, U., Louis, E. J., and Hudes, E. (1985). The affected sib method. IV. Sib trios. Ann. Hum. Genet. 49,303-314. Pugliese, A. (1999). Unraveling the genetics of insulin-dependent type 1A diabetes: The search must go on. Diabetes Rev. 7, 39-54. Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genet. Epidemiai. 15, 1- 18. Risch, N. (1987). Assessing the role of HLA-linked and unlinked determinants of disease. Am. 1. Hum. Genet. 40, l- 14. Risch, N., and Merikangas, K. (1996). Th e f u t ure of genetic studies of complex human diseases.SciT ence 273, 1516-1517. Sawcer, S., Jones, H. B., Feakes, R., Gray, J., Smaldon, N., Chataway, J., Robertson, N., Clayton, D., Goodfellow, P. N., and Compston, A. (1996). A genome screen in multiple sclerosis reveals susceptibility loci on chromosome 6~21 and 17q22. Nat. Genet. 13,464-468. Sawcer, S., Goodfellow, P. N., and Compston, A. (1997). Th e genetic analysis of multiple sclerosis. Trends Genet. 13,234-239. Schork, N. J., Cardon, L. R., and Xu, X. (1998). The future of genetic epidemiology. Trends Genet. 14,266-272. Suarez, B. K., Hampe, C. L., and Van Eerdewegh, P. (1994). P10bl ems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S, Gershon and C. R. Cloninger, eds.), pp. 23-46. American Psychiatry Press,Washington, DC.
486
Glenys Thomson
Thomson, G. (1994). Identifying complex disease genes: Progress and paradigms. Nat. Genet. 8, 108-110. Thomson, G. (1997). Strategies involved in mapping diabetes genes: An overview. Diabetes Reu. 5, 106-115. Thomson, G., and Esposito, M. S. (1999). The genetics of complex diseases. Trends Gener., 15, M17-M20. Thorsby, E. (1997). Invited anniversary review: HLA associated diseases.Hum. Immunol. 53, 1- 11. Todorov, A. A., and Rao, D. C. (1997). Trade-off between false positives and false negatives in the linkage analysis of complex traits. Genet. Epidemiol. 14,453-464. Valdes, A. M., McWeeney, S. K., and Thomson, G. (1999a). Evidence for linkage and association to alcohol dependence on chromosome 19. Genet. Epidemiol. 17, S367-S372. Valdes, A. M., Thomson, G., Erlich, H. A., and Noble, J. A. (199915). Association between type 1 diabetes age of onset and HLA among sibling pairs. Diabetes 48, 1658-1661. Wise, L. H., Lanchbmy, J. S., and Lewis, C. M. (1999). Meta-analysis of genome searches. Ann. Hum. Genet. 63,263-272.
and False False Negatives-in Gemarne Scans D.C. Raol Division of Riostatistics and Departments of Psychiatry and Genetics Washington University School of Medicine St. Louis, Missouri 63110
Chi Gu Division of Biostatistics Washington University School of Medicine St. Louis, Missouri 63 110
I. II. III. IV. V.
Summary Introduction False Positives and False Negatives Trade-off between False Positives and False Negatives Discussion References
It is emphasized that two types of errors are made in the testing of a hypothesis, false positive (type I) and f a Ise negative (type II). Genome-wide scans involving many markers give rise to the problem of multiple testing, which results in an increased number of false positives, thus necessitating a correction in the nominal significance level. While the literature has concentrated reasonably heavily on controlling false positives in genomic scans, the need to control false negatives has been largely neglected. This chapter highlights this need and attempts to strike a balance between the two error types. The need to develop alternative :To whom corrtspondencc should be addressed. Advences in Genetics, Vol. 42 Gpyright Q 2oCl by Aradcmlc I’rers. All right:- of reproducrion in any form resrned LV65.?660:01 535.OC
487
488
RaoandGu
methods for discriminating between false positives and true positives is also stressed.
II. INTRODUCTION Complex traits are determined by interactions among multiple genetic and environmental factors. Accordingly, their genetic dissection is a formidable task, particularly since in many cases there are no genes with a big effect. The effect sizes of any of the multiple etiologic factors are likely to be rather modest. Therefore, methodologies meant for detecting genes with large effects (major genes) are unlikely to be successful with complex traits, as the experience of recent years has shown. Most complex traits are oligogenic (a few genes, each with a moderate effect), and may even be polygenic (many genes, each with a small effect). Even though the individual gene effects may be small, interactions among the genes and environments could make a substantial contribution to the final manifestation of the trait. Failure to recognize and accommodate such interactions may often mask the effects of the very genes we seek. Therefore, to unmask the gene effects and aid in their discovery, we must pay attention to all relevant aspects of gene finding, including study design, optimal methods of analysis, and interpretation of the results. Large sample sizes and optimum methods of analysis should help considerably. One must also be thoughtful when interpreting the results. The interpretation of results from genome-wide linkage analyses will be the main concern of this chapter, building upon the preceding chapter by Thomson on significance levels. A common approach to enhance the detection of genes with moderate effects is to boost the power. This, however, requires very large sample sizes, which in turn are difficult for any one center to achieve. Fortunately, the concept of multicenter genetic studies (e.g., Higgins et al., 1996) is rapidly evolving as a means of generating large samples of standardized family data. The motivation is to provide an adequate sample size for traits where the effect size of any individual etiologic factor (e.g., hypertension) is likely to be modest. Choosing appropriate methods of analysis should enable us to extract maximum information from a given study. While sufficiently large sample sizes and powerful method(s) of analysis are critical, it is argued that thoughtful interpretation of the results is also important.
Ill. FALSEPOSITIVESAND FALSENEGATIVES As with all hypothesis tests, a linkage analysis that tests the null hypothesis of no linkage against the alternative hypothesis of linkage can result in two types
29. False Positives and Negativesin GenomeScans
489
of outcomes: correct outcomes and incorrect outcomes. A correct outcome results both when a correct null hypothesis is not rejected and when a false null hypothesis is rejected. The incorrect outcomes involve two types of errors: false positive (type I error), when a true null hypothesis (of no linkage) is rejected, thus claiming a false linkage, and false negative (type II error), when a false null hypothesis is not rejected, thus failing to detect a true linkage. Clearly, both types of errors have serious consequences. Unfortunately, it is not possible to eliminate either type of error, and therefore, one must try to minimize them both as much as possible.
A. Twoerrors, not just one Using multiple markers in genome-wide scans gives rise to the problem of multiple testing. Therefore, some correction for multiple testing is necessary before a result is declared a “failure” or a “success.”Failure to correct for multiple testing will inevitably yield far too many positive results, many of which would be false. Likewise, overcorrecting for multiple testing will inevitably lead to far fewer results, thus missing some or even most of the very signals we seek in the first place. Perhaps some reasonable guidelines need to be followed to minimize both types of error (false positives that cannot be replicated and false negatives that will remain undetected). Such guidelines must be based on a balanced consideration of, among other things, the real costs associated with false positives and false negatives. It should be clarified that linkage analysis was used in earlier days for localizing genes that were already known to exist. Because failure to localize a gene that was already known to exist was considered to be less serious, investigators could afford to use relatively more stringent criteria (such as a lod score of 3; Morton, 1955). Rather, falsely mapping a gene to a wrong location had dire consequences in terms of the cost associated with following up on false positives. In stark contrast, linkage analysis of compIex traits serves a dual purpose these days: proving both the very existence of a trait gene and its localization. For many complex traits, there is no knowledge about the existence of genes with detectable effects, and therefore, failing to detect a true linkage (false negative) could delay establishing the very existence of such genes, hence delaying gene finding, as well. Accordingly, errors in inference have a different meaning and value now, although pursuing false positives continues to be costly. This puts an extra degree of burden on analysts when they try to interpret the results. In any case, there is far greater emphasis in the literature on the need to minimize false positives, as if that were the only error that matters. It is important to remember that there are indeed two errors, not just one. The frequencies of the two errors depends, in part, on the (nominal) significance level chosen for testing a linkage hypothesis. A significance level
490
RaoandGu
(see Chapter 28 by Thomson) indicates one’s tolerance for inferring a false result. Choice of significance levels for genome-wide scans has been somewhat controversial (Risch, 1991; Thomson, 1994; Lander and Kruglyak, 1995; Rao, 1998; Morton, 1998; Rao and Gu, 2000; Ch everud, 2000). Lander and Kruglyak (1995) have recommended that, under the assumption of continuous marker density, the genome-wide significance level ergbe set at 0.05 and that the pointwise (nominal) significance level (Y for individual tests with each marker be determined approximately from the equation 4 = (23 + 132Ti)a, where T, is the standard normal deviate corresponding to the nominal significance level of a (see Feingold et al., 1993, for theoretical results). Under this recommendation, CX~= 0.05 yields CY= 0.000022. While this significance level will practically eliminate false positives, it will inevitably increase false negatives. Therefore, what is needed for a more balanced approach is a less stringent significance level that also minimizes false negatives (see also Rao and Gu, 2000). Lander and Kruglyak’s (1995) recommendation requires that we accept, on average, one false positive in 20 genome scans (a8 = 0.05, which corresponds to a nominal (Y of 0.000022 and a lod score of 3.63). To minimize false negatives, Rao (1998) suggested relaxing the threshold enough to tolerate, on average, one false positive per genome scan. Under continuous marker density, this corresponds to a nominal cv = 0.00071 (corresponding to a lod score of 2.21), which is very similar to the recommendation of Thomson ( 1994) to use a! = 0.001. If the marker density is adjusted to a total of 400 markers, this nominal (Y increases further to 0.0023 (which corresponds to a lod score of 1.75; see Rao and Province, 2000; Rao and Gu, 2000). Lander-Kruglyak’s nominal a also increases from 0.000022 (corresponding to a lod score of 3.63) to 0.000090 (corresponding to a lod score of 3.05, which is almost identical to Morton’s time*tested value of 3). Using the approximations presented by Feingold et al. (1993), Figure 29.1 shows the nominal significance levels and the lod scores under both marker densities as functions of the genome-wide significance level. As can be seen, the specific values just discussed are all contained in the figure. Moreover, the nominal significance levels and lod score thresholds corresponding to other values of the genome-wide significance levels can be easily read off, permitting investigators to choose the levels they are comfortable with.
B. Need to control false negatives In general, for a given effect size of a trait gene, the rate of false inferences is controlled by the sample size and the significance level chosen. Therefore, since increasing the sample size is seldom an option, changing the significance level may be the only way to control the errors. However, we have to examine the implications associated with each type of error before finally choosing a signifi-
29. False Positives and Negativesin GenomeScans
431
a 3 4 ‘83 3.5 a?l .r 3 a0 2.5
0.0005
2
1.5 0
0.2
0.4
0.6
5.8
I
0
Expected Number at False Positives per Scab Figure 29.1 Plot of the nominal (pointwise) significance level ((u) along the vertical axis on the right and the lad score along the vertical axis on the left as functions of the genomewide significance level (expected number of false positives per scan) along the horizontal axis. Two separate marker densities (continuous and 400) are shown.
cance level. While a false negative in earlier days simply, meant that a certain trait gene known to exist could not be located on the genetic map, a false negative in a genome-wide scan for a complex trait these days could delay even establishing the preliminary information that genes with moderate effects indeed exist for the trait under study. Moreover, the convenience and the ability to readily pool the lod scores over families and studies de-emphasized the issue of power in earlier days, since the evidence was expected to increase gradually as more and more families were added. Although false positives can mislead followup efforts at least for a while, false negatives may delay the discovery of important disease genes, with the resulting delays in the development of potentially important pharmacologia cal or gene interventions. It is hoped that the real cost of false negatives in this day and age, therefore, outweighs the cost of false positives as long as best efforts are made to minimize false positives. Therefore, it would be necessary to develop meaps for pruning false positives other than practically eliminating them by accepting unreasonably high thresholds. It should be emphasized that the motivation here is not one of inflating false positives, but only controlling false negatives.
492
Rao and Gu
C. Alternative ways to control false positives It is clear that using higher significance levels (for the purpose of minimizing false negatives) will also lead to more false positives. Therefore, alternative methods for discriminating between false positives and true positives need to be developed. Although some attention has been paid to this issue, it has been largely neglected and is in need of further work. There is some indication that the “breadth” of a lod score peak may be useful for this purpose, suggesting that true peaks are broader than false peaks (Terwilliger et al., 1997). Although replication is commonly required for accepting a positive finding (thus serving as an excellent method for pruning out false positives), it is not clear that failure to replicate a positive finding in a different study constitutes proof that the original finding is false (Suarez et al., 1994). This difficulty arises because complex traits involve multiple genes as well as interactions among genes and environments. This makes it readily possible either for different genes to be involved with the same trait in different studies or even for the same gene to be involved in both studies except that the gene expresses in one study (say, because of the presence of interacting determinants) but not in the other (because of the absence of the interacting determinants). Therefore, replication studies may fail to replicate the original finding even when the original gene detection is true. Perhaps replication studies can be designed more efficiently, to maximize their ability to replicate. For example, replication studies carried out on the same underlying population from which the original sample was drawn can form one useful criterion. Matching the two samples for other characteristics (age, sex, race, lifestyle, etc.) can also help. The goal here is one of matching for interacting determinants to maximize the ability to replicate a true finding. Perhaps the best replication study is embedded in the original study itself. With a sufficiently large sample size, it is possible to randomly split the (original) sample into two subsamples so that one can be used for replicating the finding(s) from the other. Since different splits give rise to different subsamples, it would be preferable to carry out several split sample analyses and examine the distribution of a similarity index (e.g., concordance of evidence between the two split samples in one trial) to determine how best to interpret the results. In general, we believe that both consistency of the evidence across multiple studies (or split samples) and the level of within-study evidence are useful for discriminating between false positives and true positives.
D. Two-stage designs (linkage in stage 1 and association in stage 2) Sometimes investigators consider two-stage studies to be a cost-effective way of designing genomic scans, whereby a relatively sparse marker map is used first to generate linkage signals, followed by a second stage with a denser marker map
29. False Positives and Negativesin GenomeScans
493
around the suggested signals (Elston, 1992). M ore recently, Elston et al. (1996) have analyzed the properties and performance of the one-stage and two-stage strategies, concluding that a two-stage procedure could halve the cost of a study in comparison to a one-stage procedure (see also Chapter 27 in this volume, by Cuo and Elston). In such designs, it is critical that the first stage have excellent power, well over the usual 80%, since the second stage cannot recover any linkages missed in the first stage. Also, use of the same sample for both stages may not be optimal for pruning false positives, since both stages would be based on the same allele sharing information. It is desirable to use independent samples of relative pairs in each stage. In general, the two-stage design appears to be cheaper when dense maps are used in the second stage. However, the rate of false positives appears to be better controlled in the one-stage study (Todorov and Rao, 1997). Perhaps a better two-stage design would feature a first stage that carries out a linkage analysis by means of a relatively dense map and identifies potential regions to be assessedin the second stage with association studies. In this case, since only the first stage will use allele sharing information, the effect of using the same sample twice in the design is unclear. Perhaps the same sample could indeed be used for the two stages. Alternatively, a combination of linkage and association may be more useful in follow-up studies. Finally, the cost-effectiveness of such a design warrants further study.
IV. TGADELQFF BETWEENFALSEPOSlTIVES AND FALSENEGATIVES As the preceding discussion shows, it is highly desirable to strike a balance between false positives and false negatives without neglecting or dismissing either type of error. This is even more pertinent for complex traits, since the effect of any gene is likely to be only moderate. For the purpose of trading one error for the other, it helps to recall how each error is affected. Whereas the rate of false positives depends on the significance level and the marker density (and assumptions related to the genetic maps, such as the total genetic map length), the rate of false negatives depends on the significance level and the sample size (beside the effect size of a trait gene). To get an idea about how the rate of false negatives varies with the sig nificance level and the sample size, we show in Table 29.1 the power (which is the complement of the false negative rate, i.e., I - false negative rate) for the affected sibpair (ASP) design. The table considers three different prevalence values of a certain disease (I$), two different values of the heritability due to the trait gene under consideration (h’), and several alternative values of the sample size (N). Corresponding to the various combinations of these parameters, the table shows the power under two different genome-wide significance
494
Rao and Gu Table 29.1 Power under Alternative Genetic Models Using an ASP Study Design.” Number of ASPS
KP
(%)
200
500
800
1 5 10
76.6142.0 23.415.0 9.111.2
h2 = 10% 99.8197.1 69.6134.3 32.418.6
100.0/100.0 92.0168.8 57.1123.2
1 5 10
100.0/99.8 90.1163.7 56.0121.9
h2 = 20% 100.0/100.0 100.0/99.7 97.2184.0
100.0/100.0 100.0/100.0 99.9198.8
aThree choices of the population prevalence (Kp), two choices of the locus-specific heritability (h2), and three choices of the sample size are considered. The residual sibling correlation was fixed at p = 0.20, which explains a residual heritability of 40% over and above that explained by the locus-specific heritability. Power is reported as x/y corresponding to two significance levels, each with a discrete map of 400 markers: For “x” we used CY= 0.00227 corresponding to one false positive (on average) per scan (power displayed in bold); for y, we used 01 = 0.00009 corresponding to one false positive (on average) in 20 genome scans.
levels CX~= 0.0023, which corresponds to one false positive per genome scan (on average), and o+ = 0.00009, which corresponds to one false positive in 20 genome scans (on average), each based on 400 markers (see Figure 29.1). Let us now examine how the rate of false positives varies with the significance level. For this purpose, the marker density was fixed at 400 (equally spaced) markers. Figure 29.2 shows that the frequency of false positives (right vertical axis) increases rapidly (exponentially) with the (logarithm of the) nominal significance level. Let us now consider why we need to trade between the two errors at all. To demonstrate the importance of balancing between the two types of errors, we have also plotted the false negative rates in Figure 29.2 for affected sibpair studies. Figure 29.2 displays the results for two sample sizes (N = 200 and 500 affected sibpairs) under each of two levels of the locus-specific heritability (IX’ = 0.1 and 0.2). I n each case, a residual sibling correlation of p = 0.20 was assumed, which accounts for an additional 40% of the phenotypic variance. As clarified earlier, the horizontal axis represents the logarithm of the nominal significance level and the vertical axis represents the false negative
29. FalsePositivesandNegativesin GenomeScans
-7
-6
-5
-3 ~o~,o(~*~t-~~~~
495
-2
-1
a)
Figure 29.2. Demonstrating the importance of balancing between false positives (right vertical axis) and false negatives (left vertical axis) using an ASP design with 400 (equally spaced) markers. The horizontal axis represents the nominal (pointwise) significance level (a). Combination of two sample sizes (N = 200 and 500) and two levels of heritability (k2 = 0.1 and 0.2) generate four situations. Two nominal significance leve!s (01) are used: the left-hand vertical dashed line corresponds to one expected false positive in 20 genomic scans ((u = 0.00009), while right-hand the vertical dashed line on the right corresponds to one expected false positive per genomic scan (cu = 0.0023). The exponentially increasing dotted line on the right shows the expected number of false positives corresponding to a given value of CL
rate, shown on the left, and the expected number of false positives per genome scan, shown on the right. The left-hand vertical dashed line corresponds to the nominal significance level of LY= 0.00009, which is based on one expected false positive in 20 genomic scans [as recommended by Lander and Kruglyak (1995) except for the discrete marker density used here]. The right-hand vertical dashed line corresponds to the nominal value of a! = 0.0023, which is based on one expected false positive per scan as suggested by Rao (1998) and Rao and Province (2000). Both significance levels are based on a marker density of 400. As can be seen in Figure 29.2, for N = 500 and h2 = 0.2, the most favorable case considered, a study using cz = 0.00009 will miss about 4% of the genes (power = 96%); practically all the genes will be detected with LL = 0.0023, we can expect one false positive per scan. In this case, it is better to use the more conservative significance level. Mowever, for N = 200 (with the same
496
RaoandGu
h2 = 0.2), the smaller significance level will miss about 45% of the trait genes, while the larger one will miss about 15% of the genes. This becomes more critical for the smaller heritability (h2 = 0.1, which is more pertinent for complex traits), where even N = 500 will miss about 75% of the genes when the smaller significance level is used. On the other hand, when the larger c~ (0.0023) is used, more than 60% of the genes are expected to be found. Finally, when the heritability is low (0.1) and the sample size is low (N = 200), which is the most unfavorable of the four cases, a design using the more stringent alpha misses practically all the trait genes, while a design using the larger value (ok = 0.0023) is capable of detecting at least some genes (about 20% of the genes). We hope that this discussion makes the choices clear.
v. DISCUSSION Since the rate of false negatives depends on the sample size, and since both error rates depend on the significance level chosen, it is clear that our goal should be one of minimizing both types of errors (as much as possible) at the stage of designing a study. That is, the required sample size should be calculated so as to yield power as high as possible when a relatively more stringent significance level is used. The usual practice of calculating a sample size to yield 80% power at a nominal significance level of (Y = 0.05 does not serve us well. Instead, these calculations may be based on 90% power at a = 0.0023. Understandably, the necessary sample sizes will be huge, and the budget projections may not permit it. Nonetheless, as long as studies are designed by heavily compromising on sample size (for whatever reason), one should not expect miraculous results. In particular, one should not mandate subsequent use of extremely stringent significance levels at the latter stages of data analysis and interpretation of results. Instead of too many studies designed less than optimally, perhaps a smaller number of multicenter studies could be designed and coordinated to fulfill our primary goal at the design stage (namely, one of minimizing both types of error). Once a study is conducted, the sample size is “fixed”; hence the error rates must be seen only as functions of the significance level chosen. We would like to argue that, at the stage of data analysis and interpretation, one should use relatively less stringent nominal significance levels (like (x = 0.0023) to identify genomic regions for the purpose of carrying out follow-up work. Finally, when carrying out two-stage genome scans, one should remember that the two errors play different roles in the two stages (e.g., see Todorov and Rao, 1997, and Chapter 27 by Guo and Elston in this volume). In particular, since genes missed in stage 1 can never be found in stage 2, it makes sense to use a larger significance level in stage 1 than in stage 2. Finally, one may bypass the multiple testing problem altogether by carrying out one hypothesis test globally
29. False Positives and Negativesin GenomeScans
497
by means of sequential methods of analysis, as argued in the next chapter by Province. Since we are advocating that less stringent significance levels (such as cx = 0.0023, which corresponds to a lod score of 1.75) be used for identifying genomic regions for further follow-up work, we should also recall that this practice will increase the rate of false positives. Therefore, there is a need to develop alternative methods for discriminating between false positives and true positives. Two potentially useful ideas were discussed in this chapter. Customdesigning replication studies by matching against the original study for known interacting determinants might be helpful for maximizing the chances for replication. Another promising approach is that of split-sample analysis. Finally, we believe that genetic dissection of complex traits can benefit from lumping and splitting approaches to data analysis and interpretation (e.g., see Rao and Gu, 2000). For example, data from multiple studies may be pooled for obtaining much larger aggregate sample sizes. One may then split the pooled data, which may have been rendered heterogeneous by virtue of pooling, into multiple homogeneous subgroups with the prospect of finding reasonably large sample sizes even within subgroups. Splitting into relatively homogeneous subgroups may be carried out using the classification and regression trees approach (e.g., see Breiman et al., 1984; Shannon et al., 2000; and Chapter 18 in this volume by Province et al.). Since there is growing evidence that some of the gene effects may be context dependent (e. g., gender dependent, race dependent etc., see Turner et al., 1999), subgrouping may be carried out based on the relevant contexts. In either case, partitioning the total sample into subgroups of meaningful sizes will be facilitated by data pooling, Pooling data from multiple studies involves tremendous opportunity for meaningful collaborations, and this may be the limiting factor in terms of whether we succeed. only when investigators interact actively without barriers can they hope to have a real chance at finding the genes for complex diseasesand disease-related traits. In these challenging pursuits, one should not hide the evidence by refusing to acknowledge signals unless they pass some extremely stringent criteria. After all, the question is not whether there are genes, only when and how they might be found,
Acknowledgments This work was supported in part by grant G M 28719 from the National Institute of General Medical Sciences of the National Institutes of Health.
References Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. H. (1984). “Classification and Regres+ sion Trees.”Wadsworth International Group, Belmont, CA.
498
RaoandGu
Cheverud, J. (2000). A simple correction for multiple comparisons in interval mapping genome scans. Sumbitted. Elston, R. C. (1992). Designs for the global search of the human genome by linkage analysis. In “Proceedings of the XVIth International Biometric Conference, Hamilton, New Zealand,” pp. 39-51. Elston, R. C., Guo, X., and Williams, L. V. (1996). T wo-stage global search designs for linkage analysis using pairs of affected relatives. Genet. Epidemiol. 13,535-558. Feingold, E., Brown, P. O., and Siegmund, D. (1993). G aussian models for genetic linkage analysis using complete high-resolution maps of identityby-descent. Am. .J. Hum. Genet. 53, 234-251. Higgins, M., Province, M. A., Heiss, G., et al. (1996). The NHLBI Family Heart Study: Objectives and design. Genet. Epidemiol. 143, 1219-1228. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat. Genet. 11,241-247. Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Gener. 7, 277-318. Morton, N. E. (1998). Significance levels in complex inheritance. Am. J. Hum. Genet. 62, 690-697. Rao, D. C. (1998). CAT scans, PET scans, and genomic scans. Genet. Epidemiol. 15, 1-18. Rao, D. C., and Go, C. (2000). Principles and methods in the study of complex phenotypes. In “Molecular Genetics and Human Personality” (J. Benjamin, R. Ebstein, and R. H. Belmaker, eds.) American Psychiatric Press, Washington, DC. Rao, D. C., and Province, M. A. (2000). The future of path analysis, segregation analysis, and combined models for genetic dissection of complex traits. Hum. Hered. 50,34-42. Risch, N. (1991). A note on multiple testing procedures in linkage analysis. Am. J. Hum. Genet. 48,1058-1064. Shannon, W. A., Province, M. A., and Rao, D. C. (2000). A CART method for subdividing linkage data into homogeneous subsets. Submitted to Genetic Epidemiology. Suarez, B. K., Hampe, C. L., and Van Eerdewegh, P. (1994). Problems of replicating linkage claims in psychiatry. In “Genetic Approaches to Mental Disorders” (E. S. Gershon, and C. R. Cloninger, eds.), pp. 23-46. American Psychiatric Press, Washington, DC. Terwilliger, J. D., Shannon, W. D., Lathrop, G. M., et al. (1997). T rue and false positive peaks in genomewide scans: Applications of length-biased sampling to linkage mapping. Am. J. Hum. Genet. 61,430-438. Thomson, G. (1994). Identifying complex disease genes: Progress and paradigms. Nut. Genet. 8, 108-110. Todorov, A. A., and Rao, D. C. (1997). Trade-off between false positives and false negatives in the linkage analysis of complex traits. Genet. Epidemiol. 14,453-464. Turner, S. T, Boerwinkle, E., and Sing, C. E (1999). Context-dependent associations of the ACE I/D polymorphism with blood pressure. Hypertension 34(part 2), 773-778.
SequentialMethods.of Analysis for GenomeScans Michael A. Province Division of &statistics Washington University School of Medicine St. Louis. Missouri 63 110
I. II. III. IV. V.
Summary Introduction Background and Significance Sequential Analysis Theory Discussion References
I. SUIVIMARY As the preceding chapters illustrate, now that whole-gcnome scan analyses are becoming more common, there is considerable disagreement about the best way to balance between false positives and false negatives (traditionally called type I and type II errors in the statistical parlance). Type I and type II errors can be simultaneously controlled, if we are willing to let the sample size of analysis vary. This is the secret that Wald (1947) d’Iscovered in the 1940s that led to the theory of sequential sampling and was the inspiration for Newton Morton in developing the lod score method. We can exploit this idea further and capitalize on an old, hut nearly forgotten theory: sequential multiple decision procedures (SMIX’) (Bechhoffer, et al., 1968), which generalizes the standard “twohypotheses” tests to consider multiple alternative hypotheses. Using this theory. WC can develop a single, genome-wide test that simultaneously partitions all. markers into “signal” and “noise” groups, with tight control over both type I and type II errors (Province, 2000). C onceiving this approach as an analysis tool for Advances In Genetics, Vol. 42 Copyright Q 2001 hv .Academx I&s. ,411 rixhrs of re~roduct~~ in any form rescrvcd. @X-2660/01 S3i.CC
499
500
Michael A. Province
fixed sample designs (instead of a true sequential sampling scheme), we can let the data decide at which point we should move from the hypothesis generation phase of a genome scan (where multiple comparisons make the interpretation of p values and significance levels difficult and controversial), to a true hypothesistesting phase (where the problem of multiple comparisons has been all but elime mated so that p values may be accepted at face value).
II. INTRODUCTION In 1976 Viking Orbiter 1 took a series of pictures of the surface of the planet Mars, including the area around Cydonia Mensae. Among the thousands of pictures taken from the lOOO-mile-high orbit was a handful of a mountain that has come to be known as “The Face” (Figure 30.1). Almost everyone who sees the picture has the same reaction: “It looks like a face!” The pictures of this mountain, later estimated to be about 1.6 miles long, 1.2 miles wide, and 0.5 miles high, generated a great deal of excitement and controversy. NASA officials dismissed the resemblance to a human face as a “trick of light and shadow,” while others suggested that it might be some kind of Martian sphinx-evidence of a long extinct extraterrestrial civilization. At one level we can readily understand why it seems highly improbable that such a close resemblance to a human face could occur by chance alone. Published estimates of the odds of such a structure occurring “at random” vary depending upon the mathematical models used as well as the kinds of deviations from randomness considered, ranging somewhere between 43 : 1 and 152,600 : 1 in favor of “artificiality” (Carlotto, 1997). This corresponds to p values between p = 0.023 and p = 0.0000065, respectively. But on a deeper level, it is perhaps not so surprising that such an unusual image was found. When we consider how much of the Martian surface was photographed by the Viking Orbiters I and 2, is it not inevitable that we would eventually pho, tograph some natural structure that would resemble another object we might ret, ognize, such as a face, a dog, a building, or a geometric shape? The point of this rather whimsical example is to demonstrate that the “multiple comparisons” problem in statistics can erode the meaning of significance. When we do many statistical tests, or a large enough search for “significant” findings (such as a genome-wide linkage scan with hundreds of anonymous markers or a linkage disequilibrium association scan with thousands of SNPs), we have to be very careful about what we finally consider to be “statistically significant.” Formally, the famous Bonferroni correction is often used in this situation. If we call the significance level for each test singly the “testwise error rate” ar, and that for M independent tests considered together the “experiment-wise error rate” as, then o$ = 1 - (1 - &r)M gives the probability of making at least one type I error among the M tests (under the complete null
30. SequentialAnalysis for GenomeScans
Figure 30.1. The Martian “face” at Cydonia Mensae, from Viking Orbiter
hypothesis, i.e., if all M null hypotheses are true). But in many cases, it is unclear what the value of “M,” the number of multiple comparisons, should be. In the Martian face example, should we consider the total number of photographs taken by Viking Orbiter 1, or the number of nonoverlapping regions photographed? Or should we count different substructures separately, such as each side of the same mountain? Or should we also consider the photos raken by Viking 2 in the total M? In the context of genome scans, the value of M can be equally ambiguous. In fact, Lander and Kruglyak (1995), building on the work of Feingold et al. (1993), h ave essentially argued for using testwise error rates, which would hold for the limiting case when M, the number of anonymous markers genotyped, approaches the total number in the genome, that is, an infinitely precise map!
502
Michael A. Province
Ill. BACKGROUND AND SIGNIFICANCE A. “Significance” in genome-wide scans Currently, a debate is raging about the proper interpretation of evidence from such genome-wide scan data (Lander and Kruglyak, 1995; Curtis, 1996; Risch and Botstein, 1996; Witte et al., 1996; Rao, 1998; Morton, 1998; Weller et al., 1998). Since the number of markers in a genome scan is usually very large, even for a very small testwise type I error rate, the number of observed “false positives” (significant by chance alone) can be large and can actually overwhelm the number of true positives. This has led some authors, such as Lander and Kruglyak (I995), to argue for very stringent testwise alpha levels, ~!T’T, so that the total, experiment-wise (genome-wide) type I error, (YE,will be small. Thus, Lander and Kruglyak would require the location-wise evidence to be strong enough to produce no more than one genome-scan-wise type I error in 20 scans. Other authors (e.g., Rao, 1998; see also the preceding two chapters in this vole ume) have pointed out that an approach that adjusts the alpha levels alone, pays a dear price in increasing the number of false negatives (i.e. missing true signals). They have advocated less severe adjustments and call for achieving a “balance” between the two error rates by requiring smaller individual type I and type II testwise errors in a priori sample size determinations. One way out of this dilemma is to recognize that by its very nature, a genome scan is a hypothesis generation exercise, which by itself cannot definitively prove anything until the findings have been replicated in independent samples. In ,the statistical literature, the distinction is often made between hypothesis testing and hypothesis generation (e.g., Thomas et al, 1985). In true hypothesis testing, the numbers of tests are small and each hypothesis is “well motivated.” Thus, no corrections for multiple comparisons are needed, and p values can be taken at (approximate) face value, whereas the hypothesis generation mode is more of a “fishing expedition,” where all the vagaries discussed earlier (such as the “true” number of tests conducted) come into play to make the interpretation of “significance” murky. Thus, a hypothesis-generating genome scan should be completed with the minimum resources necessary, and the resources saved to independently replicate or refute the few findings generated in the first stage using true hypothesis testing. Several large-scale studies of complex traits have implemented such a two-stage, fixed sample design [e.g., the COGA study, the National Heart, Lung, and Blood Institute Family Heart Study (NHBLI FHS), the San Antonio Family Heart Study the HyperGEN study]. An initial hypothesis-generating genome scan is conducted on approximately half of the sample, followed by selected hypothesis tests only in the “hit regions” using the other half. The logic is that once a relatively small number of promising regions have been identified in the hypothesis generation phase,
30. SequentialAnalysis for GenomeScans
503
these few hits can be formally tested on the remaining sample by using traditional hypothesis tests, with little or no adjustments in the p values required, since here only a few, well-motivated tests are actually being conducted and they are being tested in a completely independent set of data. An immediate issue in such two-stage designs is the question of th.e optimal relative sample sizes for the two stages. If the total “N” is divided into the “k” part for the hypothesis generation phase and the “N - k” part for the replication phase, then if “12”is too small, there may not be enough power to find “hit regions” in the first place. But if ‘2’ is too big, there may not be enough power in the remaining sample to replicate/refute the findings. Moreover, since we are unsure how exactly to interpret the significance levels in that stage, we want “k” to be as small as possible. Statisticians, on the other hand, have over 100 years of experience and tradition in testing a small number of well-motivated hypotheses in independent data, to be able to interpret the significance levels in the hypothesis-testing phase. Why not let the data decide on the optimal split point between generation and replication samples?
B. Sequentialanalysis,not sequentialsampling Sequential analysis methods give one very elegant solution to this problem. They allow us to tailor the relative sample sizesfor each phase to fit each situation, even to the level of each particular phenotype within the same study, while maintaining tight control on the false positive and false negative rates in each stage. While sequential testing methods have been widely available for the past 50 years, they are still relegated to a relatively small universe of devoted followers and are largely ignored by the “fixed sampling” world of investigators. This is due partly to the analytic difficulty ‘of obtaining some of the solutions and partly to the practical difficulty of conducting a truly sequentially sampled study. Indeed, genotyping in small family or relative pair units is not very costefficient, and the prospect of having to quickly clean, transform, and reanalyze data at every sampling point is daunting. Even #themore practical block sequential designs (Whitehead, 1983) require a higher degree of organization and a more immediate response to data than most investigators are willing or able to provide. Perhaps the most unappealing aspect of true sequential sampling is the idea that sample size for a given study should be so completely dictated by the test of a single hypothesis. Usually, there are many outcomes and many hypotheses to be tested. This is particularly true in the context of a genome scan, when one may want to use the same family data to search for genes for many phenotypes. The requirement of true sequential sampling, which would be prespecifying exactly one primary hypothesis on which to make sampling decisions to the exclusion of all others, could very easily leave one with an inadequate sample for all other hypotheses of interest.
504
Michael A. Province
While actual sequential sampling may not be very practical, sequential analysis of fixed sample data not only is practical but can be quite efficient, at least in rhe special case of genome-wide linkage or association scans. The theory of sequential sampling predicts (and our simulations confirm, Province, 2000) that the same power can be achieved at a substantially lower average sample number when sequential analysis is used. Fixed sample advocates usually argue that this “saving” is meaningless if one has already decided to use a fixed sampling scheme. Even if a sequential test needs fewer samples, they ask, what does one really gain? What practical use will one make of these “extra” samples that were not analyzed? Once having gone to the trouble of collecting these data, why not use all sample points to get even greater power? The answer is that, in the special case of genome-wide scans, we can gain a lot, because whatever samples we save in the hypothesis generation phase, we can use to great advantage in the hypothesis-testing phase. The beauty of the sequential analysis approach is that we minimize the samples needed for each specific hypothesis in the generation phase and thereby maximize the sample we can use in the confirmatory testing phase, without compromising on either the type I error or the power. Most importantly, sequential multiple decision procedures (Bechhoffer, et al., 1968), which allow us to go even a step further, obviate the need for conducting a large number of tests in the first place. Instead, the multiple hypothesis tests across the genome can be replaced by a single, genome-wide test procedure, which has multiple possible outcomes and automatically identifies the “best” subset of markers among those in the entire scan at a prespecified “p value.” In traditional (fixed sampling) theory, we hold both type I error (a) and sample size (N) constant and take whatever power the data provide. For a well-designed experiment, power calculations have been done beforehand to ensure that, on average, the power to detect the target effect size is reasonable (using, e.g., the more stringent levels advocated by Rao, 1998). However, this by itself will not guarantee that adequate power will be achieved in the sample actually collectedonce the particular experiment has been conducted. Once completed, the sample may belie expectations, having greater or lesser power than anticipated a priori (because the effect size, D, may be greater or lesser than presumed). One could (and often should) redo such power calculations at the end of a study, to ensure that adequate power was achieved, especially if the results are negative (as they will always be for some markers in a whole-genome scan). If adequate power was not achieved, then additional samples are required to ensure that too many true linkages were not missed (false negatives), which would in turn require a further sample size determination upon completion of the recalculations. Such a process amounts to a very ad hoc, unbalanced, inefficient block sequential design. By contrast, the sequential sampling analysis approach (Wald, 1947) fixes both type I error (significance) and type II error (1 - power) and allows the
30. SequentialAnalysis for GenomeScans
505
sample size to vary as a random variable with the experiment. It therefore accounts for such sampling fluctuations automatically, as they occur at every stage of the process, requiring just the right amount of additional samples to achieve the target precision in type I and type II errors for the target effect size D. They provide the most efficient way to make sure that both target errors are achieved for the specified effect size with the smallest possible sample sizes. However, the “price” we pay for being able to fix both types of error is that the sample size N is open ended and could theoretically extend indefinitely. Fortunately, sequential sampling theory yields two important results: 1. The sequential process will terminate at a finite N with probability “1”. 2. On average, the N required under sequential sampling (called the average sample number, or ASN) will be smaller than that for the “best” fixed sample test that gives the same power (Wald, 1947).
IV. SEQIJENTIAL ANALYSISTHEORY The very first systematic treatment of sequential analysis was the development of the sequential probability ratio test (SPRT) by Wald (1947). Wald literally advocated sequentially sampling as a way to avoid the extra costs in sampling when there was a high cost of observing each data point (e.g., firing an experimental rocket). Here, however, we conceive of the SPRT as an analysis tool that makes efficient use of the fixed samples collected in the first “k” samples to screen the genome, saving the remaining “N - k” for ,validation. One of the earliest and most profound applications of sequential: theory in the field of human genetics was in the development of the lod score method by Newton Morton, which was the first sequential test for linkage (Morton, 1955). Since that time, sequential theory has become quite a rich and well developed field (Wetherill, 1966; Ghosh, 1970; Whitehead, 1983; Siegmund, 1985) an d is readily applicable to many other situations in genecits. There are sequential tests for binary, ordinal, continuous, and survival traits, so that sequential sampling versions of virtually any fixed sampling linkage or association test can be readily developed, using various types family designs (sibs, nuclear families, or extended pedigrees). As an example of this approach, we will discuss sequential approaches to the well-known simple “nonparametric” (or “model-free”) test of linkage to a quantitative trait on sibpairs: the Haseman-Elston method (Haseman and Elston, 1972). Details can be found in Province (2000). It should be clarified that sequential methods are equally applicable to other situations, such as the variance components tests of linkage or regression-based tests of association in disequilibrium mapping.
506
Michael A. Province
A. Sequential probability ratio test (SPRT) In general, if 8 is the parameter of interest and f(x, 1 a) is the likelihood function of the cumulative data b, up to the nth data point (e.g., a sibpair), then to test the simple hypothesis Hs: 0 = $ against the simple alternative Hr: 8 = 8i, the SPRT procedure is to define the ratio of the probabilities: 2 = ,,f(xn I 6) fern= n f(_xn I 00)
1,2, . . . .
By the maximum likelihood principle, if the data are more compatible with parameter 8i (and in particular, if the simple hypothesis Hr is true), then this ratio will tend to be positive and will grow larger with increasing n. Conversely, if the data are more compatible with parameter 0, (or if the simple Hc is true) then 2, will tend to become more and more negative. Before sampling, prespecified limits are defined, a* < 0 < b*, which are functions of the desired type I and type II errors of the SPRT. Analysis of additional data stops when the first 2, falls outside these limits, with a decision for the corresponding hypothesis. The method can be readily extended to both one-sided and two-sided compound hypothesis tests (Wetherill, 1966; Ghosh, 1970).
B. Sequential multiple decision procedures (SMDP) The SPRT approach gives the investigator greater control over both the type I and type II errors of each individual significance test than do fixed sampling methods. This, in turn, allows the investigator to better balance between the number of false positives and false negatives that must occur when a large number of such tests are conducted on a single body of data (i.e., to control the experiment-wise type I and type II errors). Unfortunately, even the SPRT approach does not meet the core of the problem head-on. One still must conduct a large number of significance tests as part of a whole-genome scan, which is the root cause of “inflation” of type I error in the first place. Is there a way of avoiding such a large number of tests altogether, so that nominal significance levels can be used instead of “adjusted” ones? Instead of conducting a large series of binary-response decision procedures (i.e., significance tests), each of which asks the simple yes/no question “Is the gene here?“, does it not make more sense to ask the single, multiple-response question “Where are the genes?” (The implicit assumption is that there is at least one gene, albeit perhaps of very small effect, somewhere in the genome.) For this approach, we need to go beyond traditional hypothesis-testing theory. The sequential multiple decision procedures (SMDP) of Bechhofer et al. (1968) provide a powerful generalization of the traditional two-hypothesis
30. SequentialAnalysis for GenomeScans
507
paradigm and give one potential solution to this problem. Here we partition the decision universe into U mutually exclusive and exhaustive hypotheses, Hi, where i = 1, 2, . . . , U 1 2, of which we want to select one. In the particular case of a genome-wide scan, we form every possible subset of markers and try to select the one subset that contains only the truly linked (or associated) ones. In traditional hypothesis testing, we have only two hypotheses, Ho and H,, two corresponding decisions, Do and Di, and two types of errors, LYand ,8, with relationship for any test procedure, ~7: (1 - a) = Prob @ [make decision Do 1 Ho is true] (1 - /3) = Prob @ [make decision D1 I H1 is true]. But in the SMDP framework, we will have “U” types of errors, ai, where: (1 - aJ = Prob p( [make decision Di I Hi is true] for i = 1, 2, . . . , U. We wish to minimize the probability of any incorrect decision, or conversely, to maximize the Probability of a Correct Decision (PCD), which we denote by I?*, subject to the condition that the “distance” between hypotheses (using an appropriately defined metric) is at least at a certain prespecified level of “effect size.” An outline of the general theory and its derivation in this particular case are given in Province (2000), which is based upon the theory of Bechhofer, et al. (1968). Basically, in the case of a genome-wide scan, we compare the relative evidence for linkage (or association) among the markers, instead of looking for absolute evidence at each marker. As evidence accumu lates with each successive data point, the few “signal” markers will eventually distinguish themselves from the more prevalent background “noise” markers as a distinct subgroup, thus terminating the sequential analysis and identifying the hypothesis to select. For the particular situation of a genome scan via the Haseman-Elston (H-E) method using a large number of markers, M, the first goal is to rank the evidence for linkage on some appropriare scale, 61,js Olzjor. . . ~ I Sl,j, where the [i] denotes index of the P-ranked marker. In this case, we use the (sequential estimate of the) error variance from the H-E regression, a&j I CT&] 5 - * * 5 t~$w. Intuitively, this makes sense, since the regressions showing the smallest error variances will be the most significant ones, while the nonsignificant “noise” markers should have error variances nearly equal to the total variance of the response variable. Having achieved this ranking goal, we will then seek to divide the markers into two subsets: those highest Y’ showing “nonsignificant” linkage and the lowest (M - t) “significant” ones, by splitting between rank ]M - t] and rank [M - t -+- l]. Formally, in this multiple-decision universe, out of all possible ways (U) to select t from M populations, [So that there are LJ = M! /(t! (M - t)!) hypotheses], we want to select the one that
508
Michael A. Province
correctly separates the (M - t) truly linked ones from the “r” unlinked ones. The sufficient statistics for this procedure, @B, at sibpair h + 1 will be the (transformed) sequential sums of squared residuals, using the prediction (H-E) equation for all the preceding sibpairs. Then, the target effect size D* will be characterized in terms of a minimum “distance” between two adjacently ranked error variances at the critical juncture between m - t], and [M - t + 11,using the distance metric
With these definitions, the SMDP procedure ,@Bguarantees that Prob @p,{correct selection} > P* whenever D > D*. At each analysis stage, we have SMDP estimates of the mean-square error (MSE) as well as of the total SMDP R2 of the regression for each marker, which are defined in the analogous way to fixed sampling regression, based upon the (sequential) sums of squares. This allows us to track the separation of the M markers as sampling progresses, into linked and unlinked groups on a uniform, understandable standardized scale (e.g., see Figure 30.2).
C. A real data example Use of this sequential method is illustrated in an example data set distributed with the S.A.G.E. package (1997), w h ic h is used to demonstrate HasemanElston linkage analysis, of the DBH phenotype using 22 markers assessedon a sample of sibships with some 500 sibpairs (Wilson et al., 1988). In Figure 30.2, we show the results of the SMPD test. The markers are listed in the final order ranked by the SMDP method at its terminal sibpair. For the sequential method, we analyze the sibpairs in the order of increasing ID numbers (the actual sampling order is not given with the data set). For simplicity, in this example, we ignore the fact that sibpairs from sibships of size greater than 2 are nonindependent (since this is an equally annoying problem in any adaptation of the H-E approach and not unique to sequential methods-were we to contrast the fixed sample variance components approach to linkage with its sequential counterparts, the sampling unit would be pedigrees and not sibpairs, which would obviate the problem). We therefore use all possible sibpairs with uncorrected df, which may tend to inflate the “false positive” evidence for linkage for both methods considered. Traditional fixed sampling H-E applied to the entire data set finds significant linkage at four markers: C3 (p = 0.0025), ABO (p = O.OOOS),BF (p = 0.0369), and ADA (p = 0.0013) at an
30. SequentialAnalysis for GenomeScans
-- _____---\ 0
---OS ----_ ---------
10
20
GM TF AK
30
40
_____ _-___ _____
50
60
70
80
m
ABO
-__--
#Sibpairs -----
h&i
---------_---
FY Gm Go
-------------
loo
no HP GPT PGM
120 130 140 150 160 1 --------_
KM AQA
-----
l.E
---_----_ -----
BF ESR cl
Figure 30.2. Change in S M D P R2 with sequential sampling for S.A.G.E. DBH example. The progress in RZ is tracked for each marker as the number of sibpairs increases. Markers are listed in the order determined by the S M D P at its decision pair, N = 132. The two markers determined to have the highest R2 (lowest MSE), C3 and ABO, are shown in bold (from Michael A. Province, A single, sequential, genome-wide test to simultaneously identify all promising areas in a linkage scan, Genetic Epidemiology0 2000. Reprinted by permission of Wiley-L&, Inc., a subsidiary of John Wiley 6;r Sons, Inc.).
ac = 0.05. With 22 markers, if we apply the 3onferroni correction to ensure that the experiment-wise type I error is no larger than cyE= 0.05, then the individual testwise type I error should be set at o+ = 0.0023. Thus, only ABC and ADA would be considered “significant” in this context, with C.3 just barely missing the cutoff. The SMDP method (using the simplified &r described in Province, 2000) with a I’* of 0.95 and a D* of 0.08 reached its terminal decision at the 132nd sibpair. From it, we conclude that 2 of the 22 markers, C3 and ABO, are “significantly” linked, with an overall (experiment-wise) “aevalue” < (1 - I’“) = 0.05, with SMPD R2 of 0.10 and 0.05, respectively. The rest of the markers all had SMPD R2 < 0.007. The change in SMDP RZ for all 22 markers as sampling progressedis shown in Figure 30.2. As can be seen, not until sibpair 132 is reached is it clear that the SMDP RZ values for C3 and ABC are substantially better than those for the rest of the markers, and this conclusion remains the same if we continue to sample until the end of the data set. It is interesting to note that these
510
Michael A. Province
two markers are not consistently the ones with the highest SMDP R2 throughout the entire sampling phase, and in fact showed some of the lowest SMDP R2 values at the early stagesof sampling.
D. Monte Carlo simulation of genome-wide scan As instructive as it is to apply these methods to real data, doing so can provoke endless debate, since the underlying truth is unknown. If the new procedures give the same answer as the traditional ones, one might ask, why bother with them? And if they do not, which is the correct answer? A simulation experiment can often be a better vehicle for judging the relative merits of alternative methods. To evaluate the utility of the SPRT and SMDP approaches in comparison to fixed sampling methods, a Monte Carlo simulation experiment was conducted: of a 400,marker genome scan, on up to 200 sibpairs, only one of the markers was truly linked to the single quantitative trait. This result corresponds to the kind of “dense marker” genome scan that is being conducted by many contemporary family studies (e.g., the NHLBI Family Heart Study, and the four hypertension networks sponsored by the NHLBI: HyperGEN, SAWHIRe, GENOA, and GenNet). Indeed, the CHIC-IO set consists of about 404 markers. In this simulation, 100 replications were generated, with the true linked locus having a H-E slope of 6 = - 0.5, an error variance of cc2 = 0.09, which gives an R2 of 0.31. For each replication, the same data were analyzed by both methods: fixed sample H-E (at various fixed Ns), and SMDP. All simulated data were generated, analyzed, and summarized in SAS (1988). For simplicity, all markers were generated as completely informative (IBD = IBS) and completely independent of one another (i.e., 8 = $ between every pair of markers). Thus, this represents a situation of fairly high power to detect the true linked locus; but because of the large number of loci considered, there should be a large number of false positive signals by chance alone. In fact, since we are generating data under 399 independent null hypotheses, we would expect 19.95 to be significant at the a = 0.05 level. This simulation is summarized in Table 30.1, in which we tabulate the number of test results that are incorrect decisions (false positive or false negative). To get the expected number of false/true positive/negatives in a single 4OO-marker genome scan, divide the corresponding table entries by 100 replications. For the fixed sampling H-E, there are 400 tests conducted in 100 replications, which yields a total of 40,000 significance tests. For SMDP H-E, there is only ONE test per replication, for a total of 100 decisions. The fixed sampling results are shown in Table 30.1 for N = 75 sibpairs, which corresponds roughly to the ASN = 73 for the SMDP method, so that we may more easily compare the misclassification rates for the approximate same level of sampling.
40,000
24 24
0
LanderKruglyak [cl = 0.000022]
100
1
SMDP ASN = 73 sibpairs Range = (31-136) P” = 0.95 D*= 10
“Table entries are totals across all 400 markers X IO0 replications. To get the expected number per genome scan, divide by 100. “Bonferroni gives test-wise c+ for M = 400 independent tests to achieve a replication-wise type I error of or, = 0.05. Source: Province (2000).
40,000
40,000
Total decisions (tests)
2 11 13
2,035 0 2,035
HO
H, (false positives)
Conclusion from analysis
Bonferronicolrectedb [a = 0.0001281
(false negatives) (linked) Incorrect decisions
f-6
Ho (unlinked)
Truth
Nominal Significance [a! = 0.051
Fixed sampling, N = 75 sibpairs
Table 30.1. Monte Carlo Simulation (100 replications) of a 4OO-Marker Cenome Scan for a Single QTL Using Sibpairs Comparing Fixed Sampling and SMDP versions of the Haseman-Elston Test: Total Numb& of Test Results by Category
512
Michael A. Province
With fixed sampling H-E, at a nominal aT = 0.05 significance level for each test, we detect the true locus in all 100 replications, but at the cost of 2035 false positive signals (this is slightly higher than the expected number of 1995). If we use the Bonferroni correction on M = 400 tests, then we require a significance level of aT = 0.000128 for a given test, to assure that the experimentwise (replication-wise) error is (YE= 0.05. This reduces the number of false positives to only 2, but at the cost of missing the truly linked gene 11 times (11 false negatives), for I3 incorrect decisions overall. If we go further, following the recommendations of Lander and Kruglyak (1995), by requiring an individual marker to have a significance level of (YT = 0.000022, then we succeed in their goal of eliminating all false positives, but at the expense of missing 24 true genes (false negatives). In this case, this highly conservative correction actually produces 24 total misclassifications, which nearly doubles the number obtained when the Bonferroni correction was used. If we count missing a true signal at least as onerous as inferring a false one, then the Lander and Kruglyak levels are too conservative, as argued by several authors (e.g., Morton, 1998; Rao, 1998). The SMDP (simplified procedure p&), reached the correct conclusion for 99 of the 100 replications with an ASN = 73 sibpairs (range 31-136). Thus, the total number of incorrect decisions was only 1 for SMDP (the single false negative is the flip side of the same false positive-both represent the single replication in which one of the other 399 markers was chosen as “significant”). Of course, by reducing substantially the number of tests required, from 400 per genome scan for the fixed sampling approach to only 1 for SMDP, we have gone a long way toward reducing the misclassification rates.
V. DISCUSSION If we accept the logical distinction between hypothesis generation and hypothesis testing, it is clear that no matter what procedure is used during hypothesis generation (even if one uses sequential methods), there will be arguments about the proper interpretation of “significance levels” produced in this phase. Whether one takes an extreme view or a more moderate position on the false positive/false negative trade-off issue, it is clear that p values cannot be taken at simple face value in the case of hypothesis generation, and all other “interpretations” are subject to “reinterpretation.” The goal of every investigator should therefore be to move as quickly, efficiently, and accurately as possible from the ambiguous hypothesis generation phase to the more solid, traditional hypothesis testing-phase, where “a p value is a p value.” The sequential approach accomplishes this with a much smaller ASN than fixed sampling methods, while giving tight error control. The SMDP provides a single, genome-wide test, which essentially partitions all loci into two distinct groups: the linked and unlinked
30. SequentialAnalysis for GenomeScans
513
loci. In this context, the absolute evidence of linkage at each location is not at issue, but rather, the procedure focuses on the relative evidence of linkage as being more important. In other words, the approach not only grants that the trait in question is genetic and that unmeasured genes exist somewhere in the genome linked to one or more markers, but asks “Where are the most likely locations of these trait genes!,, Since for any trait there will be many truly unlinked regions in a whole-genome scan, using the relative rather than the absolute evidence will still include the “null” hypothesis of no linkage. This approach is called a multiple-decision procedure because it is a generalization of the traditional hypothesis test, in which a decision is made between two, and only two, mutually exclusive possibilities. In the SMDP setting, the decision is to choose exactly one from a partitioning into more than two hypotheses. In the case of a genome scan, one forms the set of all subsets of loci U so that the test is to find that one subset out of all possible subsets that contains the linked genes and only the linked genes. Since it is sequential, the SMDP retains the desirable properties discussed earlier: namely, it zeros in on the hit regions with predefined, analyst-specified type I and type II errors, using (on average) a smaller sample than the corresponding fixed sampling test. Also, since it is a single test for all regions simultaneously, questions about the differences between the locuswise and genomewise type I and type II errors do not arise (as they do when one conducts multiple fixed sampling marker-by-marker tests). Because of these very compelling advantages, the sequential analysis line of research may continue to be as important in dissecting the genetic nature of complex traits as it was in the very early days when Newton Morton introduced the lod score and revolutionized genetic epidemiology.
Acknowledgments This work was partly supported by National Institutes of Health grants GM28719 from the National Institute of General Medical Sciences, and HL54473 from the National Heart, Lung, and Blood Institute. The S.A.G.E. results were obtained by using the program package S.A.G.E., which is supported by a U.S. Public Health Service resource grant (1 P41 RRO3655) from the National Center for Research Resources.
References Bechhoffer, R. E., Kiefer, J., and Sobel, M. (1968). “Sequential Identification and Ranking Procey dures.” University of Chicago Press,Chicago. Carlotto, M. J. (1997). Evidence in support of the hypothesis that certain objects on Mars are artificial in origin.]. Sci. Explor. 11(Z), 123-146. Curtis, D. (1996). Genetic dissection of complex traits. Nat. Genet. 12,356-357. Feingold, E., Brown, PO., and Siegmund, D. (1993). Gaussian models for genetic linkage analysis using complete high-resolution maps of identity by descent. Am. J. Hum. Genet. 53, 234-257. Ghosh, B. K. (1970). “Sequential Tests of Sequential Hypotheses.” Addison-Wesley, Reading, M. A.
514
Michael A. Province
Haseman, J. K., and Elston, R. C. (1972). Th e Investigation of linkage between a quantitative trait and a marker locus. Behau. Genet. 2,3- 19. Lander, E., and Kruglyak, L. (1995). Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nut. Genet. 11,241-247. Morton, N. E. (1955). Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7, 277-318. Morton, N. E. (1998). Significance levels in complex inheritance. Am. 1. Hum. Genet. 62, 690-697. Province, M. A. (2000). A single, sequential, genome-wide test to simultaneously identify all promising areas in a linkage scan. Genet. Epidemiol., in press. Rao, D. C. (1998). CAT scans, PET scans and genomic scans. Genet. Epidemiol. 15, I-18. Risch, N., and Botstein, D. (1996). A manic depressive history. Nut. Gene. 12,351-353. S.A.G.E. (1997). Statistical Analysis for Genetic Epidemiology, Release 3.0. Computer program package available from the Department of Epidemiology and Biostatistics, Rammelkamp Center for Education and Research, MetroHealth Campus, Case Western Reserve University, Cleveland, OH. SAS Institute, Inc. (1988). “SASB IML U ser’s Guide, Release 6.03 Edition” SAS Institute, Inc. Gary, NC. Siegmund, D. (1985). “Sequential Analysis: Tests and Confidence Intervals.” Springer, New York. Thomas, D. C., Siemiatycki, J,, and Dewar, R. (1985). The problem of multiple inference in studies designed to generate hypotheses. Am. J. Epidemiol. 122, 1080-1095. Wald, A. (1947). “Sequential Analysis.” Dover, New York. Weller, J. I., Song, J. Z., Heyen, D. W., Lewin, H. A., and Ron, M. (1998). A new approach to the problem of multiple comparisons in the genetic dissection of complex traits. Genetics 150, 1699-1706. Wetherill, G. B. (1966). “Sequential Methods in Statistics.” Wiley, New York. Whitehead, J. (1983). “The Design and Analysis of Sequential Clinical Trials.” Wiley, New York. Wilson, A. E, Elston, R. C., Siervogel, R. M., and Tran, L. D. (1988). Linkage of gene regulating dopamine-P,hydroxylase activity and the ABO blood group locus. Am. J. Hum. Genet. 42, 160-166. Witte, J. S., Elston, R. C., and Schork, N. J. (1996). Genetic dissection of complex traits. Nat. Genet. 12,355-356.
FromGeneticsto Wleehan~~sm of DiseaseLiability Jean-Marc Lalouel Howard Hughes Medical Institute University of Utah Health Sciences Center Salt Lake City, Utah 84112
I. Summary II. Introduction
III. IV. V. VI. VII.
Statistical Inference Functional Significance at the Molecular Level Mechanism at the Cellular and Organ Levels Analysis at the Level of the Whole Organism Evolutionary Implications References
I. SUMMARY The molecular basis of single-gene Mendelian disorders resulting from gain or loss of function is being clarified at a rapid pace. Progress in the genetics of common disease, by contrast, has been frustratingly limited, as we discuss by reference to essential hypertension (EH). The application of standard genetic paradigms to hypertension research has yielded remarkable findings. Arterial pressure (AI’) variation in laboratory rats has been correlated with various genes. Likewise, rare Mendelian hypertension syndromes are increasingly understood in molecular terms. The implications of these findings for EH have proven to be modest, however. Genetic methods have been applied to investigate directly essential hypertension in humans, with mixed results. The power of such methods to identify genetic determinants of EH has been questioncd. Advances in Gene&s, Vol. 42 Copyright :L 2001 hy Academy Press. All righrt, of reproduction in any form reserved. 036%266C/Ql Si5.X
517
518
Jean-Marc Lalouel
The issues confronting the genetic analysis of EH are discussed by drawing from our ongoing work along the hypothesis that molecular variants of the angiotensinogen gene may constitute inherited predispositions to the condition. Simply establishing correlation is already a daunting task. Far more challenging yet is to establish causation for a physiological phenotype, that is, to understand the mechanism by which a genetic factor may predispose to essential hypertension. Susceptibility imparted by genetic variation, modest and quantitative, modulates response to environmental exposure over time. The product of the gene under examination may be highly pleiotropic, being involved with multiple physiological processes in multiple tissues. Finally, as physiological phenotypes are defined at the level of the entire organism, ultimate demonstration of genetic determination may require specific genetic manipulations in entire organisms.
II. INTRODUCTION The advent of molecular techniques in the early 1980s held the promise that the genetic basis of human disorders would soon be understood in molecular terms. While such expectations have been met for a host of classical Mendelian disorders resulting from gain or loss of function at single loci, the same cannot be said of common human disease. This is evident for common cardiovascular disease, particularly essential hypertension (EH), which we use as an illustrative example throughout. The application of standard paradigms has yielded remarkable findings. Analysis of variation in arterial pressure (AI’) in laboratory rats has confirmed suspicion about the involvement of certain genes in circulatory homeostasis. Likewise, genetic analysis of classical Mendelian hypertension syn dromes has led to the identification of the molecular defects accounting for the unique physiology of these rare single-gene disorders. The implications of these findings for EH, a condition of unknown cause with a lifetime prevalence of 30-50% in affluent societies, have proven to be modest, however. Ambiguous phenotypic delineation, genetic heterogeneity, and various confounding factors have been appropriately cited to account for this sorry state. Such challenges have been encountered and handled ade, quately in classical genetics. The defining feature of the genetics of common disease is that susceptibility imparted by genetic variation is modest and quantitative, modulates response to environmental exposure, and achieves significance only through cumulative integration of lifetime experience. Major conceptual issues for which we have no obvious answers limit the power of genetic investigations. Is EH to arterial pressure what mental retardation is to IQ? Many genes affect brain development, but is IQ a measure of
31. From Geneticsto fvlechanismof Disease Liability
519
brain function? Mental retardation can result from various genetic mechanisms, but it can also result from environmental insult. Similarly, alterations in very many genes affect global physiology and as such impact on Al?. Most investigators agree that EH is unlikely to represent the collection of a large number of rare single-gene Mendelian disorders. Was Pickering right and Platt wrong? It is a common view that hypertension is clinically defined by an arbitrary threshold on the continuous distribution of AP and that it is a true polygenic disease resulting from the cumulative effect of a large number of genes, each individually negligible. Under this model, there would be no merit to pursue genetic investigations of EH, inasmuch as the genetic effects are by definition unidentifiable. There are reasons to entertain a more optimistic view, however, but this requires the acceptance of a few fundamental concepts that shape the genetic strategy. First, the challenge of clinical definition based on AI’ measurement, however real, in no way implies that there are not underlying factors that exert discrete effects. The lack of obvious bimodality or inflection points in the distribution of AP is actually compatible with models postulating heterogeneity and one or more genetic variants exerting major effects in progression to hypertension and/or its complications. To be risk factors for as common a clinical outcome as EH, they must themselves occur at high frequency in the population. These “oligogenes,” then, are the only effects that stand a chance of being identified. Each of these variants may affect clinical progression through one or more distinct pathophysiological mechanisms. Common genetic factors affecting distinct physiological pathways would generate overlapping entities that would defy conventional classification. These genetic effects would be major only by contrast to the undetectable polygenic effects. Because they affect clinical progression over several decades, their net, instantaneous effects may prove to be quite small and therefore difficult to demon strate. Indeed, by reference to infinitesimal calculus, what is “small” after integration over time? These views lead to a model of EH (Figure 31.1) that includes significant over-diagnosis (maybe lo-15% of EH), a large proportion of sporadic cases or “phenocopies” primarily associated with overweight, (BO- 80% of EH?), and more severe and progressive entities involving underlying oligogenic factors, each factor promoting a particular physiological imbalance, For simplicity, only two such factors are postulated here, predisposing to a sodium-sensitive form and to an atherosclerosis-related form of EH, respectively. This perspective reflects physiological evidence for at least two distinct pathophysiological mechanisms of hypertension. Phenocopies involve undetectable polygenic determination and are defined as such only relative to the postulated oligogenic effects.
Jean-Marc Label
Sever&v
Figure 31.1. Conceptual representation of etiological heterogeneity of essential hypertension.
A common feature of classical Mendelian disorders is that genetic variation not only establishes correlation between gene and disease but also often demonstrates directly causality. Gain or loss of function can be inferred from observation of frameshift or a nonsense mutation. It is unlikely that
Doss it affect arterial pressure? How?
What organ? What process?
In what cell? What does it do in/out of the cell’?
What gene? What variant?
Figure 31.2. From genetics to mechanism of diseaseliability.
31. From Geneticsto Mechanism of Disease Liability
521
common genetic determinants of EH result from such discrete genetic alterations; rather, they probably reflect small, graded effects operating over time. With the foregoing model in mind, it becomes evident that understanding genetic mechanisms of EH should prove challenging at all levels of observations (Figure 3 1.2). We need to understand the genetic mechanism at the molecular level. The next stage of integration is the cellular level. Evidently, unlike cancer, hypertension is not defined at this level. Rather, we must understand regulation of gene expression in cells expressing the gene at hand or the mediating function resulting from its expression. The issue may be complicated by the expression of the gene in various tissues and organs. For a physiological phenotype such as hypertension, defined at the whole-body level, the ultimate demonstration of the causal effect of a genetic substitution rests in the development of an appropriate animal model. Our discussion draws on our investigations of angiotensinogen as a risk factor for EH.
Ill. STMISTICIU IWFERENCE A very large number of genes, maybe a thousand or more, are involved in circulatory homeostasis. For any of these genes to contribute to interindividual variation in AP, evidently they must host functional molecular variants. Genetic mapping in rat models of hypertension, relying on chance occurrence of such variants in laboratory stocks used in experimental crosses, has led to the identification of significant linkage on the majority of rat chromosomes. Assuming that laboratory strains constitute a limited sampling of all possible functional variants that can affect AP, it becomes evident that significant genetic determinants of AP could occur on all chromosomes, and perhaps in many chromosoma1 segments within each chromosome. Whether random variation occurring in the rat genome offers a means to identify genetic determinants of EH in humans remains an unresolved issue. The direct search for genetic determinants of hypertension in humans, despite the obvious lack of experimental control, may be justified on several grounds. The large number of genetic variants randomly sampled through animal models may not appreciably reduce the number of candidate loci that may have to be considered in humans. The few oligogenic variants relevant in human EH, considering the large number of possible candidate genes and the independence of mutational events, may show no correlation with the animal variants. Human genetic variants accounting for EH must have an evolutionary origin rooted in the unique experience of the species. Finally, genetic determinants of EH in humans are not likely to have arisen through direct selection on the level of AP.
522
Jean-Marc Lalouel
A. Linkage and association between angiotensinogenvariants and EH As with other common diseases,the likelihood of identifying successfully genetic determinants of EH rests on critical choices concerning phenotypic definition, family structures sampled, selection of homogenous subsets, choice of markers and method of analysis. These issues are well covered in the literature, including chapters in this volume, and are not reviewed here. We only summarize our initial findings, as they constitute the first step of our discussion. In a collaborative study involving scientists in France and in Utah, tests of linkage were performed in the three candidate genes encoding components of the renin-angiotensin system (RAS) that had been cloned at the time: renin, angiotensin-converting enzyme (ACE), and angiotensinogen (AGT, the gene). Genetic linkage and allelic association were identified for AGT (Jeunemaitre et al., 1992). Lacking unambiguous intermediate phenotypes, phenotypic definition was based on clinically defined hypertension requiring medical intervention. Without obvious clues to sort out heterogeneity and phenocopies, the only attempts at generation of more homogeneous subsets entailed the selection of cases on the basis of age at first diagnosis and clinical severity-through the proxy of the use of at least two antihypertensive agents for AI’ control. Linkage was tested in hypertensive sibling pairs, using a multiallelic microsatellite marker. Internal validation of the inference rested on replication in two independently ascertained samples. Significant associations with EH and plasma angiotensinogen (Ang, the protein) concentration were observed with the variant T235, encoding the presence of a threonine instead of a methionine at amino acid 235 of mature Ang. Further confirmation was obtained in a case-control study of hypertensive and normotensive subjects in Japan (Hata et al., 1994). The same variant was found in significant association with preeclampsia (Ward et al., 1993). From the outset, it was well understood that whether T235 was causal or simply a marker for one or more unidentified causal variants could not be established from statistical evidence of association alone (Jeunemaitre et al., 1992).
6. Replication and power Various attempts at replication of the statistical association with AGT T23.5 illustrate the challenge of ascertaining the “true” significance of a reported association. The many studies published have been reviewed on at least two occasions (Kunz et al., 1997; Staessen et al., 1999). The potential pitfalls of case-control studies are well known, particularly the concept of bias through unidentified population stratification. Two other critical issues in evaluating such studies are too often underappreciated: replication and power. Although
31. From Geneticsto Mechanismof DiseaseLiability
523
trivial and rather obvious, these concepts are so often ignored that we have found it necessary to address them in a separate publication . For a study to constitute a valid replication of an earlier report, other factors being controlled, it must use similar criteria in the definition of cases and controls. By contrast with the initial report (Jeunemaitre et al., 1992), few of the replication attempts gathered in meta-analyses have used a clinical defmi+ tion of hypertension and ascertainment through a specialized clinic. Some have used self-report of hypertension without validation through review of medical records; others have selected cases based on arbitrary cutoffs derived from single measurements of AP, medical insurance reimbursement rate as a “proxy” for hypertension, or hypertension among survivors of myocardial infarction. None of these reports have used the definition of clinical hypertension applied in the original report. Nor have they used selection of cases from sibships with multiple hypertensive patients. The statistical interpretation of tests of association has proven to be even more misleading and confusing. Attempts at replication of the T235 asso+ ciation have been split, showing either a significant or a nonsignificant association. In the latter case, it is common to refer to such reports as “negative” evidence, evidence “against,” or “lack” of association. When studies are considered in the aggregate, it is usual to qualify the results as “discordant” or “inconsistent” and the evidence as “controversial.” Notwithstanding the replication issue considered earlier, this analysis and language reflect only a lack of understanding for the statistical concepts underlying the Neyman-Pearson theory of statistical testing. For the statistically minded audience, this issue need not be discussed in detail. Suffice it to say that both type I and type II errors play important and symmetrical roles in statistical inference (e.g., see Chapter 29 by Rao and Gu in this volume). In the original report of an association, emphasis naturally is on type I error, namely, the probability that the claim for a significant association does not result from chance fluctuation alone. In a replication attempt, by contrast, where the authors seek to verify whether a significant association does indeed exist, type II error and its complement, power, are critical. Specifically, it is critical to measure the probability that a difference of the mag nitude reported in the initial report be identified in the replicating study. Two studies commonly cited as evidence against the initial report of Jeunemaitre et al. (1992) have power equal to or less than 50%, largely as a result of their small sample size (Bennett et al., 1993; Caulfield et al., 1994). The appropriate interpretation is that such studies are inconclusive. Indeed, when the concept of power is kept in mind, the collection of reports with either significant or nonsignificant associations is consistent: it is actually expected from first principles of statistical theory. The current trend in genetic studies of common disease is an emphasis on tests of association and their possible advantages over linkage studies (Risch
Jean-Marc Lalouel
524
and Merikangas, 1996). Leaving aside the delicate issue of multiple comparisons, one can only wonder how many significant findings will be buried through a collection of replicating attempts of dubious value or significance, how many leads will be lost, and how many papers and grant applications will be rejected. Indeed, by what means will meaningful reports be separated from the background noise of type I errors? More than ever, geneticists will continue to love hating association studies.
C. Haplotype studies For any gene under investigation, multiple genetic polymorphisms are usually observed. This can be both a bane and a plague. Multiple variants can be used to construct haplotypes in an attempt to ascribe association to specific subsets. Multiple variants at a locus are likely to exhibit extended linkage disequilibrium, and as a result functional tests may have to be applied to multiple diallelic markers, either singly or in combination, drastically increasing the effort to define function at the molecular level. The situation is well illustrated by AGT. Many common variants have been identified, as illustrated in Figure 31.3. Haplotype studies, while resolving AGT variants into a limited number of most common haplotypes (Figure 31.4), have not led to the unambiguous identification of a haplotype displaying greater association with EH than seen with T235 alone (Jeunemaitre et al., 1997). Extensive linkage disequilibrium, however, further complicates functional studies.
Common (>.05), nucleotide COdOIl
Rare (c.O5), nucleotide codon
: -1074,-830,-793,-776,~532,-216,-20, -6,67,-13(lnt3),2054 :174,235
: -366, -152,31 : 10,104,199,246,271,339,359,388
Figure 3 1.3. Molecular variants of human angiotensinogen.
I
31. From Geneticsto Mechanism of Disease Liability
M235 T235 A-20
Figure 3 1.4. Common haplotypes of human angiotensinogen.
IV. FuW~lIttR~L SIlMFICANCEAT TRE NtQLBiGtJLAR LEVEL New reports every week grace the pages of prestigious journals documenting the gene and the mutations accounting for a rare Mendelian disorder. The global news value of the report owes more to surprise concerning the natural function of the gene uncovered than to the sheer intellectual brilliance of the work involved. Those of us interested in the genetics of common disease have abandoned the hope for such instant recognition. Mutations observed in a candidate gene are not likely to establish mechanism, as would gain or loss of function. Rather, we expect modest, quantitative differences that bear significance in the long run. Functional proof will be arduous, proceeding through all levels of observation, including molecular, cellular, organ, and whole body. AGT again can serve as an example. Of the multiple variants identified, which are functionally significant? Given the large number of such variants, it may not be practical to test them all. Even when we focus on diallelic markers exhibiting the most significant association, more than one diallelic marker may have to be considered, given the extensive degree of linkage disequilibrium encountered. For AGT, we soon found that T/M(235) was in complete ‘association with a polymorphism occurring 6 nucleotides upstream from the initiation site of AGT transcription, A/G(-6): with rare exceptions, genes carrying T235 also carry A( -6), while genes with M235 exhibit G( -6). Either or
526
Jean-Marc Lalouel
both could be functionally significant, or they could serve as markers for yet another unknown, causal determinant. To test whether T/M(235) affects protein function, both variants were expressed transiently in cultured cells, and Ang so produced was tested for its reaction rate with purified human renin (Inoue et al., 1997). No significant differences were found in the kinetic parameters of the reaction under the standard conditions of the tests. Furthermore, both variants exhibited similar glycosylation, secretion and stability in this model. The A/G(-6) polymorphism occurs in a segment of the proximal promoter of AGT that, on the basis of deletion studies (Fukamizu et al., 1990), appeared to be of critical significance in AGT transcription. DNA binding studies revealed that variation at the site affected specific interactions with nuclear proteins. To test whether the A/G(-6) polymorphism could affect the basal rate of transcription of AGT, segments of the AGT promoter with either A or G at (-6) were cloned upstream of the Luciferuse reporter gene and transfected into cultured cells. The resulting reporter activity detected reflects promoter transactivation. Such promoter assaysare routinely performed in the transcription field, and most experimental aspects of the work are well defined: experiments are performed in parallel in triplicate; positive and negative controls are included in all experiments; a second reporter is cotransfected to standardize for variation in transfection efficiency; and multiple experiments are performed to provide increased power and consistency, using several independent DNA preparations. Experiments were performed following and exceeding such standards; they were presented in a report reflecting the results of more than 2000 individual transfections (Inoue et al., 1997). The data were first analyzed by two-way analysis of variance that appropriately reflected the actual design of the experiments. While very large, well-anticipated differences in overall expression level were observed among experiments, the nucleotide substitution at (-6) led to moderate but reproducible differences in transcriptional activity that achieved statistical significance at levels ranging from lOPro to 10-16. In presenting our data in details (Figure 31.5), without the customary rounds of standardization performed in such expression studies, we exposed ourselves to a variety of criticisms, not the least of which was the consensus view that such experiments cannot detect differences in gene expression that are less than twofold. This view, received orally from experts as well as in pertinent reviews in response to our submitted manuscript, failed to recognize the significance of statistical methodology as a tool to ascertain signal over noise and delayed publication by over a year. A simple cure was to follow custom by standardizing data within experiments and pooling replicates to apply a t-test to the resulting data. The same significance levels were achieved, but this standardization produced a twofold difference when means were expressed relative
.
. . . .
. .
.
.
. . .
. . . .
. @
. .
. .
.
. .
. . .
. .
. .
. . .
. .
.
..-.....
.
B
,:,Y..
5
A4
44 6
F&r& 3~5.
. . . .
7
A4
r
.
.
..-..
_.
..-..
_...
10
44
> -1-A
.OOODO .OOOOO .03363
62.07 27.24 2.21
19.3%
Difference
9
AA
P
44 8
F
8
Transactivativn experiments testing the signiiicance of the G/A(4)
DF 59 9 1 9
MS 27.68 1718.06 753.96 61 .I2
Experiment number
64 4
Variance Residual Experiment GvaA lnter~~~fl
AA 3
Mean 33.35 39.78
34 2
Effect G A
N.44 1
0
!$
Gvs.A @?j$jG ,fJ,.___...__.__.... .___.... ....___ _.......__....... _......_.._........_..__._........
20
liepG2,vector I
AA
3
34
2
DF 59 9 1 Q
5
AA
6
AA
MS 22.66 1317.09 1977.72 78.77
Mean 5.64 37.79
Experknent fwnbet
4
A4
polymorphism: detailed presentation.
Variance Residual Experiment GvsA ln~eracti~n
Effect G A
0 N=4d 1
7
A4
9
8
10
41
m
P .OOOOO .00800 .00167
F 58.11 87.26 3.48
36.7%
II-A
GvsA @gac
Difference
AA
.._...........-..
AA
,(J ._..~~~~.._._~_~~~.~____...~__.~....~................... es- ._.. e....
f-k?pG?.vector II
528
Jean-Marc Lalouel
Figure 31.6. Transactivation experiments testing the significance of the G/A(-6) polymorphism: conventional presentation.
to the standard deviation (Figure 31.6). Evidently, the notion that an absolute difference cannot be appreciated without reference to its corresponding standard deviation could not resolve well-entrenched misconceptions. Presenting data in appropriate details, in a field where convention favors so much reduction that no troubleshooting can be performed, had proven to be quite unproductive. Our data provided evidence that the G/A(-6) polymorphism could affect the basal transcriptional activity of the core AGT promoter in who. The extent of the difference noted, 20-50%, was small compared with typical results achieved in classical expression studies. As we see it, this again stressesa fundamental distinction between the statistical inference required to ascertain graded, quantitative differences and the direct deduction from observations that can be achieved when discrete, all-or-none differences are under investigation. Evidently, we acknowledged that “it is clearly not possible to directly extend the results of transfection experiments done with truncated AGT promoters in cultured cells to the function of an intact AGT gene at the level of the whole organism” (Inoue et al., 1997). Such experiments, however, do provide the opportunity to test the potential impact of such a polymorphism on gene expression and the elaboration of a molecular hypothesis for further examination. The difficulty was further compounded by the observation that an additional variant of AGT occurring 67 residues downstream from initiation of transcription was also in complete disequilibrium with G/A(-6) (Ishigami et al., 1999, and our unpublished observations). In preliminary work, we have found that this additional variant could also impact on AGT transcription in vitro.
31. FromGenetiesto Mechanismof DiseaseLiability
529
For any investigation of a genetic determinant of common disease, the goal is to progress toward an understanding of disease mechanism at higher levels of organization, ultimately demonstrating directly the impact of genetic variation on the clinical phenotype at the level of the intact organism. New challenges are anticipated with each level of biological integration. In the case of AGT, chaflenges arise from the expression of the gene in multiple tissues. AGT secreted by the liver is a constituent of an endocrine EL4S in the circulation, which also includes renin originating in the juxtaglomemlar apparatus and ACE at the luminal surface of capillary endathelium. AGT is also involved with multipletissue RAS, including the heart, the vessel wall, the brain, and the kidney. Through which of these systems, then, does genetic variation of AGT affect Al? regulation? Physiological studies have well demonstrated the pleiotropy of the effector hormone angiotensin II (A-II), as observed with most peptide hormones. How do tissues participate in the multiple effects of A-II, and which of these is mast relevant in genetic variation of AP regulation? I3y affecting vascular tone, vascular structure, and vascular volume, A-II is involved in both acute and chronic regulation of AP (Figure 3 1.7). Further progress will require reduction of complexity through a hypothesis derived from earlier work in the field. One may assume that some AGT variants mediate predisposition to EH through a chronic effect on baseline AP as a result of subtle differences in sodium handling under habitual high salt consumption typical of economically advanced societies. The work
Figure 31.7. Acute and chronic effects of angiotensin-11 on circulatory homeostasis.
530
Jean-Marc Lalouel
of Guyton and colleagues in experimental physiology over the past three decades emphasizes the significance of the kidney, and more specifically, the direct effect on sodium balance of A-II of intrarenal origin in response to variation in dietary salt (Hall, 1993). The significance of the proximal tubule in sodium reabsorption and expression of AGT at this site make it an attractive target. To progress toward a better understanding of the function of AGT at this site, we have established a conditionally immortalized cell line of mouse proximal tubule (Loghman-Adham et al., 1997), and we are pursuing investigations to understand the regulation of function of components of the RAS at the level of the nephron. Clarification of this functional aspect may help delineate specific animal models of AGT-mediated predisposition of EH.
VI. ANALYSISAT THE LEVELOF THE WHOLEORGANISM Infusion of Ang or of antibodies directed toward Ang has been shown to increase or decrease Al? Similar conclusions have been drawn through direct genetic manipulations involving inactivation or overexpression of the gene in transgenic animals (Corvol et al., 1999). A 1ess extreme manipulation devised by Smithies and colleagues (Smithies and Kim, 1994; Kim et al., 1995) allowed them to generate animals with zero to four copies of AGT. Through these animal models, the investigators were able to show that a 50% increase in gene copy number (from two to three genes) led to an 10% increase in Al’ and a 20% increase in circulating Ang. These experiments provide a direct proof that a modest increase in AGT transcription-no greater than 50% and probably less-could lead to increased Al’. This proof of principle, however, does not afford an identification of the actual mechanism by which genetic variants of AGT affect its expression and eventually lead to essential hypertension. The evidence for a renal mechanism related to sodium homeostasis is suggestedby work involving extensive inpatient studies (Hopkins et al., 1996). Th ese authors showed that after a lowdose infusion of A-II in subjects under a high-sodium diet, individuals homozygous for AGT T235 exhibited a significantly blunted reduction in renal blood flow compared with individuals of other genotypes. These data suggest that AGT variants may indeed impact on blood volume regulation through direct intrarenal effects of A-II. The developments of animal models reflecting the genetic differences observed in humans will be further complicated by the existence of multiple differences between human and mouse core promoters (Figure 31.8). A thorough, parallel investigation of the two promoters will be required before such models can be entertained.
31. From Geneticsto Mechanismof Disease Liebility
TAGGGCCTC
GGGAAGAAGCTGCCGTTG /III
TAaGGCtgCtTGg
Figure 31.8.
Comparison
II/II
III
GGGAt--AGCTGtgcTTG
of human and rodent core promoters.
VII. EVtWJTIONARY IMPLlCATlONS We conclude with a brief discussion of the manner in which common genetic variants may have arisen in human populations, ultimately to contribute to such an important proportion of human morbidity. Most rare Mendelian disorders are the consequence of spontaneous mutations that randomly inactivate genes. The rare deleterious variants are ehminated from the population through natural selection. As such, this genetic variation represents the cost of our mutation load (Morton et al., 1956), with little evolutionary significance. Common molecular variants imparting predisposition to common disease, by contrast, present an apparent dilemma: If they are detrimental to their carriers, by what mechanism have they achieved and can they maintain such high frequency in modem populations? This issue was contemplated over 35 years ago by James Neel in connection with diabetes (Neel, 1962). He proposed and expanded (Neel, 1967) a simple but powerful hypothesis generally referred to as the “thrifty genotype hypothesis.” He proposed that common disorders of modern, affluent societies, such as diabetes, hypertension, atherosclerosis, and obesity, have arisen recently as a result of excess abundance and intake of dietary components that were rare in populations living under traditional lifestyles. Homeostatic processes finetuned by natural selection to operate under conditions of environmental scarcity were no longer optimal under conditions of excess dietary intake. Common diseases, then, would represent the lag of genetic evolution over cultural change. Genetic factors predisposing to such metabolic imbalances have not increased in frequency over time. Rather, they were common in early human evolution because of their selective advantage.
532
Jean-Marc Lalouel
It is tempting to suggest that this mechanism was indeed operational in the case of AGT. Variants associated with EH, T235 and A(-6), vary in frequency among major ethnic groups, homozygotes representing 16, 50, and 80% of Caucasians, Japanese, and African-Caribbeans (Inoue et al., 1997). In limited population samples from equatorial Africa, this genotype achieved a frequency of 100% (unpublished observations). To ascertain which of the alleles at these two sites represented the ancestral form of the gene, this segment of AGT was sequenced in several primates. All exhibited the T235 and A(-6) alleles. In accordance with the thrifty genotype hypothesis, the ancestral form of AGT, best suited for the sodium-deprived environment characteristic of early human evolution and of living conditions in Africa until recent times, could prove to be disadvantageous in the high-sodium environment of affluent societies. The gene carrying M235 and G(-6), th en, would constitute a neomorph better suited to rather recent conditions that may have appeared out of Africa. The selec, tive force, then, leading to expansion of the neomorph relative to its counter, part, may rest with the occurrence of preeclampsia, a condition common enough in affluent societies, and severe enough in the recent past, to have generated a modest selection differential. If anything, this very concept represents yet another major fundamental difference in the genetics of common disease and that of rare Mendelian disorders.
Acknowledgments The work discussedhere was supported in part by grant HL45325 from the National Institutes of Health. The author acknowledges and thanks collaborators and associateswho have contributed to the work, particularly Xavier Jeunemaitre, Pierre Corvol, Roger Williams, Paul Hopkins, Steve Hunt, Gordon Williams, Ituro Inoue, Toshiaki Nakajima, Mahmoud Loghman-Adham, and Andreas Rohtwasser. J.M.L. is an investigator of the Howard Hughes Medical Institute.
References Bennett, C. L., Schrader, A. P., and Morris, B. J. (1993). Cross-sectional analysis of Met235 -+ Thr variant of angiotensinogen gene in severe, familial hypertension. Biochem. Biophrs. Res. Commun. 197,833-839. Caulfield, M., Lavender, P., Farrall, M., Munroe, P., Lawson, M., Turner, P., and Clark, A. J. (1994). Linkage of the angiotensinogen gene to essential hypertension [see comments]. N. Engl. .I. Med. 330,1629- 1633. Corvol, P, Persu, A., Gimenez-Roqueplo, A. P., and Jeunemaitre, X. (1999). Seven lessonsfrom two candidate genes in human essential hypertension: Angiotensinogen and epithelial sodium channel. Hypertension 33, 1324-1331. Fukamizu, A., Takahashi, S., and Murakami, K. (1990). Expression of the human angiotensinogen gene in human cell lines. J. Cardiopimc. PharmncoI. 16 (suppl4), Sll-S13. Hall, J. B. (1993). Intrarenal and circulating angiotensin II and renal function. In “The Renin-Angiotensin System” (J. Robertson, and M. G. Nicholls, eds.), pp. 26.1-26.43. Gower Medical Publishing, New York.
31. From Geneticsto Mechanism of Disease Liability
533
Hata, A., Namikawa, C., Sasaki, M., Sato, K., Nakamura, T., Tamura, K., and Lalouel, J.-M. (1994). Angiotensinogen as a risk factor for essential hypertension in Japan. J. Clin. Inuest. 93, 1285-1287. Hopkins, P. N., Hunt, S. C., Wu, L. L., Williams, G. H., and Williams, R. R. (1996). Hypertension, dyslipidemia, and insulin resistance: Links in a chain or spokes on a wheel? CUTT.Opin. Lipidol. 7,241-253. Inoue, I., Nakajima, T., Williams, C. S., Quackenbush, J., Puryear, R., Powers, M., Cheng, T., Ludwig, E. H., Sharma, A. M., Hata, A., Jeunemaitre, X., and Lalouel, J.-M. (1997). A nucleotide substitution in the promoter of human angiotensinogen is associated with essential hypertension and affects basal transcription in vitro. 1. C&n. Invest. 49, 1786-1797. Ishigami, T., Tamura, K., Fujita, T., Kobayashi, I., Hibi, K., Kihara, M., Toya, Y., Ochiai, II., and Umemura, S. (1999). Angiotensinogen gene polymorphism near transcription start site and blood pressure: Role of a T-to-C transition at intron I. Hypertension 34,430-434. Jeunemaitre, X., Soubrier, E, Kotelevtsev, Y. V., Lifton, R. P., Williams, C. S., Charru, A., Hunt, S. C., Hopkins, P N., Williams, R. R., Lalouel, I.-M., and et al. (1992). Molecular basis of human hypertension: Role of angiotensinogen. Cell 71, 169-180. Jeunemaitre, X., Inoue, I., Williams, C., Charm, A., Tmhet, J., Powers, M., Sharma, A. M., Gimenez-Roqueplo, A. I?, Hata, A., Corvol, P., and Lalouel, J.-M. (1997). Haplotypes of angiorensinogen in essential hypertension. Am. j. Hum. Genet. 60,1448- 1460. Kim, H. S., Krege, J. H., Kluckman, K. D., Hagaman, J. R., Hodgin, J. B., Best, C. E, Jennette, J. C., C&man, T M., Maeda, N., and Smithies, 0. (1995). Genetic control of blood pressure and the angiotensinogen locus. Proc. Natl. Acud. Sci. USA 92,2735-2739. Kunz, R., Kreutz, R., Beige, J., Distler, A., and Sharma, A. M. (1997). Association between the angiotensinogen 235T-variant and essential hypertension in whites: A systematic review and methodological appraisal. Hypertension 30, 133 1 - 1337. Loghman-Adham, M., Rohrwasser, A., Helm, C., Zhang, S., Terreros, D., Inoue, I., and Laiouel, J.-M. (1997). A conditionally immortalized cell line from murine proximal tubule. Kidney Int.
52,229-239. Morton, N., Crow, J., and Mullen H. (1956). An estimate of the mutational damage in man from data on consanguineous marriages. Proc. AJatI. Acad. Sci. USA 42,855. Neel, J. (1962). Diabetes mellitus: A “thrifty” genotype rendered detrimental by “progress”? Am. 1. Hum. Genet. 14,353-362. Neel, J. (1967). Current concepts of the genetic basis of diabetes mellitus and the biological significance of the diabetic predisposition. Supplement to the Proceedings of the Sixth Congress of the International Diabetes Federation, pp. 68-78. Risch, N., and Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273,1516-1517. Smithies, O., and Kim, H. S. (1994). Targeted gene duplication and disruption for analyzing quantitative genetic traits in mice. PTOC.Nutl. Acad. Sci. USA 91,3612-3615. Staessen, J. A,, Kuznetsova, T., Wang, J. G., Emelianov, D., Vlietinck, R., and Fagard, R. (1999). M235T angiotensinogen gene polymorphism and cardiovascular renaI risk. .I. Hypertem. 17, 9-17. Ward, K., Hata, A., Jeunemaitre, X., Helin, C., Nelson, L., Namikawa, C., Farrington, P. F., Ogasawara, M., Suzumori, K., Tomoda, S., et al. (1993). A molecular variant of angiotensinogen associated with preeclampsia. Nat. Genet. 4,59-61.
I
ComplexInheritance: aThe 21st Century Newton E. Morton Human Genetics Research Division University of Southampton Southampton SO16 6YL), United Kingdom
I. II. Ill. IV. V. VI. VII.
Summary Introduction Mapping Positional Cloning Pooling Evidence Fully Parametric Analysis Pessimistic and Optimistic Views of the 21st Century References
I. SUMMARY At least for the early years of the twenty-first century we can anticipate some of the advances to be made in mapping, positional cloning, pooling of evidence over samples for linkage and allelic association, and fully parametric methods that combine the latter with segxgation analysis. This preoccupation with problems the twentieth century failed to solve is not grounds for pessimism if the new century provides solutions and applies them to problems of biological interest. We may hope that genetic epidemiology will be part of a community that addressesthe needs of geneticists for international communication, a stable nomenclature, gcnome databases, and a consensus on ethical, legal, and social issues transcending regional prejudices.
Advances in Genetics, Vol. 42 Copvright 0 X01 by Academic Press. All righu of rcproducnon in any form rwxvcd C365-266C!Cl 335.CC
535
536
Newton E. Morton
II. INTRODUCTION Crystal balls are inherently dangerous. Military experts a century ago were uncertain whether the ultimate weapon was the sabre or the lance: machine guns proved disconcerting. Science moves more rapidly than warfare, and our science that is not yet centenarian provides stunning examples of false prophecy. Seventy years ago Thomas Hunt Morgan predicted that the next great advance in genetics would be through developmental biology, which is only now beginning to show promise, Forty years ago Bentley Glass thought that molecular biology had reached its limits with Watson and Crick, whereas the double helix was only a first step in tracing gene action back to DNA sequence. With these illustrious exemplars before me, the prospect of making a fool of myself is not daunting, although the time frame cannot realistically be a millennium or even a century, but perhaps a decade.
Ill. MAPPING Let us imagine a time when the human sequence is almost completely known, and the full impact of genetic diversity is beginning to be felt. The sequence will not be continuous through highly repetitive polynucleotides, and among sequenced contigs there will be variable insertions, deletions, duplications, and repeats that defy numbering of base pairs along a chromosome. A large fraction of expressed loci will be of uncertain function, and almost nothing will be known about the role of sequences that are not expressed but control chromosome pairing, localization of chiasmata, and disjunction. The mechanisms governing gene expression through RNA binding, methylation, and promoter sequences will still be obscure. Most oligogenes will be unrecognized. What will be the utility of mapping in this postsequencing era? The most certain prediction is that the current rate of increase in the number of mapped genes cannot continue. The annual rate of increase, which has been more than 45% since 1979, soon will saturate the genome (Morton and Collins, 1997). Further advance in the postsequencing era will be through improving the map quality at three levels of resolution. The relation between the genetic and physical maps will be clarified at low resolution by a map that includes expressed loci and nonexpressed microsatellites. Emphasis at this level is not on completeness, which depends on the quality of sequence annotation, but on loci useful for mapping. The information will include physical location, perhaps in the form XxX.Xx, where the integer is in megabases (Mb) as in the current location database ldb (Collins et al., 1996). Genetic location in centimorgans (CM) will be sex specific, and standard computer programs will use the sex difference in positional cloning and studies of imprinting and sex-biased
32. ComplexInheritance: The 2tst Century
537
transmission. Cytogenetic band assignments and homologous locations in the mouse will be refined. Locations in centirays (CR) from radiation hybrids, with no validity outside a particular panel and radiation protocol, initially interpolated into the physical map, ultimately will be discarded as sequencing approaches completeness. Algorithms will be developed to recognize locus synonyms and to combine information over these aliases, retaining for each set only one rulebased symbol not requiring arbitration by a committee (Morton, 1998b). Other algorithms will improve the relations among genetic, cytogenetic, and physical assignments. As far as this low resolution permits, recombination cold-spots will not delay positional cloning (Lonjou et al., 1998) or be mistaken for selective sweeps (Huteley et al., 1999). The functional properties of cytogenetic bands and isochores will be studied as the map becomes more reliable. Ordered loci in this low+resolution map will provide framework markers at distances of roughly 1 CM to localize and determine the polarity of sequenced contigs, making nucleotide numbering from pter as unnecessary as it is impractical. Within the interval between adjacent framework markers, the order and location of other loci in the low-resolution map will become sequence
Most genetic epidemiologists are interested less in map construction than in using the available maps to localize a gene of unknown function that influences disease susceptibility or a quantitative trait. The steps from location to cloning to sequence to function constitute positional cloning. Let us suppose that completion of the Human Genome Project coincides with virtual completion of a high-resolution map. Will positional cloning from such a map proceed by a genome scan or by selection of candidate regions? Will reliance be placed on linkage or association? And will the chosen markers be microsatellites or SNPs? This book is largely devoted to these unanswered questions and to the methods that might be used if the questions were answered in a particular way. A genome scan can detect a susceptibility gene in a region for which no candidate is known, at the expense of more intensive investigation
538
Newton E. Morton
of candidate regions. Linkage is relatively powerful at distances of 10 CM, whereas allelic association is of questionable value at 1 CM or less. A coalescence model that assumesan exponentially increasing effective population size predicts lit* tle allelic association beyond 3 kb (Krugylak, 1999), but this is contradicted by available data (Collins et al., 1999; Huttley et al., 1999; Jorde et al., 1999). Population bottlenecks associated with disasters in situ, migration, hybridization, or selective sweeps are too important to be neglected. As a consequence, the number of SNPs required for a genome scan may be closer to 30,000 than to the 500,000 proposed by Krugylak, making a genome scan of SNPs by chip technology more feasible (Collins et al., 1999). Microsatellites often have greater heterozygosity than SNPs and therefore are more informative for linkage, but less amenable to chips. Dichotomizing microsatellite alleles is powerful for positional cloning of major genes (Lonjou et al., 1998), but probably not oligogenes. It will be several years before these competing claims are resolved, and the decision will affect choice of methods for analysis.
V. POOLINGEVIDENCE Whatever methods are chosen, it will be necessary to combine evidence over linkage and association and over samples that differ in markers, phenotypes, ascertainment, and other factors. This re-creates the situation when blood groups and isozymes provided a small number of markers, and linkage evidence was presented as lads that could be summed over samples for each marker at fixed recombination fractions (Morton, 1998a). With dense markers, this practice is no longer appropriate, but multilocus lads may be evaluated at chosen locations within a candidate interval, regardless of whether markers at those locations were tested. Some practical problems remain, such as the effect of typing errors on estimated map intervals in a particular study and choice of phenotype scores and analytical methods. Lods based entirely on parameters specified a priori are unique among statistics in that the evidence they provide may be efficiently combined by simple addition without increasing the number of degrees of freedom. This gives them a clear advantage over alternatives such as chi squares and significance levels. A further advantage is that evidence in favor of the alternative hypothesis H1 provides a conservative significance level that makes no large-sample assumption
where A > 1 and 2 is the maximum of the lod sums with respect to a single parameter. However, these properties are lost if a lod is invalid, as it will be if
32. Complexinheritance: The 21st Century
539
nuisance parameters are estimated differently for Ho and Hi or if negative lo& are reflected to zero (Morton, 1998a). Many genetic epidemiologists violate the first condition by fitting multiple models and the second condition by using a computer program that cannot fit negative lods to data that favor I-IO+Variance component methods and “possible triangle” or other constraints do not give negative lods, but there are valid alternatives (Self and Gang, 1981; Zhang et al., in preparation), On the contrary, additive lods cannot be constructed for fully nonparametric methods that specify no alternative to Ho. This effectively precludes their use to combine evidence over samples. Among methods that give valid lods, the most powerful has the high est mean lod under Hi. For a normally distributed trait in random samples, variance components give the most powerful “model+free” or “weakly parametric” linkage test. However, this advantage may be lost if the trait is not normal or if the families are selected. There is scope for healthy competition among various methods. Can a valid ascertainment measure be derived for variance components, so that power is retained when randomness and normality are violated? Can any relative pair method achieve the power of variance components when randomness and normality hold? Does the robustness to ascertainment bias of the beta model, which considers the identity-bydescent probabilities conditional on phenotype cross-product, entail significant loss of power when ascertainment bias is absent? Association tests pose similar questions. It is tempting to assume that the most powerful test is also the best for combining evidence, but this may not be true. A valid lod Zi that is based on estimation of a single parameter may be added over n samples and converted to x”, = (2 In lO)~Z,, but th e quantity ZZi is not a valid lod. One alternative is to base Zi on a maximum likelihood score under the null hypothesis F(Xj zi =
Uij)’
2, K, (2 In 10) ’
where the sign is taken from XJi, but this is not a most powerful test except in the limit as Hi -+ Ho. Another alternative is to create a standard lod table over fixed values of the nuisance parameter: then no parameter has been estimated, and therefore the maximum value of ZZi is a valid lod. These unfamiliar niceties add to the complexity of pooling evidence, which is essential if controversy over multiple samples of inadequate size and inconsistent significance is to be resolved. Although experience with real data is invaluable, the operating characteristics of methods for complex inheritance are best determined by simulad tion. For this purpose, some of the data sets of Genetic Analysis Workshops
540
Newton E. Morton
(GAW) are useful because they provide many replicates. If this material is used as a tournament of methods, it cannot be long before the current profusion is reduced to a small number of proven winners.
VI. FULLYPARAMETRICANALYSIS Success in positional cloning of major loci was based on fully parametric segregation analysis that gave reliable estimates of gene frequency, penetrance, and dominance. Segregation analysis fails with oligogenes and has been supplanted by weakly parametric and nonparametric methods that do not depend on segregation analysis. Once a candidate locus has been identified, however, the frequency, penetrance, and dominance of susceptible alleles must be determined. Perhaps such analysis could be more useful at an earlier stage for detection of candidate loci. Single-locus models that attempt to describe both a candidate locus and residual family resemblance are so much in error that they inflate the estimate of marker recombination and so are incompatible with multilocus mapping. There have been several approaches to residual family resemblance, including a second locus, polygenes, and regression on parental phenotypes. A polygenic residual may be estimated by discrete numerical integration (Morton and MacLean, 1974) by Gauss-Hermite quadrature (Lalouel and Morton, 1981), or by a simpler approximation (Hassted, 1982). The class D regressive model corresponds to polygenes if both parents are tested (Demenais and Bonney, 1989), but other regressive models do not represent any genetic mechanism. For example, the class A model assumes for a normally distributed trait that ps,, the correlation between sibs, depends on the parent-offspring correlation pPOas pss = 2p,2/ ( 1 + p,), where pm is the spouse correlation. This coincides with polygenes if pss = pPO= 0 or if pm = 0 and pss = pro = $, but in no other case. Despite its invalidity, the class A model has not been abandoned (Wang et al., 1999). All regressive models are genetically meaningless when applied to dichotomous traits, and provision for incomplete ascertainment, linkage, or association has not been made. Regressive models introduced logistic models and environmental covariates to segregation analy sis, but it does not seem that further development of regressive models would be useful. This leaves two-locus models as perhaps the best basis for fully parametric analysis, assuming additivity on the liability or logistic scale. The ascertainment measure as the conditional probability that a pedigree with fixed structure and phenotypes be ascertained may be calculated exactly for any defined type of ascertainment in nuclear families with pointers, but extension to general pedigrees is difficult. If multilocus analysis is implemented and if
32. ComplexInheritance:The21stClantury
541
segregation, linkage, and association analysis were combined as in prototypic programs like COMBIN and COMDS that provide estimates and tests of hypotheses for single marker loci (MacLean et aE., 1984; Shields et al., 1994), two-locus parametric analysis would be competitive with alternatives. Our science has advanced to the point where there is no place for programs that incorporate arbitrary parameters without estimating them efficiently.
W I. e&SkiMlSTlC~flND ~OPTMISTICVIEWS OF TifE 21st CEhlTURY It is easy to present a dark view of genetic epidemiology by emphasizing weaknesses. Statistical methodology has proliferated with little attempt to demonstrate that a new variant has better operating characteristics than a standard method. Many samples are of inadequate size, suboptimal selection, and doubtful analysis. Combination of evidence over samples is necessary to reach a ftrm conclusion, but no way to do this has been validated. Genetic epidemiology is fortunate in having a society (IGES) that is consciously international, although its membership is inevitably biased toward the countries in which our science developed. However, the larger genetics community embodied in the full members of the International Federation of Human Genetics Societies makes no attempt to represent Asia, Africa, Latin America, or international societies like IGES: the Australasian society is included, while Japan, with a society four times greater, is excluded. The federation has so far failed to address the needs of human geneticists for international participation, a stable nomenclature, genome databases, or consensus on ethical, legal, and social issues (Morton, 1998b). On the other hand, the strengths of genetic epidemiology justify a bright view of its future. Path analysis for complex inheritance and methods for major loci are highly developed and stable. None of the problems that now perplex us in characterizing oligogenes is profound, although the required weight of evidence is costly (see Chapter 3 1 in this volume, by Lalouel). The power of the best methods is nearly equal, and we are concerned primarily to devise a toumament that will identify the winners in the shortest possible time. Journals wiU soon learn to require that papers presenting new variants provide evidence that they are better than the standard method. When methods are as stable for oligogenes as they are for major loci, genetic epidemiology wil1 turn to unsolved problems of more biological interest (Morton, 2000). Interactions of genes with each other and with specific environments will become approachable only when their main effects are identified. There are indications that’sets of oligogenes within one conventionally single disease may interact to produce nosological entities with different recurrences and potentially unique responses
542
Newton E. Morton
to prevention and therapy (Cox et al., 1999; Lalouel, Chapter 3 1, this volume). To meet the needs of expanding human genetics, the International Federation of Human Genetics Societies either will become international and a federation or will be succeeded by a more adequate structure. Nearly 50 years ago Curt Stern resolved a dispute between two research groups that differed in ways that at the time were apocalyptic but now seem trivial. Scientific problems, he said, were like German cheeses in the Great War (1913 - 1919), which were totally consumed when maggots entered from many directions. We are fortunate in being among the first maggots in a tasty cheese.
References Collins, A., Frezal, J., Teague, J., and Morton, N. E. (1996). A metric map of humans: 28,500 loci in 850 bands. Proc. Natl. Acad. Sci. USA 93, 14771-14775. Collins, A., Lonjou, C., and Morton, N. E. (1999). Genetic epidemiology of single-nucleotide polymorphisms. Proc. NatI. Acad. Sci. USA 96, 15173-15177. Cox, N. J., Frigge, M., Nicolae, D. L., Concannon, P., Hanis, C. L., Bell, G. I., and Kong, A. (1999). Loci on chromosomes 2 (NIDDMI) and 15 interact to increase susceptibility to diabetes in Mexican Americans. Nat. Genet. 21,213-215. Demenais, F. M., and Bonney, G. E. (1989). Equivalence of the mixed and regressive models for genetic analysis 1. Continuous traits. Genet. Epidemiol. 6,597-617. Hasstedt, S. J. (1982). A mixed-model likelihood approximation on large pedigrees. Cornput. Biomed. Res. 15,295-307. Huttley, G. A., Smith, M. W., Carrington, M., and O’Brien, S. (1999). A scan for linkage disequilibrium across the human genome. Genetics 152, 1711- 1722. Jorde, L., Watkins, W. S., Kere, J., Nyman, D., and Eriksson, A. W. (1999). Gene mapping in isolated populations: New roles for old friends? Hum. Hered. 50,57-65. Kruglyak, L. (1999). Prospects for whole genome linkage disequilibrium mapping of common diseasegenes. Nat. Genet. 225, 139-144. Lalouel, J.-M., and Morton, N. E. (1981). Complex segregation analysis with pointers. Hum. Hered. 31,312-321. Lonjou, C., Collins, A., Ajioka, R. S., Jorde, L. B., Kushner, J. l?, and Morton, N. E. (1998). Allelic association under map error and recombinational heterogeneity: A tale of two sites. Proc. Natl. Acad. Sci. USA 95,11366-11370. MacLean, C. J., Morton, N. E., and Yee, S. (1984). Combined analysis of genetic segregation and linkage under an oligogenic model. Corn&t. Biol. Med. Res. 17,471-480. Morton, N. E. (1998a). Significance levels in complex inheritance. Am. J. Hum. Genet. 62, 690-697. Morton, N. E. (1998b). Genetics without frontiers. Nat. Genet. 20,329-330. Morton, N. E. (2000). Unsolved problems in genetic epidemiology. Hum Hered. 50,5-13. Morton, N. E., and Collins, A. (1997). The future of gene mapping. Genet. Anal. 14,25-27. Morton, N. E., and MacLean, C. J. (1974). Analysis of family resemblance. III. Complex segrega, tion of quantitative traits. Am. J. Hum. Genet. 26,489-503. Self, S. G., and Liang, K.-L. (1987). Asymptotic properties of maximum likelihood estimations and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82,605 -610.
32. ComplexInheritance: The Zlst Century
543
Shields, D. C., Ratanachaiyavong, S., McGregor, A. M., Collins, A., and Morton, N. E. (1994). Combined segregation and linkage analysis of Graves disease with a thyroid antibody diathesis. Am. J. Hum. Genet. 55,540-554. Wang, H.-M., Jones, M. l?, and Bums, T. L. (1999). Regressive diagnostics for the class A regressive model with quantitative phenotypes. Genet. Epidemiol. 17, 174-187.
Numbers in parentheses indicate the pages on which the authors’ contributions begin.
Laura Almasy (151), Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio Texas 78284
Christopher1. Amos (213), Department of Epidemiology and Biostatistics, University of Texas, M. D. Anderson Cancer Center, Houston, Texas
77030 John Blangero(151), Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, Texas 78284
lngrid B. Borecki (34, 45), Department of Genetics, Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri 63 110
Utich Broeckel(191), Department of Physiology, Medical College of Wisconsin, Milwaukee, Wisconsin 53226
Karl W. Broman(77), Department of Biostatistics, School of Hygiene and Public Health, Johns Hopkins University, Baltimore, Maryland 21205
NiGOiaH. Chapman(413), Department of Biostatistics, University of Washing ton, Seattle, Washington 98195
Fraqoise Clerget-Darpoux(115), Inserm U-535, Genetique Epidemiologique et Structure des Populations Humaines; Batiment Gregory Pincus, 94275 Le Kremlin-Bicetre Cedex, France DanielCohen(191), Genset SA, 75008 Paris, France Jonathan Corbett (99), Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63110 HeatherJ. Cordell(135), Department of Medical Genetics, Addenbrookes Hospital, Wellcome Tmt-Cambridge Institute for Medical Research, Cambridge CB2 2XY, England JamesF. Crow (3), University of Wisconsin, Madison, Wisconsin 53706 LindonJ, Eaves(223), Department of Human Genktics, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, Virginia 23298 Robert C. Elston (135, 459), Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44109 Dani Fallin (191), Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44109 W. JamesGauderman(393), Department of Preventative Medicine, University of Southern California, Los Angeles, California 90033
xvi
Contributors
Saurabh Ghosh (323), Anthropometry
and Human Genetics Unit, Indian Statistical Institute, Calcutta, India 700-035 David E. Goldgar (241), Unit of Genetic Epidemiology, International Agency for Research on Cancer, Lyon, 69008 France Chi GU (255,439,487), Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri 63 110 Xiuqing GUO (459), Department of Preventive Medicine and Epidemiology, Stritch School of Medicine, Loyola University, Maywood, Illinois 60153 Howard J. Jacob (191), Department of Physiology, Medical College of Wisconsin, Milwaukee, Wisconsin 53226 Jean-Marc Lalouel (517), University of Utah, Howard Hughes Medical Institute, Salt Lake City, Utah 84112 Partha P. Majumder (323), Anthropometry and Human Genetics Unit, Indian Statistical Institute, Calcutta, India 700-035 Newton E. Morton (535), Human Genetics Research Division, Univeristy of Southampton, Southampton General Hospital, Southampton SO16 6YD, United Kingdom Jurg Ott (125, 287), Laboratory of Statistical Genetics, Rockefeller University, New York, New York 10021 Grier Page (213), Departments of Biostatistics and Epidemiology and Medicine, Hollings Cancer Center, Medical University of South Carolina, Charleston, South Carolina 29425 Michael A. Province (183, 255, 273,499), Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri 63 110 D.C. Rao (255, 273, 439, 487, 545), Division of Biostatistics, Departments of Psychiatry and Genetics, Washington University School of Medicine, St. Louis, Missouri 63 110 Erik Rasmussen (69), Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63 110 John P. Rice (69,99), Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63 110 Treva K. Rice (35), Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri 63 110 Nancy L. Saecone (69, 99), Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri 63 110 Nicholas J. Schork (191, 299), Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44109 W. D. Shannon(273), Division of General Medical Sciences and Division of Biostatistics, Washington University School of Medicine, St. Louis, Missouri 63110
Contributors
xvii
AndreaSherriff (287), University of Bristol, Royal Hospital for Sick Children, Bristol, BS2 8BJ United Kingdom
Brian K. Suarez (45), Departments of Psychiatry and Genetics, Washington University School of Medicine, St. Louis, Missouri 63 1 IO
PatrickSullivan(223), Virginia Institute for Psychiatric and Behavioral Genetics, and Department of Psychiatry, Virginia Commonwealth University, Richmond,Virginia 23298 Joseph D. Terwilliger (351), Department of Psychiatry, Columbia Univeristy, Columbia Genome Center, New York State Psychiatric Institute, New York, New York 10032 BonnieThiel (191>, Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio 44109 DuncanC. Thomas (393), Department of Preventive Medicine, University of Southern California, Los Angeles California 98089 ElizabethA. Thompson(413), Department of Statistics, University of Washing ton, Seattle, Washington 98195 GlenysThomson(475), Department of Integrative Biology, University of California - Berkeley, Berkeley, California 94720 James L. Weber (77), Center for Medical Genetics, Marshfield Medical Research Foundation, Marshfield, Wisconsin 54449 DanielE. Weeks(7), Department of Human Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania 15261 Jeff T. W illiams (151), Department of Genetics, Southwest Foundation for Biomedical Research, San Antonio, Texas 78245 Xiping XU (191), Program for Population Statistics and Department of Environmental Health, Harvard School of Public Heath, Boston, Massachusetts 02115
pendix
,ResearchContrHWons of Newton E. Morton An undergraduate encounter with Dobzhansky’s book “Genetics and the Origin of Species” led to Newton Morton’s lifelong fascination with population genetics, initially expressed in studies of Drosophila under James Crow, who also stimulated his interest in human genetics. As a graduate student, Morton was seconded to the Atomic Bomb Casualty Commission in Hiroshima, where he conducted preliminary analysis of the genetic data and developed an interest in linkage. On his return to Wisconsin he created lod score methods that were exact in small samples and summarized evidence as standard tables that could be combined over pedigrees and studies. This quickly led to resolution of genetic heterogeneity through linkage (9), then to construction of genetic maps (208) and positional cloning of disease genes.” Subsequent papers applied these developments to many loci and finally to map integration, where connectivity of the genetic and radiation hybrid maps adds value to the precision of disconnected physical maps (392, 401). Currently the Southampton group led by Andrew Collins, a former doctoral student and close colleague, is replacing physical maps with sequence-based maps. Another direction stems from the Wisconsin period. On his return from Hiroshima, Morton continued with Crow and Muller to develop genetic load theory for inbreeding effects on mortality and morbidity, providing a lower limit to deleterious mutation rates and gene frequencies across the genome and showing that most of this Load is due to genes of high penetrance (12). Later work established the small effect of preferential consanguineous marriage in reducing this load, confirmed by Sarah Bundey and others. Inbreeding studies led to analysis of interracial crosses, which showed no evidence of heterosis or hybrid dysgenesis in 180,000 births (147). Field work in Brazil confirmed his interest in population structure, with emphasis on the consistency of different approaches (polymorphisms, quantitative traits, migration, genealogies, and isonymy). This led to the Eastern Carolines, where the “Pingelap eye disease,” affecting 5% of this population, was shown to be caused by a subsequently *Numbers m parentheses designate publications in the accompanying list. Advances in Genetics, Vol. 42 Copyright All rights
$3 2001 by Academic Pm-s. of reprodust~m in any form reserved
0065.266C/Cl I35.oC
545
546
ResearchContributionsof Newton E. Morton
mapped recessive gene for achromatopisa (110). Studies of other populations confirmed the coherence of different approaches to population structure. A third interest was in pedigree analysis, and Morton began using computers to solve problems with sporadic cases that had been intractable and later applying the mixed model that includes both a major locus and polygenes (137). This revealed the contributions of rare dominants, recessives, and sexlinked genes to many heterogeneous diseases.’With ‘D. C. Rao he developed path analysis of genetic and cultural inheritance in terms of hypothesis testing, largely in response to insistence of doctrinaire geneticists that heritability of cognition is not measurable. In disproof, Morton and Rao showed by path analysis that cultural heritability is clearly demonstrated (most strikingly in Galton’s hereditary genius), that little if any of the academic deficit then shown by Afro-Americans was genetic, and that genetic heritability of IQ is not less than 25% (248). , At this point it became apparent that analysis of diseaserelated traits in contemporary populations was using methods different from evolutionary genetics to answer different questions, corresponding to the differing interests of Mendel and Darwin. Geneticists and epidemiologists were attracted to this rapidly growing field, for which Morton coined the name “genetic epidemiology.” Molecular genetics was developing explosively, and the advantage of Hawaii as a center of research in population genetics was diminishing. These considerations led to two years in Manhattan, and then relocation to Southampton. Cancer genetics provided one line of research, directed to the role of trisomy 21 in leukemia and segregation analysis of neurofibromatosis and other common cancers, establishing parameters of mutation, gene frequency, and penetrance that were later refined through linkage positional cloning by other workers (343). Another research interest was in forensic populations, the structure of which was controversial and led in the United States to invalid presentation of evidence (340). Population structure is part of genetic epidemiology, which therefore extends beyond disease-related traits to forensic science and other applications of its methods and results. The scope of genetic epidemiology was broadened and the forensic controversy resolved (405). Much recent research has been devoted to the construction of integrated maps, to the investigation of meiotic abnormalities for which the theory had been provided (293), to studies of dynamic mutation for trinucleotide repeats (376), and to the development and application of genetic analysis for complex inheritance, where the smallness and therefore obscurity of single-gene effects is a challenge to positional cloning, whereby a gene of unknown function is characterized through structure only after its location has been determined by linkage and allelic association (416). The central problem is first to establish the most powerful study design and methods of analysis and then to combine evidence over types of data. The Southampton group is active in
Appendix
547
developing and assessing these methods, using type 1 diabetes, asthma, and atopy as test systems (402,418). The latest paper in this series shows that allelic association extends over much greater distances than a recent simulation had predicted and proposes an explanation for its failure, which is favorable to the use of allelic association for positional cloning (43 1). Other contemporary issues, scientific and ethical, have been summarized in a special edition of Human Heredity celebrating Morton’s seventieth birthday (425).
Publications of Newton E. Morton 1953 1. Neel, I. V., Schull, W. J., McDonald, D. J., Morton, N. E., Kodani, M., Takeshima, K., Ander+ son, R. C., Wood, J., Brewer, R., Wright, S, Yamazaki, J.%Suzuki, M., and Kitamura, S. The effect of exposure to the atomic bombs on pregnancy termination in Hiroshima and Nagasaki: Preliminary report. Science118~537 -541. 2. Neel, J. V., Morton, N. E., Schull, W. J., McDonald, D. J., Kodani, M., Takeshima, K, Anderson, R. C., Wood, J., Brewer, R., Wright, S., Yamazaki, J., Suzuki, M., and Kitamura S. The effect of exposure of parents to the atomic bombs on the first generation offspring in Hiroshima and Nagasaki: Preliminary report. J@. j. Genet. Z&211-218.
1954 3. Morton, N. E., Maloney, W. C., and Fujii, T. Linkage in man. Pelger’s nuclear anomaly, taste and blood groups. Am. 1. Hum. Genet. 6~38-43.
1955 4. Fujii, T., Moloney, W. C., and Morton, N. E. Data on linkage ofovalocytosis and blood groups. Am. J. Hum. Genet. 7:72-75. 5. Crow, J. F., and Morton, N. E. Measurement of gene frequency drift in small populations. Evolution 9:202-214. 6. Morton, N. E. Non-randomness in consanguineous marriage. Ann, Hum. Genet. 20:116- 134. 7. Morton, N. E. The inheritance of human birth weight. Ann. Hum. Genet. Z&125- 134. 8. Morton, N. E. Sequential tests for the detection of linkage. Am. J. Hum. Genet. 7:277-318.
1956 9. Morton, N. E. The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am. .J. Hum. Genet. 8:80-96. 10. Morton, N. E., Stone, W. H., and Irwin, M. R. Linkage and fitness of cattle blood factors. Genetics 41:655. 11. Steinberg, A. G., and Morton, N. E. Sequential test for linkage between cystic fibrosis of the pancreas and the M N S locus. Am. J. Hum. Genet. 8:177-189. 12. Morton, N. E., Crow, J. E, and Muller, H. J. An estimate of the mutational damage in man from data on consanguineous marriage. Proc. Natl. Acad. Sci. USA 42:855-863.
1957 13. Morton, N. E. Further scoring types in sequential linkage tests, with a critical review of autosomal and partial sex linkage in man. Am. J. Hum. Genet. 9~55-15.
548
Research Contributions of Newton E. Morton
1958 14. Morton, N. E. Segregation analysis in human genetics. Science 127:79-80. 15. Morton, N. E. Empirical risks in consanguineous marriages: Birth weight, gestation time, and measurement of infants. Am. J. Hum. Genet. 10:344-349. 16. Chung, C. S., and Morton, N. E. Discrimination of genetic entities in muscular dystrophy. Proceedingsof rhe Tenth International Congress on Genetics, Vol. II. 17. Morton, N. E., and Chung, C. S. Formal genetics of muscular dystrophy. Proceedings of the Tenth International Congress on Genetics Vol. II.
1959 18. Morton, N. E. Genetic tests under incomplete ascertainment. Am. J, Hum. Genet. ll:l16. 19. Chung, C. S., Robinson, 0. W., and Morton, N. E. A note on deaf mutism. Ann. Hum. Genet. 23:357-366. 20. Morton, N. E. Methods of study in human genetics. In “Genetics and Cancer” (Symposium on Fundamental Cancer Research, 15th Collection of Papers), pp. 391-407. University of Texas Press, Austin. 21. Morton, N. E., and Chung, C. S. Are the MN blood groups maintained by selection? Am. J. Hum. Genet. 11:237-251. 22. Chung, C. S., and Morton, N. E. Discrimination of genetic entities in muscular dystrophy. Am. J. Hum. Genet. 11:339-359. 23. Morton, N. E., and Chung, C. S. Formal genetics of muscular dystrophy. Am. J. Hum. Genet. 11:360-379.
1960 24. Chung, C. S., Morton, N. E., and Peters, H. lar dystrophy. Am. J. Hum. Gener. 12:52-66. 25. Morton, N. E. The mutational load due to 12:348-364. 26. Crow, J. F., and Morton, N. E. The genetic Na. 94:413-419. 27. Chung, C. S., Matsunaga, E., and Morton, Hum. Genet. 5:124-134.
A. Serum enzymes and genetic carriers in muscudetrimental genes in man. Am. J. Hum. Genet. load due to mother-child
incompatibility.
Am.
N. E. The ABO polymorphism in Japan. Jap. J.
1961 28. Morton, N. E. (letter to the editor). Phenodeviants and genetic homeostasis. Am. J. Hum. Genet. 13:104. 29. Chung, C. S., and Morton, N. E. Selection at the ABO locus. Am. J. Hum. Genet. 13:9-27. 30. Morton, N. E. Morbidity of children from consanguineous marriages. In “Progress in Medical Genetics,” (A. G. Steinberg, ed.), pp. 261-291. Grune &a Stratton, New York. 31. Chung, C. S., Matsunaga, E., and Morton, N. E. The MN polymorphism in Japan. Jap. J. Hum. Genet. 6:1-11. 32. Morton, N. E. Review of H. L. LeRoy. Statist&he Methoden der Populationsgenetik. J. Am. Sm. Assoc. 56:760-761. 33. Wang Hwa, L., Morton, N. E., and Waisman, H. A. Increased reliability for the determination of the carrier state in phenylketonuria. Am. J. Hum. Genet. 13:255-261. 34. Morton, N. E., and Chung C. S. Genetics of muscular disorders. Proceedingsof & Second international Congress on Human Genetics, pp. 1599-1601. 35. Kosower, N., Christiansen, R., and Morton N. E. Sporadic cases of hemophilia and the
Appendix
549
question of a possible sex difference in mutation rates. Proceedingsof the Second Internarior& Congresson Human Genetics. 36. Chung, C. S., and Morton, N. E. Genetics of interracial crosses in Hawaii. Proceedingsof rhe SecondInternational Congresson Human Genetics, pp.134-138.
1962 37. Morton, N. E., Mackinney, A. A., Kosower, N., Schilling, R. E, and Gray, M. P Genetics of spherocytosis. Am. J. Hum. Genet. 14:170-184. 38. Morton, N. E. Genetics of interracial crossesin Hawaii. Eugen. Q. 9~23-24. 39. Kosower, N. M., Christiansen, R., and Morton, N. E. Sporadic cases of hemophilia and the question of a possible sex difference in mutation rates. Am. J. Hum. Genet. 14~159-169. 40. Morton, N. E. Segregation and linkage. In “Methodology in Human Genetics” (W. J. Burdette, ed.), pp. 17-52. Holden-Day, San Francisco. 41. Morton, N. E. Discussion of information storage, retrieval and processing. Seminar on the Use of Vital and Health Statistics for Genetic and Radiation Studies, pp. 167- 170. World Health Qrganisation, Geneva. 42. Morton, N. E., and Yasuda, N. The genetical structure of human populations. In “Les Deplacements Humains,‘]. Sutter, ed., Entret. Monaco Sci. Hum., pp. 185-203. Hachette, Paris. 43. Conneally, P M., Patel, J. R., Morton, N. E., and Stone, W. H. The J substance of cattle. VI. Multiple alleles at the J locus. Generics47:797-805. 44. MacKinney, Jr. A. A., Morton, N. E., Kosower, N. S., and Schilling, R. E Ascertaining genetic carriers of hereditary spherocytosis by statistical analysis of multiple laboratory tests. J. Clin. Inwest. 41:544-567.
1963 45. Morton, N. E., Chung, C. S., and Peters, H. A. Genetics of muscular dystrophy. In “Muscular Dystrophy in Man and Animals” G. H. B oume and M. A. N. Golarz, eds.), pp. 323-365. Karger, Basel. 46. Morton, N. E., The components of genetic variability. In “The Genetics of Migrant and Isolate Populations” (E. Goldschmidt, ed.), pp. 225-236. Williams Gz.Wilkins, Baltimore. 47. Conneally P M., Stone, W. H., Tyler, W. J., Casida, L. E., and Morton, N. E. Genetic load expressed as fetal death in cattle. J. Dairy Sci. 46:232-236.
1964 48. Morton N. E. Models and evidence in human population genetics. In “Genetics Today” (Proceedings of the 11th International Congress on Genetics, The Hague, September 1963), (S. J. Geerts, ed.), pp. 935-951. Pergamon Press,Oxford. 49. Azevedo, E., Krieger, H., and Morton, N. E. Smallpox and the ABO blood groups in Brazil. Am. J. Hum. Genet. 16:451-454. 50. Morton, N. E. Genetic studies of northeastern Brazil. Cold Spring Harbor Symp. Qwznt. Biol. 29:69-79. 51. Chapman, A. B., Hansen, J. L., Havenstein, G. B., and Morton, N. E. Genetic effects of cumulative irradiation on prenatal and early postnatal survival in the rat. Genetics 50:1029- 1042.
1965 52. Azevedo, E., Kreiger, H., Mi, M. P, Morton, N. E. PTC taste sensitivity and endemic goiter in Brazil. Am. J. Hum. Genet. 17:87-90.
550
Research Contributions of Newton E. Morton
53. Mi, M. P., Azevedo, E., Krieger, H., Morton, N. E. Malformations in northeastern Brazil. Acm Genet. 15:177-189. 54. Barrai, I., Mi, M. P, Morton, N. E., and Yasuda, N. Estimation of prevalence under incomplete selection. Am. J. Hum. Genet. 17:221-236. 55. Dewey, W. J., Barrai, I, Morton, N. E., and Mi, MP. Recessive genes in severe mental defect. Am. J. Hum. Genet. 17:237-256. 56. Krieger, H., Morton, N. E., Mi, MI’, Azevedo, E, Freire-Maia, A, and Yasuda, N. Racial admixture in northeastern Brazil. Ann. Hum. Genet. 29:113-125. 57. Morton, N. E., Kreiger, H., Steinberg, A. G., and Rosenfield, R. E. Genetic evidence confirming the localization of Sutter in the Kell blood-group system. VOXSang 10~608-613. 58. Morton, N. E. A search for a natural selection. Am. .I. Hum. Genet. 17:94-95.
1966 59. Morton, N. E., Mi, M. P., and Yasuda, N. Bivalent alleles. Am. J. Hum. Genet. l&233-242. 60. Morton, N. E., Mi, M. P., and Yasuda, N. A special theory of hemagglutination. VOX Sung
11:12-20. 61. Morton, N. E., Mi, M. P., and Yasuda, N. A study of the S” alleles in northeastern Brazil. VOX Sung 11:194-208. 62. Chung, C. S., Morton, N. E., and Yasuda, N. Genetics of interracial crosses. Ann. NY Acad. Sci 134:666-687. 63. Mi, M. I’., and Morton, N. E. Blood factor association. VOXSung 11:434-449. 64. Morton, N. E., Krieger, H., and Mi, M. P. Natural selection on polymorphism in northeastern Brazi1.Am.J. Hum. Genet. l&153-171. 65. Morton, N. E. (book review). “The Effects of Inbreeding on Japanese Children:” W. J. Schull and J. V. Neel, Harper&Row, New York, 1965. Eugen Q. 13:276-278. 66. Morton, N. E. (letter to the editor). Effects of inbreeding on mortality. Am. J. Hum. Gmet.
l&504.
1967 67. Morton N. E. The detection of major genes under additive continuous variation. Am. .I. Hum. Genet. 19:23-34. 68. Yasuda, N., and Morton, N. E. Studies on human population structure. Proceedingsof theThird International Congress of Human Genetics (Chicago, September 1966), (J. F., Crow and J. V., Neel eds.), pp. 249-265. Johns Hopkins Press, Baltimore. 69. Morton, N. E. Population genetics of mental illness. Eugen. Q. 4:181-184. 70. Morton, N. E., and Rosenfield, R. E. A new Rh allele, ry” (R- 1*2,w3”4). Transfusion 7:117-119. 71. Morton, N. E. Genetic studies of northeastern Brazil: Summary and conclusions. Cienc. Cult. 19:14-30.
1968 72. Morton, N. E. Problems and methods in the genetics of primitive groups. Am. .I. Phys. Anthrop. 28:191-202. 73. Morton, N. E., Miki, C., and Yee, S. Bioassay of population structure under isolation by distance. Am. .I. Hum. Genet. 20:411-419. 74. Morton, N. E., Yasuda, N., Miki, C., and Yee, S. Population structure of the ABO blood groups in Switzerland . Am. J. Hum. Genet. 20:420-429.
Appendix
551
75. Morton, N. E., and Miki, C. Estimation of gene frequencies in the M N system. VOX Sarxg 15:15-24.
76. Fu, L., Azevedo, E., and Morton, N. E. Evidence against the reported linkage of phosphogluco* mutase (PGMI) and phenylthiocarbamide-testing (PTC). Acta Genet. 18:416-419.
77. Wright, S. W., and Morton, N. E. Genetic studies on cystic fibrosis in Hawaii. Am. .J. Hum. Genet. 20:157-169.
78. Wright, S. W., and Morton, N. E. The incidence of cystic fibrosis in Hawaii. Am. J. I&n. Genet. 20~361-367. 79. Morton, N. E., Chung, C. S., and Friedman, L. D. Relation between homozygous viability and average dominance in Drosophila mekmognster.Genetics60:601-614.
1969 80. Morton, N. E. Population structure. In “Computer Applications in Genetics” (N. E. Morton, ed.), pp. 61-71. University of Hawaii Press,Honolulu. 81. Morton, N. E. Segregation analysis. In “Computer Applications in Genetics” (N. E. Morton, ed.), pp. 129- 139. University of Hawaii Press,Honolulu. 82. Azevedo, E., Morton, N. E., Miki, C., and Yee, S. Distance and kinship in northeastern Brazil. Am. J. Hum. Genet. 21:1-22. 83. Hunt, I-I. W., and Morton, N. E. Quantitative hemagglutination in the ABO system. Am. J. Hum. Genet. 21:84-98. 84. Morton, N. E. Human population structure (H. L. Roman, ed.). Annu. Reo. Genet. 3:53-74. 85. Imaizumi, Y., and Morton, N. E. Isolation by distance in Japan and Sweden compared with other countries. Hum. Hered. 19:433-443. 86. Azevedo, E., Krieger, H., and Morton, N. E. Ahaptoglobinemia in northeastern Brazil. Hum. Hered. 19:609-612. 87. Morton, N. E. Preface to “ProbabilitCs et H&edit&” In “The Mathematics of Heredity” (rev., G. Ma&cot; ed. and trans., D. M. Yermanos). Freeman, San Francisco.
1970 88. Morton, N. E. Birth defects in racial crosses. In “Congenital Malformations (Proceedings of the Third International Conference, The Hague, September 1969), (F. C. Fraser and V. C. M C Kusick, eds.), pp. 264-274. Excerpta Medica, Amsterdam and New York. 89. Yee, S., and Morton, N. E. (letter to the editor). Re Schull and Ito. Am. J. Hum. Genet. 22:112-113. 90. Roisenberg, I., and Morton, N. E. Population structure of blood groups in Central and South American Indians. Am. J. Phys. Anthropol. 32:373-374. 91. Imaizumi, Y., and Morton, N. E. Isolation by distance in New Guinea and Micronesia. Arc&ol. Phys. Anthropol. Oceania5:218-235. 92. Morton, N. E., and Hussels, I. Demography of inbreeding in Switzerland. Hum. BioE. 42:65-X 93. Morton, N. E., Yee, S., Elston, R. C., and Lew, R. Discontinuity and quasi-continuity: Alternative hypothesis of multifactorial inheritance. Clin. Genet. 1:81-94. 94. Kirk, R. L., Kinns, H., and Morton, N. E. Interaction between the ABO blood group and haptoglobin systems.Am. J. Hum. Gener. 22:384-389. 95. Imaizumi, Y., Morton, N. E., and Harris, D. E. Isolation by distance in artificial populations. Generics66:569-582. 96. Todorov, A., Jequier, M., Klein, D., and Morton, N. E. Analyse de la segregation dans la dystrophie myotonique. J. Genet. Hum. 18:387-406.
552
ResearchContributionsof Newton E. Morton
1971 97. Morton, N. E. Genetic structure of northeastern Brazilian populations. In “The Ongoing Evolution of Latin American Populations” (Burg Wartenstein, Austria, August 1969). (E M. Salzano, ed., pp. 251-276. Thomas, Springfield, IL. 98. Morton, N. E. (book review). “Problems in Human Biology. A Study of Brazilian Populations”: E M. Salzano and N Freire-Maia, Wayne State University Press, Detroit, 1970. Am. J. Hum.
Genet.3:327-328. 99. Morton, N. E., Roisenberg, I., Lew, R., and Yee, S. Pingelap and Mokil atolls: Genealogy. Am. J. Hum. Getter 23:350-360. 100. Morton, N. E., Harris, D. E., Yee, S., and Lew, R. Pingelap and Mokil atolls: Migration. Am. J. Hum. Genet. 23:339-349. 101. Morton, N. E., Imaizumi, Y., and Harris, D. E. Clans as genetic barriers. Am. Anthrop. 73:1005-1010. 102. Carr, R. E., Morton, N. E., and Siegel, I. M. Achromatopsia in Pingelap Islanders: Study of a genetic isolate. Am. J. Ophthalmol. 72:746-756. 103. Morton, N. E. Kinship and population size. In “Genetique et Populations, “L’Institut National d’fitudes Demographiques, Travaux et Documents Cahier No. 60. Presses Universitaires de France, Paris. 104. Morton, N. E., Yee, S., Harris, D. E., and Lew, R. Bioassay of kinship. Theor. Pop. Bill.
2507-524. 105. Morton N. E. (letter to the editor). Reply to Harpending: Treatment of random phenotype pairs. Am. J. Hum. Genet. 23:538-539. 106. Morton, N. E. Population genetics and disease control. Sot. Biol. 18:243-X51.
1972 107. Morton, N. E., Lew, R., Hussels, I. E., and Little, G. E Pingelap and Mokil atolls: Historical genetics. Am. J. Hum. Genet. 24:277-289. 108. Morton, N. E. Pingelap and Mokil atolls: Clans and cognate frequencies. Am. J. Hum. Genet.
24:290-298. 109. Morton, N. E., and Greene, D. L. Pingelap and Mokil atolls: Anthropometrics. Am. J. Hum. Genet. 24:299-303. 110. Hussels, I. E., and Morton, N. E. Pingelap and Mokil atolls: Achromatopsia. Am. J. Hum.
Genet. 24:304-309. 111. Morton, N. E. Foreward. In “Genetic Factors in Schizophrenia” (A. R. Kaplan, ed.), pp. xv-xvi. Thomas, Springfield, IL. 112. Pollock, N., Lalouel, J,M., and Morton, N. E. Kinship and inbreeding on Namu atoll (Marshall Islands). Hum. Biol. 44:459-473. 113. Morton, N. E. (book review). “Theoretical Aspects of Population Genetics”: M. Kimura and T. Ohta, Princeton University Press, Princeton, NJ, 1971. Am. J. Hum. Genet. 24:488-489. 114. Morton, N. E. (book review). “The Genetics of Human Populations”: L. L. Cavalli-Sforza and W. E Bodmer, Freeman, San Francisco, 1971. Sot. Biol. 19:405-408. 115. Morton, N. E. Population genetics. In “Human Genetics” (Proceedings of the Fourth Interna, tional Congress on Human Genetics, Paris, September 1971),J. deGrouchy, E J. G. Ebling, and I. W. Henderson, eds.), pp. 13 I- 132. Excerpta Medica, Amsterdam and New York. 116. Morton, N. E. The future of human population genetics. In “Progress in Medical Genetics,” vol. III (A. G. Steinberg and A. G. Beam, eds.), pp. 103- 124. Grune & Stratton, New York. 117. Morton, N. E. Human behavioral genetics. In “Genetics, Environment and Behavior” (L. Ehrman, ed.), pp. 247-271. Ann. NY Acad. Sci., Academic Press, New York and London,
Appendix
553
1973 118. Morton, N. E., and Yamamoto, M. Blood groups and haptoglobins in the Eastern Carolines. Am. J. Phrs. Anthropol, 38:695-698. 119. Steinberg, A. G., and Morton, N. E. Immunoglobulins in the Eastern Carolines. Am. J. Phrs. Anthropol. 38:699-702. 120. Morton, N. E., and Lalouel, J.-M. Bisassayof kinship in Micronesia. Am. J. Phys. Anthroeol. 38:709-719. 121. Lalouel, J.-M., and Morton, N. E. Bioassay of kinship in a South American Indian population. Am.J. Hum. Genet. 25:62-73. 122. Piazza, A., and Morton, N. E. A formal genetic analysis of the HL-A system. Am. J. Hum. Genet. 25:119-133. 123. Tolarova, M., and Morton, N. E. Cleft lip and palate-recurrence risk and genetic counseling. Proceedingsof the Second Conference of the European Teratolagy Society, Prague, May 1972, (E. Klika, ed.), pp. 83-90. Acta Universitatis Carolinae Medica-Monogrphia LVI-LVII. 124. Morton, N. E., Klein, D., Hussels, I. E., D o d inval, P., Todorov, A., Lew, R., and Yee, S. Genetic structure of Switzerland. Am. J. Hum. Genet. 25:347-361. 125. Morton, N. E., and Lalouel, J.-M. Topology of kinship in Micronesia. Am. J. Hum. Genet. 25:422-432. 126. Morton, N. E. Population structure of Micronesia. In “Methods and Theories of Anthropological Genetics” (M. H. Crawford I? L. Workman), pp. 333-366. University of New Mexico Press,Albuquerque. 127. Marton, N. E., Hurd, J. N., and Little G. E Pingelap and Mokil atolls: A problem in population structure. In “Methods and Theories of Anthropological Genetics” (M. I-I. Crawford and I’. L. Workman, eds.), pp. 315-332. University of New Mexico Press,Albuquerque. 128. Eriksson, A. W., Eskola, M. R., Workman, P. L., and Morton, N. E. Population studies on the Aland Islands. II. Historical population structure: Inference from bioassay of kinship and migration. Hum. Hered. 23: 511-534. 129. Rao, D. C., and Morton, N. E. Large deviations in the distribution of rare genes. Am. J. Hum. Genet. 25:594-597. 130. Morton, N. E. Population structure and historical genetics of isolates. Israel J. Med. Sci. 9:1299-1307.
1974 131. Morton, N. E. Kinship bioassay. In “Genetic Distance” (Workshop, Fourth International Congress of Human Genetics, Paris, September 1971), (J. E Crow and C. Denniston, eds.), pp. 97 - 104. Plenum, New York. 132. Freire-Maia, A., Freira-Maia, D., and Morton, N. E. Sex effect on intelligence and mental retardation. Behaw.Gene.t. 4:269-272. 133. Freire-Maia, A., Stevenson, C., and Morton, N. E. Hybridity effect on mortality. Sot. Bill. 21:232-234. 134. Morton, N. E. (letter to the editor). Reply to Cavalli-Sforza. Controversial issues in human population genetics. Am. J. Hum. Genet. 26:259-262. 135. Morton, N. E. Analysis of family resemblance. I. Introduction. Am. J. Hum. Gener. 26:3X3-330. 136. Rao, D. C., Morton, N. E., and Yee, S. Analysis of family resemblance. II. A linear model for familial correlation. Am. J. Hum. Genet. 26:331-359. 137. Morton, N. E., and MacLean, C. J. Analysis of family resemblance. III. Complex segregation of quantitative traits. Am. J. Hum. Genet. 26:489-503. 138. Rao, D. C., and Morton, N. E. (brief communication). Path analysis of family resemblance in the presence of gene-environment interaction. Am. J. Hum. Genet. 26:767-772.
554
Research Contributions of Newton E. Morton
139. Morton, N. E. (book review). Genetique des populations humaines”: A. Jacquard, Presses Universitaires de France, Paris, 1974. Am. .I. Hum. Genet. 27:127-128. 140. Morton, N. E. (book review). “Genetic Variation in Britain”: D. E Roberts and E. Sunderland (eds.), Symposia of the Society for the Study of Human Biology, vol. 12. Barnes & Noble, New York, 1973. Am. J. Hum. Genet. 26:266. 141. Chung, C. S., Ching, G. H. S., and Morton, N. E. A genetic study of cleft lip and palate in Hawaii. II. Complex segregation analysis and genetic risks. Am. J. Hum. Genet. 26:177- 188.
1975 142. Morton, N. E. Kinship, fitness and evolution. In “The Role of Natural Selection in Human Evolution” (Burg Wartenstein, Austria, August 1974), (E M. Salzano, ed.), pp. 133-154. North Holland, Amsterdom. 143. Halperin, S. L., Rao, D. C., and Morton, N. E. (short communication). A twin study of intelligence in Russia. Behau. Genet. 5:83-86. 144. Morton, N. E. Kinship, information and biological distance. Theor. Pop. Biol. 7:246-255. 145. MacLean, C. J., Morton, N. E., and Lew, R. Analysis of family resemblance. IV. Operational characteristics of segregation analysis. Am. .I. Hum. Genet. 27:365-384. 146. Rao, D. C., MacLean, C. J., Morton, N. E., and Yee, S. Analysis of family resemblance. V. Height and weight in northeastern Brazil. Am. J. Hum. Genet. 27:509-520. 147. Morton, N. E. Interracial crosses and group differences. In “Racial Variation in Man” (Institute of Biology Symposium 22), (E J. Ebling, ed.), pp. 151-169. Blackwell, London. 148. Morton, N. E., and Rao, D. C. (notes and comments). Monomorphism and heterogzygosity. Heredid 34:427-431. 149. Morton, N. E. (book review). “Genetics and Social Structure: Mathematical Structuralism in Population Genetics and Social Theory”: l? Ballanoff (ed.), Benchmark Papers in Genetics. Dowden, Hutchinson & Ross, Stroudsburg, PA, 1974. Am. .I. Hum. Genet. 27:255-256. 150. Morton, N. E., Jacobs, P. A., Frackiewicz, A., Law, l?, and Hilditch J. The effect of structural aberrations of the chromosomes on reproductive fitness in man. I. Methodology. Clin. Genet. 8:159-168. 151. Jacobs, I? A. Frackiewicz, A., Law, P., Hilditch, 0. J., and Morton, N.-E. The effect of stuctural aberrations of the chromosomes on reproductive fimess in man. II. Results. Clin. Genet. 8:169-178. 152. Morton, N. E. (book review), “Genealogical Mathemations” (Proceedings of the MSSB. Conference on Genealogical Mathemeatics, Houston, TX, February 1974), (I? A. Ballanoff, (ed.), Mouton, Paris. AmI Hum Genet. 27:696-697. 153. Morton, N. E. Analysis of family resemblance and group differences. Sot. Biol. 22:ll l- 116. 154. Freir-Maia, A., Freire-Maia, N., Morton, N. E., Azevedo, E. S., and Quelce-Sagado, A. Genetics of acheiropodia (the handless and footless families of Brazil). VI. Formal genetic analysis. Am../. Hum. Genet. 27:521-52. 155. Morton, N. E. Appendix: Theory of inbreeding effect on diploid bees. In W. E. Kerr, Population genetic studies in bees (Apidae, Hymenoptera). 1. Genetic load. An. Acad. Brazil, Cienc. 47:317-334.
1976 156. Morton, N. E., Smith, C., Hill, R., Frachiewicz, A., Law, P, and Yee, S. Population structure of Barra (Outer Hebrides). Ann. Hum. Genet. Land. 39:339-352. 157. Morton, N. E. (review article). Genetic markers in atheroschlerosis. A review. .I. Med. Genet. 13:81-90. 158. Rao, D. C., Morton, N. E., and Yee, S. Resolution of cultural and biological inheritance by path analysis. Am. J. Hum. Genet. Z&228-242.
Appendix
555
159. Morton, N. E., and Keats, B. Human microdifferentiatin in the Western Pacific. In “The origin of the Australians” (R. L. Kirk and A. G. Theme, eds>),pp. 379-399. Australian Institute of Aboriginal Studies, Canberra. 160. Morton, N. E., Stout, W. T., and Fischer, C. Academic performance in Hawaii. Sot. Biol. 23:1320. 161. Morton, N. E., and Lindsten, J. Surveillance of Dawn’s syndrome as a praradigm of population monitoring. Hum. Hered. Z&360-371. 162. MacLean, C. J. H., Morton, N. E., Elston, R. C., and Yee, S. Skewness in commingled distributions. Biometics 32:695-699. 163. Jacobs, F’.A., Mayer, M., and Morton, N. E. Acrocentric chromosome associations in man. Am. J. Hum. Genet. 28:567-576. 164. Morton, N. E., MacLean, C. J., Kagan, A., Gulbrandsen, C. L., Rhoads, G. G., Yee, S., and Lew, R. Commingling in distributions of lipids and related variables. Am. J. Hum. Cenet. 29:52-59. 165. Morton, N. E., Rao, D. C., and Yee, S. An inferred chiasma map of DrasophiEameEanogaster. Heredity 37:405-411. 166. Morton, N. E. Forces maintaining polymorphism. Acta Anthropol. 1:3 - 14. 167. Morton, N. E. (letter to the editor). Heritability of IQ. Science194:9- 10. 168. Lindsten, J., Cerasj, E., Luft, R., Morton, N. E., and Ryman, N. Significance of genetic factors for the plasma insulin response to glucose in health subjects. Clin. Genet. l&126-134.
1977 169. Morton, N. E., MacLean, C. J., Kagan, A., Gulbrandsen, C. L., Rhoads, 6. G., Yee, S., and Lew, R. Commingling in distribution of lipids and related variables. Am. J. Hum. Genet. 29~52-59. 170. Morton, N. E., Rao, D. C., Lang-Brown, H., MacLean, C. J., Bart, R. D., and Lee, R. ColChester revisited: A genetic study of mental defect. .J. Med. Genet. 14:1-9. 171. Rao, D. C., Morton, N. E., Lindsten, J., Hulten, M., and Yee, S. A mapping function for man. Hum. Hered. 27:99-104. 172. Morton, N. E., Rao, D. C., Lindsten, J., Hulten, M., and Yee, S. A chiasma map of man. Hzm. Hued. 27:38-51. 173. Morton, N. E. Isolation by distance in human populations. Arm. Hum. Genet. London. 40~361-365. 174. Rao, D. C., Morton, N. E., Elston, R. C., and Yee, S. Causal analysis of academic performance. Behav. Genet. 7:147-157. 175. Morton, N. E., and Lalouel, J.-M. (letter to the editor). Genetic epidemiology of Lesch-Nyhan disease.Am. J. Hum. Genet. 29:304-307. 176. Morton, N. E. Genetic aspects of prematurity. In “The Epidemiology of Prematurity” (D. M. and E J. Stanley, eds.), pp. 213-230. Urban & Schwarzenberg, Baltimore and Munich. 177. Rao, D. C., and Morton, N. E. Residual family resemblance for PTC taste sensitivity. Hum. Genet. 36:317-320. 178. Morton, N. E. Some aspects of the genetic epidemiology of Common diseases. In “Gene-Environment Interaction in Common Disease,” (Proceedings of Symposium, Japan Medical Research Foundation, Tokyo, February 1976), pp. 21-40. University of Tokyo Press, Tokyo. 179. Morton, N. E., Dick, H. M., Allan, N. C., Izatt, M. M., Hill, R., and Yee, S. Bioassay of kinship in northwestern Europe. Ann. Hum. Genet. Land. 41:249-255. 180. Keats, B. J. B., Morton, N. E., and Rao, D. C. Likely Linkage: Inu with Jk. Hum. Genet. 39:157-159. 181. Lalouel, J.-M., Morton, N. E., MacLean, C. J., and Jackson, J. Recurrence risk in Complex inheritance with special regard to pyloric stenosis. J. Med. Genet. 14:408-414.
556
Research Contributions of Newton E. Morton
182. Gulbrandsen, C. L., Morton, N. E., Rhoads, G. G., Kagan, A., and Lew, R. Behavioral, social, and physiological determinats of lipoprotein concentrations. Sot. Biol. 24~289-293. 183. Jacobs, l? A., and Morton, N. E. Origin of trisomics and polyploids. Hum. Hered. 27:59-72. 184. Morton, N. E. Resolution of cultural inheritance, polygenes and major loci. In “Human Genetics” (Proceedings of the Fifth International Congress of Human Genetics, Mexico City, Mexico, October 1976) (S. Armendares and R. Lisker, eds.), pp. 236-243. Excerpta Medica, Amsterdam and Oxford. 185. Jackson, J. E, Currier, R. D., Terasaki, P. I., and Morton, N. E. Spinocerebellar ataxia and HLA linkage. N. Engl. J. Med. 296:1138- 1141.
1978 186. Gerrard, J. W., Rao, D. C., and Morton, N. E. A genetic study of immunoglobulin
E. Am. J.
Hum. Genet. 30:46-58. 187. Morton, N. E. Analysis of crossing over in man. In “Human Gene Mapping 4” (Winnipeg Conference-1977, University of Manitoba, Canada, August 1977), (J. Hamerton, ed.). Cytogenet. Cell Genet. 22:15-36. 188. Morton, N. E., Matsuura, J., Bart, R., and Lew, R. Genetic Epidemiology of an institutionalized cohort of mental retardates. CEin. Gent. 13:449-461. 189. Morton, N. E., and Rao, D. C. Quantitative inheritance in man. Yearb. Phys. Anthropol. 21:12-41. 190. Rao, D. C., Keats, B. B., Morton, N. E., Yee, S., and Lew, R. Variability of human linkage data. Am. J. Hum. Genet. 30:516-529. 191. Morton, N. E. Effect of inbreeding on IQ and mental retardation. Proc. Natl. Acud. Sci. USA 75:3906-3908. 192. Rao, D. C., Morton, N. E., and Yee, S. (letter to the editor). Resolution of cultural and biolog ical inheritance by path analysis: Corrigenda and reply to Goldberger letter. Am. J, Hum. Genet. 30:445-448. 193. Rhoads, G. G., Morton, N. E., Gulbrandsen, C. L., and Kagan, A. Sinking pre-&lipoprotein and coronary heart disease in Japanese-American men in Hawaii. Am. J. Epidemiol. 10:350-356. 194. Morton, N. E., Gulbrandsen, C. L., Rhoads, G. G., Kagan, A., and Lew, R. Major loci for lipoprotein concentrations. Am. J. Hum. Genet. 30:583-589. 195. Morton N. E., Gulbrandsen, C. L., Rhoads, G. G., and Kagan, A. The Lp lipoprotein in Japanese. Clin. Genet. 14:207-212. 196. Rao, D. C., and Morton, N. E. IQ as a paradigm in genetic epidemiology. In “Genetic Epidemiology,” (N. E. Morton and C. S. Chung, eds.), pp. 145-182. Academic Press, New York. 197. Jackson, J. E, Whittington, J. E., Currier, R. D., Terasaki, P. I., Morton, N. E., and Keats, B. J. B. Genetic linkage and spinocerebellar ataxia. In “Advances in Neurology”, vol. 21 (R. A. Kirk, R. N. Rosenburg, and L. J. Schut, eds.), pp. 315-318. Raven Press, New York. 198. Keats, B. J. B., Morton, N. E., and Rao, D. C. Possible linkages (lod score over 1.5) and a tentative map of]k-Km linkage group. Cytogenet. Cell Genet. 22:304-308. 199. Rao, D. C., Keats, B. J. B., and Morton, N. E. Characteristics of a linkage heterogeneity test. Cytogenet. Cell Genet. 22:711-713.
1979 200. Lalouel, J.-M., Morton, N. E., and Jackson, J. Neural tube malformations: Complex segrega, tion analysis and calculations of recurrence risks. J. Med. Genet. 16:8- 13. 201. Morton, N. E., Jacobs, I? A., and Mayer, M. (letter to the editor). Response to Carothers’ letter. Am. J. Hum. Genet. 31:84-85.
Appendix
557
202. Rao, D. C., Chung, C. S., and Morton, N. E. Genetic and environmental determinats of per& odontaldisease. Am. J. Hum. Genet. 4:39-45. 203. Williams, W., Morton, N. E., Lew, R., and Yee, S. The likely region of overlap (LRQ) method for physical assignment of loci. Hum. Genet. 47:297-304. 204. Rao, D. C., Morton, N. E., Gulbrandsen, C. L., Rhoads, G. G., Kagan, A., and Yee, S. Cultural and biological determinants of lipoptotein concentrations. Ann. Hum. Genet. London. 42:446-477. 205. Morton, N. E., and Rao, D. C. Causal analysis of family resemblance. In “Genetic Analysis of Common Diseases:Applications to Predictive Factors in Coronary Disease”(Workshop, Snowbird, Utah, August 1978), (C. E Sing, and M. Skolnick, eds+),pp. 431-452. Liss, New York. 206. Rao, D. C., Morton, N. E., and Cloninger, C. R. Path analysis under generalized assortative mating. I. Theory. Genet. Res. 33:175-188. 207. Morton, N. E. Genetics of hyperuricemia in families with gout. Am. J. Med. Genet. 4:103-106. 208. Rao, D. C., Keats, B. J. B., Lalouel, J.-M., Morton N. E., and Yee, S. A maximum likelihood map of chromosome 1, Am. J. Hum. Genet. 31:680-696. 209. Gulbrandsen, C. L, Morton, N. E., Rao, D. C., Rhoads, G. G., and Kagan, A. Determinants of plasma uric acid. Hum. Genet. 50~307-312. 210. Keats, B. J. B., Morton, N. E., and Rao, D. C. Possible linkages (lad score over 1.5) and a tentative map ofJk-Km linkage group. Cytogenet. CeU Genet. 22:304-308. 211. Rao, D. C, Keats, B. J. B., and Morton, N. E. Characteristics of a linkage heterogeneity test, Cytogenet. Cell Genet. 22:711-713. 212. Zavala, C., Morton, N. E., Rao, D. C., Lalouel, J.-M., Gambaa, I. A., Tejeda, A., and Lisker, R. Complex segregation analysis of diabetes mellitus. Hum. Hered. 29~325-333. 213. Morton, N. E. Comments in Spielman, R. S., Migliazza, E. C., Neel, J. V., Gershowitz, D. E., and Arauz, R. T. The evolutionary relationships of two populations: A study of the Guaymi and the Yanomama. Cuw. Anthropol. 20~377-378. 214. Morton, N. E. (book review). “Kinometsics: Determinants of Socio-Economic Successwithin and between Families”. Paul Taubman, ed. North-Holland, Amsterdam, 1977. Sot. Biol. 26:84-85. 215. Morton, N. E., and Lalouel, J.-M. Genetic counselling in sex linkage. In “Risk, Communication, and Genetic Counselling,” (C. J. Epstein, C. J. R. Cury, S. Packman, S. Shirman, and B. D. Hall, eds., Birth Defects, Orig, Art. Ser., vol. XV-%, National Foundation), pp. 9-24. Liss, New York. 216. Morton, N. E. Genetic epidemiology in pedigrees: Kinship and path analysis. Brae. J. Genet. l:l-15. 217. Morton, N. E. Diseasesdetermined by major genes. Sot. Biol. 26:94-103. 218. Shows, T. B., Alper, C. A., Bootsma, D., Dorf, M., Douglas, T., Huisman, T., Kit, S., Klinger, H. P., Kozak, C., Lalley, P. A., Lindsley, D., McAlpine, !? J., McDougall, J. K., Meera Khan, P., Meisler, M., Morton, N. E., Opitz, J. M., Partridge, C. W., Payne, R., Roderick, T. l-l., Rubenstein, P., Ruddle, E H., Shaw, M., Spranger, J. W., and Weiss, K. International System for Human Gene Nomenclature (1979). Cytogenet CelEGenet. 25:96-l 16.
1980 219 Morton, N. E., and Yasuda, N. Transition matrices with mutation. Am. J. Hum. Gem%. 32:202-211. 220. Morton, N. E., Gulbrandsen, C. L., Rao, D. C., Rhoads, 6. G., and Kagan, A. Determinants of blood pressure in Japanese-American families. Hum. Genet. .53:261-266. 221. Rao, D. C., Lalouel, J.-M., Morton, N. E., and Genard, J. W. (brief communication). Immunoglobulin revisited. Am. J. Hum. Genet. 32:620-625.
558
Research Contributions of Newton E. Morton
222. Rao, D. C., and Morton, N. E. Path analysis of quantitative inheritance, In “Current Developments in Anthropological Genetics,” vol. I (J. H. Mielke and M. H. Crawford, eds.), pp. 355-372. Plenum, New York. 223. Morton, N. E. Genetic epidemiology of isolates. In “Population Structure and Genetic Disorders” (S&id Juselius VII Symposium, Mariehamn, Aland Islands, Finland, August 1978), (A. W. Eriksson et al., eds.), pp. 43-56. Academic Press, London. 224. Morton, N. E., Lalouel, J.-M., Jackson, J. F., Currier, R. D., and Yee, S. Linkage studies in spinocerebellar ataxia (SCA). Am. J. Hum. Genet. 6:251-257. 225. Krieger, H., Morton, N. E., Rao, D. C., and Azevedo, E. Familial determinants of blood pressure innortheastem Brazil. Hum. Genet. 53:415-418. 226. Morton, N. E., and Rao, D. C. Hereditary genius: A centennial problem in resolution of cultural and biological inheritance. Sot. Biol. 27:48-52. 227. Morton, N. E. (book review). “Mathematical Theory of Quantitative Genetics.” M. G. Bulmer, The Clarendon Press, Oxford University Press, New York, 1980. Sot. Biol. 27:164- 165. 228. Tiwari, J. L., Morton, N. E., Lalouel, J.-M., Terasaki, I? I., Zander, H., Hawkins, B. R., and Cho, Y. W. Joint report: Multiple sclerosis. In “Histocompatibility Testing 1980” (Eighth International Histocompatibility Workshop, Los Angeles, February, 1980). (T. I. Terasaki, ed.), pp. 687-692. UCLA nssue Typing Laboratory, Los Angeles.
1981 229. Haile, R. W., Hodge, S. E., Iselius, L., Morton, N. E., and Detels, R. Segregation and linkage analysis of 40 multiplex multiple sclerosis families. Hum. Hered. 31:252-258. 230. Barbosa, C. A. A., Rao, D. C., and Morton, N. E. Analysis of family resemblance for immunoglobulin M, F and A levels. Hum. Hered. 3 1:8- 14. 231. Morton, N. E., and Lalouel J.-M. Resolution of linkage for irregular phenotype systems. Hum. Hmd. 31:3-7. 232. Keats, B. J. B., Morton, N. E., and (Rae, D. C. Reduction of physical assignments to a standard lad table: Chromosome 1. Hum. Genet. 56:353-359. 233. Morton, N. E. Mutation rates for human autosomal recessives, In “Population and Biological Aspects of Human Mutation” (Proceedings of Birth Defecrs Symposium XI, Albany, September 1980), E. B. Hook and I. H. Porter, eds.), pp. 65-89. Academic Press,New York. 234. Morton, N. E., and Barbosa, C. Age, area, and acheiropody. Hum. Genet. 57:420-422. 235 Lalouel, J.-M., and Morton, N. E. Complex segregation analysis with pointers. Hum. Hered. 31:312-321. 236. Rao, D. C., Morton, N. E., Gottesman, I. I., and Lew, R. Path analysis of qualitative data on pairs of relatives: Application to schizophrenia. Hum. Hered. 31:325-333. 237. Simpson, S. l?, and Morton, N. E. Complex segregation analysis of the locus P-aminoisobutyric acid excreation (BAIB). Hum. Genet. 59:64-67. 238. Barbosa, C. C. A., Morton, N. E., Rao, D. C., and Kreiger, H. Biological and cultural determinants of immunoglobulin levels in a Brazilian pipulation with Chagas’ disease. Hum. Genet. 59:161-163. 239. Morton, N. E., Kennet, R., Yee, S., and Lew, R. Bioassay of kinship in populations of Middle Eastern origin and controls. Curr. Anthropol. 23:157- 167. 240. Ho, H. Z., xwari, J. L., Haile, R. W., Terasaki, P. I., and Morton, N. E. HLA-linked and unlinked determinants of multiple sclerosis. Immunogenetics 15:509-517.
1982 241. Lauder, I. J., Morton, N. E., and Yee, S. Estimation of polymorphism in quantitative measurements on homologous pairs of human chromosomes. Comput. Biomed. Res. 11:89-101.
Appendix
559
242. Morton, N. E., Kinship and inbreeding in populations of Middle Eastern origin and controls., In “Current Developments in Anthropological Genetics: Ecology and Population Structure,” vol. 2, (M. H. Crawford and J. H. Mielke, eds.), pp. 449-466. Plenum, New York. 243. Freire+Maia, A., Freire-Maia, D., and Morton, N. E. Epidemiology and genetics of endemic goiter. II. Genetic aspects. Hum. Hered. 32:176-180. 244. Morton, N. E., Hassold, T. J., Funkhowser, J., McKenna, P. W., and Lew, R. Cytogenetic surviilance of spontaneous abortion. Cymgenet. Cell Genet. 33:232-239. 245. Morton, N. E. Segregation and linkage analysis. In “Human Genetics,” Part B: “Medical Aspects” (Proceedings of the Sixth International Congress of Human Genetics, Jerusalem, September 1981), (B. Bonne-Tamir et al., eds.), pp. 3-14. Liss, New York. 246. Morton, N. E., Williams W. R., and Lew, R. Trials of structured exploratory data analysis. Am. J. Hum. Gener. 34:489-500. 247. Morton, N. E. (letter to the editor). Heterogeneity in nonsyndromai congenital glaucoma. Am. J. Med. Genet. 12:97-102. 248. Rao, D. C., Morton, N. E., Lalouel, J.-M., and Lew, R. Path analysis under generalized assortative mating. II. American IQ. Genet. Res. Cambridge 39:187-198. 249. Dunsworth, T. S., Rich, S. S., Morton, N. E., and (Barbosa, J. Heterogeneity of insulin dependent diabetes-New evidence. Clin. Genet. 21:233 -236. 250. Green A., Morton, N. E., Iselius, L., Svejgaard, A., Platz, I?, and Ryder, L. I? Genetic studies of insulin-dependent diabetes mellitus: Segregation and linkage analysis. Tissue Antigens 19:213-221. 251. Iselius, L., Lindsten, J., Morton, N. E., Efendic, S., Cerasi, E., Haegermark, A., and Luft, R. Evidence for an autosomal recessive gene regulating the persistence of the insulin response to glucose in man. Clin. Genet. 22:180-194. 252. Morton, N. E. Estimation of demographic parameters from isolation by distance. Hum. Hered. 32:37-41. 253. Morton, N. E., Lindsten, J., Iselius, L., and Yee, S. Data and theory for a revised chiasma map of man. Hum. Genet. 62:266-270. 254. Morton, N. E. Interactions in multifactorial systems. In “Immunogenetics in Rheumatology,” (D-Pen-HLA 82 Workshop Perth, Western Australia, April 1982), (R. L. Dawkins, E T. Christiansen, and I! J. Zilko, eds.), pp. 48-51. Excerpta Medica, Amsterdam, Oxford, and Princeton. 255. Morton, N. E. The design of genetic studies. In “Immunogenetics in Rheumatology,” (D-Pen-HLA 82 Workshop, Perth, Western Australia, April 1982), (R. L. Dawkins, E T Christiansen, and P. J. Zilko, eds.), pp. 73-77. Excerpta Medica, Amsterdam, Oxford, and Princeton.
1983 256. Williams, W. R., Thompson, M. W., and Morton, N. E. Complex segregation analysis and computer-assisted genetic risk assessment for Duchenne muscular dystrophy. Am. .I. Med. Genet. 14:315-333. 257. Morton, N. E., Green, A., Dunsworth, T., Svejgaard, A., Barbosa, J,, Rich, S. S., Iselius, L., Platz, P., and Ryder, L. P. Heterozygous expression of IDDM determinants in the HLA system. Am. J. I-&m. Genet. 35(2):201-213. 258. Williams, W. R., Morton, N. E., Rao, D. C., Gulbrandsen, C. L., Rhoads, G. G., and Kagan, A. Family resemblance for fasting blood glucose in a population of Japanese*Americans. Clin. Genet. 23:287-293. 259. Rao, D. C., Williams, W. R., McGue, M., Morton, N. E., Gulbrandsen, C. L., Rhoads, G. G., Kagan, A., Lakarzewski, I’., Glueck, C. J., and Russell, J. M. Cultural and biological inheritance of plasma lipids. Am. J. Phrs. Anthrop. 63:33-49.
560
ResearchContributionsof Newton E. Morton
260. Cloninger, C. R., Rao, D. C., Rice, J., Reich, T., and Morton, N. E. A defense of path analysis in genetic epidemiology. Am. J. Hum. Genet. 35:733-756. 261. Jackson, J, E, Currier, R. D., and Morton, N. E. Dominant spinocerebellar ataxia: Genetic counselling. J. Nemogene. 1:87-90. 262. Morton, N. E., Simpson, S. P., Lew, R., and Yee, S. Estimation of haplotype frequencies. Times Antigens 22:257-262. 263. Morton, N. E. An exact linkage test for multiplex case families. Hum. Hered. 33:244-249. 264. Iselius, L., Morton, N. E., and Rao, D. C. Family resemblance for blood pressure. Hum Hered. 33~277-286. 265. Rao, D. C., Morton, N. E., Glueck, C. J., Laskarzewski, l? M., and Rusell, J. M. Heterogeneity between populations for multifactorial inheritance of plasma lipids. Am. J. Hum. Genet. 35:368-483. 266. Morton, N. E., and Simpson, S. P. Kinship mapping of multilocus systems. Hum. Genet. 64(2)103-104. 267. Morton, N. E. “Prospects in Genetic Epidemiology” pp. 1-3. Japan Medical Research Foundation Tenth Anniversary volume, University Press,Tokyo. 268. Lalouel, J.-M., Rao, D. C., Morton, N. E., Elston, R. C. A unified model for complex segregation analysis. Am. J. Hum. Genet. 35:815-826. 269. Morton, N. E. (book review). “Biometrical Genetics,” 3rd ed.; Sir Kenneth Mather and John L. Jinks, Chapman&Hall, London, 1982. Am. J. Hum. Genet. 35:777-778.
1984 270. Scherman, S. L., Morton, N. E., Jacobs, P. A., and Turner, G. Marker X syndrome: A cytogenetic and genetic analysis. Ann. Hum. Genet. 48:21-37. 271. Morton, N. E. Trials of segregation analysis by deterministic and macro simulation. In “Methods in Human Population Genetics. (C. C. Li Symposium, Pittsburgh, October 1982), (A. Chakraborty, ed.) Van Nostrand Reinhold, New York. 272. Morton, N. E. Linkage and association. In “Proceedings of Workshop on genetic Epidemiology of Coronary Heart Disease: Past, Present, and Future,” (Washington University, St. Louis, MO, August 1983). Liss, New York. 273. Tiwari, J. L., Betuel, H., Geburhrer, L., and Morton, N. E. Genetic epidemiology of coeliac disease. Genet. Epidemiol. 1:37-42. 274. Barbosa, C. A. A., Morton, N. E., Wette, R., Rao, D. C., and Kreiger, H. Race, height, and blood pressure in northeastern Brazil. Sot. Biol. 30:21 l-217. 275. Morton, N. E., Sherman, S. L., MacLean, C. J., Yee, S., and Lew, R. Genetic Analysis Workship II: Combined segregation, linkage, and association analysis. Genet. Epidemiol. 1:195-199. 276. MacLean, C. J., Morton, N. E., and Yee, S. Combined analysis of genetic segregation and linkage under an oligogenic model. Cornput. Biomed. Res. 17:471-480. 277. Morton, N. E., and MacLean, C. J. Multilocus recombination frequencies: Genet. Res. Cambridge 44:99- 108. 278. Haile, R. W. C., Iselius, L., Fine, P. E. M., and Morton, N. E. Segregation and linkage analyses of 72 leprosy pedigrees. Hum. Hered. 35:43-52. 279. Iselius, L., Carlson, L. A., Morton, N. E., Efendic, S., Lindsten, J., and Luft, R. Genetic and environmental determinants for lipoprotein concentrations in blood. Acta. Med. Scan. 217:161-170.
1985 280. Morton, N. E., MacLean, C. J., and Lew, R. Tests of hyupotheses on recombination frequencies. Genet. Res. Cambridge 45:279-286.
Appendix
561
281, Sherman, S. L., Jacobs, P A., Morton, N. E., Froster-Iskenius, U., Howard-Peebles, I? N., Nielsen, K. B., Partington, M. W., Sutherland, G. R., Turner, G., and Watson, M. Further segregation analysis of the fragile X syndrome with special reference to transmitting males. Hum. Genet. 69:289-299. 282. Morton, N. E., Berg, K., Dahlen, G., Ferrel, R. E., and Rhoads, G. G. Genetics of the LP lipoprotein in Japanese-Americans. Genet. Epidemiol. 2:113- 121. 283. MacLean, C. J., Morton, N. E., and Lew, R. Efficiency of lod scores for representing multiple locus linkage data. Genet. Epidemiol. 2:145- 154. 284. MacLean, C. J., and Morton, N. E. Estimation of myriad haplotype frequencies. Genet. Epidemiol. 2:263-272. 285. Green, A., Svejgaard, A., Platz, I?, Ryder, L. I?, Jakobsen, B. K., Morton, N. E., and MacLean, C. J. The genetic susceptibility to insulin-dependent diabetes mellitus: Combined segregation and linkage analysis. Genet. Eptiemiol. 2:1-15. 286. Morton, N. E., and Lew, R. Mapping genetic systemsby the supratype method. Hum. Genet. 70:231-235. 287. Sherman, S. L., and Morton, N. E. Genetic Analysis Workshop III Construction of genetic maps using two-point lod tables. Genet. Epidemiol. 2:223-224. 288. Lalouel, J.-M., LeMignon, L., Simon, M., Fat&et, R., Bourel, M., Rao, D. C., and Morton, N. E. Genetic analysis of idiopathic hemochromatosis using both qualitative (disease status) and quantitative (serum iron) information. A m J. Hum. Genet. 37:700-718. 289. Haile, R. W. C., Iselius, L, Fine, P. E. M., and Morton, N. E. Segregation and linkage analyusis of 72 leprosy pedigrees. Hum. Hered. 35:43-52. 290. Brahe, C., Serra, A., and Morton, N. E. Erythrocyte catechofo-Methyltransferase activity: Genetic analysis in nuclear families with one child affected by Down syndrome. Am. J. Med. Genet. 21:373-384. 291. Povey, S., Morton, N. E., and Sherman, S. L. Report of the Committee on the Genetic Constitution of Chromosome 1 and 2. Cytogenet. Cell Genet. 40:67-106.
1986 292. Morton, N. E., MacLean, C. J., Lew, R., and Yee, S. Multipoint linkage analysis. Am. J. Hum. Genet. 38:868-883. 293. Shahar, S., and Morton, N. E. Origin of teratomas and twins. Hum. Genet. 74~215-218. 294. Sherman, S. L., Iselius, L., Gallono, F!, Buckton, K., Collyer, S., DeMey, R., Kristoffersson, U., Lindsten, J., Mikkelsen, M., Morton, N. E., Newton, M., Nordenssan, I., Petersen, M. B., and Wahlstrom, J. Segregation analysis of balanced pericentric inversions in pedigree data. Clin. Genet. 30:87-94. 295. Horn, N., and Morton, N. E. Genetic epidemiology of Menkes disease. Genet. Epidemiol. 3:225-230. 296. Rhoads, G. G., Dahlen, G., Berg, K., Morton, N. E., and Dannenberg AL. Lp(a) lipoprotein as a risk factor for myocardial infarction. J A M A 256:2540-2544. 297. Morton, N. E. Foundations of genetic epidemiology. J. Genet. 65:202-212.
1987 298. Pascoe, L., and Morton, N. E. The use of map functions in multipoint mapping. Am. J. Hum. Genet. 40:174-183. 299. Rich, S. S., Green, A., Morton, N. E., and Barbosa, J. A combined segregation and liiage analysis of insulin-dependent diabetes mellitus. Am. J. Hum. Genet. 40:237-249. 300. Morton, N. E., Chiu, D., Holland, C., Jacobs, P A., and Pettay, D. Chromosome anomalies as predictors of recurrence risk for spontaneous abortion. Am. J. Med. Genet. 28:19-26.
562
Research Contributions of Newton E. Morton
301. Pascoe, L., and Morton, N. E. (letter). The inheritance of cutaneous malignant melanoma (CMM) and dysplastic nevus syndrome (DNS). Am. J. Hum. Genet. 40:464. 302. Morton, N. E., (letter to editors). Tests of order in multipoint linkage. Ann. Hum. Genet. 51:265. 303. Hinkle, L. E. Thaler, H. T., Merke, D. I?, Renier-Berg, D., and Morton, N. E. The risk factors for arrhythmic death in a sample of men followed for 20 years. Am. J. Epidemiol. 127:500-515. 304. Kanamori, M., Morton, N. E., Fujiki, K., and Kondo, K. Genetic epidemiology of Duchenne muscular dystrophy in Japan: Classical segregation analysis. Genet. Epidemiol.4:425-432. 305. Klemmer, S. J., Pascoe, L., DeCosse, J., and Morton, N. E. The occurrence of desmoids in patients with familial adenomatous polyposis of the colon. Am. J. Med. Genet. 28:385-392.
1988 306. Morton, N. E., and Wu, D. Alternative bioassays of kinship between loci. Am. J. Hum. Genet.
42:173-177. 307. Morton, N. E. Multipoint
mapping and the emperor’s clothes. Ann. Hum. Genet. 52:309-318. 308. Morton, N. E., Jacobs, l? A., Hassold, T., and Wu, D. Maternal age in trisomy. Ann. Hum.
Genet. 52:227-235. 309. Morton, N. E., Wu, D., and Jacobs, l? Origin of sex chromosomal aneuploidy. Ann. Hum. Genet. 52:82-92. 310. Sherman, S. L., Aston, C. E., Morton, N. E., Speiser, P. W., and New, M.I., A segregation and linkage study of classical and nonclassical Zl-hadroxylase deficiency. Am. J. Hum. Genet.
42:830-838. 311. Morton, N. E. (letter to editor). Lod score redivivus. Nature 334:477-478.
1989 312. Morton, N. E., and Andrews, V. MAP., an expert system for multiple pairwise linkage analysis. Ann. Hum. Genet. 53:263-269.
1990 313. Basta, M., Morton, N. E., Mulvihill, J.J., Radovanovic, Z., Radojicic, C., and Marinkovic, D. Inheritance of acute appendicitis: Familial aggregation and evidence of polygenic transmission. Am. J. Hum. Genet. 46:377-382. 314. Iselius, L., Jacobs, P, and Morton, N. E. Leukaemia and transient leukaemia in Down syn drome. Hum. Genet. 85:477-485. 315. Litter, M., and Morton, N. E. Segregation analysis of peripheral neurofibromatosis (NFl). J. Med. Genet. 27:307-310. 316. Morton, N. E., Keats, B. J., Jaobs, P A., Hassold, T., Pettay, D., Harvey, J., and Andrews, A. A centromere map of the X chromosome from nisomies of maternal origin. Ann. Hum. Genet.
54:39-47. 317. Morton N. E., and Collins, A. Counting algorithms for linkage. Ann. Hum. Genet. 54:103-106. 318. Morton, N. E., and Collins, A. Standard maps of chromosome 10. Ann. Hum. Genet.
54:235-251. 319. Morton, N. E. (review). Genetic analysis of complex traits: Insulin-dependent diabetes mellitus and affective disorders. Ann. Hum. Genet. 54:181-182. 320. Morton, N. E. Genetic linkage in complex diseases: A comment Genet. Epdemiol. 7:
33-34.
Appendix
563
321. Morton, N. E. Pitfalls and prospects in genetic epidemiology of cancer. In “Recent progress in the Genetic Epidemiology of Cancer” (H. T. Lynch and I? Tautu, eds.), pp. 18-26. Spring Verlag, Heidelberg. 322. White, R. L., Lalouel, J.-M., Nakamura, Y., Donis-Keller, H., Green, P., Bowden, 0. W., Matthew, C. G. I’., Easton, D. E, Robson, E. B., Morton, N. E., Gusella, J. F., Haines, J. t., Retief, A. E., Kidd, K. K., Murray, J. C., Lathrop, G. M., and Cann, H. M. The CEPH Consortium primary linkage map of human chromosome 10. Genomics 6~393 -412. 323. Coiling, A. and Morton, N. E. Significance of maximal lads. Ann. Hum. Genet. 55:39-41. 324. Dracopoli, N. C., O’Connell, I?, Elsner, T. I., Lalouel, J.-M., White, R. L,, Buetow, K. H., Nishimura, D. Y., Murray, J. C., Helms, C., Mishra, S. K., Donis.Keller, H., Hall, J. M., Lee, M. K., King, M.-C., Attwood, J., Morton, N. E., Robson, E. B., Mahtani, M., Willard, H. F., Royle, N. J., Patel, I., Jeffreys, A. J., Verga, V., Jenmkins, T., Weber, J, L., Mitchell, A. L., and Bale, A. The CEPH consortium linkage map of human chromosome 1. Genomics 93686-700. 325. Keats, B. J., Sherman, S., Morton, N. E., Robson, E. B., Buetow, K. H., Cartwright, P. E., Chakravarti, A., Francke, U., Green, l? I?, and Ott, J. Guidelines for human linkage maps. Genomics 9:557-56O;Ann. Hum. Genet. 15:1-6. 326. Iselius, L., Slack, J., Littler, M., and Morton, N. E. Genetic epidemiology of breast cancer in Britain. Ann. Hum. Genet. 55:151-159. 327. Iselius, L., and Morton, N. E. (letters to the editor). Transmission probabilities are not correctly implemented in the computer program POINTER. Am. .J. Hum. Genet. 49:459. 328. Morton, N. E. (book review) “Genetic Maps. Locus Maps of Complex Genomes” (Stephen J. O’Brien, ed.). Ann. Hum. Genet. 55:lSl. 329. Shields, D. C., Collins, A., Buetow, K. H., and Morton, N. E. Error filtration, interference, and the human linkage map. Proc. Natl. Acad. Sci. USA 88:6501-6505. 330. Morton, N. E. Parameters of the human genome. Proc. Natl. Acad. Sci. USA 88~7474-7476. 331. Lawrence, S., Morton, N. E., and Cox, D. R. Radiation hubrid mapping. Proc, Natl. Acad. Sci. USA 88:7477-7480. 332. Morton, N. E. Gene maps and location databases.Ann. Hum. Genet. 55:23S-241. 333. Morton, N. E. (book review). “Statistics and Truth,“: C. Radhakrishna Rae, ,J. Genet, 70:63-64. 334. Mather, E. R., Iselius, L., Yates, J. R. W., Littler, M., Benjamin, C., Harris, R., Sampson J., Williams, A., Ferguson-Smith, M. A., and Morton N. E. Von Hipple-Lindau disease: A genetic study.J. Med. Genet. 28:443-447. 335. Morton, N. E. Genetic epidemiology of hearing impairment. In “Genetics of Hearing Impairment.“Ann. NyAcud. Sci. 630:16-31. 336. Morton, N. E., Shields, D. C., and Collins, A. Genetic epidemiology of complex phenotypes. Atin. Hum. Genet. 55:301-314. 337. Ho&ton, R. S., Collins, A., Slack, J., Campbell, S., Collins, W. P., Whitehead, M. I., and Morton, N. E. Genetic epidemiology of ovarian cancer: Segregation analysis. Ann. Hum. Genet. 55:291-299. 338. Sham, P. C., Morton, N. E., and Rice, J. P. Segregation analysis of the NIMH collaborative study. Family data on bipolar disorder. Psychiutr. Genet. 2: 175 - 184.
1992 339. Lawrence, S., and Morton N. E. Physical mapping by multiple pairwise analysis. Cymgenet. Cell Genet. 59:107-109. 340. Morton, N. E. Genetic structure of forensic populations. Proc. Nad. Acad. Sci. I/SA. 89~2556-2560.
564
Research Contributions of Newton E. Morton
341. Morton, N. E., and MacPherson, J. N. Population genetics of the fragile-x syndrome: Multiallelic model for the FMRI locus. Proc. Nutl. Acad. Sci. USA. 89:4215-4217. 342. Collins A, Keats, B. J,, Dracopoli, N., Shields, D. C., and Morton, N. E. Integration of gene maps: Chromosome 1. Proc. Nutl. Acad. Sci. USA. 89:4598-4602. 343. Ho&ton, R. S., Collins, A., Slack, J., and Morton, N. E. Dominant genes for colorectal cancer are not rare. Ann. Hum. Genet. 56:99-103. 344. Ceccherini, I., Romeo, G., Laswrence, S., Breunings, M. H., Harris, I? C., Himmelbauer, H., Frischauf, A. M., Sutherland, G. R., Germino, G. G., Reeders, S. T., and Morton, N. E. Construction of a map of chromosome 16 by using radiation hybrids. Proc. Nutl. Acad. Sci. USA. 89:104- 108. 345. Iselius, L., Littler, M., and Morton, N. E. Transmission of breast cancer-A controversy resolved. Clin. Genet. 41:211-217. 346. Kaye, C. I., Martin, A. O., Rollnick, B. R., Nagatoshi, K., Israel, J., Hermanoff, M., Tropea, B., Richtsmeier, J. T., and Morton, N. E. Ocularuriculovertebral anomaly: Segregation analysis. Am. J. Med. Genet. 43:913-917. 347. Morton, N. E., Collins, A., Lawrenece, S., and Shields, D. C. Algorithms for a location database. Ann. Hum. Genet. 56:223-232. 348. Morton, N. E. The development of linkage analysis. In “The History and Development of Human Genetics: Progress in Different Countries” (K. R. Dronamraju ed.), pp. 48-56. World Scientific, Singapore. 349. Lawrence, S., Keats, B. J., and Morton, N. E. The AD1 locus in familial Alzheimer disease, Ann. Hum. Genet. 56~295-301. 350. Morton, N. E. Genes for intelligence on the X chromosome. .I. Med. Gener. 29:71-72. 351. Morton, N. E. (editorial). Major loci for atopy? Clin. Exp. Allergy 22:1041-1043. 352. Morton, N. E. The future of genetic epidemiology. Ann. Med. 24:557-562.
1993 353. Morton, N. E. DNA incourt. Eur. .I. Hum. Genet. 1:172-178. 354. Morton, N. E., Collins A., and Balazs, I. Kinship bioassay on hypervariable loci in blacks and Caucasians. Proc. Nutl. Acad. Sci. USA. 90:1892-1896. 355. Lawrence, S., Collins, A., Keats, B. J., Hulten, M., and Morton, N. E. Integration of gene maps: Chromosome 21. Proc. Nutl. Acad. Sci. USA. 90:7210-7214. 356. McCarthy, M. I., Hitchins, M., Hitman, G. A., Cassell, l?, Hawrami, K., Morton, N., Mohan, V., Ramachandran, A., Snehalatha, C., and Viswanathan, M. Positive association in the absence of linkage suggests a minor role for the gucokinase gene in the pathogenesis of type 2 (non-insulin-dependent) diabetes mellitus amongst South Indians. Diabetologia 36:633-641. 357. Morton, N. E. Recent developments in genetic epidemiology. In: “Human Population Genetics” (P. I? Majumder, ed.), pp. 227-290. Plenum, New York. 358. Morton, N. E. Genetic epidemiology Anna Rev. Genet. 27:523-538.
1994 359. Shields, D. C., Marlow, A. J., Ho&ton, R. S., Eccles, D. M., and Morton, N. E. Prediction of genetic risks from segregation analyses of morbid risks. Hum. Hered. 44:52-55. 360. Morton, N. E. Disomic locus content mapping. Proc. Nutl. Acud. Sci. USA. 91:1421-1422. 361. Collins, A., and Morton, N. E. Likelihood ratios for DNA identification. Proc. iVutl. Acad. Sci. USA. 91:6007-6011. 362. Wang, L. H., Collins, A., Lawrence, S., Keats, B. J., and Morton, N. E. Integration of gene maps. Chromosome X. Genomics 22:590-604.
Appendix
545
363. MacDonald, M., Hassold, T., Harvey, J., Wang, L. H., Morton, N. E., and Jacobs, P. The origin of 47, X X Y and 47, X X X aneuploidy, heterogeneous mechanisms and role of aberrant recombination. Hum. Mol. Genet. 3(8):1365-1371. 364. Morton, N. E. Genetic structure of forensic populations. Am. J. Hum. Genet. 55:587-588. 365, Morton, N. E. Fundamentals of genetic epidemiology. Book review. Getter E~idemiol. 11:389-390. 366.Lawrence, S., Beasley, R., Doull, I., Begishvili, B., Lampe, E, Holgate, S. T, and Morton, N. E. Genetic analysis of atopy and asthma as quantitative traits and ordered polychotomies. Ann. Hum. Genet. S&359-368. 367. Shields, D. C., Ratanachaiyavong, S., McGregor, A. M,, Collins, A., and Morton, N. E. Combined segregation and linkage analysis of Graves diseasewith a thyroid autoantibody diathesis. Am. 1. Hum. Genet. 55:540-554. 368. Eccles, D., Marlow, A., Royle, G., Collins, A., and Morton, N. E. Genetic epidemiology of early-onset breast cancer. J. Med, Genet. 31:944-949, 369. Scapoi, C., Ponz De Leon, M., Sassetelli, R., Benatti, P, Roncucci, L., Collins, A., Morton, N. E., and Barrai, 1. Genetic epidemiology of hereditary non-polyposis colorectal cancer syn dromes in Modena, Italy: Results of a complex segregation analysis. Ann. Hlun. Genes.
58:275-295.
1995 370. Ho&ton, R. S., Collins, A., Kee, E, Collins, B. I., Shields, D. C., and Morton, N. E. Segregation analysis of colorectal cancer in North Ireland. Hum. Hered. 45:41-48. 371. Fisher, J. M., Harvey, J. E, Morton, N. E., and Jacobs, P A. Trisomy 18: Studies of the parent and cell division of origin and the effect of aberrant recombination on nondisjunction. Am. J. Hum. Genet. 56:669-675. 372. Morton, N. E. Applicability of the Rao mapping function. Hum. Hered. 45:178-180. 373. Morton, N. E. DNA forensic science 1995. Eur. J. Hum. Genet. 3:139- 144. 374. Morton, N. E. LODs past and present. Genet. 140:7- 12. 375. Morton, N. E. Alternative approaches to population structure. Genetica 96:139- 144. 376. Morris, A., Morton, N. E., Collins, A., Lawrence, S., and MacPherson, J. N. Evolutionary dynamics of the FMRI locus. Ann. Hum. Genet. 59:283-289. 377. Forabosco, P., Collins, A.,, and Morton, N. E. Integration of gene maps: Updating chromosome 1. Ann. Hum. Genet. 59:291-305. 378. Morris, A., Morton, N. E., Collins, A., Macpherson, J., Nelson, D., and Sherman, 5. An. n-allele model for progressive amplification in the FMRl locus. Proc. Natl. Acad. Sci. USA.
92i4833-4837. 379. Colins, A., and Morton, N. E. Nonparametric tests for linkage with dependent sib pairs. F&n. Hered. 45:311-318. 380. Morton, N. E. Meta-analysis in complex diseases. Clin. Exp. Allera 25(suppl. 2):110-112. 381. Collins, A., Forabosco, P, Lawrence, S., and Morton, N. E. An integrated map of chromosome 9. Ann. Hum. Genet. 59:393-402. 382. Watson, M., Lawrence, S., Collins, A., Beasley, R., Dot& I., Begishvili, B., Lampe, F., Holgate, S. T., and Morton, N. E. Exclusion from proximal llq of a common gene with megaphenic effect onatopy. Ann. Hum. Genet. 59:403-411. 383. Wang, L. H., Collins, A., Lawrence, S., and Morton, N. E. Integration of gene maps: Mouse chromosome X. Brux. J. Genet. 18(3):373-383. 384. Morton, N. E., and Collins, A. Statistical and genetic aspects of quality control for DNA identification. Ekctrophoresis 16:1670-1677. 385. Morton, N. E. Genetic studies on atopy and asthma in Wessex. Clin. Exp. Allerg) 25 (suppt 2):107-109.
566
Research Contributions of Newton E. Morton
1996 386. Morton, N. E. Logarithm of odds (lads) for linkage in complex inheritance. Proc. Nutl. Acad. Sci. USA. 93:3471-3476. 387. Collins, A., Teague, J., Keats, B. J., and Morton, N. E. Linkage map integration. Genomics 36:157-162. 388. Collins, A., MacLean, C. J., and Morton, N. E. Trials of the p model for complex inheritance. Proc. Natl. Acad. Sci. USA. 93:9177-9181. 389. Morton, N. E. Statistical considerations for genetic analysis of atopy and asthma. In: “The Genetics of Asthma” (S. B. Ligett and D. A. Meyers, eds.). Dekker, New York. 390. Teague, J. W., Collins, A., and Morton, N. E. Studies on locus content mapping. Proc. Natl. Acad. Sci. USA. 93:11814-11818. 391. Morton, N. E. (letter to the editor). DNA identification. Ann. Hum. Genet. 59:1398-1399. 392. Collins, A., Frezal, J., Teague, J., and Morton, N. E. A metric map of humans: 23,500 loci in 850 bands. Proc. Natl. Acud. Sci. USA. 93:14771-14775. 393. Morton, N. E., and Teague, J. W., Kinship, inbreeding, and matching probabilities, In: “Molecular Biology and Human Diversity” A.J. Boyce and C.G.N. Mascie-Taylor, eds., ~~51-62. Cambridge University Press, Cambridge University Press, Cambridge. 394. Morton, N. E. Committee on DNA Forensic Science: An update (1996). National Research Council, National Academy Press. Review. Ann. Hum. Genet. 60(5):442-444. 395. Morton, N. E. (letter to editor). R. Stat. Sot. news 24(3):4. 396. Almedia, R., Morton, N., Fidalgo, P., Leitao, N., Mira, C., Rueff, J., and Monteiro, C. APC intragenic haplotypes in familial adenomatous polyposis. Clin. Genet. 50:483-485. 397. Morton, N. E. Recent developments in kinship analysis. Riu. Antropol. 74:5- 13.
1997 398. Morton, N. E. Conference consternation. Nat. Genet. 15:15. 399. Murray, A., Macpherson, J. N., Pound, M. C., Sharrock, A., Youings, S. A., Dennis, N. R., McKechine, N., Linehan, P., Morton, N. E., and Jacobs, P A. The role of size, sequence and haplotype in the stability of FRAXA and FRAME alleles during transmission. Hum. Mol. Genet. 6(2):173-184. 400. Morton, N. E. Genetic epidemiology. Ann. Hum. Genet. 61:1-13. 401. Morton, N. E., and (Collins, A. Commentary. The future of gene mapping. Genet. Anal. Biomol. Eng. 14:25-27. 402. Lio I?, and Morton N. E. Comparison of parametric and nonparametric methods to map oligogenes by linkage. Proc. Natl. Acad. Sci. USA. 94:5344-5348. 403. E&es, D. M., Forabosco, I?, Williams, A., Dunn, B., Williams, C., Bischop, D. T., and Morton, N. E. Segeregation analysis of ovarian cancer using diathesis to include other cancers. Ann. Hum. Genet. 61:243-252. 404. Morton, N. E., and Lio, P Oligogenic linkage and map integration. In: “Genetic Mapping of Disease Genes, (Pawlowitzky, Edwards, and Thompson, eds.) pp. 17-21. 405. Morton, N. E. The forensic DNA endgame. Jurimecrics37(summer):477-494. 406. Morton, N. E. Discussion of rhe paper by Foreman, Smith and Evett. Stat. Soci. 160, part 3. 407. Morton, N. E., Teague, J. W., and Collins, A. The generalized product rule. Proceedingsfrom the First European Symposium on Human Identification, 1996, pp. 34-43. 408. Collins, A., and Morton N. E. Human genome mapping. in: “Human Genome Methods,” (K. W. Adolph, ed.). CRC Press, Boca Raton, FL.
1998 409. Morton, N. E. (correspondence) Hippocratic or hypocritic: Birth pangs of an ethical code. Nat. Genet. 8:18.
Appendix
567
410. Teague, J. W., Morton, N. E., Dennis, N. R., Curtis, G., McKechine, N., Macpherson, J. N., Murray, A., Pound, M. C., Sharrock, A. J., Youings, S. A., and Jacobs, P. A. FRAXA and FRAXE: Evidence against segregation distortion and for an effect of intermediate alleles on learning disability. Proc. Nutl. Aced. Sci. USA. 95:719-724. 411. Collins, A., and Morton, N. E. Mapping a disease locus by alielic association. Proc. Nutl. Acad. Sci. USA. 95:1741-1745. 412. Morton, N. E. Quantitative scoresfor asthma and atopy. Clin. Exp. Allergy. Z&95-97. 413. Morton, N. E. Significance levels in complex inheritance. Am. J. Hum. Genet. 62:690-697. 414. Bugge, M., Collins, A., Petersen, M. B., Fisher, J., Brandt, C., Hert, J. M., Tranebjaerg, L., de Lozier,Blanchet C., Nicolaides, I?, Brandurn-Nielsen K., Morton, N., and Mikkelsen, M. Nondisjunction of chromosome 18. Hum. MOE.Genet. 7(4):661-669. 415. Morton, N. E., and Collins, A. Tests and estimates of allelic association in complex inheritance. Proc. Nutl. Acad. Sci. USA. 95:11389-11393. 416. Lonjou, C., Collins, A., Ajioka, R. S., Jorde, L. B., Kushner, J. P., and Morton, N. E. Allehc association under map error and recombinational heterogeneity: A taie of two sites. Proc. NueE.Acad. Sci. LJSA95:11366-11370. 417. Lonjou, C., Collins, A., Beckmann, J., Allamand, V., and Morton, N. Limbgirdle muscular dystrophy type 2A (CAPN3): Mapping using allelic association. Hum. Hered. 48:333-337. 418. Wilkinson, J., Grimley, S., Collins, A., Thomas, N. S., Holgate S. T., and Morton, N. Linkage of asthma to markers on chromosome 12 in a smaple of 240 families using quantitative phenotype scores. Genomics 53:251-259. 419. Morton, N. E. Genetics without frontiers. Nut. Genet. 20:329-330.
1999 420. Morton, N. E. Nomenclature and the internationalization of genetics. Genet. Sod 38:22-23. tc association between marker loci. Proc. 421. Lonjou, C., Collins, A., and Morton, N. E. All e1’ Natl. Acad. Sci. USA 96:1621-1626. 422. Morton, N. E. Genetic aspects of population policy. Clin. Genet. %I05 - 109. 423. Murray, A., Webb, J., MacSwiney, E, Shipley, E. L., Morton, N. E., and Conway, G. S. Serum concentrations of follicle stimulating hormone may predict premature ovarian failure in FRAXA premutation women. Hum. Repro. 14(5):1217-1218. 424. Murray, A., Webb, J., Dennis, N., Conway, G., and Morton, N. Microdeletions in FMR2 are a significant cause of premature ovarian failure. J. Med. Genet. 36:767-770. 425. Morton, N. E. Unsolved problems in genetic epidemiology. Hum. Hered. 50:5-13.
In press 426. Morton, N. E. LODs past and present. In: Prespectives in Genetics. (J. E Crow, ed.). 427. Morton, N. E. Complex inheritance: The 21st century. In: “Genetic Dissection of Complex Traits” (D. C. Rao and M. A. Province, eds.). Academic Press,San Diego, CA, 2001. 428. Comes, I., Collins, A., Lonjou, C., Thomas, N. S., Wilkinson, J., Watson, M., and Morton, N. E. Hardy, Weinberg quality control. Ann. Hum. Genet. 429. Collins, A., Ennis, S., Tapper, W., and Morton, N. E. Mapping oligogenes for atopy and asthma by meta-analysis Genet. Mol. Biol. 430. Lonjou, C., Collins, A., Ennis, S., Tapper, W., and Morton, N. E. Meta-analysis and retrospec, tive collaboration. Two methods to map oligogenes for atopy and asthma. Clin. Ex. Allergy. 431. Collins, A., Lonjou C., Morton, N. E. Genetic epidemiology of single nucleotide polymorphisms. Proc. Natl. Acad. Sci. USA
568
ResearchContributionsof Newton E. Morton
Books published Morton, N. E., Chung, C. S., and Mi, M. P “Genetics of Interracial Crosses”. Karger, Basel, 1967. Morton, N. E. (ed.). “Computer Applications in Genetics”. University of Hawaii Press, Honolulu, 1969. Morton, N. E. (ed.) “A Genetics Program Library”. University of Hawaii Press,Honolulu, 1969. Morton, N. E. (ed.) “Genetic Structure of Populations”. University of Hawaii, Press, Honolulu, 1973. Morton, N. E., and Chung, C. S. (eds.) “Genetic Epidemiology”. Academic Press,New York, 1978. Keats, B. J. B., Morton, N. E., Rao, D. C., and Williams, w”. A Source Book for Linkage in Man”. Johns Hopkins University Press,Baltimore, 1979. Morton, N. E. “Outline of Genetic Epidemiology”. Karger, New York, 1982. Morton, N. E., Rao, D. C., and Lalouel, J. M. “Methods in Genetic Epidemiology”. Karger, New York, 1983.
A Admixture mapping assessmentin case-control studies, 205-206 population sampling, 446 Adoption, familial resemblance heritability estimation, 40-41 Affected sibpair analysis, seealso Linkage analy sis artificial neural networks, 293-294 cost-effectiveness, 45 l-453 emulation using lod scores, 130 identical-by-descent proportions, 262 lod scores compared, 108-111, 121, 130 meta-analysis, 262 multivariate phenotype analysis, 333-346 sampling issues,443 sibship linkage model, 186-188,443-444 significance levels in genome scans, 475-484 two-stage procedure, 463 -467 discordant relative pairs, 464-467 mean statistic analysis, 463 variance component methods, 173,178 Allelic associations case-control heterogeneity studies genetic background, 201-203 haplotype analysis, 200-201 genome partitioning analysis, multipoint identity-by-descent variance component model, 3 17 multipoint identity-by-descent variance component model, 3 17 single locus, 415-417 size effects, 428-430 structure effects, 428-430 two loci, 417-423 D expectation in a finite population, 418-419 D variance, 419-423 p variance, 419-423 Angiotensinogen, essential hypertension genetics, 522-524
Artificial neural networks, applications, 287-296 complex trait gene interactions, 291-292 data applications, 292 - 295 affected sibpair analysis, 293 - 294 disequilibrium analysis, 294-295 extended family linkage analysis, 295 neural network function, 288-291 overview, 287-288,295-296 Association studies allelic associations, 415-423 case-control heterogeneity studies genetic background, 201-203 haplotype analysis, 200-201 single locus, 415-417 size effects, 428-430 structure effects, 428-430 two loci, 417-423 D expectation in a finite population, 418-419 D variance, 419-423 p variance, 419-423 artificial neural networks, 294-295 challenges and issues genetic heterogeneity, 62-63, 203 study design, B-62 type I and II errors, 58 complex traits, 21 contemporary approaches, 56-58 future research directions, 538-540 historical perspectives, 47-48 human population structure, 43 1-434 identity-by-descent associations, 415-417 linkage analysis compared costs, 213 - 220 methods, 215-217 overview, 213-215,218-220 results, 217-218 methods, 21,49-53,215-217 localization determinant interactions, 393-409
570
Subject index
analysis, 405-407 missing covariate data treatment, 407-408 overview, 393-394,408-409 penetrance models, 394-398 segregation analysis, 396-398 meta-analysis, 267 multiple-linked loci, 198-200 overview, 45-53,413-415 power assessment,205 scanning theory, see Scanning technique significance levels in genome scans, 475 -484 single locus associations, 415 -417 structural relationships, 183- 190 overview, 183-185, 190 SEGPATH models, 185- 190 multilocus linkage model, 188- 189 sibship linkage model, 186-188 unique features, 189- 190 0 estimation from observed associations, 423 -428 transmission disequilibrium tests, 223-239 genetic and environmental risk model, 226-227 genotype-environment interactions, 228-229,236-238,405-407 logistic regression model, 227 overview, 224-226,238-239 simulation, 229-238 logistic regression, 229-234 MZ twins, 236-238 other sibling inclusion, 236 trio data analysis, 229-236 two loci associations, 417-423 D expectation in a finite population, 418-419 D variance, 419-423 p variance, 419-423 unified model, 130
C Case-control studies, 191-210 admixture assessment,205-206,446 allelic heterogeneity genetic background, 201-203 haplotype analysis, 200-201 association strength, 205 genetic matching, 204 haplotype analysis allelic heterogeneity, 200-201
multiple-linked loci, 198-200 linkage disequilibrium strength, 205 outlier detection, 204 overview, 192-194, 209-210 physiologic significance, 206-208 pleiotropy, 206-208,318 power assessment,205 sampling issues,444-445 statistical significance assessment, 196- 198 stratification, 194- 195 Classification and regression trees, complex trait dissection, 15, 28-29, 281-284 Cloning, future research directions, 537-538 Coefficient of inbreeding, description, 415-417 Complex traits artificial neural networks, gene interactions, 291-292 classification methods, 273-285 challenges, 274-277 overview, 273-274, 284-285 recursive partitioning models, 277-284 linkage trees, 15, 28-29,281-284 purity, 278-281 splitting rules, 278-281 genetic effects existence, 16- 19 familial environment compared, 17 - 19 phenotypic variation causes, 17 heterogeneity, see Heterogeneity linkage analysis, see Linkage analysis lod score method, 100,105-106,110,127 lumping and splitting strategies, 15- 16, 26-29 classification and regression trees, 15, 28-29,281-284 context-dependency, 28 description, 15- 16 meta-analysis, 27, 257-258 multivariate analysis, 27-28 mapping technique, 380-387 candidate genes, 386-387 future research directions, 536-537, 541-542 genome scanning, 121,329-330, 383-386 hypothesis testing, 386-387 trait complexity, 380-383 model-based versus model-free methods, 16, 374-380 model-free analysis, see Model-free methods multivariate analysis
Subject Index data reduction, 329 models and scenarios, 325-328 overview, 27-28,323-325,333,346 quantitative trait loci mapping, 329-331 results, 333 -346 simulation, 331-333 overview, 13-16, 29-31 phenotype analysis, seePhenotype scanning analysis, seeScanning technique significance levels in genome scans, 475-484 study design, see Study design variance component detection methods, 151-178 linkage analysis, 154-159, 172-178 affected sibpair analysis, 173, 178 alternative test statistic, 158- 159 ascertainment correction and effect, 176-178 likelihood ratio statistic, 157- 158 lod score, 157-158 maximum likelihood estimation, 156-157 phenotype modeling, 154- 156 power, 175-176 quantitative trait variance component, 152,174-175 nonnormality effects, 159- 172 alternative robust test statistics, 163-166 covariance matrix test, 163- 164 finite mixtures test, 160-161, 168-169 genotypic variation tests, 160- 161 kurtosis-type I error relationship, 161-163 likelihood tests, 164- 166 model misspecification type I errors, 167-172 multivariate t distribution, 166 quantitative trait loci accuracy estimation, 166-167 score tests, 164 Wald tests, 164 xi distribution, 169- 170 overview, 152-154, 178 whole-genome analysis, seeWhole+genome analysis Cost-benefit analysis association versus linkage analysis, 213 -220 methods, 215-217 overview, 213-215, 218-220
571
results, 217-218 EDAC design, 451-453 genetic dissection, 23 meta-analysis, 453-454 two-stage design, 453
whole-genome scans, 83 - 85 Cross-breeding studies, seeInbred model organism crosses
D DESPAIR, two-stage scanning, 468 Discordant relative pairs, affected sibpair analysis, 464-467 Disease, seealso Association studies; Case-con trol studies; spec$c diseases future research directions, 535-536, 541-542 inheritance, known mode exploitation using lod score method, 129 liability, 517-532 cellular mechanisms, 529-530 evolutionary implications, 531-532 functional significance at the molecular level, 525-528 organism level mechanisms, 529-530 overview, 517-521 statistical inference, 521-525 angiotensinogen variants linkage and association, 522 haplotype studies, 524 power, 522-524 replication, 522-524 phenotype definition, 69-70,74-75, 247 -248 quantitative traits, 71- 73 significance levels in genome scans, 475-484,525-528 Disequilibrium analysis allelic associations, 415-423 case-control heterogeneity studies genetic background, 201-203 haplotype analysis, 200- 201 single locus, 415-417 size effects, 428-430 structure effects, 428-430 two loci, 417-423 D expectation in a finite population, 418-419 D variance, 419-423
572
Subject Index
p variance, 419-423 artificial neural networks, 294-295 challenges and issues,58-63 genetic heterogeneity, 62-63, 203 study design, 58-62 type I and II errors, 58 complex traits, 21 contemporary approaches, 56-58 future research directions, 538-540 historical perspectives, 47-48 human population structure, 43 l-434 identity-by-descent associations, 415-417 linkage analysis compared costs, 213-220 methods, 215-217 overview, 213-215,218-220 results, 217-218 methods, 21,49-53,215-217 localization determinant interactions, 393-409 analysis, 405-407 missing covariate data treatment, 407-408 overview, 393-394,408-409 penetrance models, 394-398 segregation analysis, 396-398 meta-analysis, 267 multiple-linked loci, 198-200 overview, 45-53,413-415 power assessment,205 scanning theory, see Scanning technique significance levels in genome scans, 475-484 single locus associations, 415-417 structural relationships, 183- 190 overview, 183 - 185,190 SEGPATH models, 185 - 190 multilocus linkage model, 188- 189 sibship linkage model, 186- 188 unique features, 189- 190 0 estimation from observed associations, 423 -428 transmission disequilibrium tests, 223-239 genetic and environmental risk model, 226-227 genotype-environment interactions, 228-229,236-238,405-407 logistic regression model, 227 overview, 224-226,238-239 simulation, 229-238 logistic regression, 229-234
MZ twins, 236-238 other sibling inclusion, 236 trio data analysis, 229-236 two loci associations, 417 -423 D expectation in a finite population, 418-419 D variance, 419-423 p variance, 419-423 unified model, 130
E Environment, see Association studies; Case-control studies; Transmission disequilibrium tesrs Epidemiological studies, see Case-control studies; Study design Error types challenges and issues,58 false positives and negatives in genome scans, 487-497 association studies, 58,492-493 error detection, 88-90 false negative control, 490-492 linkage analysis, 492-493 lod score analysis method, 130- 13 1 multiple testing, 489-490 overview, 487-488,496-497 trade-offs, 493 - 496 two-stage designs, 492-493 variance detection nonnormality effects, 161-163,167-172 kurtosis relationship, 161- 163 meta-analysis, 256 model misspecification, 167-172 examples, 170- 172 finite mixture distribution, 168- 169 x$ distribution, 169- 170 Essential hypertension, liability genetics and mechanisms, 517-532 cellular mechanisms, 529-530 evolutionary implications, 531-532 functional significance at the molecular level, 525-528 organism level mechanisms, 529-530 overview, 517-521 statistical inference, 521-525 angiotensinogen variants linkage and association, 522 haplotype studies, 524
Subject Index power, 522-524 replication, 522-524 Evolution, disease liability, 531-532 Experimental design, seeStudy design Extended pedigrees artificial neural network analysis, 295 familial resemblance heritability estimation, 40 sampling issues,443
F False positive and negative errors, seeType I and II errors Familial resemblance extended linkage analysis artificial neural networks, 295 multifactorial models, 40 sampling optimization, 443 genetic effects compared in complex traits, 17-19 heritability, 35 -43 multifactorial models, 38-42 adoption, 40-41 extended pedigrees, 40 extensions, 41 twins, 40 overview, 36-37,42-43 study design, 38-42 affecting factors, 41-42 correlations, 39-40 hypothesis testing, 39-40 nuclear families, 39 -40
G Generalized estimation equation, description, 236 Genetic dissection, complex traits, 13-31 genetic effects existence, 16-19 familial environment compared, 17 - 19 phenotypic variation causes, 17 lumping and splitting strategies, 15- 16, 26-29 classification and regression trees, 15, 28-29,281-284 context-dependency, 28 description, 15- 16 meta-analysis, 27, 257-258 multivariate analysis, 27-28
model-based versus model-.. 374-380 overview, 13-16,29-31 study design, 19 - 26 analysis methods, 23 - 25 cost-benefit analysis, 23 genotyping issues,20-21 linkage versus association, 21 one-stage versus two-stage designs, 21-22, 540-541 phenotype definition and refinement, 19 power, 22-23,205,447-451 results interpretation, 25 -26 sample size, 22-23 sampling methods, 20 Genetic heritability, seeHeritability Genetic heterogeneity, seeAllelic associations; Heterogeneity Genetic markers, see Markers Genetic stratification, case-control study design, 194-195 Genetic studies, seeCase-control studies; Study design; specifictypes Genetic traits, seeComplex traits; s@~?fictraits Genome analysis, seeGenome partitioning analysis; Scanning technique; Whalegenome analysis Genome partitioning analysis, 299-3 19 human population-based studies, family data analysis, 307-310 inbred model organism crosses!310-314 extended muitipoint identity-bydescent variance component miadeling, 311-312 regression modeling, 3 11-3 12 multipoint identiwby-descent variance component model allelic interactions,, 317 calculation uncertainty, 318 estimation, 304 framework, 302-304 hypothesis testing, 304 inbred model organism crosses,3 12 -3 !4 locus fnteracrions, 3 17 methods, 301-304 multiple phenotype analysis, 318 pleiotropy analysis, 318 overview, 300-301,318-319 recurgive partitioning, heterogeneity classifi cation model, 277-284
574
Subject Index
linkage trees, 15, 28-29, 281-284 purity, 278-281 splitting rules, 278-281 Genome scans, see Scanning technique; Short tandem repeat polymorphism scans Genotyping complex traits, 20-21, 160-161,380-387 environmental interactions, 228-229,
236-238,405-407 study design optimization, 442 whole-genome scans future research directions, 90-93 marker density, 93 marker types, 90 - 93 genotype misclassification allowance,
365-374 complex-valued recombination fractions, 365-367
consequences,367-370 local recombination perturbation minimization, 370-374 historical perspectives, 78-79 locus mapping with known genotypes,
353-362 association analysis, 356-362 future research directions, 536-537, 541-542 linkage analysis, 353 -356 linkage disequilibrium analysis,
356-362 locus mapping with uncertain genotypes,
362-365 overview, 78,94,352-353,387 present concepts and methods, 79-90
costs,83-85 error detection, 88-90 genotyping quality, 81-83 limitations, 85-88 j marker screening sets, 80
H Haplotype analysis, see Disequilibrium analysis Haseman-Elston test identical-by-descent proportions, 262 lod score pooling, 260 quantitative trait analysis, 138- 141, 249, 281 Heritability, see also Phenotype; Whole-genome analysis
disease inheritance known mode exploitation using lod score method, 129 kurtosis-type I error relationship, 161-163 familial resemblance, 35-43 multifactorial models, 38-42 adoption, 40-41 extended pedigrees, 40 extensions, 41 twins, 40 overview, 36-37,42-43 study design, 38-42 affecting factors, 41-42 correlations, 39-40 hypothesis testing, 39-40 nuclear families, 39-40 Heterogeneity, seealso Complex traits association challenges and issues,62-63,
203 case-control study design genetic background, 201-203 haplotype analysis, 200-201 classification methods, 273-285 challenges, 274-277 overview, 273-274,284-285 recursive partitioning models, 277-284 linkage trees, 15, 28-29, 281-284 purity, 278-281 splitting rules, 278-281 future research directions, 535-536 meta-analysis for model-free methods,
266-267 two-stage global search, 467-468 Human population-based studies association study structure, 43 l-434 heritability, see Heritability heterogeneity, see Heterogeneity whole-genome analysis association studies, 43 l-434 family data analysis, 307-310 individuals in populations, 3 14-3 17 whole populations, 3 17 Hybrid studies, see Inbred model organism crosses Hypertension, see Essential hypertension Hypothesis testing heritability analysis, 39-40 multipoint identity-by-descent variance component model, 304
Subject index scanning technique in complex trait map ping, 386-387
I Identical-bydescent marker identification estimation, 136- 138 meta-analysis, 261-263 multipoint variance component model, genome partitioning and whole-genome analysis allehc interactions, 317 calculation uncertainty, 3 18 estimation, 304 framework, 302-304 hypothesis testing, 304 inbred model organism crosses,312-314 locus interactions, 3 17 methods, 301-304 multiple phenotype analysis, 3 18 pleiotropy analysis, 3 18 Inbred model organism crosses coefficient of inbreeding, 415 -417 genome partitioning and whole-genome analysis, 310-314 extended multipoint identity-by-descent variance component modeling, 311-312 regression modeling, 3 1 1 - 3 12
K Kurtosis, type I error relationship, 161-163
L Linkage analysis artificial neural networks, 295 association compared costs, 213-220 methods, 215-217 overview, 213-215, 218-220 results, 217-218 methods, 21,53-S, 215-217 challenges and issues,58-63 genetic heterogeneity, 62-63,203 study design, 58-62 type I and II errors, 58 complex traits lodscore method, 100, 105-106, 110, 127
515
variance component detection methods, 154-159,172-178 affected sibpair analysis, 173, 178 alternative test statistic, 158- 159 ascertainment correction and effect, 176-178 likelihood ratio statistic, 157- 158 lod score, 157-158 maximum likelihood estimation, 156-157 overview, 152-153 phenotype modeling, 154-156, 331-333 power, 175-176 quantitative trait variance component, 152, i74-175 contemporary approaches, 56-58 false positives and negatives, 487-497 description, 492-493 false negative control, 490-492 multiple testing, 489-490 overview, 487-488,496-497 trade-offs, 493 -496 two-stage designs, 492-493 future research directions, 535-542 genome scanning, seeScanning technique haplotype analysis, 198 - 200 historical perspectives, 47-48 linkage disequilibrium, seeDisequilibrium analysis localization determinant interactions, 393-409 analysis, 398-405 missing covariate data treatment, 407 -408 overview, 393-394,408-409 penetrance models, 394-398 segregation analysis, 396-398 lod score, seeLod score method methods, 49-53 model-free methods affected sibpair methods, 108- 111, 121 computational considerations, 246 incorrect model effects, 374-377 marker identification estimation by descent, 136- 138 maximum lod score, 110 - 111 meta-analysis, see Meta-analysis multiple Loci, 246- 247 nonparametric methods, 244
576
Subject Index
overview, 135-136,146-147,241-243, 249-250 phenotype definition, 247-248 pseudo-markers, 377 qualitative traits, 141-146 quantitative traits, 138- 141, 249, 263-265 use rationale, 244-249 mod score, see Mod score overview, 45-47 quantitative trait loci mapping, 329-331 scanning methods, see Scanning technique significance levels in genome scans, 475-484 structural relationships, 183 - 190 overview, 183-185,190 SEGPATH models, 185- 190 multilocus linkage model, 188- 189 sibship linkage model, 186- 188 unique features, 189- 190 unified model, 130 Localization, determinant interactions, 393-409 association studies, 405-407 linkage analysis, 398-405 missing covariate data treatment, 407-408 overview, 393-394,408-409 penetrance models, 394-398 segregation analysis, 396-398 Lod score method, seealso Mod score complex traits, 100,105,110, 127, 157 definition, 101 examples, lOl- 104 genetic parameter misspecification, 117- 119 likelihood ratio statistic, 157-158 linkage detection probability, 106- 108 genome-wide significance level, 107- 108 three criterion score, 107 meta-analysis, 104-106, 259-260 model-free linkage methods affected sibpair methods, 108- 111, 121 maximum lod score, 1 lo- 111 overview, 100,112, 125-127 single major locus traits, 100-101, 104-105, 110 strengths, 129- 131 affected sibpair analysis emulation, 130 disease inheritance known mode exploitation, 129 error analysis, 130- 131 expected score, 129
informativeness measurement, 129 linkage and association unified model, 130 weaknesses, 125, 127-129 Lumping and splitting strategies, complex trait dissection, 15-16, 26-29 classification and regression trees, 15, 28-29, 281-284 context-dependency, 28 description, 15- 16 meta-analysis, 27 multivariate analysis, 27-28
M Markers identification estimation by descent, 136- 138 multivariate phenotype analysis, 333-346 two-stage global search, 467-468 whole-genome scan analysis future research directions marker density, 93 marker types, 90-93 model-based versus model-free methods, 377-380 linkage analysis, 377 linkage disequilibrium incorporating methods, 377-380 multipoint identity-by-descent variance component model, 301-304,3 19 screening sets, 80 Mean statistic analysis, affected sibpair analysis, 463 Meta-analysis complex traits, 257-258 lumping and splitting strategies, 27 cost-benefit analysis, 453 -454 guidelines, 269 issues,265 - 267 association studies, 267 heterogeneity, 266-267 publication bias, 266 quality assessment,266 linkage effect pooling, 260-265 combined test of linkage, 265 common effect, 261-263 identical-by-descent proportions, 261-263 mixed effects model, 264-265 quantitative synthesis, 263-265 random effects model, 263-264
Subjecttnttex lod score pooling, 104-106,259-260,270 overview, 256-257,267-270 P values, 259-260 Model-based analysis, see Association studies; Linkage analysis; Segregation analysis Model-free methods affected sibpair methods, 108- 111, 121 computational considerations, 246 cost+effectiveness, 453 -454 heterogeneity analysis, 266-267 incorrect model effects, 374-377 marker identification estimation by descent, 136-138 maximum lod score, 1 lo- 111 meta-analysis, seeMeta-analysis model-based methods compared, 16, 374-380 multiple loci, 246-247 nonparametric methods, 244 overview, 135, 146, 241-243,249-250 phenotype definition, 247 - 248 pseudo-markers, 377-380 linkage analysis, 377 linkage disequilibrium incorporating meth. ods, 377-380 qualitative traits, 141- 146 quantitative traits, 138-141, 249, 263-265 use rationale, 244 - 249 Mod score, seealso Lod score method examples, 123 function, 119-122 candidate gene strategy, 121-122 genome scan by linkage analysis, 121 genetic parameter misspecification, 117- 119 overview, 115-117, 122-123 Monte Carlo simulation, sequential analysis of whole-genome scans, 510-512 Morton, Newton influence on genetics, 7 - 9 Morton number, 7 - 9 research contributions, 545-568 Wisconsin years, 3-5 Multipoint identityby-descent variance component model, genome partitioning and whole-genome analysis allelic interactions, 317 calculation uncertainty, 3 18 estimation, 304 framework, 302-304 hypothesis testing, 304
577
inbred model organism crosses,312-314 locus interactions, 317 methods, 301-Z 104 multiple phenotype analysis, 3 18 pleiotropy analysis, 3 18 Multivariate traits, seeComplex traits
N Neural networks, seeArtificial neural networks Nonparametric methods, seeModel-free methods
0 One-stage scanning technique, 459-470 complex trait dissection, 21-22 DESPAIR program, 468 heterogeneity effects, 467-468 overview, 459-461,469-470 procedure, 461-463
P Path analysis, linkage and association with structural relationships, 183 - 190 overview, 183-185, 190 SEGPATH models, 185 - 190 multilocus linkage model, 188- 189 sibship linkage model, 186 - 188 unique features, 189-190 Pedigrees, extended pedigrees artificial neural network analysis, 295 familial resemblance heritability estimation, 40 sampling issues,443 Penetrance models, localization determinant interactions, 394-398 Phenotype, seealso Heritability; Heterogeneity association, seeAssociation studies binary phenotype, penetrance models, 394-398 definition, 69-70,74-75,247-248 endophenotypes, 7 1 - 73 measurement error impact, 73-74 multivariate trait analysis data reduction, 329 genetic dissection definition and refinement, 19 variation causes, 17
578
Subiect Index
linkage analysis, 362-364 maximum likelihood estimation, 156-157 models and scenarios, 325-328 multipoint identity-by-descent variance component model, 3 18 overview, 323-325,333,346 quantitative trait loci mapping, 329-331, 362-365 results, 333 -346 simulation, 154- 156,331-333 study design optimization, 441-442 variance component detection modeling, 154-156 whole-genome analysis, see Whole-genome analysis narrowly defined disease phenotypes, 70 quantitative traits, 71-73,329-331,362 Pleiotropy physiologic significance, 206-208 whole-genome analysis, 318 Polymorphism scans, see Short tandem repeat polymorphism scans Pooled data, meta-analysis for modelfree methods data analysis, 270 linkage effect, 260-265 combined test of linkage, 265 common effect, 261-263 identical-by-descent proportions, 261-263 mixed effects model, 264-265 quantitative synthesis, 263-265 random effects model, 263-264 lad score, 104-106, 259-260 Population-based studies future research directions, 535-536, 541-542 population heterogeneity, see Heterogeneity sampling issues admixture, 205-206,446 target populations, 445-446 whole-genome analysis family data analysis, 307-310 individuals in populations, 3 14-3 17 whole populations, 3 17 Positional cloning, future research directions, 537-538 Power assessment case-control studies, 205 complex trait dissection, 22-23
disease liability, 522-524 enhancement, 447-451 disequilibrium combination, 448-449 genome-wide versus gene-wide scanning, 449-451 linkage combination, 448-449 sibpair combination, 447-448 study design, 22-23, 205,447-451 variance component detection methods, 175-176 Publication bias, meta-analysis for model-free methods, 266
zuantitative traits linkage analysis common effects pooling, 263-265 combined test of linkage, 265 mixed effects model, 264-265 random effects model, 263 -264 loci mapping, 329-331 model-free methods, 138-141,249, 263-265 tree models, 15, 28-29, 281-284 variance component detection methods, 152,174-175 loci accuracy estimation, nonnormality effects, 166-167 multivariate analysis, 329-331 phenotype definition, 71-73,329-331,362
R Recursive partitioning, heterogeneity classification model, 277-284 linkage trees, 15, 28-29,281-284 purity, 278-281 splitting rules, 278-281 Regression and analysis of variance genome partitioning and whole-genome analysis, inbred model organism crosses, 311-312 linkage analysis, variance component methods compared, 184-185 lumping and splitting strategies, 15, 28-29, 281-284 transmission disequilibrium tests logistic regression model, 227 simulation, 229-234
Subject index Risk models, genetic and environmental risk, 226-227 Robust test statistics, 163 - 166 covariance matrix, 163- 164 likelihood tests, 164- 166 multivariate t distribution, 166 score tests, 164 Wald tests, 164
S Sampling extended linkage analysis, 443 optimization, 443 -446 admixtures, 446 case-controls, 444-445 extended pedigrees, 443 isolates, 445 - 446 sibling pairs, 443 - 444 sibships, 443 -444 target populations, 445 -446 unitary families, 443 -444 population-based study issues admixture, 205-206,446 target populations, 445-446 sequential scanning analysis compared, 502-505 size-power relationship, 447-451,488 association combination, 448-449 genome-wide versus gene-wide scanning, 449-451 linkage combination, 448-449 sibpair combination, 447 -448 study design issues,20, 22-23,444-445 Scanning technique complex trait mapping, 380-387 candidate genes, 386-387 future research directions, 536-537, 541-542 genome scanning, 121,329-330, 383-386 hypothesis testing, 386-387 trait complexity, 380-383 false positives and negatives, 487-497 association studies, 492-493 false negative control, 490-492 linkage analysis, 492-493 multiple testing, 489-490 overview, 487-488,496-497 trade-offs, 493 -496
579
two-stage designs, 492-493 genotype misclassification allowance, 365 -374 complex-valued recombination fractions, 365-367 consequences, 367-370 local recombination perturbation minimization, 370-374 locus mapping influencing some phenotype, 362-365 association analysis, 364-365 linkage analysis, 362-364 linkage disequilibrium analysis, 364-365 known genotypes, 353-362 association analysis, 356-362 linkage analysis, 353 -356 linkage disequilibrium analysis, 356-362 uncertain genotypes, 362-365 model-based versus model-free methods, 16, 374-380 incorrect model effects, 374-377 pseudoamarkers, 377-380 linkage analysis, 377 linkage disequilibrium incorporating methods, 377-380 one-stage versus two+stage strategies, 459-470 complex trait dissection, 2 1 - 22 DESPAIR program, 468 future research directions, 540-541 heterogeneity effects, 467-468 one-stage procedure, 461-463 overview, 459-461,469-470 two-stage procedure affected sibpair analysis, 463 -467 global search design using discordant relative pairs, 464-468 heterogeneity effects, 467-468 incomplete marker information, 467-468 methods, 461-463 overview, 352-353,387-388 sequential methods, 499-513 analysis versus sampling, 502-505 examples, 508-510 historical perspective, 505 -506 Monte Carlo simulation, 510-512 overview, 499-502,512-513
580
Subject index
sequential multiple decision procedures, 506-508 sequential probability ratio test, 506 significance, 502-503 short tandem repeat polymorphism scans, see Short tandem repeat polymorphism scans significance levels in genome scans, 475 -484 association studies, 477, 480-482 complex diseases description, 477-478 genome-wide scans, 478-480 linkage studies description, 477 genome-wide scans, 478-480 overview, 475-477,482-484 SEGPATH models, linkage and association with structural relationships, 185- 190 multilocus linkage model, 188- 189 sibship linkage model, 186- 188 unique features, 189- 190 Segregation analysis, localization determinant interactions, 396-398 Sequential analysis, whole-genome scans, 499-513 analysis versus sampling, 502-505 examples, 508-510 historical perspective, 505-506 Monte Carlo simulation, 510-512 overview, 499-502,512-513 sequential multiple decision procedures, 506-508 sequential probability ratio test, 506 significance, 502-503 Short tandem repeat polymorphism scans scanning theory, seeScanning technique whole-genome scans, 77-94 future research directions, 90-93 marker density, 93 marker types, 90-93 historical perspectives, 78- 79 overview, 78,94 present concepts and methods, 79-90 costs, 83-85 error detection, 88-90 genotyping quality, 81-83 limitations, 85-88 marker screening sets, 80 Sibpair analysis, seealso Linkage analysis analysis emulation using lod scores, 130
artificial neural networks, 293-294 cost-effectiveness, 45 l-453 emulation using lod scores, 130 identical-by-descent proportions, 262 lodscorescompared, 108-111, 121 meta-analysis, 262 m&variate phenotype analysis, 333-346 sampling issues,443 sibship linkage model, 186-188,443-444 significance levels in genome scans, 475-484 two-stage procedure, 463-467 discordant relative pairs, 464-467 mean statistic analysis, 463 variance component methods, 173, 178 Single major locus traits allelic associations, 415-417 lod score method, 100-101, 104-105, 110 Splitting strategies, seeLumping and splitting strategies Statistical analysis, seespecijictests Stratification, case-control study design, 194-19s Structural equations modeling, seePath analysis Study design association analysis, seeAssociation studies case-control studies, 191-210 admixture assessment,205-206,446 ailelic heterogeneity genetic background, 201-203 haplotype analysis, 200-201 association strength, 205 genetic matching, 204 haplotype analysis allelic heterogeneity, 200-201 multiple-linked loci, 198-200 linkage disequilibrium strength, 205 outlier detection, 204 overview, 192-194,209-210 physiologic significance, 206-208 pleiotropy, 206-208,318 power assessment,205 sampling issues,444-445 statistical significance assessment, 196-198 stratification, 194- 195 complex trait dissection, 19-26 analysis methods, 23 - 25 cost-benefit analysis, 23 genotyping issues,20-21
SubiectIndex linkage versus association, 21 one-stage versus two-stage designs, 21-22, 540-541 phenotype definition and refinement, 19 power, 22 - 23 results interpretation, 25-26 sample size, 22 - 23 sampling methods, 20 cost- benefit analysis EDAC design, 45 1 - 453 genetic dissection, 23 meta-analysis, 453 -454 two-stage design, 453 future research directions, 535-536, 541-542 heritabiliry analysis, 38-42 affecting factors, 41-42 familial correlations, 39-40 hypothesis testing, 39-40 nuclear families, 39-40 hypothesis testing heritabiliry analysis, 39-40 multipoint idenrivby-descent variance component model, 304 scanning technique in compIex trait map ping, 386-387 linkage analysis, seeLinkage analysis one-stage design, two-stage design compared, 21-22,540-541 optimization, 439-455 cost-benefit analysis, 451-454 EDAC design, 451-453 meta-analysis, 453-454 two-stage design, 453 genotyping issues,442 overview, 440-441,454-455 phenotype issues,441-442 power enhancement, 447-451 disequilibrium combination, 448-449 genome-wide versus gene-wide scanning, 449-451 linkage combination, 448-449 sibpair combination, 447-448 sampling issues,443 -446 admixtures, 446 case-controls, 444-445 extended pedigrees, 443 isolates, 445-446 sibling pairs, 443 -444 sibships, 443 -444
582
target populations, 445 -446 unitary famihes, 443 -444 phenotype defmition endophenotypes, 7 I- 73 measurement error impact, 73-74 narrowly defined diseasephenotypes, 70 overview, 69-70,74-75 quantitative traits, 71-13 power assessment case-control studies, 205 complex trait dissection, 22 - 23 enhancement, 447 -451 disequilibrium combination, 448 -449 genome-wide versus gene-wide scan ning, 449-451 linkage combination, 448-449 sibpair combination, 447 -448 two-stage design cost-benefit analysis, 453 one-stage design compared, 21- 22, 540-541
T Tandem repeat polymorphism, see Short tandem repeat polymorphism scans Transmission disequilibrium tests, 223-239 genetic and environmental risk model, 226-227 genotype-environment interactions, 228-229,236-23%,405-407 logistic regression model, 227 overview, 224-226,23%-239 simulation, 229-238 logistic regression, 229-234 MZ twins, 236-238 other sibling inclusion, 236 trio data analysis, 229-236 Tree models, complex trait linkage, 15, 28-29, 281-284 Twin studies familial resemblance heritability estimation, 40 transmission disequilibrium tests, genotype-environment interactions in MZ twins, 236-238 Two-stage scanning technique, 459-470 complex trait dissection, 2 I - 22 DESPAIR program, 468
582
Subject Index
false positives and negatives in genome scans, 492-493 future research directions, 540-541 heterogeneity effects, 467-468 one*stage technique compared, 21-22, 540-541 overview, 459-461,469-470 procedure affected sibpair analysis, 463-467 global search design using discordant relay tive pairs, 464-468 heterogeneity effects, 467-468 incomplete marker information, 467-468 methods, 461-463 Type I and II errors challenges and issues,58 false positives and negatives in genome scans, 487-497 association studies, 58,492-493 error detection, 88-90 false negative control, 490-492 linkage analysis, 492-493 lod score analysis method, 130- 131 multiple testing, 489-490 overview, 487-488,496-497 trade-offs, 493-496 two-stage designs, 492-493 variance detection nonnormality effects, 161-163,167-172 kurtosis relationship, 161- 163 metaxanalysis, 256 model misspecification, 167-172 examples, 170- 172 finite mixture distribution, 168- 169 ,Y$distribution, 169- 170
V Variance component methods, 15 1 - 178 linkage analysis, 154-159, 172-178 affected sibpair analysis, 173, 178 alternative test statistic, 158-159 ascertainment correction and effect, 176- 178 likelihood ratio statistic, 157- 158 lod score, 157-158 maximum likelihood estimation, 156- 157 phenotype modeling, 154- 156,33 I-333 power, 175 - 176
quantitative trait variance component, 152,174-175 regression and analysis of variance compared, 184- 185 nonnormality effects, 159- 172 alternative robust test statistics, 163- 166 covariance matrix, 163- 164 likelihood tests, 164-166 multivariate t distribution, 166 score tests, 164 Wald tests, 164 finite mixtures, 160- 161 genotypic variation, 160- 161 kurtosis-type I error relationship, 161-163 model m&specification type I errors, 167-172 examples, 170-172 finite mixture distribution, 168- 169 xi distribution, 169- 170 quantitative trait loci accuracy estimation, 166-167 overview, 152-154,178
W Wald test, variance component detection, 164 Whole-genome analysis, 299-319 false positive and negative errors, seeError types future research directions, 535-536, 541-542 genotyping scans, 77-94 future research directions, 90-93 marker density, 93 marker types, 90-93 historical perspectives, 78- 79 overview, 78, 94 present concepts and methods, 79-90 costs, 83 - 85 error detection, 88-90 genotyping quality, 81-83 limitations, 85-88 marker screening sets, 80 scanning theory, seeScanning technique human population-based studies association studies, 43 l-434 family data analysis, 307 -3 10 individuals in populations, 3 14-317 whole populations, 317
Subject Index inbred model organism crosses,310-314 extended multipoint identity-by-descent variance component modeling, 311-312 regression modeling, 3 I I-3 12 multipoint identity-by-descent variance component model all&c interactions, 3 17 calculation uncertainty, 3 18 estimation, 304 framework, 302-304 hypothesis testing, 304 inbred model organism crosses,312-314 locus interactions, 317 methods, 305 - 307 multiple phenotype analysis, 318 pleiotropy analysis, 318
583
overview, 300-301,318-319 sequential methods, 499-513 analysis versus sampling, 502-505 examples, 508-510 historical perspective, 505-506 Monte Carlo simulation, 510-512 overview, 499-502,512-513 sequential multiple decision procedures, 506-508 sequential probability ratio test, 506 significance, 502-503
X xj distribution, model misspecification type 1 errors, nonnormality effects, 169 - 170