Molecular Epidemiology: Applications in Cancer and Other Human Diseases

about the book… This volume comprises the investigation of factors that may predict the response to treatment, outcome, ...

Author: Timothy R. Rebbeck | Christine B. Ambrosone | Peter G. Shields

23 downloads 835 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

about the book… This volume comprises the investigation of factors that may predict the response to treatment, outcome, and survival by exploring: • design considerations in molecular epidemiology, including: – case-only – family-based – approaches for evaluation of genetic susceptibility to exposure and addiction pharmacogenetics – incorporation of biomarkers in clinical trials • measurement issues in molecular epidemiology, including – DNA biosampling methods – principles for high-quality genotyping – haplotypes – biomarkers of exposure and effect – exposure assessment • methods of statistical inference used in molecular epidemiology, including – gene-gene and gene-environment interaction analysis – novel high-dimensional analysis approaches – pathway-based analysis methods – haplotype methods, dealing with race and ethnicity – risk models – a discussion of reporting and interpreting results. • A specific discussion and synopsis of these methods provides concrete examples drawn from primary research in cancer. Covering design considerations, measurement issues, and methods of statistical inference, and filled with scientific tables, equations, and pictures, Molecular Epidemiology: Applications in Cancer and Other Human Diseases presents a solid, single-source foundation for conducting and interpreting molecular epidemiological studies. about the editors... TIMOTHY R. REBBECK is Professor of Epidemiology, Director of the Center for Genetics and Complex Traits, and Associate Director for Population Science of the Abramson Cancer Center at the University of Pennsylvania in Philadelphia. Dr. Rebbeck received his Ph.D. from the University of Michigan, Ann Arbor, and has written over 150 peer-reviewed articles. CHRISTINE B. AMBROSONE is Chair, Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York. Dr. Ambrosone received her Ph.D. from the Roswell Park Cancer Institute, State University of New York at Buffalo, and has written over 115 peer-reviewed articles. PETER G. SHIELDS is Professor of Medicine and Oncology Interim Academic Chair, Department of Medicine Deputy Director, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington DC. Dr. Shields received his M.D. Medical Doctor, Mount Sinai School of Medicine, New York, New York, and has written over 140 peer-reviewed articles. Printed in the United States of America

H5291

Molecular Epidemiology: Applications in Cancer and Other Human Diseases

Oncology and Epidemiology

Molecular Epidemiology Applications in Cancer and Other Human Diseases Edited by

Timothy R. Rebbeck Christine B. Ambrosone Peter G. Shields

Rebbeck • Ambrosone • Shields

nC nM nY nK Rebbeck_978-1420052916.indd 1

4/22/08 1:01:18 PM

Molecular Epidemiology

Molecular Epidemiology Applications in Cancer and Other Human Diseases

Edited by

Timothy R. Rebbeck

University of Pennsylvania Philadelphia, Pennsylvania, USA

Christine B. Ambrosone Roswell Park Cancer Institute Buffalo, New York, USA

Peter G. Shields

Georgetown University Medical Center Washington, D.C., USA

Informa Healthcare USA, Inc. 52 Vanderbilt Avenue New York, NY 10017 # 2008 by Informa Healthcare USA, Inc. Informa Healthcare is an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-10: 1-4200-5291-8 (Hardcover) International Standard Book Number-13: 978-1-4200-5291-6 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequence of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright .com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Molecular epidemiology : applications in cancer and other human diseases / edited by Timothy R. Rebbeck, Christine B. Ambrosone, Peter G. Shields. p. ; cm. Includes bibliographical references and index. ISBN-13: 978-1-4200-5291-6 (hardcover : alk. paper) ISBN-10: 1-4200-5291-8 (hardcover : alk. paper) 1. Cancer— Epidemiology. 2. Molecular epidemiology. 3. Disease susceptibility— Genetic aspects. I. Rebbeck, Timothy R. II. Ambrosone, Christine B. III. Shields, Peter G. [DNLM: 1. Neoplasms—epidemiology. 2. Neoplasms—genetics. 3. Epidemiology, Molecular—methods. 4. Genetic Predisposition to Disease—epidemiology. 5. Genetic Predisposition to Disease–genetics. QZ 220.1 M718 2008] RA645.C3M62 2008 614.5'999—dc22 2008000698 For Corporate Sales and Reprint Permissions call 212-520-2700 or write to: Sales Department, 52 Vanderbilt Avenue, 16th floor, New York, NY 10017. Visit the Informa Web site at www.informa.com and the Informa Healthcare Web site at www.informahealthcare.com

Preface

Since the first textbook dedicated to molecular epidemiology was published (1), there have been enormous advances in science and biotechnology that have been exploited by molecular epidemiologists to understand human disease. The human genome has been largely characterized. There have been significant advances in the ability to interrogate inherited and somatic genetic variability as well as other biomarkers in relation to disease risk. It has also become possible to conduct studies needed to evaluate the complex relationships between risk factors and disease outcome. Thus, molecular epidemiology can inform the etiology, prevention, and treatment of important diseases in a variety of ways: ETIOLOGY A major focus of molecular epidemiology has been and remains the determination of disease susceptibility based on genetic variability. Interindividual variation in behavior, exposure, and response is a fundamental feature that explains why some people get cancer and others do not. We recognize that most cancers are driven by exposure, so the study of gene-environment interactions is critical to understanding of human carcinogenesis. This research began with relatively simple studies of simple genetic variants in single genes in known metabolic pathways in relation to disease risk. Over the last decade, this area alone has exploded. Technology led to high-throughput genotyping for single nucleotide polymorphisms in candidate genes, and a formidable amount of research has been conducted to identify functional effects of known polymorphisms. At present, the entire variability across genomic areas can be estimated through the use of haplotype block tags (chap. 14), and the conduct of genomewide association studies (chap. 15) has led to discoveries of genes in pathways not previously explored in relation to disease risk. Pathway-based approaches have also evolved from these early studies and from the explosion in knowledge of underlying biology (chap. 13). Despite the large amount of research in this area, only a limited number of believable and biologically plausible genetic loci have been identified, and there is a critical need for thoughtful strategies for appropriate study designs (chaps. 1–4), data collection methods (chaps. 5–8), analysis tools (chaps. 10–16), and improved ability to make meaningful inferences (chaps. 17–18). With these rapid advances in technology and genomics comes the increasingly difficult task of evaluating the role of the environment in disease risk in concert with genetic variability (chap. 11). While this was a challenge when studying single nucleotide iii

iv

Preface

polymorphisms in candidate genes in known metabolic pathways, understanding population genomic structure (chap. 10) and haplotype effects (chap. 14) as well as gene-gene or gene-environment interactions in the context of genomewide association studies or pathways with hundreds of single nucleotide polymorphisms is much more complex. Thus, there is a growing need for the development and thoughtful application of sophisticated analytic tools to evaluate higher-order relationships between genes and/or environments (chap. 12). Although much of molecular epidemiology has been focused on the role of genetic variability in disease risk and prognosis, the use of biomarkers of exposure and effect has always been a component of this discipline (chap. 7). In some ways, these phenotypes of cancer risk might be more informative for cancer risk because they represent complex genotypes and multiple genetic pathways. Past studies have mainly assessed levels of circulating biomarkers of interest, as well as DNA, protein, and hemoglobin adducts, and rigorous standardized approaches to sample collection, processing and storage have been acknowledged as important. With growing interest in assessment of novel biomarkers, such as those of the proteome or metabolome, there is renewed interest in the science of biorepositories, and the establishment of guidelines to collect and maintain the integrity of biospecimens (chap. 5). PREVENTION For many common diseases, it is likely that effective treatment and cure remain long-term and challenging goals. Molecular epidemiology intends to provide for more rationale prevention studies by identifying high-risk individuals. Early detection methods, chemoprevention, and primary prevention methods might be more successful if they can focus on those most vulnerable. As a result, the development of effective strategies for interventions to prevent common disease takes on a heightened importance (chap. 3). In times of limited health care resources, targeted prevention strategies and insusceptible populations (chap. 16) and the development of new biologically based markers for use in early detection and screening are critical components of disease prevention strategies (chap. 7). DISEASE PROGRESSION, PROGNOSIS, AND TREATMENT It is assumed that the complex genetic makeup of tumors arises from gene-environment interactions. This results in the “wiring” of tumors to be more or less aggressive and more or less resistant to treatment. Thus, many studies today are investigating if the etiological factors in cancer risk also result in tumors with worse prognosis (chap. 4). In the coming years, it is likely that there will be more focus on tissue characterization in relation to etiology, elucidating factors that contribute to genetic and epigenetic alterations. This information will become critically important for the development of novel treatment regimens as well as biologically based personalized medicine. The concepts of molecular epidemiology have also been applied to the study of disease outcomes, including pharmacogenetics, or the role of genetic variability in modification of relationships between treatment-related toxicities as well as disease recurrence and mortality (chap. 9). More recently, there have been efforts to provide information on the effects of modifiable factors, as well as gene/environment interactions, on disease prognosis. Approaches to studying the molecular epidemiology of disease

Preface

v

prognosis, and to building comprehensive models for prediction of disease outcomes, also require innovative approaches to data collection and analysis. THE (IMMEDIATE) FUTURE How can the topics covered in this volume contribute to the next phase of research in molecular epidemiology of common disease? There are a number of areas that are yet to fulfill their promise that the approaches discussed here can facilitate: l

l

l

l

Development and meaningful application of novel designs (chaps. 1– 4) and analytical methods (chaps. 11–16) that explain complex disease etiology and outcomes. Understanding the heterogeneous etiology, presentation, and progression of disease to better inform disease treatment and prevention (chaps. 3,4,9,16). Provide means of effective mutual information so that information from molecular epidemiology can be translated to basic science, and observations from basic and clinical sciences can direct molecular epidemiological research (chap. 13). Enhance clinical translation of molecular epidemiological research to inform improved prevention and treatment practices, including a better understanding of how genetic and other biomarkers can be used to improve disease risk prediction, treatment, and prognosis (chaps. 9,16–18).

Discoveries in the basic sciences are developing rapidly and offer vast opportunities for application in human populations. This information requires that today’s molecular epidemiologist be well versed not only in the concepts and language of their traditional domain of study design and analysis, but also in the areas of biology, biochemistry and genomics, and bioinformatics. Because it is impossible to be an expert in all of these areas, it is also crucial to form multidisciplinary collaborative teams of researchers that can address the wide range of fields that may be required to address these complex problems. Novel transdisciplinary approaches that emerge from these interactions will be required. The challenge is daunting, but the opportunities are vast to make important contributions to the understanding, prevention, and cure of common diseases. Molecular epidemiology is poised to make substantive contributions to these goals. Timothy R. Rebbeck Christine B. Ambrosone Peter G. Shields REFERENCE 1. Schulte P, Perera F. Molecular Epidemiology: Principles and Practices. New York: Academic Press, 1993.

Contents

Preface .................................................. . Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii ix

1. Design Considerations in Molecular Epidemiology . . . . . . . . . . . . . . . . . Montserrat Garcı´a-Closas, Qing Lan, and Nathaniel Rothman

1

2. Family-Based Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Audrey H. Schnell and John S. Witte

19

3. Trials and Interventions in Molecular Epidemiology James R. Marshall and Mary E. Reid

.............. .

29

................ .

41

5. Biosampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regina M. Santella and Susan E. Hankinson

53

6. Principles of High-Quality Genotyping Stephen J. Chanock

........................ .

63

.......................... .

81

................................. .

99

4. Molecular Epidemiological Designs for Prognosis Cornelia M. Ulrich and Christine B. Ambrosone

7. Biomarkers of Exposure and Effect Christopher P. Wild 8. Questionnaire Assessment James R. Marshall

................... .

113

10. Human Genetic Variation and its Implication in Understanding “Race”/Ethnicity and Admixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jill Barnholtz-Sloan, Indrani Halder, and Mark Shriver

129

9. Pharmacogenetics in Cancer Chemotherapy Xifeng Wu and Jian Gu

vii

viii

Contents

11. Statistical Approaches to Studies of Gene-Gene and Gene-Environment Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . Nilanjan Chatterjee and Bhramar Mukherjee

145

12. Novel Analytical Methods for Association Studies . . . . . . . . . . . . . . . . Jason H. Moore, Margaret R. Karagas, and Angeline S. Andrew

169

13. Pathway-Based Methods in Molecular Cancer Epidemiology . . . . . . . . Fritz F. Parl, Philip S. Crooke, David V. Conti, and Duncan C. Thomas

189

14. Haplotype Association Analysis Peter Kraft and Jinbo Chen

............................ .

205

15. Genomewide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael B. Bracken, Andrew DeWan, and Josephine Hoh

225

16. Validation and Confirmation of Associations John P. A. Ioannidis

.................. .

239

17. Models of Absolute Risk: Interpretation, Estimation, Validation, and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mitchell H. Gail

259

......................... .

275

................................................... .

293

18. Reporting and Interpreting Results Julian Little Index

Contributors

Christine B. Ambrosone Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York, U.S.A. Angeline S. Andrew Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire, U.S.A. Jill Barnholtz-Sloan Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, Ohio, U.S.A. Michael B. Bracken Center for Perinatal, Pediatric, and Environmental Epidemiology, Yale University, New Haven, Connecticut, U.S.A. Stephen J. Chanock Laboratory of Translational Genomics, National Cancer Institute, Advanced Technology Center, Gaithersburg, Maryland, U.S.A. Nilanjan Chatterjee Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, Rockville, Maryland, U.S.A. Jinbo Chen Department of Biostatistics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, U.S.A. David V. Conti Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, U.S.A. Philip S. Crooke Department of Mathematics, Vanderbilt University, Nashville, Tennessee, U.S.A. Andrew DeWan Center for Perinatal, Pediatric, and Environmental Epidemiology, Yale University, New Haven, Connecticut, U.S.A. Mitchell H. Gail Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, U.S.A.

ix

x

Contributors

Montserrat Garcı´a-Closas Division of Cancer Epidemiology and Genetics, National Cancer Institute, Department of Health and Human Services, Bethesda, Maryland, U.S.A. Jian Gu Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A. Indrani Halder Cardiovascular Behavioral Medicine Research Training Program, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, U.S.A. Susan E. Hankinson Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, and Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, U.S.A. Josephine Hoh Center for Perinatal, Pediatric, and Environmental Epidemiology, Yale University, New Haven, Connecticut, U.S.A. John P. A. Ioannidis Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Department of Medicine, Tufts University School of Medicine, Boston, Massachusetts, U.S.A. Margaret R. Karagas Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire, U.S.A. Peter Kraft Departments of Epidemiology and Biostatistics, Harvard School of Public Health, Boston, Massachusetts, U.S.A. Qing Lan Division of Cancer Epidemiology and Genetics, National Cancer Institute, Department of Health and Human Services, Bethesda, Maryland, U.S.A. Julian Little* Department of Epidemiology and Community Medicine, University of Ottawa, Ottawa, Canada James R. Marshall

Roswell Park Cancer Institute, Buffalo, New York, U.S.A.

Jason H. Moore Departments of Genetics and Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire; Department of Computer Science, University of New Hampshire, Durham, New Hampshire; and Department of Computer Science, University of Vermont, Burlington, Vermont, U.S.A. Bhramar Mukherjee Michigan, U.S.A.

Department of Biostatistics, University of Michigan, Ann Arbor,

Fritz F. Parl Department of Pathology, Vanderbilt University, Nashville, Tennessee, U.S.A. Mary E. Reid

*

Roswell Park Cancer Institute, Buffalo, New York, U.S.A.

Canada Research Chair in Human Genome Epidemiology.

Contributors

xi

Nathaniel Rothman Division of Cancer Epidemiology and Genetics, National Cancer Institute, Department of Health and Human Services, Bethesda, Maryland, U.S.A. Regina M. Santella Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, New York, U.S.A. Audrey H. Schnell Department of Epidemiology and Biostatistics, University of California, San Francisco, California, U.S.A. Mark Shriver Departments of Anthropology and Genetics, The Pennsylvania State University, University Park, Pennsylvania, U.S.A. Duncan C. Thomas Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, U.S.A. Cornelia M. Ulrich Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, U.S.A. Christopher P. Wild Molecular Epidemiology Unit, Centre for Epidemiology and Biostatistics, Leeds Institute of Genetics, Health and Therapeutics, Faculty of Medicine and Health, University of Leeds, Leeds, U.K. John S. Witte Department of Epidemiology and Biostatistics, University of California, San Francisco, California, U.S.A. Xifeng Wu Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A.

1

Design Considerations in Molecular Epidemiology Montserrat Garcı´a-Closas, Qing Lan, and Nathaniel Rothman Division of Cancer Epidemiology and Genetics, National Cancer Institute, Department of Health and Human Services, Bethesda, Maryland, U.S.A.

INTRODUCTION There is a wide range of biomarkers that can be used in population-based molecular epidemiological studies of cancer. These include biomarkers of exposure, intermediate endpoints (e.g., biomarkers of early biological effect), disease, and susceptibility (1–7) (Fig. 1). Hypothesis-driven biomarkers have been used for many years in molecular epidemiology studies of cancer (e.g., measurement of xenobiotics and endogenous carcinogens, macromolecular adducts, cytogenetic endpoints in cultured lymphocytes, DNA mutations in tumor suppressor genes, and phenotypic and genotypic measures of genetic variation in candidate genes). Perhaps the most revolutionary change that has occurred in molecular epidemiology in the past several years has been the emergence of discovery technologies that can been incorporated into a variety of study designs and include genome-wide scans of common genetic variants, messenger RNA (mRNA) and microRNA expression arrays, proteomics, and metabolomics (also referred to as metabonomics) (8–14). These approaches are allowing investigators to explore biological responses to exogenous and endogenous exposures, to evaluate potential modification of those responses by variants in essentially the entire genome, and to define tumors at the chromosomal, DNA, RNA, and protein levels. At the same time, with the incorporation of more powerful technologies into molecular epidemiology studies, there has been greater concern that the rights and confidentiality of study subjects be protected. A discussion of informed consent is outside the scope of this chapter, but we do note the critical need to consider ethical issues and informed consent procedures at the outset of designing a study. The focus of this chapter is on design considerations for epidemiological studies of cancer that use biomarkers primarily in the context of etiological research. We first discuss the advantages and disadvantages of classical epidemiological study designs for the application of biomarkers. We then describe biospecimen collections and sample size requirements for certain types of molecular epidemiology studies.

1

2

Garcı´a-Closas et al.

Figure 1 A continuum of biomarker categories reflecting the carcinogenic process resulting from xenobiotic exposures. Source: From Ref. 1.

STUDY DESIGNS IN MOLECULAR EPIDEMIOLOGY A description of the general principles of study design (15–17) is outside the scope of this chapter. Instead, we will discuss the advantages and disadvantages of classical epidemiological study designs (i.e., cross-sectional, case-control, and prospective cohort) that are particularly relevant to the collection and use of biological specimens. Potential new biomarkers for epidemiological research continually arise because of advances in the understanding of disease etiology and in molecular laboratory techniques. When a promising new biomarker emerges from the laboratory, some very basic issues, such as assay accuracy and reliability, need to be assessed before considering its application in epidemiological studies (1). These initial efforts to characterize biomarkers for use in epidemiological studies have been called transitional studies by some investigators (18–21), a term that serves to heighten awareness about the critical need to characterize the determinants of biomarker levels and assays before they are used in molecular epidemiological studies with precious, nonreplenishable biological samples. In this section, we will focus on study design considerations for the use of biomarkers that have already been characterized. Cross-Sectional Studies with Biomarker Endpoints Cross-sectional studies are used when there is interest in studying the relationship between particular exposures or demographic characteristics and a biomarker, which is treated as the outcome variable, and are generally carried out on healthy subjects. Biomarkers of exposure and intermediate endpoints can be measured at one or several points in time, depending on the temporal variability in the exposure and intraindividual variation in the response. The standard design is to have one group of “exposed” study subjects and a comparably sized group of “unexposed” subjects, drawn from the same base population and often matched on several factors, such as age, sex, and tobacco use, to improve efficiency. When biomarker endpoints have relatively short half-lives and a population can be studied before exposure begins, an alternative design can be used, where subjects are sampled before exposure begins and again after an appropriate length of time. Cross-sectional studies generally focus on biomarkers of exposure and intermediate endpoints. This design is often used to determine if a study population has been exposed to a particular compound, the level of exposure, and the determinants of the exposure (22,23), and sometimes is used to validate various approaches to measuring external

Design Considerations in Molecular Epidemiology

3

exposure (e.g., questionnaires, environmental monitoring). Biomarkers of exposure, discussed in chapter 7, measure internal exposure levels of exogenous or endogenously produced compounds in either tissues or body fluids. A wide range of exposures can be measured biologically, including environmental factors, nutrients, infectious agents, and endogenous compounds. Cross-sectional studies also can be used to evaluate intermediate endpoints from exposures in the diet, general environment, and workplace, as well as from lifestyle factors such as obesity and reproductive status. This design can be used to provide mechanistic insight into well-established exposure-disease relationships and to supplement suggestive but inconclusive evidence of the carcinogenicity of an exposure (24). As such, they complement classic epidemiological studies that use cancer endpoints. In addition, intermediate biomarkers can provide initial clues about the carcinogenic potential of new exposures years before cancer develops (1,6,25–27). One group of intermediate biomarkers, biomarkers of early biological effect (1,28) (Fig. 1), generally measures early biological changes that reflect early, nonclonal, and generally nonpersistent effects. Examples of early biological effect biomarkers include measures of cellular toxicity, chromosomal alterations, DNA, RNA, and protein expression and early nonneoplastic alterations in cell function (e.g., altered DNA repair, altered immune function). Generally, early biological effect markers are measured in substances such as blood and blood components (e.g., red blood cells, white blood cells, DNA, RNA, plasma, sera) because they are easily accessible and because in some instances it is reasonable to assume that they can serve as surrogates for other organs. Early biological effect markers also can be measured in other accessible tissues such as skin, cervical and colon biopsies, epithelial cells from surface tissue scrapings or sputum samples, exfoliated urothelial cells in urine, colonic cells in feces, and epithelial cells in breast nipple aspirates. Other early effect markers include measures of circulating biologically active compounds in plasma that may have epigenetic effects on cancer development (e.g., hormones, growth factors, cytokines). Cross-sectional studies can also be used to extensively evaluate the genetic determinants of a biomarker endpoint. Traditionally, the candidate gene approach has been employed, where functional or putatively functional variants in biologically relevant genes are analyzed to determine how they influence biomarker levels (22,23,29). With the advent of genome-wide scanning technology, a new generation of studies is being launched that will agnostically evaluate a large number of genetic variants for their potential influence on biomarker endpoints. These include classic genotoxicity, cytogenetic, hematological, and immunological biomarkers; a new generation of assays that include measures of genomic stability and epigenetic alterations such as telomere length and global methylation status (30–32); and biomarkers identified by discovery technologies described earlier. A distinct advantage of the cross-sectional study is that very detailed and accurate information can be collected on current as well as past exposure patterns (23,33) and potential confounders and effect modifiers. Further, as sample size in these studies typically can be much smaller than in case-control or prospective cohort studies, it is feasible to invest substantial resources into very extensive processing of biological samples, often beyond what resources allow in a larger study. This also enables an evaluation of new technologies that require biological samples to be collected and processed in very precise and intensive ways (23,33). At the same time, an important caveat in theses studies is that it is often unknown if the intermediate biomarker under study is predictive of developing cancer (25). As such,

4

Garcı´a-Closas et al.

it is important to cautiously interpret results from these study designs, as a particular exposure may cause measurable biological perturbations that are of uncertain relevance. Case-Control Studies In contrast to cross-sectional studies where biomarkers are the outcome variables, in casecontrol and prospective cohort studies the risk of disease is the outcome of interest. In case-control studies, risk factors, measured by questionnaire, medical record abstraction, external databases, biomarkers, etc., are compared between subjects with and without a particular disease. This design allows efficient enrolling of large numbers of cancer cases in relatively short periods of time. This is of particular importance for the study of uncommon tumors that occur in small numbers in prospective cohort studies. Case-control studies can be hospital- or population-based depending on how the cases and controls are identified (Table 1). Population-based studies attempt to identify all cases occurring in a predefined population during a specified period of time, and controls are a random sample of the source population where the cases come. On the other hand, cases and controls in hospital-based studies are identified among subjects admitted to or seen in clinics associated with specific hospitals. As in the population-based design, the distribution of exposures in the control group should represent that from the source population of the cases. However, the source population is often more difficult to define in hospital-based studies. Molecular epidemiology studies often use the hospital-based case-control design because the hospital setting facilitates the enrollment of subjects as well as the collection and processing of biological specimens. Enrollment of subjects is also facilitated by having in-person contact with study participants by doctors, nurses or interviewers, which usually results in higher participation rates (34). Because study subjects are generally less geographically spread out than those in population-based or cohort studies, rapid shipment of specimens to central laboratories for more elaborate processing protocols such as cyropreservation of lymphocytes is facilitated. Rapid ascertainment of cases through the hospitals also facilitates the collection of specimens from cases before treatment, thus avoiding the potential influence of treatment on some biomarker measurements. Exposure Assessment in Case-Control Studies Exposure assessment through questionnaires in case-control studies of a single disease or multiple diseases sharing risk factors (e.g., breast, ovarian, and endometrial cancer) can be more detailed and focused than prospective cohort studies that often study multiple, unrelated diseases. However, exposure information and biological specimens are collected after diagnosis of the disease and sometimes after treatment, and therefore are vulnerable to exposure measurement error/misclassification differential between cases and controls. Differential errors or recall bias from questionnaire information collected in case-control studies, although of concern, have only been proven for a few exposures. Similarly, the influence of the disease process or treatment on a biomarker of interest is often raised as a concern, but rarely proven. Differences in biomarker levels among cases diagnosed at different stages of disease can help evaluate whether differences in biomarkers’ levels between cases and controls reflect an influence of the disease on the biomarker rather than the contrary. The applicability of exposure biomarkers in case-control depends on certain intrinsic features related to the marker itself (e.g., half-life, variability, specificity) and the exposure pattern. The first prerequisite for successful application of an exposure marker is

Source: From Ref. 77.

l

Prospective cohort

l

l

l

l

l

l

l

Population-based case-control

Hospital-based case-control

Allows study of multiple disease endpoints. Allows study of transient biomarkers and biomarkers affected by disease process. Selection bias and differential misclassification are avoided. Nondifferential misclassification may be reduced for some exposures. Nested case-control or case-cohort studies can be used to improve efficiency of the design.

Less subject to biases (e.g., selection, exposure misclassification) than hospital-based studies.

Facilitates intense collection and processing of specimens (e.g., freshly-frozen tumor samples, cryopreserved lymphocytes). Participation rates for biological collections might be enhanced. Facilitates follow-up of cases for treatment response, recurrence, and survival.

Advantages

l

l

l

l

l

l

l

l

l

l

Implementation of intense, specialized collection and processing protocols for the entire cohort can be logistically challenging and overly costly. Obtaining tissue samples and following-up cases for treatment response and survival can be challenging. Unless repeat biomakers and questionnaires are collected, risk factor data may not reflect a relevant time period. Loss to follow-up can cause a potential bias.

Some biomarkers and response to certain types of questions might be affected by disease process. May be harder to obtain high participation rates for biological collections than in hospital-based designs. Implementation of intense, specialized blood and tumor collection and processing protocols can be challenging. May be more difficult to carry out response to treatment and survival studies, if cases are treated at many hospitals and clinics.

More prone to selection and differential misclassification biases than other designs. Some biomarkers and response to certain types of questions might be affected by disease process or stay at the hospital.

Limitations

Table 1 Comparison of Advantages and Limitations Relevant to the Collection of Biological Specimens and Data Interpretation in Different Molecular Epidemiology Study Designs

Design Considerations in Molecular Epidemiology 5

6

Garcı´a-Closas et al.

that the assay is reliable and accurate, the marker is detectable in human populations, and important effect modifiers (e.g., nutrition and demographic variables) and kinetics are known (24). Second, the timing of sample collection in combination with the biological half-life of a biomarker of exposure is the key, as this determines the exposure time window that a marker of exposure reflects. The time of collection may be critical if the exposure is of brief duration, is highly variable in time, or has a distinct exposure pattern (e.g., diurnal variation for certain endogenous markers such as hormones). However, chronic, near-constant exposures pose fewer problems. Ideally, the biomarker should persist over time and not be affected by disease status in case-control studies. However, most biomarkers of internal dose generally provide information about recent exposures (hours to days), with the exception of markers such as persistent pesticides, dioxins, polychlorinated biphenyls, certain metals, and serological markers related to infectious agents, which can reflect exposures received many years before. If the pattern of exposure being measured is relatively continuous, short-term markers may be applicable in casecontrol studies of patients with early disease, so that disease bias would be less likely. However, in general, short-term markers have limited use in case-control studies as they are less likely to reflect usual patterns, and the disease or treatment might influence its absorption, metabolism, storage, or excretion. Biomarkers of Susceptibility in Case-Control Studies The approaches to studying genetic susceptibility factors for cancer have evolved very quickly over the last several years, owing to advances in genotyping technologies, substantial reductions in genotyping costs, and improvements in the annotation of common genetic variation, namely, the most common type of variant, the single nucleotide polymorphism (SNP). The principles and quality control approaches for the use of genetic makers in epidemiological studies is described in chapter 6 of this book. Because inherited genetic markers measured at the DNA level are stable over time, the timing of measurement before disease diagnosis is irrelevant. In addition, it is highly likely that most genetic markers are not related to factors influencing the likelihood of participation in a study, and therefore selection bias in case-control studies is less of a concern for studying the main effect of genetic risk factors. Indeed, the robustness of genetic associations with disease for different study designs has been demonstrated in findings from consortia of studies that have shown remarkably consistent estimates of relative risk across studies of different designs (35,36). Because genetic markers might influence disease progression, incomplete ascertainment of cases in case-control studies can introduce survival bias, particularly for cancers associated with high morbidity and mortality rates such as pancreas and ovarian cancers. This is a particular concern for population-based studies, unless a very rapid ascertainment system is put in place that enrolls cases as close as possible to time of diagnosis. Susceptibility biomarkers can also be measured at the functional/phenotypic level [e.g., metabolic phenotypes, DNA repair capacity (DRC)] (7). While phenotypic measures are likely to be closer to the disease process and can integrate the influences of multiple genetic and posttranscriptional influences on protein expression and function, genotypic measures are considerably easier to study since they are stable over time and much less prone to measurement error (37). Thus, from the logistic point of view, genotype assays are usually preferred to phenotype assays. However, when complex combinations of genetic variants and/or important posttranscriptional events determine a substantial portion of interindividual variation in a particular biological process, phenotypic assays may be the only means to capture important variation in the population.

Design Considerations in Molecular Epidemiology

7

For example, a number of studies have assessed the role of DRC regarding cancer risk by using in vitro phenotypic assays mostly on circulating lymphocytes (e.g., mutagen sensitivity, host cell reactivation assay). These studies have shown differences in DRC between cases and controls; however, interpretation of these results needs to take into account study design limitations, such as use of lymphocytes to infer DRC in target tissues, the possible impact of disease status on assay results, and confounding by unmeasured risk factors that influence the assay (38–40). The application of functional assays in multiple, large-scale epidemiological studies will require development of less costly and labor-intensive assays. In the future, assays that assess nonclonal mutations in DNA through the analysis of DNA isolated from circulating white blood cells may capture some of the same information as the above functional assays and have wider application because of greater logistic ease. Molecular Classification of Tumors in Case-Control Studies Molecular characterization of tumor samples in epidemiological studies at the DNA, mRNA, microRNA, chromosomal, or protein level, permits the analysis of genetic and environmental risk factors and clinical outcomes by biologically important tumor subtypes. These analyses can lead to improvements in risk assessment by identifying tumors with distinct risk profiles. In addition, identifying classes of tumors of different etiology can help in understanding the carcinogenesis pathways to disease as well as to develop targeted prevention programs (e.g., use of hormonal chemoprevention for women at high risk of ER-positive tumors). Review of pathology and medical records can be used to obtain information on tumor characteristics determined for clinical practice, e.g., histological type and tumor grade. However, more detailed characterization of tumors requires large collections of tissue samples (see section on “Design considerations in biospecimen collections”). Screening cohorts provide special advantages for etiological investigations because specimens may be available from both cases and unaffected subjects. Unfortunately, the cervix is currently the only organ in which population-based screening typically includes pathological examination (Pap tests). Follow-Up of Cases to Determine Clinical Outcomes The prospective collection of clinical information from cases enrolled in case-control studies (e.g., treatment, recurrence of disease, and survival) greatly increases the value of these studies since critical questions on the relationship between biomarkers and disease progression can be addressed in well-characterized populations (described in detail in chap. 4). It is clear that designing a survival study within a case-control study is easier to do at the beginning of the case-control study, rather than later, after subject enrollment is completed. Given the value that such studies have for carrying out translational research in a very efficient manner, consideration should be given to implementing this type of study whenever possible. The collection of clinical information is facilitated in hospitalbased studies when cases are diagnosed in a relatively small number of hospitals and in stable populations where patients are likely to be followed up in the diagnostic hospitals or associated clinics. Information on clinical outcomes can be obtained through active follow-up of the cases where patients are contacted individually through the course of their treatment and medical follow-up or through passive follow-up by extracting information from medical records. Passive follow-up is less costly; however, it is often limited by difficulties in obtaining detailed information on treatment from medical records or by loss to follow-up

8

Garcı´a-Closas et al.

in populations where patients change cities or hospitals during clinical follow-up. Use of database resources such as death registries in populations where cases are diagnosed can be helpful in determining survival from cases lost to follow-up. Prospective Cohort Studies In prospective cohort studies, exposure information and biological specimens are collected from healthy subjects who are then followed up to identify those who develop disease. In fact, case-control studies are conceptualized as a retrospective sampling of cases and controls from an underlying prospective cohort, referred to as the source population (15,17). Although establishing a cohort study is initially very costly and time consuming, in the long run it becomes more cost efficient since it can study multiple disease endpoints and provides a well-defined population that can be easily sampled for efficiency (41). Biological specimens are collected before disease diagnosis and, ideally, before the beginning of the disease process. Therefore, it is the only method able to study biomarkers that are directly or indirectly affected by the disease process (42). Although cohort studies have the theoretical advantage of collecting serial biological samples over time, many large studies have been able to collect a single biological sample at only one point in time. Although this is not a concern for DNA-based assays of inherited susceptibility markers, it poses some limitations for several other categories of markers, particularly for shortterm exposure markers that may vary substantially from day-to-day. In addition, it can be difficult to evaluate the relevant time window of exposure for disease causation unless serial collections of specimens over time are available. The advantage of prospective cohorts over case-control studies for the study of genetic associations, even though DNA-based markers are not influenced by disease, has been advocated on the basis of prospective designs being better suited for studying interactive effects of environmental and genetic exposures (43–46). In particular, prospective studies are better suited to evaluate genotype associations and interactions with biomarkers of exposure or intermediate endpoints, if these biomarkers are influenced by disease status, or if measures close to the time of diagnosis do not reflect past events relevant to disease onset. Although cohort studies can minimize the occurrence of differential misclassification, nondifferential misclassification of exposure or biomarkers can still limit the assessment of genotype-environment and genotype-phenotype interactions. As mentioned before, case-control studies evaluating only one disease outcome or a few related diseases that focus on particular exposure hypotheses can obtain more detailed information from questionnaires than cohort studies, thus reducing exposure misclassification. Therefore, unless cohort studies can measure exposure accurately and with repeated measures over time, they might not have clear advantages over cases-control studies for the study of certain hypotheses. Given that most members of a cohort will not develop cancer, nested case-control and, less commonly, case-cohort studies are used to improve efficiency (47). In these designs, only samples from cases and a random subset of noncases are analyzed, reducing the laboratory requirements and cost considerably. The nested case-control design includes all cases identified in the cohort up to a particular point in time and a random sample of subjects free of disease at the time of the case diagnosis. Increasing the case-to-control ratio to two or three controls per case can easily increase the efficiency of nested case-control studies. A case-cohort design includes a random sample of the cohort population at the onset of the study and all cases identified in the cohort up to a particular point in time. The case-cohort design allows for the evaluation of several disease endpoints using the same comparison group (referred to as a subcohort); however, since the same disease-free subjects are

Design Considerations in Molecular Epidemiology

9

repeatedly used as “controls” for different disease endpoints, depletion of samples from this group can be an issue. Perhaps some historical biomarker data from a subcohort can be compared against a future series of cases with newly analyzed data (e.g., genetic biomarkers, which are now analyzed with extremely high accuracy and precision). However, in general, biomarkers should be analyzed in cases and controls, or in a comparison subcohort, at the same time, in the same laboratory, on the same platform, with the same reagents, and by the same personnel whenever feasible, to minimize assay errors differential between cases and controls, and the influence of secular trends. Multiple prospective cohort studies are currently being followed-up for cancer incidence with basic risk factor information from questionnaires and stored blood components, including white blood cells that can be used as a source of DNA. At the completion of ongoing collections, current studies will have stored DNA samples on over two million individuals (7). These studies will provide very large numbers of cases of the more common cancer sites (e.g., breast, lung, prostate, and colon) to evaluate genetic markers of susceptibility and biomarkers in serum or plasma such as hormone levels, chemical carcinogen levels, and proteomic patterns. Most cohort studies do not have cryopreserved blood samples since the procedure is very expensive and logistically challenging in large studies. Also, cohort studies often have a limited capability to collect tumor samples on large numbers of subjects and to follow up cases to carry out survival studies. New cohort studies based on large institutions such as health maintenance organizations (HMOs) could enable access to tumor samples and easier follow-up of cases for treatment response and survival. Prospective cohort studies are sometimes designed within screening cohorts. In this design, screening failures lead to missing prevalent cases among cohort participants that are misclassified as controls (48). Although repeated screening reduces misclassification of subjects, cases discovered in follow-up cannot be distinguished from prevalent cases missed by the initial screening or incident disease. However, the degree of misclassification of prevalent and incident cases can be assessed by analyses of time to diagnosis or pathological characteristics. Intensive screening may also uncover a reservoir of latent diseases that would not otherwise become clinically relevant and that might differ from disease detected through clinical symptoms (49,50). Other Study Designs Case-Series Design In the so-called case-series, case-case, or case-only design, only subjects with the disease of interest and no controls are enrolled in the study. This design has been proposed to evaluate etiological heterogeneity using tumor markers. The degree of etiological heterogeneity is quantified by the ratio of the odds ratio for the effect of exposure on marker-positive tumors to the odds ratio for marker-negative tumors. This parameter is equivalent to the odds ratio for the association between exposure and tumor marker in the cases (51). However, case-only studies are limited to the estimation of the ratio of odds ratios and cannot be used to obtain estimates of the odds ratios for different tumor types. It should be noted that the odds ratio from a case-only design would underestimate the odds ratio derived in a case-control design when the exposure of interest is associated with more than one tumor type. In addition, demonstration that expected associations between established factors and a particular type of cancer are identifiable in a particular study population provides reassurances of the generalizability of findings. Case-series studies where cases can be identified and obtained using well-characterized population-based registries could overcome some of these limitations.

10

Garcı´a-Closas et al.

The case-only study has also been proposed as a valid design to evaluate multiplicative gene-gene (52) and gene-environment interactions (53). However, this design has important limitations, most notably it cannot be used to obtain estimates of relative risk for disease or additive interactions, is susceptible to misinterpretation of the interaction parameter (54), and is highly dependent on the assumption of independence between the exposure and the genotype under study (55). Because of these limitations, case-control designs are preferable to case-series designs, when an appropriate control group can be enrolled. Clinical Trials Randomized clinical trials are the gold standard for the evaluation of therapeutic or preventive interventions. The key advantage over observational studies such as casecontrol and prospective cohort studies is the potential to avoid selection and confounding biases through randomization of interventions. Within the limits of chance variation, randomization ensures similar distributions of known and unknown confounding factors in the groups of patients being compared. Although clinical trials cannot be used to address etiological questions because of the lack of a control population, assessment of risk factors for disease through questionnaires and biomarkers can be valuable in studying etiological heterogeneity using case-only analyses described above. In addition, this design is well suited to evaluate the influence of genetic and environmental risk factors for disease on disease progression and response to treatment. A potential limitation is the lack of generalizability of findings as discussed for other case-only designs of highly selected cohorts of patients. Other Study Designs Alternative study designs have been proposed to address some to the limitations of the classical epidemiological designs. For instance, the two-phase sampling designs can be used to improve efficiency and reduce the cost of measuring biomarkers in large epidemiological studies (56). The first phase of this design could be a case-control or cohort study with basic exposure information and no biomarker measurements. In a second phase, more elaborate exposure information and/or determination of biomarkers (with collection of biological specimens if these were not collected in the first phase) is carried out in an informative sample of individuals defined by disease and exposure (e.g., subjects with extreme or uncommon exposures). Multiple statistical methods such as simple conditional likelihood (57) or estimated-score (58) methods have been developed to analyze data from two-sampling designs. Another example is the use of the kin-cohort design as a more efficient alternative to case-control or cohort studies when the main aim is to estimate age-specific penetrance for rare inherited mutations in the general population (59,60). In this design, relatives of selected individuals with genetic testing form a retrospective cohort that is followed from birth to onset of disease or censoring. DESIGN CONSIDERATIONS IN BIOSPECIMEN COLLECTION The proper collection, processing, and storage of biological specimens in epidemiological studies is critical for the successful determination of biomarkers (61). In this section we describe design considerations in biospecimen collection. Other aspects related to biospecimens such as informed consent, sample sources and processing and storage protocols, biobanks, and quality control considerations are addressed in chapters 5 (“Biosampling Methods”) and 6 (“Principles of High Quality Genotyping”).

Design Considerations in Molecular Epidemiology

11

Biological specimens in prospective cohort studies are collected before the clinical onset of disease, ensuring identical sample collection, processing, and storage conditions for samples collected from individuals that develop the disease and those that remain disease free. In addition, the potential effects of disease processes on biomarker measurements make the collection of specimens, particularly of sequential collections overtime, very valuable in prospective studies. Biomarker measurements can be very sensitive to differences in handling of samples, e.g., fasting status at blood collection and time between collection and processing of specimens. Therefore, to avoid or minimize spurious differences in case-control studies, it is important that samples from cases and controls are collected during the same timeframe and using identical protocols. Ideally, the nursing and laboratory staff should be blinded with respect to the case-control status of the subjects to avoid differences in collection, processing, and storage. However, because differences between cases and controls in handling of samples are not always completely avoided, it is important to record key information such as date and time of collection, processing and storage processing problems, time since last meal, current medication, current tobacco, and alcohol use to be able to account for the influence of these variables at the data analysis stage. In fact, this information should be collected in all study designs. In addition, since biomarkers requiring elaborate and expensive protocols are often measured only in a subset of study participants, this information can be used to match cases and controls selected for biomarker measurements. This will ensure efficient adjustment for these extraneous factors during data analysis. Biomarkers measured in samples collected in subjects during a hospital stay might not reflect measurements from samples collected outside the hospital because many habits and exposures change during hospitalization, e.g., dietary habits, medication used, physical activity. Therefore, even if cases and controls are selected through a hospitalbased design, collection of specimens after the patients return home and are no longer taking medications for the conditions that brought them to the hospital should be considered, if feasible. On the other hand, specimens to measure biomarkers that are influenced by long-term effects of treatment should be collected before treatment is started at the hospital, within logistic limitations. Blood and Buccal Specimens Because new molecular techniques often require special processing of biological samples, it is important to design protocols that maximize opportunities to apply future assays to samples being collected in epidemiology studies. For instance, blood samples are a very valuable source of specimens that can be used for the determination of a wide range of biomarkers. Leukocytes or white blood cells (granulocytes, lymphocytes, and monocytes), erythrocytes or red blood cells, platelets, and plasma/serum can be obtained through appropriate separation of blood components. Blood samples can also be a source of viable lymphocytes to perform phenotypic assays (62–64). In spite of the advantages of blood samples, the use of venipuncture to obtain blood samples in large-scale epidemiological studies has two important limitations: relatively high cost and, in some populations, relatively low acceptability. Small amounts of DNA can be obtained from dried blood spots on filter paper using finger-pricks, avoiding the use of venipuncture. Advantages of blood spots include lower costs for collection, shipping, and storage (65). Epidemiological studies often need less expensive methods of collection with lower levels of discomfort to the study participant to increase participation rates. Further, in some instances, methods that are suitable for self-collection, such as expectorated buccal epithelial cells (66–69), may be particularly advantageous. Although the use of venipuncture to

12

Garcı´a-Closas et al.

collect blood samples has clear advantages, alternative less invasive methods to collect genomic DNA, at a minimum, should be considered in most etiological studies. Urine Specimens A wide variety of biological markers of exposure and metabolic markers can be measured in urine samples (70). Often, the collection and processing of urine samples are uncomplicated, with the sample being kept cold, both to maintain stability of the analytes as well as to avoid bacterial overgrowth. Generally, urine is simply aliquoted and frozen; however, for some analytes, collection and storage containers have specific requirements and preservatives may be needed. For most exposure markers, the gold standard is the 24-hour urine sample collection, followed, in general, by the 12-hour evening/overnight sample, the 8-hour overnight collection, the first morning voided sample, and the socalled spot single urine sample. The utility of a single spot urine sample, relative to longer, timed collections, is highly specific to the kinetics associated with the pattern of exposure and the half-life of the biomarker. Tissue Specimens Consideration of tissue protocols during the design of the study is critical for the successful retrieval and use of specimens. As with other type of specimens, study protocols should be designed in conjunction with experts, in this case study pathologists, who can assess options for obtaining, processing, storing, and testing specimens. Considerations on specimen labeling, storing, tracking, and shipping during the planning stages can also greatly facilitate the use of specimens in the future. Paraffin-embedded tissue blocks are often the most accessible sources of tissue, since these are routinely prepared in pathology laboratories for clinical purposes. To optimize the utility of tissue blocks in epidemiological studies, it is important to collected information on the protocols used for tissue preparation. This information includes dates when blocks were prepared, criteria used for tissue sampling, and methods used for the processing and storage of blocks. These factors are relevant in the analysis and interpretation of assays performed in the tissues. For instance, for surgical specimens it is critical to know if patients underwent presurgical chemotherapy or radiotherapy since these therapies can lead to extensive necrosis that can affect the representativeness of the tissue specimens. Collection of paraffin-embedded tumor samples is made easier in hospital-based compared with population-based studies since the number of pathology departments where cases are diagnosed tends to be smaller. Hospitals typically discard diagnostic tissue blocks some years after the initial diagnosis, and thus, retrieval of archived specimens years after the diagnosis of disease often results in low success rates. Requesting the retrieval of archived tissue blocks shortly after diagnosis increases the chances of obtaining tissues, however, these specimens usually need to be returned to the hospitals since they might be needed for medical care of the patients. Tissue microarray (TMA) technology can be used for sampling of small tissue cores of pathological targets from paraffin blocks and transferred systematically into one or a few recipient blocks containing multiple tissue cores (71). Sections of single TMA blocks can provide representations of hundreds of cases suitable for testing in a single batch, thereby reducing cost, expense, and interbatch variability. Although TMAs offer opportunities to standardize IHC performance, many important factors that impact the reliability of IHC data need to be been addressed.

Design Considerations in Molecular Epidemiology

13

These concerns can be pronounced in multicenter studies in which tumor tissues are collected from different centers with varying tissue-processing protocols. Primary factors influencing IHC results include delays in the time to formalin fixation (72), variation in the adequacy of formalin fixation (73), improper storage of cut and unstained slides (74,75), and variable reproducibility in IHC interpretation (76). Development of markers of tissue quality and processing (77) could be a useful quality control measure to improve the interpretation and analysis of IHC information, particularly in multicenter studies. Implementation of tissue processing protocols not routinely performed for clinical care can be of interest in epidemiological investigations. For instance, the use of tissue fixatives that preserve RNA or snap freezing tissue samples may be required to obtain high-quality RNA for gene expression arrays. The proximity to laboratory facilities and pathology departments in hospital-based studies facilitates the implementation of specialized protocols since it allows for rapid processing of specimens. SAMPLE SIZE CONSIDERATIONS Sample size considerations during the design of a molecular epidemiology study are important to ensure adequate numbers to evaluate questions of interest with sufficient statistical power. All general epidemiological principles that apply to power and sample size (16) apply to molecular epidemiology studies. For example, the main determinants of sample size requirements for a given test in a case-control study are the rate of disease in the population under study, type and distribution of the biomarker (e.g., frequency of a categorical biomarker such as genotypes and distribution of a continuous biomarker such as serum hormone levels), magnitude of biomarker differences (e.g., measured as the odds ratio for a biomarker-disease association or differences in means of biomarker levels between two groups in cross-sectional studies), the desired statistical power to detect these differences, and the alpha–level of the test. Sample size considerations are critical for the design of studies of genetic associations and gene-environment interactions (78). Generally, sample size requirements are large (hundreds to thousands of subjects) because the expected effects of individual genetic variants are relatively small. Recent findings from large-scale studies that have confirmed associations between common polymorphism and cancer risk using candidate gene and genome-wide association studies have shown magnitude of associations within the OR range of 1.2–1.5. For instance, a deletion in the glutathione-S transferase M1 (GSTM1) gene and the slow acetylation genotype for the N-acetyl transferase 2 (NAT2) gene are associated, respectively, with 1.5- and 1.4-fold increases in bladder cancer risk (79,80), uncommon versus common homozygous genotypes in novel breast cancer– susceptible loci for breast cancer are associated with ORs ranging from 1.2 to 1.6 (36,81,82), and a polymorphism in the tumor necrosis factor (TNF-308G–>A) gene was associated with a 1.6-fold increased risk of diffuse large B-cell non–Hodgkin lymphoma (83). Because the costs of genome-wide association studies where hundred of thousands of genetic markers are evaluated in thousands of individuals are very high at the current genotyping cost, staged designs are commonly used. The sample size needs for these type of studies are described in chapter 15. However, as the costs for primary genome-wide scans continue to decrease, there may be less need for two-stage designs in the future. Sample size requirements for more complex analyses of genotype data such as pathwaybased analyses, haplotype analyses, novel high-dimensional analyses are less well understood (chaps. 12–14).

14

Garcı´a-Closas et al.

Evaluation of gene-environment interactions often requires large sample sizes (see chap. 11, “Statistical Approaches to Studies of Gene-Gene and Gene-Environment Interactions”), and sample size needs are further increased by the presence of errors measuring environmental and/or genetic exposures, even when the errors are small (84,85). Although multiplicative parameters for gene-environment interactions tend to be attenuated by differential misclassification of exposure (84), this does not hold for the estimation of exposure main effects, joint effect, and subgroup effects or additive interactions. In addition, misclassification leads to biased estimates of risk (86). Thus, high-quality exposure assessment and almost perfect genotype determinations are required for the evaluation of gene-environment interactions. This highlights the importance of validating genotype assays and including quality control samples during genotype determinations to assess the reproducibility of the assays (see chap. 6, “Principles of High-Quality Genotyping”). Current case-control or cohort studies usually include somewhere between a few hundred to a few thousand cases and similar numbers of controls. Therefore, to meet the larger sample size requirements to identify weak associations and interactions, especially when considering histological subtypes of cancers, an increasing number of consortia of existing studies are being formed. Consortia can achieve the large sample sizes necessary to confirm or refute associations by coordinating the analysis of pooled data from many studies, as well as to evaluate consistency of findings across studies of different quality and with different sources of biases (see chap. 17 for a discussion on the value of consortia to validate and confirm associations through meta-analyses and pooled analyses).

CONCLUDING REMARKS The field of molecular epidemiology is undergoing a transformational change with the recent incorporation of powerful genomic technology, which should continue to improve in its comprehensiveness, cost, and efficiency into the foreseeable future, and provide an unprecedented opportunity to understand the fundamental process of carcinogenesis. At the same time, large and high-quality case-control studies have been established with detailed exposure data and stored biological specimens; previously established cohorts with biologic samples are being followed up; and new cohort studies with biological samples are still being established, particularly in developing countries. The confluence of extraordinary technology and the availability of large epidemiological studies should ultimately lead to new preventive, screening, and treatment strategies. However, this will only be achieved if the field of molecular epidemiology adheres to the time-tested and fundamental epidemiological principles of high-quality study design, vigilant quality control, thoughtful data analysis and interpretation, and well-powered replication of important findings.

ACKNOWLEDGMENTS This chapter has been adapted and updated from a book chapter, “Application of Biomarkers in Cancer Epidemiology,” by Garcia-Closas et al. (7). We thank the other coauthors from the earlier chapter, Drs. Roel Vermeulen, Mark E Sherman, Lee E Moore, and Martyn T Smith, for their valuable contributions.

Design Considerations in Molecular Epidemiology

15

REFERENCES 1. National Research C. Biological markers in environmental health research. Committee on Biological Markers of the National Research Council. Environ Health Perspect 1987;74:3–9. 2. Rothman N, Wacholder S, Caporaso NE, et al. The use of common genetic polymorphisms to enhance the epidemiologic study of environmental carcinogens. Biochimica et Biophysica Acta 2001; 1471:C1–C10. 3. Schulte PA. Methodologic issues in the use of biologic markers in epidemiologic research. Am J Epidemiol 1987; 126(6):1006–1016. 4. Perera FP. Molecular cancer epidemiology: a new tool in cancer prevention. J Natl Cancer Inst 1987; 78(5):887–898. 5. Perera FP. Molecular epidemiology: on the path to prevention? J Natl Cancer Inst 2000; 92(8): 602–612. 6. Toniolo P, Boffetta P, Shuker DEG, et al. Application of Biomarkers in Cancer Epidemiology. Lyon: IARC, 1997. 7. Garcia-Closas M, Vermeulen R, Sherman ME, et al. Application of biomarkers in cancer epidemiology. In: Fraumeni DSJF, ed. Cancer Epidemiology and Prevention. Third Edition. New York: Oxford University Press, 2006. 8. Aardema MJ, MacGregor JT. Toxicology and genetic toxicology in the new era of “toxicogenomics”: impact of “-omics” technologies. Mutat Res 2002; 499(1):13–25. 9. Wang W, Zhou H, Lin H, et al. Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal Chem 2003; 75(18):4818–4826. 10. Hanash S. Disease proteomics. Nature 2003; 422(6928):226–232. 11. Baak JP, Path FR, Hermsen MA, et al. Genomics and proteomics in cancer. Eur J Cancer 2003; 39(9):1199–1215. 12. Sellers TA, Yates JR. Review of proteomics with applications to genetic epidemiology. Genet Epidemiol 2003; 24(2):83–98. 13. Staudt LM. Molecular diagnosis of the hematologic cancers. N Engl J Med 2003; 348(18): 1777–1785. 14. Strausberg RL, Simpson AJ, Wooster R. Sequence-based cancer genomics: progress, lessons and opportunities. Nat Rev Genet 2003; 4(6):409–418. 15. Wacholder S, McLaughlin JK, Silverman DT, et al. Selection of controls in case-control studies. I. Principles. Am J Epidemiol 1992; 135(9):1019–1028. 16. Breslow NE, Day NE. Design Considerations. In: Breslow NE, Day NE, eds. Statistical Methods in Cancer Research Volume II. The Design and Analysis of Cohort Studies. Lyon: IARC Press, 1987. 17. Rothman KJ, Greenland S. Modern Epidemiology. Philadelphia: Lippincott-Raven, 1998. 18. Schulte PA, Perera FP, Toniolo P, et al. Transitional studies. In: Application of biomarkers in cancer epidemiology. Lyon, France: IARC Scientific Publications, 1997:19–29. 19. Rothman N. Genetic susceptibility biomarkers in studies of occupational and environmental cancer–methodologic issues. Toxicology Letters 1995; 77(1–3):221–225. 20. Hulka BS, Margolin BH. Methodological issues in epidemiologic studies using biologic markers. Am J Epidemiol 1992; 135(2):200–209. 21. Hulka BS. ASPO Distinguished Achievement Award Lecture. Epidemiological studies using biological markers: issues for epidemiologists. Cancer Epidemiol Biomarkers Prev 1991; 1(1): 13–19. 22. Kim S, Lan Q, Waidyanatha S, et al. Genetic polymorphisms and benzene metabolism in humans exposed to a wide range of air concentrations. Pharmacogenet Genomics 2007; 17 (10):789–801. 23. Lan Q, Zhang L, Li G, et al. Hematotoxicity in workers exposed to low levels of benzene. Science 2004; 306(5702):1774–1776. 24. Rothman N, Stewart WF, Schulte PA. Incorporating biomarkers into cancer epidemiology: a matrix of biomarker and study design categories. Cancer Epidemiol Biomarkers Prev 1995; 4(4):301–2311.

16

Garcı´a-Closas et al.

25. Schatzkin A, Freedman LS, Schiffman MH, et al. Validation of intermediate end points in cancer research. J Natl Cancer Inst 1990; 82(22):1746–1752. 26. Schulte PA, Rothman N, Schottenfeld D, et al. Design considerations in molecular epidemiology. In: Molecular Epidemiology: Principles and Practices. San Diego, CA: Academic Press, 1993:159–198. 27. Schatzkin A, Gail M. The promise and peril of surrogate end points in cancer research. Nat Rev Cancer 2002; 2(1):19–27. 28. Forrest MS, Lan Q, Hubbard AE, et al. Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers. Environ Health Perspect 2005; 113(6):801–807. 29. Lan Q, Zhang L, Shen M, et al. Polymorphisms in cytokine and cellular adhesion molecule genes and susceptibility to hematotoxicity among workers exposed to benzene. Cancer Res 2005; 65(20):9574–9581. 30. Bollati V, Baccarelli A, Hou L, et al. Changes in DNA methylation patterns in subjects exposed to low-dose benzene. Cancer Res 2007; 67(3):876–880. 31. Chen H, Li S, Liu J, et al. Chronic inorganic arsenic exposure induces hepatic global and individual gene hypomethylation: implications for arsenic hepatocarcinogenesis. Carcinogenesis 2004; 25(9):1779–1786. 32. Morla M, Busquets X, Pons J, et al. Telomere shortening in smokers with and without COPD. Eur Respir J 2006; 27(3):525–528. 33. Vermeulen R, Li G, Lan Q, et al. Detailed exposure assessment for a molecular epidemiology study of benzene in two shoe factories in China. Ann Occup Hyg 2004; 48(2):105–106. 34. Morton LM, Cahill J, Hartge P. Reporting participation in epidemiologic studies: a survey of practice. Am J Epidemiol 2006;163(3):197–203. 35. Cox A, Dunning AM, Garcia-Closas M, et al. A common coding variant in CASP8 is associated with breast cancer risk. Nat Genet 2007; 39(3):352–358. 36. Easton DF, Pooley KA, Dunning AM, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 2007; 447(7148):1087–1093. 37. Ahsan H, Rundle AG. Measures of genotype versus gene products: promise and pitfalls in cancer prevention. Carcinogenesis 2003; 24(9):1429–1434. 38. Wu X, Gu J, Spitz MR. Mutagen sensitivity: a genetic predisposition factor for cancer. Cancer Res 2007; 67(8):3493–3495. 39. Berwick M, Vineis P. Markers of DNA repair and susceptibility to cancer in humans: an epidemiologic review. J Natl Cancer Inst 2000; 92(11):874–897. 40. Spitz MR, Wei Q, Dong Q, et al. Genetic susceptibility to lung cancer: the role of DNA damage and repair. Cancer Epidemiol Biomarkers Prev 2003; 12(8):689–698. 41. Potter JD, Toniolo P, Boffeta P, et al. Logistics and design issues in the use of biological specimens in observational epidemiology. In: Application of Biomarkers in Cancer Epidemiology. Lyon, France: IARC Scientific Publications, 1997:31–37. 42. Hunter DJ, Toniolo P, Boffeta P, et al. Methodological issues in the use of biological markers in cancer epidemiology: cohort studies. In: Application of Biomarkers in Cancer Epidemiology. Lyon, France: IARC Scientific Publications, 1997:39–46. 43. Banks E, Meade T. Study of genes and environmental factors in complex diseases. Lancet 2002; 359(9312):1156–1157. (author reply 1157). 44. Burton P, McCarthy M, Elliott P. Study of genes and environmental factors in complex diseases. Lancet 2002; 359(9312):1155–1156. (author reply 1157). 45. Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001; 358(9290):1356–1360. 46. Wacholder S, Garcia-Closas M, Rothman N. Study of genes and environmental factors in complex diseases. Lancet 2002; 359(9312):1155. (author reply 1157). 47. Wacholder S. Practical considerations in choosing between the case-cohort and nested casecontrol designs. Epidemiology 1991; 2(2):155–158. 48. Franco EL. Statistical issues in human papillomavirus testing and screening. Clin Lab Med 2000; 20(2):345–367.

Design Considerations in Molecular Epidemiology

17

49. Welch HG, Black WC. Using autopsy series to estimate the disease “reservoir” for ductal carcinoma in situ of the breast: how much more breast cancer can we find? Ann Intern Med 1997; 127(11):1023–1028. 50. Morrison AS, Rothman KJ, Greenland S. Screening. In: Modern Epidemiology, 2003:499–518. 51. Begg CB, Zhang ZF. Statistical-analysis of molecular epidemiology studies employing caseseries. Cancer Epidemiol Biomarkers Prev 1994; 3(2):173–175. 52. Yang Q, Khoury MJ, Sun F, et al. Case-only design to measure gene-gene interaction. Epidemiology 1999; 10(2):167–170. 53. Khoury MJ, Flanders WD. Nontraditional epidemiologic approaches in the analysis of geneenvironment interaction: case-control studies with no controls! Am J Epidemiol 1996; 144(3): 207–213. 54. Schmidt S, Schaid DJ. Potential misinterpretation of the case-only study to assess geneenvironment interaction. Am J Epidemiol 1999; 150(8):878–885. 55. Albert PS, Ratnasinghe D, Tangrea J, et al. Limitations of the case-only design for identifying gene-environment interactions. Am J Epidemiol 2001; 154(8):687–693. 56. White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 1982; 115(1):119–128. 57. Cain KC, Breslow NE. Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 1988; 128(6):1198–1206. 58. Chatterjee N, Chen Y, Breslow N. A pseudoscore estimator for regression problems for two phase sampling. J Am Stat Assoc 2003; 98:10. 59. Wacholder S, Hartge P, Struewing JP, et al. The kin-cohort study for estimating penetrance. J Epidemiol 1998; 148(7):623–630. 60. Chatterjee N, Shih J, Hartge P, et al. Association and aggregation analysis using kin-cohort designs with applications to genotype and family history data from the Washington Ashkenazi Study. Genet Epidemiol 2001; 21(2):123–138. 61. Holland NT, Smith MT, Eskenazi B, et al. Biological sample collection and processing for molecular epidemiological studies. Mutat Res 2003; 543(3):217–234. 62. Kleeberger CA, Lyles RH, Margolick JB, et al. Viability and recovery of peripheral blood mononuclear cells cryopreserved for up to 12 years in a multicenter study. Clin Diagn Lab Immunol 1999; 6(1):14–19. 63. Beck JC, Beiswanger CM, John EM, et al. Successful transformation of cryopreserved lymphocytes: a resource for epidemiological studies. Cancer Epidemiol Biomarkers Prev 2001; 10(5):551–554. 64. Hayes RB, Smith CO, Huang WY, et al. Whole blood cryopreservation in epidemiological studies. Cancer Epidemiol Biomarkers Prev 2002; 11(11):1496–1498. 65. Steinberg KK, Sanderlin KC, Ou CY, et al. DNA banking in epidemiologic studies. Epidemiologic Reviews 1997; 19(1):156–162. 66. Hansen TV, Simonsen MK, Nielsen FC, et al. Collection of blood, saliva, and buccal cell samples in a pilot study on the Danish nurse cohort: comparison of the response rate and quality of genomic DNA. Cancer Epidemiol Biomarkers Prev 2007; 16(10):2072–2076. 67. Garcia-Closas M, Egan KM, Abruzzo J, et al. Collection of genomic DNA from adults in epidemiological studies by buccal cytobrush and mouthwash. Cancer Epidemiol Biomarkers Prev 2001; 10(6):687–696. 68. Paynter RA, Skibola DR, Skibola CF, et al. Accuracy of multiplexed Illumina platform-based single-nucleotide polymorphism genotyping compared between genomic and whole genome amplified DNA collected from multiple sources. Cancer Epidemiol Biomarkers Prev 2006; 15(12):2533–2536. 69. Feigelson HS, Rodriguez C, Robertson AS, et al. Determinants of DNA yield and quality from buccal cell samples collected with mouthwash. Cancer Epidemiol Biomarkers Prev 2001; 10 (9):1005–1008. 70. Gunter EW, McQuillan G. Quality control in planning and operating the laboratory component for the Third National Health and Nutrition Examination Survey. J Nutr 1990; 120(suppl 11): 1451–1454.

18

Garcı´a-Closas et al.

71. Kononen J, Bubendorf L, Kallioniemi A, et al. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med 1998;4(7):844–847. 72. Oyama T, Ishikawa Y, Hayashi M, et al. The effects of fixation, processing and evaluation criteria on immunohistochemical detection of hormone receptors in breast cancer. Breast cancer (Tokyo, Japan) 2007; 14(2):182–188. 73. Goldstein NS, Ferkowicz M, Odish E, et al. Minimum formalin fixation time for consistent estrogen receptor immunohistochemical staining of invasive breast carcinoma. Am J Clin Pathol 2003; 120(1):86–92. 74. Jacobs TW, Prioleau JE, Stillman IE, et al. Loss of tumor marker-immunostaining intensity on stored paraffin slides of breast cancer. J Natl Cancer Inst 1996; 88(15):1054–1059. 75. Fergenbaum JH, Garcia-Closas M, Hewitt SM, et al. Loss of antigenicity in stored sections of breast cancer tissue microarrays. Cancer Epidemiol Biomarkers Prev 2004; 13(4):667–672. 76. Rhodes A, Borthwick D, Sykes R, et al. The use of cell line standards to reduce HER-2/neu assay variation in multiple European cancer centers and the potential of automated image analysis to provide for more accurate cut points for predicting clinical response to trastuzumab. Am J Clin Pathol 2004; 122(1):51–60. 77. De Marzo AM, Fedor HH, Gage WR, et al. Inadequate formalin fixation decreases reliability of p27 immunohistochemical staining: probing optimal fixation time using high-density tissue microarrays. Hum Pathol 2002; 33(7):756–760. 78. Garcia-Closas M, Lubin JH. Power and sample size calculations in case-control studies of gene-environment interactions: comments on different approaches. Am J Epidemiol 1999; 149(8):689–692. 79. Garcia-Closas M, Malats N, Silverman D, et al. NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and metaanalyses. Lancet 2005; 366(9486):649–659. 80. Rothman N, Garcia-Closas M, Hein DW. Commentary: reflections on G. M. Lower and colleagues’ 1979 study associating slow acetylator phenotype with urinary bladder cancer: meta-analysis, historical refinements of the hypothesis, and lessons learned. Int J Epidemiol 2007; 36(1):23–28. 81. Hunter DJ, Kraft P, Jacobs KB, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 2007; 39(7): 870–874. 82. Stacey SN, Manolescu A, Sulem P, et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet 2007; 39(7):865–869. 83. Rothman N, Skibola CF, Wang SS, et al. Genetic variation in TNF and IL10 and risk of nonHodgkin lymphoma: a report from the InterLymph Consortium. Lancet Oncol 2006; 7(1):27–38. 84. Garcia-Closas M, Rothman N, Lubin J. Misclassification in case-control studies of geneenvironment interactions: assessment of bias and sample size. Cancer Epidemiol Biomarkers Prev 1999; 8(12):1043–1050. 85. Deitz AC, Garcia-Closas M, Rothman N, et al. Impact of Misclassification in GenotypeDisease Association Studies: Example of N-acetyl 2 (NAT2) smoking and bladder cancer. Proc Am Asso Cancer Res 2000; 41:559. 86. Armstrong BK, White E, Saracci R, et al. Exposure measurement error and its effects. In: Principles of Exposure Measurement in Epidemiology. New York: Oxford University Press, 1992.

2

Family-Based Study Designs Audrey H. Schnell and John S. Witte Department of Epidemiology and Biostatistics, University of California, San Francisco, California, U.S.A.

INTRODUCTION Family-based designs are unique in that they use relatives to assess the genetic and molecular epidemiology of disease. The number of relatives studied can range from two family members to enormous pedigrees. The most commonly used studies are of familial aggregation, twins, segregation, linkage, and association. The first three designs evaluate the potential genetic basis of disease using patterns of coaggregation and do not require the collection of biospecimens (e.g., DNA). In contrast, linkage and association studies directly evaluate genetic markers—commonly searching across the entire human genome for regions harboring potentially causal risk factors—and thus require the collection of biospecimens on study subjects. Historically, family-based studies have been the primary approach to detecting disease-causing genes. Segregation and linkage studies have had a number of successes cloning highly penetrant, rare disease-causing genes (e.g., BRCA1). These approaches are well suited to detect such genes, though recently there has been growing interest in detecting lower-risk but common disease-causing variants. Association studies may have more power than linkage studies to detect such variants, and much recent work has been done using non-family-based association studies to decipher the genetic basis of cancer (1). Nevertheless, family-based designs remain valuable. A Nature Genetics editorial noted that such designs might be required for publishing association study results (2). A key benefit of family-based association studies is the control for confounding bias due to population stratification, albeit at a potential loss of power (3,4). Moreover, family members may be easier to recruit for some disorders than unrelated individuals, since they can have higher motivation to participate given their affected family member. Of course, this theory assumes that any geographical distance among family members does not limit their ability to take part in a study, and that a family is large enough to make it eligible for study. Another benefit is the potential for quality assurance measures when the same data are collected on more than one family member. In addition, if genotyping is performed, quality assurance measures such as checking for Mendelian inheritance are also possible.

19

20

Schnell and Witte

Details on these family-based designs, along with further consideration of their strengths and weaknesses, are presented here and highlighted by examples. FAMILIAL AGGREGATION AND TWIN STUDIES Familial Aggregation The clustering of disease within families suggests genetic and/or shared environmental risk factors. In fact, this clustering is often the first indication that a disease may have a genetic component. A pattern consistent with genetic factors occurs when the similarity or correlation of a trait among closely related individuals, for example, siblings, is greater than for more distant relatives and/or greater among relatives than unrelated individuals. One way to look at familial correlation is to compare the overall population prevalence with the risk of disease to other family members when there is an identified affected individual in the family. The degree of risk can be computed for different types of family members on the basis of their relatedness to the case (i.e., first- or second-degree relative). This risk can also be based on additional factors for the case (e.g., age of onset). For example, the risk of prostate cancer to a man with an affected first-degree relative is 2.6, and increases to 3.3 if the relative was diagnosed before age 65 (5). Twin Studies Evidence for the genetic involvement in disease can also come from studies of twins. If a disease occurs more often among monozygotic twins than dizygotic twins or siblings, this suggests that genes play a role (assuming environmental factors are the same). Dizygotic twins are no more similar than siblings but have the distinct advantage of being the same age, which is a known factor in the occurrence of many diseases. Matching on age also controls for trends in environmental influences or exposures. The classification of the twins must be ascertained, and there are questionnaires available that have been well validated, as well as DNA testing (6,7). Despite the intuitive appeal of twin studies, twins may represent a unique and rare population, and a drawback of twin studies is their generalizability to the general population. Twin studies rest on several assumptions such as random mating, equal environment for monozygotic and dizygotic twins, and no gene-by-environment interaction. If these assumptions do not hold, the conclusions from twin studies may not be valid. Nevertheless, these limitations do not negate the usefulness of twin studies (8). Twin Studies Example A study of 44,788 pairs of twins from Swedish, Danish, and Finnish twin registries by Lichtenstein et al. was used to assess the risk of 28 cancers (9). The authors believed that the use of twins rather than families allowed them to better separate heritable from environmental factors. The zygosity of twins was validated by questionnaire and the twins were chosen from population-based registries. The concordance of disease among monozygotic and dizygotic twins was calculated. The authors found an increased risk for a twin when the other twin had certain types of cancer (stomach, colorectal, lung, breast, and prostate cancer) (9). SEGREGATION ANALYSIS Segregation analysis is a method of establishing the genetic inheritance of disease and can only be performed using family data. This approach helps determine if a disease is largely caused by the segregation of a single major gene, and if so, what mode of inheritance best

Family-Based Study Designs

21

fits the data. Families are collected on the basis of identification of a family member with the disease of interest, and ideally all families are collected following a set ascertainment scheme. This allows for the necessary ascertainment correction to be applied in the analyses. Choosing families at random would not yield sufficient numbers of families with affected members. Therefore, a case is typically ascertained (e.g., from a hospitalbased registry) and then that case’s family is studied. Because of this nonrandom selection, an ascertainment correction must be applied to avoid bias in the estimates of parameters such as gene frequency. This becomes problematic if the method of ascertainment is not uniform for families. Segregation analysis is performed by testing models of varying degrees of generality. Models with various restrictions (e.g., dominant or recessive inheritance) are compared to the most general model where all parameters in the model are estimated to see what model(s) best fit the data. Families with large pedigrees and many affected individuals are particularly informative both for establishing that genes are important and for identifying specific genes (10). Collecting families for segregation analysis can be extremely time consuming and costly. If not all family members are interviewed or directly assessed and information is only obtained by proxy, the segregation results can be subject to bias and misclassification. In addition, there may not be enough information in the data to discriminate among the models, and hence it may not be possible to determine a single best fitting mode of inheritance. However, if no information is available on the familial aggregation of a disease, segregation analysis is a method for establishing a genetic component to justify further studies. In addition, if a mode of inheritance can be identified, this information can be used in model-based linkage analysis and will increase the power of the analyses (see below). Note that once families are collected for segregation analyses they are available for further analyses (e.g., linkage and association). Stand-alone segregation analyses are increasingly uncommon, as investigators focus on more common but complex diseases and genotyping becomes less expensive. Segregation analyses can be incorporated jointly into linkage analyses, helping to determine the best-fitting model for model-based linkage analysis and in turn increase power (11,12). Segregation Analysis Example Schaid et al. undertook a large segregation analysis of prostate cancer (13). Men who underwent radical prostatectomies were identified, and these “probands” and their relatives were studied. Information on cancer was obtained via questionnaire with specific emphasis on gathering a history of prostate cancer in male relatives. The segregation analysis found that no single-gene model best explained the overall pattern of inheritance, although a rare autosomal dominant model fit the data when the proband’s age at diagnosis was less than 60 years, reflecting the complex and heterogeneous nature of prostate cancer. Findings such as these suggest that families be selected for linkage analysis to increase potential genetic homogeneity (e.g., by age at diagnosis of the proband) (13). LINKAGE ANALYSIS Linkage occurs when two loci on the same chromosome are inherited together. That is, when loci are relatively close together, recombination during meiosis is uncommon, and they are “linked.” Linkage analysis builds on this phenomenon by investigating the

22

Schnell and Witte

cosegregation of genetic markers and a disease trait within families, where the trait can be qualitative (e.g., presence or absence of prostate cancer) or quantitative (e.g., Gleason score for prostate cancer). If markers and the trait cosegregate in families, one infers that the disease-causing variants are nearby the markers. Hence, linkage can be considered as “intrafamilial” association (the cosegregating marker allele could be different in different families). Families are generally recruited into linkage studies on the basis of having at least one identified affected individual, as a randomly sampled collection of families would be relatively uninformative unless the traits of interest were common. For a quantitative trait, sampling individuals with extreme values will increase the power of a linkage study (14). One generally studies either large pedigrees or affected sibling pairs for linkage analyses. With large pedigrees one often uses model-based linkage analysis (sometimes referred to as parametric), and with affected sibling pairs a model-free (nonparametric) approach is used. Markers are spaced evenly over the entire genome, and linkage can be performed using all or a subset of these markers. Single-point linkage analysis uses information from one marker at a time, while multipoint linkage combines information from closely spaced markers. Multipoint linkage analysis can provide more power, as there is more information when the markers are analyzed together, but is computationally more demanding and hence an issue when used with large pedigrees. Model-Based (Parametric) Linkage In model-based linkage analysis, one must specify the mode of inheritance of the trait being studied, including the number of loci involved, the number of alleles at each locus and their frequencies, and the penetrances of each genotype. The marker allele frequencies specified have no effect on the evidence for linkage if the marker genotypes of all the pedigree founders—those family members from whom all other pedigree members are descended—are known or can be inferred with certainty. As noted above, it is possible to estimate the penetrance to be used in a linkage study by first performing a segregation analysis of families that have been ascertained according to a specified sampling scheme and corrected for this ascertainment. Segregation analysis may not be practical or may not provide a single best-fitting model. In that case the investigator may use several models based on information known about the disease. Model-based linkage analysis is more powerful than model-free, provided that the correct model is used. While use of an incorrect model will reduce power, it should not lead to false-positive results (14). Different models can be tested, but then multiple tests are performed, and this reduces the overall statistical power to detect linkage (15). The results of model-based linkage analysis are generally given in terms of likelihood log-odds (LOD) scores. These are the logarithms of the observed likelihood for the data if there is linkage with a given marker divided by the expected likelihood if there is no linkage. For each marker LOD scores are calculated over a range of recombination fraction values (), and the maximum LOD score is determined. The recombination fraction () is the probability of crossover between two loci; if there is no linkage (the two loci are sufficiently far apart), the recombination fraction is 0.5. A LOD score of 3 has historically been used as a cutoff for statistical significance, and corresponds to an alphalevel of approximately 10 4. A more stringent cutoff for LOD scores has been suggested that would provide a genome-wide false-positive rate of 0.05 (16). However, the use of arbitrary cut points to declare the importance of a finding has been questioned (17). One can sum LOD scores across families and different studies provided the studies are

Family-Based Study Designs

23

comparable (i.e., the cases are all diagnosed using the same criteria). Because the model for the disease is specified possibly incorrectly, model-based linkage analysis is not a robust method for the discovery of genes. Model-Based Linkage Analysis Example In an effort to follow up on the numerous but inconsistently replicated candidate genes for prostate cancer the ACTANE Consortium performed genome-wide linkage with 65 families, where each family had multiple affected members. The 65 most informative families were chosen with all but one family having at least three affected members. Family data were used to perform quality assurance tests such as checking for Mendelian consistency. Multipoint model-based linkage analysis was performed using three models, two of which were based on segregation analysis previously reported by other authors. The authors did not find linkage to any of the previously reported genes nor did they find strong evidence for any new candidates. This is possibly due to the heterogeneity of the sample, which came from different countries with different screening protocols and the probable heterogeneity of the disease (18). Model-Free (Nonparametric) Linkage As noted above, model-free linkage does not require one to specify a mode of inheritance. This approach relies on estimates of identity-by-descent for markers between sets of relatives and estimated functions of the recombination fraction (). In the simplest case, if linkage is present, affected sib pairs will share more alleles identical-by-descent (IBD) in the identified region than the expected 50% based solely on the basis of being siblings. One tests this by calculating the difference between the expected sharing and the observed sharing of alleles IBD for a given marker divided by the estimated standard error. Model-free analysis is also appropriate for a first-stage “screening” of markers, as it is computationally easy and fast. One can study sib pairs (affected and possibly unaffected) as well as other relative pairs. Given the current emphasis on studying complex diseases, where the mode of inheritance cannot be easily specified, model-free linkage is especially useful. Because model-free linkage analysis is not demanding computationally, multipoint-linkage analysis is very feasible. Model-free linkage can be less powerful than model-based linkage, but it may be easier to recruit large numbers of sib pairs than extensive pedigrees. All linkage approaches are suitable for localizing candidate regions (e.g., down to 10 Mb), but not for identifying a particular causal variant. Model-Free Linkage Example A recent genome wide linkage analysis of prostate cancer aggressiveness was undertaken by Schaid et al (19). Because of the lack of consistent results from previous studies the authors analyzed only Caucasians and used Gleason score, a measure of severity, as the outcome to obtain a more homogeneous sample. In addition, the families were selected for eligibility on the basis of a required number of prostate cancer cases in the family. Cases either had to have three or more first-degree relatives with prostate cancer, or prostate cancer in three generations (maternal or paternal), or two first-degree relatives diagnosed prior to age 65. A total of 183 families were analyzed. Multipoint model-free linkage analysis was performed and evidence of linkage was found on chromosomes 19q and 5q, with lower significance on chromosomes 3q and 7q. These results confirmed earlier positive findings for linkage on chromosome 19q and found additional linkages for future study (19).

24

Schnell and Witte

FAMILY-BASED ASSOCIATION STUDIES While parametric and nonparametric linkage analysis methods have been successful for detecting high risk, relatively rare disease-causing loci, as noted above, association studies may have more power to detect common causal variants. Moreover, although the resolution of linkage is limited because of the small number of recombination events among single generations, association studies may allow for getting closer to a diseasecausing variant. Association studies use a case-control design, with cases coming from a hospital or disease registry. Controls are either unrelated (population or hospital/registry based) or are cases’ family members (e.g., parents or siblings). The occurrence of a given allele in cases versus controls is compared to see if an “association” exists between genes and disease. With the reduction in cost of large-scale single-nucleotide polymorphisms (SNP) genotyping, association studies are increasingly common and are quickly expanding from focused candidate gene studies to genome-wide association studies. Population Stratification Spurious associations can be detected if cases and controls come from different source populations that have varying allele frequencies (20). This phenomenon is termed “population stratification,” and there is wide debate regarding how much bias may realistically result from such confounding. Some argue that this is a nonissue in welldesigned studies of nonadmixed populations (21), while others suggest that it could lead to substantive bias (22,23). One can address this issue when studying unrelated cases and control individuals through the use of genomic information (24,25). Population stratification can also be circumvented by using family-based study designs. When studying parents and their offspring or siblings, one is assured that cases and controls within each family arise from the same source population. The most common family-based case-control designs are to use case-parent trios [e.g., the Transmission Disequilibrium Test (TDT) approach] and sibling controls. One could also study other relatives (e.g., cousins) or simultaneously study a large number of different family members. Another possibility is to over-sample cases and controls from families with a more extensive disease history. We give below further details on these designs. Case-Parent Trios This design starts with an affected individual (the “case”) and recruits their parents as the “controls.” These are not conventional controls, however, as one compares the alleles transmitted from the parents to the case versus those not transmitted (i.e., the “controls”). These controls are often referred to as “pseudo sibs” or pseudocontrols. For example, assume that a case and their parents are genotyped for a particular marker and that one parent has genotypes (a,b) and the other has (c,d). Suppose that the case received genotype (a,c) at this marker—these are then the “transmitted” alleles. The other three genotypes, (a,d), (b,c), and (b,d), are the nontransmitted genotypes, and these can be considered “pseudo sibs”. This design looks across numerous trios to assess whether a specific allele or combination of alleles is preferentially transmitted to the cases indicating an association between the corresponding allele and disease. The TDT makes use of this design (26). The TDT seeks to detect linkage between a marker locus and a disease allele when linkage disequilibrium or any other type of allelic

Family-Based Study Designs

25

association is present. If linkage is assumed and one is testing for association, it is important to remember that if more than one sibling in a family is studied the siblings are not independent. The basic TDT has been expanded upon to include additional family members and to analyze multiallelic markers and quantitative traits—the traditional TDT being for binary traits. The case-parent design has been extended to add additional family members. This leads to an increase in sample size and information (27). Other extensions include analyzing two transmitted and two nontransmitted alleles jointly (28). A common problem with the TDT is missing parental data, and several authors have proposed strategies for dealing with this situation, for example, by incorporating genotypic information from other family members to infer missing genotypes (29–32). The case-parent design is very efficient for rare diseases (33). Though efficient, it may not be a practical design. First, it may be limited to disorders that occur at such young ages that parents of the cases are still likely to be alive. In addition, cost may be a factor as three people are being genotyped (i.e., two parents and the case) as opposed to just two in a traditional case-control design (34). If only one parent can be genotyped and one parent is missing, there can be bias in the TDT (35). In addition, if the cause of missing parental data is associated with the genotype being studied and if these cases are excluded, the remaining cases may not be representative of the entire population (36). TDT Example Ho et al. used the combined TDT and sib-TDT to further explore the reported link between prostate cancer and the CAG repeat and GGC repeat of the X-linked androgen receptor gene and the autosomal gene coding for glutathione S-transferase pi (GSTP1) (37). They studied 79 North American pedigrees where most had three or more affected first-degree relatives. They used the reconstruction combined TDT (RC-TDT), which combines the TDT and the sib-TDT (31). By using the RC-TDT they were able to use information from families in which parental genotypes were either typed or inferred as well as families in which parental genotypes were unavailable, but the genotypes of unaffected sibs were available. The authors were not able to replicate the previous findings, suggesting that either their study was underpowered to detect associations or that the original finding was due to population stratification. Case-Sibling Association Study Design With this design, each case is matched to one or more unaffected sibling. In general, eligible controls should be those unaffected siblings who have reached the age of diagnosis for the case. If incident cases are studied, this will most likely lead to the controls being older siblings. One could use younger siblings and address any potential bias toward the null by including information on population rates of disease (34). Nevertheless, the older or younger age difference among siblings could lead to cohort or time- or trend-dependent environmental exposures (i.e., different-aged siblings may be subject to different exposure variables that were time dependent). In addition to addressing potential issues of population stratification, siblings may also be matched on many other potential confounding variables (genetic and environmental). However, this advantage can result in siblings being overmatched on many variables (including genotypes) that may result in a loss of power. Specifically, using siblings as controls can be 50% less efficient than using unrelated individuals, indicating that twice the sample size may be required to maintain the same power (36). When

26

Schnell and Witte

looking at gene x environment interactions, using sibling controls can be even more efficient than using unrelated controls, especially if the genotype is less common (34). Siblings may be more willing to participate in association studies than an unrelated subject from the general population. Cross-validation of questionnaire information is possible, for example, comparing answers from siblings about the disease status of grandparents. Unfortunately, all cases will not have an eligible and willing sibling available, and, if availability is related to both disease risk and to allele frequency, this may result in biased estimates of effect (34). If siblings are not available, one might also consider matching each case to an unaffected first cousin. First cousins could potentially allow for better age matching compared with a sibling, and there may also be a larger pool to draw from. This can increase the inclusion rate for identified cases. Cousins may be affected by reduced participation rate as they may be less motivated to participate than siblings or parents and may be more geographically distant. Moreover, cousin controls do not provide the perfect solution to population stratification than sibling controls do as they are related through only one parent, and can also be less efficient than using unrelated controls (36). For the case-sib design, subjects are analyzed in a matched manner and using conditional logistic regression. Restricted Study Designs One can modify the family-based association study designs to include subjects with a positive family history of the disease of interest. Doing so may increase the frequency of the causal gene, improving the power to detect associations (i.e., especially for rare genes) (34). The same criteria applied to case selection, for example, having an affected sibling must also be applied to the control. This might require a multistage design where first unrestricted cases and controls are identified and the desired cases and controls are selected from the first collection. Gauderman et al. looked at asymptotic relative efficiency (ARE) when using familybased controls, when restrictions on having a family history are applied (34). Overall, the gains in efficiency are greatest when the attributable risk is small and the relative risk is large. Thus, it is best used with a rare gene with a high relative risk. Affected sib pairs were found to be most efficient for a dominant gene when the restriction of having an affected parent was applied. If the added restriction of an additional affected sib was applied, the gain in efficiency was seen for both dominant and recessive genes. Restricted Study Design Example Douglas et al. studied 715 discordant brothers from 266 families, evaluating the association between four candidate genes involved in the synthesis or metabolism of androgens and prostate cancer (38). All the families had at least two living first- or second-degree relatives with prostate cancer or the case was diagnosed by age 55. Affected brothers and at least one unaffected brother were studied—with the oldest unaffected brother being preferentially selected for study. Familial association tests and conditional logistic regression were used to analyze the date. Stratified analyses were undertaken on the basis of age at diagnosis, progression of disease, and number of affected family members. In the unstratified analyses, CYP17 showed a preferential transmission of the minor allele to unaffected individuals, but this was not evident in the stratified analysis. After stratification, the CYP19 minor allele was preferentially transmitted to affected men in the subset of families with early age of diagnosis.

Family-Based Study Designs

27

SUMMARY While the use of family-based studies is currently decreasing, they have a number of positive, practical, and statistical attributes. Family-based studies can help distinguish whether a disease or trait is indeed influenced by genetics, for example, by studying familial aggregation and excess risk to relatives of a case. Such work can be complemented with a segregation analysis to identify the mode of inheritance. Such information from a segregation analysis can add power to a linkage analysis, which searches across the entire genome in an attempt to locate regions containing causal genes. Linkage analysis, however, does not aim to identify the causal gene. Family-based association studies can get much closer to directly identifying disease variants and help address issues of population stratification. These studies can be designed with a number of different relative controls, although these can result in substantially reduced power in comparison with population-based association studies. The most appropriate study design, whether linkage or association, family or population based, will largely depend on available resources and the trait one is investigating. REFERENCES 1. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996; 273(5281):1516–1517. 2. Freely associating. Nat Genet 1999; 22:1–2. 3. Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol 1999; 149(8):693–705. 4. Thomas DC, Witte JS. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev 2002; 11(6):505–512. 5. Zeegers MP, Jellema A, Ostrer H: Empiric risk of prostate carcinoma for relatives of patients with prostate carcinoma: a meta-analysis. Cancer 2003; 97(8):1894–1903. 6. Bonnelykke B, Hauge M, Holm N, et al. Evaluation of zygosity diagnosis in twin pairs below age seven by means of a mailed questionnaire. Acta Genet Med Gemellol (Roma) 1989; 38(3-4): 305–313. 7. Jackson RW, Snieder H, Davis H et al. Determination of twin zygosity: a comparison of DNA with various questionnaire indices. Twin Res 2001; 4(1):12–18. 8. Winerman L. A second look at twin studies. Monitor on Psychology 2004; 35(4):46. 9. Lichtenstein P, Holm NV, Verkasalo PK, et al. Environmental and heritable factors in the causation of cancer—analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl J Med 2000; 343(2):78–85. 10. Terwilliger JD, Goring HH. Gene mapping in the 20th and 21st centuries: statistical methods, data analysis, and experimental design. Hum Biol 2000; 72(1):63–132. (Review). 11. Zhao LP, Hsu L, Davidov O, et al. Population-based family study designs: an interdisciplinary research framework for genetic epidemiology. Genet Epidemiol 1997; 14(4):365–388. (Review). 12. Wijsman EM, Yu D Joint oligogenic segregation and linkage analysis using bayesian Markov chain Monte Carlo methods. Mol Biotechnol 2004; 28(3):205–226. 13. Schaid DJ, McDonnell SK, Blute ML, et al. Evidence for autosomal dominant inheritance of prostate cancer. Am J Hum Genet 1998; 62(6):1425–1438. 14. Amos CI, Witte JS, Newman B. Identifying causal genetic factors. In: Marschall S. Runge, Cam Patterson, eds. Principles of Molecular Medicine. Totowa, NJ: Humana Press, 2006:19–26. 15. Weeks DE. A likelihood-based analysis of consistent linkage of a disease locus to two nonsyntenic marker loci: osteogenesis imperfecta versus COL1A1 and COL1A2. Am J Hum Genet 1990; 47(3):592–594.

28

Schnell and Witte

16. Lander E, Kruglyak L. Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 1995; 11:241–247. 17. Witte JS, Elston RC, Schork NJ. Genetic dissection of complex traits. Nat Genet 1996; 12: 355–356. 18. Edwards S, Meitz J, Eles R, et al. International ACTANE consortium. Results of a genomewide linkage analysis in prostate cancer families ascertained through the ACTANE consortium. Prostate 2003; 57(4):270–279. 19. Schaid DJ, Stanford JL, McDonnell SK, et al. Genome-wide linkage scan of prostate cancer Gleason score and confirmation of chromosome 19q Hum Genet. 2007; 121(6):729–735. 20. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet 2003; 361(9357):598–604. (Review). 21. Wacholder S, Rothman N, Caporaso N. Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev 2002; 11(6):513–520. 22. Thomas DC, Witte JS. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev 2002 Jun, 11(6, 505–12. 23. Gorroochurn P, Hodge SE, Heiman G, et al. Effect of population stratification on case-control association studies. II. False-positive rates and their limiting behavior as number of subpopulations increases. Hum Hered 2004; 58(1):40–48. 24. Devlin B, Bacanu SA, Roeder K. Genomic Control to the extreme. Nat Genet 2004; 36(11): 1129–1130. 25. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 1999; 65(1):220–228. 26. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 1993; 52(3):506–516. 27. Abecasis GR, Cookson WO, Cardon LR. Pedigree tests of transmission disequilibrium. Eur J Hum Genet 2000; 8(7):545–551. 28. Bickeboller H, Clerget-Darpoux F. Statistical properties of the allelic and genotypic transmission/ disequilibrium test for multiallelic markers. Genet Epidemiol 1995; 12(6):865–870. 29. Spielman RS, Ewens WJ. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet 1998; 62(2):450–458. 30. Sun F, Flanders WD, Yang Q, et al. Transmission disequilibrium test (TDT) when only one parent is available: the 1-TDT. Am J Epidemiol 1999; 150(1):97–104. 31. Knapp M. The transmission/disequilibrium test and parental-genotype reconstruction the reconstruction-combined transmission/disequilibrium test. Am J Hum Genet 1999; 64(3): 861–870. 32. Sebastiani P, Abad MM, Alpargu G, et al. Robust transmission/disequilibrium test for incomplete family genotypes. Genetics 2004; 168(4):2329–2337. 33. Laird NM, Lange C. Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet 2006; 7(5):385–394. (Review). 34. Gauderman WJ, Witte JS, Thomas DC. Family-based association studies. J Natl Cancer Inst Monogr 1999; 26:31–37. (Review). 35. Curtis D, Sham PC. A note on the application of the transmission disequilibrium test when a parent is missing. Am J Hum Genet 1995; 56(3):811–812. 36. Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol 1999; 149(8):693–705. 37. Ho GY, Knapp M, Freije D, et al. Transmission/disequilibrium tests of androgen receptor and glutathione S-transferase pi variants in prostate cancer families. Int J Cancer 2002; 98(6):938–942. 38. Douglas JA, Zuhlke KA, Beebe-Dimmer J, et al. Identifying susceptibility genes for prostate cancer—a family-based association study of polymorphisms in CYP17, CYP19, CYP11A1, and LH-beta. Cancer Epidemiol Biomarkers Prev 2005;14(8):2035–2039.

3

Trials and Interventions in Molecular Epidemiology James R. Marshall and Mary E. Reid Roswell Park Cancer Institute, Buffalo, New York, U.S.A.

INTRODUCTION Experimentation is an important resource for therapeutic and prevention research in the era of molecular epidemiology. It is understood that well-executed clinical trials can add greatly to our understanding of the impact of exposures on risk and of treatment on disease outcome. Trials add significantly to our understanding of genetic variability; this variability can predict or determine risk or treatment outcome; it can also be critical, interacting with exposure, predicting risk, and with treatment, predicting outcome. The goal of molecular epidemiologic research is to understand the effects of different exposures on the risk of disease; genetic variability may alter these effects. For prevention, researchers seek to distinguish those exposures or experiences that can be modified or blocked from those that cannot. Within the therapeutic trial, the molecular epidemiologist assesses whether treatment alters the course of disease and whether genetic variation alters the impact of treatment. Within the prevention trial, the molecular epidemiologist considers both whether intervention alters the risk of disease and whether genetic variability alters the intervention effect. This chapter considers (1) the structure of and rationale for clinical trials, (2) critical distinctions between therapeutic and prevention trials, (3) the uses of biomarkers in clinical trials, and (4) pharmacogenetics within clinical trials. STRUCTURE OF AND RATIONALE FOR CLINICAL TRIALS The clinical trial is a relatively recent instrument of medical science; it did not become a mainstay of medical research until the late 1940s. To be sure, rudimentary clinical trials were known as early as the 18th century. Lind, for example, undertook a clinical trial of citrus fruit for 12 seamen stricken by scurvy (1). Two were assigned to receive a daily ration of citrus fruits; the others, in groups of two, received various other medicinal concoctions. Assignment to citrus fruit as opposed to the other compounds was not randomized; Lind reported, however, that all 12 men were about equally ill. The recovery

29

30

Marshall and Reid

of the two men who received citrus fruits was, within two weeks, markedly better than that of the other 10. Application of the scientific method to medicine continued with contributions from Farr, from Louis, and from Guy early in the 19th century (2). Key statistical concepts and techniques, such as multiple regression, were introduced during the same general period (3). The clinical trial was proposed and promulgated by Austin Bradford Hill in the late 1940s to accomplish three important goals: (i) ensure uniformity in treatment or in nontreatment; (ii) ensure that subject and clinician expectations do not bias treatment assignment or outcome evaluation; and (iii) eliminate confounding, by ensuring that experimental and control subjects differ only by treatment (4). The clinical trial is focused on intervention and is conducted prospectively. The subject is randomized to one among a series of distinct interventions; neither the subject nor the investigator picks the intervention the subject receives. Assignment to intervention is random, not haphazard; assignment is by preset, random allocation. When possible, the subject’s trial assignment is blinded. Neither the subject nor the investigator knows treatment assignment until after the trial is complete and the data analyzed. To the degree possible, the clinicians evaluating the patients are blinded to subject assignment. In some instances, such as with behavioral interventions, it is impossible to blind subjects to their experimental or control status; even so, the investigator attempts to blind the clinician evaluating the subject. In dietary intervention trials, subjects clearly know whether they have been assigned to the treatment or to the control condition: they are urged, however, not to discuss their intervention assignment with their clinician (5,6). The statistical analyst knows only that some subjects received one intervention while others received one of the others. The goal of this blinding is to discourage the statistician from coaxing a finding from the data. In the standard clinical trial, the intervention and the outcome are operationally defined. The investigator, on the basis of his or her understanding of the probable effect of the intervention, determines how many subjects need to be included in the trial, how long the treatment needs to extend, and for how long after treatment the subjects need to be observed. The subjects may be assigned to any one of two or more distinct interventions. Subjects are recruited to participate, are informed about the study, are then assigned by a random process to one of the treatments, and then observed for a fixed period of time. The strength of the clinical trial resides in its superior ability to address one of the major threats to internal study validity: confounding. In observational research, a wide range of exposures may occur in concert; thus, variability in outcome can be attributed to the focal exposure, or to any of a number of the exposure’s correlates. The researcher uses the association of the outcome with the focal exposure to assess the exposure’s importance; the investigator will conclude that the focal exposure is a genuine risk factor if that association is substantively and statistically significant; the researcher’s confidence in this conclusion will be strengthened if the association is stronger than the association of this outcome with any other exposure. In testing this association, the investigator will also consider the degree to which varying levels of the exposure are associated with increased risk of the outcome (7). A critical test of the assertion that an exposure affects an outcome involves statistical control; the investigator uses any of a series of techniques to evaluate the association of this exposure and the outcome at fixed levels, or within categories, of other study variables. This process is also known as holding the other variables constant. Thus, for example, researchers may want to know whether a new drug, drug X, is more effective then drug Y. A straightforward way of understanding the relative effectiveness of drugs X and Y would be to compare cure or remission rates of patients who are treated with drug X as opposed to drug Y. But drug X may be more expensive than drug Y, or it may only be

Trials and Interventions in Molecular Epidemiology

31

used by certain doctors, or it may be more available in certain regions. A concern will be that wealthy patients who can afford drug X, or patients treated by doctors who use drug X, or patients in regions where drug X is more available, are different from patients treated by drug Y, and that one of these differences is responsible for any apparent difference in the relative effects of drugs X and Y. The standard approach to this problem is to group patients according to their status on the other factors. For example, patients may be categorized by the region in which they live: region A, where drug X is more available or region B, where it is less available. Among patients in region A, those who received drug X can be compared with patients who received drug Y. The same comparison of X and Y can be undertaken in region B. A weighted comparison of drugs X and Y in regions A and B is described as controlled for region. If drug X is truly better, it might be expected to be associated with better cure or remission likelihood in both regions A and B. The same process would be undertaken to address the possibility of confounding for wealth, or for physician’s use of the drug. The investigator’s ability to affect statistical control requires identification of factors that might confound comparisons of drugs X and Y; it also requires that exposure to these factors be measured. It is on this point that the randomized trial holds an undeniable advantage: randomization almost inevitably, in studies of reasonably large numbers of subjects, causes the intervention—the focal study variable—to be uncorrelated with those other exposures. Thus, the clinical trial enables the researcher to understand the importance of this one exposure, net of the impact of other suspected exposures. THERAPEUTIC AND PREVENTION TRIALS There is little argument: the definitive test of an agent’s therapeutic efficacy is a randomized double-blinded clinical trial, with the agent compared to placebo or to the accepted standard of care (8). Even with respect to prevention, the superiority of experimentation is generally recognized. Nonetheless, therapeutic and prevention trials are subject to substantially different challenges. Clinical trials are expensive in dollar costs and time; this expense has blocked clinical trials from becoming a widely utilized component of prevention and epidemiologic research. Subjects must, after treatment, be monitored for the outcome of concern. Monitoring must be frequent; subjects must not be lost to follow-up, and their experience of any outcome must be documented. Side effects or complications of treatment or the experimental exposure must be accounted for; in many settings, responding to these side effects will require modification of the exposure protocol. These factors can add enormously to the expense of a prevention trial. In general, it is more difficult to execute a prevention trial than a treatment trial. The difference stems largely from the fact that cancer patients are dealing with a lifethreatening illness, with the conventional approach to administer for a short time a frequently toxic agent that kills or disrupts both healthy and cancer cells: the agent’s effectiveness stems from its ability to kill a greater proportion of cancer than of healthy cells. Toxicity is widely understood to accompany chemotherapy, and patients and clinicians in chemotherapeutic trials commonly expect it. In a prevention trial, on the other hand, the patients are well; they may be, because of family history or a biomarker such as a premalignant lesion, at elevated risk of subsequent cancer; nonetheless, they are not sick, and they are much less willing than a cancer patient to tolerate an agent that makes them feel sick. The design of the trial of finasteride for prevention of prostate

32

Marshall and Reid

cancer included an allowance for substantial noncompliance due to modest sexual side effects (9). In colon cancer prevention trials of diet change and of dietary fiber supplements among adenomatous polyp patients, dropout or treatment noncompliance was substantial (5,10). Indeed, even if subjects are at increased risk, their actual risk—the probability that they will develop cancer—is relatively low. Patients in a therapeutic trial will learn in relatively short order whether the intervention has been effective; the drug kills the tumor or lessens the tumor burden and the patient feels better, or it fails and the tumor remains or progresses. In a prevention trial, most patients, even those at elevated risk, will not experience the cancer the treatment is expected to prevent, and they will not know whether their chances have been improved until several years after the intervention begins. In the Polyp Prevention Trial, for example, some 2000 adenomatous polyp patients were randomized to diet change; by the end of the trial, only 40% of subjects had experienced an adenoma and fewer than 7% had experienced an advanced adenoma. Later, 15 years after the intervention began, fewer than 15—less than 1%—of the participants had developed colon cancer (11). Third, as noted, in a therapeutic trial, treatment is of relatively short duration; in the trial of bevacizumab for treatment of metastatic colon cancer, median disease-free survival in the group that received bevacizumab along with a standard treatment cocktail was 10.6 months, compared with 6.2 months for those who received the standard. Median survival was 20.3 months for those whose treatment included bevacizumab, compared with 15.6 months for those who received the standard treatment. Clearly, toxicity experienced for 6 to 12 months is an issue most cancer patients would prefer to avoid, but it tends to be relatively short term (12). On the other hand, the Prostate Cancer Prevention Trial testing finasteride among average-risk men called for subject treatment by finasteride or placebo for up to seven years (9). Thus, in a treatment trial, the success of the intervention is known within a reasonably short period; there is a partial or complete response, or there is none. A prevention trial, however, requires some understanding of the extended period during which a change in exposure or treatment will change outcome. In the Polyp Prevention Trial, participants were enrolled after a colonoscopy identified adenomatous polyps; patients were then randomized to a diet change program or to receive a copy of the National Cancer Institute dietary guidelines. Patients received a follow-up colonoscopy with polyp ablation after one year; they were scheduled to receive a final follow-up colonoscopy after being monitored for an additional three years (5). This trial design was based on the understanding that diet change would change polyp recurrence within three years. One explanation of its failure to do so has been that it was of too short duration. The recent follow-up, nearly 15 years after the intervention began, confirmed that the intervention had not altered polyp recurrence risk (11). Fourth, because of the reluctance of healthy patients to accept even minor toxicity, of the fact that most prevention trial participants will not ever know whether their participation decreased their risk of disease, and because treatment in a prevention trial can well last for several years, noncompliance can be substantial. The design of the just mentioned Prostate Cancer Prevention Trial of finasteride included an assumption that compliance with the medication would be a good deal less than 100% (9,13). Prevention clinical trials are particularly burdensome in expense and time requirements. Several colon cancer prevention trials have been focused on a risk biomarker—adenomatous polyp recurrence—rather than on colon cancer. Adenomatous polyps, a necessary precursor to colon cancer, occur at a much higher rate than colon cancer does; in addition, the occurrence of adenomatous polyps in people who have already had one or more polyps diagnosed has been understood to be much higher than it

Trials and Interventions in Molecular Epidemiology

33

is among those who have never had them diagnosed. Although not all adenomatous polyps eventuate in cancer, virtually all colon cancers began as adenomatous polyps. Thus, a trial of an agent’s ability to prevent adenomatous polyp recurrence can be smaller and involve a shorter period of follow-up than a trial of the same agent’s ability to prevent colon cancer. Whether the advantages of clinical trials are great enough to legitimize the expense they engender has been widely argued. Some have argued that it is enough simply to rely on observation, taking advantage, through cohort and case-control studies of the natural experiments that lead to variance in exposure (14). Competing hypotheses are tested by means of statistical control. Others have argued that many of the exposures that are of interest for prevention cannot be induced except by an experimental intervention (15,16). Control for confounding remains an enormous challenge. Unless exposure to confounders can be measured with great accuracy, under circumstances involving minimal measurement error, effective statistical control is impossible. It has been understood in epidemiology for over 50 years that random error in the measurement of an exposure tends to cause attenuation of the exposure’s association with the study outcome (17). Errors in measurement of exposure to suspected confounders will lead to underestimation of their importance. Thus, if several exposures are equally predictive of an outcome, but they are measured with varying degrees of error, the association of the exposures with the outcome will be inversely associated with the degree of error in their measurement (18). Thus, the variables that are measured well will be strongly associated with risk, those measured poorly will be more weakly associated with risk. Or, they will be uncorrelated with risk. It has also been well documented (19) that statistical control is inhibited by error in the measurement of these exposures: the need for statistical control is obscured, and the effectiveness of this control is lessened (18). A final distinction between observational and trial-based prevention research concerns change in exposure. What one generally learns from a cohort or case-control study is the impact of a given level of exposure; whether those who more frequently consume red meat, for example, may be at elevated risk of colon cancer (20). Researchers may seek to evaluate the impact of change in exposure, but it is extremely difficult to measure this change in an observational setting; there is little evidence that it can readily be gauged. In the face of an association of risk with exposure, it may be tempting to propose that the risk of those at a given level of exposure would tend to merge with the risk of those with a level of exposure, if those at the given level assumed the higher exposure. As attractive as this proposition might be, it is based on little to no evidence and is generally without foundation. For example, those with elevated levels of red meat intake may experience elevated risks of colon cancer; whether those people would soon— or ever—decrease their risk by decreasing their red meat intake is not addressed by the data: how long it would take and how much change would be required to effect any substantial change are not known. Whether the risk of those consuming red meat at a given level can be changed after a given age is similarly not known. On the other hand, a trial imposes and tests a change. In a therapeutic trial, the patient is treated by an agent—chemotherapeutic—to which he or she has not been previously exposed. In prevention trials many of the participants are already exposed to the agent or intervention being tested. The object of the intervention is to substantially increase or decrease the exposure. In the trial of wheat bran fiber supplementation for adenomatous polyp patients, the intervention was an increase in fiber; virtually all of the participants were already consuming some dietary fiber, and all had developed their index adenoma while consuming this preintervention diet. On average, they more than doubled their fiber intake (5). In the trial of change to a plant-based diet among women who had

34

Marshall and Reid

undergone definitive first-line treatment for breast cancer, all the participants were already consuming a diet that contained some plant-based foods. Many, in fact, consumed a diet that was intensely focused on plant-based foods (6). BIOMARKERS IN CLINICAL TRIALS Biomarkers in trials and interventions perform a range of functions. They reflect exposure to possible confounders, effect-modifying genetic factors, the extent to which treatment has reached the target tissue, and biologic response. As in all prospective studies, it is valuable to measure exposure to disease risk factors as well as possible. In a clinical trial, exposure to most risk factors will be by study design uncorrelated with the intervention: as there will be no correlation between the risk factor exposure and the intervention, there will, in all likelihood, be no confounding. Nonetheless, properly interpreted biomarkers of exposure to possible confounders offer the opportunity to substantially lessen the attenuation of associations that would result from more error-laden verbal reports. A salient issue for the clinical trial is the impact of baseline risk factor exposure as a modifier of any intervention effect. For example, in a widely cited trial of selenium treatment of nonmelanoma skin cancer patients, the baseline selenium blood level was associated with substantial alteration of the impact of treatment (21). In the Women’s Health Initiative dietary intervention study, the intervention was not successful in decreasing breast cancer risk; it did, however, decrease risk among women with baseline fat intake in the highest quartile (22). Genetically governed modification of intervention effects is a very real possibility that has only recently begun to attract widespread attention. Poor measurement of baseline exposure status could, however, well obscure the evidence that that baseline status alters the impact of the intervention. This is especially important for prevention trials that evaluate lifestyle interventions; recent studies have considered nutritional supplements and diet change. The subjects in these studies have in many cases had some exposure to these supplements or to a given dietary pattern prior to their enrollment in the intervention. In evaluating supplementation, or dietary or lifestyle change, prior status will be critical to interpreting the findings. To the degree that biomarkers increase the precision of baseline exposure intake, they will be valuable for evaluating baseline status’s modification of the impact of intervention. Biomarkers of genetic status will be critical to understanding baseline modification of the impact of intervention. An individual’s baseline exposure status may modify the impact especially of preventive interventions: nutritional status with respect to vitamin or trace element repletion; baseline stores of toxic chemicals; any status likely to result from a constellation of exposures and genetic predispositions. Biomarkers of such facets of baseline vulnerability play and will continue to play critical roles. A critical issue in any biologic experiment is that the intervention be documented to have affected the target organ or organ system. In drug treatment trials, this will have been established prior to the intervention. It is not always as clear in prevention trials. Thus, in an important study of antioxidant administration to prevent the recurrence of adenomatous polyps, Greenberg (23) presented data documenting that the blood antioxidant levels of subjects were substantially increased by the intervention. In the Women’s Healthy Eating and Living study, Pierce and colleagues documented that dietary biomarkers of a plant-based diet—blood carotenoids—were increased by the intervention (6). In some instances, what reaches the target organ may be a metabolite of the agent administered; in prostate cancer prevention trials presently underway at Roswell Park,

Trials and Interventions in Molecular Epidemiology

35

selenomethionine is administered but androgen receptor activity, understood to be affected in prostatic tissue by a selenomethionine metabolite, is the key effect biomarker (24). As such, a biomarker may be a more precise and immediate indicator of agent activity than either tumor incidence or growth. In prevention, biomarkers are used both as risk indicators and as interventional targets. By identifying populations at higher risk, they enable intervention to focus on individuals who can be expected to experience the outcome of interest. In addition, risk biomarkers are more likely to be seen than the endpoint of cancer. For example, the use of a biomarker endpoint may enable a study to be conducted and completed in far less time than would be necessary if cancer were the study endpoint. The adenomatous polyp is considered, as a premalignant lesion, to denote populations at elevated colon cancer risk. Many adenomas will not ever progress to colon cancer; nonetheless, colon cancers virtually always emerge from an adenoma. Thus, the adenoma is not a sufficient, but a necessary premalignant lesion, for colon cancer (25). The focus of several large interventional trials has been the recurrence of adenomatous polyps among individuals who, having had adenomas identified and ablated, are at elevated risk, not just of having new ones detected, but also of colon cancer (5,10,26). In each instance, the adenomatous polyp is used as a biomarker to denote a population risk; the recurrence of adenomatous polyps is then used as a biomarker of interventional efficacy. Breast cancer biomarkers have been considered in several intervention trials. These biomarkers include mammographic density or pattern, and indicators of cellular proliferation and apoptosis. Atypical ductal hyperplasia, recognized as predictive of substantially increased breast cancer risk, has been used as a risk biomarker (27,28). High grade prostate intraepithelial neoplasia (HGPIN), in the opinion of many a premalignant lesion, has been used to identify individuals at increased risk of prostate cancer (29). The rationale for linking HGPIN to prostate cancer stems from the fact that prostate cancer patients often have extensive fields of HGPIN, HGPIN occasionally appears to have cancer emanating from it, and populations at elevated risk of prostate cancer have elevated HGPIN prevalence. Nonetheless, the predictive value of HGPIN has become the subject of increasing debate. Early studies (30–34) that found foci of HGPIN without cancer often, on rebiopsy, found cancer; whether these were because HGPIN leads to cancer or because these early studies were limited by inadequate prostate sampling is not clear. On the other hand, some recent studies based on very complete prostatic sampling have found HGPIN to be highly predictive of subsequent prostate cancer (35). Three studies to date have focused chemoprevention interventions on men with HGPIN: Steiner et al, in a study of several doses of an estrogen modulator, toremifene (36); Alberts et al, in a study of an antiandrogen, flutamide (37); and Marshall et al, in a study of selenomethionine (29). PHARMACOGENETICS This chapter has already noted that the impact of individual variability in the myriad processes that affect carcinogenesis, the effects of chemotherapeutic or chemopreventive agents is in all likelihood profound. A part of the experience of any living organism is exposure to a range of chemicals and compounds; irritants and toxins are particularly important. A range of metabolic systems have evolved to detoxify these compounds and to protect the organism from them. These systems can be crudely characterized as phase I and phase II systems. The phase I systems transform possibly harmful substances for excretion or for additional modification by a second set of systems. The products of these

36

Marshall and Reid

phase I systems can be more toxic than the initial agents they modify. At that point, phase II systems can proceed to further degradation or to excretion. Genetically governed variance in the speed and the efficiency with which the organism deals with these compounds can be substantial. Oxidative stress, recognized as a likely source of genetic damage, stems from cellular exposure to both endogenously generated and exogenous compounds. Genes damaged by oxidative stress may function aberrantly. Living organisms are equipped with extensive and highly redundant systems to regulate oxidative stress; Individual variability in genes that govern these systems is highly possible. Thus, variability in systems that protect against oxidative stress could alter the efficiency of the body in coping with an excess of oxidative stress (38). In an important recent paper, Ahn et al. (39) used a functional assay to show that intake of fruits and vegetables, likely sources of antioxidants, interacted with allelic variation of a gene, catalase, that regulates intracellular oxidative stress. Detoxification and protection against oxidative stress directed toward cellular damage are likely governed by pharmacogenetic processes. After cells are damaged, however, a series of steps involved in arcinogenesis governs the formation and advance of neoplasia to invasion. According to Hanahan and Weinberg (40), at least six distinct processes play roles in carcinogenesis: enhanced replicative potential, angiogenesis, apoptosis evasion, growth signal self-sufficiency, insensitivity to antigrowth signaling, tissue invasion, and metastasis. All of these in carcinogenesis represent aberrations of normal, genetically governed cellular processes; potentially, there is great variability in the genes that normally govern each of these processes. This naturally occurring variability could affect the degree to which neoplastic growth advances and becomes invasive. Clearly, the clinical trial, in which an intervention of great interest is administered and evaluated such that confounding is extremely unlikely, provides a distinctly advantageous setting for evaluating pharmacogenetics for both therapeutic and for preventive intervention. Gene systems could readily alter one another’s effects: thus, to take the example from phase I and phase II enzyme systems, one phase I enzyme could lead subjects to very efficiently metabolize a foreign substance to an intermediate stage; the stage could be highly reactive and toxic to the cell. A phase II enzyme could then prepare it for excretion. An active form of the phase I enzyme coupled with an inactive form of the phase II enzyme could lead to excessive cellular damage, and to increased risk of carcinogenesis. If both enzymes were active, or inactive, or if the phase I enzyme were inactive and the phase II enzyme active, exposure might be less likely to have an effect. Clearly, as has been mentioned, it is possible that environmental exposures could interact with critical gene systems to govern the extent to which the organism is subject to environmentally induced damage. Two critical statistical issues are raised by pharmacogenetics. The first is the significance of interactions in the face of null overall genetic or environmental effects. The second concerns statistical hypotheses testing and exploration. A profound implication of the specter of substantial gene-environment interactions is that a common means of sifting data—evaluating bivariate associations between each exposure and each variant gene under study—is not adequate. A polymorphic variant of the same gene could reverse its effect. Not taking the polymorphic variants of this gene into account, the investigator would understand treatment to have no effect; he or she would only see its effects by categorizing subjects by their status on the gene. Similarly, the effects of the variant forms of the gene would not be seen unless the subjects were categorized by their treatment or nontreatment.

Trials and Interventions in Molecular Epidemiology

37

The second implication issues from the sheer numbers of genetic variants that can be examined. It is well known that statistical significance testing refers to the probability that an association of a given magnitude or larger would be seen, given the absence of a real association. A common approach is to denote as statistically significant as a result that would have only a 5% probability of being observed, given the truth of the null hypothesis. In other words, a test of a null variable has about a 95% probability of not indicating that the null variable has an effect. If two genes are evaluated, the possible outcomes are that the first gene is statistically significant and the other is not, or that the first gene is not statistically significant and the other one is, that neither is statistically significant, or that both are. The probability that both are not statistically significant, even with neither representing a genuine association, is smaller than the probability that, if only one gene is tested, it is statistically nonsignificant. If three null genes are tested, the probability that none of the three appears statistically significant is smaller. If 10 null genes are tested, the probability that none appears statistically significant is smaller yet. In general, the probability that none in a series of null genes will be found to be statistically significant is inversely proportional to the number of genes tested. If the investigator is testing a large number of null genes, say 30, the probability that none of the 30 will be found to be statistically significant is only about 20%; in other words, the probability that one or more of these null genes will be found to be statistically significant is approximately 80%. If several hundred even null genes are tested, then the probability that none of them will be found to be statistically significant is essentially zero. Several have suggested adjustments or corrections for the testing of multiple hypothesis; essentially, these require increasing the strength of the association required for the association to be recognized as statistically significant. In a study in which a number of exposures and a number of genes are equally of interest, the problem of multiple hypothesis testing reaches astronomical proportions; the number of combinations is equal to the product of the number of gene constellations and the number of exposures; in a study in which 35 exposures and 30 gene patterns are of interest, the number of two-way interactions alone is 1050. The number of gene-exposure interactions studied could readily be several times the number of subjects in the study. In a clinical trial, this problem is lessened because only one intervention is focal. The number of genetic factors interacting with treatment may well be large, but the number of interactions is restricted by there being only one treatment. Clearly, of course, using baseline exposures as predictive of risk increases the number of genetic interactions that can be considered. At present, a number of options have been proposed for handling the vast amounts of data generated by pharmacogenetic analysis. Adjusting for multiple hypothesis testing offers one approach; another is to simply regard statistical significance tests as convenient fictions, using these more as means of sifting and comparing the data on associations than as strict hypothesis testing exercises. SUMMARY This chapter has considered the clinical trial as a resource for molecular epidemiology. Although the clinical trial is in general more expensive than the purely observational study, it is in many ways superior: it provides for standardization of treatment, for eliminating bias in treatment assignment, and for control of confounding. There is no question but that the clinical trial is the criterion standard of therapeutic research. While the clinical trial holds great potential to strengthen inference for preventive options, it is to an order of magnitude more difficult than the therapeutic trial to execute.

38

Marshall and Reid

Biomarkers in observational epidemiology are of particular value for control of confounding; their value in the clinical trial, especially for molecular epidemiology, resides primarily in their ability to predict differences in response to treatment. Biomarkers as linked to pharmacogenomics may prove critical to progress in molecular epidemiology, but their use raises two critical issues: discovery of gene-environment interactions may not follow from the standard epidemiologic approach of beginning with the evaluation of first-order associations of exposure and risk. Finally, the number of interactions to be potentially evaluated is so large as to render meaningless the common use of statistical significance criteria. REFERENCES 1. Stewart CP, Guthrie D (eds). Lind’s Treatise on Scurvy. Endinburgh, Great Britain: Edinburgh University Press, 1953. 2. Lilienfeld AM (ed). Aspects of the History of Epidemiology: Times, Places, and Persons. Baltimore, MD: The Johns Hopkins University Press, 1980. 3. Stigler SM. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA and London, England: The Belknap Press of Harvard University Press, 1986. 4. Gail MH. Statistics in action. J Am Stat Assoc 1996; 91(433):1–13. 5. Schatzkin A, Lanza E, Corle D, et al. Lack of effect of a low-fat, high-fiber diet on the recurrence of colorectal adenomas. The Polyp Prevention Trial study group. N Engl J Med 2000; 342(16):1149–1155. 6. Pierce JP, Natarajan L, Caan BJ, et al. Influence of a diet very high in vegetables, fruit, and fiber and low in fat on prognosis following treatment for breast cancer. The Women’s Health Eating and Living (WHEL) randomized trial. JAMA 2007; 298(3):289–298. 7. Hill AB. The environment and disease: association or causation? Proc R Soc Med 1965; 58:295–300. 8. Bailar JC, Mosteller F (eds). Medical Uses of Statistics. 2nd ed. Boston, MA: NEJM, 1992. 9. Feigl P, Blumenstein B, Thompson I, et al. Design of the Prostate Cancer Prevention Trial (PCPT). Control Clin Trials 1995; 16:150–163. 10. Alberts DS, Martinez ME, Roe DJ, et al. Lack of effect of a high-fiber cereal supplement on the recurrence of colorectal adenomas. N Engl J Med 2000; 342(16):1156–1162. 11. Schatzkin A, Mouw T, Park Y, et al. Dietary fiber and whole-grain consumption in relation to colorectal cancer in the NIH-AARP diet and health study. Am J Clin Nutr 2007; 85(5):1353–1360. 12. Hurwitz H, Fehrenbacher L, Novotny W, et al. Bevacizumab plus irinotecan, fluorouracil, and leucovorin for metastatic colorectal cancer. N Engl J Med 2004; 350(3):2335–2342. 13. Thompson IM, Goodman PJ, Tangen CM, et al. The influence of finasteride on the development of prostate cancer. N Engl J Med 2003; 349(3):215–224. 14. Willett WC, Stampfer MJ. Dietary fat and cancer: another view. Cancer Causes and Control 1990; 1:103–109. 15. Prentice RL, Sheppard L. Dietary fat and cancer: consistency of the epidemiologic data, and disease prevention that may follow from a practical reduction in fat consumption. Cancer Causes and Control 1990; 1(1):81–97. 16. Prentice RL, Kakar F, Hursting S, et al. Aspects of the rationale for the women’s health trial. J Natl Cancer Inst 1988; 80(11):802–814. 17. Bross I. Misclassification in 22 tables. Biometrics 1954; 10:478–186. 18. Marshall JR, Hastrup JL. Mismeasurement and the resonance of strong confounders: uncorrelated errors. Am J Epidemiol 1996; 143(10):1069–1078. 19. Greenland S, Robins JM. Confounding and misclassification. Am J Epidemiol 1985; 122(3): 495–506. 20. Willett WC, Stampfer MJ, Colditz GA, et al. Relation of meat, fat, and fiber intake to the risk of colon cancer in a prospective study among women. N Engl J Med 1990; 323(24):1664–1672.

Trials and Interventions in Molecular Epidemiology

39

21. Duffield-Lillico AJ, Dalkin BL, Reid ME, et al. Selenium supplementation, baseline plasma selenium status and incidence of prostate cancer: an analysis of the complete treatment period of the Nutritional Prevention of Cancer Trial. BJU Int 2003; 91:608–612. 22. Prentice RL, Caan B, Chlebowski RT, et al. Low-fat dietary pattern and risk of invasive breast cancer the women’s health initiative randomized controlled dietary modification trial. J Am Med Assoc 2006; 295(6):629–642. 23. Greenberg ER, Baron JA, Tosteson TD, et al. A clinical trial of antioxidant vitamins to prevent colorectal adenoma. N Engl J Med 1994; 331(3):141–147. 24. Dong Y, Zhang H, Gao AC, et al. Androgen receptor signaling intensity is a key factor in determining the sensitivity of prostate cancer cells to selenium inhibition of growth and cancer-specific biomarkers. Molecular Cancer Therapeutics 2005; 4(7):1047–1055. 25. Lance P, Grossman S, Marshall JR. Screening for colorectal cancer. Seminars in Gastrointestinal Diseases 1992; 3:22–33. 26. Sandler RS, Halabi S, Baron JA, et al. A randomized trial of aspirin to prevent colorectal adenomas in patients with previous colorectal cancer. N Engl J Med 2003; 348(10):883–890. 27. Fabian CJ, Kimler BF, Zalles CM, et al. Short-term breast cancer prediction by random periareolar fine-needle aspiration cytology and the Gail risk model. J Natl Cancer Inst 2000; 92(15):1217–1227. 28. O’Shaughnessy JA, Kelloff GJ, Gordon GB, et al. Treatment and prevention of intraepithelial neoplasia: an important target for accelerated new agent development. Clin Cancer Research 2002; 8:314–346. 29. Marshall JR, Sakr W, Wood D, et al. Design and progress of a trial of selenium to prevent prostate cancer among men with high grade prostatic intraepithelial neoplasia. Cancer Epidemiol Biomarkers Prev 2006; 15(8):1479–1484. 30. Bostwick DG, Qian J. High-grade prostatic intraepithelial neoplasia. Mod Pathol 2004; 17:360–379. 31. Brawer MK. Prostatic intraepithelial neoplasia: a premalignant lesion. Hum Pathol 1992; 23(3):242–248. 32. Brawer MK, Bigler SA, Sohlberg OE, et al. Significance of prostatic intraepithelial neoplasia on prostate needle biopsy. Urology 1991; 38(2):103–107. 33. Bostwick DG, Qian J, Civantos F, et al. Does finasteride alter the pathology of the prostate and cancer grading? Clin Prostate Cancer 2004; 2(4):228–235. 34. Davidson D, Bostwick DG, Qian J, et al. Prostatic intraepithelial neoplasia is a risk factor for adenocarcinoma: predictive accuracy in needle biopsies. J Urol 1995; 154:1295–1299. 35. Steiner M, Boger R, Barnette G, et al. Prospective study confirms men with high grade prostatic intraepithelial neoplasia (PIN) are at high risk for prostate cancer. Cancer Epidemiol Biomarkers Prev 2004; 13(11). 36. Steiner MS, Boger R, Barnette G, et al. Evaluation of toremifene in reducing prostate cancer incidence in high risk men. Cancer Epidemiol Biomarkers Prev 2004; 13(11). 37. Alberts SR, Novotny PJ, Sloan JA, et al. Flutamide in men with prostatic intraepithelial neoplasia: a randomized, placebo-controlled chemoprevention trial. Am J Ther 2006; 13(4): 291–297. 38. Institute of Medicine (IOM). Dietary Reference Intakes for Vitamin C, Vitamin E, Selenium, and Carotenoids. Washington, DC: National Academy Press, 2000. 39. Ahn J, Gammon MD, Santella RM, et al. Associations between breast cancer risk and the catalase genotype, fruit and vegetable consumption, and supplement use. Am J Epidemiol 2005; 162(10):943–952. 40. Hanahan D, Weinberg RA. The hallmarks of cancer. Cell 2000; 100:57–70.

4

Molecular Epidemiological Designs for Prognosis Cornelia M. Ulrich Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, U.S.A.

Christine B. Ambrosone Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York, U.S.A.

INTRODUCTION—WHY STUDIES OF CANCER PROGNOSIS AND OUTCOMES? As a result of improved strategies for cancer detection at earlier stages, as well as improved treatment modalities, the number of cancer survivors in the population continues to increase. As of January 2004, it was estimated that there were 10.8 million cancer survivors in the United States, which is almost triple what it was in 1971 (1). Approximately 14% of the 10.8 million estimated survivors were diagnosed more than 20 years ago, and cancer survivors now represent approximately 3.7% of the population (2). As the number of survivors increases, research on cancer survivorship has been identified as an area of major importance by the Institutes of Medicine (1), the President’s Cancer Panel (3), and the National Cancer Institute (4). For this large population of cancer survivors, there are many unanswered questions. Many cancer patients want to know what they can do to reduce symptoms during treatment, how they can protect themselves against recurring or secondary tumors, and how they can return to an active, healthy life (5). It is at this time after a cancer diagnosis that individuals are often most motivated to change their diet, exercise habits, and other health behaviors, and vast improvements in public health could be made among cancer survivors (6–12). Unfortunately, however, as stated by the American Institute of Cancer Research, “the painstaking process that yields science-based recommendations on diet and exercise for cancer survivors has not yet reached its conclusion” (5), and there are few guidelines or recommendations for cancer patients. Because many cancer patients are in search of factors that may improve their health and reduce risk of recurrence, there is widespread use of nutritional supplements and complementary and alternative medicines (13). However, there is a paucity of data from 41

42

Ulrich and Ambrosone

rigorously conducted studies that would support many behavior changes that may be adopted by cancer patients. Although it would be relatively harmless to find that many lifestyle factors may have no beneficial effects for cancer patients, there is the possibility that some factors could actually increase risk of adverse outcomes. For example, there are data indicating that antioxidant supplements may interfere with radiation therapy and many chemotherapeutic agents (14,15), that folate supplements may accelerate growth of malignant tumors (16,17), and that some herbal supplements, such as St. John’s wort, may directly affect the pharmacokinetics of cancer chemotherapeutic drugs (18). Thus, there is an urgent need for results from well-conducted studies to address and answer some of these questions. Identifying predictors of cancer prognosis has, to date, been largely understudied by molecular epidemiologists, but it is becoming a more prominent research priority (19). In the absence of solid scientific data, it is frequently assumed that factors that reduce the risk of cancer must also have a positive influence on cancer survival. However, this assumption may be inaccurate, as noted above with regard to supplement use. Preliminary data also suggest that weight gain or weight reduction may have different effects on cancer prognosis than on etiology. Reasons for this dual response may be related to the transformation state of cells (e.g., a differential biological mechanism depending on the state of the cell, as detailed below for folate) or in the effects of health behaviors on quality of life and related immunological defense mechanisms. This suggests that epidemiologists should evaluate carefully the potential effects of exposures on prognosis, independent of their associations with etiology. There is a wide range of research topics that can be addressed in molecular epidemiological studies of cancer prognosis, and such studies will be most fruitful if addressed with an interdisciplinary approach that includes strong biological knowledge. Comprehensive studies of cancer prognosis need to consider the role of molecular characteristics of the tumor in relation to treatment response, as well as the role of inherited genetic variability (polymorphisms) in drug metabolism pathways and response to treatment-related DNA damage in therapeutic toxicity and to treatment (pharmacogenetics) (20). As an additional benefit to studies of prognosis, they may inform studies of cancer etiology. Genetic polymorphisms are more likely to show measurable effects if a system is under stress, e.g., during chemo- or radiation therapy, which puts immense pressure on a cell’s DNA repair capacity and metabolism; identifying the most relevant genetic factors in this scenario can then inform epidemiological studies of geneenvironment interaction where the environmental stressor may be comparably less strong (e.g., moderate smoking). CANCER PROGNOSIS—A MULTIFACTORIAL OUTCOME A large range of factors can contribute to cancer patients’ prognosis. Not surprisingly, studies of cancer treatment outcomes and prognosis are undertaken by a multitude of researchers who are interested in different research questions and outcomes. The most important outcomes include overall survival, disease-free survival, and, more recently, quality of life. In clinical trials investigating new treatment regimens, treatment-related toxicity is an important secondary outcome that is clearly associated with symptoms and quality of life. In addition, investigations may focus on surgical outcomes, comorbidities, a cancer patient’s ability and success in returning to work, economical outcomes, and many more. Figure 1 illustrates the many influences and the complex interplay of factors that affect cancer prognosis. While this list is by no means comprehensive, it demonstrates the

Molecular Epidemiological Designs for Prognosis

43

Figure 1 A multitude of factors can influence cancer prognosis. Many of these factors are also interrelated or modify associations between other factors and cancer outcomes.

necessity for an interdisciplinary approach that also takes account of several interrelationships between prognostic factors. Factors that are known to affect prognosis include tumor characteristics (e.g., stage and biological characteristics, such as microsatellite instability in colorectal cancers) (21), treatment modalities, surgical technique (which is of great prognostic significance for cancers that are difficult to resect, e.g., pancreatic cancer or rectal cancer), access to care, race/ethnicity, lifestyle factors (including smoking, body mass index, and physical activity), psychosocial factors, nutritional status, and also inherited genetic characteristics (polymorphisms). Several interrelations emerge with respect to race/ethnicity. Racial factors are indisputably linked to access to care and thus quality of the surgeon. In addition, racial factors are also correlated with genetic factors and tumor biology; for example, African-American women are more likely to be diagnosed with breast cancer with tumors that are high grade and negative for estrogen and progesterone receptors, associated with a poorer prognosis (22). Cancer prognosis itself, or how well a patient responds to an initial treatment regimen, will conversely affect the choice of future treatment modalities as well as quality of life, nutritional status, physical activity and BMI (summarized here as energy balance), and other lifestyle factors. A number of factors can modify the efficacy and toxicity of chemotherapeutic agents, such as genetic polymorphisms (20), gene expression and other tumor characteristics (23), and nutritional status of the patient. Finally, it is expected that genetic polymorphisms will modify associations among lifestyle factors, including nutrition or energy balance and cancer prognosis, similar to gene-environment interactions that have been observed in the context of studies of cancer etiology (24–26). Traditionally, researchers have studied cancer prognostic factors in isolation. For example, they may have focused exclusively on tumor characteristics, such as gene expression. Alternatively, they may have studied primarily genetic polymorphisms without taking into account relevant epidemiological factors. Or, finally, they may have focused entirely on lifestyle factors without consideration of tumor characteristics or other biomarkers. While all of these approaches are valid and have yielded useful insights into the effects of specific factors on cancer prognosis, important gaps remain. A more comprehensive, integrated approach to studying cancer prognosis seems essential for understanding cancer outcomes, as illustrated below for folate status and prognosis. In addition, the interconnectedness of several components in this circle of cancer prognosis, as discussed above, can render an isolated approach incomplete and

44

Ulrich and Ambrosone

limited. Researching the “larger picture” of cancer prognosis creates challenges and new opportunities for molecular epidemiologists. Folate—an Example of Integrated Prognostic Research The paradigm of integrated prognostic studies is illustrated here with the example of folate status and prognosis after colorectal cancer (Fig. 2). Within this thematic area, again, there are multiple interrelated components that may affect outcomes and may modify 5-fluorouracil (5-FU)-based treatment outcomes. 5-FU directly targets a key enzyme in folate metabolism, thymidylate synthase (TS), which converts uridylate to thymidylate. This inhibition results in a deficiency of thymidylate for DNA synthesis, which is compensated for by uracil misincorporation. Misincorporation of uracil into DNA causes repeated repair cycles with a greatly increased likelihood of single and double strand breaks. Thus, 5-FU functions as an antimetabolite. The use of folic acid (FA)-containing supplements before and after cancer diagnosis is thought to affect survival, possibly in opposite directions: while a higher folate status is associated with reduced cancer risk, presumably because of reduced mutation rates, there is increasing concern that the administration of folate once neoplastic or early neoplastic lesions are present can “feed the tumor,” i.e., foster growth of these lesions via a greater provision of nucleotides for DNA synthesis (16,17,27). Such a growth-enhancing effect is consistent with the upregulation of folate receptors and folate-related enzymes in many cancer types; most likely this upregulation reflects a greater need for folate for DNA synthesis to support rapidly growing tumors (28–30). Accordingly, folate-related tumor characteristics, such as the gene and protein expression of TS and enzymes in folate metabolism (31) are known to affect survival and modify 5-FU-based treatment. There is now evidence that gene expression in both the cancerous part of the tissue, as well as normal parts, play a role in the tumor’s folate status and cancer outcomes (32,33). It is not yet clear whether FA-containing supplement use prior to diagnosis affects these tumor characteristics, yet this is an important clinical question. Finally, inherited genetic polymorphisms in the folate pathway, such as 5,10-methylenetetrahydrofolate reductase (MTHFR)

Figure 2 The paradigm of integrated prognostic studies illustrated for folate status and colorectal cancer survival. Abbreviations: FA, folic acid containing supplements (see text for explanation); 5-FU, 5-fluorouracil.

Molecular Epidemiological Designs for Prognosis

45

and functional TS variants can modify 5-FU toxicity and efficacy (20,34,35). Inherited genetic polymorphisms are, by default, also reflected in the tumor, unless loss of heterozygosity (i.e., loss of one allele in the cell due to chromosomal instability) has altered the tumor’s genotype at a polymorphic locus. STUDY DESIGNS FOR PROGNOSTIC STUDIES Multiple study designs can be employed to investigate questions on cancer outcomes and prognosis. Table 1 summarizes and contrasts the general advantages and limitations of various approaches. These study designs should be considered complementary and chosen depending on the current scientific knowledge in a specific research area. For example, if it is not yet clear whether a factor, such as nutritional supplement use, may benefit or potentially harm patients, then observational studies constitute the first step to investigating this research question. Once observational evidence has accumulated and consistently supports a benefit, then intervention studies or randomized clinical trials (for secondary cancer prevention) are needed to confirm such an association with certainty. Population-Based Cohort Studies Molecular epidemiological studies of cancer prognosis may be based on follow-up of cases who participated in a case-control study of cancer risk or were identified in the context of a cohort study of cancer etiology. The establishment of a cohort specifically for an observational study of cancer outcomes may also be used to investigate factors contributing to cancer prognosis. Each of these approaches has strengths and weaknesses. Follow-Up from Studies of Cancer Etiology: Case-Control and Cohort Studies The conversion of a case-control study of cancer risk to one of cancer prognosis can be efficient and benefit from all of the efforts already put into ascertaining and enrolling cases, procuring blood specimens, and collecting extensive epidemiological data. Such a study may be conducted at a number of levels, from simply following up cases to ascertaining recurrence and survival and evaluating in relation to characteristics prior to and at diagnosis, to recontacting cases to assess behaviors after diagnosis and treatment, to an in-depth study involving medical record review to determine disease characteristics as well as treatments received, in addition to the collection of postdiagnostic epidemiological data. Each of these approaches entails determination of disease outcomes among the cases, which can be ascertained through a number of approaches. In general, follow-up is conducted by recontacting those who participated in the study to determine recurrence status; this approach requires foresight in the design of the original study, with permission in the study consent to recontact participants in the future and permission to obtain their future medical records. Deaths of cases can also be ascertained through checks on vital status through the state vital records and the National Death Index (NDI). Follow-up of cases from an etiological study that was not initially planned for conversion to a study of prognosis has a number of inherent weaknesses. Unless patients are recontacted at predetermined intervals to capture them at the same timepoint postdiagnosis, the questions that can be addressed are limited to behaviors prior to diagnosis in relation to treatment outcomes. Because there are gaps in understanding of potential lifestyle changes that patients can make to enhance their survival, a lack of data on the effects of postdiagnostic factors, such as diet, physical activity, weight gain, and supplements,

Variable outcome assessment Excellent assessment of health behaviors Single- or multicenter Logistic challenges of multiple hospitals who participate in the prospective study, HIPAA regulations may differ

Prospective

Cannot establish causality

l

l

l

l

l

l

l

l

l

l

l

l

l

Excellent outcome assessment Limited assessment of health behaviors Multicenter Logistic challenges of many sites, cooperative group setting Advantage: already established collaborative setting Prospective or retrospective (for genetic testing) Cannot establish causality

Selected population Selected treatment (often higher quality) Uniform treatment regimens

l

l

l

l

General population “Real-world” treatment Heterogeneous treatment regimens

l

l

Ancillary studies to clinical trials

Population-based cohort study

l

l

l

l

l

l

l

l

l

l

Can establish causality (randomized controlled trials)

Prospective

Selected population Treatment of choice Potentially heterogeneous treatment regimens Uniform exposure (treatment) to prognostic variable that is tested Excellent outcome assessment Intervention or randomization on health behaviors Single- or multicenter Often small sample size

Intervention studies or randomized controlled trials (secondary prevention)

Table 1 Comparisons of Epidemiological Study Designs for Prognostic and Pharmacogenetic Studies Among Cancer Patients

46 Ulrich and Ambrosone

Molecular Epidemiological Designs for Prognosis

47

makes these types of studies less informative than those that obtain information from cases’ postdiagnosis. Of more use are etiological studies that are designed to conduct follow-up for recurrence and survival outcomes. With prognosis studies in mind, the cases can be consented at baseline for permission to recontact them, to obtain medical record information, and to retrieve their tissue blocks. It may be important to collect these data and biospecimens soon after enrollment into the study, because some Institutional Review Boards will not honor such consents for data and sample retrieval after a specified time period has elapsed. Additional queries regarding potential predictors of cancer prognosis, such as lifestyle factors, psychosocial factors, and complementary and alternative medicines, also should optimally be planned for implementation at specified timepoints after diagnosis. Of course, the obvious limitation for collection of these data in the context of an etiological study is funding. Although some aspects, such as permission to recontact, to review medical records, and to retrieve tissues, can easily be incorporated into an etiological study, the labor-intensive aspects of follow-up, recontact, and chart review can seldom be conducted within the context of a funded study of cancer risk. An additional complication in follow-up of cases ascertained in the context of a cohort study is the variable time from initial data collection among the healthy participants to cancer diagnosis, and then the variable times between diagnosis and follow-up assessments, unless there are resources to contact each case at the same specific timepoints postdiagnosis. Prospective Observational Studies of Cancer Prognosis Many of the limitations of conducting follow-up of cases in etiological studies can be overcome by the design of a prospective cohort study of cancer prognosis (36). In such a study, patients newly diagnosed with the incident, primary cancer of interest are ascertained and invited to participate. Ideally, cases will be enrolled, interviewed, and a blood specimen obtained prior to therapy for cancer. At enrollment, data can be collected on standard epidemiological factors prior to diagnosis and also on behaviors/ characteristics at the time of diagnosis. Because the effects of some factors, such as folate or antioxidants, on treatment outcomes may be most dependent on their use during cancer therapy, the optimal study will collect data both at baseline and throughout cancer therapy, as well as at predetermined intervals throughout the followup period. In the context of a study specifically designed to evaluate the effects of behaviors and other factors during the postdiagnostic period, data can be rigorously collected that will likely provide important information for recommendations for cancer survivors to improve their prognosis. In an ongoing prospective observational study, data can also be collected on quality of life, psychosocial factors, and other variables that are not usually ascertained in epidemiological studies. With a prospective design, it will also be easier to collect extensive information on treatments received, including surgical procedures, chemotherapies, radiation therapies, and hormonal therapies for hormonally-related cancers. This type of study has the power to evaluate the effects of lifestyle factors on treatment outcomes, as well as gene-environment interactions, and also to examine the effects of epidemiological factors on outcomes in relation to specific cancer subtypes, determined through molecular characterization of the tumor. However, one limitation of this type of study is the heterogeneity of treatments received, which may be overcome through the implementation of a prospective follow-up study in the context of an ongoing clinical trial.

48

Ulrich and Ambrosone

Studies Ancillary to Clinical Treatment Trials There are many advantages to conducting prognostic follow-up studies in the context of a therapeutic clinical trial. For most studies, patients on the trial have more homogeneous disease characteristics, with eligibility criteria usually limited to subsets of disease characteristics, such as stage, grade, and nodal status. Because of the nature of the randomized clinical trial, initial chemotherapy regimens are consistent across each of the arms of the study, with all patients within an arm receiving the same drugs and dosages. Furthermore, endpoints are extremely reliable, with outcomes rigorously monitored for recurrence, disease-free progression, and survival, as well as toxicities experienced, usually using the NCI Common Toxicity Criteria or a similar standardized scale. All of these strengths reduce the number of sources of misclassification and minimize some sources of bias. As such, these studies may be quite advantageous for studies of the effects of pharmacogenetics on treatment outcomes, using DNA extracted from archived normal tissue, or for examining the role of tumor characteristics in cancer treatment outcomes. Currently, it is becoming more and more common for Cooperative Groups to collect and bank blood specimens in the context of clinical trials. These samples will provide an excellent source of DNA for pharmacogenetic studies, but the utility of serum may be somewhat limited due to the logistics associated with shipping blood samples from around the country. As pointed out in the chapter by Hankinson and Santella, variability in time to processing and differences in sample handling and shipping could introduce some systematic bias into subsequent studies. One major limitation of conducting molecular epidemiological studies of cancer prognosis in the context of clinical trials is the lack of epidemiological and behavioral data on patients during and following treatment. However, this setting is ideal for the incorporation of questionnaires to assess diet, physical activity, supplement use, and other factors that may impact outcomes both during and following treatment. Recently, this has been initiated for specific studies in Cancer and Leukemia Group B (CALGB), resulting in findings of relationships between dietary patterns and colon cancer outcomes (37), and ongoing studies are underway in the Southwest Oncology Group. With comprehensive assessment of epidemiological factors during and following therapy, and banked blood specimens as well as tissues that can be accessed, such studies can provide excellent data on predictors of cancer outcomes. A second major limitation is that the clinical trial setting does not reflect cancer care in “real life.” Only 3% of adult U.S. cancer patients currently participate in clinical trials. These are usually conducted at academic institutions with a greater expertise in cancer care and usually much better treatment facilities. Thus, studies in the context of clinical trials play an important role, yet they need to be complemented by research in a more community-based setting. Intervention Studies or Randomized Controlled Trials for Secondary Prevention These studies are uniquely suited to test hypotheses about a prognostic factor. In particular, randomized controlled trials are considered the gold standard and the only study design that can without doubt establish causality. Note that these trials are distinguished from those in the previous section in that they are not studies testing the efficacy or toxicity of cancer drugs themselves. Rather, they randomize cancer survivors to specific lifestyle activities or factors, such as physical activity and its ability to directly influence a prognostic outcome. For example, physical activity may reduce cachexia

Molecular Epidemiological Designs for Prognosis

49

(wasting syndrome) among late-stage cancer patients and improve quality of life, while also impacting physiological states to enhance cancer-free survival. There have also been randomized trials of low-fat, high fruit and vegetable diets among women with breast cancer, the null results of which were somewhat disappointing (38). However, even with a trial, which is believed to have more rigor than an observational study, the effects can be affected by common sources of bias. For example, in the Women’s Healthy Eating and Living (WHEL) study, it was observed that the women who participated already had diets very high in fruits and vegetables; thus, self-selection bias into an intervention trial may dilute effects if there are not substantial differences in change to effect. Furthermore, as noted above, because some specific dietary factors and complementary and alternative medicines may have adverse effects on treatment outcomes, it may be prudent to note beneficial effects in a number of observational studies before trials are embarked upon. In summarizing approaches to studying cancer prognosis, it is clear that a wellcontrolled observational study in which patient populations are appropriately homogeneous in relation to disease characteristics and treatments received, complemented by in-depth assessments at baseline and throughout and following treatments of epidemiological factors, combined with biospecimens from blood and tumor tissue, is the optimal design for a comprehensive assessment of treatment outcomes. However, there is also merit in conducting research on more limited aspects of cancer prognosis, bounded by the constraints of existing resources and research opportunities. Statistical Methodologies Used in Studies of Cancer Prognosis Standard statistical tools may be applied to molecular epidemiological studies of cancer prognosis. Predictors of treatment toxicities may be assessed using standard methods for binary data (chi-square tests, logistic regression models), with a focus on specific toxicities (blood counts, cardiac, diarrhea, fatigue, febrile neutropenia, liver function, mucositis, nausea and vomiting, sensory neuropathy, and pain) or on all combined toxicities, usually grades 3 and 4. To adjust for other known prognostic factors, logistic regression models can be applied. To determine effects of predictors on recurrence and survival, standard time-to-event methods are usually used for the analysis of disease-free survival, such as log-rank tests, Cox models, and Kaplan-Meier estimates. Cox regression models are generally used to adjust for other known prognostic factors. If the study is in the context of a therapeutic clinical trial, toxicities and/or diseasefree survival may differ by treatment arm, and these differences may impact relationships between epidemiological factors, genetic polymorphisms, and treatment outcomes. Thus, careful analyses should be conducted to first determine if relationships differ by treatment arm, and, if so, treatment arm should be considered in the analysis. In many studies, the effects of tumor characteristics, of genetic variability, and of epidemiological factors on treatment outcomes are examined singly, without consideration of the potential interactions among these factors. For a more comprehensive assessment of the molecular epidemiology of cancer prognosis, more sophisticated analytic techniques need to be implemented. Kattan first developed “nomograms,” using primarily clinical data to model cancer outcomes (39,40), in which a patient’s predicted probability of disease-specific survival is assumed to be a function of both the baseline hazard function shared by all patients and a linear combination of the individual patient’s predictor variable values. It would be of interest to build nomograms incorporating genetic and epidemiological data, as well as clinical characteristics. The use of Classification and Regression Trees (CART) analysis may be particularly useful for studying the combined effects of genetic polymorphisms, tumor

50

Ulrich and Ambrosone

characteristics, and clinical and epidemiological factors on treatment outcomes. The model is fit using binary recursive partitioning whereby the data are successively split along coordinate axes of the predictor variables so that at any node, the split that maximally distinguishes the response variable in the left and the right branches is selected. Splitting continues until nodes are pure or data are too sparse; terminal nodes are called leaves, while the initial node is called the root. In practice, to avoid overfitting, typical decision tree systems then “prune” the tree to get a smaller tree that is nearly consistent with the data, though not necessarily completely consistent. Each leaf then makes the majority class prediction among data points that end at that leaf. Using approaches such as these may lead to a better understanding of the multiple factors that impact treatment outcomes among cancer patients. PROGNOSTIC STUDIES—OPPORTUNITIES AND OBSTACLES There is a growing interest in the epidemiological community to focus on the molecular epidemiology of cancer prognosis. To date, this has been a highly understudied area, and there are few good data on which lifestyle recommendations to cancer patients can be made. With efforts toward personalized medicine based on tumor characteristics and genetic profiles and application of molecular epidemiology to cancer prognosis, it is likely that this area will grow, and through multidisciplinary research, a better understanding of predictors of treatment outcomes will be had. However, there are numerous obstacles that the research community will have to confront, many of which are discussed above. Approaches to ascertaining and consenting patients, and reviewing medical records, will need to be compliant with the Health Insurance Portability and Accountability Act (HIPAA) and more stringent requirements from institutional review boards (IRBs), while still enabling research. As in studies of cancer risk, methods for assessment of exposures and behaviors need to be refined, and rigorous study design applied. Most importantly, researchers from multiple fields, as well as pathologists and clinicians, will need to communicate well with each other, so that novel approaches to studying the multiple factors impacting treatment outcomes can be developed, leading to an elucidation of the molecular epidemiology of cancer prognosis.

REFERENCES 1. Institute of Medicine, National Research Council. From Cancer Patient to Cancer Survivor: Lost in Transition. Washington, DC: National Academies Press; 2006. 2. Ries LA, Wingo PA, Miller DS, et al. The annual report to the nation on the status of cancer, 1973–1997, with a special section on colorectal cancer. Cancer 2000; 88(10):2398–2424. 3. Panel PsC. Living Beyond Cancer: Finding a New Balance. National Cancer Institute; 2003. 4. National Cancer Institute. Eliminating the Suffering and Death Due to Cancer. NCI Cancer Bulletin 2006; 3(40):1–8. 5. American Institute for Cancer Research. Nutrition Guidelines for Cancer Survivors After Treatment. Available at: http://www.aicr.org/information/survivor/guidelines.lasso. Accessed January 20, 2006. 6. Satia JA, Campbell MK, Galanko JA, et al. Longitudinal changes in lifestyle behaviors and health status in colon cancer survivors. Cancer Epidemiol Biomarkers Prev 2004; 13(6):1022–1031. 7. Maunsell E, Drolet M, Brisson J, et al. Dietary change after breast cancer: extent, predictors, and relation with psychological distress. J Clin Oncol 2002; 20(4):1017–1025. 8. Patterson RE, Neuhouser ML, Hedderson MM, et al. Changes in diet, physical activity, and supplement use among adults diagnosed with cancer. J Am Diet Assoc 2003; 103(3):323–328.

Molecular Epidemiological Designs for Prognosis

51

9. Maskarinec G, Murphy S, Shumay DM, et al. Dietary changes among cancer survivors. Eur J Cancer Care (Engl) 2001; 10(1):12–20. 10. Thomson CA, Flatt SW, Rock CL, et al. Increased fruit, vegetable and fiber intake and lower fat intake reported among women previously treated for invasive breast cancer. J Am Diet Assoc 2002; 102(6):801–808. 11. Salminen E, Bishop M, Poussa T, et al. Dietary attitudes and changes as well as use of supplements and complementary therapies by Australian and Finnish women following the diagnosis of breast cancer. 2004; 58(1):137–144. 12. Wayne SJ, Lopez ST, Butler LM, et al. Changes in dietary intake after diagnosis of breast cancer. J Am Diet Assoc 2004; 104(10):1561–1568. 13. Ambrosone CB, Jiyoung A, Schoenenberger V. Antioxidant supplements, genetics and chemotherapy outcomes. Current Cancer Therapy Reviews 2005;1(3):1–8. 14. Bairati I, Meyer F, Gelinas M, et al. A randomized trial of antioxidant vitamins to prevent second primary cancers in head and neck cancer patients. J Natl Cancer Inst 2005; 97(7):481–488. 15. Labriola D, Livingston R. Possible interactions between dietary antioxidants and chemotherapy. Oncology (Williston Park) 1999; 13(7):1003–1008. (discussion 1008, 1011–1002). 16. Cole BF, Baron JA, Sandler RS, et al. Folic acid for the prevention of colorectal adenomas: a randomized clinical trial. JAMA 2007; 297(21):2351–2359. 17. Ulrich CM, Potter JD. Folate and cancer–timing is everything. JAMA 2007; 297(21):2408–2409. 18. Meijerman I, Beijnen JH, Schellens JH. Herb-drug interactions in oncology: focus on mechanisms of induction. Oncologist 2006; 11(7):742–752. 19. Ambrosone CB, Rebbeck TR, Morgan GJ, et al. New developments in the epidemiology of cancer prognosis: traditional and molecular predictors of treatment response and survival. Cancer Epidemiol Biomarkers Prev 2006; 15(11):2042–2046. 20. Ulrich CM, Robien K, McLeod HL. Cancer pharmacogenetics: polymorphisms, pathways and beyond. Nat Rev Cancer 2003; 3(12):912–920. 21. Popat S, Hubner R, Houlston RS. Systematic review of microsatellite instability and colorectal cancer prognosis. J Clin Oncol 2005; 23(3):609–618. 22. Amend K, Hicks D, Ambrosone CB. Breast cancer in African-American women: differences in tumor biology from European-American women. Cancer Res 2006; 66(17):8327–8330. 23. Popat S, Matakidou A, Houlston RS. Thymidylate synthase expression and prognosis in colorectal cancer: a systematic review and meta-analysis. J Clin Oncol 1 2004; 22(3):529–536. 24. Rebbeck TR, Ambrosone CB, Bell DA, et al. SNPs, haplotypes, and cancer: applications in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 2004; 13(5):681–687. 25. Ulrich CM. Nutrigenetics in cancer research–folate metabolism and colorectal cancer. J Nutr 2005; 135(11):2698–2702. 26. Moran AE, Hunt DH, Javid SH, et al. Apc deficiency is associated with increased Egfr activity in the intestinal enterocytes and adenomas of C57BL/6J-Min/þ mice. J Biol Chem 2004; 279 (41):43261–43272. 27. Kim YI. Folate: a magic bullet or a double edged sword for colorectal cancer prevention? Gut 2006; 55(10):1387–1389. 28. Bueno R, Appasani K, Mercer H, et al. The alpha folate receptor is highly activated in malignant pleural mesothelioma. J Thorac Cardiovasc Surg 2001; 121(2):225–233. 29. Wu M, Gunning W, Ratnam M. Expression of folate receptor type alpha in relation to cell type, malignancy, and differentiation in ovary, uterus, and cervix. Cancer Epidemiology, Biomarkers & Prevention 1999; 8(9):775–782. 30. Ross JF, Chaudhuri PK, Ratnam M. Differential regulation of folate receptor isoforms in normal and malignant tissues in vivo and in established cell lines. Physiologic and clinical implications. Cancer 1994; 73(9):2432–2443. 31. Popat S, Chen Z, Zhao D, et al. A prospective, blinded analysis of thymidylate synthase and p53 expression as prognostic markers in the adjuvant treatment of colorectal cancer. Ann Oncol 2006: 1810–1817. 32. Yasuno M, Mori T, Koike M, et al. Importance of thymidine phosphorylase expression in tumor stroma as a prognostic factor in patients with advanced colorectal carcinoma. Oncol Rep 2005; 13(3):405–412.

52

Ulrich and Ambrosone

33. Odin E, Wettergren Y, Nilsson S, et al. Altered gene expression of folate enzymes in adjacent mucosa is associated with outcome of colorectal cancer patients. Clin Cancer Res 2003; 9(16 pt 1): 6012–6019. 34. Robien K, Boynton A, Ulrich CM. Pharmacogenetics of folate-related drug targets in cancer treatment. Pharmacogenomics 2005; 6(7):673–689. 35. Lenz H-J. Pharmacogenomics and colorectal cancer. Adv Exp Med Biol 2006; 587:211–231. 36. Kushi LH, Kwan ML, Lee MM, et al. Lifestyle factors and survival in women with breast cancer. 2007; 137(1 suppl):236S–242S. 37. Meyerhardt JA, Niedzwiecki D, Hollis D, et al. Association of dietary patterns with cancer recurrence and survival in patients with stage III colon cancer. JAMA 2007; 298(7):754–764. 38. Pierce JP, Natarajan L, Caan BJ, et al. Influence of a diet very high in vegetables, fruit, and fiber and low in fat on prognosis following treatment for breast cancer: the Women’s Healthy Eating and Living (WHEL) randomized trial. JAMA 18 2007; 298(3):289–298. 39. Kattan MW, Leung DH, Brennan MF. Postoperative nomogram for 12-year sarcoma-specific death. J Clin Oncol 1 2002; 20(3):791–796. 40. Mariani L, Miceli R, Kattan MW, et al. Validation and adaptation of a nomogram for predicting the survival of patients with extremity soft tissue sarcoma using a three-grade system. Cancer 2005; 103(2):402–408.

5

Biosampling Methods Regina M. Santella Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, New York, U.S.A.

Susan E. Hankinson Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, and Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, U.S.A.

SAMPLE COLLECTION AND PROCESSING The collection of biospecimens is now an essential component of almost every epidemiologic study. The specific samples to be collected and the way they are processed and stored is usually dependent on their ultimate use. However, the difficulties in sample collection and the cost necessitates that thought be given to possible additional uses even if they are unknown at the time of collection. Almost any material that can possibly be obtained from participants has been used in various studies including blood, urine, oral cells, sputum, breast milk, hair, toenails, saliva, meconium, feces, and fat. In addition, frozen or paraffin tissues are routinely available from either surgery or, less frequently, autopsy. Blood and urine are the most frequently collected because of the ease of collection and general acceptability. Prior to blood collection, decisions must be made about timing of collection (time of day, month, season, fasting/nonfasting), what type of tube and needles will be used (e.g., acid washed for trace metals), and whether an anticoagulant will be used, and if so, of which type. Other decisions include how much blood to collect and, if it is to be shipped, under what conditions. Chill packs limit degradation of specific analytes but lead to lower viability of lymphocytes (1). As a measure of quality control, lymphocytes should be routinely tested for ability to be transformed with Epstein Barr virus. The time to processing will also impact some analytes and can also lead to hemolysis. For example, both interleukin-6 (IL-6) and tumor necrosis factor alpha (TNF-a) appear to degrade in blood after about four to six hours at room temperature (2). In contrast, levels of most circulating sex steroids in blood are quite robust even with processing delays of up to 72 hours (3–5). Prolonged storage of blood prior to DNA isolation has been shown to not dramatically interfere with the quantity or quality of 53

54

Santella and Hankinson

DNA obtained (6). Although immediate processing of the blood or urine sample is preferred, because of constraints in some epidemiologic studies (e.g., participants are dispersed geographically and blood samples must be mailed to a central processing facility), a substantial literature regarding the influence of delays in sample processing is accruing (7). Possibilities for aliquots to be stored include whole blood, serum, plasma, total white blood cells (possibly viable cells for later immortalization), fractionated white blood cells (granulocytes and mononuclear cells), and red blood cells. Other decisions include whether the sample must be kept sterile, protected from sunlight (e.g., for measurement of carotenoids) or whether specific additives must be used, such as addition of butylated hydroxyl toluene (BHT) to prevent oxidation. For urine, first morning void samples are more concentrated while random, spot samples are easier to collect but may not be representative. The most difficult to collect but also the most accurate is 24-hour collection. In some situations, a preservative such as ascorbic acid may be added. While urine is most frequently used for measurement of excreted chemicals or hormones, it has also been reported that urine can be used for genotyping when blood or buccal cells are not available (8). Exfoliated cells have also been used for the measurement of DNA adducts (9). There are several methods available for buccal cell collection including the use of swabs, brushes, or mouthwash solutions (10–12). There are conflicting data on which method provides the highest yield but sufficient DNA for genotyping as well as for wholegenome amplification (WGA) (see below) can be collected. However, it must be remembered that a significant portion of the DNA is from bacteria. Studies suggest that polymerase chain reaction (PCR) amplification of fragments up to 1 kb are generally feasible. Another major decision is the type, number, and volume of aliquots to be stored of any specific sample type. Multiple, smaller aliquots are preferable to avoid freeze-thaw cycles as a sample is used, but they take up more freezer space, thus increasing costs. Small polypropylene vials or cryotubes [such as Nunc CryoTubes (http://www.nuncbrand .com)] are frequently used to store samples; aliquot volumes generally range from 1 to 5 mL per cryovial. However, alternate straw-based systems [see CryoBioSystems (http:// www.cryobiosystem-imv.com)] have also been successfully used in epidemiologic studies such as the large European Prospective Investigation into Cancer and Nutrition (EPIC). Here, 300 to 1000 mL or larger straws are filled, plugged and heat-sealed, and stored in a goblet within a metal canister; this storage system has generally been used in nitrogen freezers, although use in mechanical freezers is possible. SAMPLE STORAGE A number of freezer storage options are available depending on the sample type and analytes of interest. Blood or urine samples are generally stored in 708C or 808C mechanical freezers or in the vapor ( 1308C or colder) or liquid phase ( 1968C) of liquid nitrogen freezers. Although 208C and 408C mechanical freezers are available, a number of blood parameters are known to be unstable at this temperature and hence these freezers are generally not recommended for epidemiologic studies. For example, plasma carotenoids were reported to degrade by 15% at six months and 97% at 10 years when blood samples were stored at 208C(13). Note, in one study of 15 upright mechanical freezers, the freezer display panel reported that the temperature was 818C to 748C, but the measured temperature was 908C to 43.58C (14). This suggests that liquid nitrogen

Biosampling Methods

55

freezers with a temperature less than 1308C, might be a better choice for very long-term storage of samples, such as in a prospective cohort study, since they maintain a consistently lower temperature. At a minimum, careful calibration and ongoing monitoring of freezer temperatures is required. Optimally, samples should be split between two freezers to avoid loss of all samples from some individuals in the event of a freezer failure. Given the possibility of natural disasters as well as major electrical system shutdowns, it makes sense to ensure that valuable samples are split between locations including buildings and cities. Appropriate backups should be in place for freezers as well as connection to a telephone alarm system with response 24 hr/day. If nitrogen freezers are used, oxygen level sensors must be in place to monitor the ambient oxygen level. Further, including multiple replicate samples (e.g., plasma or urine) throughout the freezers will allow comparisons of frozen versus thawed sample values in the event of a freezer failure with resultant thawing of sample. STABILITY OF SAMPLES WITH LONG-TERM STORAGE Although biomarker stability with long-term storage is an important issue for many epidemiologic studies, it is difficult to directly assess the effects of long-term storage on analyte degradation. Two study designs are currently used. One method is to collect a sample at one time point and then measure the analyte(s) of interest several times over a number of years. Although this means that the baseline biomarker levels are the same for each person, laboratory variability and drift can make comparison of assay results over time difficult, especially if the assay changes. Interpretability of the results, in part, depends on the reliability or variability of the assay. An alternate study design is to collect sample from the same individuals or population over a number of years, storing them at each time point. Then, the samples will all be assayed together at the end of the study, which reduces problems of assay variability. However, within-person changes in biomarkers levels over time means that it is unclear whether the levels at each time point are the same. To detect large changes, simply comparing levels between stored samples and newly collected samples is informative. Because of the difficulty in addressing storage degradation over time, it is important that any samples being compared (e.g., case vs. control samples) have comparable storage conditions and times, either by matching on storage times, or by controlling for storage times in the data analysis. DNA is usually stored in a Tris EDTA solution for maximum stability. Short-term storage can be at 48C, but for long-term storage, samples should be kept at 808C. Information regarding storage stability for blood markers varies substantially by the marker of interest. As noted above, carotenoids are quite labile if stored at 408C or warmer; however, if stored at 708C or colder, levels are stable for up to 10 years. Longterm storage at 208C is adequate for dehydroepiandrosterone sulfate levels, while levels of free estradiol change when stored at that temperature (15). SAMPLE TRACKING A critical component in setting up a repository is the database management system for documentation of sample characteristics (e.g., when sample was collected, whether sample was hemolyzed; sample volume) and to track the samples in the storage system. The latter should include the location of both the original sample and any new subaliquots, where any samples were sent for analysis, and the assays to be conducted.

56

Santella and Hankinson

Further, information such as the freeze-thaw history of a sample must be stored. A number of commercially available laboratory information management systems (LIMS) are now commercially available and can track data sets ranging from 100s to 1,000,000s of samples. The cost of obtaining and maintaining a database management system can be substantial; however, it is critical for maintaining and using a biorepository. The National Cancer Institute, as part of its Cancer Biomedical Informatics GridTM (caBIG) initiative, is now developing a number of applications for general use by the research community (see https://cabig.nci.nih.gov). CaBIG is the cornerstone of NCI’s biomedical informatics effort to link investigators and institutions throughout the cancer community with the hope of facilitating and accelerating research discoveries. One recently released tool, “caTissue,” permits users to track the collection, storage, quality control, and distribution of tissue specimens, as well as the aliquotting of new specimens from existing ones. Plans exist to expand these tools to allow tracking of other specimens, such as blood. TISSUE MICROARRAYS Tumor tissue is frequently collected in epidemiologic studies to assess tumor morphology and characteristics such as protein or gene expression. Formalin-fixed, paraffin-embedded (FFPE) tissue (vs. fresh frozen) is most commonly available. Until recently, to evaluate specimens immunohistochemically, FFPE tumor blocks were sectioned, stained, and microscopically examined. A typical tumor block might provide up to 100 sections (16). With tissue microarrays (TMA), small core samples from multiple blocks or patients are embedded into a single, recipient paraffin block (16,17). Several hundred cores can be accommodated in one recipient block. This technique greatly extends tissue resources because only a few small cores (typically 0.6 mm) are used in the TMA; hence the remainder of the donor block is available for other studies. Because multiple cases are contained on a single block—and sectioned on a single slide—standardization of staining is greatly enhanced and reagent use minimized compared with staining individual slides for each case. Further, bioinformatics programs have been developed to standardize the reading of the TMA slides and storage of data. The greatest potential limitation of this technology is whether several small cores can adequately represent the whole tissue section. To date, most studies suggest that two to four cores per donor block can well characterize the tumor (18–23). Cutting multiple sections at once from a TMA is most efficient (because of the need to trim or “face” a block each time it is used); hence the potential loss of antigenicity on cut sections—due to oxidation—is an important concern (24,25). Although data remain limited, applying a thin coat of paraffin and placing in a vacuum dessicator appear to preserve samples best (24,26).

DNA PREPARATION DNA isolated from various biospecimens can be used for both DNA adduct measurement and genetic studies. DNA can be prepared from blood (total white blood cells or fractionated cells, plasma or serum, and blood clots), exfoliated buccal or bladder cells, or frozen tissue samples, as well as paraffin tissue blocks. DNA is generally stable under appropriate storage conditions and thus can be prepared in batches as samples are received, or alternatively, all samples can be prepared when needed. The method used for DNA extraction is dependent on the amount of sample available, the number of samples to be extracted, the availability of specific equipment such as robotics, and, in some

Biosampling Methods

57

situations, the ultimate use of the DNA (27). While phenol/chloroform extraction has been extensively used, newer methods include salting out, commercial kits for single samples or in a 96-well-based format (Arcturus, Invitrogen, Qiagen, Stratagene, and others), and robotics (28). Kits are also available for extracting DNA from plasma or serum. While the yield is sometimes slightly lower than other methods and there may be some degradation of DNA, the major advantage is that these methods are fast and easy. If small amounts of DNA are needed, 96-well DNA extraction kits can be used to isolate approximately 6 mg DNA from 200 mL of whole blood. Methods, including specific commercial kits, are also available for extracting DNA from clotted bloods but the yields are lower than comparable blood samples and the methods are more labor intensive (29). Yields from serum or plasma have been reported in the range of 0.4 to 4 mg/mL (30). For those laboratories that carry out large-scale DNA extractions, the cost of investing in an automated system may be worthwhile. For example, Gentra (www.gentra .com) has a system that can handle whole blood samples ranging from 0.05 to 10 mL and yield DNA typically 50 kb in length within 30 minutes to 1 hour. Simple methods are available for extraction of DNA from paraffin [e.g.,(31)] as well as commercial kits (e.g., http://www.arctur.com). When DNAs are extracted from paraffin sections, the size of the product is much smaller than for other types of samples. This may require the development of new PCR primers that produce smaller products, or a poor success rate will be obtained. Nonetheless, there are multiple reports of successful genotyping with paraffin DNA (30,32). DNA is most easily quantitated by measurement of absorbance at 260 nm with an extinction coefficient for double-stranded DNA of 6500 (28). One absorbance unit in a 1 cm cuvette corresponds to approximately 50 mg/mL. This method is limited to concentrations above 5 mg/mL and measures both single- and double-stranded DNA. For lower concentrations of DNA (25 pg to 1 mg/mL), picogreen methods can be used. When intercalated into double-stranded DNA, this dye becomes fluorescent. Thus, it is insensitive to single-stranded DNA or RNA. A standard curve must be generated using known quantities of DNA, but the assay is sensitive to DNA length of the sample and standard (33). The most sensitive method of quantitation (for concentrations <25 pg/mL) is real-time PCR with SYBR green, another dye that binds to double-stranded but not single-stranded DNA. Again, a standard curve is generated using known amount of DNA plotting the threshold cycle versus DNA concentration. Since most DNAs are prepared for genotyping, this method also ensures that the DNA is of sufficient quality for PCR. With high-quality double-stranded DNA, the three methods generally agree, but discrepancies can be found with some samples. WHOLE GENOME AMPLIFICATION When DNA amounts are limited, the technique of WGA can be used. One approach uses random hexamer primers that anneal at multiple sites on denatured DNA and Phi29 DNA polymerase to initiate replication. Strand displacement downstream of replicated DNA generates new single-stranded DNA to which additional primers bind. This isothermal process produces large quantities of high molecular weight, double-stranded DNA in the range of 40 mg starting with nanogram quantities. Qiagen reports that the average product length using their REPLI-g kit is typically greater than 10 kb, with a range between 2 and 100 kb. A second method uses blunt-end ligation with T4 DNA polymerase, then selfligation or cross-ligation with T4 DNA ligase and amplification via random primer-initiated multiple displacement amplification (34). Amplifications of approximately 1000-fold

58

Santella and Hankinson

were reported with better allelic retention or imbalance compared with multiple displacement amplification. A number of publications have demonstrated that WGA gives reliable genotyping data compared with the original DNA sample including that from white blood cells, buccal cells, and plasma DNA (31,35,36). However, a recent detailed study demonstrated that multiple displacement amplification of a large number of DNAs from control subjects resulted in heterozygote loss in many single nucleotide polymorphisms (SNPs) (37). It is also clear that this process generates significant amounts of side products leading to problems in quantitation of DNA yield when UV absorbance or PicoGreen fluorescence methods are used. Recently, it was reported that the use of real-time PCR amplification of human-specific Alu Yd6 accurately predicted which WGA DNA samples provided reliable genotyping results (38). To minimize errors in genotyping, sufficient input DNA should be used for WGA. In addition, it has been suggested that the product DNA should be screened by Applied Biosystem AmpFLSTR Identifiler Amplification kit (37), a method that amplifies 15 tetranucleotide short tandem repeat loci, or Alu (38) PCR prior to use. Studies of buccal cells that have been subjected to electron-beam irradiation, sometimes used to sterilize mail, have shown that mail irradiation may interfere with WGA of buccal cell DNA (39). It is also less clear that WGA can be done with paraffin DNA. Specific methylation of cytosine is lost during WGA. Thus, WGA samples are not appropriate for methylation studies. However, it has been reported that WGA of bisulfitetreated DNA can be used to accurately estimate methylated cytosine levels (40). QUALITY CONTROL OF DNA The most frequently used method to determine DNA quality is the ratio of absorbance at 260 and 280 nm (A260nm/A280nm). Ratios around 1.8 indicate good quality DNA, while lower values indicate protein contamination since proteins have a peak in absorption at 280 nm resulting from the aromatic amino acids. Higher values indicate RNA contamination. DNAs can also be run on a gel to determine size. As an identity check of the DNA sample, highly polymorphic microsatellite repeats can be determined, such as with Identifiler, discussed above, and compared to another source of DNA from the same individual. For example, if a filter card blood spot was made from the initial blood sample, DNA can be isolated from it and alleles compared. This method can also be used to confirm that a cell line made from a blood sample corresponds to that individual. The presence of the Y chromosome can also be determined as a partial identity check. As indicated above, DNA is usually stored in a Tris EDTA solution for maximum stability. Short-term storage can be at 48C but for long-term storage, samples should be kept at 808C. Several reports have indicated variable stability of dilute and small volume samples (41,42), but we have not had difficulties with long-term storage at 808C of DNAs around 200 ng/mL. Usually storage is in either sealed vials or 96-deep-well storage plates. Storage in well-sealed vials gives minimal evaporation and thus maintains concentration. However, the handling of large numbers of samples is more easily done robotically with 96-well plates. For most studies, especially when genotyping is to be carried out, the stock DNAs can be kept in 1.5-mL vials at its initial concentration. But at the same time, 96 well-storage plates can be prepared at a standard concentration appropriate for making replica plates for genotyping. Methods are also available to store DNAs on chemically treated paper in 96 or 384 well-storage plates that can then be

Biosampling Methods

59

rehydrated when the DNA is needed (e.g., http://www.biomatrica.com/ or https://store .genvault.com/). These methods allow room temperature storage. RNA PREPARATION If RNA is needed, several methods are available that limit activity of RNases and allow rapid isolation of relatively pure material. Because of potential changes in gene expression after blood collection, the goal is to lyse cells and stabilize RNA as quickly as possible. Several methods are available and make use of guanidium isothiocyanate or a commercial reagent Trizol LS (Invitrogen Life Technologies, Carlsbad, U.S.), a monophasic solution of phenol and guanidine isothiocyanate. During sample homogenization or lysis, the reagents used maintain the integrity of the RNA while disrupting cells and dissolving cell components. Specifically for blood sample collection, Paxgene Blood RNA System vacutainer tubes can be used. They contain a proprietary reagent that immediately stabilizes intracellular RNA, reportedly for days at room temperature and weeks at 48 C (http://www.preanalytix.com) (43–45). The most frequently used methods for RNA extraction, including from paraffin sections, employ commercial kits such as the RNAeasy kit from Qiagen. Arcturus also produces a line of products that can be used for RNA isolation from paraffin sections as well as the amplification of as little as 5 ng of RNA. Quantitation is by absorption at 260 nm with one absorbance unit equal to 40 mg/mL. Some of these methods are also appropriate for simultaneous RNA and protein extractions (43–45). The quality of the RNA can be evaluated from the A260nm/A280nm, with values of 1.9 to 2.1 being ideal. Analysis by gel electrophoresis should show discrete bands of high molecular weight RNA between 7 and 15 kb in size, two predominant ribosomal RNA bands at approximately 5 (28S) and 2 kb (18S) in a ratio of approximately 2:1, and low molecular weight RNA between 0.1 and 0.3 kb (tRNA, 5S). Long-term storage should be at 808C or as an ethanol precipitate at 208C. For laboratories that deal with large numbers of samples, a major concern is errors in sample identification and swapping. For these reasons, bar code labels should always be used, with no hand labeling. The use of robotics also minimizes sample swapping. NCI BEST PRACTICES As a result of concerns about the lack of common standard operating procedures (SOP) and quality control/quality assurance measures across NCI-funded biorepositories, as well as the lack of a common database and defined mechanism of accessing samples, the NCI has begun the development of best practices guidelines to optimize the quality and accessibility of biospecimens across the cancer research community. The draft guidelines for NCI Best Practices for Biospecimen Resources are available on the web (http:// biospecimens.cancer.gov/biorepositories/guidelines_full_formatted.asp) with updates made as necessary. This draft document provides useful information on sample collection, processing, storage, retrieval, and dissemination. Quality control and quality assurance, biohazards, biospecimen, and clinical data management as well as ethical and legal issues are also discussed. Although only recommended at the present time, it is likely that all future NCI-funded biorepositories will be required to meet these best practices guidelines and have data compatible with caBIG.

60

Santella and Hankinson

ACKNOWLEDGMENT Supported by NIH grants P30ES09089, R01ES05116, P30CA013696, R01CA49449, and R01CA167262. REFERENCES 1. Kristal AR, King IB, Albanes D, et al. Centralized blood processing for the selenium and vitamin E cancer prevention trial: effects of delayed processing on carotenoids, tocopherols, insulin-like growth factor-I, insulin-like growth factor binding protein 3, steroid hormones, and lymphocyte viability. Cancer Epidemiol Biomarkers Prev 2005; 14(3):727–730. 2. Flower L, Ahuja RH, Humphries SE, et al. Effects of sample handling on the stability of interleukin 6, tumour necrosis factor-alpha and leptin. Cytokine 2000; 12(11):1712–1716. 3. Hankinson SE, London SJ, Chute CG, et al. Effect of transport conditions on the stability of biochemical markers in blood. Clin Chem 1989; 35(12):2313–2316. 4. Key T, Oakes S, Davey G, et al. Stability of vitamins A, C, and E, carotenoids, lipids, and testosterone in whole blood stored at 4 degrees C for 6 and 24 hours before separation of serum and plasma. Cancer Epidemiol Biomarkers Prev 1996; 5(10):811–814. 5. Taieb J, Benattar C, Birr AS, et al. Delayed assessment of serum and whole blood estradiol, progesterone, follicle-stimulating hormone, and luteinizing hormone kept at room temperature or refrigerated. Fertil Steril 2000; 74(5):1053–1054. 6. Nederhand RJ, Droog S, Kluft C, et al. Logistics and quality control for DNA sampling in large multicenter studies. J Thromb Haemost 2003; 1(5):987–991. 7. Tworoger SS, Hankinson SE. Collection, processing, and storage of biological samples in epidemiologic studies: sex hormones, carotenoids, inflammatory markers, and proteomics as examples. Cancer Epidemiol Biomarkers Prev 2006; 15(9):1578–1581. 8. van Duijnhoven FJ, van der Hel OL, van der Luijt RB, et al. Quality of NAT2 genotyping with restriction fragment length polymorphism using DNA isolated from frozen urine. Cancer Epidemiol Biomarkers Prev 2002; 11(8):771–776. 9. Hsu TM, Zhang YJ, Santella RM. Immunoperoxidase quantitation of 4-aminobiphenyl- and polycyclic aromatic hydrocarbon-DNA adducts in exfoliated oral and urothelial cells of smokers and nonsmokers. Cancer Epidemiol Biomarkers Prev 1997; 6(3):193–199. 10. Feigelson HS, Rodriguez C, Robertson AS, et al. Determinants of DNA yield and quality from buccal cell samples collected with mouthwash. Cancer Epidemiol Biomarkers Prev 2001; 10(9): 1005–1008. 11. King IB, Satia-Abouta J, Thornquist MD, et al. Buccal cell DNA yield, quality, and collection costs: comparison of methods for large-scale studies. Cancer Epidemiol Biomarkers Prev 2002; 11(10 pt 1):1130–1133. 12. Lum A, Le Marchand L. A simple mouthwash method for obtaining genomic DNA in molecular epidemiological studies. Cancer Epidemiol Biomarkers Prev 1998; 7(8):719–724. 13. Mathews-Roth MM, Stampfer MJ. Some factors affecting determination of carotenoids in serum. Clin Chem 1984; 30(3):459–461. 14. Su SC, Garbers S, Rieper TD, et al. Temperature variations in upright mechanical freezers. Cancer Epidemiol Biomarkers Prev 1996; 5(2):139–40. 15. Langley MS, Hammond GL, Bardsley A, et al. Serum steroid binding proteins and the bioavailability of estradiol in relation to breast diseases. J Natl Cancer Inst 1985; 75(5):823–829. 16. Rimm DL, Camp RL, Charette LA, et al. Amplification of tissue by construction of tissue microarrays. Exp Mol Pathol 2001; 70(3):255–264. 17. Kononen J, Bubendorf L, Kallioniemi A, et al. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat Med 1998; 4(7):844–847. 18. Camp RL, Charette LA, Rimm DL. Validation of tissue microarray technology in breast carcinoma. Lab Invest 2000; 80(12):1943–1949.

Biosampling Methods

61

19. Fernebro E, Dictor M, Bendahl PO, et al. Evaluation of the tissue microarray technique for immunohistochemical analysis in rectal cancer. Arch Pathol Lab Med 2002; 126(6):702–705. 20. Hoos A, Urist MJ, Stojadinovic A, et al. Validation of tissue microarrays for immunohistochemical profiling of cancer specimens using the example of human fibroblastic tumors. Am J Pathol 2001; 158(4):1245–1251. 21. Leversha MA, Fielding P, Watson S, et al. Expression of p53, pRB, and p16 in lung tumours: a validation study on tissue microarrays. J Pathol 2003; 200(5):610–619. 22. Rosen DG, Huang X, Deavers MT, et al. Validation of tissue microarray technology in ovarian carcinoma. Mod Pathol 2004; 17(7):790–797. 23. Rubin MA, Dunn R, Strawderman M, et al. Tissue microarray sampling strategy for prostate cancer biomarker analysis. Am J Surg Pathol 2002; 26(3):312–319. 24. DiVito KA, Charette LA, Rimm DL, et al. Long-term preservation of antigenicity on tissue microarrays. Lab Invest 2004; 84(8):1071–1078. 25. Fergenbaum JH, Garcia-Closas M, Hewitt SM, et al. Loss of antigenicity in stored sections of breast cancer tissue microarrays. Cancer Epidemiol Biomarkers Prev 2004; 13(4):667–672. 26. Su Y, Shrubsole MJ, Ness RM, et al. Immunohistochemical expressions of Ki-67, cyclin D1, beta-catenin, cyclooxygenase-2, and epidermal growth factor receptor in human colorectal adenoma: a validation study of tissue microarrays. Cancer Epidemiol Biomarkers Prev 2006; 15(9):1719–1726. 27. Austin MA, Ordovas JM, Eckfeldt JH, et al. Guidelines of the National Heart, Lung, and Blood Institute Working Group on Blood Drawing, Processing, and Storage for Genetic Studies. Am J Epidemiol 1996; 144(5):437–441. 28. Santella RM. Approaches to DNA/RNA Extraction and whole genome amplification. Cancer Epidemiol Biomarkers Prev 2006; 15(9):1585–1587. 29. Everson RB, Mass MJ, Gallagher JE, et al. Extraction of DNA from cryopreserved clotted human blood. Biotechniques 1993; 15(1):18–20. 30. Blomeke B, Bennett WP, Harris CC, et al. Serum, plasma and paraffin-embedded tissues as sources of DNA for studying cancer susceptibility genes. Carcinogenesis 1997; 18(6):1271–1275. 31. Sjoholm MI, Hoffmann G, Lindgren S, et al. Comparison of archival plasma and formalinfixed paraffin-embedded tissue for genotyping in hepatocellular carcinoma. Cancer Epidemiol Biomarkers Prev 2005; 14(1):251–255. 32. Xie B, Freudenheim JL, Cummings SS, et al. Accurate genotyping from paraffin-embedded normal tissue adjacent to breast cancer. Carcinogenesis 2006; 27(2):307–10. 33. Georgiou CD, Papapostolou I. Assay for the quantification of intact/fragmented genomic DNA. Anal Biochem 2006; 358(2):247–256. 34. Li J, Harris L, Mamon H, et al. Whole genome amplification of plasma-circulating DNA enables expanded screening for allelic imbalance in plasma. J Mol Diagn 2006; 8(1):22–30. 35. Lu Y, Gioia-Patricola L, Gomez JV, et al. Use of whole genome amplification to rescue DNA from plasma samples. Biotechniques 2005; 39(4):511–515. 36. Paynter RA, Skibola DR, Skibola CF, et al. Accuracy of multiplexed Illumina platform-based single-nucleotide polymorphism genotyping compared between genomic and whole genome amplified DNA collected from multiple sources. Cancer Epidemiol Biomarkers Prev 2006; 15(12):2533–2536. 37. Liang X, Trentham-Dietz A, Titus-Ernstoff L, et al. Whole-genome amplification of oral rinse self-collected DNA in a population-based case-control study of breast cancer. Cancer Epidemiol Biomarkers Prev 2007; 16(8):1610–1614. 38. Hansen HM, Wiemels JL, Wrensch M, et al. DNA quantification of whole genome amplified samples for genotyping on a multiplexed bead array platform. Cancer Epidemiol Biomarkers Prev 2007; 16(8):1686–1690. 39. Bergen AW, Qi Y, Haque KA, et al. Effects of electron-beam irradiation on whole genome amplification. Cancer Epidemiol Biomarkers Prev 2005; 14(4):1016–1019. 40. Mill J, Yazdanpanah S, Guckel E, et al. Whole genome amplification of sodium bisulfitetreated DNA allows the accurate estimate of methylated cytosine density in limited DNA resources. Biotechniques 2006; 41(5):603–607.

62

Santella and Hankinson

41. Madisen L, Hoar DI, Holroyd CD, et al. DNA banking: the effects of storage of blood and isolated DNA on the integrity of DNA. Am J Med Genet 1987; 27(2):379–390. 42. Sozzi G, Roz L, Conte D, et al. Effects of prolonged storage of whole plasma or isolated plasma DNA on the results of circulating DNA quantification assays. J Natl Cancer Inst 2005; 97(24):1848–1850. 43. Hummon AB, Lim SR, Difilippantonio MJ, et al. Isolation and solubilization of proteins after TRIzol extraction of RNA and DNA from patient material following prolonged storage. Biotechniques 2007; 42(4):467–470,472. 44. MacIntyre DA, Smith R, Chan EC. Differential enrichment of high- and low-molecular weight proteins and concurrent RNA extraction. Anal Biochem 2006; 359(2):274–276. 45. Thach DC, Lin B, Walter E, et al. Assessment of two methods for handling blood in collection tubes with RNA stabilizing agent for surveillance of gene expression profiles with high density microarrays. J Immunol Methods 2003; 283(1–2):269–279.

6

Principles of High-Quality Genotyping Stephen J. Chanock Laboratory of Translational Genomics, National Cancer Institute, Advanced Technology Center, Gaithersburg, Maryland, U.S.A.

INTRODUCTION The generation of a draft sequence of a human genome has heralded a new age in the investigation of the contribution of genetic variation to human disease and phenotypes (1). Quickly following on the scaffold of a draft sequence, there has been a rush to annotate genetic variation and expression across the genome, some of which has been accomplished through large international consortial efforts [e.g., the International HapMap Project or Encyclopedia of DNA Elements (ENCODE)], whereas other smaller scale programs and countless individual studies have provided detailed analyses of genetic variation in specific genes or regions of the genome (2–5). In parallel with a growing appreciation of the unanticipated large scope of genetic variation in the genome, there has been the commercial development of new technical platforms with fixed content that can survey thousands of variants at once. This has expanded the opportunities for analysis, enabling investigation of common germline genetic variation across the genome in what are now known as genomewide association studies (GWAS) (6). To keep pace with the generation of dense data sets, new computational and analytical processes continue to be developed, usually in an effort to solve a problem of efficiency or excessive computational requirements (Fig. 1). Historically, genotyping projects were small in scope and typically included analysis of a handful of single nucleotide polymorphisms (SNPs) in one or more candidate genes. Previously, the field focused on sets of SNPs or microsatellite markers that were usually chosen because they satisfied one or more conditions: (i) known or putative function either altering the coding region or the regulation of the gene or genomic region, (ii) prior functional evidence emanating from the laboratory or a prior association study, or (iii) exploration of regions flanking a locus based on patterns of linkage disequilibrium (LD). While linkage analyses have been successfully employed in family pedigrees to identify mutations, their application to the search for common genetic variants with low to moderate effect has been generally uninformative. In the late 1990s, Risch and Merikangas recognized the importance of shifting from microsatellites to SNPs for mapping common genetic variation, especially SNPs, in complex diseases, such as cancer or diabetes (7,8). 63

64

Chanock

Figure 1 Milestones in human genomics and disease susceptibility.

It is now possible to look ‘agnostically’ at markers across the genome in a GWAS to find markers for regions of the genome associated with the phenotype or disease interest. Unlike the candidate gene approach, the criteria for significance in a GWAS must meet a low threshold for p value—either in the scan or in a joint analysis with follow-up studies designed to replicate the findings (9,10). The latter is critical because of the large number of false positives observed, which can be due to chance, heterogeneity in outcomes, problems in study design, or genotype performance. In this regard, replication is necessary to determine the true positives (9,11). The initial success of the GWAS approach has yielded many candidate regions in diabetes, cancer, heart disease, arthritis, and a host of additional diseases. A GWAS can discover novel regions of the genome, not all of which contain suitable candidate genes. In the past year, a series of studies has identified new loci in breast, prostate, and colon cancer in GWAS with replication (12–23). Already, unexpected findings have emerged: (i) a handful of regions on chromosome at 8q24 (over 1.5 million base pairs) have been associated with risk for prostate cancer, some of which appear to be important in specific populations and not others (14,17,19); (ii) one of the regions in 8q24 was also discovered in four separate GWAS in colon cancer (15,18,20,23). Each of these large-scale GWAS required intense effort in organizing studies for high-throughput genotyping and association analyses. Substantive effort was required to extract, procure, and prepare samples prior to high-throughput genotype analysis under stringent protocols followed by iterations of quality control assessment of data stability before beginning the process of sequential analysis (Fig. 2). GENETIC VARIATION SNPs are the most common type of genetic variation in the human genome. Current estimates are that there are at least 8 to 10 million SNPs with a minor allele frequency (MAF) greater than 1% in at least one studied population (24–26). Interestingly, most

Principles of High-Quality Genotyping

65

Figure 2 High-throughput genotype analysis.

high-frequency SNPs with a MAF greater than 15% to 20% are common to human populations and thus, are of great interest for surveying the genome in GWAS studies (4,27). There are perhaps as many as five million SNPs with MAF greater than 10%, many of which have been selected and maintained in populations. One of the first largescale surveys of SNPs in the genome indicated that 85% of more than 1.5 million SNPs were common to the three populations surveyed (e.g., European, East Asian, and West African background) whereas a subset of high-frequency SNPs appear to be private to a single population (27). Historical bottlenecks associated with geographical isolation and evolutionary selection have shaped the pattern of common genetic variation; populations of African background have a greater number of common variants and segments of LD are shorter (28). In some cases, substantive differences in the allele frequencies of SNPs between populations can reflect major differential and regional selective pressures; infectious diseases such as malaria, or environmental stresses such as temperature or diet could have shaped the pattern of common genetic variation (29,30). The majority of SNPs appear to be functionally silent and have been maintained on the backbone of an inherited block of DNA through generations. Already, a small subset of SNPs have been investigated that can alter the function or expression of a gene, known as “functional SNPs” (31,32). Moreover, there are thousands of common SNPs that predict changes in the protein coding of amino acids, yet only a small subset have been shown to be functionally significant. It is also notable that the coding regions of the genome represent approximately 2% overall but a substantially larger percentage of the genome is transcribed, suggesting that variation can alter gene regulation and expression in more locations than previously anticipated (33,34). In silico tools have been generated to predict the effect of coding substitutions, but the predictive nature of these tools have not been sufficiently validated. Overall, it has been estimated that there could be between 50,000 and 250,000 SNPs that are either nonsynonymous coding variants or regulators of gene expression or splicing (31,32). Even though the number of “common” SNPs approaches 10 million, this still represents less than 0.2% of the genome overall. Most SNPs are not inherited independent of the surrounding variants, but instead passed from one generation to the next on ancestral chromosomal segments (haplotypes) (26). It is possible to exploit estimates of LD to choose a fraction of SNPs that can monitor untested variants. Haplotypes can be estimated on the basis of LD between SNPs, which can be used to decrease the requirements for surveying across the genome. This indirect approach is predicated on

66

Chanock

Figure 3 Genetic association testing: finding markers.

finding markers only, relegating the search for causal or functional variants to follow-up of regions well marked by one or more “tag” SNPs (Fig. 3) (35,36). Quantitative methods have been developed to minimize the number of SNPs required to represent common haplotypes (37,38). On the basis of the size and complexity of LD patterns, the indirect approach using “tagging SNPs” has emerged as the preferred approach for scanning across the genome (39). It is still early to estimate the scope of uncommon, or rare, single-base sequence substitutions in the genome. Preliminary indications are that there are many more uncommon variants than common SNPs, but it will not be until we can compare a large number of complete sequence assemblies that an accurate assessment is available (5). On the other hand, over 2000 rare variants, often known as disease mutations have been reported in classical Mendelian disorders, namely, diseases with strong familial components that can be traced to highly penetrant mutations. The latter are cataloged in the database Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/sites/ entrez?db¼OMIM/). Recently, there has been intense interest in characterizing structural variation in the genome, which is defined as either cytologically visible or more commonly submicroscopic variants (40,41). These can include deletions, insertions, duplications, and largescale copy number variations and are collectively known as copy number variations (CNVs). In particular, perhaps a tenth of genic regions have at least one paralog elsewhere (42); that is, the same sequence resides on different chromosomes or remotely different segments of the same chromosome because of sequence duplication. Positional inversions and translocations, also forms of structural variation, can occur but less frequently. Current efforts to map CNVs are ongoing, and the results so far suggest that CNVs are more common than previously reported (40). An unexpected fallout of the International HapMap project was the observation that a substantial percentage of SNPs that failed quality control resided in regions now known to be enriched for CNVs (3,43). However, the total number of common CNVs, namely, those with an estimated frequency of greater than 5% in a population is still relatively low. Large-scale efforts to characterize the breadth and frequency of CNVs have been hampered by technical challenges but recent efforts have begun to establish standards for identification, validation, and reporting CNVs (40). Because of the daunting technical problems that have plagued the field thus far, new advances in techniques (e.g., tiling arrays, paired-end

Principles of High-Quality Genotyping

67

sequencing, and dense SNP platforms) together with sequence assembly comparisons should provide eventual sequence resolution of all types of variation and in turn set a standard for future genotype platforms for analysis of CNVs. CNVs can be repeated elsewhere in the genome, either contiguously or on another chromosome. This has profound implications for SNP assay design because the current method for assaying SNPs is based on amplification of local sequence surrounding the SNP of interest. Redundant sequence—either locally or elsewhere in the genome—is amplified and if there is a polymorphism between at least two different segments, the fidelity of the SNP assay is undermined. Thus, great care must be exercised in designing SNP assays and in particular, attention should be paid to segments in or near known CNVs. The international public database for SNPs is dbSNP (www.ncbi.nih.gov/SNP/), a repository with over eight million human SNPs (44,45). Though the majority of entries are reliable SNPs, a subset of reported variants, perhaps 15% are actually monoallelic and represent genotyping or more likely, sequence-tracing errors (44,45). Major international efforts, such as the SNP Consortium and the International HapMap Project, have deposited SNPs verified by a genotyping assay but only a small percentage have been verified by sequence analysis. Overall, these strategies have been biased toward highfrequency variants and poorly represent SNPs with MAF less than 5%. For CNVs, there are two major international databases, the Database of Genomic variation (http://projects .tcag.ca/variation/) and the Human Structural variation Database (http://humanparalogy .gs.washington.edu/structuralvariation/). Nearly a dozen additional databases provide further information on CNVs. GENOTYPE OPTIONS Since the industry standard for genotyping is an approach that targets specific base(s) after amplification of the flanking region using unique oligonucleotides, the analysis of an SNP or a CNV requires careful assessment of the unique flanking sequence. Failure to recognize CNVs or neighboring SNPs can undermine the fidelity of the assay, sometimes providing bias in allele calling (Fig. 3). It is important to recognize that genotyping differs from microarray analysis because the profile of all expressed messenger ribonucleic acids (mRNAs) is captured by an oligo dT primer whereas the unique context of the SNP or CNV has to be used to assay the specific variant. Although significant enthusiasm has been generated by the availability of a rapidly increasing set of common SNPs, great care must be undertaken in the design of an SNP assay (46). Since SNP frequencies, and in some cases, presence, vary by population, assessment of the sequence context (i.e., the sequence flanking the target SNP) is critical, particularly to avoid ‘neighboring” SNPs that can alter performance. A previously unknown SNP underlying a primer or probe can erroneously alter the performance of an SNP assay (Fig. 4) (47). Often, an adjacent SNP, usually within 50 base pairs can prevent the design of a robust and reproducible assay. Design issues [e.g., guanine-cytosine (GC) content, neighboring SNPs] represent one challenge and as mentioned above, paralog sequence can undermine the fidelity of the assay. Clearly, some regions of the genome cannot be uniquely amplified and thus cannot be assayed (42). Optimization of SNP assays is critical for minimizing error in genotyping, which has been estimated to be less than 0.4% for loss of a heterozygous marker in well-designed assays. For instance, the National Cancer Institute (NCI) SNP500cancer program provides optimized

68

Chanock

Figure 4 Effect of unsuspected neighboring SNP on allele calling. An example of an SNP (rs16941) in SNP500cancer database in which the neighboring SNP, found primarily in East Asians, undermines the fidelity of the assay, leading to incorrect allele calls. Abbreviation: SNP, single nucleotide polymorphism.

genotype assays that are verified on two different platforms—sequence analysis is usually one of the two—which provide an opportunity for discovery of unknown SNPs (47). Because of the large number of SNPs, mining genomic databases such as dbSNP or the University of California Santa Cruz (UCSC) browser can be a formidable task. For candidate gene analysis, the Web site, Genewindow (genewindow.nci.nih.gov), offers an example of a tool for genecentric visualization of genetic variation across a locus whereas the HapMap website (http://hapmap.org) provides a comprehensive set of common SNPs in three reference populations already assayed (48). This resource rapidly enables assessment of LD across a region using tools such as Tagger (http://www.broad.mit.edu/ mpg/tagger/) or TaqZilla (http://tagzilla.nci.nih.gov/). The scope of the project is the primary determinant for the choice of a genotyping platform(s) that will optimally address the goals of the study. There is no single SNP platform that can easily interrogate both a single SNP and, at a distinct time, one million SNPs. Thus, it is necessary to have more than one technology available to conduct largescale genotyping. Selection of an appropriate SNP platform involves considerations of cost, flexibility in the choice and substitution of SNP assays, commercial availability, and magnitude of assay throughput. High-throughput genotype efforts, especially of the extreme variety that generate millions of genotypes per month require sophisticated robotics for efficient laboratory flow. In recent years, availability of alterative technologies and platforms has increased dramatically; these options have substantially

Principles of High-Quality Genotyping

69

Figure 5 Progress in genotyping technology.

lowered the price of genotyping and accelerated genotyping throughput (Fig. 5). Options for genotyping technology span a wide spectrum—from single SNP detection to thousands of SNPs in one reaction. Consequently, the technical capacity to generate genotypes requires sophisticated bioinformatics to handle both the size and complexity of data sets. The first laboratory methods were manual, such as restriction fragment length polymorphism (RFLP), and have now given way to approaches that include differential hybridization, primer extension, ligation reactions, and allele-specific probe cleavage. Most commercial systems currently employ two-color technology because nearly all SNPs are biallelic. Triallelic SNPS are quite rare and require both validation and special assays to interrogate. The widely used single-SNP assay is TaqManTM (Applied Biosystems, California, U.S.), which uses a single enzymatic step with universal reaction and thermocycling conditions; it is based on the unique binding of a probe to the target sequence followed by primer extension. Allelic discrimination is determined by selective annealing of exact matching probe and primer sequences, which in turn generate an allelespecific fluorescent signal. The reaction is capped by modification of a nonfluorescent quencher with a minor grove binder. TaqMan system can also be used for real-time assessment of copy number or measurement of semiquantitative differences. There is a range of technologies available for analyzing between 5 and 100 SNPs in parallel, but the economy of cost and scale is certainly less than the extreme genotype platforms described below. Optimization of assays represents a significant challenge and there are few commercially available sets, requiring investigators to design and optimize for each project. The basic principle for each relies on amplification of the surrounding region before interrogating the SNP of interest. Assay designs are based on direct oligonucleotide ligation with probe fluorescence detection, single-base sequencing or chip-based matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry. Several technologies have been developed to examine a thousand or more custom SNPs. For instance, the custom bead-array technology in solution by Illumina

70

Chanock

(California, U.S.) enables custom design of more than 1500 SNPs with excellent performance, including analysis of high-quality DNA generated by whole-genome amplification (WGA) approaches. The age of high-density genotyping systems, also known as “extreme genotyping” has commenced, accelerating the discovery of new loci in GWAS studies. Because of cost, there is little flexibility in choosing SNPs for inclusion in commercial panels that can scan the genome. There are several technologies that can efficiently genotype between 500,000 and 1 million SNPs at once, using hundreds of nanograms of high-quality DNA. Preparation of optimal samples is critical for high performance because of decrease in call rates per sample track with increasing error rate in genotype calls. Therefore, detailed and reproducible pregenotyping quality control measures are necessary to generate high-quality data. There are two transportable high-density genotyping platforms that require extensive bioinformatic support to achieve analytical capabilities of between 500,000 and 1 million SNPs as well as monoallelic probe content to interrogate CNVs using intensity of signal. For each of the extreme genotyping technologies, it is challenging to assay SNPs that reside close together (within 60 or fewer nucleotides); assays often cannot be tailored to accommodate these circumstances. It appears that for each platform a subset of the SNP assays do not perform well enough because of either previously unappreciated neighboring SNPs or local sequence peculiarities. There has been considerable controversy surrounding the software used to call the genotypes analyzed on the new dense chips, leading some groups to develop open-access products. The Affymetrix (California, U.S.) microchip system uses a simplification of the genome by restriction digest with at least two common restriction endonucleases. After universal adapting linkers are added, the reaction is amplified prior to fragmentation, labeling, and analysis on a microchip using an address system. The initial 500 k product had been partially designed to space SNP markers across the genome, with higher density across genes. The newest product, 6.0, provides not only a dense set of SNPs (augmenting the previous chip) but also monoallelic probes chosen to monitor CNVs across the genome. Some regions for the genome are not well addressed by this technology because of a paucity of restriction enzyme sites, thus limiting the opportunity to represent SNPs well. The Illumina (California, U.S.) system uses an allele-specific gap-fill process followed by ligation and amplification prior to a readout on a bead-based capillary microchip system. A third technology is available from Perlegen Sciences (California, U.S.) on contract only; it is a polymerase chain reaction (PCR)-based sample preparation applied to high-density oligonucleotide arrays consisting of short DNA probes, synthesized on a glass surface and used to determine genotypes with great redundancy. The successive designs of the HumanHap 300, HumanHap 500, and HumanHap 1 million have been based on selecting tagSNPs derived from analysis of the HapMap II data. The content of commercial SNP chips continues to evolve toward improved genetic coverage of the entire genome for not only populations with similar histories as the set of Caucasians used in HapMap but also populations similar to West Africa (Yoruba) and East Asia. The choice of chip is driven by considerations of cost, coverage for the population to be analyzed, available technology, and sample quality. Whole-genomeamplified DNA is not optimal, but there are considerable published data for the Affymetrix system whereas experience with the Illumian chips is accumulating. The scientific debate over the choice of chips has focused on the extent of genomic coverage, based on the HapMap II set of SNPS with MAF greater than 5% (4,49). Figure 6 illustrates the minimum LD for any SNP assay assessed by the coefficient of correlation, r2 for 2-SNP comparison. Many investigators have embraced the threshold of r2 greater than 0.8, but it appears that it is possible to monitor regions with slightly lower

Principles of High-Quality Genotyping

71

r2. Multimarker strategies have been proposed for analyzing SNPs, particularly since LD patterns can be complex; for instance, monitoring select SNPs could be more efficient with two or more SNPs together. It is also notable that the population to be studied should be an important consideration in deciding which product to choose, particularly as it relates to the issue of genomic coverage. QUALITY CONTROL IN THE LABORATORY The success of a genotyping study begins with the efficient and meticulous handling of the samples. Close coordination between the laboratory performing the extraction and the biorepository storing the DNA samples is necessary to guarantee reliable results (50). For instance, errors in dilution of DNA following extraction, storage, and transport can undermine the integrity of the study (51). Thus, standard operating procedures (SOPs) are mandatory and should be regularly reviewed to maintain optimal practice, particularly as it relates to regular review of performance of assays. Optimization of assays with standard control samples should be included in the workflow and the results must be reviewed regularly. Genomic DNA of poor quality reduces completion rates and concordance, suggesting that DNA quality can alter accuracy of genotyping (46).

Figure 6 (Continues on page 72 ) Genomic coverage of fixed-content extreme-genotyping platforms. Graphic representation of the minimum LD (estimated by r2) for SNPs in HapMap II (Build 22) with MAF greater than 5%. Analysis is by continental populations and restricted to unrelated individuals, CEU (north European background; n ¼ 60), YRI (Yoruba of West Africa; n ¼ 60) and JPT/ CHB (Japanese and Han Chinese of East Asia; n ¼ 45 and n ¼ 45). Abbreviations: SNP, single nucleotide polymorphism; MAF, minor allele frequency; LD, linkage disequilibrium.

72

Figure 6 (Continued )

Chanock

Principles of High-Quality Genotyping

73

SOPs should be in place for quantification and assessment of DNA quality. The quantity of DNA can be interrogated by spectrophotometric measurement of DNA optical density, by PicoGreen (Turner BioSystems, California, U.S.) analysis or by real-time PCR analysis using a standardized TaqMan assay (52). The latter can provide a preliminary test for sample quality as it relates to robust analysis in a high-throughput laboratory. Spectrophotometry and the PicoGreen assay measure total DNA present, regardless of source or quality, whereas a real-time PCR assay measures the total “amplifiable” human DNA. Establishing DNA quantity by real-time PCR is critical for DNA from buccal swabs, cytobrush samples, or other nonblood sources, particularly as it relates to estimating the amount of competing nonhuman DNA. Even small differences between these techniques are important in assessing the amounts of single- and double-stranded DNA because accurate quantification is critical for optimizing the genotyping results. In the high-throughput setting, there should be strong consideration for DNA fingerprinting of each sample with either a set of SNPs (probably more than 60 with high MAF) or a forensic panel of 15 small tandem repeats and amelogenin, also known as the AmpFLSTR Identifiler assay (Applied Biosystems, California, U.S.). The former can be useful for assessing the sample quality for the specific technology used for extreme genotyping. Both can be useful for identifying contaminated samples as well as those likely to fail on the chip technology (53). Moreover, the individual profiles can be useful for verifying known duplicates and identifying unexpected duplicates. When the latter are observed, investigation should consider not only pregenotyping laboratory or informatic errors but also laboratory errors with plates or reagents (46). For the last five years, there has been intense interest in using WGA technology to rescue epidemiology studies with scant amounts of DNA (54–56). While the results have been encouraging, WGA has been widely used with varying results, partly because of hard lessons learned concerning quality and quantity of input DNA. If performed optimally, WGA can generate large quantities of DNA for genotype assays but with the caveat that approximately 5% to 8% of the genome is not faithfully represented. Regions with high GC content and telomeric regions are especially problematic and data pertaining to these regions should be cautiously interpreted. With advances in genomic technologies having evolved to permit the study of thousands of SNPs simultaneously from small quantities of DNA, the temptation to use WGA DNA in GWAS is great but the performance does not reach the high standard observed with native DNA. Furthermore, efficiencies in WGA have generated considerable enthusiasm but have not yet reproducibly amplified the entire genome nor recaptured heavily degraded or damaged DNA. Two different approaches have been commercially optimized: the multiple displacement amplification approach utilizes a high-performance bacteriophage f 29 DNA polymerase with degenerate hexamers or generation of libraries of 200 to 2000 base-pair fragments created by random chemical cleavage of genomic DNA, followed by ligation of adaptor sequences to both ends and PCR amplification (54). Though there has been effort to amplify a spectrum of DNA sources, including whole blood, dried blood, buccal cell swabs, cultured cells, and buffy coat cells, varying degrees of success have been reported. Under optimal conditions, the expected yield approaches 10,000-fold in genomic DNA overall. Many laboratories have observed that WGA of water control specimens generates a small, monoallelic signal, which can be called as a single allele. The mid-range Illumina 1536 and Affymetrix chips have been shown to perform with WGA DNA, but with an increased loss of heterozygosity (57). This loss underscores the care that must be given to both the quality control analysis and the software programs for automating calls in highthroughput genotype analysis.

74

Chanock

The design of all molecular epidemiology studies should include undisclosed duplicates taken from the same sample and, if possible, replicates of different samples taken from the same individual. Duplicate testing is necessary to assess the quality of the DNA and the extraction process and its prior storage. For the new extreme genotyping technologies, the genotype concordance between duplicates usually exceeds 99.5%. Errors in genotyping, mainly due to loss of one of the heterozygous alleles, occur in well below 1% of samples for commercial and academic platforms and techniques of the highest quality, which should not significantly impact well-powered studies (of sufficient size) (53,58). If SOPs are followed closely, completion rates should be greater than 95% for most studies but may be slightly lower depending on the quality of genomic DNA (59). Completion rates below 90% should raise concern about technical or analytical deficiencies, prompting repeat genotype analysis. In GWAS, because so many false positive results are observed, some of which could be because of genotype error, it is recommended that a second technology, such as TaqMan or sequencing be performed to verify the accuracy and establish concordance (9). Though some have advocated using the fitness for Hardy-Weinberg proportion [so-called Hardy-Weinberg equilibrium (HWE) testing], errors in HWE testing can catch major genotype errors but should probably not be used as a stringent threshold for excluding SNPs from analysis. INFORMATICS Data management and analysis of dense SNP and CNV data sets represent a formidable challenge in conducting association studies. The efficiency and effectiveness of a genotyping laboratory are based on the flow of information, from the choice of markers, the laboratory analysis (including quality control/quality assessment) and presentation of data sets for analysis in preparation for publication. Specialized resources, including highly trained personnel, are required to generate and manage both laboratory and analytical data. High-throughput laboratory processing requires a specialized Laboratory Information Management System (LIMS) to track samples, assays, reagents, equipment, robotics, and processes through the entire flow of the laboratory. Particular attention must be given to the details of laboratory workflow, often beginning in the biorepository through the delivery of final genotype or copy number variant reports. The LIMS captures the movement of information in a relational database, incorporating the results of experimental data, linked directly to in silico information. These annotations include the specific genomic coordinates and genotype assay, which ensures the fidelity and accuracy of the genotype analysis. LIMS software should be subjected to rigorous quality-control and quality-assurance checks. Select tasks, such as real-time monitoring, assay reproducibility, and control concordance, are validated regularly to ensure high performance of stable data sets. Standard data management techniques for quality control within a study and crossvalidation (e.g., of the performance of technical platforms) with prior studies can ensure continued high-performance genotype analysis. Dense data sets, such as GWAS are generating unprecedented amounts of genotype data, frequently billions of genotypes per study, and require scalable computational solutions. Already two general suites of tools have been developed and are freely available to accelerate the pace of research, PLINK and Genetic Library Utilities (GLU) (60). The first tool developed, PLINK, now in version 1.0, is a free, open-source whole-genome association analysis set, developed to conduct basic, large-scale analyses in a computationally efficient manner. The primary purpose of PLINK is to perform analysis of genotype/phenotype data, and there is no support for the steps prior to the final analysis build, namely, assistance with study design, generating genotype calls from raw data, and

Principles of High-Quality Genotyping

75

quality control/quality assessment of genotype data sets. A new suite of tools, GLU, has been developed to address issues related to storage, management, quality control, and genetic analysis. This is also an open-source framework and a software package designed to effectively handle billions of genotypes. Data management features include the import, export, merge, and split genotype data among several common formats and standards as well as the capacity to filter, based on powerful criteria for inclusion and exclusion as part of the quality control/quality assessment process. This suite can be used to perform all the critical steps in the generation of Builds (see below). For instance, from the large dense data sets, one can quickly identify expected and unexpected duplicates, within studies and between studies as well as verify sex determination. It also allows estimation of LD using TagZilla, a high-performance tagger-like application with the ability to augment SNPs from set panels with optimal tagSNPs using flexible criteria including design scores from major genotyping vendors (http://tagzilla.nci.nih.gov/) (36). High-throughput genotype analysis requires an iterative process to generate the final analytical data set used for publication-grade analyses and public posting (Table 1). When dense genotype data sets are generated, particular effort must be directed at documenting the serial steps required to filter the data after the genotype-calling algorithm has been finalized. After determining which samples fail to reach an acceptable threshold for calling SNPs (e.g., determination of a stable genotype call), a similar analysis should assess the performance of each SNP assay. An assessment of HWE can identify egregious genotype errors but should not be used to filter lack of fitness at a p value greater than 0.001 (61–63). Once the failed samples and assays are removed, most experts would recommend a second reclustering of the data, which should result in small number of additional called genotypes reclaimed. At this point, it is critical to check known duplicates while investigating unexpected duplicates; removal of duplicate results should be done before conducting studies to determine cryptic relationships (first and second generation relatives). For most studies, two complimentary analyses are conducted to identify individuals with a high degree of admixture, STRUCTURE analysis and principal component analysis (PCA) (64,65). For the latter, eigenvectors for each subject can be used to adjust analyses and better account for subtle differences in population structure. On the other hand, STRUCTURE analysis should identify individuals with greater than 10% or 15% admixture, who might be excluded form the study. It should be noted that admixture mapping can actually take advantage of the differences in structure but necessitates a specific design discussed elsewhere (66). Table 1 Major Considerations for Generation of Final, Publication-Grade Build of High-Density Genotype Data Filter out samples with inadequate completion rate (<90%) Filter out SNP assays with inadequate call rates (<90%) Determination of fitness for Hardy-Weinberg proportion Determine expected and unexpected duplicates Assess concordance between duplicates Search for cryptic relatedness between subjects Assessment of population substructure Determine admixture (STRUCTURE analysis) Estimate population stratification (PCA) Recluster of genotype calls Assess fidelity of significant genotype calls with comparison to second technology Abbreviations: PCA, principal component analysis; SNP, single nucleotide polymorphism.

76

Chanock

CONCLUSION Recent developments in human genomics have energized the field of molecular epidemiology and established a foundation for dissecting the contribution of genetic variation to human disease. The struggle to generate a sufficient number of markers has given way to the daunting challenge of analyzing dense data sets. No longer are we concerned with whether or not a signal has been detected, instead the problem lies in sorting out the true positives amid the tsunami of false positives. Computational solutions are needed to handle the tsunami of information so that we can dissect the contribution of germline variation in human disease. The bottleneck has moved from the laboratory assay to the data analytical phase, where the major challenge now resides in the mining of enormously dense data sets. Far greater effort must now focus on the search of meaningful effects of individual genotypes, haplotypes, and ultimately interactions between genetic markers. Replication and validation have become critical steps in establishing genetic markers associated with human disease. However, to conduct the sequence of studies, decisions must be made that ensure optimal design, quality genotyping, and assembly of a final data set ready for sophisticated association analyses. For the near future, research into genetic susceptibility will continue to focus on SNP-based genotyping techniques, but within the next five years, high-throughput sequence technology should assume a prominent role in human genomics, perhaps supplanting genotyping technologies. Though high-throughput sequence analysis should enable the investigation of both common and uncommon genetic variants, providing an opportunity to identify a comprehensive set of variants, it will be daunting to search through the genetic variants to establish the credible associations. In this regard, the bioinformatic challenges will be particularly formidable: to develop paradigms for interpreting the significance of rarer variants not only in pedigrees but also in unrelated populations (76). When this occurs, it will be possible to comprehensively assess the contribution of all forms of genetic variation to human disease and perhaps develop suitable models appropriate for clinical implementation. However, the latter will emerge from a sequence of studies that will take into account population genetics history, public health implications, and clinical paradigms that are designed to protect the confidentiality of individuals.

REFERENCES 1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004; 431(7011):931–945. 2. The International HapMap Consortium. The International HapMap Project. Nature 2003; 426 (6968):789–796. 3. The International HapMap Consortium. A haplotype map of the human genome. Nature 2005; 437(7063):1299–1320. 4. Frazer KA, Ballinger DG, Cox DR, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 2007; 449(7164):851–861. 5. Birney E, Stamatoyannopoulos JA, Dutta A et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447 (7146):799–816. 6. Hunter DJ, Thomas G, Hoover RN, et al. Scanning the horizon: what is the future of genomewide association studies in accelerating discoveries in cancer etiology and prevention? Cancer Causes Control 2007; 18(5):479–484.

Principles of High-Quality Genotyping

77

7. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996; 273(5281):1516–1517. 8. Risch N. The genetic epidemiology of cancer: interpreting family and twin studies and their implications for molecular genetic approaches. Cancer Epidemiol Biomarkers Prev 2001; 10(7): 733–741. 9. Chanock SJ, Manolio T, Boehnke M, et al. Replicating genotype-phenotype associations. Nature 2007; 447(7145):655–660. 10. Skol AD, Scott LJ, Abecasis GR, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 2006; 38(2):209–213. 11. Hirschhorn JN, Altshuler D. Once and again-issues surrounding replication in genetic association studies. J Clin Endocrinol Metab 2002; 87(10):4438–4441. 12. Easton DF, Pooley KA, Dunning AM, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 2007; 447(7148):1087–1093. 13. Hunter DJ, Kraft P, Jacobs KB, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 2007; 39(7): 870–874. 14. Freedman ML, Haiman CA, Patterson N, et al. Admixture mapping identifies 8q24 as a prostate cancer risk locus in African-American men. Proc Natl Acad Sci U S A 2006; 103(38): 14068–14073. 15. Gruber SB, Moreno V, Rozek LS, et al. Genetic variation in 8q24 associated with risk of colorectal cancer. Cancer Biol Ther 2007; 6(7) (Epub ahead of print). 16. Gudmundsson J, Sulem P, Manolescu A, et al. Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nat Genet 2007; 39(5):631–637. 17. Haiman CA, Patterson N, Freedman ML, et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nat Genet 2007; 39(5):638–644. 18. Tomlinson I, Webb E, Carvajal-Carmona L, et al. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet 2007; 39(8): 984–988. 19. Yeager M, Orr N, Hayes RB, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet 2007; 39(5):645–649. 20. Zanke BW, Greenwood CM, Rangrej J, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet 2007; 39(8):989–994. 21. Stacey SN, Manolescu A, Sulem P, et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet 2007; 39(7):865–869. 22. Broderick P, Carvajal-Carmona L, Pittman AM, et al. A genome-wide association study shows that common alleles of SMAD7 influence colorectal cancer risk. Nat Genet 2007; 39(11): 1315–1317. 23. Haiman CA, Le ML, Yamamato J, et al. A common genetic risk factor for colorectal and prostate cancer. Nat Genet 2007; 39(8):954–956. 24. Kruglyak L, Nickerson DA. Variation is the spice of life. Nat Genet 2001; 27(3):234–236. 25. Reich DE, Gabriel SB, Altshuler D. Quality and completeness of SNP databases. Nat Genet 2003; 33(4):457–458. 26. Reich DE, Cargill M, Bolk S, et al. Linkage disequilibrium in the human genome. Nature 2001; 411(6834):199–204. 27. Hinds DA, Stuve LL, Nilsen GB et al. Whole-genome patterns of common DNA variation in three human populations. Science 2005; 307(5712):1072–1079. 28. Tishkoff SA, Verrelli BC. Patterns of human genetic diversity: implications for human evolutionary history and disease. Annu Rev Genomics Hum Genet 2003; 4:293–340. 29. Hughes A, Packer B, Welch R, et al. Effects of natural selection on inter-population divergence at polymorphic sites in human protein-coding loci. Genetics 2005; 170:1181–1187. 30. Sabeti PC, Varilly P, Fry B, et al. Genome-wide detection and characterization of positive selection in human populations. Nature 2007; 449(7164):913–918. 31. Risch NJ. Searching for genetic determinants in the new millennium. Nature 2000; 405(6788): 847–856.

78

Chanock

32. Chanock S. Candidate genes and single nucleotide polymorphisms (SNPs) in the study of human disease. Dis Markers 2001; 17(2):89–98. 33. Cheng J, Kapranov P, Drenkow J, et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 2005; 308(5725):1149–1154. 34. Kapranov P, Cheng J, Dike S, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 2007; 316(5830):1484–1488. 35. Carlson CS, Eberle MA, Rieder MJ, et al. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat Genet 2003; 33(4): 518–521. 36. Carlson CS, Eberle MA, Rieder MJ, et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 2004; 74 (1):106–120. 37. Salisbury BA, Pungliya M, Choi JY, et al. SNP and haplotype variation in the human genome. Mutat Res 2003; 526(1–2):53–61. 38. Stephens M, Smith NJ, Donnelly P. A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 2001; 68(4):978–989. 39. Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet 2006; 38(6):659–662. 40. Scherer SW, Lee C, Birney E, et al. Challenges and standards in integrating surveys of structural variation. Nat Genet 2007; 39(7 suppl):S7–S15. 41. McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet 2007; 39(7 suppl):S37–S42. 42. Bailey JA, Gu Z, Clark RA, et al. Recent segmental duplications in the human genome. Science 2002; 297(5583):1003–1007. 43. Freeman JL, Perry GH, Feuk L, et al. Copy number variation: new insights in genome diversity. Genome Res 2006; 16(8):949–961. 44. Marth G, Schuler G, Yeh R, et al. Sequence variations in the public human genome data reflect a bottlenecked population history. Proc Natl Acad Sci U S A 2003; 100(1):376–381. 45. Marth GT, Korf I, Yandell MD, et al. A general approach to single-nucleotide polymorphism discovery. Nat Genet 1999; 23(4):452–456. 46. Pompanon F, Bonin A, Bellemain E, Taberlet P. Genotyping errors: causes, consequences and solutions. Nat Rev Genet 2005; 6(11):847–859. 47. Packer BR, Yeager M, Burdett L et al. SNP500Cancer: a public resource for sequence validation, assay development, and frequency analysis for genetic variation in candidate genes. Nucleic Acids Res 2006; 34(Database issue):D617–D621. 48. Staats B, Qi L, Beerman M, et al. Genewindow: an interactive tool for visualization of genomic variation. Nat Genet 2005; 37(2):109–110. 49. Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nat Genet 2006; 38(6):659–662. 50. Bonin A, Bellemain E, Bronken EP, et al. How to track and assess genotyping errors in population genetics studies. Mol Ecol 2004; 13(11):3261–3273. 51. Taberlet P, Griffin S, Goossens B, et al. Reliable genotyping of samples with very low DNA quantities using PCR. Nucleic Acids Res 1996; 24(16):3189–3194. 52. Haque KA, Pfeiffer RM, Beerman MB, et al. Performance of high-throughput DNA quantification methods. BMC Biotechnol 2003; 3:20. 53. Morton NE, Collins AE. Statistical and genetic aspects of quality control for DNA identification. Electrophoresis 1995; 16(9):1670–1677. 54. Dean FB, Hosono S, Fang L, et al. Comprehensive human genome amplification using multiple displacement amplification. Proc Natl Acad Sci U S A 2002; 99(8):5261–5266. 55. Bergen AW, Haque KA, Qi Y, et al. Comparison of: yield and genotyping performance of multiple displacement amplification and omniPlex whole genome amplified DNA generated from multiple DNA sources Hum.Mutat. 2005; 26:262–270. 56. Bergen AW, Qi Y, Haque KA, et al. Effects of electron-beam irradiation on whole genome amplification. Cancer Epidemiol Biomarkers Prev 2005; 14(4):1016–1019.

Principles of High-Quality Genotyping

79

57. Paynter RA, Skibola DR, Skibola CF, et al. Accuracy of multiplexed Illumina platform-based single-nucleotide polymorphism genotyping compared between genomic and whole genome amplified DNA collected from multiple sources. Cancer Epidemiol Biomarkers Prev 2006; 15(12): 2533–2536. 58. Morton NE, Collins A. Tests and estimates of allelic association in complex inheritance. Proc Natl Acad Sci U S A 1998; 95(19):11389–11393. 59. Ewen KR, Bahlo M, Treloar SA, et al. Identification and analysis of error types in highthroughput genotyping. Am J Hum Genet 2000; 67(3):727–736. 60. Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007; 81(3):559–575. 61. Xu J, Turner A, Little J, et al. Positive results in association studies are associated with departure from Hardy-Weinberg equilibrium: hint for genotyping error? Hum Genet 2002; 111 (6):573–574. 62. Wittke-Thompson JK, Pluzhnikov A, Cox NJ. Rational inferences about departures from Hardy-Weinberg equilibrium. Am J Hum Genet 2005; 76(6):967–986. 63. Salanti G, Amountza G, Ntzani EE, Ioannidis JP. Hardy-Weinberg equilibrium in genetic association studies: an empirical evaluation of reporting, deviations, and power. Eur J Hum Genet 2005; 13(7):840–848. 64. Price AL, Patterson NJ, Plenge RM, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006; 38(8):904–909. 65. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 1999; 65(1):220–228. 66. Marchini J, Cardon LR, Phillips MS, et al. The effects of human population structure on large genetic association studies. Nat Genet 2004; 36(5):512–517. 67. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet 2001; 69(1):124–137.

7

Biomarkers of Exposure and Effect Christopher P. Wild Molecular Epidemiology Unit, Centre for Epidemiology and Biostatistics, Leeds Institute of Genetics, Health and Therapeutics, Faculty of Medicine and Health, University of Leeds, Leeds, U.K.

INTRODUCTION The majority of cancers have a complex etiology. Usually one or more environmental risk factors, such as lifestyle, diet, chemical pollutants, radiation, and infectious agents, are implicated. Individual response to these factors is influenced by genetic background, co-exposures to environmental agents, and other factors including age, sex, and sociodemographic status. Despite the acknowledged importance of the environment in influencing cancer risk, the precise contribution of individual factors and their interaction, both with each other and with genotype, is difficult to elucidate. This is partially due to the challenges inherent in accurately measuring exposure, not least when the critical period relevant to disease risk may have occurred many years prior to diagnosis (1). It is in response to this need that one of the promises of molecular cancer epidemiology is to provide biomarkers that will refine exposure assessment (2). While the focus of this chapter is on biomarkers, it is important to note that other approaches to refining exposure assessment are also important, including geographic information systems, personal and environmental monitoring, and increasingly sophisticated questionnaires (3,4); it is a combination of tools which is most likely to provide the answers required. The development of exposure biomarkers, including DNA and protein adducts was well represented among early studies in molecular cancer epidemiology (5,6). However, in the late 1980s and early 1990s, there was a major shift in effort and resources away from exposure biomarkers toward the conduct of gene-disease association studies corresponding to the development of the polymerase chain reaction. This technique permitted analysis of genetic polymorphisms using methodology that was simpler, cheaper, less demanding in terms of sampling, of higher throughput, and quicker than methods required for exposure biomarkers. In addition, genotyping was applicable to the case:control study design, making it suitable for integration into a greater proportion of ongoing epidemiological studies. The accurate assessment of many environmental exposures has remained an outstanding and largely unmet challenge in molecular cancer epidemiology and consequently one that impairs understanding of the complex gene:environment interactions that contribute to the majority of human cancers (7,8). 81

82

Wild

Table 1 Examples of Biomarkers of Exposure and Effect Biomarker type

Biomarker category

Examples

Carcinogens and their metabolites

Exposure–internal dose

Urinary biomarkers (1-hydroxypyrene, aflatoxins, tobacco-specific nitrosamines); fecal N-nitrosocompounds; organochlorines in serum and adipose tissue; blood lead levels Urinary sugars; urinary flavonoids; serum vitamins Antibodies to hepatitis B surface antigen and human papilloma virus antigens Urinary AFB1-N7-gua; bulky DNA adducts; DNA strand breaks (comet assay); Aminobiphenyl-hemoglobin; aflatoxinalbumin Chromosomal aberrations; micronuclei; somatic cell mutations in reporter genes (HPRT, glycophorin A) or protooncogenes and tumor suppressor genes (TP53 mutation spectra) Altered growth factors (e.g., IGF-1, IGFBPs) and cytokines (e.g., IL-6) Lymphocyte CYP1A1 expression; panels of reporter genes using microarray

Nutrients Circulating antibodies DNA adducts

Exposure–biologically effective dose

Protein adducts Genetic alterations

Altered proteins Altered gene expression

Biomarkers of effect

This chapter will explore not only biomarkers of exposure but also biomarkers of effect. The latter reflect occurrences subsequent to the initial exposure-related events and in general, but not always, may be more persistent than exposure biomarkers. It is difficult to define precisely the scope of this category of biomarker but examples include chromosomal alterations, changes in gene expression, altered protein levels (e.g., growth factors, cytokines), and mutations. Other biomarkers that may be termed biomarkers of altered structure or function, such as precancerous lesions, will not be considered here. These different categories of biomarker have been fully described elsewhere (9). In one sense, there is only a semantic distinction between the categories of biomarkers of exposure and effect because there are no sharp boundaries in the continuum leading from exposure to disease. A DNA adduct, for example, is not comfortably forced exclusively into one or other category. In addition, many of the required properties of biomarkers in both categories are common, e.g., sensitivity, specificity, validity, and reliability. Nevertheless, the categorization, if held lightly, can be helpful both for descriptive purposes and to inform discussions of disease mechanisms in the context of what, by definition, is the interdisciplinary research embraced by molecular epidemiology (9). Examples of biomarkers of exposure and effect are summarized in Table 1. Throughout the chapter the general principles applying to these types of biomarker are illustrated by examples from the literature. THE VALUE OF BIOMARKERS OF EXPOSURE AND EFFECT As highlighted above, accurately characterizing an individual’s environmental exposure is critical to establishing disease etiology. Misclassification, both of the exposure of primary interest and of potential confounding factors, introduces uncertainty and limits the power

Biomarkers of Exposure and Effect

83

Table 2 Applications of Exposure and Effect Biomarkers in Studies of Human Cancer Etiology Biological plausibility Intervention Animal-human extrapolations Risk assessment Early diagnosis, targeted intervention Biomonitoring; biosurveillance

of epidemiological studies. This is particularly damaging when the true strength of the association between exposure and disease is modest. In these cases misclassification may blur or completely obscure underlying causal associations. Many environmental exposures of interest occur at low levels but can be ubiquitous, posing further challenges to accurate exposure assessment. In response, biomarkers promise to provide a more objective measure of “true” exposure than previous approaches. The assays used often have great analytical sensitivity and can enable detection and quantification at the low levels occurring in the environment. A good example of the success in this area is the way biomarkers of exposure have been used to study environmental tobacco smoke (10,11). Biomarkers of exposure can provide far more, however, to the field of cancer epidemiology than improvement of the exposure metric (Table 2). For example, such biomarkers are also of value in establishing the biological plausibility of an association between a risk factor and disease. If a particular chemical exposure from ambient air is associated with increased risk, the additional information that exposed individuals have higher levels of DNA damage would add support to the exposure-disease association (11). Alternatively, if a particular genetic polymorphism associated with increased cancer risk is also linked to higher levels of a biomarker on the presumed causal pathway, e.g., DNA adducts, then this would provide support for the original association (12). These types of mechanistic data are increasingly being considered in the processes of hazard identification and cancer risk assessment. Another valuable application of both biomarkers of exposure and effect is in evaluation of the potential of intervention strategies (13). This may be primary prevention to reduce exposure, or more mechanism-based approaches such as chemoprevention. In either case biomarkers can be used as endpoints, permitting a proof of principle to be established in advance of longer-term interventions where precancerous lesions or cancer itself might be the outcome. The proof-of-principle stage may also provide valuable information pertinent to the design of interventions, for example, in establishing appropriate doses of chemopreventive agents. Biomarkers of exposure and effect can prove valuable tools in extrapolating data from animals to humans (14). The occurrence of the same biomarker responses across species may, for example, indicate the appropriateness of one model compared to another for carcinogen bioassays. Similarly, biomarker data may help in selection of the most appropriate model systems for initial mechanistic or prevention studies and contribute to the improvement of physiologically based pharmacokinetic models (15). Biomarkers of exposure and effect may potentially be used to identify individuals at higher risk of development of cancer. This could permit more effective, targeted interventions at the individual level. This application needs careful interpretation though

84

Wild

because these events are indicators of early steps in the carcinogenic process; many other factors will impinge on the subsequent risk of progression to malignancy. It is notable that even in experimental animals few studies have examined this relationship. Among those that have, associations have been reported between biomarkers (both sister chromatid exchanges and adducts) and cancer incidence at the group (i.e., dose) level but not at the level of the individual animal (16,17). Notwithstanding the above caveat, biomarker levels can be used for biomonitoring or biosurveillance in cross-sectional or longitudinal studies where the prevalence and level of exposure to an environmental carcinogen can be established in a population and act as a signal for the need for interventions. Examples from the occupational field in relation to industrial chemicals are pertinent here (18,19), as are the monitoring of lead or organochlorines and the detection of antibodies to oncogenic viruses. The types of application of biomarkers of exposure and effect mentioned above hold considerable promise, but there are a number of key questions to be considered, which are generic to their application in epidemiological studies. Among these are what to measure, where to measure, and when to measure? These questions are considered in turn below with examples from biomarkers across the spectrum considered in this chapter. WHAT TO MEASURE? Biomarkers of Exposure Internal Dose Environmental chemicals or their metabolites in human tissues and body fluids have been used to measure exposure. In these instances, the chemicals are not bound to a critical target in the cell but provide a measure of exposure and absorption and are thus termed markers of internal dose. This type of biomarker can provide an improved metric of exposure compared with levels in ambient air, food, or water. Examples are 1-hydroxypyrene-glucuronide as a measure of polycyclic aromatic hydrocarbon (PAH) exposure (20) and urinary heterocyclic amine metabolites (21). Certain biomarkers of internal dose that are a product of metabolism may also be used to characterize the metabolic phenotype of an individual (22). However, the role of metabolism and other factors such as absorption does mean there may not always be a simple relationship between external exposure and internal dose. Lipid soluble organochlorines can be detected in serum and plasma as well as adipose tissue (23,24). Other examples of internal dose markers include hormones or nutrients in body fluids, while from the area of infections and cancer (e.g., hepatitis viruses, human papilloma viruses, Helicobacter pylori), the presence of pathogen proteins or antibodies to these proteins indicate that exposure to the infectious agent has occurred. The use of biomarkers in conjunction with questionnaire data to characterize diet should be an area of significant contribution in the future with the measurement of specific nutrients in body fluids, for example (25,26). Biologically Effective Dose To date a broad range of different DNA adducts have been measured in human biological samples with assays involving 32P-postlabelling, immunoassays, and mass spectrometry among other techniques (5,6). On the basis of the mechanistic role of DNA adducts in chemical carcinogenesis, these have been referred to as biomarkers of biologically effective dose, i.e., a measure of the amount of the carcinogen reaching the critical

Biomarkers of Exposure and Effect

85

cellular target. Protein adducts, by virtue of sharing bioactivation pathways with DNA adducts may also be considered measures of biologically effective dose (19,27). However, it is important to keep in mind that as with certain internal dose biomarkers, adducts reflect not only exogenous exposure, but also other processes such as absorption, distribution, metabolism, and DNA repair. As a result the measured adduct level will be a composite of exposure and these other variables, which will differ among individuals. Both DNA and protein adducts have been successfully applied in molecular epidemiology studies for different chemicals and in different study designs such that neither can be said to be inherently superior to the other. Sample availability has however, at times, been more of a limitation with DNA than with protein.

Validation of Exposure Biomarkers Validity of a biomarker has been defined as the (relative) lack of systematic measurement error when comparing the actual observation with a standard (reference) method, which represents the “truth” (28). This is true for biomarkers of exposure and effect, which may both be compared to a “gold standard” of exposure; additionally, biomarkers of effect may be validated against disease outcome and consequently serve as intermediate endpoints and predictors of cancer (29). A lack of biomarker validity may simply result from analytical error in the assay. However, it may result from a more complex set of factors that act to distort the relationship between the biomarker and the true exposure or disease outcome. In the case of exposure biomarkers, these factors can include a lack of dose-response relationship between exposure and biomarker, failure of the biomarker to integrate exposure over the time period of interest, or interindividual variability in the exposure-biomarker relationship. Empirical data is therefore required to establish validity of a biomarker. Early studies of exposure biomarkers in human subjects were indeed able to not only demonstrate the presence of adducts or carcinogen metabolites, but also to quantitatively associate the levels with environmental exposures (20,21,30–32). Such studies, therefore, both established the adequate sensitivity of the analytical methods and contributed to the above process of validation. However, despite the analytical advances that have permitted measurement of such biomarkers in human samples over the last 20 years, the number of cases where the biomarker has been shown to quantitatively reflect exposure remains relatively few. Unfortunately, exposure biomarkers may be applied to human studies without being fully validated. Proceeding without validation is a risk because some biomarkers do not accurately reflect exposure. For example, urinary aflatoxin P1, a metabolite of aflatoxin B1, is not a good indicator of aflatoxin exposure (33), while ochratoxin A in plasma did not correlate with intake in a detailed duplicate diet study (34). Of course as mentioned above, a perfect linear relationship between exposure and biomarker will not be expected, both because of analytical measurement error and because of interindividual differences in, for example, metabolism. Nevertheless, one may be potentially misled if biomarkers are applied to epidemiological studies without such validation. A reliably (reproducibly) measured biomarker in analytical terms does not necessarily translate therefore to a valid biomarker in the context of epidemiology. Specificity of Exposure Biomarkers A further consideration in selecting a biomarker of exposure is that of specificity. A chemical specific assay of carcinogen-DNA adducts, protein adducts, or metabolites, by definition, will be specific for the target chemical. However, while some chemicals will

86

Wild

equate to the exposure that is of interest to the epidemiologist, some may not. For example, urinary aflatoxin-DNA adducts are a specific measure for dietary exposure to aflatoxins. However, with benzo(a)pyrene [B(a)P]-DNA adducts, the B(a)P may originate from tobacco smoke, diet, air pollution, or certain occupational exposures. While the marker is chemical specific, giving an accurate reflection of B(a)P exposure, it is not environmental exposure specific. However, it is often the environmental exposure that is the focus of the epidemiological study, not one specific chemical from multiple sources. This fact was illustrated in a study of PAH-DNA adducts in peripheral blood lymphocytes in firefighters (35). While there was no association between adducts and occupational smoke exposure (the main hypothesis), there was a significant correlation with the number of barbecued meals consumed. Here, the dietary source of PAH was more significant than smoke. Other types of assay are less chemical specific than adducts or metabolites and provide a more general measure of exposure. For example, 32P-postlabelling of “bulky” aromatic DNA adducts may reflect a class of chemicals. The single cell gel electrophoresis (“comet”) assay provides an even broader indication of DNA strand breaks, although in combination with DNA repair enzymes may be targeted to more specific types of damage (36). Genotoxic and Nongenotoxic Mechanisms One notable feature of much of the literature on biomarkers of biologically effective dose is the emphasis on genotoxic pathways of carcinogenesis and the associated measurement of DNA and protein adducts. There is an increasing need however to consider appropriate endpoints for nongenotoxic exposures, given that many carcinogens work through these alternative mechanisms. One example is the mycotoxin, fumonisin B1 (FB1), classed by the IARC as “possibly carcinogenic to humans” (37). FB1 is a structural analogue of ceramide synthase and as such can act as an inhibitor of this enzyme, resulting in alteration of sphingolipid biosynthesis and in the ratio of sphinganine to sphingosine (38). This pathway may also be important in carcinogenesis, and thus the ratio may serve as a biomarker of biologically effective dose (39). The increasing recognition that environmental and dietary factors act through epigenetic mechanisms (40) presents a new and exciting challenge to the field of molecular cancer epidemiology. One of the major areas of future development is likely to be the development of biomarkers of events such as altered gene methylation or histone modification (41). It is possible that the discoveries from “omics” technologies (see later in this chapter) will increasingly lead to identification of relevant nongenotoxic carcinogenic pathways and stimulate the development of associated biomarkers. Biomarkers of Effect A number of different biomarkers of effect have been applied to human studies. Perhaps the most common are measures in peripheral blood cells of genetic alterations including chromosomal aberrations, micronuclei, and sister chromatid exchanges. In general terms, these biomarkers are nonspecific, reflecting cumulative exposure to a variety of environmental factors. Alternative biomarkers of effect include somatic mutations, either in reporter genes or in the proto-oncogenes and tumor suppressor genes implicated directly in the carcinogenic process. These mutation analyses may consider only the mutation frequency or may also investigate the pattern of mutations to infer something more specific about their environmental origin. In the case of reporter genes, the analyses are

Biomarkers of Exposure and Effect

87

generally conducted on blood cells while studies of the proto-oncogenes and tumor suppressor genes may involve target organ cells (biopsies or exfoliated) or analysis of circulating plasma DNA. Examples of some of these biomarkers of effect and their association with exposures or disease are described below.

Chromosomal Aberrations, Micronuclei, and Sister Chromatid Exchanges A number of prospective epidemiological studies have reported positive associations between elevated chromosomal aberrations and increased cancer risk (42–47). These positive findings have been generally limited to associations with cancers overall, rather than at specific sites, because of the limited number of cases accrued during follow-up. In the most recent study (47), data on a total of 6430 individuals were collated across nine laboratories in central Europe with follow-up for an average 8.5 years and 200 cancer cases identified. There was a higher relative risk (approximately 1.8) in subjects in the middle and upper tertiles of chromosomal aberrations compared with the lower tertile. In a similar study measuring micronuclei (48), the individuals with medium and high frequencies had a 1.5- to twofold increased relative risk compared with the low-frequency subjects. Sister chromatid exchanges have not been shown to date to be associated with increased risk in prospective studies (49). This work on both chromosomal aberrations and micronuclei suggests that these biomarkers do indicate a higher risk of cancer at the population level and could be a suitable endpoint in future etiologic or intervention studies. As with the biomarkers of exposure discussed above, the evidence is not yet available to link the biomarker with risk at the individual level. Somatic Mutations in Reporter Genes and Cancer Genes There have been a number of assays developed which measure somatic mutation frequency in human population studies. These approaches tend to focus on reporter genes that are not a target for carcinogenesis, but represent the mutational burden of an individual due to environmental exposures. Examples of reporter genes that have been widely applied to human population studies are hypoxanthine-guanine phosphoribosyltransferase (HPRT) (50) and glycophorin A (51), permitting investigation of a number of environmental exposures to physical and chemical carcinogens. In terms of somatic mutations in cancer genes, the TP53 tumor suppressor gene has been a major focus of attention with considerable efforts made to relate mutation spectra to environmental exposures (52). In the case of dietary aflatoxins, for example, a specific G to T transversion in codon 249 was geographically correlated with dietary aflatoxin exposure (37). An exciting subsequent development was the discovery of the same mutation in circulating plasma DNA and the demonstration that this was linked to an increased risk of liver cancer (53). Other than aflatoxin, UV sunlight, and cigarette smoke (52), the number of instances to date where this approach has been informative is limited. Nevertheless, interesting data are emerging in other areas, for example, in relation to the potential role of the natural toxin aristolochic acid in the etiology of Balkan endemic nephropathy (54,55). In addition, development of a highly sensitive mutation assay applied to nontumor material in ulcerative colitis cases showed a high frequency of G to A transitions in inflamed tissue; this type of mutation is consistent with free radical induced DNA damage (56). Quantitative mutation assays of this type reflecting specific exposures and applicable to human biopsies may permit assessment of longer-term past exposure or be indicators of increased risk of progression to cancer.

88

Wild

WHERE TO MEASURE? Biomarkers of exposure and effect have been measured in a number of different biological media, including urine, feces, saliva, plasma, serum, exfoliated cells, white blood cells, biopsies, and other tissue samples. Each of these sources of material may give different qualitative and quantitative information. It is important to bear in mind the information being sought when different biological materials are used. As described above, a DNA adduct, for example, is believed to be on the pathway from exposure to cancer. The critical dose of that DNA lesion however is defined at the tissue, cell, gene, and even DNA sequence level. Other exposure biomarkers, such as protein adducts or internal dose markers, have the additional caveat of not being on the disease pathway, but hopefully associated with it. These layers of complexity need to be considered when interpreting the information gained from such surrogate molecules or tissues in molecular epidemiology studies. In practice, biomarkers of exposure and effect can rarely be measured in the target organ. Consequently, measurements are made in more easily available material, typically plasma or serum, white blood cells, or urine; often there is relatively little empirical data to demonstrate the relationship between a biomarker in the target organ and these surrogate materials. Indeed, from animal studies it is known that many carcinogens induce different levels of DNA adducts in different tissues, this being one determinant of the susceptibility of that tissue to carcinogenesis. In a study of four methylating agents, each with a differing primary target organ (colon, esophagus, liver, and lung), DNA adducts were measured in the target organs, the liver, and the peripheral blood cells (PBC) (57). While adducts in PBC and the liver showed a relatively consistent relationship with all four agents, the relationship with the target organ varied markedly. This type of correlation between biomarkers in body fluids and internal organs has rarely been addressed in human studies, although opportunities do exist, for example, where procedures such as bronchoalveolar lavage and endoscopy are performed provide access to target tissues (58–61). If a biomarker in peripheral blood has been validated in relation to external exposure then the lack of knowledge concerning its relationship with levels in target tissues does not mean that that biomarker is invalid. However, the limitation in understanding this relationship should be borne in mind in the interpretation of the data. There is also a strong case to be made to explore the relationship between target and surrogate tissue biomarker levels where possible; in practice, this is often done in the context of small-scale clinical studies. WHEN TO MEASURE? Environmental exposures will most often vary qualitatively and quantitatively during an individual’s life because of changes in lifestyle, place of residence, occupation, etc. In addition, a given exposure may have a greater or lesser impact on disease risk at different times of life. Exposures in childhood may be particularly relevant to disease many decades later (1,62). For example, some dietary exposures reveal differing risks depending on the period of life when exposure is assessed (63). It is also well established that age at infection with hepatitis B virus (HBV) is a critical determinant of risk of becoming a chronic HBV carrier, and therefore of developing liver cancer (64). The temporal variation in exposure and the varying significance of that exposure over time poses particular challenges to biomarkers that are, by their inherent nature, transient. It

Biomarkers of Exposure and Effect

89

should be remembered however that on occasion a biomarker responding rapidly to changes in exposure is advantageous. Examples include monitoring the impact of intervention strategies, e.g., antismoking programs (65) or chemoprevention (66) (see also below), or in ascribing exposure to occupation (67). The question of when to measure a biomarker of exposure in relation to the natural history of the disease requires some notion of the critical exposure period, the temporal variation in exposure, and the inherent stability of the biomarker concerned. With viruses, serum antibodies to viral antigens often persist and indicate past exposure. This has meant that studies of infectious disease can progress more rapidly than perhaps is the case for chemical or dietary exposures. For example, with Helicobacter pylori and gastric cancer, only 11 years elapsed between identification of the organism and its classification as a Group 1 human carcinogen by the IARC (5). For chemical exposures the ability to measure long-term past exposure is more difficult to achieve. There are exceptions, where chemical exposures can be assessed by biomarkers that persist in the body. For example, certain pesticide residues have long half-lives in plasma and adipose tissue (68). Nevertheless, in the majority of instances the biomarker half-life for a chemical exposure will be relatively short and consequently provide information only from a few days through to a few months prior to sampling. As with other areas of biomarker investigation mentioned above, there are few occasions where this has been determined empirically. More often it is implied from other evidence, e.g., the half-life of protein adducts may be inferred from the biological turnover of the protein concerned. Again studies to derive information on the persistence of biomarkers should be obtained where possible as part of the validation process. In all cases, the actual or assumed information regarding the biomarker should be taken into account in the design and analysis of molecular epidemiology studies. One attempt to reconstruct past exposures has been to seek longer-lived protein adducts by examining binding of chemicals to proteins that persist through cell division, such as histones (69). Alternatively, as discussed above, the analysis of mutation spectra may be informative concerning past exposure (53,56,70–72). As mentioned earlier, the most common design to assess cancer etiology has been the case:control study. While genotyping is well suited to this design, biomarkers of exposure and possibly effect applied to biological samples at the time of or soon after cancer diagnosis may not reflect what has occurred in the past. In addition, the presence of disease could potentially influence the biomarker. This may be because the patient changes their lifestyle postdiagnosis or because the presence of disease alters physiology such that the biomarker is affected, resulting in reverse causation. Despite the above limitations, DNA and protein adducts, for example, have been applied in case:control studies with positive findings (73–75). An alternative is to establish prospective cohorts. A nested case:control study within the cohort or a case:cohort design (76) can then be used to limit the resources needed for biomarker analysis. As the aim is to have all individuals healthy at entry to the study, when biological samples are collected, the problems of reverse causation may be avoided. The design also provides an opportunity for one or more measures of exposure prior to disease onset. In practice, however, the periods of follow-up still tend to encompass a relatively short fraction of the carcinogenic process (a few years rather than decades), thus presenting some residual risk of reverse causation and often involve only a single time point (most often recruitment) for biological sampling. Despite the caveats mentioned above, it is the prospective cohort design that is better adapted to most currently available biomarkers of both exposure and effect. Huge

90

Wild

investment is being made in large prospective cohort studies that include collection and banking of biological material (77–79). These studies are predicated on the exploitation of the associated biobanks using appropriate biomarkers to elucidate the genetic and environmental basis of common chronic disorders. The need for biomarkers of exposure to complement those of genetic factors in order to fully benefit from these investments has been stressed on (7,8). Given the fact that many of these biomarkers inherently are reflective of only recent past exposure also implies value in repeat samples being collected from the same individual wherever possible. It is within the prospective cohort design that the most successful examples of biomarkers and disease outcome are found. For example, DNA and protein adducts as well as urinary metabolites have provided key evidence in establishing an exposuredisease association between aflatoxins and liver cancer (33,80). Other studies have associated “bulky” DNA adducts with lung cancer in smokers (81) and never- or exsmokers (82). Similarly, the chromosomal aberration and micronuclei studies mentioned earlier, although not strictly from a cohort design, were related prospectively to increased cancer risk (47,48). BIOLOGICAL PLAUSIBILITY One of the criteria for establishing a causal association between an exposure and disease is biological plausibility. In this context, biomarkers may contribute by illuminating some of the carcinogenic steps linked to a particular risk factor. This is therefore an additional area, and possibly an undervalued one, where biomarkers can make significant contributions to cancer epidemiology. Some examples are described below. One of the much-investigated risk factors for colorectal cancer is high meat intake, but the causal factor(s) remain unknown. One hypothesis is that red meat stimulates endogenous intestinal N-nitrosation, resulting in formation of N-nitroso compounds (NOC). Lewin and colleagues (61) studied volunteers in a metabolic suite consuming defined diets high in red meat compared with individuals consuming a vegetarian diet. They demonstrated that both the formation of NOC in feces and NOC-related DNA adducts (O6-carboxymethylguanine) in exfoliated cells in feces were higher in people consuming the red meat diet. At the individual level there was a correlation between the two biomarkers. Such studies are often demanding to perform. For example, in this instance, a metabolic suite was required with provision of controlled diets and rapid processing of fecal samples. Consequently, studies are normally limited to relatively small numbers of subjects. Nevertheless, this study serves as a good example, whereby carefully designed biomarker experiments in human volunteers can be used to investigate mechanisms of carcinogenesis and provide evidence for the biological plausibility of putative etiologic associations. Another area where biomarkers have been applied to investigate the plausibility of etiologic hypotheses is that of genetic susceptibility. For example, associations have been predicted between genetic polymorphisms in carcinogen metabolizing or DNA repair genes and higher DNA adduct levels. In one study in healthy individuals, a dose-response relationship between “bulky” DNA adducts in lymphocytes and polymorphisms in three different DNA repair enzymes was reported (83). This type of study strengthens the rationale for examining the relationship between polymorphisms and cancer risk precisely because they provide evidence that the former affect a significant step on the mechanistic pathway to cancer.

Biomarkers of Exposure and Effect

91

INTERVENTION STUDIES Biomarkers can be incorporated into intervention studies for different purposes. First, the objective may be to examine the possibility to modulate a particular biochemical pathway, perhaps using a micronutrient or pharmaceutical agent. This type of intervention study tends to be conducted on a small scale with intense analysis of one or more biomarkers. The outcome is a better understanding in vivo of mechanisms of carcinogenesis and a stronger scientific rationale for public health interventions. Second, the objective may be to use biomarkers of exposure or effect as surrogates for disease endpoints in intervention studies. This type of study will tend to be on a larger scale than the more mechanism-orientated studies and may serve as a prelude to the conduct of a full-scale intervention study with disease as the outcome. Some illustrations of the application of biomarkers in these types of intervention study are discussed briefly below. Epidemiological evidence links fruit and vegetable consumption to decreased risk of cancer at a number of sites (84). One hypothesis is that antioxidants may be at least partially responsible by protecting cells from oxidative DNA damage. However, when large-scale intervention studies with antioxidants have been conducted the results have been equivocal (85). In each case, the precise mechanism by which the vitamin supplementation may be effective and the dose required were poorly understood. A number of small-scale intervention studies have been conducted to examine whether vitamin supplementation can decrease DNA damage in vivo, in order to provide some additional supportive data to the underlying mechanistic hypotheses. For example, Collins and colleagues (86) reported that kiwi fruit consumption reduced both endogenous oxidative DNA damage and damage induced by an ex vivo challenge in peripheral lymphocytes. In addition, DNA repair capacity was enhanced by fruit consumption. These observations illustrate that an additional mechanism, notably induction of DNA repair capacity, may be of relevance. In another intervention study, in smokers, increased flavonoid consumption resulted in an increase in urine antimutagenicity (87) but no consistent effects on DNA adducts in exfoliated bladder cells (88). This type of data can add valuable information to the scientific rationale behind intervention studies using micronutrients. Another example where biomarkers have been used in intervention studies is in the case of aflatoxins. In a community-based postharvest primary prevention trial, targeted at the groundnut crop in Guinea, West Africa (89), aflatoxin-albumin adduct levels were more than 50% lower in subjects in the intervention villages. A panel of aflatoxin biomarkers, including urinary DNA adducts and metabolites, have also been applied successfully in an elegant series of chemoprevention studies in China using oltipraz and chlorophyllin (66,90–92). These studies were able to demonstrate the modulation of aflatoxin metabolism in exposed individuals and a decrease in the level of aflatoxin-N 7guanine adducts. To date these intervention studies have employed biomarkers of chemical metabolites or adducts, rather than biomarkers of effect such as chromosomal aberrations or mutations. This fact has mainly reflected the aims of the study, to establish proof of principle, and the associated requirement for biomarkers that respond rapidly to the intervention. In the future, it can be envisaged that longer-term interventions may use biomarkers of effect to monitor early outcomes. These may be considered analogous to the use of serum HBV surface antigen as a biomarker to demonstrate that HBV vaccination has been effective in preventing chronic HBV carriage (93). An argument

92

Wild

could be made to incorporate biomarkers of effect in longer-term intervention trials, as early indicators for any evidence of adverse effects. FUTURE BIOMARKERS OF EXPOSURE AND EFFECT It is pertinent to ask whether the increasing application of transcriptomics, proteomics, and metabonomics in cancer research can contribute to the development of biomarkers of exposure and effect (15). Their value will primarily depend on whether specific exposures are reflected by altered levels of mRNA, proteins, or metabolites. Will there be signatures or fingerprints of environmental exposures across a broad spectrum of mechanisms of action, both genotoxic and nongenotoxic? Similarly, there is a need to examine whether the particular biological effects consequent to exposure will be represented by characteristic patterns of gene expression, proteins, or metabolites. If so, these new technologies may provide a step-change in the development of biomarkers of both exposure and effect. There are some early indications that this is a fruitful area of research, albeit one that to date is relatively unexplored. For example, naturally occurring or industrial compounds with estrogen activity do alter the expression of similar genes in vitro (94), while ionizing irradiation also altered expression of specific genes in human lymphocytes (95,96). There have also been a few gene expression studies in relation to exposure in a population setting focusing on smoking (97), benzene (98), arsenic (99), metal fumes (100), and environmental air pollution (101). These preliminary data show that different exposures do elicit different changes in gene expression and encourage the further exploration of the sensitivity, specificity, and stability of these changes. Examination of the 3000 or so major metabolites (102) that constitute the metabonome also offers opportunities to address exposure assessment. In an experimental mouse study involving infection with Schistosoma mansoni, differences in urinary metabolite fingerprints were obtained, which indicated effects on certain metabolic pathways (103). In terms of human studies, the change from nonsoy to soy-containing diet was shown to be associated with some changes in plasma metabonome (104). The potential application of “omics” technologies to characterize dietary exposures and to understand the biological effects of diet at the cellular level therefore receives some support from these early investigations (104,105). This new generation of technologies requires further extensive research before deciding whether it can provide a marked advance in biomarkers of exposure and effect. Notably, it remains to be seen if, in principle, mRNA, protein, or metabolite expression can be specific and sensitive enough to define exposures at low levels in human populations. It will be important to understand whether complex mixtures or families of chemicals act through the same pathways and can be represented by common targets on common mechanistic pathways. The dynamic nature of each of these systems may militate against long-term exposure assessment, unless some of the changes prove stable over time. The technology will also need to be tailored in terms of such properties as sensitivity, sample requirement, throughput, and cost to be applicable to population studies. Purification procedures in the case of metabonomics and proteomics will be essential to measure rare, and possibly more informative, proteins or metabolites among the background of quantitatively more dominant species. Although not explicitly discussed here, the need for sophisticated statistical analysis emerges as crucial to any eventual application. As with the earlier generation of exposure biomarkers, a carefully planned strategy, starting with model systems and small-scale human studies, is likely to be most successful (33).

Biomarkers of Exposure and Effect

93

SUMMARY The new generation of mega-cohort studies (78,106–108) provides the framework for investigations of genetic variation, environment, lifestyle, and chronic disease. These studies represent substantial investment, with U.K. Biobank, for example, recruiting 500,000 adults at a cost of around £60 million in the initial phase. A considerable part of this cost is driven by the collection and banking of biological material. This investment is at least partially justified on the assumption that biochemical and molecular measures on this material will help resolve important etiologic questions. It is self-evident that unraveling complex environmental and genetic etiologies to plan effective public health interventions demands that both environmental exposures and genetic variation are reliably measured. Advances in statistical methods and in bioinformatics in relation to large data sets are also of critical importance. The further development, validation, and application of biomarkers of exposure and effect in this context are manifestly a critical part of the future of cancer epidemiology in the 21st century. ACKNOWLEDGMENTS CPW was supported by a grant from the NIEHS USA no. ES06052. The author would also like to thank Margaret Jones for her help in preparing the manuscript and Dr Rene´e Mijal for providing critical comments. REFERENCES 1. Anderson LM, Diwan BA, Fear NT, et al. Critical windows of exposure for children’s health: cancer in human epidemiological studies and neoplasms in experimental animal models. Environ Health Perspect 2000; 108:573–594. 2. Wild CP, Pisani P. Carcinogen-DNA and protein adducts as biomarkers of human exposure in environmental cancer epidemiology. Cancer Detect Prev 1998; 22:273–283. 3. Weis BK, Balshawl D, Barr JR, et al. Personalized exposure assessment: promising approaches for human environmental health research. Environ Health Perspect 2005; 113:840–848. 4. Nuckols JR, Ward MH, Jarup L. Using geographic information systems for exposure assessment in environmental epidemiology studies. Environ Health Perspect 2004; 112:1007–1015. 5. Poirier MC. Chemical-induced DNA damage and human cancer risk. Nat Rev Cancer 2004; 4:630–637. 6. Phillips DH. DNA adducts as markers of exposure and risk. Mut Res-Fundame and Molec Mech of Mutag 2005; 577:284–292. 7. Vineis P. A self-fulfilling prophecy: are we underestimating the role of the environment in gene-environment interaction research? Int J Epidemiol 2004; 33:945–946. 8. Wild CP. Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in Molecular Epidemiology. Cancer Epi Bio Prev 2005; 14:1847–1850. 9. Committee on Biological Markers of the National Research Council. Biological markers in environmental-health research. Environ Health Perspect 1987; 74:3–9. 10. Tang D, Warburton D, Tannenbaum SR, et al. Molecular and genetic damage from environmental tobacco smoke in young children. Cancer Epi Bio Prev 1999; 8:427–431. 11. Vineis P, Husgafvel-Pursiainen K. Air pollution and cancer: biomarker studies in human populations. Carcinogenesis 2005; 26:1846–1855. 12. Wojnowski L, Turner PC, Pedersen B, et al. Increased levels of aflatoxin-albumin adducts are associated with CYP3A5 polymorphisms in The Gambia, West Africa. Pharmacogenetics 2004; 14:691–700.

94

Wild

13. Sharma RA, Farmer PB. Biological relevance of adduct detection to the chemoprevention of cancer. Clin Cancer Res 2004; 10:4901–4912. 14. Walker VE, Wu KY, Upton PB, et al. Biomarkers of exposure and effect as indicators of potential carcinogenic risk arising from in vivo metabolism of ethylene to ethylene oxide. Carcinogenesis 2000; 21:1661–1669. 15. Albertini RJ. Developing sustainable studies on environmental health. Mut Res-Fundam and Molec Mech of Mutag 2001; 480:317–331. 16. Aitio A, Cabral JRP, Camus AM, et al. Evaluation of sister chromatid exchange as an indicator of sensitivity to N-ethyl-N-nitrosourea-induced carcinogenesis in rats. Terat Carcinog and Mutag 1988; 8:273–286. 17. Kensler TW, Gange SJ, Egner PA, et al. Predictive value of molecular dosimetry: individual versus group effects of oltipraz on aflatoxin-albumin adducts and risk of liver cancer. Cancer Epi Bio Prev 1997; 6:603–610. 18. Imbriani M, Ghittori S. Gases and organic solvents in urine as biomarkers of occupational exposure: a review. Int Arch Occup Environ Health 2005; 78:1–19. 19. Watson WP, Mutti A. Role of biomarkers in monitoring exposures to chemicals: present position, future prospects. Biomarkers 2004; 9:211–242. 20. Kang DH, Rothman N, Poirier MC, et al. Interindividual differences in the concentration of 1-hydroxypyrene-glucuronide in urine and polycyclic aromatic hydrocarbon-DNA adducts in peripheral white blood cells after charbroiled beef consumption. Carcinogenesis 1995; 16:1079–1085. 21. Strickland PT, Qian Z, Friesen MD, et al. Metabolites of 2-amino-1-methyl-6-phenylimidazo (4,5-b)pyridine (PhIP) in human urine after consumption of charbroiled or fried beef. Mutat Res—Fundam Molec Mech Mutag 2002; 506:163–173. 22. Hecht SS, Carmella SG, Yoder A, et al. Comparison of polymorphisms in genes involved in polycyclic aromatic hydrocarbon metabolism with urinary phenanthrene metabolite ratios in smokers. Cancer Epi Bio Prev 2006; 15:1805–1811. 23. Helzlsouer KJ, Alberg AJ, Huang HY, et al. Serum concentrations of organochlorine compounds and the subsequent development of breast cancer. Cancer Epi Bio Prev 1999; 8:525–532. 24. Rusiecki JA, Matthews A, Sturgeon S, et al. A correlation study of organochlorine levels in serum, breast adipose tissue, and gluteal adipose tissue among breast cancer cases in India. Cancer Epi Bio Prev 2005; 14:1113–1124. 25. Tasevska N, Runswick SA, McTaggart A, et al. Urinary sucrose and fructose as biomarkers for sugar consumption. Cancer Epi Bio Prev 2005; 14:1287–1294. 26. Nielsen SE, Freese R, Kleemola P, et al. Flavonoids in human urine as biomarkers for intake of fruits and vegetables. Cancer Epi Bio Prev 2002; 11:459–466. 27. Skipper PL, Peng XC, Soohoo CK, et al. Protein adducts as biomarkers of human carcinogen exposure. Drug Metab Rev 1994; 26:111–124. 28. Vineis P, Garte S. Biomarker validation. In: Wild CP, Garte S, Vineis P, eds. Molecular Epidemiology of Chronic Disease. In Press: John Wiley & Sons Ltd, 2008. 29. Schatzkin A, Freedman LS, Schiffman MH, et al. Validation of intermediate end-points in cancer research. J Natl Cancer Inst 1990; 82:1746–1752. 30. Phillips DH, Hewer A, Martin CN, et al. Correlation of DNA adduct levels in human lung with cigarette-smoking. Nature 1988; 336:790–792. 31. Groopman JD, Hall AJ, Whittle H, et al. Molecular dosimetry of aflatoxin-N7-Guanine in human urine obtained in the Gambia, West Africa. Cancer Epi Bio Prev 1992; 1:221–227. 32. Wild CP, Hudson GJ, Sabbioni G, et al. Dietary intake of aflatoxins and the level of albuminbound aflatoxin in peripheral blood in The Gambia, West Africa. Cancer Epi Bio Prev 1992; 1:229–234. 33. Qian GS, Ross RK, Yu MC, et al. A follow-up study of urinary markers of aflatoxin exposure and liver cancer risk in Shanghai, People’s Republic of China. Cancer Epi Bio Prev 1994; 3:3–10. 34. Gilbert J, Brereton P, MacDonald S. Assessment of dietary exposure to ochratoxin A in the UK using a duplicate diet approach and analysis of urine and plasma samples. Food Addit Contam 2001; 18:1088–1093.

Biomarkers of Exposure and Effect

95

35. Rothman N, Correavillasenor A, Ford DP, et al. Contribution of occupation and diet to white blood-cell polycyclic aromatic hydrocarbon-DNA adducts in wildland firefighters. Cancer Epi Bio Prev 1993; 2:341–347. 36. Collins AR. The comet assay for DNA damage and repair—Principles, applications, and limitations. Mol Biotechnol 2004; 26:249–261. 37. IARC. Some traditional herbal medicines, some mycotoxins, naphthalene and styrene. Monographs on the evaluation of carcinogenic risks to humans. Lyon, France: IARC Press, 2002: 301–366. 38. Merrill AH, Sullards MC, Wang E, et al. Sphingolipid metabolism: roles in signal transduction and disruption by fumonisins. Environ Health Perspect 2001; 109:283–289. 39. Turner PC, Nikiema P, Wild CP. Fumonisin contamination of food: progress in development of biomarkers to better assess human health risks. Mutat Res-Genet Toxicol and Environ Mutag 1999; 443:81–93. 40. Feil R. Environmental and nutritional effects on the epigenetic regulation of genes. Mutat ResFundam and Molec Mech of Mutag 2006; 600:46–57. 41. Esteller M. The necessity of a human epigenome project. Carcinogenesis 2006; 27: 1121–1125. 42. Hagmar L, Brogger A, Hansteen IL, et al. Cancer risk in humans predicted by increased levels of chromosomal-aberrations in lymphocytes–Nordic study-group on the health risk of chromosome-damage. Cancer Res 1994; 54:2919–2922. 43. Hagmar L, Stromberg U, Bonassi S, et al. Impact of types of lymphocyte chromosomal aberrations on human cancer risk: results from Nordic and Italian cohorts. Cancer Res 2004; 64:2258–2263. 44. Bonassi S, Abbondandolo A, Camurri L, et al. Are chromosome-aberrations in circulating lymphocytes predictive of future cancer onset in humans–preliminary-results of an Italian cohort study. Cancer Genet Cytogenet 1995; 79:133–135. 45. Liou SH, Lung JC, Chen YH, et al. Increased chromosome-type chromosome aberration frequencies as biomarkers of cancer risk in a blackfoot endemic area. Cancer Res 1999; 59:1481–1484. 46. Rossner P, Boffetta P, Ceppi M, et al. Chromosomal aberrations in lymphocytes of healthy subjects and risk of cancer. Environ Health Perspect 2005; 113:517–520. 47. Boffetta P, van der Hel O, Norppa H, et al. Chromosomal aberrations and cancer risk: results of a cohort study from central Europe. Am J Epidemiol 2007; 165:36–43. 48. Bonassi S, Znaor A, Ceppi M, et al. An increased micronucleus frequency in peripheral blood lymphocytes predicts the risk of cancer in humans. Carcinogenesis 2007; 28:625–631. 49. Hagmar L, Bonassi S, Stromberg U, et al. Chromosomal aberrations in lymphocytes predict human cancer: a report from the European study group on cytogenetic biomarkers and health (ESCH). Cancer Res 1998; 58:4117–4121. 50. Albertini RJ. HPRT mutations in humans: biomarkers for mechanistic studies. Mutat Res-Rev in Mutat Res 2001; 489:1–16. 51. Lindholm C, Murphy BP, Bigbee WL, et al. Glycophorin A somatic cell mutations in a population living in the proximity of the Semipalatinsk nuclear test site. Radiat Res 2004; 162:164–170. 52. Hussain SP, Harris CC. Molecular epidemiology and carcinogenesis: endogenous and exogenous carcinogens. Mutat Res-Rev in Mutat Res 2000; 462:311–322. 53. Kirk GD, Lesi OA, Mendy M, et al. 249(ser) TP53 mutation in plasma DNA, hepatitis B viral infection, and risk of hepatocellular carcinoma. Oncogene 2005; 24:5858–5867. 54. Liu ZP, Hergenhahn M, Schmeiser HH, et al. Human tumor p53 mutations are selected for in mouse embryonic fibroblasts harboring a humanized p53 gene. Proc Natl Acad Sci USA 2004; 101:2963–2968. 55. Grollman AP, Shibutani S, Moriya M, et al. Aristolochic acid and the etiology of endemic (Balkan) nephropathy. Proc Natl Acad Sci USA 2007; 104:12129–12134. 56. Wild CP, Turner PC. The toxicology of aflatoxins as a basis for public health decisions. Mutagenesis 2002; 17:471–481.

96

Wild

57. Bianchini F, Wild CP. Comparison of 7-Medg formation in white blood-cells, liver and target organs in rats treated with methylating carcinogens. Carcinogenesis 1994; 15:1137–1141. 58. Godschalk RWL, Maas LM, Van Zandwijk N, et al. Differences in aromatic-DNA adduct levels between alveolar macrophages and subpopulations of white blood cells from smokers. Carcinogenesis 1998; 19:819–825. 59. Nia AB, Maas LM, Brouwer EMC, et al. Comparison between smoking-related DNA adduct analysis in induced sputum and peripheral blood lymphocytes. Carcinogenesis 2000; 21: 1335–1340. 60. Wiencke JK, Kelsey KT, Varkonyi A, et al. Correlation of DNA-adducts in blood mononuclear cells with tobacco carcinogen-induced damage in human lung. Cancer Res 1995; 55:4910–4914. 61. Lewin MH, Bailey N, Bandaletova T, et al. Red meat enhances the colonic formation of the DNA adduct O-6-carboxymethyl guanine: implications for colorectal cancer. Cancer Res 2006; 66:1859–1865. 62. Wild CP, Kleinjans J. Children and increased susceptibility to environmental carcinogens: evidence or empathy? Cancer Epi Bio Prev 2003; 12:1389–1394. 63. Yu MC, Mo CC, Chong WX, et al. Preserved foods and nasopharyngeal carcinoma–a casecontrol study in Guangxi, China. Cancer Res 1988; 48:1954–1959. 64. Wild CP, Hall AJ. Hepatitis B virus and liver cancer: unanswered questions. Cancer Surveys 1999; 33:35–54. 65. Maclure M, Bryant MS, Skipper PL, et al. Decline of the hemoglobin adduct of 4aminobiphenyl during withdrawal from smoking. Cancer Res 1990; 50:181–184. 66. Wang JS, Shen XN, He X, et al. Protective alterations in phase 1 and 2 metabolism of aflatoxin B-1 by oltipraz in residents of Qidong, People’s Republic of China. J Natl Cancer Inst 1999; 91:347–354. 67. Ward EM, Sabbioni G, Debord DG, et al. Monitoring of aromatic amine exposures in workers at a chemical plant with a known bladder cancer excess. J Natl Cancer Inst 1996; 88: 1046–1052. 68. Stellman SD, Djordjevic MV, Muscat JE, et al. Relative abundance of organochlorine pesticides and polychlorinated biphenyls in adipose tissue and serum of women in Long Island, New York. Cancer Epi Bio Prev 1998; 7:489–496. 69. Ozbal CC, Dasari RR, Tannenbaum SR. Stability of histone adducts in murine models: implications for long-term molecular dosimetry. Abstracts of Papers of the American Chemical Society 1998; 216:053–TOXI. 70. Greenblatt MS, Bennett WP, Hollstein M, et al. Mutations in the p53 tumor suppressor gene: clues to cancer etiology and molecular pathogenesis. Cancer Res 1994; 54:4855–4878. 71. Szymanska K, Lesi OA, Kirk GD, et al. Ser-249 TP53 mutation in tumour and plasma DNA of hepatocellular carcinoma patients from a high incidence area in the Gambia, West Africa. Int J Cancer 2004; 110:374–379. 72. Gormally E, Caboux E, Vineis P, et al. Circulating free DNA in plasma or serum as biomarker of carcinogenesis: practical aspects and biological significance. Mutat Res-Rev in Mutat Res 2007; 635:105–117. 73. Jackson PE, Kuang SY, Wang JB, et al. Prospective detection of codon 249 mutations in plasma of hepatocellular carcinoma patients. Carcinogenesis 2003; 24:1657–1663. 74. Skipper PL, Tannenbaum SR, Ross RK, et al. Nonsmoking-related arylamine exposure and bladder cancer risk. Cancer Epi Bio Prev 2003; 12:503–507. 75. Gan JP, Skipper PL, Gago-Dominguez M, et al. Alkylaniline-hemoglobin adducts and risk of non-smoking-related bladder cancer. J Natl Cancer Inst 2004; 96:1425–1431. 76. Rundle AG, Vineis P, Ahsan H. Design options for molecular epidemiology research within cohort studies. Cancer Epi Bio Prev 2005; 14:1899–1907. 77. Manolio TA, Bailey-Wilson JE, Collins FS. Opinion–Genes, environment and the value of prospective cohort studies. Nature Rev Genet 2006; 7:812–820. 78. Potter JD. Toward the last cohort. Cancer Epi Bio Prev 2004; 13:895–897. 79. Palmer LJ. UK Biobank: bank on it. Lancet 2007; 369:1980–1982.

Biomarkers of Exposure and Effect

97

80. Wang LY, Hatch M, Chen CJ, et al. Aflatoxin exposure and risk of hepatocellular carcinoma in Taiwan. Int J Cancer 1996; 67:620–625. 81. Tang DL, Phillips DH, Stampfer M, et al. Association between carcinogen-DNA adducts in white blood cells and lung cancer risk in the Physicians Health Study. Cancer Res 2001; 61:6708–6712. 82. Peluso M, Munnia A, Hoek G, et al. DNA adducts and lung cancer risk: a prospective study. Cancer Res 2005; 65:8042–8048. 83. Matullo G, Peluso M, Polidoro S, et al. Combination of DNA repair gene single nucleotide polymorphisms and increased levels of DNA adducts in a population-based study. Cancer Epi Bio Prev 2003; 12:674–677. 84. World Cancer Research Fund. Evidence you can trust on nutrition and cancer. WCRF. 2007, in press. 85. Taylor PR, Greenwald P. Nutritional interventions in cancer prevention. J Clin Oncol 2005; 23:333–345. 86. Collins AR, Harrington V, Drew J, et al. Nutritional modulation of DNA repair in a human intervention study. Carcinogenesis 2003; 24:511–515. 87. Malaveille C, Fiorini L, Bianchini M, et al. Randomized controlled trial of dietary intervention: association between level of urinary phenolics and anti-mutagenicity. Mutat Res-Genet Toxicol and Environ Mutag 2004; 561:83–90. 88. Talaska G, Al Zoughool M, Malaveille C, et al. Randomized controlled trial: effects of diet on DNA damage in heavy smokers. Mutagenesis 2006; 21:179–183. 89. Turner PC, Sylla A, Gong YY, et al. Reduction in exposure to carcinogenic aflatoxins by postharvest intervention measures in west Africa: a community-based intervention study. Lancet 2005; 365:1950–1956. 90. Jacobson LP, Zhang BC, Zhu YR, et al. Oltipraz chemoprevention trial in Qidong, People’s Republic of China: study design and clinical outcomes. Cancer Epi Bio Prev 1997; 6:257–265. 91. Kensler TW, He X, Otieno M, et al. Oltipraz chemoprevention trial in Qidong, People’s Republic of China: modulation of serum aflatoxin albumin adduct biomarkers. Cancer Epi Bio Prev 1998; 7:127–134. 92. Egner PA, Wang JB, Zhu YR, et al. Chlorophyllin intervention reduces aflatoxin-DNA adducts in individuals at high risk for liver cancer. Proc Natl Acad Sci USA 2001; 98: 14601–14606. 93. Viviani S, Jack A, Hall AJ, et al. Hepatitis B vaccination in infancy in The Gambia: protection against carriage at 9 years of age. Vaccine 1999; 17:2946–2950. 94. Amundson SA, Do KT, Shahab S, et al. Identification of potential mRNA biomarkers in peripheral blood lymphocytes for human exposure to ionizing radiation. Radiat Res 2000; 154:342–346. 95. Amundson SA, Grace MB, McLeland CB, et al. Human in vivo radiation-induced biomarkers: Gene expression changes in radiotherapy patients. Cancer Res 2004; 64:6368–6371. 96. Lampe JW, Stepaniants SB, Mao M, et al. Signatures of environmental exposures using peripheral leukocyte gene expression: tobacco smoke. Cancer Epi Bio Prev 2004; 13:445–453. 97. Nylund R, Leszczynski D. Proteomics analysis of human endothelial cell line EA.hy926 after exposure to GSM 900 radiation. Proteomics 2004 4:1359–1365. 98. Forrest MS, Lan O, Hubbard AE, et al. Discovery of novel biomarkers by microarray analysis of peripheral blood mononuclear cell gene expression in benzene-exposed workers. Environ Health Perspect 2005; 113:801–807. 99. Wu MM, Chiou HY, Ho IC, et al. Gene expression of inflammatory molecules in circulating lymphocytes from arsenic-exposed human subjects. Environ Health Perspect 2003; 111:1429–1438. 100. Wang ZX, Neuburg D, Li C, et al. Global gene expression profiling in whole-blood samples from individuals exposed to metal fumes. Environ Health Perspect 2005; 113:233–241. 101. van Leeuwen DM, van Herwijnen MHM, Pedersen M, et al. Genome-wide differential gene expression in children exposed to air pollution in the Czech Republic. Mutat Res-Fundam Molec Mech Mutag 2006; 600:12–22.

98

Wild

102. Wang YL, Holmes E, Nicholson JK, et al. Metabonomic investigations in mice infected with Schistosoma mansoni: an approach for biomarker identification. Proc Natl Acad Sci USA 2004; 101:12676–12681. 103. Solanky KS, Bailey NJC, Beckwith-Hall BM, et al. Application of biofluid H-1 nuclear magnetic resonance-based metabonomic techniques for the analysis of the biochemical effects of dietary isoflavones on human plasma profile. Anal Biochem 2003; 323:197–204. 104. Davis CD, Milner J. Frontiers in nutrigenomics, proteomics, metabolomics and cancer prevention. Mutat Res-Fundam Mol Mech Mut 2004; 551:51–64. 105. Elliott R, Ong TJ. Science, medicine, and the future–nutritional genomics. Br Med J 2002; 324:1438–1442. 106. Collins FS. The case for a US prospective cohort study of genes and environment. Nature 2004; 429:475–477. 107. Barbour V. UK Biobank: a project in search of a protocol? Lancet 2003; 361:1734–1738. 108. Cyranoski D, Williams R. Health study sets sights on a million people. Nature 2005 434:812.

8

Questionnaire Assessment James R. Marshall Roswell Park Cancer Institute, Buffalo, New York, U.S.A.

INTRODUCTION The goal of epidemiologic observation is to take advantage of human experience, namely, subject exposure and disease, to unravel disease causation. The purpose of this observation is prevention: to identify risk-enhancing exposures that might be avoided and risk-diminishing exposures that might be encouraged. One strength of epidemiology is that it directly studies the object of our concern. A question arising from studies based on cell lines or animal models is whether or not the findings are generalizable to humans; generalizing an epidemiologic finding to human populations is, in general, a great deal more straightforward. However, in spite of the importance of epidemiology to prevention research, its reliance on observation rather than on experimentation poses at least a potential limitation to the validity of its findings. The prevention trial represents an increasingly utilized attempt to meld epidemiology and experimentation to increase the rigor of inference based on human experience. Prominent recent examples include a trial of the 5-alpha reductase inhibitor finasteride to prevent prostate cancer among low-risk men (1), a trial of selenium and vitamin E to prevent prostate cancer in average-risk men (2), and the use of the selective estrogen response modifiers tamoxifen and raloxifene to prevent breast cancer in high-risk women (3). Another limitation of epidemiologic observation is its vulnerability to bias and inconsistency. Measuring exposure at the wrong time, or incorrectly assuming linearity or monotonicity in the association of exposure with risk, can induce bias. Confounding, or failure to assess and control for relevant factors, or adjustment for variables that should not be controlled for, can lead to bias and inconsistency. Failure to recognize that the effects of some exposures may vary according to other subject exposures or resources, such as genetic predisposition, may induce bias. The potential effects of bias may have an even more profound effect on molecular epidemiologic studies, where one key component is studying the effects of genetic differences in metabolic pathways on modification of exposure-disease relationships. Although prevention trials can lessen the likelihood of some of these sources of bias, they cannot eliminate it. Not all exposures can be experimentally manipulated. And a precondition for prevention trials is the existence of some analysis of human experience—epidemiology—as a guide to trial design.

99

100

Marshall

Table 1(a) True Exposure A and Disease Disease Exposure A þ Total

þ

Total

167 83 250

125 125 250

292 208 500

OR ¼ 2.0. Abbreviation: OR, odds ratio.

Table 1(b) Measured Exposure A and Disease: 30% Misclassification Disease Exposure A þ Total

þ

Total

142 108 250

125 125 250

267 233 500

OR ¼ 1.3. Abbreviation: OR, odds ratio.

Among the sources of bias and inconsistency in epidemiology, imperfect measurement looms as one of the least tractable. Molecular epidemiology is in no way immune to limitations due to poor measurement. To assume that a variable as measured accurately represents the variable it is intended to represent amounts to a form of misspecification (4). What limited data we have indicate that epidemiologic measures often do not accurately represent what they are intended to represent; they are subject to substantial measurement error, or noise. The conventional epidemiologic wisdom is that poor measurement has little impact other than to attenuate measures of association; this effect can be profound, however. Consider the following example: A and B are equally powerful predictors of a disease outcome, and each increases the risk of disease twofold. But A is measured less accurately than B is. Table 1a summarizes the true association between A and risk: an odds ratio of 2.0. Table 1b reveals what the investigator would see with 30% exposure misclassification: an odds ratio of 1.3 associating A and disease. With B associated to exactly the same degree with disease, the true association of exposure and disease would be that depicted in Table 2a, while Table 2b shows what 10% misclassification would cause the investigator to observe: an odds ratio of 1.7. Whether exposures A and B were dependent or were closely correlated, the investigator would Table 2(a) True Exposure B and Disease Disease Exposure B þ Total

þ

Total

167 83 250

125 125 250

292 208 500

OR ¼ 2.0. Abbreviation: OR, odds ratio.

Questionnaire Assessment

101

Table 2(b) Measured Exposure B and Disease: 10% Misclassification Disease Exposure B þ Total

þ

Total

159 91 250

125 125 250

284 216 500

OR ¼ 1.7. Abbreviation: OR, odds ratio.

estimate that, with both exposures considered together, exposure B would appear, net of control for exposure A, to be a risk factor, while control for exposure B would eliminate the association of exposure A and disease. Bias other than attenuation of risk estimates is also possible (5–7). A particularly vexing problem is that a strong confounder can cause a null variable to appear associated with risk, and this can persist even with statistical control for that strong confounder (6,7). Epidemiology, taking advantage of natural experiments, with individuals not assigned to risk factor exposure or nonexposure but the coincidence of their exposure and disease studied, can be conducted prospectively or retrospectively. In the prospective mode, individuals who have come to be exposed are compared with others who have not; the occurrence of disease among the exposed individuals is compared with that among the unexposed. If the exposure increases disease risk, then occurrence will be greater among the exposed than among the unexposed; if it decreases risk, then occurrence will be less among the exposed than among the unexposed. The coincidence of exposure and disease occurrence can also be studied by the case-control method, or retrospectively; a sample of individuals who have been diagnosed with a given disease are compared with a sample of individuals who have not been so diagnosed. Exposure among those who have been diagnosed (cases) is compared with exposure among those not so diagnosed (controls). If the exposure increases disease risk, then the cases will be more likely than the controls to have been exposed. In both settings, the accuracy of this exposure assessment is critical. Molecular epidemiology adds a dimension to this analysis; whether in the prospective or retrospective mode, individuals are categorized by another natural experiment: a molecular genetic characteristic, such as a single nuclear polymorphism or haplotype understood or suspected to affect a metabolic process or cascade. The association of disease and exposure is then studied within the categories of this polymorphism or haplotype. The first focus is whether the association is statistically significant within each category; what is also of interest is variability among the different categories. The researcher is interested in whether this association is far greater in one category than in the others (8). The sample sizes for these analyses will be smaller than for the overall sample. In addition, the comparison that is integral to molecular epidemiology requires that other sources of associational inconsistency be considered. As will be demonstrated later in this chapter, measurement error imposes two major strictures on molecular epidemiology. First, in different subsets of a study sample, such as subsets with different genotypes, even slight differences in measurement error can lead to the appearance of different strengths of association. Second, major differences in the association of exposure and risk can be masked by measurement error; the ability of a molecular epidemiologic inquiry to discern these differences can be substantially affected by limitations in the accuracy of the data.

102

Marshall

Whether the association of exposure and disease risk is based on a prospective or retrospective study design, exposure must often be gleaned from such data as subject reports; subjects must be asked about their exposure. They may be interviewed, the interview taking place in person, over a telephone, or by mail. The questionnaire either guides questioning of the study participants or is the instrument used for data acquisition. A common technique is for subjects to complete a written questionnaire. The accuracy of questionnaire-based data is critical because, as demonstrated, the observed association of an epidemiologic exposure and risk not only is a function of that factor’s etiologic importance but is also determined by the accuracy with which that factor is measured. It does not matter how strong an epidemiologic association truly is; the observed association will be affected by the accuracy of both exposure and outcome measurement (5). Much of epidemiology has historically emphasized categorical data; subjects may be categorized as smokers or nonsmokers, as vegetarians or as omnivores, as those who consume or do not consume a given vegetable, or as exercisers or nonexercisers. On the other hand, the data may be quantitative: the investigator may seek to assess the extent of subject exposure to an environmental toxin, such as cigarette smoke. The investigator may seek a quantitative, overall measure of exposure to a specific type of vegetable. In each setting, an important component of research based on the collection of exposure data by means of a questionnaire is the accuracy of the data collected by that questionnaire. QUESTIONNAIRE ASSEMBLY Total exposure to risk factors, or to protective factors, may stem from many sources. Exposure to asbestos, for example, may be transmitted by water, air, food, or other sources. The investigator will seek to assess all sources of that exposure. Exposure to a food contaminant may result from consumption of any food that contains it, so the investigator will want to assess consumption of all the foods that contain the contaminant. A mold that induces liver cancer, aflatoxin, can be found in corn products and in peanuts, so that a questionnaire seeking to assess aflatoxin exposure will need to attend to a range of sources of corn and peanut exposure. As total physical activity may stem from exposures in the workplace, from household tasks, and from leisure pastimes, the investigator will seek to assess physical activity in each setting. Alternately, the investigator may compile a list of all the activities in which people normally engage and that involve some expenditure of physical activity; that investigator may then inquire of subjects how often they engage in each activity and how long, each time, they engage in that activity. To represent total exposure would require all likely sources of that exposure to be considered. There are, in general, few rules for determining whether a facet of exposure needs to be included in an epidemiologic questionnaire. In general, epidemiologists are reluctant to exclude any source of exposure. A canon of nutritional epidemiology has been that both frequencies and quantities of food product consumption must be measured; however, Hunter has argued that food quantity in general adds little in the way of useful information (9). The importance of any one exposure depends less on the contribution of one source to total exposure than to the contribution of variance due to the source to total exposure variance. Assume, for example, that in a study population, average energy expenditure is 2000 cal/day, with a variance of 400 cal/day. Assume, further, that all members of the study population engage in a given energy-expending activity—sleeping—that is responsible for expenditure of some 500 kcal/day. This would represent a sizeable expenditure of energy, but, unless substantial variance in energy expenditure is associated with this activity, querying subjects about it will add little to variance in total energy expenditure. Including the measure may increase the estimated mean energy expenditure

Questionnaire Assessment

103

closer to the true expenditure, but it will have little to no effect on the ability of the investigator to link energy expenditure to any epidemiologic outcome. If, on the other hand, there is great variance in voluntary or recreational physical activity about a mean expenditure of 150 cal/day, then measuring voluntary or recreational physical activity would add to the ability of the investigator to link energy expenditure to an epidemiologic endpoint. Clearly, if each person in a population walks to his or her mailbox and spends exactly 15 kcal/day in doing so, then measuring that exposure will add little to total energy expenditure and less yet to the ability of the measure to reflect any association between physical activity and disease risk. It is possible that a specific exposure will add little enough to total exposure to render it negligible. The importance of environmental tobacco smoke to total smoke exposure has been widely debated. In an attempt to address environmental tobacco smoke exposure, Cummings et al. (10,11) focused on nonsmokers, assuming that the exposure resulting from direct tobacco smoking would completely obviate environmental tobacco exposure. Thus, nonsmokers completed a questionnaire that addressed smoke exposure in automobiles, in the workplace, in dining locations, in the home. QUESTIONNAIRE ACCURACY: RELIABILITY AND VALIDITY The major issue for questionnaire-based research is the performance of the questionnaire used to assess exposure: the ability of the questionnaire to reflect the activity supposedly represented by the queries. Mention of an exposure measure in epidemiology—whether psychosocial stress, social support, indirect smoke, air pollution, any nutrient or food contaminant—almost without fail turns to performance of that measure. Willett’s Nutritional Epidemiology text, for example, contains a chapter on the performance of diet measurement questionnaires. A primary criterion for inclusion of databases in the large, Harvard-based cancer pooling project was the existence of a validation study (12). The accuracy of questionnaire-based data is often described first in terms of reliability. Reliability is readily measurable; repeatability of a questionnaire is the most widely considered reliability measure. A basic approach is to use the test questionnaire to query subjects on two different occasions (13,14). In a variation on this approach, male subjects in a case-control study were queried about their diets; their wives were then also queried (15). The data from repeat administration of a questionnaire describe the correspondence in the data extracted from the questionnaires on the separate occasions. An alternative form of a measure may be used: subjects and their significant others may be queried regarding the subjects’ exposure. Different questionnaires may be used. McCann et al., for example, administered a series of diet questionnaires to subjects (16); she and colleagues then considered the correspondence of the data extracted from the different questionnaires. In each of these situations, the data obtained by a questionnaire are compared with data obtained by a different instrument; this instrument is not in general a perfect standard. The correspondence of the separate administrations of the questionnaire depends both on the accuracy of the test instrument and that of the instrument or administration to which it is compared and on the independence of errors in the separate questionnaires. Clearly, the repetition of errors in the separate instruments would invalidate the reliability test. The total exposure surveys that dominate epidemiology are structurally distinct from those used in social science. In both disciplines, the goal of the questionnaire is to more accurately assay exposure. However, questionnaires in epidemiology seek to gather information on exposures that may have similar impacts but whose occurrences are highly independent. A person’s exposure to vegetable A may have very little to do with his or her exposure to vegetable B, even though both vegetables may contain sizeable amounts of a given nutrient (17). On the other hand, questionnaires and measures used in social science

104

Marshall

often emphasize alternate indicators of a given exposure: these indicators will, in general, be highly correlated. In addition to repeatability, a widely used measure of reliability considers internal consistency (18). The internal consistency criterion revolves around the extent to which alternative measures of a given exposure are correlated. The most common of these measures, Cronbach’s alpha (18), is a function of the number of alternative indicators used as well as of the average interitem correlation. If 10 items are used and the average interitem correlation is 0.8, Cronbach’s alpha will be approximately 0.98. If, on the other hand, the average interitem correlation among these 10 items is 0.2, Cronbach’s alpha will only be approximately 0.71. According to this criterion of internal consistency many of the exposure questionnaires commonly used in epidemiology, based on average intervention correlations of 0.1 to 0.2, would have distressingly poor reliability. In general, empirical data describing the repeatability or reliability of a measure are used to point to, or to suggest, the validity of the measure; the association of data obtained by the measure with the truth. Validity describes the association of the data the measure obtains with what a “gold standard” measure would reveal. The connection between reliability and validity is critical; we are interested, to be sure, in the reliability of our measures. But what we want to know is less reliability— whether repeatability or internal consistency—than what that reliability implies about the degree of noise or error in our measure; we need to know how much noise our measure contains so that we can understand what exposure truly portends for the risk of disease. Since a true “gold standard” measure rarely exists, validity has to be derived from reliability; it is, in general, dependent on what is assumed about the structure of what is measured and what the investigator wishes to measure. The connection between the reliability of a measure X, as represented by its correspondence with an alternative measure, Z, and the validity of X has been demonstrated by Carmines and Zeller (18). Assume X and Z are parallel measures, each an imperfect indicator of a true exposure, t: X ¼ t þ ex Z ¼ t þ ez ex and ez represent the contamination of the measure X and Z, respectively, by the errors ex and ez. The product-moment correlation of X and Z, r(X,Z), is then the ratio of the covariance of X and Z to the square root of the product of their variances. Because the measures X and Z are parallel, Var (ex) ¼ Var (ez) ¼ Var (e). Because the errors are independent of one another and of the true values, Cov (t,ex) ¼ Cov (t,ez) ¼ Cov(ex,ez) ¼ 0 Thus, r(X, Z) ¼ ¼ ¼

CovðX; ZÞ ½VarðXÞ VarðZÞ1=2 Covðt þ ex Þ; ðt þ ez Þ ½Varðt þ ex Þ Varðt þ ez Þ1=2 Covðt; tÞ þ Covðt; ex Þ þ Covðt; ez Þ þ Covðex ; eÞ

½Varðt þ eÞ Varðt þ eÞ1=2 Covðt; tÞ VarðtÞ ¼ ¼ Varðt þ eÞ Varðt þ eÞ true variance ¼ total variance

Questionnaire Assessment

105

Thus, the correlation between two parallel measures equals the ratio of true to observed variance in the study measure under consideration. Note, however, that the assumption of no correlation among the errors of measurement is a strong assumption. It can also be shown to indicate that the correlation of the measured variable, X, with the actual or true exposure, t, is equal to the square root of the correlation of parallel measures: Rðt; xÞ ¼ rðx; yÞ Validity and reliability are independent of statistical significance. Assume, for example, that on two repeated administrations of a serum marker, the Pearson correlation of the time 1 and time 2 levels of that marker is 0.2. Assume, for simplification, that this repeatability measure is made on the basis of a sample of 1000 subjects, drawn from an appropriate database. Clearly, the correlation, based as it is on observations from 1000 subjects, is statistically significant. Nonetheless, this low correlation, regardless of its statistical significance, indicates that the measure is not highly repeatable. It is difficult, therefore, to expect that the validity of the measure—its correlation with a “gold standard” measure—would be high. Clearly, the statistical significance of this correlation does not necessarily mean that the measure is reliable or valid. The argument or demonstration that measurement errors made at time one are correlated or associated with measurement errors made at time 2 would raise additional questions regarding the validity of the measure. The accuracy of a measure, whether represented by reliability or validity—or both—cannot in general be represented as a dichotomy, as either accurate or not accurate. A measure might be shown to, in some sense, reflect an exposure, and its reflection of that exposure may be characterized as statistically significant; nonetheless, that reflection may be poor enough to render the measure useless. One might view one’s reflection in a pool, and it may be abundantly clear that something real is reflected by the pool; nonetheless, the reflection may be poor enough that one would not want to depend upon the reflection as a guide for applying makeup, shaving, or combing one’s hair. In general, on the other hand, few measures are perfect. Error that is random with respect to disease risk will cause the use of a measure to attenuate the apparent association of disease and risk. Even minor inaccuracies in the measurement of exposure will cause at least minor attenuation in the association of exposure and risk. Error that is random with respect to disease risk will cause the use of a measure to attenuate the apparent association of disease and risk. Even minor inaccuracies in the measurement of exposure will cause at least minor attenuation in the association of exposure and risk. What is most problematic is that very modest errors in the measurement of a strong risk factor will cause the effects of the risk factor to resonate. This resonance will bias estimates of the importance of correlates of that strong risk factor (6,7). If, on the other hand, errors of measurement are associated with disease, an endless number of biases are possible (19). If overestimation of exposure is correlated with disease risk, the strength of the association of risk and exposure will be overestimated. If the overestimation of exposure occurs less frequently among those who are at higher risk—in other words, if the association of exposure overestimation and risk is negative—the positivity of the association of risk and exposure will be underestimated or even reversed. The investigator does not obviate the problem of poor measurement by measuring an exposure quantitatively, then collapsing the data, and analyzing them as categories. Table 3 displays the association of a quantitative exposure variable with the actual, true exposure; also displayed is the association that would be observed if the investigator collapsed the data and then analyzed them as if they were categorical. The data represent expected values from repeated samples of cases and controls. To facilitate comparisons, we set the quantitative analysis so that the coefficient represents the change in the odds

106

Marshall

Table 3 Expected, Observed Association of Exposure and Disease, by Validity: A Quantitative Exposure Analyzed Quantitatively or as a Dichotomy Validity: ratio of true to total variance (%) 100 90 80 70 60 50 40 30 20 10 0

Quantitative analysis

Dichotomous analysis

10.4 9.3 8.1 7.1 6.1 5.2 4.4 3.6 2.8 2.0 1.0

10.0 8.9 7.8 6.8 5.8 5.0 4.2 3.5 2.8 2.1 1.0

ratio associated with a 1.65 standard deviation increase in exposure. It can be seen that, in the absence of measurement error (validity of 100%), both the quantitative and the dichotomous data analysis indicate that exposure is associated with an approximately tenfold elevation of the odds of disease. As validity declines, the declines in association revealed by the quantitative and dichotomous analyses are virtually identical. If a quantitative exposure imparts a strong effect on disease risk, the impact of that exposure will be revealed whether the data are treated quantitatively or collapsed and treated by categorical analyses: collapsing makes no difference. Statistical significance is not at issue: in moderate to large samples, all associations would be statisticially significant. Nonetheless, the strength of the association is vastly underestimated in the face of even modest measurement error. As other studies have shown (6,16,20–22), validity ratios of 0.5 are not uncommon in nutritional epidemiology; such limited validity indicates that if the data are treated quantitatively or as a dichotomy, an odds ratio of approximately 10 will appear as 1 of approximately 5. DIMENSIONS OF QUESTIONNAIRE VALIDITY With anything resembling a “gold standard” almost universally lacking, validity is generally addressed indirectly. Several approaches have been advanced (18). Content validity, generally a subjective assessment, summarizes our confidence that a variable measures what we intend for it to measure. This is what we see in an index: we understand that total exposure is in some sense a resultant of component exposures. Thus, in the example previously considered, a questionnaire addressing secondhand smoke exposure may well attempt to capture all the separate sources—the sites, locations, times or occasions—of that exposure. This is reasonable, and the result is that one number, rather than several, represents this exposure. On the other hand, the reliability, hence the validity, of an index composed of a sum of these exposures, rather than their separate impacts, is not entirely clear. Criterion-related validity, or the degree to which a measure is associated with an objective measure of what we intend to measure (driver’s test measures actual driving performance), may be invoked. This is what many epidemiologists conceptualize as a “gold standard” measure. It might involve comparison to a concurrent standard, if such a

Questionnaire Assessment

107

standard is available. Thus, for example, with indirect tobacco smoke exposure, one might question a person about his or her exposure, and then compare the estimate derived from the questionnaire with data based on a biologic marker of exposure, such as plasma or urinary cotinine. To clarify the complexity of establishing criterion-related validity, envisage subjects reporting on all their sources of indirect tobacco smoke over the past 10 years; one “gold standard” would be to have placed a monitor on those persons over the past 10 years. Or, failing that, the investigator might have placed monitors on the subjects for the past year, assuming that what happened over the past year could be a flawless representation of what had happened over the past 10 years. Or perhaps the investigator might ask all the subjects to wear monitors for the month following their completion of the questionnaire. The investigator would have, of course, to assume that what is observed over the coming month would flawlessly represent what had happened over the past 10 years. In all cases, it would be necessary for the investigator to assume that the wearing of the monitors would not alter the behavior of the persons wearing them, that it would not lead them to become more, or less likely to seek to avoid indirect smoke. One of the more vexing questions of recent years concerns food frequency measures of exposure to dietary micronutrients or macronutrients; these are based on series of questions regarding the frequency of consumption and usual portion size of various individual food items, and they are usually intended to reflect diet during a vaguely defined period in the past. Often, the questionnaire will direct the respondent to exclude the past year or two; cases in case-control studies may be asked to exclude the period subsequent to their experience of disease symptoms that might have affected their diet. Validation of these questionnaires usually comprises comparison of what has been derived to some kind of “standard”: a diary of several days of food consumption or from a series of food records. These latter measures, collected over a recent period, will cover a brief time span: three days to as long as a month (23,24). However, if they are gold standards, they are so in only a limited sense of the term: they generally refer to time spans different than that of the questionnaire they are intended to validate. As brief snippets of time, they are subject to considerable temporal variation. Food diaries draw attention to diet and impose a burden on subjects that may well alter those subjects’ dietary practice, and dietary recalls may be subject to substantial desirability effects. In addition, such “standards” are subject to some of the same reporting biases that limit food frequency questionnaires, such as emphasizing the consumption of foods perceived to be “healthful.” If the extent of imperfection in other exposure measures is well understood, then that information can be used in interpreting the association of the exposure and risk (25). The complexity of interpreting these data is significant, however it depends heavily on what is understood about the structure of measurement errors in the data (7). Construct validation concerns the agreement of an index, such as one derived from a questionnaire, with another index or measure with which the questionnaire-based index could reasonably be expected to agree. Whether these separate indices could be expected to agree would depend on theoretical understanding of the process that would have generated the two indicators. Comparison of a test questionnaire to different questionnaires that would seem to reflect this same exposure would be described as construct validation. The just-mentioned comparison of a diet questionnaire to a recent food record or series of short-term diet recalls could be justified as construct validation. This justification would be based on the understanding that the food frequency instrument, the diet record, and the series of short-term recalls reflect a general dietary tendency. Errors in their measurement might, in addition, be expected to be independent; thus, the correlation of food frequency data with the two other measures would be expected to provide a validation of the food frequency questionnaire (21).

108

Marshall

Given that the testing of a questionnaire is a laborious prelude to the investigation of a substantive concern, the researcher may seek not to continue to retest the questionnaire repeatedly, in different settings, or with different populations. Nonetheless, this testing is necessary for valid epidemiologic use of a questionnaire, and this validation should be a component of each administration of the questionnaire. It has been shown that slight variations in the variance resulting from epidemiologic measurement can have a decided effect on the data and on the epidemiologic conclusions drawn. Differential variance, such as might issue from differences in the validity of a questionnaire used among cases and controls, can lead to substantial misinterpretation and bias (26). Clearly, differences in the validity of an indicator among study subjects characterized by different genetic polymorphisms or haplotypes could lead to substantial bias in conclusions drawn.

QUESTIONNAIRE ASSESSMENT IN MOLECULAR EPIDEMIOLOGY A convenient and useful distinction for testing the reliability, and hence the validity, of questionnaires in the molecular epidemiologic setting concerns whether the data obtained are by nature categorical or numerical. If the data are quantitative, then Pearson’s correlation is appropriate. If the data are truly categorical, then Cohen’s Kappa (27) provides a useful index of the correspondence of questionnaire-based data to a “gold standard”: K ¼ [A(o) A(e)]/[1 A(e)], where A(o) represents observed categorical agreement between two measures, and A(e) represents the agreement that would be expected on the basis of chance. The index ranges from 1.0, for perfect agreement with a gold standard, to 1.0, for perfectly negative agreement (perfect disagreement) with a gold standard. Kappa can be used for any number of categories, although if the number of categories exceeds 3 or 4, statistical precision declines and the interpretation of the data becomes more complicated. Table 4 contains a simple example of the use of kappa. In these examples, the measurement error is extremely simple: the probability of false-positive reports of exposure is the same as the probability of false-negative reports, and these probabilities are the same for cases as for controls. Validity is here defined at the probability of correct classification. The numbers of cases and controls are equal, and the overall prevalence of exposure is 50%. The table displays odds ratios associated with a dichotomous exposure Table 4 Validity, Misclassification, Kappa, and Observed Odds Ratio (True Odds Ratio ¼ 4.0)a Validity (%) 100 95 90 85 80 75 70 65 60 55 50 a

P (misclassification)

Kappa

Observed odds ratio

0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

1.00 0.90b 0.80b 0.70b 0.60b 0.50b 0.40b 0.30b 0.20c 0.10d 0

4.0 3.45 2.98 2.59 2.25 1.96 1.71 1.49 1.31 1.07 1.00

Expected values, case-control data, equal numbers of cases and controls, 50% control exposure prevalence. Kappa statistically significant with 50 cases, 50 controls. c Kappa statistically significant with 100 cases, 100 controls. d Kappa statistically significant with 250 cases, 250 controls. b

Questionnaire Assessment

109

Table 5 Observed OR, by true OR, by Kappa (K)a K True OR 2.0 4.0 7.0 10.0 20.0

.10

.20

.30

.40

.50

.60

.70

.80

.90

1.07 1.07 1.20 1.23 1.29

1.15 1.31 1.44 1.52 1.67

1.23 1.49 1.72 1.87 2.16

1.32 1.71 2.10 2.32 2.82

1.41 1.96 2.51 2.90 3.72

1.51 2.25 3.04 3.63 4.97

1.62 2.59 3.70 4.59 6.75

1.74 2.98 4.54 5.87 9.38

1.86 3.45 5.61 7.60 13.42

a

Expected values, equal numbers of cases and controls, 50% control exposure prevalence. Abbreviation: OR, odds ratio.

that imparts a relative risk of 4. Also presented is whether the kappa statistic is statistically significant in a study with 50, 100, or 250 cases and a like number of controls. With misclassification represented by m, kappa is equal to (1 2m). Even a modest decrease in kappa, such as a score of 0.7 or 0.8, is indicative of substantial bias in the observed relative risk. Statistically significant kappa scores do not ensure against attenuation of the odds ratio. A kappa score of 0.3 might reflect statistically significant association of true and reported exposure; the use of a questionnaire with this performance would result in an odds ratio less than half of the true odds ratio. The importance of small numbers in this example is also noteworthy. Although the data are not presented, it is noteworthy that, with samples of less than 500 cases and 500 controls, the 95% statistical confidence intervals for adjacent kappa scores, and even for those that are two categories apart, overlap. Thus, the ability of the investigator to sense whether the validity observed in small subsets of a sample varies is extremely limited. A critical effect of measurement error–induced association attenuation is to bunch odds ratios together, pushing them all toward unity. A large odds ratio is attenuated more than a smaller one is by measurement error. Table 5 displays this problem. The table begins with true odds ratios of 2,4,7,10, and 20. The analyses are based on infinite samples, so the results represent expected values, or what would be seen in the probability limit. (The reader may object that odds ratios of 10 or 20 are rarely, if ever, observed: the objection is fair but misplaced; true odds ratios are never observed; what are observed are odds ratios that are attenuated by substantial measurement error). It can be seen that a true odds ratio of 2 would be seen, with a kappa of 0.9, as one of 1.86, while one of 20 would be seen as 13.43. As kappa declines, indicating that measurement error is increasing, the observed odds ratios continue to converge. Thus, with no measurement error, odds ratios of 10 and 20, respectively, are 5 and 10 times that of 2.0. With a kappa of only 0.5, however, the odds ratio that would be 2 is actually 1.41, while that that would be 10 is 2.9, and that that would be 20 is 3.7. The same process continues, so that, with kappa scores of 0.2, an odds ratio of 2 is observed as 1.15, while one of 4 is seen as 1.3, and one of 20 is seen as 1.67. Mismeasurement has heavily blurred he differences between these odds ratios. It is critical that, in molecular epidemiology, the ability of the investigator to distinguish among very different odds ratios is distinctly limited by questionnaire-induced measurement errors. In molecular epidemiologic studies, there is also the potential for bias derived from misclassification of biomarker status. This could be relevant for markers of exposures, such as serum micronutrients, hormone assays, or DNA adducts, and such studies must include extensive preliminary work to establish that the assay is reliable and valid, including suitable quality control measures. The use of blinded duplicates, repeated assays, and known and blank controls will assist in reducing misclassification. As discussed above, the temporal relevance of the marker must also be considered; an

110

Marshall

extrapolation of current measures of a biomarker to the timepoint most important in disease etiology may or may not reflect true values. Furthermore, if used in case-control studies, the potential effects of disease status among cases on the biomarker of interest should be evaluated. If it is likely that disease processes may have affected the biomarker, then it clearly cannot reflect predisease status. Markers of susceptibility are also prone to misclassification. As approaches to genotyping become more sophisticated and our knowledge expands, the potential for error in genotyping assays becomes more apparent. This could result from human error such as mislabeling of samples or error or contamination in plating of DNA; it could also result from errors in the assay itself. Although distribution of blinded duplicates throughout the plates may alert errors to some extent, it is also recommended that samples with known values, such as DNAs from pedigrees, be used for preparing the assays to ensure that results are correct. It may also be prudent to perform genotyping assays using several different platforms on a subset of samples to ensure that results are accurate. SUMMARY The goal of questionnaire assessment—the evaluation of questionnaire performance—in molecular epidemiology is to evaluate the extent to which measurement error may be obscuring or distorting associations present in the data. This chapter has addressed questionnaire assessment, considering questionnaire assembly, dimensions of validity, and questionnaire assessment in molecular epidemiology. It has attempted to address, to compare and contrast, two often discussed facets of questionnaire assessment: reliability and validity. It has throughout emphasized the significant impact of measurement error, pointing out that interpretation of an observed association in an epidemiologic data set is not readily possible without a careful consideration of the quality of the exposure date gleaned from the questionnaires that provide the backbone of epidemiologic measurement. REFERENCES 1. Thompson IM, Goodman PJ, Tangen CM, et al. The influence of finasteride on the development of prostate cancer. N Engl J Med 2003; 349(3):215–224. 2. Klein EA, Thompson IM, Lippman SM, et al. SELECT: the next prostate cancer prevention trial. Selenum and Vitamin E Cancer Prevention Trial. J Urol 2001; 166:1311–1315. 3. Vogel VG, Costantino JP, Wickerham DL, et al. Effects of tamoxifen vs raloxifene on the risk of developing invasive breast cancer and other disease outcomes: the NSABP study of tamoxifene and raloxifene (STAR) P-2 trial. JAMA 2006; 295(23):2727–2741. 4. Leamer EE. Specifications Searches: Ad Hoc Interference with Nonexperimental Data. New York: John Wiley and Sons, 1978. 5. Goldberg JD. The effects of misclassification on the bias in the difference between two proportions and the relative odds in the fourfold table. J Am Stat Assoc 1975; 70:561–567. 6. Marshall JR, Hastrup JL. Mismeasurement and the resonance of strong confounders: uncorrelated errors. Am J Epidemiol 1996; 143(10):1069–1078. 7. Marshall J, Hastrup JL, Ross JS. Mismeasurement and the resonance of strong confounders: correlated errors. Am J Epidemiol 1999; 150(1):88–96. 8. Terry PD, Goodman M. Is the association between cigarette smoking and breast cancer modified by genotype? A review of epidemiologic studies and meta-analysis. Cancer Epidemiol Biomarkers Prev 2006; 15:602–611. 9. Hunter, D. Biochemical indicators of dietary intake. In: Nutritional Epidemiology (Willet), Chapt 9, p. 174–243, Oxford UK: Oxford University Press, 1998.

Questionnaire Assessment

111

10. Cummings KM, Markello SJ, Mahoney M, et al. Measurement of lifetime exposure to passive smoke. Am J Epidemiol 1989; 130:122–132. 11. Cummings KM, Markello SJ, Mahoney M, et al. Measurement of current exposure to environmental tobacco smoke. Arch Environ Health 1990; 45:74–79. 12. Hunter DJ, Spiegelman D, Adami HO, et al. Cohort studies of fat intake and the risk of breast cancer—a pooled analysis. N Engl J Med 1996; 334(6):356–361. 13. Xing X, Burr JA, Brasure JR, et al. Reproducibility of food intake in a food frequency questionnaire used in a general population. Nutr Cancer 1995; 24(1):85–95. 14. Xing X, Burr JA, Brasure JR, et al. Reproducibility of food intake in a food frequency questionnaire used in a general population. Nutr Cancer 1996; 25:259–268. 15. Marshall J, Priore RL, Haughey BP, et al. Spouse-subject interviews and the reliability of diet studies. Am J Epidemiol 1980; 112:675–683. 16. McCann S, Freudenheim JL, Trevisan M, et al. Recent alcohol intake as estimated by the health habits and history food frequency questionnaire, the harvard semi-quantitative food frequency questionnaire and a more detailed alcohol intake questionnaire. Am J Epidemiol 1999; 150:334–340. 17. Byers T, Graham S, Haughey BP, et al. Diet and lung cancer risk: findings from the Western New York Diet Study. Am J Epidemiol 1987; 125(3):351–363. 18. Carmines EG, Zeller RA. Reliability and Validity Assessment. Thousand Oaks, CA: Sage Publications, 1979. 19. Goldberger AS. Structural equation methods in the social sciences. Econometrica 1972; 40(6): 979–1001. 20. Freudenheim J, Marshall J. The problem of profound mismeasurement and the power of epidemiologic studies of diet and cancer. Nutr Cancer 1988; 11:243–250. 21. Marshall JR. The reliability and validity of dietary data as used in epidemiology. In: Kinlen L, Campbell C, eds. Cancer Surveys: Advances & Prospects in Clinical, Epidemiological and Laboratory Oncology. Oxford, UK: Oxford University Press, 1987:673–684. 22. Marshall JR, Lanza E, Bloch A, et al. Indexes of food and nutrient intakes as predictors of serum concentrations of nutrients: the problem of inadeqate discriminant validity. Am J Clin Nutr 1997; 65(4 suppl):1269S–1274S. 23. Willett W, Stampfer M, Chu NF, et al. Assessment of questionnaire validity for measuring total fat intake using plasma lipid levels as criteria. Am J Epidemiol 2001; 154(12):1107–1112. 24. Feskanich D, Marshall J, Rimm EB, et al. Simulated validation of a brief food frequency questionnaire. Ann Epidemiol 1994; 4(3):181–187. 25. Spiegelman D, Schneeweiss S, McDermott A. Measurement error correction for logistic regression models with an “alloyed gold standard”. Am J Epidemiol 1997; 145(2):184–196. 26. Gregorio DI, Marshall JR, Zielezny M. Fluctuations in odds ratios due to variance differences in case-control studies. Am J Epidemiol 1985; 121(5):767–774. 27. Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. Ontario, Canada: John Wiley & Sons, 1981.

9

Pharmacogenetics in Cancer Chemotherapy Xifeng Wu and Jian Gu Department of Epidemiology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A.

INTRODUCTION It is well recognized that there is wide interindividual variability in drug response. Given the same dosage of a drug, some patients experience complete remission, some fail to respond, and others experience severe toxicities. These differences in drug response are influenced by many factors, including demographics (age, sex, ethnicity), organ function and comorbidities, environmental exposures (diet, carcinogen exposure, drug interactions), and genetics (Fig. 1). In general, genetic factors are estimated to account for 15% to 30% of interindividual variability in drug response, but for certain drugs, genetic factors could explain up to 95% of this variation (1–4). The term “pharmacogenetics” was coined almost 50 years ago to refer to the role of inherited genetic variation in prediction of drug response (5). Earlier pharmacogenetic studies focused on pharmacokinetics, or the way the body metabolizes a drug and controls the amount of the drug at the site of action, which includes the absorption, distribution, metabolism, and excretion (ADME) of the drug. Almost all of the classic examples of pharmacogenetics (e.g., NAT2, CYP2D6, CYP2C19, TPMT, UGT1A1) have come from studies of pharmacokinetics. This area of pharmacogenetics, related to pharmacokinetics, continues to attract much attention, with growing research in drug transporter and metabolizing enzymes. The second component of pharmacogenetics is pharmacodynamics, which involves the mechanisms of drug action. With the completion of the human genome, the international Hapmap project, and rapid advance of high-throughput whole-genome technologies, the field of pharmacogenetics has been undergoing a steady evolution from the study of monogenic traits to that of polygenic traits, and from small-scale candidate approach to comprehensive and whole-genome approaches. The newer term “pharmacogenomics” reflects such an evolution, which often involves large-scale, genomewide analysis to identify genetic factors relevant to drug response. However, these two terms are often used interchangeably. Pharmacogenetics is particularly relevant and important to cancer chemotherapy, since chemotherapeutic agents are characterized by a narrow therapeutic index and high

113

114

Wu and Gu

Figure 1 Sources of interindividual variability in drug responses, given the same dosage of a drug.

toxicity. In this chapter, we will introduce phenotypic and genotypic measurements in pharmacogenetic studies, illustrate the application of these measurements in cancer pharmacogenetics through a few classic examples, summarize issues and challenges in pharmacogenetic studies of polygenic drug response, and provide perspectives of future personalized cancer medicine. PHARMACOGENETIC MEASUREMENTS Phenotyping Pharmacogenetic variations can be evaluated using phenotypic and genotypic measurements. Measurement of phenotype has traditionally been applied to assess variations in drug metabolism. The early success stories of pharmacogenetics, such as NAT2, CYP family, and TPMT, have stemmed from metabolic phenotype measurements. Metabolic phenotyping can distinguish individuals into three classes of phenotypes: extensive metabolism (EM) of the drug (characteristic of the normal population), poor metabolism (PM), and ultraextensive metabolism (UM). Effective and convenient metabolic phenotyping assays will continue to have wide applicability in pharmacogenetics. The metabolic phenotype of a specific drug-metabolizing enzyme may be assessed by direct measurement of enzyme activity or by administration of a probe drug or substrate and subsequent measurement of the metabolites in serum or urine (6). Direct measurement of enzyme activity in human liver microsomes has been limited to small-scale laboratory experiments for obvious ethical and technical issues related to human liver sample collection and processing, preparation of microsomes, and in vitro measurement. Direct measurement of enzyme activity in surrogate tissue such as peripheral blood is an ideal alternative and the most convenient approach. However, many enzymes of pharmacogenetic interest are expressed at very low levels in surrogate tissues compared with target tissues (e.g., liver or small intestine mucosa), and the lack of correlation between surrogate and target tissues renders this approach less practical. There are only a few examples of direct phenotypic measurements using peripheral blood as a surrogate, including the glutathione S-transferases (GSTs) GSTM1 and GSTT1, the methyltransferase TPMT, paraoxonase, and sulfotransferase (6). Among these, the direct measurement of

Pharmacogenetics in Cancer Chemotherapy

115

TPMT activity in erythrocytes remains the most successful example of the clinical application of phenotyping. Because of the limitations of direct phenotypic measurements of enzyme activity, indirect assays using probe drugs have been the most widely applied method of metabolic phenotyping. In a typical phenotyping assay of this kind, individuals are orally administered a probe drug which is specifically metabolized by the enzyme of interest. Following this, urine and/or blood samples are collected over a specific period of time (e.g., 8–12 hours). The levels of the parent drug and its metabolites are measured in urine and/or serum, and the metabolic ratio (MR), defined as the ratio of metabolite to parent compound, is calculated and plotted as a frequency distribution histogram. There are a few important considerations for this particular phenotypic approach. Probe drugs should be specific for the target enzyme and be well tolerated by the subject, and the chemical analysis should be simple and robust. Phenotyping of the CYP family enzymes using this approach has been extensively investigated, and well-validated probes have been identified for most CYP enzymes (6,7). Furthermore, simultaneous use of several specific probe drugs for different enzymes in a “cocktail” approach can be used to assess a metabolic profile. Numerous useful phenotype cocktails have been reported with good tolerability and no evidence of interactions among probe drugs (8). With continuing efforts to improve the probe selection and phenotype metrics, drug cocktails can be a valuable, safe, and scientifically sound tool to characterize drug metabolic phenotypes. Although drug metabolism is the predominant application of phenotyping in pharmacogenetics, phenotypic measurements have also been used to determine variations related to pathways of drug action, such as assessing DNA repair capacity (DRC) in the context of cisplatin-based chemotherapy. Cisplatin forms bulky DNA adducts, which are primarily repaired through the nucleotide excision repair (NER) system. Theoretically, poor DRC should favor cisplatin treatment efficacy, since more cisplatin-DNA adducts would be formed and not removed, enhancing cytotoxicity of the drug, whereas effective DRC may predispose to chemoresistance to cisplatin. Indeed, some previous studies support this notion. In a study investigating the role of DNA repair in intrinsic resistance of non–small cell lung cancer (NSCLC), the overall DRC was estimated by the ability of cells to repair a transfected reporter plasmid damaged by cisplatin (“host cell reactivation” assay). A correlation was observed between reporter gene activity and intrinsic resistance to cisplatin, suggesting that elevated DRC may predispose to cisplatin resistance (9). In a case-control study of advanced NSCLC patients, DRC was measured in the patient’s peripheral lymphocytes by the host cell reactivation assay (10). In patients treated with chemotherapy only, the risk of death increased to 1.11 (95% CI, 1.02 to 1.21) for every percentage increase in DRC, and patients in the top quartile of the DRC distribution had twice the relative risk of death compared with those in the lowest quartile (RR ¼ 2.72; 95% CI ¼ 1.24–5.95; p ¼ 0.01). Effective DRC was not a risk factor for death in patients who were not treated with chemotherapy. These studies suggest that phenotypic measurement of DRC in surrogate cells may aid in prediction of response to cisplatinbased chemotherapy. There are several advantages and disadvantages to the use of direct phenotyping. The approach is more clinically relevant because it measures the combined effects of genetic, epigenetic, and environmental factors on enzyme activity. However, phenotyping requires collection of samples before treatment and usually requires repeated sample collection. The experimental process is labor intensive, time consuming, and not easily applicable to large clinical studies. Some additional drawbacks include risk of adverse drug reactions, incorrect phenotype assignment, and potential confounding by disease status and other nongenetic factors.

116

Wu and Gu

Genotyping In the last two decades, advances in technology have allowed the rapid, accurate, and highthroughput detection of germline genetic variants, mostly single nucleotide polymorphisms (SNPs), facilitating an explosion in pharmacogenetics research and publications. Genotyping is much more widely used than phenotyping in pharmacogenetic studies, and numerous genotyping techniques have been developed and widely applied. Using a candidate gene approach, SNPs in genes relevant to specific treatment agents have been queried, and numerous associations have been reported on genetic polymorphisms and drug response. The genotypic approach has several advantages over phenotyping. First, genotypes are fixed, and although ultimate enzyme activity may be affected by other factors, genotypes will not be affected by treatment and environment; therefore, sample collection and measurement can be done before, during, or after drug treatment. Second, genotyping can be performed from one blood sample or from even more easily obtained material (e.g., buccal cells, nail clippings, archived tissue). Third, current genotyping methodology is robust and accurate, and measurement error is negligible. Finally, the ever-increasing publicly available SNP data, coupled with high-throughput genotyping technology (e.g., custom SNP arrays and wholegenome SNP arrays), allows for a comprehensive or whole-genome array of hundreds of thousands of SNPs simultaneously. In December 2004, the U.S. FDA approved the first pharmacogenetic test: the AmpliChip CYP450 designed to screen common polymorphisms of two major drug metabolism genes (CYP2D6 and CYP2C19). The determination of haplotypes to estimate variability across genomic blocks has been increasingly used in pharmacogenetic studies. Haplotypes provide more comprehensive analysis of single genes and may increase power to detect relationships of rare alleles with treatment outcomes. Most genes have a limited number of common haplotypes (11), which may reduce the number of SNPs necessary to analyze a gene. The ongoing international HapMap project has greatly facilitated haplotype analysis in pharmacogenetic studies. However, genotyping is only clinically relevant to the degree to which it predicts phenotype. For some genes, there may be no strong candidate polymorphisms that robustly affect enzyme function. For many of the known polymorphisms evaluated, the functional impact is either not clear or not strong enough to be biologically and clinically relevant. With millions of SNPs deposited in public databases, selection and prioritization of candidate SNPs is becoming increasingly complex and challenging (12). Taking into consideration the polygenic nature of drug response, and the redundancy of cellular genes and functions, it is conceivable that the majority of individual SNPs and haplotypes may not have a noticeable effect on drug response. Most reported associations between individual genotypes or haplotypes and drug response have been inconsistent and/or contradictory. The excitement of the initial success in a few monogenic traits has been tempered by the reality of the complexity of polygenic drug responses. As we will discuss later, assessing the cumulative effects of multiple genes and polymorphisms may be a necessary trend in future pharmacogenetic studies. Furthermore, in addition to the currently prevalent hypothesis-driven candidate gene approach, the discoverydriven whole-genome scanning approach may soon come into the pharmacogenetic field. APPLICATION OF PHARMACOGENETICS TO CANCER TREATMENT OUTCOMES 6-Mercaptopurine and TPMT The thiopurine S-methyltransferase (TPMT) polymorphism has been a prototype for the clinical importance of pharmacogenetics. TPMT has several features that make it an ideal model for pharmacogenetics: monogenic inheritance of TPMT activity, a simple

Pharmacogenetics in Cancer Chemotherapy

117

measurement of phenotype (TMPT activity) in surrogate tissue (erythrocytes), existence of functional polymorphisms, near-complete concordance between genotype and phenotype, a clearly defined clinical endpoint (toxicity), and a direct causal relationship between genetic polymorphism and toxicity. TPMT is a cytosolic metabolizing enzyme that catalyzes the S-methylation of thiopurine drugs, such as 6-mercaptopurine (6-MP), 6-thioguanine, and azathioprine. These drugs are used in the treatment of acute lymphoblastic leukemia and rheumatoid arthritis, and as immunosuppressants. There is wide interindividual variability in human TPMT enzymatic activity (13); about 90% of individuals have high activity, 10% have intermediate activity, and a small percentage (*1 in 300 individuals) have exceptionally low TPMT activity. If these low-activity patients are given standard doses of thiopurines, severe and sometimes life-threatening hematopoietic toxicity will occur, due to the accumulation of cytotoxic metabolites (14–17). Therefore, substantial (*10-fold) dosage reduction is required in patients with the homozygous mutant TPMT to avoid toxicity (17–19). In addition, patients with heterozygous phenotypes (*10% of Caucasians and African-Americans) are also at a higher risk of thiopurine toxicity (20). The TPMT gene has about 20 functional polymorphisms. In a large study of 1214 healthy Caucasians, Schaeffeler et al. (21) found that the overall concordance rate between TPMT genotypes and phenotypes was 98.4%, thus providing strong support for using genotypes as predictors of TPMT activity and thiopurine therapy-induced toxicity. In Caucasians, the frequencies of the three most common variant genotypes, TPMT*3A (Ala154Thr and Tyr240Cys), TPMT*3C (Tyr240Cys), and TPMT*2 (Ala80Pro), are 4.4%, 0.4%, and 0.2%, respectively, which combined accounts for >95% of the lowactivity phenotypes. In Asians and Africans, the TPMT*3C is the most frequent variant allele. The U.S. FDA has added TPMT genetic information to the package inserts of these drugs, which recommends TPMT testing (genotypic testing for TPMT*3A, *3C, and *2 and/or phenotypic testing of thiopurine nucleotides and TPMT activity in erythrocytes) if patients exhibit clinical or laboratory evidence of severe toxicity, particularly myelosuppression. Irinotecan and UGT1A1 The uridine diphosphoglucuronosyltransferases (UGTs) are a superfamily of membranebound enzymes that participate in one of the primary pathways in phase II drug metabolism. UGTs are expressed predominantly in the liver and catalyze the glucuronidation of a wide variety of drugs, including irinotecan (22). Irinotecan is an analogue of camptothecin, a topoisomerase I inhibitor that was initially approved for the treatment of advanced colorectal cancer, and has since been extended to or is under evaluation for a few other solid tumors. Irinotecan is converted by hepatic microsomal carboxylesterases to the active metabolite SN-38, which is further conjugated into the inactive SN-38 glucuronide (SN-38G) by UGTs, and eliminated via the bile. UGT1A1 is the major liver enzyme responsible for the glucuronidation of irinotecan, and, as with other chemotherapy drugs, irinotecan may cause life-threatening toxicities among patients with poor metabolizer phenotypes. Phenotypic measurements using in vitro SN-38 and bilirubin glucuronidation assays with human liver microsomes clearly demonstrated wide interindividual variations (>15 fold) in liver UGT1A1 activity (23). A common dinucleotide (TA) repeat polymorphism (containing 5, 6, 7, or 8 copies of TA repeat) has been identified in the UGT1A1 promoter TATA box, with an inverse relationship between the number of TA repeats and promoter activity (23–25). The (TA)6TAA allele is

118

Wu and Gu

the most common allele, and (TA)7TAA (UGT1A1*28) is the major variant allele in Caucasians. The presence of more than six TA repeats leads to reduced UGT1A1 activity and less glucuronidation of SN-38 to SN-38G, resulting in excessive SN-38 accumulation and severe toxicity. Several retrospective and prospective studies have linked the UGT1A1*28 allele with severe irinotecan-related toxicity, particularly neutropenia (26–29). In 2005, the U.S. FDA approved the inclusion of UGT1A1 genotyping information in the irinotecan package insert and recommended reducing the starting dose of the drug for homozygous UGT1A1*28 carriers. Shortly thereafter, in August 2005, the FDA approved the first cancer pharmacogenetic test, the Invader UGT1A1 Assay by Third Wave Technologies, for the UGT1A1*28 allele. Two recent studies of UGT1A1 are worth mentioning. In a large prospective study, Toffoli et al. (30) followed 250 metastatic colorectal cancer patients treated with irinotecan-based therapy and observed a significantly increased risk of severe neutropenia among patients carrying the UGT1A1*28 allele; however, the increased neutropenia was only evident in the first cycle, and not seen throughout the whole treatment period. In addition, the treatment response rate was also significantly higher in homozygous UGT1A1*28 patients, compared with patients with the common alleles. The implication of this study is that careful consideration should be given to irinotecan dose reduction in patients with the UGT1A1*28 allele, because the toxicity is likely to be manageable, and patients with the UGT1A1*28 allele have higher response rates. However, associations between UGT1A1*28 and irinotecan treatment efficacy have not been consistent. In another report, Han et al. (31) found that another common variant, UGT1A1*6 (Gly71Arg), present almost exclusively in Asians and more prevalent than UGT1A1*28 in that population, was associated with a higher incidence of severe neutropenia in Korean patients with advanced NSCLC receiving irinotecan and cisplatin treatment. No homozygous UGT1A1*28 allele was found in this study, suggesting that UGT1A1*28 testing may not be adequate to predict severe toxicity in Asians or other non-Caucasian populations. Another surprising observation from this Korean study was that patients with the homozygous UGT1A1*6 had a lower tumor response, and significantly shorter progression-free and overall survival. Biologically, less glucuronidation of the active SN-38 should favor efficacy at the cost of higher toxicity. This counterintuitive finding, as well as many other inconsistent observations between polymorphisms and drug efficacy, highlights the complexity of treatment outcomes. A comprehensive evaluation of germline variations in pharmacokinetic and pharmacodynamic genes, combined with genetic and molecular alterations in tumor tissue, is needed to be able to predict treatment efficacy. Pharmacogenetics of 5-Fluorouracil The pharmacogenetics of 5-fluorouracil (5-FU) has focused on thymidylate synthase (TS), which is a drug target, dihydropyrimidine dehydrogenase (DPD), involved in drug metabolism, and the folate metabolic pathway genes that regulate TS, most notably, methyltetrahydrofolate reductase (MTHFR). Thymidylate Synthase TS is a folate-dependent enzyme that catalyzes the conversion of dUMP to dTMP, providing the sole intracellular source of dTMP, and is the main target for the fluoropyrimidine drugs. 5-FU is converted to FdUMP, which forms a tight covalent bond with TS and inhibits thymidylate synthesis. In the 5’ untranslated enhancer region of the

Pharmacogenetics in Cancer Chemotherapy

119

TS gene, there is a 28-bp variable number of tandem repeats (VNTR) polymorphism (32). The number of the tandem repeats ranges from 2 (2R) to 9 (9R). In Caucasians, 2R and 3R alleles are predominant. The 3R allele has been associated with higher TS protein expression than the 2R allele. Earlier pharmacogenetic studies evaluating this VNTR polymorphism suggested that the 3R allele conferred a low-treatment response rate (33,34), which led to the first genotype-guided clinical trial in North America based on this TS VNTR polymorphism (35,36). However, some recent studies have provided somewhat conflicting clinical results in advanced colorectal cancer, with one null result and another showing a significantly higher response rate for patients with the 3R/3R genotype (37,38). These inconsistent results again highlight the challenges in predicting drug efficacy in pharmacogenetic studies using the candidate gene approach. Drug efficacy is a complex trait that involves many factors. It is unlikely that any single polymorphism will have a dramatic effect on outcomes. It is possible that the functional impact of any individual polymorphism is negligible or modest, and there might be other polymorphisms that may enhance or counteract the effects of this specific polymorphism. This complex scenario is best illustrated by the TS polymorphisms. There is an additional G/C SNP within the 3R VNTR, and the 3G allele has been associated with higher transcriptional and translational activity than the 3C allele (39). Ruzzo et al. (40) recently showed that the TS 5’ UTR 3G genotypes (2R/3G, 3C/3G, 3G/3G) predicted worse outcome, suggesting that a double assessment of VNTR and G/C SNP might be more relevant than the simple VNTR analysis. The analysis of the predictive role of TS polymorphisms may be more complex if the 6-bp deletion/insertion (del/ins) polymorphism in the 3’ UTR is added for consideration. The 6-bp deletion allele may cause mRNA instability and reduce intratumoral TS mRNA levels (41). An improved study design on the pharmacogenetic effect of TS polymorphisms should include a global evaluation of the combined VNTR, G/C, and 6-bp del/ins polymorphisms (42). The same comprehensive analysis applies to other genes, especially when the functional polymorphism is unclear. Haplotypes may be more informative than individual genotypes in predicting gene function, and haplotype tag SNPs (htSNPs) based on the international HapMap project may represent the next wave of pharmacogenomic studies. Dihydropyrimidine Dehydrogenase Up to 85% of administered 5-FU is metabolized to its inactive form, 5,6-dihydro-5-FU, by dihydropyrimidine dehydrogenase (DPD). In the general white population, 3% to 5% of individuals are heterozygous and 0.1% homozygous for the DYPD null genotype. DPD activity in peripheral blood mononuclear cells (PBMC) has been used as a surrogate for liver DPD activity and a significant linear correlation between liver- and PBMC-DPD activity (r ¼ 0.56, p ¼ 0.002) has been noted (43). Patients who exhibited the lowest PBMC-DPD activity also had very low liver-DPD activity (43). There have been reports that DPD deficiency is associated with 5-FU toxicity (44,45), but in a population study assessing activity in PBMC, the risk of developing side effects was not linked to pretreatment DPD activity in patients treated with FU (46), and the correlation between pretreatment DPD activity and FU systemic clearance was weak. Phenotyping of PBMCDPD activity is labor intensive and difficult to apply in clinical practice. Other phenotyping methods for DPD have been developed, such as measuring metabolic ratios of pyrimidines in plasma or urine, the fluorouracil test dose approach, and the 13C-uracil breath test, but each has its inherent limitations (47). A common SNP (DYPD*2A, IVS14 G1A), which causes a 50 splice site mutation and results in exon 14 skipping, accounts

120

Wu and Gu

for >50% of variants and is linked to *25% of severe 5-FU toxicities (48). However, this SNP is rare and is also present in individuals with normal DPD activity (49). The majority of DPD SNPs are very rare and not associated with low DPD activity. Collie-Duguid et al. (49) showed that only 17% of the patients with a low DPD phenotype have a molecular basis for reduced activity, emphasizing the complex nature of the molecular mechanisms controlling polymorphic DPD activity in vivo. In addition, 5-FUrelated toxicity also occurs in a subset of patients with normal DPD activity, indicating the contribution of other genes. In light of all these issues, the future of DPD in the pharmacogenetics of 5-FU is unclear. MTHFR The results of MTHFR and other folate-pathway gene polymorphisms in 5-FU pharmacogenetics are inconsistent. Some studies suggest that the 677T allele or 1298C are associated with a better 5-FU response (38,50–52). There might also be joint effects when combining these two SNPs (50). However, the pharmacogenetic impact of this pathway on 5-FU efficacy is uncertain, and likely to be modest at best. Pharmacogenetics of Cisplatin Cisplatin is one of the most commonly used chemotherapy drugs. Platinum agents form intra- and interstrand DNA adducts that result in bulky distortion of DNA, and inhibit DNA replication. The level of platinum-DNA adducts in the circulation is correlated with clinical outcome, and resistance to platinum agents has been linked to enhanced tolerance and repair of DNA damage. Therefore, the major focus of cisplatin pharmacogenetics has been on DNA repair genes (especially, nucleotide excision repair genes) as well as GSTs, which are the major phase II detoxifying enzymes for platinum agents. Some positive associations have been observed, but there are also many null or conflicting results (37,40,50,53–57). The inconsistent findings, as well as lack of major candidate markers in cisplatin pharmacogenetics, are not surprising, since platinum agents do not target a specific protein, and none of the numerous polymorphisms in DNA repair genes have shown a major functional impact experimentally. In addition, both DNA repair and GST systems are highly complex and are probably tightly regulated in vivo, because of redundant and/or alternative mechanisms. Single germline polymorphisms in a single or a few candidate genes would not have a strong effect on cisplatin response. Tumor heterogeneity, treatment heterogeneity, and other uncontrolled confounders may obscure the modest effect of genetic polymorphisms. For the expected low impact of individual polymorphisms on drug response, we need to move beyond the candidate gene approach and apply a comprehensive polygenic approach, combined with somatic genetic events and environmental factors, to build multivariate models to predict drug response. Somatic Mutations and Pharmacogenetics Cancer is largely a somatic genetic disease characterized by specific mutations and genetic instabilities that lead to chromosome translocations, losses, and gains. All these types of somatic aberrations may exert effects on treatment efficacy similar to those of germline polymorphisms. One illustrative example is the pharmacogenetics of the epidermal growth factor receptor (EGFR) inhibitor. Gefitinib (Iressa) and erlotinib (Tarceva) are kinase inhibitors that specifically target the EGFR kinase domain and have been used in the treatment of advanced NSCLC. Several studies have reported that EGFR

Pharmacogenetics in Cancer Chemotherapy

121

mutations are strong determinants of tumor response to gefitinib (58–60). The two most common gefitinib-sensitizing EGFR mutations are the 15-bp nucleotide in-frame deletion in exon 19 (E746-A750del) and the point mutation in exon 21 (L858R). On the other hand, another specific somatic mutation (T790M) was shown to be associated with resistance to gefitinib or erlotinib therapy in NSCLC (61,62). The example of somatic mutations in the EGFR gene affecting gefitinib response highlights the need for the integration of both germline and somatic variations in determining drug efficacy. ISSUES AND CHALLENGES IN PHARMACOGENETICS In addition to the examples listed above, there are numerous pharmacogenetic studies using a candidate gene approach that have evaluated relationships between treatment outcomes and polymorphisms in genes involved in drug metabolism pathways, membrane transport, and drug action. Apart from the examples given above where genotype very clearly predicts drug response, the results from many candidate gene approach studies have been inconsistent. The candidate gene approach was originally intended to identify a robust, independent main effect for a single locus based on the assumption of a monogenic trait. Unfortunately, it is rare to find such a strong, monogenic candidate polymorphism for currently used chemotherapeutic drugs. Instead, most published pharmacogenetic studies have used the candidate gene approach to investigate complex polygenic drug response traits. In these studies, the hypothesis may not be particularly strong, the genotypephenotype correlation weak, or the anticipated effect expected to be modest. Additional factors that affect pharmacogenetic study results are: the small number of patients evaluated, patient and tumor heterogeneity, different treatment regimens and schedules, and failure to evaluate the effect of multiple genes that are pathophysiologically related. Thus, it is not surprising that the many pharmacogenetic studies using the candidate gene approach have had inconsistent results. To improve the reproducibility and raise the applicability of pharmacogenetic findings, a number of issues and challenges need to be considered. Study Design Principles relevant for epidemiologic studies should be applied to those related to treatment outcomes, with application of epidemiologic study designs to pharmacogenetic studies, including experimental design (clinical trials) and observational design (cohort and case-control studies). The case series design is the approach used predominantly in many pharmacogenetic studies. Most published studies were not prospectively designed, but instead based on retrospective assessment of patients who had already received treatment. Although a retrospective approach in and of itself does not necessarily reduce the rigor of the study, the heterogeneity of study populations in these “convenience” studies may increase the likelihood for spurious or false-negative findings. This may be a particular problem when the study population is derived either from cases from a completed case-control study or from tumor registries or pathology departments where disease characteristics and treatments received are heterogeneous. The optimal population for a pharmacogenetic study would be somewhat homogeneous patients treated on a clinical trial, stratified by drug regimens, with all patients treated with the same chemotherapy drugs and at the same doses. Lack of reproducibility may also be related to poor study design and execution, particularly if sample size is not adequate for detection of modest individual effects. Overinterpretation of marginal results in the context of multiple testing should also be

122

Wu and Gu

avoided. Population stratification is one of the most often cited reasons for spurious findings in association studies (63), and may be particularly relevant to pharmacogenetic studies, since many clinical trials are taken in admixed and heterogeneous populations in the United States. But neither systematic testing for population stratification nor statistical methods to correct for population stratification has been incorporated into published pharmacogenetic studies. An additional weakness of many pharmacogenetic studies is lack of a proper comparison group. Without a comparable group who received no adjuvant therapy, it may be difficult to distinguish if genotype is an effect modifier (predictor of drug response) or a prognostic factor not necessarily related to treatments received. Prospectively designed studies involving a sizable number of subjects and sufficient statistical power are warranted to confirm retrospective observations and validate effect modification of treatment outcomes by genetic variability. The ultimate study design to produce the most scientifically rigorous data and strongest evidence of pharmacogenetic benefit is the genotype-guided clinical trial. However, the high cost and difficulties in enrolling participants often make this design infeasible, particularly when most of the current pharmacogenetic biomarkers fail to meet the desirable cost-benefit ratio, even if they are validated in large observational studies. Pathway-Based Polygenic Genotyping and Analysis It is increasingly recognized that most common complex human diseases and most drug responses are under the control of many genes, each contributing modest individual effects. Therefore, evaluation of numerous polymorphisms in multiple genes and dissection of complex interactions among genetic loci are likely to be necessary to identify sensitive and specific “predictor profiles” for drug response. An illustrative example of the importance of taking a multigenic approach to a pharmacogenetic study is the case of warfarin. Warfarin is an anticoagulant drug that targets vitamin K epoxide reductase complex 1 (VKORC1). The drug is metabolized mainly by CYP2C9. VKORC1 genotypes account for about one-fourth of the variance in the warfarin maintenance dose, and CYP2C9 for 6% to 10%. However, obtaining genetic profiles for both genes accounts for more than 50% of the variability in the maintenance dose of warfarin (64–66). Therefore, analysis of the combination of VKORC1 and CYP2C9 genotypes provides significantly enhanced power to identify warfarin-sensitive patients who would require a lower maintenance dose of the drug. The same multigenic approach can apply to cancer pharmacogenomic research, which may be especially relevant for cancer therapies that do not have major protein targets, work through a number of mechanisms, or involve combinatorial therapy with multiple agents. We will use cisplatin-based chemotherapy as an example to illustrate the application of using a pathway-based approach in pharmacogenetic studies. There have been many pharmacogenetic studies of cisplatin-based therapy. The GST family of enzymes and DNA repair genes are the most commonly investigated. Several potential candidate polymorphisms in the GSTP1, ERCC1, XPD, and XRCC1 genes have been reported, but with many contradictory reports for each significant association reported (40,50,53–57). None of the evaluated candidate polymorphisms in these genes have particularly strong evidence of significant functional impact. The associations were generally weak and not clinically usable as single predictors of treatment response, even if confirmed in larger prospective studies. Theoretically, a large number of deleterious alleles, each contributing in a small yet important proportion of drug response, should collectively enhance the predictive power to a level that may be clinically relevant. Several recent “proof-of-principle” studies have demonstrated the

Pharmacogenetics in Cancer Chemotherapy

123

enhanced power of this pathway-based polygenic approach. Stoehlmacher et al. (56) jointly analyzed the polymorphisms of four genes (XPD, ERCC1, GSTP1, TS) in colorectal cancer patients treated with 5-FU/oxaliplatin, and found that an increasing number of favorable alleles were associated, in a stepwise manner, with a significantly longer survival time. Quintela-Fandino et al. (67) recently determined the associations between four DNA repair gene SNPs (XPD-Asp312Asn, Lys751Gln, ERCC1-C8092A, and XRCC1-Arg399Gln) and clinical outcomes in patients with advanced squamous cell carcinoma of head and neck (SCCHN) receiving cisplatin-based induction chemotherapy. In Cox’s multivariate analysis, each variant allele reduced the risk of death by 2.1-fold and patients with seven variant alleles exhibited a 175-fold decrease in risk of death compared with those with all common alleles ( p < 0.001). The probability of achieving a complete response increased 2.94-fold per additional variant alleles ( p ¼ 0.041). Wu et al. (50) evaluated the role of nine SNPs in eight NER genes in esophageal cancer patients treated with 5-FU/platinum chemoradiation. Although no significant individual associations were observed, there was a significant trend for a decreasing risk of death with a decreasing number of unfavorable alleles ( p for trend ¼ 0.0008) (50). In addition to simply summing the number of adverse alleles, several sophisticated data-mining and analysis tools such as classification and regression tree (CART), multifactor dimensionality reduction (MDR), random forests, and artificial neural networks have been used to analyze combinations of multiple polymorphisms in genes in drug-related pathways to account for gene-gene interactions, and to identify genotype profiles that would best predict drug response (68). For example, CART uses a binary recursive partitioning method to identify subgroups of patients with worse or better drug responses. The method generates a tree-structured node with binary splits, and identifies optimal cut points at each node for the covariate. The recursive procedure is continued to yield subsequent nodes that are more homogeneous (with respect to the response variable) than the original node. Gordon et al. (69) recently performed a pathway-based pharmacogenomic study among rectal cancer patients treated with chemoradiation, in which they evaluated 21 polymorphisms in 18 genes involved in critical pathways related to cancer progression (drug metabolism, tumor microenvironment, cell cycle control, DNA repair). Using CART analysis, they found that a classification tree with four genes (IL-8, ICAM-1, TGF-b, and FGFR4), as well as TNM classification, was predictive of tumor recurrence. A large prospective trial has been initiated to validate this preliminary finding (69). The pathway-based polygenic approach is an extension of the candidate gene approach. It gives a more comprehensive picture of candidate genes in affecting drug response. By evaluating the combined impact of multiple polymorphisms, it may be possible to identify minor associations that would not have been detected with the candidate gene approach. By using data-mining and analysis tools, it may be possible to identify high-order gene–gene interactions, and provide a clinically relevant prediction value based on distinct genotype profiles. However, there are some limitations. This approach is still limited by our current knowledge of the function of selected genes and polymorphisms. By tallying the number of adverse alleles, this approach assumes an equal weight for each allele, which may not be true for all genes and polymorphisms. It is arbitrary to assign which allele is the unfavorable allele, since there may be no functional data to show that one specific allele is functionally inferior. The minor alleles may not necessarily result in reduced gene expression or protein function. Analyses using datamining tools become more complex with increasing numbers of polymorphisms, and the analysis requires larger sample sizes. In addition, the validation of identified genotype profiles by post hoc data-mining tools is more demanding, and their biological plausibility is difficult to assess experimentally.

124

Wu and Gu

Whole-Genome Study With rapid advances in genotyping technology and cost reduction, coupled with the progress of the International Hapmap project, genomewide scanning has become possible in association studies. However, in pharmacogenetic studies, the sample size is much smaller than what is typically observed in disease-association studies. Therefore, a high-density whole-genome SNP array in pharmacogenetics would require huge international collaborations in terms of patient population, tumor characteristics, and treatment regimens. The more realistic, immediate genomewide approach (with fewer patients) may be to perform a low-resolution genomic screening to reduce multiple testing and sample size requirements, followed by htSNP-based fine mapping to locate specific loci that predict response. There are several distinct advantages of the genomewide scanning approach. It is a hypothesis-generating and discovery-driven approach. It gives a global assessment of each individual gene and may uncover novel genes and gene-gene interactions in complex genetic traits. The major limitations are the huge numbers of potential false positives, large sample size requirements, high cost, and sophisticated data management and statistical analyses. Nonetheless, the application of genomewide scanning to treatment outcome studies may be more productive than studying cancer risk, particularly if all patients received the same treatments. In studies of cancer risk, there are likely multiple environmental factors that may modify gene/disease associations, but with treatment outcome studies, the exposure (chemotherapy) is known and can be quantitated, and if in a clinical trial setting, is the same in all patients. Thus, uncovering genes that modify associations of treatment and toxicity and/or recurrence may be more straightforward. PERSPECTIVES In 50 years, pharmacogenetics has shifted from focusing on monogenic traits to studying complex polygenic traits. Hypothesis-driven candidate gene approaches have resulted in some striking examples of the role of pharmacogenetics in treatment outcomes that have elicited strong interest and raised high expectations for the clinical application of pharmacogenetics and personalized therapy. However, despite the progress of exploratory research, the translation of pharmacogenomics from research to bedside has not been as rapid as hoped. A major issue for current pharmacogenetic studies seems to be a lack of reproducibility and validation of modest effects. These inconsistent results will need to be confirmed by larger retrospective studies. Then, results from these retrospective studies would need to be verified in prospective studies. More importantly, it is apparent that for complex polygenic drug responses, there has not been a powerful independent single genetic predictor to support stand-alone testing for most chemotherapy regimens. For adverse drug reactions and toxicities, germline genetic polymorphisms play a prominent role. However, for drug efficacy, somatic genetic and biochemical changes may play a more significant role. The technologic revolution in the past decade has created an entirely new paradigm for future research in personalized medicine. In parallel with pharmacogenomics, pharmacotranscriptomics, pharmaco-proteomics, and pharmaco-metabonomics have come into the field of personalized drug therapy (70–73). Molecular profiling in the forms of mRNA expression, tissue proteomic, or metabolomic measurements has produced potential signatures that can predict either drug efficacy or adverse reactions (73). However, the potential synergy between genotype data and these molecular profiling data remain completely unexplored in the context of predicting drug responses. The path to personalized medicine will ultimately require a multivariate analysis incorporating

Pharmacogenetics in Cancer Chemotherapy

125

germline genetic polymorphisms, genetic and molecular biomarkers in tumor tissues, and patients’ clinical and epidemiologic profiles. REFERENCES 1. Weinshilboum R. Inheritance and drug response. N Engl J Med 2003; 348(6):529–537. 2. Evans WE, McLeod HL. Pharmacogenomics—drug disposition, drug targets, and side effects. N Engl J Med 2003; 348(6):538–549. 3. Weinshilboum RM, Wang L. Pharmacogenetics and pharmacogenomics: development, science, and translation. Annu Rev Genomics Hum Genet 2006; 7:223–245. 4. Eichelbaum M, Ingelman-Sundberg M, Evans WE. Pharmacogenomics and individualized drug therapy. Annu Rev Med 2006; 57:119–137. 5. Vogel F. Moderne problem der humangenetik. Ergeb Inn Med U Kinderheilk 1959; 12:52–125. 6. Streetman DS, Bertino JS Jr., Nafziger AN. Phenotyping of drug-metabolizing enzymes in adults: a review of in-vivo cytochrome P450 phenotyping probes. Pharmacogenetics 2000; 10(3):187–216. 7. Daly AK. Development of analytical technology in pharmacogenetic research. Naunyn Schmiedebergs Arch Pharmacol 2004; 369(1):133–140. 8. Fuhr U, Jetter A, Kirchheiner J. Appropriate phenotyping procedures for drug metabolizing enzymes and transporters in humans and their simultaneous use in the “cocktail” approach. Clin Pharmacol Ther 2007; 81(2):270–283. 9. Zeng-Rong N, Paterson J, Alpert L, et al. Elevated DNA repair capacity is associated with intrinsic resistance of lung cancer to chemotherapy. Cancer Res 1995; 55(21):4760–4764. 10. Bosken CH, Wei Q, Amos CI, et al. An analysis of DNA repair as a determinant of survival in patients with non-small-cell lung cancer. J Natl Cancer Inst 2002; 94(14):1091–1099. 11. Stephens JC, Schneider JA, Tanguay DA, et al. Haplotype variation and linkage disequilibrium in 313 human genes. Science 2001; 293(5529):489–493. 12. Rebbeck TR, Spitz M, Wu X. Assessing the function of genetic variants in candidate gene association studies. Nat Rev Genet 2004; 5(8):589–597. 13. Weinshilboum RM, Sladek SL. Mercaptopurine pharmacogenetics: monogenic inheritance of erythrocyte thiopurine methyltransferase activity. Am J Hum Genet 1980; 32(5):651–662. 14. Lennard L, Van Loon JA, Weinshilboum RM. Pharmacogenetics of acute azathioprine toxicity: relationship to thiopurine methyltransferase genetic polymorphism. Clin Pharmacol Ther 1989; 46(2):149–154. 15. Lennard L, Van Loon JA, Lilleyman JS, et al. Thiopurine pharmacogenetics in leukemia: correlation of erythrocyte thiopurine methyltransferase activity and 6-thioguanine nucleotide concentrations. Clin Pharmacol Ther 1987; 41(1):18–25. 16. Lennard L, Lilleyman JS, Van Loon JA, et al. Genetic variation in response to 6-mercaptopurine for childhood acute lymphoblastic leukemia. Lancet 1990; 336(8709):225–229. 17. Evans WE, Horner M, Chu YO, et al. Altered mercaptopurine metabolism, toxic effects and dosage requirement in a thiopurine methyltransferase-deficient child with acute lymphoblastic leukemia. J Pediatr 1991; 119(6):985–989. 18. Lennard L, Lewis IJ, Michelagnoli M, et al. Thiopurine methyltransferase deficiency in childhood lymphoblastic leukaemia: 6-mercaptopurine dosage strategies. Med Pediatr Oncol 1997; 29(4):252–255. 19. Relling MV, Hancock ML, Boyett JM, et al. Prognostic importance of 6-mercaptopurine dose intensity in acute lymphoblastic leukemia. Blood 1999; 93(9):2817–2823. 20. Relling MV, Hancock ML, Rivera GK, et al. Mercaptopurine therapy intolerance and heterozygosity at the thiopurine S-methyltransferase gene locus. J Natl Cancer Inst 1999; 91(23):2001–2008. 21. Schaeffeler E, Fischer C, Brockmeier D, et al. Comprehensive analysis of thiopurine S-methyltransferase phenotype–genotype correlation in a large population of GermanCaucasians and identification of novel TPMT variants. Pharmacogenetics 2004; 14(7):407–417.

126

Wu and Gu

22. Nagar S, Remmel RP. Uridine diphosphoglucuronosyltransferase pharmacogenetics and cancer. Oncogene 2006; 25(11):1659–1672. 23. Iyer L, Hall D, Das S, et al. Phenotype-genotype correlation of in vitro SN-38 (active metabolite of irinotecan) and bilirubin glucuronidation in human liver tissue with UGT1A1 promoter polymorphism. Clin Pharmacol Ther 1999; 65(5):576–582. 24. Beutler E, Gelbart T, Demina A. Racial variability in the UDP-glucuronosyltransferase 1 (UGT1A1) promoter: a balanced polymorphism for regulation of bilirubin metabolism? Proc Natl Acad Sci U S A 1998; 95(14):8170–8174. 25. Raijmakers MT, Jansen PL, Steegers EA, et al. Association of human liver bilirubin UDPglucuronyltransferase activity with a polymorphism in the promoter region of the UGT1A1 gene. J Hepatol 2000; 33(3):348–351. 26. Ando Y, Saka H, Ando M, et al. Polymorphisms of UDP-glucuronosyltransferase gene and irinotecan toxicity: a pharmacogenetic analysis. Cancer Res 2000; 60(24):6921–6926. 27. Rouits E, Boisdron-Celle M, Dumont A, et al. Relevance of different UGT1A1 polymorphisms in irinotecan-induced toxicity: a molecular and clinical study of 75 patients. Clin Cancer Res 2004;10(15):5151–5159. 28. Marcuello E, Altes A, Menoyo A, et al. UGT1A1 gene variations and irinotecan treatment in patients with metastatic colorectal cancer. Br J Cancer 2004; 91(4):678–682. 29. Innocenti F, Undevia SD, Iyer L, et al. Genetic variants in the UDP-glucuronosyltransferase 1A1 gene predict the risk of severe neutropenia of irinotecan. J Clin Oncol 2004; 22(8): 1382–1388. 30. Toffoli G, Cecchin E, Corona G, et al. The role of UGT1A1*28 polymorphism in the pharmacodynamics and pharmacokinetics of irinotecan in patients with metastatic colorectal cancer. J Clin Oncol 2006; 24(19):3061–3068. 31. Han JY, Lim HS, Shin ES, et al. Comprehensive analysis of UGT1A polymorphisms predictive for pharmacokinetics and treatment outcome in patients with non-small-cell lung cancer treated with irinotecan and cisplatin. J Clin Oncol 2006; 24(15):2237–2244. 32. Marsh S. Thymidylate synthase pharmacogenetics. Invest New Drugs 2005; 23(6):533–537. 33. Pullarkat ST, Stoehlmacher J, Ghaderi V, et al. Thymidylate synthase gene polymorphism determines response and toxicity of 5-FU chemotherapy. Pharmacogenomics J 2001; 1(1): 65–70. 34. Villafranca E, Okruzhnov Y, Dominguez MA, et al. Polymorphisms of the repeated sequences in the enhancer region of the thymidylate synthase gene promoter may predict downstaging after preoperative chemoradiation in rectal cancer. J Clin Oncol 2001;19(6):1779–1786. 35. Marsh S, McLeod HL. Pharmacogenomics: from bedside to clinical practice. Hum Mol Genet 2006;15(S1):R89–R93. 36. McLeod HL, Tan B, Malyapa R, et al. Genotype-guided neoadjuvant therapy for rectal cancer. Proc Am Soc Clin Oncol 2005; 23:197. 37. Stoehlmacher J, Park DJ, Zhang W, et al. A multivariate analysis of genomic polymorphisms: prediction of clinical outcome to 5-FU/oxaliplatin combination chemotherapy in refractory colorectal cancer. Br J Cancer 2004; 91(2):344–354. 38. Jakobsen A, Nielsen JN, Gyldenkerne N, et al. Thymidylate synthase and methylenetetrahydrofolate reductase gene polymorphism in normal tissue as predictors of fluorouracil sensitivity. J Clin Oncol 2005; 23(7):1365–1369. 39. Kawakami K, Watanabe G. Identification and functional analysis of single nucleotide polymorphism in the tandem repeat sequence of thymidylate synthase gene. Cancer Res 2003; 63(18):6004–6007. 40. Ruzzo A, Graziano F, Kawakami K, et al. Pharmacogenetic profiling and clinical outcome of patients with advanced gastric cancer treated with palliative chemotherapy. J Clin Oncol 2006; 24(12):1883–1891. 41. Mandola MV, Stoehlmacher J, Zhang W, et al. A 6 bp polymorphism in the thymidylate synthase gene causes message instability and is associated with decreased intratumoral TS mRNA levels. Pharmacogenetics. 2004;14(5):319–327.

Pharmacogenetics in Cancer Chemotherapy

127

42. Graziano F, Kawakami K. Studying the predictive/prognostic role of thymidylate synthase genotypes in patients with colorectal cancer: Is one polymorphism enough? J Clin Oncol 2005; 23(28):7230–7231. 43. Chazal M, Etienne MC, Rene´e N, et al. Link between dihydropyrimidine dehydrogenase activity in peripheral blood mononuclear cells and liver. Clin Cancer Res 1996; 2(3):507–510. 44. Diasio RB, Beavers TL, Carpenter JT. Familial deficiency of dihydropyrimidine dehydrogenase. Biochemical basis for familial pyrimidinemia and severe 5-fluorouracil-induced toxicity. J Clin Invest 1988; 81(1):47–51. 45. Ezzeldin H, Diasio R. Dihydropyrimidine dehydrogenase deficiency, a pharmacogenetic syndrome associated with potentially life-threatening toxicity following 5-fluorouracil administration. Clin Colorectal Cancer 2004; 4(3):181–189. 46. Etienne MC, Lagrange JL, Dassonville O, et al. Population study of dihydropyrimidine dehydrogenase in cancer patients. J Clin Oncol 1994;12(11):2248–2253. 47. Ploylearmsaeng SA, Fuhr U, Jetter A. How may anticancer chemotherapy with fluorouracil be individualised? Clin Pharmacokinet 2006; 45(6):567–592. 48. Raida M, Schwabe W, Hausler P, et al. Prevalence of a common point mutation in the dihydropyrimidine dehydrogenase (DPD) gene within the 50 -splice donor site of intron 14 in patients with severe 5-fluorouracil (5-FU)-related toxicity compared with controls. Clin Cancer Res 2001; 7(9):2832–2839. 49. Collie-Duguid ES, Etienne MC, Milano G, et al. Known variant DPYD alleles do not explain DPD deficiency in cancer patients. Pharmacogenetics 2000; 10(3):217–223. 50. Wu X, Gu J, Wu TT, et al. Genetic variations in radiation and chemotherapy drug action pathways predict clinical outcomes in esophageal cancer. J Clin Oncol 2006; 24(23):3789–3798. 51. Cohen V, Panet-Raymond V, Sabbaghian N, et al. Methylenetetrahydrofolate reductase polymorphism in advanced colorectal cancer: a novel genomic predictor of clinical response to fluoropyrimidine-based chemotherapy. Clin Cancer Res 2003; 9(5):1611–1615. 52. Etienne MC, Formento JL, Chazal M, et al. Methylenetetrahydrofolate reductase gene polymorphisms and response to fluorouracil-based treatment in advanced colorectal cancer patients. Pharmacogenetics 2004; 14(12):785–792. 53. Park DJ, Stoehlmacher J, Zhang W, et al. A Xeroderma pigmentosum group D gene polymorphism predicts clinical outcome to platinum-based chemotherapy in patients with advanced colorectal cancer. Cancer Res 2001; 61(24):8654–8658. 54. Stoehlmacher J, Park DJ, Zhang W, et al. Association between glutathione S-transferase P1, T1, and M1 genetic polymorphism and survival of patients with metastatic colorectal cancer. J Natl Cancer Inst 2002; 94(12):936–942. 55. Gurubhagavatula S, Liu G, Park S, et al. XPD and XRCC1 genetic polymorphisms are prognostic factors in advanced non-small-cell lung cancer patients treated with platinum chemotherapy. J Clin Oncol 2004; 22(13):2594–2601. 56. Stoehlmacher J, Park DJ, Zhang W, et al. A multivariate analysis of genomic polymorphisms: prediction of clinical outcome to 5-FU/oxaliplatin combination chemotherapy in refractory colorectal cancer. Br J Cancer 2004; 91(2):344–354. 57. Goekkurt E, Hoehn S, Wolschke C, et al. Polymorphisms of glutathione S-transferases (GST) and thymidylate synthase (TS)—novel predictors for response and survival in gastric cancer patients. Br J Cancer 2006; 94(2):281–286. 58. Lynch TJ, Bell DW, Sordella R, et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004; 350(21):2129–2139. 59. Paez JG, Janne PA, Lee JC, et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 2004; 304(5676):1497–1500. 60. Pao W, Miller V, Zakowski M, et al. EGF receptor gene mutations are common in lung cancers from “never smokers” and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc Natl Acad Sci USA 2004;101(36):13306–13311. 61. Kobayashi S, Boggon TJ, Dayaram T, et al. EGFR mutation and resistance of non-small-cell lung cancer to gefitinib. N Engl J Med 2005; 352(8):786–792.

128

Wu and Gu

62. Pao W, Miller VA, Politi KA, et al. Acquired resistance of lung adenocarcinomas to gefitinib or erlotinib is associated with a second mutation in the EGFR kinase domain. PLoS Med 2005; 2(3):e73. 63. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet 2003; 361(9357):598–604. 64. Rieder MJ, Reiner AP, Gage BF, et al. Effect of VKORC1 haplotypes on transcriptional regulation and warfarin dose. N Engl J Med 2005; 352(22):2285–2293. 65. Sconce EA, Khan TI, Wynne HA, et al. The impact of CYP2C9 and VKORC1 genetic polymorphism and patient characteristics upon warfarin dose requirements: proposal for a new dosing regimen. Blood 2005; 106(7):2329–2333. 66. Wadelius M, Chen LY, Downes K, et al. Common VKORC1 and GGCX polymorphisms associated with warfarin dose. Pharmacogenomics J 2005; 5(4):262–270. 67. Quintela-Fandino M, Hitt R, Medina PP, et al. DNA-repair gene polymorphisms predict favorable clinical outcome among patients with advanced squamous cell carcinoma of the head and neck treated with cisplatin-based induction chemotherapy. J Clin Oncol 2006; 24(26):4333–4339. 68. Sabbagh A, Darlu P. Data-mining methods as useful tools for predicting individual drug response: application to CYP2D6 data. Hum Hered 2006; 62(3):119–134. 69. Gordon MA, Gil J, Lu B, et al. Genomic profiling associated with recurrence in patients with rectal cancer treated with chemoradiation. Pharmacogenomics 2006; 7(1):67–88. 70. Wang Y. Gene expression-driven diagnostics and pharmacogenomics in cancer. Curr Opin Mol Ther 2005; 7(3):246–250. 71. Nebert DW, Vesell ES. Can personalized drug therapy be achieved? A closer look at pharmaco-metabonomics. Trends Pharmacol Sci 2006; 27(11):580–586. 72. Wulfkuhle JD, Edmiston KH, Liotta LA, et al. Technology insight: pharmacoproteomics for cancer—promises of patient-tailored medicine using protein microarrays. Nat Clin Pract Oncol 2006; 3(5):256–268. 73. Stoughton RB, Friend SH. How molecular profiling could revolutionize drug discovery. Nat Rev Drug Discov 2005; 4(4):345–350.

10

Human Genetic Variation and its Implication in Understanding “Race”/ Ethnicity and Admixture Jill Barnholtz-Sloan Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, Ohio, U.S.A.

Indrani Halder Cardiovascular Behavioral Medicine Research Training Program, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, U.S.A.

Mark Shriver Departments of Anthropology and Genetics, The Pennsylvania State University, University Park, Pennsylvania, U.S.A.

HUMAN GENETIC VARIATION The human genome consists of 3 billion nucleotides which are 99.5% to 99.8% similar between any two individuals (1). This 0.2% to 0.5% of the genome (or 6–15 million nucleotides) that varies is believed to cause the wide interindividual variation in normal and disease risk alleles. Most genetic variation is between individuals within a specific population as opposed to between populations on different continents (2–4). Genetic variation within and between populations results from natural selection, mutation, genetic drift, and admixture. These evolutionary forces, along with inbreeding and nonrandom mating, have predictable effects on the levels of variation among and within populations. A summary of whether or not each of these components of evolution increases or decreases variation within and between populations is given in Table 1. Given that genetic drift can be highly variable in terms of its effects on various genomic regions, and that we have limited knowledge on the action of natural selection and the demographic and migration histories of human populations, it is important that we have empirical data on genetic variation across the world’s populations and genomic regions. Projects such as the International HapMap (5) have played a key role in addressing these needs and can serve as starting points in discussions of which of these evolutionary forces best explains observed patterns of human genetic variation.

129

130

Barnholtz-Sloan et al.

Table 1 The Effect of the Different Forces of Evolution on Variation Within and Between Populations. (Either An Increase or Decrease of Variation Within and/or Between Populations is Shown for Each Force) Evolutionary component

Within populations

Between populations

Inbreeding and nonrandom mating Genetic drift Mutation Admixture Selection - Stabilizing - Directional - Disruptive

Decrease Decrease Increase Increasea Increase Decrease Decrease

Increase Increase Increase Decrease Decrease Increase and decrease Increase

a

Admixture can at first increase but if it continues, it has the potential to decrease the levels of variation among populations

Mutation Ultimately, all new variation begins with a mutation or a change in the sequence of the bases in the DNA. Mutations in the DNA sequence can take many forms, including nucleotide substitution, insertions/deletions, changes in the numbers of repeat units (short tandem repeats), and translocations. Alternate forms of DNA sequences are commonly called alleles and if a mutation appears in at least 1% of the population, it is generally referred to as a polymorphism. Most mutations do not affect the protein product and these are called neutral variants. Some mutations will occur in nonprotein or RNA coding regions, which may or may not have regulatory roles. Hence, the most interesting variation is in regions where the sequence is important for function, since these can affect phenotypes. Given our current level of understanding of the function of the genome, protein folding, and the regulation of cellular processes, it is difficult to predict the biochemical and physiological effects of most sequence variations. Given that the rate of most types of mutations is very small, evolution over the short term (i.e., within species) is often thought of in terms of changes in allele frequencies. Darwin and Natural Selection The mechanism of evolution was proposed by a number of people in the early nineteenth century, but it was Charles Darwin (6) who seriously addressed this hypothesis. He proposed that the cause of evolution was natural selection in the presence of variation. Natural Selection is the process by which heritable traits that are advantageous to the individual become more common over the generations. Darwin based this theory on three key observations: (1) when conditions allow individuals in a population to survive and reproduce, they will have more offspring than can possibly survive (population size can increase exponentially); (2) individuals will vary in their ability to survive and reproduce; and (3) no two individuals are the same due to variability in inherited characteristics, therefore they vary in their ability to reproduce and survive. From these observations he deduced the following: (1) there is competition for survival and successful reproduction; (2) heritable differences that are favorable for individuals will allow them to survive and reproduce more efficiently compared with individuals with unfavorable characteristics; i.e., elimination is selective; and (3) subsequent generations of a population will have a higher frequency of the favorable

Human Genetic Variation and its Implications

131

variants or alleles than previous generations. An increase in these favorable alleles in the population leads to an increase in the favorable genotype(s), so that the population gradually changes and becomes better adapted to the environment. This is the definition of fitness; individuals with genotypes with greater fitness produce more offspring than those with less fit genotypes. Fitness of a genotype is directly related to the ability and success of transmission of those alleles to the next generation. Individuals are forced to compete for resources to stay alive and reproduce successfully; therefore, genotypes that are genetic determinants for the favorable traits will become more common than the less favorable genotypes. As a result, different genotypes will have different likelihoods of success. The effects of relative frequencies of the genotypes, group interactions, and environmental effects can complicate the likelihood. Sexual selection is a key contributor to the likelihood of success and can be affected by direct competition between individuals of the same sex for a mate or by reproductive inclusion through the process of mate choice (7). There are three basic models of natural selection: (1) stabilizing selection—this removes individuals who deviate too far from the average and maintain an optimal population, i.e., selection for the average individual; (2) directional selection—this favors individuals who are at one extreme of the population, i.e., selection of either of the extreme individuals; and (3) disruptive selection—this favors individuals at both extremes of the population, which can cause either the fixation of one extreme, or lead to the population breaking into two separate populations. GENETIC DRIFT, ADMIXTURE, AND MIGRATION Genetic drift is change in allele frequency that results from the chance difference in transmission of alleles between generations. Because populations do not grow exponentially, some of the potential offspring of each mating couple will not exist and, thus, alleles will be lost from the population. In the long term, this results in random changes (drift) in allele frequency over time from the starting frequency. The effects of genetic drift are inversely proportional to population size, with larger populations showing less random change in allele frequency than smaller populations. Because very strong selection is required to affect the frequency of rare alleles, drift can be very important since it has a greater effect on transmission of rare alleles than selection. In small populations, drift can cause certain allele frequencies to be much larger or smaller than would likely occur in a large population. This is called the founder effect, when a small, underrepresented group forms a new colony. The Amish in the United States are a good example of this effect because the roots of this population can be traced to a small number of immigrant families. Another related phenomenon is when a population is reduced to a small number, possibly because of disease or famine, and later expands in size, establishing a new larger population. Such a population is said to have been through a population bottleneck and is generally distinguished from a founder effect in that there is no clear extant parental population. The fundamental parameter of genetic drift is effective population size, Ne. This is the size of a population, assuming that these are randomly mating individuals, half of which are male and half are female, that would generate the same rate of allelic fixation as is observed in the real population of total size N (8). Admixture also causes evolution in populations, through the genetic fusion of populations that had been previously separated. Ethnographically defined populations generally show variation between each other since they, by definition, have been reproductively isolated and thus are each subject to allele frequency changes through the

132

Barnholtz-Sloan et al.

actions of genetic drift, mutation, and selection. Through admixture or migration, these previously separated populations can be reunited and form new, sustainable populations. Admixture can result in a rapid increase in the level of genetic variation, causing a decrease in the proportion of homozygous individuals, known as the Wahlund effect (9). In human populations, the main effect of this fusion of populations is it will decrease the overall frequency of children born with genetic defects, resulting from homozygous recessive genes that have high frequency in one of the mixing populations. Inbreeding and Nonrandom Mating (Assortative Mating) Inbreeding and other forms of nonrandom mating, or assortative mating, can also have a profound effect on variation within a population. Inbreeding occurs when genetically related individuals mate more frequently than would be expected in a randomly mating population. Inbreeding mainly causes departures from Hardy-Weinberg Equilibrium (HWE), and, as a consequence of this departure from equilibrium, there is an increase in homozygotes. Inbreeding can cause the offspring of the mating pair to have replicates of specific alleles present in the shared ancestor of the parents. Thus, inbred individuals may carry two copies of an allele at a locus that are identical by descent (IBD) from a common ancestor. The proportion IBD is the frequency with which two offsprings share copies of the same parental (ancestral) allele. The amount of inbreeding in a population can be measured by comparing the actual proportion of heterozygous genotypes in the population in question to the proportion in a randomly mating population. Inbreeding alone cannot change allele frequencies, but it can change how the alleles come together to form genotypes. Inbreeding in human populations can result in a much higher frequency of recessive disease homozygotes, since recessive disease alleles generally have low frequencies in humans. Inbreeding affects all genes, resulting in rare recessive disorders that may not have presented themselves if no inbreeding had occurred. Nonrandom mating or assortative mating is when a mate is chosen on the basis of a certain phenotype. In other words, it is the situation when mates are more similar (or dissimilar) to each other than would be expected by chance in a randomly mating population. In positive assortative mating, a mate chooses a mate that phenotypically resembles him or her. In negative assortative mating, a mate chooses a mate that is phenotypically very different from him or her. Assortative mating will only affect the alleles that are responsible for the phenotypes affecting mating choices. The genetic variance (or variability) of the trait that is associated with the mating increases with more generations of assortative mating for that trait. In humans, positive assortative mating occurs for traits like intelligence (IQ score), height, or certain socioeconomic and lifestyle variables. Negative assortative mating is suspected to occur in humans for genes involved in immunity such as the HLA complex. These types of assortative mating or nonrandom mating can cause deviations from random mating, especially if the environment might present as an important risk factor, as is the case in most complex diseases. Genetic Structure and Wright’s FST Genetic structure of a species is characterized by the number of populations within it, the frequencies of alleles in each population, and the degree of genetic differentiation among them. As summarized in Table 1, the evolutionary forces reviewed above will lead to either greater or lesser differentiation among the subpopulations of a species. Wright (10–12) showed that every population has three levels of complexity, I, the individual, S, the various subpopulations within the total population and T, the total population. In order to

Human Genetic Variation and its Implications

133

assess this population substructure and test for allelic correlation within subpopulations, Wright defined three measurements called fixation indices that have correlational interpretations for genetic structure and are a function of heterozygosity. FIT is the correlation of alleles within individuals over all subpopulations; FST is the correlation of alleles of different individuals in the same subpopulation; and FIS is the correlation of alleles within individuals within one subpopulation. Cockerham (13,14) later showed analogous measures for these three fixation indices, which he called the overall inbreeding coefficient, F, the coancestry, y and f, respectively. The degree of genetic isolation of one subpopulation from another can be measured by genetic distance, which can be interpreted as the time since the subpopulations diverged from their original ancestral population. DEFINITIONS OF ‘‘RACE’’ AND ETHNICITY The concept of “race”/ethnicity is a point of contention for many, given that these categorizations have been used historically in the mistreatment of certain groups of people. Historically “racial” groups have been defined by commonality of physical characteristics, such as skin color, hair color and texture, facial features, and various other physical attributes. Linneaus’s Systema Naturae (15) defined four “racial” groups: Europeanus, Asiaticus, Americanus, and Africanus. These four “racial” categories were then further refined into five categories: Caucasian, Mongolian, Ethiopian, American, and Malay. Some of these “racial” categories are still used today when describing populations around the world (16), particularly by the United States Census Bureau (17), which are then used in genetic studies to classify individuals. However, definitions of “race”/ethnic groups have continued to change over time and differ in different parts of the world (i.e., clinal variation) (18,19). In most medical research studies, “race”/ethnicity is a self-reported variable by the study subject. This “race”/ethnicity variable, while allowing for classification of individuals into large nonhomogeneous groups for ease of data analysis, gives no indication of a person’s ancestral background (20). For example, in one study approximately 20% of individuals from Washington, DC who self-reported their “race” as African American showed a low African genetic ancestral component and a very high European or Native American genetic ancestral component to their individual admixture (21). This same study also showed that there is substantial variation in skin color within these groups and that, although the correlation between skin color and genetic ancestry is generally low, it is usually highly statistically significant (22). The use of self-reported “race” as a proxy for ancestral background is even more problematic for Hispanics who have a highly varied ancestral background based on their country of birth. Puerto Ricans have an average admixture of 37% African, 45% European, and 18% Native American, while Mexican Americans have an average admixture of 8% African, 61% European, and 31% Native American (23). Using genetic markers to classify individuals into one “racial”/ethnic group versus another can be quite complicated given that individuals are mixes of many different ancestral populations and/or “races”; more than 2.5% of the United States population answered that they belonged to one or more “racial”/ethnic groups in the last census (24). HARDY-WEINBERG AND LINKAGE EQUILIBRIUM Most statistical analyses of genetic data rely on two important stipulations: statistical independence of alleles within a locus and statistical independence of alleles between loci. The first is HWE within loci. Independently in 1908, G. H. Hardy (25) and

134

Barnholtz-Sloan et al.

W. Weinberg (26,27), both published work showing that the frequency of particular genotypes in a sexually reproducing diploid population reaches equilibrium after one generation of random mating and fertilization with no selection, mutation, or migration. The population size is assumed to be very large and the generations not overlapping. They then showed that the frequencies of the genotypes in the “equilibrium” population are just simple products of the allele frequencies. Even if HWE holds in each subpopulation, the genotype frequencies for the admixed population will most likely deviate from HWE (28). The second stipulation is independence of alleles between loci, or linkage equilibrium (LE) between loci. LE is a state of random gametic combinations of alleles of different genes, in which no allelic association between gametes or alleles at multiple loci for different individuals are in the same combinations with each other. One major cause of linkage disequilibrium (LD) is subdivision of populations (i.e., population admixture). In randomly mating populations with no selection, LD is reduced in every generation, dependent on the recombination fraction between loci, and the number of generations of mating. The approach to LE is further retarded in subdivided populations where members of different subpopulations do not freely intermix (29). Hence, equilibrium refers to the concept that there are no changes in genotypic proportions in a population from generation to generation. This equilibrium will remain constant unless the frequencies of alleles in the population are disrupted. Distorting effects could be any one of the following: selection, migration, nonrandom (assortative) mating and inbreeding, population substructure or admixture, and mutation or genetic drift [e.g., (28,30)]. Commonly, the existence of a greater than expected number of associations between unlinked markers (i.e., LD) and a greater than expected number of violations of HWE at each of these markers is indicative of recent admixture and/or population stratification (PS) (31). After one more generation of random mating HWE can be restored in a population, while the restoration of LE requires four or more generations (29). Therefore, testing for HWE and LD in potentially admixed populations is important because of the possible effects on inference of study results. The recent completions of the human genome sequence (32) and the HapMap project (33) have renewed interest in the area of LD mapping and Admixture LD (ALD) mapping, which was proposed many years ago (34,35). LD mapping is based on the premise that regions adjacent to the mutation in a putative disease gene are transmitted through generations along with the mutation because of the presence of strong LD. ALD mapping makes the additional assumption that differences in ancestral allele frequencies reflect differences in LD patterns. The idea behind LD mapping is to exploit LD between the putative disease locus and single nucleotide polymorphisms (SNPs), which are abundant throughout the genome. ALD mapping exploits not only this LD relationship but also exploits ancestral differences between individuals within and between “racial”/ ethnic groups. This strategy is useful for localizing genes for complex traits which show non-Mendelian patterns of inheritance and are most likely affected by multiple genes acting together and/or by environmental factors. LD mapping uses population-based samples, such as case-control designs, instead of family-based samples, and provides greater statistical power for detection of genes for complex traits. However, there are a few disadvantages to LD mapping. For example, (1) when the extent of LD is small in population data, a very dense set of SNPs as well as a large number of cases and controls would be necessary to have reasonable power to detect the gene of interest (i.e., 10,000 or more SNPs genotyped on at least 1000 cases and 1000 controls) or (2) when LD is a result of admixture, additional markers need to be genotyped to adjust for this admixture in order to avoid any false results, false positive or false negative, in the association analysis (36–38).

Human Genetic Variation and its Implications

135

PS AND ANCESTRY PS or admixture stratification refers to the existence of variation in genetic ancestry within a single race/ethnicity group. PS is not only present in recently admixed populations like African Americans and Latinos (39–41) but also in European-American populations (42–45) and historically isolated populations including Icelanders (46). A consequence of PS is the potential for increased allelic associations (i.e., LD) and deviations from HWE(31). Another consequence of PS is bias in the estimate of genetic associations, which can lead to incorrect inferences as well as inconsistency across reports. In order for bias due to PS to exist, both of the following must be true: (1) the frequency of the marker genotype of interest varies significantly by “race”/ethnicity and (2) the background disease prevalence varies significantly by “race”/ethnicity. If either of these is not fulfilled, bias due to PS cannot occur. Bias due to PS can result in both falsepositive and false-negative associations. In some studies, this bias has been shown to be small in magnitude (47–49) and bounded by the magnitude of the difference in background disease rates across the populations being compared (50). Simulation studies have shown that the adverse effects of PS increase with increasing sample size (51,52). An unresolved question is how large the difference in disease rates or genotype frequencies must be for “meaningful” bias to arise. Skin color is a good example of confounding due to PS. In general, for most physical human traits, the majority of the variation seen within the trait is within a population, not between populations. Skin color is the opposite of this phenomena, where 10% of the variation is within a group and 90% occurs between groups (53). This shift in the distribution of variation for skin color indicates that this trait has been under significant selective pressure over time. In general, darker skin is associated with nearness to the equator, and lighter skin is associated with higher latitudes (54). When “race”/ethnicity can be accurately described in terms of actual ancestry and there is ancestral homogeneity in a study population, standard epidemiologic approaches of matching or statistical adjustment by “race”/ethnicity may be sufficient to remove or reduce bias due to PS. Controlling for self reported “race” has generally been thought to suffice (55); however, self-reported “race”/ethnicity and/or ancestry can be quite unreliable. Burnett et al. (56) showed that only 49% to 68% of non-Hispanic European American siblings agreed on their ancestry. Recent data show that matching on ancestry is more robust, although in many populations, whether recently admixed or not, individuals are not aware of their precise ancestry (56,57). No true consensus has been reached as to how to test for and/or adjust for PS (48,58), although many methods have been developed (36–38,59,60). Genomic control (36,60) and structured association (37,38,59) are two techniques commonly used to control and adjust for possible PS in association studies. Genomic control uses a set of noncandidate, unlinked loci to estimate an inflation factor, l, which was caused by the PS present, and then corrects the standard X2 association test statistic for this inflation factor. This method considers group level PS only and can help to control against false-positive associations, but not against false-negative associations. The structured association method utilizes Bayesian techniques to assign individuals to “clusters” or subpopulation classes using information from a set of noncandidate, unlinked loci, and then tests for association within each “cluster” or subpopulation class. Therefore, this method considers both individual and group level PS. Another technique involves the estimation of ancestral proportions, at the individual or group level, through the genotyping of ancestry informative markers (AIMs), which are markers that show large allele frequency differences between ancestral populations and have been found throughout the genome (42,61–63). These estimates of either individual

136

Barnholtz-Sloan et al.

or group-specific ancestry can then be used to delineate associations between genetic variants and traits of interest by using genetic ancestry instead of “race”/ethnicity to measure stratification within a study sample (64–68). We demonstrate PS in three US populations in Figures 1 and 2. The three populations studied include unrelated individuals who self-identified as European American (N ¼ 41), African American (N ¼ 42) and Puerto Rican (N ¼ 20) (69). The triangle plot shows the variation in individual admixture estimates within each “racial”/ethnic group as well as the overlap in admixture portions between “racial”/ethnic groups. It is clear from these plots that within African Americans, who are predominantly of mixed European and African ancestry, there are some individuals who are genomically more similar to European Americans than to other African Americans. Similar inference can be drawn for self-identified Puerto Ricans, who exhibit significant European and Native American admixture. Even within self-identified European Americans, some individuals tend to have slightly higher Native American ancestry, which is equivalent in some cases to the proportion of Native American admixture observed in some self-identified Puerto Ricans. However, existence of variation in admixture proportions need not lead to strictly discernable PS, as in the European American sample shown in Figures 1 and 2. AIMS AND ANCESTRY ESTIMATION Estimation of genetic ancestry can be achieved through the genotyping of AIMs. AIMs are unlinked markers found throughout the genome that show large allele frequency differences between the relevant ancestral populations (42,61–63). Simulation studies

Figure 1 Plot of MLE individual admixture estimates of 41 European Americans (black dots), 42 African Americans (open circles) and 20 Puerto Ricans (grey triangles), all self-identified. Affymetrix 10K chip genotypes were used to infer individual admixture estimates. Each point represents one or more individual with the same admixture proportions. The vertices represent 100% ancestry from each of the named groups. An individual at a vertex has 100% ancestry from that population. All individuals plotting along an axis have admixture from the two populations that limit that line. All other individuals have some proportion of admixture from each of the three populations identified. Abbreviation: MLE, maximum likelihood.

Human Genetic Variation and its Implications

137

Figure 2 Plot of individual admixture estimates obtained with STRUCTURE of 41 European Americans (black dots), 42 African Americans (open circles) and 20 Puerto Ricans (grey triangles), all self-identified (same individuals as in Fig. 1).

show that anywhere from 50 to 100 AIMs are needed to accurately assign one’s individual ancestry; fewer markers (~40 AIMs) are needed when the average allele frequency difference between ancestral populations (denoted by d) of the panel of markers is 0.6 and above (16,39,70). The utility of using individual genetic ancestry estimates for understanding complex disease risk has recently been shown for asthma (39,40), cardiovascular disease-related phenotypes (71), insulin-related phenotypes, (72) and early onset lung cancer (67). Wilson et al. observed that frequency of risk genotypes in six drug metabolizing genes varied by genetically defined ancestry and that self-reported “race”/ ethnicity was an insufficient and inaccurate representation of these ancestral clusters (73). Determination of the best method to use to estimate individual ancestry remains unresolved. The two most commonly used methods are maximum likelihood estimation (MLE) (74,75) and structured association clustering techniques as implemented in STRUCTURE (37,38,59). Although these methods have been shown to be comparable in terms of accuracy in some studies (37,70,76), their validity is dependent upon the informativeness of the panel of AIMs being used as well as the availability of allele and genotype frequency data. In our example using unrelated individuals who self-identified as European American, African American, and Puerto Rican, ancestry estimates in Figure 1 were obtained using MLE, while those in Figure 2 were obtained using STRUCTURE (69). The same ancestral allele frequencies/genotypes were used for both calculations. Although there is general agreement in estimates obtained by the two methods, absolute values of the estimates are obviously different. The MLE estimates have larger variance, while the STRUCTURE estimates are more tightly clustered. In addition, the STRUCTURE estimates fail to achieve absolute contributions (i.e., 100% admixture) from any one population. The extent to which this difference translates to discrepancies in results when these individual admixture estimates are further used remains untested. Today there are several AIM panels to choose from (Table 2) (33,41,45,61–67, 77– 78,81–91). Most of these panels consist of SNPs, although some include microsatellites.

European American African American Hispanic African Jamaican African American Hispanic European American Asian European American Mexican American African American Amerindian African

European American African American African Chinese Amerindian European American Mexican American Japanese Amerindian

SNPs and diallelic insertions/deletions

SNPs

SNPs

Microsatellites and diallelic insertions/ deletions

STRs

Population studied

Type of markers

744

~75–100

Number of AIMs

Smith et al., 2004 (82)

Collins-Schramm et al., 2004 (83)

>500

123

3011

151 for Mexican Collins-Schramm American and et al., 2002 (77) 97 for African Collins-Schramm et al., 2002 (78) American

Smith et al., 2001 (61)

Shriver et al., 1997 (62) and Parra et al., 1998 (81)

Reference

>300

DNA pooling used

175

>1000

Total number of individuals genotyped

Table 2 Published Genome-Wide Panels of AIMs Appropriate for Ancestry Analyses

UC Davis, Rowe Program (http://roweprogram. ucdavis.edu) The SNP Consortium Allele Frequency Project (http://snp .cshl.org)

UC Davis, Rowe Program (http:// roweprogram.ucdavis .edu) UCSC Human Genome Project Center (http:// genome.ucsc.edu) Laboratory of Genomic Diversity (http://lgd .nci.nih.gov)

dbSNP database (http:// www.ncbi.nlm.nih .gov/SNP), keyword: PSUANTH Laboratory of Genomic Diversity (http://lgd .nci.nih.gov)

Web site

138 Barnholtz-Sloan et al.

European American African American Asian American

European American African American Asian

African European American Chinese Japanese 12 worldwide population samples

6 European populations European American Ashkenazi Jewish Asian American African American Amerindians European American CEPH Europeans West African (including Yorubans) African Americans

SNPs

SNPs

SNPs

SNPs

SNPs

SNPs

Population studied

Type of markers

Miller et al., 2005 (85)

Altshuler et al. and The International HapMap Consortium, 2005 (33) Shriver et al., 2005 (86)

85

269

Seldin et al., 2006 (44)

Tian et al., 2006 (87)

>1000

>300

203

Hinds et al., 2005 (84)

Reference

71

Total number of individuals genotyped

Table 2 Published Genome-Wide Panels of AIMs Appropriate for Ancestry Analyses (Continued )

>4000

400–800

877,351 polymorphic in all 3 groups 75,997 monomorphic across all 3 groups 11,555

1410

1,586,383

Number of AIMs

(Continued)

UC Davis, Rowe Program (http:// roweprogram.ucdavis. edu)

Shriver Laboratory (http://www.anthro .psu.edu/biolab/index .html) UC Davis, Rowe Program (http:// roweprogram.ucdavis .edu)

Perlegen Genome Browser (http://www .hapmap.org/cgi-perl/ gbrowse/gbrowse) Haplotype data (http:// research.calit2 .net/hap/wgha) The SNP Consortium Allele Frequency Project (http://snp .cshl.org) The HapMap Project (http://www.hapmap .org)

Web site Human Genetic Variation and its Implications 139

5 different Amerindian populations European American Japanese Chinese Latino African European Native American (North and South America) European American 4 Amerindian populations West African Japanese Chinese European Americans 21 European and worldwide populations

European Americans

SNPs

SNPs

Price et al., 2007 (89)

Mao et al., 2007 (90)

>700

>300

>4000

Price et al., 2008 (91)

Bauchet et al., 2007 (45)

Tian et al., 2007 (88)

>700

297

Reference

300

1200

>2000

>4100

>8000

Number of AIMs

Shriver Laboratory (http://www.anthro .psu.edu/biolab/ euroaims.pc1.xls) Reich Laboratory (http://genepath.med .harvard.edu/~reich/)

Shriver Laboratory (http://www.anthro .psu.edu/biolab/ euroaims.pc1.xls)

Reich Laboratory (http://genpath.med .harvard.edu/~reich/)

UC Davis, Rowe Program (http://roweprogram .ucdavis.edu)

Web site

Abbreviations: AIMs, ancestry informative markers; SNP, single nucleotide polymorphism; STRs, short tandem repeats; CEPH, Centre d’Etude du Polymorphisme Humain; dbSNP, Single Nucleotide Polymorphism database.

SNPs

SNPs

SNPs

Population studied

Type of markers

Total number of individuals genotyped

Table 2 Published Genome-Wide Panels of AIMs Appropriate for Ancestry Analyses (Continued )

140 Barnholtz-Sloan et al.

Human Genetic Variation and its Implications

141

The choice of markers depends on the marker’s informativeness for ancestry, which depends on the value of d (61,62,77,78) and can also depend on other population variables (79), such as the relative proportional contributions from each of the parental populations (80). A practical understanding of the history of the immigration and migration history of the study population is critical in order to accurately select an appropriate panel of AIMs. Knowledge of this history is also critical to establish the analytical models in terms of how many and which of the ancestral parental populations should be stipulated for robust ancestry estimation. Not all AIMs’ panels are equivalent. For example, an AIMs panel assembled for Mexican Americans, where the levels of African ancestry are quite low, might well be inappropriate for Puerto Ricans, as they have a higher African contribution. Thus, estimation of ancestral proportions is highly dependent on (1) knowledge of parental populations, (2) choice of markers for ancestry estimation (i.e., informativeness for ancestry analyses), (3) estimation of the parental allele frequencies, (4) method for ancestry estimation, and (5) the level of PS in the admixed population. Applying generic AIM sets that were developed in one population to a different population may be suboptimal. Therefore, we recommend that AIMs for a specific study be chosen using a combination of the following informations: (1) d value of 0.6 or higher, (2) calculating a measure of informativeness (79,80) for multiple possible combinations of ancestral proportions and prioritizing those markers that are informative across multiple different ancestral proportion combinations, and (3) knowledge of immigration/migration patterns in the region from which the study population was drawn, which should inform choice of and the number of ancestral parental populations. Our knowledge of human genetic variation and its role in risk of complex disease will continue to expand as new genome data emerge and new data analysis techniques are developed. REFERENCES 1. Jobling MA, Hurles ME, Tyler-Smith C. Human Evolutionary Genetics: Origins, Peoples and Disease. New York, NY: Garland Science, 2004. 2. Barbujani G, Magagni A, Minch E, et al. An apportionment of human DNA diversity. Proc Natl Acad Sci U S A 1997; 94(9):4516–4519. 3. Jorde LB, Watkins WS, Bamshad MJ, et al. The distribution of human genetic diversity: a comparison of mitochondrial, autosomal, and Y-chromosome data. Am J Hum Genet 2000; 66 (3):979–988. 4. Nei M. Genetic distance between populations. Am Nat 1972; 106:283–292. 5. Available at: http://www.hapmap.org 6. Darwin C. On the Origin of Species. London, UK: Murray, 1859. 7. Darwin C. The Descent of Man and Selection in Relation to Sex. New York, NY: D. Appleton and Company, 1871. 8. Weiss KM. Genetic Variation and Human Disease: Principles and Evolutionary Approaches. Cambridge, UK: Cambridge University Press, 1993. 9. Wahlund S. Zuzammensetzung von populationen und korrelation-serscheinungen von stand pundt der vererbungslehre aus betrachtet. Hereditas 1928; 11:65–106. 10. Wright S. The genetic structure of populations. Ann Eugen 1951; 15:323–354. 11. Wright S. Evolution in mendelian populations. Genetics 1931; 16:97–159. 12. Wright S. Isolation by genetic distance. Genetics 1943; 28:114–138. 13. Cockerham CC. Analyses of Gene Frequencies. Genetics 1973; 74(4):679–700. 14. Cockerham CC. Analyses of Gene Frequencies of Mates. Genetics 1973; 74(4):701–712. 15. Linneaus C. Systemae naturae (The system of nature). Stockholm, Sweden: Laurentii Salvii, Holmiae, 1758.

142

Barnholtz-Sloan et al.

16. Risch N, Burchard E, Ziv E, et al. Categorization of humans in biomedical research: genes, race and disease. Genome Biol 2002; 3(7) (comment 2007). 17. Available at: www.census.gov 18. Jacobson MF. Whiteness of a different color: European Immigrants and the Alchemy of Race. Cambridge, MA: Harvard University Press, 1998. 19. Snowden FM. Before color prejudice: The Ancient View of Blacks. Cambridge, MA: Harvard University Press, 1983. 20. Helgadottir A, Manolescu A, Helgason A, et al. A variant of the gene encoding leukotriene A4 hydrolase confers ethnicity-specific risk of myocardial infarction. Nat Genet 2006; 38(1): 68–74. 21. Shriver MD, Parra EJ, Dios S, et al. Skin pigmentation, biogeographical ancestry and admixture mapping. Hum Genet 2003; 112(4):387–399. 22. Parra EJ, Kittles RA, Shriver MD. Implications of correlations between skin color and genetic ancestry for biomedical research. Nat Genet 2004; 36(11 suppl):S54–S60. 23. Hanis CL, Hewett-Emmett D, Bertin TK, et al. Origins of U.S. Hispanics. Implications for diabetes. Diabetes Care 1991; 14(7):618–627. 24. United States Census 2000: The Hispanic Population Census 2000 Brief. 2001. 25. Hardy GH. Mendelian proportions in a mixed population. Science 1908; 28:449–450. 26. Weinberg W. Uber den Nachweis der Vererbung biem Menschen. Jh. Verein f. vaterl. Naturk. in Wurttemberg 1908; 64:368–382. 27. Weinberg W. Uber Verebungsgestze beim Menschen. Ztschr. Abst. U. Vererb. 1909; 1:277–330. 28. Nei M. Molecular evolutionary genetics. New York, NY: Columbia University Press; 1987. 29. Nei M, Li WH. Linkage disequilibrium in subdivided populations. Genetics 1973; 75(1):213–219. 30. Li CC. Population Genetics: 1st Edition. Chicago, IL: The University of Chicago Press; 1955. 31. Chakraborty R, Weiss KM. Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc Natl Acad Sci U S A 1988; 85(23): 9119–9123. 32. Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature 2001; 409(6822):860–921. 33. Altshuler D, Brooks LD, Chakravarti A, et al. A haplotype map of the human genome. Nature 2005; 437(7063):1299–1320. 34. Morton NE. Linkage disequilibrium maps and association mapping. J Clin Invest 2005; 115 (6):1425–1430. 35. Collins A, Morton NE. Mapping a disease locus by allelic association. Proc Natl Acad Sci U S A 1998; 95(4):1741–1745. 36. Devlin B, Roeder K. Genomic control for association studies. Biometrics 1999; 55(4): 997–1004. 37. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 1999; 65(1):220–228. 38. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics 2000; 155(2):945–959. 39. Choudhry S, Coyle NE, Tang H, et al. Population stratification confounds genetic association studies among Latinos. Hum Genet 2006; 118(5):652–664. 40. Salari K, Choudhry S, Tang H, et al. Genetic admixture and asthma-related phenotypes in Mexican American and Puerto Rican asthmatics. Genet Epidemiol 2005; 29(1):76–86. 41. Hanis CL, Chakraborty R, Ferrell RE, et al. Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican-Americans in Starr County, Texas. Am J Phys Anthropol 1986; 70(4):433–441. 42. Shriver MD, Mei R, Parra EJ, et al. Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2005; 2(2):81–89. 43. Campbell CD, Ogburn EL, Lunetta KL, et al. Demonstrating stratification in a European American population. Nat Genet 2005; 37(8):868–872. 44. Seldin MF, Shigeta R, Villoslada P, et al. European population substructure: clustering of northern and southern populations. PLoS Genet 2006; 2(9):e143.

Human Genetic Variation and its Implications

143

45. Bauchet M, McEvoy B, Pearson LN, et al. Measuring European population stratification using microarray genotype data. Am J Hum Genet 2007; 80:948–956. 46. Helgason A, Yngvadottir B, Hrafnkelsson B, et al. An Icelandic example of the impact of population structure on association studies. Nat Genet 2005; 37(1):90–95. 47. Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst 2000; 92(14): 1151–1158. 48. Wacholder S, Rothman N, Caporaso N. Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev 2002; 11(6):513–520. 49. Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in case-control association studies of admixed populations. Genet Epidemiol 2004; 27(1):14–20. 50. Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in epidemiologic studies of gene-gene or gene-environment interactions. Cancer Epidemiol Biomarkers Prev 2006; 15(1):124–132. 51. Marchini J, Cardon LR, Phillips MS, et al. The effects of human population structure on large genetic association studies. Nat Genet 2004; 36(5):512–517. (Epub 2004 Mar 28). 52. Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol 2001; 20(1):4–16. 53. Relethford JH. Apportionment of global human genetic diversity based on craniometrics and skin color. Am J Phys Anthropol 2002; 118(4):393–398. 54. Harding RM, Healy E, Ray AJ, et al. Evidence for variable selective pressures at MC1R. Am J Hum Genet 2000; 66(4):1351–1361. 55. Dean M. Approaches to identify genes for complex human diseases: lessons from Mendelian disorders. Hum Mutat 2003; 22(4):261–274. 56. Burnett MS, Strain KJ, Lesnick TG, et al. Reliability of self-reported ancestry among siblings: implications for genetic association studies. Am J Epidemiol 2006; 163:486–492. 57. Ziv E, Burchard EG. Human population structure and genetic association studies. Pharmacogenomics 2003; 4(4):431–441. 58. Thomas DC, Witte JS. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev 2002; 11(6):505–512. 59. Pritchard JK, Donnelly P. Case-control studies of association in structured or admixed populations. Theor Popul Biol 2001; 60(3):227–237. 60. Devlin B, Roeder K, Wasserman L. Genomic control, a new approach to genetic-based association studies. Theor Popul Biol 2001; 60(3):155–166. 61. Smith MW, Lautenberger JA, Shin HD, et al. Markers for mapping by admixture linkage disequilibrium in African American and Hispanic populations. Am J Hum Genet 2001; 69(5): 1080–1094. 62. Shriver MD, Smith MW, Jin L, et al. Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet 1997; 60(4):957–964. 63. Akey JM, Zhang G, Zhang K, et al. Interrogating a high-density SNP map for signatures of natural selection. Genome Res 2002; 12(12):1805–1814. 64. Williams RC, Long JC, Hanson RL, et al. Individual estimates of European genetic admixture associated with lower body-mass index, plasma glucose, and prevalence of type 2 diabetes in Pima Indians. Am J Hum Genet 2000; 66(2):527–538. 65. Fernandez JR, Shriver MD, Beasley TM, et al. Association of African genetic admixture with resting metabolic rate and obesity among women. Obes Res 2003; 11(7):904–911. 66. Gower BA, Fernandez JR, Beasley TM, et al. Using genetic admixture to explain racial differences in insulin-related phenotypes. Diabetes 2003; 52(4):1047–1051. 67. Barnholtz-Sloan JS, Chakraborty R, Sellers TA, et al. Examining population stratification via individual ancestry estimates versus self-reported race. Cancer Epidemiol Biomarkers Prev 2005; 14(6):1545–1551. 68. Ziv E, John EM, Choudhry S, et al. Genetic ancestry and risk factors for breast cancer among Latinas in the San Francisco Bay Area. Cancer Epidemiol Biomarkers Prev 2006;15(10): 1878–1885.

144

Barnholtz-Sloan et al.

69. Halder I, Nievergelt C, Ferrell R, et al. Variation of individual admixture within and between populations follows continuous distributions. Presented at the annual meeting of The American Society of Human Genetics, Salt Lake City, Utah. October 27, 2005 (abstr 1059). Available at: http://www.ashg.org/genetics/ashg/menu-annmeet-2005.shtml. 70. Tsai HJ, Choudhry S, Naqvi M, et al. Comparison of three methods to estimate genetic ancestry and control for stratification in genetic association studies among admixed populations. Hum Genet 2005; 118(3–4):424–433. 71. Reiner AP, Ziv E, Lind DL, et al. Population structure, admixture, and aging-related phenotypes in African American adults: the Cardiovascular Health Study. Am J Hum Genet 2005; 76(3): 463–477. 72. Gower BA, Fernandez JR, Beasley TM, et al. Using genetic admixture to explain racial differences in insulin-related phenotypes. Diabetes 2003; 52(4):1047–1051. 73. Wilson JF, Weale ME, Smith AC, et al. Population genetic structure of variable drug response. Nat Genet 2001; 29(3):265–269. 74. Chakraborty R, Kamboh MI, Nwankwo M, et al. Caucasian genes in American blacks: new data. Am J Hum Genet 1992; 50(1):145–155. 75. Chakraborty R. Gene admixture in human populations: models and predictions. Yearbook of Physical Anthropology 1986; 29:1–43. 76. McKeigue PM, Carpenter JR, Parra EJ, et al. Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to African-American populations. Ann Hum Genet 2000; 64(pt 2):171–186. 77. Collins-Schramm HE, Kittles RA, Operario DJ, et al. Markers that discriminate between European and African ancestry show limited variation within Africa. Hum Genet 2002; 111 (6):566–569. 78. Collins-Schramm HE, Phillips CM, Operario DJ, et al. Ethnic-difference markers for use in mapping by admixture linkage disequilibrium. Am J Hum Genet 2002; 70(3):737–750. 79. Rosenberg NA, Li LM, Ward R, et al. Informativeness of Genetic Markers for Inference of Ancestry. Am J Hum Genet 2003; 73(6):1402–1422. 80. Pfaff CL, Barnholtz-Sloan J, Wagner JK, et al. Information on ancestry from genetic markers. Genet Epidemiol 2004; 26(4):305–315. 81. Parra EJ, Marcini A, Akey J, et al. Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet 1998; 63(6):1839–1851. 82. Smith MW, Patterson N, Lautenberger JA, et al. A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet 2004; 74(5):1001–1013. 83. Collins-Schramm HE, Chima B, Morii T, et al. Mexican American ancestry-informative markers: examination of population structure and marker characteristics in European Americans, Mexican Americans, Amerindians and Asians. Hum Genet 2004; 114(3):263–271. (Epub 2003 Nov 20). 84. Hinds DA, Stuve LL, Nilsen GB, et al. Whole-genome patterns of common DNA variation in three human populations. Science 2005; 307(5712):1072–1079. 85. Miller RD, Phillips MS, Jo I, et al. High-density single-nucleotide polymorphism maps of the human genome. Genomics 2005; 86(2):117–126. 86. Shriver MD, Mei R, Parra EJ, et al. Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation. Hum Genomics 2005; 2(2):81–89. 87. Tian C, Hinds DA, Shigeta R, et al. A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am J Hum Genet 2006; 79(4):640–649. 88. Tain C, Hinds DA, Shigeta R, et al. A genomewide single-nucleotide-polymorphism panel for Mexican American admixture mapping. Am J Hum Genet 2007; 80:1014–1023. 89. Price AL, Patterson N, Yu F, et al. A genomewide admixture map for Latino populations. Am J Hum Genetics 2007; 80:1024–1036. 90. Mao X, Bingham AW, Meui R, et al. A genomewide admixture mapping panel for Hispanic/ Latino populations. Am J Hum Genetics 2007; 80:1171–1178. 91. Price AL, Butler J, Patterson N, et al. Discerning the ancestry of European Americans in genetic association studies. PLoS Genet 2008; 4(1):e236.

11

Statistical Approaches to Studies of Gene-Gene and Gene-Environment Interactions Nilanjan Chatterjee Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, Rockville, Maryland, U.S.A.

Bhramar Mukherjee Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, U.S.A.

INTRODUCTION Most common human diseases have a multifactorial etiology involving a complex interplay among genetic susceptibilities and environmental exposures. Studying the “interaction” of multiple factors on the risk of a complex disease can improve the statistical power to detect the underlying causative factors of the disease, give insight into their biologic effects, and lead to public health strategies for prevention. The purpose of this chapter is to describe some classical and modern statistical approaches to investigation of interaction in population-based epidemiologic studies. In most part, the chapter will focus on the study of interaction among pairs of risk factors. For notational convenience, the methods will be often described in the context of studies of gene-environment interaction, but the same approaches will also be applicable to studies of gene-gene interaction unless otherwise specified. The chapter will begin with a review of statistical models for interaction and their biologic interpretations. The section “Inference Techniques for Alternative Study Designs” will describe classical and modern statistical methods for inference on interactions under a variety of commonly used epidemiologic designs, including population-based cohort, case-control, two-phase stratified designs, and family-based case-sibling and case-parent designs. The section “Biases” will describe the effects of selection bias, misclassification, and missing data. The section “Test for Association in Presence of Interaction” will focus on a hypothesis-testing framework for modern association studies that can improve power of detecting disease-susceptibility genes by accounting, but not necessarily testing for, gene-gene and gene-environment interactions. The section “Higher Order Interaction and Data-Mining Tools” will give a brief introduction

145

146

Chatterjee and Mukherjee

to modern data-mining techniques for studies of higher order interactions. The chapter will be concluded with a discussion on some of the statistical challenges associated with the investigation of interaction in modern molecular epidemiologic studies. MODELS FOR INTERACTION A model for interaction corresponds to a form of constraints on the joint effects of the risk factors. Table 1 shows the form of the relative risk of a disease associated with two binary exposures, say G and E, under some commonly used models for interaction used in epidemiologic studies. The multiplicative and additive forms are the two most commonly used models in practice (1). The multiplicative model implies r11=r10 ¼ r01=r00, i.e., the relative risk of the disease associated with G is the same irrespective of the value of E and vice versa. The additive model corresponds to the constraint r11 r10 ¼ r01 r00, which in turn implies a11 a10 ¼ a01 a00, i.e., the risk-difference of the disease associated with G is the same irrespective of the value of E and vice versa. If joint effect of the two exposures departs from the additive or multiplicative constraints, then an “interaction” is said to be present in the corresponding scale with the corresponding indices of interaction given by AIge ¼ r11 r10 r01 þ r00 and MIge ¼ r11r00=r01r00, respectively. There has been a long-standing debate about whether and when additive and multiplicative models, the two statistical forms of interactions, correspond to any plausible biologic model. The fact that the presence or absence of these interactions depends on the scale in which the risk is being measured limits their biologic interpretations. Sometimes, under simplistic assumptions, statistical and biologic models for interactions can be related. Under a two-hit model for carcinogenesis, for example, if two risk factors affect the rates of transitions for the two different stages, i.e., normal to stage-I and stage-I to stage-II, then the pattern of incident rate of the disease would fit the multiplicative model (2,3). On the other hand, under a number of alternative biologic models, such as “single-hit model” (4) and “sufficient-component-cause” model (5), it has been shown that “independent biologic actions” for two risk factors lead to additivity of their effect on the incident rate of the disease. Unfortunately, these relationships, although conceptually useful, rely on very simplistic assumptions, such as absence of unknown risk factors that are unlikely to hold in practice. Interactions, which are “nonremovable,” i.e., they are present irrespective of the scale of measurement of association, can give biologic insights. Table 1 shows three

Table 1 Risk of Binary Outcome D Associated with Joint Status of Two Binary Exposures G and E (G ¼ 0, E ¼ 0) (G ¼ 0, E ¼ 1) (G ¼ 1, E ¼ 0) (G ¼ 1, E ¼ 1) Absolute risk Relative risk

a00

r00 ¼ aa0000 ¼ 1

a01

r01 ¼ aa01 00

Models

a10

r10 ¼ aa10 00

a11

r11 ¼ aa11 00

Forms of relative risks General Additive Multiplicative NR-I NR-II NR-III

1 1 1 1 1 1

fG fG fG 1 fG 1

fE fE fE 1 1 fE

fG fE yGE fG þ fE 1 fG fE yGE fG yGE fE yGE

Studies of Gene-Gene and Gene-Environment Interactions

147

forms of such interactions, NR-I, NR-II, and NR-III, where the effect of one or both the factors exists only in the presence of the others. Khoury et al. (6) and Ottman (7) describe several examples where such patterns of interactions have been observed in real studies. It is however unclear how one could test for such “pure” interaction using standard hypothesis-testing framework. An extreme form of nonremovable interaction is known as “crossover” effect, under which the effect of a causative factor is reversed by the presence of another. It has been noted, for example, that NAT2 slow-acetylation activity increases the risk of bladder cancer among smokers but can reduce the risk of the same disease among subjects who are exposed to Benzidine, an occupational exposure present in textile dyes (8). Formal statistical tests are available for detecting crossover effects (9). From a biologic standpoint, however, it is believed that crossover forms of interactions are likely to be rare in practice. Irrespective of their lack of biologic interpretation, statistical evaluation of interaction can be important in practice. Thomson (3) describes three primary reasons. First, assessment of interaction can enhance the detection of the underlying risk factors of a disease. If the effect of a factor on the risk of a disease is heterogeneous by the level of a second factor, then the power of detecting the association of the disease with the first factor can be reduced if its interaction with the second factor is ignored. In the section “Test for Association in Presence of Interaction” of this chapter, we address this topic in depth because of its relevance in modern large-scale genetic association studies. Second, evaluation of statistical interaction can be important for understanding public health impact of two exposures. In particular, it has been pointed out that the evaluation of additive interaction is important to understand whether elimination of a risk factor could be more beneficial by targeting subjects based on the level of a second factor (5). Third, evaluation of interaction can be beneficial for building parsimonious models for prediction of risk of disease in individuals based on their status of two or more risk factors. INFERENCE TECHNIQUES FOR ALTERNATIVE STUDY DESIGNS Cohort Studies Standard epidemiologic study designs that have been traditionally used for epidemiologic studies of environmental factors can be used for investigation of genetic effects and genegene/gene-environment interactions. Ideally, one can use a prospective cohort study that involves recruitment of a random sample of healthy subjects selected from a well-defined population, collection of their biologic samples and questionnaire data on various demographic, lifestyle, and dietary factors at the baseline, and then follow up these individuals over time for observing their disease occurrence with the possibility of updating some of the biologic samples and questionnaire data. Data from cohort studies can be used to study the relationship of genotypes and various types of questionnaire and biomarker-based environmental exposures with the incident rate of a disease in a suitable timescale such as biologic age or time since enrollment into the study. Incident rate focuses on time rather than individual as the unit of observation. If, for example, lge(t, t þ u) denotes the incident rate of the disease during the time interval (t, t þ u) for a subcohort of subjects having the specific genotype and exposure configuration (G ¼ g, E ¼ e), one can estimate lge(t, t þ u) as ^lge ðt, t þ uÞ ¼ Nge , PYge

ð1Þ

148

Chatterjee and Mukherjee

where Nge denotes the number of subjects in that subcohort who developed the disease during that time interval and PYge denotes the total number of person-time for which those subjects were “at risk” of the disease. If an environmental exposure changes over time, then the contribution of a subject in the denominator of equation (1) is computed based only on the time intervals she/he had the relevant exposure E ¼ e. Once the incident rates of the disease are estimated by joint status of two exposures, the interaction between the two factors can then be investigated in alternative scales like the ones described above. For binary factors, for example, the additive and multiplicative interactions can be estimated as ^l11^l00 ^l10^l01

and ^l11 ^l10 ^l01 þ ^l00

For testing and obtaining confidence intervals, one can estimate the variance of ^lge , using the standard Greenwood’s formula (10) PYge : ^lge ð1 ^lge Þ Moreover, the covariances between hazard estimates for any pair of disjoint time intervals is zero. Thus, the variance estimate for any function of the hazard parameters, including those for the multiplicative and additive interaction, can be obtained on the basis of simple standard application of delta theorem. When one or both of the exposures under study have many different levels, then a model-based approach for estimation of incident rates becomes necessary. The most popular approach would be to use the Cox’s proportional hazard model (11), which measures association in the relative-risk scale. If l{tjG,E(t)} denotes the instantaneous hazard of the disease at time t for a subject with genotype denoted by G and exposure history up to time t denoted by E(t), then a proportional hazard model for geneenvironment interaction can be specified as lftjG, EðtÞg ¼ l0 ðtÞRfG, EðtÞ;bÞg

ð2Þ

where l0(t) denotes the baseline hazard rate of the disease associated with a reference genotype and a reference exposure level, say g0 and e0, and R{G,E(t);b)} denotes a parametric function describing relative hazard associated with the exposure G,E(t) in reference to (g0, e0). The relative-risk function can be further specified as log RfG, EðtÞ;bÞg ¼ XG bG þ XEðtÞ bE þ XG,EðtÞ bGE where XG, XE(t), and XG,E(t) denote suitable design vectors for representing the main effects of G, main effects of E(t), and interaction effect of G and E(t), with the corresponding regression coefficients being denoted by bG, bE, and bGE, respectively. If, for example, G denotes the genotype data for a biallelic locus and E(t) denotes a quantitative exposure, then assuming a multiplicative (additive in log scale) trend effect for both G and E, one could choose XG to be the number of minor alleles in genotype G, XE(t) to be E(t) itself, and XG,E(t) ¼ XG XE(t). Model (2) quantifies association in the relative-risk scale. Thus, the interaction coefficient exp(bGE) measures the magnitude of multiplicative interaction. Estimation and testing for regression coefficients can be conducted by partial-likelihood methods (11), widely implemented in standard statistical software packages. Alternatively, one could investigate the additive effect of genetic and environmental exposures on the risk of a disease on the basis of an additive hazard model (12) of the form lftjG, XðtÞg ¼ l0 ðtÞ þ XG bG þ XEðtÞ bE þ XG,EðtÞ bGE

ð3Þ

Studies of Gene-Gene and Gene-Environment Interactions

149

where the regression coefficients bG and bE quantify association of the disease with genotype and exposure in a risk-difference scale and the interaction coefficient bGE measures the magnitude of “additive” interaction. Methods for parameter estimation and testing in the additive model, although not as widely available as those for the proportional hazard model, have been well studied in the literature (13,14). A statistical software package for fitting additive hazard regression in R/S-plus software, primarily based on survival analysis techniques (15,16), is available at http://www.med.uio.no/imb/stat/addreg/. Case-Control Studies For rare diseases, like cancer, cohort studies can be very expensive as they involve recruiting and gathering covariate information for a very large number of subjects, most of whom would remain healthy during the course of the study. Case-control studies save cost compared with a cohort study by dramatically reducing the number of nondiseased subjects included in the study. Typically, a case-control study involves recruiting all or a large fraction of the diseased subjects (cases) that arise in an underlying study base and then sampling a comparable number of healthy subjects (controls), preferably from the same study base, and possibly matched with the cases by some socio-demographic characteristics such as race, age, and gender. Similar to cohort studies, both biologic samples and questionnaire-based data can be collected in case-control studies, but ascertainment of the environmental variables requires special attention as the measurements need to reflect exposure occurrence prior to disease history for the cases and a comparable window of time for the controls. Data from case-control studies can be used to study the association of a disease with the exposures under study in the odds-ratio scale. Table 2 shows the 2 2 2 representation of case-control data with two binary exposures, say G and E. If pdge ¼ pr(D ¼ djG ¼ g, E ¼ g) defines the absolute risk of the disease for subjects with G ¼ g and E ¼ e, then the prospective population odds ratio of the disease for the cell G ¼ g and E ¼ e relative to the reference cell G ¼ 0 and E ¼ 0 can be defined as ORge ¼ (p1gep000)=(p0gep100). For rare diseases, p0ge and p000 are both close to unity and thus ORge approximates the relative risk RRge ¼ p1ge=p100. Throughout this chapter, we will assume that case-control studies are conducted for rare diseases. Thus, we will often use the odds-ratio and relative-risk scales interchangeably. Due to the well-known Cornfield’s equality (17) prðD ¼ 1jG ¼ g, E ¼ eÞprðD ¼ 0jG ¼ 0, E ¼ 0Þ ¼ prðD ¼ 0jG ¼ g, E ¼ eÞprðD ¼ 1jG ¼ 0, E ¼ 0Þ prðG ¼ g, E ¼ ejD ¼ 1ÞprðG ¼ 0, prE ¼ 0jD ¼ 0Þ , prðG ¼ g, E ¼ ejD ¼ 0ÞprðG ¼ 0, E ¼ 0jD ¼ 1Þ ORge can be directly estimated from the “retrospective” case-control design as c ge ¼ r1ge r000 OR r0ge r100 and the variance-covariances of the corresponding log-odds ratios can be estimated by the sum of the reciprocals of corresponding cell entries (18). Once the joint odds ratios are estimated, interaction between the two exposures can be investigated in alternative scales. In particular, indices for the multiplicative and the additive interactions can be defined as MIge ¼ OR11=OR10OR01 and AIge ¼ OR11 OR10 OR01 þ 1, respectively. More generally, data from case-control studies can be analyzed using a flexible logistic regression model of the form prðD ¼ 1jG, EÞ ¼ 1

1 , 1 þ expfb0 þ mðG, E; b1 Þg

ð4Þ

150

Chatterjee and Mukherjee

Table 2 Data for an Unmatched Case-Control Study with a Binary Genetic Factor and a Binary Environmental Exposure G ¼ 0

D ¼ 0 D ¼ 1

G ¼ 1

E ¼ 0

E ¼ 1

E ¼ 0

r000 r100

r001 r101

r010 r110

E ¼ 1 Total r011 r111

n0 n1

where m(·) is a parametric function that defines the joint odds ratio of the disease as a function of G and E in terms of the association parameters b1. Typically, in the standard logistic regression model, one chooses mðG, E;b1 Þ ¼ XG bG þ XE bE þ XG,E bGE

ð5Þ

where XG, XE, and XG,E denote suitable design vectors for representing the main effects of G, main effects of E, and the multiplicative interaction effect of G and E, with the corresponding regression coefficients being denoted by bG, bE, and bGE. The general form of model (4), however, allows alternative forms of interaction through the choice of the function m(·). When both G and E are binary, for example, the additive model shown in Table 1 corresponds to a choice of m(·) function where the odds-ratio main effects and the interaction parameters are related by the constraint expðbG þ bE þ bGE Þ ¼ expðbG Þ þ expðbE Þ 1: Data from case-control studies are generally analyzed using any standard logistic regression software, ignoring the retrospective nature of the sampling design. It is well known that such “prospective” analysis of case-control studies yields the efficient maximum-likelihood estimate of the association parameter b1 (19). The estimate of the intercept parameter b0, however, is not unbiased for its true value in the population. If the cases and controls are sampled by individual matching, for example, by age, then the standard method for analysis of the resulting data is conditional logistic regression (18) (chap. 7, “Biomarkers of Exposure and Effect”) for which software is also widely available in packages such as SAS, R, and Stata. A common feature of standard conditional and unconditional logistic regression methods for analysis of case-control data is that these methods allow the population distribution of the joint exposures to remain completely unrestricted (nonparametric). For studies of gene-gene or/and gene-environment interactions, however, often a reasonable assumption may be that these factors are independently distributed in the underlying population. In the following section, we describe some modern methods for analysis of case-control studies that can exploit the assumptions of independence for gaining efficiency. Piegorsch et al. (20) observed that under gene-environment independence and the rare disease assumption, the odds-ratio interaction parameter between two exposures can be estimated using the cases alone. To understand this phenomenon, let us consider the situation with binary G and binary E as represented by the 2 2 2 data layout in Table 2. Due to the virtue of Cornfield’s equality (17), the odds-ratio interaction parameter can be expressed as a ratio of the two odds ratios, namely, Odds ratio between G and E among cases : Odds ratio between G and E among controls |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} ¼ 1 under G-E independence and rare disease

MIge ¼ studies

ð6Þ

Studies of Gene-Gene and Gene-Environment Interactions

151

As indicated above, the denominator in MIge, the population odds ratio between G and E among the disease-free subjects reduces to unity under gene-environment independence and rare disease assumption. Thus, under those two assumptions, MIge can be estimated by the sample odds ratio between G and E among the cases alone. The standard casecontrol analysis, which does not require any of those assumptions, estimates MIge by replacing the odds ratios in the numerator and denominator of equation (6) by the sample G-E odds ratios for the cases and controls, respectively. Thus, the case-only estimator gains efficiency over its case-control counterpart by reduction of the variance associated with estimation of the odds ratio between G and E among the controls. More formally, if we denote b^CC and b^CO to be the case-control and case-only estimators of log(MIge) ¼ b (say) with the corresponding formulae given by r001 r010 r100 r111 r100 r111 and b^CO ¼ log , b^CC ¼ log r000 r011 r101 r110 r101 r110 ^ 2CC ¼ respectively, then the corresponding estimated asymptotic variances are given by s P1 P1 P1 P P ^2CO ¼ 1g¼0 1e¼0 ð1=r1ge Þ, respectively. Evidently, d¼0 g¼0 e¼0 ð1=r1ge Þ and s ^2CC . ^2CO < s s One limitation of the case-only approach is that it does not allow estimation of other parameters required for specification of the full joint effect, such as the main effects of G and E in the logistic regression model. When data on both cases and controls are available, assuming rare disease and gene-environment independence, maximumlikelihood estimates of all the parameters in a logistic regression model can be obtained by fitting a suitably constrained log-linear model for categorical data (21). For binary G and E, the log-linear approach produces identical estimator of the interaction parameter as the case-only analysis. For a rich model with many covariates, implementing the loglinear method may become challenging with introduction of many nuisance parameters that are needed to model the control distribution of E. A fully general framework for maximum-likelihood estimation under gene-environment independence, which does not necessarily require the rare disease assumption and retains the flexibility of a traditional logistic regression model in terms of adjusting for confounders and incorporating continuous exposures and/or confounders, was proposed in (22). A potential criticism of modern methods for analysis of case-control data exploiting exposure distribution constraints is that they can incur severe bias when the underlying assumptions are violated (23,24). From the representation in equation (6), for example, it is clear that if gene-environment independence does not hold, i.e., when the odds ratio in the denominator of equation (6) departs from unity, the case-only estimator of the interaction parameter will remain asymptotically biased by a magnitude that is exactly equal to the G-E odds ratio in the control population. To reduce the bias, one could adopt a “two-step” (TS) procedure where one first tests for the gene-environment independence assumption in the control sample and then based on the acceptance or rejection of the null hypothesis uses the case-only or the case-control estimator at the second step. The procedure however as a whole could still remain significantly biased under modest violation of gene-environment independence, and small sample sizes as the test used in the first step may not have adequate power to reject the null hypothesis. A novel solution has been recently proposed to tackle the bias versus efficiency dilemma due to the independence assumption. In the setting of 2 2 2 table described above, the method involves estimation of log-odds-ratio interaction parameter by taking a weighted average of the case-only (b^CO ) and case-control (b^CC ) estimators using the formulae b^EB ¼

s^2CC 2

ð^GE þ s^2CC Þ

^ þ b CO

2 ^GE 2

^2CC Þ ð^GE þ s

b^CC ,

ð7Þ

152

Chatterjee and Mukherjee

where ^ s2CC denotes the asymptotic variance estimator for b^CC and ^GE denotes the log odds ratio between G and E among controls. The original estimator was proposed from an empirical Bayes (EB) point of view, the detail of which can be found in (25). To understand the intuitive rationale behind the estimator, observe that as ^GE ! 0, i.e., as the data provide evidence in favor of G-E independence, b^EB ! b^CO , and as ^GE ! 1, i.e., as the uncertainty regarding G-E independence in control population becomes stronger, b^EB ! b^CC . Also, when the true GE 6¼ 0, i.e., the independence assumption is violated, then as the sample size n ! 1, b^EB ! b^CC , the unbiased case-control estimator. A variance formula for b^EB has been also derived from which one can construct Wald type tests and confidence intervals. Table 3 contains a snapshot of type I error and power of the four different methods, namely, the case-control, case-only, TS, and EB methods, under varying values of yGE for the case with a binary G and a binary E. Both type I error and powers are evaluated with n0 ¼ n1 ¼ 500 and a ¼ 0.05, with the power being evaluated at b ¼ log(1.5) and log(2). Under gene-environment independence, i.e., yGE ¼ 0, all approaches except TS maintain the nominal a-level of 0.05. In terms of power, the case-only approach is certainly the best option with huge advantages over its case-control counterpart. The EB approach gives up some efficiency compared with case-only, but still maintains major advantage over casecontrol approach. When yGE = 0, i.e., the gene-environment independence assumption is violated, the type I error of the case-control estimator is maintained, but those of caseonly and TS are highly inflated. The EB approach provides a much better control of type I error compared with case-control and TS. It is particularly encouraging that under small departure from independence, such as when yGE ¼ log(1.1), which may arise often in practice (26) but would not be detectable by statistical tests, the EB approach provides very good control of type I error and yet provides substantial power advantage over the case-control estimator. The EB estimation method has been extended to the general logistic regression setup of (22) and has been implemented with the same Matlab software referred above. An excel spreadsheet for computing all the four estimators for binary G and E and R codes for computing power for the different methods are available at http://www.sph.umich.edu/bhramar/public_html.

Table 3 Type 1 error and Power for Detection of Multiplicative Interaction Between a Binary Genetic and a Binary Environmental Exposure Using Four Alternative Methods: (1) Case-Control (CC), (2) Case-Only (CO), (3) Two-Stage (TS), and (4) Empirical Bayes (EB) Power at ba ¼ log(1.5)

Type 1 error

Power at ba ¼ log(2.0)

bGE

CC

CO

EB

TS

CC

CO

EB

TS

CC

CO

EB

TS

0 log(l.l) log(1.2) log(1.5) log(2.0)

0.05 0.05 0.05 0.04 0.05

0.05 0.08 0.14 0.50 0.91

0.04 0.05 0.07 0.08 0.06

0.07 0.09 0.15 0.28 0.11

0.29 0.30 0.29 0.29 0.30

0.53 0.70 0.84 0.98 1.00

0.41 0.50 0.51 0.45 0.40

0.52 0.66 0.72 0.54 0.32

0.68 0.68 0.70 0.69 0.68

0.95 0.98 0.99 1.00 1.00

0.84 0.85 0.85 0.79 0.76

0.93 0.92 0.89 0.73 0.68

All calculations are done for sample size of 500 cases and 500 controls, pr(G ¼ 1) ¼ 0.3, pr(E ¼ 1) ¼ 0.3, and assuming no main effects for G and E. a log-odds ratio of interaction. b log-odds ratio between G and E among disease-free subjects.

Studies of Gene-Gene and Gene-Environment Interactions

153

Two-Phase Stratified Sampling Designs Two-phase stratified sampling design has been proposed for epidemiologic studies as an efficient alternative to the traditional cohort and case-control designs when detailed covariate data collection on a large number of subjects is prohibitive due to cost and other practical considerations (27,28). Under this design, the disease-outcome information (D) and some inexpensive covariate data, possibly including error-prone surrogate measurements, are first collected for a relatively large number of subjects at phase I. At phase II, a small subset of the phase I subjects is selected for whom the detailed and more expensive covariate data are gathered. Stratified random sampling, where strata defined by both the disease-outcome and covariate information collected at phase I, can be much more efficient than simple random or case-control sampling for selection of the phase II subjects. Two-phase designs can be particularly useful for studies of gene-environment interactions. Existing cohort studies are now routinely used for selecting case-control sample of subjects to be genotyped for investigation of disease susceptibility loci. In these studies, data on environmental factors available from the original cohort can be potentially used for oversampling subjects with rare exposures and thus increasing the efficiency of the nested case-control sample for investigation of gene-environment interactions. Two-phase design can also be used for reducing cost associated with a geneenvironment study by limiting evaluation of expensive environmental biomarkers on a small subsample of the main study. Cases and controls in the substudy can be selected on the basis of an inexpensive surrogate of the biomarker that may be available from the main study. Moreover, if genotyping has been performed in the main study, subjects in the substudy can be selected on the basis of their genotype status for known or putative disease susceptibility loci in the underlying biochemical pathway of the environmental exposure. A number of reports have studied the power of various types of two-phase designs for studies of interactions (2933). Methods for analysis of two-phase studies need to account for the underlying stratified sampling design. A variety of methods for logistic regression and Cox’s proportional hazard analysis of two-phase data have been described in the literature with related software packages like the survey and sampling packages available in R-cran Web site (http://www.r-project.org). The available methods can be classified into two broad types. The first class of methods considers the subjects who are selected at phase II, and hence have the complete detailed covariate information as the primary unit of analysis. The effect of stratified sampling at phase II is accounted for by weighing the subjects according to the inverse of their selection probabilities (34) or by considering a conditional likelihood or partial likelihood of the data that can account for nonrandom sampling design (35,36). The second class of methods considers the subjects in the larger phase I sample as the primary units of analysis, pretending that the subjects who are selected at phase I, but not at phase II, have missing covariate information. This missing data approach is more efficient as it can make the most use of all the available data. In particular, in studies of interactions, if data on one of the interacting factors are available as part of the phase-I study, then the efficiency of estimation of the main effect parameter associated with that factor can be greatly enhanced by considering the missing data approach (3739). The assumption of gene-environment or/and gene-gene independence can also be exploited in some of these methods to gain efficiency in estimation of oddsratio interaction parameters (40).

154

Chatterjee and Mukherjee

Family-Based Case-Control Studies In population-based case-control studies, cases and controls are randomly selected from the diseased and nondiseased subjects that arise in an underlying population. Typically, the cases and controls in such designs are unrelated. In contrast, in family-based case-control studies, controls are selected from the families of the cases. An excellent review of the relative advantages and disadvantages of population- and family-based designs can be found in (41). While selection of population-based controls may be logistically more convenient, family-based designs can offer protection against spurious association induced by population stratification or admixture. Even when bias due to population stratification or admixture is not a concern, for efficiency reasons family-based designs may be preferred in studies of gene-environment interaction involving rare genetic variants (42,43) (Table 4). Two types of family-based designs, namely, case-siblings and case-parents, are particularly popular. In case-siblings design, healthy siblings of the cases are selected as the matching controls. Data from sibling case-control studies are usually analyzed using standard conditional logistic regression methods. Thus, if (G1j, E1j), (G0j, E0j) denote the genotypeexposure configuration for the cases and the controls in j ¼ 1, . . . , N case-control sibpairs, then the conditional likelihood of the data under a model of the form (4) is given by LCLR ¼

N Y

P

i¼1

expfmðGi1 , Ei1 ; b1 Þg , j2Ri expfmðGij , Eij ; b1 Þg

ð8Þ

where Ri denotes the risk set containing the ith case-control pair. An important feature of equation (8) is that it allows the model (4) to have family-specific intercept term b0i to account for between-family heterogeneity in disease risk. If one is willing to assume rare disease and that G and E are independently distributed within families in the underlying population, then a more efficient conditional likelihood for analysis of case-siblings studies is given by (44) LInd CLR ¼

N Y i¼1

expfmðGi1 , Ei1 ; b1 Þg , j2Ri expfmðGij , Eij ; b1 Þg

P

ð9Þ

which has similar form as the traditional conditional likelihood (8), except that the ith risk set Ri consists of four subjects with genotype-exposure configurations: (Gi0, EI0), Table 4 Relative Efficiencies of Alternative Family-Based Designs Compared to a PopulationBased Case-Control Design for Testing of Multiplicative and Additive Interactions Designs Methodsa

Case-parents LCPT

Case-sibling LCLR

Hybrid LCC

LCCGP

pbG ,cG

Parameter

Dom

Rec

Dom

Rec

Dom

Rec

Dom

Rec

0.01, 7

MIGE AIGE MIGE AIGE

0.81 NA 0.94 NA

2.04 NA 1.13 NA

2.90 2.90 1.05 1.05

2.49 2.51 0.96 0.93

5.52 5.56 1.34 1.40

4.09 4.67 1.40 1.39

5.67 5.64 1.64 1.72

5.12 5.78 1.80 1.86

Relative efficiencies are computed in reference to a population-based case-control study with the same number of cases and 1:1 case-control ratio. All calculations assume Pr(E ¼ 1) ¼ 0.2, fE ¼ 1.3, and that the Pearson’s correlation in E between a pair of siblings is 0.2. a Alternative conditional likelihoods described in formulae. b Genotype (G ¼ I(Aa=aa) for dominant and G ¼ I(aa) for recessive) frequency. c True values for main effects of G.

Studies of Gene-Gene and Gene-Environment Interactions

155

(Gi1, Ei1), (Gi0, Ei1), and (Gi1, Ei0). The first two subjects in the set correspond to the selected case and control, and the two additional subjects with genotype-exposure configuration (Gi0, Ei1) and (Gi1, Ei0) can be viewed as “pseudo” siblings obtained by exchanging the genotypes of the observed siblings: under the G-E independence assumption, such “pseudo” subjects are as equally likely to appear in a family as the observed subjects in that family. Similar to equation (8), inference based on equation (9) is robust to population stratification because it allows both the disease rate and the genefrequency to vary arbitrarily across families, and the G-E independence assumption only needs to hold within families. For gene-gene interaction analysis, the conditional likelihood (8) can be used for both linked and unlinked loci, but the conditional likelihood (9) should be used only for unlinked loci. In case-parents design, cases and their parents are genotyped and the parental genotypes are used to construct a set of “pseudocontrols” consisting of the siblings the cases could have had given the parental genotypes and assuming Mendelian inheritance. If data on E are available on the cases, then case-parents design can be used to estimate gene-environment interaction parameter under the assumption that the distribution of the genotypes in the offspring does not depend on their exposure status given the parental genotypes (45). In particular, under the logistic model of form (4), one can use the conditional likelihood LCP ¼

N Y

prðGi1 jDi1 ¼ 1, Ei1 , GPi Þ ¼

i¼1

N Y i¼1

P G2HG

expfmðGi1 , Ei1;b1 Þg , expfmðG, Ei1;b1 ÞgprðGjGpi Þ P

ð10Þ

i

where HGPi denotes all possible genotype configurations for the offsprings given the parental genotype GPi and prðGjGPi Þ denotes the corresponding conditional probability of observing G given GPi computed according to the Mendelian mode of inheritance. Caseparents studies can be also used for studying gene-gene interaction involving linked and unlinked loci (46,47). A practical problem in case-parents studies is that some parents may be unavailable for genotyping. Families with partial parental genotype information however can remain informative for association analysis. Various advanced statistical methods are now available for efficient analysis of case-parents studies with partial parental genotype information (4850). A major limitation of the case-parents design for gene-environment studies is that it cannot estimate the main effect coefficient bE in models of the form (5) because of the lack of contrast in E between the cases and their pseudocontrols. Consequently, under this design, one cannot estimate or test for additive interaction either. For inference regarding main effect of G, however, case-parents design can have major efficiency advantage over the case-sibling design (42). To combine the strengths of case-siblings and case-parents studies, one can consider a hybrid design that involves genotyping the cases and their parents and gathering environmental exposures on the cases and their siblings (44). A conditional likelihood for such studies, as described in (44), is given by LCCGP ¼

N Y i¼1

P G2HGP i

expfmðGi1 , Ei1;b1 Þg P expfmðG, Ei1 ;b1 ÞgprðGjGPi Þ þ G2H P expfmðG, Ei0 ;b1 ÞgprðGjGPi Þ G i

ð11Þ The conditional likelihood (11), similar to (9), requires a rare-disease approximation and the assumption that that genotype and exposure status for pairs of sibling in the source population are independently distributed conditional on their parental genotype information. One could also consider a hybrid design to combine the strength of case-parents and

156

Chatterjee and Mukherjee

population-based case-control studies (51). A conditional likelihood similar to (11) can also be used to analyze such hybrid designs if the cases and population controls are individually matched. Hybrid designs involving population controls however are not completely robust to bias due to population stratification. Table 4 shows the relative efficiencies of alternative family-based designs and related analytic methods compared with an unmatched population-based case-control design for estimation of additive and multiplicative interaction parameters. Briefly speaking, in these comparisons, all the studies include the same number of cases. The population- and sibling-based case-control designs include the same number of controls as the cases, the case-parents design includes parents of the cases and the hybrid design includes parents and one sibling for each case. Other details of the simulation study based on which the power calculations were done can be found in (44). A number of key observations can be made. For analysis of ordinary sibling casecontrol design without additional parental genotype information, the proposed conditional likelihood LInd CLR leads to major efficiency gain over the traditional conditional likelihood LCLR for inference on multiplicative and additive interaction parameters. The hybrid design, when analyzed using the novel conditional likelihood LCCGP, can be far superior to an ordinary sibling case-control design, the case-parents design, or even a populationbased case-control design in a wide variety of settings. Several previous studies (42,43) have compared the relative efficiencies of case-siblings and case-parents designs for estimation of the multiplicative interaction parameter: they generally concluded that while the case-siblings design tends to be superior for dominant genes, the case-parents design is more efficient for recessive genes. In these comparisons, however, the method employed for analysis of the case-parents design implicitly assumes G-E independence, but for the case-siblings design it does not exploit any such assumption. Table 4 reveals that when both designs are analyzed using similar independence assumptions, then the efficiency advantage of the case-siblings design over the case-parents design for dominant genes is even greater than reported before. Moreover, under the independence assumption, the case-siblings design can be more efficient than the case-parents design even for recessive variants. BIASES Observational epidemiologic studies can be prone to different types of biases. All types of designs can be affected by confounding to some degree. In population-based studies, an environmental or/and a genetic factor may seem to be associated with a disease merely because of the correlation of that factor with one or more unknown risk factors for the disease. The nature of such confounding bias for estimation of main effect parameters has been studied in-depth in classical environmental epidemiology literature (5,18). For inference on genetic main effects, much attention has been recently given to “population stratification,” the phenomenon of confounding of the association between a disease and a genetic exposure due to coupling of heterogeneity in allele frequency and disease risk across hidden substructures in an underlying population. Studies on the effect of confounding in studies of interaction have been much more limited. A recent numerical study reported that the impact of population stratification on odds-ratio interaction is likely to be small unless there exists strong linkage disequilibrium among genes or correlation among the genes and the environments (52). A major strength of family-based designs, which restrict case-control comparisons to be within homogeneous families, is that they are robust to population stratification for

Studies of Gene-Gene and Gene-Environment Interactions

157

studies of genetic effects. For studies of gene-environment interactions, however, familybased studies are not completely robust to confounding. The case-parents design may detect spurious multiplicative interaction when the underlying assumption of independence of genetic susceptibility and environmental exposure is violated. Similarly, the case-sibling and the hybrid designs, when analyzed under the assumptions of geneenvironment independence, can lead to bias. The within-family gene-environment independence assumptions required in these methods, although are quite weak compared with those required for the analogous methods for population-based case-control studies, can be violated due to direct association between G and E. When plausibility of such direct association exists, the advantage of the sibling case-control design is that it has the option of being analyzed by the standard conditional logistic regression method that does not require the independence assumption. All types of studies are prone to bias due to nondifferential measurement error in genotyping or/and ascertainment of environmental exposures. Case-control studies, in addition, are susceptible to measurement error that could be differential by disease status. In particular, cases and controls may differentially recall their exposure history during interview. Differential measurement error in genotyping and biomarker evaluation may also arise because of differences in handling and storage of DNA and other biologic samples between the cases and the controls. There is a vast literature on the effect of misclassification on studies of main effects (53). Nondifferential measurement error generally causes bias toward null for the main effect of a single covariate. Thus, the test for the main effect of an exposure can be valid in the presence of nondifferential measurement error. Differential measurement error for an exposure, however, can cause bias away from null for estimation of the main effect of a covariate, thus rendering the corresponding test to be also biased. There have been relatively few theoretical studies about the effect of misclassification on interactions. Nevertheless, a number of empirical investigations have reported that independent nondifferential measurement errors for two exposures generally lead to bias toward null for the multiplicative interaction parameter (54). The effect of differential misclassification for G or/and E on studies of interaction could be quite complex in general and has not been well investigated. If, however, it can be assumed that G and E are independently distributed in the underlying population, then one need not worry about nondifferential misclassification because in this setting, as discussed earlier, the multiplicative interaction parameter can be simply estimated as the odds ratio between G and E among cases alone (56). Further, if the measurement errors for E is uncorrelated with G and vice versa, then the corresponding case-only odds ratio is expected to be attenuated toward null (55,56). Case-control studies could also be susceptible to different types of selection bias due to improper selection of the cases and the controls. Ideally, cases and control should be randomly sampled from a well-defined common study base so that the two groups of subjects have comparable population characteristics (57). For logistical difficulties, however, practitioners often cannot adhere to the study-base principle and select controls from alternative sources such as hospitals or neighborhoods from which the cases arise. Moreover, even if an underlying population can be identified, the cases and controls who are willing to participate may be driven by different, possibly unknown factors, creating potential bias due to difference in population characteristics between the two samples. In particular, the association between a disease and an exposure in a case-control study can be distorted if the underlying selection mechanism is directly or indirectly related to the exposure of interest itself and the nature of this relationship is differential by case-control status. In classical environmental epidemiology, such selection bias has been traditionally considered a major concern for case-control studies because of the potential influence of

158

Chatterjee and Mukherjee

dietary, lifestyle, and behavioral exposures on the participation rates of the cases and the controls. In modern genetic association studies, however, it is often argued that participation bias is less of a concern on the ground that it is unlikely that subjects’ willingness to participate in epidemiologic studies are related to their genetic makeup. The topic, however, is controversial given that it is well possible that genes that influence human behavior and psychology could influence the phenotype of “participation.” There have been limited studies about the effect of selection-bias in case-control studies of interaction. For family-based case-control studies, it has been noted that bias in environmental relative risk can arise when the family-based controls do not reside in the same geographic region and the prevalence of an environmental risk factor varies by the geographic regions (58). However, assuming the independence of genotype and environmental exposure, genetic relative risks and multiplicative gene-environment interaction parameters can be estimated in unbiased fashion from studies with such controls. The effect of selection bias on studies of interaction has been recently investigated in the context of hospital-based case-control studies for which there are simple established criteria for selecting controls so as to estimate the effect of a single factor without bias (59). It is noted that there is no bias in the estimate of the effect of E when G is associated with the control condition and vice versa, whether causally or because of confounding. There is no bias in estimating multiplicative interaction between G and E for the disease of interest when there is no multiplicative G-E interaction for the control condition, even when the control condition is caused by G and E both; if a mixture of several control groups are used, however, the absence of G-E interaction in each individual control condition does not ensure a lack of overall bias when controls are pooled. Hospital control designs are much less robust for assessing additive interaction. These results extend to the general problem of distortion of joint effects from selection biases or confounding. TEST FOR ASSOCIATION IN PRESENCE OF INTERACTION The primary goal of modern large-scale association studies is to identify susceptibility genes that influence the risk of the diseases under study. A central statistical issue in this effort has been whether and how one could account for heterogeneity in genetic risk due to gene-gene and gene-environment interactions for more powerful discovery of the susceptibility loci. An omnibus hypothesis-testing framework is useful for this purpose. Suppose one is interested in testing the association of a disease outcome D with genetic factor G in the background of an environmental risk factor E. For simplicity, let us assume all the three factors D, G, and E are binary. The null hypothesis of no association of G with D could be stated as H0 : bGjE¼0 ¼ 0 and

bGjE¼1 ¼ 0 ,

ð12Þ

where bGjE¼0 and bGjE¼1 denote the log odds ratio for D associated with G for subjects with E ¼ 0 and E ¼ 1, respectively. Thus, if G is associated with D in either of the unexposed (E ¼ 0) or the exposed (E ¼ 1) subjects, then the corresponding single nucleotide polymorphism (SNP) will be considered a “susceptibility” SNP. We note that the global hypothesis H0 could be alternatively stated as bGjE¼0 ¼ 0 and y ¼ 0, where bGjE¼0 and ð¼ bGjE¼1 bGjE¼0 Þ denote the main effect of G and the interaction coefficient between G and E in a logistic regression model that also includes a main-effect coefficient for E. For a given data, the omnibus null hypothesis can be tested using a 2 d.f. chi-square test obtained by computing the squared Wald statistics for the test of

Studies of Gene-Gene and Gene-Environment Interactions

159

association between D and G once for subjects with E ¼ 0 and once for subjects with E ¼ 1 and then summing the statistics over two strata. More generally, such omnibus hypothesis tests can be performed by simultaneously testing for the main effect bG and the interaction coefficient bGE in regression models of the form (4) or (8). Figures 1 and 2 illustrate potential advantages of omnibus tests over a simple 1 d.f. Wald tests for association of D with G, ignoring E. In Figure 1, it is assumed that the effect of G exists only for subjects with E ¼ 1 with the corresponding odds ratio shown in the top axis. We observe that in this situation the 1 d.f. test of association suffers serious

Figure 1 Power at a-level of 0.0001 for 2 d.f omnibus and 1 d.f marginal tests for G when the effect of G exists only among subjects with E ¼ 1. It is assumed that pr(G ¼ 1) ¼ 0.3, pr(E ¼ 1) ¼ 0.3, and the odds ratio of D associated with E marginalizing over G is 1.3. The top axis shows the disease odds ratio associated with G among subjects with E ¼ 1, and the x-axis of the figure shows the corresponding odds ratio for the disease with G marginalizing over E.

Figure 2 Power at a-level of 0.0001 for 2 d.f omnibus and 1 d.f marginal tests for G when the odds ratio of D associated with G is the same for subjects with E ¼ 0 and E ¼ 1. It is assumed that pr(G ¼ 1) ¼ 0.3 and pr(E ¼ 1) ¼ 0.3.

160

Chatterjee and Mukherjee

loss of power as the “marginal” odds ratio for D associated with G (shown in the x-axis) can be quite attenuated compared with the odds ratio for D associated with G among subjects with E ¼ 1. In Figure 2, it is assumed that the odds ratio for D associated with G is the same for subjects with both E ¼ 0 and E ¼ 1. In this situation, as intuitive, the 1 d.f. test has the highest power, but the omnibus test also performs well in the sense that the loss of power due to the extra d.f. is quite small. Overall, it can be observed that the omnibus test is quite robust in the sense that it either has the highest power itself or it has a small loss of power compared with the marginal test. A more detailed study of the power of the 2 d.f. omnibus test under alternative models for gene-environment joint effect can be found in Kraft et al. (60). The power of omnibus test depends on precision of both main effect and interaction parameter estimates. Thus, any design and analytic strategies that could increase efficiency of estimation of the interaction parameters could also improve the power of omnibus tests. In particular, exploiting gene-environment or gene-gene independence can lead to dramatic gain in power of omnibus tests, but caution is needed to protect against false-positive results when the assumptions are violated. Two-phase stratified sampling designs that enrich a study sample with rare but informative exposure or/and genotype values can also increase the power of omnibus tests. The advantage of the omnibus over marginal method for testing the association of a disease with an exposure diminish in the presence of measurement error in the background risk factor by which the effect of the exposure of interest is allowed to be modified. Figure 3 shows how the power for the omnibus test for G accounting for G-E interaction can decrease as a function of the correlation (R2) between the observed E, assumed to be measured with error, with the true unobserved environmental exposure E*. We observe that if the measurement of the environmental exposure is poor, say R2 with true exposure less than 0.5, then there may not be much benefit in accounting for G-E interaction for detection of genetic loci. The figure, however, illustrates the robustness of the omnibus test in that the power of it is never much lower than the marginal test even when E is measured very poorly.

Figure 3 Power at a-level of 0.0001 for 2 d.f omnibus and 1 d.f marginal tests as a function of the correlation (R2) between measured and gold standard exposure. It is assumed that the effect of G only exists for subjects who are truly exposed (E ¼ 1) with an odds ratio of 2.7, but the test is being performed using a misclassified exposure E*.

Studies of Gene-Gene and Gene-Environment Interactions

161

A concern for omnibus test is that its performance can become poor when the d.f. required for modeling interactions become large. In modern molecular epidemiologic studies, for example, the association between the disease and a genomic region, such as a candidate gene, is often investigated using a set of tagging SNPs. The number of parameters required in standard statistical models for gene-gene and gene-environment interactions could easily become very large in such settings. Chatterjee et al. (61) proposed the use of Tukey’s “one degree-of-freedom” for interaction for reducing the d.f. for omnibus tests. To illustrate the idea, suppose G1 and G2 are two candidate genes of interest for which K1 and K2 tagging SNPs have been genotyped. Let S1 ¼ ðS11 , S21 , , SK1 1 Þ and S2 ¼ ðS12 , S22 , , SK2 2 Þ denote the corresponding genotype data, recorded as 0, 1, or 2, counting the number of copies of the corresponding SNP carried by an individual. Chatterjee et al. considered specifying the risk of a binary disease outcome (D) using a model of the form logitfPrðD ¼ 1jS1 , S2 Þg ¼ þ

K1 X k1 ¼1

bk1 1 Sk1 1 þ

K1 X k1 ¼1

b k2 1 S k2 2 þ

K1 X K2 X k1 ¼1 k2 ¼1

bk1 1 bk2 1 Sk1 1 Sk2 2 , ð13Þ

which resembles a traditional logistic regression model where each SNP for each gene has a “main effect” and each pair of SNP across the two genes has an “interaction” effect, but the different interaction effects for different pairs of SNPs are not independent; they are related by the special functional form ck1 k2 ¼ bk1 1 bk2 2 . Chatterjee et al. used a latent variable framework to show that this form of interaction is natural when individual SNPs within a gene are associated with a disease through a common biologic mechanism; by contrast, many standard regression models are designed as if each SNP has unique functional significance. One simple but common example is the association of tagging SNPs to disease due to their linkage disequilibrium with underlying causal variant (s). Models of the form (13) are very appealing for developing omnibus test of association in the presence of interactions. In equation (13), for example, the null hypothesis of no association of the disease with a specific gene Gi can be statistically stated as H0i : bi ðbi1 , bi2 , K, biKi Þ ¼ 0, under which, we note that, both the main effect of the SNPs in G1 as well as their interactions with SNPs in G2 disappear. A technical complication, however, is that under the null hypothesis of b1 ¼ 0 the parameter y disappears from the model and hence is not estimable from the data. Thus, standard statistical tests, such as the score or the likelihood-ratio tests that require estimation of all of the “nuisance parameters” under the null hypothesis, are not applicable. Nevertheless, Chatterjee et al. have shown that models of the form (13) can be used to construct simple score tests for genetic association that are implementable using standard regression software. In practice, an association study may involve a variety of genetic and environmental exposures, each of which can potentially interact with the others. In such setting, the association of a disease with a particular factor can be investigated using a max-omnibus test (61,63) that involves pairing the factor of interest with each of the other factors and then taking the maximum of the omnibus tests over all those different pairs. The null distribution of the test statistics can be computed using permutation-based resampling methods, which would automatically adjust for multiple testing. If the omnibus test for one factor involves taking maximum over a large number of different other factors, then intuitively one would pay a price in terms of loss of power due to adjustment for multiple testing. A number of reports, however, have indicated that in the presence of multiplicative interactions, omnibus tests can retain significant gain in power over marginal methods even after proper adjustment for multiple testing (6163).

162

Chatterjee and Mukherjee

HIGHER-ORDER INTERACTION AND DATA-MINING TOOLS In this chapter, we have so far focused on studies of interactions involving pairs of exposures, but many of the inferential issues described above are also applicable to studies of third- or higher-order interactions. The problem of model selection poses an additional challenge for studies of higher-order interactions. When a large number of factors are being studied together, then searching for an optimal model in the very large space of all possible models for joint effects is a very complex task. In addition, even if an optimal model can be found, one cannot treat it as a fixed model for statistical inference because of the stochastic nature of the model selection. A large variety of data-mining methods are now available for model selection in the statistical and computer science literature. In the following paragraphs, we describe few algorithms that have been specifically studied for exploring gene-gene and gene-environment interactions. A traditional approach to model selection is stepwise regression that uses statistical significance testing to add or drop higher- or lower-order interaction terms in standard parametric regression models. Millstein et al. (64) recently described such a stepwise forward selection algorithm for genetic association studies. The algorithm, known as Focused Interaction Testing Framework, performs a series of marginal and omnibus tests for detection of disease susceptibility loci after properly controlling the false discovery rate (65) of the whole procedure. A number of data-mining methods are available for exploring the space of joint effects in alternative ways. Classification and Regression Tree (CART) (66,67) uses a recursive partitioning algorithm that in each step splits a group of subjects in a root node into two nodes based on an exposure that yields the highest discrimination in the disease risk and then repeat the procedure for each of the child nodes. The procedure starts with the root node being defined by the whole-study sample and ends with a set of final nodes representing groups of subjects with homogeneous disease risk. If S1, S2, and S3 denote the binary indicator for the presence (¼1) or absence (¼ 0) of variant allele in three biallelic loci, for example, then a CART can include a final node of the form {(S1 ¼ 0) and S2 ¼ 1 and S3 ¼ 1} allowing subjects with common allele in S1 but variant alleles in S2 and S3 to have homogeneous risk. The problem of overfitting is avoided by pruning or trimming the tree to an optimal size determined by cross-validation so that the out-ofsample misclassification rate or prediction error is minimized. A weakness of CART is that the final model selected in this method could be highly sensitive to small perturbation of the data. Bagging stabilizes output from CART by combining results from an ensemble of trees generated by repeated bootstrap sampling of the data (68). The Random Forest procedure minimizes correlation among the ensemble of trees by picking a random subset of the covariates for growing the tree in each bootstrap replication (69). An advantage of the ensemble approaches is that they can produce measures of variable importance that can be used as an omnibus test statistic that captures information on both the main effect of a factor and its interactions with other factors. Permutation-based resampling methods can be used for generating p values associated with measures of variable importance. The Random Forest package available in R implements the above procedure. Logic regression (70) distinguishes itself from standard parametric regression model and CART by allowing predictors of the outcome to be defined by a combination of both and and or operations among the exposures. Thus, in the example involving three biallelic loci discussed above, a logic regression could include a construct of the form {S1 ¼ 1 and (S3 ¼ 1 or S2 ¼ 1)} allowing subjects with variant allele in locus 1 and a variant allele in either of locus 2 or 3 to have similar risk. The inclusion of or operation is appealing because

Studies of Gene-Gene and Gene-Environment Interactions

163

it is biologically quite plausible that disruption of certain protein products that ultimately determine the risk of a disease require only one mutation in a set of genetic loci and risk of carrying multiple mutations in this class is no higher than just carrying one. Similar to CART, logic regression uses cross-validation to determine an optimal logic-tree. Measures of variable importance can be also defined using a Markov chain Monte Carlo (MCMC) method that generates ensembles of logic trees (71). The accompanying LogicReg package is available through R-cran. In contrast to tree-based methods that hierarchically build complex models, the multifactorial dimension reduction (MDR) method (72) reduces the dimension of the joint effect associated with a set of genetic loci by pooling the multilocus genotype data into simply high-risk and low-risk groups and then evaluates the derived binary exposure variable in terms of its ability to predict disease outcome using cross-validation and permutation testing. If there are a large number of loci involved, the MDR method attempts to identify a best multilocus model by screening through all possible two- to k-factor combination where the choice of k depends on computational feasibility. The method is appealing because of its parsimony, but the performance of the method can vary substantially depending on how well simple dichotomization of risk can fit the true joint effect of the underlying susceptibility loci (7274). Information regarding the MDR software is available at http://phg.mc.vanderbilt.edu/Software/MDR with an open source version accessible at http://www.epistasis.org/software.html. The above data-mining methods, although are promising for exploring complex high-order interactions, have the disadvantage that they cannot impose or exploit natural constraints in the model space. In studies of gene-gene interactions using SNP data, for example, it may be natural to assume that for any given locus, the effect of carrying two copies of a variant allele is always going to be more than that for carrying one copy, irrespective of the genotype status for the other loci. Parametric models, such as logistic regression, can easily impose such monotonicity constraints by assuming additive or multiplicative effect for each copy of a variant allele within a locus. Although biologic rationale for the underlying assumptions can be questioned, recent discoveries from a number of genomewide association studies have revealed that additive and multiplicative models often provide good description of disease-genotype association for individual susceptibility loci. Similarly, for studies of gene-environment interactions, it could be natural to assume some sort of “dose-response” relationship between disease and continuous environmental exposures. Nonparametric data-mining methods can potentially lose power by ignoring such constraints. The method called FlexTree is appealing in this regard as it can impose parametric structure in binary tree-based regression models (75). A supporting R-package could be requested at http://stat.stanford.edu/olshen/flexTree/. In summary, data-mining methods are potentially promising for exploring higher-order gene-gene and gene-environment interactions. Different methods have different strengths and it is unlikely that a single method will perform uniformly well irrespective of the true state of the nature. Thus, a robust strategy for data analysis would be to apply alternative methods with complementary strengths and follow up promising findings in replication studies. DISCUSSION A number of challenges remain in studies of interaction in the era of modern molecular epidemiologic studies. Exploring interactions for large-scale association studies remains a computationally daunting task. Most available statistical methods are not scalable for exploring gene-gene interactions in very large-scale association studies, such as

164

Chatterjee and Mukherjee

genomewide scans, that may involve hundreds to hundreds of thousands of SNPs. A twostage method where tests of interaction or joint effects are restricted to only those loci that show some evidence of main effect is computationally practical and can have good power (62). A Bayesian statistical method, known as BEAM, has been also recently proposed for exploring interactions in genomewide scale (76) (the software can be downloaded at http://www.people.fas.harvard.edu/junliu/BEAM/). It is expected that a number of other practical methods will evolve over the next few years. As data from genomewide association studies become increasingly available, it will be interesting to watch whether and how multilocus statistical tests would be able to detect true disease susceptibility loci that may be missed by single locus methods. Measurement errors in environmental exposures pose a major challenge in studies of gene-environment interactions. As discussed in the section “Biases,” measurement errors can seriously distort joint effect of two exposures, limiting the power and interpretation of studies of interactions. Availability of very fine-scale genotyping data now enables researchers to capture the effect of underlying functional genetic variants with a fairly small amount of measurement error. In contrast, current measurements of environmental exposures, such as questionnaire-based evaluation of dietary and lifestyle exposures, are expected to be very inaccurate in terms of their ability to capture the underlying biologic doses of the exposures. Thus, in the future, finding good biomarkers for environmental exposures could be a key to success for studies of gene-environment interactions. To understand individual variability in the risk for a complex disease that is associated with an environmental exposure, such as tobacco smoking and cancer, epidemiologists often study the genetic variants in the biochemical pathways possibly related to the exposure. The pathway information, however, is typically ignored during conventional association or interaction analysis of the data. Clearly, there is now a vast amount of information in various emerging “-omics” databases about the structure of various biochemical pathways. Hierarchical Bayesian methodologies (77) can potentially integrate such information into the analysis of pathway data, though quantifying prior information from various disparate sources of data remains a challenging task. As pathway-based research becomes increasingly important for molecular epidemiologic studies, we believe that there will be an increasing demand for statistical methods that can incorporate “prior” information into the analysis of the data in a robust way so that misspecification of priors does not invalidate the inference.

REFERENCES 1. Yang Q, Khoury MJ. Evolving methods in genetic epidemiology. III. Gene-Environment interaction in epidemiologic research. Epidemiol Rev 1997; 19:3343. 2. Knudson AG Jr. Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci U S A 1971; 68:820823. 3. Thompson WD. Effect modification and the limits of biological inference from epidemiologic data. J Clin Epidemiol 1991; 42:221232. 4. Iversen S, Arley N. On the mechanism of experimental carcinogenesis. Acta Pathol Microbiol Scand 1950; 27:773803. 5. Rothman KJ, Greenland S. Modern Epidemiology, 2nd ed. Lippincott Williams and Wilkins: Philadelphia, 1998. 6. Khoury MJ, Beaty TH, Cohen BH. Fundamentals of Genetic Epidemiology. New York: Oxford University Press, 1993.

Studies of Gene-Gene and Gene-Environment Interactions

165

7. Ottman R. Gene-environment interaction: definitions and study designs. Prev Med 1996; 25:764770. 8. Rothman N, Garcia-Closas M, Hein DW. Commentary: reflections on G. M. Lower and colleagues 1979 study associating slow acetylator phenotype with urinary bladder cancer: meta-analysis, historical refinements of the hypothesis, and lessons learned. Int J Epidemiol 2007; 36:2328. 9. Gail M, Simon R. Testing for qualitative interactions between treatment effects and patient subsets. Biometrics 1985; 41:361372. 10. Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley: New York, 1980. 11. Cox DR. Regression models and life tables with discussions. J R Stat Soc B 1972; 34: 187220. 12. Breslow NE, Day NE. Statistical Methods in Cancer Research: The Design and Analysis of Cohort Studies. IARC: Lyon, 1987. 13. Cox DR, Oakes D. Analysis of survival data. Chapman and Hall: New York, 1984. 14. Lin DY, Ying Z. Semiparametric analysis of the additive risk model. Biometrika 1994; 81:6171. 15. Aalen OO. A linear regression model for the analysis of life times. Stat Med 1989; 8:907925. 16. Aalen OO. Further results on the non-parametric linear regression model in survival analysis. Stat Med 1993; 12:15691588. 17. Cornfield J. A method of estimating comparative rates from clinical data: applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst 1951; 11:12691275. 18. Breslow NE, Day NE. Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. IARC: Lyon, 1980. 19. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika 1979; 66:403411. 20. Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing suspectibility in population based case-control studies. Stat Med 1994; 13:153162. 21. Umbach DM, Weinberg CR. Designing and analyzing case-control studies to exploit independence of genotype and exposure. Stat Med 1997; 16:17311743. 22. Chatterjee N, Carroll R. Semiparametric maximum likelihood estimation exploiting geneenvironment independence in case-control studies. Biometrika 2005; 92:399418. 23. Albert PS, Ratnastingle D, Tangrea J, et al. Limitations of the case-only design for identifying gene-environment interaction. Am J Epidemiol 2001; 154:687693. 24. Gatto NM, Campbell UB, Rundle AG, et al. Further development of the case-only design for assessing gene-environment interaction: evaluation of and adjustment for bias. Int J Epidemiol 2004; 33(5):10141024. 25. Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of casecontrol studies: an empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. (Epub 2007, doi:10.1111/j.1541-0420.2007.00953.x). 26. Liu X, Fallin MD, Kao WH. Genetic dissection methods: designs used for tests of geneenvironment interaction. Curr Opin Genet Dev 2004; 14, 241245. 27. White JE. A two-stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 1982; 115:119128. 28. Walker AW. Anamorphic analysis: sampling and estimation for covariate effects when both exposure and disease are known. Biometrics 1982; 38:10251032. 29. Andrieu N, Goldstein AM, Thomas DC, et al. Counter-matching in studies of gene-environment interaction: efficiency and feasibility. Am J Epidemiol 2001; 153(3):265274. 30. Breslow NE, Chatterjee N. Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. Appl Stat 1999; 4:457468. 31. Wacholder S, Weinberg CR. Flexible maximum likelihood methods for assessing joint effects in case-control studies with complex sampling. Biometrics 1994; 50:350357. 32. Hanley JA, Csizmadi I, Collet JP. Two-stage case-control studies: precision of parameter estimates and considerations in selecting sample size. Am J Epidemiol 2005; 162:12251234.

166

Chatterjee and Mukherjee

33. Mcnamee R. Optimal design and efficiency of two-phase casecontrol studies with error-prone and error-free exposure measures. Biostatistics 2005; 6(4):590603. 34. Fears TR, Brown CC. Logistic regression methods for retrospective case-control studies using complex sampling procedures. Biometrics 1986; 42(4):955960. 35. Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika 1988; 75:1120. 36. Langholz B, Borgan O. Counter-matching: a stratified nested case-control sampling method. Biometrika 1995; 82(1):6979. 37. Scott AJ, Wild CJ. Maximum likelihood estimation for case-control data. Biometrika 1997; 84:5771. 38. Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase outcome-dependent sampling. J R Stat Soc B 1997; 59:447461. 39. Chatterjee N, Chen Y-H, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. J Am Stat Assoc 2003; 98:158168. 40. Chatterjee N, Chen Y-H. Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. J R Stat Soc B (Stat Methodol) 2007; 69(2):123142. 41. Weinberg CR, Umbach DM. Choosing a retrospective design to assess joint genetic and environmental contributions to risk. Am J Epidemiol 2000; 152:197203. 42. Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol 1999; 149:693705. 43. Gauderman W. Sample size requirements for matched case-control studies of geneenvironment interaction. Stat Med 2002; 21:3550. 44. Chatterjee N, Kalaylioglu Z, Carroll R. Exploiting gene-environment independence in familybased case-control studies: increased power for detecting associations, interactions and joint effects. Genet Epidemiol 2005; 28:138156. 45. Schaid DJ. Case-parents design for gene-environment interaction. Genet Epidemiol 1999; 16:261273. 46. Cordell HJ, Clayton DG. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am J Hum Genet 2002; 70:124141. 47. Cordell HJ, Barratt BJ, Clayton DG. Case/pseudocontrol analysis in genetic association studies: a unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions, and parent-of-origin effects. Genet Epidemiol 2004; 26: 167185. 48. Weinberg CR. Methods for detection of parent-of-origin effects in genetic studies of case-parents triads. Am J Hum Genet 1999; 65:229235. 49. Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered 2000; 50(4):211223. 50. Lange C, DeMeo D, Silverman E, et al. Using the noninformative families in family-based association tests: a powerful new testing strategy. Am J Hum Genet 2003; 73:801811. 51. Weinberg CR, Umbach DM. A hybrid design for studying genetic influences on risk of diseases with onset early in life. Am J Hum Genet 2005; 77:627636. 52. Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in epidemiologic studies of gene-gene or gene-environment interactions. Cancer Epidemiol Biomarkers Prev 2006; 15:124132. 53. Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement Error in Nonlinear Models: A Modern Perspective. 2nd ed. Chapman and Hall: New York, 2006. 54. Garcia-Closas M, Thompson WD, Robins JM. Differential misclassification and the assessment of gene-environment interactions in case-control studies. Am J Epidemiol 1998; 147:426433.

Studies of Gene-Gene and Gene-Environment Interactions

167

55. Garcia-Closas M, Rothman N, Lubin J. Misclassification in case-control studies of geneenvironment interactions: assessment of bias and sample size. Cancer Epidemiol Biomarkers Prev 1999; 8:10431050. 56. Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001; 358:13561360. 57. Wacholder S, Silverman DT, McLaughlin JK, et al. Selection of controls in case-control studies. II. Types of controls. Am J Epidemiol 1992; 135(9):10291041. 58. Siegmund KD, Langholz B. Ascertainment bias in family-based case-control studies. Am J Epidemiol 2002; 155(9):875880. 59. Wacholder S, Chatterjee N, Hartge P. Joint effect of genes and environment distorted by selection biases: implications for hospital-based case-control studies. Cancer Epidemiol Biomarkers Prev 2002; 11(9):885889. 60. Kraft P, Yen YC, Stram DO, et al. Exploiting gene-environment interaction to detect genetic associations. Hum Hered 2007; 63(2):111119. 61. Chatterjee N, Kalaylioglu Z, Moslehi R, et al. Powerful multilocus tests of genetic association in the presence of gene-gene and gene-environment interactions. Am J Hum Genet 2006; 79:10021016. 62. Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005; 37:413417. 63. Chapman J, Clayton D. One degree of freedom for dominance in indirect association studies. Genet Epidemiol 2007; 31:261271. 64. Millstein J, Conti DV, Gilliland FD, et al. A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet 2006; 78(1):1527. 65. Storey JD. A direct approach to false discovery rates. J R Stat Soc B 2002; 64:479498. 66. Breiman L, Freidman JH, Olshen RA, et al. Classification and Regression Trees. Wadsworth, 1984. 67. Zhang HP, Bonney G. Use of classification trees for association studies. Genet Epidemiol 2000; 19:323332. 68. Breiman L. Bagging predictors. Machine Learning 1996; 24(2):123140. 69. Breiman L. Random Forests. Machine Learning 2001; 45:532. 70. Ruczinsiki I, Kooperberg C, LeBlanc ML. Logic regression. Journal of Computational and Graphical Statistics 2003; 12:475511. 71. Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo Logic Regression. Genet Epidemiol 2005; 28:157170. 72. Ritchie MD, Hahn LW, Roodi N, et al. Multifactor dimensionality reduction reveals highorder interactions among estrogen metabolism genes in sporadic breast cancer. Am J Hum Genet 2001; 69:138147. 73. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003; 19:376382. 74. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene and gene-environment interactions. Genet Epidemiol 2003; 24:150157. 75. Huang J, Lin A, Narasimhan B, et al. Tree-structured supervised learning and the genetics of hypertension. Proc Natl Acad Sci U S A 2004; 101:1052910534. 76. Zhang Y, Liu JS. Bayesian inference of epistatic interactions in case-control studies. Nat Genet 2007; 39:11671173. 77. Conti DV, Cortessis V, Molitor J, et al. Bayesian modeling of complex metabolic pathways. Hum Hered 2003; 56(13):8393.

12

Novel Analytical Methods for Association Studies Jason H. Moore Departments of Genetics and Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire; Department of Computer Science, University of New Hampshire, Durham, New Hampshire; and Department of Computer Science, University of Vermont, Burlington, Vermont, U.S.A.

Margaret R. Karagas and Angeline S. Andrew Community and Family Medicine, Dartmouth Medical School, Lebanon, New Hampshire, U.S.A.

INTRODUCTION The initiation, progression, and severity of human cancer are complex processes that are dependent on many genes, many environmental factors, and chance events that are perhaps not measurable with current technology or are simply unknowable. Success in the design and execution of population-based association studies to identify those genetic and environmental factors that play an important role in cancer biology will depend on our ability to embrace, rather than ignore, complexity in the genotype-to-phenotype mapping relationship for any given human ecology. We review here several novel analytical strategies that assume complexity and thus complement traditional parametric statistical strategies such as those based on logistic regression that often make simplifying assumptions. The rapid advances in the speed and affordability of computing along with the availability of powerful open-source software have made novel analytical strategies accessible to epidemiologists and geneticists. An important goal of human disease epidemiology is to understand the mapping relationship between interindividual variation in DNA sequences (i.e., the genome), variation in environmental exposure (i.e., ecology), and variation in disease susceptibility (i.e., the phenotype). Stated another way, how do one or more changes in an individual’s DNA sequence increase or decrease their risk of developing cancer through complex networks of biomolecules that are hierarchically organized, highly interactive, and dependent on ecology? Understanding the role of genomic variation and ecological context in disease susceptibility is likely to improve diagnosis, prevention, and

169

170

Moore et al.

treatment. Success in this important public health endeavor will depend critically on the degree of nonlinearity in the mapping from genotype to phenotype. That is, how complex is the transfer of information from the genome to the phenotype of interest? Nonlinearities can arise from phenomena such as locus heterogeneity (i.e., different DNA sequence variations leading to the same phenotype), phenocopy (i.e., environmentally determined phenotypes that do not have a genetic basis), and the dependence of genotypic effects on ecology (i.e., gene-environment interactions or plastic reaction norms) and genotypes at other loci (i.e., gene-gene interactions or epistases). Each of these phenomena have been recently reviewed and discussed by Thornton-Wells et al. (1) who call for an analytical retooling to address these complexities. We direct the reader elsewhere for recent work on locus heterogeneity (1,2). We focus here on nonlinearities due to interactions between multiple genetic and environmental factors. We emphasize the important difference between biological interactions and statistical interactions and then discuss some novel analytical approaches for detecting and characterizing these patterns. BIOLOGICAL INTERACTIONS A major source of complexity in biology is the interaction between biomolecules in transcriptional networks, protein-protein interaction networks, and biochemical and metabolic systems, for example. We review here the biological phenomena of gene-gene and gene-environment interactions. Gene-gene interaction or epistasis has been recognized for many years as deviations from the simple inheritance patterns observed by Mendel (3) or deviations from additivity in a linear statistical model (4) and is likely due, in part, to canalization or mechanisms of stabilizing selection that evolve robust (i.e., redundant) gene networks (5–8). Epistasis has been defined in multiple different ways (9–11). We have reviewed two types of epistases, biological and statistical (12,13). Biological epistasis results from physical interactions between biomolecules (DNA, RNA, proteins, enzymes, etc.) and occurs at the cellular level in an individual. This type of epistasis is what Bateson (3) had in mind when he coined the term. Statistical epistasis on the other hand occurs at the population level and is realized when there is interindividual variation in DNA sequences. The statistical phenomenon of epistasis is what Fisher (4) had in mind. The relationship between biological and statistical epistases is often confusing, but will be important to understand if we are to make biological inferences from statistical results (12,13). The focus of this chapter is the detection, characterization, and interpretation of statistical patterns of interaction in human populations since interaction or synergy among predictors in a data set is one of the primary sources of complexity. The role of the environment in biology has also had a long history. The German researcher Woltereck (14) coined the term “reaction norm” to refer to the set of phenotypes that can be produced by a genotype in different environmental contexts. Reaction norms or gene-environment interactions were revived by Schmalhausen (15) and recently reviewed in books by Schlicting and Pigliucci (16) and Pigliucci (17). An excellent basic science example of gene-environment interactions can be found in a study of Escherichia coli by Remold and Lenski (18). In this study, 18 random insertion mutations were introduced in E. coli on five different genetic backgrounds exposed to two different resource environments (glucose or maltose). The authors of the study found no examples of an environmental effect on fitness. However, 6 of the 18 mutations had an effect on fitness that was dependent on both genetic background and environmental

Novel Analytical Methods

171

context demonstrating a plastic reaction norm. These functional studies in model organisms document biological interactions and lay an important foundation for understanding the role of the environment in modulating genetic effects in humans. Understanding the nature of biomolecular interactions in model systems will play a very important role in helping us understand statistical patterns of interactions in human populations (13). Consider the study by Garcia-Closas et al. (19) that found statistical evidence of gene-smoking interaction in bladder cancer in a human population-based study. To what extent does the statistical pattern reflect an underlying biological process? The importance of gene-environment interactions in cancer, for example, has been recently reviewed by Hunter (20). STATISTICAL INTERACTIONS As mentioned above, interactions between biomolecules and environmental agents occur at the cellular level in an individual. The focus of this chapter is detecting statistical patterns of interaction in human populations. As Moore (12) and Moore and Williams (13) have discussed, there is a significant disconnect between the biology that happens in an individual and a statistical summary of genotypic, environmental, and phenotypic variation in a population. To clarify this difference, consider the following simple example of statistical interaction (i.e., epistasis) in the form of a penetrance function. Penetrance is simply the probability (P) of disease (D) given a particular combination of genotypes (G) that was inherited [i.e., P(DjG)]. The model illustrated in Table 1 is an extreme example of epistasis between two single nucleotide polymorphisms (SNPs) A and B. Let us assume that genotypes AA, aa, BB, and bb have population frequencies of 0.25 while genotypes Aa and Bb have frequencies of 0.5 (values in parentheses in Table 1). What makes this model interesting is that disease risk is entirely dependent on the particular combination of genotypes inherited. Individuals have a very high risk of disease if they inherit Aa or Bb but not both [i.e., the exclusive OR function]. The penetrance for each individual genotype in this model is 0.5 and is computed by summing the products of the genotype frequencies and penetrance values. Thus, in this model there is no difference in disease risk for each single genotype as specified by the single-genotype penetrance values (all 0.5). This model is labeled M170 by Li and Reich (21) in their categorization of genetic models involving two SNPs and is an example of a pattern that is not linearly separable. Heritability or the size of the genetic effect is a function of these penetrance values (22). The model specified in Table 1 has a heritability of 0.053, which represents a relatively small genetic effect size. This model is a special case where all of the heritability is due to epistasis or nonlinear gene-gene interaction. How could cellular processes give rise to a pattern like this in a human population?

Table 1 Penetrance Values for Genotypes from Two SNPs AA (0.25) BB (0.25) Bb (0.50) bb (0.25)

0 0.1 0

Aa (0.50) 0.1 0 0.1

Abbreviation: SNP, single nucleotide polymorphism.

aa (0.25) 0 0.1 0

172

Moore et al.

INTERACTION ANALYSIS As discussed above, one of the early definitions of epistasis was deviation from additivity in a linear model (14). The linear model plays a very important role in modern epidemiology because it has solid theoretical foundation, is easy to implement using a wide range of different software packages, and is easy to interpret. Despite these good reasons to use linear models, they do have limitations for detecting nonlinear patterns of interaction (23). The first problem is that modeling interactions require looking at combinations of variables. Considering multiple variables simultaneously is challenging because the available data get spread thinly across multiple combinations of genotypes, for example. Estimation of parameters in a linear model can be problematic when the data are sparse. The second problem is that linear models are often implemented such that interaction effects are only considered after independent main effects are identified. This certainly makes model fitting easier but assumes that the important predictors will have main effects. Further, it is well documented that linear models have greater power to detect main effects than interactions (24–26). For example, the focused interaction testing framework (FITF) approach of Millstein et al. (27) provides a powerful logistic regression approach to detecting interactions but conditions on main effects. Moore (28) argues that this is an unrealistic assumption for common human diseases. The limitations of the linear model and other parametric statistical approaches have motivated the development of computational approaches such as those from machine learning and data mining (29) that make fewer assumptions about the functional form of the model and the effects being modeled. We review below a novel computational method called multifactor dimensionality reduction (MDR) that can be applied for detecting gene-gene and geneenvironment interactions in cancer epidemiology studies. Since the focus of this review is on novel computational methods that embrace complexity, the reader is directed elsewhere for reviews of methods for detecting independent main effects. A recent series of seven reviews summarize many of the basics of genetic and epidemiologic association studies in human populations, thus providing a starting point for those needing to learn more about basic analytical methods such as logistic regression (30–36). Several other recent reviews also provide some basic concepts (37). MDR MDR was developed as a nonparametric (i.e., no parameters are estimated) and genetic model-free (i.e., no genetic model is assumed) data-mining strategy for identifying combinations of discrete genetics and environmental factors that are predictive of a discrete clinical endpoint (38–44). Unlike most other methods, MDR was designed to detect interactions in the absence of detectable main effects and thus complements approaches such as logistic regression and random forests. At the heart of the MDR approach is a feature or attribute construction algorithm that creates a new variable or attribute by pooling, for example, genotypes from multiple SNPs. The general process of defining a new attribute as a function of two or more other attributes is referred to as constructive induction or attribute construction and was first described by Michalski (45). Constructive induction using the MDR kernel is accomplished in the following way. Given a threshold T, a multilocus genotype combination is considered high-risk if the ratio of cases (subjects with disease) to controls (healthy subjects) exceeds T, else it is considered low-risk. Genotype combinations considered to be high-risk are labeled G1 while those considered low-risk are labeled G0. This process constructs a new one-dimensional

Novel Analytical Methods

173

attribute with levels G0 and G1. It is this new single variable that is assessed using any classification method. The MDR method is based on the idea that changing the representation space of the data will make it easier for methods such as logistic regression, classification trees, or a naive Bayes classifier to detect attribute dependencies. A tutorial on how to use MDR can be found in several November 2006 postings at compgen. blogspot.com. A user-friendly MDR software package written in Java is freely available from www.epistasis.org. Consider the simple example presented above and in Table 1. This penetrance function was used to simulate a data set with 200 cases (diseased subjects) and 200 controls (healthy subjects) for a total sample size of n ¼ 400. The list of attributes included the two functional interacting SNPs (SNP1 and SNP2) in addition to three randomly generated SNPs (SNP3–SNP5). The SNPs each have three levels (0 ¼ AA, 1 ¼ Aa, 2 ¼ aa) while the class (i.e., endpoint) has two levels (0 ¼ control, 1 ¼ case). Figure 1A illustrates the distribution of cases (left bars) and controls (right bars) for each of the three genotypes of SNP1 and SNP2. The dark-shaded cells have been labeled “high-risk” using a threshold of T ¼ 1. The light-shaded cells have been labeled “low-risk.” Note that when considered individually, the ratio of cases to controls is close to 1 for each single genotype. Figure 1B illustrates the distribution of cases and controls when the two functional SNPs are considered jointly. Note the larger ratios that are consistent with the genetic model in Table 1. Also illustrated in Figure 1B is the distribution of cases and controls for the new single attribute constructed using MDR. This new single attribute captures much of the information from the interaction and could be assessed using logistic regression, for example. The MDR method has been successfully applied for detecting gene-gene and gene-environment interactions for a variety of common human diseases and clinical endpoints including, for example, antiretroviral therapy (46), asthma (27,47), atrial fibrillation (43,48,49), autism (50), bladder cancer (51–53), cervical cancer (54), coronary calcification (55), coronary artery disease (56,57), diabetic nephropathy (58), drug metabolism (59), essential hypertension (60), familial amyloid polyneuropathy (61), multiple sclerosis (62,63), myocardial infarction (64,65), osteoporosis (66), preterm birth (67), prostate cancer (68), schizophrenia (69,70), sporadic breast cancer (38,71,72), and type 2 diabetes (73). The MDR method has also been proposed for pharmacogenetics and toxicogenetics (74). STATISTICAL INTERPRETATION OF INTERACTION MODELS MDR is a powerful method for detecting gene-gene and gene-environment interactions in epidemiologic studies of cancer. The models that these methods produce are by nature multidimensional and thus difficult to interpret. For example, an interaction model with four SNPs, each with three genotypes, summarizes 81 different genotype (i.e., level) combinations (i.e., 34). How do each of these level combinations relate back to biological processes in a cell? Why are some combinations associated with high-risk for disease and some associated with low-risk for disease? Moore et al. (43) have proposed using information theoretic approaches with graph-based models to provide both a statistical and a visual interpretation of models from MDR and other novel methods such as symbolic discriminant analysis (75). Statistical interpretation should facilitate biological interpretation because it provides a deeper understanding of the relationship between the attributes and the class variable. We describe next the concept of interaction information and how it can be used to facilitate statistical interpretation.

174

Moore et al.

Figure 1 (A) Distribution of cases (left bars) and controls (right bars) across three genotypes (0, 1, 2) for two simulated interacting SNPs. Note that the ratio of cases to controls for these two SNPs are nearly identical. The dark-shaded cells signify “high-risk” genotypes. (B) Distribution of cases and controls across nine two-locus genotype combinations. Note that the two SNPs jointly reveal larger case-control ratios. Also illustrated is the use of the MDR attribute construction function that produces a single attribute (SNP1_SNP2) from the two SNPs. (C) An interaction dendrogram summarizing the information gain associated with constructing pairs of attributes using MDR. The length of the connection between two SNPs is inversely related to the strength of the information gain. Red lines indicate a positive information gain that can be interpreted as synergistic interaction. Brown lines indicate no information gain. Abbreviations: SNP, single nucleotide polymorphism; MDR, multifactor dimensionality reduction.

Novel Analytical Methods

175

Jakulin and Bratko (76) have provided a metric for determining the gain in information about a class variable (e.g., case-control status) from merging two attributes into one (i.e., attribute construction) over that provided by the attributes independently. This measure of information gain allows us to gauge the benefit of considering two (or more) attributes as one unit. While the concept of information gain is not new (77), its application to the study of attribute interactions has been the focus of several recent studies (76). Consider two attributes, A and B, and a class label C. Let H(X) be the Shannon entropy (78) of X. The information gain (IG) of A, B, and C can be written as (i) and defined in terms of Shannon entropy (ii and iii). IGðABCÞ ¼ IðA;BjCÞ IðA;BÞ

ðiÞ

IðA;BjCÞ ¼ HðAjCÞ þ HðBjCÞ HðA,BjCÞ

ðiiÞ

IðA;BÞ ¼ HðAÞ þ HðBÞ HðA,BÞ

ðiiiÞ

The first term in (i), I(A;BjC), measures the interaction of A and B. The second term, I(A;B), measures the dependency or correlation between A and B. If this difference is positive, then there is evidence for an attribute interaction that cannot be linearly decomposed. If the difference is negative, then the information between A and B is redundant. If the difference is zero, then there is evidence of conditional independence or a mixture of synergy and redundancy. These measures of interaction information can be used to construct interaction graphs (i.e., network diagrams) and interaction dendrograms using the entropy estimates from step (i), with the algorithms described first by Jakulin and Bratko (76) and more recently in the context of genetic analysis by Moore et al. (43). Interaction graphs are comprised of a node for each attribute with pairwise connections between them. The percentage of entropy removed (i.e., information gain) by each attribute is visualized for each node. The percentage of entropy removed for each pairwise MDR product of attributes is visualized for each connection. Thus, the independent main effects of each polymorphism can be quickly compared to the interaction effect. Additive and nonadditive interactions can be quickly assessed and used to interpret the MDR model, which consists of distributions of cases and controls for each genotype combination. Positive entropy values indicate synergistic interaction while negative entropy values indicate redundancy. Interaction dendrograms are also a useful way to visualize interaction (43,76). Here, hierarchical clustering is used to build a dendrogram that places strongly interacting attributes close together at the leaves of the tree. Jakulin and Bratko (76) define the following dissimilarity measure, D (iv), that is used by a hierarchical clustering algorithm to build a dendrogram. The value of 1000 is used as an upper bound to scale the dendrograms. DðA,BÞ ¼ jIðA; B; CÞj1 if jIðA; B; CÞj1 <1000 1000 otherwise

ðivÞ

Using this measure, a dissimilarity matrix can be estimated and used with hierarchical cluster analysis to build an interaction dendrogram. This facilitates rapid identification and interpretation of pairs of interactions. The algorithms for the entropy-based measures of information gain are implemented in the open-source MDR software package available from www.epistasis.org. Output in the form of interaction dendrograms is provided. Figure 1C illustrates an interaction dendrogram for the simple simulated data

176

Moore et al.

set described above. Note the strong synergistic relationship between SNP1 and SNP2. All other SNPs are independent, which is consistent with the simulation model.

EXAMPLE APPLICATION TO CANCER SUSCEPTIBILITY Consider the following case study. Andrew et al. (51) carried out an epidemiologic study to identify genetic and environmental predictors of bladder cancer susceptibility in a large sample of Caucasians (n ¼ 914) from New Hampshire, U.S.A. This study focused specifically on genes that play an important role in the repair of DNA sequences that have been damaged by chemical compounds (e.g., carcinogens). Seven SNPs were measured including two from the X-ray repair cross-complementing group 1 gene (XRCC1), one from the XRCC3 gene, two from the xeroderma pigmentosum group D (XPD) gene, one from the nucleotide excision repair gene (XPC), and one from the AP endonuclease 1 gene (APE1). Each of these genes plays an important role in DNA repair. Smoking is a known risk factor for bladder cancer and was included in the analysis along with gender and age for a total of 10 attributes. Age was discretized to 50 years or more or less than 50 years. A parametric statistical analysis of each attribute individually revealed a significant independent main effect of smoking as expected. However, none of the measured SNPs were significant predictors of bladder cancer individually. Andrew et al. (51) used MDR to exhaustively evaluate all possible two-, three-, and four-way interactions between the genetic and environmental variables. For each combination of attributes a single constructed attribute was evaluated using a naı¨ve Bayes classifier. Training and testing accuracy were estimated using 10-fold cross-validation. A best model was selected that maximized the testing accuracy. The best model had a testing accuracy of 0.66 and included two SNPs from the XPD gene and smoking. The distribution of cases and controls with each genotype/smoking combination is illustrated in Figure 2A. Also illustrated in Figure 2A is the single variable constructed by MDR. The empirical p value of this model was less than 0.001, suggesting that a testing accuracy of 0.66 or greater is unlikely under the null hypothesis of no association as assessed using a 1000-fold permutation test. Decomposition of this model using the measures of information gain described above demonstrated that the effects of the two XPD SNPs were nonadditive or synergistic, suggestive of nonlinear interaction. The same analysis indicated that the effect of smoking was mostly independent from the gene-gene interaction effect. Figure 2B illustrates an interaction dendrogram summarizing these measures of information gain. Variables connected by short lines have stronger interactions than variables connected by longer lines. The color of the line indicates the type of interaction (i.e., synergistic or redundant). Note that the two XPD polymorphisms are connected by a short red line, indicating a strong synergistic interaction. In other words, combining these two polymorphisms into a single variable using MDR provides more information about case-control status than considering them additively. The short green line connecting pack years of smoking to gender suggests that these two variables are correlated or redundant. In other words, information about case-control status is lost when putting these two variables together. The long yellow line between smoking and the two XPD polymorphisms indicates independence. It is important to note that parametric logistic regression was unable to model this three-attribute interaction because of lack of convergence. It is important to note that Huang et al. (53) also found nonlinear interactions in bladder cancer using MDR.

Novel Analytical Methods

177

Figure 2 (A) Distribution of cases (left bars) and controls (right bars) for each genotype-smoking combination in the bladder cancer example. (B) An interaction dendrogram summarizing the information gain associated with constructing pairs of attributes using MDR. Note that the XPD polymorphisms are connected by a short red line, indicating a strong synergistic interaction. Pack years of smoking is connected to the XPD polymorphisms by a brown line suggesting the effects are independent or additive. It is also interesting to note that gender is connected to smoking with a short green line indicating these two attributes are correlated or redundant. Abbreviation: MDR, multifactor dimensionality reduction.

178

Moore et al.

This study, and the study by Huang et al. (53), illustrates the power of MDR to identify complex relationships among genes, environmental factors such as smoking, and susceptibility to a common disease such as bladder cancer. The MDR approach works well in the context of a candidate gene study, but how does it scale to genomewide analysis of thousands of attributes? MODELING INTERACTIONS IN GENOMEWIDE ASSOCIATION STUDIES Epidemiology is undergoing an information explosion and an understanding implosion. That is, our ability to generate data is far outpacing our ability to interpret it. This is especially true in genetics and epidemiology where it is now technically and economically feasible to measure thousands of SNPs from across the human genome. It is anticipated that at least one SNP occurs approximately every 100 nucleotides across the 3*109 nucleotide human genome. An important goal in human genetics is to determine which of the many thousands of SNPs are useful for predicting who is at risk for common diseases. This “genomewide” approach is expected to revolutionize the genetic analysis of common human diseases (79,80) and, for better or worse, is quickly replacing the traditional “candidate gene” approach that focuses on several genes selected by their known or suspected function. Moore and Ritchie (81) have outlined three significant challenges that must be overcome if we are to successfully identify genetic predictors of health and disease using a genomewide approach. First, powerful data-mining and machine-learning methods will need to be developed to statistically model the relationship between combinations of DNA sequence variations and disease susceptibility. Traditional methods such as logistic regression have limited power for modeling high-order nonlinear interactions (23). The MDR approach was discussed above as an alternative to logistic regression. A second challenge is the selection of genetic features or attributes that should be included for analysis. If interactions between genes explain most of the heritability of common diseases, then combinations of DNA sequence variations will need to be evaluated from a list of thousands of candidates. Filter and wrapper methods (described below) will play an important role because there are more combinations than can be exhaustively evaluated. A third challenge is the interpretation of gene-gene interaction models. Although a statistical model can be used to identify DNA sequence variations that confer risk for disease, this approach cannot be translated into specific prevention and treatment strategies without interpreting the results in the context of human biology. Making etiological inferences from computational models may be the most important and the most difficult challenge of all (13). Combining the concept of interaction described above with the challenge of variable selection yields what Goldberg (82) calls a needle-in-a-haystack problem. That is, there may be a particular combination of SNPs or SNPs and environmental factors that together with the right nonlinear function are a significant predictor of disease susceptibility. However, individually they may not look any different than thousands of other SNPs that are not involved in the disease process and are thus noisy. Under these models, the learning algorithm is truly looking for a genetic needle in a genomic haystack. A recent report from the International HapMap Consortium (83) suggests that approximately 300,000 carefully selected SNPs may be necessary to capture all of the relevant variation across the Caucasian human genome. Assuming this is true (it is probably a lower bound), we would need to scan 4.5*1010 pairwise combinations of SNPs

Novel Analytical Methods

179

to find a genetic needle. The number of higher-order combinations is astronomical. What is the optimal approach to this problem? There are two general approaches to selecting attributes for predictive models. The filter approach preprocesses the data by algorithmically or statistically assessing the quality or relevance of each variable and then using that information to select a subset for analysis. The wrapper approach iteratively selects subsets of attributes for classification using either a deterministic or stochastic algorithm. The key difference between the two approaches is that the analysis method (e.g., MDR) plays no role in selecting which attributes to consider in the filter approach. As Freitas (84) reviews, the advantage of the filter approach is speed while the wrapper approach has the potential to do a better job classifying subjects as sick or healthy. We discuss each of these general approaches in turn for the specific problem of detecting epistases or gene-gene interactions on a genomewide scale.

A FILTER STRATEGY FOR GENOMEWIDE ASSOCIATION ANALYSIS There are many different statistical and computational methods for determining the quality of attributes. A standard strategy in human genetics and epidemiology is to assess the quality of each SNP using a chi-square test of independence followed by a correction of the significance level that takes into account an increased false-positive (i.e., type I error) rate due to multiple tests. This is a very efficient filtering method but it ignores the dependencies or interactions among genes. Kira and Rendell (85) developed an algorithm called Relief that is capable of detecting attribute dependencies. Relief estimates the quality of attributes through a type of nearest neighbor algorithm that selects neighbors (instances) from the same class and from the different class on the basis of the vector of values across attributes. Weights (W) or quality estimates for each attribute (A) are estimated on the basis of whether the nearest neighbor (nearest hit, H) of a randomly selected instance (R) from the same class and the nearest neighbor from the other class (nearest miss, M) have the same or different values. This process of adjusting weights is repeated for m instances. The algorithm produces weights for each attribute ranging from 1 (worst) to þ1 (best). The Relief pseudocode is outlined below: set all weights W½A ¼ 0 for i ¼ 1 to m do begin, randomly select an instance Ri , find nearest hit H and nearest miss M for A ¼ 1 to a do W½A ¼ W½A diffðA, Ri , HÞ/m þ diffðA, Ri , MÞ/m end The function diff(A, I1, I2) calculates the difference between the values of the attribute A for two instances I1 and I2. For nominal attributes such as SNPs, it is defined as: diffðA, I1 , I2 Þ ¼ 0 if genotypeðA, I1 Þ ¼ genotypeðA, I2 Þ, 1 otherwise

180

Moore et al.

The time complexity of Relief is O(m*n*a), where m is the number of instances randomly sampled from a data set with n total instances and a attributes. Kononenko (86) improved upon Relief by choosing n nearest neighbors instead of just one. This new ReliefF algorithm has been shown to be more robust to noisy attributes (86–88) and is widely used in data-mining applications. ReliefF is able to capture attribute interactions because it selects nearest neighbors using the entire vector of values across all attributes. However, this advantage is also a disadvantage because the presence of many noisy attributes can reduce the signal the algorithm is trying to capture. Moore and White (89) proposed a “tuned” ReliefF algorithm (TuRF) that systematically removes attributes that have low quality estimates so that the ReliefF values of the remaining attributes can be reestimated. The pseudocode for TuRF is outlined below: let a be the number of attributes for i ¼ 1 to n do begin estimate ReliefF sort attributes remove worst n/a attributes end return last ReliefF estimate for each attribute The motivation behind this algorithm is that the ReliefF estimates of the true functional attributes will improve as the noisy attributes are removed from the data set. Moore and White (2007a) carried out a simulation study to evaluate the power of Relief F, TuRF, and a naı¨ve chi-square test of independence for selecting functional attributes in a filtered subset. Five genetic models in the form of penetrance functions (Table 1) were generated. Each model consisted of two SNPs that define a nonlinear relationship with disease susceptibility. The heritability of each model was 0.1, which reflects a moderate-to-small genetic effect size. Each of the five models was used to generate 100 replicate data sets with sample sizes of 200, 400, 800, 1600, 3200, and 6400. This range of sample sizes represents a spectrum that is consistent with small-to-medium size genetic studies. Each data set consisted of an equal number of case (disease) and control (no disease) subjects. Each pair of functional SNPs was combined within a genomewide set of 998 randomly generated SNPs for a total of 1000 attributes. A total of 600 data sets were generated and analyzed. Relief F, TuRF, and the univariate chi-square test of independence were applied to each of the data sets. The 1000 SNPs were sorted according to their quality using each method and the top 50, 100, 150, 200, 250, 300, 350, 400, 450, and 500 SNPs out of 1000 were selected. From each subset we counted the number of times the two functional SNPs were selected out of each set of 100 replicates. This proportion is an estimate of the power or how likely we are to find the true SNPs if they exist in the data set. The number of times each method found the correct two SNPs was statistically compared. A difference in counts (i.e., power) was considered statistically significant at a type I error rate of 0.05. Moore and White (89) found that the power of Relief F to pick (filter) the correct two functional attributes was consistently better ( p 0.05) than a naı¨ve chi-square test of independence across subset sizes and models when the sample size was 800 or larger. These results suggest that ReliefF is capable of identifying interacting SNPs with a moderate genetic effect size (heritability ¼ 0.1) in moderate sample sizes. Next, Moore and White (89) compared the power of TuRF with the power of ReliefF. They found that

Novel Analytical Methods

181

the TuRF algorithm was consistently better ( p 0.05) than ReliefF across small SNP subset sizes (50, 100, and 150) and across all five models when the sample size was 1600 or larger. These results suggest that algorithms based on ReliefF show promise for filtering interacting attributes in this domain. The recently developed Evaporative Cooling ReliefF algorithm supports this (90). The disadvantage of the filter approach is that important attributes might be discarded prior to analysis. Stochastic search or wrapper methods provide a flexible alternative. The ReliefF algorithm has been included in the open-source MDR software package available from www.epistasis.org. A WRAPPER STRATEGY FOR GENOMEWIDE ASSOCIATION ANALYSIS Stochastic search or wrapper methods may be more powerful than filter approaches because no attributes are discarded in the process. As a result, every attribute retains some probability of being selected for evaluation by the classifier. There are many different stochastic wrapper algorithms that can be applied to this problem. Moore and White (91) have explored the use of genetic programming (GP). GP is an automated computational discovery tool that is inspired by Darwinian evolution and natural selection (92–98). The goal of GP is to evolve computer programs to solve specific problems. This is accomplished by first generating random computer programs that are composed of the basic building blocks needed to solve or approximate a solution to the problem. Each randomly generated program is evaluated and the good programs are selected and recombined to form new computer programs. This process of selection based on fitness and recombination to generate variability is repeated until a best program or set of programs is identified. GP and its many variations have been applied successfully in a wide range of different problem domains including data mining and knowledge discovery (84), electrical engineering (95), and bioinformatics (99). Moore and White (91) developed and evaluated a simple GP wrapper for attribute selection in the context of an MDR analysis. The goal of this study was to develop a stochastic wrapper method that is able to select attributes that interact in the absence of independent main effects. At face value, there is no reason to expect that a GP or any other wrapper method would perform better than a random attribute selector because there are no “building blocks” for this problem when accuracy is used as the fitness measure. That is, the fitness of any given classifier would look no better than any other with just one of the correct SNPs in the MDR model. Preliminary studies by White et al. (100) support this idea. For GP or any other wrapper to work there needs to be recognizable building blocks. Moore and White (91) specifically evaluated whether including preprocessed attribute quality estimates using TuRF (see above) in a multiobjective fitness function improved attribute selection over a random search or just using accuracy as the fitness. Using a wide variety of simulated data, Moore and White (91) demonstrated that including TuRF scores in addition to accuracy in the fitness function significantly improved the power of GP to pick the correct two functional SNPs out of 1000 total attributes. Subsequent studies showed that using TuRF scores to select trees for recombination and for mutation performed significantly better than using TuRF in a multiobjective fitness function (101,102). This study presents preliminary evidence suggesting that GP might be useful for the genomewide genetic analysis of common human diseases that have a complex genetic architecture. The results raise numerous questions. How well does GP do when faced with finding three, four, or more SNPs that interact in a nonlinear manner to predict disease

182

Moore et al.

susceptibility? How does extending the function set to additional attribute construction functions impact performance? How does extending the attribute set impact performance? Is using GP better than filter approaches such as ReliefF? To what extent can GP theory help formulate an optimal GP approach to this problem? Does GP outperform other evolutionary or nonevolutionary search methods? Does the computational expense of a stochastic wrapper like GP outweigh the potential for increased power? These studies provide a starting point to begin addressing some of these questions. SUMMARY The initiation, progression, and severity of human cancer are complex processes that are dependent on many genes, many environmental factors, and other chance events that are perhaps not measurable with current technology or simply unknowable. Success in the design and execution of population-based association studies to identify those genetic and environmental factors that play an important role in cancer biology will depend on our ability to embrace, rather than ignore, complexity in the genotype-to-phenotype mapping relationship for any given human ecology. We have reviewed several phenomena including gene-gene interactions (i.e., epistases) and gene-environment interactions (i.e., plastic reaction norms) that are an important part of the genetic architecture of common human diseases and that partly explain the nonlinear mapping relationship between genotype and phenotype. The success of any cancer epidemiology study will depend on whether an analytical strategy is employed that embraces, rather than ignores, these complexities. We have introduced random MDR as an example of a novel computational method that can be used to complement (not replace) traditional parametric methods such as logistic regression. MDR was designed specifically for the purpose of detecting, characterizing, and interpreting nonlinear interaction among multiple factors in the absence of detectable main effects. MDR is available in an easy-to-use and freely available software package that comes with tools for visualizing and interpreting interactions. We have also reviewed a filter method using ReliefF and a stochastic wrapper method using GP for the analysis of interactions on a genomewide scale with thousands of attributes. These data-mining and knowledge discovery methods and others will play an increasingly important role in cancer epidemiology as the field moves away from the candidate-gene approach that focuses on a few targeted genes to the genomewide approach that measures DNA sequence variations from across the genome. ACKNOWLEDGMENTS This work was supported by National Institutes of Health (U.S.A.) grants LM009012, AI59694, HD047447, RR018787, and HL65234. REFERENCES 1. Thornton-Wells TA, Moore JH, Haines JL. Genetics, statistics and human disease: analytical retooling for complexity. Trends Geneti 2004; 20:640–647. 2. Thornton-Wells TA, Moore JH, Haines JL. Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data. BMC Bioinformatics 2006; 7:204. 3. Bateson W. Mendel’s Principles of Heredity. Cambridge, UK: Cambridge University Press, 1909.

Novel Analytical Methods

183

4. Fisher RA. The correlations between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh 1918; 52:399–433. 5. Waddington CH. Canalization of development and the inheritance of acquired characters. Nature 1942; 150:563–565. 6. Waddington CH. The Strategy of the Genes. New York, New York: MacMillan, 1957. 7. Gibson G, Wagner G. Canalization in evolutionary genetics: a stabilizing theory? BioEssays 2000; 22:372–380. 8. Proulx SR, Phillips PC. The opportunity for canalization and the evolution of genetic networks. Am Nat 2005; 165:147–162. 9. Hollander WF. Epistasis and hypostasis. J Hered 1955; 46:222–225. 10. Phillips PC. The language of gene interaction. Genetics 1998; 149:1167–1171. 11. Brodie ED III. Why evolutionary genetics does not always add up. In: Wolf J, Brodie B III, Wade M, eds. Epistasis and the Evolutionary Process. New York, New York: Oxford University Press, 2000:3–19. 12. Moore JH. A global view of epistasis. Nat Genet 2005; 37:13–14. 13. Moore JH, Williams SW. Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. BioEssays 2005; 27:637–646. 14. Woltereck R. Weitere experimentelle Untersuchungen u¨ber Artvera¨nderung, speziellu¨ber das Wesen quantitativer Artunterschiede bei Daphnien. Verhandlungen der deutschen zoologischen Gesellschaft 1909; 19:110–173. 15. Schmalhausen II. Factors of Evolution: The theory of Stabilizing Selection. University of Chicago Press: Chicago, 1949. 16. Schlichting CD, Pigliucci M. Phenotypic Evolution: A Reaction Norm Perspective. Sinauer: Sunderland, Massachusetts, 1998. 17. Picliucci M. Phenotypic Plasticity: Beyond Nature and Nurture. Baltimore, Maryland: Johns Hopkins Press, 2001. 18. Remold SK, Lenski RE. Pervasive joint influence of epistasis and plasticity on mutational effects in Escherichia coli. Nat Genet 2004; 36:423–426. 19. Garcı´a-Closas M, Malats N, Silverman D, et al. NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and metaanalyses. Lancet 2005; 366:649–659. 20. Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet 2005; 6:287–298. 21. Li W, Reich J. A complete enumeration and classification of two-locus disease models. Hum Hered 2000; 50:334–349. 22. Culverhouse R, Suarez BK, Lin J, et al. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002; 70:461–471. 23. Moore JH, Williams SW. New strategies for identifying gene-gene interactions in hypertension. Ann Med 2002; 34:88–95. 24. Lewontin RC. The analysis of variance and the analysis of causes. Am J Hum Genet 1974; 26:400–411. 25. Lewontin RC. Commentary: statistical analysis or biological analysis as tools for understanding biological causes. Int J Epidemiol 2006; 35(3):536–537. 26. Wahlsten D. Insensitivity of the analysis of variance to heredity-environment interactions. Behav Brain Sci 1990; 13:109–161. 27. Millstein J, Conti DV, Gilliland FD, et al. A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet 2006; 78:15–27. 28. Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered 2003; 56:73–82. 29. McKinney BA, Reif DM, Ritchie MD, et al. Machine learning for detecting gene-gene interactions: a review. Appl Bioinformatics 2006; 5:77–88. 30. Burton PR, Tobin MD, Hopper JL. Key concepts in genetic epidemiology. Lancet 2005; 366:941–951. 31. Teare MD, Barrett JH. Genetic linkage studies. Lancet 2005; 366:1036–1044. 32. Cordell HJ, Clayton DG. Genetic association studies. Lancet 2005; 366:1121–1131.

184

Moore et al.

33. Palmer LJ, Cardon LR. Shaking the tree: mapping complex disease genes with linkage disequilibrium. Lancet 2005; 366:1223–1234. 34. Hattersley AT, McCarthy MI. What makes a good genetic association study? Lancet 2005; 366:1315–1323. 35. Hopper JL, Bishop DT, Easton DF. Population-based family studies in genetic epidemiology. Lancet 2005; 366:1397–1406. 36. Smith G, Ebrahim S, Lewis S, et al. Genetic epidemiology and public health: hope, hype, and future prospects. Lancet 2005; 366:1484–1498. 37. Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet 2006; 7:781–791. 38. Ritchie MD, Hahn LW, Roodi N, et al. Multifactor dimensionality reduction reveals highorder interactions among estrogen metabolism genes in sporadic breast cancer. Am J Hum Genet 2001; 69:138–147. 39. Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003; 19:376–382. 40. Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, phenocopy, and genetic heterogeneity. Genet Epidemiol 2003; 24:150–157. 41. Hahn LW, Moore JH. Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In Silico Biol 2004; 4:183–194. 42. Moore JH. Computational analysis of gene-gene interactions in common human diseases using multifactor dimensionality reduction. Expert Rev Mol Diagn 2004; 4:795–803. 43. Moore JH, Gilbert JC, Tsai C-T, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol 2006; 241:252–261. 44. Moore JH. Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In: Zhu X, Davidson I, eds. Knowledge Discovery and Data Mining: Challenges and Realities with Real World Data. IGI Global, 2007:17–30. 45. Michalski RS. A theory and methodology of inductive learning. Artif Intell 1983; 20:111–161. 46. Haas DW, Geraghty DE, Andersen J, et al. Immunogenetics of CD4 lymphocyte count recovery during antiretroviral therapy: an AIDS Clinical Trials Group Study. J Infect Dis 2006; 194:1098–1107. 47. Chan IH, Leung TF, Tang NL, et al. Gene-gene interactions for asthma and plasma total IgE concentration in Chinese children. J Allergy Clin Immunol. 2006;117:127–133. 48. Tsai CT, Lai LP, Lin JL, et al. Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation 2004; 109:1640–1646. 49. Asselbergs FW, Moore JH, van den Berg MP, et al. A role for CETP TaqB polymorphism in determining susceptibility to atrial fibrillation: a nested case control study. BMC Med Genet 2006; 7:39. 50. Coutinho AM, Sousa I, Martins M, et al. Evidence for epistasis between SLC6A4 and ITGB3 in autism etiology and in the determination of platelet serotonin levels. Hum Genet 2007; 121:243–256. 51. Andrew AS, Nelson HH, Kelsey KT, et al. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking, and bladder cancer susceptibility. Carcinogenesis 2006; 27:1030–1037. 52. Andrew AS, Karagas MR, Nelson HH, et al. Assessment of multiple DNA repair gene polymorphisms and bladder cancer susceptibility in a joint Italian and U.S. population: a comparison of alternative analytic approaches. Hum Hered 2007, in press. 53. Huang M, Dinney CP, Lin X, et al. High-order interactions among genetic variants in DNA base excision repair pathway genes and smoking in bladder cancer susceptibility. Cancer Epidemiol Biomarkers Prev 2007; 16:84–91. 54. Chung HH, Kim MK, Kim JW, et al. XRCC1 R399Q polymorphism is associated with response to platinum-based neoadjuvant chemotherapy in bulky cervical cancer. Gynecol Oncol 2006; 103:1031–1037.

Novel Analytical Methods

185

55. Bastone L, Reilly M, Rader DJ, et al. MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered 2004; 588:2–92. 56. Tsai CT, Hwang JJ, Ritchie MD, et al. Renin-angiotensin system gene polymorphisms and coronary artery disease in a large angiographic cohort: detection of gene-gene interactions. Atherosclerosis 2007, in press. 57. Agirbasli D, Agirbasli M, Williams SM, et al. Interaction among 5,10 methylenetetrahydrofolate reductase, plasminogen activator inhibitor and endothelial nitric oxide synthase gene polymorphisms predicts the severity of coronary artery disease in Turkish patients. Coron Artery Dis 2006; 17–413–417. 58. Hsieh CH, Liang KH, Hung YJ, et al. Analysis of epistasis for diabetic nephropathy among type 2 diabetic patients. Hum Mol Genet 2006; 15:2701–2708. 59. Sabbagh A, Darlu P. Data-mining methods as useful tools for predicting individual drug response: application to CYP2D6 data. Hum Hered 2006; 62:119–134. 60. Williams SM, Ritchie MD, Phillips JA III, et al. Multilocus analysis of hypertension: a hierarchical approach. Hum Hered 2004; 57:28–38. 61. Soares ML, Coelho T, Sousa A, et al. Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease. Hum Mol Genet 2005; 14:543–553. 62. Motsinger AA, Brassat D, Caillier SJ, et al. Complex gene-gene interactions in multiple sclerosis: a multifactorial approach reveals associations with inflammatory genes. Neurogenetics 2007; 8:11–20. 63. Brassat D, Motsinger AA, Caillier SJ, et al. Multifactor dimensionality reduction reveals genegene interactions associated with multiple sclerosis susceptibility in African Americans. Genes Immun 2006; 7:310–315. 64. Coffey CS, Hebert PR, Ritchie MD, et al. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: The importance of model validation. BMC Bioinformatics 2004; 4:49. 65. Mannila MN, Eriksson P, Ericsson CG, et al. Epistatic and pleiotropic effects of polymorphisms in the fibrinogen and coagulation factor XIII genes on plasma fibrinogen concentration, fibrin gel structure and risk of myocardial infarction. Thromb Haemost 2006; 95:420–427. 66. Xiong DH, Shen H, Zhao LJ, et al. Robust and comprehensive analysis of 20 osteoporosis candidate genes by very high-density single-nucleotide polymorphism screen among 405 white nuclear families identified significant association and gene-gene interaction. J Bone Miner Res 2006; 21(11):1678–1695. 67. Menon R, Velez DR, Simhan H, et al. Multilocus interactions at maternal tumor necrosis factor-alpha, tumor necrosis factor receptors, interleukin-6 and interleukin-6 receptor genes predict spontaneous preterm labor in European-American women. Am J Obstet Gynecol 2006; 194:1616–1624. 68. Xu J, Lowery J, Wiklund F, et al. The interaction of four inflammatory genes significantly predicts prostate cancer risk. Cancer Epidemiol Biomarkers Prev 2005; 14:2563–8. 69. Qin S, Zhao X, Pan Y, et al. An association study of the N-methyl-D-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray. Eur J Hum Genet 2005; 13:807–14. 70. Yasuno K, Ando S, Misumi S, et al. Synergistic association of mitochondrial uncoupling protein (UCP) genes with schizophrenia. Am J Med Genet B Neuropsychiatr Genet 2007; 144 (2):250–253. 71. Oestergaard MZ, Tyrer J, Cebrian A, et al. Interactions between genes involved in the antioxidant defence system and breast cancer risk. Br J Cancer 2006; 95:525–531. 72. Nordgard SH, Ritchie MD, Jensrud SD, et al. ABCB1 and GST polymorphisms associated with TP53 status in breast cancer. Pharmacogenet Genom 2007; 17:127–136. 73. Cho YM, Ritchie MD, Moore JH, et al. Multifactor-dimensionality reduction shows a twolocus interaction associated with Type 2 diabetes mellitus. Diabetologia 2004; 47–549–554.

186

Moore et al.

74. Wilke RA, Reif DM, Moore JH. Combinatorial pharmacogenetics. Nature Reviews Drug Discovery 2005; 4:911–91s8. 75. Moore JH, Barney N, Tsai CT, et al. Symbolic modeling of epistasis. Hum Hered 2007; 63:120–133. 76. Jakulin A, Bratko I. Analyzing attribute interactions. Lecture Notes in Artificial Intelligence 2003; 2838:229–240. 77. McGill WJ. Multivariate information transmission. Psychometrica 1954; 19:97–116. 78. Pierce JR. An Introduction to Information Theory: Symbols, Signals, and Noise. New York, New York: Dover, 1980. 79. Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 2005, 6, 95–108. 80. Wang WY, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 2005; 6:109–118. 81. Moore JH, Ritchie MD. The challenges of whole-genome approaches to common diseases. J Amer Med Assoc 2004; 291:1642–1643. 82. Goldberg DE. The Design of Innovation. Boston, Massachusetts: Kluwer, 2002. 83. Altshuler D, Brooks LD, Chakravarti A, et al. International HapMap Consortium. A haplotype map of the human genome. Nature 2005; 437:1299–1320. 84. Freitas A. Data Mining and Knowledge Discovery with Evolutionary Algorithms. New York, New York: Springer, 2002. 85. Kira K, Rendell LA. A practical approach to feature selection. In: Sleeman DH, Edwards P. eds. Proceedings of the Ninth International Workshop on Machine Learning. San Francisco, California: Morgan Kaufmann Publishers, 1992:249–256. 86. Kononenko I. Estimating attributes: analysis and extension of Relief. Proceedings of the European Conference on Machine Learning. New York, New York: Springer, 1994:171–182. 87. Robnik-Sikonja M, Kononenko I. Comprehensible interpretation of Relief’s Estimates. Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, California: Morgan Kaufmann Publishers, 2001:433–440. 88. Robnik-Siknja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 2003; 53:23–69. 89. Moore JH, White BC. Tuning Relief F for genome-wide genetic analysis. Lect Notes Comput Sci 2007; 4447:166–175. 90. McKinney BA, Reif DM, White BC, et al. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics 2007; 23c2113–2120. 91. Moore JH, White BC. Genome-wide genetic analysis using genetic programming. The critical need for expert knowledge. In: Riolo R, Soule T, Worzel B eds. Genetic Programming Theory and Practice IV. New York, New York: Springer, 2007 New York, New York: 11–28. 92. Koza JR. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambrindge, Massachusetts: The MIT Press, 1992. 93. Koza JR. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambrindge, Massachusetts: The MIT Press, 1994. 94. Koza JR, Bennett III FH, Andre D, et al. Genetic Programming III: Darwinian Invention and Problem Solving. San Francisco, California: Morgan Kaufmann Publishers, 1999. 95. Koza JR, Keane MA, Streeter MJ, Mydlowec W, Yu J, Lanza G. Genetic Programming IV: Routine Human-Competitive Machine Intelligence. New York, NY: Springer, 2003. 96. Banzhaf W, Nordin P, Keller RE, et al. Genetic Programming: An Introduction : On the Automatic Evolution of Computer Programs and Its Applications. San Francisco, California: Morgan Kaufmann Publishers, 1998. 97. Langdon WB. Genetic Programming and Data Structures: Genetic Programming þ Data Structures ¼ Automatic Programming! Boston, Massachusetts: Kluwer, 1998. 98. Langdon WB, Poli R. Foundations of Genetic Programming. New York, New York: Springer, 2002. 99. Fogel GB, Corne DW. Evolutionary Computation in Bioinformatics. San Francisco, California: Morgan Kaufmann Publishers, 2003.

Novel Analytical Methods

187

100. White BC, Gilbert JC, Reif DM, et al. A statistical comparison of grammatical evolution strategies in the domain of human genetics. Proceedings of the IEEE Congress on Evolutionary Computing. New York, New York: IEEE Press, 2005:676–682. 101. Moore JH, White BC. Exploiting expert knowledge in genetic programming for genomewide genetic analysis. Lecture Notes in Computer Science 2006, 4193, 696-977. 102. Greene CS, White BC, Moore JH. An expert knowledge-guided mutation operator for genetic analysis using genetic programming. Lecture Notes in Bioinformatics 2007; 4774:30–40.

13

Pathway-Based Methods in Molecular Cancer Epidemiology Fritz F. Parl Department of Pathology, Vanderbilt University, Nashville, Tennessee, U.S.A.

Philip S. Crooke Department of Mathematics, Vanderbilt University, Nashville, Tennessee, U.S.A.

David V. Conti and Duncan C. Thomas Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California, U.S.A.

INTRODUCTION Cancer is a complex phenotype resulting from the interaction of inherited and environmental factors. Many studies have sought to unravel this complexity by investigating one gene at a time or by considering pairwise gene-gene and gene-environment interactions. These studies have not proven as successful as hoped and led to a growing appreciation that a more comprehensive strategy is required (1). This broader strategy centers on the concept of the biochemical pathway, which has been defined as “the sequence of reactions undergone by a compound or class of compounds in a living organism” (2). A reaction is “a chemical change, where the transformation of one or more components into new substances occurs, accompanied by energy changes.” Some biochemical pathways are linear, proceeding in a step-by-step fashion from one molecule to another. Other pathways are branching, generating two or more products. Pathways can also have feedback loops, for example, when the product of a pathway controls the rate of its own synthesis through inhibition of an early step. Each pathway is organized by the links in its chemical reactions, with the product of one reaction providing a substrate for an enzyme that catalyzes a subsequent reaction (3,4). It is evident that a pathwaywide perspective provides a broader conceptual and analytical strategy for the detection of potential disease associations. One can expect the analysis of genetic variation across entire biological pathways to be more likely to reveal the association of candidate genes with cancer risk than studies limited to single genes.

189

190

Parl et al.

Figure 1 Schematic overview of a pathway-based model for the relationship among genes, exposures, disease, and biomarkers: boxes represent measured quantities and circles represent unmeasured latent variables or parameters to be estimated.

An inherent feature of pathways is the pictorial demonstration of the component reactions. Pathway-based epidemiological models generally entail some combination of the following elements (Fig. 1): l

l

l

l

l

Measured inputs in the form of genotypes G at loci thought to be relevant to the hypothesized pathway and exposures E to their environmental substrates. A measured outcome phenotype Y—in the case of cancer, a disease outcome, and the age at diagnosis or censoring, although more generally this could be a continuous trait or vector of traits, possibly measured longitudinally. Possibly, additional biomarker measurements B of intermediate states in the postulated pathway. Some model for the underlying unobservable pathogenic process, represented by a series of latent variables X, structured by some postulated topology L and involving a vector of parameters y, typically representing rates of the intermediate steps and their dependence on genotypes. Some external knowledge Z about the structure of the model and the parameters.

Within this general paradigm, an investigator has many specific tools that could be used to represent such a model, ranging from purely exploratory methods like classification and regression trees (CART) or neural nets (NN) to highly parametric physiologically based pharmacokinetic (PBPK) models, with hypothesis-driven statistical approaches like hierarchical Bayes (HB) or logic regression (LR) models somewhere inbetween. Although exploratory methods can be useful for detecting subtle patterns in complex multidimensional data sets (5–7), they generally do not attempt to incorporate prior knowledge about the structure of a pathway, so we will not consider them further here. Instead, we will focus on mechanistic and empirical methods that in one way or another allow an investigator to formally incorporate biological knowledge (or beliefs) about specific pathways hypothesized to be relevant to the disease and causal factors under study. We begin with a discussion of PBPK modeling principles, illustrated with their application to estrogen metabolism and its role in breast cancer, and show how such

Pathway-Based Methods in Molecular Cancer Epidemiology

191

metabolic models could be incorporated into an epidemiological model for disease risk. We then describe a more empirical hierarchical modeling framework that allows biological information to enter the analysis in the form of a higher-level model for the parameters of the lower-level model for the epidemiological data. We conclude with some thoughts about how the two approaches could be combined. An example of a PBPK model is the oxidative estrogen metabolism pathway (Fig. 2). Estrogens have long been recognized as the prime risk factor for the development of breast cancer (8). Current models of breast cancer risk prediction are based mainly on cumulative estrogen exposure and incorporate such factors as age, age at menarche, parity, and age at first live birth. While these traditional exposure data are valuable in risk calculation, they do not directly reflect mammary estrogen metabolism. Furthermore, previous models do not address genetic variability among women exposed to estrogen metabolites, including catechols and quinones, reactive molecules that have been shown experimentally to damage DNA and cause cancer (9,10). Recently, a novel kinetic-genomic model of estrogen exposure in relation to breast cancer risk prediction was proposed (11). The model incorporates the main components of mammary estrogen metabolism, i.e., the conversion of 17b-estradiol (E2) by the enzymes cytochrome P450 (CYP) 1A1 and 1B1, catechol-O-methyltransferase (COMT), and glutathione S-transferase P1 (GSTP1) into oxidative metabolites, including catechol estrogens and estrogen quinones (Fig. 2A). LIMITATIONS OF CURRENT TECHNIQUES How should biological pathways such as the estrogen metabolism pathway be investigated to discover potential risk associations? The most logical and practical approach would be to employ the same methodology, i.e., DNA sequencing, which has been useful in the detection of the most common form of genetic variation, the single nucleotide polymorphism (SNP) (Table 1). The strengths of DNA sequencing have been well documented in the genomewide, high-throughput analysis of the Human Genome Project. The human genome consists of approximately three billion nucleotides of DNA sequence containing an estimated number of 25,000 genes (12). While current DNA sequencing methods offer the advantage of high-throughput, genomewide analysis with reliable identification of SNPs, this approach has certain limitations in detecting pathway-based disease associations (Table 1). It has been estimated that there are approximately seven million common SNPs with a minor allele frequency of at least 5% across the entire human population (13). An additional four million SNPs exist with an allele frequency between 1% and 5%, yielding a total of approximately 11 million for the human genome. As millions of SNPs are being identified and genotyped, the estimation of haplotype frequencies becomes increasingly important in the mapping of complex disease genes, which has led to the International Haplotype Mapping Project (HapMap) (14). One of the goals of HapMap is to identify common haplotypes across the human genome. HapMap release 16 has genotyped 269 individuals from four ethnic groups for approximately one million SNPs (http://www.hapmap.org) and there are plans to increase the total to five million SNPs. A separate study examined the completeness of SNP data provided by HapMap release 16 by resequencing of 335 genes in 90 different individuals. The analysis revealed 54% newly identified SNPs that had not yet been deposited in the dbSNP database (http://www.ncbi,nlm.nih.gov/SNP/ index.html) (15). Thus, at current and planned levels of DNA sequencing, HapMap will not tag all common SNPs, resulting in some loss of statistical power to detect associations with SNPs not directly assayed, although current estimates are that the majority of common SNPs can be tagged at an r 2 of greater than 80% (16–18). Another limitation in pathway-driven research is introduced with the combined analysis of

192

Parl et al.

Figure 2 Estrogen metabolism pathway. (A) The estrogen metabolism pathway is regulated by oxidizing phase I and conjugating phase II enzymes. CYP1A1 and CYP1B1 catalyze the oxidation of E2 to catechol estrogens 2-OHE2 and 4-OHE2. The catechol estrogens are either methylated by COMT to methoxyestrogens (2-MeOE2, 2-OH-3-MeOE2, 4-MeOE2) or further oxidized by CYPs to semiquinones (not shown) and quinones (E2-2,3-Q, E2-3,4-Q). The estrogen quinones are conjugated by GSTP1 to GSH-conjugates (2-OHE2-1-SG, 2-OHE2-4-SG, 4-OHE2-2-SG).

Pathway-Based Methods in Molecular Cancer Epidemiology

193

Table 1 Strengths and Limitations of DNA Sequencing and Protein Studies in Pathway Analysis Strengths Genomewide High throughput Reliable identification of SNPs

Direct analysis of candidate proteins Functional comparison of wild-type and variant proteins

Limitations DNA sequencing HapMap does not identify all common haplotypes Uncertain assignment of complex haplotypes in heterozygous individuals DNA analysis provides qualitative, not quantitative information about protein function Effect of SNPs on pathway is uncertain Changes in protein expression levels are invisible to DNA sequencing Protein studies Labor intensive Time consuming, expensive Uncertain concentration of membrane-bound proteins

Abbreviations: HapMap, haplotype mapping; SNP, single nucleotide polymorphism.

several genes. Let us assume that there are two genes in a pathway, one with two and the other with four nonsynonymous SNPs. Thus, there are 3 10 ¼ 30 possible joint genotypes associated with a change in amino acid sequence. The inclusion of a third gene containing a single SNP leads to an exponential increase in possible joint genotypes (3 10 3 ¼ 90). A fundamental limitation of DNA analysis and SNP identification is the inability to predict the effect of different alleles on the structure and function of the encoded protein. Like DNA, proteins are linear heteropolymers, but instead of four types of bases they are composed of 20 amino acids, each with a range of chemical properties. There is therefore a great diversity of possible protein sequences. The linear chains fold into specific threedimensional conformations, which are determined by the sequence of amino acids. The three-dimensional structures of proteins are therefore equally diverse, ranging from completely fibrous to globular. Protein structures can be determined at the atomic level by X-ray diffraction and neutron-diffraction studies of crystallized proteins (X-ray crystallography), and by high-field nuclear magnetic resonance (NMR) spectroscopy of proteins in solution. The Human Genome Project has led to an enormous increase in protein sequence data. However, the database for protein structures has lagged far behind the number of known sequences. In principle, it should be possible to translate a gene sequence into an amino acid sequence, and to predict the three-dimensional structure of the resulting chain from this amino acid sequence by performing a detailed simulation of

<

Alternatively, the quinones can form quinone-DNA adducts (e.g., 4-OHE2-N7-guanine, 2-OHE2N2-deoxyguanosine) or cause oxidative adducts (e.g., 8-OH-deoxyguanosine) via reactive oxygen species resulting from redox-cycling between semiquinones and quinones. These adducts and their estrone and adenine counterparts have been identified in human breast tissues. Recently, we demonstrated experimentally that CYP1B1-mediated oxidation of E2 in the presence of deoxyguanosine caused the formation of the 4-OHE2-N7-guanine adduct. Our results provide direct evidence that metabolism of the parent hormone can initiate DNA damage. (B) Comparison of experimental data with in silico model. The metabolism of E2, 2-OHE2, 4-OHE2, 2-MeOE2, 2-OH-3-MeOE2, 4-MeOE2, 2-OHE2-1-SG, 2-OHE2-4-SG, and 4-OHE2-2-SG is shown as a function of time. In each graph, the concentration is expressed in mM, the blue dots represent the experimental data, and the red curves are derived from the in silico model. Source: From Refs. 11,23,37–39.

194

Parl et al.

molecular dynamics of protein folding on the basis of underlying physicochemical interactions among the constituent amino acids. However, the number of possible structures that proteins may possess is extremely large. A particular sequence may be able to assume multiple conformations depending on its environment, and the biologically active conformation may not be the most thermodynamically favorable. The prediction of structures for small proteins by ab initio modeling and for larger proteins by homology modeling or protein threading to previously solved structures are current initiatives designed to bridge the gap between sequence data and three-dimensional protein structure (19). Yet, in many cases it is computationally not possible to predict the structure of a protein with a high degree of certainty. To put computational structural biology and DNA sequencing in perspective, the latter establishes the presence of wildtype or variant alleles with near 100% certainty, whereas current methods of protein structure modeling cannot reliably predict the actual structure, much less the effect of a substituted amino acid (20,21). To extrapolate from protein structure to function is even less predictable. The effect of an amino acid substitution on protein function can range from none to impaired or enhanced. As an extreme, the substitution can render the protein functionless. In summary, the ease of obtaining DNA sequence data should not hide the fact that the information is only qualitative with respect to the encoded protein until detailed structure-function studies are performed to quantify the effect of a nucleotide substitution.

Example: The Estrogen Metabolism Pathway As mentioned earlier, biochemical pathways consist of reaction sequences, which in one form or other depend on proteins (enzymes, growth factors, etc.). To understand the effect of SNPs on a pathway requires assessment not only of the individual variant proteins but also of their complex interactions in the pathway. Using the estrogen metabolism pathway as an example, consider the three polymorphic genes CYP1A1, CYP1B1, and COMT. DNA analysis identifies the variant alleles but does not quantify the variation in the dynamics of the pathway. In contrast, the functional protein analysis will provide a quantitative assessment with each variation in protein structure likely to have a different effect. Each of the three proteins also has a certain concentration that determines their interaction in the pathway. Obviously, changes in protein expression levels cannot be discerned by DNA sequencing. An additional advantage of protein studies is the direct comparison of wild-type and variant proteins, resulting in the generation of quantitative parameters such as rate or binding constants (Table 1). In the case of the estrogen metabolism pathway, the kinetic parameters (Km, Vmax) for wild-type and variant CYP1A1, CYP1B1, and COMT were obtained (11,23–27). These data permitted quantitative rather than qualitative assessment of the metabolic interactions in the pathway (Fig. 2B). These observations clearly demonstrate the importance of protein analyses in molecular epidemiological investigations. However, protein studies are time consuming and relatively expensive. To quantify the effect of an amino acid substitution on protein function is a laborious experimental task requiring cloning, expression, and purification of the recombinant protein followed by careful functional studies of the variant protein in comparison to the wild type. To determine the concentration of cellular proteins can be technically challenging, especially for membrane-bound proteins. In view of the timeconsuming, technically demanding nature of protein studies, it becomes a challenge to explore the multitude of potential reactions in a pathway that may result from the complex genetic variations of the component proteins. Experiments can be performed either

Pathway-Based Methods in Molecular Cancer Epidemiology

195

in vitro or in vivo or, with the help of computers and mathematical tools, by an “in silico” approach (27). Using again the estrogen metabolism pathway as an example, an in silico model of mammary estrogen metabolism was developed (11). The in silico model incorporates the main components of mammary estrogen metabolism (Fig. 2A), i.e., the conversion of E2 by CYP1A1, CYP1B1, COMT, and GSTP1 into oxidative metabolites, including catechol estrogens and highly reactive estrogen quinones, which can damage DNA in breast epithelium (9,10). Using experimentally determined enzyme reaction rate constants, the model allows kinetic simulation of the conversion of E2 into eight major estrogen metabolites. The simulations showed excellent agreement with experimental results and provided a quantitative assessment of the metabolic interactions (Fig. 2B). Using rate constants of genetic variants of CYP1A1, CYP1B1, and COMT, the model further allowed examination of the kinetic impact of enzyme polymorphisms on the entire estrogen metabolic pathway. Furthermore, the model identified those joint genotypes in CYP1A1, CYP1B1, and COMT that produce the largest amounts of catechols and quinones. Application of the model to a breast cancer case-control population defined the estrogen quinone E2-3,4-Q as a potential breast cancer risk factor. MECHANISTIC MODELS FOR METABOLIC PATHWAYS The in silico model demonstrates that mathematical simulations, paired with experiments, can be successfully used to understand and predict the quantitative behavior of complex systems, such as biochemical pathways (Table 2). In view of the increasing utility of in silico modeling, we will present a brief mathematical analysis of the main types of reactions occurring in biochemical pathways. In our discussion of each type of reaction we use a basic conservation equation of compartmental analysis. Suppose that we have a compound, A, which is being created at a rate Rin and converted to another compound at a rate Rout.

The dynamic rate of change of A given is then da dt ¼ Rin Rout , where a(t) is the concentration of A at time t and Rin and Rout may depend on various substrate concentrations, as described below. We will use this conservation law to analyze more complicated reactions. Reaction Type I

The mathematical model for the reaction is da ¼ k1 a dt db ¼ k1 a k2 b dt dc ¼ k 2 b k3 c dt

196

Parl et al.

where a(t), b(t), c(t) are the concentrations of compounds A, B, and C, respectively. Given an initial condition for each compound, it is possible to obtain exact solutions of these differential equations, e.g., Reaction Type II

aðtÞ ¼ að0Þek1 t :

This series of reactions has a feedback reaction between compounds A and C. The mathematical model is da ¼ k1 a þ k4 c dt db ¼ k1 a k2 b dt dc ¼ k2 b ðk3 þ k4 Þc dt In theory, it is possible to find abstract solutions, but they are highly dependent on the parameters of the system. Here is an illustration for particular values of the parameters.

Reaction Type III

Pathway-Based Methods in Molecular Cancer Epidemiology

197

This is a classical enzymatic reaction between two compounds A and B with an k1

! intermediate C. The reaction can be written as A þ E C ! B þ E. The k3

k2

mathematical model for this reaction is a system of nonlinear differential equations for the concentrations of the compounds: da ¼ k1 e a þ k2 c dt db ¼ k3 c dt dc ¼ k1 e a ðk2 þ k3 Þc dt de ¼ k1 e a þ ðk2 þ k3 Þc dt This system has two conservation laws that can be obtained by manipulation: dc de þ ¼ 0 ) cðtÞ þ eðtÞ ¼ cð0Þ þ eð0Þ ¼ e dt dt da dc db þ þ ¼ 0 ) aðtÞ þ cðtÞ þ bðtÞ ¼ að0Þ þ cð0Þ þ bð0Þ ¼ a dt dt dt This, in turn, allows us to reduce the system of differential equations to two differential equations with initial conditions: da ¼ k1 e a þ ðk1 a þ k2 Þc, að0Þ ¼ a dt dc ¼ k1 e a ðk1 a þ k2 þ k3 Þc, cð0Þ ¼ 0 dt Here we have assumed that b(0) ¼ c(0) ¼ 0. Having c(t), it is a straightforward calculation to recover b(t): Zt bðtÞ ¼ k3

cðÞd: 0

It is possible to derive approximations for c(t) in terms of a(t) (28). In particular, cðtÞ ¼

e aðtÞ k2 þ k3 , km ¼ : k1 km þ aðtÞ

To develop the in silico model of the estrogen metabolism pathway, a set of nonlinear differential equations corresponding to reaction type III was developed (11). Since the in silico model integrates both kinetic and genomic data, it can be used to simulate the effect of the possible 128 composite CYP1A1, CYP1B1, and COMT genotypes on all metabolites in the pathway. Once additional experimental results for the GSTP1-mediated glutathione conjugation of estrogen metabolites have been obtained, GSTP1 can be included in the model. An advantage of the in silico model is that it can readily accommodate additional enzymes contributing to estrogen metabolism such as sulfotransferases and UDP-glucuronyltransferases, provided experimental data are

198

Parl et al.

Table 2 Strengths and Limitations of In Silico Modeling in Pathway Analysis Strengths

Limitations

Speed of running simulations

Development of in silico model depend on quality of experimental data Validation of model requires experiments

Additional variables can be included; expandability of in silico model Easy testing of hypotheses Multiple simulations help in narrowing choice of possible experiments

available. Another important feature of the in silico model is that it allows efficient simulation of various experimental conditions and prediction of outcomes. Agreement of the experimental data with the data predicted by the in silico model would constitute a successful validation of the latter. Alternatively, a poor fit of the experimental and simulated data would allow further refinement of the model by incorporation of the new experimental results. Overall, these simulations narrow the choice of multiple possible experiments and thereby save time and expenses. In the end, it is the quality of the experimental data that determines the usefulness and validity of any in silico model (Table 2). HIERARCHICAL MODELS FOR DISEASE-PATHWAY ASSOCIATIONS The preceding section provides an overview of how one might go about building a PBPK model for the intermediate variables X in a latent variables model of the type shown in Figure 1. How does one then incorporate this into an analysis of epidemiological data on individual exposures, genotypes, biomarkers, and disease? Let yi ¼ (yi1,yi2,. . .,yiK) denote the vector of metabolic rate parameters (e.g., Vmax and km for a Michaelis-Menton kinetic model) involving K reactions specific to individual i with vector genotypes Gi ¼ (Gi1,Gi2,. . .,GiK) at the relevant loci, and let XiM denote the predicted final metabolite concentration in this process. The relationships among these variables are depicted in Figure 3. Here Xim denotes the steady-state solution for metabolite m in subject i from the system of differential equations forming the PBPK model, represented by triangles to indicate deterministic nodes in a graphical model. The inputs to this system comprise the individual’s exposures Ei and genotypes Gi, which in turn influence that person’s metabolic rate parameters yi. For example, one might assume that there is some person-toperson variability in these rates among people with the same genotype because of various other unmeasured characteristics and adopt a statistical model for this unobserved variability, such as a lognormal distribution with logarithmic mean Gim and logarithmic standard deviation sm. Ideally, these genotype-specific population means and standard deviations would have prior distributions that were informed by laboratory assays of experimentally measured rates in appropriate model systems. One might also have shortterm biomarker measurements (B in the notation of Fig. 1), such as “metabolomic” measurements of intermediate metabolite concentrations Mim or “proteomic” measurements of enzyme activity levels Pim. These measurements also might be assumed to be lognormally distributed around their respective unobserved long-term average values with logarithmic standard deviations tm and om, respectively. Finally, one might assume a logistic or proportional hazards model for disease risk as a function of the final metabolite concentration, e.g., logit PrðYi ¼ 1jXiM Þ ¼ b0 þ b1 XiM . A Markov chain Monte Carlo

Pathway-Based Methods in Molecular Cancer Epidemiology

199

Figure 3 Schematic representation of one intermediate step in a PBPK model for intermediate metabolite concentrations in a metabolic model. Boxes represent measured quantities, circles latent variables or model parameters, and triangles represent deterministic quantities given by the steadystate solution to the differential equations in the PBPK model. Abbreviation: PBPK, physiologically based pharmacokinetic.

approach to fitting this entire combination of deterministic and stochastic models was described by Cortessis and Thomas (29). A less mechanistic approach is provided by hierarchical modeling, in which the epidemiological data is fitted using a conventional “empirical” model for the main effects and interactions among the various input genotypes and exposures, denoted here generically as X ¼ (E,G,G G,G E,G G E,. . .), for example, a logistic regression model of the form X bp Xip ð1Þ logit PrðYi ¼ 1jXi Þ ¼ b0 þ p

the sum being taken over the range of terms included in the X vector, and then each of the regression coefficients in this model is in turn regressed on a vector Zp of “prior covariates,” giving characteristics of the corresponding terms in X, e.g., ! X 2 ð2Þ v Zpv ,s bp N v

There are many possibilities for what could be included in the set of prior covariates, ranging from indicator variables to which of several pathways each gene might act in (30), in silico predictions of the functional significance of polymorphisms in each gene (31), or genomic annotation from formal ontologies (32). The general hierarchical modeling strategy has been described by Conti et al. (33), who also considered uncertainty about the set of main effects and interactions to be included in X using Bayesian model averaging. Specifically, they replaced the secondlevel model by a pair of models, a logistic regression for the probability given that bp ¼ 0

200

Parl et al.

and a linear regression of the form of Eq.(2) for the expected values of the coefficient given that it was not zero. Alternatively, one could model the variances, e.g., bp Nðp0 Zp, s2 expðc0 ZpÞ

ð3Þ

For example, suppose the X vector comprised effects for different polymorphisms within each gene and one had some prior predictors of the effects of each polymorphism (e.g., in silico predictions of functional effects or evolutionary conservation) and other predictors of the general effects of genes (e.g., their roles in different pathways). Then it might be appropriate to include the former in the p0 Z part of the model for the means and the latter in the c0 Z part of the model for the variances. This description raises the more general question about how to deal with model uncertainty in any of these modeling strategies. In Figure 1, we represented the “topology” of the model L and the corresponding vector of model parameters yL as unknown quantities, about which we might have some general prior knowledge in the form of the “ontology” Z. In the microarray analysis world, Bayesian network analysis has emerged as a powerful technique for inferring the structure of a complex network of genes (34). Might such a technique prove helpful for epidemiological analysis? One promising approach is “logic regression,” which considers a set of tree-structured models relating measurable inputs (genes and exposures) to a disease trait through a network of unobserved intermediate nodes representing logical operators (AND, OR, XOR, etc.) (35). To allow for uncertainty about model form, a Markov chain Monte Carlo method is used to update the structure of the graphical model by adding, deleting, moving, or changing the types of the intermediate nodes. Although appealing as a way of representing the biochemical pathways, logic regression does not exploit any external information about the form of network and also assumes that all the intermediate nodes are binary, hence more suitable for modeling regulatory than for metabolic pathways where the intermediate nodes would represent continuous metabolite concentrations. In principle, however, similarly tree-structured models could be devised for continuous variables and one could incorporate external knowledge in the form of a prior model on the space of possible topologies, say using Hamming distance (36) as a metric to compare each fitted tree structure to the assumed prior model. COMBINING MECHANISTIC AND STATISTICAL MODELS One could try to marry the mechanistic and empirical modeling approaches by using the output of a PBPK model as prior covariates in a hierarchical model (Table 3). Let Zge ¼ E[X(Gg,Ee)] denote the predicted steady-state concentrations of the final metabolite from a differential equations model for a particular combination of genes and/or exposures (thus, Zgg0 might represent the predicted effect of a G G interaction between genes g and g0 or Zgee0 a G E E interaction). As discussed above, other Zs could comprise variances of predicted Xs across a range of input values as a measure of the sensitivity of the output to variation in that particular combination of inputs. Zge could also be a vector of several different predicted metabolite concentrations if there were multiple hypotheses about which was the most etiologically relevant. So far, no information about disease outcomes or specific study subject’s covariate values has been used. We now build an empirical model for the epidemiological data in terms of main effects and interactions among the various exposures and genotypes, as in Eq. (1), and treat each Zge for the various combinations of genes and exposure variables as prior

Pathway-Based Methods in Molecular Cancer Epidemiology

201

Table 3 Strengths and Limitations of Approaches to Pathway Modeling of Gene-Environment and Gene-Gene Interactions in Molecular Cancer Epidemiology Strengths Physiologically based Incorporates biological understanding of physiological mechanisms Allows for individual variation in metabolic rates Unifies analyses of genotypes, exposures, and biomarkers in a natural way

Limitations pharmacokinetic models Highly parametric and model dependent Computationally intensive Involves many unobservable latent variables (e.g., intermediate metabolite concentrations, individual rate parameters)

Allows external information about pathway structure and rates to be included as priors Empirical hierarchical Bayes models Less dependent on models for underlying Does not exploit quantitative knowledge about disease processes pathway dynamics Can incorporate external data about Relationship of disease to exposure and contribution of genes and exposures to genotypes is descriptive, not model based various pathways Can allow for uncertainty about model form (e.g., which main effects and interactions to include) Well-founded statistical theory for hypothesis testing and estimation Unified mechanistic and empirical modeling approach Largely untested methodology Combines strong biological models for biological processes with weak empirical models for resulting risk of disease

covariates for the corresponding regression coefficients bge in a prior model of the form Eqs. (2) or (3). CONCLUSION While there exist many obstacles hindering the general application of pathway-based methods, there are many potential advantages to conceptualizing the analysis in this fashion. First, these approaches formalize the incorporation of prior knowledge before we examine empirical data, thus reducing the reliance on post hoc justification of significant associations generated when testing a broad set of genetic and environmental factors. In addition, by structuring the relations between measured genetic and environmental factors via a specified PBPK topology, a considerable reduction in the dimensionality of the space of possible models may be achieved. This occurs mainly through the incorporation of external knowledge, reducing the possible factors that may act on a single component, latent variable, or reaction within the topology. Preliminary investigation has revealed that while the formal introduction of the underlying topology may not increase the power to detect genetic effects, it can aid in the characterization by localizing the genetic effect to a specific step within the topology. In contrast, a less mechanistic hierarchical modeling approach stays within the standard association framework and can suffer from a large number of main effects and interactions (Eq. 1). However, the higher-level models

202

Parl et al.

(Eq. 2) help to stabilize first-stage effect estimates and allow for summary inference of exchangeable sets defined by the prior covariates and their corresponding effect estimates (the ps). As the costs of high-volume genotyping platforms have declined, agnostic genomewide association scans have recently become feasible. Coupled with gene expression arrays and emerging proteomic, metabolomic, and other –omics technologies for genomewide interrogation of biological processes, investigators are facing a choice between pathway-driven or exploratory modes of research, or better yet, finding creative ways to combine the two. We anticipate that the latter will be a particularly fruitful but challenging line of research, allowing investigators to exploit the wealth of genomic and biomarker data that will soon become available to better define the biological processes that the pathway-driven approach aims to describe. In summary, the analysis of biochemical pathways encompassing multiple genes provides a broad conceptual strategy. The analysis of genetic variation across entire biological pathways is more likely to reveal the association of candidate genes with cancer risk than studies limited to single genes. To be successful in this endeavor, DNA sequencing will need to be combined with the analysis of proteins and the development of in silico models. Such models should also include environmental variables to truly represent the exposure to both inherited and environmental risk factors responsible for carcinogenesis and ultimately be linked with stochastic models for disease risk resulting from the hypothesized pathways. ACKNOWLEDGMENTS This research was supported in part by NIH grants R01 CA/ES83752, U54 CA113007, R01 CA52862, P50 HG002790, and P30 ES07048. REFERENCES 1. Thomas DC. The need for a systematic approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 2005; 14:557–559. 2. Thagard P. Pathways to biomedical discovery. Philosophy Sci 2003; 70:235–254. 3. Borodina I, Nielsen J. From genomes to in silico cells via metabolic networks. Curr Opin Biotechnol 2005; 16:350–355. 4. Karp PD. Pathway databases: a case study in computational symbolic theories. Science 2001; 293:2040–2044. 5. Cook NR, Zee RYL, Ridker PM. Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Statist Med 2004; 23:1439–1453. 6. Hoh J, Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nature Rev Genet 2003; 4:701–709. 7. Moore JH. The ubiquitous nature of epistasis in determining susceptibility in common human diseases. Hum Hered 2003; 56:73–82. 8. Pike MC, Krailo MD, Henderson BE, et al. ‘Hormonal’ risk factors, ‘breast tissue age’ and the age-incidence of breast cancer. Nature 1983; 303:767–770. 9. Cavalieri EL, Li KM, Balu N, et al. Catechol ortho-quinones: the electrophilic compounds that form depurinating DNA adducts and could initiate cancer and other diseases. Carcinogenesis 2002; 23:1071–1077. 10. Li KM, Todorovic R, Devanesan P, et al. Metabolism and DNA binding studies of 4-hydroxyestradiol and estradiol-3,4-quinone in vitro and in female ACI rat mammary gland in vivo. Carcinogenesis 2004; 25:289–297.

Pathway-Based Methods in Molecular Cancer Epidemiology

203

11. Crooke PS, Ritchie MD, Hachey DL, et al. Estrogens, enzyme variants, and breast cancer: a risk model. Cancer Epidemiol Biomarkers Prev 2006; 15:1620–1629. 12. Stein LD. End of the beginning. Nature 2004; 431:915–916. 13. Collins FS, Morgan M, Patrinos A. The Human Genome Project: lessons from large-scale biology. Science 2003; 300:286–290. 14. Altshuler D, Broooks LD, Chakravarti A, et al. A haplotype map of the human genome. Nature 2005; 437:1299–1320. 15. Taylor JA, Xu ZL, Kaplan NL, et al. How well do HapMap haplotypes identify common haplotypes of genes? A comparison with haplotypes of 334 genes resequenced in the environmental genome project. Cancer Epidemiol Biomarkers Prev 2006; 15:133–137. Available at: http://www.ncbi,nlm.nih.gov/SNP/index.html 16. Barrett JC, Cardon LR. Evaluating coverage of genome-wide association studies. Nature Genetics 2006; 38:659–662. 17. de Bakker PIW, Yelensky R, Pe’er I, et al. Efficiency and power in genetic association studies. Nature Genetics 2005; 37:1217–1223. 18. Pe’er I, de Bakker PIW, Maller J, et al. Evaluating and improving power in while-genome association studies using fixed-marker sets. Nature Genetics 2006; 38:663–667. 19. Nayeem A, Sitkoff D, Krystek S. A comparative study of available software for high-accuracy homology modeling: From sequence alignment to structural models. Protein Sci 2006; 15: 808–824. 20. Bateman A. Editorial. Nucl Acids Res 2006; 34:D1. 21. Yeats C, Maibaum M, Marsden R, et al. Gene3D: modelling protein structure, function and evolution. Nuclleic Acids Res 2006; 34:D281–D284. 22. Dawling S, Hachey DL, Roodi N, et al. In vitro model of mammary estrogen metabolism: structural and kinetic differences in mammary metabolism of catechol estrogens 2- and 4-hydroxyestradiol. Chem Res Toxicol 2004; 17:1258–1264. 23. Dawling S, Roodi N, Mernaugh RL, et al. Catechol-O-methyltransferase (COMT)-mediated metabolism of catechol estrogens: comparison of wild-type and variant COMT isoforms. Cancer Res 2001; 61:6716–6722. 24. Dawling S, Roodi N, Parl FF. Methoxyestrogens exert feedback inhibition on cytochrome P450 1A1 and 1B1. Cancer Res 2003; 63:3127–3132. 25. Hachey DL, Dawling S, Roodi N, Parl FF. Sequential action of phase I and II enzymes cytochrome P450 1B1 and glutathione S-transferase P1 in mammary estrogen metabolism. Cancer Res 2003; 63:8492–8499. 26. Hanna IH, Dawling S, Roodi N, et al. Cytochrome P450 1B1 (CYP1B1) pharmacogenetics: association of polymorphisms with functional differences in estrogen hydroxylation activity. Cancer Res 2000; 60:3440–3444. 27. Di Ventura B, Lemerle C, Michalodimitrakis K, et al. From in vivo to in-8. Crooke PS, Tanner RD, Aris R. The role of dimensionless parameters in the Briggs-Haldane and MichaelisMenten approximations. Chem Eng Sci 1979; 34:1357–1359. 28. Crooke PS, Tanner RD, Aris R. The role of dimensionless parameters in the Briggs-Haldane and Michaelis-Menten approximations. Chem Eng Sci 1979; 34:1357–1359. 29. Cortessis V, Thomas DC. Toxicokinetic genetics: an approach to gene-environment and genegene interactions in complex metabolic pathways. Mechanistic considerations in the molecular epidemiology of cancer. P. Bird, P. Boffetta, P. Buffler and J. Rice. Lyon, France, IARC Scientific Publications; 2003. 30. Capanu M, Begg CB. The use of hierarchical models for estimating relative risks of individual genetic variants: an application to the Genes Environment and Melanoma Study. J Am Statist Assoc 2007; in review. 31. Conti DV, Lewinger JP, Swan GE, Tyndale RF, Benowitz NL, Thomas PD. Using ontologies in hierarchical modeling of genes and exposures in biologic pathways. NCI Monograph 22: Phenotypes, Endophenotypes, and Genetic Studies of Nicotine Dependence. G. E. Swan: in press.

204

Parl et al.

32. Hung RJ, Brennan P, Malaveille C, et al. Using hierarchical modeling in genetic association studies with multiple markers: Application to a case-control study of bladder cancer. Cancer Epidemiol Biomarkers Prev 2004; 13:1013–1021. 33. Conti DV, Cortessis V, Molitor J, et al. Bayesian modeling of complex metabolic pathways. Hum Hered 2003; 56:83–93. 34. Friedman N. Inferring cellular networks using probabilistic graphical models. Science 2004; 303:799–805. 35. Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 2005; 28:157–170. 36. Hamming RW. Error detecting and error correcting codes. Bell System Tech J 1950; 26: 147–160. 37. Embrechts J, Lemiere F, Van Dongen W, et al. Detection of estrogen DNA-adducts in human breast tumor tissue and healthy tissue by combined nano LC-nano ES tandem mass spectrometry. J Am Soc Mass Spectrom 2003; 14:482–491. 38. Markushin Y, Zhong W, Cavalieri EL, et al. Spectral characterization of catechol estrogen quinone (CEQ)-derived DNA adducts and their identification in human breast tissue extract. Chem Res Toxicol 2003; 16:1107–1117. 39. Belous AR, Hachey DL, Dawling S, et al. Cytochrome P450 1B1-mediated estrogen metabolism results in estrogen-deoxyribonucleoside adduct formation. Cancer Res 2007;67:812–81.

14

Haplotype Association Analysis Peter Kraft Departments of Epidemiology and Biostatistics, Harvard School of Public Health, Boston, Massachusetts, U.S.A.

Jinbo Chen Department of Biostatistics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, U.S.A.

INTRODUCTION The analysis of association between a trait and genetic markers is in many ways no different from the analysis of association between a trait and nongenetic exposures. Conceptual issues such as confounding bias and inflated type I error rate due to multiple testing arise in both contexts, and, at least for studies of unrelated individuals, standard statistical techniques such as logistic regression (binary traits), multivariable regression (continuous traits), the log-rank test, or Cox regression (censored survival traits) are readily applicable to genetic association studies by treating genotypes as categorical or ordinal variables. Haplotype association analysis—broadly conceived as an analysis that exploits local patterns of correlation among physically close genetic markers, a.k.a., linkage disequilibrium—attempts to improve statistical efficiency and interpretability of genetic association studies by incorporating additional modeling assumptions based on principles of Mendelian and population genetics. It has become an important tool for both the design and analysis of genetic association studies. At the design stage, it can be used to select an informative subset of markers to genotype, greatly reducing study cost. At the analysis stage, it can be used to directly or indirectly infer genotypes at untyped loci, account for correlation among tested markers, or test hypotheses about the joint effects of alleles at linked markers. There are two principal reasons to undertake haplotype association analysis. First, researchers may wish to comprehensively scan a candidate region for disease susceptibility loci. Before the advent of the HapMap (1,2) (http://www.hapmap.org/) or other resources that provide empirical data on local patterns of linkage disequilibrium among genetic variants (http://pga.mbt.washington.edu/; http://egp.gs.washington.edu/) and before the development of inexpensive and flexible genotyping techniques this was not possible. Researchers could test a few known variants but could not extend their results to 205

206

Kraft and Chen

other unknown or untyped variants. Failure to find an association with the typed variants could not rule out a causal role for the gene, as there might have been untyped causal variants present that were not strongly correlated with the typed variants. Conversely, finding an association with a marker need not implicate that specific marker in the causal process, as it might be strongly correlated with another causal variant in the same gene, in nearby regulatory regions, or even in a different gene. Now that a more detailed picture of local variation is available [nearly but not quite complete for variants (1) with minor allele frequency above 5%], researchers can use the local pattern of linkage disequilibrium to interpret results and even choose a subset of markers that serve as surrogates for all variation in a region. Second, researchers may hypothesize that two nearby loci act in cis, that is, variant alleles at the two loci only affect gene function if they are on the same chromosome. Haplotype analysis can increase power to detect such effects and, in principle, can be used to test whether alleles at the two loci act in cis or affect trait distribution regardless of whether they are on the same chromosome (3,4). In this article, we first discuss some of the reasons that led researchers to propose using haplotypes when testing association between common variation in a candidate gene or region and disease risk. [We do not focus haplotype sharing methods (5), which may be better suited for identifying rare variants with large effects (1).] Then we briefly review methods for estimating phased haplotypes from unphased genotype data and algorithms for selecting a subset of highly informative single nucleotide polymorphisms (SNPs) to study, along with their implications for data analysis and interpretation. Finally, we review methods that explicitly model associations between individual haplotypes and different types of traits (binary, continuous, survival), and discuss modern methods that implicitly use haplotype information to infer genotypes at markers that have not been genotyped in the main study. There is a vast and ever expanding literature on each of these topics and we do not aim to give a comprehensive review. But we hope readers come away with some intuition for the basic issues and an introduction to the relevant literature. WHAT ARE HAPLOTYPES, AND HOW CAN THEY SERVE AS SURROGATES FOR ALL VARIATIONS IN A REGION? A haplotype is an ordered sequence of alleles at a subset of loci along a chromosome. Table 1 shows the frequency of distinct haplotypes defined by 63 common SNPs spanning the ATM gene in the HapMap CEU trios (Utah residents with ancestry from northern or western Europe). (Throughout the rest of this article, we adopt the convention that “common” SNP alleles or haplotypes have minor allele frequency greater than 5%.) There are 10 unique haplotypes in this sample of 120 founder chromosomes, far fewer than would be expected if each of the 263 > 1018 possible haplotypes were equally likely. The empirical observation that such limited haplotype diversity extended over relatively long physical distances (6) helped spark interest in haplotype association analysis (7). Limited haplotype diversity suggests that genetic variation across a large set of polymorphic markers can be measured using a much smaller subset of markers. In Table 1, for example, the complete 63-SNP haplotype can be identified using only the alleles at the nine starred markers. This suggests that genotyping costs can be greatly reduced with little loss in power by choosing a representative subset of markers (often called “haplotype tagging SNPs” or just “tagging SNPs”) that capture genetic variation in a gene (or under a linkage peak or spanning the whole genome).

* * * * * ** * * TGCGCAGAGGCAGAAGCAGAGGCACCTAAGGAGACCCCACAGAATACTTTAGTCGATTTTCGG CATACAATTAAGGGGGTGAGCAAGCCCGCAAAACTTATACGAGGGCGTGCAGGGTGCCATATT TGCGCAGAGGCAAAAGCAGAGGCACCTAAGGAGACCCCACAGAATACTTTAGTCGATTTTCGG CATACAATTAAGGGAATGAGCAAGCCCGAAGTACTTATATGAGGGCGTGCGAGGTGCCAGCTT CATACAATTAAGGGAGTGAGCAAGCCCGAAGTACTTATACGAGGGCGGGCAGGGTGCCATCTT CATACAATTAAGGGAATGAGCAAGCCCGAAGTACTTATATGAGGGCGTGCAAGGTGCCAGCTT TGCGGCGAGGCAGAAGCAGAGGCATCTAAGGAGACCCCACAGAATACTTTAGTCGATTTTCGG TGCGCAGAGGCAGAAGCAGAGGCACCTAAGGAGACCCCGCAGAATACTTTAGTCGATTTTCGG TGCGCAGAGGCAGAAGCAGAGGCACTTAAGGAGACCCCACAGAATACTTTAGTCGATTTTCGG CATACAATTAAGGGAATGAGCAAGCCCGAAGTACTTATATGAGGGCGTGCAAGGTGCCAGCGT

1 2 3 4 5 6 7 8 9 10

Abbreviation: SNPs, single nucleotide polymorphisms.

Haplotype

Haplotypes of 63 Common SNPs Spanning the ATM Gene and Their Frequencies

Label

Table 1

0.200 0.167 0.150 0.105 0.092 0.073 0.058 0.058 0.050 0.039

Frequency

Haplotype Association Analysis 207

208

Kraft and Chen

Figure 1 Linkage disequilibrium patterns across ATM (A) and CYP19A1 (B) genes. Dark shading indicates high linkage disequilibrium (measured by D0 ); light shading indicates low linkage disequilibrium. Bold lines indicate “block” boundaries.

In addition to saving genotyping costs, limited haplotype diversity can also be used to avoid overcorrection for multiple testing and thus increase power of association analyses. Say researchers had measured all 63 SNP alleles depicted in Figure 1A on samples of haplotypes taken from cases and controls. Two naive approaches to test the global null hypothesis of no association between genetic variation among these 63 SNPs and case-control status would be to perform 63 tests (one for each SNP) comparing allele frequencies between cases and controls or to perform one multivariable test by regressing case-control status on allele counts from the 63 SNPs. For the first approach, standard techniques for controlling the familywise error rate (the probability that even one test incorrectly rejects the null hypothesis of no association) or controlling the false discovery rate (the fraction of rejected hypotheses that are falsely rejected) are notoriously inefficient when the tests are highly correlated, as they are here (8–10). Intuitively, the number of nonredundant tests is much smaller than 40. The first four SNPs in Table 1 are perfectly correlated, so if one of those SNPs has been tested, then all have been. The second approach leads to a single chi-squared test statistic (e.g., from a likelihood ratio test) but with a large number of degrees of freedom (63). Both of these approaches require relatively large test statistics to overcome the multiple-testing adjustment and reach statistical significance. Testing association between these 10 ATM haplotypes and case-control status, instead of individual SNPs, reduces the multiple-testing penalty. If each haplotype is tested for association in turn (say, by comparing carrier frequencies between cases and controls) then only 10 tests have been conducted, not 40. Alternatively, regressing casecontrol status on haplotype counts leads to a nine-degree-of-freedom test (one haplotype has to be left out as the referent), not a 40-degree-of-freedom test. This reduction in the multiple-testing penalty does not guarantee greater power, as there is a trade-off between test signal (effect size) and overall number of tests or degrees of freedom (11). For example, if SNP 1 is the causal SNP, the single test of SNP 1 perfectly captures the

Haplotype Association Analysis

209

association signal, while any single haplotype is an imperfect proxy. Still, the fourfold reduction in the multiple testing penalty in this case seems likely to outweigh any losses in information from considering haplotypes rather than individual SNPs. However, realizing these potential gains from haplotype analysis often proves difficult in practice. The key to these gains is limited haplotype diversity, such as that depicted in Table 1 and Figure 1A. Not all regions that investigators wish to study exhibit limited haplotype diversity. The larger the region, the more the diversity. But there is no rule of thumb that says how large is too large. The CYP19A1 gene is slightly smaller than ATM (129.1 kb vs. 146.3 kb), but none of the haplotypes defined by 119 common SNP spanning CYP19A1 has frequency above 5% in the HapMap CEU sample. The cumulative frequency of the 26 haplotypes with frequency greater than 1% is only 56.6%—so more than 40% of the 120 haplotypes in this sample are unique. Even assuming these haplotypes could be measured directly—as we discuss below, haplotypes are typically inferred from genotype data—the association analysis approaches outlined above will not be much more powerful than an SNP-by-SNP approach and in fact may not be valid due to sparse data. Testing each haplotype in turn will require over 90 tests, and regressing on haplotype counts will lead to a test statistic with over 90 degrees of freedom. Since each haplotype is carried by very few subjects, estimates of haplotypespecific association parameters will be unstable. This problem can be ameliorated somewhat by restricting analysis to haplotypes observed at least (say) 10 times, but even for large sample sizes the cumulative frequency of excluded rare haplotypes may be greater than 20%. Including these rare haplotypes in the referent category or as a separate polyglot category can reduce power and make the association parameters difficult to interpret. Just as association analysis cannot be used to find causal alleles that do not vary, standard association analyses that estimate distinct association parameters for each haplotype break down when everybody carries two unique haplotypes. Nor is it clear that choosing a subset of SNPs that identify individual haplotypes in this case will lead to a great reduction of genotyping costs without a loss in information. The alleles at 19 SNPs are required to distinguish the 119-SNP haplotypes spanning CYP19A1 above 1% frequency in the HapMap CEU sample—so the proportionate reduction in genotyping costs from using haplotype tagging SNPs is smaller for CYP19A1 than for ATM. (We adopt the convention that “haplotype tagging SNPs,” a.k.a., htSNPs are chosen to distinguish haplotypes, while “tagging SNPs” are chosen as surrogates for individual SNPs.) However, these 20 haplotype-tagging SNPs selected using the HapMap CEU panel may not distinguish CYP19A1 haplotypes even in another sample of subjects with northern or western European ancestry. To begin with, it is very likely the new sample will contain haplotypes not observed in the CEU panel. Furthermore, methods for choosing htSNPs based on the correlation between htSNP haplotypes or genotypes of htSNPs and underlying haplotypes will tend to overestimate the performance or htSNPs, as correlations for rare alleles tend to be overestimated in small samples (12). One way to avoid these problems is to break the larger region of interest into smaller contiguous, nonoverlapping regions called “blocks” that do exhibit limited haplotype diversity. There are many algorithms for partitioning a set of SNPs into blocks, some based on measures of pairwise LD (as measured by jD0 j), some based on measures of haplotype diversity, some based on both between-block LD and haplotype diversity (13–16). Figures 1A and 1B show the block partition for ATM and CYP19A1 based on the HapMap CEU data using the default blocking algorithm implemented in the program haploview (http://www.broad.mit.edu/mpg/haploview/). The block structure of ATM is quite simple, while CYP19A1 can be broken into nine multi-SNP blocks and several interblock regions containing 15 SNPs. (This structure could be simplified slightly by

210

Kraft and Chen

combining blocks with high pairwise LD as long as the resulting merged block retains limited haplotype diversity—defined, e.g., as having the cumulative frequency of common haplotypes greater than 80%. This algorithm can be implemented by hand in haploview by clicking and dragging to define custom blocks.) Although association analysis of haplotypes within blocks is relatively straightforward, parsing the region of interest into multiple blocks creates its own set of problems. First, this reintroduces the problem of multiple, correlated tests, as all the haplotypes in multiple blocks need to be tested. Second, the block partition will depend on the density of SNPs in the panel used to define blocks and the particular block algorithm used (17,18). This can complicate comparison of results across studies that used different panels and algorithms to select htSNPs. Third, htSNPs that capture variation within the blocks may not capture information about interblock SNPs. To achieve comprehensive coverage of the target region these individual SNPs will have to be genotyped and tested. In fact, the concept of contiguous, nonoverlapping haplotype blocks—although useful in terms of describing patterns of variation—is rather ad hoc and not rooted in any population genetics model. Local variation in recombination rates and population history do create block-like patterns of linkage disequilibrium, but common haplotypes often overlap block boundaries and recombination hotspots (2). A better metaphor might be to consider each haplotype as an imperfect mosaic of other haplotypes (19,20). These considerations have led researchers to adopt an alternative paradigm for choosing tagging SNPs and testing association between unmeasured variants and disease. Instead of testing association between individual haplotypes and a trait, this paradigm uses the underlying haplotype structure to select simple proxies for individual untyped SNPs (21,22). These proxies are then tested in lieu of unmeasured SNPs or used to infer alleles at the unmeasured SNPs. We discuss these methods further in subsequent sections. HOW CAN HAPLOTYPES BE INFERRED FROM GENOTYPES? Although the scale and speed of SNP genotyping technologies have greatly increased and per-genotype costs have dramatically fallen in the last five years, it remains timeconsuming, labor-intensive, and expensive to directly measure autosomal haplotypes. Thus, as alluded to above and illustrated in Figure 2, one of the principal statistical challenges in haplotype analysis is how to infer phased haplotypes from unphased genotypes. If an individual is heterozygous at more than one locus, then—absent additional information about population haplotype frequencies—there are multiple possible haplotype

Figure 2 Example of haplotype ambiguity (phase uncertainty) given observed genotypes.

Haplotype Association Analysis

211

configurations that are consistent with the observed genotypes. Modern statistical approaches treat in silico inference of haplotypes from observed genotypes as a missing data problem, where the missing data is haplotype phase, that is, which alleles lie on the maternal chromosome and which alleles lie on the paternal chromosome. To illustrate how these methods work, we will assume for now that the study subjects (including the triply heterozygous subject illustrated in Fig. 2) are drawn completely at random from a population in Hardy-Weinberg equilibrium. The implications of sampling conditional on a trait that may be associated with the locus under study (e.g., case-control sampling or sampling the extremes of a continuous phenotype) and departures from Hardy-Weinberg equilibrium will be discussed in later sections. In general, the probability of a particular haplotype pair given the observed genotypes and known haplotype frequencies can be calculated using Bayes’ theorem as: I ½H G PrðH; qÞ , H I ½H G PrðH ; qÞ

PrðHjG; qÞ ¼ P

ðiÞ

where H is a haplotype pair, G is the set of observed genotypes, I[H*G] is an indicator that H is consistent with G, and q is a vector of haplotype frequencies. Since we have assumed Hardy-Weinberg equilibrium, the probability of a haplotype pair is simply the product of haplotype frequencies: Pr(H ¼ (hm,hp);q) ¼ qhm qhp. If only two of the eight possible haplotypes are known to exist in the population, say AAG and GGA, then only one of the four possible haplotype configurations in Figure 2 (AAG-GGA) is possible, since all the others involve haplotypes that do not occur in the population. On the other hand, if haplotype AAA has 30% frequency, and all other have 10%, then there is much more ambiguity in the haplotype configuration. Each of the four possible configurations in Figure 2 is possible, although one pair of haplotypes (AAAGGG) is more likely (50% vs. 17%) than each of the others. How to account for this ambiguity when assessing the association between haplotypes and a trait is a major topic in the section on haplotype association analysis below. This approach to inferring unobserved phased haplotypes from genotype data can also be used when there is missing genotype information—the set of haplotype pairs consistent with the observed genotypes is simply larger in this case. For example, if the genotype at the second locus in Figure 2 were missing, the sum in the denominator of expression (i) would be over eight unique haplotype configurations of the forms AX1AGX2G or AX1G-GX2A, where X1 and X2 could be A or G. The relative probability of different possible haplotype pairs again depends on the haplotype frequencies q. As a nice by-product of this approach, the probability that a missing locus has a particular genotype can be inferred by summing the probabilities of the haplotypes that are consistent with that genotype and the observed genotypes. This is loosely how newer methods for imputing untyped genotypes proceed. Of course the haplotype frequencies q are rarely known a priori. Most methods for haplotype inference proceed iteratively, first guessing a value for q, then inferring haplotypes given q and the observed genotypes, and then updating the guess for q based on the inferred haplotypes. Many popular software packages for estimating haplotype frequencies—for example, PROC HAPLOTYPE in the SAS Genetics package, SNPHAP (http://www-gene.cimr.cam.ac.uk/clayton/software/snphap.txt), and tagSNPs (http:// www-rcf.usc.edu/~stram/tagSNPs.html)—use some version of the expectation-maximization (EM) algorithm (23–25), a statistical method for inference in the presence of missing data (here: haplotype phase).

212

Kraft and Chen

Although conceptually simple, fast, and accurate over short distances, the EM algorithm does not utilize any subject-matter-specific knowledge, other than assuming the distribution of haplotype pairs follows Hardy-Weinberg proportions. The EM algorithm does not even “notice” marker order: haplotype frequency estimates will not change if markers are reordered (although of course the alleles along the haplotypes will have been reordered). This naı¨ve approach can break down over long distances, where recombination plays a larger role. To improve haplotype-frequency and individual-haplotype estimation, a number of researchers have suggested more sophisticated approaches that explicitly model the recombination and mutation processes (26,27), as implemented, for example, in the software PHASE (http://stephenslab.uchicago.edu/software.html). These approaches are more accurate than the EM algorithm, especially when applied to large regions (e.g., entire chromosomes) (26). This is particularly relevant when using haplotype structure to impute missing SNP genotypes. However, analyses that explicitly consider the association between haplotypes and a trait are often either restricted to two loci thought to interact in cis or restricted to small regions of limited recombination. In these cases the EM algorithm and the more sophisticated methods should return very similar frequency and individual-haplotype estimates. Most haplotype estimation algorithms return subject-specific haplotype probabilities Pr(HjG; q) as well as estimated haplotype frequencies. These subject-specific probabilities will be useful later for haplotype association analyses. Up to this point we have restricted our attention to methods for inferring haplotype phase from multilocus genotypes measured in unrelated samples. These methods can be extended to data from related individuals—e.g., offspring-parent trios (28). Family data can improve the accuracy of inferred haplotypes, because haplotypes that are inconsistent with Mendelian inheritance can be excluded. HOW CAN A SUBSET OF (HAPLOTYPE) TAGGING SNPs BE SELECTED? Choosing (haplotype) tagging SNPs requires that researchers have resequencing or dense genotyping data on the region to be studied in a screening sample from a representative population. Five years ago, investigators had to acquire such data themselves—a painstaking, slow, and expensive process given the relatively low-throughput genotyping methods and limited information about sequence variation then available. Now public resources such as the HapMap and large-scale resequencing projects can provide a screening sample. The patterns of variation estimated from this sample are used to select the (haplotype) tagging SNPs. This raises several questions. How dense does the genotyping need to be so that the observed variation also captures the unobserved variation? (I.e., when can we safely assume we have captured information about both the “known knowns”—the SNPs genotyped in the screening panel—and the “known unknowns”—the common SNPs that are there but not genotyped in the screening panel?) How large does the screening panel need to be to accurately estimate the relevant covariance patterns? How representative does the screening panel need to be? Could the HapMap Utah sample of people of northern and central European ancestry be used to select tagging SNPs for a study of diabetics from the United Kingdom? From Spain? Latinos from Los Angeles? Empirical studies suggest that the law of diminishing returns applies to increasing marker density: if about one SNP every 10,000 base pairs has been genotyped in the screening panel, doubling the genotyping density can greatly improve the ability of

Haplotype Association Analysis

213

(haplotype) tagging SNPs, but if the screening panel has already been genotyped at about one SNP every 1000 base pairs, doubling the genotyping density (essentially genotyping every SNP) provides little increase in information (22). This is related to another empirical observation. Most SNPs are in high linkage disequilibrium with several other SNPs, but there is a minority of SNPs that are not in high linkage disequilibrium with any other SNP. Each of these “unhappy” SNPs (unhappy because they do not have any “friends”) has to be genotyped directly if investigators want to completely cover a region. Very loosely stated: to achieve complete coverage, about 80% of the genotyping budget will be spent measuring 20% of the variation. Similarly, the law of diminishing returns applies to the size of the screening sample. With only 30 independent chromosomes (15 unrelated subjects) the correlations among markers can be overestimated, and so can the performance of tagging SNPs selected from such small samples. However, screening panels with 90 to 120 independent chromosomes estimate the ability of tagging SNPs to capture common variants relatively well; adding samples to the screening panel does not appreciably improve the performance of tagging SNPs (29). (We note that some overestimation of tagging SNPs’ performance is inevitable—as with any procedure that builds a predictor of multiple outcomes using multiple inputs. Researchers could use a bootstrap procedure to estimate tagging SNPs’ performance (12), or simply select tagging SNPs using conservative performance criteria, e.g., requiring every SNP have a pairwise r2 above 90% with at least one tagging SNP when an r2 threshold of 80% would suffice.) These considerations suggest that the phase II HapMap has adequate marker density (about one SNP per 1000 base pairs) and sample size (45 to 60 unrelated subjects in each of its four panels) to serve as a useful screening panel for tagging SNP selection. Recent empirical studies have also shown that tagging SNPs chosen using the HapMap CEU panel effectively capture variation in European or U.S.-European–descended samples (30) and that tagging SNPs selected from the pooled Han Chinese in Beijing (CHB) and Japanese in Tokyo (JPT) panels perform well in other east Asian samples (31). Interestingly, tagging SNPs selected from a “cosmopolitan” panel combining all four HapMap reference panels perform well in a wide range of populations, including admixed populations such as African-Americans and Mexicans living in Los Angeles (30). Thus the HapMap should serve as a good reference panel even for samples from populations with no obvious analog among the four HapMap panels (32). Given a screening panel, a large number of algorithms for selecting tagging SNPs is available (7,21,22,25,33–36). Precisely because there are so many, we are unaware of any comprehensive comparison of these algorithms across a wide range of situations (degree of linkage disequilibrium, marker density) in terms of power to detect a genetic association or efficiency, i.e., number of markers needed to achieve a given power. But for a screening panel with the marker density of the HapMap, the existing comparisons suggest that choice of tagging algorithm does not greatly affect the efficiency of tagging SNPs (37). The biggest determinant of tagging SNP performance is the density of the tagging SNPs themselves. For fixed density, using local linkage disequilibrium patterns to choose SNPs can improve performance over simply selecting at random without reference to linkage disequilibrium (22). But it is not clear that for fixed density one method of tagging SNP selection will consistently yield much greater power than others. We will discuss two general approaches to tagging SNPs selection. The first, haplotype tagging, chooses htSNPs with the aim of distinguishing phased haplotypes. This is done by maximizing some measure of how much information the haplotype tagging SNPs contain, such as the haplotype diversity or entropy of tagging-SNP haplotypes or the squared correlation between tagging-SNP haplotypes and complete

214

Kraft and Chen

haplotypes (7,35,36). Of course, complete enumeration of all potential tagging SNP subsets of sizes, 1, 2, 3, etc., can be quite time consuming if the region to be tagged contains more than 20 SNPs, and in general, there is no unique solution to this problem: different subsets may yield the same haplotype diversity. Most tagging-SNP selection algorithms use some fast search algorithm to find tagging SNP sets with optimal or nearoptimal performance. Hence the proliferation in tagging-SNP selection algorithms: not only are there many choices for metric to be maximized, but there are also many choices for the search algorithm to find the subset(s) of SNPs that maximize this metric. However, these measures of haplotype-tagging-SNP informativeness assume that the phase of the haplotypes in the screening sample is known and the phase of the taggingSNP haplotypes in the main study will be known, even though this almost always has to be estimated from unphased genotype data, as described in the previous section. Alternatively, the squared correlation between complete haplotypes and unphased tagging-SNP genotypes or Rh2 (25) accounts for the uncertainty in determining haplotype phase from unphased genotypes. This measure also has a nice interpretation in terms of relative efficiency of tagging SNPs: the power of a study with n cases and n controls to detect an association between a directly-measured haplotype h and disease is the same as a study with (1/Rh2) n cases and (1/Rh2) n controls using haplotype-tagging SNPs. Because of this, Rh2 n is often referred to as the “effective size” of a study using tagging SNPs. Haplotype-tagging methods work best in regions of limited haplotype diversity. In regions of high haplotype diversity, the power of haplotype-tag SNPs can be degraded because the correlations Rh2 tend to be overestimated in the screening sample. Furthermore, if most haplotypes are rare, any analysis based on associations between individual haplotypes and a trait will face problems because of sparse data and multiple testing. In contrast to haplotype tagging, which captures information about individual SNPs only through the correlation between SNPs and haplotypes, pairwise tagging selects diallelic markers that serve as direct surrogates for unmeasured SNPs: every SNP in the target region has a pairwise correlation (r2) above some user-defined threshold with at least one of the tagging SNPs. An influential early implementation of this approach (21) simply partitioned the set of SNPs to be tagged into subsets or “bins,” where all pairs of SNPs within each bin were highly correlated with each other, and any SNP within a bin can be selected as a tagging SNP for the remainder of the SNPs in the bin. Note that these linkage-disequilibrium bins are not the same as haplotype blocks: for example, bins are not required to be contiguous and are typically highly interdigitated. One nice advantage of this approach over haplotype tagging is that it does not require the region to be preprocessed into haplotype blocks. Because many SNPs are “unhappy,” some bins will contain only a single SNP, increasing the number of tagging SNPs needed to cover a region comprehensively. One way to boost the efficiency of pairwise tagging is to consider multimarker haplotypes as well as individual SNPs as potential surrogate markers (22). This differs from haplotype tagging in several respects. First, the goal is not measuring the haplotype per se, but to make use of the haplotype as a predictor for some unmeasured SNP. Second, the markers used to define the haplotype need not be contiguous. Finally, not all haplotypes are tested; rather one multimarker haplotype is used to define a diallelic marker, and that “pseudoSNP” is tested as a surrogate for one or more unmeasured SNPs. This “aggressive tagging” approach has been implemented in the tagger program, which has been integrated into haploview and is also available through an online server (http://www .broad.mit.edu/mpg/tagger/).

Haplotype Association Analysis

215

The differences between haplotype- and pairwise tagging SNPs have implications for data analysis. Because haplotype tagging SNPs are chosen to collectively distinguish haplotypes, they may not be effective surrogates for individual SNPs if analyzed one at a time (multiple marginal tests). On the other hand, there is no guarantee that haplotypes of pairwise tagging SNPs accurately predict all common haplotypes in a region. Of course, once selected, the performance of a set of tagging SNPs can be evaluated using any metric, so a set of haplotype-tagging SNPs can be evaluated in terms of its pairwisetagging properties and vice versa. Which measure of performance is more appropriate will depend on researchers’ a priori beliefs about the properties of causal variants for the trait under study. If they believe most causal variants are themselves SNPs or are highly correlated with individual known SNPs, then pairwise tagging is more appropriate. If they believe causal variants are likely to be highly correlated with common haplotypes but not with any individual known SNP—perhaps the causal variant is itself haplotype due to cis interaction or a copy number variant in strong linkage disequilibrium with a haplotype— then haplotype tagging may be more appropriate. Much remains unknown about the spectrum of causal variants underlying disease-related traits, so it is not obvious which of these two scenarios is more likely. However, the success of multiple genomewide studies that have taken a pairwise tagging approach (38–44) suggests that there are more causal variants in sufficient linkage disequilibrium with individual SNPs waiting to be found. HOW DO I ANALYZE ASSOCIATIONS BETWEEN A TRAIT AND HAPLOTYPES FROM UNPHASED GENOTYPE DATA? We group the wide range of haplotype-association methods (45–49) (50–53) into three categories, roughly in order of statistical sophistication: those that compare haplotype frequencies between cases and controls, “plug-in” or single-imputation methods that use estimates of individuals’ haplotypes in standard regression models as if they were observed, and marginal regression methods that integrate over the unknown phase information. The marginal methods are, in principle, more efficient (48,54), although as we discuss below, the plug-in methods are somewhat more flexible (association analyses can be conducted with standard software such as SAS, R, or STATA) and have relatively good efficiency in most practical situations. Comparing Haplotype Frequencies in Cases and Controls The earliest methods of haplotype association analysis were developed for case-control studies and use a likelihood ratio statistic to test for differences in case and control haplotype frequencies (55). This statistic will be approximately chi-squared distributed (with degrees of freedom equal to the number of haplotypes minus one) as long as each haplotype appears at least five times (say) in cases and in controls. In practice, however, such rare haplotypes are the rule rather than the exception, and in these situations a permutation procedure should be used to evaluate statistical significance. Permutation p values for this test can be calculated using PROC HAPLOTYPE in SAS Genetics, for example. Individual haplotype odds ratios can be calculated from the two-by-two table of haplotype frequencies in cases and controls (as in SAS PROC HAPLOTYPE). However, these odds ratios do not have the usual interpretation (increase in log odds of disease for an individual per copy of the putative risk haplotype relative to all other haplotypes), except in the special case where the study base is in Hardy-Weinberg equilibrium and

216

Kraft and Chen

disease risk is log-linear in the number of putative risk haplotypes carried (56). This simple approach also cannot easily adjust for important measured covariates that may be needed to control for potential confounding or assess gene-environment interaction, and the approach is obviously restricted to analysis of binary traits. These limitations sparked the development of more flexible analytic methods. Haplotypes in a Regression Framework If all subjects’ haplotypes were known, haplotype analysis would simply be a special case of analyzing the association between a multiallelic locus and a trait. For example, for binary disease traits one could fit a logistic regression model of the form: m ¼ a þ b0 ZðHÞ þ y0 X, ðiiÞ log 1m where m is the probability of disease, X is a vector of observed covariates, and Z(H) is a numeric coding relating the haplotype pair H to risk of disease. In principle, there is a wide range of possible codings Z(H) (57), but in practice two are commonly used. The first regresses disease risk on a vector of haplotype counts, Z(H) ¼ (n1,. . ., nJ–1)0 , where nj is the number of copies of the jth haplotype carried by an individual, and J is the total number of haplotypes. For identifiability, one haplotype (here indexed with i ¼ 0) is set as the referent. If there are many rare haplotypes, these either can be combined into one haplotype class, so that nJ–1 refers to the number of rare haplotypes carried, or if the cumulative frequency of the rare haplotypes is quite small (say fewer than 10 are expected in the total sample), the rare haplotypes can be combined in the reference category by excluding them from Z(H). This coding assumes risk increases linearly on the log odds scale with each extra risk haplotype and has two nice properties. First, the resulting model is invariant to the choice of reference haplotype, which is typically chosen to be the most common haplotype. Second, this linear coding also provides convenient single test of the global null hypothesis that variation in haplotypes at a locus is associated with disease risk: one simply compares the model including Z(H) to the model without using a standard likelihood ratio, score, or Wald test. The resulting test statistic can be compared to a chi-squared distribution with J – 1 degrees of freedom. The other common haplotype coding relates risk to a particular haplotype, while allowing risk to differ for heterozyogote and homozygote carriers of the putative risk haplotype i: Z(H) ¼ (I[nj ¼ 1], I[nj ¼ 2])0 . Here I[.] is an indicator function, and the reference category is noncarriers of haplotype j. This coding is useful when the primary aim of the analysis is characterization, that is, describing the effect of a particular haplotype hypothesized to be associated with disease risk, or when an individual haplotype is used as a surrogate marker for untyped SNPs (as in the “aggressive” tagging approach implemented in tagger). This regression approach can also be applied to continuous traits using the linear regression model m ¼ a þ b0 ZðHÞ þ y0 X,

ðiiiÞ

where m is the mean trait value. In fact, the coding Z(H) could be used in any standard regression approach, including generalized linear models, generalized estimating equations for repeated measurements or correlated traits, conditional logistic regression, and Cox proportional hazards models. In particular, this makes testing or estimating geneenvironment interaction effects relatively straightforward. For example, for a binary

Haplotype Association Analysis

217

exposure X, taking values 0 or 1 one could include the standard cross-product interaction terms in the logistic model to assess whether haplotype effects differ across strata of X: m log ¼ a þ b0 ZðHÞ þ y0 X þ d0 X ZðHÞ: 1m For the additive coding, this leads to a J – 1 degree of freedom test of departures from an additive model on the log-odds scale for gene-environment interaction (H0: d ¼ 0). In practice, subjects’ haplotype pairs will not be known, but both the plug-in and marginal methods are based on these regression models and utilize some coding Z(H). Thus, the answer to the hypothetical question “If I could observe H, how would I analyze these data?” can help guide haplotype analysis. Plug-in A.K.A. Single Imputation Methods Plug-in methods attempt to solve the problem of unknown phase by replacing the unobserved coding Z(H) in the regression equation with some estimate Z*(G) based on the observed genotypes G. The simplest is to treat the most probable haplotype pair as the true haplotype pair, i.e., Z*(G) ¼ Z(Hmax), where Hmax is the haplotype pair with ^ As maximum posterior probability conditional on the observed genotypes Pr(HjG; q). discussed above, these individual posterior probabilities are natural by-products of most haplotype frequency estimation algorithms. However, treating the most likely pair as observed induces measurement error: the resultant Z*(G) may not be equal to the true, unobserved Z(H), so that parameter estimates are generally biased. The degree of bias depends on the degree of phase ambiguity. Furthermore, this approach does not account for the uncertainty in Z*(G), so that confidence intervals for the haplotype regression parameters b are generally underestimated (49,54), potentially leading to inflated type I error rate. A more attractive “plug-in” approach replaces Z(H) with its expected value: ^ Z*(G) ¼ E[Z(H)jG], where the expectation is over the posterior distribution Pr(HjG; q) (49–51). This approach has been shown to provide valid and efficient tests of the null hypothesis of no haplotype effect (58,59). Away from the null, this approach is in principle less efficient than the marginal methods discussed in the next section, and, ^ it because it does not account for the added variability induced by the estimation of q, ^ may also underestimate the variability in regression parameters b. However, in situations of practical relevance—i.e., in regions of limited haplotype diversity, limited missing data, and modest genetic effect—the performance of this “expectation-substitution” method is nearly indistinguishable from comparable marginal methods (49,60). A closely related approach is based on weighted regression analysis (52). In this approach, for a subject with several compatible haplotype pairs, several records are created, with each having one of the possible pairs. Records in this expanded data set then have known phase. A weighted likelihood analysis is then used to analyze the data: each record is given a weight corresponding to the conditional probability of that haplotype pair given the genotype, and a robust “sandwich” estimator is used to estimate the variance. All three of these approaches use estimated subject-specific haplotype probabilities conditional on observed genotypes, which are typically calculated assuming haplotypes are in Hardy-Weinberg equilibrium, using haplotype frequencies estimated using all subjects. For case-control samples (or other ascertained samples, e.g., subjects with extreme values of a continuous trait), the pooled sample may not be in Hardy-Weinberg equilibrium if haplotypes are indeed associated with disease risk. Furthermore, in principle the imputation should be done conditionally on all observed data, including

218

Kraft and Chen

phenotypes, not just the genotype data, which none of these approaches (as we have sketched them) do. Still, these approaches appear to perform quite well in regions of limited haplotype diversity and modest genetic effect (49,60). If large haplotype effects are observed, analysis should be repeated estimating frequencies separately in cases and controls (50). The primary advantage of these plug-in approaches has been their computational simplicity: given estimates of posterior haplotype probabilities they can be implemented using standard statistical software, whereas the marginal approach requires specialized routines. However, the recent development of flexible software such as the R function haplo.glm (in the haplo.stats package) or stand-alone programs like chaplin (http://www .genetics.emory.edu/labs/epstein/software/chaplin/) or hapstat (http://www.bios.unc.edu/ ~lin/hapstat/) have made the marginal methods available to a wider community of researchers, who may not have the skills or time to implement these more sophisticated methods.

Marginal Methods Marginal methods for haplotype association extend methods for haplotype frequency estimation by incorporating a penetrance function that models trait distribution given haplotypes (and other observed covariates). Thus, if one is interested in studying a trait Y using a cross-sectional sample of individuals and is willing to assume that covariates X are independent of H conditional on genotypes G [a looser condition than assuming X is independent of G, although not foolproof: see discussion in (61)], the likelihood for the observed data is proportional to X PrðYjH, X; uÞ PrðH; qÞ ðivÞ PrðY,G,Xju,qÞ ¼ H where u is a vector of penetrance parameters [e.g., a, b, and g in equation (iii)]. One advantage of this marginal approach is that both u and q are estimated simultaneously, so ^ into account and vice versa. Also, assuming the estimates of u take uncertainty in q penetrance and haplotype distribution Pr(Hjq) are correctly modeled, maximum likelihood inference based on this likelihood is statistically efficient (54). There are a number of numeric methods available to calculate maximum likelihood estimates for u and q, but most involve some application of the EM algorithm. For case-control data, the likelihood should in principle account for the sampling scheme. The classic result of Prentice and Pyke (62) that the prospective logistic model applied to retrospectively sampled case-control data yields unbiased and efficient odds ratio estimates does not generally hold in the haplotype context (53,63). A variation of the cross-sectional likelihood (iv)] that estimates haplotype frequencies in controls only is essentially unbiased when the disease is rare for all values of H and X (63,64). This prospective variation can be applied to case-control data with good results in many practical situations (53,65). If sampling factions are known—e.g., if cases and controls are drawn from a known cohort—it is also possible to modify the prospective likelihood to remove the bias due to the case-control sampling (47,63). Alternatively, analysis of case-control data could be based on the retrospective likelihood Pr(GjY;u,q) (48). If all modeling assumptions hold, this retrospective approach yields the most precise estimates for both u and q. However, deviations from the required assumption of Hardy-Weinberg equilibrium can lead to an intolerable degree of bias, although modeling departures from HWE can reduce this bias somewhat (65).

Haplotype Association Analysis

219

The prospective likelihood can easily incorporate the main effects of observed covariates X and haplotype-covariate interactions through the penetrance function Pr(YjH,X;u) (64). Several authors (53,61,63,66) have developed sophisticated methods for fitting the retrospective likelihood Pr(G,XjY;u,q) that allow for covariate effects and haplotype-covariate interactions. The difficulty here is that the retrospective likelihood requires the distribution Pr(XjG) (61) or Pr(X) be estimated. Lin et al. (61), for example, overcome this difficulty using elegant semiparametric methods. These retrospective approaches require additional assumptions, namely, that X is independent of H given G (61) or the somewhat stronger assumption that X and H are independent in the study base (63). If these assumptions hold, retrospective methods can be more powerful than the prospective likelihood, especially for gene-environment interaction parameters (61,63). The assumption of H-X independence is perhaps reasonable for many environmental exposures but not those whose distributions differ by ethnicity and not for covariates that might be caused by H, such as smoking behavior or body mass index, so the retrospective methods should be applied with care—e.g., parameter estimates from the retrospective methods should be compared to estimates from the prospective likelihood. Lin et al. (53) have implemented their haplotype association methods in the software hapstat for a range of study designs (cross-sectional, case-control, cohort and nested case-control). [Although we have focused here on methods for unrelated subjects, analogous haplotype association methods for offspring-parent trios are available (67–69).] Finally, we note that all the marginal methods discussed here assume all subjects’ haplotypes are drawn from the same distribution, typically assumed to be in HardyWeinberg equilibrium. This will not be the case if the study contains substantial numbers of subjects from multiple ethnic groups, each of which has its own haplotype distribution: some haplotypes may be common in African-Americans but completely absent in Europeans, for example. In principle, these methods could be modified so that haplotype frequencies are estimated separately for each ethnic group. In practice, none of the standard “off the shelf” software packages implement this modification. This is one situation where plug-in methods retain a practical advantage, as it is relatively straightforward to calculate subject-specific haplotype probabilities by ethnicity. ‘‘HAPLOTYPE ANALYSIS’’ WITHOUT HAPLOTYPES In the previous section, we described methods for testing association between a trait and all haplotypes across a given set of markers or between a trait and a particular haplotype. But whether this approach is the most appropriate will depend on the ultimate goal of the analysis. If investigators wish to test association between a particular haplotype and disease—perhaps because that haplotype is known to be functional, was observed to be associated with risk in a previous study, or was chosen as a surrogate for an untyped variant—then haplotype analysis is required. If the goal of the analysis is to comprehensively test for association between a trait and any common variant in a candidate region, however, haplotype analysis may not be the most convenient or the most powerful approach. We discussed practical difficulties of haplotype analysis such as the need to parse the studied region into “blocks” of limited haplotype diversity in earlier sections. Here we briefly sketch two other methods for testing association between a trait and variation at locus that do not explicitly test haplotypes but implicitly leverage the local linkage disequilibrium (haplotype) structure.

220

Kraft and Chen

Imputing Untyped SNPs New analytic methods combine data on a relatively small set of SNPs genotyped on the main study with a more comprehensive set of SNPs genotyped on an external sample (e.g., HapMap data) or internal substudy (e.g., a subset of participants who have their DNA sequenced) to infer alleles at untyped loci, regardless of how the genotyped SNPs were chosen (20,70–72). (In fact, these methods use sophisticated population genetics models to infer haplotypes, which are then used to impute the missing genotypes.) Because the untyped SNPs can often be inferred with high reliability, this approach could also be used to effectively increase the density of observed markers spanning candidate regions, which in turn should increase the informativeness of simple analytic approaches, such as testing each marker separately. These imputed SNPs could also be used in flexible multivariable regression models, as described below. This imputation approach relies heavily on the densely genotyped reference panel. Ideally, this panel should be drawn from population with similar linkage disequilibrium patterns to the study base (e.g., the HapMap CEU panel for a study consisting of European subjects), although there is some evidence that in the absence of directly comparable panel, a “cosmopolitan” panel consisting of the pooled HapMap samples may perform well (20). Another practical limitation of this approach is that not all the genotyped SNPs may be in the reference panel. This is particularly true for “legacy” studies, where SNPs were selected from dbSNP without regard to linkage disequilibrium structure or from a private resequencing panel that may no longer be available to all investigators. Many of the SNPs in these studies may not be in the HapMap, so would be effectively excluded from any imputation procedure that relies on a HapMap reference panel. Flexible Multivariable Regression Methods for Unphased SNP Genotypes As mentioned above, one could test for association between a trait and multiple SNP markers spanning a region simply by regressing the trait on counts of minor alleles for all the SNPs simultaneously, for example, using multivariable linear or logistic regression. This approach can be as or more powerful than haplotype analysis in many situations, even when the causal variant has not been genotyped directly or is itself a haplotype (11). There are several potential drawbacks to this simple approach, however. First, most multivariable regression methods require complete data: subjects missing even one genotype will be excluded from the analysis, leading to appreciable loss in power. It is thus desirable to impute the missing SNP data in one way or another for performing various association analyses. Missing genotypes for such SNPs could be imputed using the methods described in the previous section, using flexible techniques for imputing correlated data (73), or by calculating haplotypes for sets of SNPs in high linkage disequilibrium. Second, if there are a large number of correlated SNP markers spanning the candidate regions(s), it may be difficult to estimate all parameters simultaneously, leading to highly variable parameter estimates and invalid tests. Finally, assuming an additive allelic model for each SNP and an additive interaction model across SNPs may not be the most powerful. Allowing for nonadditive (dominant, recessive) SNP effects or nonlinear SNP-SNP interactions may boost power to detect an association. This may be due to causal interaction among loci (whether or not the risk alleles must be on the same haplotype to have an effect) or due to the fact that a nonlinear model better captures about a single, untyped variant.

Haplotype Association Analysis

221

These difficulties have sparked research into multivariable regression methods that combine feature selection (eliminating “uninteresting” SNPs) and flexible modeling. One simple-minded approach might be simply to fit a forward stepwise regression (allowing pairwise product interaction terms to enter the model before either main effect) and then assess significance via a permutation procedure. More sophisticated approaches include penalized regression (74,75), logic regression (76), regression trees (77), and kernel machines (78). (The last approach is closely related to flexible multimarker tests based on measures of genetic similarity among cases and controls (79–81) but has the advantages that come with a regression framework, including simple adjustments for measured covariates.) SUMMARY The empirical fact of limited haplotype diversity among humans makes it possible to measure and test most common SNPs for association with disease-related traits using data on only a small subset of SNPs. Haplotypes of dense SNP markers—or, anticipating nearfuture technological developments (82), stretches of DNA sequence along a chromosome— may have direct functional relevance or be highly correlated with a causal variant. Because of this, methods that test directly for association between phased haplotypes and traits will remain useful to researchers who wish to discover and understand the genetic variants that contribute to disease risk. These methods continue to be refined, incorporating more sophisticated models for local recombination rates, mutation, and the ancestry of the causal variant (83). However, other methods that implicitly leverage local haplotype structure without using (unobserved) phase information are also useful, convenient, and provide association tests that in some situations are as or more powerful than phased-haplotype-based tests. Because much remains unknown about the spectrum of causal genetic variants for complex diseases like cancer, it is a good time to “let a hundred flowers bloom” and allow diverse, theoretically sound, and practical analytic approaches to flourish. REFERENCES 1. Frazer KA, Ballinger DG, Cox DR, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 2007; 449:851–861. 2. Altshuler D, Brooks LD, Chakravarti A, et al. A haplotype map of the human genome. Nature 2005; 437:1299–1320. 3. Conti DV, Gauderman WJ. SNPs, haplotypes, and model selection in a candidate gene region: the SIMPle analysis for multilocus data. Genet Epidemiol 2004; 27:429–441. 4. Cordell H, Clayton D. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in Type I Diabetes. Am J Hum Genet 2002; 70:124–141. 5. Beckmann L, Thomas DC, Fischer C, et al. Haplotype sharing analysis using mantel statistics. Hum Hered 2005; 59:67–78. 6. Daly M, Rioux J, Schaffner S, et al. High-resolution haplotype structure in the human genome. Nature Genet 2001; 29:229–232. 7. Johnson G, Esposito L, Barratt B, et al. Haplotype tagging for the identification of common disease genes. Nature Genet 2001; 29:233–237. 8. Benjamini Y, Hochberg Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. J Roy Stat Soc B Met 1995; 57:289–300. 9. Westfall P, Zaykin D, Young S. Multiple tests for genetic effects in association studies. In: Looney S, ed. Biostatistical Methods. Totowa, NJ: Humana Press, 2002.

222

Kraft and Chen

10. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001; 29:1165–1188. 11. Chapman JM, Cooper JD, Todd JA, et al. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum Hered 2003; 56:18–31. 12. Iles MM. Quantification and correction of bias in tagging SNPs caused by insufficient sample size and marker density by means of haplotype-dropping. Genet Epidemiol 2007; 32:22–28. 13. Zhang K, Jin L. HaploBlockFinder: haplotype block analyses. Bioinformatics 2003; 19: 1300–1301. 14. Wall J, Pritchard J. Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet 2003; 4:587–597. 15. Gabriel SB, Schaffner SF, Nguyen H, et al. The structure of haplotype blocks in the human genome. Science 2002; 296:2225–2229. 16. Rinaldo A, Bacanu SA, Devlin B, et al. Characterization of multilocus linkage disequilibrium. Genet Epidemiol 2005; 28:193–206. 17. Ke X, Hunt S, Tapper W, et al. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum Mol Genet 2004; 13:577–588. 18. Wall JD, Pritchard JK. Assessing the performance of the haplotype block model of linkage disequilibrium. Am J Hum Genet 2003; 73:502–515. 19. Ayers KL, Sabatti C, Lange K. A dictionary model for haplotyping, genotype calling, and association testing. Genet Epidemiol 2007; 31:672–683. 20. Li Y, Abecasis G. Mach 1.0: rapid haplotype reconstruction and missing genotype inference. Am J Hum Genet 2006; S79:2290. 21. Carlson CS, Eberle MA, Rieder MJ, et al. Selecting a maximally informative set of singlenucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 2004; 74:106–120. 22. de Bakker PI, Yelensky R, Pe’er I, et al. Efficiency and power in genetic association studies. Nat Genet 2005; 37:1217–1223. 23. Excoffier L, Slatkin M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995; 12:921–927. 24. Niu T. Algorithms for inferring haplotypes. Genet Epidemiol 2004; 27:334–347. 25. Stram D, Haiman C, Hirschhorn J, et al. Choosing haplotype-tagging SNPs based on unphased genotype data using as preliminary sample of unrelated subjects with an example from the multiethnic cohort study. Hum Hered 2003; 55:27–36. 26. Stephens M, Scheet P. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet 2005; 76:449–462. 27. Scheet P, Stephens M. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 2006; 78:629–644. 28. Marchini J, Cutler D, Patterson N, et al. A comparison of phasing algorithms for trios and unrelated individuals. Am J Hum Genet 2006; 78:437–450. 29. Zeggini E, Rayner W, Morris AP, et al. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets. Nat Genet 2005; 37: 1320–1322. 30. de Bakker PI, Burtt NP. Transferability of tag SNPs in genetic association studies in multiple populations. Nat Genet 2006; 38:1298–1303. 31. Conrad D, Jakobbson M, Coop G, et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet 2006; 11:1251–1260. 32. Need A, Goldstein DB. Genome-wide tagging for everyone. Nat Genet 2006; 11:1227–1228. 33. Horne BD, Camp NJ. Principal component analysis for selection of optimal SNP-sets that capture intragenic genetic variation. Genet Epidemiol 2004; 26:11–21. 34. Meng Z, Zaykin DV, Xu CF, et al. Selection of genetic markers for association analyses, using linkage disequilibrium and haplotypes. Am J Hum Genet 2003; 73:115–130.

Haplotype Association Analysis

223

35. Sebastiani P, Lazarus R, Weiss ST, et al. Minimal haplotype tagging. Proc Natl Acad Sci U S A 2003; 100:9900–9905. 36. Weale M, Depondt C, MacDonald S, et al. Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium mapping. Am J Hum Genet 2003; 73:551–565. 37. Stram DO. Tag SNP selection for association studies. Genet Epidemiol 2004; 27:365–374. 38. Zanke BW, Greenwood CM, Rangrej J, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet 2007; 39:989–994. 39. Tomlinson I, Webb E, Carvajal-Carmona L, et al. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet 2007; 39: 984–988. 40. Yeager M, Orr N, Hayes RB, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet 2007; 39:645–649. 41. Hunter DJ, Kraft P, Jacobs KB, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 2007; 39: 870–874. 42. Easton DF, Pooley KA, Dunning AM, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 2007; 447:1087–1093. 43. Broderick P, Carvajal-Carmona L, Pittman AM, et al. A genome-wide association study shows that common alleles of SMAD7 influence colorectal cancer risk. Nat Genet 2007; 39: 1315–1317. 44. Gudmundsson J, Sulem P, Manolescu A, et al. Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nat Genet 2007; 39:631–637. 45. Schaid DJ. Evaluating associations of haplotypes with traits. Genet Epidemiol 2004; 27: 348–364. 46. Lake S, Lyon H, Tantisira K, et al. Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 2003; 55:56–65. 47. Stram D, Pearce C, Bretsky P, et al. Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Hum Hered 2003; 55:179–190. 48. Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. Am J Hum Genet 2003; 73:1316–1329. 49. Kraft P, Cox DG, Paynter RA, et al. Accounting for haplotype uncertainty in matched association studies: a comparison of simple and flexible techniques. Genet Epidemiol 2005; 28:261–272. 50. Cordell HJ. Estimation and testing of genotype and haplotype effects in case-control studies: comparison of weighted regression and multiple imputation procedures. Genet Epidemiol 2006; 30:259–275. 51. Zaykin D, Westfall P, Young S, et al. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered 2002; 53:79–91. 52. French B, Lumley T, Monks SA, et al. Simple estimates of haplotype relative risks in casecontrol data. Genet Epidemiol 2006; 30:485–494. 53. Lin DY, Zeng D, Millikan R. Maximum likelihood estimation of haplotype effects and haplotype-environment interactions in association studies. Genet Epidemiol 2005; 29:299–312. 54. Lin DY, Huang BE. The use of inferred haplotypes in downstream analyses. Am J Hum Genet 2007; 80:577–579. 55. Fallin D, Cohen A, Essioux L, et al. Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer’s disease. Genome Res 2001; 11:143–151. 56. Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics 1997; 53:1253–1261. 57. Schaid DJ. General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 1996; 13:423–449. 58. Schaid D, Rowland C, Tines D, et al. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 2002; 70:425–434.

224

Kraft and Chen

59. Xie R, Stram DO. Asymptotic equivalence between two score tests for haplotype-specific risk in general linear models. Genet Epidemiol 2005; 29:166–170. 60. Kraft P, Stram DO. Re: the use of inferred haplotypes in downstream analysis. Am J Hum Genet 2007; 81:863–865 (author reply 5–6). 61. Lin D, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J Am Stat Assoc 2006; 101:89–118. 62. Prentice R, Pyke R. Logistic disease incidence models and case-control studies. Biometrika 1979; 86:403–411. 63. Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genet Epidemiol 2005; 29:108–127. 64. Zhao L, Li S, Khalid N. A method for the assessment of disease associations with singlenucleotide polymorphism haplotypes and environmental variables in case-control studies. Am J Hum Genet 2003; 72:1231–1250. 65. Satten GA, Epstein MP. Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genet Epidemiol 2004; 27:192–201. 66. Allen AS, Satten GA. Robust estimation and testing of haplotype effects in case-control studies. Genet Epidemiol 2007; 32:29–40. 67. Chatterjee N, Kalaylioglu Z, Carroll R. Exploiting gene-environment independence in familybased case-control studies: increased power for detecting associations, interactions and joint effects. Genet Epidemiol 2005; 28:138–156. 68. Horvath S, Xu X, Lake SL, et al. Family-based tests for associating haplotypes with general phenotype data: application to asthma genetics. Genet Epidemiol 2004; 26:61–69. 69. Allen AS, Satten GA, Tsiatis AA. Locally-efficient robust estimation of haplotype-disease association in family-based studies. Biometrika 2005; 92:559–571. 70. Marchini J, Howie B, Myers S, et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 2007; 39:815–816. 71. Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 2007; 3:e114. 72. Nicolae DL. Testing untyped alleles (TUNA)-applications to genome-wide association studies. Genet Epidemiol 2006; 30:718–727. 73. Dai JY, Ruczinski I, LeBlanc M, et al. Imputation methods to improve inference in SNP association studies. Genet Epidemiol 2006; 30:690–702. 74. Tanck MW, Jukema JW, Zwinderman AH. Simultaneous estimation of gene-gene and geneenvironment interactions for numerous loci using double penalized log-likelihood. Genet Epidemiol 2006; 30:645–651. 75. Park MY, Hastie T. Penalized logistic regression for detecting gene interactions. Biostatistics 2007; 9:30–50. 76. Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol 2005; 28:157–170. 77. Chen J, Yu K, Hsing A, et al. A partially linear tree-based regression model for assessing complex joint gene-gene and gene-environment effects. Genet Epidemiol 2007; 31:238–251. 78. Kwee L, Liu D, Lin X, et al. A powerful and flexible multi-locus association test for quantitative traits. Am J Hum Genet (in press). 79. Schaid DJ, McDonnell SK, Hebbring SJ, et al. Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet 2005; 76:780–793. 80. Zaykin DV, Meng Z, Ehm MG. Contrasting linkage-disequilibrium patterns between cases and controls as a novel association-mapping method. Am J Hum Genet 2006; 78:737–746. 81. Wessel J, Schork NJ. Generalized genomic distance-based regression methodology for multilocus association analysis. Am J Hum Genet 2006; 79:792–806. 82. Harmon A. 6 billion bits of data about me, me, me!. New York Times June 3 2007. 83. Minichiello MJ, Durbin R. Mapping trait loci by use of inferred ancestral recombination graphs. Am J Hum Genet 2006; 79:910–922.

15

Genomewide Association Studies Michael B. Bracken, Andrew DeWan, and Josephine Hoh Center for Perinatal, Pediatric, and Environmental Epidemiology, Yale University, New Haven, Connecticut, U.S.A.

INTRODUCTION Genomewide association (GWA) studies, a hypothesis-free study design to associate complex diseases to particular genotypes, have come into use only very recently. They are increasingly seen to offer a more efficient strategy for identifying disease genes and overcoming bias in the more traditional candidate gene approach. Some major successes in the use of GWA studies have already been documented. This commentary summarizes the recent rise of GWA studies, identifies some key characteristics, and points to aspects of their methodology susceptible to improving their efficiency, particularly phenotype classification. A number of excellent reviews of genetic epidemiology methods have been published, but for the most part they predate the widespread use of GWA studies (1–6). A workshop on GWA studies (7) considered some ways of making these designs more efficient, and Brookes, writing in 2001, commented, “As statistical genetic, genomic and computational technologies improve, it is likely that within one or two decades a corresponding ‘hypothesis free,’ comprehensive, and highly automated research strategy could turn out to be the most effective (although still limited) way to unravel the molecular basis of human disease” (8). As we shall see, rather than decades into the future, it was the following year that saw publication of the first hypothesis-free GWA study. Earlier genetic epidemiology studies have investigated a specified number, typically a few dozen to several hundreds, of candidate genes located throughout the genome. These studies represent the more traditional scientific paradigm; however, many investigators now realize that candidate gene studies can be quite biased. Despite their limitations, candidate gene studies are still widely conducted and contribute to genomic epidemiology, and some of the issues in their design and analysis are the same as for GWA studies (2). In contrast, GWA studies are specifically not studying candidate genes. The strategy is to interrogate potentially all DNA variants such as the single nucleotide polymorphisms (SNPs) throughout the genome. The choice of SNPs initially investigated is essentially random, selected either by determining a specified density of SNPs or selecting SNPs on the basis of linkage disequilibrium (LD) patterns in the genome to maximize information 225

226

Bracken et al.

gain with a minimum number of SNPs (tagSNPs). The number of SNPs analyzed has typically ranged from 10,000 to over 500,000, and will soon be a million and above. GWA studies to date have primarily used case-control designs, although at least one GWA linkage family study has been reported (9). LIMITATIONS IN CANDIDATE GENE STUDIES It was not fully appreciated until relatively recently that candidate gene studies were often producing nonreproducible results. Because of the opportunity to simultaneously investigate a large number of SNPs, many published reports focused solely on those SNPs showing the largest and most statistically significant associations. In a classic series of papers, Ioannidis and colleagues (10–12) demonstrated that the first reported SNP associations were often the largest. Subsequent investigators reported smaller or nonsignificant associations when larger studies were undertaken and more genetic material was available. Thompson (13) reported a failure to replicate 19 of 20 candidate SNPs, previously associated with atorvastatin, when reexamined in a genomewide analysis. Bias in the publication of candidate gene studies has been further amplified by errors in original studies and publication bias in the in vitro (14) and animal literature (15–17), leading to substantial replication failure. A BRIEF HISTORY OF GWA STUDIES It was not until microarray technologies started to be developed in the early 1990s that the possibility for GWA studies arose (18). Later versions of these silicon chips allowed highthroughput microprocessing using several copies of probes to interrogate large number of SNPs accurately and at a reasonable cost. While the total number of reported SNPs in the human genome keeps rising as more and more sequence data become available, some common SNPs can be found across the genome at intervals of approximately one SNP per 300 base pairs. Large numbers of SNPs have now been identified and are archived in publicly available databases. The first large-scale GWA study was conducted by Ozaki et al. (19) who first investigated 94 cases and 658 controls to search for SNPs associated with myocardial infarction (MI) that required two of three strict clinical criteria. Controls were recruited from the general population at several Japanese medical institutions. Initially, 92,788 randomly selected SNPs from the Japanese Millennium Genome Project (20) were examined with 71% genotyping, resulting in 65,671 usable SNPs. A nominal p value of 0.01, under recessive or dominant models, was used to exclude 99% of the SNP loci. In a second larger panel (1133 cases and 1006 controls), all individual associations of the 1% of SNPs surviving the initial screen were nonreproducible. Only with linkagedisequilibrium (LD) mapping and haplotype analysis, two SNPs in the lymphotoxin-alpha (LTA) gene were shown to be associated with MI (RR, 1.78; 95% CI, 1.39, 2.15; p ¼ 3.3 10–7). Functional analysis indicated that these polymorphisms increased induction of several cell adhesion molecules in smooth muscle cells of the coronary artery. The HapMap, published in 2005 (21), facilitated not only SNP discovery but also the development of tagSNP panels. Due to the high LD among consecutive SNPs, some one million SNPs (tagSNPs) would capture all common variation in the genome, and microchips able to analyze one million SNPs are now in widespread use.

Genomewide Association Studies

227

It has been suggested (22) that the first successful GWA study taking advantage of the HapMap data was reported in 2005 (23). In this study, 96 cases of age-related macular degeneration (AMD), selected from the Age-Related Eye Disease Study (AREDS) clinical trial if they had large drusen, diagnosed using photographic assessment plus sightthreatening AMD, were compared with 50 controls. Controls were frequency matched on gender and smoking history (one of the few known risk factors for AMD), and were purposely older than cases to increase the likelihood of their remaining free of AMD. All subjects were “white, of northern European origin.” A total of 116,204 randomly selected SNPs were studied with 103,611 on 22 autosomal chromosomes being successfully genotyped. After Bonferroni (24) correction, two SNPs in LD were associated with AMD; both were in an intron of the gene for complement factor H (CFH). All CFH exons were resequenced and a polymorphism in exon 9 (rs1061170) in LD with the two intronic SNPs was most strongly associated with AMD. This study (23) yielded odds ratios (ORs) greater than 4.5 under the dominant model and greater than 6.0 under a recessive model. Perhaps most noteworthy about this study is the delineation of a homogeneous clinical phenotype (large drusen) and a well-characterized and well-matched control group. The power of small studies to discover the most important disease genes is provided by another major discovery of HTRA1 being the second gene associated with AMD (25). The study consisted of 104 cases and 130 controls sufficient to demonstrate an OR of 10.40 (95% CI, 4.68, 23.14) for an SNP in the promoter region of HTRA1 (25,26). In this study, the researchers used a cross-ethnic approach, discovering the gene in a small cohort of Hong Kong Chinese subjects and then returning to Caucasian cohorts to validate the association in these as well. Analysis proved that the HTRA1 gene is important but was missed in the first study because several AMD subphenotypes are compounded in Caucasian patients. In another recent GWA study, variant rs11209026 in the interleukin-23 receptor (IL23R) gene was identified as strongly protective against Crohn’s disease (CD) or ulcerative colitis (27). There were three significant SNPs, two located in the gene NOD2 previously identified using family linkage studies as the CD locus. The gene IL23 was known to influence autoimmunity in mice—animals lacking IL23 do not develop colitis. A factor for success in this study was that the initial phenotype studied was ileal CD diagnosed by several diagnostic criteria to maximize phenotypic homogeneity (28). The AMD GWA study has spurred a flurry of interests and confidence among researchers hoping to discover the genetic predispositions for human complex traits (7,29). By 2007, there were several dozen studies that reported using some form of GWA methodology, mostly from consortium efforts that have collected large cohorts over many years. The traits under study include schizophrenia, juvenile- and adult-onset diabetes, obesity, and cardiovascular diseases. One of the largest numbers of SNPs studied is 770,000þ to identify cis-regulatory regions controlling gene expression (30). Another intriguing recent finding among large-scale GWA association studies for type 2 diabetes and cardiovascular disease phenotypes identified the same variants in the vicinity of the genes CDKN2A/CDKN2B associated with both complex diseases (31–35). This surprising observation now challenges geneticists to understand how the same genes influence both of these phenotypes—a testament to the unanticipated hypotheses that will surely emerge from large-scale GWA studies. Copy number variation (CNV) occurs by deletion or duplication of large numbers of nucleotides (many kilo- or megabases), which can result in changes to the number of copies of genes and to gene regulation. Most recently, variation in the copy number of

228

Bracken et al.

DNA sequences likely to be functionally important has been mapped in European, African, and Asian populations (36). Some 1400 regions, that influence approximately 14.5% of genes thought to affect human disease, have been identified to date (37), and association and linkage studies using microarrays are also able to assess CNV in their samples. The first association between CNVs and human complex diseases was reported in autism (38), and more findings in other diseases are expected. These studies have established some general criteria for conducting a successful GWA study, and it is useful to consider them in more detail. For the most part, they demonstrate some ways in which the efficiency of the design may be improved. In this context, efficiency refers to maintaining adequate statistical power among the smallest possible number of cases and controls, with minimal genotyping and other study costs. IMPROVING THE EFFICIENCY OF GWA STUDIES Genomic association studies are typically of case-control design and they associate a phenotype (a homogeneous disease state in traditional epidemiology) with a genotype (genetic polymorphism). Misclassification of either phenotype or genotype, or heterogeneity or indeterminate definition of cases and controls, is a major cause of reduced efficiency, typically leading to a concomitant increase in sample size needed to detect statistically significant effects. If the detectable association can be substantially and plausibly increased by making GWA studies more efficient, by avoiding phenotype “dilution” that leads to weak genetic influences with small RR, the reduction on required sample size is dramatic. Simply put, to detect an RR of 1.2, 10,770 cases are needed. This number drops to 590 cases for OR ¼ 2.0, 100 cases for RR ¼ 4.0, and 48 cases for RR ¼ 6.0 [assuming 1 to 1 ratio for case and control sizes, type II error rate b ¼ 0.90, two-sided type I error rate a ¼ 0.05, minor allele frequency (MAF) ¼ 5%]. Phenotype One of the surprises of the first GWA studies has been the large size of the ORs detected from quite small samples. Two examples of this characteristic come from research done on AMD, a disease that has been carefully characterized for specific phenotypes that are directly linked to defined manifestations of disease (39,40). The implication from two AMD studies is quite profound, because it means that principal genes can be found in a carefully designed study. In complex diseases with heterogeneous phenotypes such as AMD, different population substructures may exhibit the disease as a result of different genes causing a different pathophysiologic etiology. Diseases like schizophrenia may have the same characteristic; the key genes may vary in different cohorts. Risch and Zhang (41) have demonstrated the substantial increase in power obtained by selecting the affected sibling who expresses the higher end (e.g., 10th percentile) of a distribution of a quantitative trait (e.g., drusen size) and the unaffected sibling having the lower end in the trait spectrum. The principle of extreme discordant sibpair analysis was extended in the AMD GWA study for case-control selection. Drusen size was used as a quantitative trait measure for which all cases had the extreme size of drusen and controls none or only a few small drusen. Controls also were matched to cases as much as possible for known risk factors as siblings presumably share the same environment (23). In sibpair analysis, this strategy was estimated to reduce genotyping by 10- to 40-fold (41). In addition to using homogeneous case groups, consideration of phenotypes derived from the same biological pathways or the same or related embryological processes are

Genomewide Association Studies

229

likely to increase the likelihood of identifying genetic associations. These homologous phenotypes may be identified using molecular biomarkers. For example, E-cadherin expression may be an important prognostic phenotype for infiltrating ductal breast cancer (42). Studies of breast cancer phenotypes using E-cadherin expression to increase homogeneity will have increased likelihood of identifying new genotypes associated with this form of breast cancer. Cases with early onset of disease are generally more likely to be of genetic rather than environmental etiology. This was observed in breast cancer with the BRACA1/2/3 mutations and has been seen in many other cancers as well as other conditions such as asthma and cardiovascular disease. Cases who have parents or siblings with the phenotype of interest are also more likely to represent a phenotype with a genetic etiology (43). Risch and Teng (44) have shown that selecting cases with affected sibs provides a more powerful test of association even when only one of the affected sibs is selected. This is due to the increased likelihood of the case having the disease allele. Whether or not multiple affected cases from the same family should be in the case group is more controversial (45). Using multiple case sibs in the case group may lead to increased study power, but it undermines the usual assumption of independence among observations made in case-control studies. However, family-based association tests (FBAT) have been developed to deal with this type of study design (46). Allele sharing, to enrich the “at-risk” alleles in the case group, has been demonstrated as a method of increasing power by about 20% (47). Several options are available for selecting one case from the other sibs using degree of linkage, extent of allele sharing, or randomly on the basis of sharing chromosome fragments from multiple cases within a family. This strategy may be effective if DNA has already been collected, such as in a case-control study nested within a larger cohort, or if multiple cases are being selected from the same family. If sibling cases are selected, statistical methods that control for the lack of independence among cases should be considered (48). Selection bias can occur in case-control studies if the cases are from the same extended family. Among cases drawn from neighborhood hospitals, clinics or patient series, it may not be an infrequent occurrence for the cases to be related. Indeed, the cases may not know they are related. Genomic control may be used to identify relatives in a cases series and to make appropriate corrections by either deleting data from one of the relatives or making statistical adjustment for the lack of independent observations. It has been suggested that clinically more “severe” cases form superior phenotypes for genomic research. However, this may only be true if the criterion of homology is also met. A case group of severe asthma cases is unlikely to be genetically informative if it is comprised of a mix of patients with severe bronchial asthma, severe atopic asthma, and severe asthma due to specific environmental triggers such as cold and species of pollen or mold. Even measures of lung function may be assessing poor breathing, resulting from a heterogeneous mix of diseases. Preferable asthma phenotypes would include responses to methalcholine challenge, specific IgE responses, umbilical cord blood IgE, and other more specific biomarkers of disease.

Controls Healthy or disease-free people lack incentive for genetic research, and control samples are often more difficult to obtain than cases. For this reason, the use of historical controls for GWA has been proposed, and they may be more acceptable than in studies of environmental risk factors that can change over time. More precise specification of

230

Bracken et al.

control groups who unambiguously do not have the case phenotype may equally improve the efficiency of GWA studies. Controls older than cases, for whom the likelihood of developing disease is reduced, were used in the AMD studies. However, using much older controls may incur some confounding from genotypes that are associated with survival rather than etiology. Controls having no relatives with the study phenotype will also enhance the chance of finding genes of interest. Sibling controls offer less powerful study designs than unrelated controls because disease allele frequencies are correlated in siblings (49). In studies of a specific cancer, it may be desirable to exclude subjects with a history of any cancer as controls so as to eliminate any overlap in genetic etiology that will reduce the power to detect association. Controls are usually derived from the same population as the cases to avoid population stratification, which occurs when allele frequencies vary across population subgroups and so may be falsely attributed to case status when ethnicity or race is not matched or controlled. This can be done in the study design phase or by the use of genomic controls. However, genomic control of population stratification is less preferable than incorporating control into the study design (i.e., selecting all cases and controls from the same ethnic group), which can be built into the power calculations for a study. In the study of Crohn’s disease, ethnicity was stratified: the initial GWA study was conducted in 547 non-Jewish patients of European ancestry and in 548 non-Jewish controls, producing ORs of 0.26 (95% CI, 0.15, 0.43) for IL23R rs11209026. The replication study of the same marker was done in 401 cases and 433 controls, all Jewish (OR, 0.45; 95% CI, 0.27, 0.73) (27). To date, the most efficient use of controls is reported in a large GWA study on seven diseases in that each case group was contrasted to the same control group (50). The GWA studies conducted in Iceland by DeCode Genetics have also been using various subsets selected from one big population control cohort. Such practices may be expected to be adopted in most future GWA studies. However, this strategy is beneficial for largescale studies. In smaller studies, it may lose efficiency as it precludes matching, whose requirements may vary for each disease. Moreover, any errors in control group selection will confound all disease comparisons.

REDUCING GENOTYPING AND LABORATORY ERROR Even infrequent random errors in genotyping can substantially influence a study’s power. Gordon (51) estimated that 1% random error requires from 2% to 8% larger sample sizes. Standard laboratory practice should be followed to avoid bias or random error: case and control DNA should be mixed on the same microarray platforms, all genotyping should be masked to case-control status, concordance should be assessed among multiple genotyping operators, and reference samples used to confirm genotyping accuracy. Additionally, call rates should be assessed on a per-sample and per-SNP basis. Samples that give consistently low call rates may indicate problems with the DNA quality, whereas a low call rate for an SNP across all samples may indicate a poorly performing SNP. In either case, it is recommended that the data be excluded to reduce the introduction of genotyping errors. Hardy-Weinberg equilibrium (HWE) values should be calculated for all SNPs as a significant deviation from HWE is often indicative of genotyping errors despite high call rates for the SNP (52).

Genomewide Association Studies

231

OTHER CONSIDERATIONS FOR IMPROVING EFFICIENCY Allelle Frequency While future investigations may study SNPs having rare allele frequencies, priority should be given in current research to those SNPs having allele frequencies between 5% and 50% so as to optimize both the power of needed studies and to focus research on the more common alleles that are likely to contribute substantially to the burden of complex disease in the population (53). SNP Selection GWA studies can include both functional and nonfunctional SNPs, but how these SNPs are selected can affect the power of the association study. The most cost-effective method of interrogating large numbers of SNPs is to use commercially available panels of SNPs arrayed on chips. The number and SNP selection is largely dependent on currently available products from various vendors. Among available products, several factors go into the decision to choose one over another. Some SNP panels are based on a tagging approach, selected to optimize coverage based on patterns of LD observed in the HapMap samples. This panel is optimal for populations of a single ethnic origin, as it maximizes use of LD information and avoids interrogating haplotype blocks with several SNPs unnecessarily. Most of the high-density SNP genotyping platforms attempt to cover the genome in its entirety and are not genecentric. This approach does not assume that susceptibility loci will be located within a known coding or regulatory region, thus allowing for a more hypothesis-free approach. At the same time, a significant association within a coding region is more biologically plausible, but significant associations in regions harboring no obvious candidates may be equally valid. Ozaki et al. (19) analyzed exonic SNPs that covered transcribed sequences but not variation within the regulatory sequence. The AMD studies initially screened panels of approximately evenly distributed SNPs across the genome. Following the initial identification of these “marker” SNPs associated with AMD, sequencing of coding and regulatory regions led to the discovery of functional SNPs. Other strategies for initial SNP selection involve a “genecentric” approach based on functional or regulatory sites (54). These strategies require some prior knowledge of disease biology and are subject to some of the same limitations as candidate gene studies described earlier. The obvious functional or disease-causing SNP(s) would be nonsynonymous (changing the amino acid residue in the protein sequence) or regulatory [changing gene expression leading to excess or reduced messenger ribonucleic acid (mRNA) production]. Tagging avoids inefficiency and expense by not typing SNPs that are in complete LD. The degree of LD between alleles at two loci is described by r2, which is inversely proportional to the sample size required to detect a disease association per fixed effect. Thus, r2 ¼ 0.5 requires two times the sample size of r2 ¼ 1, indicating perfect LD and no loss of power from using a tagSNP instead of a disease-causing SNP. It has been shown that an r2 greater than or equal to 0.8 is sufficient for tagSNP mapping (55). In genomic regions with high LD, using tagSNPs may reduce genotyping by 70% to 80%. Where LD is low, every SNP may need to be genotyped to cover a region. As larger microarray chips are developed that can incorporate more variants across the whole genome, SNP selection will be less of an issue.

232

Bracken et al.

The high LD rates in the genome identified by HapMap have also allowed the successful development of imputation methods (5). These recently allowed interrogation of more than two million autosomal SNPs (including MAFs of <5%), in addition to 315,000 genotyped SNPs, to identify 10 susceptibility variants for type 2 diabetes (32). Different SNP genotyping platforms have very few overlapping SNPs. This poses an analytical problem if investigators switch platforms in the middle of a study or if samples from different platforms are combined in a meta-analysis. If samples genotyped for different sets of SNPs can be combined, the resulting increase in sample size will improve the efficiency of the studies. To achieve this goal, one would require powerful statistical and bioinformatic methods that can effectively utilize the empirical LD patterns and allele frequencies to infer knowledge from mixed SNP panels and impute genotypes from nonoverlapping SNPs as was achieved in the type 2 diabetes study mentioned above. Reducing False Positives This is perhaps the most widely discussed topic in GWA methodology. Analysis of 100,000 SNPs will produce some 5000 positive associations at a ¼ 0.05, almost all of them spurious, and replication in those SNPs screened positive will produce a further 250 positive associations, most of them also false. To adjust for multiple comparisons, the traditional Bonferroni procedure is applied (e.g., 0.05/500,000 SNPs use p ¼ 10–7) (24). The Bonferroni correction is quite conservative because a substantial portion of the human genome is in LD, that is, alleles are not all distributed randomly, and multiple SNPs are likely to be truly associated with disease. A preferred method uses the false discovery rate that takes into account some of the linkage across the genome, thereby requiring somewhat larger corrected p values (56–58). It has been suggested that other stratifications at each locus need to be considered for multiple comparisons (2). Analyses for dominant or recessive inheritance, haplotype or diplotype, and stratification by gender, by other subgroups, or by phenotype variants all substantially inflate multiple comparison problems. For example, if each of 500,000 alleles is stratified by gender and model of inheritance, two million observations will have been made. Genomic Control The general idea is to correct for spurious associations that may arise simply from the case and control populations not being comprised of the same subpopulations (i.e., differential levels of admixture in cases and controls). This adjustment requires selecting neutral loci to compute a correction parameter. Several procedures have been proposed that involve correcting all SNP association statistics by either the mean or median w2 values among these neutral loci (59,60). More recently, the use of ancestry informative markers has been proposed as a better approach for detecting and correcting for population stratification in case-control studies (61). Multistage Designs Perhaps the most common statistical device for increasing efficiency is use of multistage designs (62,63). Typically, an initial screen may use 20% to 30% of the study, randomly sampled, and impose a liberal uncorrected alpha value (0.05 or 0.1) for the first genomewide scan. A second screen of the positive SNPs is conducted in the remainder of the sample. An alternative strategy is to jointly analyze data from both screens, and this

Genomewide Association Studies

233

has been shown to increase power to detect genetic associations, particularly when 30% or more of the sample is genotyped in the first screen and a relatively large number of markers are followed up in the second screen (>1%) (64). Another approach is to collect multiple independent samples. The first sample is used as the discovery sample in which a complete GWA study is performed. All SNPs surpassing the initial significance threshold are subsequently typed in one or more replication samples to identify those SNPs that continue to exhibit positive association signals (27). Control of Confounding Random error is generally more important in GWA studies than is bias. When comparison of individuals is defined by their genotype this is equivalent to a randomized comparison because of random segregation of alleles at meiosis (except for some LD). Potentially confounded exposure to environmental and lifestyle risk factors should be random with respect to genotype, a concept known as Mendelian randomization (6,65). Mendelian randomization should eliminate confounding from other environmental risk factors as long as population stratification is avoided. THE PROMISE OF A NEW PARADIGM GWA studies represent a paradigm shift in the scientific method. Animal and in vitro studies now follow rather than precede observations in humans to confirm biological plausibility, replication is reported in the first published observations, and plans for confirmatory work and replication are expected in grant applications for GWA studies. Most differently, GWA studies are hypothesis-free and represent a search for diseasecausing associations among massive numbers of possible associations, all in contradiction to long-held beliefs in how epidemiologic research should be conducted. GWA studies are not without their critics. Terwilliger and Weiss (66) have commented, “Not only is it argued that we need know basically nothing substantial about the biology of a trait to do a mapping study, but it need not even aggregate in families, and, to the contrary, the study design is to compare unrelated cases with controls. Often this is now proposed as an attraction of a study design! A strange way to do science.” Others have been more supportive. After publication of the first AMD genomic study (23), Science editorialized, “As promised, the Human Genome Project provides powerful new insights into human disease and raises many challenging questions” (67). One of the key challenges is to improve the efficiency of GWA studies. This is done by improving the efficiency of individual studies, as described above, but also by improving the efficiency of the research enterprise itself. This can occur in at least four ways: ensuring that initially reported results are replicated in the first report of an association, demonstrating biological plausibility of newly discovered mechanisms of disease, avoiding publication bias so that time is not spent following false leads, and rapid systematic reviewing of association studies with frequent updating. False positives are distinguished from true positives by performing replication studies in independent populations to determine if a similar association is observed. If the new p value multiplied by the number of candidate SNPs for replication is less than 0.05, then a true positive for that SNP may be declared; if the p value is large, say greater than 0.2, declare a false positive; if the p value is in between, then it needs further investigation.

234

Bracken et al.

Replication of initially observed associations for HTRA1 was reported in the AMD paper in a Southeast Asian population from Hong Kong (25) and replicated in a companion paper (68) to a Caucasian population. The paper associating IL23R with Crohn’s disease was replicated in non-Jewish and Jewish cohorts and in a third, familybased study in the same paper (27). A recent GWA study identifying an allele for prostate cancer, and Gleason stage 7 to 10, for marker DG8S737-8 included replications in five distinct populations (69). Box 1 describes a recent example of replication uncertainty in obesity genetics. In contrast to traditional candidate gene studies, which are conducted following leads from molecular studies, GWA studies often perform biological investigations to lend plausibility to the newly identified gene. First, is the associated SNP in the vicinity of a coding region or in a conserved region across known species genomes? Previous linkage/association studies are also looked at to determine if the associated SNP is within a previously identified region. All of these investigations lend credence to a SNP association being a true positive rather than false positive. If the associated SNP (or marker) is not itself a likely functional mutation, the candidate region or gene should be sequenced or by other means to identify functional variants. Lastly, biological experiments are performed to determine if the presumed functional SNP has any biological effect on the candidate gene using a variety of methods (e.g., in situ expression studies, cell culture-based expression studies, binding experiments) depending on the location of the SNP and biological characteristics of the gene. The first AMD study included immunofluorescence investigations to localize complement factor H (CFH) protein in the human retina as predicted from the associated alleles (23). The second AMD GWA study verified that the predicted transcription factors did bind to the HTRA1 promoter in human retinal pigment (25). Publication bias has been commonly documented in the candidate gene association studies (70), and much of the impulse to demand replication and biological confirmation in original reports is to avoid publication of false-positive studies. GWA studies are equally vulnerable to publication bias. One way to manage this would be to establish online repositories for reporting negative associations in some organized and readily searchable way. Efficiency in the rapid synthesis and publication of systematic reviews of GWA studies is an important strategy for avoiding unnecessary duplication of effort, following false leads, and delay in achieving consensus as to which associations can be declared to be real. Electronic publication of systematic reviews, which allows for their rapid updating, as has been done for clinical medicine by the Cochrane collaboration, is an urgent priority (71) and one that is currently being addressed (72). SNP mutations do not themselves cause disease. It is the resultant mutated protein or excess or deficit of protein that causes disease. There are many more proteins than genes, and microarray processing is already able to analyze large numbers of them. It is inevitable that hypothesis-free association studies of proteins with disease will be reported with increasing frequency. Work being done now to improve the validity and precision of GWA studies will have direct relevance for these future areas of research. The immediate future may see greater use of the more diverse genomes and smaller haplotype blocks seen in African populations and a rapid adoption of ultrahigh-throughput sequencing of the rarer minor alleles (73). With increased efficiency, hypothesis-free GWA research has the potential, as has already been shown to a limited degree, to produce major breakthroughs in our understanding of the complex causes of chronic diseases and in developing new therapies to treat them.

Genomewide Association Studies

235

BOX 1 A COMMON OBESITY VARIANT: REPLICATED OR NOT? Herbert (74) reported that rs7566605 intron of INSIG2 was associated with adult and childhood obesity (BMI 30) by genotyping 86,604 SNPs in 694 offsprings of the Framingham cohort (OR,1.33; 95% CI, 1.20, 1.48; p ¼ 0.0026). This result was confirmed in five further unrelated samples of varying ethnicity and age but not 2726 subjects from the Nurses Health Study. Recently, several authors reported a failure to replicate this result. Loos (75) genotyped two separate cohorts (N ¼ 4916 and 1683) and in linear models rs7566605 tended to be associated with lower BMI. Dina (76) genotyped four sets of Caucasian children (449, 386, and 287 families and 4998 individuals), but no evidence of an association with BMI greater than or equal to 30 was observed. In a third report of homozygous carriers of the C allele in rs7566605 and BMI greater than or equal to 30, the overall association was not observed, but in individuals who were already overweight a positive linear association was found for the SNP (77). What may be causing these apparently discrepant results? The ORs are small and imprecise except in the very large studies; however, lack of power is unlikely to be the major problem since, in the nonreplicating studies, the ORs are very close to unity. Geneenvironment interaction is often invoked to explain different results but an environmental risk factor has not been identified and would need to vary systematically across these cohorts. Gene-gene interaction is also a possibility but no evidence for this has been observed in the analyses to date. There is some consistency in the existing studies for the SNP to have an effect only in subjects who already are overweight, which suggests that phenotypic heterogeneity may be an important explanation. BMI greater than or equal to 30 is an objective and precise measure, but it can be caused by a variety of underlying conditions (e.g., types 1 and 2 diabetes, other conditions leading to insulin resistance, hypertension, hypertriglyceridemia, low HDL, impaired fasting glucose, inactivity due to other diseases), all of which are likely to have different genetic risk alleles. The net effect is to dilute the effect of any one SNP seen for the general phenotype. Using phenotypes of BMI based on careful clinical characterization is likely to produce more homogeneous results and stronger associations. REFERENCES 1. Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001; 358(9290):1356–1360. 2. Hattersley AT, McCarthy MI. What makes a good genetic association study? Lancet 2005; 366 (9493):1315–1323. 3. Wang WY, Barratt BJ, Clayton DG, et al. Genome-wide association studies: theoretical and practical concerns. Nat Rev Genet 2005; 6(2):109–118. 4. Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet 2005; 6(2):95–108. 5. Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005; 37(4):413–417. 6. Colhoun HM, McKeigue PM, Davey Smith G. Problems of reporting genetic associations with complex outcomes. Lancet 2003; 361(9360):865–872. 7. Thomas DC, Haile RW, Duggan D. Recent developments in genomewide association scans: a workshop summary and review. Am J Hum Genet 2005; 77(3):337–345. 8. Brookes AJ. Rethinking genetic strategies to study complex diseases. Trends Mol Med 2001; 7 (11):512–516.

236

Bracken et al.

9. Mani A, Radhakrishnan J, Wang H, et al. LRP6 mutation in a family with early coronary disease and metabolic risk factors. Science 2007; 315(5816):1278–1282. 10. Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA 2005; 294(2):218–228. 11. Ioannidis J, Lau J. Evolution of treatment effects over time: empirical insight from recursive cumulative metaanalyses. Proc Natl Acad Sci U S A 2001; 98(3):831–836. 12. Trikalinos TA, Ntzani EE, Contopoulos-Ioannidis DG, et al. Establishment of genetic associations for complex diseases is independent of early study findings. Eur J Hum Genet 2004; 12(9):762–769. 13. Thompson JF, Man M, Johnson KJ, et al. An association study of 43 SNPs in 16 candidate genes with atorvastatin response. Pharmacogenomics J 2005; 5(6):352–358. 14. Buchanan AV, Weiss KM, Fullerton SM. Dissecting complex disease: the quest for the Philosopher’s Stone? Int J Epidemiol 2006; 35(3):562–571. 15. Pound P, Ebrahim S, Sandercock P, et al. Where is the evidence that animal research benefits humans? BMJ 2004; 328(7438):514–517. 16. Hackam DG, Redelmeier DA. Translation of research evidence from animals to humans. JAMA 2006; 296(14):1731–1732. 17. Macleod MR, O’Collins T, Howells DW, et al. Pooling of animal experimental data reveals influence of study design and publication bias. Stroke 2004; 35(5):1203–1208. 18. Schena M, Shalon D, Davis RW, et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270(5235):467–470. 19. Ozaki K, Ohnishi Y, Iida A, et al. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet 2002; 32(4):650–654. 20. Haga H, Yamada R, Ohnishi Y, et al. Gene-based SNP discovery as part of the Japanese Millennium Genome Project: identification of 190,562 genetic variations in the human genome. Single-nucleotide polymorphism. J Hum Genet 2002; 47(11):605–610. 21. A haplotype map of the human genome. Nature 2005; 437(7063):1299–1320. 22. Collins FS. Genomic Medicine: A Revolution in Medical Practice in the 21st Century. World Health Care Congress 2006. 23. Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in age-related macular degeneration. Science 2005; 308(5720):385–389. 24. Bonferroni C. Teoria statistica delle classi e calcolo delle probabilita In Volume in Onore de Ricarrdo dalla Volta, Universita de Firenza 1937:1–62. 25. Dewan A, Liu M, Hartman SS, et al. HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 2006; 314(5801):989–992. 26. Dewan A, Liu M, Hartman SS, et al. Online supporting material for HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 2006; 314(5801):989–992. 27. Duerr RH, Taylor KD, Brant SR, et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science 2006; 314(5804):1461–1463. 28. Duerr RH, Taylor KD, Brant SR, et al. Supporting online material for a genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science 2006; 314 (5804):1461–1463. 29. Palmer LJ, Cardon LR. Shaking the tree: mapping complex disease genes with linkage disequilibrium. Lancet 2005; 366(9492):1223–34. 30. Cheung VG, Spielman RS, Ewens KG, et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature 2005; 437(7063):1365–1369. 31. Zeggini E, Weedon MN, Lindgren CM, et al. Replication of genome-wide association signals in U.K. samples reveals risk loci for Type 2 Diabetes. Science 2007; 316:1336–1341. 32. Scott LJ, Mohlke KL, Bonnycastle LL, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 2007; 316:1341–1345. 33. Saxena R, Voight BF, Lyssenko V, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007; 316:1331–1336. 34. Helgadottir A, Thorleifsson G, Manolescu A, et al. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 2007; 316:1491–1493.

Genomewide Association Studies

237

35. McPherson R, Pertsemlidis A, Kavaslar N, et al. A common allele on chromosome 9 associated with coronary heart disease. Science 2007; 316:1488–1491. 36. Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature 2006; 444(7118):444–454. 37. Lupski JR. Structural variation in the human genome. N Engl J Med 2007; 356(11):1169–1171. 38. Sebat J, Lakshmi B, Malhotra D, et al. Strong association of de novo copy number mutations with autism. Science 2007; 316(5823):445–449. 39. de Jong PT. Age-related macular degeneration. N Engl J Med 2006; 355(14):1474–1485. 40. Rattner A, Nathans J. Macular degeneration: recent advances and therapeutic opportunities. Nat Rev Neurosci 2006; 7(11):860–872. 41. Risch N, Zhang H. Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 1995; 268(5217):1584–1589. 42. Gould Rothberg BE, Bracken MB. E-cadherin immunohistochemical expression as a prognostic factor in infiltrating ductal carcinoma of the breast: a systematic review and meta-analysis. Breast Cancer Res Treat 2006; 100(2):139–48. 43. Thompson D, Witte JS, Slattery M, et al. Increased power for case-control studies of single nucleotide polymorphisms through incorporation of family history and genetic constraints. Genet Epidemiol 2004; 27(3):215–224. 44. Risch N, Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res 1998; 8 (12):1273–1288. 45. Li M, Boehnke M, Abecasis GR. Efficient study designs for test of genetic association using sibship data and unrelated cases and controls. Am J Hum Genet 2006; 78(5):778–792. 46. Horvath S, Xu X, Laird NM. The family based association test method: strategies for studying general genotype—phenotype associations. Eur J Hum Genet 2001; 9(4):301–306. 47. Fingerlin TE, Boehnke M, Abecasis GR. Increasing the power and efficiency of diseasemarker case-control association studies through use of allele-sharing information. Am J Hum Genet 2004; 74(3):432–443. 48. Slager SL, Schaid DJ. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am J Hum Genet 2001; 68(6):1457–1462. 49. Boehnke M, Langefeld CD. Genetic association mapping based on discordant sib pairs: the discordant-alleles test. Am J Hum Genet 1998; 62(4):950–961. 50. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007; 447(7145):661–78. 51. Gordon D, Finch SJ, Nothnagel M, et al. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered 2002; 54(1):22–33. 52. Dewan A KR, Hoh J. Linkage Disequilibrium Maps and Disease-Association Mapping. Linkage Disequilibrium and Association Mapping: Analysis and Applications Methods in Molecular Biology. Volume 376: Totowa, NJ: Humana Press, 2007. 53. Lohmueller KE, Pearce CL, Pike M, et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet 2003; 33(2):177–182. 54. Jorgenson E, Witte JS. A gene-centric approach to genome-wide association studies. Nat Rev Genet 2006; 7(11):885–91. 55. Carlson CS, Eberle MA, Rieder MJ, et al. Selecting a maximally informative set of singlenucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 2004; 74(1):106–120. 56. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995; 57(1):289–300. 57. Sabatti C, Service S, Freimer N. False discovery rate in linkage and association genome screens for complex disorders. Genetics 2003; 164(2):829–833. 58. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 2003; 100(16):9440–9445.

238

Bracken et al.

59. Shmulewitz D, Zhang J, Greenberg DA. Case-control association studies in mixed populations: correcting using genomic control. Hum Hered 2004; 58(3–4):145–153. 60. Roeder K, Bacanu SA, Sonpar V, et al. Analysis of single-locus tests to detect gene/disease associations. Genet Epidemiol 2005; 28(3):207–219. 61. Enoch MA, Shen PH, Xu K, et al. Using ancestry-informative markers to define populations and detect population stratification. J Psychopharmacol 2006; 20(4 suppl):19–26. 62. Satagopan JM, Verbel DA, Venkatraman ES, et al. Two-stage designs for gene-disease association studies. Biometrics 2002; 58(1):163–170. 63. Satagopan JM, Venkatraman ES, Begg CB. Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 2004; 60(3):589–97. 64. Skol AD, Scott LJ, Abecasis GR, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 2006; 38(2):209–213. 65. Salanti G, Sanderson S, Higgins JP. Obstacles and opportunities in meta-analysis of genetic association studies. Genet Med 2005; 7(1):13–20. 66. Terwilliger JD, Weiss KM. Confounding, ascertainment bias, and the blind quest for a genetic ‘fountain of youth’. Ann Med 2003; 35(7):532–44. 67. Daiger SP. Genetics. Was the Human Genome Project worth the effort? Science 2005; 308 (5720):362–364. 68. Yang Z, Camp NJ, Sun H, et al. A variant of the HTRA1 gene increases susceptibility to age-related macular degeneration. Science 2006; 314(5801):992–993. 69. Amundadottir LT, Sulem P, Gudmundsson J, et al. A common variant associated with prostate cancer in European and African populations. Nat Genet 2006; 38(6):652–658. 70. Keavney B, McKenzie C, Parish S, et al. Large-scale test of hypothesised associations between the angiotensin-converting-enzyme insertion/deletion polymorphism and myocardial infarction in about 5000 cases and 6000 controls. International Studies of Infarct Survival (ISIS) Collaborators. Lancet 2000; 355(9202):434–442. 71. Bracken MB. Genomic epidemiology of complex disease: the need for an electronic evidencebased approach to research synthesis. Am J Epidemiol 2005; 162(4):297–301. 72. Khoury MJ, Little J, Gwinn M, et al. On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epidemiol 2006; 36(2):439–445. 73. Abecasis G, Tam PK, Bustamante CD, et al. Human Genome Variation 2006: emerging views on structural variation and large-scale SNP analysis. Nat Genet 2007; 39(2):153–155. 74. Herbert A, Gerry NP, McQueen MB, et al. A common genetic variant is associated with adult and childhood obesity. Science 2006; 312(5771):279–283. 75. Loos RJ, Barroso I, O’Rahilly S, et al. Comment on “A common genetic variant is associated with adult and childhood obesity”. Science 2007; 315(5809):187 (author reply 187). 76. Dina C, Meyre D, Samson C, et al. Comment on “A common genetic variant is associated with adult and childhood obesity”. Science 2007; 315(5809):187 (author reply 187). 77. Rosskopf D, Bornhorst A, Rimmbach C, et al. Comment on “A common genetic variant is associated with adult and childhood obesity”. Science 2007; 315(5809):187 (author reply 187).

16

Validation and Confirmation of Associations John P. A. Ioannidis Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Department of Medicine, Tufts University School of Medicine, Boston, Massachusetts, U.S.A.

INTRODUCTION The advent of molecular epidemiology has resulted in a flurry of postulated associations. The discovery of associations is continuously facilitated by the advent of more massive and efficient platforms for measuring biological factors of interest. At the same time, this has created an untamed plethora of postulated risk factors, only a fraction of which may be true (sufficiently “credible” in a Bayesian framework). A survey of the published literature shows that almost all epidemiological papers claim at least one finding to which they attribute statistical significance. An empirical evaluation (1) showed that 87% of epidemiological studies published in 2005 claimed at least one statistically significant result in their abstracts. For some fields in molecular epidemiology, the situation is ever more extreme (2). For example, in an empirical survey of 340 studies on cancer prognostic factor studies that were included in meta-analyses and another 1575 articles on cancer prognostic factor studies published in 2005, the proportion of articles that claimed statistically significant prognostic effects in their abstracts was 90.6% and 95.8%, respectively. Even among the few studies that did not claim statistically significant prognostic effects, the majority either claimed statistically significant results for something else or significant effects based on trends, or at least offered some “apologies” that supported the probed associations based on some other external, qualitative, or subjective evidence. Fully “negative” articles amounted only to 1.5% and 1.3% of the articles, in the two data sets, respectively. Based on this picture, it is difficult to argue that (almost) all of the probed molecular associations are truly “positive.” This postulation is extremely unlikely. When more studies are performed on the same association and data are standardized so as to compare like analyses with like, nonreplication is a very common theme in molecular epidemiology. Empirical evidence on the problem of nonreplication comes from topics as diverse as genetic associations, microarrays, proteomics, and linkage studies (3–6). Lack of replication could reflect either genuine bias (causing false positives) or genuine 239

240

Ioannidis

heterogeneity among different populations and settings where the association is probed (7). These two forces may also coexist in the same body of evidence. A major challenge is to try to dissect how many of the proposed molecular associations are genuine and how many are just the consequence of bias. This is a very difficult task and often there can be no straightforward answer. In fact, in most, if not all, circumstances, the best one can aim for is an appraisal of the approximate credibility of the association. Some objective and quantitative or semiquantitative methods may help in this appraisal, but some components of this appraisal unavoidably remain subjective and may vary among different observers/scientists.

TRADITIONAL CRITERIA AND BASIC CONCEPTS Occasionally, the validation process in molecular epidemiology may be helped by examining the traditional “criteria” set by Bradford Hill as hints of causality (8). However, for much of molecular epidemiology most of these criteria are irrelevant or difficult to apply (9). Temporality is often implicit and obvious, but this does not help a lot. For example, all nonacquired genetic risk factors are fixed at birth. For other putative risk factors that are acquired, understanding their temporal emergence in the disease process may be more useful. This would require prospective cohort studies with multiple repeated measurements of putative risk factors and outcomes that they are thought to influence. Still, many biological alterations coexist as part of wider biological cascades, and it is often difficult to assess what caused what. Experimental support through randomization cannot be typically pursued for molecular putative risk factors—although admittedly Mendelian randomization brings the study of common genetic variants closer to randomized research than probably any other discipline in epidemiology (10). Analogy and coherence “with generally known facts of the natural history and biology of the disease” are difficult to make use of in much of molecular epidemiology, until we create a large database of solidly confirmed associations. Otherwise we run the risk of trying to fit the analogy and coherence criteria against data that are contaminated by a large proportion (or even the majority) of false knowledge. At the moment, it is also difficult to say whether specificity should be invoked as a means of identifying true associations. Some biological processes may be specific, but other associations may lack specificity and may even have extreme nonspecificity, e.g., the same biological cascade may be involved in many different phenotypes and diseases. Biological plausibility is extremely interesting to consider, and the advent of new biological methods offers nice opportunities for juxtaposing epidemiological and biological evidence. However, in most cases, we still have a very incomplete picture of biology in all its complexity and there is a considerable risk that post hoc it is easy to invoke any kind of biological plausibility to support some research finding. Biological gradient is often difficult to prove and it is uncertain that it should be a prerequisite for many associations in molecular epidemiology. For example, in the vast majority of genetic associations, there is no rationale to imply a priori that a specific genetic model, in particular an additive or multiplicative one with linear trend, should be the one that operates biologically. Data fitting for selection of specific models, among several possible models, runs the danger of spurious overfitting. Currently, there is no reason to believe that an association that follows a specific model that fits to some biological gradient, e.g., dose-allele response, is more plausible than an association that fits a different model. Strength has been a traditional hallmark criterion in epidemiology, but it is increasingly recognized that some claimed associations that supposedly show

Validation and Confirmation of Associations

241

large effect sizes may simply be the result of stonger bias operating in their generation. For many domains of molecular epidemiology, most postulated risk factors seem to have relatively small effect sizes that are almost indistinguishable from each other. Many “quanta” of such risk factors need to operate in tandem or interact with each other to generate a considerable clinical risk. With these criteria having relatively limited applications for confirming associations in most instances, one is left with consistency and replication and these factors need to be discussed in a fresh light in the context of molecular epidemiology. The current chapter takes the stance that validation in molecular epidemiology depends on and requires the juxtaposition and if possible the synthesis of pieces of evidence from diverse studies using the same or complementary methods trying to address a family of questions of interest. The chapter will focus primarily on the integration of data from different studies and the evaluation of between-study heterogeneity (a measure of consistency or lack thereof) through meta-analysis methods. Some definitions here are useful. We define replication as any effort that aims to examine a previously proposed research finding, in a framework where the data from the original study and the replicating ones can be considered for quantitative synthesis through meta-analysis. Meta-analysis can always try to address the diversity of replication efforts (inconsistency, heterogeneity between studies), regardless of whether it can also produce reliable summary estimates by “pooling” the results of all studies. Conversely, one may use the term corroboration to define any effort that may strengthen or weaken the credibility of a previously proposed research finding, in a framework where the new data are obtained with too dissimilar or incompatible a method for a quantitative synthesis with the original data to be possible. The two terms are often used interchangeably, and we have to acknowledge that the line where data are too dissimilar to even consider for metaanalysis is often subjective. META-ANALYSIS METHODS Meta-analysis is a term used to describe the quantitative synthesis of information obtained from different studies. The aims of meta-analysis are usually twofold: first, to measure the extent of heterogeneity (inconsistency) among the different studies and second to try to arrive at some summary estimate of the effect of interest. There is a very wide literature of meta-analysis methods and a detailed description is beyond the scope of this chapter. Methods include both parameteric and nonparametric approaches and both frequentist and Bayesian implementations and variants thereof. I will present some common methods for meta-analyses of association data, and will also mention in brief some other common or emerging applications of meta-analysis in molecular epidemiology. Meta-Analysis of Association Data In the typical scenario there are k studies that may address the same association, each of which has an effect estimate di that also has a certain variance. The first quantitative step is to examine the extent of between-study heterogeneity. The Cochran’s Q statistic (11) is the most common test used to examine whether between-study heterogeneity is statistically significant or not. It is calculated as the weighted sum of the squares of the deviation of the effect estimate in each study from the

242

Ioannidis

common effect calculated for all studies under the assumption that no heterogeneity exists, i.e., X Q¼ wi ðdi dþ Þ2 The weights are the inverse of the variance of each effect estimate as we discuss in more detail below. Q has an asymptotic w2 distribution with k–1 degrees of freedom, and is typically considered significant at the a ¼ 0.10 level (12,13). However, even with this lenient level of significance, the test may be considerably underpowered in the large majority of meta-analyses conducted to date where the amount of data (number of studies combined) is limited. Therefore, in most circumstances negative inferences that “no heterogeneity exists” may be misleading. Another useful measure of inconsistency is the between-study variance t2. A moment-based estimate for t2 was proposed by DerSimonian and Laird (14) and can be adapted to different effect metrics. The formula for this is given by Q ðk 1Þ P 2 P ,0 t2 ¼ max P wi ð wi = wi Þ This estimate has the advantage of offering a direct measure of how much the effect sizes differ across studies. However, it does not avoid the limitation that it can have substantial uncertainty when data are limited. Moreover, this estimate depends on the metric and on the magnitude of the effect sizes. Another potentially more useful measure that does correct for this latter limitation is the ratio of the square root of the betweenstudy variance over the summary effect size. This estimate gives a measure of how big the variability in effect sizes is compared with the most plausible common effect size. Some inconsistency metrics do not depend on the effect metric and number of studies. The most popular of such metrics is I2 (15–17), which expresses the percentage of between-study variability that is attributable to heterogeneity rather than chance, i.e., I2 ¼

t2 t2 þ s 2

where s2 denotes the within-study variance component. It can be shown that I2 can be calculated as 1 k1 Q . Typical thresholds for this metric are often quoted, with values below 25% denoting low heterogeneity, 20% to 25% denoting modest heterogeneity, 50% to 75%, large heterogeneity, and above 75% very large heterogeneity. However, these estimates can also have a very large uncertainty when few studies are available, as in most meta-analyses. Thus, it has been recently recommended (17) that 95% confidence intervals for I2 would be useful to show routinely in meta-analyses. Much has been said about whether combination of data from different studies is justified in the face of different amounts of between-study heterogeneity. This is probably largely a pseudodilemma. Importantly, statistical heterogeneity is only a mirror, occasionally a vague and distorted mirror, of clinical and biological heterogeneity. Clinical and biological heterogeneity should be examined on a case-by-case basis and often important aspects of heterogeneity may be unknown or unmeasured. Lack of documented statistical heterogeneity does not guarantee that no clinical or biological heterogeneity exists. The presence of statistical heterogeneity, conversely, does not mean that a specific type of clinical or biological heterogeneity has been identified. Allowing for these caveats, a quantitative synthesis of the data is practically always feasible in a meta-analysis. Simply, one has to be cautious in the interpretation of the results when very large heterogeneity exists, is documented, or is suspected.

Validation and Confirmation of Associations

243

Table 1 Examples of Commonly Used Effect Size Estimates and Their Variances in Meta-Analysis Effect size

d

log odds ratio (log OR), population-based study Mean difference (MD)

var(d)

p1 p2 log 1p log 1p2 1 m1 m2

3 ðm1 m2 Þ 1 4ðn þ n Þ9 1 2

Standardized mean difference (Hedges’s g)

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ 2 2

1 p1 ð1p1 Þn1

þ

1 p2 ð1p2 Þn2

sd12 n1

þ

sd22 n2

þ

g2 2ðn1 þn2 3:94Þ

n1 þn2 n1 n2

ðn1 1Þsd1 þ ðn2 1Þsd2 n1 þ n2 2

Two groups are assumed to be compared (group 1 and group 2). Abbreviations: p, the proportion with the molecular risk factor of interest; n, the total sample; m, the mean of the continuous (quantitative) trait/variable of interest; sd, standard deviation.

The simplest approach in obtaining a summary effect is using a fixed effects model (18,19). Fixed effects approaches assume that there is a single common effect size and that the observed between-study variability is entirely attributed to chance. The summary effect size dþ is obtained by the following weighting: P F X w di WiF di ¼ P i F dþ ¼ wi where

WiF ¼ PiwF , i ¼ ð1, . . . , kÞ, wF

and

wi ¼ varðdi Þ1 .

i

i.e., the weight of each study is the inverse of the variance of its effect. Table 1 shows some typical effect metrics and the variances thereof. The variance of the summary effect P dþ is given by varðdþ Þ ¼ ð wFi Þ1 . Fixed effects are counterintuitive when between-study heterogeneity is documented or suspected. Random effects models (20) assume that the results of different studies may differ among themselves. What we are then interested in is calculating an average effect size that is the most typical of the distribution of effect sizes and the variance in this R , is obtained by a linear distribution. The summary estimate for random effects, dþ estimator similar to the one described above, but replacing the weights with R 2 1 wRi ¼ ðvarðdP i Þ þ t Þ . Similarly, the variance of the summary effect dþ is given by 1 R Þ ¼ ð wRi Þ . varðdþ In the absence of between-study heterogeneity, the fixed and random effects estimates coincide. In the presence of between-study heterogeneity, only random effects make sense, since fixed effects are violated in their basic assumptions. Therefore, random effects should be preferred in general, although both fixed and random effects are easy to obtain and compare in any commercial statistical package. One caveat is that random effects may tend to give disproportionately more weight to smaller studies compared with fixed effects. If there is a suspicion that the evidence derived from smaller studies may be more biased compared with the evidence derived from larger studies, this weighting may result in more biased results overall. Random effects models are also unstable, when studies involve very small numbers and zero counts in the 2 2 tables to be synthesized. Besides these basic methods, many other approaches may be applied to the metaanalysis of parametric data on associations. There is an increasing use of Bayesian

244

Ioannidis

methods in particular, but their description goes beyond the scope of this chapter. The interested reader is referred to several relevant references (21–24). There are many ways to show the results of meta-analyses, the most common being the traditional forest plot, where each study is shown by its effect size and 95% confidence interval and the summary effect and 95% confidence interval is also shown. Cumulative meta-analysis (25) orders the results of the studies to be combined according to some specific order (e.g., chronological) and then the summary estimate is recalculated with the addition of one study at a time (or, for chronological order, at the end of the calendar year). Cumulative meta-analysis may be very useful to show the strength of associations over time, as more data are obtained, especially whether there is evidence that an association is dissipated with the addition of new evidence, or remains equally strong and consistently replicated (3). Recursive cumulative meta-analysis shows the relative change in the effect size at each update, rather than the updated effect itself, and may also be helpful in visualizing the trends and extent of changes in summary effect (26). Another visual aim, the Galbraith plot, may be useful in showing the outlier studies (27). For other more specialized graphs see Sutton et al. (28). Meta-Analyses of Other Data in Molecular Epidemiology I will simply mention here in brief some other methods that have been developed for the combination of other types of data beyond just typical parametric associations. Several methods have been developed for the combination of linkage data, such as those derived from genomewide scans. Most of these methods are not parametric. Typically used methods include meta-analysis of significance levels, Fisher’s sum of logs, Stouffer’s sum of zs, the respective weighted version, the truncated p value product method, the multiple scan probability method, and the genome scan meta-analyses (GSMA) methods along with their implementation of heterogeneity testing [heterogeneity-based genome scan meta-analysis (HEGESMA)]. The interested reader is referred to the relevant references (29–37). Another literature is rapidly developing for the combination of data from microarrays studies of gene expression profiling and similar databases of multidimensional biology. These methods involve permutation tests (38,39), support vector machines (40), parametric tests and clustering (41), machine learning algorithms (42), rank-aggregation procedures (43), rank product methods (44), linear programming and decomposition procedures (45), and parametric meta-analysis methods (46). The interested reader is referred to the specific references for more information. Biases in Meta-Analyses of Associations All errors and biases that affect single studies can be carried over into the meta-analyses that try to combine these studies. Actually, meta-analysis offers a prime opportunity to examine robustly the data of the individual studies that are to be combined and the design and conduct of the experiments that have led to the collection of these data. When this is done prospectively and the meta-analysis is an anticipated common goal of several conducted studies, then the errors can be efficiently minimized by advance planning. Much too often though, meta-analysis is used to combine data after the fact, and it is not possible to change the design or conduct of past studies. Improvements in the quality of the data may be limited by pragmatic limitations. Data may also be fragmented and only partially available. In this case, the meta-analysis may serve the field a lot by carefully trying to record and interpret the potential errors that may have intervened in the single studies and acknowledging the limitations of the collected evidence.

Validation and Confirmation of Associations

245

Besides errors and biases that affect single studies, there are some biases that affect large fields of research at wide and these are particularly interesting from a meta-analysis perspective. Publication bias refers to the preferential publication of studies the basis of their results: studies with “negative” (statistically nonsignificant) findings are not published, even though they are of similar quality as studies with “positive” results. This may cause inflated summary results (47). Many statistical tests have been proposed that try to detect publication bias and/or adjust the results of meta-analyses for potential publication bias (48). Most of these tests examine whether small studies give different results from larger studies under the assumption that larger studies will tend to be published regardless of their results, while smaller studies may be more likely to remain unpublished, if their results are “negative” than if their results are “positive.” This basic assumption underlying these tests may be problematic in molecular epidemiology, where there is often a very large array of analyses that can be performed to be able to reach statistical significance (using different genetic models, different adjustments, subgroups, or interactions—to name a few options). Moreover, these tests require a large number of studies to have sufficient power, a condition that is not frequently met. Some of the traditionally used tests also have inflated type I errors, although newer versions bypass this problem. In all, these tests should be applied cautiously and their results should be interpreted with caution as “small study effects” and not as definitive evidence for the presence or absence of publication bias. Publication bias is very difficult to deal with retrospectively and can be effectively conquered only with prospective designs, including prospective registration of all studies, and/or all-inclusive consortia (see below). Time-lag bias refers to the situation where the time to publication of a study depends on its results with “positive” results being published faster than “negative” results. All data may eventually be published (no publication bias strictly speaking), but the early evidence shows significant findings that are not validated when further data appear. Time-lag bias was first described in clinical research, affecting randomized clinical trials (49). The situation may be common in molecular epidemiology. Two variants are worthwhile mentioning. First, one often encounters in the early literature a succession of extreme opposite results, refuting each other. This has been termed the Proteus phenomenon (50) and most likely reflects the interest for publishing contradictory results once a prominent new claim has been made, in an environment where rapid, massive testing of hypotheses is facilitated. The Proteus phenomenon has evolved in recent years to the point that the extremely contradictory results may appear even in the same publication. For example, it is typical in genomewide association studies to find a lot of putative association signals in the first stage of testing, most of which are contradicted immediately upon the first replication efforts because of regression to the mean. The winner’s curse (51) is another variant that says that on average the first study that finds and proposes a new association may often show an exaggerated effect compared with the true effect of the association. This is to be expected when new associations emerge from massive testing and only the most low-lying fruit are selected. The most low-lying fruit find themselves at this prominent position partly by true merit (true associations) and partly by chance. Replication efforts should eventually correct the contribution of chance, provided they are unbiased. Selective outcome and analysis reporting bias occurs when outcomes/analyses with “negative” results are not reported, whereas emphasis is given to “positive” outcomes/ analyses in the same study. This has been described to be a major problem even in randomized trials (52,53), where, in theory, the room for manipulation of outcomes and analyses should be far more limited compared with observational research. A survey of epidemiological studies suggests that selective reporting is probably very common (1) and

246

Ioannidis

quality of reporting in epidemiological research has remained elliptical and suboptimal. In this regard, meta-analytic approaches offer a major advantage in that they try to standardize the analyses across the different combined data sets. Some of the “positive” findings may be dissipated during the standardization process. However, sometimes the standardization is impossible after the fact, because the multiplicity of analyses and the extent of selective reporting are so large that it is impossible to harmonize the data from different studies. Meta-analysis may have to be aborted if there is no minimal concordance in analytical choices and definitions across the studies that one wishes to combine (54). Language bias (55) ensues when, depending on their results, studies are selectively reported in local non-English language journals rather than in international English language journals. Most local language journals are not indexed in major databases such as PubMed. The results of a meta-analysis may differ depending on how extensive an effort is made to retrieve and include non-English articles. A systematic comparison of the English language international literature and the local Chinese language literature (56) has shown that genetic association studies for common diseases invariably show large and significant effects in the Chinese literature. This may herald some strong biases in this local literature. It is unlikely that this problem is confined to the Chinese language. A similar issue may arise about results that have remained unpublished, but have been presented nevertheless in abstracts at various meetings. Here it is more likely that a selection bias may exist in the direction of a larger proportion of these otherwise unpublished results to be “negative.” In all, retrospective meta-analyses may be considerably influenced by the choice of the boundaries of the “universe” from which evidence is to be drawn for data synthesis. This universe may often be a biased version of the whole universe of eligible data. Collaborative Meta-Analyses, Consortia, and Meta-Analysis in the Setting of Genomewide Association Studies Collaborative meta-analyses may bypass several of the biases alluded above. Such collaborations are usually based on a consortium of investigators who decide to join forces and share data (57–59). Consortia may merge team-specific databases that they have already collected data on specific questions of interest, or may design parts or the whole protocol prospectively on new questions of interest. Often there is a limiting factor on how much can be done prospectively, especially, if the participating teams are not studies that are designed de novo but already have a prehistory of conducting research on the basis of existing protocols, study designs, implementation, and sample collections. The term meta-analysis of individual participant data (MIPD) is used to describe analyses that combine information from several teams/studies with details given for each individual participant in these studies, rather than simply group summary data, as are typically available in published articles. Advantages of this approach include the standardization (or at least harmonization) of definitions of cases and controls, opportunities for better control of confounding, and unified and more flexible statistical analyses (60,61). Overall, there are a number of challenges that consortia of multiple teams of investigators come across. Creative solutions to these challenges can improve the quality of the science and lead to more rapid advances in knowledge in the respective scientific fields (Table 2). Prospective coordinated efforts of consortia are increasingly popular in molecular epidemiology. As new molecular risk factors are proposed by single teams, consortia are able to appraise and replicate these new proposed risk factors with large-scale evidence in a timely and reliable manner. Such “prospective” meta-analyses largely obviate the problems of publication bias and selective reporting for data within the consortium,

Validation and Confirmation of Associations

247

Table 2 Challenges Faced by Networks of Investigators in Human Genome Epidemiology and Possible Solutions Major challenges

Possible solutions

Resources for establishing the initial infrastructure, support consortia implementation, and adding new partners Coordination: minimize administration to maximize scientific progress and avoid conflicts Selection of target projects

New and more flexible funding mechanisms: planning grants, collaborative research grants Coordination among national and international funding agencies and foundations Appropriate evaluation criteria for continuation of funding Clear leadership structure: steering committee and working groups Early development of policies and processes Cutting edge communication technology

Variable data and biospecimen quality from participating teams Handling of information from nonparticipating teams and of negative results Collection, management, and analysis of complex and heterogeneous data sets

Anticipating future needs

Communication and coordination Scientific credits and career development Access to the scientific community at large and transparency Peer review

Informed consent

Source: From Ref. 59.

Questions that can be uniquely addressed by collaborative groups Preliminary supportive evidence High-profile controversial hypothesis Biological plausibility Genomewide evidence Eligibility criteria based on sample size Sound and appropriate study design Accurate phenotype outcome and genotype assessments State-of-the-art biospecimen repositories Integration of evidence across all teams and networks in a field Comprehensive reporting to maintain transparency Curated updated encyclopedia of knowledge base Central informatics unit or coordinating center “Think tank” for analytical challenges of retrospective and prospective data sets Centralization of genotyping Standardization or harmonization of phenotypical and genotypical data Standardization of quality control protocols across participating teams Rapid integration of evolving high-throughput genomic technologies Consideration of centralized platforms Maximizing use of bioresources Public-private partnerships Development of analytical approaches for large and complex data sets Web-based communication: web sites and portals Teleconferences and meetings support Upfront definition of publication policies Mentorship of young investigators Change in tenure and authorship criteria Data sharing plan and policies Support for release of public data sets Availability and dissemination of both “positive” and “negative” results Encyclopedia of knowledge Review criteria appropriate for interdisciplinary large science Education of peer scientists to consortia issues Inclusion of interdisciplinary expertise in Initial Review Groups Anticipation of data and biospecimen sharing requirements and careful phrasing of informed consent Sensitivity to local and national legislations

248

Ioannidis

provided that there is no conscious effort to selectively publish “positive” results. Prospective meta-analyses also allow better quality control, since the measurement of the molecular risk factor of interest can be performed with centralized quality control or even simply at a single central lab with more robust procedures. This practice has become increasingly frequent with genomewide association studies, where the top-ranking polymorphisms from the early discovery phases are then tested across several other teams of investigators (62,63). Some of these consortia reflect long-established collaborations, while others may be more “opportunistic collaborations” that arise simply by the need to provide quick replication of the findings in the way to rapid publication. There may exist no further commitment to continue collaboration on a more regular basis, other than as needs arise. Also the membership of these consortia may not be fixed, but may be regulated by the ability of specific teams to contribute data more quickly to document replication in a very competitive and fast-moving research environment. This is not necessarily a disadvantage, but it can become an issue when availability of the data to be combined is a function of their results. Selective reporting and publication bias then reemerge despite the appearances of a prospective design. Finally, it is possible that for the same disease and molecular research field, several consortia may exist, with nonoverlapping or even overlapping membership. More long-standing commitment for collaborations in the form of consortia requires more effort and ideally also considerable funding to maintain a suitable infrastructure. While there are several challenges in the process (Table 2), such a long-standing commitment eventually is worth it. A gain is the enhancement of communication and collaboration among diverse teams of investigators working on the same topic. Consortia of investigators may also assume a leading role for maintaining updated synopses of all evidence in their field, although this role may also be performed by other researchers (64,65). A major challenge in the evolving research environment is to ensure that there is full transparency and ideally public availability of all data that are procured by the current massive-testing platforms. Genetic Association Information Network (GAIN) is one such effort (66) that aims to enhance public availability of detailed data from genomeassociation studies without compromising the rights of the original investigators to their data and proper exploitation of discovery. Transparent reporting of experiments is also fully recognized as a priority in several other fields such as microarrays [e.g., the minimum information about a microarray experiment (MIAME)] guidelines (67) and other platforms of multidimensional biology. Microarray data are now routinely available in public, and many leading journals require public deposition of data in suitable databases as a prerequisite for publication (68). However, empirical surveys suggest that public availability is far from complete as of the time of writing of this chapter (69,70). Moreover, more work is needed to optimize the use of publicly available raw data, enhance communication between primary investigators and secondary users, avoid misconception and mishandling of databases, and ensure that proper credit is given to all involved. STANDARDS FOR VALIDATION: CREDIBILITY OF MOLECULAR ASSOCIATIONS Calibrating Credibility Assigning a credibility level to the findings of molecular epidemiological studies is a difficult task and entails some subjective interpretation of the evidence. We believe that a comprehensive meta-analysis should be a first step in the process since this can convey

Validation and Confirmation of Associations

249

Table 3 Typical Credibility of Research Findings According to Effect Size and Extent of Replication Effect size (relative risk)

Replication

Large (>5)

None Limited Extensive None Limited Extensive None Limited Extensive None Limited Extensive

Moderate (2–5)

Small (1.2–2)

Very small (1–1.2)

Typical credibility (%) 10–60 30–80 70–95 5–20 10–40 50–90 <5 2–20 10–70 <1 1–5 2–30

Source: From Ref. 71.

information on how much evidence exists, how strong the effects are, and how consistently they seem to be replicated. Table 3 offers such a proposed subjective credibility rating for associations as a function of the effect size and the extent of replication (71). As shown, for each level of effect size and extent of replication, there is a considerable range of credibility. The reason is that besides the amount of evidence and the extent and consistency of replication, it is extremely important to consider whether there is adequate protection from bias in the accumulated evidence. Bias can invalidate any association, no matter how strong and how (seemingly) extensively replicated, and it has been argued that bias is the major determinant of credibility in the molecular era. To illustrate this issue, in a Bayesian format, the probability of a research finding being true is generally given by [(1 bR þ ubR)/(R þ a bR þ u ua þ ubR)], where R is the prestudy odds, 1 b is the power of a study, a is the type I error, and u is a term for bias denoting the proportion of associations that show up to be “positive” because of bias among the total number of associations that do not exist (72). The probability of a research finding being true decreases with increasing bias u, and it cannot reach over 50% unless u is less than R. This means that bias must decrease below the prestudy odds for a research question for a “positive” result to be able to reach credibility exceeding 50% under any circumstances, even if studies of infinite sample size are conducted and their results are consistently replicated. If the same errors are repeated again and again, replication will be consistent, but this will not mean that the association is true. I will now examine a bit more closely some of the components of the credibility rating. Effect Size Larger effects are likely to reach more stringent levels of statistical significance, higher Bayesian credibility estimates, and lower false-discovery rates (73), as compared with smaller effects if tested with the same amount of evidence and if they exhibit a similar replication profile. However, there are several caveats that need to be acknowledged here. First, the majority of effect sizes in molecular epidemiology are small and massive testing platforms, such as those employed in genomewide association studies, tend to reveal effect sizes that are minimal, at the edge of the analytical capability of epidemiological

250

Ioannidis

research. In the observed range of typical effects, small effects may often be indistinguishable from each other, i.e., an odds ratio of 1.14 may be very difficult or impossible to genuinely differentiate from an odds ratio of 1.19, even if very large studies are conducted. Second, in settings where the prestudy odds are very low, the average observed effects tend to reflect simply the average net bias, unless the bias is properly and meticulously minimized (72). In conditions where massive testing is accompanied by substantial bias, larger effects may simply mean that larger bias has intervened in the specific scientific field. Amount of Evidence Other things being equal, larger amounts of evidence increase the credibility of associations regardless of whether one sees this from the viewpoint of statistical significance achieved, Bayesian credibility, or false discovery rate. Moreover, larger studies may be more protected from biases, especially selective reporting, although this may vary from one scientific field to another. A large number of larger studies allow also a better handle for estimating between-study heterogeneity and thus statements about it are likely to be more robust. Many fields of molecular epidemiology have operated with very small sample sizes. For example, most applications of microarrays and other—omics multidimensional platforms were originally described with studies of sample size ranging between a dozen and a few hundred participants (74,75). Larger studies are still uncommon, as of the time of the writing of this chapter, but there is increasing realization that very large studies would be useful in moving these technologies forward (76). Other fields, such as genetic associations and pharmacogenomics have also started with similar small studies, but they already have reached the point where large-scale single studies with many thousand participants and consortia of teams with tens of thousands of participants are not uncommon. If single molecular effects are very small, then sufficient sample size may be a key limiting factor for demonstrating associations. As discussed above, credibility is dependent on study power, thus larger studies should help increase credibility, other aspects of the evidence being equal. Replication and Consistency As discussed already, replication is a key component of credibility assessment of proposed associations. Table 4 shows some proposed criteria of what constitutes “replication,” particularly in the setting of genetic association studies according to a published consensus (77). Similar qualitative appraisals are possible to develop for other types of molecular research. In the meta-analysis framework, replication consistency can be appraised with heterogeneity metrics (9), but one should be aware of the caveats discussed above for these metrics. Biological and clinical heterogeneity may not square with statistical heterogeneity. If some associations are genuinely different across different settings and populations, then it is possible that strict replication may be impossible, other than within the very limited setting where homogeneity exists (7). Protection from Bias As discussed above, the assessment of the cumulative evidence is a prime opportunity to examine the extent of bias. This is a difficult exercise and it is always possible that important information may be missing that is essential for appraising the extent of bias, or

Validation and Confirmation of Associations

251

Table 4 Suggested Criteria for Establishing Positive Replication These criteria are intended for follow-up studies of initial reports of genotype-phenotype associations assessed by genomewide or candidate-gene approaches. Replication studies should be of sufficient sample size to convincingly distinguish the proposed effect from no effect Replication studies should preferably be conducted in independent data sets, to avoid the tendency to split one well-powered study into two less conclusive ones The same or a very similar phenotype should be analyzed A similar population should be studied, and notable differences between the populations studied in the initial and attempted replication studies should be described Similar magnitude of effect and significance should be demonstrated, in the same direction, with the same SNP or an SNP in perfect or very high linkage disequilibrium with the prior SNP (r2 close to 1.0) Statistical significance should first be obtained using the genetic model reported in the initial study When possible, a joint or combined analysis should lead to a smaller p value than that seen in the initial report A strong rationale should be provided for selecting SNPs to be replicated from the initial study, including linkage-disequilibrium structure, putative functional data or published literature Replication reports should include the same level of detail for study design and analysis plan as reported for the initial study Abbreviation: SNP, single nucleotide polymorphism. Source: From Ref. 77.

sources of bias may be latent and impossible to reveal. It is essentially impossible to reach 100% certainty about any molecular association, simply because latent bias can never be 100% excluded. However, meticulous study design, conduct, and analysis can help increase certainty in the truth of specific associations. Table 5 shows an example of biases that may need to be considered in particular for genetic association studies and the impact they may have on the cumulative evidence (9). Of note, bias will easily invalidate small effects, while more major bias is needed to nullify larger effects. However, there is no guarantee that any large effect is protected from bias, unless the sources thereof have been carefully eliminated. When evidence has already been summarized through meta-analysis, some diagnostic tests that can be routinely performed to examine whether the association is robust to possible bias include examining whether the results remain significant after the first study is excluded, evaluating whether significance remains after removal of the Hardy-Weinberg–deviating studies or adjusting for such deviations (for genetic associations), and evaluating whether there are small study effects (possible hint for publication bias) and an excess of statistically significant findings in single studies (78). In the presence of any of these hints, or any other documentation of considerable error, the evidence should be considered susceptible to bias. However, no test can have perfect sensitivity and specificity for detecting bias, so there is some unavoidable subjectivity in this assessment. Putting it Together An overall assessment of the credibility of a molecular association needs to consider all the parameters listed above. Such a scheme has been developed, for example, for genetic associations of common variants (the Venice criteria) and it is shown in Table 6. In this scheme, an association is considered to have strong epidemiological credibility if it scores

A. Large-scale evidence B. Moderate amount of evidence C. Little evidence

A. Extensive replication including at least one well-conducted meta-analysis with little between-study inconsistency B. Well-conducted meta-analysis with some methodological limitations or moderate between-study inconsistency C. No association; no independent replication; failed replication; scattered studies; flawed meta-analysis; or large inconsistency A. Bias, if at all present, could affect the magnitude but probably not the presence of the association B. No obvious bias that may affect the presence of the association, but there is considerable missing information on the generation of evidence C. Considerable potential for or demonstrable bias that can affect even the presence or not of the association

Amount of evidence

Replication

A prerequisite for A is that the bias due to phenotype measurement, genotype measurement, confounding (population stratification), and selective reporting (for meta-analyses) can be appraised as not being high (as shown in detail in Table 3), plus there is no other demonstrable bias in any other aspect of the design, analysis, or accumulation of the evidence that could invalidate the presence of the proposed association. In category B, although no strong biases are visible, there is no such assurance that major sources of bias have been minimized or accounted for, because information is missing on how phenotyping, genotyping, and confounding have been handled. Given that occult bias can never be ruled out completely, note that even in category A we use the qualifier “probably.”

Thresholds may be defined on the basis of sample size, power, or false discovery rate considerations. The frequency of the genetic variant of interest should be accounted for. As a simple rule, we suggest that category A requires over 1000 subjects (total number of cases and controls assuming 1:1 ratio) evaluated in the least common genetic group of interest; B corresponds to 100–1000 subjects evaluated in this group, and C corresponds to <100 subjects evaluated in this group.a Between-study inconsistency entails statistical considerations (e.g., defined by metrics such as I2, where values of 50% and above are considered large, and values of 25–50% are considered moderate inconsistency) and also epidemiological considerations for the similarity/standardization or at least harmonization of phenotyping, genotyping, and analytical models across studies.

Proposed operationalization

For example, if the association pertains to the presence of homozygosity for a common variant, and if the frequency of homozygosity is 3%, then category A amount of evidence requires over 30,000 subjects; and category B between 3000 and 30,000. Source: From Ref. 9.

a

Protection from bias

Categories

Criteria

Table 5 Considerations for Epidemiological Credibility in the Assessment of Cumulative Evidence on Genetic Associations

252 Ioannidis

Validation and Confirmation of Associations

253

Table 6 Typical Biases and Their Typical Impact on Associations Depending on the Status of the Evidence Likelihood of bias to invalidate an observed association Biases

Status of the evidence

Bias in phenotype definition Not reported what was done Unclear phenotype definitions Clear widely agreed definitions of phenotypes Efforts for retrospective harmonization Prospective standardization of phenotypes Bias in genotyping Not reported what was done No quality control checks Appropriate quality control checks Population stratification Not reported what was done Nothing donea Same descent groupb Adjustment for reported descent Family-based design Genomic control, PCA, or a similar method Selective reporting biases Meta-analysis of published data Retrospective efforts to include unpublished data Meta-analysis within consortium

Small OR <1.15

Typical OR 1.15–1.8

Large OR >1.8

Unknown Possible/high Low/none

Unknown Possible/high Low/none

Unknown Possible/high Low/none

Possible/high

Low

Low/none

Low/none

Low/none

Low/none

Unknown Possible/high Low

Unknown Low Low

Unknown Low Low/none

Unknown Possible/high Possible/high Possible/high Low/none Low/none

Unknown Possible/high Low Low Low/none Low/none

Unknown Possible/high Low/none Low/none Low/none Low/none

Possible/high Possible/high

Possible Possible

Possible Possible

Low/none

Low/none

Low/none

a

including groups of clearly different descent without consideration to this diversity; the ethnic population structure may need to be considered also on a case-by-case basis Category decreases from A to B, if the ‘Unknown’ are considered to be a major issue for the appraisal of the evidence. Any ‘Possible/high’ item confers category C status. ‘Possible’ (selective reporting biases for nonconsortium/prospective meta-analysis) does not necessarily decrease the category grade (from A to C); this may need to be appraised separately in each field and may be facilitated by using tests for selective reporting biases (tests for small-study effects and excess of significant studies), although probably no test has high sensitivity and specificity for such biases. Clear demonstrable biases in other aspects of the design, conduct and analysis of the evidence (besides the four aspects considered in this table) also result in shift to category C for protection from bias. Abbreviations: OR, odds ratio; PCA, principal component analysis. Source: From Ref. 9.

b

top grades (three As) on amount of evidence, replication consistency, and protection from bias. Any C grade in any of the three axes results in weak epidemiological credibility and any B without any C in any of the three axes results in moderate epidemiological credibility. These criteria are still preliminary, but they may be useful to apply for operational purposes in appraising the strength of the evidence at least in genetic epidemiology. Similar constructs may be developed also for other fields of molecular epidemiology. Finally, besides

254

Ioannidis

epidemiological credibility one also has to consider biological plausibility and clinical (or even public health relevance) for each of the proposed association. REFERENCES 1. Kavvoura FK, Liberopoulos G, Ioannidis JP. Selection in reported epidemiological risks: an empirical assessment. PLoS Med 2007; 4:e79. 2. Kyzas P, Denaxa-Kyza D, Ioannidis JPA. Almost all cancer prognostic marker studies report significant results. Eur J Cancer 2007; 43:2559–2579. 3. Ioannidis JP, Ntzani EE, Trikalinos TA, et al. Replication validity of genetic association studies. Nat Genet 2001; 29:306–309. 4. Hirschhorn JN, Lohmueller K, Byrne E, et al. A comprehensive review of genetic association studies. Genet Med 2002; 4:45–61. 5. Ransohoff DF. Lessons from controversy: ovarian cancer screening and serum proteomics. J Natl Cancer Inst 2005; 97:315–319. 6. Goring HH, Terwilliger JD, Blangero J. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 2001; 69:1357–1369. 7. Ioannidis JPA. Non-Replication and inconsistency in the genome-wide association setting. Hum Hered 2007; 64:203–213. 8. Hill AB. The environment and disease: association or causation? Proc R Soc Med 1965; 58:295–300. 9. Ioannidis JP, Boffetta P, Little J, et al. Assessment of cumulative evidence on genetic associations: interim guidelines. Int J Epidemiol 2008; 37:120–132. 10. Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol 2003; 32:1–22. 11. Cochran WG. The combination of estimates from different experiments. Biometrics 1954; 10:101–129. 12. Hardy RJ, Thompson SG. Detecting and describing heterogeneity in meta-analysis. Stat Med 1998; 17:841–856. 13. Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21:1539–1558. 14. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7:177–188. 15. Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in meta-analyses. BMJ 2003; 327:557–560. 16. Huedo-Medina TB, Sanchez-Meca J, Marin-Martinez F, et al. Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Psychol Methods 2006; 11:193–206. 17. Ioannidis JPA, Patsopoulos N, Evangelou E. Uncertainty of heterogeneity estimates in metaanalysis. BMJ 2007; 335:914–916. 18. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 1959; 22:719–748. 19. Yusuf S, Peto R, Lewis J, et al. Beta blockade during and after myocardial infarction: an overview of the randomized trials. Prog Cardiovasc Dis 1985 Mar; 27(5):335–371. 20. Egger M, Smith GD, Altman DG. Systematic reviews in health care: meta-analysis in context. London, UK: BMJ Publishing Group, 2001. 21. Spiegelhalter DJ, Thomas A, Best NG, et al. WinBUGS Version 1.4 Users Manual. MRC Biostatistics Unit, Cambridge 2003. 22. Spiegelhalter DJ, Abrams KR, Myles P.J. Evidence Synthesis. In: Spiegelhalter DJ, Abrams KR, Myles P.J., eds. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: John Wiley & Sons Ltd, 2004. 23. Lambert PC, Sutton AJ, Burton PR, et al. How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS. Stat Med 2005; 24:2401–2428.

Validation and Confirmation of Associations

255

24. Minelli C, Thompson JR, Abrams KR, et al. Bayesian implementation of a genetic model-free approach to the meta-analysis of genetic association studies. Stat Med 2005; 24:3845–3861. 25. Antman EM, Lau J, Kupelnick B, et al. A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Treatments for myocardial infarction. JAMA 1992; 268:240–248. 26. Ioannidis J, Lau J. Evolution of treatment effects over time: empirical insight from recursive cumulative metaanalyses. Proc Natl Acad Sci USA 2001; 98:831–386. 27. Galbraith RF. A note on graphical presentation of estimated odds ratios from several clinical trials. Stat Med. 1988; 7:889–894. 28. Sutton A, Abrams K, Jones D, et al. Methods for meta-analysis in medical research. Chichester, UK: Wiley, 2000. 29. Dempfle A, Loesgen S. Meta-analysis of linkage studies for complex diseases: an overview of methods and a simulation study. Ann Hum Genet 2004; 68:69–83. 30. Li Z, Rao DC. Random effects model for meta-analysis of multiple quantitative sibpair linkage studies. Genet Epidemiol 1996; 13:377–383. 31. Wise LH, Lanchbury JS, Lewis CM. Meta-analysis of genome searches. Ann Hum Genet 1999; 63:263–272. 32. Zintzaras E, Ioannidis JP. Heterogeneity testing in meta-analysis of genome searches. Genet Epidemiol 2005; 28:123–137. 33. Kruglyak L, Daly MJ, Reeve-Daly MP, et al. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet 1996; 58:1347–1363. 34. Zaykin DV, Zhivotovsky LA, Westfall PH, et al. Truncated product method for combining P-values. Genet Epidemiol 2002; 22:170–185. 35. Dudbridge F, Koeleman BP. Rank truncated product of P-values, with application to genomewide association scans. Genet Epidemiol 2003; 25:360–366. 36. Etzel CJ, Guerra R. Meta-analysis of genetic-linkage analysis of quantitative-trait loci. Am J Hum Genet 2002; 71:56–65. 37. Wise LH, Lanchbury JS, Lewis CM. Meta-analysis of genome searches. Ann Hum Genet 1999; 63:263–272. 38. Rhodes DR, Yu J, Shanker K, et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci. U S A 2004; 101:9309–9314. 39. Zintzaras E, Ioannidis JPA. Meta-Analysis for ranked discovery datasets: theoretical framework and empirical demonstration for microarrays. Comput Biol Chem 2008; 32:38–46. 40. Warnat P, Eils R, Brors B. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 2005; 6:265. 41. Smid M, Dorssers LC, Jenster G. Venn Mapping: clustering of heterologous microarray data based on the number of co-occurring differentially expressed genes. Bioinformatics 2003; 19:2065–2071. 42. Xu J, Li Y. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics 2006; 22:2800–2805. 43. DeConde RP, Hawley S, Falcon S, et al. Combining results of microarray experiments: a rank aggregation approach. Stat Appl Genet Mol Biol 2006; 5:Article15. 44. Hong F, Breitling R, McEntee CW, et al. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics 2006; 22:2825–2827. 45. Wang Y, Joshi T, Zhang XS, et al. Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics 2006; 22:2413–2420. 46. Grutzmann R, Boriss H, Ammerpohl O, et al. Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene 2005; 24:5079–5088. 47. Dickersin K, Min YI. Publication bias: the problem that won’t go away. Ann N Y Acad Sci 1993; 703:135–146. 48. Publication bias in meta-analysis. Prevention, assessment and adjustments. West Sussex: John Wiley & Sons, 2007.

256

Ioannidis

49. Ioannidis JP. Effect of the statistical significance of results on the time to completion and publication of randomized efficacy trials. JAMA 1998; 279:281–286. 50. Ioannidis JP, Trikalinos TA. Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. J Clin Epidemiol 2005; 58:543–549. 51. Lohmueller KE, Pearce CL, Pike M, et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet 2003; 33:177–182. 52. Chan AW, Hrobjartsson A, Haahr MT, et al. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles. JAMA 2004; 291:2457–2465. 53. Chan AW, Krleza-Jeric K, Schmid I, et al. Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. Cmaj 2004b; 171:735–740. 54. Contopoulos-Ioannidis DG, Alexiou GA, Gouvias TC, et al. An empirical evaluation of multifarious outcomes in pharmacogenetics: beta-2 adrenoceptor gene polymorphisms in asthma treatment. Pharmacogenet Genomics 2006; 16:705–711. 55. Egger M, Zellweger-Zahner T, Schneider M, et al. Language bias in randomised controlled trials published in English and German. Lancet 1997; 350:326–329. 56. Pan Z, Trikalinos TA, Kavvoura FK, et al. Local literature bias in genetic epidemiology: an empirical evaluation of the Chinese literature. PLoS Med 2005; 2:e334. 57. Ioannidis JP, Bernstein J, Boffetta P, et al. A network of investigator networks in human genome epidemiology. Am J Epidemiol 2005; 162:302–304. 58. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007; 447:661–678. 59. Seminara D, Khoury MJ, O’Brien TR, et al. The emergence of networks in human genome epidemiology: challenges and opportunities. Epidemiology 2007; 18:1–8. 60. Higgins JP, Whitehead A, Turner RM, et al. Meta-analysis of continuous outcome data from individual patients. Stat Med 2001; 20:2219–2241. 61. Whitehead A, Omar RZ, Higgins JP, et al. Meta-analysis of ordinal outcomes using individual patient data. Stat Med 2001; 20:2243–2260. 62. Saxena R, Voight BF, Lyssenko V, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007; 316:1331–1336. 63. McPherson R, Pertsemlidis A, Kavaslar N, et al. A common allele on chromosome 9 associated with coronary heart disease. Science 2007; 316:1488–1491. 64. Ioannidis JP, Gwinn M, Little J, et al. A road map for efficient and reliable human genome epidemiology. Nat Genet 2006; 38:3–5. 65. Bertram L, McQueen MB, Mullin K, et al. Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nat Genet 2007; 39:17–23. 66. The GAIN Collaborative Research Group, , Manolio TA, Rodriguez LL, Brooks L, et al.; the Collaborative Association Study of Psoriasis, Ballinger D, Daly M, Donnelly P, et al.; the International Multi-Center ADHD Genetics Project, Frazer K, Gabriel S, Gejman P; the Molecular Genetics of Schizophrenia Collaboration, Guttmacher A, Harris EL, Insel T, et al.; the Bipolar Genome Study, Lander E, McCowin N, Mailman MD, et al.; the Major Depression Stage 1 Genomewide Association in Population-Based Samples Study, Thompson JF, Warram J; the Genetics of Kidneys in Diabetes (GoKinD) Study, Wholley D, Milos PM, Collins FS (2007) New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet 39:1045–1051. 67. Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001; 29:365–371. 68. Brazma A, Parkinson H, Sarkans U, et al. Array Express–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2003; 31:68–71. 69. Ioannidis JPA, Trikalinos TA, Polyzos N. Selective discussion and transparency for data and processing in microarrays research findings for cancer outcomes. Eur J Cancer (in press).

Validation and Confirmation of Associations

257

70. Larsson O, Sandberg R. Lack of correct data format and comparability limits future integrative microarray research. Nat Biotechnol 2006; 24:1322–1323. 71. Ioannidis JP. Commentary: grading the credibility of molecular evidence for complex diseases. Int J Epidemiol 2006; 35:572–578. 72. Ioannidis JP. Why most published research findings are false. PLoS Med 2005; 2:e124. 73. Wacholder S, Chanock S, Garcia-Closas M, et al. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004; 96:434–442. 74. Ntzani EE, Ioannidis JP. Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment. Lancet 2003; 362:1439–1444. 75. Simon R, Radmacher MD, Dobbin K, et al. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003; 95:14–18. 76. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA 2006; 103:5923–5928. 77. Chanock SJ, Manolio T, Boehnke M, et al. Replicating genotype-phenotype associations. Nature 2007; 447:655–660. 78. Ioannidis JP, Trikalinos TA. An exploratory test for an excess of significant findings. Clinical Trials 2007; 4:245–253.

17

Models of Absolute Risk: Interpretation, Estimation, Validation, and Application Mitchell H. Gail Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, U.S.A.

INTRODUCTION Definition and Uses of Absolute Risk Absolute risk is the chance that an individual with specific risk factors and who is free of the disease of interest at age a will be diagnosed with it in a defined age interval (a, a þ ) of duration . Some investigators use the terms “crude risk” or “cumulative incidence” instead of absolute risk. We give a mathematical representation of absolute risk in the section “Estimating Absolute Risk Models from Various Types of Samples.” We will use invasive breast cancer as the disease of interest in this paper, but the ideas apply generally. For example, the breast cancer risk model of Gail and colleagues (1), as modified [“Model 2” in (2)] for the NCI Breast Cancer Risk Assessment Tool, predicts that a 50-year-old white woman who began menstruating at age 12, had a first live birth at age 31, had one previous breast biopsy without atypical hyperplasia, and had one first-degree relative with breast cancer would have an absolute risk of 2.3% of being diagnosed with breast cancer between ages 50 and 55 years and an absolute risk of 20.1% of being diagnosed between ages 50 and 90 years. Absolute risk is called “crude risk” in the competing risks literature (3) because it is reduced by the chance that the woman will die of non-breast cancer causes before developing invasive breast cancer. Sometimes, especially in the genetic epidemiologic literature, the “pure” risk is presented, which estimates the cumulative risk of disease if death from other causes were eliminated. This hypothetical construct seems less relevant to clinical management than crude risk. Pure risks are popular in part because they can be easily obtained from standard survival programs (e.g., 1 minus the Kaplan-Meier curve with deaths treated as censorship). Absolute risk increases with age at counseling, a, because the age-specific baseline hazard of invasive breast cancer increases with age. Absolute risk increases with the duration of the risk interval, , and absolute risk increases in the presence of specific risk factors, such as the number of affected first-degree relatives, that raise risk above baseline levels.

259

260

Gail

Absolute risk is useful in counseling a woman, because it allows her to put her actual risk of disease in perspective and to compare it to other health risks she might face. Absolute risk can be helpful in weighing the risks and benefits of a preventive intervention, such as tamoxifen, that reduces the risks of breast cancer and hip fracture, but increases the risks of stroke, pulmonary emboli, deep vein thrombosis, and endometrial cancer. By weighing the net effects of tamoxifen on the absolute risks of these various health outcomes, one can gain insight into which types of women stand to gain most from tamoxifen, such as young women (who are at low risk of stroke and endometrial cancer) with high predicted invasive breast cancer risk (4). Absolute risk models can be used to design breast cancer prevention trials because the statistical power of such studies depends on the numbers of breast cancers that develop, which is proportional to the average absolute risk of the trial participants. Absolute risk models can also be used to assess the likely effects of interventions on the burden of disease in a population. For example, Freedman and colleagues (5) estimated that if all white women with a net favorable benefit/risk index took tamoxifen, nearly 30,000 incident invasive breast cancers could be prevented annually in the United States. All these applications pertain to the absolute risk of disease incidence. The concept of absolute risk is also very useful in the management of recently diagnosed patients. For example, the absolute risk of dying within 20 years from prostate cancer in a recently diagnosed man with a Gleason score of 2 to 4 is less than 5% (6); this fact influences clinical management. The use of somatic mutations and gene expression data may be useful in refining estimates of absolute risks of recurrence or of death in patients with newly diagnosed cancer. In this paper, we focus on the absolute risk of cancer incidence, however. Data Sources and Types of Absolute Risk Models In order to estimate risk factor–specific absolute risk, one needs follow-up data; case-control data alone are not sufficient. A representative prospective cohort provides direct data on absolute risk. If the risk factor is rare, however, such as for a rare mutation, a huge number of individuals would need to be screened to identify a representative cohort of mutation carriers. If, in addition, the disease tends to occur in mid to late life and disease rates are modest or small, the cohort will need to be large and long follow-up times will be required to observe the needed number of incident cases. One strategy is to use a retrospective cohort design in which members of a previously assembled cohort are traced to determine their health outcomes and measure their risk factors at baseline, based on previously stored information and blood for genotyping. Another strategy, described in the section “Estimating Absolute Risk Models from Various Types of Samples,” is to couple population-based case-control data with follow-up data from registries, such as the National Cancer Institute’s Surveillance, Epidemiology and End Results (SEER) Program, to estimate absolute risk. Even this strategy may require very large case-control studies to estimate absolute risk with required precision if the genetic exposures are rare (7). The previous population-based sampling designs are ideal for estimating risks for men or women in the general population, but family-based designs may be preferred for two reasons: (1) family-based designs may be easier to conduct for rare mutations and (2) studies based on families with many affected members (“multiplex pedigrees”) may be more appropriate for counseling women from such families. This is true when familial aggregation of disease is caused by unidentified genetic, lifestyle, and environmental exposures in addition to the gene under study (“residual familial risk”). For example, the Breast Cancer Linkage Consortium assessed the penetrance of BRCA1 and BRCA2 mutations in multiplex pedigrees by analyzing markers linked to these genes conditional

Models of Absolute Risk

261

on the disease status of pedigree members (8,9). Another family-based design is the kincohort design (7,10,11), in which randomly sampled case and control probands are genotyped and penetrance is estimated based on the disease histories of their first-degree relatives and the inferred distribution of genotypes in those relatives. The case-control family-study extends data available in the kin-cohort design to include covariate information and genotypes in the first-degree relatives of case and control probands (12). As we shall discuss, estimates of absolute risks from such family studies will only agree with those from population-based studies if the analysis takes ascertainment into account and also correctly models residual sources of familial risk, apart from the mutation under study. In particular, estimates of relative risks obtained by conditioning on phenotypes within a family are family-specific; that is, they address the relative risk conferred by a genotype within a family. If risks vary across families for reasons not associated with the mutation under study, corresponding marginal relative risks for randomly selected mutation carriers and noncarriers in the general population will typically be smaller than family-specific relative risks. Likewise, estimates of relative risks from a populationbased case-control study will typically be smaller than family-specific relative risks from case-control comparisons within a family (13,14). Absolute risk models may differ because various types of data are used or because different approaches are used for modeling. Some models are empirical, whereas others rely on a genetic theory. The Gail model (1) for breast cancer is based on empirical modeling of relative odds in a logistic model. Family history data is confined to the number of affected first-degree relatives and its interaction with age at first live birth. The model does not include the numbers and ages of relatives nor the ages at onset of breast cancer in affected relatives, but it does include information of age at menarche, age at first live birth, numbers of biopsies, and the presence of atypical hyperplasia on a biopsy. In contrast, the Claus model (15,16) and BRCAPRO (17,18) are based on a purely autosomal dominant model for breast cancer. Thus, more extensive family history data are used, and, in the case of BRCAPRO, information from assays for BRCA1 and BRCA2 in relatives or in the counselee, can be incorporated. These models do not allow for residual sources of familial risk, apart from BRCA1 or BRCA2 mutations. More recent models, such as BOADICEA (19–21), which includes a polygenic component, and a model by Tyrer, Duffy, and Cuzick (22), which includes a dominant gene with low penetrance, allow for such residual familial risk. ESTIMATING ABSOLUTE RISK MODELS FROM VARIOUS TYPES OF SAMPLES Cohort Data If a representative cohort is assembled to determine risk factors X, which may include genotypes or other risk factors, and if the cohort is followed prospectively to determine ages at breast cancer incidence and death from other causes, the analyst can estimate absolute risk in several ways. Letting h1(t) be the age-specific breast cancer hazard rate for a woman with all risk factors X at their lowest risk levels, rr{X(t)} be the relative risk of breast cancer for a woman with risk factor levels X(t) at age t, and h2(t) be the agespecific mortality rate from non-breast cancer causes, then the absolute risk of breast cancer in the interval (a, a þ ) is aþ ð

r(a, , X) ¼ a

2

ðt

3

rrfX(t)gh1 (t) exp4 frrfX(u)gh1 (u) þ h2 (u)gdu5dt: a

ðiÞ

262

Gail

Usually the relative risk is expressed as rr{X(t)} ¼ exp{bX(t)}, where X(t) may include constant and age-varying covariates. Standard survival methods may be used to estimate h1(t), rr{X(t)}, and h2(t) from cohort data (23). We call the model in equation (i) a cause-specific relative hazards model, because it is expressed in terms of cause-specific hazards and corresponding relative risks. With cohort data, one can also model absolute risk in other ways. For example, Fine and Gray (24) used models such as log½logf1 r(a ¼ 0,, X ¼ x)g ¼

0 ()

þ xb,

where 0() is a known monotonic increasing function and r(a ¼ 0, , X ¼ x) is the absolute risk from age 0 to . The parameter b describes the effect of a unit increase in X on the absolute risk itself, rather than on the cause-specific hazards. A positive value of b may reflect an increase in the cause-specific risk of breast cancer, a decrease in the hazard from competing mortality, or both. Such models are hard to fit with censored data (24). One way to reduce the screening effort required to assemble a cohort of women with a rare mutation is to recruit from “high risk” clinics attended by women with a strong family history of breast or ovarian cancer. The risk associated with a strong family history may reflect other unmeasured genetic factors, or familial environmental or behavioral factors, in addition to the gene under study. Even with prospective follow-up of the recruited women, such residual familial risk can lead to overestimates of absolute risk for mutation carriers in the general population, although these estimates might be appropriate for women in high-risk clinics. To be explicit, suppose that unidentified familial factors multiply the genotype(g)-specific hazard, h1g(t), of each member of a family by a positive random familial effect (“frailty”), b. The frailty has expectation ð E(b) ¼ bdG(b) ¼ 1, where G is the distribution of b. Then, if competing causes of death are not influenced by genotype g, the probability of remaining free of breast cancer at age t is proportional to ) ( ð ðt exp

b h1g (u)du dG(b): 0

It follows that the marginal hazard in the general population is Ð Ðt h1g ðtÞ exp b h1g (u)du bdG(b) | 0 : h1g (t) ¼ Ð Ðt exp b h1g (u)du dG(b)

ðiiÞ

0

Thus, the average family- and genotype-specific hazard, E{bh1g(t)} ¼ h1g(t), will not in | general equal the marginal genotype-specific hazard in the general population, h1g (t). | | Indeed, h1g (t) h1g (t), as can be demonstrated by recognizing that h1g (t) is h1g(t) times Ð w(b)bdG(b), where the weights w(b) are positive and decreasing and achieve their | Ðmaximum value of 1.0 at b ¼ 0. Thus, h1g (t) is less than or equal to h1g(t) times bdG(b) ¼ 1. An approach to shorten the time required for a cohort study is to reconstruct the cohort experience retrospectively. The retrospective cohort design has been used in occupational epidemiology to eliminate the need to wait for health outcomes. Ideally, a complete roster of the cohort is available from work records, and exposure data and

Models of Absolute Risk

263

complete health outcome data are available from all cohort members, throughout the historical period of follow-up. Unfortunately, it is usually not possible to apply this ideal retrospective cohort paradigm in genetic epidemiologic studies. For example, although one may recruit women with breast cancer from lists of current clinic patients, it is usually not possible to reconstruct the entire cohort of women from whom these cases arose, let alone obtain complete follow-up and genotypes on members of this hypothetical retrospective cohort. Serious biases can arise from estimating absolute risks from imperfectly reconstructed retrospective cohorts. An extreme overestimate of absolute risk results from applying standard survival methods only to those members of a cohort who have experienced the health outcome (25), such as women recruited because they had breast cancer. Such analysis ignores the many women in the cohort who did not develop the disease. Case-Control Designs Studies Nested in Cohorts A very attractive feature of the cause-specific relative hazard model for absolute risk in equation (i) is that its components can be estimated from different data sources, including case-control data. If cases are matched to controls in a nested case-control study within a cohort, survival methods are available to estimate the needed components (26). Likewise, the case-cohort design, in which cases are compared to members of a random subsample of the cohort (27), leads to convenient estimates of absolute risk (28,29). Population-Based Case-Control Design An increasingly used approach is to combine estimates of relative risks from case-control data with population registry data on age-specific cancer incidence and mortality to estimate absolute risk. If the case-control study is population-based or if survey data on the joint distribution of risk factors in the general population are available, one can use relative risks from the case-control data to estimate age-specific population attributable risks, AR(t), for women aged t. If registry data are available on the age-specific hazard of breast cancer in the general population, h1 (t) then the required baseline hazard can be computed from h1 (t) ¼ h1 (t)f1 AR(t)g

ðiiiÞ

For example, Costantino and colleagues (2) describe how a modified “Gail Model 2” was developed. Researchers combined breast cancer incidence rates, h1 (t), from the National Cancer Institute’s SEER Program, national mortality rates, h2(t), relative risks, rr{X(t)}, from a case-control study in the Breast Cancer Detection Demonstration Project (BCDDP) population, and from data on the joint risk factor distribution in the general population from the Cancer and Steroid Hormone (CASH) Study (30) to estimate AR(t) and hence h1(t). With these ingredients, they computed absolute invasive breast cancer risk from equation (i). Family-Based Case-Control Design Sometimes relative risks are estimated by comparing cases and controls within a family, as in a discordant sibpair design. If the mutation is rare, incidence rates from the general population h1 (t) approximately equal rates for noncarriers, h1(t). Hence, it is not uncommon to estimate the genotype-specific hazard as rr(X ¼ g)h1 (t), where the relative

264

Gail

risk for genotype, g, rr(X ¼ g), has been estimated from a family-based case-control design. As mentioned previously, family-specific relative risks of this type can exceed population-based relative risks in the presence of familial effects apart from the gene under study. To illustrate, assume that 30% of families have b ¼ 0.5, 50% have b ¼ 1, and 20% have b ¼ 1.75 in equation (ii), and assume h1g(t) is constant for carriers, g ¼ 1, and noncarriers, g ¼ 0. Let the family-specific relative risk be rr (X ¼ 1) ¼ h1g¼1(t)/h1g¼0(t) ¼ 3.0. | | Then, equation (ii) yields (14) marginal relative risks h1g¼1 (t)/h1g¼0 (t) as low as 2.22 as one varies t. Thus, family-specific relative risks may overestimate the risks of a mutation in the general population, and applying rr (X ¼ g)h*1(t) can lead to an overestimate of general population risks. Family-Based Designs In addition to the family-based case-control design just discussed, other family-based designs are used to estimate the absolute risk associated with mutations. Kin-Cohort Design A population-based strategy for estimating absolute risk is to obtain a random sample of cases and of controls from the population (called the “probands”) who agree to be genotyped and to provide information on the disease histories (phenotypes) in first-degree relatives (10,11). The sample can be enriched in case probands, as long as case probands are a random sample of all cases, and control probands are a random sample of all controls. This design was used to estimate the risk of breast cancer in Ashkenazi women with BRCA1 and BRCA2 founder mutations, but no formal sampling of case and control probands was attempted. The pure risk in carriers was estimated (10) as twice the cumulative pure risk in first-degree relatives of probands with mutations, about half of whom were expected to be mutation carriers, less the cumulative pure risk in first-degree relatives of noncarrier probands without mutations, almost all of whom were expected to be free of mutation. To describe likelihoodbased procedures for kin-cohort data, we let Yr be the vector of phenotypes for the relatives i ¼ 1, 2, . . . , r. For time-to-event data, the components of this vector are (di, Ti), where di ¼ 1 if the disease of interest occurred and 0 otherwise, and where Ti is the minimum of age at end of follow-up or age at disease incidence. The proband’s disease history is Y0, and the phenotypes of all family members are Y ¼ ðY0 , Y1 , . . . , Yr ÞT ¼ ðY0 , YTr ÞT . The vector of genotypes is likewise G ¼ ðG0 , GTr ÞT . To take ascertainment into account, one writes the likelihood for a family as X PðYr , G0 jY0 Þ ¼ ðG0 jY0 Þ PðGr jG0 ÞPðYr jGr , Y0 Þ: ðivÞ G r

In equation (iv), the summation is over all joint genotypes of relatives, and P(GrjG0) is defined by Mendel’s laws. If each individual’s genotype is the only factor that influences his or her phenotype, then, phenotypes are conditionally independent given G, and equation (iv) can be simplified because PðYr jGr , Y0 Þ ¼

r Y

PðYi jGi Þ:

ðvÞ

i¼1

Under this “conditional independence” assumption, likelihood methods (7) applied to (iv) can be used to estimate the allele frequency and genotype-specific hazards, h1g(t). Under conditional independence, the family-specific and marginal hazards in the general population are equal.

Models of Absolute Risk

265

Several biases can affect the estimates of the genotype-specific hazard from kincohort studies. If the tendency to participate as a proband and give blood for genotyping is greater in individuals who have several affected first-degree relatives than in individuals whose families have few or no affected relatives, estimates of h1g(t) will be too high, not only for mutation carriers but also for noncarriers (7,11). If the proband mistakenly reports disease in his or her relatives, hazards can be seriously overestimated, and if the proband neglects to report disease that has occurred, hazards will be underestimated (7,31,32). If the gene in question increases the risk of death from competing causes, the hazard for the disease of interest will be underestimated (13,33). This underestimation is even more severe if the risk of mortality following incidence of the disease of interest is higher in mutation carriers than noncarriers (13). (A version of this reference that includes figures to illustrate these points is available at http://www.springer.com/west/home/ statistics/statsþlifeþsci?SGWID=4-10134-22-34953337-detailsPage=ppmmediajerratum.) Several other factors can induce bias, including inaccuracy of asymptotic formulas with sample sizes that seem large but yield too little information (7,31,32) and failure of the reproducibility assumption, P(Yijgi,g0) ¼ P(Yijgi) (13,34). Absolute risk in mutation carriers will be overestimated if one does not account for residual familial risk, especially if the kin-cohort study is enriched in case probands (7,12,14,32). The key ingredient in equation (iv), P(YrjGr,Y0), does not simplify to equation (v) in the presence of residual familial risk, because components of Yr are correlated with Y0 and with each other. In the special case that probands are a random sample from the source population, (Yr, Y0) can be regarded as a random sample from the population. Each of the r pairs (Yi, Y0) from a given family can also be regarded as randomly sampled, though not independent. Thus, for each family, a “composite likelihood” Pðg0 ÞPðY0 jg0 Þ

r X Y i¼1

PðYi jgi ÞPðgi jg0 Þ

gi

can be constructed that yields unbiased estimates of the marginal absolute risk P(Yijgi) (35). This elegant approach reduces, but does not eliminate, the upward bias in risk estimates for mutation carriers when case probands are overrepresented in the kin-cohort sample, as is usually the case. Begg (36) pointed out that overestimates of absolute risk from the mutation under study are to be expected when only case probands are used and when risk is heterogeneous in the population because of factors other than the gene under study. However, one can allow for residual familial risk in the model, for example, by including familial frailties, by including additional genetic components, such as a polygenetic effect or an unmeasured Mendelian effect, or by using association parameters in copula models (12). Using such copula models, Chatterjee and colleagues (12) found that even kin-cohort designs in which all probands were cases yielded unbiased estimates of absolute risk if the correct model of residual familial risk (i.e., correct copula model) was used for analysis. If only case probands were sampled, however, the results of this analysis were sensitive to misspecification of the copula model. Kin-cohort designs that sampled equal numbers of case and control probands yielded nearly unbiased absolute risk estimates that were much less sensitive to the choice of the copula model for residual familial risk; such balanced designs are therefore more robust to model misspecification. Chatterjee and colleagues noted that estimates of relative risk were more robust to misspecification of the copula model than estimates of absolute risk, especially if only case probands were sampled. For kin-cohort studies based on case probands only, Chatterjee and colleagues therefore recommended estimating relative risks, rr (g), with a copula model to

266

Gail

allow for residual familial risk, calculating attributable risk from known allele frequencies and rr(g) and applying equation (iii) to population hazard rates to estimate h1g (t) ¼ rr(g)f1 AR(t)gh1 (t): Note that unlike estimates of rr(g) obtained from family-based case-control data, the rr (g) from the kin-cohort design and based on an analysis that allows for residual familial risk is not upwardly biased for the marginal relative risk and therefore applies to the general population. Throughout this discussion we have assumed that the probands were genotyped. Without genotyping, Claus and colleagues analyzed breast cancer history data on relatives of population-based cases and controls in the CASH Study (15,16) by assuming an autosomal dominant model for genetic risk and no residual familial risk. Even before BRCA1 and BRCA2 had been identified, the results of this segregation analysis, based on P(YrjY0), yielded estimates of absolute risk for carriers that remain useful today. A disadvantage of this approach, compared with measuring the gene in probands, is that it is not possible to separate the risks from BRCA1 and BRAC2 from risks imparted by other unmeasured dominant mutations. Other Family-Based Designs The kin-cohort design is amenable to analysis because the ascertainment criteria (randomly selected case and control probands) are well understood, at least in principle. We have seen how the combination of residual familial risk and case-enriched proband ascertainment complicates the analysis of the kin-cohort design. These challenges are even more complex when pedigrees are recruited from high-risk clinics based on the fact that multiple family members are affected (“multiplex pedigrees”). Often the precise features that led to ascertainment of the family are not known. Estimation of absolute risk from a pedigree could be based on P(Y, GjA), where Y and G are the vectors of phenotypes and genotypes in the family and A is the ascertainment condition, which might depend in a complex way on Y. The quantity P(Y,GjA) is termed the “ascertainment corrected joint likelihood” by Kraft and Thomas (37), who also consider the “prospective likelihood,” P(YjG, A), and the “retrospective likelihood,” P(GjY, A) ¼ P(GjY); the last equality follows from the assumption that A is determined by Y alone. To avoid the difficulty of defining the ascertainment condition, analysts often use the retrospective likelihood. Use of P(GjY) does not avoid the need to consider residual familial risks, however, because X P(YjG)P(G) P(GjY) ¼ P(YjG)P(G)= G

depends on P(YjG), which, conditional on G, may include correlations among familial phenotypes that are induced by residual familial risk. Iverson and Chen (38) used P(Y, GjA) but assumed that A depended only on identified features of Y, such as the number of affected family members. Using external data, they estimated P(A) empirically, permitting inference from P(Y,G j A) / P(Y,G)=P(A): This approach does not avoid the need to consider residual familial risk, however. These issues are important when estimating the absolute risk in carriers of BRCA1 and BRCA2 mutations and in noncarriers. Data from a marker tightly linked to BRCA1 were obtained from the Breast Cancer Linkage Consortium (BCLC), consisting of

Models of Absolute Risk

267

families with at least four breast cancer cases under age 60 or ovarian cancers. The retrospective likelihood P(GjY) was used, where the likelihood describes the marker pattern, rather than BRCA1 mutations themselves. No attempt was made to model residual familial risk. The estimated pure breast cancer risk to age 70 was 0.85 in BRCA1 mutation carriers (8). A similar analysis of data from BCLC (9) yielded an estimate of pure breast cancer risk to age 70 of 0.84 for carriers of BRCA2 mutations. In contrast, Antoniou and colleagues (39) analyzed data from 22 kin-cohort studies with case-only probands (female breast cancer cases in 16 studies, male breast cancer cases in two studies, and ovarian cancer cases in four studies). These probands were not selected on the basis of family history but only on the basis of their personal history of cancer. The combined data from these studies yielded an estimate of pure breast cancer risk to age 70 of 0.65 for BRCA1 mutation carriers and 0.45 for BRCA2 mutation carriers. Risks were somewhat higher in families whose proband had breast cancer than in families whose proband had ovarian cancer, as might be expected if residual familial breast cancer risk were present. Because no allowance had been made for residual familial risk, the authors noted that these risk estimates might overestimate risks in the general population, despite the fact that they were considerably lower than estimates from the BCLC. It is likely that these lower estimates are useful for counseling women who are considering being tested or who have been tested because only one or two relatives developed breast cancer or because a BRCA1 or BRCA2 mutation was found in a family with few affected members. The higher risk estimates found from the BCLC might be useful for counseling women in families with many affected members; but part of the risk in such families is likely due to other factors apart from BRCA1 or BRCA2 mutations. One way to correct for residual familial correlation is to allow for a latent genetic effect in addition to the measured gene (19–22). Antoniou and colleagues (19–21) applied the likelihood (iv) to kin-cohort data from 1484 probands with breast cancer diagnosed under age 55 years and the retrospective likelihood P(GjY) to 156 families containing two or more breast cancer cases, at least one of which was diagnosed under age 50. Letting k ¼ 2, 1, or 0 index carriers of a BRCA2 mutation, carriers of a BRCA1 mutation, and noncarriers, and letting C denote a polygenic component with variance 2 (and hypergeometric parameter fixed at N ¼ 3), Antoniou and colleagues modeled the genotype- and age-specific breast cancer hazard as h1g¼k (tjC) ¼ lk (t)rrk (t) expfCg:

ðviÞ

In this expression, l1(t) and l2(t) are age-specific breast cancer hazards for BRCA1 mutation carriers from BCLC (8) and for BRCA2 mutation carriers from BCLC (9), respectively, and rr1(t) and rr2(t) are relative risks that can change values over four age intervals for the BRCA1 and BRCA2 mutation carriers, respectively. For noncarriers, l0(t) was a piecewise constant hazard and rr0 ¼ 1. The model was constrained to fit the observed composite breast cancer incidence rates, h1 ðtÞ, in England and Wales (1983– 1987), which allowed the parameter estimates for l0(t) to be expressed in terms of the remaining free parameter estimates. Thus, for breast cancer, there were four relative risk parameters for BRCA1, four relative risk parameters for BRCA2, two allele frequency parameters (one for BRCA1 and one for BRCA2), and one polygenic variance parameter, 2. In addition to these eleven parameters, five relative risk parameters were needed for similar modeling of ovarian cancer, resulting in a combined breast and ovarian model with 16 parameters estimated from the previously mentioned likelihoods [Table 5 in reference (20)]. Allowance for residual familial correlation in this model via 2, which was estimated as 1.2912 ¼ 1.667, may contribute to the different predictions of pure risk

268

Gail

of breast cancer and of the probability of carrying a mutation found with this model (BOADICEA) compared with autosomal dominant models that did not allow for residual familial risk (21). Not only does inclusion of a parameter for residual familial correlation affect estimates of genotype-specific hazard rates and hence estimates of pure risk and of mutation carrier probabilities based on family history, but such correlation parameters are also of intrinsic interest for estimating absolute risk. For example, suppose a counselee and her mother both tested negative for mutations of BRCA1 and BRCA2. The mother developed breast cancer at age tm ¼ 45 years; the counselee has no sisters and the family history is limited to first-degree relatives. The conditional distribution of Cm, the polygenic effect of the mother, given tm ¼ 45, is exp (cm ) expfL0 (tm ) exp (cm )g(cm ; 0,2 ) pCm jTm (cm jtm ) ¼ Ð , exp (c) expfL0 (tm ) exp (c)g(c; 0,2 )dc

ðviiÞ

where (cm ; ,2 ) is the normal density with mean m and variance 2 and where L0(t) is the integral to age t of the baseline hazard l0(t). Note that survival expressions involving ovarian cancer cancel from pCm jTm (cm jtm ) and from the other quantities that we calculate below. The conditional distribution of the daughter’s polygenic effect, Cd, given Cm is normal with mean Cm/2 and variance 0.752, as follows from the fact (40) that the correlation between first-degree relatives is 0.5. The conditional density of Cd given tm is ð pC jTm (cd jtm ) ¼ (cd ; cm =2; 0:752 )pCm jTm (cm jtm )dcm : ðviiiÞ d

The mean of Cd given tm is

ð E(Cd jtm ) ¼ E(Cm =2jtm ) ¼ 0:5 cm pCm jTm (cm jtm )dcm 0:5Cm jTm ,

ðixÞ

and the conditional variance is

ð Var(Cd jtm ) ¼ Var(Cm =2jtm ) þ 0:752 ¼ 0:25 (cm Cm jTm )2 p(cm jtm )dcm þ 0:752 : ðxÞ

The unconditional hazard of breast cancer for the daughter at age td is Ð l (t ) exp (c) expfL0 (td ) exp (c)g(c; 0, 2 )dc l0 (td ) ¼ 0 d Ð , expfL0 (td ) exp (c)g(c; 0, 2 )dc whereas the corresponding conditional hazard given tm is Ð l0 (td ) exp (c) expfL0 (td ) exp (c)gpC jTm (cjtm )dc d Ð : l0 (td jtm ) ¼ expfL0 (td ) exp (c)gpC jTm (cjtm )dc

ðxiÞ

ðxiiÞ

d

To compute the desired familial relative risk (41), FRR ¼ l0 (td jtm )/l0 (td ),

ðxiiiÞ

requires repeated numerical integration. If, however, breast cancer is rare in noncarriers, exp{L0(td)exp(c)} can be set to unity, and ð ð : FRR ¼ exp (c)pC jTm (cjtm )dc= exp (c)(c; 0,2 )dc: ðxivÞ d

Models of Absolute Risk

269

The denominator is exp(2/2), which follows from the moment generating function of f(c;0,2). Under the rare disease approximation, the numerator of equation (vii) has the kernel exp(cm)f(cm;0,2), which is that of a normal distribution with mean 2 and variance 2. Thus, the conditional mean and variance of Cd given tm are approximately 2/2 and (2/4 þ 32/4) ¼ 2, respectively. Given that pCm jTm (cjtm ) in equation (vii) is normal under the rare disease approximation, pC jTm (cd jtm ) in equation (viii) is also d normal, because it is a normal mixture of normal distributions. Hence the numerator in : equation (xiv) is exp(2/2 + 2/2), and FRR ¼ exp(2/2) ¼ exp(1.667/2) ¼ 2.30. Thus, the fact that the mother had breast cancer more than doubles the hazard for the daughter, compared with the general population of noncarriers, even though both the mother and daughter are noncarriers. This is a consequence of residual familial risk. Claus and colleagues (42) noted that the relative risk associated with a history of breast cancer in a first-degree relative was 2.3, based on the population-based CASH casecontrol study. When they eliminated from the cases and controls all women whose estimated probability of carrying a mutation in BRCA1 or BRCA2 was more than 0.01, based on the BRCAPRO (17,18) algorithm for analyzing family history data, the relative risk from family history was only reduced to 2.1. This result indicates that BRCA1 and BRCA2 mutations accounted for only a small portion of the familial aggregation of breast cancer in the general population. Most of the familial aggregation in the general population is due to residual familial risk. VALIDATION Features that can be Validated with Case-Control Data Before a model is widely used for counseling and the other applications described in the section “Definition and Uses of Absolute Risk,” it is highly desirable that the model be evaluated using independent data. Data from cases and controls can be used to assess certain features of the model, such as the relative risks rr{X(t)} in equation (i). Another feature that can be estimated from samples of cases and of controls is the concordance statistic or area under the receiver-operating curve (AUC), which is a measure of “discriminatory accuracy” (43,44). The AUC is the probability that the absolute risk estimated from a randomly selected case exceeds that of a randomly selected control. If all cases have high risks and all controls low risks, AUC is high, and there is said to be good discriminatory accuracy. To assess the importance of risk factors X(t) among women of similar age, an “age-specific AUC” can be computed by restricting the cases and controls to women aged 60 to 64, for example, at the beginning of follow-up. Most risk models, such as the Gail model, have modest age-specific AUC values, such as 0.6. Very strong risk factors must be included to increase AUC (45). For example, Chen and colleagues (46) reported that adding mammographic density, a strong predictor of breast cancer, only increased the average age-specific AUC from 0.60 for the Gail model to 0.64 in the model with mammographic density. Some reports of models with higher AUC include age as one of the risk factors. Because age is a strong predictor of risk of cancer and other diseases like heart disease, AUC values that are computed from cases and controls of all ages will be higher than age-specific AUC values, unless the cases and controls were matched on age. Relative risks and AUC can also be estimated from cohort data. In particular, the AUC can be estimated by comparing the estimated risks of those who develop disease with the estimated risks of those who do not develop disease.

270

Gail

Features that Require Cohort Data for Validation The term “calibration” describes the degree to which the expected numbers of cases (E) based on the absolute risk model agrees with the observed number of cases (O) over the entire population and in subgroups of the population. Cohorts are needed to obtain the observed counts, O. For example, Costantino and colleagues (2) computed the expected risks for each woman in the placebo arm of the Breast Cancer Prevention Trial, which was designed to determine whether tamoxifen prevented invasive breast cancer. Over the next five years, E ¼ 159.0 invasive breast cancer were predicted by summing up the individual risk estimates from Gail Model 2, and O ¼ 155 were observed, yielding an E/O ratio of 1.0. Similar good calibration was found in other age groups and in other subgroups. Rockhill and colleagues (44) also found good calibration of Gail Model 2 in cohort data from the Nurses Health Study, but they noted its modest discriminatory accuracy. Cohort data can also be used to estimate measures of predictive accuracy, such as positive predictive value, negative predictive value, and classification accuracy. Gail and Pfeiffer (43) described these and other criteria for assessing models of absolute risk. They also described assessments based on a consideration of the clinical losses associated with true-positive, true-negative, false-positive, and false-negative outcomes. Cohort data are needed to assess expected losses. Gail and Pfeiffer concluded that models with modest discriminatory accuracy, such as Gail Model 2, can be valuable for assisting in a decision such as whether or not to take an intervention that has offsetting risks and benefits (e.g., tamoxifen), but that such models do not perform well for screening a general population to decide, for example, which women should get mammograms and which should not. Too many “low-risk” women will develop breast cancer for this to be a good strategy. DISCUSSION We defined absolute risk and distinguished it from the less clinically pertinent “pure” risk. Well-calibrated absolute risk models have important applications in counseling patients by giving them a realistic idea of their risk of disease. Such models can also assist a patient in deciding whether or not to take a preventive intervention that has risks and benefits, because the effects of the intervention on various health outcomes can be compared in terms of the changes induced by the intervention on their absolute risks. Absolute risk models have also been used to define eligibility criteria and compute power in designing intervention trials to prevent disease, and such models have been used to assess the likely consequences of a preventive intervention, such as tamoxifen, at the population level (5). The absolute risk of cause-specific mortality is also helpful in shaping clinical management after a disease develops. Absolute risk models have been advocated for advising women in their forties who are not being screened with mammography and whose breast cancer risk equals or exceeds that of a 50-year-old woman to consider beginning screening mammography (47). In general, however, models with modest discriminatory accuracy, such as the Gail model, should not be applied to a general population to decide who should and who should not have further diagnostic procedures or preventive interventions (43,44). Too many people judged to have modest or low risk will develop disease, and many people found to be at elevated risk would never develop disease. In particular, one would not want to rely on a questionnaire-based absolute risk model with modest discriminatory

Models of Absolute Risk

271

accuracy to decide who should be screened for elevated blood pressure or cholesterol or for colonic or breast neoplasia. The available data influence the approach to estimating absolute risk. Cohort data give the most analytic flexibility for estimating absolute risk, but one can estimate absolute risk from equation (i) by combining relative risk and attributable risk information from case-control studies with population registry data on age-specific composite disease rates and death rates. Family-based designs are useful for estimating the absolute (and pure) risk from rare genetic variants. We have indicated the potential impact of residual familial risk in family-based studies. If residual familial risk is not allowed for, one tends to overestimate the absolute (and pure) risk associated with a measured mutation in the general population. Measures of residual familial risk also allow one to make better risk predictions by using not only the information on mutation status but also the residual predictive information in the family history, as illustrated at the end of the section “Family-Based Designs.” It is paradoxical that ignoring unmeasured genetic effects leads to underestimation of familyspecific genetic relative risks (37,48) but to overestimation of the absolute risk in the general population for carriers of the measured mutation. Considerable effort is being expended to identify genetic variants associated with breast cancer and other diseases. Genomewide association studies (49,50) and candidate gene approaches (51) have identified single nucleotide polymorphisms associated with breast cancer. The corresponding relative odds per minor allele is typically small, such as 1.2. Although such findings may yield insights into carcinogenesis and leads for prevention or treatment, it will be a challenge to use such information to improve the discriminatory accuracy of risk prediction models, because only strong risk factors can have much impact on discriminatory accuracy (45). A potential danger is the development of models based on exploratory data analyses covering many candidate genes and possibly including interactions with known exposures. Elaborate models of this type will typically “overfit” the data and appear to perform well in the original sample. It is especially important that such models be rigorously assessed with independent data. One can question whether the concept of absolute risk is well defined for a mutation in a gene such as BRCA1. There are many variants in the gene, and a single absolute risk may not be appropriate. Moreover, other known factors (e.g., history of child bearing) and unknown factors (e.g., other unidentified genes) influence risk, and some of these factors may modify the risks associated with a mutation in BRCA1. Ideally, suitable data would be available on a variety of risk factors, such as reproductive history, results from breast biopsies, mammographic density, variants of single nucleotide polymorphisms associated with breast cancer, and highly penetrant mutations, such as those in BRCA1, to develop absolute risk models that account for the joint effects of such factors. Although progress is being made in determining whether reproductive and other factors act similarly in carriers of BRCA1 and BRCA2 mutations as in the general population, much remains to be learned. The model by Tyrer, Duffy, and Cuzick (22) includes a variety of genetic and nongenetic risk factors, but the authors needed to make various assumptions about the joint effects of the risk factors. There remains a need for well-designed studies with information on genetic and various other risk factors on each subject. Such data would greatly facilitate the development and validation of models with multiple risk factors. If strong risk factors can be identified, such joint risk models have the potential to improve discriminatory accuracy.

272

Gail

REFERENCES 1. Gail MH, Brinton LA, Byar DP, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst 1989; 81(24):1879–1886. 2. Costantino JP, Gail MH, Pee D, et al. Validation studies for models projecting the risk of invasive and total breast cancer incidence. J Natl Cancer Inst 1999; 91(18):1541–1548. 3. Tsiatis AA. Competing Risks. In: Armitage PaC, T., ed. Encyclopedia of Biostatistics. 2nd ed. Chicester, England: John Wiley & Sons, 2005:1025–1035. 4. Gail MH, Costantino JP, Bryant J, et al. Weighing the risks and benefits of tamoxifen treatment for preventing breast cancer. J Natl Cancer Inst 1999; 91(21):1829–1846. 5. Freedman AN, Graubard BI, Rao SR, et al. Estimates of the number of US women who could benefit from tamoxifen for breast cancer chemoprevention. J Natl Cancer Inst 2003; 95(7): 526–532. 6. Albertsen PC, Hanley JA, Fine J. 20-year outcomes following conservative management of clinically localized prostate cancer. JAMA 2005; 293(17):2095–2101. 7. Gail MH, Pee D, Benichou J, et al. Designing studies to estimate the penetrance of an identified autosomal dominant mutation: Cohort, case-control, and genotyped-proband designs. Genet Epidemiol 1999; 16(1):15–39. 8. Easton DF, Ford D, Bishop DT, et al. Breast and ovarian-cancer incidence in BRCA1mutation carriers. Am J Hum Genet 1995; 56(1):265–271. 9. Ford D, Easton DF, Stratton M, et al. Genetic heterogeneity and penetrance analysis of the BRCA1 and BRCA2 genes in breast cancer families. Am J Hum Genet 1998; 62(3):676–689. 10. Struewing JP, Hartge P, Wacholder S, et al. The risk of cancer associated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. N Engl J Med 1997; 336(20):1401–1408. 11. Wacholder S, Hartge P, Struewing JP, et al. The kin-cohort study for estimating penetrance. Am J Epidemiol 1998; 148(7):623–630. 12. Chatterjee N, Kalaylioglu Z, Shih JH, et al. Case-control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 2006; 62(1):36–48. 13. Gail M, Chatterjee N. Some biases that may affect kin-cohort studies for estimating the risks from identified disease genes. In: Lin D.Y. HPJ, ed. Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data. New York: Springer; 2004:175–87. 14. Gail M, Chatterjee N. Estimating the absolute risk of disease associated with identified mutations. In: Lin S, Zhao H, ed. Handbook on Analyzing Human Genetic Data— Computational Approaches and Software. New York: Springer (in press). 15. Claus EB, Risch N, Thompson WD. Genetic-analysis of breast-cancer in the Cancer and Steroid-Hormone Study. Am J Hum Genet 1991; 48(2):232–42. 16. Claus EB, Risch N, Thompson WD. Autosomal-dominant inheritance of early-onset breastcancer—implications for risk prediction. Cancer 1994; 73(3):643–651. 17. Berry DA, Iversen ES, Gudbjartsson DF, et al. BRCAPRO validation, sensitivity of genetic testing of BRCA1/BRCA2, and prevalence of other breast cancer susceptibility genes. J Clin Oncol 2002; 20(11):2701–2712. 18. Berry DA, Parmigiani G, Sanchez J, et al. Probability of carrying a mutation of breast-ovarian cancer gene BRCA1 based on family history. J Natl Cancer Inst 1997; 89(3):227–238. 19. Antoniou AC, Pharoah PDP, McMullan G, et al. Evidence for further breast cancer susceptibility genes in addition to BRCA1 and BRCA2 in a population-based study. Genet Epidemiol 2001; 21(1):1–18. 20. Antoniou AC, Pharoah PDP, McMullan G, et al. A comprehensive model for familial breast cancer incorporating BRCA1, BRCA2 and other genes. Br J Cancer 2002; 86(1):76–83. 21. Antoniou AC, Pharoah PPD, Smith P, et al. The BOADICEA model of genetic susceptibility to breast and ovarian cancer. Br J Cancer 2004; 91(8):1580–90. 22. Tyrer J, Duffy SW, Cuzick J. A breast cancer prediction model incorporating familial and personal risk factors. Stat Med 2004; 23:1111–1130. Erratum In: Stat Med 2005; 24(1):156.

Models of Absolute Risk

273

23. Prentice RL, Kalbfleisch JD, Peterson AV, et al. Analysis of failure times in presence of competing risks. Biometrics 1978; 34(4):541–554. 24. Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc 1999; 94(446):496–509. 25. Brookmeyer R, Gail MH. A method for obtaining short-term projections and lower bounds on the size of the AIDS epidemic. J Am Stat Assoc 1988; 83(402):301–308. 26. Langholz B, Borgan O. Estimation of absolute risk from nested case-control data. Biometrics 1997; 53:767–774. Erratum in: Biometrics 2003; 59(2):451. 27. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73(1):1–11. 28. Mark SD, Katki HA. Specifying and implementing nonparametric and semiparametric survival estimators in two-stage (nested) cohort studies with missing case data. Journal of the American Statistical Association 2006; 101(474):460–471. 29. Self SG, Prentice RL. Asymptotic-distribution theory and efficiency results for case cohort studies. Annals of Statistics 1988; 16(1):64–81. 30. Wingo PA, Ory HW, Layde PM, et al. The evaluation of the data-collection process for a multicenter, population-based, case-control design. Am J Epidemiol 1988; 128(1):206–217. 31. Gail MH, Pee D, Carroll R. Kin-cohort designs for gene characterization. J Natl Cancer Inst Monogr 1999; 26:55–60. 32. Gail MH, Pee D, Carroll R. Effects of violations of assumptions on likelihood methods for estimating the penetrance of an autosomal dominant mutation from kin-cohort studies. J Stat Plan Infer 2001; 96(1):167–177. 33. Chatterjee N, Hartge P, Wacholder S. Adjustment for competing risk in kin-cohort estimation. Genet Epidemiol 2003; 25(4):303–13. 34. Whittemore AS. Logistic regression of family data from case-control studies. Biometrika 1995; 82:57–67. Amendment 1997; 84(4):989–990. 35. Chatterjee N, Wacholder S. A marginal likelihood approach for estimating penetrance from kin-cohort designs. Biometrics 2001; 57(1):245–252. 36. Begg CB. On the use of familial aggregation in population-based case probands for calculating penetrance. J Natl Cancer Inst 2002; 94(16):1221–1226. 37. Kraft P, Thomas DC. Bias and efficiency in family-based gene-characterization studies: conditional, prospective, retrospective, and joint likelihoods. Am J Hum Genet 2000; 66(3): 1119–1131. 38. Iversen ES, Chen SN. Population-calibrated gene characterization: estimating age at onset distributions associated with cancer genes. J Am Stat Assoc 2005; 100(470):399–409. 39. Antoniou A, Pharoah PDP, Narod S, et al. Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: a combined analysis of 22 studies. Am J Hum Genet 2003; 72(5):1117–1130. 40. Fisher RA. The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh 1918; 52:399–433. 41. Risch N. Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 1990;46(2):222–228. 42. Claus EB, Schildkraut J, Iversen ES, et al. Effect of BRCA1 and BRCA2 on the association between breast cancer risk and family history. J Natl Cancer Inst 1998; 90(23):1824–1829. 43. Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics 2005; 6(2):227–239. 44. Rockhill B, Spiegelman D, Byrne C, et al. Validation of the Gail et al. model of breast cancer risk prediction and implications for chemoprevention. J Natl Cancer Inst 2001; 93(5): 358–366. 45. Pepe MS, Janes H, Longton G, et al. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol 2004; 159(9):882–890. 46. Chen JB, Pee D, Ayyagari R, et al. Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density. J Natl Cancer Inst 2006; 98(17): 1215–1226.

274

Gail

47. Gail M, Rimer B. Risk-based recommendations for mammographic screening for women in their forties. J Clin Oncol 1998; 16(9):3105–3114. Erratum in: J Clin Oncol 1999; 17(2):740. 48. Pfeiffer RM, Gail MH, Pee D. Inference for covariates that accounts for ascertainment and random genetic effects in family studies. Biometrika 2001; 88(4):933–948. 49. Easton DF, Pooley KA, Dunning AM, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 2007; 447(7148):1087–U7. 50. Hunter DJ, Kraft P, Jacobs KB, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 2007; 39(7): 870–874. 51. Cox A, Dunning AM, Garcia-Closas M, et al. A common coding variant in CASP8 is associated with breast cancer risk. Nat Genet 2007; 39(3):352–358.

18

Reporting and Interpreting Results Julian Little* Department of Epidemiology and Community Medicine, University of Ottawa, Ottawa, Canada

INTRODUCTION Molecular epidemiology has been defined as the study of the distribution and determinants of disease in human populations using techniques of molecular biology and epidemiology (1). The field started to grow in the early 1980s (2). Studies in the 1980s were based primarily on markers of exposure to carcinogens such as DNA and protein adducts and then expanded into the analysis of somatic genetic alterations in tumor and precancerous tissues (1). Since then, there has been a great deal of work on genetic markers of susceptibility, initially focused on candidate genes and more recently on genomewide association studies. Also, since the 1980s, there has been the development of large-scale analyses of somatic mutations and a range of phenotypical assays, including gene expression arrays, genomic instability, and DNA repair. In some instances, molecular epidemiological investigation has added value to causal inference, for example, the clarification of the doseresponse effects of aflatoxin exposure (1), the role of human papillomavirus infection in the etiology of cervical cancer (3), and the detection of protein and DNA adducts in workers exposed to chemicals such as ethylene oxide (1). However, in a worryingly high number of other instances, promising initial results have not been confirmed by subsequent investigations that have usually had greater statistical power to detect effects. Indeed, lack of replication became such an issue in genetic association studies (4–12) that it threatened to discredit the field. However, as research has matured, a somewhat greater proportion of associations with candidate gene variants has been replicated (11) and recently consistent findings have emerged from genomewide association studies (13–35). A similar pattern of failure to confirm early promising findings has been observed in relation to prognostic markers (36). Indeed, the opening salvo of a commentary in 2005 was, “The number of cancer prognostic markers that have been validated as clinically useful is pitifully small, despite decades of effort and money invested in marker research” (37). This malaise in the field reflects broader concerns about observational epidemiology, with commentaries with titles such as “Epidemiology faces its limits” (38) and “Epidemiology-is it time to call it a day?” (39). These concerns have been *

Canada Research Chair in Human Genome Epidemiology.

275

276

Little

fuelled by results of observational studies being inconsistent (40), and even when they are consistent, their results may not be borne out by the results of randomized controlled trials (41–45). In the cancer field in particular, there have been some crushing disappointments, notably in relation to beta-carotene (45,46) and hormone replacement therapy (47,48). Critical appraisals of observational studies in general have highlighted concerns about validity (41–44,49–51). Poor reporting has made it difficult to use empirical evidence to establish the methodological issues that are most critical to minimizing bias (40,51). In this chapter, therefore, we focus on the reporting of molecular epidemiological studies. As these studies tend to be interdisciplinary in nature, we also discuss the interpretation of study findings. REPORTING The adequate reporting of molecular epidemiological studies of cancer and other human diseases and of prognosis is of importance in (i) assembling empirical evidence regarding methodological biases that might affect this type of study, and thereby help improve study design and conduct in the longer term; (ii) minimising the potential problems of selective reporting and publication bias; and (iii) facilitating the synthesis of knowledge. A diversity of study designs has been used to establish a portfolio of evidence (see chaps. 1–4). With regard to randomized controlled trials, Consolidated Standards for Reporting Trials (CONSORT) have been developed and endorsed widely (52,53), and have been found to improve the reporting of such trials (54). These standards comprise a 22-item checklist and flow diagram. It is noteworthy that the development of CONSORT was stimulated by earlier concerns about poor quality of medical research (55), and poor reporting was found to be associated with biased estimates of effects (56). Statements about the reporting of nonrandomized evaluations of behavioral and public health interventions [Transparent Reporting of Evaluations with Nonrandomized Designs (TREND)] (57), diagnostic studies [Standards for Reporting of Diagnostic Accuracy (STARD)] (58), and microarray studies [Minimum Information About Microarray Experiments (MIAME)] (59) have adopted similar principles. Using a similar approach, reporting guidance for cross-sectional, casecontrol, and cohort studies [STrengthening the Reporting of OBservational studies in Epidemiology (STROBE)] has recently been developed (60–65). Of note is the emphasis on strengthening reporting as distinct from developing reporting standards (66,67) and as distinct from focusing on how research should be done as this might stifle methodological innovation (68). Of particular relevance to molecular epidemiology, it has been suggested that future versions of the STROBE guidance should include the consideration of incubation periods for risk factors and diseases, biological plausibility, and clear definition and presentation of results on host factors (69). Although several commentaries on the conduct and/or appraisal of genetic association studies have been published that cover issues in reporting (4,70–97), their recommendations differ. For example, some papers suggest that replication of findings should be part of any publication (4,70,73,74,80,84,92–94), while others consider this suggestion unnecessary or even unreasonable, such as when a novel hypothesis is tested in a large well-conducted study (78,98–102). In many, the guidance has focused on conduct of genetic studies rather than reporting (70–72,74,76,77,79,80,83,88–90,93,94) or has focused on association studies for specific diseases (71,72,74,76,79 80,83,84,89–96). Despite increasing recognition of these problems, the quality of reporting genetic association studies is not optimal (103–106). Therefore, guidance aimed at STrengthening the REporting of Genetic Associations (STREGA) is being developed. As there is a strong

Reporting and Interpreting Results

277

dependence on general epidemiological principles in genetic association studies, this guidance is being developed as an extension of the STROBE guidance (STROBE Extension to Genetic Association Studies—STREGA) (www.hugenet.ca). Following a joint National Cancer Institute–European Organization for Research and Treatment of Cancer (NCI-EORTC) meeting on cancer diagnostics in 2000, a working group was charged with addressing statistical issues of poor design and analysis and with reporting of tumor marker prognostic studies. This group developed a list of 20 “REporting recommendations for tumor MARKer prognostic studies” (REMARK), which has been published in eight journals (36,107–113). As in CONSORT and the other guidelines that have adopted similar principles, these guidelines suggest information that should be provided about study design, a priori hypotheses, patient and specimen characteristics, assay methods, and methods of statistical analysis. Publication Bias Publication bias is the selective publication of studies based on the magnitude and direction of their findings (114). Research with statistically significant results is more likely to be submitted and published than work with null or nonsignificant results (115), and this bias has led to a preponderance of potentially spurious results in the literature (116). Therefore, publication bias is a potentially serious problem for the integration of evidence on molecular epidemiology, and the problem may be even more prominent for more complex analyses, such as those encountered in relation to gene-environment and gene-gene interactions (117). Another potential method of controlling the problem of potential publication bias is to establish registries of research at the stage of its initiation that can be searched. Such a registry has been established centrally for randomized controlled trials (International Standard Randomized Controlled Trial Number register) (www.controlledtrials.com). However, in view of the diversity of researchers involved in molecular epidemiological studies and in the settings in which these studies may be implemented, such a central registry may be less feasible. The challenges of managing research registries on studies of genetic associations and related interactions have been noted (118), and efforts are being made to address these by developing registries of studies or investigators through a network of networks, each with a defined structure (119). Selective Reporting A further issue is the selective reporting of the results of studies. There is empirical evidence of selective reporting of randomized controlled trials. Comparisons of primary outcomes defined in trial protocols and those defined in published articles have shown that the reporting of trial outcomes is frequently incomplete, biased, with statistically significant efficacy outcomes being more likely to be reported than nonsignificant efficacy outcomes, and inconsistent with protocols (120–122). In the field of pharmacogenetics, an evaluation of studies of genetic variation in responses to b2-agonist therapy in asthma identified almost 500 potential associations on which investigators could have reported, based on different endpoints, time of assessment, type of intervention, and genetic contrast (123). Each study reported only a small minority of potential associations, and it is not known how many more potential associations had been examined but not reported. It might be expected that selective reporting would become an increasingly important problem with the application of array technologies in

278

Little

molecular epidemiological studies, and thereby increase the prevalence of spurious results. However, experience with genome-wide association studies suggests that the problem of selective reporting may be addressed implicitly, because information on all genetic variants is collected concurrently and the entire database can be made available online (124). Many of the genome-wide association studies are likely to be identified as a result of the National Human Genome Research Institute’s Genetic Association Information Network (GAIN) initiative (125). Moreover, the National Institutes of Health (NIH) recently announced a policy for sharing of data obtained in NIH supported or conducted genomewide association studies (http://grants.nih.gov/grants/guide/notice-files/ NOT-OD-07-088.html). While such mechanisms will facilitate data sharing and mitigate the potential problem of selective reporting, a potential concern about centralization of data in one site is that updates, corrections, and methodological issues may receive less attention than would be the case if these issues were the responsibility of the primary investigators, and the purpose of the registry were to provide links to the investigators. The concept of data sharing has also been developed for studies based on gene expression microarrays, with a condition of publication being that data be made available (59). It would seem reasonable to extend this concept to other types of molecular epidemiological studies. In addition to helping address the issue of selective publication, it would facilitate meta-analysis of individual subject data by enabling standardization of marker values, endpoint definitions and eligibility criteria (37). INTERPRETATION Issues of interpretation arise at two levels at least: first, in single studies and second, across studies. In addition, they broadly divide into exclusion of noncausal explanations, overall quantity and quality of the evidence base, consideration of possible publication and selective reporting biases, and causal inference. Exclusion of Noncausal Explanations in Single Studies The ability to exclude noncausal explanations depends in part on the quality of reporting of a study. For example, in the context of randomized controlled trials, poor reporting of methods may not reflect on poor methods themselves (126). For randomized controlled trials, a systematic review of quality assessment concluded that evidence-based components such as allocation concealment and double blinding should be used to assess quality. Topic-specific items should be part of the assessment (127). A potential problem with topic-specific quality assessment in trials in the molecular epidemiology field is the paucity of evidence specific to the field. Less attention has been given to tools for assessing the quality of observational studies, (128). In commentaries about STROBE, it is interesting that one of the authors emphasized that these reporting guidelines do not constitute an instrument to evaluate the quality of research (67), and it has been suggested that the authors of STROBE should expressly discourage the use of the guideline for the evaluation of studies or study results, and that “the blindly applied rule” should not “trump the creative exception” (66). Papers dealing with the critical appraisal of observational studies, in general, structure discussion around the issues of assessing potential bias, confounding, and the role of chance (41–44,49,50). With regard to molecular epidemiological studies in general, three issues may need emphasis. First, studies should consider the timing of assessment of exposure. For example, folate may be efficacious in the primary prevention

Reporting and Interpreting Results

279

of colorectal neoplasia, but supplemental folic acid increases the risk of advanced and multiple adenomas in persons who have had an adenoma removed (129,130). Second, the validity of surrogate markers of outcome is generally uncertain (131,132). Third, multiple testing and the prestudy odds of true finding should be assessed. It would be useful to interpret the results in the context of how many biomarkers have been studied. However, adjustments for multiple corrections are not routinely adopted or accepted. Moreover, the problem is that it is often difficult or impossible to determine the possible risk of selective reporting. Investigators may test associations with a very large number of biomarkers, but may mention only the most promising ones. It has been suggested that it would be useful to provide an estimate of the prestudy odds for an association, since this would be key for interpreting the credibility of a “positive” finding (102,133,134). It makes a tremendous difference if an association has been found as part of massive unselected screening of 100,000 polymorphisms (in which case, the prestudy odds is less than 1:1000) or with a highly targeted hypothesis supported by other data and specific line of reasoning (in which case, the prestudy odds will be much higher). Wacholder et al. (133) proposed the assessment of the false-positive report probability, calculated on the basis of the prior probability that a gene-disease association is real, statistical power, and the observed p value. In addition, they suggested that the stringency of this probability should depend in part on the magnitude of the negative consequences of potentially incorrect decisions. For example, it might be less stringent for rare diseases or small initial studies, but more stringent for large studies or pooled analysis that attempted to be more definitive evaluations. One problem with this suggestion is that it would make integration of evidence from different studies very difficult. Other issues include the problem of false negatives (note that false-negative report probability can be considered) (133) and possible overemphasis on controlling the false-positive rate (135). Many commentaries have been published on genetic association studies (70–72, 74,76,77,79,80,83,88–90,93,94). In addition to the question of multiple testing already discussed, particular issues flagged include the possible effects of population stratification, genotyping error, and departure from Hardy-Weinberg equilibrium. Population Stratification Population stratification is the presence within a population of subgroups among which allele/genotype frequencies and disease risks differ. When the groups compared in the study differ in terms of the proportions of the population subgroups, there is the potential for an association between the genotype and the disease being investigated to reflect the fact that genotype is a marker for population subgroup rather than to be causal. Population subgroup is a confounder in this situation as it is associated with both genotype frequency and disease risk. There has been vigorous debate about the potential implications of population stratification for the validity of genetic association studies (136–150). Modeling the possible effect of population stratification (without adjusting for it) suggests that the effect is likely to be small in most situations (142,143,145–147). Meta-analyses of 43 gene-disease associations comprising 697 individual studies show consistent associations across groups of different ethnic origin (147), and so provide evidence against a large effect of population stratification, hidden or otherwise. However, as studies of association and interaction typically address moderate or small effects and hence require large sample sizes, a small bias arising from population stratification may be important (148). Design methods (case-family control studies) and statistical methods (151) have been proposed to address population stratification, but so far few studies have used these. However, most genomewide association studies to date have either used

280

Little

family-based designs or used strict methods to control for stratification such as genomic control and principal components analysis (16,152), and this is probably essential for excluding bias when the identified genetic effects are very small (odds ratio <1.20). Genotyping Error In a commentary on the possible causes and consequences of genotyping error in 2005, it was observed that although an increasing number of researchers were aware of the difficulty, the effect of such error had largely been neglected (153). The extent of genotyping error has been reported to vary between about 1% and 30% (153–156). Nondifferential genotyping error will usually bias associations toward the null (157,158). The most marked bias occurs when genotyping sensitivity is poor and genotype prevalence is high (>85%) or genotyping specificity is poor and genotype prevalence is low (<15%) (157). When exposure measurement error is substantial, genotyping errors of the order of 3% can substantially underestimate an interaction effect (159). When there are systematic differences in genotyping according to outcome status (differential error), serious bias in any direction may occur. While unblinded assessment may lead to differential misclassification, for genomewide association studies of single nucleotide polymorphisms (SNPs), shifts between cases and controls in the point clusters corresponding to each genotype, thought to be due to differences in DNA processing, have been reported (160). This is an expected consequence of violating case-control principles. In this situation, using blinded samples to determine the parameters for allele calling could lead to differential misclassification. Departure from Hardy-Weinberg Equilibrium There is a divergence of views as to whether testing for departure from Hardy-Weinberg equilibrium is a useful method of detecting errors or peculiarities in the data. In particular, it has been suggested that deviation from Hardy-Weinberg equilibrium may be a sign of genotyping error (161–163). However, the statistical power to detect such errors by testing for departure from Hardy-Weinberg equilibrium is low, and in hypothetical data the presence of Hardy-Weinberg equilibrium, in general, was not altered by the introduction of genotyping error (164). Furthermore, the assumptions underlying Hardy-Weinberg equilibrium, including random mating, lack of selection according to genotype, and absence of mutation or gene flow, are rarely met in human populations (165,166). In 5 of 42 gene-disease associations assessed in meta-analyses of almost 600 studies, the results of studies in which Hardy-Weinberg equilibrium was violated gave significantly different results from Hardy-Weinberg equilibrium-conforming studies (167). Moreover, the study suggested that exclusion of Hardy-Weinberg equilibrium-violating studies may result in loss of the statistical significance of some postulated gene-disease associations and that adjustment for the magnitude of deviation from Hardy-Weinberg equilibrium may also have the same consequence for some other gene-disease associations. Empirical assessments have found that 20% to 69% of genetic associations were reported with some indication about conformity with Hardy-Weinberg equilibrium, and that among some of these, there were limitations or errors in the assessment of Hardy-Weinberg equilibrium (163). Exclusion of Noncausal Explanations Across Studies Quality assessment of individual studies that are summarized in systematic reviews is necessary to limit bias in conducting the systematic review, gain insight into potential comparisons, and guide interpretation of findings (168).

Reporting and Interpreting Results

281

Overall Quantity and Quality of the Evidence Base This is clearly important not only in interpretation across studies but also in the context of individual studies, as it facilitates identifying the contribution of a single study to the evidence base. Results from small or few studies should be interpreted with great caution as it is common to see dissipation of early claimed effects (169), and early studies have no predictive power for the subsequent picture of the evidence (170). Efforts have been made to grade the overall quantity and quality of evidence on gene-disease associations (171–173). Consideration of Possible Publication and Selective Reporting Biases As discussed above, publication bias is potentially a serious problem for the integration of evidence. At the level of excluding noncausal explanations, one method of minimizing the potential impact of publication bias is to identify such analyses through “gray literature,” which includes conference proceedings, books, abstracts, technical reports, and journals that may not be identified by electronic searches. However, a caveat in using some types of gray literature is that the material may not be peer reviewed, may be subject to modification and revision, and the information on study methods may be insufficient to assess risk of bias. Evidence from randomized trials suggests that inclusion of gray literature tends to move the treatment effect toward a null result, but the direction of the effect is not always predictable (174). Causal Inference The most frequently quoted guidelines for inferring causation from observational studies of associations between exposures and disease were proposed in the 1960s (175,176). Of note, these were guidelines and were not intended to be strict criteria (176). However, they are still of value as an aid to thinking about causality. In a review of the application of these guidelines in cancer epidemiology, consistency, strength of association, dose response, and biological plausibility were the most frequently used, in descending order (177). In the literature on epidemiological methods, the most often mentioned guidelines are, in descending order, strength of association, temporality, consistency, biological plausibility, dose response, and specificity (178). In the literature on gene-disease associations, consistency (replication) has received the greatest emphasis (7,8,11). Consistency A lack of consistency, formally identified by tests for heterogeneity of effect among studies, may be indicative of underlying errors and biases operating differentially among studies (152,173,179). Differences among studies in distributions of host characteristics are sources of heterogeneity (69). For example, hormonal alterations can affect ligand binding, enzyme activity, gene expression, and the metabolic pathways influenced by gene expression. Expression of some genes depends on age, sex, ethnicity, and other factors (180). In addition, some inconsistency among the results of gene-disease association studies may be secondary to variation among studies in the prevalence of interacting environmental factors that have not been assessed (181,182). Linkage disequilibrium varies among populations (183) and therefore, potentially could be a source of heterogeneity among studies of gene-disease associations if the genetic variant investigated is not itself causal. Differences among populations in the prevalence or

282

Little

variability of biomarkers may result in differences among studies in the statistical power to detect both the main effect of a biomarker and interactions among them. Strength As noted by Rothman (184), once methodological factors are accounted for, the strength of an association is not a biologically consistent feature but rather a characteristic that depends on the relative prevalence of other causes. Many of the genetic variants so far identified as influencing susceptibility to common diseases are associated with a low relative risk (172). However, for other biomarkers, it is possible that relative risks will increase as the biomarker becomes a more precise measure of underlying cause, as, for example, was the case for assessment of human papillomavirus infection in relation to cervical neoplasia (185). Dose Response The development of quantitative assays has been a challenge in molecular epidemiology (1). Thus, information on the validity of biomarkers is crucial in the assessment of doseresponse relationships. In the context of gene-disease associations, the value of considering dose-response relations will depend on information about the functional effect(s) of the relevant gene. As already noted, in the particular instance of geneenvironment interaction, when multiple categories of dose are defined for the exposure, then many different dose-response models can be tested in the data, and tests for interaction can be applied to the trends across strata. Consequently, false-positive results are likely to be a problem.

Temporality Although a correct time relation is specified in many methodological texts, it seems to have been relatively seldom used in causal inference (177). However, with the development of large-scale cohort studies involving the collection of biological samples (186,187), increased attention is being paid to critical periods of exposure or gene expression. In the situation of gene-disease associations, the disease could influence the result of a phenotypic assay of the genotype under investigation. This should not be a problem with polymerase chain reaction (PCR) methods. If data were available on the time window of gene expression, it would be relevant to consider this in relation to age specificity of gene-disease relations. As, perhaps, an extreme example, if an association existed between a type of cancer in infants and the CYP1A1 or CYP1A2 genotype of the index child, this probably would be indirect (e.g., reflecting an effect of maternal genotype) because the enzymes coded by these genes are not expressed in the fetal liver (188,189).

Specificity Specificity may not be a useful consideration in relation to the effects of complex exposures that may influence several outcomes such as tobacco smoking, markers of folate status and homocysteine level that are associated with many outcomes, or genetic variants that may influence the metabolism of a variety of exposures, such as cytochrome P450 gene variants.

Reporting and Interpreting Results

283

Experimental Support At present, more emphasis seems to be placed on contrasts between the results of randomized controlled trials and observational studies than on support for observational findings from such trials (41–45). Interestingly, however, comparison between the results of these two classes of study on the effect of aspirin on long-term risk of colorectal cancer showed that only studies that collected and reported detailed data on dose, frequency, and duration of aspirin use identified the same strong protective effect observed in long-term follow-up of randomized trials (190). This observation highlights the importance of precision of exposure assessment (191). Comparisons among species of biomarkers of carcinogenesis, using experimental animals for which carcinogenicity data are available, have been used in support of the association between 1,3-butadiene and biomarkers used in human studies; hence, interspecies comparisons have been proposed as a more general method for assessing the validity of biomarkers (192). In the context of gene-disease associations, experimental support is most likely to be derived from studies of gene expression in knockout or other experimental animals, from in vitro data on gene function, or from experimental interventions based on clinical trials of interventions aimed at normalizing the function or levels of a product regulated by the gene. Biological Plausibility Considerable importance is being accorded by many commentators to biological data, including data on gene function and mechanisms, in making causal inference about genedisease associations (83,193–198). Mechanistic data also have been considered in other types of epidemiological studies (69). “Analogy” and “coherence” are considered to be variants of, or equivalent to, biological plausibility (177,199). Weed (200) noted that consideration of biological plausibility bridges the gap between epidemiological evidence and diverse forms of biological evidence and that it was likely to become increasingly important in causal assessments as molecular epidemiology permits more precise measurements of intracellular causal effects; on the other hand, there has been concern that some form of mechanistic evidence might be identified and used selectively to reinforce an assertion of causality. Cardon and Bell (6) note that a biological argument can be constructed for virtually any associated allele because of the “relative paucity of current understanding of the mechanisms of action of complex trait loci.” While candidate gene studies are often based on some biological knowledge of the candidate gene, genomewide linkage and association studies initially identify variants without regard to biological function. However, the absence of mechanistic evidence, or evidence of high quality, would not exclude concluding that an association was causal if other guidelines for causation were satisfied. As knowledge of the genome is not complete, biological plausibility may not be apparent (98,100,198,201,202). However, substantial information about known gene function is recorded routinely in genome annotation databases and has been amassed through a variety of techniques applied to a targeted 1% of the human genome in the Encyclopedia of DNA Elements (ENCODE) project (203). Concerns with some of the biological data include the extent of replication and the extent to which bias can be excluded as an explanation for the results (173,201,204,205). As has proved to be the case for trials and observational evidence in humans, it seems important to apply systematic review methods to this area, both to integrate evidence rigorously and to understand the impact of study methods on results (205).

284

Little

CONCLUSION There has been widespread recognition of the importance of transparency of reporting of randomized controlled trials that is now being extended to observational studies, including molecular epidemiological studies. There is some debate about the extent to which checklists developed to aid reporting should be used in appraisal of studies, in part because of concerns of stifling methodological innovation. Increased transparency of reporting will facilitate integration of evidence. Mechanisms are needed to deal with possible publication bias and selective reporting of study results, and distributed models based on disease- or pathway-specific consortia of investigators may be a feasible means of achieving these aims. A challenge is the development of tools to assemble in a systematic fashion with diverse types of mechanistic evidence that might complement molecular epidemiological evidence in the process of causal inference. REFERENCES 1. IARC Working Group. Workshop report. In: Buffler P, Rice J, Baan R, Bird M, Boffetta P, eds. Mechanisms of carcinogenesis. Contributions of molecular epidemiology. Lyon: International Agency for Research on Cancer, 2004:1–27. 2. Perera FP, Weinstein IB. Molecular epidemiology and carcinogen-DNA adduct detection: new approaches to studies of human cancer causation. J Chronic Dis 1982; 35:581–600. 3. Munoz N, Bosch FX, de Sanjose S, et al. Epidemiologic classification of human papillomavirus types associated with cervical cancer. N Engl J Med 2003; 348:518–527. 4. Nature Genetics. Freely associating (editorial). Nat Genet 1999; 22:1–2. 5. Gambaro G, Anglani F, D’Angelo A. Association studies of genetic polymorphisms and complex disease. Lancet 2000; 355:308–311. 6. Cardon LR, Bell JI. Association study designs for complex diseases. Nat Rev Genet 2001; 2:91–99. 7. Ioannidis JPA, Ntzani EE, Trikalinos TA, et al. Replication validity of genetic association studies. Nat Genet 2001; 29:306–309. 8. Hirschhorn JN, Lohmueller K, Byrne E, et al. A comprehensive review of genetic association studies. Genet Med 2002; 4:45–61. 9. Tabor HK, Risch NJ, Myers RM. Opinion: candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet 2002; 3:391–397. 10. Editorial. In search of genetic precision. Lancet 2003; 361(9355):357. 11. Lohmueller KE, Pearce CL, Pike M, et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet 2003; 33:177–182. 12. Gorroochurn P, Hodge SE, Heiman GA, et al. Non-replication of association studies: “pseudofailures” to replicate? Genet Med 2007; 9:325–331. 13. Parkes M, Barrett JC, Prescott NJ, et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn’s disease susceptibility. Nat Genet 2007; 39:830–832. 14. Todd JA, Walker NM, Cooper JD, et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat Genet 2007; 39:857–864. 15. Zeggini E, Weedon MN, Lindgren CM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 2007; 316:1336–1341. 16. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007; 447:661–678. 17. Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes of BioMedical Research, Saxena R, Voight BF, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 2007; 316:1331–1336.

Reporting and Interpreting Results

285

18. Scott LJ, Mohlke KL, Bonnycastle LL, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science 2007; 316:1341–1345. 19. Helgadottir A, Thorleifsson G, Manolescu A, et al. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science 2007; 316:1491–1493. 20. McPherson R, Pertsemlidis A, Kavaslar N, et al. A common allele on chromosome 9 associated with coronary heart disease. Science 2007; 316:1488–1491. 21. Easton DF, Pooley KA, Dunning AM, et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 2007; 447:1087–1093. 22. Hunter DJ, Kraft P, Jacobs KB, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 2007; 39:870–874. 23. Stacey SN, Manolescu A, Sulem P, et al. Common variants on chromosomes 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet 2007; 39:865–869. 24. Gudmundsson J, Sulem P, Steinthorsdottir V, et al. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nat Genet 2007; 39(8):977–983. 25. Haiman CA, Patterson N, Freedman ML, et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nat Genet 2007; 39:638–644. 26. Yeager M, Orr N, Hayes RB, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet 2007; 39:645–649. 27. Zanke BW, Greenwood CM, Rangrej J, et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nat Genet 2007; 39:989. 28. Tomlinson I, Webb E, Carvajal-Carmona L, et al. A genome-wide association scan of tag SNPs identifies a susceptibility variant for colorectal cancer at 8q24.21. Nat Genet 2007; 39:984. 29. Haiman CA, Le Marchand L, Yamamoto J, et al. A common genetic risk factor for colorectal and prostate cancer. Nature Genetics 2007; 39:954. 30. Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in age-related macular degeneration. Science 2005; 308:385–389. 31. Haines JL, Hauser MA, Schmidt S, et al. Complement factor H variant increases the risk of age-related macular degeneration. Science 2005; 308:419–21. 32. Edwards AO, Ritter R III, Abel KJ, et al. Complement factor H polymorphism and age-related macular degeneration. Science 2005; 308:421–424. 33. Rioux JD, Xavier RJ, Taylor KD, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat Genet 2007; 39:596–604. 34. Libioulle C, Louis E, Hansoul S, et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet 2007; 3:e58. 35. Duerr RH, Taylor KD, Brant SR, et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science 2006; 314:1461–1463. 36. McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). Breast Cancer Res Treat 2006; 100:229–235. 37. McShane LM, Altman DG, Sauerbrei W. Identification of clinically useful cancer prognostic factors: what are we missing? J Natl Cancer Inst 2005; 97:1023–1025. 38. Taubes G. Epidemiology faces its limits. Science 1995; 269:164–169. 39. Davey Smith G, Ebrahim S. Epidemiology—is it time to call it a day? Int J Epidemiol 2001; 30:1–11. 40. von Elm E, Egger M. The scandal of poor epidemiological research. BMJ 2004; 329:868–869. 41. Pocock SJ, Elbourne DR. Randomized trials or observational tribulations? (comment). N Engl J Med 2000; 342:1907–1909. 42. Benson K, Hartz AJ. A comparison of observational studies and randomized, controlled trials. (comment). N Engl J Med 2000; 342:1878–1886. 43. Concato J, Shah N, Horwitz RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. (comment). N Engl J Med 2000; 342:1887–1892. 44. Concato J, Horwitz RI. Beyond randomised versus observational studies. Lancet 2004; 363:1660–1661.

286

Little

45. Lawlor DA, Davey Smith G, Kundu D, et al. Those confounded vitamins: what can we learn from the differences between observational versus randomised trial evidence? Lancet 2004; 363:1724–1727. 46. IARC Working Group. IARC Handbooks of Cancer Prevention, Volume 2: Carotenoids. Lyon: IARC, 1998. 47. Bostick RM, Fosdick L, Grandits GA, et al. Colorectal epithelial cell proliferative kinetics and risk factors for colon cancer in sporadic adenoma patients. Cancer Epidemiol Biomarkers Prev 1997; 6:1011–1019. 48. Hulley S, Furberg C, Barrett-Connor E, et al. Noncardiovascular disease outcomes during 6.8 years of hormone therapy: Heart and Estrogen/progestin Replacement Study follow-up (HERS II). JAMA 2002; 288:58–66. 49. Lawlor DA, Smith GD, Bruckdorfer KR, et al. Those confounded vitamins: what can we learn from the differences between observational versus randomised trial evidence? Lancet 2004; 363:1724–1727. 50. Vandenbroucke JP. When are observational studies as credible as randomised trials? Lancet 2004; 363:1728–1731. 51. Pocock SJ, Collier TJ, Dandreo KJ, et al. Issues in the reporting of epidemiological studies: a survey of recent practice. BMJ 2004; 329:883. 52. Moher D, Schultz KF, Altman D. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA 2001; 285:1987–1991. 53. Altman DG, Schulz KF, Moher D, et al. The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med 2001; 134:663–694. 54. Plint AC, Moher D, Morrison A, et al. Does the CONSORT checklist improve the quality of reports of randomised controlled trials? A systematic review. Med J Aust 2006; 185:263–267. 55. Altman DG. The scandal of poor medical research. BMJ 1994; 308:283–284. 56. Gluud LL. Bias in clinical intervention research. Am J Epidemiol 2006; 163:493–501. 57. Des Jarlais DC, Lyles C, Crepaz N, TREND Group. Improving the reporting quality of nonrandomized evaluations of behavioral and public health interventions: the TREND statement. Am J Public Health 2004; 94:361–366. 58. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Ann Intern Med 2003; 138:40–44. 59. Brazma A, Hingamp P, Quackenbush J, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001; 29:356–371. 60. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med 2007; 147:573–577. 61. von Elm E, Altman DG, Egger M, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ 2007; 335:806–868. 62. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology 2007; 18:800–804. 63. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet 2007; 370:1453–1457. 64. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med 2007; 4:e296. 65. von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Prev Med 2007; 45:247–251. 66. Editors. Probing STROBE. Epidemiology 2007; 18:789–790. 67. Vandenbroucke JP. The making of STROBE. Epidemiology 2007; 18:797–799.

Reporting and Interpreting Results

287

68. Rothman KJ, Poole C. Some guidelines on guidelines. They should come with expiration dates. Epidemiology 2007; 18:794–796. 69. Kuller LH, Goldstein BD. Suggestions for STROBE recommendations. Epidemiology 2007; 18:792–793. 70. Cardon L, Bell J. Association study designs for complex diseases. Nat Rev Genet 2001; 2:91–99. 71. Weiss S. Association studies in asthma genetics. Amer J Respir Crit Care Med 2001; 164: 2014–2015. 72. Weiss ST, Silverman EK, Palmer LJ. Case-control association studies in pharmacogenetics. Pharmacogenomics J 2001; 1:157–158. 73. Cooper DN, Nussbaum RL, Krawczak M. Proposed guidelines for papers describing DNA polymorphism-disease associations. Hum Genet 2002; 110:208. 74. Hegele R. SNP judgements and freedom of association. Arterioscler Thromb Vasc Biol 2002; 22:1058–1061. 75. Little J, Bradley L, Bray MS, et al. Reporting, appraising, and integrating data on genotype prevalence and gene-disease associations. Am J Epidemiol 2002; 156:300–310. 76. Romero R, Kuivaniemi H, Tromp G, et al. The design, execution, and interpretation of genetic association studies to decipher complex diseases. Am J Obstet Gynecol 2002; 187:1299–1312. 77. Colhoun HM, McKeigue PM, Davey Smith G. Problems of reporting genetic associations with complex outcomes. Lancet 2003; 361:865–872. 78. van Duijn CM, Porta M. Good prospects for genetic and molecular epidemiologic studies in the European Journal of Epidemiology. Eur J Epidemiol 2003; 18:285–286. 79. Crossman D, Watkins H. Jesting Pilate, genetic case-control association studies, and Heart. Heart 2004; 90:831–832. 80. Huizinga TW, Pisetsky DS, Kimberly RP. Associations, populations, and the truth: recommendations for genetic association studies in Arthritis & Rheumatism. Arthritis Rheum 2004; 50:2066–2071. 81. Little J. Reporting and review of human genome epidemiology studies. In: Khoury MJ, Little J, Burke W, eds. Human Genome Epidemiology: A Scientific Foundation for Using Genetic Information to Improve Health and Prevent Disease. New York: Oxford University Press, 2004:168–192. 82. Turkal A. Editorial policies and practices. J Clin Invest 2007. Available at http://www.jci.org/ kiosk/publish/policies. Accessed February 7, 2008. 83. Rebbeck TR, Martinez ME, Sellers TA, et al. Genetic variation and cancer: improving the environment for publication of association studies. Cancer Epidemiol Biomarkers Prev 2004; 13:1985–1986. 84. Tan N, Mulley J, Berkovic S. Association studies in epilepsy: “the truth is out there.” Epilepsia 2004; 45:1429–1442. 85. Framework for a fully powered risk engine. Nat Genet 2005; 37:1153. 86. Ehm MG, Nelson MR, Spurr NK. Guidelines for conducting and reporting whole genome/ large-scale association studies. Hum Mol Genet 2005; 14:2485–2488. 87. Freimer NB, Sabatti C. Guidelines for association studies in Human Molecular Genetics. Hum Mol Genet 2005;14:2481–2483. 88. Hattersley AT, McCarthy MI. What makes a good genetic association study? Lancet 2005; 366:1315–1323. 89. Manly K. Reliability of statistical associations between genes and disease. Immunogenetics 2005; 57:549–558. 90. Shen H, Liu Y, Liu P, et al. Nonreplication in genetic studies of complex diseases-lessons learned from studies of osteoporosis and tentative remedies. J Bone Mine Res 2005; 20:365–376. 91. Vitali S, Randolph A. Assessing the quality of case-control association studies on the genetic basis of sepsis. Pediatr Crit Care Med 2005; 6:S74–S77. 92. Wedzicha JA, Hall IP. Publishing genetic association studies in Thorax. Thorax 2005; 60:357. 93. Hall IP, Blakey JD. Genetic association studies in Thorax. Thorax 2005; 60:357–359. 94. DeLisi LE, Faraone SV. When is a “positive” association truly a “positive” in psychiatric genetics? A commentary based on issues debated at the World Congress of Psychiatric Genetics, Boston, October 12–18, 2005. Am J Med Genet B Neuropsychiatr Genet 2006; 141:319–322.

288

Little

95. Saito YA, Talley NJ, de Andrade M, et al. Case-control genetic association studies in gastrointestinal disease: review and recommendations. Am J Gastroenterol 2006; 101:1379–389. 96. Uhlig K, Menon V, Schmid CH. Recommendations for reporting of clinical research studies. Am J Kidney Dis 2007; 49:3–7. 97. NCI-NHGRI Working Group on Replication in Association Studies, Chanock SJ, Manolio T, et al. Replicating genotype-phenotype associations. Nature 2007; 447:655–660. 98. Begg CB. Reflections on publication criteria for genetic association studies. Cancer Epidemiol Biomarkers Prev 2005; 14:1364–1365. 99. Byrnes G, Gurrin L, Dowty J, et al. Publication policy or publication bias? Cancer Epidemiol Biomarkers Prev 2005; 14:1363. 100. Pharoah PD, Dunning AM, Ponder BA, et al. The reliable identification of disease-gene associations. Cancer Epidemiol Biomarkers Prev 2005; 14:1362. 101. Wacholder S. Publication environment and broad investigation of the genome. Cancer Epidemiol Biomarkers Prev 2005; 14:1361. 102. Whittemore AS. Genetic association studies: time for a new paradigm? Cancer Epidemiol Biomarkers Prev 2005; 14:1359–1360. 103. Bogardus ST Jr., Concato J, Feinstein AR. Clinical epidemiological quality in molecular genetic research. The need for methodological standards. JAMA 1999; 281:1919–1926. 104. Peters DL, Barber RC, Flood EM, et al. Methodologic quality and genotyping reproducibility in studies of tumor necrosis factor-308 G->A single nucleotide polymorphism and bacterial sepsis: implications for studies of complex traits. Crit Care Med 2003; 31:1691–1696. 105. Clark MF, Baudouin SV. A systematic review of the quality of genetic association studies in human sepsis. Intensive Care Med 2006; 32:1706–1712. 106. Lee W, Bindman J, Ford T, et al. Bias in psychiatric case-control studies: literature survey. Br J Psychiatry 2007; 190:204–209. 107. McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). Nat Clin Pract Urol 2005; 2:416–422. 108. McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumour MARKer prognostic studies (REMARK). Br J Cancer 2005; 93:387–391. 109. McShane LM, Altman DG, Sauerbrei W, et al. Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst 2005; 97:1180–1184. 110. McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumour MARKer prognostic studies (REMARK). Eur J Cancer 2005; 41:1690–1696. 111. McShane LM, Altman DG, Sauerbrei W, et al. Reporting recommendations for tumor marker prognostic studies (remark). Exp Oncol 2006; 28:99–105. 112. McShane LM, Altman DG, Sauerbrei W, et al. Reporting recommendations for tumor marker prognostic studies. J Clin Oncol 2005; 23:9067–9072. 113. McShane LM, Altman DG, Sauerbrei W, et al. REporting recommendations for tumor MARKer prognostic studies (REMARK). Nat Clin Pract Oncol 2005; 2:416–422. 114. Stroup DF, Thacker SB. Meta-analysis in epidemiology. In: Gail MH, Benichou J, eds. Encyclopedia of Epidemiologic Methods. New York: Wiley & Sons Publishers, 2000:557–570. 115. Easterbrook PJ, Berlin JA, Gopalan R, et al. Publication bias in clinical research. Lancet 1991; 337:867–872. 116. Begg CB, Berlin JA. Publication bias and dissemination of clinical research. J Natl Cancer Inst 1989; 81:107–115. 117. Little J, Higgins JPT, eds. The HuGENetTM HuGE Review Handbook, version 1.0. 2006; 2006:1–59. Available at http://www.hugenet.ca. Accessed January 30, 2008. 118. Ioannidis JP, Gwinn M, Little J, et al. A road map for efficient and reliable human genome epidemiology. Nat Genet 2006; 38:3–5. 119. Ioannidis JP, Bernstein J, Boffetta P, et al. A network of investigator networks in human genome epidemiology. Am J Epidemiol 2005; 162:302–304. 120. Chan AW, Hrobjartsson A, Haahr MT, et al. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles. JAMA 2004; 291:2457–2465.

Reporting and Interpreting Results

289

121. Chan AW, Krleza-Jeric K, Schmid I, et al. Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. CMAJ 2004; 171:735–740. 122. Chan AW, Altman DG. Identifying outcome reporting bias in randomised trials on PubMed: review of publications and survey of authors. BMJ 2005; 330:753. 123. Contopoulos-Ioannidis DG, Alexiou GA, Gouvias TC, et al. An empirical evaluation of multifarious outcomes in pharmacogenetics: beta-2 adrenoceptor gene polymorphisms in asthma treatment. Pharmacogenet Genomics 2006; 16:705–711. 124. Lawrence RW, Evans DM, Cardon LR. Prospects and pitfalls in whole genome association studies. Philos Trans R Soc Lond B Biol Sci 2005; 360:1589–1595. 125. GAIN Collaborative Research Group, Manolio TA, Rodriguez LL, et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat Genet 2007; 39:1045–1051. 126. Soares HP, Daniels S, Kumar A, et al. Bad reporting does not mean bad methods for randomised trials: observational study of randomised controlled trials performed by the Radiation Therapy Oncology Group. BMJ 2004; 328:22–24. 127. Moher D, Cook DJ, Jadad AR, et al. Assessing the quality of reports of randomised trials: implications for the conduct of meta-analyses. Health Technol Assess 1999; 3:i–iv, 1–98. 128. Sanderson S, Tatt ID, Higgins JP. Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: a systematic review and annotated bibliography. Int J Epidemiol 2007; 36:666–676. 129. Cole BF, Baron JA, Sandler RS, et al. Folic acid for the prevention of colorectal adenomas: a randomized clinical trial. JAMA 2007; 297:2351–239. 130. Ulrich CM, Potter JD. Folate and cancer–timing is everything. JAMA 2007; 297:2408–2409. 131. Lassere MN, Johnson KR, Boers M, et al. Definitions and validation criteria for biomarkers and surrogate endpoints: development and testing of a quantitative hierarchical levels of evidence schema. J Rheumatol 2007; 34:607–615. 132. Lassere M, Johnson K, Hughes M, et al. Simulation studies of surrogate endpoint validation using single trial and multitrial statistical approaches. J Rheumatol 2007; 34:616–619. 133. Wacholder S, Chanock S, Garcia-Closas M, et al. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004; 96:434–442. 134. Ioannidis JP. Why most published research findings are false. PLoS Med 2005; 2:e124. 135. Thomas DC, Clayton DG. Betting odds and genetic associations. J Natl Cancer Inst 2004; 96: 421–423. 136. Knowler WC, Williams RC, Pettitt DJ, et al. Gm3,5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Human Genet 1988; 43:520–536. 137. Gelernter J, Goldman D, Risch N. The A1 allele at the D2 dopamine receptor gene and alcoholism: a reappraisal. JAMA 1993; 269:1673–1677. 138. Kittles RA, Chen W, Panguluri RK, et al. CYP3A4-V and prostate cancer in African Americans: causal or confounding association because of population stratification? Hum Genet 2002; 110:553–560. 139. Thomas DC, Witte JS. Point: population stratification: a problem for case control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev 2002; 11:505–512. 140. Wacholder S, Chatterjee N, Hartge P. Joint effects of genes and environment distorted by selection biases: implications for hospital-based case-control studies. Cancer Epidemiol Biomarkers Prev 2002; 11:885–89. 141. Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet 2003; 361:598–604. 142. Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst 2000; 92:1151–1158. 143. Ardlie KG, Lunetta KL, Seielstad M. Testing for population subdivision and association in four case-control studies. Am J Human Genet 2002; 71:304–311. 144. Edland SD, Slager S, Farrer M. Genetic association studies in Alzheimer’s disease research: challenges and opportunities. Stat Med 2004; 23:169–178.

290

Little

145. Millikan RC. Re: population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst 2001; 93:156–157. 146. Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in casecontrol association studies of admixed populations. Genet Epidemiol 2004; 27:14–20. 147. Ioannidis JP, Ntzani EE, Trikalinos TA. ‘Racial’ differences in genetic effects for complex diseases. Nat Genet 2004; 36:1312–1318. 148. Marchini J, Cardon LR, Phillips MS, et al. The effects of human population structure on large genetic association studies. Nat Genet 2004; 36:512–517. 149. Freedman ML, Reich D, Penney KL, et al. Assessing the impact of population stratification on genetic association studies. Nat Genet 2004; 36:388–393. 150. Khlat M, Cazes MH, Genin E, et al. Robustness of case-control studies of genetic factors to population stratification: magnitude of bias and type I error. Cancer Epidemiol Biomarkers Prev 2004; 13:1660–1664. 151. Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet 2006; 7:781–791. 152. Ioannidis JP. Non-replication and inconsistency in the genome-wide association setting. Hum Hered 2007; 64:203–213. 153. Pompanon F, Bonin A, Bellemain E, et al. Genotyping errors: causes, consequences and solutions. Nat Rev Genet 2005; 6:847–859. 154. Akey JM, Zhang K, Xiong M, et al. The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am J Hum Genet 2001; 68:1447–1456. 155. Dequeker E, Ramsden S, Grody WW, et al. Quality control in molecular genetic testing. Nat Rev Genet 2001; 2:717–723. 156. Mitchell AA, Cutler DJ, Chakravarti A. Undetected genotyping errors cause apparent overtransmission of common alleles in the transmission/disequilibrium test. Am J Hum Genet 2003; 72:598–610. 157. Rothman N, Stewart WF, Caporaso NE, et al. Misclassification of genetic susceptibility biomarkers: implications for case-control studies and cross-population comparisons. Cancer Epidemiol Biomarkers Prev 1993; 2:299–303. 158. Garcia-Closas M, Wacholder S, Caporaso N, et al. Inference issues in cohort and case-control studies of genetic effects and gene-environment interactions. In: Khoury MJ, Little J, Burke W, eds. Human genome epidemiology: a scientific foundation for using genetic information to improve health and prevent disease. New York: Oxford University Press, 2004:127–144. 159. Wong MY, Day NE, Luan JA, et al. Estimation of magnitude in gene-environment interactions in the presence of measurement error. Stat Med 2004; 23:987–998. 160. Clayton DG, Walker NM, Smyth DJ, et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 2005; 37:1243–1246. 161. Xu J, Turner A, Little J, et al. Positive results in association studies are associated with departure from Hardy-Weinberg equilibrium: hint for genotyping error? Hum Genet 2002; 111:573–574. 162. Hosking L, Lumsden S, Lewis K, et al. Detection of genotyping errors by Hardy-Weinberg equilibrium testing. Eur J Hum Genet 2004; 12:395–399. 163. Salanti G, Amountza G, Ntzani EE, et al. Hardy-Weinberg equilibrium in genetic association studies: an empirical evaluation of reporting, deviations, and power. Eur J Hum Genet 2005; 13: 840–848. 164. Zou GY, Donner A. The merits of testing Hardy-Weinberg equilibrium in the analysis of unmatched case-control data: a cautionary note. Ann Hum Genet 2006; 70:923–933. 165. Shoemaker J, Painter I, Weir BS. A Bayesian characterization of Hardy-Weinberg disequilibrium. Genetics 1998; 149:2079–2088. 166. Ayres KL, Balding DJ. Measuring departures from Hardy-Weinberg: a Markov chain Monte Carlo method for estimating the inbreeding coefficient. Heredity 1998; 80(pt 6):769–777. 167. Trikalinos TA, Salanti G, Khoury MJ, et al. Impact of violations and deviations in Hardy-Weinberg equilibrium on postulated gene-disease associations. Am J Epidemiol 2006; 163:300–309. 168. Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions 4.2.5. Available at http://www.cochrane.org/resources/handbook/hbook.htm. Accessed February 7, 2008. Updated May 2005.

Reporting and Interpreting Results

291

169. Ioannidis JPA, Ntzani EE, Trikalinos TA, et al. Replication validity of genetic association studies. Nat Genet 2001; 29:306–309. 170. Trikalinos TA, Ntzani EE, Contopoulos-Ioannidis DG, et al. Establishment of genetic associations for complex diseases is independent of early study findings. Eur J Hum Genet 2004; 12:762–769. 171. Ioannidis JP. Commentary: grading the credibility of molecular evidence for complex diseases. Int J Epidemiol 2006; 35:572–578. 172. Khoury MJ, Little J, Gwinn M, et al. On the synthesis and interpretation of consistent but weak gene-disease associations in the era of genome-wide association studies. Int J Epidemiol 2007; 36:439–445. 173. Ioannidis JP, Boffetta P, Little J, et al. Assessment of cumulative evidence on genetic associations: interim guidelines. Int J Epidemiol 2007 Sep 26; (Epub ahead of print). 174. Burdett S, Stewart LA, Tierney JF. Publication bias and meta-analyses: a practical example. Int J Technol Assess Health Care 2003; 19:129–134. 175. Surgeon General (Advisory Committee). Smoking and health. Washington, DC: US Department of Health, Education and Welfare, 1964. 176. Hill AB. The environment and disease: association or causation? Proc R Soc Med 1965; 58: 295–300. 177. Weed DL, Gorelic LS. The practice of causal inference in cancer epidemiology. Cancer Epidemiol Biomarkers Prev 1996; 5:303–311. 178. Potischman N, Weed DL. Causal criteria in nutritional epidemiology. Am J Clin Nutr 1999; 69:1309S–1314S. 179. Ioannidis JP, Patsopoulos NA, Evangelou E. Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE 2007; 2:e841. 180. King HC, Sinha AA. Gene expression profile analysis by DNA microarrays: promise and pitfalls. JAMA 2001; 286:2280–2288. 181. Hunter DJ. Gene-environment interactions in human diseases. Nat Rev Genet 2005; 6:287–298. 182. Rebbeck TR, Khoury MJ, Potter JD. Genetic association studies of cancer: where do we go from here? Cancer Epidemiol Biomarkers Prev 2007; 16:864–865. 183. Ardlie KG, Kruglyak L, Seielstad M. Patterns of linkage disequilibrium in the human genome. Nat Rev Genet 2002; 3:299–309. 184. Rothman KJ. Modern epidemiology. Boston/Toronto: Little, Brown and Company, 1986. 185. International Agency for Research on Cancer. IARC monographs on the evaluation of carcinogenic risks to humans. Volume 64 Human Papillomaviruses. Lyon: IARC, 1995. 186. Frank J, Di Ruggiero E, McInnes RR, et al. Large life-course cohorts for characterizing genetic and environmental contributions: the need for more thoughtful designs. Epidemiology 2006; 17:595–598. 187. Ashbury FD, Kirsh VA, Kreiger N, et al. An invitation to develop Ontario’s cancer research platform: report of the Ontario cancer cohort workshop. Chronic Dis Can 2006; 27:94–97. 188. Cresteil T. Onset of xenobiotic metabolism in children: toxicological implications. Food Addit Contam 1998; 15:45–51. 189. Sonnier M, Cresteil T. Delayed ontogenesis of CYP1A2 in the human liver. Eur J Biochem 1998; 251:893–898. 190. Flossmann E, Rothwell PM, British Doctors Aspirin Trial and the UK-TIA Aspirin Trial. Effect of aspirin on long-term risk of colorectal cancer: consistent evidence from randomised and observational studies. Lancet 2007; 369:1603–1613. 191. Rothwell PM, Bhatia M. Reporting of observational studies. BMJ 2007; 335:783–784. 192. Albertini RJ. Mechanistic insights from biomarker studies: somatic mutations and rodent/human comparisons following exposure to a potential carcinogen. IARC Sci Publ 2004; (157):153–177. 193. Weiss KM, Terwilliger JD. How many diseases does it take to map a gene with SNPs? Nat Genet 2000; 26:151–157. 194. Nature Genetics. Challenges for the 21st century. Nat Genet 2001; 29:353–354. 195. Cloninger CR. The discovery of susceptibility genes for mental disorders. Proc Natl Acad Sci USA 2002; 99:13365–13367.

292

Little

196. Glazier AM, Nadeau JH, Aitman TJ. Finding genes that underlie complex traits. Science 2002; 298:2345–2349. 197. Harrison PJ, Owen MJ. Genes for schizophrenia? Recent findings and their pathophysiological implications. Lancet 2003; 361:417–419. 198. Page GP, George V, Go RC, et al. “Are we there yet?”: Deciding when one has demonstrated specific genetic causation in complex diseases and quantitative traits. Am J Human Genet 2003; 73:711–719. 199. Schlesselman JJ. “Proof ” of cause and effect in epidemiologic studies: criteria for judgment. Prev Med 1987; 16:195–210. 200. Weed DL. On the use of causal criteria. Int J Epidemiol 1997; 26:1137–1141. 201. Rebbeck T, Spitz M, Wu X. Assessing the functions of genetic variants in candidate gene association studies. Nat Rev Genet 2004; 5:589–597. 202. Ioannidis JP, Kavvoura FK. Concordance of functional in vitro data and epidemiological associations in complex disease genetics. Genet Med 2006; 8:583–593. 203. ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447:799–816. 204. Potter JD. At the interfaces of epidemiology, genetics and genomics. Nat Rev Genet 2001; 2: 142–147. 205. Pound P, Ebrahim S, Sandercock P, et al. Reviewing Animal Trials Systematically (RATS) Group. Where is the evidence that animal research benefits humans? BMJ 2004; 328:514–517.

Index

Ab initio modeling, 194 Absolute risk of disease incidence, 259 definition and uses of, 259–260 effects of tamoxifen on, 260 models case-control designs, 263–264 cohort data, 261–263 data sources and types of, 260–261 family-based design, 260, 261, 264–269. See also Kin-cohort design validation of, 269–270 ACTANE consortium, 23 Additive hazard model, 148–149 Adenomas, 32 and colon cancer, 35 Admixture, 131–132 linkage disequilibrium mapping, 134 stratification (PS), 135–136, 137 Affymetrix systems, 70 Aflatoxin, 275 exposure, 102 Aflatoxin-DNA adducts, 86 Age-Related Eye Disease Study (AREDS), 227 Age-related macular degeneration (AMD) GWA study, 227, 228, 234 AIM. See Ancestry, informative markers (AIM) Alleles, 130 CYP19 minor, 26 frequencies, 231 frequency of marker, 22 sharing, 229 transmitted, 24 Allelic discrimination, 69 Alu Yd6, 58

American Institute of Cancer Research, 41 Amine metabolites, urinary heterocyclic, 84 Amish, 131 AmpFLSTR identifiler amplification kit, 58 assay, 73 AmpliChip CYP450, 116 Analogy, 240 Ancestry, 135–136 informative markers (AIM), 135, 136–141 ARE. See Asymptotic relative efficiency (ARE) Area under the receiver-operating curve (AUC), 269 Association studies, 239, 240, 248–253 for gene-gene/environment interactions, 158–161 Assortative mating, 132 Asthma phenotypes, 229 Asymptotic relative efficiency (ARE), 26 ATM gene linkage disequilibrium in, 208 63-SNP haplotypes in, 207 Attribute construction, 172 Atypical ductal hyperplasia, 35 AUC. See Area under the receiver-operating curve (AUC) Austin Bradford Hill study, 30

Bagging, 162 b2– agonist therapy, 277 Bayesian network analysis, 200 Bayes’ theorem, 211 Bevacizumab, 32 BHT. See Butylated hydroxyl toluene (BHT)

293

294 Biases in epidemiologic studies, 99–101, 156–158, 249, 253 language, 246 in meta-analyses of associations, 244–246 protection from, 250–251 publication, 245 selective outcome and analysis reporting, 245 winner’s curse, 245 Biological epistasis, 170–171 Biologically effective dose, 84–85 Biological specimens. See Specimens Biomarkers and assessment of environmental exposures, 88–90 biological plausibility, 90 breast cancer, 35 in clinical trials, 34–35 of effect, 86–87 application of, 82–84 measuring, 88–90 exposure and intermediate endpoints, 2–4 and intervention study, 91–92 misclassification in exposure analysis, 109–110 in peripheral blood, 88 risk, 35 of susceptibility, 6–7 Biomarkers of exposure application of, 82–84 biologically effective dose, 84–85 and genotoxicity, 86 internal dose, 84 measuring concentration in organs and body fluid, 88–90 specificity of, 85–86 validation of, 85 Biosampling. See also Specimens collection and processing, 53–54 DNA preparation, 56–57 DNA quality control, 58–59 NCI best practices, 59 RNA preparation, 59 storage, 54–55 tissue microarrays, 56 tracking, 55–56 whole genome amplification (WGA), 57–58 Blocks, gene, 209–210 Blood specimens, 11–12 BMI phenotypes, 235 BOADICEA model, 261 Bonferroni correction, 232 BRCAPRO model, 261

Index Breast cancer. See also Absolute risk of disease incidence; Cancer biomarkers, 35 phenotypes, E-cadherin expression for, 229 prevention trial, 270 risk prediction, 191 estrogen metabolism pathway in, 191, 192, 195, 197–198 Buccal epithelial cells, 11 Buccal specimens, 11–12 Butylated hydroxyl toluene (BHT), 54

CaBIG. See Cancer Biomedical Informatics GridTM (caBIG) CALGB. See Cancer and leukemia group B (CALGB) Camptothecin, 117 Cancer. See also Breast cancer pathway-based methods in molecular epidemiology of, 189–201 susceptibility prediction study, 176–178 Cancer and leukemia group B (CALGB), 48 Cancer Biomedical Informatics GridTM (caBIG), 56 Cancer genes, somatic mutations in, 87 Cancer prognosis case-control study of, 45–47 cohort study of, 45–47 factors, 42–45 prospective observational study of, 47 statistical methods for study of, 49–50 Cancer treatment, pharmacogenetics in, 113–124 Candidate gene approach, 116, 121 Candidate gene studies, limitations, 226 CART analysis. See Classification and regression trees (CART) analysis Case-control studies designs for absolute risk models estimation, 263–264 on gene-gene/environment interactions, 149–152 biases in, 157, 158 family-based, 154–156, 157, 158 haplotype frequencies comparison, 215–216 Case-control study, 4–8 biomarkers of susceptibility in, 6–7 of cancer prognosis, 45–47 and clinical outcomes, 7–8 exposure assessment in, 4–6 molecular classification of tumors in, 7

Index Case-only approach for gene-gene/environment interactions, 151, 152 Case-parents study design, 154, 155–156, 157 Case-parent trios and TDT, 24–25 Case-series design, 9–10, 121 Case-sibling association study design, 25–26 Case-siblings study design, 154, 155–156, 157 Caucasians human genome, 178–179 TPMT in, 117 VNTR polymorphism in, 119 Chromosomal aberrations, 87 Chromosomes, linkage on, 23 Cisplatin pharmacogenetics, 115, 120, 122–123 Classification and regression trees (CART), 49–50, 123, 162 Claus model, 261 Clinical outcomes, determination of, 7–8 Clinical trials ancillary study to, 48 biomarkers in, 34–35 prevention, 31–34 randomized, 10 structure of, 29–31 CNV. See Copy number variations (CNV) Cochran’s Q statistic test, 241–242 Coherence in molecular epidemiology, 240 Cohort studies for absolute risk models estimation, 261–261 of cancer prognosis, 45–47 on gene-gene/environment interactions, 147–149 prospective of exposure and biological specimens, 8–9 and screening cohorts, 9 Colon cancer and adenomas, 35 prevention trials, of dietary change, 32–33 Colonoscopy, 32 Consolidated Standards for Reporting Trials (CONSORT), 276, 277 Constructive induction, 172 Construct questionnaire validity, 107 Controls, for GWA studies, 229–230 Copula models, 265, 266 Copy number variations (CNV), 66–67, 227–228 Cornfield’s equality, 149 Cox regression models, 49 Cox’s proportional hazard model, 148 Criterion-related questionnaire validity, 106–107

295 Crohn’s disease, 230 Cronbach’s alpha, 104 Crossover effect, 147 Cross-sectional study, 2–4 “Crude risk.” See Absolute risk “Cumulative incidence.” See Absolute risk CYP19A1 gene, 208, 209 CYP2C9, 122

Darwin’s theory of natural selection, 130–131 Data-mining tools, 162–163 Dbenzo(a)pyrene [B(a)P]-DNA adducts, 86 DeCode genetics, 230 Dendrograms, 174, 175–176, 177 Detoxification, against oxidative stress, 36 Diallelic markers, 214 Dichotomous analysis of exposure and disease risk, 106 Dietary intervention trials, 30 Dietary prevention trials, for patient with colon cancer, 32–33 Dihydropyrimidine dehydrogenase (DPD), 119–120 Dinucleotide (TA) repeat polymorphism, 117–118 Directional selection, 131 Discriminatory accuracy, 269 Disease susceptibility, 64 Disruptive selection, 131 Dizygotic twins, 20 DNA paraffin, 57 preparation, 56–57 quality control, 58–59 repair capacity (DRC), 6, 7, 91, 115 repair genes, 120 sequencing, 191, 193, 194 synthesis, 44 DNA adducts, 84–85 bulky, 86, 90 NOC-related, 90 Dose biologically effective, 84–85 internal, 84 response, 282 DPD. See Dihydropyrimidine dehydrogenase (DPD) DRC. See DNA repair capacity (DRC) Drug responses, factors of interindividual variability to, 114 Drusen size, 228

296 E-cadherin expression, for breast cancer phenotypes, 229 Effective dose, biologically, 84–85 Effective population size (Ne), 131 EGFR. See Epidermal growth factor receptor (EGFR) Empirical Bayes (EB) analysis, 152 Energy expenditure and epidemiological outcome, 102–103 Environmental exposures, biomarkers and, 88–90 Environmental factors, effect on genetic interactions, 170–171 Enzyme systems and pharmacogenetics, 36 EPIC. See European prospective investigation into cancer and nutrition (EPIC) Epidemiologic observation energy expenditure association, 102–103 limitations of, 99 Epidermal growth factor receptor (EGFR), 120–121 Epistasis. See Gene-gene interaction Epithelial cells, buccal, 11 Epstein Barr virus, 53 Erlotinib (Tarceva), 120–121 Estrogen metabolism pathway in breast cancer risk prediction, 191, 192, 195. See also Breast cancer in silico model of, 197–198 Ethnicity/Race, 133 admixture stratification (PS) and ancestry, 135–136, 137 European Prospective Investigation Into Cancer And Nutrition (EPIC), 54 Expectation-maximization (EM) algorithm, 211–212 Expected numbers of cases/observed number of cases (E/O) ratio, 270 Exposure and disease risk accuracy of assessment, 100–101 association analysis, 101–102, 106 Exposure assessment, in case-control study of molecular epidemiology, 4–6 Exposure biomarkers specificity of, 85–86 validation of, 85 Extreme genotyping, 70

FA-containing supplements. See Folic acid (FA)-containing supplements False positives in GWA studies, 232, 233 Familial aggregation, 20

Index Family-based association study, 24–26 case-parent trios, 24–25 case-sibling association study design, 25–26 population stratification, 24 restricted study designs, 26 Family-based association tests (FBAT), 229 Family-based studies for absolute risk model estimation, 260, 261, 264–269 case-control approach for gene-gene/ environment interactions, 154–156, 157, 158 FB1. See Fumonisin B1 (FB1) FFPE tissue. See Formalin-fixed, paraffin-embedded (FFPE) tissue Filter method, 178, 179 for genomewide association analysis, 179–181 Finasteride, prostate cancer prevention trial of, 31–32 Fisher’s sum of logs method, 244 Fitness, 131 Fixation indices, 132–133 Fixed effects model, 243 FlexTree, 163 5-fluorouracil (5-FU), 118–120 5-fluorouracil (5-FU)-based treatment, 44–45 Focused interaction testing framework (FITF), 162, 172 Folate, 44–45, 278–279 Folic acid (FA)-containing supplements, 44 Formalin-fixed, paraffin-embedded (FFPE) tissue, 56 Founder effect, 131 5-FU-based treatment. See 5-fluorouracil (5-FU)-based treatment Fumonisin B1 (FB1), 86 Functional SNP, 65

Gail model, 261, 270 AUC values for, 269 Gail Model 2, 263, 270 Gefitinib (Iressa), 120–121 Gel electrophoresis, 59 Gene blocks, 209–210 Genecentric approach of SNP selection, 231 Gene-environment interactions and sample size, 14 Gene-gene/environment interactions, 170–171 association studies for, 158–161 case-control studies on, 149–152 family-based, 154–156, 157, 158 cohort studies on, 147–149

Index [Gene-gene/environment interactions] dendrograms, 174, 175–176, 177 graphs, 175 higher-order interactions and data-mining tools, 162–163 MDR for, 172–173 models genomewide association analysis, 178–182 statistical, 146–147, 173–176 two-phase stratified sampling design for, 153 Genetic Association Information Network (GAIN), 248 Genetic association testing, 66 Genetic distance, 133 Genetic drift, 131 Genetic library utilities (GLU), 74–75 Genetic markers, 6. See also Biomarkers Genetic polymorphisms, 42 Genetic programming, 181–182 Genetic structure of species, 132–133 Genetic variation MAF, 64–65 SNP, 64–67 Genome scan meta-analyses (GSMA) method, 244 Genomewide association (GWA) study (GWAS), 63, 225–226, 278 efficiency improvement in, 228–233 history of, 226–228 markers across genome in, 64 in modeling interactions, 178–179 filter strategy for, 179–181 wrapper strategy for, 181–182 as new scientific paradigm, 233–234 reducing genotyping and laboratory error in, 230 Genome-wide panels of AIM for ancestry studies, 138–140 Genomic control in GWA studies, 232 Genotoxicity, 86 Genotype-guided clinical trial, 122 Genotyping, 116. See also Pharmacogenetics errors, 280 extreme, 70 and laboratory error reduction in GWA studies, 230 platforms, 68–69 high-density, 70 and SNP, 67–71 GLU. See Genetic library utilities (GLU) Glutathione-S transferase M1 (GSTM1) gene, 13 Glutathione S-transferase pi (GSTP1), 25

297 Glycophorin A, 87 Greenwood’s formula, 148 GSTM1 gene. See Glutathione-S transferase M1 (GSTM1) gene GSTP1. See Glutathione S-transferase pi (GSTP1) Guanidium isothiocyanate, 59 GWAS. See Genomewide association studies (GWAS)

Hanahan and Weinberg study, 36 Hankinson and Santella study, 48 Haplotype analysis in pharmacogenetics, 116, 119 Haplotypes association analysis, 205–206, 208–210 comparing haplotype frequencies in cases and controls, 215–216 marginal methods, 218–219 plug-in methods, 217–218 in regression framework, 216–217 without haplotypes, 219–221 definition, 206 inference from genotypes, 210–212 HapMap. See International Haplotype Mapping Project (HapMap) Hardy-Weinberg equilibrium (HWE), 132, 133–134, 211, 230 departure from, 280 HBV. See Hepatitis B virus (HBV) Health insurance portability and accountability act (HIPAA), 50 Helicobacter pylori, 89 Hepatitis B virus (HBV), 88 Heterocyclic amine metabolites, urinary, 84 Heterogeneity-based genome scan meta-analysis (HEGESMA), 244 HGPIN. See High grade prostate intraepithelial neoplasia (HGPIN) Hierarchical modeling for pathway based molecular epidemiology, 191, 198–200, 201 Higher-order interactions, 162–163 High grade prostate intraepithelial neoplasia (HGPIN), 35 High-throughput genotype analysis, 65, 75 HIPAA. See Health insurance portability and accountability act (HIPAA) Hispanics, 133 Homologous phenotypes, 229 Homology modeling, 194 Hospital control designs, for gene-gene/ environment interactions, 158

298 HPRT. See Hypoxanthine-guanine phosphoribosyltransferase (HPRT) HTRA1 gene, 227 “http://www.hapmap.org,” 191, 205 “http://www.r-project.org,” 153 Human genetic variation effect of different evolutionary forces, 130 Human genome, 129 epidemiology investigators, challenges for, 247 Hybrid study design, 155–156, 157 Hypothesis-driven biomarkers, 1 Hypoxanthine-guanine phosphoribosyltransferase (HPRT), 87

IBD. See Identical by descent (IBD) Identical by descent (IBD), 132 IL-6. See Interleukin-6 (IL-6) Illumina systems, 70 Inbreeding, 132 Inference of molecular epidemiology studies, casual, 281–283 In silico model, 195, 197–198 tools, 65 Institutional review boards (IRB), 47, 50 Interleukin-6 (IL-6), 53 Interleukin-23 receptor (IL23R) gene, 227 Internal dose, 84 International Haplotype Mapping Project (HapMap), 129, 191, 226–227 Consortium, 178–179 panels for tagging SNP selection, 213 International HapMap project, 66, 70 Intervention trials, 30–31 Invader UGT1A1 Assay, 118 IRB. See Institutional review boards (IRB) Irinotecan, 117–118

Japanese Millennium Genome Project, 226

Kaplan-Meier estimation, 49 Kappa index, 108–109 Kin-cohort design, 10, 261, 264–266 Koreans, UGT genes in, 118

Laboratory information management systems (LIMS), 56, 74–75 Lifestyle factors and cancer prognosis, 43 Likelihood log-odds (LOD) scores, 22

Index LIMS. See Laboratory information management systems (LIMS) Linkage analysis, 21–23 model-based/parametric, 22–23 model-free/nonparametric, 23 Linkage disequilibrium, 24, 208 Linkage equilibrium, 133–134 Linneaus’s Systema Naturae, 133 Locus heterogeneity, 170 LOD scores. See Likelihood log-odds (LOD) scores Logistic regression, 162–163, 178, 200 haplotype analysis, 216–217

MAF. See Minor allele frequency (MAF) MALDI-TOF mass spectrometry. See Matrix-assisted laser desorption/ ionization time-of-flight (MALDI-TOF) mass spectrometry Marginal methods for haplotype association, 218–219 Markov chain Monte Carlo (MCMC) method, 163, 200 Matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry, 69 Maximum likelihood estimation (MLE), 136, 137 Max-omnibus test, 161 MDR. See Multifactor dimensionality reduction (MDR) Measurement errors in exposure assessment, 99–110 Mechanistic modeling for pathway based molecular epidemiology, 195–198, 200–201 Mendelian disorders, 66 Mendelian inheritance, 19 Mendelian randomization, 233 6-Mercaptopurine, 116–117 Meta-analysis methods in molecular epidemiology, 241–248 Meta-analysis of individual participant data (MIPD), 246 Metabolic ratio, 115 5,10-methylenetetrahydrofolate reductase (MTHFR), 44, 120 Mexican Americans, 133 Microarray data, 248 Micronuclei, 87 Minimum Information About Microarray Experiments (MIAME), 276 guidelines, 248

Index Minor allele frequency (MAF), 64–65 Model-based linkage analysis, 22–23 Model-free linkage analysis, 23 Molecular epidemiology calibrating credibility in studies of, 248–249 defined, 275 issues, 278–279 papers, 239 quality assessment in, 281–283 studies interpretation, 278–283 publication of, 277 reporting in, 276–277 results, selective reporting of, 277–278 traditional criteria and concepts, 240–241 Monozygotic twins, 20 MTHFR. See 5,10-methylenetetrahydrofolate reductase (MTHFR) Multifactor dimensionality reduction (MDR), 163, 172–173, 174 Multiple-testing penalty, 208–209 Multiplex pedigrees, 266 Multistage designs in GWA studies, 232–233 Multivariable regression methods for unphased SNP genotypes, 220–221 Mutations, 130 somatic, in reporter and cancer genes, 87

N-acetyl transferase 2 (NAT2) gene, 13 NAT2 gene. See N-acetyl transferase 2 (NAT2) gene National Cancer Institute (NCI) best practices for biosampling, 59 Breast Cancer Risk Assessment Tool, 259 common toxicity criteria, 48 dietary guidelines, 32 SNP500cancer program, 67–68 National Cancer Institute’s program. See Surveillance, Epidemiology and End Results (SEER) Program National death index (NDI), 45 National Institutes of Health (NIH), 278 Natural selection, 130–131 NDI. See National death index (NDI) Needle-in-a-haystack problem, 178 Neutral variants, 130 Neutropenia, 118 N-nitroso compounds (NOC), 90 NOC. See N-nitroso compounds (NOC) NOC-related DNA adducts, 90 Nonlinear patterns of gene-gene interactions, 170, 172 Nonrandom mating, 132

299 Non–small cell lung cancer (NSCLC) DRC in, 115 somatic mutations in, 120–121 NSCLC. See Non-small cell lung cancer (NSCLC) Null hypothesis of no association, 208 Nurses Health Study, 235 Nutritional epidemiology, 102, 107

Obesity gene, 235 Observational study, prospective of cancer prognosis, 47 Odds-ratio interaction, 150–151, 159–160 Omnibus hypothesis-testing, 158, 159–161 Oxidative stress, 36

PAH. See Polycyclic aromatic hydrocarbon (PAH) PAH-DNA adducts, 86 Paraffin DNA, 57 Paraffin-embedded tissue blocks, 12 Paraffin-embedded tumor samples, 12 Parametric physiologically based pharmacokinetic (PBPK) model, 190–191, 198–200, 201 Pathway-based molecular epidemiology DNA sequencing in, 191, 193 hierarchical modeling in, 191, 198–200, 201 mechanistic modeling for, 195–198, 200–201 overview of models in, 190 in silico modeling in, 198 Pathway-based polygenic genotyping, 122–123. See also Pharmacogenetics Paxgene blood RNA system vacutainer tubes, 59 PBPK model. See Parametric physiologically based pharmacokinetic (PBPK) model PCA. See Principal component analysis (PCA) Penetrance, 10, 22, 171, 173, 180, 218, 219 Peripheral blood mononuclear cells (PBMC)-DPD activity, 119 Pharmacogenetics, 35–37, 113–114 cisplatin, 115, 120, 122–123 5-fluorouracil (5-FU), 118–120 genotypic measurements, 116 irinotecan and UGT1A1, 117–118 issues and challenges in, 121–124 6-mercaptopurine and TPMT, 116–117 phenotypic measurements, 114–115 somatic mutations and, 120–121 and statistical issues, 36–37 PHASE software, 212

300 Phenocopy, 170 Phenotype in GWA studies, 228–229 Phenotyping, 114–115. See also Pharmacogenetics Platinum agents, 120 PLINK, 74–75 Plug-in methods for haplotype association, 217–218 Polycyclic aromatic hydrocarbon (PAH), 84 Polymerase chain reaction (PCR), 282 Polymorphism, 130 Polyp prevention trials, 32–34 Polyps, adenomatous, 32–33, 34, 35 Population bottleneck, 131 Population stratification, 24, 122, 156, 230 Principal component analysis (PCA), 75 Probands, 261, 264, 265, 266, 267 Prostate cancer prevention trial of Finasteride, 32 Protein sequences, 193–194 Proteus phenomenon, 245 Puerto Ricans, 133

Quality assessment in molecular epidemiology studies, 281–283 Quality control of DNA, 58–59 in laboratory, 71–74 Questionnaire assessment accuracy of, 102, 103–105 assembly of, 102–103 dimensions of validity in, 106–108 in molecular epidemiology, 108–110 Quinone E2-3,4-Q, 195

Race/ethnicity, 133 admixture stratification (PS) and ancestry, 135–136, 137 Racial factors and cancer prognosis, 43 Random effects model, 243 Random forest procedure, 162 Randomized controlled trials, 48–50 R-cran Web site, 153 RC-TDT. See Reconstruction combined TDT (RC-TDT) Reaction norm, 170 Real-time PCR assay, 73 Reconstruction combined TDT (RC-TDT), 25 Recursive partitioning algorithm, 162 Reliability of questionnare assessment for exposures, 103–105

Index Relief algorithm, 179–181 REPLI-g kit, 57 Reporter genes, somatic mutations in, 87 Restricted study designs, 26 Restriction fragment length polymorphism (RFLP), 69 RFLP. See Restriction fragment length polymorphism (RFLP) Risch and Merikangas study, 63 RNAeasy kit, 59 RNA preparation, 59 “Rs7566605,” 235

Sample collection, 53–54 size consideration, 13–14 stability, 55 storage, 54–55 tracking, 55–56 Sampling designs, two-phase, 10 SAS PROC HAPLOTYPE, 211, 215 Schistosoma mansoni, 92 Screening cohorts, 9 Segregation analysis, 20–21 Selenium, for nonmelanoma skin cancer patients, 34 Selenomethionine, 35 Sexual selection, 131 Shannon entropy, 175 Single imputation methods for haplotype association, 217–218 Single-nucleotide polymorphism (SNP), 6, 24, 174, 191 database for, 67 data management and analysis, 74–75 DPD, 119–120 functional, 65 genetic variation, 64–67 and genotyping, 67–71 genotyping errors, 230 genotyping for, 116 imputation of untyped, 220 multivariable regression methods for unphased, 220–221 selection for GWA studies, 231–232 tagging. See Tagging SNP Sister chromatid exchanges, 87 Smoking, 176, 177 SNP. See Single-nucleotide polymorphisms (SNP) SNP assays, optimization of, 67–68 SNPHAP, 211 63-SNP haplotype, 206, 207, 208

Index Somatic mutations, 120–121. See also Pharmacogenetics in reporter and cancer genes, 87 SOP. See Standard operating procedures (SOP) Specimens blood and buccal, 11–12 design consideration in collection of, 10–11 tissue, 12–13 urine, 12 Stabilizing selection, 131 Standard operating procedures (SOP), 59 Standards for Reporting of Diagnostic Accuracy (STARD), 276 Statistical epistasis, 170, 171 Statistical methods for cancer prognosis, 49–50 Statistical models for gene-gene/environment interactions, 146–147, 173–176 Stouffer’s sum of zs method, 244 Stratified random sampling, 153 Strengthening the Reporting of Genetic Associations (STREGA), 276–277 Strengthening the Reporting of Observational studies in Epidemiology (STROBE), 276, 278 STRUCTURE, 29–31, 137 analysis, 75 Surrogate markers, 279 Surveillance, Epidemiology and End Results (SEER) Program, 260, 263 Susceptibility genes, 158

Tagging SNP, 66, 206, 209–210, 211, 231. See also Haplotypes selection of, 212–215 Tamoxifen, 260, 270 TaqManTM, 69 TDT. See Transmission disequilibrium test (TDT) Temporality, 240, 282 Therapeutic trials, 31–34 polyp prevention trials, 32–34 and toxicity, 31–32 Thiopurine S-methyltransferase (TPMT), 116–117 Thymidylate, 44 Thymidylate synthase (TS), 44, 118–119 Tissue microarrays (TMA), 12, 56 Tissue specimens, 12–13 TMA. See Tissue microarrays (TMA) TNF-a. See Tumor necrosis factor alpha (TNF-a) TNF gene. See Tumor necrosis factor (TNF) gene

301 Toxicity, 31–32 TPMT. See Thiopurine S-methyltransferase (TPMT) TP53 tumor suppressor gene, 87 Transmission disequilibrium test (TDT), 24–25 Transparent Reporting of Evaluations with Nonrandomized Designs (TREND), 276 Tris EDTA solution, 55, 58 Trizol LS, 59 TS. See Thymidylate synthase (TS) Tumor characteristics, folate-related, 44 molecular classification in case-control study, 7 Tumor necrosis factor alpha (TNF-a), 53 Tumor necrosis factor (TNF) gene, 13 Tuned ReliefF algorithm (TuRF), 180–181 Twin study, 20 Two-phase sampling designs, 10 Two-step approach for gene-gene/environment interactions, 151, 152, 153

UGT. See Uridine diphosphoglucuronosyltransferases (UGT) Uridine diphosphoglucuronosyltransferases (UGT), 117–118 Urine specimens, 12

Validation of absolute risk models of disease incidence, 269–270 in molecular epidemiology, 240, 248–253 Validity of questionnaire assessment for exposures, 104–105 dimensions of, 106–108 Variable number of tandem repeats (VNTR) polymorphism, 119 Vitamin K epoxide reductase complex 1 (VKORC1), 122 VNTR polymorphism. See Variable number of tandem repeats (VNTR) polymorphism

Wahlund effect, 132 Warfarin, 122 WGA. See Whole genome amplification (WGA) WGA DNA, 73 Wheat bran fiber supplementation trials for adenomatous polyp patients, 33–34

302 WHEL study. See Women’s healthy eating and living (WHEL) study Whole genome amplification (WGA), 57–58 Whole-genome studies, 124 Women’s health initiative dietary intervention study, 34

Index Women’s healthy eating and living (WHEL) study, 34–35, 49 Wrapper method, 178 for genomewide association analysis, 181–182 Wright’s fixation indices, 132–133 “www.epistasis.org,” 163, 173, 175

about the book… This volume comprises the investigation of factors that may predict the response to treatment, outcome, and survival by exploring: • design considerations in molecular epidemiology, including: – case-only – family-based – approaches for evaluation of genetic susceptibility to exposure and addiction pharmacogenetics – incorporation of biomarkers in clinical trials • measurement issues in molecular epidemiology, including – DNA biosampling methods – principles for high-quality genotyping – haplotypes – biomarkers of exposure and effect – exposure assessment • methods of statistical inference used in molecular epidemiology, including – gene-gene and gene-environment interaction analysis – novel high-dimensional analysis approaches – pathway-based analysis methods – haplotype methods, dealing with race and ethnicity – risk models – a discussion of reporting and interpreting results. • A specific discussion and synopsis of these methods provides concrete examples drawn from primary research in cancer. Covering design considerations, measurement issues, and methods of statistical inference, and filled with scientific tables, equations, and pictures, Molecular Epidemiology: Applications in Cancer and Other Human Diseases presents a solid, single-source foundation for conducting and interpreting molecular epidemiological studies. about the editors... TIMOTHY R. REBBECK is Professor of Epidemiology, Director of the Center for Genetics and Complex Traits, and Associate Director for Population Science of the Abramson Cancer Center at the University of Pennsylvania in Philadelphia. Dr. Rebbeck received his Ph.D. from the University of Michigan, Ann Arbor, and has written over 150 peer-reviewed articles. CHRISTINE B. AMBROSONE is Chair, Department of Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, New York. Dr. Ambrosone received her Ph.D. from the Roswell Park Cancer Institute, State University of New York at Buffalo, and has written over 115 peer-reviewed articles. PETER G. SHIELDS is Professor of Medicine and Oncology Interim Academic Chair, Department of Medicine Deputy Director, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Washington DC. Dr. Shields received his M.D. Medical Doctor, Mount Sinai School of Medicine, New York, New York, and has written over 140 peer-reviewed articles. Printed in the United States of America

H5291

Molecular Epidemiology: Applications in Cancer and Other Human Diseases

Oncology and Epidemiology

Molecular Epidemiology Applications in Cancer and Other Human Diseases Edited by

Timothy R. Rebbeck Christine B. Ambrosone Peter G. Shields

Rebbeck • Ambrosone • Shields

nC nM nY nK Rebbeck_978-1420052916.indd 1

4/22/08 1:01:18 PM